ALGORITHMS FOR DEEP PACKET INSPECTION By Jignesh D. Patel A DISSERTATION Submitted to Michigan State University in partial ful llment of the requirements for the degree of DOCTOR OF PHILOSOPHY Computer Science 2012 ABSTRACT ALGORITHMS FOR DEEP PACKET INSPECTION By Jignesh D. Patel The core operation in network intrusion detection and prevention systems is Deep Packet Inspection (DPI), in which each security threat is represented as a signature, and the payload of each data packet is matched against the set of current security threat signatures. DPI is also used for other networking applications like advanced QoS mechanisms, protocol identi cation etc.. In the past, attack signatures were speci ed as strings. Today most DPI systems use Regular Expression (RE)s to represent signatures. RE matching for networking applications is dicult for several reasons. First, the DPI application is usually implemented in network devices, which have limited computing resources. Second, as new threats are discovered, the size of the signature set grows over time. Last, the matching needs to be done at network speeds, the growth of which outpaces improvements in computing speed; so there is a need for novel solutions that can deliver higher throughput. As a result, RE matching for DPI is a very important and active research area. We study existing methods proposed for RE matching, identify their limitations, and propose new methods to overcome these limitations. RE matching remains a fundamentally challenging problem due to the diculty in compactly encoding Deterministic Finite state Automata (DFA). While the DFA for any one RE is typically small, the DFA that corresponds to the entire set of REs is usually too large to be constructed or deployed. To address this issue, many alternative automata implementations that compress the size of the nal automaton have been proposed. We improve upon previous research in three ways. First, we propose a more ecient \Minimize then Union" framework for constructing compact alternative automata that minimizes smaller automata before combining them. Previously proposed automata construction algorithms employ a \Union then Minimize" framework where the automata for each RE are joined before minimization occurs. This leads to expensive minimization on a large automata and a large intermediate memory footprint. Our minimize then union approach requires much less time and memory, allowing us to handle a much larger RE set. Second, we propose the rst hardware-based RE matching approach that uses Ternary Content Addressable Memory (TCAM). Prior hardware based RE matching algorithms typically use FPGA. The main drawback of FPGA is that resynthesizing and updating FPGA circuitry to handle RE updates is slow and dicult. In contrast, TCAM supports easy RE updates, and we show that we can achieve very high throughput. Furthermore, TCAMs are widely used in modern networking devices for tasks such as packet classi cation, so no major architecture modi cations are needed to implement our approach in existing networking devices. Finally, we propose new overlay automata models that e ectively address the replication of DFA states that occurs when multiple REs are combined. The idea is to group together the replicated DFA structures instead of repeating them multiple times. The result is that we get a nal automata size that is close to that of a NFA (which is linear in the size of the RE set), and simultaneously achieve fast deterministic matching speed of a DFA. ACKNOWLEDGMENTS I would like to take this opportunity to thank all the people who have helped me during my graduate career and made this Dissertation possible. First and foremost, I would like to thank my advisor, Dr. Eric Torng, for his constant guidance, support and encouragement. I would like to express my earnest gratitude to my thesis committee members Dr. Richard Enbody, Dr. Alex Liu and Dr. Peter Magyar for being there for me whenever I needed. I would also like to thank the sta of the CSE department for all their help and support. Finally I would like to thank my friends and family for all their support and encouragement. iv TABLE OF CONTENTS List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Research Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Research Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 4 Chapter 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Chapter 3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Chapter 1 3.1 DFA for RE Matching . . . . . . . . . . . . . . . . . . . . . . . 3.2 Understanding DFA space explosion . . . . . . . . . . . . . . . 3.2.1 Transition Sharing . . . . . . . . . . . . . . . . . . . . . 3.2.2 State Replication . . . . . . . . . . . . . . . . . . . . . . 3.3 D2 FA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 D2 FA De nition . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Original D2 FA Algorithm . . . . . . . . . . . . . . . . . 3.3.3 Limiting Deferment Depth in Original D2 FA Algorithm 3.3.4 Backpointer D2 FA Algorithm . . . . . . . . . . . . . . . 3.4 Classi ers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Classi er de nition . . . . . . . . . . . . . . . . . . . . . 3.4.1.1 Pre x Classi er . . . . . . . . . . . . . . . . . . 3.4.1.2 Ternary Classi er . . . . . . . . . . . . . . . . . 3.4.1.3 Weighted Classi er . . . . . . . . . . . . . . . . 3.4.2 Classi er Minimization . . . . . . . . . . . . . . . . . . . 3.5 TCAM Introduction . . . . . . . . . . . . . . . . . . . . . . . . Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 14 17 18 19 20 24 25 26 27 27 28 29 29 30 30 Software Implementation . . . . . . . . . . . . . . . . . . . . . . . 32 4.1 Introduction/Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 v 4.2 4.3 4.4 4.5 4.1.1 Solution Goals . . . . . . . . . . . . . . . . . . . 4.1.2 Summary and Limitations of Prior Art . . . . . . 4.1.3 Summary of Our Approach . . . . . . . . . . . . 4.1.3.1 Advantages of our algorithm . . . . . . Minimum State PMDFA construction . . . . . . . . . . Ecient D2 FA Construction . . . . . . . . . . . . . . . . 4.3.1 Improved D2 FA Construction for One RE . . . . 4.3.2 D2 FA Merge Algorithm . . . . . . . . . . . . . . 4.3.3 Direct D2 FA construction for RE set . . . . . . . 4.3.4 Optional Final Compression Algorithm . . . . . . D2 FA Merge Algorithm Properties . . . . . . . . . . . . 4.4.1 Proof of Correctness . . . . . . . . . . . . . . . . 4.4.2 Limiting Deferment Depth . . . . . . . . . . . . . 4.4.3 Deferment to a Lower Level . . . . . . . . . . . . 4.4.4 Algorithmic Complexity . . . . . . . . . . . . . . Experimental Results . . . . . . . . . . . . . . . . . . . . 4.5.1 Methodology . . . . . . . . . . . . . . . . . . . . 4.5.1.1 Data Sets . . . . . . . . . . . . . . . . . 4.5.1.2 Metrics . . . . . . . . . . . . . . . . . . 4.5.1.3 Measuring Space . . . . . . . . . . . . . 4.5.1.4 Correctness . . . . . . . . . . . . . . . . 4.5.2 D2 FAMERGE versus ORIGINAL . . . . . . . . . 4.5.3 Assessment of Final Compression Algorithm . . . 4.5.4 D2 FAMERGE versus ORIGINAL with Bounded ment Depth . . . . . . . . . . . . . . . . . . . . . 4.5.5 D2 FAMERGE versus BACKPTR . . . . . . . . . 4.5.6 Scalability results . . . . . . . . . . . . . . . . . . Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maximum Defer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 33 35 36 37 41 41 46 51 53 53 54 55 57 59 61 61 61 62 63 65 65 68 69 71 73 TCAM Implementation . . . . . . . . . . . . . . . . . . . . . . . . 75 5.1 Introduction/Motivation . . . . . . . . . . . . . . . . 5.1.1 TCAM Architecture for RE matching . . . . 5.1.2 Reducing TCAM size . . . . . . . . . . . . . . 5.1.2.1 Transitions Sharing . . . . . . . . . . 5.1.2.2 Table Consolidation . . . . . . . . . 5.1.3 Increasing Matching Throughput . . . . . . . 5.1.4 Comparison of Transition Sharing with D2 FA 5.2 Transition Sharing . . . . . . . . . . . . . . . . . . . 5.2.1 Character Bundling . . . . . . . . . . . . . . . 5.2.2 Shadow Encoding . . . . . . . . . . . . . . . . 5.2.2.1 Observations . . . . . . . . . . . . . vi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 76 77 78 79 80 81 82 82 83 83 5.3 5.4 5.5 5.6 5.2.2.2 Determining Table Order . . . . . . . . . 5.2.2.3 Shadow Encoding Algorithm . . . . . . . 5.2.2.4 Choosing Transitions . . . . . . . . . . . . Table Consolidation . . . . . . . . . . . . . . . . . . . . . 5.3.1 Observations . . . . . . . . . . . . . . . . . . . . . 5.3.2 Computing a k-decision table . . . . . . . . . . . . 5.3.3 Choosing States to Consolidate . . . . . . . . . . . 5.3.3.1 Greedy Matching . . . . . . . . . . . . . . 5.3.4 E ectiveness of Table Consolidation . . . . . . . . Variable Striding . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Observations . . . . . . . . . . . . . . . . . . . . . 5.4.2 Eliminating State Explosion . . . . . . . . . . . . . 5.4.3 Controlling Transition Explosion . . . . . . . . . . 5.4.3.1 Self-Loop Unrolling Algorithm . . . . . . 5.4.3.2 k-var-stride Transition Sharing Algorithm 5.4.4 Variable Striding Selection Algorithm . . . . . . . Implementation and Modeling . . . . . . . . . . . . . . . . Experimental Results . . . . . . . . . . . . . . . . . . . . . 5.6.1 Methodology . . . . . . . . . . . . . . . . . . . . . 5.6.2 Results on 1-stride DFAs . . . . . . . . . . . . . . . 5.6.3 Results on 7-var-stride DFAs . . . . . . . . . . . . Chapter 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 86 95 98 99 101 103 105 107 108 109 110 111 111 113 116 118 120 120 121 126 Overlay Automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Limitations of Prior Automata Models . . . . 6.1.2 Summary of Overlay Automata Approach . . 6.1.2.1 Overlay DFA . . . . . . . . . . . . . 6.1.2.2 Overlay D2 FA . . . . . . . . . . . . . 6.1.2.3 Building OD2 FA . . . . . . . . . . . 6.1.2.4 Implementing OD2 FA . . . . . . . . 6.2 Overlay DFA . . . . . . . . . . . . . . . . . . . . . . 6.3 Overlay D2 FA . . . . . . . . . . . . . . . . . . . . . . 6.3.1 OD2 FA Multiplicative Compression . . . . . . 6.3.2 E ectiveness of OD2 FA on Ideal RE set . . . 6.4 OD2 FA Construction . . . . . . . . . . . . . . . . . . 6.4.1 OD2 FA Construction from One RE . . . . . . 6.4.2 OD2 FA Construction from 2 OD2 FAs . . . . 6.4.3 Direct OD2 FA Construction from 2 OD2 FAs . 6.5 Building Super-state Transitions . . . . . . . . . . . 6.5.1 Combining State Transitions . . . . . . . . . vii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 132 133 133 134 135 136 136 144 147 148 149 150 154 162 166 168 6.5.1.1 Computing State Transitions . . . . . . 6.5.2 Creating Overlay Classi er . . . . . . . . . . . . . 6.5.3 Minimizing Overlay Classi er . . . . . . . . . . . 6.5.3.1 Pre-merging Bits . . . . . . . . . . . . . 6.5.3.2 Bit Merging Algorithm . . . . . . . . . . 6.5.4 Overlay Discussion . . . . . . . . . . . . . . . . . 6.5.4.1 Restricting Overlay Count to Power of 2 6.5.4.2 Eliminating Overlay Bits . . . . . . . . 2 FA Software Implementation . . . . . . . . . . . . . 6.6 OD 6.6.1 Implementing OD2 FA . . . . . . . . . . . . . . . 6.6.2 Overlay Classi er Storage and Lookup . . . . . . 6.6.3 Space Requirement . . . . . . . . . . . . . . . . . 6.7 OD2 FA Implementation in TCAM . . . . . . . . . . . . 6.7.1 Generating Super-state IDs and Codes . . . . . . 6.7.2 Implementing Super-state Transitions . . . . . . 6.7.3 TCAM Table Generation . . . . . . . . . . . . . . 6.7.4 Variable Striding . . . . . . . . . . . . . . . . . . 6.7.4.1 Self-loop Unrolling . . . . . . . . . . . . 6.7.4.2 Full Variable Striding . . . . . . . . . . 6.8 Experimental Results . . . . . . . . . . . . . . . . . . . . 6.8.1 E ectiveness of OverlayCAM . . . . . . . . . . . 6.8.2 Results on 7-var-stride . . . . . . . . . . . . . . . 6.8.2.1 Self-loop Unrolling . . . . . . . . . . . . 6.8.2.2 Full Variable Striding . . . . . . . . . . 6.8.3 Scalability of OverlayCAM . . . . . . . . . . . . . Chapter 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 173 175 179 180 182 182 184 185 186 187 188 188 189 190 191 193 193 194 198 199 203 203 204 205 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 viii LIST OF TABLES Table 4.1 Performance data of ORIGINAL and D2 FAMERGE . . . . . . . . . 65 Table 4.2 Comparing D2 FAMERGE and D2 FAMERGEOPT with ORIGINAL. 66 Table 4.3 Performance data of D2 FAMERGEOPT . . . . . . . . . . . . . . . 68 Table 4.4 The D2 FA size and D2 FA average ψ deferment depth for ORIGINAL and D2 FAMERGE on our eight primary RE sets given maximum deferment depth bounds of 1, 2 and 4. . . . . . . . . . . . . . 70 Table 4.5 Comparing D2 FAMERGE with ORIGINAL given maximum deferment depth bounds of 1, 2 and 4. . . . . . . . . . . . . . . . . . . . 70 Table 4.6 Performance data for both variants of BACKPTR and D2 FAMERGE with the back-pointer property. . . . . . . . . . . . . . . . . . . . . 71 Table 4.7 Comparing D2 FAMERGE with both variants of BACKPTR. . . . . 72 Table 5.1 TCAM size and Latency . . . . . . . . . . . . . . . . . . . . . . . . 119 Table 5.2 TCAM size and throughput for 1-stride DFAs . . . . . . . . . . . . 121 Table 6.1 Experimental results of OverlayCAM on 8 RE sets in comparison with RegCAM-TC and RegCAM+TC . . . . . . . . . . . . . . . . . . . . 201 Table 6.2 Number of TCAM rules for RegCAM-TC and OverlayCAM for 1stride, with self-loop unrolling and with 7-var-stride . . . . . . . . . 204 Table 6.3 Average stride values for self-loop unrolling and 7-var-stride for RegCAM-TC and OverlayCAM for pM = 0, 50 and 95. . . . . . . . . 205 ix LIST OF FIGURES Figure 3.1 Example of DFA and state replication. . . . . . . . . . . . . . . . . 15 Figure 3.2 D2 FA example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Figure 4.1 Edge weights distribution in a typical SRG. . . . . . . . . . . . . . 42 Figure 4.2 Example showing D2 FA with non self-looping root states. . . . . . 44 Figure 4.3 D2 FA merge example. . . . . . . . . . . . . . . . . . . . . . . . . . 47 Figure 4.4 Algorithm D2FAMerge(D1 , D2 ) for merging two D2 FAs. . . . . . . 52 Figure 4.5 Memory and time required to build D2 FA versus number of Scale REs used for ORIGINAL's D2 FA and D2 FAMERGE's D2 FA. . . . 74 Figure 5.1 A DFA with its TCAM table. . . . . . . . . . . . . . . . . . . . . . 77 Figure 5.2 TCAM table with shadow encoding. . . . . . . . . . . . . . . . . . 84 Figure 5.3 D2 FA, SRG, and deferment tree of the DFA in Figure 5.1. . . . . . 85 Figure 5.4 Shadow encoding example. . . . . . . . . . . . . . . . . . . . . . . . 90 Figure 5.5 Shadow Encoding Algorithm. . . . . . . . . . . . . . . . . . . . . . 92 Figure 5.6 3-decision table for 3 states in Figure 5.1 . . . . . . . . . . . . . . . 100 Figure 5.7 Consolidating two trees. . . . . . . . . . . . . . . . . . . . . . . . . 104 x Figure 5.8 Algorithm for Consolidating Trees. . . . . . . . . . . . . . . . . . . 106 Figure 5.9 D2 FA for RE set f/abc/, /abd/, /e. f/g. . . . . . . . . . . . . . 107 Figure 5.10 3-var-stride transition table for s0 . . . . . . . . . . . . . . . . . . . 110 Figure 5.11 States s1 and s2 share transition aa . . . . . . . . . . . . . . . . . . 113 Figure 5.12 Uncompressed 2-var-stride transition tables for D2 FA in Figure 5.3(a) (a = 97, o = 111) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 Figure 5.13 TCAM entries per DFA state (a) and compute time per DFA state (b) for Scale 26 through Scale 34. . . . . . . . . . . . . . . . . . . . 124 Figure 5.14 Consolidation times for Scale 26 through Scale 34 for Optimal and Greedy consolidation algorithms. . . . . . . . . . . . . . . . . . . . 125 Figure 5.15 The throughput and average stride length of RE sets. . . . . . . . . 128 Figure 6.1 Relationship of Automata Models. . . . . . . . . . . . . . . . . . . . 135 Figure 6.2 Example of DFA, state replication and Overlay DFA. . . . . . . . . 137 Figure 6.3 OD2 FA Example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 Figure 6.4 OD2 FA construction from one RE. . . . . . . . . . . . . . . . . . . 151 Figure 6.5 D2 FA and OD2 FA for RE /cd[ˆn] pr/. . . . . . . . . . . . . . . 154 Figure 6.6 Merged OD2 FA construction example. . . . . . . . . . . . . . . . . 155 Figure 6.7 Algorithm OD2FAMerge(D1 , D2 ) for merging two OD2 FAs. . . . . 159 Figure 6.8 Algorithm DirectOD2FAMerge(D1 , D2 ) for merging two OD2 FAs. 167 Figure 6.9 Overlay classi er and corresponding super-state transitions for the super-states in OD2 FA in Figure 6.6(c). . . . . . . . . . . . . . . . 175 xi Figure 6.10 Algorithm CreateOverlayClassifier(Dec, Reqd). . . . . . . . . . 176 Figure 6.11 Minimizing overlay classi er example. . . . . . . . . . . . . . . . . . 177 Figure 6.12 Algorithm MinimizeOverlayClassifier(C). . . . . . . . . . . . . . 178 Figure 6.13 Overlay Padding Example. . . . . . . . . . . . . . . . . . . . . . . . 182 Figure 6.14 TCAM rules for RegCAM and OD2 FA. . . . . . . . . . . . . . . . . 192 Figure 6.15 Root super-state self loop unrolling example for TCAM rules in Figure 6.14. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 Figure 6.16 variable stride transitions generated for super-state 0 from 1-stride transition in Figure 6.9. . . . . . . . . . . . . . . . . . . . . . . . . 198 Figure 6.17 Algorithm BuildVarStrideOD2FA(D) to build k-var-stride rules. . 199 Figure 6.18 (a) TEF vs. # NFA states for OverlayCAM and RegCAM, (b) SEF vs. # NFA states for OverlayCAM . . . . . . . . . . . . . . . . . . 207 xii Chapter 1 Introduction 1.1 Problem Statement Deep Packet Inspection (DPI) is the core component of many networking devices on the Internet such as Network Intrusion Detection (or Prevention) Systems (NIDS/NIPS), rewalls, and layer 7 switches. In DPI, in addition to examining the packet headers, the entire contents of each packet is compared against a set of signatures to check if any signature is found in the packet or not. For instance, for security applications, each individual virus or attack threat is represented using one signature. The payload of each packet passing through the network device is compared against the set of signatures, and a match indicates the corresponding threat is found. Necessary action to neutralize the threat can then be taken. Application level signature analysis is also used for providing advanced QoS mechanisms, detecting peer-to-peer trac, and in general application 1 protocol identi cation. In the past, DPI typically used string matching as the core operation, in which signatures are speci ed as simple strings. Today, DPI typically uses Regular Expression (RE) matching as the core operation, in which signatures are speci ed as REs. REs are used instead of simple string patterns because REs are fundamentally more expressive and thus are able to describe a wider variety of attack signatures [43]. Most open source and commercial intrusion detection and prevention systems such as Snort [2, 39], Bro [37], HP TippingPoint and Cisco networking appliances use RE matching. Likewise, some operating systems such as Cisco IOS and Linux [1] have built RE matching into their layer 7 ltering functions. So the problem we are trying to solve is as follows: given a set of REs, R, and an input stream, we want to quickly nd all occurrences of each RE from R in the input stream. 1.2 Research Problems There are several challenges in implementing RE matching parsers for network applications. First, for many DPI applications, the signature set size rapidly grows over time. For example for security applications, new attack threats are regularly discovered and so the signature set size keeps growing. The current release of the Snort rules has close to 2000 REs in it. So the DPI engine should be able to handle a large RE set and it also needs to be scalable. Second, since each packet needs to be scanned in real time as it is 2 processed, the DPI engine needs to be able to process the packets at a fast and deterministic rate. As network speed increases, this becomes an increasingly dicult and important problem to solve. Finally, the DPI engine is typically implemented in a network device, like a router, which usually has limited memory and processing power. So the DPI engine needs to achieve the high throughput using limited hardware resources. As both trac rates and signature set sizes are rapidly growing over time, fast and scalable RE matching is now a core network security issue. As a result, there has been a lot of recent work on implementing high speed RE parsers for network applications. The straightforward approach to performing RE matching is to convert the RE set into an equivalent automata and use the packet payloads as input strings for the automata. Two standard choices are Deterministic Finite state Automata (DFA) and Nondeter- ministic Finite state Automata (NFA). The DFA has the advantage of maintaining only a single active state at any time. Thus processing each input character requires only a single lookup, so the throughput achieved is fast and deterministic. However, DFAs experience state explosion where the number of states in the DFA can be exponential in terms of the number of REs. Thus, DFAs require too much memory to store them. The NFA has the advantage of small size where the number of states in the NFA is typically linear in the number of REs, hence requiring little memory to store them. However, the NFA has no limits on the number of active states, which means that the number of lookups needed to process each input character is high and unpredictable. So NFAs cannot achieve high and deterministic throughput. 3 1.3 Research Goals As high and deterministic throughput is the primary requirement on networking devices, high speed RE matching is typically based on the DFA. But the high memory requirement of DFAs limits the number of REs in the ruleset that can be parsed simultaneously. In this thesis, we propose algorithmic solutions to implement RE matching based on the DFA that simultaneously achieves high throughput and low memory requirement. Storing a DFA requires a large amount of memory because (1) the number of states grows exponentially with the number of REs, and (2) more states implies more transitions need to be stored since each state needs to store 256 = 28 transitions. The rst research goal was to develop ecient algorithms that reduce the number of transitions of a DFA that need to be stored. The Delayed Input DFA (D2 FA) proposed by Kumar et al. [26] reduces the number of stored transitions by exploiting redundancy among the transitions. This and other previous techniques employ a \union then minimize" framework, in which they rst build a large automata corresponding to all the REs in the ruleset, and then perform an expensive minimization on the large automata. We develop algorithms that use a \minimize then union" framework to build the D2 FA. In this approach we rst minimize the automata corresponding to each individual RE in the ruleset, which is an inexpensive step because the automata are very small. We then use a fast algorithm to union the minimized automata together in such a way that the minimization is not lost. The D2 FA can be used for a software implementation of a DPI engine. 4 The compressed transition table is stored in RAM, and the processor does a RAM lookup for each transition of the automata. The drawback of implementing D2 FA in software is that the throughput is reduced (we explain this in Section 3.3.3.) The second research goal was to nd an ecient implementation of RE matching in networking device hardware. To this end, we develop techniques to implement the D2 FA for RE matching using Ternary Content Addressable Memory (TCAM). TCAMs are already widely used in networking devices for header based packet forwarding, so our techniques can be implemented on current TCAM hardware without requiring major modi cations. We also develop techniques to increase throughput by processing more than one input character in each cycle. While the D2 FA is much smaller than a DFA, the memory requirement is still proportional to the number of DFA states, which grows exponentially with the number of REs. The ultimate goal for RE matching is to develop an automata model for RE matching that achieves throughput close to that of a DFA but only requires space close to that of a NFA. Our nal research goal was to develop such an automata model. For this, we have developed two new automata models, Overlay Deterministic Finite state Automata (ODFA) and Overlay Delayed Input DFA (OD2 FA) as well as algorithms to implement OD2 FA automata in both software and hardware. Our hardware OD2 FA implementation achieves the speed of a DFA and the memory requirement of a NFA for many RE sets. The rest of this proposal is organized as follows. In Chapter 2 we discuss related problems and research. Background about DFA, D2 FA and TCAM is presented in Chapter 3. Our 5 research related to D2 FA and implementing RE matching in TCAM is presented in Chapters 4 and 5, respectively. Chapter 6 presents our research for the OD2 FA automata model and implementation. Finally, Chapter 7 ends the dissertation with concluding remarks. 6 Chapter 2 Related Work In the past, DPI typically used string matching (often called pattern matching) as a core operator; string matching solutions have been extensively studied [4, 5, 44, 46, 48, 49, 52]. Several TCAM-based solutions have been proposed for string matching [5, 12, 46, 52], but they do not generalize to RE matching because they only deal with independent strings and do not use DFAs. Sommer and Paxson [43] rst proposed using REs instead of strings to specify attack signatures. Today most DPI engines uses RE matching as a core operator because strings are not adequate to precisely describe attack signatures. There are two main approaches in previous work to developing RE matching solutions. One is to start with a DFA and compress it. The second is to start with an NFA and develop methods for coping with multiple active states. We rst review DFA compression work. Great work has been done in reducing the number of transitions stored per DFA state such as D2 FA [6,8,17,26,27]. These techniques exploit 7 transition redundancy between states to compress the size of the DFA. We present a novel \minimize then union" approach of building the D2 FA incrementally. Our approach can build much larger D2 FAs in fraction of the time compared to the previous solutions. This work is presented in [36]. Recently and independently, Liu et al. proposed to construct DFA by hierarchical merging [29]. That is, they essentially propose the \minimize then union" framework for DFA construction. They consider merging multiple DFAs at a time rather than just two. However, they do not consider D2 FA, and they do not prove any properties about their merge algorithm including that it results in minimum state DFAs. Another approach to reducing the number of transitions stored per DFA state is alphabet encoding. In this approach the input characters are mapped to a new alphabet such that input characters which are always treated identically in the DFA are combined into one new character, thus reducing the size of the alphabet [8,9,13,22]. This work is orthogonal to our techniques, and can be used together to improve the results. In [32] we present our current RE matching solution using TCAMs. Here we exploit both inter state and intra state transition redundancy to minimize the number of transitions stored per DFA state. There has been work to increase the throughput by creating multi-stride DFAs and NFAs that scan multiple characters per transition [9, 13]. This work primarily applies to FPGA NFA implementations since multiple character SRAM based DFAs have only been evaluated for a small number of REs. The ability to increase stride has been limited by the constraint that all transitions must be increased in stride; this leads to excessive memory 8 explosion for strides larger than 2. In [32] we present the technique of variable striding, in which we increase stride selectively on a state by state basis while carefully controlling the increase in required space. Alicherry et al. have explored variable striding for TCAM-based string matching solutions [5] but not for DFAs that apply to arbitrary RE sets. Our techniques in [32] achieve very high transition compression; requiring close to just 1 transition per state. However, that might still not be practical if the number of states grows exponentially with the number of REs. Some work has attempted to address state explosion that occurs due to extensive state replication. One approach is to simply partition REs into groups building an automata for each group [7,42,51]. With this approach, at run time, each automata must process all packet payloads; that is, similar to an NFA, multiple active states must be maintained. The one advantage this approach has compared to an NFA is that the number of active states at any given time is known in advance, so a system can be designed to accommodate the increased bandwidth requirements for processing packet payloads. This approach is usually used with any of the RE matching techniques when all REs cannot be compiled into a single automata. Our goal is to conquer state explosion so that such partitioning is not needed. If we cannot fully achieve our goal, our work should at least reduce the number of partitions required. In particular, because our techniques achieve greater compression of DFAs than previous software-based techniques, less partitioning of REs will be required. A second approach is to use \scratch memory" to manage state replication and avoid state 9 explosion [10, 25, 41]. However, there are several issues with this approach. First, the size of the required scratch memory may itself be signi cant. Second, the processing required to update the scratch memory after each transition may be signi cant. Finally, many of these approaches are not fully automated. For example, as Yang et al. write in [50] about XFA, \... prior work on improved signature representations has required manual analysis of REs (e.g., to identify and eliminate ambiguity [41]) ...". Liu et al. developed a new method for RE matching that was the rst to introduce relative state addressing through the use of o set transitions [28]. In their work, they signi cantly reduce the number of stored transitions by exploiting state replication and transition sharing without using TCAM. However, they do require the use of bitmaps for each DFA state which means they still require at least one bit per DFA state which means they ultimately do not address the state explosion problem. The current best approach for coping with state explosion is that of Peng et al. [38], though they do not o er an automata model. We propose new automata model, ODFA, which facilitates reasoning about state replication and provides a systematic way of handling state replication. Some preliminary results indicate that our technique require signi cantly fewer TCAM entries than the technique in [38]. Much of the NFA work has exploited the parallel processing capabilities of FPGA technology to cope with the multiple active states that arise from NFA [7, 9, 14, 15, 33, 34, 40, 45]. However, it is not clear that FPGA's can cope with the large number of active states required when processing large signature sets. Furthermore, FPGA's cannot be quickly 10 recon gured when the RE sets change and they have relatively slow clock speeds. Also, FPGAs are not commonly embedded in network processors as TCAMs commonly are. One recent work in this direction is that of Yang et al. [50] where they use ordered binary decision diagrams to facilitate updating a set of active states in one operation. This is an intriguing idea that merits further study and comparison with DFA compression approaches. 11 Chapter 3 Background In this section, we rst discuss the background material for the research presented in the later sections. 3.1 DFA for RE Matching Most RE parsers use some variant of the Deterministic Finite state Automata (DFA) representation of REs. Any set of REs can be converted into an equivalent DFA with the minimum number of states [19, 20]. Traditionally, a DFA is de ned as a 5-tuple D = (Q, Σ, q0 , A, δ), where Q is the set of states, Σ is the alphabet, q0 ∈ Q is the start state, and A ⊆ Q is the set of accepting states. 12 δ : Q × Σ → Q is the transition function, DFAs have the property of needing constant memory access per input symbol, and hence result in predictable and fast bandwidth. The main problem with DFAs is space explosion: a huge amount of memory is needed to store the transition function δ which has |Q| × |Σ| entries. Speci cally, the number of states can be very large (state explosion), and the number of transitions per state is large (|Σ| = 256). A straightforward approach to implement DFAs is to store the transition function δ in a two dimensional (|Q| by |Σ|) array. However, |Q| is very large (typically ten thousand or larger) and |Σ| = 28∗k , where k ≥ 1, for k-stride DFAs that process k 8 bit characters per transition. Thus, although a |Q| by |Σ| array is fast in theory, it is not in reality because it consumes so much memory (hundreds of megabytes) that it has to be stored in DRAM instead of SRAM and DRAM is an order of magnitude slower than SRAM. In a standard DFA, each state is only marked as either accepting or non-accepting. Given the set of REs R, reaching an accepting state only tells us that some RE in R matched, but does not tell speci cally which RE in R matched. However, in DPI applications we must keep track of which REs in R have been matched. For example, each RE may correspond to a unique security threat that requires its own processing routine. This leads us to de ne Pattern Matching Deterministic Finite State Automata (PMDFA). The key di erence between a PMDFA and a DFA is that for each state q in a PMDFA, we cannot simply mark it as accepting or rejecting; instead, we must record which REs from R are matched when we reach q. 13 Definition 1 (Pattern Matching DFA (PMDFA)). Given as input a set of REs R, a PMDFA is a 5-tuple (Q, Σ, q0 , M, δ) where the term M is de ned as M : Q → 2R . For each state q in the DFA, M gives the set of REs from R that are matched when we reach q. All the other terms are de ned in the same way as in a DFA. In a PMDFA, there can be many pairs of states that are equivalent except for the set of REs accepted by the two states. In a DFA, such a pair of states will be merged since they would be completely equivalent. Because of this, the resulting minimum state PMDFA is typically larger than the minimum state DFA. Since we always use a PMDFA, in the rest of the report we just use the term DFA to mean a PMDFA. 3.2 Understanding DFA space explosion DFAs su er from space explosion due to two reasons, which we call transitions sharing and state replication. We explain these reasons using the DFAs shown in Figure 3.1. We rst de ne some of our notation for the DFAs in Figure 3.1 for the RE sets f/abc/, /abd/g and f/abc/, /abd/, /e. f/g. Note that any RE that is not anchored (i.e. does not begin with a `ˆ') has an implicit `. ' in the beginning, since the RE match can begin anywhere in the input stream. To simplify the diagram, we condense many transitions that have a common destination state on common input characters as follows. These transitions are denoted with double arrows with their character labels next to the double arrow. The source states for these transitions are denoted as \From [x..y]" which represents the set 14 From [0..4] 3/1 From [1..4] a fail 0 a c b 1 2 d 4/2 (a) DFA for RE set f/abc/, /abd/g. From [0..4] From [1..4] a fail d a b 1 c 8/1 9/2 From [6..10] From [1..4] a e a b 6 fail From [5..10] f 10/3 4/2 2 e 5 3/1 d 0 c 7 f From [6..10] (b) DFA for RE set f/abc/, /abd/, /e. f/g. Figure 3.1: Example of DFA and state replication. (For interpretation of the references to color in this and all other gures, the reader is referred to the electronic version of this dissertation.) of states with state IDs in the range [x..y]. For example, we represent four transitions starting in states 1 through 4 that end in state 1 on character `a' using double arrows beneath \From [1..4]" and an `a' next to the double arrow. When the text next to a double arrow is \fail", this represents all character transitions not explicitly shown in the 15 gure. For example, the \fail" transition in Figure 3.1(a) represents all transitions out of state 0 for characters that are not `a', all transitions out of state 1 for characters that are not 'b', and so on. Finally, in an accepting state, the number(s) following the `/' represents σ the ID(s) of the RE matched by that accepting state. We also use the notation s1 − s2 → to denote the transition δ(s1 , σ) = s2 . We de ne a self-looping state as a state which has more than Σ/2 (= 128) of its outgoing transitions going back to itself. Self-looping states are the \failure states" on which the DFA stays when the current input character does not advance the (partial) matching of any of the REs in the RE set. For example in Figure 3.1(b) states 0 and 5 are self-looping states. The transitions in a DFA can be categorized into three types: 1. Failure transitions are those that go to the self-looping states. It indicated that the current input character does not advance (or start) the matching of any RE. In Figure 3.1(a), all the incoming transitions of state 0 are failure transitions. 2. Restartable transitions are those that go to a state at a lower level than the current state, usually a non self-looping state. It indicates that the current partial matches are lost but there is a new partial match of another (possibly the same) RE. In Figure 3.1(b), the incoming transitions of state 5 on character `e' from states [1..4] e are restartable transitions. For instance the transitions 2 − 5 means that we had a → partial match (ab) of REs/abc/ and /abd/ (since the current state is 2), and the current input `e' does not advance the match of either of these REs, but it starts the matching of a new RE /e. f/. 16 3. Forward transitions are the those that go from one state to the next in a chain of states that identify a RE. These transitions advance the current partial match of the RE by one character. In Figure 3.1(b), the outgoing transition of state 0 on characters `a' and `e' are forward transitions. 3.2.1 Transition Sharing We say two transitions are shared when, out of the three values in a transition (source state, input character, destination state), they di er in only one value. Two shared transitions can only possibly di er in either the input character or the source state (since a DFA has only one transition per source state and input character pair). This gives us two causes of transition sharing: character redundancy and state redundancy. Character redundancy is when two shared transitions di er in only the input character value. That is, for a state q ∈ Q, we often have δ(q, σ1 ) = δ(q, σ2 ) for characters σ1 and σ2 in Σ. A DFA has a lot of character redundancy since for most states, most of their transitions are failure transitions going to the same self-looping state. Only a few of transitions for most states are either restartable or forward transitions. In addition, if a RE has a chracter range (like `[a-z]') in it, then it leads to character redundant forward transitions. For example in Figure 3.1(a), 254 of the 256 transitions for state 1 go to the same state 0. State redundancy is when two shared transitions di er in only the source state value. That is, for a character σ ∈ Σ, we have δ(p, σ) = δ(q, σ) for states p and q in Q. The cause for 17 the large amount of state redundancy is failure and restartable transitions, because both of these types of transitions go to the same next state for many di erent states in the DFA. For example in Figure 3.1(a), for all the states in the DFA, their failure transitions go to state 0, and their transition on input character `a' goes to state 1. 3.2.2 State Replication When the NFA is converted to an equivalent DFA, the number of states typically increases exponentially. This happens because most of the states in the NFA are replicated many times in the DFA. To understand this, consider the DFAs in Figure 3.1. Figure 3.1(a) shows the DFA for the RE set f/abc/, /abd/g, and Figure 3.1(b) shows the DFA after the RE /e. f/ is added to this RE set. As we can see, the entire DFA in Figure 3.1(a) is repeated twice in the DFA in Figure 3.1(b). Each state is replicated twice because of the wildcard closure `. ' in the new RE that is added. In general when building the DFA for an RE set where some REs contains 's, the states in the DFAs that corresponds to individual REs are replicated multiple times. And when a state is replicated, we automatically get replication of the transitions of that state, causing transitions replication. 18 3.3 The D2FA Delayed Input DFA (D2 FA) was proposed by Kumar et al. [26] to compress the size of the DFA transition function δ by exploiting state redundancy. The basic idea of D2 FA is that in a typical DFA for real world RE set, given two states u and v, δ(u, σ) = δ(v, σ) for many symbols σ ∈ Σ. We can remove all the transitions for v from δ for which δ(u, σ) = δ(v, σ) and make a note that v's transitions were removed based on u's transitions. When the D2 FA is later processing input and is in state v and encounters input symbol σ, if δ(v, σ) is missing, the D2 FA can use δ(u, σ) to determine the next state. We can do the same thing for most states in the DFA, and it results in tremendous transition compression. Kumar et al. observe an average decrease of 97.6% in the amount of memory required to store a D2 FA when compared to its corresponding DFA. In more detail, to build a D2 FA from a DFA, we just do the following two steps: 1. For each state u ∈ Q, pick a deferred state, denoted by F(u). (We can have F(u) = u.) 2. For each state u ∈ Q for which F(u) = u, remove all the transitions for u for which δ(u, σ) = δ(F(u), σ). When traversing the D2 FA, if on current state u and current input symbol σ, if δ(u, σ) is missing (i.e. has been removed), we can use δ(F(u), σ) to get the next state. Of course, δ(F(u), σ) might be missing too, in which case we then use δ(F(F(u)), σ) to get the next state, and so on. 19 Figure 3.2(a) shows a DFA for the REs set f/. a. bcb/, /. c. bcb/g, and Figure 3.2(c) shows the D2 FA built from the DFA. The dashed lines represent deferred states. The DFA has 13 × 256 = 3328 transitions, whereas the D2 FA only has 1030 actual transitions and 9 deferred transitions. 3.3.1 D2 FA Definition We formally de ne a D2 FA and introduce some notation here. Definition 2 (D2 FA). Let is de ned as a 6-tuple D = (Q, Σ, q0 , M, δ) be a DFA. A corresponding D2 FA D = (Q, Σ, q0 , M, ρ, F). D The rst four terms here are de ned the same way as in the DFA. The function F : Q → Q de nes a unique deferred state from 1,3 -{b,c} b b 1 -{a,c} a c 3 b 6 c from 4,6,7,9,10,12 from 4,6,10 c b -{b,c} c c 0 b 4 a c -{a,b,c} from 2,5,8,11 b b 10 b a from 5,8,11 2 c 7 c c b c b 8 b (a) DFA for RE set f/. a. bcb/, /. c. bcb/g Figure 3.2: D2 FA example. 20 12/1,2 b c 5 9/1 11/2 4 0 6 254 2 5 1 254 254 8 254 256 12 256 11 7 254 3 254 256 10 256 254 254 9 (b) SRG for the DFA. Edges with weight ≤ 1 are not shown. Unlabeled edges have weight 255 -{b,c} 1 -{a,c} b 3 c 6 b 7 c 10 b 9/1 a c 0 4 b 12/1,2 -b c a 2 b 5 c 8 b 11/2 -{a,b} (c) The corresponding D2 FA. Dashed edges represent deferment. Figure 3.2: D2 FA example (cont'd). 21 for each state in Q, and the partial function ρ: Q × Σ → Q transition function. Together, the deferment function function ρ F is a partially de ned and the partial transition are equivalent to DFA transition function δ. We use dom(ρ) to denote the domain of ρ, i.e. the values for which ρ is de ned. The key property of the D2 FA D that corresponds to DFA D is as follows: ∀ s, σ ∈ Q × Σ, s, σ ∈ dom(ρ) ⇐⇒ (F(s) = s ∨ δ(s, σ) = δ(F(s), σ)) That is, for each state, ρ only has those transitions that are di erent from that of its deferred state in the underlying DFA. When de ned, ρ(s, σ) = δ(s, σ). The function F de nes a directed graph on the states of Q, which we call the deferment forest. A D2 FA is well de ned if and only if there are no cycles of length > 1 in the deferment forest ( i.e. there are no cycles except self-loops.) The total transition function for the D2 FA (derived from ρ) is de ned as    ρ(s, σ)  δ (s, σ) = if    δ (F(s), σ) else s, σ ∈ dom(ρ) It is easy to see that δ is well de ned and equal to δ if the D2 FA is well de ned. We need the restriction that the deferment forest cannot have a cycle other than a self-loop on the states because otherwise all states on the cycle might have their transitions on some σ ∈ Σ removed, and there would no way of nding the next state. 22 We also use the term deferment pointer to refer to the deferred state of a state. That is, if F(u) = v ∧ u = v, we say the deferment pointer of state u is set to state v. If F(u) = u, we say the deferment pointer for state u is not set. States that defer to themselves (i.e. deferment pointer is not set), which we call root states, must have all their transitions de ned. Each connected component of the deferment forest is called a deferment tree. It is easy to see that each deferment tree has exactly one root state in it, and the deferment pointer of all the other states in the deferment tree are set towards the root state. We use u→v to denote F(u) = v, i.e. u directly defers to v. In this case, we say state u is a child of state v, and state v is the parent of state u, in the deferment forest. We use u v to denote that there is a path from u to v in the deferment forest de ned by F. In this case we say state u is a descendant of state v, and state v is the ancestor of state u, in the deferment forest. The deferment depth of state u, denoted ψ(u), is the distance, in the deferment tree containing u, of state u from the root state of that deferment tree. The (maximum) deferment depth of D2 FA D , denoted Ψ(D ), is the maximum deferment depth among all the states in D . We use ψ(D ) to denote the average deferment depth among all the states in D . We use u v to denote the number of transitions in common between states u and v; i.e. u v = |{σ | σ ∈ Σ ∧ δ(u, σ) = δ(v, σ)}|. 23 We only consider D2 FA that correspond to minimum state DFA, though the de nition applies to all DFA. 3.3.2 Original D2 FA Algorithm In this section we explain the original D2 FA construction algorithm proposed by Kumar et al. [26]. They rst build a DFA for the given RE set. The amount of transition compression achieved by the D2 FA depends on the number of common transition between each (non-root) state and its deferred state. So next, in order to maximize transition compression, they essentially solve a maximum weight spanning tree problem on the following weighted graph which they call a Space Reduction Graph (SRG). The SRG is a complete graph with the DFA states, Q, as its vertices. The weight of any edge (u, v) in the SRG is equal to the number of common transitions between DFA states u and v. They use the the Kruskal's algorithm [23] to construct the maximum weight spanning tree. Edges with weight ≤ 1 are not considered (selecting an edge with weight 1 does not reduce the transition function, since it will result in removal of one actual transition and addition of the deferment pointer transition.) For this reason the maximum weight spanning tree construction might result in a forest. Once the spanning forest is constructed, (one of) the state(s) in the center of each tree is selected as the root for that tree, and all edges are directed towards the root. These directed edges give the deferred state for each state. 24 Figure 3.2(b) shows the SRG built for the DFA in Figure 3.2(a), along with the maximum weight spanning forest with roots selected and the edges directed. 3.3.3 Limiting Deferment Depth in Original D2 FA Algorithm A D2 FA has the drawback that while parsing the input string, the current input character is not advanced when a deferment transition is followed (hence the name delayed input DFA.) In the worst case for a given state u and current input character σ, we might have to do ψ(u) + 1 lookups to nd the next state δ (u, σ); that is ψ(u) lookups to get to the root state following deferment transitions and 1 more lookup to get the actual next state. This is a problem since we no longer get deterministic throughput, which was the main reason for using the DFA. So, in general, it is better to have low deferment depth for all states. If we set an upper bound on Ψ, then we achieve deterministic throughput, since we would have a constant bound on the number of lookups per input character. Recall that during the maximum weight spanning tree construction, Kruskal's algorithm considers edges in decreasing edge weight order. At any time during the construction, many edges will have the current largest edge weight (since there are only 257 possible edge weights.) In order to reduce the deferment depth of the resulting D2 FA, Kumar et al. propose the following tie breaking heuristic: among all edges with the current maximum weight, pick the one that will result in the least increase in the diameter when added to the spanning forest. 25 Also, given an upper bound, Ω, on the D2 FA deferment depth Ψ, Kumar et al. propose the following method to generate D2 FA with deferment depth within the bound: during the maximum weighted spanning tree construction, an edge is only added to the spanning tree if it does not cause the tree diameter to go over 2 × Ω. Since the tree center is chosen as the root state, this guarantees that Ψ(D ) ≤ Ω. 3.3.4 Backpointer D2 FA Algorithm The level of a state u in a DFA is the length of the shortest string that takes the DFA from the start state to state u. Becchi and Crowley [8] propose an algorithm to build the D2 FA based on the following idea: each state in the DFA should defer to a state that is at a lower level than itself. Because of this, every deferred transition followed will decrease the level of the current state by at least 1. Any actual transition taken can only increase the level of the current state by 1. Therefore, when processing any input string of length n, at most n − 1 deferred transitions will be followed. So this method guarantees an amortized cost of at most 2 lookups per input character. To build the D2 FA, they build the DFA for the given RE set rst. Next, for each state u, among all the states at a lower level than u, they set F(u) to be the state which shares the most transitions with u. Since each state defers to a state at a lower level than itself, the deferment forest can never have a cycle, so the D2 FA is well de ned. The resulting D2 FA is typically a bit larger in size than the D2 FA built using the algorithm proposed by Kumar et al.. 26 3.4 Classifiers In this section we de ne a classi er, related terminology and describe a classi er minimization problem. A classi er is essentially a mapping function from the source domain to the target domain. In a d-dimensional classi er, the input value is composed of d elds. A classi er is traditionally de ned for the (header based) packet classi cation problem. The input value is the packet header, which has ve elds: Protocol type, Source IP address, Source port number, Destination IP address and Destination port number. The output is the decision or action to be taken for the packet, which typically has values like accept, discard, accept and log, discard and log etc.. So the classi er is de ned as a 5-dimensional classi er, with the set of possible packet headers as the source domain, and set of possible actions as the target domain. For each possible packet header, the classi er gives the action to be taken. 3.4.1 Classifier definition We now formally de ne a d-dimensional classi er and related terminology. A eld Fi is a nite width variable. The domain of eld Fi of w bit width is dom(Fi ) = [0..2w − 1]. The domain of a d-dimensional classi er, f, de ned over the d elds F1 , . . . , Fd is dom(f) = dom(F1 ) × · · · × dom(Fd ). A packet is a d-tuple (p1 , . . . , pd ), where, for 1 ≤ i ≤ d, pi ∈ dom(Fi ). A rule has the form predicate → decision . A rule predicate is a d-tuple (S1 , . . . , Sd ), 27 where, for 1 ≤ i ≤ d, Si ⊆ dom(Fi ); and it covers the set of packets S1 ×· · ·×Sd ⊆ dom(f). A packet p matches rule r if and only if the predicate of r covers p. The set of possible rule decisions is denoted by H. The classi er f = r1 , . . . , rn is speci ed as a sequence of rules. For packet p, the rst rule in the sequence that p matches is said to be the binding rule for p. If p does not match any rule in f, then p does not have any binding rule (or is unbound). For a bound packet p, the output of the classi er, f(p), is given by the decision of the binding rule for p. For unbound packets, p, f(p) is unde ned. The cost of a classi er f, denoted Cost(f), is the number of rules in f. The Cover of a classi er f, denoted Cover(f), is de ned as the set of packets in dom(f) that have a binding rule in f (i.e. set of packets that match at least one rule in f.) A classi er f, is said to be a complete classi er if Cover(f) = dom(f), otherwise f is said to be an incomplete classi er. Clearly, two rules in a classi er can be overlapping (i.e. at least one packet matches both rules), as well as con icting (i.e. overlapping and having di erent decisions). But that is ok, since the classi er output for a bound packet is uniquely de ned by its binding rule. 3.4.1.1 Prefix Classifier A pre x {0, 1}k {∗}w−k with k leading bits (i.e. 0s or 1s), for a eld of width w, denotes the range of values [{0, 1}k {0}w−k , {0, 1}k {1}w−k ]. A rule is said to be a pre x rule if and 28 only if every Si in the rule predicate (S1 , . . . , Sd ) is represented as a pre x. A classi er f is said to be a pre x classi er if and only if every rule in f is a pre x rule. 3.4.1.2 Ternary Classifier A ternary value for a eld of width w is of the form {0, 1, ∗}w , and denotes the set of values obtained by replacing the ∗'s with 0's and 1's in all possible combinations (if there are k ∗'s, there are 2k ways to replace the ∗'s with 0's and 1's.) A rule is said to be a ternary rule if and only if every Si in the rule predicate (S1 , . . . , Sd ) is represented as a ternary value. A classi er f is said to be a ternary classi er if and only if every rule in f is a ternary rule. A pre x classi er is a special case of a ternary classi er, since every pre x is also a ternary value. 3.4.1.3 Weighted Classifier In a weighted classi er, each decision in H has a weight associated with it. The cost of a classi er f is then equal to the sum of the weights of decisions of all the rules in f. The unweighted classi er is a special case of weighted classi er with weights of all the decisions set to 1. 29 3.4.2 Classifier Minimization Two classi ers f1 and f2 are equivalent, denoted f1 ≡ f2 , if and only if Cover(f1 ) = Cover(f2 ) and ∀p ∈ Cover(f1 ), f1 (p) = f2 (p). For a classi er f, we use {f} to denote the set of all classi ers that are equivalent to f. The classi er minimization problem is then de ned as follows. Definition 3 (Classi er Minimization Problem). Given a classi er f1 , nd a pre x classi er f2 ∈ {f1 } such that for any pre x classi er f ∈ {f1 }, Cost(f2 ) ≤ Cost(f). Multi-dimensional classi er minimization has been shown to be NP-hard. An optimal solution for 1-dimensional complete classi er minimization was proposed by Suri et al. [47]. Meiners et al. [30, 31] proposed algorithms for 1-dimensional complete weighted classi er minimization and 1-dimensional incomplete weighted classi er minimization. 3.5 TCAM Introduction In any regular memory, the input is the memory address location, and the output is the contents of the memory at that location. In a Ternary Content Addressable Memory (TCAM), as the name suggests, it is the exact opposite. The input to a TCAM is binary value, and the output of the TCAM is the address of the location, if any, at which the given value occurs. The ternary refers to the fact that the contents of the memory are ternary bits, i.e. 0, 1 or ∗ (don't care). The ∗ matches both a 0 and a 1. 30 If more than one location matches the given (binary) value, then the address of the rst location that matches the value is returned. We call this the rst match semantics of TCAM. The key thing about TCAMs is that the output is returned in constant time. TCAMs internally have a massively parallel hardware that searches the given input against all the entries stored in the TCAM at once, and returns the address of the rst match. For this reason, TCAM memory chips have very limited size. The largest available chip is about 72Mb, and typical sizes are around 1Mb to 8Mb. TCAM chips also consume a lot of energy compared to regular memory. The TCAM chip is usually paired with a corresponding SRAM that stores output values. The matching address from the TCAM is used as input to the SRAM to get the output value. TCAM chips are widely used in networking devices for packet classi cation. A ternary classi er for packet classi cation can be naturally implemented in a TCAM. All the rules predicates are stored, in order, in the TCAM, and the corresponding rule decisions are stored in the SRAM. The packet header is then used as a lookup key for the TCAM, and the matching SRAM values gives the decision for the packet. 31 Chapter 4 Software Implementation In this section we present our work on the software implementation of RE matching. A software solution typically uses a DFA to achieve deterministic throughput. The software solution can be implemented on general purpose processors, or on customized ASIC chips. 4.1 Introduction/Motivation The straightforward way to implement a DFA in software is to store the DFA transition table, δ, in a two dimensional Q × Σ array. But DFAs su er from space explosion when multiple REs are combined, making them impractical even for moderately sized RE set. D2 FA are very e ective at dealing with the space explosion problem of the DFA. In particular, D2 FA exhibit tremendous transition compression reducing the size of the DFA by a huge factor. This makes D2 FA much more practical for a software implementation of RE matching than DFAs. In our work we focus on the D2 FA. 32 4.1.1 Solution Goals For software implementation of RE matching, given as input a set of REs R, we need to be able to build a compact D2 FA as eciently as possible that also supports frequent updates. Eciency is important because RE matching solutions are typically implemented in networking devices, which usually have very limited computing resources. Current methods for constructing D2 FA may be so expensive in both time and space that they may not be able to construct the nal D2 FA even if the D2 FA is small enough to be deployed in networking devices that have limited computing resources. Such issues become doubly important when we consider the issue of the frequent updates (typically additions) to R that occur as new security threats are identi ed. 4.1.2 Summary and Limitations of Prior Art Given the input RE set R, any solution that builds a D2 FA for R will have to do the following two operations: (a) union the automata corresponding to each RE in R and (b) minimize the automata, both in terms of the number of states and the number of edges. Previous solutions [8,26] (discussed in Section 3.3) employ a \Union then Minimize" framework in which: (1) they rst build automata for each RE within R, and perform union operations on these automata to arrive at one combined automaton for all the REs in R, and (2) next they minimize the resulting combined automaton. In particular, previous solutions rst construct the combined NFA for the RE set. Then they perform a computationally expensive NFA to DFA subset construction on the large combined NFA, 33 followed by or composed with DFA minimization (for states). And last they perform the D2 FA minimization (for edges). There are three fundamental limitations with prior solutions, due to which they do not meet our goals. First, they perform the minimization on the large combined automata which is expensive in both time and space. Second, prior methods build the corresponding minimum state DFA before constructing the nal D2 FA. This is very costly in both space and time. The D2 FA is typically 50 to 100 times smaller than the DFA, so even if the D2 FA would t in available memory, the intermediate DFA might be too large, making it impractical to build the D2 FA. This is exacerbated in the case of the Kumar et al. algorithm which needs the SRG which ranges from about the size of the DFA itself to over 50 times the size of the DFA. The resulting space and time required to build the DFA and SRG impose serious limits on the D2 FA that can be practically constructed. We do observe that the method proposed in [8] does not need to create the SRG. Furthermore, as the authors have noted, there is a way to go from the NFA directly to the D2 FA, but implementing such an approach is still very costly in time as many transition tables need to be repeatedly recreated in order to realize these space savings. In addition, this direct NFA to D2 FA construction would still need to perform the expensive subset construction on the large combined NFA. Third, none of the previous methods support updating the D2 FA when a new RE is added to R. The whole D2 FA would have to be rebuilt when the RE set is updated. 34 4.1.3 Summary of Our Approach To address the limitations of prior solutions, we propose a \Minimize then Union framework". Speci cally, we rst minimize the small automata corresponding to each RE from R, and then union the minimized automata together. In particular, given R, we rst build a DFA and D2 FA for each individual RE in R. The heart of our technique is the D2 FA merge algorithm that performs the union. It merges two smaller D2 FAs into one larger D2 FA such that the merged D2 FA is equivalent to the union of REs that the D2 FAs being merged were equivalent to. Starting from the the initial D2 FAs for each RE, using this D2 FA merge subroutine, we merge two D2 FAs at a time until we are left with just one nal D2 FA. The initial D2 FAs are each equivalent to their respective REs, so the nal D2 FA will be equivalent to the union of all the REs in R. A key property of our D2 FA merge algorithm is that it automatically produces a minimum state D2 FA without explicit state minimization. Likewise, it creates ecient state deferment in the merged D2 FA using state deferment information from the input D2 FAs. Together, these optimizations lead to a vastly more ecient D2 FA construction algorithm in both time and space. The D2 FA produced by our merge algorithm can be larger than the minimal D2 FA produced by the Kumar et al. algorithm. This is because the Kumar et al. algorithm does a global optimization over the whole DFA (using the SRG), whereas our merge algorithm eciently computes state deferment in the merged D2 FA based on state deferment in the two input D2 FAs. In most cases, the D2 FA produced by our approach is suciently small to be deployed. However, in situations where more compression is needed, we o er an 35 ecient nal compression algorithm that produces a D2 FA very similar in size to that produced by the Kumar et al. algorithm. This nal compression algorithm uses an SRG; we improve eciency by using the deferment already computed in the merged D2 FA to greatly reduce the size of this SRG and thus signi cantly reduce the time and memory required to do this compression. 4.1.3.1 Advantages of our algorithm One of the main advantages of our algorithm is a dramatic increase in time and space eciency. These eciency gains are partly due to our use of the Minimize then Union framework instead of the Union then Minimize framework. More speci cally, our improved eciency comes about from the following four factors. First, other than for the initial DFAs that correspond to individual REs in R, we build D2 FA bypassing DFAs. Those initial DFAs are very small (typically < 50 states), so the memory and time required to build the initial DFAs and D2 FAs is negligible. The D2 FA merge algorithm directly merges the two input D2 FAs to get the output D2 FA without creating the DFA rst. Second, other than for the initial DFAs, we never have to perform the NFA to DFA subset construction. Third, other than for the initial DFAs, we never have to perform DFA state minimization. Fourth, when setting deferred states in the D2 FA merge algorithm, we use deferment information from the two input D2 FA. This typically involves performing only a constant number of comparisons per state rather than a linear in the number of states comparison per state as is required by previous techniques. All told, our algorithm has a practical time 36 complexity of O(n|Σ|) where n is the number of states in the nal D2 FA and |Σ| is the size of the input alphabet. In contrast, Kumar et al.'s algorithm [26] has a time complexity of O(n2 (log(n) + |Σ|)) and Becchi and Crowley's algorithm [8] has a time complexity of O(n2 |Σ|) just for setting the deferment state for each state and ignoring the cost of the NFA subset construction and DFA state minimization. Section 4.4.4 has a more detailed complexity analysis. Because of these eciency advantages in time and space complexity, given the same limited resources, our algorithm can build much larger D2 FAs than are possible with previous methods. Besides being much more ecient in constructing D2 FA from scratch, our algorithm is very well suited for frequent RE updates. When an RE needs to be added to the current set, we just need to merge the D2 FA for the RE to the current D2 FA using our merge routine which is a very fast operation. 4.2 Minimum State PMDFA construction Before we present our algorithm for ecient D2 FA construction, we consider the problem of constructing minimum state DFA for a given RE set. Given a set of REs R, we can build the corresponding minimum state DFA using the standard Union then Minimize framework: rst build a combined NFA for all the REs in R, then convert the NFA to a DFA, and nally minimize the DFA. This method can be very slow, mainly due to subset construction in the NFA to DFA conversion, which often 37 results in an exponential growth in the number of states. Instead, we propose a more ecient Minimize then Union framework. Let R1 and R2 denote any two disjoint subsets of R, and let D1 and D2 be their corresponding minimum state DFAs. We use the standard union cross product construction for DFAs to construct a minimum state DFA D3 that corresponds to R3 = R1 ∪ R2 . Speci cally, suppose we are given the two DFAs D1 = (Q1 , Σ, q01 , M1 , δ1 ) and D2 = (Q2 , Σ, q02 , M2 , δ2 ). The union cross product DFA of D1 and D2 , denoted as UCP(D1 , D2 ), is given by D3 = UCP(D1 , D2 ) = (Q3 , Σ, q03 , M3 , δ3 ) where Q3 = Q1 × Q2 q03 = q01 , q02 ∀qi ∈ Q1 , ∀qj ∈ Q2 , M3 ( qi , qj ) = M1 (qi ) ∪ M2 (qj ) ∀σ ∈ Σ, ∀qi ∈ Q1 , ∀qj ∈ Q2 , δ3 ( qi , qj , σ) = δ1 (qi , σ), δ2 (qj , σ) Each state in D3 corresponds to a pair of states, one from D1 and one from D2 . For notational clarity, we use and to enclose an ordered pair of states. Transition function δ3 just simulates both δ1 and δ2 in parallel. Many states in Q3 might not be reachable from the start state q03 . Thus, while constructing D3 , we only create states that are reachable from the start state q03 . 38 We now argue that this construction is correct. This is a standard construction, so the fact that D3 is a DFA for R3 = R1 ∪ R2 is straightforward and covered in standard automata theory textbooks (e.g. [20]). We now show that D3 is also a minimum state DFA for R3 assuming R1 ∩ R2 = ∅. Recall that we are using DFA to mean a PMDFA (see Section 3.1.) For a traditionally de ned DFAs, the UCP construction is not guaranteed to produce a minimum state DFA. Theorem 1. Given two RE sets, R1 and R2 , and equivalent minimum state DFAs, D1 and D2 , the union cross product DFA D3 = UCP(D1 , D2 ), with only reachable states constructed, is the minimum state DFA equivalent to R3 = R1 ∪ R2 if R1 ∩ R2 = ∅. Proof. First since only reachable states are constructed, D3 cannot be trivially reduced. Now assume D3 is not minimum. That would mean there are two di erent states in D3 , say p1 , p2 and q1 , q2 , that are indistinguishable. This implies that ∀x ∈ Σ , M3 (δ3 ( p1 , p2 , x)) = M3 (δ3 ( q1 , q2 , x)). Working on both sides of this equality, we get, ∀x ∈ Σ , M3 (δ3 ( p1 , p2 , x)) = M3 ( δ1 (p1 , x), δ2 (p2 , x) ) = M1 (δ1 (p1 , x)) ∪ M2 (δ2 (p2 , x)) 39 as well as, ∀x ∈ Σ , M3 (δ3 ( q1 , q2 , x)) = M3 ( δ1 (q1 , x), δ2 (q2 , x) ) = M1 (δ1 (q1 , x)) ∪ M2 (δ2 (q2 , x)) This implies that ∀x ∈ Σ , M1 (δ1 (p1 , x)) ∪ M2 (δ2 (p2 , x)) = M1 (δ1 (q1 , x)) ∪ M2 (δ2 (q2 , x)) Now since R1 ∩ R2 = ∅, this gives us ∀x ∈ Σ , M1 (δ1 (p1 , x)) = M1 (δ1 (q1 , x)) and ∀x ∈ Σ , M2 (δ1 (p2 , x)) = M2 (δ1 (q2 , x)) This implies that p1 and q1 are indistinguishable in D1 and p2 and q2 are indistinguishable in D2 . Since p1 , p2 = q1 , q2 , we have that p1 = p2 ∨ q1 = q2 , implying that at least one of D1 or D2 is not a minimum state DFA, which is a contradiction and the result follows. Our ecient construction algorithm works as follows. First, for each RE r ∈ R, we build an equivalent minimum state DFA D for r using the standard method, resulting in a set of DFAs D. Then we merge two DFAs from D at a time using the above UCP construction until there is just one DFA left in D. The merging in done in a greedy manner: in each step, 40 the two DFAs with the fewest states are merged together. Note the condition R1 ∩R2 = ∅ is always satis ed in all the merges, so Theorem 1 ensures that we always have a minimized DFA. In our experiments, our Minimize then Union technique runs exponentially faster than the standard Union then Minimize technique because we only apply the NFA to DFA subset construction step on the NFAs that correspond to each individual RE rather than on the combined NFA for all the REs. This makes a signi cant di erence even when we have a relatively small number of REs. For example, for the C7 RE set which contains 7 REs, the standard technique requires 385.5 seconds to build the DFA, but our technique builds the DFA in only 0.66 seconds. 4.3 Efficient D2FA Construction In this section, we describe how we can extend the Minimize then Union technique to directly build the D2 FA bypassing the DFA construction. We rst build the D2 FA for each individual RE in the RE set, and then merge these D2 FAs together to get the combined D2 FA for the entire RE set. 4.3.1 Improved D2 FA Construction for One RE To build the initial D2 FA for each RE in R, we can use the original D2 FA algorithm proposed in [26]. However, we propose several improvements to original algorithm that 41 facilitate our D2 FA merge algorithm, our techniques for hardware implementation of RE matching presented in Chapter 5 and the overlay automata approach presented in Chapter 6. Edge weight distribution  1.0E+8 1.0E+7 Count 1.0E+6 1.0E+5 1.0E+4 1.0E+3 1.0E+2 1.0E+1 0 5 10 15 20 175 180 185 190 195 200 205 210 215 220 225 230 235 240 245 250 255  1.0E+0 Edge Weight Figure 4.1: Edge weights distribution in a typical SRG. Figure 4.1 shows the typical distribution of the weights of the edges in the SRG. The distribution is typically bimodal. The weights of the edges are very high (> 128) or very low (< 20). The reason behind this is that, for all state pairs for which both states have their failure transitions going to the same self-looping state, the two states will have most of their transitions in common, and hence result in a very high weight edge in the SRG. Likewise, for all state pairs for which both states have their failure transitions going to di erent self-looping states, the two states will have none (or very few) of their transitions in common, and hence result in a very low weight edge in the SRG. If we remove the low weight edges from the SRG, we get a natural partitioning of the states based on the 42 self-looping state they fail to. Let us call this partitioning of states P . Each partition in P will have at most one self-looping state. Multiple deferment trees: We remove the low weight (< 20) edges from the SRG before building the maximum spanning tree. The result of this is that the deferment forest has multiple deferment trees, one tree for each partition in P . This only results in a small increase in the number of transitions in the resulting D2 FA, since edges removed from the SRG have very low weight. For each partition in P , the unique self-looping state (if any) within the partition is chosen as the root of the corresponding deferment tree. Handling non-self-looping roots: We can have a partition in P which does not have any self-looping state. In such cases we will have a non self-looping state selected for the partition. This will happen for REs that have a `.' (or a large range like [ˆa]) without the closure ` '. For example consider that D2 FA shown in Figure 4.2(a) for the RE /a. b..c/. The deferment forest will have 4 root states, 0, 1, 2 and 3. States 0 and 1 are self-looping. However, states 2 and 3 are not self-looping and are only roots states because they have no transition in common with other states. In such cases, we make these states non root states and set their deferment as follows. We look at the deferment of the next state where the transition on the `.' goes to. If we have more than one consecutive `.', we note the state where the last `.' transitions to. In our example, the next state of the last `.' is state 4. We follow the deferment of this state until we reach its root, and select that root as the deferred state of the non self-looping 43 ‐a 0 ‐{n,b} a 1 b 2  3  4 c 5 (a) D2 FA for RE /a. b..c/ with non self-looping roots ‐a 0 ‐{n,b} a 1 b 2  3  4 c 5 (b) D2 FA after setting deferment for non self-looping roots Figure 4.2: Example showing D2 FA with non self-looping root states. roots. In our example, the deferment chain of state 4 ends in state 1, so state 1 is chosen as the deferred state for both states 2 and 3. Figure 4.2(b) shows the resulting D2 FA. Setting the deferment of non self-looping roots in this manner does not reduce the size of the D2 FA since these states will not have any transitions (or very few transitions) in common with their deferred states. However, this results in a better structure of the deferment forest. It also ensures we have the condition that all roots states are self-looping states and vice versa. Improved edge weight tie breaking: Recall that during the construction of the max- imum spanning tree using Kruskal's algorithm, at any time there are usually many edges with the current maximum weight. We use the following tie breaking strategy. For each state u, we store a value, deg (u), which is initially set to 0. During Kruskal's 44 algorithm, when an edge e = (u, v) is added to the current spanning tree, deg (u) is incremented by 2 if level(u) ≤ level(v); otherwise it is incremented by 1. Recall that level(u) is the length of the shortest string that takes the DFA from the start state to state u. We similarly update deg (v). Then we use the following tie breaking order among edges having the current maximum weight. 1. Edges that have a self-looping state as one of their end points are given the highest priority. 2. Next, priority is given to edges with higher sum of deg of their end vertices. 3. Next, priority is given to edges with higher di erence between the levels of their end vertices. The sum of degrees of end vertices is used for tie breaking in order to prioritize states that are already highly connected. However, we also want to prioritize connecting to states at lower levels, so we use deg instead of just the degree. Using the di erence between levels of end points for tie breaking also prioritizes states at a lower level. This helps reduce the deferment depth and the D2 FA size for RE sets whose D2 FAs have a higher average deferment depth. There are several bene ts of these improvements. 1. Having the self-looping states in the center helps to minimize the average height of the deferment tree. Also, prioritizing edges with well connected endpoints increases the fanout, which again reduces tree height. The result is that we get a D2 FA that 45 has a much lower deferment depth. 2. The state partitioning P identi es a natural partitioning of states, such that all replications of one NFA state are in di erent partitions. So typically all partitions in P have sizes close to each other; and because of our tie breaking strategy, all the deferment trees have very similar structure. This property helps to improve the e ectiveness of our D2 FA merge algorithm explained in the next section, and of our table consolidation technique explained in Section 5.3. 3. Having self-looping states as roots helps to improve the e ectiveness of our variable striding technique which we describe in Section 5.4. And the condition that all roots states are self-looping states and vice versa is needed for our overlay automata approach described in Chapter 6. 4.3.2 D2 FA Merge Algorithm The UCP construction merges two DFAs together. We extend the UCP construction to merge two D2 FAs together as follows. To build a D2 FA from a DFA, we basically just need to set the deferment pointer, F(u), for each state. During the UCP construction, as each new state u is created, we de ne F(u) at the same time. We then de ne ρ to only include transitions for u that di er from F(u). To help explain our algorithm, Figure 4.3 shows an example execution of the D2 FA merge algorithm. Figures 4.3(a) and 4.3(b) show the D2 FAs for the REs/. a. bcb/ 46 and /. c. bcb/. Figure 4.3(c) shows the merged D2 FA for the D2 FAs in Figures 4.3(a) and 4.3(b). We use the following conventions when depicting a D2 FA. The dashed lines correspond to the deferred state for a given state. For each state in the merged D2 FA, the pair of numbers above the line refers to the states in the original D2 FAs that correspond to the state in the merged D2 FA. The number below the line is the state in the merged D2 FA. The number(s) after the `/' in accepting states gives the id(s) of the pattern(s) matched. Figure 4.3(d) shows how the deferred state is set for a few states in the merged D2 FA D3 . We explain the notation in this gure as we give our algorithm description. For each state u ∈ D3 , we set the deferred state F(u) as follows. While merging D2 FAs D1 and D2 , let state u = p0 , q0 be the new state currently being added to the merged D2 FA D3 . Let p0 →p1 →· · ·→pl be the maximal deferment chain DC1 (i.e. pl defers to itself) in D1 starting at p0 , and q0 →q1 →· · ·→qm be the maximal deferment chain -a 0 -b a 1 b 2 c 3 b 4/1 b 4/2 (a) D1 , the D2 FA for RE /. a. bcb/. -c 0 -b c 1 b 2 c 3 (b) D2 , the D2 FA for RE /. c. bcb/. Figure 4.3: D2 FA merge example. 47 -{a,c} -{b,c} a 0,0 0 c -{a,b} 0,1 2,0 3 c c a 1,1 4 2 b b 1,0 1 3,1 6 b -b 0,2 5 b 2,2 7 4,2 9/1 c c 0,3 8 3,3 10 b b 0,4 11/2 4,4 12/1,2 (c) D3 , the merged D2 FA. 0 5 2 7 2 1 Deferment for 5=0,2 4 9 2 2 7 256 4 255 2 2 255 1 1 Deferment for 7=2,2 1 4 4 255 2 9 7 12 256 256 1 4 Deferment for 9=4,2 2 1 4 255 1 Deferment for 12=4,4 (d) Illustration of setting deferment for some states in D3 . Figure 4.3: D2 FA merge example (cont'd). 48 DC2 in D2 starting at q0 . For example, in Figure 4.3(d), we see the maximal deferment chains for u = 5 = 0, 2 , u = 7 = 2, 2 , u = 9 = 4, 2 , and u = 12 = 4, 4 . For u = 9 = 4, 2 , the top row is the deferment chain of state 4 in D1 and the bottom row is the deferment chain of state 2 in D2 . We will choose some state pi , qj where 0 ≤ i ≤ l and 0 ≤ j ≤ m to be F(u). In Figure 4.3(d), we represent these candidate F(u) pairs with edges between the nodes of the deferment chains. For each candidate pair, the number on the top is the corresponding state number in D3 and the number on the bottom is the number of common transitions in D3 between that pair and state u. For example, for u = 9 = 4, 2 , the two candidate pairs represented are state 7 ( 2, 2 ) which shares 256 transitions in common with state 9 and state 4 ( 1, 1 ) which shares 255 transitions in common with state 9. Note that a candidate state pair is only considered if it is reachable in D3 . In Figure 4.3(d) with u = 9 = 4, 2 , three of the candidate pairs corresponding to 4, 1 , 2, 1 , and 1, 2 are not reachable, so no edge is included for these candidate pairs. Ideally, we want i and j to be as small as possible though not both 0. For example, our best choices are typically p0 , q1 or p1 , q0 . In the rst case, p0 p1 = p0 , q0 and we already have p0→p1 in D1 . In the second case, q0 q1 = p0 , q0 p1 , q0 , p0 , q1 , and we already have q0→q1 in D2 . In Figure 4.3(d), we set F(u) to be p0 , q1 for u = 5 = 0, 2 and u = 12 = 4, 4 , and we use p1 , q0 for u = 9 = 4, 2 . However, it is possible that both states are not reachable from the start state in D3 . This leads us to consider other possible pi , qj . For example, in Figure 4.3(d), both 2, 1 and 1, 2 are not reachable in D3 , so we use reachable state 1, 1 as F(u) for u = 7 = 2, 2 . 49 We consider a few di erent algorithms for choosing pi , qj . The rst algorithm which we call the rst match method is to nd a pair of states (pi , qj ) for which pi , qj ∈ Q3 and i + j is minimum. Stated another way, we nd the minimum z ≥ 1 such that the set of states Z = { pi , qz−i | (max(0, z − m) ≤ i ≤ min(l, z)) ∧ ( pi , qz−i ∈ Q3 )} = ∅. From the set of states Z, we choose the state that has the most transitions in common with p0 , q0 breaking ties arbitrarily. If Z is empty for all z > 1, then we just pick p0 , q0 , i.e. the deferment pointer is not set (or the state defers to itself). The idea behind the rst match method is that p0 , q0 pi , qj decreases as i + j increases. In Figure 4.3(d), all the selected F(u) correspond to the rst match method. A second more complete algorithm for setting F(u) is the best match method where we always consider all (l+1)×(m+1)−1 pairs and pick the pair that is in Q3 and has the most transitions in common with p0 , q0 . The idea behind the best match method is that it is not always true that p0 , q0 px , qy ≥ p0 , q0 px+i , qy+j for i + j > 0. For instance we can have p0 p2 < p0 p3 , which would mean p0 , q0 p2 , q0 < p0 , q0 p3 , q0 . In such cases, the rst match method will not nd the pair along the deferment chains with the most transitions in common with p0 , q0 . In Figure 4.3(d), all the selected F(u) also correspond to the best match method. It is dicult to create a small example where rst match and best match di er. When adding the new state u to D3 , it is possible that some state pairs along the deferment chains that were not in Q3 while nding the deferred state for u will later on be added to Q3 . This means that after all the states have been added to Q3 , the deferment for u 50 can potentially be improved. Thus, after all the states have been added, for each state we again nd a deferred state. If the new deferred state is better than the old one, we reset the deferment to the new deferred state. Algorithm 4.4 shows the pseudocode for the D2 FA merge algorithm with the rst match method for choosing a deferred state. Note that we use u and u1 , u2 interchangeably to indicate a state in the merged D2 FA D3 where u is a state in Q3 , and u1 and u2 are the states in Q1 and Q2 , respectively, that state u corresponds to. 4.3.3 Direct D2 FA construction for RE set Similar to ecient DFA construction, we rst build the D2 FA for each RE in R. We now need to merge the D2 FAs together using the D2FAMerge algorithm from the previous section. We consider a variety of methods for merging the D2 FAs together including a greedy \Hu man" approach, where in each step, the two smallest D2 FA are merged together. The best approach, we have found experimentally, is to merge all the D2 FAs in a balanced binary tree fashion. This is because a binary tree minimizes the worst-case number of merges that any RE experiences. We use two di erent variations of our D2FAMerge algorithm while merging D2 FAs. For all merges except the nal merge, we use the rst match method for setting F(u). When doing the nal merge to get the nal D2 FA, we use the best match method for setting F(u). It turns out that using the rst match method results in a better deferment forest structure in the D2 FA, which helps when the D2 FA is further merged with other D2 FAs. 51 2 1 Input: A pair of D FAs, D1 = (Q1 , Σ, ρ1 , q0 1 , M1 , F1 ) and D2 = (Q2 , Σ, ρ2 , q0 2 , M2 , F2 ), corresponding to RE sets, say R1 and R2 , with R1 ∩ R2 = ∅. Output: A D2 FA corresponding to the RE set R1 ∪ R2 1 2 3 4 5 6 7 8 9 10 11 12 13 Initialize D3 to an empty D2 FA; Initialize queue as an empty queue; queue.push ( q01 , q02 ); while queue not empty do u, u1 u2 ← queue.pop(); Q3 ← Q3 ∪ {u}; foreach c ∈ Σ do nxt ← δ1 (u1 , c), δ2 (u2 , c) ; if nxt ∈ Q3 ∧ nxt ∈ queue then queue.push (nxt); / / Add (u, c) → nxt transition to ρ3 ; M3 (u) ← M1 (u1 ) ∪ M2 (u2 ); F3 (u) ← FindDefState(u); Remove transitions for u from ρ3 that are in common with F3 (u); 14 foreach u ∈ Q3 do 15 newDptr ← FindDefState(u); 16 if (newDptr = F3 (u)) ∧ (newDptr u > F3 (u) u) then 17 F3 (u) ← newDptr; 18 Reset all transitions for u in ρ3 and then remove ones that are in common with F3 (u); 19 return D3 ; 20 Function FindDefState( v1 , v2 ) 21 Let p0 = v1 , p1 , . . . , pl be the list of states on the deferment chain from v1 to the root in D1 ; 22 Let q0 = v2 , q1 , . . . , qm be the list of states on the deferment chain from v2 to the root in D2 ; 23 for z = 1 to (l + m) do 24 S ← { pi , qz−i | (max(0, z − m) ≤ i ≤ min(l, z))∧ ( pi , qz−i ∈ Q3 )}; 25 if S = ∅ then return argmaxv∈S ( v1 , v2 v); 26 27 return v1 , v2 ; Figure 4.4: Algorithm D2FAMerge(D1 , D2 ) for merging two D2 FAs. The local optimization achieved by using the best match method only helps when used in the nal merge. 52 4.3.4 Optional Final Compression Algorithm When there is no bound on the deferment depth (see Section 4.4.2), the original D2 FA algorithm proposed in [26] results in a D2 FA with smallest possible size because it runs Kruskal's algorithm on a large SRG. Our D2 FA merge algorithm results in a slightly larger D2 FA because it uses a greedy approach to determine deferment. We can further reduce the size of the D2 FA produced by our algorithm by running the following compression algorithm on the D2 FA produced by the D2 FA merge algorithm. We construct an SRG and perform a maximum weight spanning tree construction on the SRG, but we only add edges to the SRG that have the potential to reduce the size of the D2 FA. More speci cally, let u and v be any two states in the current D2 FA. We only add the edge e = (u, v) in the SRG if its weight w(e) is ≥ min(u F(u), v F(v)). Here, F(u) is the deferred state of u in the current D2 FA. As a result, very few edges are added to the SRG, so we only need to run Kruskal's algorithm on a small SRG. This saves both space and time compared to previous D2 FA construction methods. However, this compression step does require more time and space than the D2 FA merge algorithm because it does construct an SRG and then runs Kruskal's algorithm on the SRG. 4.4 D2FA Merge Algorithm Properties We now discuss some properties of the D2 FA merge algorithm itself and the resulting D2 FA. 53 4.4.1 Proof of Correctness The D2 FA merge algorithm exactly follows the UCP construction to create the states. So the correctness of the underlying DFA follows from the the correctness of the UCP construction. Theorem 2 shows that the merged D2 FA is also well de ned (no cycles in deferment forest). Lemma 1. In the D2 FA D3 = D2FAMerge(D1 , D2 ), u1 , u2 v1 , v2 ⇒ u1 v1 ∧ u2 v2 . Proof. If u1 , u2 = v1 , v2 then the lemma is trivially true. Otherwise, let u1 , u2 → w1 , w2 v1 , v2 be the deferment chain in D3 . When selecting the deferred state for u1 , u2 , D2FA Merge always choose a state that corresponds to a pair of states along deferment chains for u1 and u2 in D1 and D2 , respectively. Therefore, we have that u1 , u2 → w1 , w2 ⇒ u1 chain and the fact that the Theorem 2. w1 ∧ u 2 w2 . By induction on the length of the deferment relation is transitive, we get our result. If D2 FAs D1 and D2 are well de ned, then the D2 FA D3 = D2FAMerge(D1 , D2 ) is also well de ned. Proof. Since D1 and D2 are well de ned, there are no cycles in their deferment forests. Now assume that D3 is not well de ned, i.e. there is a cycle in its deferment forest. Let 54 u1 , u2 and v1 , v2 be two distinct states on the cycle. Then, we have that u1 , u2 v1 , v2 ∧ v1 , v2 u1 , u2 Using Lemma 1 we get (u1 v1 ∧ u2 v2 ) ∧ (v1 u1 ∧ v2 u2 ) i.e. (u1 v1 ∧ v1 u1 ) ∧ (u2 v2 ∧ v2 u2 ) Since u1 , u2 = v1 , v2 , we have u1 = v1 ∨ u2 = v2 which implies that at least one of D1 or D2 has a cycle in its deferment forest, which is a contradiction. 4.4.2 Limiting Deferment Depth Since no input is consumed while traversing a deferred transition, in the worst case, the number of lookups needed to process one input character is given by the deferment depth of the D2 FA. As proposed in [26], we can guarantee a worst case performance by limiting the deferment depth of the D2 FA. Recall that ψ(u) denoted the deferment depth of state u, and Ψ(D) denoted the deferment depth of the D2 FA D. Lemma 2. In the D2 FA D3 = D2FAMerge(D1 , D2 ), ∀ u1 , u2 ∈ Q3 , ψ( u1 , u2 ) ≤ ψ(u1 ) + ψ(u2 ). 55 Proof. Let ψ( u1 , u2 ) = d. If ψ( u1 , u2 ) = 0, then u1 , u2 is a root and the lemma is trivially true. So, we consider d ≥ 1 and assume the lemma is true for all states with ψ < d. Let u1 , u2 → w1 , w2 v1 , v2 be the deferment chain in D3 . Using the inductive hypothesis, we have ψ( w1 , w2 ) ≤ ψ(w1 ) + ψ(w2 ) Given u1 , u2 = w1 , w2 , we assume without loss of generality that u1 = w1 . Using Lemma 1 we get that u1 w1 . Therefore ψ(w1 ) ≤ ψ(u1 ) − 1. Combining the above, we get ψ( u1 , u2 ) = ψ( w1 , w2 ) + 1 ≤ ψ(w1 ) + ψ(w2 ) + 1 ≤ (ψ(u1 ) − 1) + ψ(u2 ) + 1 ≤ ψ(u1 ) + ψ(u2 ) Lemma 2 directly gives us the following Theorem. Theorem 3. If D3 = D2FAMerge(D1 , D2 ), then Ψ(D3 ) ≤ Ψ(D1 ) + Ψ(D2 ). For an RE set R, if the initial D2 FAs have Ψ = d, in the worst case, the nal merged D2 FA corresponding to R can have Ψ = d × |R|. Although Theorem 3 gives the value of Ψ in the worst case, in practical cases, Ψ(D3 ) is very close to max(Ψ(D1 ), Ψ(D2 )). Thus 56 the deferment depth of the nal merged D2 FA is usually not much higher than d. Let Ω denote the desired upper bound on Ψ. To guarantee Ψ(D3 ) ≤ Ω, we modify the FindDefState subroutine in Algorithm 4.4 as follows: When selecting candidate pairs for the deferred state, we only consider states with ψ < Ω. Speci cally, we replace line 24 with the following S := { pi , qz−i | (max(0, z − m) ≤ i ≤ min(l, z)) ∧ pi , qz−i ∈ Q3 ) ∧ (ψ( pi , qz−i ) < Ω)} When we do the second pass (lines 14-18), we may increase the deferment depth of nodes that defer to nodes that we readjust. We record the a ected nodes and then do a third pass to reset their deferment states so that the maximum depth bound is satis ed. In practice, this happens very rarely. When constructing a D2 FA with a given bound Ω, we rst build D2 FAs without this bound. We only apply the bound Ω when performing the nal merge of two D2 FAs to create the nal D2 FA. 4.4.3 Deferment to a Lower Level Becchi and Crowley [8] propose a D2 FA algorithm where each state defers to a state at a lower level than itself (see Section 3.3.4.) More formally, they ensure that for all states u, level(u) > level(F(u)) if F(u) = u. We call this property the back-pointer property. If the back-pointer property holds, then every deferred transition taken decreases the level of the 57 current state by at least 1. Since a regular transition on an input character can only increase the level of the current state by at most 1, there have to be fewer deferred transitions taken on the entire input string than regular transitions. This gives an amortized cost of at most 2 transitions taken per input character. Unfortunately, if D2 FAs D1 and D2 have the back-pointer property, the merged D2 FA D3 = D2FAMerge(D1 , D2 ) is not guaranteed to have the back-pointer property. A simple counter example is when trying to merge the D2 FAs corresponding to the REs/(aaa)+/ and /(aaaa)+/. Typically, for practical cases, if the initial D2 FAs have the back-pointer property, in the nal merged D2 FA, almost all of the states have the back-pointer property. In order to guarantee the D2 FA D3 has the back-pointer property, we perform a similar modi cation to the FindDefState subroutine in Algorithm 4.4 as we performed when we wanted to limit the maximum deferment depth. When selecting candidate pairs for the deferred state, we only consider states with a lower level. Speci cally, we replace line 24 with the following: S :={ pi , qz−i | (max(0, z − m) ≤ i ≤ min(l, z)) ∧ ( pi , qz−i ∈ Q3 ) ∧ (level( v1 , v2 ) > level( pi , qz−i ))} For states for which no candidate pairs are found, we just search through all states in Q3 that are at a lower level for the deferred state. In practice, this search through all the states needs to be done for very few states because if D2 FAs D1 and D2 have the back-pointer 58 property, then almost all the states in D2 FAs D3 have the back-pointer property. As with limiting maximum deferment depth, we only apply this restriction when performing the nal merge of two D2 FAs to create the nal D2 FA. 4.4.4 Algorithmic Complexity The time complexity of the original D2 FA algorithm proposed in [26] is O(n2 (log(n)+|Σ|)). The SRG has O(n2 ) edges, and O(|Σ|) time is required to add each edge to the SRG and O(log(n)) time is required to process each edge in the SRG during the maximum spanning tree routine. The time complexity of the D2 FA algorithm proposed in [8] is O(n2 |Σ|). Each state is compared with O(n) other states, and each comparison requires O(|Σ|) time. The time complexity of our new D2FAMerge algorithm to merge two D2 FAs is O(nΨ1 Ψ2 |Σ|), where n is the number of states in the merged D2 FA, and Ψ1 and Ψ2 are the maximum deferment depths of the two input D2 FAs. When setting the deferment for any state u = u1 , u2 , in the worst case the algorithm compares u1 , u2 with all the pairs along the deferment chains of u1 and u2 , which are at most Ψ1 and Ψ2 in length, respectively. Each comparison requires O(|Σ|) time. In practice, the time complexity is O(n|Σ|) as each state needs to be compared with very few states for the following three reasons. First, the maximum deferment depth Ψ is usually very small. The largest value of Ψ among our 8 primary RE sets in Section 4.5 is 7. Second, the length of the deferment chains for most states is much smaller than Ψ. The largest value of average deferment depth ψ among our 8 RE sets is 2.54. Finally, many of the state pairs along the deferment chains are 59 not reachable in the merged D2 FA. Among our 8 RE sets, the largest value of the average number of comparisons needed is 1.47. When merging all the D2 FAs together for an RE set R, the total time required in the worst case would be O(nΨ1 Ψ2 |Σ| log(|R|)). The worst case would happen when the RE set contains strings and there is no state explosion. In this case, each merged D2 FA would have a number of states roughly equal to the sum of the sizes of the D2 FAs being merged. When there is state explosion, the last D2 FA merge would be the dominating factor, and the total time would just be O(nΨ1 Ψ2 |Σ|). When modifying the D2FAMerge algorithm to maintain back-pointers, the worst case time would be O(n2 |Σ|) because we would have to compare each state with O(n) other states if none of the candidate pairs are found at a lower level than the state. In practice, this search needs to be done for very few states, typically less than 1%. The worst case time complexity of the nal compression step is the same as that of Kumar et al.'s D2 FA algorithm, which is O(n2 (log(n) + |Σ|)), since both involve computing a maximum weight spanning tree on the SRG. However, because we only consider edges which improve upon the existing deferment forest, the actual size of the SRG in practice is typically linear in the number of nodes. In particular, for the real-world RE sets that we consider in the experiments section, the size of the SRG generated by our nal compression step is on average 100 times smaller than the SRG generated by Kumar et al.'s algorithm. As a result the optimization step requires much less memory and time compared to the original algorithm. 60 4.5 Experimental Results In this section, we evaluate the e ectiveness of our algorithms on real-world and synthetic RE sets. We consider two variants of our D2 FA merge algorithm: the main variant D2 FAMERGE which just merged the D2 FAs, and D2 FAMERGEOPT, which applies our nal compression algorithm after running D2 FAMERGE. We compare our algorithms with the original D2 FA construction algorithm proposed in [26] ORIGINAL that optimizes transition compression and the D2 FA construction algorithm proposed in [8] BACKPTR that enforces the back-pointer property described in Section 4.4.3. 4.5.1 Methodology 4.5.1.1 Data Sets Our main results are based on eight real RE sets, four proprietary RE sets C7, C8, C10, and C613 from a large networking vendor and four public RE sets Bro217, Snort 24, Snort31, and Snort 34, that we partition into three groups, STRING, WILDCARD, and SNORT, based upon their RE composition. For each RE set, the number indicates the number of REs in the RE set. The STRING RE sets, C613 and Bro217, contain mostly string matching REs. The WILDCARD RE sets, C7, C8 and C10, contain mostly REs with multiple wildcard closures `. '. The SNORT RE sets, Snort24, Snort31, and Snort34, contain a more diverse set of REs, roughly 40% of which have wildcard closures. To test scalability, we use Scale, a synthetic RE set consisting of 26 REs of the form 61 /. cu 0123456. cl 789!#%&/, where cu and cl are the 26 uppercase and lowercase al- phabet letters. Even though all the REs are nearly identical di ering only in the character after the two . 's, we still get the full multiplicative e ect where the number of states in the corresponding minimum state DFA roughly doubles for every RE added. 4.5.1.2 Metrics We use the following metrics to evaluate the algorithms. First, we measure the resulting D2 FA size (# transitions) to assess transition compression performance. Our D2 FAMERGE algorithm typically performs almost as well as the other algorithms even though it builds up the D2 FA incrementally rather than compressing the nal minimum state DFA. Second, we measure the the maximum deferment depth (Ψ) and average deferment depth (ψ) in the D2 FA to assess how quickly the resulting D2 FA can be used to perform regular expression matching. Smaller Ψ and ψ mean that fewer deferment transitions that process no input characters need to be traversed when processing an input string. Our D2 FAMERGE signi cantly outperforms the other algorithms. Finally, we measure the space and time required by the algorithm to build the nal automaton. Again, our D2 FAMERGE signi cantly outperforms the other algorithms. When comparing the performance of D2 FAMERGE with another algorithm A on a given RE or RE set, we de ne the following quantities to compare them: transition increase is (D2 FAMERGE D2 FA size - A D2 FA size) divided by A D2 FA size, transition decrease is (A D2 FA size D2 FAMERGE D2 FA size) divided by A D2 FA size, average (maximum) deferment depth 62 ratio is A average (maximum) deferment depth divided by D2 FAMERGE average (maximum) deferment depth, space ratio is A space divided by D2 FAMERGE space, and time ratio is A build time divided by D2 FAMERGE build time. 4.5.1.3 Measuring Space When measuring the required space for an algorithm, we measure the maximum amount of memory required at any point in time during the construction and then nal storage of the automaton. This is a dicult quantity to measure exactly; we approximate this required space for each of the algorithms as follows. For D2 FAMERGE, the dominant data structure is the D2 FA. For a D2 FA, the transitions for each state can be stored as pairs of input character and next state id, so the memory required to store a D2 FA is calculated as = (#transitions) × 5 bytes. However, the maximum amount of memory required while running D2 FAMERGE may be higher than the nal D2 FA size because of the following two reasons. First, when merging two D2 FAs, we need to maintain the two input D2 FAs as well as the output D2 FA. Second, we may create an intermediate output D2 FA that has more transitions than needed; these extra transitions will be eliminated once all D2 FA states are added. We keep track of the worst case required space for our algorithm during D2 FA construction. This typically occurs when merging the nal two intermediate D2 FA to form the nal D2 FA. For ORIGINAL, we measure the space required by the minimized DFA and the SRG. For the DFA, the transitions for each state can be stored as an array of size Σ with each array 63 entry requiring four bytes to hold the next state id. For the SRG, each edge requires 17 bytes as observed in [8]. This leads to a required memory for building the D2 FA of = |Q| × |Σ| × 4 + (#edges in SRG) × 17 bytes. For D2 FAMERGEOPT, the space required is the size of the nal D2 FA resulting from the merge step, plus the size of the SRG used by the nal compression algorithm. The sizes are computed as in the case of D2 FAMERGE and ORIGINAL. For BACKPTR, we consider two variants. The rst variant builds the minimized DFA directly from the NFA and then sets the deferment for each state. For this variant, no SRG is needed, so the space required is the space needed for the minimized DFA which is |Q| × |Σ| × 4 bytes. The second variant goes directly from the NFA to the nal D2 FA; this variant uses less space but is much slower as it stores incomplete transition tables for most states. Thus, when computing the deferment state for a new state, the algorithm must recreate the complete transition tables for each state to determine which has the most common transitions with the new state. For this variant, we assume the only space required is the space to store the nal D2 FA which is = (#transitions) × 5 bytes even though more memory is de nitely needed at various points during the computation. We also note that both implementations must perform the NFA to DFA subset construction on a large NFA which means even the faster variant runs much more slowly than D2 FAMERGE. 64 4.5.1.4 Correctness We tested correctness of our algorithms by verifying the nal D2 FA is equivalent to the corresponding DFA. Note, we can only do this check for our RE sets where we were able to compute the corresponding DFA. Thus, we only veri ed correctness of the nal D2 FA for our eight real RE sets and the smaller Scale RE sets. 4.5.2 D2 FAMERGE versus ORIGINAL We rst compare D2 FAMERGE with ORIGINAL that optimizes transition compression when both algorithms have unlimited maximum deferment depth. These results are shown in Table 4.1 for our 8 primary RE sets. RE set Bro217 C613 C7 C8 C10 Snort24 Snort31 Snort34 ORIGINAL # Def. depth RAM States # Trans Avg. Max. (MB) 6533 9816 3.42 8 179.3 11308 21633 8.43 16 1039.5 24750 205633 19.18 30 47.4 3108 23209 8.95 13 4.9 14868 96793 13.68 27 25.5 13886 38485 9.53 20 861.2 20068 70701 11.41 23 298.5 13825 40199 9.99 17 795.4 Time # Trans (s) 119.4 11737 326.0 26709 397.7 207540 14.5 23334 141.0 97296 299.2 39409 244.3 92284 309.9 43141 D2 FAMERGE Def. depth RAM Time Avg. Max. (MB) (s) 2.15 5 0.13 3.2 2.69 7 0.23 9.7 1.14 3 1.07 0.9 1.14 2 0.14 0.2 1.18 3 0.52 0.6 1.56 4 0.32 0.2 2.00 6 1.29 2.6 1.38 5 0.27 1.8 Table 4.1: The D2 FA size, D2 FA average ψ and maximum Ψ deferment depths, space estimate and time required to build the D2 FA for ORIGINAL and D2 FAMERGE. Table 4.2 summarizes these results by RE group. We make the following observations. (1) D2 FAMERGE uses much less space than ORIGINAL. On average, D2 FAMERGE uses 1500 times less memory than ORIGINAL to build the resulting D2 FA. This di erence 65 RE set group All STRING WILDCARD SNORT Trans increase 10.8% 21.5% 1.0% 13.3% D2 FAMERGE Def. depth ratio Space Avg. Max. ratio 7.5 5.2 1499.8 2.4 1.9 2994.8 12.1 8.5 42.8 6.3 4.1 1960.3 Time Trans ratio increase 154.5 0.4% 35.4 0.0% 246.6 1.0% 141.8 0.0% D2 FAMERGEOPT Def. depth ratio Space Avg. Max. ratio 7.4 5.4 113.1 2.1 1.6 103.5 12.1 10.0 16.8 6.1 3.3 215.8 Time ratio 9.4 0.8 10.8 13.7 Table 4.2: Average values of transition increase, deferment depth ratios, space ratios, and time ratios for D2 FAMERGE and D2 FAMERGEOPT compared with ORIGINAL. is most extreme when the SRG is large, which is true for the two STRING RE sets and Snort24 and Snort34. For these RE sets, D2 FAMERGE uses between 1422 and 4568 times less memory than ORIGINAL. For the RE sets with relatively small SRGs such as those in the WILDCARD and Snort31, D2 FAMERGE uses between 35 and 231 times less space than ORIGINAL. (2) D2 FAMERGE is much faster than ORIGINAL. On average, D2 FAMERGE builds the D2 FA 155 times faster than ORIGINAL. This time di erence is maximized when the deferment chains are shortest. For example, D2 FAMERGE only requires an average of 0.05 msec and 0.09 msec per state for the WILDCARD and SNORT RE sets, respectively, so D2 FAMERGE is, on average, 247 and 142 times faster than ORIGINAL for these RE sets, respectively. For the STRING RE sets, the deferment chains are longer, so D2 FAMERGE requires an average of 0.67 msec per state, and is, on average, 35 times faster than ORIGINAL. (3) D2 FAMERGE produces D2 FA with much smaller average and maximum deferment depths than ORIGINAL. On average, D2 FAMERGE produces D2 FA that have average deferment depths that are 7.5 times smaller than ORIGINAL and maximum de66 ferment depths that are 5.2 times smaller than ORIGINAL. In particular, the average deferment depth for D2 FAMERGE is less than 2 for all but the two STRING RE sets, where the average deferment depths are 2.15 and 2.69. Thus, the expected number of deferment transitions to be traversed when processing a length n string is less than n. One reason D2 FAMERGE works so well is that it eliminates low weight edges from the SRG so that the deferment forest has many shallow deferment trees instead of one deep tree. This is particularly e ective for the WILDCARD RE sets and, to a lesser extent, the SNORT RE sets. For the STRING RE sets, the SRG is fairly dense, so D2 FAMERGE has a smaller advantage relative to ORIGINAL. (4) D2 FAMERGE produces D2 FA with only slightly more transitions than ORIGINAL, particularly on the RE sets that need transition compression the most. On average, D2 FAMERGE produces D2 FA with roughly 11% more transitions than ORIGINAL does. D2 FAMERGE works best when state explosion from wildcard closures creates DFA composed of many similar repeating substructures. This is precisely when transition compression is most needed. For example, for the WILDCARD RE sets that experience the greatest state explosion, D2 FAMERGE only has 1% more transitions than ORIGINAL. On the other hand, for the STRING RE sets, D2 FAMERGE has, on average, 22% more transitions. For this group, ORIGINAL needed to build a very large SRG and thus used much more space and time to achieve the improved transition compression. Furthermore, transition compression is typically not needed for such RE sets as all string matching REs can be placed into a single group and the resulting DFA can be built. 67 In summary, D2 FAMERGE achieves its best performance relative to ORIGINAL on the WILDCARD RE sets (except for space used for construction of the D2 FA) and its worst performance relative to ORIGINAL on the STRING RE sets (except for space used to construct the D2 FA). This is desirable as the space and time ecient D2 FAMERGE is most needed on RE sets like those in the WILDCARD because those RE sets experience the greatest state explosion. 4.5.3 Assessment of Final Compression Algorithm We now assess the e ectiveness of our nal compression algorithm by comparing D2 FAMERGEOPT to ORIGINAL and D2 FAMERGE. The results are shown in Table 4.3 for our 8 primary RE sets. Def. depth RAM Time RE # # Trans Avg. Max. (MB) (s) set States Bro217 6533 9816 2.44 7 2.64 99.2 C613 11308 21633 3.04 8 7.48 940.4 C7 24750 207540 1.14 3 2.49 45.7 C8 3108 23334 1.14 2 0.32 1.0 C10 14868 97296 1.17 2 1.61 14.8 Snort24 13886 38601 1.57 4 2.67 19.9 Snort31 20068 70780 2.17 8 15.61 59.1 Snort34 13825 40387 1.42 8 2.60 14.2 Table 4.3: The D2 FA size, D2 FA average ψ and maximum Ψ deferment depths, space estimate and time required to build the D2 FA for D2 FAMERGEOPT. Table 4.2 summarizes these results by RE group. As expected D2 FAMERGEOPT produces a D2 FA that is almost as small as that produced by ORIGINAL; on average, the number of transitions increases by only 0.4%. There is a very small increase for WILDCARD and 68 SNORT because ORIGINAL also considers all edges with weight > 1 in the SRG, whereas D2 FAMERGEOPT does not use edges with weight < 10. There is a signi cant bene t to not using these low weight SRG edges; the deferment depths are much higher for the D2 FA produced by ORIGINAL when compared to the D2 FA produced by D2 FAMERGEOPT. The nal compression algorithm of D2 FAMERGEOPT does require more resources than are required by D2 FAMERGE. In some cases, this may limit the size of the RE set D2 FAMERGEOPT can be used for. However, as explained earlier, D2 FAMERGE performs best on the WILDCARD (which has the most state explosion) and performs the worst on the STRING (which has the no or limited state explosion). So the nal compression algorithm is only needed for and is most bene cial for RE sets with limited state explosion. Finally, we observe that D2 FAMERGEOPT requires on average 113 times less RAM than ORIGINAL, and, on average, runs 9 times faster than ORIGINAL. 4.5.4 D2 FAMERGE versus ORIGINAL with Bounded Maximum Deferment Depth We now compare D2 FAMERGE and ORIGINAL when they impose a maximum deferment depth bound Ω of 1, 2, and 4. Because time and space do not change signi cantly, we focus only on number of transitions and average deferment depth. These results are shown in Table 4.4. Note that for these data sets, the resulting maximum depth Ψ typically is identical to the maximum depth bound Ω; the only exception is for D2 FAMERGE and Ω = 4; thus we omit the maximum deferment depth from Table 4.4. 69 ORIGINAL # Trans Avg. def. depth RE set Ω=1 Bro217 C613 C7 C8 C10 Snort24 Snort31 Snort34 698229 1204831 2044171 206897 1105160 1376779 2193679 1357697 D2 FAMERGE # Trans Avg. def. depth Ω = 2 Ω = 4 Ω=1 Ω=2 Ω=4 Ω = 1 Ω = 2 Ω = 4 Ω=1 Ω=2 Ω=4 296433 507613 597544 40411 325536 543378 1102693 559255 52628 102183 206814 23261 97137 106211 405785 85800 0.62 0.62 0.71 0.77 0.75 0.66 0.62 0.66 1.18 1.17 1.24 1.32 1.31 1.25 1.11 1.19 2.09 2.16 2.07 2.51 2.39 2.39 2.08 2.17 50026 154548 215940 24090 101556 68906 208136 57187 15087 51858 208044 23334 97326 42176 119810 44607 11757 27735 207540 23334 97296 39409 95496 43231 1.00 1.00 0.97 0.98 0.98 0.99 1.00 1.00 1.83 1.94 1.13 1.14 1.18 1.47 1.52 1.34 2.15 2.64 1.14 1.14 1.18 1.56 1.97 1.38 Table 4.4: The D2 FA size and D2 FA average ψ deferment depth for ORIGINAL and D2 FAMERGE on our eight primary RE sets given maximum deferment depth bounds of 1, 2 and 4. Table 4.5 summarizes the results by RE group highlighting how much better or worse D2 FAMERGE does than ORIGINAL on the two metrics of number of transitions and average deferment depth ψ. RE set group All STRING WILDCARD SNORT Ω=1 Ω=2 Ω=4 Trans Avg. def. decr. depth ratio 91.3% 0.7 90.0% 0.6 89.3% 0.8 94.0% 0.7 Trans Avg. def. decr. depth ratio 79.4% 0.9 92.5% 0.6 59.0% 1.1 91.0% 0.8 Trans Avg. dptr decr. len ratio 42.5% 1.5 75.5% 0.9 0.0% 2.0 63.0% 1.4 Table 4.5: Average values of transition decrease and average deferment depth ratios for D2 FAMERGE compared with ORIGINAL for our RE set groups given maximum deferment depth bounds of 1, 2 and 4. Overall, D2 FAMERGE performs very well when presented a bound Ω. In particular, the average increase in the number of transitions for D2 FAMERGE with Ω equal to 1, 2 and 4, is only 131%, 20% and 1% respectively, compared to D2 FAMERGE with unbounded maximum deferment depth. Stated another way, when D2 FAMERGE is required to have a maximum deferment depth of 1, this only results in slightly more than twice the number 70 of transitions in the resulting D2 FA. The corresponding values for ORIGINAL are 3121%, 1216% and 197%. These results can be partially explained by examining the average deferment depth data. Unlike in the unbounded maximum deferment depth scenario, here we see that D2 FAMERGE has a larger average deferment depth ψ than ORIGINAL except for the WILDCARD when Ω is 1 or 2. What this means is that D2 FAMERGE has more states that defer to at least one other state than ORIGINAL does. This leads to the lower number of transitions in the nal D2 FA. Overall, for Ω = 1, D2 FAMERGE produces D2 FA with roughly 91% fewer transitions than ORIGINAL for all RE set groups. For Ω = 2, D2 FAMERGE produces D2 FA with roughly 59% fewer transitions than ORIGINAL for the WILDCARD RE sets and roughly 92% fewer transitions than ORIGINAL for the other RE sets. 4.5.5 D2 FAMERGE versus BACKPTR BACKPTR D2 FAMERGE with back-pointer RE Def. depth RAM Time RAM2 Time2 Def. depth RAM Time # Trans set # Trans Avg. Max. (MB) (s) (MB) (s) Avg. Max. (MB) (s) Bro217 11247 2.61 6 6.38 88.08 0.05 273.95 13567 2.33 6 0.13 6.24 C613 26222 2.50 5 11.04 55.91 0.13 971.45 33777 2.30 5 0.25 10.78 C7 217812 5.94 13 24.17 277.80 1.04 1950.00 219684 1.15 4 1.12 4.51 C8 34636 2.44 8 3.04 12.61 0.17 27.76 35476 1.20 4 0.19 0.69 C10 157139 2.13 7 14.52 96.86 0.75 476.54 158232 1.21 4 0.80 11.94 Snort24 46005 8.74 17 13.56 70.95 0.22 1130.00 58273 1.62 8 0.41 47.77 Snort31 82809 2.87 8 19.60 109.56 0.39 1110.00 124584 1.74 6 1.29 3.61 Snort34 46046 7.05 14 13.50 94.19 0.22 983.98 51557 1.42 5 0.30 6.06 Table 4.6: The D2 FA size, D2 FA average ψ and maximum Ψ deferment depths, space estimate and time required to build the D2 FA for both variants of BACKPTR and D2 FAMERGE with the back-pointer property. 71 We now compare D2 FAMERGE with BACKPTR which enforces the back-pointer property described in Section 4.4.3. We adapt D2 FAMERGE to also enforce this back-pointer property. The results for all our metrics are shown in Table 4.6 for our 8 primary RE sets. We consider the two variants of BACKPTR described in Section 4.5.1.3, one which constructs the minimum state DFA corresponding to the given NFA and one which bypasses the minimum state DFA and goes directly to the D2 FA from the given NFA. We note the second variant appears to use less space than D2 FAMERGE. This is partially true since BACKPTR creates a smaller D2 FA than D2 FAMERGE. However, we underestimate the actual space used by this BACKPTR variant by simply assuming its required space is the nal D2 FA size. We ignore, for instance, the space required to store intermediate complete tables or to perform the NFA to DFA subset construction. Table 4.7 summarizes these results by RE group displaying ratios for many of our metrics that highlight how much better or worse D2 FAMERGE does than BACKPTR. RE set Trans Def. depth ratio Space Time Space2 Time2 group increase Avg. Max. ratio ratio ratio ratio All 17.9% 2.9 1.9 30.4 19.3 0.7 142.5 STRING 25.0% 1.1 1.0 47.3 9.7 0.5 67.0 WILDCARD 1.3% 3.0 2.3 18.5 29.3 0.9 170.8 SNORT 29.7% 4.0 2.1 31.1 15.8 0.5 164.5 Table 4.7: Average values of transition increase, deferment depth ratios, space ratios, and time ratios for D2 FAMERGE compared with both variants of BACKPTR for RE set groups. Similar to D2 FAMERGE versus ORIGINAL, we nd that D2 FAMERGE with the backpointer property performs well when compared with both variants of BACKPTR. Speci cally, with an average increase in the number of transitions of roughly 18%, D2 FAMERGE 72 runs on average 19 times faster than the fast variant of BACKPTR and 143 times faster than the slow variant of BACKPTR. For space, D2 FAMERGE uses on average almost 30 times less space than the rst variant of BACKPTR and on average roughly 42% more space than the second variant of BACKPTR. Furthermore, D2 FAMERGE creates D2 FA with average deferment depth 2.9 times smaller than BACKPTR and maximum deferment depth 1.9 times smaller than BACKPTR. As was the case with ORIGINAL, D2 FAMERGE achieves its best performance relative to BACKPTR on the WILDCARD RE sets and its worst performance relative to BACKPTR on the STRING RE sets. This is desirable as the space and time ecient D2 FAMERGE is most needed on RE sets like those in the WILDCARD because those RE sets experience the greatest state explosion. 4.5.6 Scalability results Finally, we assess the improved scalability of D2 FAMERGE relative to ORIGINAL using the Scale RE set assuming that we have a maximum memory size of 1GB. For both ORIGINAL and D2 FAMERGE, we add one RE at a time from Scale until the space estimate to build the D2 FA goes over the 1GB limit. For ORIGINAL, we are able only able to add 12 REs; the nal D2 FA has 397, 312 states and requires over 71 hours to compute. As explained earlier, we include the SRG edges in the RAM size estimate. If we exclude the SRG edges and only include the DFA size in the RAM size estimate, we would only be able to add one more RE before we reach the 1GB limit. For D2 FAMERGE, we are able to add 19 REs; the nal D2 FA has 80, 216, 064 states and requires only 77 minutes to 73 compute. This data set highlights the quadratic versus linear running time of ORIGINAL and D2 FAMERGE, respectively. Figure 4.5 shows how the space and time requirements grow for ORIGINAL and D2 FAMERGE as REs from Scale are added one by one until 19 have been added. Memory required to build 1000 RAM (MB) 100 10 1 ORIGINAL D FAMERGE 0.1 2 0.01 2 4 6 8 10 12 #REs 14 16 18 20 18 20 Time required to build 1e+006 100000 Time (s) 10000 1000 100 10 1 ORIGINAL D2FAMERGE 0.1 0.01 2 4 6 8 10 12 #REs 14 16 Figure 4.5: Memory and time required to build D2 FA versus number of Scale REs used for ORIGINAL's D2 FA and D2 FAMERGE's D2 FA. 74 Chapter 5 TCAM Implementation In this section we present our work on the hardware implementation of RE matching using TCAM, which we call RegCAM. 5.1 Introduction/Motivation Previous hardware solutions of RE matching have be based on FPGA. Although FPGAbased solutions can be modi ed, resynthesizing and updating FPGA circuitry in a deployed system to handle RE updates is slow and dicult. This makes FPGA-based solutions dicult to be deployed in many networking devices (such as NIDS/NIPS and rewalls) where the RE need to be updated frequently. We propose the rst TCAM based RE matching solution. TCAMs are prevalent in networking devices because TCAM-based packet classi cation is the de facto industry stan75 dard for high-speed packet classi cation, i.e., header-based ltering. We show that TCAMs are also very e ective for high-speed DPI, i.e., payload-based ltering. 5.1.1 TCAM Architecture for RE matching We rst explain the straightforward implementation of RE matching using TCAM without any compression. Given a RE set, we rst construct an equivalent minimum state DFA. Second, we build a two column TCAM lookup table where each column encodes one of the two inputs to δ: the source state ID and the input character. Third, for each TCAM entry, we store the destination state ID in the same entry of the associated SRAM. Figure 5.1 shows an example DFA, its TCAM lookup table, and its SRAM decision table. We illustrate how this DFA processes the input stream \01101111, 01100011". We form a TCAM lookup key by appending the current input character to the current source state ID; in this example, we append the rst input character \01101111" to \00", the ID of the initial state s0 , to form \0001101111". The rst matching entry is the second TCAM entry, so \01", the destination state ID stored in the second SRAM entry is returned. We form the next TCAM lookup key \0101100011" by appending the second input character \011000011" to this returned state ID \01", and the process repeats. Directly encoding a DFA in a TCAM using one TCAM entry per transition is infeasible. For example, consider a DFA with 25, 000 states that consumes one 8 bit character per transition. Each state has 28 transitions, and each transition needs 8 bits for the character 76 fail fail S0 fail b [a..o] S1 b,c a,[c..o] S2 TCAM Source  Input  ID character 00 0110 0000 S0 00 0110  00   01 0110 0000 01 0110 0010 S1 01 0110  01   10 0110 0000 10 0110 001 S2 10 0110  10   Src ID a,[d..o] (a) Example DFA.            SRAM Dest.  ID 00 01 00 00 01 10 00 00 01 10 00 S0 S1 S0 S0 S1 S2 S0 S0 S1 S2 S0 Input Input stream (b) Corresponding TCAM table. Figure 5.1: A DFA with its TCAM table. and log 25000 bits for the source state ID. Thus, we would need a total of 140.38 Mb (= 25000 × 28 × (8 + log 25000 )). This is infeasible given the largest available TCAM chip has a capacity of only 72 Mb. To address this challenge, we use two techniques that minimize the TCAM space for storing a DFA: transition sharing and table consolidation. 5.1.2 Reducing TCAM size Recall that the two causes of DFA space explosion are transitions sharing and state replication (Section 3.2). We propose two techniques to reduce the size of TCAM required 77 to implement a DFA: Transitions Sharing that exploits transitions sharing and Table Consolidation that exploits state replication. The basic idea is to combine multiple transitions into one such that we use the ternary nature and rst-match semantics of TCAMs to encode multiple DFA transitions using one TCAM entry. 5.1.2.1 Transitions Sharing The two reasons for transition sharing are character redundancy and state redundancy. Character redundancy: Prior work exploits character redundancy mainly by alphabet encoding, where the alphabet Σ is mapped to a smaller alphabet Σ . Alphabet encoding cannot fully leverage all the compression opportunities presented by character redundancy, as it can only exploit global character redundancy that is common to all states in the DFA. Speci cally, alphabet encoding can map two characters σ1 and σ2 in Σ to the same character σ in Σ if and only if ∀q ∈ Q, δ(q, σ1 ) = δ(q, σ2 ). To exploit character redundancy at each state, we propose the technique of character bundling. In character bundling, we leverage the ternary nature and rst-match semantics of TCAMs on the input character eld to represent multiple characters and thus multiple transitions that share the same source and destination states. State redundancy: Prior work exploits state redundancy mainly by deferred transi- tions, where one state p might defer most of the transitions for another state q. Existing 78 deferred transition based solutions cannot fully exploit state redundancy because of the speed penalty, i.e., traversal of a deferred transition leads to no input being processed. Thus, to alleviate this speed penalty, such solutions often choose deferred transitions that do not fully compress the transition table. To exploit state redundancy, we propose the technique of shadow encoding. In shadow encoding, we leverage the ternary nature and rst-match semantics of TCAMs on the current state ID eld to encode many incoming transitions of a state from di erent states using only one TCAM entry. 5.1.2.2 Table Consolidation We get state explosion in a DFA because each NFA state is replicated multiple times in the DFA. Table Consolidation exploits state replication in a DFA based on the following observation: two DFA states that are replications of the same NFA state, will usually have transitions remaining in the D2 FA (i.e. non-deferred transitions) on the same set of input characters (although the corresponding transitions in the two states might go to di erent states.) In this case, the TCAM tables for the two states will be exactly the same except for the state IDs. If the corresponding transition go to di erent next state then the SRAM tables for the two states will be di erent. The idea is that we can merge the TCAM tables for the two states into one TCAM table, and store both the SRAM tables side by side. This results in reduction in the TCAM size, at the cost of possibly increasing SRAM size, which is ne since TCAM size is much more 79 critical than SRAM size. 5.1.3 Increasing Matching Throughput Another challenge that we address is improving RE matching speed and thus throughput. One way to improve the throughput by up to a factor of k is to use k-stride DFAs that consume k input characters per transition. However, this leads to an exponential increase in both state and transition spaces. For example, a k-stride DFA requires 28∗k transitions per state, so the transition space grows exponentially in k. Previous multi-stride DFAs su er from a signi cant increase in the number of states and the number of transitions such that only 2-stride DFAs are achieved in practice [9, 13]. To avoid this space explosion, we use the novel idea of variable striding. The basic idea is to use transitions with variable strides, i.e. di erent transitions can consume di erent numbers of input characters. This allows us to increase the average number of characters consumed per transition while ensuring all the transitions t within the allocated TCAM space. This idea is based on two key observations. First, for many states, we can capture many but not all k-stride transitions using relatively few TCAM entries whereas capturing all k-stride transitions requires prohibitively many TCAM entries. Second, with TCAMs, we can easily store transitions with di erent strides in the same TCAM lookup table. Variable striding would be very dicult to implement without TCAMs and thus it is not surprising variable striding has not been considered before. 80 5.1.4 Comparison of Transition Sharing with D2 FA The observation behind the transition sharing technique, namely many states share a large number of outgoing transactions, is similar to that of deferred transition in a D2 FA. We use a D2 FA as the starting point for transition sharing, and it can be viewed as a way of implementing a D2 FA in TCAM. But there are several di erences between transition sharing and D2 FA: (1) The transitions stored at each state is given by the D2 FA. But our character bundling technique achieves further compression, and so the total number of TCAM rules is significantly less than the number of transitions in the D2 FA. (2) D2 FA su ers from speed penalty, as no input is consumed when a deferred transition is taken. The number of lookups needed in the worst case is given the the deferment depth of the current state. Because of or shadow encoding technique, there is no speed penalty in transition sharing. Only one TCAM lookup is needed for each character, irrespective of the deferment depth of the current state. (3) Because of the speed penalty in the D2 FA, for practical implementation, the deferment depth of the D2 FA is bounded, which signi cantly increases the number of transitions in the D2 FA. For transition sharing, we build D2 FA without any limit on the deferment depth, achieving maximum transition compression. We now explain each of our techniques in detail. 81 5.2 Transition Sharing The basic idea of transition sharing is to combine multiple transitions into a single TCAM entry. We propose two transition sharing ideas: character bundling and shadow encoding. Character bundling exploits intra-state optimization opportunities and minimizes TCAM tables along the input character dimension. Shadow encoding exploits inter-state optimization opportunities and minimizes TCAM tables along the source state dimension. 5.2.1 Character Bundling Character bundling exploits character redundancy by combining multiple transitions from the same source state to the same destination into one TCAM entry. Character bundling consists of four steps. (1) Assign each state a unique ID of log |Q| bits. (2) For each state, enumerate all 256 transition rules where for each rule, the predicate is a transition's label and the decision is the destination state ID. (3) For each state, treating the 256 rules as a 1-dimensional packet classi er and leveraging the ternary nature and rst-match semantics of TCAMs, we minimize the number of transitions using the optimal 1-dimensional TCAM minimization algorithm (Section 3.4.2). (4) Concatenate the |Q| 1-dimensional minimal pre x classi ers together by prepending each rule with its source state ID. The resulting list can be viewed as a 2-dimensional classi er where the two elds are source state ID and transition label and the decision is the destination state ID. Figure 5.1 shows an example DFA and its TCAM lookup table built using character bundling. The three chunks of 82 TCAM entries encode the 256 transitions for s0 , s1 , and s2 , respectively. Because each TCAM entry matches one or more input characters, we need only 11 total TCAM entries instead of the nave implementation that requires 256 × 3 = 768 entries.  5.2.2 Shadow Encoding Whereas character bundling encodes multiple transitions with the same source and destination states using one TCAM entry, shadow encoding encodes multiple transitions with the same character label and destination state ID using one TCAM entry. This technique is based upon the observation of state redundancy. More speci cally, character bundling uses ternary codes in the input character eld to encode multiple input characters, and shadow encoding uses ternary codes in the source state ID eld to encode multiple source states. 5.2.2.1 Observations We use our running example in Figure 5.1 to illustrate shadow encoding. We observe that all transitions with source states s1 and s2 have the same destination state except for the transitions on character c. Likewise, source state s0 di ers from source states s1 and s2 only in the character range [a..o]. This implies there is a lot of state redundancy. The table in Figure 5.2 shows how we can exploit state redundancy to further reduce required TCAM space. First, since states s1 and s2 are more similar, we give them the state IDs 00 and 01, respectively. State s2 uses the ternary code of 0∗ in the state ID eld of its TCAM 83 entries to share transitions with state s1 . We give state s0 the state ID of 10, and it uses the ternary code of ∗∗ in the state ID eld of its TCAM entries to share transitions with both states s1 and s2 . Second, we order the state tables in the TCAM so that state s1 is rst, state s2 is second, and state s0 is last. This facilitates the sharing of transitions among di erent states where earlier states have incomplete tables deferring some transitions to later tables. Speci cally, s1 has an incomplete table with only a single TCAM entry to specify the transitions it does not share with s2 , and s2 has an incomplete table with only 3 TCAM entries to specify the transitions it (and s1 ) does not share with s0 . TCAM Source  Input  SC character 00 0110 0011 S1 0 0110 001 S2 0 0110 0000 0 0110   0110 0000 S0  0110            SRAM Dest.  ID 01 00 10 01 10 00 10 S2 S1 S0 S2 S0 S1 S0 Figure 5.2: TCAM table with shadow encoding. Implementing shadow encoding requires solving three key problems: (1) Find the best order of the state tables in the TCAM (any order is allowed). (2) Choose binary IDs and ternary codes for each state given the state table order. (3) Identify entries to remove from each state table. 84 5.2.2.2 Determining Table Order We rst describe how we compute the order of tables within the TCAM. In order to exploit inter-state transition sharing, we rst build a D2 FA for the given RE set. If p q (i.e. state p is a descendant of state q), we say that state p is in state q's shadow. We use the partial order of the deferment forest of the D2 FA to determine the order of state transition tables in the TCAM. Speci cally, state q's transition table must be placed after the transition tables of all states in state q's shadow. That is, the state order in given by a depth rst traversal of the deferment forest. fail S0 S0 [a..o] S0 S1 b,c c 242 243 S2 S2 a,[d..o] (a) D2 FA S0 255 (b) SRG S0 S1 (c) Deferment tree Figure 5.3: D2 FA, SRG, and deferment tree of the DFA in Figure 5.1. Figure 5.3 shows the D2 FA, SRG, and the deferment tree, respectively, for the DFA in Figure 5.1. 85 5.2.2.3 Shadow Encoding Algorithm We now describe our shadow encoding algorithm which takes as input a deferment forest F, and outputs the state IDs. We also use the term nodes to refer to states in the description of the algorithm. To ensure that proper sharing of transitions occurs, we need to compute a shadow encoding for the given deferment forest. In a valid shadow encoding, each state q is assigned a binary state ID (ID(q)) and a ternary shadow code (SC(q)). Binary state IDs are used in the destination state ID eld (in the SRAM) of transition rules. Ternary shadow codes are used in the source state ID eld (in the TCAM) of transition rules. The shadow length of a shadow encoding is the common length of every state ID and shadow code. A valid shadow encoding for a given deferment forest F must satisfy the following four Shadow Encoding Properties (SEP): 1. Uniqueness Property : For any two distinct states p and q, ID(p) = ID(q) and SC(p) = SC(q). 2. Self-Matching Property : For any state p, ID(p) ∈ SC(p) (i.e., ID(p) matches SC(p)). 3. Deferment Property : For any two states p and q, p q (i.e., q is an ancestor of p in the given deferment forest) if and only if SC(p) ⊂ SC(q). 4. Non-interception Property : For any two distinct states p and q, p q if and only if ID(p) ∈ SC(q). 86 Lemma 3. Given a valid shadow encoding for deferment forest F, for any state q and all states p in q's shadow, ID(p) ∈ SC(q). Proof. The deferment property implies that SC(p) ⊂ SC(q). The self-matching property implies that ID(p) ∈ SC(p). Thus, the result follows. Lemma 4. Given a valid shadow encoding for deferment forest F, for any state q and all states p not in q's shadow, ID(p) ∈ SC(q). / Proof. This follows immediately from the non-interception property. Intuitively, q's shadow code must match the state ID of all states in q's shadow and cannot match the state ID of any states not in q's shadow. Theorem 4. F Given a valid shadow encoding for a DFA and a TCAM classi er C M and deferment forest that uses only binary state IDs for both source and destination state IDs in transition rules and that orders the state tables according to F, the TCAM classi er formed by replacing each source state ID in C with the corresponding shadow code and each destination state ID in C with the corresponding state ID will be equivalent to C. Proof. This follows from the rst match nature of TCAMs, the state tables are ordered according to F, and Lemmas 3 and 4. We give a shadow encoding algorithm where the deferment forest is a single deferment tree DT . We handle deferment forests by simply creating a virtual root node whose children 87 are the roots of the deferment trees in the forest and then running the algorithm on this tree. Our algorithm uses the following internal variables for each node v: a local binary ID denoted L(v), a global binary ID denoted G(v), and an integer weight denoted W(v) that is the shadow length we would use for the subtree of DT rooted at v. Intuitively, the state ID of v will be G(v)|L(v) where | denotes concatenation, and the shadow code of v will be the pre x string G(v) followed by the required number of ∗'s; some extra padding characters may be needed. We use #L(v) and #G(v)to denote the number of bits in L(v) and G(v), respectively. Our algorithm works as follows. For all v, we initially set L(v) = G(v) = ∅ and W(v) = 0. Our algorithm works recursively in a bottom-up fashion. We mark nodes red when they have been processed. We begin by marking each leaf node of DT as processed. We process an internal node v when all its children v1 , · · · , vn are marked red. Once a node v is processed, its weight W(v) and its local ID L(v) are xed, but we will prepend additional bits to its global ID G(v) when we process its ancestors in DT . While precessing v, we assign v and each of its n children a variable-length binary code HCode that is pre x free (i.e. no HCode is a pre x of another HCode.) One option is to assign each of the (n + 1) nodes a binary number from 0 to n using log2 (n + 1) bits. To minimize the shadow length W(v), we use a Hu man coding style algorithm to compute the HCodes and W(v). This algorithm uses two data structures: a binary encoding tree T with n + 1 leaf nodes, one for v and each of its children, and a min-priority queue PQ, 88 initialized with n + 1 elements (one for v and each of its children) that is ordered by node weight. While PQ has more than one element, we remove the two elements x and y with lowest weight from PQ, create a new internal node z in T with two children x and y, and set weight(z)=maximum(weight(x), weight(y))+1, and then put element z into PQ. When PQ has only one element, T is complete. The HCode assigned to each leaf node v is the path in T from the root node to v where left edges have value 0 and right edges have value 1. We update the internal variables of v and its descendants in DT as follows. We set L(v) to be its HCode, and W(v) to be the weight of the root node of T ; G(v) is left empty. For each child vi , we prepend vi 's HCode to the global ID of every node in the subtree rooted at vi including vi itself. We then mark v as red. This continues until all nodes in DT are red. We now set state IDs and a shadow codes. The shadow length is k, the weight of the root node of DT . We use {∗}m to denote a ternary string with m ∗'s and {0}m to denote a binary string with m 0's. For each node v, we compute v's state ID and shadow code as follows: ID(v) = G(v)|L(v)|{0}k−#G(v)−#L(v) , SC(v) = G(v)|{∗}k−#G(v) . We illustrate our shadow encoding algorithm in Figure 5.4. Figure 5.4(a) shows all the internal variables just before v1 is processed. Figure 5.4(b) shows the Hu man style binary encoding tree T built for node v1 and its children v2 , v3 , and v4 and the resulting HCodes. 89 G :  L :  W: 0 v1 G :  L :  W: 0 G :  L : 0 W: 1 v3 G : 1 L :  W: 0 v2 v4 v5 v6 G :  L : 00 W: 2 G : 01 L :  W: 0 v7 G : 10 L :  W: 0 (a) Deferment tree with internal variables before processing v1 . 3 =max(2,2)+1 0 2 =max(1,1)+1 1 0 1 1 =max(0,0)+1 0 1 Weight: 0 0 1 2 Node: v1 v2 v3 v4 HCode: 000 001 01 1 (b) Build Hufman tree and assigned HCodes while processing v1 . Figure 5.4: Shadow encoding example. 90 v1 G : 001 v2 L :  W: 0 SC = 001 ID = 001 G : 01 L : 0 W: 1 G :  SC =  L : 000 ID = 000 W: 3 v3 SC = 01 ID = 010 G : 011 v5 L :  W: 0 SC = 011 ID = 011 v6 SC = 101 ID = 101 v4 G : 101 L :  W: 0 G : 1 SC = 1 L : 00 ID = 100 W: 2 v7 SC = 110 ID = 110 G : 110 L :  W: 0 (c) Internal variables before processing v1 and assigned state IDs and shadow codes. Figure 5.4: Shadow encoding example (cont'd). Figure 5.4(c) shows each node's nal weight, global ID, local ID, state ID and shadow code. The pseudo-code for the Shadow Encoding algorithm is given in gure Algorithm 5.5. We now prove two properties of our shadow encoding algorithm using induction on the height n of the deferment tree T . In both proofs, in the inductive case, we let s denote the root node of T , s1 through sc denote the c children of s, and Ti for 1 ≤ i ≤ c denote the subtree rooted at si . Theorem 5. The state IDs and shadow codes generated by our Shadow Encoding algorithm satisfy the SEP. Proof. We prove by induction on the height n of T . The base case where n = 0 is trivial since there is only a single node. For the inductive case, our inductive hypothesis is that the shadow codes and state IDs generated for Ti for 1 ≤ i ≤ c satisfy the SEP. Note, we do 91 1 Input: Deferment forest, DF, with n states, s1 , . . . , sn . Output: ID[1..n] and SC[1..n] for each state. 1 2 3 4 Add state s0 to DF with all the tree roots as its children; Set all ID[1..n] and SC[1..n] to the empty string; ShadowEncode (s0 ); return ID[1..n] and SC[1..n]; 5 Function ShadowEncode(s) // Base case 6 if s has no children then return 0; // Recursive case 7 r ← Number of children of s; 8 CHILD[1..r] ← List of children of s; 9 for i = 1 to r do 10 W[i] ← ShadowEncode(CHILD[i]); 11 12 13 14 15 16 W[0] ← 0; G ← HCode(W); l ← max0≤i≤r (|G[i]| + W[i]); for i = 1 to r do Append 0's at end of G[i] to make |G(i)| + W(i) = l; Attach G[i] in front of ID and SC for each state in the subtree of CHILD[i]; 17 18 19 ID(s) ← (0)l ; SC(s) ← (∗)l ; return l; 20 Function HCode(W[0..r]) 21 Initialize Q as a min priority queue of binary tree nodes; 22 for i = 0 to r do 23 Insert leaf node ni in Q with value V[ni ] ← W[i]; 24 25 26 27 while |Q| > 1 do nl ← pop(Q); nr ← pop(Q); Insert node n in Q with nl and nr as left and right children, and value V[n] ← max(V[nl ], V[nr ]) + 1; 28 29 30 n ← pop(Q); 31 Generate the codes based on the Hu man Tree rooted at n; return the codes assigned to the leaf nodes; Figure 5.5: Shadow Encoding Algorithm. 92 not process the root node s in this assumption. We now consider what happens when we process s. For each node v ∈ Ti for 1 ≤ i ≤ c, HCode(si ) is prepended to the SC(v) and ID(v). Thus, the SEP still holds for all the nodes within Ti for 1 ≤ i ≤ c. For any nodes p and q from di erent subtrees Ti and Tj , it follows that ID(p) ∈ SC(q) and ID(q) ∈ SC(p) / / because HCode(si ) and HCode(sj ) are not pre xes of each other. Finally, for all nodes v ∈ T , ID(v) ∈ SC(s) because SC(s) contains only ∗'s. We de ne a pre x shadow encoding as a shadow encoding where all shadow codes are pre x strings; that is, all ∗'s are after any 0's or 1's. For any pre x shadow encoding E of T , ETi denotes the subset of state ids and shadow codes for all v ∈ Ti . For any state id or shadow code X, p X denotes the rst p characters of X, and X p denotes the last p characters of X. We de ne ETi Lemma 5. E p = {X p | X ∈ ET }. i Consider a deferment tree T with a valid length x pre x shadow encoding that satis es the SEP. For every child si , 1 ≤ i ≤ c, of the root of T , there exist two values pi and qi such that: 1. ∀i, 0 < pi ≤ x, 0 ≤ qi < x and pi + qi = x. 2. ∀i, ∀v ∈ Ti , pi ID(v) = pi SC(v) = pi SC(si ). 3. ∀i, ET qi i is a valid pre x shadow encoding of Ti . 4. The set EID = {pi SC(si ) | 1 ≤ i ≤ c} is pre x free. Proof. Since E is a pre x shadow encoding, for any child si , SC(si ) must be of the form 93 {0, 1}a {∗}x−a . Let pi = a and qi = x − a. Now, pi > 0, otherwise we would have SC(si ) = {∗}x , which is not possible as it would violate the deferment and non-interception properties. This proves (1). Also, since E satis es the deferment and self-matching properties, we must have (2) and (3). And we must have (4) because of the non-interception property. Our shadow encoding algorithm produces minimum length encodings. Theorem 6. For any deferment tree T , our shadow encoding algorithm generates the shortest possible pre x shadow encoding that satis es the SEP. Proof. First, our shadow encoding algorithm generates a pre x shadow encoding. We prove by induction on the height n of T that it is the shortest possible pre x shadow encoding. The base case where n = 0 is trivial since the encoding for a single node is empty and thus optimal. For the inductive case, our inductive hypothesis is that the pre x shadow encoding for Ti for 1 ≤ i ≤ c is the shortest possible. Let E be the pre x shadow encoding generated by our shadow encoding algorithm and F be the optimal pre x shadow encoding. Let l and m be the lengths of E and F respectively. Let gi and wi be the values de ned by Lemma 5 for E . And let pi and qi be the corresponding values for F . By the inductive hypothesis, we have wi ≤ qi for 1 ≤ i ≤ c. If m < l, this implies that the optimal shortest pre x shadow encoding for T must compute a better set of HCode equivalents for each child node si . In particular, we have that 94 maxi (pi + qi ) < maxi (gi + wi ). That is, given equal or larger initial lengths, {qi }, optimal pre x shadow encoding computes pre x-free codes FID for the children that are shorter than the pre x-free codes EID computed by the HCode subroutine. However, this is a contradiction, since the Hu man style encoding used to compute the HCodes minimizes the term maxi (gi + wi ) [21]. Therefore, we must have l ≤ m. Experimentally, we found that our shadow encoding algorithm is e ective at minimizing shadow length. No DFA had a shadow length larger than log2 |Q| + 3 where log2 |Q| is the shortest possible shadow length. 5.2.2.4 Choosing Transitions Section 5.2.1 describes how the TCAM tables are generated for states with all 256 transitions (i.e. for root states) using 1-dimensional complete classi er minimization. But non-root states do not have complete tables. We now describe how we apply the character bundling technique to generate the TCAM tables for non-root states. For a given DFA and a corresponding deferment forest, we construct a D2 FA by choosing which transitions to encode in each transition table as follows. If state p has a default transition to state q, we identify p's deferrable transitions which are the transitions that are common to both p's transition table and q's transition table. These deferrable transitions are optional for p's transition table; that is, they can be removed to create an incomplete transition table or included if that results in fewer TCAM entries. Figure 5.2 is an example where including a deferrable transition produces a smaller classi er. The second entry in 95 s2 's table in Figure 5.2 can be deferred to state s0 's transition table. However, this results in a classi er with at least 4 TCAM entries whereas specifying the transition allows a classi er with just 3 TCAM entries. This leads us to the following problem for which we give an optimal solution. Definition 4 (Partially Deferred Incomplete One-dimensional TCAM Minimization Prob- lem). Given a one-dimensional packet classi er f on {∗}b and a subset D ⊆ {∗}b , nd the minimum cost pre x classi er f such that Cover(f ) ⊇ {∗}b \ D and is equivalent to f over Cover(f ). Here b is the eld width (in bits), D is the set of packets that can be deferred, and Cover(c) is the union of the predicates of all the rules in c (i.e. all the packets matched by c). For simplicity of description, we assume that f has attened rule set (i.e. one rule for each packet with the packet as the rule predicate). Assuming the packet is a one byte character, this implies f has 256 rules. We provide a dynamic programming formulation for solving this problem that is similar to the dynamic programming formulation used in [31] and [47] to solve the related problem when all transitions must be speci ed. In these previous solutions for complete classi ers, for each pre x, the dynamic program maintains an optimal solution for each possible nal decision. It then speci es how to combine these optimal solutions for matching pre xes into an optimal solution for the pre x that is the union of the two matching pre xes; in this step, two nal rules for each pre x that have the same decision can be replaced by a single nal rule for the combined pre x resulting in a savings of one TCAM entry. 96 The main change is to maintain an optimal solution for each pre x where we defer some transitions within the pre x. We now formally specify this dynamic program introducing the following notation. Let di , i ≥ 1 denote the actual decisions in a classi er. For a pre x P = {0, 1}k {∗}b−k , we use P to denote the pre x {0, 1}k 0{∗}b−k−1 , and P to denote the matching pre x {0, 1}k 1{∗}w−k−1 . For a classi er f on {∗}b and a pre x P ⊆ {∗}b , fP denotes a classi er on P that is equivalent to f (i.e. the subset of rules in f with predicates that are in P ). So f = f{∗}b . For i ≥ 1, fPi d denotes a classi er on P that is equivalent to f and the decision of the last rule is di . Note d that all packets in P are speci ed by such classi ers. Classi er fP0 denotes the optimal classi er that is equivalent to f except that it possibly defers some packets within D. We d d use C(fPi ) to denote the cost of the minimum classi er equivalent to fPi for i ≥ 0. [P(x)] evaluates to 1 when the statement inside is true; otherwise it evaluates to 0. We use x to represent some packet in the pre x P currently being considered. Theorem 7. Given a one-dimensional classi er f on {∗}b and a subset D ⊆ {∗}b with a set of possible decisions {d1 , d2 , . . . , dz } and a pre x P ⊆ {∗}b , we have that is calculated as follows: (1) For i > 0 d C(fPi )     1 + [f(x) = d ]  i = if f is consistent on P    minz (C(fdj ) + C(fdj ) − 1 + [j = i])  j=1 P P 97 else d C(fPi ) (2) For i = 0: d C(fP0 ) =     0  if P ⊆ D    min(minz (C(fdi )), C(fd0 ) + C(fd0 ))  i=1 P P P else Proof. (1) When i > 0, we just build a minimum cost complete classi er. The recursion and the proof is exactly the same as given in [31] Theorem 4.1 (with decision weights = 1). (2) We consider two possibilities. Either the optimal classi er is a complete classi er or the optimal classi er is an incomplete classi er. If the optimal classi er is incomplete, we consider two cases. If the entire pre x P is contained with D and can be deferred, the minimum cost classi er is to defer all transitions and has cost 0. Otherwise, the minimum cost classi er for P would just be the minimum cost classi er for P concatenated with the minimum cost classi er for P . This is represented by the last term in the minimization for case (2). The possibility that the optimal classi er is a complete classi er is handled by the rst term in the rst minimization for case (2). 5.3 Table Consolidation We now present table consolidation where we combine multiple transition tables for di erent states into a single transition table such that the combined table takes less TCAM space than the total TCAM space used by the original tables. To de ne table consolidation, we need two new concepts: k-decision rule and k-decision table. A k-decision rule is a rule 98 whose decision is an array of k decisions. A k-decision table is a sequence of k-decision rules following the rst-match semantics. Given a k-decision table T and i (0 ≤ i < k), if for any rule r in T we delete all the decisions except the i-th decision, we get a 1-decision table, which we denote as T[i]. In table consolidation, we take a set of k 1-decision tables T0 , · · · , Tk−1 and construct a k-decision table T such that for any i (0 ≤ i < k), the condition Ti ≡ T[i] holds where Ti ≡ T[i] means that Ti and T[i] are equivalent (i.e., they have the same decision for every search key). We call the process of computing k-decision table T table consolidation, and we call T the consolidated table. 5.3.1 Observations Table consolidation is based on three observations. First, semantically di erent TCAM tables may share common entries with possibly di erent decisions. For example, the three tables for s0 , s1 and s2 in Figure 5.1 have three entries in common: 01100000, 0110∗∗∗∗, and ∗∗∗∗∗∗∗∗. Table consolidation provides a novel way to remove such information redundancy. Second, given any set of k 1-decision tables T0 , · · · , Tk−1 , we can always nd a k-decision table T such that for any i (0 ≤ i < k), the condition Ti ≡ T[i] holds. This is easy to prove as we can use one entry per each possible binary search key in T. Third, a TCAM chip typically has a build-in SRAM module that is commonly used to store lookup decisions. For a TCAM with n entries, the SRAM module is arranged as an array of n entries where SRAM[i] stores the decision of TCAM [i] for every i. A TCAM lookup returns the index of the rst matching entry in the TCAM, which is then used as the 99 index to directly nd the corresponding decision in the SRAM. In table consolidation, we essentially trade SRAM space for TCAM space because each SRAM entry needs to store multiple decisions. As SRAM is cheaper and more ecient than TCAM, moderately increasing SRAM usage to decrease TCAM usage is worthwhile. Figure 5.6 shows the TCAM lookup table and the SRAM decision table for a 3-decision consolidated table for states s0 , s1 , and s2 in Figure 5.1. In this example, by table consolidation, we reduce the number of TCAM entries from 11 to 5 for storing the transition tables for states s0 , s1 , and s2 . This consolidated table has an ID of 0. As both the table ID and column ID are needed to encode a state, we use the notation < Table ID > @ < Column ID > to represent a state. TCAM Consolidated Input Src Table ID Character 0 0 0 0 0 0110 0110 0110 0110 ∗∗∗∗ 0000 0010 0011 ∗∗∗∗ ∗∗∗∗ → → → → → SRAM Column ID 00 01 10 s0 s1 s1 s1 s0 s0 s1 s2 s2 s0 s0 s1 s1 s2 s0 Figure 5.6: 3-decision table for 3 states in Figure 5.1 We illustrate input character stream processing with table consolidation using this example 3-decision table. Suppose the input character string is \01101111, 01100011". The initial state is state s0 which is represented as 0@00. We append s0 's table ID of 0 to the rst character 01101111 to form the lookup key 001101111. This matches the fourth TCAM entry in the 3-decision table. We now need to nd the decision. We use s0 's column ID 00 to determine that the rst decision is the correct decision. This gives us the state s1 100 which is represented as 0@01. We then prepend s1 's table ID of 0 to the second character 01100011 to form the lookup key 001100011. This matches the third TCAM entry. We use s1 's column ID of 01 to determine that the second decision is the correct decision. This gives us the next state s2 which has code 0@10. Because s2 is an accepting state, we would accept the input string. Note that because this DFA has only 3 states which have all been consolidated together, all three states have the same table ID of 0. In general, with more states than just those consolidated together, we would have more table IDs. There are two key technical challenges in table consolidation. The rst challenge is how to consolidate k 1-decision transition tables into a k-decision transition table. The second challenge is which 1-decision transition tables should be consolidated together. Intuitively, the more similar two 1-decision transition tables are, the more TCAM space saving we can get from consolidating them together. However, we have to consider the deferment relationship among states. We present our solutions to these two challenges. 5.3.2 Computing a k-decision table In this section, we assume we know which states need to be consolidated together and present a local state consolidation algorithm that takes a k1 -decision table for state set Si and a k2 -decision table for another state set Sj as its input and outputs a consolidated (k1 + k2 )-decision table for state set Si ∪ Sj . For ease of presentation, we rst assume that k 1 = k 2 = 1. Let s1 and s2 be the two input states which have default transitions to states s3 and s4 . 101 The consolidated table will be assigned a common table ID X. We assign state s1 column ID 0 and state s2 column ID 1. Thus, we encode s1 as X@0 and s2 as X@1. We enforce a constraint that if we do not consolidate s3 and s4 together, then s1 and s2 cannot defer any transitions at all. If we do consolidate s3 and s4 together, then s1 and s2 may have incomplete transition tables due to default transitions to s3 and s4 , respectively. The key concepts underlying this algorithm are breakpoints and critical ranges. To de ne breakpoints, it is helpful to view Σ as numbers ranging from 0 to |Σ| − 1; given 8 bit characters, |Σ| = 256. For any state s, we de ne a character i ∈ Σ to be a breakpoint for s if δ(s, i) = δ(s, i−1). For the end cases, we de ne 0 and |Σ| to be breakpoints for every state s. Let b(s) be the set of breakpoints for state s. We then de ne b(S) = s∈S b(s) to be the set of breakpoints for a set of states S ⊂ Q. Finally, for any set of states S, we de ne r(S) to be the set of ranges de ned by b(S): r(S) = {[0, b2 −1], [b2 , b3 −1], . . . , [b|b(S)|−1 , |Σ|−1]} where bi is ith smallest breakpoint in b(S). Note that 0 = b1 is the smallest breakpoint and |Σ| is the largest breakpoint in b(S). Within r(S), we label the range beginning at breakpoint bi as ri for 1 ≤ i ≤ |b(S)| − 1. If δ(s, bi ) is deferred, then ri is a deferred range. When we consolidate s1 and s2 together, we compute b({s1 , s2 }) and r({s1 , s2 }). For each r ∈ r({s1 , s2 }) where r is not a deferred range for both s1 and s2 , we create a consolidated transition rule where the decision of the entry is the ordered pair of decisions for state s1 and s2 on r . For each r ∈ r({s1 , s2 }) where r is a deferred range for one of s1 but not the other, we ll in r in the incomplete transition table where it is deferred, and we create a consolidated entry where the decision of the entry is the ordered pair of decisions for 102 state s1 and s2 on r . Finally, for each r ∈ r({s1 , s2 }) where r is a deferred range for both s1 and s2 , we do not create a consolidated entry. This produces a non-overlapping set of transition rules that may be incomplete if some ranges do not have a consolidated entry. If the nal consolidated transition table is complete, we minimize it using the optimal 1-dimensional TCAM minimization algorithm in [30, 47]. If the table is incomplete, we minimize it using the 1-dimensional incomplete classi er minimization algorithm in [31]. We generalize this algorithm to cases where k1 > 1 and k2 > 1 by simply considering k1 + k2 states when computing breakpoints and ranges. 5.3.3 Choosing States to Consolidate We now describe our global consolidation algorithm for determining which states to consolidate together. As we observed earlier, if we want to consolidate two states s1 and s2 together, we need to consolidate their parent nodes in the deferment forest as well or else lose all the bene ts of shadow encoding. Thus, we propose to consolidate two deferment trees together. A consolidated deferment tree must satisfy the following properties. First, each node is to be consolidated with at most one node in the second tree; some nodes may not be consolidated with any node in the second tree. Second, a level i node in one tree must be consolidated with a level i node in the second tree. The level of a node is its distance from the root. We de ne the root to be a level 0 node. Third, if two level i nodes are consolidated together, their level i − 1 parent nodes must also be consolidated together. 103 An example legal matching of nodes between two deferment trees is depicted in Figure 5.7. x0 x1 x5 x8 x2 x6 y0 x3 x4 x7 y1 y5 x9 y2 y3 y4 y6 y7 Figure 5.7: Consolidating two trees. Given two deferment trees, we start the consolidation process from the roots. After we consolidate the two roots, we need to decide how to pair their children together. For each pair of nodes that are consolidated together, we again must choose how to pair their children together, and so on. We make an optimal choice using a combination of dynamic programming and matching techniques. Suppose we wish to compute the minimum cost C(x, y), measured in TCAM entries, of consolidating two subtrees rooted at nodes x and y where x has u children X = {x1 , . . . , xu } and y has v children Y = {y1 , . . . , yv }. We rst recursively compute C(xi , yj ) for 1 ≤ i ≤ u and 1 ≤ j ≤ v using our local state consolidation algorithm as a subroutine. We then construct a complete bipartite graph KX,Y such that each edge (xi , yj ) has the edge weight C(xi , yj ) for 1 ≤ i ≤ u and 1 ≤ j ≤ v. Here C(x, y) is the cost of a minimum weight matching [24, 35] of K(X, Y) plus the cost of consolidating x and y. When |X| = |Y|, to make the sets equal in size, we pad the smaller 104 set with null states that defer all transitions. Finally, we must decide which trees to consolidate together. We assume that we produce k-decision tables where k is a power of 2. We describe how we solve the problem for k = 2 rst. We create an edge-weighted complete graph with where each deferment tree is a node and where the weight of each edge is the cost of consolidating the two corresponding deferment trees together. We nd a minimum weight matching [16, 18] of this complete graph to give us an optimal pairing for k = 2. For larger k = 2l , we then repeat this process l − 1 times. Our matching is not necessarily optimal for k > 2. In some cases, the deferment forest may have only one tree. In such cases, we consider consolidating the subtrees rooted at the children of the root of the single deferment tree. We also consider similar options if we have a few deferment trees but they are not structurally similar. Figure Algorithm 5.8 shows the pseudo-code for the algorithm. 5.3.3.1 Greedy Matching Our algorithm using the matching subroutines gives the optimal pairing of deferment trees but can be relatively slow on larger DFAs. When running time is a concern, we present a greedy matching routine. When we need to match children of two nodes, x and y, we consider one child at a time from the node with fewer children (say x). First all children of y are set unmarked. For each child, xi , of x, we nd the best match from the unmarked 105 1 Input: Deferment forest, DF, with r tree roots, s1 , . . . , sr . Output: Optimal matching of the r roots. 1 For each pair of roots, si and sj , compute C(si , sj ); 2 Construct complete graph Kr , with the roots as vertices and C(si , sj ) as edge weights; 3 return Minimum Weight Matching(Kr ); 4 Function C(s1 , s2 ) // Base case 5 if s1 and s2 have no children then 6 return Consolidated Cost(s1 , s2 ); 9 10 11 12 // Recursive case Attach NULL children so that both s1 and s2 have same number of children, q; Construct complete bipartite graph Kq,q , with the children of s1 and s2 as the vertices, and set C(sx , sy ) as the edge weight between vertices sx and sy ; M = Minimum Weight Bipartite Matching(Kq,q ) gives the matching of the children; count ← 0; foreach matching (sx , sy ) ∈ M do count ← count + C(sx , sy ); 13 return (count + Consolidated Cost(s1 , s2 )); 7 8 14 Figure 5.8: Algorithm for Consolidating Trees. children of y, match them up, and set the matched child in y as marked. The best match for xi is given by C(xi , yj ) argminy ∈{unmarked children of y} j C(xi ) + C(yj ) where C(x) is just the cost (in TCAM entries) of the subtree rooted at x. If C(xi )+C(yj ) = 0, then we set the ratio to 0.5. All unmarked children of y at the end are matched with null states. We consider the children of x in decreasing order of C(xi ) to prioritize the larger children of x. We use the same approach for matching roots. First all roots are set unmarked. Each time we consider the largest unmarked root, nd the best match for it, and then mark the newly matched roots. In our experiments, this greedy approach runs much faster than the optimal approach 106 and the resulting classi er size is not much larger. We also observe that another greedy approach that uses C(xi , yj ) instead of C(xi ,yj ) C(xi )+C(yj ) produces classi ers with much larger TCAM sizes. This approach often matches a large child of x with a small child of y that it does not align well with. 5.3.4 Effectiveness of Table Consolidation We now explain why table consolidation works well on real-world RE sets. Most real-world RE sets contain REs with wildcard closures `. ' where the wildcard `.' matches any character and the closure ` ' allows for unlimited repetitions of the preceding character. Wildcard closures create deferment trees with lots of structural similarity. For example, consider the D2 FA in Figure 5.9 for RE set f/abc/, /abd/, /e. f/g ‐{a,e} b 1 4/2 c 8/1 d ‐{a,f} a 3/1 d 0 c 9/2 2 e 5 a b 6 f 7 10/3 Figure 5.9: D2 FA for RE set f/abc/, /abd/, /e. f/g. 107 where we use dashed arrows to represent the default transitions. The second wildcard closure `. ' in the RE /e. f/ duplicates the entire DFA sub-structure for recognizing REs/abc/ and /abd/. Thus, table consolidation of the subtree (0, 1, 2, 3, 4) with the subtree (5, 6, 7, 8, 9, 10) will lead to signi cant space saving. 5.4 Variable Striding We explore ways to improve RE matching throughput by consuming multiple characters per TCAM lookup. One possibility is a k-stride DFA which uses k-stride transitions that consume k characters per transition. Although k-stride DFAs can speed up RE matching by up to a factor of k, the number of states and transitions can grow exponentially in k. To limit the state and transition space explosion, we propose variable striding using variable-stride DFAs. A k-var-stride DFA consumes between 1 and k characters in each transition with at least one transition consuming k characters. Conceptually, each state in a k-var-stride DFA has 256k transitions, and each transition is labeled with (1) a unique string of k characters and (2) a stride length j (1 ≤ j ≤ k) indicating the number of characters consumed. In TCAM-based variable striding, each TCAM lookup uses the next k consecutive characters as the lookup key, but the number of characters consumed in the lookup varies from 1 to k; thus, the lookup decision contains both the destination state ID and the stride length. 108 There are many technical challenges in implementing variable striding. First, we need to control the exponential growth in the number of states. Second, we need to control the exponential growth in the number of transitions. Third, we need to carefully choose which transitions to expand from 1-stride to multi-stride given a speci c amount of available TCAM space. Fourth, we need to carefully decide on the maximum stride length k. Increasing k can help by increasing average RE matching throughput; however, increasing k can hurt by requiring more TCAM space. Speci cally, implementing a k-var-stride DFA in TCAM requires 8k bits for the k input characters in each lookup key. The width of a TCAM chip is con gurable, but not arbitrary. Commercially available TCAM chips typically can be con gured with length 36, 72, 144, 288, or 576 bits. We must choose k so that we optimize throughput while not wasting bits in each TCAM entry. 5.4.1 Observations We use an example to show how variable striding can achieve a signi cant RE matching throughput increase with a small and controllable space increase. Figure 5.10 shows a 3-var-stride transition table that corresponds to state s0 in Figure 5.1. This table only has 7 entries as opposed to 116 entries in a full 3-stride table for s0 . If we assume that each of the 256 characters is equally likely to occur, the average number of characters consumed per 3-var-stride transition of s0 is 1 ∗ 1/16 + 2 ∗ 15/256 + 3 ∗ 225/256 = 2.82. 109 Src state s0 s0 s0 s0 s0 s0 s0 TCAM Inp char1 Inp char2 0110 0110 ∗∗∗∗ ∗∗∗∗ ∗∗∗∗ ∗∗∗∗ ∗∗∗∗ 0000 ∗∗∗∗ ∗∗∗∗ ∗∗∗∗ ∗∗∗∗ ∗∗∗∗ ∗∗∗∗ ∗∗∗∗ ∗∗∗∗ 0110 0110 ∗∗∗∗ ∗∗∗∗ ∗∗∗∗ ∗∗∗∗ ∗∗∗∗ 0000 ∗∗∗∗ ∗∗∗∗ ∗∗∗∗ ∗∗∗∗ Inp char3 ∗∗∗∗ ∗∗∗∗ ∗∗∗∗ ∗∗∗∗ 0110 0110 ∗∗∗∗ ∗∗∗∗ ∗∗∗∗ ∗∗∗∗ ∗∗∗∗ 0000 ∗∗∗∗ ∗∗∗∗ → → → → → → → SRAM Dest state Stride s0 1 s1 1 s0 2 s1 2 s0 3 s1 3 s0 3 Figure 5.10: 3-var-stride transition table for s0 5.4.2 Eliminating State Explosion We rst explain how converting a 1-stride DFA to a k-stride DFA causes state explosion. For a source state and a destination state pair (s, d), a k-stride transition path from s to d may contain k − 1 intermediate states (excluding d); for each unique combination of accepting states that appear on a k-stride transition path from s to d, we need to create a new destination state because a unique combination of accepting states implies that the input has matched a unique combination of REs. This can be a very large number of new states. We eliminate state explosion by ending any k-var-stride transition path at the rst accepting state it reaches. Thus, a k-var-stride DFA has the exact same state set as its corresponding 1-stride DFA. Ending k-var-stride transitions at accepting states does have subtle interactions with table consolidation and shadow encoding. We end any k-var-stride consolidated transition path at the rst accepting state reached in any one of the paths being consolidated which can reduce the expected throughput increase of variable striding. There is a similar but even more subtle interaction with shadow encoding which we 110 describe in the next section. 5.4.3 Controlling Transition Explosion In a k-stride DFA converted from a 1-stride DFA with alphabet Σ, a state has |Σ|k outgoing k-stride transitions. Although we can leverage our techniques of character bundling and shadow encoding to minimize the number of required TCAM entries, the rate of growth tends to be exponential with respect to stride length k. We have two key ideas to control transition explosion: self-loop unrolling and k-var-stride transition sharing. 5.4.3.1 Self-Loop Unrolling Algorithm We now consider root states, all of which are self-looping states. We have two methods to compute the k-var-stride transition tables of root states. The rst is direct expansion (stopping transitions at accepting states) since these states do not defer to other states which results in an exponential increase in table size with respect to k. The second method, which we call self-loop unrolling, scales linearly with k. Self-loop unrolling increases the stride of all the self-loop transitions encoded by the last default TCAM entry. Self-loop unrolling starts with a root state j-var-stride transition table encoded as a compressed TCAM table of n entries with a nal default entry representing most of the self-loops of the root state. Note that given any complete TCAM table where the last entry is not a default entry, we can always replace that last entry with a default entry without changing the semantics of the table. We generate the (j+1)-var-stride 111 transition table by expanding the last default entry into n new entries, which are obtained by prepending 8 ∗'s as an extra default eld to the beginning of the original n entries. This produces a (j+1)-var-stride transition table with 2n − 1 entries. Figure 5.10 shows the resulting table when we apply self-loop unrolling twice on the DFA in Figure 5.1. We next illustrate the idea of self-loop unrolling using an example. Consider state s0 of Figure 5.1. The default transition in s0 's table is a self-loop that is matched by 240 characters; one self-loop is matched by the rst TCAM entry in s0 's table. We can \unroll" this self-loop and increase the stride of many but not all 2-stride and 3-stride transitions as follows. First, we leave in place the rst two 1-stride transitions. We then make 2-stride copies of these transitions where we shift the characters over by one and put a default character in the rst position. These 2-stride transitions capture the case where the rst character in the transition self-loops but is not 01100000 and the second character leaves state s0 or is 01100000. We then make 3-stride copies of these transitions where we shift the characters over by one again and put default characters for the rst two positions. Finally, we include a stride-3 default transition that self-loops back to state 0. The resulting 7 transition variable-stride table is shown in Figure 5.10. In this example, we could continue using self-loop unrolling to create even larger stride transitions with an additional cost of only 2 TCAM entries per extra character consumed. 112 5.4.3.2 k-var-stride Transition Sharing Algorithm Similar to 1-stride DFAs, there are many transition sharing opportunities in a k-var-stride DFA. Consider two states s0 and s1 in a 1-stride DFA where s0 defers to s1 . The deferment relationship implies that s0 shares many common 1-stride transitions with s1 . In the kvar-stride DFA constructed from the 1-stride DFA, all k-var-stride transitions that begin with these common 1-stride transitions are also shared between s0 and s1 . Furthermore, two transitions that do not begin with these common 1-stride transitions may still be shared between s0 and s1 . For example, in the 1-stride DFA fragment in Figure 5.11, although s1 and s2 do not share a common transition for character a, when we construct the 2-var-stride DFA, s1 and s2 share the same 2-stride transition on string aa that ends at state s5 . b S1 a S3 a S5 a S2 a S4 b S6 Figure 5.11: States s1 and s2 share transition aa To promote transition sharing among states in a k-var-stride DFA, we rst need to decide on the deferment relationship among states. The ideal deferment relationship should be calculated based on the SRG of the nal k-var-stride DFA. However, the k-var-stride DFA cannot be nalized before we need to compute the deferment relationship among states 113 because the nal k-var-stride DFA is subject to many factors such as available TCAM space. There are two approximation options for the nal k-var-stride DFA for calculating the deferment relationship: the 1-stride DFA and the full k-stride DFA. We have tried both options in our experiments, and the di erence in the resulting TCAM space is negligible. Thus, we simply use the deferment forest of the 1-stride DFA in computing the transition tables for the k-var-stride DFA. Second, for any two states s1 and s2 where s1 defers to s2 , we need to compute s1 's k-varstride transitions that are not shared with s2 because those transitions will constitute s1 's k-var-stride transition table. Although this computation is trivial for 1-stride DFAs, this is a signi cant challenge for k-var-stride DFAs because each state has too many (256k ) k-var-stride transitions. The straightforward algorithm that enumerates all transitions has a time complexity of O(|Q|2 |Σ|k ), which grows exponentially with k. We propose a dynamic programming algorithm with a time complexity of O(|Q|2 |Σ|k), which grows linearly with k. Our key idea is that the non-shared transitions for a k-stride DFA can be quickly computed from the non-shared transitions of a (k-1)-var-stride DFA. For example, consider the two states s1 and s2 in Figure 5.11 where s1 defers to s2 . For character a, s1 transits to s3 while s2 transits to s4 . Assuming that we have computed all (k-1)-var- stride transitions of s3 that are not shared with the (k-1)-var-stride transitions of s4 , if we prepend all these (k-1)-var-stride transitions with character a, the resulting k-var-stride transitions of s1 are all not shared with the k-var-stride transitions of s2 , and therefore should all be included in s1 's k-var-stride transition table. Formally, using n(si , sj , k) to 114 denote the number of k-stride transitions of si that are not shared with sj , our dynamic programming algorithm uses the following recursive relationship between n(si , sj , k) and n(si , sj , k − 1): n(si , sj , 0) =    0 if s = s  i j    1 if s = s i j n(δ(si , c), δ(sj , c), k − 1) n(si , sj , k) = (5.1) (5.2) c∈Σ The above formulae assume that the intermediate states on the k-stride paths starting from si or sj are all non-accepting. For state si , we stop increasing the stride length along a path whenever we encounter an accepting state on that path or on the corresponding path starting from sj . The reason is similar to why we stop a consolidated path at an accepting state, but the reasoning is more subtle. Let p be the string that leads sj to an accepting state. The key observation is that we know that any k-var-stride path that starts from sj and begins with p ends at that accepting state. This means that si cannot exploit transition sharing on any strings that begin with p. Figure 5.12 shows the resultant 2-var-stride transition tables for all three states s0 , s1 , and s2 of the D2 FA in Figure 5.3(a). Note that the one transition out of state s1 and two self-loop transitions for state s2 have stride-1 because they end at s2 , an accepting state. The above dynamic programming algorithm produces non-overlapping and incomplete 115 TCAM Src state Inp char1 Inp char2 s1 s2 s2 s2 s0 s0 s0 s0 s0 s0 s0 s0 s0 s0 s0 [c] [b..c] [a] [d..o] [a..o] [a..o] [a..o] [a..o] [a..o] [0..96] [0..96] [0..96] [112..255] [112..255] [112..255] ∗ [c] ∗ ∗ [0..96] [a] [b] [c..o] [112..255] [0..96] [a..o] [112..255] [0..96] [a..o] [112..255] → → → → → → → → → → → → → → → SRAM Dest state Stride s2 1 s2 2 s2 1 s2 1 s0 2 s2 2 s1 2 s2 2 s0 2 s0 2 s1 2 s0 2 s0 2 s1 2 s0 2 Figure 5.12: Uncompressed 2-var-stride transition tables for D2 FA in Figure 5.3(a) (a = 97, o = 111) transition tables that we compress using the 1-dimensional incomplete classi er minimization algorithm in [31]. 5.4.4 Variable Striding Selection Algorithm We now propose solutions for the third key challenge - which states should have their stride lengths increased and by how much, i.e., how should we compute the transition function δ. Note that each state can independently choose its variable striding length as long as the nal transition tables are composed together according to the deferment forest. This can be easily proven based on the way that we generate k-var-stride transition tables. For any two states s1 and s2 where s1 defers to s2 , the way that we generate s1 's k-var-stride transition table is seemingly based on the assumption that s2 's transition table is also 116 k-var-stride; actually, we do not have this assumption. For example, if we choose k-var- stride (2 ≤ k) for s1 and 1-stride for s2 , all strings from s1 will be processed correctly; the only issue is that strings deferred to s2 will process only one character. We view this as a packing problem: given a TCAM capacity C, for each state s, we select a variable stride length value Ks , such that s∈Q |T(s, Ks )| ≤ C, where T(s, Ks ) denotes the Ks -var-stride transition table of state s. This packing problem has a avor of the knapsack problem, but an exact formulation of an optimization function is impossible without making assumptions about the input character distribution. We propose the following algorithm for nding a feasible δ that strives to maximize the minimum stride of any state. First, we use all the 1-stride tables as our initial selection. Second, for each j-var-stride (j ≥ 2) table t of state s, we create a tuple (l, d, |t|) where l denotes variable stride length, d denotes the distance from state s to the root of the deferment tree that s belongs to, and |t| denotes the number of entries in t. As stride length l increases, the individual table size |t| may increase signi cantly, particularly for the complete tables of root states. To balance table sizes, we set limits on the maximum allowed table size for root states and non-root states. If a root state table exceeds the root state threshold when we create its j-var-stride table, we apply self-loop unrolling once to its (j − 1)-var-stride table to produce a j-var-stride table. If a non-root state table exceeds the non-root state threshold when we create its j-var-stride table, we simply use its (j − 1)-var-stride table as its j-var-stride table. Third, we sort the tables by these tuple values in increasing order rst using l, then using d, then using |t|, and nally a pseudorandom coin ip to break 117 ties. Fourth, we consider each table t in order. Let t be the table for the same state s in the current selection. If replacing t by t does not exceed our TCAM capacity C, we do the replacement. 5.5 Implementation and Modeling We now describe some implementation issues associated with our TCAM based RE matching solution. First, the only hardware required to deploy our solution is the o -theshelf TCAM (and its associated SRAM). Many deployed networking devices already have TCAMs, but these TCAMs are likely being used for other purposes. Thus, to deploy our solution on existing network devices, we would need to share an existing TCAM with another application. Alternatively, new networking devices can be designed with an additional dedicated TCAM chip. Second, we describe how we update the TCAM when an RE set changes. First, we must compute a new DFA and its corresponding TCAM representation. For the moment, we recompute the TCAM representation from scratch, but we believe a better solution can be found and is something we plan to work on in the future. We report some timing results in our experimental section. Fortunately, this is an oine process during which time the DFA for the original RE set can still be used. The second step is loading the new TCAM entries into TCAM. If we have a second TCAM to support updates, this rewrite can occur while the rst TCAM chip is still processing packet ows. If not, RE matching must halt 118 while the new entries are loaded. This step can be performed very quickly, so the delay will be very short. In contrast, updating FPGA circuitry takes signi cantly longer. We have not developed a full implementation of our system. Instead, we have only developed the algorithms that would take an RE set and construct the associated TCAM entries. Thus, we can only estimate the throughput of our system using TCAM models. We use Agrawal and Sherwood's TCAM model [3] assuming that each TCAM chip is manufactured with a 0.18µm process to compute the estimated latency of a single TCAM lookup based on the number of TCAM entries searched. These model latencies are shown in Table 5.1. We recognize that some processing must be done besides the TCAM lookup such as composing the next state ID with the next input character; however, because the TCAM lookup latency is much larger than any other operation, we focus only on this parameter when evaluating the potential throughput of our system. Entries 1024 2048 4096 8192 16384 32768 65536 131072 TCAM TCAM Latency Chip size Chip size ns (36-bit wide) (72-bit wide) 0.037 Mb 0.074 Mb 0.94 0.074 Mb 0.147 Mb 1.10 0.147 Mb 0.295 Mb 1.47 0.295 Mb 0.590 Mb 1.84 0.590 Mb 1.18 Mb 2.20 1.18 Mb 2.36 Mb 2.57 2.36 Mb 4.72 Mb 2.94 4.72 Mb 9.44 Mb 3.37 Table 5.1: TCAM size and Latency 119 5.6 Experimental Results In this section, we evaluate our TCAM-based RE matching solution on real-world RE sets focusing on two metrics: TCAM space and RE matching throughput. 5.6.1 Methodology We use the same 8 RE sets used in Section 4.5 for the main results. To test the scalability of our algorithms, we use one family of 34 REs from a recent public release of the Snort rules with headers ($EXTERNAL NET, $HTTP PORTS, $HOME NET, any), most of which contain wildcard closures `. '. We added REs one at a time until the number of DFA states reached 305, 339. We name this family Scale. We calculate TCAM space by multiplying the number of entries by the TCAM width: 36, 72, 144, 288, or 576 bits. For a given DFA, we compute a minimum width by summing the number of state ID bits required with the number of input bits required. In all cases, we needed at most 16 state ID bits. For 1-stride DFAs, we need exactly 8 input character bits, and for 7-var-stride DFAs, we need exactly 56 input character bits. We then calculate the TCAM width by rounding the minimum width up to the smallest larger legal TCAM width. For all our 1-stride DFAs, we use TCAM width 36. For all our 7-var-stride DFAs, we use TCAM width 72. We estimate the potential throughput of our TCAM-based RE matching solution by using the model TCAM lookup speeds we computed in Section 5.5 to determine how many 120 TCAM lookups can be performed in a second for a given number of TCAM entries and then multiplying this number by the number of characters processed per TCAM lookup. With 1-stride TCAMs, the number of characters processed per lookup is 1. For 7-var-stride DFAs, we measure the average number of characters processed per lookup in a variety of input streams. We use Becchi et al.'s network trac generator [11] to generate a variety of synthetic input streams. This trac generator includes a parameter that models the probability of malicious trac pM . With probability pM , the next character is chosen so that it leads away from the start state. With probability (1 − pM ), the next character is chosen uniformly at random. 5.6.2 Results on 1-stride DFAs TS TS + TC2 TS + TC4 RE set #states tcam #rows thru tcam #rows thru tcam #rows thru Mbits per state Gbps Mbits per state Gbps Mbits per state Gbps Bro217 6533 0.31 1.40 3.64 0.21 0.94 4.35 0.17 0.78 4.35 C613 11308 0.63 1.61 3.11 0.52 1.35 3.64 0.45 1.17 3.64 C10 14868 0.61 1.20 3.11 0.31 0.61 3.64 0.16 0.32 4.35 C7 24750 1.00 1.18 3.11 0.53 0.62 3.64 0.29 0.34 3.64 C8 3108 0.13 1.20 5.44 0.07 0.62 5.44 0.03 0.33 8.51 Snort24 13886 0.55 1.16 3.64 0.30 0.64 3.64 0.18 0.38 4.35 Snort31 20068 1.43 2.07 2.72 0.81 1.17 2.72 0.50 0.72 3.64 Snort34 13825 0.56 1.18 3.11 0.30 0.62 3.64 0.17 0.36 4.35 Table 5.2: TCAM size and throughput for 1-stride DFAs Table 5.2 shows our experimental results on the 8 RE sets using 1-stride DFAs. We use TS to denote our transition sharing algorithm including both character bundling and 121 shadow encoding. We use TC2 and TC4 to denote our table consolidation algorithm where we consolidate at most 2 and 4 transition tables together, respectively. For each RE set, we measure the number states in its 1-stride DFA, the resulting TCAM space in megabits, the average number of TCAM table entries per state, and the projected RE matching throughput; the number of TCAM entries is the number of states times the average number of entries per state. The TS column shows our results when we apply TS alone to each RE set. The TS+TC2 and TS+TC4 columns show our results when we apply both TS and TC under the consolidation limit of 2 and 4, respectively, to each RE set. We draw the following conclusions from Table 5.2. (1) Our RE matching solution is extremely e ective in saving TCAM space. Using TS+TC4, the maximum TCAM size for the 8 RE sets is only 0.50 Mb, which is two orders of magnitude smaller than the current largest commercially available TCAM chip size of 72 Mb. More speci cally, the number of TCAM entries per DFA state ranges between .32 and 1.17 when we use TC4. We require 16, 32, or 64 SRAM bits per TCAM entry for TS, TS+TC2, and TS+TC4, respectively as we need to record 1, 2, or 4 state 16 bit state IDs in each decision, respectively. (2) Transition sharing alone is very e ective. With the transition sharing algorithm alone, the maximum TCAM size is only 1.43Mb for the 8 RE sets. Furthermore, we see a relatively tight range of TCAM entries per state of 1.16 to 2.07. Transition sharing works extremely well with all 8 RE sets including those with wildcard closures and those with primarily strings. (3) Table consolidation is very e ective. On the 8 RE sets, adding TC2 to 122 TS improves compression by an average of 41% (ranging from 16% to 49%) where the maximum possible is 50%. We measure improvement by computing (TS − (TS + TC2))/TS). Replacing TC2 with TC4 improves compression by an average of 36% (ranging from 13% to 47%) where we measure improvement by computing ((TS + TC2) − (TS + TC4))/(TS + TC2). Here we do observe a di erence in performance, though. For the two RE sets Bro217 and C613 that are primarily strings without table consolidation, the average improvements of using TC2 and TC4 are only 24% and 15%, respectively. For the remaining six RE sets that have many wildcard closures, the average improvements are 47% and 43%, respectively. The reason, as we touched on in Section 5.3.4, is how wildcard closure creates multiple deferment trees with almost identical structure. Thus wildcard closures, the prime source of state explosion, is particularly amenable to compression by table consolidation. In such cases, doubling our table consolidation limit does not greatly increase SRAM cost. Speci cally, while the number of SRAM bits per TCAM entry doubles as we double the consolidation limit, the number of TCAM entries required almost halves! (4) Our RE matching solution achieves high throughput with even 1-stride DFAs. For the TS+TC4 algorithm, on the 8 RE sets, the average throughput is 4.60Gbps (ranging from 3.64Gbps to 8.51Gbps). We use our Scale dataset to assess the scalability of our algorithms' performance focusing on the number of TCAM entries per DFA state. Figure 5.13(a) shows the number of TCAM entries per state for TS, TS+TC2, and TS+TC4 for the Scale REs containing 26 REs (with DFA size 1275) to 34 REs (with DFA size 305, 339). The DFA size roughly doubled for 123 (b) # entries/state time/state (msec) (a) 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 1000 TS TS+TC2 TS+TC4 10000 100000 # states 10000 1000 100 10 1 0.1 1000 TS Build TS+TC2 Build TS+TC4 Build TS BW TS+TC2 BW TS+TC4 BW 10000 100000 # states Figure 5.13: TCAM entries per DFA state (a) and compute time per DFA state (b) for Scale 26 through Scale 34. every RE added. In general, the number of TCAM entries per state is roughly constant and actually decreases with table consolidation. This is because table consolidation performs better as more REs with wildcard closures are added as there are more trees with similar structure in the deferment forest. We now analyze running time. We ran our experiments on the Michigan State University High Performance Computing Center (HPCC). The HPCC has several clusters; most of our experiments were executed on the fastest cluster which has nodes that each have 2 quad-core Xeons running at 2.3GHz. The total RAM for each node is 8GB. Figure 5.13(b) shows the compute time per state in milliseconds. The build times are the time per DFA 124 state required to build the non-overlapping set of transitions (applying TS and TC); these increase linearly because these algorithms are quadratic in the number of DFA states. For our largest DFA Scale 34 with 305,339 states, the total time required for TS, TS+TC2, and TS+TC4 is 19.25 mins, 118.6 hrs, and 150.2 hrs, respectively. These times are cumulative; that is going from TS+TC2 to TS+TC4 requires an additional 31.6 hours. This table consolidation time is roughly one fourth of the rst table consolidation time because the number of DFA states has been cut in half by the rst table consolidation and table consolidation has a quadratic running time in the number of DFA states. The BW times are the time per DFA state required to minimize these transition tables using the Bitweaving algorithm in [31]; these times are roughly constant as Bitweaving depends on the size of the transition tables for each state and is not dependent on the size of the DFA. For our largest DFA Scale 34 with 305, 339 states, the total Bitweaving optimization time on TS, TS+TC2, and TS+TC4 is 10 hrs, 5 hrs, and 2.5 hrs. These times are not cumulative and fall by a factor of 2 as each table consolidation step cuts the number of DFA states by a time/state (msec) factor of 2. 10000 Opt TC2 Opt TC4 Greedy TC2 Greedy TC4 1000 100 10 1 1000 10000 100000 # states Figure 5.14: Consolidation times for Scale 26 through Scale 34 for Optimal and Greedy consolidation algorithms. 125 Figure 5.14 shows the time required per state for the greedy and optimal consolidation algorithms on the Scale dataset. The greedy algorithm runs roughly 6 times faster than the optimal algorithm. The average increase in the number of resulting TCAM rules is around 4% for TC2 and around 9% for TC4. The partially deferred algorithm given in Section 5.2.2.4 always performs at least as well as the completely deferred minimization algorithm given in [31]. For the three Snort RE sets and C613, the partially deferred algorithm results in a reduction of 1, 2, 152, and 194 TCAM entries over the completely deferred algorithm. For the other RE sets, both algorithms perform equally well. The partially deferred algorithm is slower than the completely deferred algorithm because there are more unique decisions during minimization, so we use the completely deferred minimization algorithm for computing classi er sizes during consolidation, and we use the partially deferred minimization algorithm for generating the nal TCAM classi ers for each state. 5.6.3 Results on 7-var-stride DFAs We consider two implementations of variable striding assuming we have a 2.36 megabit TCAM with TCAM width 72 bits (32,768 entries). Using Table 5.1, the latency of a lookup is 2.57 ns. Thus, the potential RE matching throughput of by a 7-var-stride DFA with average stride S is 8 × S/.00000000257 = 3.11 × S Gbps. In our rst implementation, we only use self-loop unrolling of root states in the deferment forest. Speci cally, for each RE set, we rst construct the 1-stride DFA using transition 126 sharing. We then apply self-loop unrolling to each root state of the deferment forest to create a 7-var-stride transition table. Because of the linear increase in transition table size, we know that the resulting TCAM table will increase in size by at most a factor of 7. In all our experiments, the size never increased by more than a factor of 2.25, and the largest DFA (for C7) required only 2.25 megabits. We can decrease the TCAM space by using table consolidation; this was very e ective for all RE sets except the string matching RE sets Bro217 and C613. This was unnecessary since all self-loop unrolled tables t within our available TCAM space. Second, we apply full variable striding. Speci cally, we rst create 1-stride DFAs using transition sharing and then apply variable striding with no table consolidation, table consolidation with 2-decision tables, and table consolidation with 4-decision tables. We use the best result that ts within the 2.36 megabit TCAM space. For the RE sets Bro217, C8, C613, Snort24 and Snort34, no table consolidation is used. For C10 and Snort31, we use table consolidation with 2-decision tables. For C7, we use table consolidation with 4-decision tables. We now run both implementations of our 7-var-stride DFAs on traces of length 287484 to compute the average stride. For each RE set, we generate 4 traces using Becchi et al.'s trace generator tool using default values 35%, 55%, 75%, and 95% for the parameter pM . These generate increasingly malicious trac that is more likely to move away from the start state towards distant accept states of that DFA. We also generate a completely random string to model completely uniform trac such as binary trac patterns which 127 we treat as pM = 0. We group the 8 RE sets into 3 groups: group (a) represents the two string matching RE sets Bro217 and C613; group (b) represents the three RE sets C7, C8, and C10 that contain all wildcard closures; group (c) represents the three RE sets Snort24, Snort31, and Snort34 that contain roughly 40% wildcard closures. Figure 5.15 shows the average stride length and throughput for the three groups of RE sets according to the parameter pM (the random string trace is pM = 0). Throughput (Gbps) 6 Group (a) Group (b) Group (c) 15 4 10 2 5 0 Average Stride length Self-Loop Unrolling 20 0 0 0.2 0.4 pM 0.6 0.8 1 Throughput (Gbps) 6 Group (a) Group (b) Group (c) 15 4 10 2 5 0 Average Stride length Variable Striding 20 0 0 0.2 0.4 pM 0.6 0.8 1 Figure 5.15: The throughput and average stride length of RE sets. 128 We make the following observations. (1) Self-loop unrolling is extremely e ective on the uniform trace. For the non string matching sets, it achieves an average stride length of 5.97 and 5.84 and RE matching throughput of 18.58 and 18.15 Gbps for groups (b) and (c), respectively. For the string matching sets in group (a), it achieves an average stride length of 3.30 and a resulting throughput of 10.29 Gbps. Even though only the root states are unrolled, self-loop unrolling works very well because the non-root states that defer most transitions to a root state will still bene t from that root state's unrolled self-loops. In particular, it is likely that there will be long stretches of the input stream that repeatedly return to a root state and take full advantage of the unrolled self-loops. (2) The performance of self-loop unrolling does degrade steadily as pM increases for all RE sets except those in group (b). This occurs because as pM increases, we are more likely to move away from any default root state. Thus, fewer transitions will be able to leverage the unrolled self-loops at root states. (3) For the uniform trace, full variable striding does little to increase RE matching throughput. Of course, for the non-string matching RE sets, there was little room for improvement. (4) As pM increases, full variable striding does signi cantly increase throughput, particularly for groups (b) and (c). For example, for groups (b) and (c), the minimum average stride length is 2.91 for all values of pM which leads to a minimum throughput of 9.06Gbps. Also, for all groups of RE sets, the average stride length for full variable striding is much higher than that for self-loop unrolling for large pM . For example, when pM = 95%, full variable striding achieves average stride lengths of 2.55, 2.97, and 3.07 for groups (a), (b), and (c), respectively, whereas self-loop unrolling achieves average stride lengths of only 1.04, 1.83, 129 and 1.06 for groups (a), (b), and (c), respectively. These results indicate the following. First, self-loop unrolling is extremely e ective at increasing throughput for random trac traces. Second, other variable striding techniques can mitigate many of the e ects of malicious trac that lead away from the start state. 130 Chapter 6 Overlay Automata In this section we present our overlay automata model for handling DFA state replication, and the implementation of the overlay automata in both software and hardware. 6.1 Introduction As discussed in Section 3.2, the main reason for redundancy in a DFA is state replication, which causes the exponential increase in the size of the DFA as multiple REs are combines. Ideally we would like to build an automata whose size is proportional to a NFA and matching speed close to that of a DFA. We achieve this goal using our new overlay automata model. 131 6.1.1 Limitations of Prior Automata Models DFA-based automata models have been developed to address DFA space explosion. Two representative models are D2 FA proposed by Kumar et al. [26] and XFA proposed by Smith et al. [41]. D2 FAs reduce the number of transitions stored per state by using deferred transitions to compactly represent common transitions, i.e., the transitions with the same input character and destination state. This elegant solution can be automated; however, it only handles transition sharing, and does not address state replication, and resulting replicated transitions. So although there is a huge reduction in space required, it is still proportional to the number of DFA states, which grows exponentially with the number of REs in the RE set. XFAs deal with state replication using scratch memory and auxiliary code stored at each state that must be executed before or after each transition. This interesting solution models state replication; however, it cannot be fully automated [50]. Furthermore, the code that needs to be executed for each transition limits the throughput that can be achieved. Our technique of table consolidation presented in Section 5.3 actually exploits state replication to reduce the size of TCAM required, but it does so accidentally. That is, table consolidation works well because of state replication, but the the technique is oblivious to state replication. The algorithm does not explicitly search for replicated states, it only looks for state pairs that are good matches for consolidation. But replicate states are usually good matches for consolidation, and so states that are consolidated together are usually replications of the same NFA state. There are several limitations of table consoli132 dation because of which state replication is not fully exploited. First, there is a practical limit on the number of TCAM tables that can be consolidated. For instance we only consider consolidating 4 tables together. Thus, table consolidation can only lead to a constant factor reduction in TCAM storage no matter how much state replication exists in the DFA. So the nal TCAM size can still be exponential in the size of the RE set. Ideally we would like to combine together all the replications of a NFA state. Second, table consolidation does not reduce the associated SRAM required to store decisions because although the TCAM entries are merged, the decisions are not. Furthermore, the SRAM required by table consolidation might increase due to imperfect merging of tables. 6.1.2 Summary of Overlay Automata Approach We developed a new overlay automata model which exploit state replication to compress the size of the DFA. The idea is to group together the replicated DFA structures instead of repeating them multiple times. We brie y describe here the overlay automata model and how the automata is implmented in software and hardware. 6.1.2.1 Overlay DFA We propose Overlay Deterministic Finite state Automata (ODFA) that models state replication in DFAs. The basic idea is to overlay all the DFA states that are replications of the same NFA state vertically together into what we call a super-state. If we view a DFA as a 2-D object, then an ODFA can be viewed as a 3-D object. Figure 6.2 depicts 133 the DFA and ODFA for the RE set f/abc/, /abd/, /e. f/g. The ODFA model gives us the following key bene ts. First, it allows us to easily identify replications of the same NFA state as they are all in the same super-state. For example, in Figure 6.2, we merge states 0 and 5 and states 1 and 6 into super-states S0 and S1 , respectively. Second, it allows us to represent replications of the same NFA transition by one super-state transition between two super-states. For example, for any NFA transition from s1 to s2 on character σ, in the corresponding ODFA, all replications of state s1 are in the same super-state say S1 , all replications of state s2 are in the same super-state say S2 , and all replicates of state s1 have a transition on σ to their corresponding replicates on state s2 . We merge these replicate transitions into one combined super-state transition from super-state S1 to super-state S2 on character σ. For example, in Figure 6.2, we merge the two transitions from states 0 and 5 on character `a' into one super-state transition on character `a'. 6.1.2.2 Overlay D2 FA Combining our overlay idea, which models state replication and replicated transitions, and the delayed input idea in D2 FA, which models sharing non-replicated transitions among non-replicated DFA states (i.e. transition sharing) through a state deferment relationship, we propose Overlay Delayed Input DFA (OD2 FA) to model state replication, repli- cated transitions, and transition sharing. The relationship among these automata models, DFA, D2 FA, ODFA, and OD2 FA, is illustrated in Figure 6.1. A key bene t of OD2 FA 134 is that we can represent the deferment relationship among D2 FA states more compactly using deferment among OD2 FA super-states. From the perspective of transitions, OD2 FA optimizes both deferred transitions (i.e., common transitions among states) and replicated transitions. DFA Models State Replication D2FA ODFA Models  Transition Sharing OD2FA Models State Replication and Transition Sharing Figure 6.1: Relationship of Automata Models. 6.1.2.3 Building OD2 FA To build an OD2 FA, we propose algorithms for constructing it from a given set of REs incrementally. We rst construct the equivalent OD2 FA for each RE. We then eciently merge OD2 FAs until only a single OD2 FA for the entire set of REs is left. We propose an incremental construction algorithm that builds the OD2 FA D for RE set R1 ∪R2 by merging the OD2 FA D1 for R1 with the OD2 FA D2 for R2 . This algorithm automatically identi es and groups together replicate states in D into super-states and replicate transitions into super-state transitions without having to perform an expensive analysis of the nal DFA structure. 135 6.1.2.4 Implementing OD2 FA We develop techniques for implementing the OD2 FA is software and hardware. We extend the software implementation of a D2 FA to OD2 FA. The main problem we need to solve is that, since an OD2 FA only stores super-state transitions, how do we eciently lookup state transitions from the super-state transitions. Our ecient encoding of super-state transition facilitates in performing this lookup very quickly. For the hardware implmentation, we develop a solution which we call OverlayCAM, by extending RegCAM to implement the OD2 FA in TCAM. Again, our ecient encoding of super-state transition allows us to implement each super-state transition using only one TCAM entry. So OverlayCAM not only encodes multiple deferred state transitions using one TCAM entry but also encodes multiple non-deferred state transitions that are replications of the same NFA transition using only one TCAM entry. We also extend the variable striding technique in RegCAM for use with OverlayCAM to increase the matching throughput. 6.2 Overlay DFA In this section, we formally de ne a new automata, Overlay Deterministic Finite state Automata (ODFA), which we propose to deal with state explosion in DFA. There are two ideas behind an ODFA. The rst is to group all DFA states that are repli136 cations of the same NFA state into a single super-state. The second is to merge as many transitions from the replicate states within a super-state as possible. To de ne ODFA, we will use the concepts of super-states, overlays, super-state transitions, and overlay o sets. We begin by informally de ning ODFA and these concepts using the ODFA in Figure 6.2 as a running example. From [0..4] 3/1 From [1..4] a fail 0 a c b 1 2 d 4/2 (a) DFA for RE set f/abc/, /abd/g From [0..4] From [1..4] a fail d a b 1 c 8/1 9/2 From [6..10] From [1..4] a e a b 6 fail From [5..10] f 10/3 4/2 2 e 5 3/1 d 0 c 7 f From [6..10] (b) DFA for RE set f/abc/, /abd/, /e. f/g Figure 6.2: Example of DFA, state replication and Overlay DFA. 137 From [0..4] S3 From [1..4] fail a S0 S1 0 fail 1 b 5 e a a 6 S2 b f e From [1..4] S5 From [5..10] 3/1 c 2 c 8/1 d 7 S4 d a From [6..10] 4/2 9/2 10/3 From [6..10] f (c) Corresponding ODFA From [S0..S5] fail 0 From [S1..S5] a 0 S3/1 3 From [0..4] S0 f 0 S1 1 a b 0 c S2 2 0 0 5 6 7 0 f e 8 d S5/3 From [S0..S5] S4/2 4 9 10 f From [6..10] (d) ODFA with super-state transitions Figure 6.2: Example of DFA, state replication and Overlay DFA (cont'd). 138 Figure 6.2(a) shows the DFA for the RE set f/abc/, /abd/g from Figure 3.1(a). The notations used in the gure are explained in Section 3.2. Figure 6.2(b) shows the DFA after the RE /e. f/ is added to the RE set (same as Figure 3.1(b).) This DFA illustrates the potential for ODFA as the entire DFA for the RE set f/abc/, /abd/g is replicated twice. The corresponding ODFA is shown in Figure 6.2(c). In Figure 6.2(c), we overlay the two copies of the DFA for the RE set f/abc/, /abd/g) on top of each other. Each pair of replicated DFA states is a super-state in the ODFA. Each layer of states is called an overlay. The ODFA in Figure 6.2(c) has six super-states S0 , . . . , S5 and two overlays. Each overlay contains a subset of the states in the entire DFA; in Figure 6.2(c), the rst overlay does not contain a state from super-state S5 . We now introduce the concept of super-state transitions. One super-state transition represents multiple DFA transitions much as one super-state represents a group of DFA states. In a standard DFA transition, the source state is a DFA state. In a super-state transition, the source state is an ODFA super-state and represents transitions from all the replicated DFA states within the super-state. The destination state is usually an ODFA super-state but can sometimes be a DFA state. The two super-state transition forms are σ σ S1 − S2 , o, 1 and S1 − S2 , O, 0 (distinguished by the last bit value 1/0). In the rst form, → → the semantics are that each DFA state q in super-state S1 transitions on character σ to a DFA state q in super-state S2 , with o = (overlay of q − overlay of q) mod #overlays. We call this di erence in the overlay value the overlay o set (or just o set for short.) The value of the overlay o set o is usually 0. In the second form, the semantics are that 139 each DFA state q in super-state S1 transitions on character σ to the DFA state located in b super-state S2 at overlay O. For example, consider the two DFA state transitions 1 − 2 → b and 6 − 7 in Figure 6.2(c). These two transitions can be represented by one super-state → b transition S1 − S2 , 0; the 0 denotes no change in overlay. As a second example, consider → e e the two DFA state transitions 3 − 5 and 8 − 5 in Figure 6.2(c). These two transitions → → e can be represented by one super-state transition S3 − S2 , 1, 0. → In the ideal case, all DFA transitions can be replaced by super-state transitions which reduces the total number of transitions by the number of overlays in the ODFA. In some cases, not all states in a super-state have transitions that can be merged. We generalize super-state transitions to allow super-state transitions to be de ned for a speci c subset of overlays X within a given super-state. Technically, traditional transitions from a single state s are super-state transitions where X contains only s's overlay. We refer to these as singleton super-state transitions. Figure 6.2(d) shows the ODFA for our running example with non-singleton super-state a a transitions denoted with thick edges. For example, the two transitions 0 − 1 and 5 − 6 → → a from Figure 6.2(c) are represented with one super-state transition S0 − S1 , 0, 1. For → σ super-state transitions of the form S1 − S2 , o, 1 (i.e. destination is also a super-state), → the number besides the thick edge gives the overlay o set o. As we use double arrows to represent multiple transitions, we use thick double arrows to represent multiple none e singleton super-state transitions. For example, the two transitions 0 − 5 and 5 − 5 from → → e Figure 6.2(c) are included in one super-state transition S0 − S0 , 1, 0 which is part of the → 140 thick double arrow labeled with `e' ending at state 5. The DFA in Figure 6.2(b) has 11 × 256 = 2816 total transitions; the ODFA in Figure 6.2(d) has 1542 total super-state transitions which is close to the best possible result of 2816/2 = 1408 total super-state transitions; only a few of these transitions are singleton super-state transitions. Recall the DFA is de ned as a 5-tuple (Q, Σ, q0 , M, δ) (Section 3.1). We now formally de ne the ODFA. Definition 5 ( Overlay Deterministic Finite state Automata (ODFA)). An ODFA for a set of REs R is de ned as a 7-tuple D = (Q, Σ, q0 , S, O, M, ∆). The rst three terms are the same as those in the above DFA de nition. The next two terms de ne the overlay structure on top of a DFA: S = {S0 , . . . , S|S|−1 } is a set of super-states that partitions Q, while O = {O0 , . . . , O|O|−1 } is a set of overlays that also partitions Q. We shall treat each overlay as a unique number in the range [0..|O|). We overload notation and de ne S: Q → S and O: Q → O as functions mapping states to super-states and overlays, respectively. For any two states si = sj , it must be the case that (S(si ), O(si )) = (S(sj ), O(sj )). For any super-state S and overlay O, S ∩ O is either empty or contains one state s ∈ Q. The term M : S → 2R gives the subset of REs matched by any super-state. The set of REs matched by any state s ∈ Q ∆ : S × 2O × Σ → S × [0..|O|) × {0, 1} transition function. For any dom(∆) with O(s) ∈ X s∈Q is then given by M(S(s)). The nal term is a partial function and de nes the super-state and any σ ∈ Σ, all the transition (S(s), X, σ) ∈ must have the same value; i.e. if we have two transitions 141 (S(s), X, σ) ∈ dom(∆) and (S(s), Y, σ) ∈ dom(∆), with O(s) ∈ X ∩ Y , then we must have ∆(S(s), X, σ) = ∆(S(s), Y, σ). δ (s, σ) We de ne the derived total state transition function based on this unique transition value, say (S , o, b), as follows. First, if b = 0, we call the transition a non-o set transition, and δ we call the transition an o set transition, and δ value b (s, σ) = S ∩ o. (s, σ) = S ∩ ((O(s) + o) mod |O|). is called the o set bit. It must be the case that overlay does intersect S Otherwise (b = 1), . Normally for o set transitions o = 0, The (O(s) + o) mod |O| so the resulting overlay is just O(s). σ We use the notation (S1 , O) − (S2 , o, b) to denote the super-state transition ∆(S1 , O, σ) = → (S2 , o, b). Even though an ODFA has super-states and overlays, an ODFA processes an input string much like a DFA does. That is, the ODFA is always in a unique state and each character processed moves the ODFA to a potentially new state. The main di erence is that the ODFA hopefully compresses multiple DFA transitions into a single ODFA super-state transition, and the RE matching information is stored at the super-state level rather than at the state level. For example, given the ODFA in Figure 6.2(d) and the input string abea, the ODFA begins in state 0. After processing character a, the ODFA moves to state 1. After processing character b, the ODFA moves to state 2. After processing character e, the ODFA moves to state 5. Finally, after processing character a, the ODFA moves to state 6. The rst and fourth transitions are actually the same super-state transition. The third transition corresponds to the rst form of super-state transition with speci ed destination state 5. In all cases, M(S(s )) = ∅, so no RE is matched at any point in time. 142 Overlays and super-states are two orthogonal partitionings of states in Q; intuitively, super-states partition Q vertically and overlays partition Q horizontally. There exist many possible ways to partition the states of a DFA into super-states and overlays. The bene ts of an ODFA are only realized by a careful partitioning; for example, grouping replicate states of the same NFA state together in a super-state. Note that some super-states may not have DFA states in each overlay. If overlay O in super-state S is empty, we denote it by S ∩ O =⊥ (i.e. ⊥ denotes an empty location). In Figure 6.2(d), super-state S5 contains only one DFA state 10 which belongs to the second overlay. The compressive power of a super-state transition increases with the number of overlays that it includes. In the best case, all overlays are included in a super-state transition. In Figure 6.2(d), most super-state transitions include all overlays; there are only a few singleton super-state transitions. In more complex ODFA, there may be cases where a given super-state transition includes more than one overlay but not all overlays. In an ODFA the RE matching is stored at the super-state level (i.e. M) and state matching is de ned by M. So when constructing an ODFA D for a given DFA D, we must create the super-states such that the following condition is satis ed ∀S ∈ SD , ∀s1 , s2 ∈ S, MD (s1 ) = MD (s2 ), 143 (C1) 6.3 Overlay D2FA Overlay Delayed Input DFA (OD2 FA), In this section we present another new automata, which we propose to deal with both state and transition explosion in DFA. Recall that, given a DFA D=(Q, Σ, q0 , M, δ), its corresponding D2 FA D is de ned as a 6-tuple (Q, Σ, q0 , M, ρ, F) (Section 3.3). ODFAs address state explosion and D2 FAs address transition explosion. We propose OD2 FA to address both state and transition explosion in DFAs. Definition 6 (Overlay D2 FA (OD2 FA)). q0 , F , S , O, M, ∆), We de ne an OD2 FA as an 8-tuple (Q, Σ, where the rst three terms are the same as in de ning D2 FA. The last four terms are the same as in de ning ODFA. The only di erence is that, we derive a partial state transition function ρ : Q×Σ → Q from ∆. Since partial function, we do not require the existence of a covering transition in each s∈Q and σ ∈ Σ. F : S → S is a ρ ∆ for is the super-state deferment function, and gives the deferred super-state for each super-state. We de ne the D2 FA state deferment function F from F as F(s) = F(S(s)) ∩ O(s)). To ensure this is a valid deferment function, F must satisfy the following two conditions. First, (C2) ∀s ∈ Q, F(S(s)) ∩ O(s)) =⊥, Second, the deferment forest of super-states de ned by self-loops. Finally, ρ and F F has no cycles other than de ne the derived total state transition function 144 δ as follows.    ρ (s, σ)  δ (s, σ) = We say that If s, σ ∈ dom(ρ ) s, σ ∈ dom(ρ ), if    δ (F(s), σ) else s, σ ∈ dom(ρ ) if there exists a transition (S(s), X, σ) ∈ ∆ with O(s) ∈ X. then ρ (s, σ) is de ned as δ is de ned for ODFA. We say that super-state S overlay covers super-state S if ∀O ∈ O, (S ∩ O =⊥) → (S ∩ O =⊥). That is, every overlay that is empty in S is also empty in S . Then, Condition (C2) says that for every super-state S, super-state F(S) overlay covers S. The transition function δ is computed by nding the transition (S(s), X, σ) ∈ ∆ with O(s) ∈ X if such a transition exists. If not, the OD2 FA follows the super-state deferment function. As de ned, we store F rather than F; thus deferment information is stored only at the super-state level. Likewise, we store just RE matching information M at the super-state level. Finally, with ∆, many super-state transitions represent multiple singleton transitions. Combined, we can achieve signi cant savings. Figure 6.3(a) shows the D2 FA for the RE set f/abc/, /abd/, /e. f/g. The dashed edges are deferment transitions. Figure 6.3(b) shows the corresponding OD2 FA. The D2 FA needs to store 518 actual transitions and 10 deferment transitions while the OD2 FA only needs to store 260 actual transitions, most of which are non-singleton super-state transitions, and 5 super-state deferred transitions. For this example, we achieve near optimal compression 145 ‐{a,e} 4/2 c 8/1 d b 1 3/1 d a 0 c 9/2 2 e ‐{a,f} a 5 b 6 7 f 10/3 (a) D2 FA for RE set f/abc/, /abd/, /e. f/g S3/1 ‐{a,e} 3 0 S0 0 S1 1 a 0 b 0 2 8 0 5 e c S2 6 7 0 d f S5/3 S4/2 4 9 10 (b) Corresponding OD2 FA Figure 6.3: OD2 FA Example. 146 given only two overlays in the OD2 FA when compared to the D2 FA. 6.3.1 OD2 FA Multiplicative Compression OD2 FA multiplies the compressive e ect of D2 FA and ODFA to signi cantly reduce the space required to store transitions. ODFA reduces the storage space for transitions among DFA replicates by storing one super-state transition for each replicated transition. The compression limit for ODFA is the number of DFA replicates. D2 FA reduces the storage space for transitions within each DFA replicate using deferment transitions. The compression limit for D2 FA is the number of states within each DFA replicate. OD2 FA is able to do both simultaneously. The compression limit is the number of DFA replicates multiplied by the number of states within each replicate which is essentially the total number of DFA states. To illustrate this multiplicative compression, consider again the OD2 FA in Figure 6.3(b). The original DFA for this RE set requires 11 × 256 = 2816 transitions. The corresponding ODFA in Figure 6.2(d) is able to reduce the number of transitions by almost a factor of 2 by storing one super-state transition for each pair of replicated transitions. The corresponding D2 FA in Figure 6.3(a) is able to reduce the number of transitions by more than a factor of 5 using deferment transitions. In particular, in both replicates, almost all of the transitions for all states except the self-looping start states are eliminated. Finally, the OD2 FA in Figure 6.3(b) multiplies both e ects and ends up with 260 super-state transitions and 5 super-state deferment transitions. This is almost a factor of 11 times smaller than the 147 original DFA where 11 is the compression limit since the DFA has 11 states. Starting from the D2 FA, the OD2 FA is able to replicate all the self-looping transitions out of the two self-looping states in the D2 FA (adding one singleton transition on `f' for state 5). This is critical since the vast majority of transitions remaining in many D2 FA are self-looping transitions. 6.3.2 Effectiveness of OD2 FA on Ideal RE set We can further demonstrate the e ectiveness of OD2 FA using an example set of n REs where each RE is of the form /Ai,1 Ai,2 · · · Ai,p . Bi,1 Bi,2 · · · Bi,p /, 1 ≤ i ≤ n; that is, each RE has p characters followed by `. ' and another p characters and all the 2np characters are unique. This is a simple RE set, in the sense that there is no interaction between the REs in the set, and we get a simple exponential increase in the size of the DFA relative to the number of REs in the set n because of state replication. In this case, the NFA has (2p+1)n+2 (O(pn)) states and the DFA has ((2p−1)n+2)2n−1 (O(pn2n )) states. The D2 FA has ((p − 1)n + 256)2n (O(pn2n )) transitions, and our RegCAM presented in Section 5.2 will generate (pn+1)2n (O(pn2n )) TCAM entries. The OD2 FA only has pn + 1 (O(pn)) super-states, 2pn + 256 (O(pn)) super-state transitions, and a straightforward TCAM implementation of these transitions needs only 2pn + 1 (O(pn)) TCAM entries. The number of rules with the OD2 FA is the same as the NFA size, which is a lower bound on the compression any method can achieve. 148 6.4 OD2FA Construction In this section we present our algorithms for constructing an OD2 FA for a set of REs. Given a set of REs, we construct its equivalent OD2 FA incrementally in two phases. In the rst phase, we construct an equivalent individual OD2 FA for each RE. In the second phase, we merge all the individual OD2 FAs in a binary tree fashion; i.e. we merge two OD2 FAs into one OD2 FA at a time until there is only one OD2 FA for the entire given RE set. Constructing an OD2 FA involves three main steps: (1) creating the super-states (i.e. assigning a super-state, overlay pair for each DFA state), (2) setting the deferment for each super-state and (3) for each super-state creating the (combined) super-state transitions from the (singleton) state transitions. The algorithms for the rst two steps (creating super-states and setting deferment) are di erent for the two phases mentioned above. However the algorithms for the third step (creating super-state transitions) are almost identical for the two phases. So we describe the OD2 FA construction algorithms in two parts. In this section we demonstrate how the super-states are created and how super-state deferment is set (i.e. steps 1 and 2) during both the phases. In the next section we show how super-state transitions are built from state transitions (i.e. step 3). 149 6.4.1 OD2 FA Construction from One RE Given one RE, we rst build its equivalent D2 FA using the technique described in Section 4.3.1. The deferment relationship among states in this D2 FA de nes a deferment forest. The root states in this forest are all self-looping states which means they transit to themselves for more than |Σ|/2 = 128 characters. Most failure transitions end in selflooping states. For example, in the D2 FA in Figure 6.4, states 0 and 2 are self-looping states. An important property of the D2 FA constructed using the technique described in Section 4.3.1 is that each self-looping state in the DFA is the root of a tree in the deferment forest of the D2 FA, and vice versa. Furthermore, all the states whose failure transitions go to a self-looping state s are in the deferment tree rooted at s. Now we describe our algorithm for constructing the OD2 FA from a D2 FA using the example in Figure 6.4 for the RE /ab[ˆn] pq/. A key observation is that any D2 FA is also a valid OD2 FA with only a single overlay, singleton super-states, and singleton super-state transitions. We gradually convert the D2 FA into a more compact OD2 FA rst creating valid overlays and super-states and then updating the super-state transition function to combine multiple transitions into one super-state transition. We begin by specifying the number of deferment trees in the super-state deferment forest and the number of overlays in a super-state. We accomplish these tasks by partitioning the self-looping root states of the D2 FA into two groups, accepting root states and rejecting root states. If either partition is empty, we create one deferment tree in the OD2 FA; otherwise there are two deferment trees. The number of overlays in the OD2 FA is the 150 larger of the number of accepting root states and the number of rejecting root states. For any non-empty partition, we merge the root states in that partition into a single root super-state in the OD2 FA. Typically, self-looping states are failure states, so the accepting root state partition is empty and the resulting root super-state is not formed. This observation holds for all of our experimental RE sets. Thus, the deferment forest of the OD2 FA typically has one deferment tree rooted at the rejecting root super-state. For example, the OD2 FA in Figure 6.4 has one deferment tree with two overlays, 0 and 1, and the rejecting root super-state is 0 2 . ‐a ‐{n,p} n a 0 b 1 2 p q 3 4/1 D2FA for RE ab[^n]pq ‐{n} 0 0 2 0 1 0 1 0 2 4 3 D2FA n 1 deferment forest 0 1 1 3 2 0  1 4 Corresponding OD2FA (singleton super-state transitions not shown) Figure 6.4: OD2 FA construction from one RE. There are two reasons we group root states into super-states even though the self-looping states in the D2 FA are usually not replications of the same NFA state. First, all the com151 mon self-loops can be merged into super-state transitions. We specify this more precisely in Section 6.5. Second, as self-looping states are typically the \replication points" when combining REs, grouping self-looping states into a common super-state helps us automatically identify the state replications and replicated transitions when we merge two OD2 FAs. We elaborate this more in Section 6.4.2. Condition (C2) is satis ed as the root super-state defers to itself. We now describe how we assign the remaining states to super-states and overlays ensuring Condition (C2) is maintained. Given a super-state S that is in the OD2 FA deferment forest, our algorithm groups the children of the states in S into new super-states that defer to S. This grouping is recursively applied to the new super-states formed until all states are assigned to super-states. We now specify how the children of the states of S are grouped into super-states. Let n be the number of non-empty overlays in S, and let s1 , . . . , sn be the states in these overlays. Let Ci = F−1 (si ) be the set of children for each state si in S, and let U = n i=1 Ci be the total set of states to be grouped into super-states. To ensure all states in a super-state match the same REs, we partition U into accepting states and rejecting states and work with each partition independently. Without loss of generality, we assume U has one partition. We create super-states with the following two goals in mind: grouping together states u ∈ U from di erent Ci to (1) maximize the number of super-state transitions that can be formed and (2) minimize the total number of super-states formed. We propose the following greedy strategy. We start with an arbitrary state u from the rst 152 non-empty Ci removing u from Ci and creating super-state S with just u in O(si ). From each of the remaining non-empty Ck , we pick the state uk that has the most common non-deferred transitions with u, delete uk from Ck , and add uk to super-state S in O(sk ). State uk must have at least one common non-deferred transition with u to be selected. We repeat this process until all the Ci are empty. Condition (C2) is maintained because a state s in a super-state S is added to overlay O if and only if the corresponding state s in F(S) is in overlay O. For the D2 FA in Figure 6.4 with root super-state 0 2 as S, we have C0 = {1} and C1 = {3, 4}, and we create three super-states, 1 ⊥ , ⊥ 3 and ⊥ 4 , each of which defers to 0 2 . No super-states with more than one overlay occupied are formed because states 1 and 3 as well as 1 and 4 do not have any common non-deferred transitions. After the super-states have been created, we greedily merge together compatible pairs of super-states. Two super-states are compatible if there is no overlay that is non-empty in both super-states. For our example in Figure 6.4, the super-states 1 ⊥ and ⊥ 3 will be merged together, giving us two nal super-states 1 3 and ⊥ 4 . The last step is to create the super-state transitions which is discussed in Section 6.5. We use greedy algorithms in several of our steps. This does not have much e ect on overall compression because most compression opportunities are accidental; they are not the result of replications of the same NFA state. The key compression that is attained results from grouping the root states together and combining the resulting self-loops into super-state transitions; everything else is a bonus. 153 6.4.2 OD2 FA Construction from 2 OD2 FAs We present our OD2 FA merge algorithm, which we call OD2FAMerge, that constructs OD2 FA D3 with underlying D2 FA D3 for the RE set R3 = R1 ∪ R2 given two OD2 FAs, D1 with underlying D2 FA D1 for RE set R1 and D2 with underlying D2 FA D2 for RE set R2 where R1 ∩ R2 = ∅. ‐c 0 ‐{n,p} n c d 1 p 2 r 3 4/1 D2FA for RE cd[^n]pr ‐{n} n 0 0 1 0 1 1 3 0 1 0 2 2 0  1 4 Corresponding OD2FA (singleton super-state transitions not shown) Figure 6.5: D2 FA and OD2 FA for RE /cd[ˆn] pr/. The rst step is to create the merged D2 FA D3 using the the D2 FA merge algorithm described in Section 4.3.2. For example, Figure 6.6(a) shows the D2 FA constructed from the D2 FAs in Figure 6.4 and Figure 6.5. For each state, the number below the line is the state id in D3 and the two numbers above the line are the state ids of the states in D1 154 and D2 that this state corresponds to. We now construct OD2 FA D3 = (Q3 , Σ, q03 , F3 , S3 , O3 , M3 , ∆3 ) from the input OD2 FAs D1 = (Q1 , Σ, q01 , F1 , S1 , O1 , M1 , ∆1 ) and D2 = (Q2 , Σ, q02 , F2 , S2 , O2 , M2 , ∆2 ) as well as the merged D2 FA D3 . The rst three terms in D3 are derived from D3 . We then set S3 = S1 × S2 and O3 = O1 × O2 . We reduce S3 to only include reachable super-states (a super-state is reachable if it contains at least one reachable state). We discuss how we handle empty overlays in Section 6.5.4. ‐{a,c} 0,0 0 ‐{c,n,p} n a b 1,0 1 2,0 3 0,1 2 ‐{a,n,p} d a 1,2 7 b 4,0 10/1 ‐{n,p} 2,2 9 4,2 13/1 p p 0,3 8 q 3,3 12 r r 0,4 11/2 q 2,1 5 n d 0,2 4 3,0 6 c c n p 2,4 14/2 (a) D2 FA merged from D2 FAs in Figures 6.4 and 6.5. Figure 6.6: Merged OD2 FA construction example. 155 ‐{n} 0 {0,2} c 0 n 0 0,0 0 1 1 0,0 0,1 1,0 1,1 0 1 2 3 0,2 2/2  0,4 11  2,4 14 2,0 5/1 0 2 b {0,2} 0 3 0,0 0,2 2,0 2,2 0 4 3 9 {0,1} a 0 1 d 0,1 1 2 {0,1} 2 3 1,0 3  0,1 0,3 2,1 2 8 5 0 1 2 3 1,1 4  1,0 1,2 3,0 1 7 6 0 1 2 3    3,3 12 1   merged (b) OD2 FA merged from OD2 FAs in Figures 6.4 and 6.5. ‐{n} 0 {0,2} 0 c 0 n 0 {0,1} a 0 1 d 1 2 3 {2,3} p 0 1 1 1 0,4 11 0 1 2 3  2,4 14 2 3 2 b {0,2} 0 0 2/2  0,0 0,2 2,0 2,2 0 4 3 9 {0,1} 2 0,1 0,3 2,1 2 8 5 3  3 0 1 2 3 1,0 1,2 3,0 3,3 1 7 6 12 4/1   4,0 4,2 10 13 q 0 {2,3} (c) Corresponding optimized OD2 FA. Figure 6.6: Merged OD2 FA construction example (cont'd). 156 SS 0 1 2 3 4 SSCD  001 010 011 100 SSID 000 001 010 011 100 2 3 4,0 4,2 10 13 Recall that the notation S3 = S1 , S2 means super-state S3 in D3 corresponds the pair of super-states super-state S1 from D1 and S2 from D2 . Both S3 and S1 , S2 refer to the same super-state in D3 . Then for any super-state S3 = S1 , S2 ∈ S3 , we set M3 (S3 ) = M1 (S1 ) ∪ M2 (S2 ). Condition (C1) holds because all the states in super-state S1 match the REs in M1 (S1 ) and all the states in super-state S2 match the REs in M2 (S2 ). Just as each state in D3 (D3 ) corresponds to a pair of states from D1 (D1 ) and D2 (D2 ), each super-state in D3 will correspond to a pair of super-states from D1 and D2 , and similarly each overlay in D3 will correspond to a pair of overlays from D1 and D2 . Any state in D3 is assigned to a super-state and an overlay as follows. Let u= v, w be a state in D3 . Then S3 (u) ← S1 (v), S2 (w) and O3 (u) ← O1 (v), O2 (w) . That is, we assign u to the super-state (overlay) that corresponds to the pair of super-states (overlay) that v and w belong to in D1 and D2 respectively. Figure 6.6(b) shows the OD2 FA D3 constructed from OD2 FA D1 in Figure 6.4 and OD2 FA D2 in Figure 6.5. In this gure, for each super-state, the number below the line is the super-state ID in D3 and the pair numbers above the line are the super-state IDs of the super-states in D1 and D2 that this super-state corresponds to. For instance, consider state 7 in D3 , which corresponds to state 1 in D1 and state 2 in D2 . As we can see from Figures 6.4 and 6.5, state 1 ∈ D1 belongs to super-state 1 and overlay 0, and state 2 ∈ D2 belongs super-state 0 and overlay 1. Therefore, in OD2 FA D3 , we assign state 7 to super-state 3, which corresponds to super-state 1 from D1 and super-state 0 from D2 ; similarly, we assign state 7 to overlay 1, which corresponds to overlay 0 from D1 and 157 overlay 1 from D2 . In Figure 6.6(b), the input character and overlay o set are shown along each super-state transition. For super-state transitions that do not include all the overlays in the super-state, the set of numbers at the base of the transition gives the included overlays. We de ne the super-state deferment relationship F3 as follows: for any super-state S, which contains one or more states in Q3 , we defer it to the super-state that contains most of the states that the states in S defer to; i.e., ∀S ∈ S , F3 (S) ← mode({S3 (F3 (u)) | u ∈ S}). After de ning F3 , we need to adjust the deferment relationship F for D2 FA D3 . Speci cally, for each state s in a super-state S where S defers to super-state S , we let s defer to state s in S where s and s are in the same overlay if s =⊥. If s =⊥, we split S into two super-states S1 = S \ {s} and S2 = {s}, where S2 defers to the super-state that contains the state that s defers to (i.e., F3 (S2 ) := S3 (F3 (s))). Note that the case that s =⊥ rarely happens in our experimental RE sets. This super-state splitting ensures that Condition (C2) holds for D3 . We show how the super-state transitions are created for the merged OD2 FA Section 6.5. Pseudo-code for our OD2FAMerge algorithm is given in Algorithm 6.7. We now consider the following optimization for D3 . Among the super-states that defer to the same super-state, we merge two compatible super-states into one super-state if merging them results in more super-state transitions. This will commonly be the case when we lose a D2 FA state we expect to generate from a self-looping state. For example, in D2 FA Figure 6.6(a), we lost the expected states 2, 3 and 3, 2 getting instead state 12 = 3, 3 . 158 2 2 1 Input: OD FAs, D1 and D2 , with underlying D FAs D1 and D2 , corresponding to RE sets R1 and R2 . Output: An OD2 FA and its underlying D2 FA corresponding to the RE set R1 ∪ R2 . 1 Let D3 ← D2FAMerge(D1 , D2 ) // algorithm from Section 4.3.2 2 Set #overlays in D3 , |O3 | = n ← |O1 | × |O2 |; 3 foreach Si ∈ S1 × Sj ∈ S2 do // Create the super-states 4 Initialize super-state S= Si , Sj with n NULL states; 5 foreach Ok ∈ O1 , 0 ≤ k < |O1 | × Ol ∈ O2 , 0 ≤ l < |O2 | do 6 if state s= Si ∩ Ok , Sj ∩ Ol ∈ Q3 then 7 Assign s to overlay O(k×|O2 |+l) in super-state S; 8 9 10 if at least one non-NULL state in S then Add S to S3 ; M3 (S) ← M1 (Si ) ∪ M2 (Sj ); 11 foreach S ∈ S3 do // set super-state deferment 12 Set F3 (S) ← mode({S3 (F3 (s)) | s ∈ S}); 13 Let P = {s | (s ∈ S) ∧ (F3 (S) ∩ O3 (s) =⊥)}; 14 foreach state u ∈ P do 15 Remove u from super-state S; 16 Create new super-state S with just state u in overlay O3 (u) and add S to S3 ; 17 Set M3 (S ) ← MD3 (u); 18 Set F(S ) ← S3 (F3 (u)); 19 20 foreach state s ∈ S with F3 (s) = F3 (S) ∩ O3 (s) do Set F3 (s) ← F3 (S) ∩ O3 (s), and regenerate non-deferred transitions for ρ3 in D3 for state s; 21 foreach S ∈ S3 × c ∈ Σ do // create super-state trans. 22 CreateSupreStateTrans(S, c); 23 Function CreateSupreStateTrans(S, c) 24 C ← CreateSupreStateTransClassifier(S, F3 (S), c); 25 For each rule, ri ∈ C add super-state transition ∆3 (S, P(ri ), c) = D(ri ); 26 Function CreateSupreStateTransClassifier(S, DS, c) /* Generate transitions for character c and super-state S when it defers to DS */ 27 Let ODec[n] be the o set decision vector initialized to ; 28 Let NODec[n] be the non-o set decision vector initialized to ; 29 Let Reqd[n] be the required vector initialized to False; 30(cont'd) 31 Figure 6.7: Algorithm OD2FAMerge(D1 , D2 ) for merging two OD2 FAs. 159 1(cont'd) 30 foreach O ∈ O3 do 31 if S ∩ O =⊥ then 32 u= u1 , u2 ← S ∩ O; // current state 33 nu ← δ3 (u, c); // next state 34 if ρ3 (u, c) is de ned then // not deferred 35 if S = DS ∨ u = nu then Reqd[O] ← True ODec[O] ← (S3 (nu), (O3 (nu)−O) mod n, 1); NODec[O] ← (S3 (nu), O3 (nu), 0); 36 37 38 39 40 41 42 ODec ≤ #Unique values in NODec then return CreateOverlayClassifier(ODec, Reqd); else return CreateOverlayClassifier(NODec, Reqd); if #Unique values in Figure 6.7: Algorithm OD2FAMerge(D1 , D2 ) for merging two OD2 FAs (cont'd). As a result, in Figure 6.6(b), the super-states 13 = 2 8 5 ⊥ and 33 = 1 7 6 ⊥ have ⊥ in overlay 3, and there is the super-state 43 = ⊥ ⊥ ⊥ 12 with just state 12 in overlay 3, and super-state 43 is compatible with both super-states 13 and 33 . We can create new super-state transitions by merging super-state 43 with either 13 or 33 . In Figure 6.6(c), we show the resulting OD2 FA when we merge 43 from Figure 6.6(b) with 33 adding the super-state transitions out of super-state 03 on `p' to super-state 33 for overlays 2 and 3 with o set o = 0 and the super-state transitions out of super-state 33 to super-state 53 (renamed 43 in Figure 6.6(c)) on 'q' for overlays 2 and 3 with o set o = 0. Alternatively, we could have merged super-state 43 from Figure 6.6(b) with super-state 13 and added a super-state transition out of super-state 03 on `p' to super-state 13 for overlays 1 and 3 with o set o = 0 and a super-state transition out of super-state 13 on r to super-state 23 for overlays 1 and 3 with o set o = 0. After merging super-states, we regenerate the super-state transitions for all the super-states and not just the super-states that were 160 merged, as merging super-states could lead additional transition merging opportunities in other super-states too. Theorem 8. Given as input OD2 FAs D1 and D2 and corresponding equivalent D2 FAs D1 and D3 that is equivalent to D2 FA D3 for RE set R1 ∪ R2 . D2 for RE sets R1 and R2 , the OD2FAMerge algorithm outputs an OD2 FA Proof. The D2 FA D3 constructed by merging D2 FAs D1 and D2 using D2FAMerge algorithm is equivalent to RE set R1 ∪ R2 ( [36]). Line 20 only changes the deferred state for some states and so D3 is equivalent to RE set R1 ∪ R2 . We now show that the generated OD2 FA D3 is equivalent to D2 FA D3 . To show equivalence, we need to show that for each state s ∈ Q3 , the deferred state for s, the non-deferred transitions for s, and the matched REs for s, derived from D3 are same as in D3 . Let s = s1 , s2 ∈ Q3 be any state in D3 . First, S3 (s) and O3 (s) are de ned as we take a complete cross product of S1 × S2 and O1 × O2 . The super-state transitions are directly generated from the D2 FA state transitions. It is easy to see that ∀σ ∈ Σ, ρ3 (s, σ) is de ned in D3 ⇐⇒ ρ3 (s, σ) is de ned in D3 ; and when de ned ρ3 (s, σ) = ρ3 (s, σ). Then we have the following two cases. Case 1: S3 (s) added to S3 on line 16. Then REs matched in D3 by s = MD3 (s) ∪ M3 (S(s)) = MD (s) ( MD (s) = ∅). 3 3 Deferred state of s in D3 = F3 (S3 (s)) ∩ O3 (s) = S3 (F3 (s)) ∩ O3 (F3 (s)) = F3 (s). Case 2: S3 (s) added on line 9. Then let S3 (s) = S= S1 , S2 . REs matched in D3 by s = 161 MD (s) ∪ M3 (S) = M1 (S1 ) ∪ M2 (S2 ) = MD (s1 ) ∪ MD (s2 ) = MD (s). 3 1 2 3 Deferred state of s in D3 = F3 (S) ∩ O3 (s) = F3 (s). 6.4.3 Direct OD2 FA Construction from 2 OD2 FAs Our OD2 FA merge algorithm presented in Section 6.4.2 requires the underlying D2 FA to be stored along with the OD2 FA. This underlying D2 FA requirement for merging OD2 FAs is problematic for two main reasons. First, in most practical cases, we would need to update the RE set over time. If the underlying D2 FA is discarded, then when a new RE is added to the RE set, we cannot use the merge algorithm to merge the OD2 FA for the new RE into the existing OD2 FA. Instead, we will have to build the entire OD2 FA again. This defeats one of the main advantages of the merge approach to building the OD2 FA which is automatic support for updating the RE set. The second problem is that because the underlying D2 FA is generally orders of magnitude larger than the OD2 FA, the size of the D2 FA limits the scalability of the algorithm. We now present our algorithm, called DirectOD2FAMerge, to merge two OD2 FAs which does not require storing the underlying D2 FA. After the initial OD2 FAs have been built for each individual RE, we only store the OD2 FA at each merge step. The input is two OD2 FAs, D1 = (Q1 , Σ, q01 , F1 , S1 , O1 , M1 , ∆1 ) for RE set R1 and D2 = (Q2 , Σ, q02 , F2 , S2 , O2 , M2 , ∆2 ) for RE set R2 where R1 ∩ R2 = ∅, and we construct construct OD2 FA D3 = (Q3 , Σ, q03 , F3 , S3 , O3 , M3 , ∆3 ) for the RE set R3 = R1 ∪ R2 . 162 Just as in our OD2FAMerge algorithm in Section 6.4.2, each state (super-state) in D3 corresponds to a pair of states (super-states) from D1 and D2 . The rst step is to compute Q3 , i.e. nd which states in the underlying DFA for D3 that are reachable. The set Q3 is not stored explicitly but is implicit from the set of non-empty overlays for each super-state. If we store the set of non-empty overlays for each super-state as a list, the total size will be proportional to Q3 , which can be very large. So the set of non-empty overlays for each super-state is stored as a ternary classi er (similar to how we store super-state transitions which is discussed in Section 6.5.) One option to nd the reachable states is to simulate a UCP construction of the underlying DFAs of D1 and D2 . That is, we do the UCP construction, but after computing the transitions of each merged state, we do not store them. The UCP construction also gives the state to super-states and overlay assignment. The problem with this method is that the queue of unexplored states while doing the UCP construction can be proportional to |Q3 |. To avoid this, we simulate the UCP construction focusing on super-states instead of states. The construction works as follows. For each discovered super-state in D3 , we maintain two sets of overlays: (1) the Explored set containing the overlays which have a reachable DFA state that have already been explored, and (2) the Unexplored set containing the overlays which have a reachable DFA state that have not already been explored. We maintain a queue, Queue, of super-states in D3 that currently need to be explored, and explore one super-state from the queue at a time. For the super-state, say S, currently being explored, 163 we explore all the states corresponding to the overlays in S's Unexplored set, and them move all the overlays from the Unexplored to the Explored set. When a new state, say (S ∩ O ), is discovered, it is processed as follows. If S is a newly discovered super-state, we add it to Queue and set Explored(S ) = ∅ and Unexplored(S ) = O . Otherwise S is already discovered and so is in S3 . In this case, if O ∈ Explored(S ) or O ∈ Unexplored(S ), then we do not have to do anything as the state has already been discovered. Otherwise, this is a newly discovered state, so we add O to Unexplored(S ), and add S to Queue if S is not already there. A super-state may be added to Queue and explored multiple times because all non-empty overlays within a super-state are not discovered at the same time. As mentioned earlier, the Explored and Unexplored overlay sets are maintained as ternary classi ers. As new overlays are added to the sets, the classi ers are minimized using the bit merging algorithm that is explained in Section 6.5.3. After computing the reachable states, we have all the terms in D3 constructed except for F3 and ∆3 . For the OD2 FAs in Figure 6.4 and Figure 6.5, this new merge algorithm results in the same OD2 FA as earlier shown in Figure 6.6(b). To set the super-state deferment, we use a method similar to that used in Section 4.3.2 to set state deferment when merging D2 FAs. Let S= S0 , T0 be the current super-state in D3 for which we need to compute the deferment. Let S0 →S1 →· · ·→Sl be the maximal 164 deferment chain DC1 (i.e. Sl is the root super-state) in D1 starting at S0 , and T0→T1→ · · ·→Tm be the maximal deferment chain DC2 in D2 starting at T0 . We will choose some super-state Si , Tj where 0 ≤ i ≤ l and 0 ≤ j ≤ m to be F3 (S). We only consider a candidate super-state pair if it is reachable in D3 and it overlay covers super-state S (so Condition (C2) holds). Ideally, we want i and j to be as small as possible though not both 0. For example, our best choices are typically S0 , T1 or S1 , T0 . However, it is possible that both super-states are not eligible (either not reachable or do not overlay cover S). This leads us to consider other possible Si , Tj . For any candidate super-state pair Si , Tj , we build the super-state transitions for super-state S as if it were to defer to super-state Si , Tj in D3 (we show how to build the super-state transitions in Section 6.5). The number of super-state transitions built gives us the measure of the e ectiveness of the deferment; the fewer transitions built, the better it is. One strategy (the best match method ) is to consider all candidate super-state pairs, and pick the one that results in the fewest super-state transitions built for super-state S. A faster strategy (the rst match method ) is to consider the `distance sum' z = i + j in increasing order, from 1 to l + m. For the current distance sum z, we consider all super-state pairs at that distance; i.e. the set of super-states Z = { Si , Tz−i | (max(0, z − m) ≤ i ≤ min(l, z)) ∧ ( Si , Tz−i ∈ Q3 ) ∧ ( Si , Tz−i overlay coversS)}. From the set of super-states Z, we choose the super-state that results in the fewest super-state transitions built for super-state S. We can always nd an eligible super-state to set as F3 (S), since the root super-state pair Sl , Tm is always reachable in D3 and it overlay covers all other 165 super-states. For example in Figure 6.6(b), for super-state 4= 1, 1 , there are three reachable super-state pairs along the deferment chains: 1= 0, 1 , 3= 1, 0 and 0= 0, 0 . However super-states 1= 0, 1 and 3= 1, 0 do not overlay cover super-state 4= 1, 1 , leaving the super-state 0= 0, 0 as the only candidate pair, which is chosen as the deferred super-state. How the super-state transitions are created for the merged OD2 FA is shown in Section 6.5. Pseudo-code for our DirectOD2FAMerge algorithm is given in Algorithm 6.8. At the end, we apply the same optimization of merging sibling super-states together as in the case of our OD2FAMerge algorithm. 6.5 Building Super-state Transitions In this section we describe how we combine state transitions to create super-state transitions after the super-states have been created. The OD2 FA captures similarity among states in di erent overlays within a super-state. So we would expect that state transitions (which are just singleton super-state transitions) would be combined over the overlay eld; i.e. multiple singleton super-state transitions with the same current super-state, current input character and decision values but di erent overlay values will be combined. The super-state transitions are created for each super-state and input character at a time. In the rest of the section, S refers to the current super-state and σ refers to the current 166 2 1 Input: OD FAs, D1 = (Q1 , Σ, q0 1 , F1 , S1 , O1 , M1 , ∆1 ) and D2 = (Q2 , Σ, q0 2 , F2 , S2 , O2 , M2 , ∆2 ), corresponding to RE sets R1 and R2 . Output: An OD2 FA and its underlying D2 FA corresponding to the RE set R1 ∪ R2 . 1 Initialize D3 to an empty OD2 FA; 2 Set #overlays in D3 , |O3 | = n ← |O1 | × |O2 |; // Create the super-states 3 Initialize queue as an empty queue; 4 queue.push ( q01 , q02 ); 5 while queue not empty do 6 u= u1 , u2 ← queue.pop(); 7 Q3 ← Q3 ∪ {u}; 8 S1 ← S1 (u1 ); O1 ← O1 (u1 ); 9 S2 ← S2 (u2 ); O2 ← O2 (u2 ); 10 if super-state S= S1 , S2 ∈ S3 then / 11 Initialize super-state S= S1 , S2 with n NULL states; 12 Add S to S3 ; 13 M3 (S) ← M1 (S1 ) ∪ M2 (S2 ); 14 15 16 17 Assign u to overlay (O1 × |O2 | + O2 ) in super-state S; foreach c ∈ Σ do nxt ← δ1 (u1 , c), δ2 (u2 , c) ; if nxt ∈ Q3 ∧ nxt ∈ queue then queue.push (nxt); / / 18 foreach S ∈ S3 do F3 (S) ← FindDefState(S); // set super-state deferment 19 foreach S ∈ S3 × c ∈ Σ do // create super-state trans. 20 CreateSupreStateTrans(S, c); 21 Function FindDefState( S1 , S2 ) 22 Let p0 = S1 , p1 , . . . , pl be the list of super-states on the deferment chain from S1 to the root super-state in D1 ; 23 Let q0 = S2 , q1 , . . . , qm be the list of super-states on the deferment chain from S2 to the root super-state in D2 ; 24 for z = 1 to (l + m) do 25 S ← { pi , qz−i | (max(0, z − m) ≤ i ≤ min(l, z))∧ ( pi , qz−i ∈ S3 )}; 26 if S = ∅ then return argminDS∈S (Σc∈Σ Cost(CreateSupreStateTransClassifier( S1 , S2 , DS, c))); 27 return S1 , S2 ; 28 Function CreateSupreStateTrans(S, c) 29 C ← CreateSupreStateTransClassifier(S, F3 (S), c); 30 For each rule, ri ∈ C add super-state transition ∆3 (S, P(ri ), c) = D(ri ); 31(cont'd) 32 Figure 6.8: Algorithm DirectOD2FAMerge(D1 , D2 ) for merging two OD2 FAs. 167 1(cont'd) 31 Function CreateSupreStateTransClassifier(S, DS, c) /* Generate transitions for character c and super-state S when it defers to DS */ 32 Let ODec[n] be the o set decision vector initialized to ; 33 Let NODec[n] be the non-o set decision vector initialized to ; 34 Let Reqd[n] be the required vector initialized to False; 35 foreach O ∈ O3 do 36 if S ∩ O =⊥ then 37 u= u1 , u2 ← S ∩ O; // current state 38 nu1 ← δ1 (u1 , c); nu2 ← δ2 (u2 , c); // next state 39 if S = DS then // for the root super-state 40 if (u1 = nu1 ) ∨ (u2 = nu2 ) then Reqd[O] ← True; // not a self-loop 41 else 42 du= du1 , du2 ← DS ∩ O; 43 if (δ1 (du1 , c) = nu1 ) ∨ (δ2 (du2 , c) = nu2 ) then Reqd[O] ← True; // not deferred ODec[O] ← (S3 ( nu1 , nu2 ), (O3 ( nu1 , nu2 )−O) mod n, 1); NODec[O] ← (S3 ( nu1 , nu2 ), O3 ( nu1 , nu2 ), 0); 44 45 46 47 48 49 ODec ≤ #Unique values in NODec then return CreateOverlayClassifier(ODec, Reqd); else return CreateOverlayClassifier(NODec, Reqd); if 50 #Unique values in Figure 6.8: Algorithm DirectOD2FAMerge(D1 , D2 ) for merging two OD2 FAs (cont'd). input character for which we want to build the super-state transitions. T refers to the current (or potential) deferred super-state of S. 6.5.1 Combining State Transitions To combine the state (singleton super-state) transitions, we rst need to identify the (subset of) overlays that have the same decision; that is, the same next super-state, overlay value and the o set bit. 168 A trivial way to combine state transitions is to create one super-state transition for each unique decision value among all the overlay decisions. All the overlays (i.e. state transitions) having the same decision will be combined into the super-state transition for that decision. In this case, we will have the smallest possible number of super-state transitions, which is equal to the number of unique decisions. The problem with this approach is that we may have any arbitrary subset of overlays in a super-state transition. Thus, we will need to represent arbitrary subsets of overlays. This is problematic because any such representation will have a size that will be linear in the size of the overlay set, O. The combined memory requirement of such a representation over all the super-state transitions for all super-states will essentially be linear in terms of the number of state transitions. This defeats the purpose of combining the state transitions. To address this issue, we only create overlay subsets (i.e. only combine state transitions) for states whose overlay sets that can be concisely represented. Speci cally, we only create overlay subsets that can be represented as a ternary value; i.e. the set of overlays in each combined super-state transition is equal to the ternary expansion of a ternary value. Recall that we treat the overlays as integers in the range [0..|O|) and |O| is always a power of 2. In most cases this restriction does not result in lost state combining opportunities. In almost all cases, we are able to combine all state transitions with the same decision into a single super-state transition. 169 6.5.1.1 Computing State Transitions For each overlay O ∈ O, we can have one of the following three cases: (a) S ∩ O =⊥, i.e. the overlay is empty, (b) S ∩ O = s and δ (s, σ) = δ (T ∩ O, σ), i.e. state transition is not deferred and (c) S ∩ O = s and δ (s, σ) = δ (T ∩ O, σ), i.e. state transition is deferred. Of ⊆ O denotes the set of lled overlays, and Or ⊆ Of denotes the set of overlays for which the state transition is not deferred. Note that Of depends on S, and Or depends on S, T and σ. The super-state transitions generated for super-state S need to cover all the overlays in Or . We make the following two observations which help us combine the state transitions into fewer super-state transitions.  We never do a lookup on the OD2 FA for any overlay O ∈ O \ Of for super-state S. Because of this, empty overlays can have any decision, and so can be `merged' with any overlay. For example, suppose we have |O| = 4, where overlay 2 = (10)2 is empty, and overlays 0 = (00)2 , 1 = (01)2 and 3 = (11)2 all have the same decision. If we try to combine just the lled overlays, we get two super-state transitions with overlay sets 0∗ and 11. But since we would never do a lookup on the empty overlay, we can include it in the super-state transition, which results in only one transition with overlay set ∗∗. For every empty overlay we designate a special wildcard decision, denoted by , that matches any actual decision. Also note that if we include empty overlays in super-state transitions, Condition (C2) is necessary and sucient to ensure that transition deferment works correctly.  It is not necessary to defer transitions that match the deferred state. When 170 combining state transitions, including transitions that can be deferred can result in fewer super-state transitions. For example, suppose we have |O| = 4, where all four overlays are lled and all have the same decision, but the transition for overlay 2 = (10)2 is deferred, whereas transitions for overlays 0 = (00)2 , 1 = (01)2 and 3 = (11)2 are not deferred. If we require that the transition for overlay 2 must be deferred, then we need need two super-state transitions with overlay sets 0∗ and 11 to cover the remianing overlays. Including the state transition for overlay 2 in the combined super-state transition results in only one super-state transition with overlay set ∗∗. Before we can combine state transitions, we rst need to compute the state transition and deferment for each overlay. We create a Decision array which records the decision for each overlay, and a corresponding Boolean Required array which records whether the decision is necessary or not (i.e. whether it must be speci ed or it can be deferred). For empty overlays, the Decision value is set to and Required is set to false. For lled overlays, how the state transitions are computed depends on the stage of the OD2 FA construction. During initial OD2 FA construction for one RE: The underlying D2 FA is available during the initial OD2 FA construction, so the state transitions and deferments are determined by the D2 FA. During OD2FAMerge: Since the underlying D2 FA is stored in OD2FAMerge, again the state transitions and deferments are determined by the stored D2 FA. The D2 FA lookup 171 from the underlying D2 FA corresponds to lines 33 and 35 in Algorithm 6.7. During DirectOD2FAMerge: During DirectOD2FAMerge, we do a lookup from the input OD2 FAs to compute the state transitions and deferments. The lookup from the two input OD2 FAs corresponds to lines 38 and 43 in Algorithm 6.8. For the root super-state, for self-loop state transitions we set the Required value to false, even though these transitions are not deferred. As a result, the root super-state will not store the self-looping super-state transitions. If lookup fails for a non-root super-state, then we would follow the deferment pointer and do a lookup on its deferred super-state. If lookup fails for the root super-state, there no deferment pointer to follow along; however we would know that the missing transition is a self-loop (on the root super-state), so the destination super-state is the root super-state and the destination overlay is the current overlay. Since most transitions for the root super-state are self-loops, this greatly reduces the resulting number of super-state transitions. We need to determine which of the two forms of super-state transitions (o set transitions or non-o set transitions) to create. Clearly we would use the form which results in fewer super-state transitions. So we create a Decision array for both o set and non-o set decision, and use the one which has fewer unique values in it to create the super-state transitions. In most of the cases using the o set decisions results in fewer super-state transitions. We only compute and store transitions for all states in one super-state at at time. Once the 172 super-state transitions have been constructed, the state transitions are discarded. Hence we never store state transitions for all states in the OD2 FA at the same time. For example consider super-state 1 and input character d in the OD2 FA in Figure 6.6(c). The OD2 FA has four overlays so O = {0, 1, 2, 3}. In this case we have Of = {0, 1, 2} and Or = {0, 2}. The Decision array will be [(0, 1, 1), (0, 0, 1), (0, 1, 1), ] and the Required array will be [true, false, true, false]. 6.5.2 Creating Overlay Classifier The set of state transitions for each overlay for super-state S and input character σ essentially forms a 1-dimensional classi er over the overlay eld. The problem of creating a minimum set of covering super-state transitions then boils down to nding an equivalent ternary minimized classi er. We introduce some standard terminology rst. A 1-dimensional classi er is de ned over a eld F and consists of a list of rules. Each rule r has a predicate P(r) ⊆ F and a decision D(r). A packet p ∈ F matches rule r if p ∈ P(r). The decision of the classi er C for a packet p is given by the rst rule in C that matches p. For our purpose of using a classi er to build super-state transitions, we de ne a generalized version of a classi er that we call an overlay classi er. Definition 7 (Overlay classi er). over the eld O. Each rule r An overlay classi er, C, is 1-dimensional classi er has a Boolean ag, denoted by 173 R(r), that indicates whether the rule is required or not. Rules with decision have their ag R(r) set to false. The rules in C satisfy the following properties:  Ternary classi er: For each rule r ∈ C, its predicate P(r) is a ternary value.  Non-con icting property: For every packet p ∈ Of , all the rules that match p (if any) also have matching decisions (note that  Covering property: matches any actual decision.) For every packet p ∈ Or , there is at least one rule r ∈ C that matches p and R(r) is true (which also implies D(r) = .)  Restricted equivalence: Two overlay classi ers are equivalent if, for every packet in Of for which both overlay classi ers have a match, they both have the same decision. Given the Decision and Required values for each overlay, we rst construct an overlay classi er with one rule for each overlay. Speci cally, we create an empty overlay classi er C over O. Then for each overlay O, we add the rule Rule(O, Decision[0], Required[O]) to C. Here Rule(x, y, z) refers to creating a rule r with P(r) = x, D(r) = y and R(r) = z. Next we minimize the rules in C to get an equivalent overlay classi er C (which is discussed in the next section). After minimizing, each rule r ∈ C with R(r) = true gives us a combined super-state transition ∆(S, P(r), σ) = D(r) in the OD2 FA. The covering property of overlay classi ers ensures that super-state S will have a super-state transition covering every overlay in Or . The non-con icting property of overlay classi er ensures that each overlay in Of has at most one decision. Note that we can have more 174 than one super-state transition covering an overlay, but in that case the non-con icting property ensures that they all have the same decision. For example with super-state 1 and input character d in the OD2 FA in Figure 6.6(c), the overlay classi er created will have just one required rule ∗0 → (0, 1, 1), which gives us d the super-state transition (1, ∗0) − (0, 1, 1). Figure 6.9 shows the overlay classi ers and → corresponding super-state transitions generated for all the super-states in the OD2 FA in Figure 6.6(c). Super-state Char. Overlay classi er Super-state transition a 3 (0, 0∗) − (3, 0, 1) → c (0, ∗0) − (1, 0, 1) → n (0, ∗∗) − (0, 0, 0) → 01 → (1, 0, 1) (0, 01) − (1, 0, 1) → 1∗ → (3, 0, 1) (0, 1∗) − (3, 0, 1) → d r ∗0 → (0, 1, 1) 01 → (2, 0, 1) (1, ∗0) − (0, 1, 1) → r (1, 01) − (2, 0, 1) → b 1 0∗ → (3, 0, 1) ∗0 → (1, 0, 1) ∗∗ → (0, 0, 0) p 0 a c n 0∗ → (0, 2, 1) (3, 0∗) − (0, 2, 1) → q r 1∗ → (4, 0, 1) 11 → (2, 0, 1) (3, 1∗) − (4, 0, 1) → r (3, 11) − (2, 0, 1) → p p d b q Figure 6.9: Overlay classi er and corresponding super-state transitions for the super-states in OD2 FA in Figure 6.6(c). The pseudo-code for creating the overlay classi er is given in Algorithm 6.10. 6.5.3 Minimizing Overlay Classifier We now explain how we minimize the initial overlay classi er created from the Decision and Required arrays. We generalize the bit merging algorithm proposed in [31] to handle 175 1 Input: The decision, Dec[], and required value, Reqd[], for each overlay. Output: An equivalent ternary minimized overlay classi er. 1 n ← len(Dec); // number of overlays, will be a power or 2 2 w ← log2 (n); // number of bits 3 Create empty overlay classi er C with eld width w; 4 foreach overlay o ∈ [0..n) do 5 Insert Rule(o, Dec[o], Reqd[o]) in C; 6 return MinimizeOverlayClassifier(C); // minimize the rules and return 7 Figure 6.10: Algorithm CreateOverlayClassifier(Dec, Reqd). wildcard decision and optional deferment. We introduce some standard terminology rst. For a ternary value T , the ternary position mask of T , denoted by τ(T ), is the binary value obtained by replacing all binary bits in T by 0 and all ternary bits (∗) in T by 1. The ternary position mask of T basically indicates the positions in T which have a ternary bit. The binary bit mask of T , denoted by β(T ), is the binary value obtained by replacing all ternary bits in T by 1. The ternary position mask and binary bit mask together represent a ternary value using two binary values. If bit location b is a 1 bit in τ(T ) then T has a ∗ in location b; otherwise T has the same binary bit in location b as in β(T ). So we can represent at ternary value T as the pair of binary values (τ(T ), β(T )). Two ternary values, T1 and T2 , are said to be ternary adjacent if τ(T1 ) = τ(T2 ) and β(T1 ) and β(T2 ) di er in exactly one bit. In other words, T1 and T2 are ternary adjacent if they di er in exactly one location which has a binary bit in both T1 and T2 . The ternary cover of T1 and T2 is the ternary value (τ(T1 ) | (β(T1 ) ^ β(T2 )), β(T1 ) | (β(T1 ) ^ β(T2 ))) (here | is the bitwise OR, and ^ is the bitwise XOR). That is, the ternary cover is the ternary value 176 obtained by replacing the di ering binary bit location in T1 (or in T2 ) by the ternary bit ∗. Two rules are said to be ternary adjacent if their predicates are ternary adjacent and their decision match. We rst minimize the rules in the overlay classi er and then remove rules that are not required (i.e. have the R(r) ag set to false). Minimizing the overlay classi er is done in two steps, pre-merging bits and bit merging. We explain these two steps using the example in Figure 6.11. 0000 → A 0001 → 0010 → A? 0011 → A 0100 → 0101 → 0110 → B 0111 → B 1000 → B 1001 → 1010 → 1011 → A? 1100 → 1101 → 1110 → 1111 → B? 000∗ → A 001∗ → A 010∗ → 011∗ → B 100∗ → B 101∗ → A? 110∗ → 111∗ → B? 00∗∗ → A ∗01∗ → A? 0∗0∗ → A? 00∗∗ → A ∗01∗ → A? 0∗0∗ → A? ∗11∗ → B 01∗∗ → B? 1∗0∗ → B 11∗∗ → B? ∗11∗ → B 1∗0∗ → B Bit merge rst pass 00∗∗ → A ∗11∗ → B 1∗0∗ → B ∗1∗∗ → B? Bit merge second pass Remove nonrequired rules Bit 0 eliminated Figure 6.11: Minimizing overlay classi er example. The pseudo-code for minimizing the overlay classi er is given in Algorithm 6.12 177 1 Input: A initial overlay classi er C with n = O rules. Output: Equivalent overlay classi er with rules minimized. 1 w ← log2 (n); // number of bits 2 foreach bit k ∈ [0..w) do // first try pre-merging bits 3 premerge ← True; 4 foreach pair of rules, ri , rj , such that P(ri ) and P(ri ) di er only in bit k do 5 if ri and rj are not ternary adjacent then // i.e. decisions of ri and rj do not match 6 premerge ← False; 7 break; 8 9 10 11 if premerge then // bit k is pre-merged foreach pair of rules, ri , rj , such that P(ri ) Remove rules ri and rj from C; Insert rule MergedRule(ri , rj ) in C; and P(ri ) di er only in bit k do 12 C ← BitMerge(C); // then do bit merging 13 foreach rule ri ∈ C do if R(ri ) = False then Remove ri from C; // remove non-required rules 14 return C; 15 Function BitMerge(C) 16 Create empty overlay classi er C ; 17 foreach rule ri ∈ C do Initialize covered[i] ← False; 18 PM ← Partition of rules in C based on rule predicate ternary position masks; 19 foreach Partition pm ∈ PM do 20 PD ← Partition of rules in pm based on rule decision; 21 foreach Partition pd ∈ PD with corresponding decision d do 22 foreach pair or rules ri , rj ∈ pd do 23 if ri and rj are ternary adjacent then 24 Insert MergedRule(ri , rj ) in C ; 25 covered[i] ← covered[j] ← True; 26 R(ri ) ← R(rj ) ← False; 27 28 29 30 31 32 33 if d = then psd ← Partition in PD corresponding to ; foreach pair or rules ri ∈ pd × rj ∈ psd do if ri and rj are ternary adjacent then Insert MergedRule(ri , rj ) in C ; covered[i] ← covered[j] ← True; R(ri ) ← R(rj ) ← False; 34(cont'd) 35 Figure 6.12: Algorithm MinimizeOverlayClassifier(C). 178 1(cont'd) 34 if C is empty then // no rules merged 35 return C; 36 37 38 rule ri ∈ C do if covered[i] = False then Insert ri in C ; Remove duplicate rules from C ; return BitMerge(C ); // recursively call BitMerge and return the result foreach 39 Function MergedRule(r1 , r2 ) 40 T ← ternary cover of P(r1 ) and P(r2 ); 41 if D(r1 ) = then D ← D(r1 ) else D ← D(r2 ) ; 42 reqd ← R(ri ) ∨ R(rj ); 43 return Rule(T, D, reqd); 44 Figure 6.12: Algorithm MinimizeOverlayClassifier(C) (cont'd). 6.5.3.1 Pre-merging Bits The initial overlay classi er created from the Decision and Required arrays will have |O| rules, one rule for each overlay, and the predicate of any rule ri is i (the corresponding overlay (binary) value). For our example, the rst column in Figure 6.11 shows the initial overlay classi er. We have |O| = 16. There are two unique actual decisions denoted by A and B. A `?' next to an actual decision indicates that the rule is not required (rules with a decision are always not required). At this point we can directly apply the bit merging algorithm, which will result in a minimized set of rules. But in most cases, all except for a few overlays have the same decision. So only a few bits that distinguish the overlays having di erent decisions will vary in the minimized rules. All the other bits will be merged to ∗'s in all the minimized rules. We can accelerate the bit merging step by identifying these bits and pre-merging them so that the bit-merging algorithm only needs to work on the few remaining bits that 179 are not pre-merged. The pre-merging works as follows. For a binary value p, ^b (p) denotes the value obtained 0 by inserting a 0 bit at location b, and ^b (p) denotes the value obtained by inserting a 1 1 bit at location b. Bit location b is pre-merged if the following condition is true: ∀p ∈ [0..|O|/2), D(r^ (p) ) matches D(r^ (p) ). That is, for every pair of rules whose predicates 0b 1b di er only in bit location b, their decisions match. Since the decisions for every such pair of rules match, we merge these pair of rules. A pair of such rules, lets say ri and rj are merged as follows. We create a new merged rule, say rk . P(rk ) is set to the ternary cover of P(ri ) and P(rj ). If D(ri ) = then we set D(rk ) ← D(ri ) otherwise we set D(rk ) ← D(rj ), and we set R(rk ) ← R(ri ) ∨ R(rj ). Rules ri and rj are replaced with the merged rule rk . We test and pre-merge one bit location at a time. Every time a bit is pre-merged, the number of rules is reduced by half. In our example in Figure 6.11, bit location 0 gets pre-merged, and the resulting rules are shown in the second column. 6.5.3.2 Bit Merging Algorithm The bit merging algorithm runs in several iterations. The input to each iteration is an overlay classi er C, and the output is an equivalent overlay classi er C . Each iteration works as follows. We rst initialize a Covered ag to false for each rule in C. For rule ri , Covered[ri ] 180 indicates if rule ri is covered by some rule in C . Then for every pair of rules, say ri and rj , in C that are ternary adjacent, we insert the merged rule rk in C . The merged rule rk is created in the same way as during the pre-merging step. After inserting merged rule rk to C , we set Covered[ri ] and Covered[rj ] to true and set R(ri ) and R(rj ) to false. The idea behind setting the required ags for ri (and rj ) to false is that since a rule has already been added to C that covers ri , any further rules we add to C should not be set as required because of ri . To speed up bit merging, we partition the rules based on the ternary position mask of the each rule's predicate and each rule's decision. This reduces the number of pairs of rules we need to check for merging. After all pairs have been checked for merging, any rules left in C with their Covered ag false are added to C . The bit merging iterations continue as long as there is at least one merged rule added to C When no pair of rules is merged, we stop and return the current overlay classi er. For our example in Figure 6.11, we have two iterations of bit merging. After the rst iteration, we get the rules in column 3. The rst rule in column 3 is obtained by merging the rst two rules in column 2. After merging the rst two rules in column 2, both rules will be marked as non-required. Therefore when the third rule in column 3 is created by merging the rst and third rule in column 2, it is marked as non-required. We get the rules in column 4 after the second iteration of bit merging. No more rules can be merged after that, so bit merging stops. Finally, we remove the non-required rules to get the nal overlay classi er shown in column 5. 181 6.5.4 Overlay Discussion 6.5.4.1 Restricting Overlay Count to Power of 2 We keep the number of overlays in all intermediate OD2 FAs and the nal OD2 FA to be a power of 2 and number the overlays starting with 0 and ending with |O| − 1. We achieve this by modifying the algorithm that constructs an OD2 FA from one RE to pad empty overlays at the end if necessary. The OD2 FA merge algorithm requires no modi cation since the number of overlays in the merged OD2 FA is equal to the product of the number of overlays in the two input OD2 FAs. We explain by example the bene t of requiring the number of overlays to be a power of 2. Figure 6.13(a) shows the D2 FA for the RE /x. y. z/ and Figure 6.13(b) shows two possible overlay structures for the OD2 FA. Since there are three self-looping states in the D2 FA, 0, 1 and 2, our algorithm places them in the root super-state. The overlay structure on the left has three overlays, with the three self-looping states in them, with no padding. ‐x 0 ‐y x 1 y 2 z (a) D2 FA for RE /x. y. x/. 3/3 1 2 0 1 2 0 0 ‐z 0 1 2 1   3 Without padding 0 0 1 2 0 1 2  0 1 2 3 1   3 3  With padding (b) Possible overlay structures for the corresponding OD2 FA. Figure 6.13: Overlay Padding Example. 182 0 3,0 2 1 3 1 7 6 12 0 1 2 3 3 X 4 5 0 6 2 0 0 1 1 2 7 8 9 10 11 1,0 1,1 1,2 7,0 7,1 7,2 6,0 6,1 6,2 12,012,112,2 Without padding 0 3,0 0 1 2 1 3 1 7 6 12 2 1,0 1,1 1,2 3  4 3 5 X 6 7 7,0 7,1 7,2  0 8 2 0 0 1 1 2 9 10 6,0 6,1 6,2 11 3  12 13 14 15  12,012,112,2  With padding (c) Merged super-state. Overlay 0 1 2 3 4 5 OID 0000 0001 0010 0011 0100 0101 Overlay 0 1 2 3 4 5 6 7 OID 00 010 Without padding OID 0000 0001 0010 0011 0100 0101 0110 0111 OID 0 With padding (d) TCAM rules. Figure 6.13: Overlay Padding Example (cont'd). In the right overlay structure, we pad one empty overlay, so that the number of overlays is a power of 2. Now consider what happens when this new OD2 FA in Figure 6.13(b) with and without padding, is merged with the OD2 FA in Figure 6.6(c). As an example, we consider the merging of super-state 3 in Figure 6.6(c), which we call S3 and super-state 0 for the new OD2 FA, which we call S0 . For both cases, Figure 6.13(c) shows the resulting 183 super-state in the merged OD2 FA, which we call Sm . In both cases, there will be 12 states in the merged super-state. The rst three of these states are replications of state 1 in S3 , the next three states are replications of state 7 in S3 , and so on. Furthermore, states 1 and 7 in S3 were itself replications of the state 1 of the D2 FA in Figure 6.4. Hence, the rst six states in Sm are replications of the same state (i.e. state 1) of the D2 FA in Figure 6.4. For the case without padding, Sm has 12 overlays, with one state in each overlay. For the case with padding, Sm has 16 overlays, with the overlays 3, 7, 11 and 15 being empty. Now, since the rst six states in Sm are replications of state 1 of the D2 FA in Figure 6.4, in the merged OD2 FA, they all will have one non-deferred transitions on input character a. In both cases, the overlay o sets will also be the same for all six state transitions. So all six overlays will have the same decision, and will bit-merge in the overlay classi er. Figure 6.13(d) shows the (predicates of the) rules in the minimized overlay classi er for both cases. For the case without padding, we can only get down to two rules from six rules. In the case with padding, the overlays 3 = 0011 and 7 = 0111 are empty overlays, and hence will have decision during bit-merging. As a result, we can merge all six rules into a single rule. 6.5.4.2 Eliminating Overlay Bits We modify the OD2 FA merge algorithm to eliminate unnecessary overlay ID bits and thus reduce the required TCAM entry width. The idea behind doing a cross product of overlays while merging is to capture the replication of states. Replicated states get assigned 184 to di erent overlays in the same super-state. However, sometimes there is no replication and we do not need to create extra overlays. For example, consider the merging of the OD2 FAs for REs/ab. cd/ and /ab. ef/. The two input OD2 FAs will both have two overlays 0 and 1, so in the merged OD2 FA we will create four overlays 0, 1, 2, and 3. In this case, since both REs have a common pre x, there is no state replication and overlays 1 and 2 will be empty in the merged OD2 FA. The two lled overlays, 0 and 3, have overlay IDs 00 and 11. Since the two overlays di er in both the bits, either bit is redundant and can be removed from the overlay ID producing only two overlays 0 and 1. In general, after merging two OD2 FAs, we eliminate as many overlay ID bits as possible by searching for overlay ID bits i where in every pair of overlays whose overlay ID di ers only in bit i, at least one of the two overlays is empty. If bit i is eliminated, one empty overlay from each pair that di er in bit i is removed. We note that the overlay count stays a power of 2. 6.6 OD2FA Software Implementation In this section we discus the implementation of OD2 FA in software on a general purpose processor. We rst review the implementation of DFA and D2 FA in software, then present our proposed implementation of OD2 FA. Implementation of any nite automta mainly involves choosing a data structure to store the transition function and then implementing the lookup function using the given data structure. In a DFA (Q, Σ, q0 , M, δ), each state in Q has |Σ| transitions. The transition 185 function δ can be stored in memory as a 2-dimensional array of next state values, indexed over Q and Σ. Looking up the next state requires just one memory lookup in the array using the current state and input character as indices. If we assume a 4 byte state ID value, then the amount of memory required to implement the transition function is |Q| × |Σ| × 4 bytes. For a D2 FA (Q, Σ, q0 , M, ρ, F), each state in Q has 0 to |Σ| transition plus the deferment pointer. Most states have only a couple of transitions. So the transitions for each state can be stored as a list of (current character, next state) pairs in memory. To do a lookup, we go through the list of transitions for the current state to check if there is a transition on the current input character or not. If there is one, we get the next state, otherwise we go to the deferred state of the current state and check its transition table. The amount of memory required to implement the transition function is # transitions in ρ × 5 bytes for the transitions and |Q| × 4 bytes for the deferment pointers. 6.6.1 Implementing OD2 FA We now discuss the implementation for an OD2 FA (Q, Σ, q0 , F, S, O, M, ∆). All of the elds of an OD2 FA are simple to implement except for ∆. To implement ∆, we use a structure similar to that of a D2 FA except that instead of storing next state values, we store pointers to overlay classi ers. Speci cally, for each super-state, we store a list of (current character, pointer to overlay classi er) pairs in memory for each character that is not deferred. Note that a character may be deferred for some overlays, but we say it is 186 not deferred if there is at least one overlay where it is not deferred. Given the current super-state S, current overlay O and current character σ, the lookup is done as follows. We go through the transition list for the super-state S to check if there is an entry for character σ. If there is no entry for σ, we perform the lookup using the deferred super-state for S F(S). If there is an entry for σ, that gives us the location of the overlay classi er to use. We do a lookup in this overlay classi er for overlay O (we discuss next how to do this). If we nd a match, the decision gives us the next super-state and overlay values. If we do not nd a match, then overlay O is deferred for character σ, so we again perform the lookup using the deferred super-state for S F(S). 6.6.2 Overlay Classifier Storage and Lookup An overlay classi er is just a list of rules. Each rule has a rule predicate, which is a ternary value, and a rule decision, which is a triple of next super-state, overlay value and the o set bit. If we use 4 byte overlay id values, then the rule predicate can be stored using two 4 byte values. One value will be the ternary position mask of the rule predicate and the other value will be the binary bit mask of the rule predicate. The rule decision can also be stored as two 4 byte values, one for the next super-state and the other for the overlay value. The single o set bit can be encoded in either of these two values. We would just store the list of rules in memory requiring 16 bytes per rule. The lookup for an overlay O is done as follows. We just go through the list of rules and check if any rule matched the overlay O. To check if a rule r matches overlay O, we need 187 to check if the rule predicate P(r) covers O. P(r) will cover O if all the bit locations that contain a binary bit in P(r) have the same bit in both P(r) and O. This check may be done very eciently using just one bitwise OR by testing (O | τ(P(r))) = β(P(r)). 6.6.3 Space Requirement For the OD2 FA, we need |S| × 4 bytes to store the super-state deferment pointers and roughly |S| bytes to store the super-state match function M. If m = ΣS∈S (# of nondeferred characters for S), then we need m×5 bytes to store the overlay classi er pointers. We optimize the size required to store the overlay classi ers using the following observation. The same overlay classi er may be used by multiple super-states for multiple characters. Rather than storing the same overlay classi er multiple times, we store one copy of each unique overlay classi er. In each super-state transition list, the same pointer is used by each entry that points to the same overlay classi er. The memory required to store the overlay classi ers will be 16 times the total number of rules among all the unique overlay classi er stores. 6.7 OD2FA Implementation in TCAM In this section, we describe how OD2 FA can be implemented in TCAM and present our OverlayCAM algorithm for doing so. We extend our solution of the RegCAM algorithm described in Chapter 5 to implement the OD2 FA in TCAM. The RegCAM implementation 188 uses two tables to represent an automata: a TCAM lookup table with a source state ID column and an input character column, and a corresponding SRAM decision table which contains the next state ID. To implement OD2 FA in TCAM, we use the unique pair of super-state ID and overlay ID as source state ID in the TCAM lookup table and next state ID (which is pair of next super-state ID and next overlay ID) in the SRAM decision table. The super-state ID and overlay ID columns in TCAM will be lled with ternary values that together match multiple states rather than a single state whereas the super-state ID and overlay ID columns in SRAM will be binary values that together give a single state. We add an extra bit in the SRAM decision table to specify the overlay bit in the super-state transition decision. Just as in RegCAM, we leverage the rst match feature of TCAMs to ensure that the correct transition will be found in the TCAM lookup table. Speci cally, if super-state S defers to super-state S , then we list all the super-state transitions for super-state S before those of super-state S . We describe the speci c challenges of implementing OD2 FA in TCAM including dealing with super-states, overlays, and super-state transitions in the remainder of this section. 6.7.1 Generating Super-state IDs and Codes For the super-states, we apply the shadow encoding algorithm described in Section 5.2.2.3 on the super-state deferment forest of the given OD2 FA. This generates a binary super-state ID SSID(S) and a ternary super-state shadow code SSCD(S) for each super-state S that satis es the Shadow Encoding Properties (SEP). Figure 6.6(c) shows the SSIDs and 189 SSCDs generated for that OD2 FA. 6.7.2 Implementing Super-state Transitions σ We now address the implementation of super-state transitions in TCAM. Let (S1 , X) − → (S2 , o, b) be the super-state transition we want to implement in TCAM. In the TCAM table, we use SSCD(S1 ) in the super-state ID column. Since we restrict the set of overlays in any super-state transition to ternary values, we can just use X in the overlay ID column of the TCAM. For the SRAM, in the super-state ID column, we use SSID(S2 ), In the overlay ID column, we use the binary representation of the overlay value o. And the o set bit b is stored in the o set bit location in the SRAM. The RE matching process works as follows. Let S be the current super-state, O be the current overlay, and σ the current input character. So s = SSID(S) · O denotes the current state; s concatenated with σ is used as a TCAM lookup key. Let uid be the SSID stored in super-state ID column in SRAM and o be the value stored in the overlay ID column in SRAM and b be the value of the o set bit stored in SRAM. We compute the next super-state ID and overlay ID as follows. The next super-state ID will be uid. The next overlay ID will be (b×O(s)+o) mod |O|. If b = 0, the next overlay ID is simply o. If b = 1, the next overlay ID is (O(s) + o) mod |O|. In most cases where o = 0, the next overlay ID is (O(s) + 0) mod |O| = O(s). For example, consider the OD2 FA in Figure 6.6(c). We represent the super-state transition ∆(03 , {0, 1}, a) = (33 , 0, 1) as follows. The TCAM super-state ID column is lled with SSCD(03 ) = ∗∗∗, the TCAM overlay ID column is 0∗, 190 the SRAM super-state ID column is lled with SSID(33 ) = 011, the overlay ID column is lled with 0, and the o set bit is set to 1. 6.7.3 TCAM Table Generation We now explain how we generate the TCAM entries for OD2 FA. We generate the TCAM entries for one super-state at a time. Say S is the current super-state. We use the overlay classi ers of super-state S to generate its TCAM rules. For each character for which S has an overlay classi er, we add a TCAM entry for each rule in the overlay classi er as described in the previous section. After building this initial TCAM table for S, we reduce the TCAM entries as follows. We apply the bit merging algorithm explained in Section 6.5.3.2 on the TCAM entries generated for the super-state. The predicate of each rule corresponding to the TCAM entries has three parts: the current super-state code SSCD(S), the overlay set X, and the current input character. The SSCD(S) part will be the same in all TCAM rules for S, and the bit merging algorithm was already applied on the overlay eld while building the overlay classi ers, so we cannot merge TCAM rules using any bits from these two elds. However, we can merge rules based on the current input character eld. This mostly helps with case insensitive searches where transitions on the alphabet characters will mostly occur in pairs and such pairs can be merged because they di er on only one bit in ASCII encoding. We order the TCAM tables of the super-states according to the super-state deferment relationship (every super-state table occurs before its deferred super-state table). 191 The overlay classi ers for the root super-state exclude all the self-looping transitions. All of these transitions are handled by the last rule added in the TCAM which is all ∗s. TCAM Source Input  SCD char. State 2 0010 d  State 1 0001 b  00 a  State 0 00 c  00   State 6 0110 q  State 5 0101 d  01 c  01 p  State 3 01 n  01   State 8 1010 r  State 7 1001 b  10 a  10 p  State 4 10 n  10   1101 q  State 12 1101 r  11 p  State 9 11 n  11   SRAM Dest.  SID 1000 0100 0001 0010 0000 0111 1100 0101 0110 0000 0100 1011 1100 1001 1010 0000 1000 1110 1111 1101 0000 1100 Super‐ state 3 Super‐ state 1 Super‐ state 0 TCAM Source Input Overlay  SSCD char. set 011 b  0 011 q  1 011 11 r  001 d  0 001 01 r  a   0 c   0 p   1 01 p   n        SRAM Destination Overlay  offset  SSID value bit 000 2 1 100 0 1 010 0 1 000 1 1 010 0 1 011 0 1 001 0 1 011 0 1 001 0 1 000 0 0 000 0 1 OverlayCAM TCAM rules RegCAM TCAM rules Figure 6.14: TCAM rules for RegCAM and OD2 FA. Figure 6.14 shows the nal TCAM and SRAM tables for the OD2 FA in Figure 6.6, and, for comparison purposes, the TCAM and SRAM tables generated by the RegCAM algorithm for the same RE set f/ab[ˆn] pq/, /cd[ˆn] pr/g. 192 6.7.4 Variable Striding In this section, we describe how we adapt the technique of variable striding introduced in Section 5.4 to use with OD2 FA. We rst explain the basic idea of a variable striding in a DFA. Creating a full k-stride DFA leads to space explosion because of two reasons. First each state in a k-stride DFA has |Σ|k transitions. This leads to transition explosion. Second, anytime a k-stride transition passes through an accepting state, we might need to create multiple copies of the destination state in order to record the matching. This leads to state explosion. A k-var-stride DFA handles both these problems by generating variable (between 1 and k) stride transitions. The transition decision stores the stride length of the transition along with the destination state. The problem of transition explosion is managed by selectively extending the stride of a limited number of transitions. The problem of state explosion is eliminated by never extending a transition past an accepting state. There are two implementations of variable striding that we considered in Section 5.4, self-loop unrolling and full variable striding. 6.7.4.1 Self-loop Unrolling The self-loop unrolling technique for the OD2 FA works in the same way as for the D2 FA as presented in Section 5.4. The basic idea behind self-loop unrolling is as follows. The last rule in the TCAM table for the root super-state is always the self-loop rule which handles 193 all the self-looping transitions for all the states in the root super-state. For example consider the TCAM table for the root super-state (0) in Figure 6.14, which is also shown in Figure 6.15(a). Consider the lookup when the next two input characters are xa and 0 is the current super-state. On the rst input character x, we will match the last self-loop rule. This indicates that after processing the current character, we return to the same state. We can replace the last self-loop rule with another copy of super-state 0s TCAM table with the input character over the second stride and ∗s in the rst stride. This is shown in Figure 6.15(b) with this second copy of the rules marked as Stride-2. If we do a lookup for xa, we will match the rst Stride-2 rule. Thus, instead performing two lookups in the 1-stride table, we get the same decision by performing one lookup in the unrolled 2-stride table. If we unroll the self-loop rule at the end of the second copy of the TCAM rules one more time, we get the table shown in Figure 6.15(b). We can further unroll the self-loop rule to extend to a k-stride table. If the 1-stride TCAM table has n rules, then the self-loop unrolled k-stride table will have only (n − 1)k + 1 rules. 6.7.4.2 Full Variable Striding Adapting the full variable striding technique for the OD2 FA is more challenging. The k-var-stride transition sharing algorithm presented in Section 5.4 generates k-var-stride tables which correctly handle state deferment in the D2 FA. What we mean by this is the 194 TCAM Source Input Overlay  SSCD char. set a  0  c  0  p  1  01 p   n        SRAM Destination Overlay  offset  SSID value bit 011 0 1 001 0 1 011 0 1 001 0 1 000 0 0 000 0 1 (a) 1-stride table for super-state 0. TCAM Stride 1 Stride 2 Stride 3 Source Input Overlay  SSCD char1 char2 char3 set a  0    c  0    p  1    01 p     n      a  0    c  0    p  1    01 p     n      a  0    c  0    p  1    01 p     n            SRAM Destination Overlay  offset  Stride SSID value bit 011 0 1 1 001 0 1 1 011 0 1 1 001 0 1 1 000 0 0 1 011 0 1 2 001 0 1 2 011 0 1 2 001 0 1 2 000 0 0 2 011 0 1 3 001 0 1 3 011 0 1 3 001 0 1 3 000 0 0 3 000 0 1 3 (b) Super-state 0 table unrolled to 3-var-stride. Figure 6.15: Root super-state self loop unrolling example for TCAM rules in Figure 6.14. 195 following. Suppose s1 is the current state and it defers to state s2 . If we lookup a character and match a rule from state s2 's TCAM table giving the next state s3 , then state s1 also transitions to state s3 on the same input. In general, a match found in the TCAM table of an ancestor of s1 when doing a lookup for s1 will always be correct. We cannot extend the k-var-stride transition sharing algorithm to OD2 FA to generate tables that correctly handle deferment. The diculty arises from the following. In an OD2 FA, each super-state has multiple states. On the same input, di erent states in the same super-state might transition to states in di erent super-states. Thus, we propose an alternate technique to generate variable stride tables. For each super-state S, we generate a k-var-stride table in addition to its 1-stride table. When the k-var-stride table is implemented in TCAM, in the current super-state column of the TCAM, we use SSID(S) instead of the SSCD(S). That way, the k-var-stride rules of super-state S will only match when doing a lookup for itself, and will not match when doing a lookup for any other super-state. So the k-var-stride rules only have to be correct for S. The k-var-stride table for S is placed just before its 1-stride table in TCAM, so higher priority is given to k-var-stride rules over the 1-stride rules. We now explain our algorithm to generate the k-var-stride table for a super-state. We de ne the variable stride transition function as Γ : S × 2O × ( i 1≤i≤k Σ ) → S × [0..|O|) × {0, 1}, which is same as ∆ except that Γ transitions over a string of characters of length between 1 and k. Let S be the super-state for which we are generating the k-var-stride transitions. For each 1-stride transition for super-state S, we build k-var-stride transitions 196 by extending the transitions of super-state S2 with that transition in two ways: rst by composing with S2 's k-var-stride table, then by composing with S2 's 1-stride table. More σ speci cally, let (S, X) − (S1 , o1 , 1) ∈ ∆ be any 1-stride transition for S such that S < S1 → and M(S1 ) = ∅. We add the condition S < S1 because we only want to extend forward transitions and this condition is true for most forward transitions. We add the condition M(S1 ) = ∅ because we stop a variable stride transition at matching super-states. If we have not already built the k-var-stride transition table for super-state S1 , we recursively build it rst. Then we rst extend the transitions in the k-var-stride table of S1 : for w each transition (S1 , Y) − (S2 , o2 , 1) in the k-var-stride transition table of S1 , if |X ∩ Y| is → σ.w large enough and len(w) < k, we add the extended transition (S, X ∩ Y) − − (S2 , (o1 + o2 ) −→ mod |O|, 1) to the k-var-stride transition table for S. Next we extend the transitions in σ 2 the 1-stride table of S1 : for each transition (S1 , Y) −→ (S2 , o2 , 1) in the 1-stride transition − σ.σ table of S1 , if |X ∩ Y| is large enough, we add the extended transition (S, X ∩ Y) − −2 −→ (S2 , (o1 + o2 ) mod |O|, 1) to the k-var-stride transition table for S. We use the condition |X ∩ Y| ≥ min(|X|, |Y|)/4 as the measure for large enough in our experiments. When we extend one transition to the next, the extended transition can only cover overlays that are common in both initial transitions. Ideally we would like both transitions to cover the exact same set of overlays (in most cases this is true). But even when we do not have the same overlay set, if the size of the intersection is signi cant compared to the number of overlays covered by the two initial transitions, it is worthwhile to add the extended transition. We do not extend 1-stride transitions that are on the whitespace characters. 197 We have found experimentally that extending 1-stride transitions on these characters signi cantly increases the number of TCAM rules while only marginally (if at all) increasing the average stride. Figure 6.16 shows the k-var-stride transition table built for super-state O from the 1-stride transition tables in Figure 6.9. Super-state 0 rule Next super-state rule Extended var-stride rule a (3, 0∗) − (0, 2, 1) → b (0, 0∗) −→ (0, 2, 1) − (0, ∗0) − (1, 0, 1) → c (1, ∗0) − (0, 1, 1) → d (0, 1∗) − (3, 0, 1) → q (0, ∗0) −→ (0, 1, 1) − (3, 1∗) − (4, 0, 1) → (0, 1∗) −→ (4, 0, 1) − (0, 01) − (1, 0, 1) → (1, 01) − (2, 0, 1) → r (0, 01) − (2, 0, 1) → (0, 0∗) − (3, 0, 1) → p p ab cd pq pr Figure 6.16: variable stride transitions generated for super-state 0 from 1-stride transition in Figure 6.9. The pseudo-code of our algorithm for building the k-var-stride transition tables is shown in Algorithm 6.17. 6.8 Experimental Results We implemented OverlayCAM using C++ and conducted experiments to evaluate its e ectiveness and scalability. We verify our results by con rming that the TCAM table generated by OverlayCAM is equivalent to the original DFA. That is, for every pair of current state and input character, the next state returned by the TCAM lookup matches the next state returned by the DFA. 198 2 1 Input: OD FAs, D = (Q, Σ, q0 , F, S, O, M, ∆). Output: Builds multi-stride transitions for D. 1 foreach Si ∈ S do Initialize Built[Si ] ← False; 2 foreach Si ∈ S do 3 if Built[Si ] = False then BuildVarStrideTrans (Si ); 4 Function BuildVarStrideTrans(S) c 5 foreach o set transition (S, X) → (Si , o, 1) ∈ ∆ for super-state S do − 6 if Si ≤ S then Continue; // skip backward transition 7 if M(Si ) = ∅ then Continue; // stop at accepting super-states 8 if Built[Si ] = False then 9 BuildVarStrideTrans (Si ); // extend var-stride transitions of destination super-state w foreach transition (Si , Y) − (Sj , o2 , 1) ∈ Γ for super-state Si do → if |X ∩ Y| ≥ min(|X|, |Y|)/4 then if len(w) < k then // max stride limit not reached c.w Add transition (S, X ∩ Y) −→ (Sj , (o + o2 ) mod |O|, 1) to Γ ; − 10 11 12 13 // extend 1-stride transitions of destination super-state foreach o set transition (Si , Y) −2 (Sj , o2 , 1) ∈ ∆ for super-state Si do → if |X ∩ Y| ≥ min(|X|, |Y|)/4 then c.c2 Add transition (S, X ∩ Y) − → (Sj , (o + o2 ) mod |O|, 1) to Γ ; − c 14 15 16 17 Built[S] ← True; 18 6.8.1 Figure 6.17: Algorithm BuildVarStrideOD2FA(D) to build k-var-stride rules. Effectiveness of OverlayCAM We use the same 8 RE sets used in Section 4.5 for the main results. We de ne the following metric for measuring the amount of state replication in the DFA that corresponds to an RE set. For any RE set R, we de ne SR(R) to be the ratio of the number of states in the minimum state DFA corresponding to number of states in the standard NFA without R divided by the transitions corresponding to R. Based on the characteristics of the REs, these eight sets are partitioned into three groups, 199 STRING =fC613, Bro217g, which contains mostly strings, causing little state replication (SR(Bro271) = 3.0, SR(C613) = 2.1); WILDCARD =fC7, C8 and C10g, which contains multiple wildcard closures `. ', causing lots of state replication (SR(C7) = 231, SR(C8) = 43, and SR(C10) = 162); and SNORT =fSnort24, Snort31, and Snort34g, which contain a diverse set of REs, roughly 40% of the REs have wildcard closures, causing moderate state replication (SR(Snort24) = 24, SR(Snort31) = 22, and SR(Snort34) = 16). We conducted side-by-side comparison with RegCAM-TC (RegCAM without Table Consolidation) and RegCAM+TC (RegCAM with Table Consolidation) on all 8 real-world RE sets. For RegCAM+TC, we consolidated 4 tables together. The results are shown in Table 6.1. For TCAM space, we only report the number of TCAM entries because the TCAM widths for all TCAM tables generated by RegCAM-TC, RegCAM+TC, and OverlayCAM on all 8 RE sets. Since TCAM width typically is only allowed to be con gured as 36, 72, or 144 bits, we use a TCAM width of 36 in all cases. 200 RE set C8 C10 C7 Snort24 Snort34 Snort31 Bro217 C613 # NFA States 72 92 107 575 891 917 2132 5343 SR 43.17 161.61 231.31 24.15 15.52 21.88 3.06 2.12 # # TCAM entries SRAM size (Kb) Throughput (Gbps) # # NFA Over- Super RegCAM RegCAM Overlay RegCAM RegCAM Overlay RegCAM RegCAM Overlay -TC +TC CAM -TC +TC CAM -TC +TC CAM Trans. lays states 2177 72 85 3722 1012 125 47.25 51.39 1.83 5.44 8.51 12.50 2982 288 133 17824 4739 263 261.09 277.68 4.62 3.11 4.35 12.12 3261 648 127 29196 8315 234 456.19 519.69 4.57 3.11 3.64 12.31 4054 30 897 16130 5310 1426 236.28 331.88 26.46 3.64 4.35 7.27 4731 48 1151 16297 5026 2293 238.73 294.49 42.55 3.64 4.35 5.44 5738 32 2395 41539 14464 9478 689.61 960.50 185.12 2.72 3.64 3.64 5424 2 3401 9143 5087 6028 133.93 317.94 88.30 3.64 4.35 4.35 14563 1 11308 18256 13182 18256 320.91 978.35 338.73 3.11 3.64 3.11 Table 6.1: Experimental results of OverlayCAM on 8 RE sets in comparison with RegCAM-TC and RegCAM+TC 201 TCAM lookup speed is typically higher for smaller TCAM chips. We use the TCAM model discussed in Section 5.5 to calculate RE matching throughput. For the two string-based RE sets Bro217 and C613, we observe that OverlayCAM does not signi cantly outperform the two RegCAM algorithms. This is expected as OverlayCAM is designed to handle state replication and string-based RE sets have little state replication. For the other RE sets, OverlayCAM signi cantly outperforms RegCAM and often outperforms NFAs. (1) OverlayCAM uses orders of magnitude less TCAM and SRAM than RegCAM. On average, OverlayCAM uses 41 times less TCAM and 33 times less SRAM than RegCAM-TC and 12 times less TCAM and 38 times less SRAM than RegCAM+TC. (2) OverlayCAM has signi cantly higher throughput than RegCAM. On average, OverlayCAM has 2.5 and 1.93 times higher throughput than RegCAM-TC and RegCAM+TC, respectively. (3) The total number of TCAM entries used by OverlayCAM is often (far) smaller than the total number of NFA transitions. For C7, OverlayCAM's number of TCAM entries is 14 times less than the number of NFA transitions. We now describe why OverlayCAM performs so well. (4) OverlayCAM is very e ective in conquering state replication. OverlayCAM e ectively and automatically identi es all NFA state replicates and groups them together into super-states. The number of super-states is, on average, 1.55 times the number of NFA states and is never more than 2.61 times the number of NFA states. Because of this, the larger SR(R) is, the more that OverlayCAM outperforms RegCAM. For C7, OverlayCAM uses 125 times less TCAM and 100 times less SRAM than RegCAM-TC and 36 times less TCAM and 114 times less SRAM 202 than RegCAM+TC. (5) OverlayCAM e ectively multiplies the compression bene ts of conquering state replication and transition sharing. That is, OverlayCAM e ectively multiplies the bene ts of ODFA and D2 FA. The average number of TCAM entries per super-state is only 2.14, even when super-states have hundreds of constituent states. We wanted to conduct side-by-side comparison with Peng et al.'s scheme [38]; however, we do not have access to their code. Fortunately, Peng et al. have reported their results on the two public RE sets Snort24 and Snort34. For these two sets, OverlayCAM requires 2.15 and 1.44 times less TCAM and SRAM space. 6.8.2 Results on 7-var-stride We now compare the results of applying the variable striding technique with k = 7 on OverlayCAM with the results for RegCAM-TC. We compare the average stride values achieved using the same traces that were used for the experiments in Section 5.6.3 as well as the number of TCAM rules. We only compare using the RE sets in the WILDCARD and SNORT groups since the RE sets in the STRING group have no (or limited) state replication. 6.8.2.1 Self-loop Unrolling The root state in both RegCAM-TC and OverlayCAM are exactly the same since the selflooping states are selected as the root states. As a result, the resulting TCAM rules 203 after unrolling the roots states are semantically equivalent. Hence we get the exact same average stride values for both algorithms (which are shown in Table 6.3). Table 6.2 shows the number of TCAM rules required without self-loop unrolling (i.e. for 1-stride) and with self-loop unrolling for both algorithms. RegCAM-TC OverlayCAM RE 1-stride Unroll 7-var-stride 1-stride Unroll 7-var-stride set C8 3722 7794 8192 125 310 814 C10 17824 36336 65536 263 590 1113 C7 29196 64356 65536 234 442 1381 Snort24 16130 18627 32768 1426 1482 6942 Snort34 16297 19825 32768 2293 2577 9654 Snort31 41539 43920 65536 9478 9819 32243 Table 6.2: Number of TCAM rules for RegCAM-TC and OverlayCAM for 1-stride, with self-loop unrolling and with 7-var-stride Compared to RegCAM-TC, OverlayCAM requires on average 77 times fewer TCAM rules for the WILDCARD group and 8 times fewer TCAM rules for the SNORT group. The average percentage increase in the number of TCAM rules resulting from unrolling the roots for the SNORT group is 14.3% for RegCAM-TC and only 6.6% for OverlayCAM. This is because in RegCAM-TC, there are many root states that are unrolled. On the other hand, in OverlayCAM, there is only one root super-state that is unrolled. 6.8.2.2 Full Variable Striding Table 6.2 shows the number of TCAM rules required for full variable striding, and Table 6.3 shows the average stride values for RegCAM-TC and OverlayCAM. As we can see, OverlayCAM requires many fewer TCAM rules than RegCAM-TC. On average OverlayCAM 204 requires 38.8 times fewer rules for the WILDCARD group and 3.4 times fewer TCAM rules for the SNORT. RE set C8 C10 C7 Snort24 Snort34 Snort31 Self-loop unroll 0 50 95 6.1 5.9 6.1 5.6 5.9 6.1 2.9 3.4 1.9 1.7 1.7 1.7 1.8 1.9 1.8 1.1 1.1 1.1 7-var-stride RegCAM-TC OverlayCAM 0 50 95 0 50 95 6.1 6.0 6.1 5.7 5.9 6.2 4.1 4.5 3.7 2.9 3.4 2.8 2.9 3.2 3.8 3.6 3.7 2.3 6.1 5.9 6.1 5.6 5.9 6.1 3.8 4.1 2.7 2.4 2.5 2.3 3.7 3.6 3.8 4.0 4.1 2.9 Table 6.3: Average stride values for self-loop unrolling and 7-var-stride for RegCAM-TC and OverlayCAM for pM = 0, 50 and 95. In general OverlayCAM is able to achieve nearly the same average stride values as RegCAM-TC. For random trac (pM = 0), OverlayCAM has nearly identical average stride value as RegCAM-TC. This is because with random trac, most of the transitions taken are self-loops around the root state, whichh are unrolled to 7-stride in both algorithms. For pM = 95, OverlayCAM is able to achieve equal or higher average stride value than RegCAM-TC for all the RE sets. This is because with pM = 95, most of the transitions taken are forward transitions, and OverlayCAM is able to selectively combine longer chains of forward transitions into higher stride transitions than RegCAM-TC. The average of the ratio of the stride values across all RE sets and pM values is only 1.09. 6.8.3 Scalability of OverlayCAM We evaluated the scalability of OverlayCAM on synthetic RE sets constructed by adding new REs from 13 REs from a recent release of the Snort rules one at a time. Each RE 205 contains closure on the wildcard or a range; these cause the DFA size to double as each RE is added. The nal DFA has 225,040 states. We rst de ne the TCAM Expansion Factor (TEF) of an RE set to be the number of TCAM entries divided by the number of NFA transitions. In Figure 6.18(a), we plot the TEF for RegCAM-TC, RegCAM+TC and OverlayCAM. We omit the rst 5 data points because the corresponding 5 DFAs are too small. As expected, the TEF of the RegCAM algorithms grows exponentially with the number of NFA states due to state replication. In contrast, the TEF of OverlayCAM grows linearly at a very slow growth rate with the number of NFA states. We next de ne the super-state expansion factor (SEF) of an RE set to be the number of super-states divided by the number of NFA states. Figure 6.18(b) shows that the SEF of OverlayCAM also grows linearly and slowly with the number of NFA states. Note that for any RE set, the number of NFA states is the minimum compared to any other automaton. 206 #TCAM entries/#NFA trans #Super states/#NFA states (a) TEF 35 30 25 RegCAM-TC RegCAM+TC OverlayCAM 20 15 10 5 0 200 250 300 350 #NFA states 400 450 400 450 (b) OverlayCAM SEF 2 1.5 1 0.5 0 200 250 300 350 #NFA states Figure 6.18: (a) TEF vs. # NFA states for OverlayCAM and RegCAM, (b) SEF vs. # NFA states for OverlayCAM 207 Chapter 7 Conclusion In this dissertation, we consider the problem of RE matching in DPI for networking applications. We survey current solutions for RE matching for DPI and identify their limitations. We then develop several techniques and algorithms for fast and ecient RE matching. For a software solution of RE matching, we use an existing automata model D2 FA. We propose a novel Minimize then Union framework and develop ecient algorithms for building D2 FA based on the framework. Our approach requires a fraction of the memory and time required by current algorithms. This allows us to build much larger D2 FAs than was possible with previous techniques. Our algorithm naturally supports frequent RE set updates. We conducted experiments on real-world and synthetic RE sets that verify our claims. For example, our algorithm requires an average of 1400 times less memory and 300 times less time than the original D2 FA construction algorithm of Kumar et al.. We believe our Minimize then Union framework can be incorporated with other alternative 208 automata for RE matching. We propose the rst TCAM-based RE matching solution. We prove that this unexplored direction works very well for RE matching. We implemented our techniques and conducted experiments on real-world RE sets. We show that small TCAMs are capable of storing large DFAs. For example, in our experiments, we were able to store a DFA with 25K states in a 0.5Mb TCAM chip. We also develop multi-striding techniques to increase matching throughput wihtout signi cantly increasing the memory requirement. We are able to achieve a matching throughput of nearly 20Gbps. The D2 FA and our TCAM-based solution only partially handles the problem of state replication in a DFA. We propose a new overlay automata model called the OD2 FA, which fully exploits state replication in a DFA. We develop algorithms for eciently constructing the OD2 FA. We also develop techniques to implement the OD2 FA in software and in hardware using TCAMs. Our experiments indicate that OD2 FA is able to e ectively manage state replication. This results in a memory requirement proportional to that of a NFA while maintaining fast and deterministic matching throughput like that of a DFA. 209 APPENDICES 210 Glossary character redundancy Redundant/shared transitions within a state. 17, 78 The directed graph with states as vertices and edges given by the deferment forest deferment relation F. 22, 211 deferment pointer deferment tree The deferred state, F(s), of a state s. 23 A tree (connected component) in the deferment forest. 23 self-looping state State with more than Σ/2 of its transitions looping back to itself. 16 state redundancy state replication Redundant/shared transitions between two states. 17, 78 Multiple replications of same NFA state in a DFA when DFAs for two REs are combined. 14, 77, 78, 131 transitions sharing Multiple transitions within a state or between di erent states go- ing to the same next state. 14, 77, 78 211 Acronyms Delayed Input DFA. 4, 19, 208, 209 D2 FA DFA Deterministic Finite state Automata. ii, iii, 3, 12, 209 DPI Deep Packet Inspection. ii, 1, 208 NFA Nondeterministic Finite state Automata. iii, 3, 209 OD2 FA ODFA RE Overlay Delayed Input DFA. 5, 134, 144, 209 Overlay Deterministic Finite state Automata. 5, 133, 136, 141 Regular Expression. ii, iii, 2, 208, 209 SEP Shadow Encoding Properties. 86, 91, 93, 94, 189 SRG Space Reduction Graph. 24, 42 TCAM Ternary Content Addressable Memory. iii, 5, 30, 209 212 Notation D A DFA/D2 FA. 12 D An ODFA/OD2 FA. 141 Q Set of states in the DFA/D2 FA/ODFA/OD2 FA. 12 Σ The input alphabet. 12 S The set of super-states in an ODFA/OD2 FA. 141 O The set of overlays an ODFA/OD2 FA. 141 s, q, u A DFA/D2 FA/ODFA/OD2 FA state. 13 S An ODFA/OD2 FA super-state. 141 O An ODFA/OD2 FA overlay. 141 X A set of overlays in an ODFA/OD2 FA. 140 M(s) Set of REs accepted by state s. 14 M(S) Set of REs accepted by all states in super-state S. 141 F(s) Deferred state of state s. 19, 20, 211 F(S) Deferred super-state of super-state S. 144 u→v State u defers to state v. 23 u v State u descendant of state v. 23 213 ⊥ NULL state/empty location. 143 δ(s, σ) The state transition function for a DFA. 13 ρ(s, σ) Partial state transition function for a D2 FA. 22 ∆(S, X, σ) Super-state transition function for a ODFA/OD2 FA. 141 ρ (s, σ) Partial state transition function derived from ∆ for OD2 FA. 144 δ (s, σ) Total transaction function derived from ρ for D2 FA. 22 δ (s, σ) Total transaction function derived from ∆ (ρ ) for ODFA (OD2 FA). 142 214 BIBLIOGRAPHY 215 BIBLIOGRAPHY [1] Application layer packet classi er for linux. http://l7- lter.clearfoundation.com/. [2] Snort. http://www.snort.org/. [3] B. Agrawal and T. Sherwood. Modeling TCAM power for next generation network devices. In Proc. IEEE Int. Symposium on Performance Analysis of Systems and Software, pages 120{ 129, 2006. [4] A. V. Aho and M. J. Corasick. Ecient string matching: an aid to bibliographic search. Communications of the ACM, 18(6):333{340, 1975. [5] M. Alicherry, M. Muthuprasanna, and V. Kumar. High speed pattern matching for network ids/ips. In Proc. 2006 IEEE International Conference on Network Protocols, pages 187{196. Ieee, 2006. [6] M. Becchi and S. Cadambi. Memory-ecient regular expression search using state merging. In Proc. INFOCOM. IEEE, 2007. [7] M. Becchi and P. Crowley. A hybrid nite automaton for practical deep packet inspection. In Proc. ACM Int. Conf. on emerging Networking EXperiments and Technologies (CoNEXT). ACM Press, 2007. [8] M. Becchi and P. Crowley. An improved algorithm to accelerate regular expression evaluation. In Proc. ACM/IEEE ANCS, 2007. [9] M. Becchi and P. Crowley. Ecient regular expression evaluation: Theory to practice. In Proc. ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS), pages 50{59, 2008. 216 [10] M. Becchi and P. Crowley. Extending nite automata to eciently match perlcompatible regular expressions. In Proc. ACM Int. Conf. on emerging Networking EXperiments and Technologies (CoNEXT). ACM Press, 2008. Article Number 25. [11] M. Becchi, M. Franklin, and P. Crowley. A workload for evaluating deep packet inspection architectures. In Proc. IEEE IISWC, 2008. [12] A. Bremler-Barr, D. Hay, and Y. Koral. Compactdfa: Generic state machine compression for scalable pattern matching. In Proc. IEEE INFOCOM, pages 1{9. Ieee, 2010. [13] B. C. Brodie, D. E. Taylor, and R. K. Cytron. A scalable architecture for highthroughput regular-expression pattern matching. SIGARCH Computer Architecture News, 2006. [14] C. R. Clark and D. E. Schimmel. Ecient recon gurable logic circuits for matching complex network intrusion detection patterns. In Proc. Field-Programmable Logic and Applications, pages 956{959, 2003. [15] C. R. Clark and D. E. Schimmel. Scalable pattern matching for high speed networks. In Proc. 12th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), Washington, DC, 2004. [16] J. Edmonds. Paths, trees, and owers. Canad. J. Math., 17:449{467, 1965. [17] D. Ficara, S. Giordano, G. Procissi, F. Vitucci, G. Antichi, and A. D. Pietro. An improved DFA for fast regular expression matching. Computer Communication Review, 38(5):29{40, 2008. [18] H. N. Gabow. An ecient implementation of edmonds' algorithm for maximum matching on graphs. J. ACM, 23:221{234, April 1976. [19] J. E. Hopcroft. The Theory of Machines and Computations, chapter An nlogn algorithm for minimizing the states in a nite automaton, pages 189{196. Academic Press, 1971. 217 [20] J. E. Hopcroft, R. Motwani, and J. D. Ullman. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, 2000. [21] D. E. Knuth. Hu man's algorithm via algebra. Journal of Combinatorial Theory, Series A, 32(2):216 { 224, 1982. [22] S. Kong, R. Smith, and C. Estan. Ecient signature matching with multiple alphabet compression tables. In Proc. 4th Int. Conf. on Security and privacy in communication netowrks (SecureComm), page 1. ACM Press, 2008. [23] J. B. Kruskal. On the shortest spanning subtree of a graph and the traveling salesman problem. Proc. American Mathematical Society, 7:48{50, 1956. [24] H. W. Kuhn. The hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2:83{97, 1955. [25] S. Kumar, B. Chandrasekaran, J. Turner, and G. Varghese. Curing regular expressions matching algorithms from insomnia, amnesia, and acalculia. In Proc. ACM/IEEE ANCS, pages 155{164, 2007. [26] S. Kumar, S. Dharmapurikar, F. Yu, P. Crowley, and J. Turner. Algorithms to accelerate multiple regular expressions matching for deep packet inspection. In Proc. SIGCOMM, pages 339{350, 2006. [27] S. Kumar, J. Turner, and J. Williams. Advanced algorithms for fast and scalable deep packet inspection. In Proc. IEEE/ACM ANCS, pages 81{92, 2006. [28] T. Liu, Y. Yang, Y. Liu, Y. Sun, and L. Guo. An ecient regular expressions compression algorithm from a new perspective. In Proc. IEEE INFOCOM, pages 2129{2137, 2011. [29] Y. Liu, L. Guo, M. Guo, and P. Liu. Accelerating DFA construction by hierarchical merging. In Proc. IEEE 9th Int. Symposium on Parallel and Distributed Processing with Applications, 2011. 218 [30] C. R. Meiners, A. X. Liu, and E. Torng. TCAM Razor: A systematic approach towards minimizing packet classi ers in TCAMs. In Proc. 15th IEEE Conf. on Network Protocols (ICNP), pages 266{275, October 2007. [31] C. R. Meiners, A. X. Liu, and E. Torng. Bit weaving: A non-pre x approach to compressing packet classi ers in TCAMs. In Proc. 17th IEEE Conf. on Network Protocols (ICNP), October 2009. [32] C. R. Meiners, J. Patel, E. Norige, E. Torng, and A. X. Liu. Fast regular expression matching using small TCAMs for network intrusion detection and prevention systems. In Proc. 19th USENIX Security Symposium (USENIX Security), pages 111{126, Washington, DC, August 2010. [33] A. Mitra, W. Najjar, and L. Bhuyan. Compiling PCRE to FPGA for accelerating SNORT IDS. In Proc. 3rd ACM/IEEE Symposium on Architecture for networking and communications systems ANCS. ACM Press, 2007. [34] J. Moscola, J. Lockwood, R. P. Loui, and M. Pachos. Implementation of a contentscanning module for an internet rewall. In Proc. 11th IEEE Symposium on FieldProgrammable Custom Computing Machines (FCCM), pages 31{38. IEEE Comput. Soc, 2003. [35] J. Munkres. Algorithms for the assignment and transportation problems. Journal of the Society of Industrial and Applied Mathematics, 5(1):32{38, March 1957. [36] J. Patel, A. X. Liu, and E. Torng. Bypassing space explosion in regular expression matching for network intrusion detection and prevention systems. In Proc. Network and Distributed System Security Symposium (NDSS'12), February 2012. [37] V. Paxson. Bro: a system for detecting network intruders in real-time. Computer Networks, 31(23-24):2435{2463, 1999. [38] K. Peng, S. Tang, M. Chen, and Q. Dong. Chain-based DFA de ation for fast and scalable regular expression matching using TCAM. In Proc. ACM ANCS, pages 24{35, 2011. 219 [39] M. Roesch. Snort: Lightweight intrusion detection for networks. In Proc. 13th Systems Administration Conference (LISA), USENIX Association, pages 229{ 238, November 1999. [40] R. Sidhu and V. K. Prasanna. Fast regular expression matching using fpgas. In Proc. IEEE Symposium on Field-Programmable Custom Computing Machines FCCM, pages 227{238, 2001. [41] R. Smith, C. Estan, and S. Jha. Xfa: Faster signature matching with extended automata. In Proc. IEEE Symposium on Security and Privacy, pages 187{201, 2008. [42] R. Smith, C. Estan, S. Jha, and S. Kong. De ating the big bang: fast and scalable deep packet inspection with extended nite automata. In Proc. SIGCOMM, pages 207{218, 2008. [43] R. Sommer and V. Paxson. Enhancing byte-level network intrusion detection signatures with context. In Proc. 10th ACM Conf. on Computer and Communications Security (CCS), pages 262{271, 2003. [44] I. Sourdis and D. Pnevmatikatos. Pnevmatikatos: Fast, large-scale string match for a 10gbps fpga-based network intrusion detection system. In Proc. Int. on Field Programmable Logic and Applications, pages 880{889, 2003. [45] I. Sourdis and D. Pnevmatikatos. Pre-decoded cams for ecient and high-speed nids pattern matching. In Proc. 12th IEEE Symposium on FieldProgrammable Custom Computing Machines, volume C, pages 258{267. Ieee, 2004. [46] J.-S. Sung, S.-M. Kang, Y. Lee, T.-G. Kwon, and B.-T. Kim. A multi-gigabit rate deep packet inspection algorithm using tcam. In Proc. IEEE GLOBECOM, pages 453{457, 2005. [47] S. Suri, T. Sandholm, and P. Warkhede. Compressing two-dimensional routing tables. Algorithmica, 35:287{300, 2003. [48] L. Tan and T. Sherwood. A high throughput string matching architecture for intrusion detection and prevention. In Proc. 32nd Annual Int. Symposium on Computer 220 Architecture (ISCA), pages 112{122, 2005. [49] N. Tuck, T. Sherwood, B. Calder, and G. Varghese. Deterministic memory-ecient string matching algorithms for intrusion detection. In Proc. IEEE Infocom, pages 333{340, 2004. [50] L. Yang, R. Karim, V. Ganapathy, and R. Smith. Fast, memory-ecient regular expression matching with NFA-OBDDs. Computer Networks, 55(55):3376{3393, 2011. [51] F. Yu, Z. Chen, Y. Diao, T. V. Lakshman, and R. H. Katz. Fast and memory-ecient regular expression matching for deep packet inspection. In Proc. ACM/IEEE Symposium on Architecture for Networking and Communications Systems (ANCS), pages 93{102, 2006. [52] F. Yu, R. H. Katz, and T. V. Lakshman. Gigabit rate packet pattern-matching using TCAM. In Proc. 12th IEEE Int. Conf. on Network Protocols (ICNP), pages 174{183, 2004. 221