f a? m“. A ' .z. fi\’v ‘Dl‘. .7. *1: ... ~ .3. .3 19.. all I .I: I—I Dru-1n I “3:“ Michigan State 200(1 _ Unwersuty This is to certify that the dissertation entitled ALGORITHMIC APPROACHES FOR FAST AND EFFICIENT PACKET CLASSIFICATION presented by Chad R. Meiners has been accepted towards fulfillment of the requirements for the PhD. degree in Computer Science Mp Major Professor’s Signature M3}, l1: LOO ‘1 Date MSU is an Affirmative Action/Equal Opportunity Employer . - -._-—.~.—-—.-.--.—.—.—.’-.- sequential decomposition . . 78 Compression ratio for each classifier in It]. ................ 84 Compression ratio for each classifier in [if/4U ............... 85 Example of topological transformations ................. 94 Multi-lookup architecture ........................ 96 Parallel pipelined—lookup architecture .................. 97 8.4 8.5 8.6 8.7 8.8 8.9 8.10 8.11 8.12 8.13 8.14 8.15 Chained pipelined—lookup architecture ................. 97 Step 1 of domain compression ...................... 117 Step 2 of domain compression ...................... 118 Step 3 of domain compression ...................... 119 Example of 1-D prefix alignment ..................... 120 Summary of notation ........................... 120 Compression ratio of RLa and RLaU .................. 121 Compression ratio of RLb and RLbU .................. 121 Rule size ratio of RLa and RLaU .................... 121 Rule size ratio of RLb and RLbU .................... 121 Rule number ratio of RLa and RLaU .................. 121 Rule number ratio of RL-b and RLbU .................. 121 viii Chapter 1 Introduction Packet classification, which is widely used on the Internet, is the core n’iechanism that enables routers to perform many networking services such as firewall packet filtering, virtual private networks (VPNs). network address translation (N AT). quality of ser- vice (QoS). load balancing. traffic: accounting and monitoring. differentiated services (Diffserv), etc. As more services are (.leployed on the Internet. packet. classification grows in demand and importance. The function of a packet classification system is to map each packet to a decision (i.e., action) according to a sequence (z'.e., ordered list) of rules, which is called a packet classifier. Each rule in a. packet classifier has a predicate over some packet. header fields and a decision to be performed upon the packets that. match the pred— icate. To resolve possible conflicts among rules in a classifier. the decision for each packet is the decision of the first (i.e., highest priority) rule that. the packet matches. Table 1.1 shows an example packet classifier of two rules. The format of these rules is based upon the format. used in Access Control Lists on Cisco routers. Rule Source IP Destination IP Source Port Destination Port Protocol Action 7‘1 1.2.3.0/24 192108.01 [1.65534] [1.05534] TCP accept * * * * * discard 7.2 Table 1.1: An example packet classifier Rule Source IP Destination IP Source Port Destination Port Protocol Action 7‘1 1.2.3.0/24 192168.01 0 * * discard r2 1.2.3.0/24 192168.01 65535 * * discard r3 1.2.3.0/24 192168.01 * 0 * discard 7‘4 1.2.3.0/24 192168.01 * 65535 * discard r5 1.2.3.0/24 192168.01 [0,65535] [0.65535] TCP accept 7'6 * * * * * discard Table 1.2: TCAM Razor output for the example packet classifier in Table 1.1 1 .1 Motivation To classify the never—ending supply of packets at wire speed, Ternary Content Ad— dressable Memories (TCAMs) have become the de facto standard for high-speed routers on the Internet. [15]. A TCAM is a memory chip where each entry can store a packet classification rule that is encoded in ternary format. Given a packet, the TCAM hardware can compare the packet with all stored rules in parallel and then return the decision of the first rule that the packet matches. Thus, it takes 0(1) time to find the decision for any given packet. In 2003, most packet classification de- vices shipped were TCAlVI—based. More than 6 million TCAM devices were deployed worldwide in 2004. A traditional random access memory chip receives an address and returns the content of the memory at that address. A TCAM chip works in a reverse manner: it receives content and returns the address of the first entry where the content lies in the TCAM in constant time (1.6., a few CPU cycles). Exploiting this hardware feature, TCAM-based packet classifiers store a. rule in each entry as an array of 0’8, 1’8, or *’s (don"t—care values). A packet header (216., a search key) matches an entry if and only if their corresponding 0’s and 1’s match. Given a search key to a TCAM, the hardware circuits compare the key with all its occupied entries in parallel and return the index (or the content, depending on the chip architecture and configt,1ra.ti(_)n,) of the first matching entry. Despite their high speed, TCAMs have their own limitations with respect to packet classification. Range expansion TCAMs can only store rules that. are encoded in ternary for— mat. In a typical packet classification rule, source IP address, destination IP ad- dress, and protocol type are specified in prefix format, which can be directly stored in TCAMs, but source and destination port numbers are specified in ranges (216., integer intervals), which need to be converted to one or more prefixes before being stored in TCAMs. This can lead to a significant increase in the number of TCAM entries needed to encode a rule. For example, 30 prefixes are needed to represent the single range [1, 65534], so 30 X 30 = 900 TCAM entries are required to represent the single rule 7‘1 in Table 1.1. Low capacity TCAMS have limited capacity. The largest TCAM chip available on the market has 18Mb while 2Mb and 1Mb chips are most popular. Given that each TCAM entry has 144 bits and a packet classification rule may have a worst expansion factor of 900, it is possible that an 18Mb TCAM chip cannot store all the required entries for a modest. packet classifier of only 139 rules. While the worst case may not happen in reality. this is certainly an alarming issue. Furthermore, TCAM capacity is not expected to increase dramatically in the near future due to other limitations that we will discuss next. High power consumption and heat generation TCAM chips consume large amounts of power and generate large amounts of heat. For example. a le TCAM chip consumes 15-30 watts of pox-ver. Power consumption together with the con- sequent heat generation is a serious problem for core routers and other networking devices. Large board Space occupation: TCAMs occupy much more board space than SRAMs. For networking devices such as routers. area efficiency of the circuit board is a critical issue. High hardware cost TCAMs are expensive. For example, a 1.\'Ib TCAM chip costs about 200 ~ 250 US. dollars. TCAM cost is a significant fraction of router cost. 1.2 Contribution This work describes two methods of addressing packet classification and the related TCAM based issues: equivalent transformation techniques and new architectural approaches. Equivalent transformation techniques seek to find semantically equivalent but more efficient classifiers. Two methods of equivalent transformation are TCAM Razor and All—Match Redundancy Removal. TCAM Razor decomposes a multi-field problem into a series of single-field problems; these problems are solved optimally and then recomposed into a greedy multi-field solution. In contrast, All—match Redundancy Removal identifies a maximal set of rules that can be removed from a packet classifier without changing the packet classifier’s semantics. New architectural approaches seek to modify how the TCAM based packet classi- fiers operate in order to improve efficiency. We propose two approaches: sequential decomposition and topological transformation. Sequential decomposition decom- poses a single d-field packet classification TCAM lookup into a sequence of d 1-field TCAM lookups. Topological transformations provide methods to translate the do- main of each packet field into a more efficient representation. Both techniques allow for the efficient utilization of TCAM space. These techniques mitigate the effects of range expansion; however, they also have the unique advantage that they find optimizations beyond range expansion. This advantage allows for sublinear com- pression. Chapter 2 Background We now formally define the concepts of fields, packets, and packet classifiers. A field F,- is a variable of finite length (i.c., of a finite number of bits). The domain of field F,- of w bits, denoted D(F,-), is [0,2"" — 1]. A packet over the (1 fields F1, - - - , Fd is a d—tuple (pl, - - . ,pd) where each p,- (1 g i g d) is an element of D(F,;). Packet classifiers usually check the following five fields: source IP address, destination IP address, source port number, destination port number, and protocol type. The lengths of these packet fields are 32, 32, 16, 16, and 8, respectively. We use 2' to denote the set of all packets over fields 171,- - - , Fd' It follows that E is a finite set. and [27] = ]D(F1)| X - - - x |D(Fd)]. where |Z| denotes the number of elements in set X and |D(Fz)| denotes the number of elements in set D(F,-). A rule has the form (predicate) —> (decision). A (predicate) defines a set of packets over the fields F1 through Pd and is specified as F1 6 $1 /\ - - - /\ Fd E Sd where each S,- is a subset of l)(F.,-) and is specified as either a prefix or a nonnegative integer interval. A prefix {0, 1}l"{*}“’“k with. A“ leading (ls or ls for a packet field of length to denotes the integer interval [{0, 1}k{0}“’_k, {0. 1}A'{1}“’_k]. For example. prefix 01** denotes the interval [0100,0111]. A rule. F1 6 .91 /\ - -- /\ Fd E 5d —» (decision) is a prcfi'r rule if and only if each S,- is represented as a prefix. A packet matches a rule if and only if the packet. matches the predicate of the rule. A packet (p1. - - - .pd) "mic/ms a predicate F] E 5'1 /\ /\ Fd E Sd if and only if the condition [)1 E 91 /\ - - - /\ Pd 6 Sd holds. \Ve use US to denote the set CH of possible values that (decision) can be. Typical elements of 05 include accept, discard, accept with logging, and discard with logging. A sequence of rules (r1, - - - , rn) is complete if and only if for any packet p, there is at least one rule in the sequence that p matches. To ensure that a sequence of rules is complete and thus a packet classifier, the predicate of the last rule is usually specified as F1 6 D(F1) /\ - - - Fd E AD(Fd). A packet classifier C is a sequence of rules that is complete. The size of C, denoted ICI, is the number of rules in C. A packet classifier C is a prefix packet classifier if and only if every rule in C is a prefix rule. A classifier with d fields is called a (ll-dimensional packet classifier. Two rules in a packet classifier may overlap; that is, a single packet may match both rules. Furthermore, two rules in a packet. classifier may conflict; that is, the two rules not only overlap but also have different decisions. Packet classifiers typi- cally resolve such conflicts by employing a first-match resolution strategy where the decision for a packet p is the decision of the first (i.e., highest priority) rule that p matches in C. The decision that packet classifier C makes for packet p is denoted CU?)- VVe can think of a packet classifier C as defining a many-to-one mapping function from E to DS. Two packet classifiers C1 and C2 are equivalent, denoted C1 E C2, if and only if they define the same mapping function from E to DS; that is, for any packet p E Z, we have C1(p) = C2(p). A rule is redundant in a. classifier if and only if removing the rule does not change the semantics of the classifier. Furthermore, we define the equivalence relation that. classifier C defines on each field domain and the resulting equivalence classes. We use the notation 2;,- to denote the set of all (d— 1)-tuple packets over the fields (F1, - - -, Fi_1, Fi+b -~-, Fat) and p_,- to denote an element of 23_,;. Then we use C(pi,p_,-) to denote the decision that packet classifier C makes for the packet p that is formed by combining p,- E DlFi) and p_,-. Definition 2.0.1 (Equivalence Class). Given a packet classifier C over fields F1, --«, Fd, we say that 3:, y E D(F,-) for 1 S i g d are equivalent. with respect to C if and only ifC(;1:,p_z-) = C(y,p_,;) for any p__,; E E_,;. It follows that C partitions D(F,~) into equivalence classes. W 3 use the notation C{.:r} to denote the equivalence class that :1: belongs to as defined by classifier C. In a typical packet. classifier rule, the fields of source IP, destination IP, and pro- tocol type are specified in prefix format, which can be directly stored in TCAMs; however, the remaining two fields of source port and destination port are specified as ranges (i.e., non-negative integer intervals), which are typically converted to pre- fixes before being stored in TCAMs. This leads to range expansion, the process of converting a non-prefix rule to prefix rules. In range expansion, each field of a rule is first expanded separately. The goal is to find a minimum set of prefixes such that the union of the prefixes corresponds to the range. For example, if one 3-bit field of a rule is the range [1, 6]. a corresponding minimum set of prefixes would be 001, 01*, 10*, 110. The worst-case range expansion of a w—bit range results in a set. containing 2w — 2 prefixes [13]. The next step is to compute the cross product of the set of prefixes for each field, resulting in a potentially large number of prefix rules. 2.1 Firewall decision diagrams A crucial data structure required for this work is the Firewall Decision Diagram (FDD) [9] A Firewall Decision Diagram (FDD) with a decision set 03 and over fields F1, - ~ - , Fd is an acyclic and directed graph that. has the following five proper- ties: (1) There. is exactly one node that has no incoming edges. This node is called the root. The nodes that have no outgoing edges are called terminal nodes. (2) Each node v has a label. denoted F (U), such that {F12 - . - , Fd} if v is a nonterminal node. F(U) E DS if v is a terminal node. (3) Each edge 6:11 —> U is labeled with a nonempty set of integers. denoted 1(e), where 1(e) is a subset of the domain of 11‘s label (i.e., I((’) C_: D(F(u))). (4) A directed path from the root to a tern‘iinal node is called a decision path. No two I' nodes on a decision path have the same label. (a) The set of all outgoing edges of a node v, denoted 13(1)). satisfies the following two conditions: (i) Consistency: I (e) fl [(6’) = Q) for any two distinct edges e and e’ in E(u). 066%) Me) = Darn». (ii) Completeness: F1 F 2 Decision 000 000 accept 000 111 accept 110 000 accept 110 111 accept 010 0** accept 100 0** accept ** 1** accept ** *** discard it FDD Construction 6 001 01 1 101 000 1 10 010 100 1 1 1 @ 6 6 000 001 000 001 0” 1“ O” 1“ 1' 0” 01 ' 01 ' 20;.) 1a a a d a a d a d a d a d Figure 2.1: Illustration of FDD construction We define a full—length ordered F DD as an FDD where in each decision path all fields appear exactly once and in the same order. For ease of presentation, we use the term “FDD” to mean “full-length ordered FDD” if not otherwise specified. Given a packet classifier (C, the F DD construction algorithm in [17] can convert it. to an equivalent full-length ordered FDD f. Figure 2.1(a) contains a sample classifier, and Figure 2.1(b) shows the resultant F DD from the construction process. After an FDD f is constructed, we can reduce f ’5 size by merging isomorphic subgraphs. A full-length ordered FDD f is reduced if and only if it satisfies the following two conditions: (1) no two nodes in f are isomorphic; (2) no two nodes have more than one edge between them. Two nodes '1) and v' in an FDD are isomorphic if and only if v and 'U’ satisfy one of the following two conditions: (1) both 1) and 'v’ are terminal nodes with identical labels; (2) both 1) and v’ are nonterminal nodes and there is a one-to-one correspondence between the outgoing edges of 'v and the outgoing edges of U, such that every pair of corresponding edges have identical labels and they both point to the same node. A reduced F DD is essentially a canonical representation for packet classifiers. Figure 2.1(c) shows the reduced FDD from Figure 2.1(b). 2.2 One-Dimensional Classifier Minimization The special problem of weighted one—field TCAM I'ninirnization is used as a building block for multi-dimensional TCAM minimization. Given a one-field packet classifier f of n prefix rules (r1, r2, - - - , r"), where {Decision(r1), De(.:ision(r2), - - - . Decision(rn) } = {d1,(12.- - -,d;} and each decision (1,; is z-issociated with a cost. Cost(d,;) (for 1 S i S .3). we define the cost. of packet classifier f as follows: n. Cost(f) = Z Cost(Decision(ri)) i=1 Based upon the above definition, the problem of weighted one-diInensional TCAM minimization is stated as follows. Definition 2.2.1. Weighted ()nc—Dimensional Prefix: Min.i-Inization Problem. Given a one-field packet classifier f1 where each (lccismn is associated with a cost. find a 9 prefix packet classifier f2 6 {f1} such that for any prefix packet classifier f E {f1}. the condition Cost(f2) g Cost(f) holds. The problem of one-dimensional prefix minimization (with uniform cost) has been studied in [6,27] in the context of compressing routing tables. I generalize the dynamic programming solution in [27] to solve the weighted one-dimensional TCAM minimization. There are three key observations: 1. For any one-dimensional packet classifier f on {*}111, we can always change the predicate of the last rule to be {*}w without changing the semantics of the packet classifier. This follows from the completeness property of packet classifiers. 2. Consider any one-dimensional packet. classifier f on {*}w. Let f1 be f ap- pended with rule {1*}111 —> d, where d can be any decision. The ol;)servation is that f E f’. This is because the new rule is redundant in f ’ since f must be complete. A rule in a packet classifier is redundant if and only if removing the rule from the packet. classifier does not change the semantics of the packet. classifier. 3. For any prefix 79 E {0,1}k{*}1"'_1" (0 S I; g in), one and only one of the following ccmditions holds: (a) r e {0,1}1~‘0{*}w-k-1, (b) 73 E {0,1}k1{*}“’—k—1. (c) ”P = {0,1}k{*}w—k. This property allows us to divide a problem of {0, l}k{*}1“‘k into two sub- problems: {0. 1}1‘70{*}w—k—1. and {0. 1}]"1{*}“1‘k—1. This divide—arid—conquer strategy can be applied recursively. We formulate an o )timal dynamic )rtwrammine‘ solution to the weighted one- . C) C") F) dimensional TCAM minimization problem. 10 Let. P denote a prefix {0,1}k{*}w_k. We use 2 to denote the prefix {0,1}k 0{*}11’_k—1, and P to denote the prefix {0,1}k1{*}w—k_1. Given a one—dimensional packet classifier f on {*}w, we use f7; to denote a packet classifier on P such that for any :1: E P, frp(1:) = f (1:), and we use ff; to denote a similar packet classifier on P with the additional restriction that the final decision is d. C ( fp) denotes the minimum cost of a packet classifier t that is equivalent to fp, and C ( fig) denotes the minimum cost of a packet classifier t’ that is eq1.1ivalent to fp and the decision of the last rule in t’ is d. Given a one-dimensional packet classifier f on {*}1“ and a prefix P where P Q {*}1”, f is consistent on P if and only if V1.3 y E P, f(:r) = f(y). The dynamic programming solution to the weighted one—dimensional TCAM minimization problem is based on the following theorem. The proof of the theorem shows how to divide a problem into sub-problems and how to combine solutions to sub-problems into a solution to the original problem. Theorem 2.2.1. Given a one—dimensional packet classifier f on {*}w, a prefix? P where P Q {*}w, the set of all possible decisions {(11, (12, - - - , dz} where each decision d,- has a cost wdi (1 g i S 2:). we have that ., Z: , (l, C(fp) : mm ((fp) 221 where each C(fgi) is calculated as follows: (I) Iff is consistent on P, then an __ War) was, 1p>—, , ,. , U)f(1,) + ”(12: iff(.z.) # ,1 '23) [ff is not consistent on P, then 11 f (11 (11 CUP. )+ C( 7—3— ) — 'wdl +u'1‘1i’ ' 7 1- 1- C(f;1_1)+ C(fg—l) — 111,17;1 + 1111,17,, , 11,- . Hi 11 ' C(fpl) = min < C(fp1)+ C(fffi‘) — 1111,12,, C0211 )+C( 51+ l—‘U’11,+1 +1111,» 0', 1. c C(f‘g) + C( [3“) —: 111-'11,, + “’1 \ "i Proof. (1) The base case is when f is consistent on P. In this case, the minimum cost prefix packet classifier in {fp} is clearly (P ——) f (13)), and the cost of this packet classifier is w “1.). Furthermore, for d,- # [(17), the minimum cost prefix packet classifier in {fp} with decision 11,- in the last rule is (P —> f (:17),P —+ dz) where the second rule is redundant. The cost of this packet classifier is 1Uf(-r) + wdi' (2) If f is not consistent on P, divide P into E and P. The crucial observation is that an optimal solution f * to {fp} is essentially an optimal solution f1 to the sul_)-problem of minimizing f2 appended with an optimal solution f2 to the sub- problem of minimizing f5. The only interaction that can occur between f1 and f2 is if their final rules have the same decision, in which case both final rules can be replaced with one final rule covering all of P with the same decision. Let d3; be the decision of the last rule in f1 and dy be the decision of the last rule. in f2. Then we can compose f * whose last rule has decision (1, from f1 and f2 based on the following cases: (A) (1.5,- = dy :- (17;: In this case, f can be constructed by listing all the rules in f1 except the last. rule, followed by all the rules in [2 except the last rule, and then the last rule P ——+ (1,. Thus, Cost(f) = Cost(f1) + Cost(f2) — (11(1):. (B) (1,: : (1y # di‘ In this case, f can be constructed by listing all the rules in f1 except the last rule, followed by all the rules in f2 except the last. rule, then rule P —> 11,-. and finally rule P -—_> d,. Thus. Cost(f) = Cost(f1)+ (70-51(12)‘1L'11I‘1’U’11 '11. (C) (11,- 75 (1;).(11- = d,,dy # 11,-: we do not, need to consider this case because , .(l' V (1 (l- 1 (l 1 (1: >1 (1: (11211 + ( (1,511) = cognac-,9) +1.11 — 1.1 2 ( 11151 + 1 (1,33) — 11.1 D d; (1 ,dl- d-, d1 = (1: Similarly, this case need not be considered. 9 z J i (E) d1: aé dy, d1: 75 di, dy # dZ-z Similarly, this case need not be considered. C] Figure 2.2 shows the illustration of a one—dimensional TCAM minimization prob— lem, where the black bar denotes decision “accept” and the white bar denotes deci- sion “discard”. Figure 2.3 illustrates how the dynamic progrannning works on this example. L:ll::l —— 00 01 10 11 Figure 2.2: An example one-dimensional TCAM 11111111111281.1011 problem HI IHIH \><01./ \><._/ I: E HUI lll EU _r f _1_ J. |oo o1||oo o1] |1o11||1o11| 1:] [:1 - — — 4 —-L 10 11 loo o1 1o11| loo 0 Figure 2.3: Illustration of dynamic program 2.3 Experimental Data we performed experiments on a. set, of '25 real—world packet classifiers, which is denoted by R1,. The classifiers in 11L were chosen from a larger set. of real-world 13 classifiers obtained from various network service providers. where the classifiers range in size from a handful of rules to thousands of rules. We partition the original classifiers into 25 groups where the classifiers in each group share similar structure. For example, the ACLs configured for the different interfaces of a router often share a similar structure. we created RL by randomly choosing one classifier from each of the 25 groups. We did this because classifiers with similar structure often exhibit. similar results for the proposed algorithms. If all classifiers were used, the results would be skewed by the relative size of each group. Because packet classifiers are considered confidential due to security concerns, which makes it difficult to acquire a large quantity of real-world classifiers, we gen- erated a set of synthetic classifiers SYN with the number of rules ranging from 250 to 8000. The predicate of each rule has five fields: source IP, destination IP, source port, destination port, and protocol type. We based our generation method upon Singh et al.’s [23] model of synthetic rules. We choose this model over Tay- lor&Turner’s Classbench [30] because Classbench does not generate decisions, and there is no rationale or guidelines for assigning decisions to each rule. To stress test. the sensitivity of our algorithms to the number of decisions in a classifier. we created a. set. of classifiers RLU by replacing the decision of every rule in each classifier by a unique decision. Similarly, we created the set S YNU. Thus, each classifier in RLU (or S YNU) has the maximum possible number of distinct decisions. Such classifiers might arise in the context of rule logging where the system monitors the frequency that each rule is the first matching rule for a packet. 14 Chapter 3 Related Work There is significant prior work on packet classification for both TCAM based packet classification and software based packet classification. While TCAM based sys- tems are more immediately relevant, software based classification shares a degree of commonality with the sequential decomposition technique; however, differences in available hardware result in very different design decisions. 3.1 TCAM Based Classifiers There is significant. prior work on minimizing the TCAM space. occupied by a single classifier. Such work falls into three broad categories: (1) classifier minimization (c.g., [1, 5, 6, 18, 27]), which converts a. given classifier to a semantically equivalent classifier that. requires fewer TCAM entries; (2) range encoding (e.g., [4.15.19.21,31]), which encodes the ranges (i.e., source port and destination port) in a manner that reduces range expansion; and (3) circuit modification (e.g., [24]). which modifies TCAIV'I circuits to ac(_:(:)mmodate range comparisons. 3.1.1 Classifier Minimization: The basic idea is to convert a given packet classifier to another semantically equiva- lent packet classifier that. requires fewer TCAM entries. Several classifier minimiza— tion schemes have been proposed [1, 5.6. 18. 27]. The work in [1, 6.27] focuses on one-dimensional and two dimensional packet. classifiers. Construction Optimal IP Tables In [6], Draves et al. present a polynomial algorithm for generating a minimum equivalent packet classifier for one-dimensional prefix match classifiers. Their algorithm, Optimal Routing Table Constructor(ORTC), works by reducing a longest matching prefix trie to its minimal representation via. three traversals. This minimal trie can then be used to use to generate a minimum single field prefix classifier. A longest match prefix classifier can be trivially con- verted into a first match prefix classifier by sorting the rules such that the longer prefix rules appear before shorter prefix rules. Compressing Two-Dimensional Routing Tables In [27], Suri et al. present. a polynomial time dynamic program that generates a minimum equivalent packet classifier for one-dimensional prefix classifiers. This dynamic program is equivalent to the dynamic program presented in Chapter 2.2. Furthermore, Suri et al. present a. generalization of this dynamic program for two or more fields. They show that this generalization produces optimal two field classifiers when the the solution space of classifiers is restricted such that the predicates of any two rules in the a classifier are either disjoint, or one predicate is a subset of the other. However, these generalized algorithms have a significant time requirements, 0(NIDS (wl x x 111(1)), where wd is the number of bits used for 17,. As a. result, the dynamic program ceases to be usable for more than two fields. Complete Redundancy Detection in Firewalls In [18]. Liu and Gouda, pro- pose the first algorithm that is guaranteed to detect and remove a maximal set of redundant rules within a classifier. They propose two types of redundant rules. upward redundant rules and downward redundant rules. These two types of rules are shown to completely categorize the set of all redundant rules. Liu and Gouda’s algorithm first uses an iterative F DD construction technique to remove all upward redundant rules. and it uses a different iterative FDD ccmstruction technique to remove all downward redundant rules. By removing both types of rules, the al- 16 gorithm produces a classifier free of redundant rules. Note, that this algorithm is not guaranteed to remove the maximum number of redundant rules since there can be interdependencies between redundant rules. However, this algorithm is effective for all types of classifiers, and its efficiency scales well as the number of fields in a classifier increases. Packet Classifiers in Ternary CAMS can be Smaller In [5], Dong et al. propose the first algorithm that modifies rules within a classifier in an attempt to reduce the effects of range expansion. They propose four types of operations: trim- ming rules, expanding rules, merging rules, and adding rules. The basic idea of their algorithm is that by trimming or expanding the space covered by a predi- cate, the range expansion for each rule can be reduced. They propose a two stage algorithm that first trims the predicate space of every rule and then expands the predicate space of each rule going from last to first. This algorithm is significant in that it accommodates classifiers with more than two fields. However, it is unknown whether or not the algorithm is optimal given a one-dimensional classifier. Further- more, The algorithm requires repeated applications for a classifier to converge upon a minimal set of rules and depends heavily upon repeated applications of the redun- dancy removal technique found in [18]. This suggests that their algorithm requires a significant amount of con’irmtational overhead. Compressing Rectilinear Pictures and Minimizing Access Control Lists In [1], Applegate et al. propose an optimal solution for two field classifiers composed entirely of strip rules. Strip rules have a wild card for at. least one field. However, while this work is of theoretical interest, it does not. scale to d—dimensional classifiers, and it is not clear that packet classifiers can be efficiently represented with strip rules 3.1.2 Range Encoding: The basic idea is to first encode ranges that appear in a classifier and store the en— coded rules in a TCAM. When a packet. comes, the packet needs to be. preprocessed 17 so that the resulting encoded packet can be used as a search key for the TCAM. Previous range encoding schemes fall into two categories: database independent en- coding schemes [4,15], where the encoding of each rule is independent of other rules in the classifier, and database dependent encoding schemes [19, 21,31], where the encoding of each rule may depend on other rules in the classifier. The advantage of database independent encoding schemes is that they allow fast. incremental up- dates to the classifier since each rule is encoded indepemlently. However, database dependent schemes have the potential for better space savings since they can utilize the low number of unique ranges that appear in real life classifiers to achieve lower range expansion. Algorithms for Advanced Packet Classification In [15], Lakslnninarayanan et al. propose a scheme called fence encoding, which encodes interval ranges as a range of unary numbers. All ranges under fence encoding have an expansion factor of one, which implies that all ranges can be encoded with one rule, but the number of unary bits required for each rule is prohibitive since a field with length 11) requires 2‘“ bits per rule. To reduce the required number of bits in a rule, Lakshminarayan et al. proposed the technique called DIRPE, which con'mresses the size of fence encodings at the expense of increasing the average expansion ratio. DIRPE works by dividing a field into equally sized sub-fields, which are called chunks, and fence encoding these chunks. Selecting the number of chunks provides a trade-off between range expansion and TCAM entry width. The authors also propose a method of combining DIRPE with Liu’s range encoding scheme found in [19] to handle ranges that have a large expansion factor under DIRPE. However. this combination negates DIRPE'S ability to allow fast updates for classifiers. Space-efficient TCAM-based Classification using Gray Coding Bremler- Barr and Hendler, in [4], propose a scheme in which field domains are encoded using binary reflected gray codes(BRCC). While there is no advantage or disadvantage to using a BRGC for fields that contain only prefix ranges. using BRGC on non-prefix ranges breaks up the range in such a way that additional ternary bits can be used to 1,8 eliminate some of the prefixes needed to represent a range. The result is that some of the prefixes ranges required to represent a range are merged together into a single ternary entry. The authors note that this encoding technique is especially effective for small ranges and name their encoding algorithm short range gray encoding or SGRE. Since SGRE does not require any additional TCAM bits to encode ranges, Bremler-Barr and Hendler also propose a method of combining SCRE with Liu’s range encoding technique. Like DIRPE this combination negates SGRE’S ability to support fast updates for classifiers, but it allows for the technique to concisely encode ranges with large expansion factors under SRGE. Efficient Mapping of Range Classifiers into Ternary-CAM In [19], Liu pro- poses an encoding method that designates specific ternary bits within each TCAM entry to represent a specific range. A packet field is encoded via an SRAM lookup table that maps each field value to a codeword that has a designated bit set to 1 if and only if the value is an element of the corresponding range. Each rule’s range predicate can then be encoded such that the designated bit is set to 1 and every other bit is set to *. This technique eliminates range expansion completely; however, it also requires n bits per TCAM entry when a classifier has n unique ranges in a field. This technique quickly becomes impractical as the number of unique ranges within a field increases. To combat the explosive growth in required bits, Liu pro- poses splitting a field domain into I: disjoint ranges such that each disjoint range intersects with a small number of unique ranges. Since [logk + 1] bits are needed to encode these disjoint ranges, this scheme allows for a field to be encoded using [log k + 1] + 71’ bits where n’ is the maximum number of unique ranges that inter— sect with a given disjoint range. Using this scheme means that rule predicates that. intersects with more than one disjoint range must. be replicated for each intersection. To manage the trade off between rule expansion and bit expansion, Liu proposes a heuristic algorithm that repeatedly finds and merges the pair of disjoint ranges that reduces rule expansion the most. These merges continue until a budget. of b bits is exhausted. 19 Fast and Scalable Packet Classification In [31], van Lunteren and Engbersen propose an encoding method similar to [19]. In their encoding scheme, they control the required number of bits by partitioning the unique ranges into 1 layers. The ranges within each layer are then broken into disjoint ranges so that each layer can be encoded in [log ni] bits where n,- is the number of disjoint ranges in layer i. Each field then become the concatenated encoding for each layer. The authors also note ’ in another layer, that if a disjoint range r in one layer contains a disjoint range r this information can be used to reduce the number of bits needed to encode r”s layer. Unfortunately, no algorithms are given for partitioning unique ranges into layers. An Encoding Scheme for TCAM-based Packet Classification In [21], Pao et al. propose an encoding algorithm called. prefix inclusion encoding(PIC). PIC utilizes van Lunteren and Engbersen’s observation that containment information for one layer can reduce the number of bits required to encode ranges in the next layer. That is the scheme produces a series of 1 layers L1, . . . , Ll such that each disjoint range in L,- is a subset of a single range of Li—l for i E {2, . . . , I}. With this property, PIC can encode a field predicate into a compact prefix range. PIC was designed for encoding fields with prefix ranges and large domains such as the source IP for IPv6 headers. However, the authors do suggests techniques for adapting their scheme to encode range fields. These techniques require breaking overlapping ranges into disjoint ranges, which in turn requires that encoded rules are replicated in a manner similar to other encoding techniques. 3.1.3 Circuit Modification: The basic idea is to modify TCAM circuits to accon‘m‘iodate range comparisons. For example, Spitznagel ct al. proposed adding comparators at each entry level to better accommodate range matching [24]. While this research direction is important. such solutions are hard to deploy due to high cost. [15], and modified TCAMs may be less applicable to applications other than packet processing. 20 3.2 Software Based Techniques The simplest. software based technique for packet classification is a linear search, which has excellent storage requirement but. becomes too slow for wire speed packet. classification for even modest sized classifiers. As a results, there is a rich body of software based packet classification techniques [2,3,7',11,14,22,23,25,26,28,32], and an extensive survey of these techniques can be found in [29]. These techniques trade storage space for an improvement in search time via special preprocessing of the classifier rules. Techniques can be partitioned into two categories: parallel decomposition and decision tree classification. 3.2.1 Parallel decomposition The objective of parallel decomposition techniques [3,11,14,25,26,28] is to break the classification process into several steps that can performed in parallel. The above techniques perform the decomposition along the field boundaries of a packet header. This in effect allows for fast and efficient. single field classification solutions to encode each field in parallel. These new values are then composed via one or more additional classifications stages to yield a. correct classification. High-speed Policy-based Packet Forwarding using Efficient Multi-dimen- sional Range Matching In [14]. Lakshman and Stiliadis propose encoding each field’s value into a bitmap that specifies a containment relationship among values and rules [14]. This bitmap indicates whether or not an encoded value intersects with a. given rule’s field predicate. Once each field is encoded, this method uses customized parallel AND gates to perform an intersection of these bitmaps and ultimately finds the first matching rule. This technique. is effective; however it requires a bit. line for each rule in the classifier and must be implemented on customized hardware. Scalable Packet Classification In [3]. Baboescu and Varghese improve on the above technique by observing that for classifiers with a low occurrence of wildcards. bitmaps will be sparely populated with 1‘s. They group bits within each bitmap 1)1 ‘— into chunks and represent. each chuck with a single bit which is the logical OR of all the bits within the chunk. This allows the second stage to skip the comparison of a significant number of bits. For classifiers that have a low occurrence of wildcards, this technique is very effective at reducing the number of memory access needed to perform the second stage processing; however, this reduction is diminished once wildcards occur more frequently within a classifier. Fast and Scalable Layer Four Switching In [26], Srinivasan et al. propose an encoding method called crossproducting that assigns a unique number to each maximal disjoint range within a classifier field and constructs a lookup table for the cross product of the numbers associated with each field. This technique is fast; however, its storage requirements multiplicatively increases as the number of fields and ranges increases. As a result, the authors only intend crossproducting for small classifiers with two fields. Packet Classification on Multiple Fields In [11], Gupta and McKeown pro- pose an encoding method called Recursive Flow Classification (RFC) that. is an op- timized version of the cross-producting scheme. This uses recursive cross-producting tables to reduce the space requirements of regular cross-producting tables. F urther- more, they map disjoint ranges that are contained by the same set of rules into a single value. RFC’S mapping tables define. an equivalence relation; however, this equivalence relation is less general than the domain compression technique discussed in Chapter 8, so they are unable to achieve a maximum compression for each field domain in most cases. Furthermore, the recursive cross-producting scheme requires a significant amount of space to store in memory. Packet Classification using Tuple Space Search In [25], Srinivasan et al. propose a tuple based search approach. This approach transforms each rule predicate with d fields into a d—tuple, which is in essence a hash of the predicate. The idea is that this initial hashing divides the. search space into regions that. can be searched in parallel. Perfect hashing functions are used to find exact matches in each tuple’s [\D [\J search space. The authors propose two methods of determining appropriate tuples to search. The first method is an exhaustive search of each tuple, and the second uses a set pruning trie for each field that returns a set of candidate tuples. With set pruning, the intersections of the results from each field is the set of tuple spaces that need to be searched. Scalable Packet Classification using Distributed Crossproducting of Field Labels In [28], Taylor and Turner propose the Distributed Crossproducting of Field Labels (DCFL) method that assigns each locally unique range within a field a locally unique number. Each field value is encoded into a set of numbers, which rep- resents the ranges that contains the value. These sets are crossproducted together and then intersected with the set of unique tuples generated from the classifier’s field predicates. The resulting intersection provides a. list of rules that the packet header matches. The authors optimize this technique by incrementally perform- ing the crossproduct and filtering the intermediate results after each incremental crossproduct. This optimization can dramatically reduce that number of false pos- itive tuples that are generated. Since overlapping ranges diminish the incremental crossproducts’ ability to keep the number of candidate matches low, the technique’s perforn'iance depends on classifiers having a. low number of overlapping ranges. 3.2.2 Decision Trees Decision tree methods [2, 7, 12, 22,23, 32] use tree structures to successively prune the search space to a single rule. or a small number of rules. which are then searched linearly to find a match. Decision tree methods such as HiCuts [12] and Hypercuts [23] are similar in flavor to our sequential decomposition approach in that they use a sequence of searches where each search uses a portion of the packet predicate to classify a packet. However, scfltware—based methods are constrained by a complex tradeoff among how many searches need to be performed. the time. required to perform] a search and the space required to store the data structure that facilitates the search. In the worst case, these methods require many searches, slow searches. 23 or tremendous amounts of memory. Classification Using Hierarchical Intelligent Cuttings In [12], Gupta and McKeown present a decision tree algorithm called HiCuts. This algorithm builds a decision tree similar to an unordered FDD with the following differences: Each node makes a decision based on a partition of a field’s domain, and leaves are allowed to store a list of rules. The rationale for both of these decisions is derived directly from the limitations of SRAM lookup methods. The authors implement each node as a lookup table so that the next node in the tree can be found in constant time. However, since a field domain of size 232 is prohibitively large, the field domain must be cut into subsets to limit the size of each tree node. This technique successively prunes the set of candidate rules with the decision tree until the set of rules is below a certain threshold. Once this threshold is reached. the list of remaining rules is stored in the leaf at the end of the decision path. The rationale for this decision is to save storage space since small lists usually result in big subtrees. The authors also present a parameterized construction algorithm that. allows the user to trade maximum lookup time for storage space. Packet Classification using Multidimensional Cutting In [23], Singh et al. present HyperCuts, which is an improvement upon HiCuts. The authors contribution is to allow each node in the decision tree to build a nmltidimensional lookup table from cuts in multiple fields. This improvement allows for a. more effective pruning of the list of candidate rules. The authors show that HyperCuts significantly improves upon the performance of HiCuts. A Modular Approach to Packet Classification In [32], Woo uses a three stage approach to classifying packets. The first. stage is a lookup table that distributes packet value among a set of decision trees by matching m. bits within a rule predicate. These decision trees are binary trees where the nodes select. the appropriate bit within the predicate to determine which edge to foll(;)w. The second stage traverses the decision tree until the third staéige is reached when a. leaf in the decision tree is 24 found. These leaves contain a list of one or more candidate matches that is searched sequentially until a match is found. One key assumption in VVoo’s work is that each classifier predicate needs to be transformed into a ternary bit. string. This assumption implies that classifiers with significant amounts of range expansion will degrade the storage efficiency of this technique. Fast Firewall Implementations for Software-based and Hardware-based Routers In [22], Qiu et al. revisit two trie-based lookup schemes for packet clas- sification that have been traditionally dismissed as being inefficient and show that for real packet classifiers, they offer predictable classification speeds. Longest prefix matching tries are an efficient data structure for performing an exact match for a single field packet; however, once packet. classifiers requires multiple field packets, some packets will not match against the longest prefix in all dimensions. The first technique that they examine uses a backtracking search on a multi-field trie to find every candidate rule. They provide a set of optimizations for the multi-field trie that speeds up the backtracking search; however, the number of memory accesses required to classify each packet range from 117 to 196 for real-life classifiers. The second technique that they examine uses set pruning tries, which enumerate all deci- sion paths so that each packet value can only follow a single path. The authors also propose two compression algorithms that help to reduce the storage requirements for set pruning tries and backtracking tries. Set pruning tries outperform backtracking search at the expense of additional memory storage requirements; however, experi- mental results suggest that backtracking tries offer a better performance for storage trade off. The experimental results suggest that these techniques offer a 2 to 5 times speedup over linear search. Tradeoffs for Packet Classification In [7], Feldmann and Muthukrishnan pro- pose building lookup—up trees similar to HiCuts; however, instead of using a. lookup table at each node, they employ an inverted lookup tree call a Fat Imrcrtcd Seg— ment(F IS) tree to store a complete set of cuts of each field. This technique allows for a more compact re])resentation of the classifier, but. it can significamtly increasing the number memory accesses needed to classify a packet when compared to HiCuts or HyperCuts. Packet Classification for Core Routers: Is there an Alternative to CAMS? In [2], Baboescu et al. propose the Extended Grid—0f- T1‘2'es(EGT) technique. EGT uses a two-field trie to prune the candidate rule list and uses a path compression algorithm to minimize the amount of memory needed to store the trie. For core router tables, EGT provides reasonable performance; however, EGT’S performance depends on the structural properties of the core routing tables. Packets classifiers used in other applications (6.9. firewalls) may not have acceptable performance with EGT. 26 Part I Equivalent Transformation Techniques 27 Consider the following TCAM lVIinimization Problem: given a packet classifier, how can we generate another semantically equivalent packet classifier that requires the least number of TCAM entries? Two packet classifiers are (semantically) equiv— alent if and only if they have the same decision for every packet. For example, the two packets classifiers in Tables 1.1 and 1.2 are equivalent; however, the one in Ta— ble 1.1 requires 900 TCAM entries, and the one in Table 1.2 requires only 6 TCAM entries. Solving this problem helps to address the limitations of TCAMs. As we reduce the number of TCAM entries required, we can use smaller TCAMs, which results in less board space and lower hardware cost. Furthermore, reducing the number of rules in a TCAM directly reduces power consumption and heat generation because the energy consumed by a TCAM grows linearly with the number of ternary rules it stores [33]. While the optimal solution to the above problem is conceivably NP-hard, in this thesis, we propose a practical algorithmic solution using two techniques. Our first technique, TCAM Razor, generates new but equivalent classifiers, whereas our second technique. all-match redundancy removal. finds a set of rules that can be safely removed from a classifier Chapter 4 TCAM Razor TCAM Razor consists of the following four basic steps. First, convert a given packet. classifier to a reduced decision diagram, which is the canonical representation of the semantics of the given packet classifier. Second, for every nonterminal node in the decision diagram, minimize the number of prefixes associated with its outgoing edges using dynamic programming. Third, generate rules from the decision diagram. Last, remove redundant rules. As an example, running our algorithms on the packet classifier in Table 1.1 will yield the one in Table 1.2. Our solution is effective, efficient, and practical. In terms of effectiveness, our ap- proach achieves a total compression ratio of 3.9% on real—life packet classifiers, which is significantly better than the previously published best. result of 54% [5]. In terms of efficiency, our approach runs in seconds, even for large packet classifiers. Finally, in terms of practicality, our approach can be easily deployed as it does not require any modification of existing packet classification systems. In comparison, a number of previous solutions require hardware and architecture modifications to existing packet classification systems, making their adoption by networking manufz—icturers and ISPS much less likely. The solution is named “TCAM Razor” f(’)llowing the principle of Occams razor: “0f two equivalent theories or eJ‘planat-ions. all other things being equal, the simpler one is to be preferred.” In our context, of all packet classifiers that are equivalent, the one with the least number of TCAM entries is preferred. 29 4.1 Multi-dimensional TCAM Minimization: The Basics In this section, we present TCAM Razor, our algorithm for minimizing multi- dimensional prefix packet classifiers. A key idea behind TCAM Razor is processing one dimension at a time using the weighted one-dimensional TCAM minimization algorithm in Section 2.2 to greedily identify a local minimum for the current dimen- sion. Although TCAM Razor is not guaranteed to achieve a global minimum across all dimensions, it does significantly reduce the number of prefix rules in real-life packet classifiers. 4.1.1 Conversion to Firewall Decision Diagrams To facilitate processing a packet classifier one dimension at. a time, we first convert. a given packet classifier to an equivalent reduced Firewall Decision Diagram (FDD) [10]. Given a packet classifier f1, we can construct. an equivalent. FDD f2 using the FDD construction algorithm in [17]. F i "ure 4.1: A firewz-ill decision diagram g o .30 4.1.2 Multi-dimensional TCAM Minimization We start the discussion of our greedy solution by examining the reduced FDD in Figure 4.1. We first look at the subgraph rooted at node '02. This subgraph can be seen as representing a one-dimension packet classifier over field F2. We can use the weighted one-dimensional TCAM minimization algorithm in Section 2.2 to minimize the number of prefix rules for this one-dimensional packet classifier. The algorithm takes the following 3 prefixes as input: 10 * * (with decision accept and cost 1), O * ** (with decision discard and cost 1), 11 * * (with decision discard and cost 1). The one—dimensional TCAM minimization algorithm will produce a minimum (one- dimensional) packet classifier of two rules as shown in Table 4.1. Rule # F1 Decision 1 10** accept 2 * ** * discard Table 4.1: A minimum packet classifier corresponding to 212 in Fig. 4.1 Similarly, from the subgraph rooted at. node 123, we can get. a minimum packet classifier of one rule as shown in Table 4.2. Rule# F1 Decision 1 **** discard Table 4.2: A minimum packet classifier corresponding to U3 in Fig. 4.1 Next, we look at the root. ’Ul- As shown in Figure 4.2, we view the subgraph rooted at ’02 as a decision with a n‘iultiplication factor or cost of 2, and the subgraph rooted at ‘03 as another decision with a. cost. of 1. Thus. the graph rooted at 1.21 can be thought of as a “virtual” one-(lin'iensioiial packet classifier over field F1 where each child has a multiplicative cost. 31 F2610“ -’ accept F26 **** -* discard F26 **** —* discard Figure 4.2: “Virtual” one—dimensional packet classifier Now we are ready to use the one-dimensional TCAM 11111111111281.1011 algorithm in Section 2.2 to minimize the number of rules for this “virtual” one-dimensional packet. classifier. The algorithm takes the following 5 prefixes and associated costs as input: 1000 101* with decision '02 and cost 2 , with decision 2:2 and cost 2 , 1001 11* * with decision 123 and cost 1 . ( l ( l 0 * ** (with decision ()3 and cost. 1) ( l ( l with decision '03 and cost 1 . / Running the weighted one-(iimensional TCAh-l minimization algorithm on the above input. will produce the “virtual” one-dimensional packet classifier of three rules as shown in Table 4.3. Rule # F1 Decision 1 1001 go to node. (’3 2 10** go to node (22 3 **** go to node U3 Table 4.3: A minimum packet classifier (:(n‘respontling to ”1 in Fig. 4.1 32 Combining the “virtual” packet classifier in Table 4.3 and the two packet clas- sifiers in Table 4.1 and 4.2, we get a packet classifier of 4 rules as shown in Table 4.4. Rule # F1 F2 Decision 1 1001 **** discard 2 10** 10** accept 3 10** **** discard 4 **** **** discard Table 4.4: Packet classifier generated from the FDD in Figure 4.1 4.1.3 Removing Redundant Rules Next, we observe that rule r3 in the packet classifier in Table 4.4 is redundant. If we remove rule 73, all the packets that used to be resolved by 13 (that is, all the packets that match 13 but do not match r1 and r2) are now resolved by rule 7‘4, and r4 has the same decision as 73. Therefore. removing rule 73 does not change the semantics of the packet. classifier. Redundant rules in a packet classifier can be removed using the algorithms in [18] or the algorithm in the next chapter. Finally, after removing redundant rules, we get a packet classifier of 3 rules from the FDD in Figure 4.1. 4.1.4 The Algorithm To summarize, TCAM Razor, our multi-dimensional TCAM minimization algo- rithm, consists of the following four steps: 1. Convert the given packet classifier to an equivalent. FDD. 2. Use the FDD reduction algorithm described in the next. section to reduce the size of the FDD. This step will be explained in more detail in the next section. 3. Generate a packet (jtlassifier from the F DD in the following bottom up fashion. For every terminal node, assign a cost of 1. For a non-terminal node U with 33 ~ ~ outgoing edges {(31, - - - , ez}, formulate a one-dimensional TCAM n'iinimization problem as follows. For every prefix ’P in the label of edge e.-, (1 g j S .3), we set the decision of ”P to be j, and the cost of ’P to be the cost of the node that edge ej points to. For node v, we use the weighted one-dimensional TCAM minimization algorithm in Section 2.2 to compute a one—dimensional prefix packet classifier with the minimum cost. We then assign this minimum cost to the cost of node 1). After the root node is processed, generate a. packet classifier using the prefixes computed at each node in a depth first. traversal of the F DD. The cost of the root indicates the total number of prefix rules in the resulting packet classifier. 4. Remove all the redundant rules from the resulting packet classifier. 4.1.5 TCAM Update Packet classification rules periodically need to be updated. The common practice for updating rules is to run two TCAMs in tandem where one TCAM is used while the other is updated [16]. TCAM Razor is compatible with this current practice. Because TCAM Razor is efficient and the resultant TCAM lookup table is small, TCAM updating can be efficiently performed by rerunning TCAM Razor on the updated rules. When rules are frequently added to a classifier, we suggest the following lazy update strategy. First, after running TCAM Razor, store the resulting rules in the lower portion of the TCAM. Let n denote the total number of entries in the TCAM, m denote the total number of TCAM entries needed by a packet classifier after applying Razor, and let array T denote the TCAM. Initially, the m entries are stored from T[n — m] to T[n — 1]. When a new rule 7‘ needs to be added to the classifier. we first. perform range expz—msion on r. Let ""1 be the number of prefix rules that. are created. We store these rules in locations T[n — m — ml] to T[n — m — 1]. As new rules are added, this process continues until the TCAM is filled up. Thus, TCAM Razor only needs to run periodically rather than when each new rule is added. 34 Chapter 5 All-Match Redundancy Removal we present an all-match based complete redundancy removal algorithm. This is the first algorithm that attempts to solve first—match problems from an all-match per— spective. We formally prove that the resulting packet classifiers have no redundant rules after running our redundancy removal algorithm. We conducted extensive ex- periments on both real-life and synthetic packet classifiers. The experimental results show that our redundancy removal algorithm achieves an average compression ratio of 41.8% for TCAM entries. We have improved upon [18] in two ways. First. the redundancy theorem be- comes simpler. The redundancy theorem in [18] distinguishes upward and downward redundant rules, and detects them separately. In contrast, the redundancy theorem presented here gives a single criterion that. can detect both upward and downward redundant rules. Second, the new redundancy removal algorithm is more efficient. The algorithm in [18] scans a packet classifier twice and build F DDS twice in order to remove the two types of redundant rules. In comparison. the new algorithm only scans a packet classifier once and builds one all—match FDD with a cost similar cost. to building an FDD. The new algorithm is about twice as efficient. as the algorithm in [18]. 5.1 All-Match Based Redundancy Theorem In this section, we introduce the concept. of all—match FDDs and the all-match based redundancy theorem. 5.1.1 All-Match FDDs Definition 5.1.1 (All—Match FDD). An all—match FDD t for a packet classifier f : (r1,r2,-~,rn) over fields F1,---,Fd is an FDD that has the following five properties: 9": Each node v is labeled with a packet field denoted F(v). If v is a nonterminal node, then [7(0) is a packet field. va is a terminal node, then F(v) is a list of integer values (121, i2, . - - , ik) where 1 S i1 < i2 . -- < ik g n. Each edge e:u —+ v is labeled with a nonempty set of integers, denoted [(6), where [(6) is a subset of the domain ofu ’s label (i.e., I(e) g I)(F(u))). The set of all outgoing edges of a node “U in t, denoted E(v), satisfies the following two conditions: (a) Consistency: I(e) fl I(e’) = (D for any two distinct edges 8 and e’ in E(v). (b) Completeness: UeEE(v) I(e) = D(F(v)). A directed path from the root to a te'miinal node is called a decision pat-h. No two nodes on a decision path have the same label. Given a decision path P : (2218112262 - - - 'Umemu,,,,+1), the matching set ofP is defined as the set of all packets that satisfy (F(ul) E [(61)) /\(F(v2) E I(e2))/\- - -/\(F(Um) E I(e-m)). We use M(P) to denote the matching set of P. For any decision path. P : (‘Ulelvgeg - - - Uy'neynvnl+1) where F(vm+1) =2 (i1, i2, ...,ik.) and for any rule rJ-(l g j S n). if i‘\l(P)fli‘ll(rJ-) 75 o. then ll/(P) Q 111(7‘j) andj E {i1,l2,- - -,'lk}. C] 36 r1 : F1 E [1, 5] /\ F2 E [1,10] ——> accept r2 : F1 E [1, 5] /\ F2 E [5,10] —> accept r3 : F1 E [6, 10] /\ F2 E [1,3] —> discard r4 : F1 E [1, 10] /\ F2 E [1,10]—> discard Figure 5.1: A simple packet classifier [1,5] [6,10] [1,4] V0] [1,3] [4,10] 1,4 (1,2,4)( 3,4 ) Figure 5.2: A11 all—match FDD for the packet classifier in Fig 5.1 Fig 5.2 shows an all-match F DD for the simple packet classifier in Fig 5.1. In this example, we assume every packet has only two fields F1 and F2. and the domain of each field is [1. 10]. In an all—match F DD for a packet classifier f . for any decision path 73 : (01611}282 - -- vme-,-nu,,.,y+1) where F(um+1) = (i1, i2, ik), if a packet p satisfies this path P, then {731,7}: .. "i k} are exactly all the rules in f that p mat(_:hes. 1le IS why we 2, .. call such a FDD an “all-Inat.(':h FDD”. 37 5.1.2 The All-Match Based Redundancy Theorem Before we present the All-I\‘Iat(_:l1 Based Redundancy Theorem, we first prove the following lemma. Lemma 5.1.1. Let t be an all-722atch FDD for packet classifier f : (7‘1,r2, - - - ,rn). For any rule rz- in f, let P1, P2, - - - , Ph be all the decision paths whose terminal node contains r23. then the following condition holds: 111(73): U _1 lil(P j') [3 Proof: (1) According to prope1tv5 oin the definition of all- match FDDs, we have 111(Pj) Q Al(ri) for everyj (1 < j\ (dec), and make ek+1 point to the first. node in this path; 7: Add i to the end of the label of the terminal node of this decision path; 8: end if 9: forj:=1t.okdo 10: if I(ej) C Sm then 11 APPEND(eJ-’s target,(F1 E 81) /\ - - - /\ (Fd E Sd) —> (dec),m+1,i); 12: end if 13: Add one outgoing edge e to v, and label e with I (ej) fl Sm; 14: Make a copy of the subgraph rooted at the target node of ej, and make 6 points to the root of the copy; 15: Replace the label of ej by I(ej) — Sm; 16: APPEND(e’s target,(F1 E 81) /\ - - - /\ (Fd E 5(1) —1 (dec).m+1,i); 17’: return 18: end for that. has 771 terminal nodes, we assign a. unique sequence number in [1,772.] to each terminal node. In the containment. list. each entry consists of a terminal node sequence number and the rule sequence numbers crmtained in the terminal node. 41 Terminal Rule # node # ,._ 1 1 f——'\ r———fl 2 1,2,4 2 2 \____J m 3 3,4 3 3 (——\ 4 4 4 1,2,3,4 Containment List Residency List Figure 5.4: The containment. list. and the residency List for the all-match FDD in Figure 5.2 In the residency list, each entry consists of a. rule sequence number and the set of terminal nodes which contains this rule. The all-match list and the residency list for the all-match FDD in Figure 5.2 are in Figure 5.4. The all—match based redundancy removal algorithm works as follows. Given a packet. classifier f : (r1, r2, - - - , 7'”), this algorithm scans f from r-n to r1, and checks whether each rule is redundant using Theorem 5.1.1. Whenever a rule is detected as redundant, the rule is removed from f. The pseudocode of the algorithm is shown in Algorithm 3. Consider the packet classifier in Figure 5.1 and its all-match FDD in Figure 5.2. The all—111atch list and the residency list are in Figure 5.4. We, next demonstrate the process of determining whether 7'4 is redundant using the all-Iriatch based redun- dancy removal algorithm shown in Figure 3. From the residency list, we know that rule r4 is contained in terminal nodes 1.2.3 and 4. 111 t(:1.rminal node 4. 7‘4 is the only value, and thus is not redundant. Next, we check whether 7'3 is redundant. From the. residency list. we know that. 73 is contained in terminal node 3. In terminal node 3. the first. value is 3, and is innnediatt-1ly followed by a 4. and 7‘3 and 7'4 have 42 Algorithm 3 All—Match Based Redundancy Removal Algorithm Input: A packet classifier f : (r1, r2, - - - , rn), and an all-match FDD t for f. Output: A packet classifier f I where f E f ’ and there is no redundant rule in f ' . 1: Build the containment list ConList[1..m] from t. 2: Build the residency list ResList[1..n] from t. 3: for i := n to 1 do 4: redundant :2 true 5: for each terminal node sequence number tn in ResList[i] do 6: if i is the only value in ConList[tn]) or i is the first value in Co-22,List[tn] and the second value in ConList[tn], say j , satisfies the condition that r,- and rj have different decisions then 7: redundant :2 false; 8: break; 9: end if 10: end for 11: if redundant then 12: remove rz- from f7; 13: for each terminal node sequence number in in ResLis-t[i] do 14: delete i from (YonListltnl; 15: end for 16: end if 17: end for the same decision. According to the Theorem 5.1.1, r3 is redundant. Subsequently, we remove r3 from the packet classifier and delete 3 from the third entry of the all-match list. In a similar fashion, we can further detect that r2 is redundant and r1 is not. redundant. The resulting packet classifier is shown in Figure 5.5. 43 F1 E [1, 5] /\ F2 E [1,10] —2 accept F1 E [1, IO] /\ F2 E [1,10]—+ discard Figure 5.5: The resulting packet classifier after removing redundant rules from the packet. classifier in Figure 5.1 5.2.3 Proof of Complete Redundancy Removal A packet classifier redundancy removal algorithm is a complete redundancy removal algorithm if and only if for any packet classifier the algorithm produces a. semantically equivalent packet classifier in which no rule is redundant. Theorem 5.2.1. The All—Match Based Redundancy Removal Algorithm is a com- plete redundancy removal algorithm. C] Proof: Let. f be a given packet classifier and let t be an all—match FDD for f. Let f ' ' be the resulting packet classifier after running the all-match based redundancy removal algorithm. Suppose f I I has a rule rz- that is redundant in f / I . Let f I be the resulting packet classifier after the algorithm has examined all the rules from ”+1 to r-n and the redundant rules from n+1 to rn, has been removed. Because the algorithm does not remove ri, 7",; is not redundant. in f’. According to Theorem 5.1.1, there is at least a terminal node u that satisfies one of the following conditions: 1. this terminal node only contains i, 2. this terminal node has i as its first value and i is inunediately followed by another value j such that 7",; and rj have different (_'l(-?(_tlSlODS. If U satisfies one of the two conditions, then u still satisfies that condition after the algorithm removes all the redundant rules above Ti» because i will never be deleted from ’12 according to the. algorithm. Therefore, r7; is not redundant. in f I ’ according to Theorem 5.1.1. [:1 44 It is worth noting that the order from n to 1 in detecting redundant rules is critical. If we choose another order, the algorithm may not. be able to guarantee complete redundancy removal. Take the order from 1 to n as an example. When we check whether rz- is redundant, suppose rz- is not redundant because there is one and only one terminal node in the all-match F DD that has i as its first value and i is immediately followed by another value j such that r,- and rj have different decisions. We further suppose j is immediately followed by another value k where 7‘2‘ and rk have the same decision. After moving all the redundant rules after 71,-, j is possibly removed from the terminal node and consequently rz- and rk become the first two values in the terminal node and they have the same decision. Thus, rz- becomes redundant. Chapter 6 Experimental Results In this section, we evaluate the effectiveness of TCAM Razor and All-Match Re- dundancy Removal on both real-life and synthetic packet classifiers. Note that in cases where TCAM Razor cannot produce smaller packet classifiers than redundancy removal alone, TCAM Razor will return the classifier produced by redundancy re- moval. Thus, T CAM Razor always performs at least as well as redundancy ren'ioval. 6. 1 Methodology We first define the metrics that. we used to measure the effectiveness of TCAM Razor and the redundancy removal technique by Liu and Gouda [18]. 111 this paragraph, f denotes a packet classifier, S denotes a set. of packet. classifiers, and A denotes either TCAM Razor or the redundancy removal technique. We then let I f | denote the number of rules in f, A( f ) denote the prefix classifier produced by applying A on f, and Direct( f) denote the prefix classifier produced by applying direct range expansion on f. We define the following four metrics for assessing the performance of A on a set of classifiers S'. o The average compression ratio of A over S = (MN Efesm lSl 46 o The total compression ratio of A over S = EfeSlAffll Efengirect(f)|' o The average expansion ratio of A over S 2 Wm Efes m ISI o The total expansion ratio of A over S = EfeSlAffll EfeSlfl . Variable Ordering The variable order that. we used to convert a packet classifier into an equivalent F DD affects the effectiveness of TCAM Razor. There are 5! = 120 different permutations of the five packet fields (source IP address, destination IP address, source port number, destination port number, and protocol type). A question that naturally arises is: which variable order achieves the best com- pression ratio? To answer this question, for each pern‘iutation , we computed the total compression ratio for TCAM Razor over RL. The maximum total compres- sion ratio is 10.8% and the minimum total compression ratio is 7.7%. When we use the best permutation for each classifier, the total compression ratio is 6.7%. Furthermore, the four permutations that start with the field Destination IP address and Source IP address have a total compression of less than 7.8%. Using the best permutation from these four permutations results in a. total compression ratio of 7.5%. The best single permutation is the order (Destination IP address, Source IP address, Source Port, Destination Port, Protocol) and has a total compression ratio of 7.7%. We use this pernuitation to represent TCAM Razor in the experiment. 47 results, which is denoted Razor. When we report results using the best permuation for each individual classifier, we denote this algorithm as BRazor. The next natural question to ask is: is this permutation the best order for most packet classifiers? The answer for BL is yes. In Figure 6.1(a), for each packet classifier in RL, we show the compression ratios of TCAM Razor, BRazor, and redundancy removal. The results show that the single best permutation achieves almost the best compression ratio for each packet classifier group. 10 500 .93 '- RR ‘6 g 80» I Razor 8 4000* t :5 f5 , 1:1 BRazor g 9 0 60- c 3000 3 33 = 9 : 93$ 40» % 2 2000 ' ‘5‘ 2 ' 3 l , ° 20’ 1 u’i 1000 REE i u -— ; ~ . _ 0 , W ,, 7 7 _ .. 0 10 15 20 25 0 10 15 20 25 Classifier Classifier (a) (b) Figure 6.1: Compression and Expansion ratios of RL Compression Ratio Expansion Ratio Average Total Average Total RR Razor RR Razor RR Razor RR Razor RL 42.4 % 23.3 % 32.5 % 7.7 ‘70 1373.4 % 60.8 ‘70 111.7 ‘70 26.5 ‘70 RLU 66.2 “/0 32.2 ‘70 56.9 ‘70 13.4 % 3217.0 % 153.3 ‘70 195.3 % 46.0 % SYN 11.1 ‘70 6.8 ‘70 8.4 % 5.6 % 13.2 (70 8.1 ‘70 10.0 70 6.6 % SYNU 44.1 % 42.8 ‘70 39.4 % 38.6 % 52.6 ‘70 50.9 ‘70 47.0 ‘70 46.0 ‘70 Table 6.1: Statistics for experimental classifiers Compression Ratio Table 6.1 demonstrate that TCAM Razor outperforms just redundancy removal. Figure 6. 1 (a) shows that only 7 out 25 classifiers so no improvement over redundancy 48 removal; however, BRazor does achieve an improvenumt over redui‘idancy removal for four of these classifiers. We can also see that both TCAM Razor and redundancy removal are less effec- tive on RLU due to each rule having unique decisions. TCAM Razor is still twice as effective as redundancy on RLU, which demonstrates that TCAM Razor remains twice effective for RL as the number of unique decisions becomes large. The results for the synthetic classifiers show that TCAM Razor is an improvement over redun- dancy removal and that synthetic generation technique produces a large number of redundant rules. Expansion Ratio Table 6.1 and Figure 6.1(a) demonstrate similar results for Razor, BRazor and re- dundancy removal. Note that. twelve classifiers are significantly effected by range expansion. and for these classifiers TCAM Razor is highly effective and combating range expansion. 6.2 Comparison with other techniques It is difficult. to compare our results directly with those of Deng ct at. [5] because we do not have access to their programs or the packet classifiers they experimented with. However. Razor has a total compression ratio of 7.70},- on our real—life packet classifiers. IIi contrast. Dong et al. reported a total compression ratio of 54‘ c on their real-life packet classifiers. 49 Part II New Architectural Approaches 50 New architectural approaches seek to modify how the TCAM based packet classi- fiers operate in order to improve efficiency. We propose two approaches: sequential decomposition and topological transformation. Sequential decomposition decom- poses a single d—field packet classification TCAM lookup into a sequence of d l-field TCAM lookups. Topological transformations provide methods to translate the do— main of each packet field into a more efficient representation. Both techniques allow for the efficient utilization of TCAM space. These techniques mitigate the effects of range; however, they have the unique advantage that they find optimizations beyond range expansion. This advantage allows for sublinear compression. Chapter 7 Sequential Decomposition The problem that sequential decomposition tries to solve in this chapter can be stated intuitively as follows: how can we squeeze the most information possible into TCAM chips with little or no perfownance loss? Solving this problem helps to ad- dress almost all limitations of TCAMs. Given a packet classifier, if we can represent. it using fewer bits, we can use a smaller TCAM chip, which will result in lower power consumption, less heat generation. less board space, and lower hardware cost. Furthermore, reducing the number of TCAM rules results in less power consumption and heat generation because the energy consumed by a TCAM grows linearly with the number of ternary rules it stores [33] Sequential decomposition is composed of four algorithmic approaches to re- thinking and redesigning TCAM-based packet classification systems: multi-lookup, pipelined-lookup, packing, and table consolidation. These approaches move beyond the traditional paradigm that performs a. single lookup on a single TCAM for each search key. The following two observations of information redundancy and a ternary search key, which have mostly been ignored in prior work, form the theoretical basis for our new approaches. Information Redundancy Inforn'iation stored in TCAMS tends to have high redundancy from an information theory perspective. Specifi(‘:ally, we observe that the same ternary string for a specific field may be repetitively stored in I'nultiple TCAM entries. Figure 7.1(a), the strings 001, 010, and 100 from the first field are each stored three For example, in the simple two-dimensional packet classifier in times in the TCAM, and the strings 001, 010, and *** from the second field are each stored three times in the TCAM as well. Such information redundancy is primarily due to the multi—dimensional nature of packet classification rules. One source of information redundancy is range expansion in two or more fields. Single-lookup Multi-lookup 001 ,001 accept t1 001,010 accept w. log 001 t; 001,*** discard 010 t2 010,001 accept 100 t2 010,010 accept w. log ** t3 010,*** discard t2 100,001 accept 001 accept 100,010 accept w. log 010 accept w. 10g 100,*** discard an: discard ***,*** discard w. log t3 *** discard w. log (a) (b) Figure 7.1: Reducing information redundancy Ternary Search Key A TCAM chip typically has a built—in Global Mask Register (GMR) that supports ternary search keys. The GMR of a TCAM chip contains a bit. mask that specifies which bit columns in the chip participate in a search. For example, in a TCAM chip that contains two entries 1010 and 0100, if the search key is 0101 and the GMR specifies that only the first two columns participate in this search, the TCAM chip will return that. the lookup key matches the second entry 0100. In essence, the GMR allows the user to specify the search key in ternary format. In the above example, the GMR transforms the. search key 0101 into 01**. The GMR opens new opportunities for further improving TCAM space efficiency. Intuitively, the GMR allows multiple lookup tables to be packed into one TCAM chip where the GMR can be used at run time to dynamically select the right table to search. The multi—lookup approach is based on three key observations. First, breaking a multi-dimensional packet classifier into multiple one-dimensional classifiers greatly reduces information redundancy in TCAMs. Second, multiple lookup tables can co—reside in TCAM as long as extra bits are set to distinguish them. Third, a search key can be segmented into multiple search keys where each is searched in a one-dimensional classifier. Breaking the two-dimensional classifier in Figure 7.1(a) results in the three one-dimensional classifiers in Figure 7.1(b). Although in this particular example, the number of entries in the original single—lookup table is only two more than the total number of entries in the multi—lookup tables, the savings will increse significantly as the repetition in each field increases. Furthermore, note that the width of the multi—lookup tables is much smaller than that of the single lookup table. The space efficiency achieved by the multi-lookup approach comes with the price of more clock cycles to perform each search. Our pipelined—lookup approach speeds up the multi-lookup approach by pipelining the multiple lookups using multiple TCAM chips. Interestingly, the pipelined-lookup approach achieves even higher packet classification throughput than the traditional single—lookup approach because the narrower TCAM entries now fit on the data bus. The packing approach is based on the following three observations. First, TCAM chips have limited configurability on their width. This prevents us from configuring the TCAM width to exactly the table width, which could cause a significant number of bits in each TCAM entry to be unused. Second, the multi-lookup and pipelined- lookup approaches produce “thin” tables of varying width. Third, search keys for TCAM chips can. be ternary, which allows TCAM columns to be dynamically selected for each lookup. The basic idea of the packing approach is that multiple tables can be placed within the same TCAM entries. These tables will be distinguished by the GMR. In addition to the packing approach, table consolidation allows one TCAM table to store multiple classifiers efficiently at the expense of extra SRAM. Table consol- idation is based on the two observations. First, TCAM is far more expensive and consumes much more power than SRAM; it makes sense to use a large SRAM with a small TCAM rather than a small SRAM with a large TCAM. Second, semantically different TCAM tables may share common entries, and transferring this redundancy to SRAM removes the information redundancy from the more expensive TCAM. 7 .1 Multi-Lookup Approach Prior work and current practice have assumed the use of a single-lookup for TCAM based systems. we observe that relaxing this assumption could yield unexpected savings on TCAM space with minor throughput degradation. In this section, we propose a multi-lookup approach to redesigning TCAM based systems. We present two algorithms to support this approach: an algorithm for constructing a. multi- lookup table and an algorithm for processing packets. 7.1.1 Constructing Multi—lookup Table The algorithm for constructing a multi-lookup TCAM table from a given packet classifier consists of the following four steps, which are illustrated in Figure 7.2 and Figure 7.3: (1) F DD Construction: Constructing a tree—like representation, called a Firewall Decision Diagram (FDD), of the packet.- classifier. (2) F DD Reduction: Reducing the size of the FDD. (‘3) Table Generation: Generating a TCAM table from each nonterminal node in the reduced F DD. (4) Table Meiyence: Merging the generated TCAM tables into a single multi-lookup TCAM table. FDD Construction To generate a multi—lookup TCAM table, we first convert a. given packet classifier to an equivalent firewall decision diagram.Figure 7.2( b) shows the FDD constructed Ci"! CJI F1 F 2 Decision 000 000 accept 000 1 11 accept 110 000 accept 1 10 1 1 I accept 0 10 0* * accept 100 0* * accept ** 1** accept ** *** discard ll FDD Construction 011101111 000 110 010 100 000 001 000 001 0" 1" 0“ 1"? 0" 01* 01' 1* * 1“ 1010 ‘11 11010 a a d Lallalldl laHdHaHdla d l) FDD Reduction Figure 7.2: FDD generation and reduction from the packet classifier in Figure 7.2(a). 56 [I 1) Table Generation 000 a 111 ll Table Mergence 000 01 f 00000 01 010 10 00010 10 00 10010 "NM 00100 10 110 01 11 00110 01 00*" 11 000 a 01000 a 01 1‘18 ‘--- 01111 a (b) m d 01... d .0 139:: f; d - 111" a 11 1“ a L,-—"‘[11m d d l Figure 7.3: The h-"Iulti—lookup scheme FDD Reduction Reduction is an important step in reducing the total number of TCAM entries in the final multi-lookup table because the reduction step reduces the number of nonter- minal nodes, which consequently reduces the number of TCAM entries generated. Figure 7.2(c) shows the resultant FDD after FDD reduction. A brute force algo- rithm FDD reduction algorithm can be found in [9]; however, we provide a more. efficient reduction algorithm in Section 7.6. Table Generation Suppose the reduced F DD has n nonterminal nodes. Consider any nonterminal node v. Since v is complete with respect to its labeled field, we can view v as a one-dimensional packet classifier in which its outgoing edges point to its classifier’s decisions. We will construct a corresponding TCAM table Table(v) for each non— terminal node v in the FDD, and we assign a unique I D in the range 0 to n — 1 to both v and T able(v). We will refer to ID as both node v’s ID and table Table(v)’s ID. The meaning should be clear from context. For example, the IDs for the four nonterminal nodes in Figure 7.3(a) are 00, 01, 10, and 11. For any table t, we define its height h(t) to be the number of entries in t, and we define its width w(t) to be the number of bits in each TCAM entry. In the single lookup approach, most people assume w(t) = 144 because the five packet fields require 104 bits. Using the multi-lm1kup approach, we will show we can make w(t) = 72. We generate Table(v) in two steps. We first generate a correct packet classifier by making one entry for each prefix on each edge. That is, for each of v’s outgoing edges e from v to v' and for each prefix p on edge e, we generate a rule r as follows: the predicate of r is p; if v’ is a terminal node, the decision of r is the label of v'; if v, is a nonterminal node, the decision of r is the ID of v’. We then minimize the number of TCAM entries in Table(v) by using an optimal, polynomial-time algorithm for minimizing one-dimensional prefix packet classifiers [27]. Figure 7.3(a) shows the four minimal TCAM tables that correspond to each of the four nonterminal nodes in the FDD. Table Mergence The final step is to merge all n TCAM tables into a single Imllti-lookup table. For every nonterminal node v, we prepend v‘s ID to the predicate of each rule in Table(v). Since each table ID provides a unique signature that distinguishes that table‘s entries from all other table entries, all n tables can be merged into a. single n’mlti-lookup table. Figure 7.3(b) shows the resultant multi—lookup table from merging the four 58 TCAM tables in Figure 7.3(a). 7.1.2 Packet Processing After the multi-lookup TCAM table is built for a d-din‘iensional packet. classifier, the decision for a d-dimensional packet (p1,...,pd) can be found by d searches on the TCAM. The first search key kl is formed by concatenating the root node’s ID and p1. Let f (131) denote the search result of 11:1. The second search key, k2 is formed by concatenating f (131) and 172. This process continues until we compute f (kd), which is the decision for the packet. For example, given the two dimensional multi—lookup table in Figure 7.2(e) and a packet (010, 001), the first search key is 00010, which returns 10. The second search key is 10010, which returns accept as the decision for the packet. 7.1.3 Analysis We analyze the impact of the 11'uilt.i-lookup approach on TCAM space and packet classification throughput. Space We define the space used by a packet classifier in a TCAM chip as the number of classifier entries or rules multiplied by the width of the TCAM chip in bits: space 2 # of entries x TCAM width Given a packet classifier f, let Single(f) denote the resulting single lookup TCAM table and let Multi(f) denote the resulting multi—lcmkup TCAM table. It follows that. width w(Single( f )) = 144 because the table must accommodate the 104 bits in the five packet fields, and the number of entries is h(Si/iglc( f )) Thus, the number of bits required by the single lookup approach is h(S/fnglc(f)) x 144. On the other hand, we can safely set width u.r(111'u.lti(f)) = 72. This follows as the maximum width of the five packet fields is 32, which leaves 40 bits for storing a table. ID and optionally, the decision. This is more than sufficient for any realistic TCAM for the forseeable future. Thus, the number of bits required by the multi-lookup approach is h(Multi( f )) x 72. The multi-lookup table starts with a 50% reduction in width. Throughput Based on the above analysis that there are at least 40 bits to store table IDs plus the decision, there is sufficient space to store the decision for each rule in the TCAM entry and still have each TCAM entry fit within the 72 bits of the typical TCAM bus width. Thus, it will require two bus cycles to process each packet field: one cycle to send the search key and one cycle to perform the search and return the result. Given there are five packet fields that need to be processed, the total packet processing time will require ten TCAM bus cycles. The single lookup approach requires either four TCAM bus cycles or five TCAM bus cycles to find the decision for a packet: four bus cycles if the decision is stored in TCAM, five bus cycles if the decision is stored in SRAM. Note that the overall packet processing throughput for the multi- lookup approach may actually be closer to the packet processing throughput of the single lookup approach because TCAM lookup is normally not the bottleneck of such systems; instead other operations such as moving a packet in and out of queues are the real bottlenecks, so taking a few more bus cycles to process a packet may not have a significant impact on throughput. 7 .2 Pipelined-lookup Approach The multi-lookup approach is an effective method for reducing TCAM space needed for packet classifiers. However, this reduction in space reduces packet classification throughput by requiring multiple lookups on a single TCAM chip. In this section, we present our pipelined-lookup approach, which improves packet throughput. by using one TCAM chip for each field. That is. we will use five TCAM chips where, for 1 S i g 5. chip i stores table t,- which is the merger of all tables of F,- nodes. Having one merged table per field in a separate TCAM chip enables us to pipeline 60 the multiple lookups needed for processing each packet. Surprisingly, the pipelined- lookup approach can be four to five times faster than the traditional single-lookup approach. Furthermore, separating the tables from different fields yields new op— portunities to save bits. The result is that while more TCAM chips are needed, the pipelined-lookup approach can be even more space efficient than the multi-lookup approach. Next, we present the technical details of the pipelined-lookup approach to redesigning TCAM based systems. In particular, we present two algorithms to support this approach: an algorithm for constructing a sequence of d TCAM tables and an algorithm for processing packets. 7.2.1 Pipelined-Table Construction Our algorithm for constructing a sequence of d tables t1 through t5 consists of four steps. The first two steps are FDD construction and F DD reduction. which are similar to the first two steps in the multi-lookup approach. The last two steps, table generation and table mergence, require some modifications as described below. Table Generation This step differs from the table generation step in the multi—lookup approach in assigning node IDs in the constructed F DD. Here, we assign each node an ID that can uniquely discriminate that node from all other nodes of the same field. Let m,- be the number of nodes with label F,- in the constructed FDD. The ID assigned to each F,- node consists of [log mi] bits. For example, the IDs of the three F2 nodes in Figure 7.4(a) are 00, 01, and 10. In contrast, in the table generation step of the multi-lookup approach, the ID assigned to each node with label Fi consists of [log(Zgl':1 771,-)] bits. Note that each F, node. has a unique ID in the context of 17,-. We also observe that for field F1, there will always be a single table. Therefore, we do not need an ID to distinguish this table from tables of other fields. In the remainder of this section, we assume no ID is needed for the F1 table. 61 Table Mergence Similar to the table mergence step in the multi-lookup approach, for every nontermi- nal node v, we first prepend v’s ID to each rule in Table(v). Second, for every field Fi, we combine all tables of Pi nodes into one table ti- For example, Figure 7.4(b) shows the two pipelined TCAM tables generated from the F DD in Figure 7.4(a). 000 a f 000 00 111 010 01 m 100 01 110 00 Hi 10 , (a) [1“ a 0“ I.“ d it Table Mergence 000 01 f 000 00 l 010 10 010 01 1:, 10010 ----->( 100 01 it, 110 01 110 00 *1". 11 \ ,“ 10 J 000 a 00000 a ) i F2: 00 111 a ‘-— 00111 a ()) d 00*" d o" a ___ 010" a >1, 13:01 d 01... d [1" 101" a Fzz10 I“ Z —" 10*" d Figure 7.4: Table generation and table mergence in the pipelined-lookup approach 7.2.2 Packet Processing Similar to the multi-lookup approach, in the pipelined-lookup approach, a til-dimensional packet search is separated into d searches; however. with the pipelined-lookup ap- 62 proach, these searches can be pipelined. That is, the d TCAM chips are chained together into a pipeline such that the search result of the i-th chip is part of the search key for the (i + 1)-th chip, and the result of the last chip is the decision for the packet. With such a chain, d packets can be processed in parallel in the pipeline. Figure 7.5 illustrates the packet processing algorithm for the two tables t1 and t2 in Figure 7.4(b). Suppose two packets (010,001) and (111,010) arrive one after the other. When (010, 001) arrives, the first. search key, 010, is formed and sent to t1 while the rest of the the packet (001) is forwarded to t2. When the next packet (111, 010) arrives, table t1 has sent the search result 01 to table t2. When the first search key for the second packet. 111 is formed, the second search key for the first. packet 01001 is formed in parallel, and both are sent to tables t1 and t2, respectively. This cycle will yield a result of accept for the first packet, and a result of 10 for the second packet. The above process continues for every received packet. (010,001) ——> 010 V a a f 000 00 d 010 01 a F,( 100 01 >t1 : t2 110 00 d H 10 a t J a 01 d t I i 001 > 01001 accept Figure 7.5: Example, of a. pipelined—lookup 7.2.3 Analysis We next analyze the impact of the 1)ipelined—lookup approach on TCAM space. and classification throughput. 63 Space The pipelined—lookup approach is at. least as space efficient as the multi—lookup approach, and there are many cases where it uses even fewer TCAM bits. We first observe that the pipelined-lookup approach generates the same number of TCAM entries as the multi-lookup approach. That is, if t is the table formed by the multi- lookup approach and t1 through t5 are the tables formed by the pipelined-lookup approach, we have that h(t) = [3:1 h(t,;). The space savings of the pipelined— lookup approach results from requiring fewer bits per entry. The first opportunity for saving space comes from the fact that the number of bits needed to encode a table ID in the pipelined-lookup approach is less than or equal to that for the multi- lookup approach as we only need to distinguish a table from other tables with the same field label. The second Opportunity for savings comes from the three fields with width 16 or 8 bits. For these fields, we. may use width w(ti) = 36 whereas w(t) = 72. Specifically, for the source port and destination port fields, we have 36 -— 16 = 20 bits to represent a table ID and optionally the decision of each rule. For the protocol field, we have 36 —— 8 = ‘28 bits for this purpose. For most classifiers, these bits should suffice. In summary, the pipelined—lookup approach is at least as space efficient as its multi-lookup counterpart; fm'thermore, there are cases where it is more space efficient. Throughput The pipelined—lookup approach clearly leads to higher throughl‘mt than the multi- lookup approach. More so, the pipelined—lookup approach actually achieves four or five times higher throughput than the traditional single-lookup approach. In the pipelined—lookup approach, because a search key can always be transmitted over the bus in one cycle, the packet classification throughput. is one packet per cycle. In contrast, the packet classification throughput. of the traditional single-l(')(_)kup approach is one packet per four or five cycles. 64 7 .3 Packing Approach In this section, we present the TCAM packing approach. This approach reduces space consumption by allowing multiple rules from different TCAM tables to co- reside in the same TCAM entry. The TCAM packing approach is orthogonal to the multi—lookup and pipelined—lookup approaches in that it can be combined with the two approaches to further improve TCAM space efficiency. The TCAM packing idea is based on the following three key observations: First, the reconfigurability of TCAM widths is limited. TCAM chips typically only allow entry widths of 36, 72, 144, or 288 bits. This leads to wasted space as typical TCAM tables rarely can be configured to exactly one of these widths. In the standard single lookup approach, up to 40 bits might be unused given the predicate will be 104 bits. Second, the multi—lookup and pipelined—lookup approaches produce “thin” tables of varying widths. We say the tables are thin because each table focuses on a single packet field. Thus, the table widths are much smaller than 104 bits. The widths vary because the packet fields have different lengths: 8, 16, and 32, and these predicate bits form a significant fraction of each table entry. Having multiple fields of varying widths provides opportunities to better approach the standard TCAM widths. Third, the search key for TCAM chips can be ternary. In other words, TCAM columns are dynamically selectable for each lookup. A typical TCAM chip has a global mask register, which dynamically selects the columns that participate in a lookup. The global mask register allows multiple entries from different lookup tables to co—reside in the same TCAM entry without conflicting with each other. We developed two TCAM packing schemes, which we call strict partitioning and shadow packing. 7 .3.1 Strict Partitioning Basic Strict Partitioning The basic idea of strict partitioning is to divide a TCAM chip into multiple columns and distribute multi-lookup or pipelined-lookup tables among these columns. The distribution needs to satisfy the following two conditions: (1) all tables in the same column must have different node IDs, (2) all the roles in a table are stored in only one column. Multiple tables in the same column are discriminated by their table ID. Multiple columns in the same TCAM chip are discriminated by the GMR of the chip. The appropriate GMR can be selected by using a column [D which must have enough bits to discriminate all the columns. We illustrate the strict partitioning scheme using the four multi-lookup tables in Figure 7.2(d) in a TCAM chip with entry width 21 bits. Figure 7.6 shows a possible arrangement of the four tables: table 00 in column 1, table 01 in column 2, and tables 10 and 11 in column 3. We use column IDs 00, 01, and 10 for columns 1, 2. and 3, respectively. We decode the first entry 00 000 01’@01 in column 1 as follows. The first two bits 00 encode the table ID, the next three bits 000 are the rule predicate, and the last four bits 01L®01 encode the rule decision where the first two bits encode the next table ID and the last two bits encode the column ID of the next table. By partitioning the TCAM chip into three separate columns, the TCAM chip is essentially divided into three TCAM chips. Lookups in the TCAM chip are performed by padding the search key with the appropriate ternary bits via the GMR. For example, to lookup 111 in table 10 of column 10, the lookup key is ***************10111*. Storing Decisions: Because packing schemes can use previously unused bits in the multi-lookup and pipelined—lookup approaches to store rules. storing the decision of each rule in the TCAM entry is not space or cost effective. Therefore, for all our packing schemes, we assume that the rule decisions are stored in SRAM. Figure 7.7 shows a version of the strict partitioning in Figure 7.6 with decisions stored in SRAM. Packing multi-lookup tables vs. packing pipelined-lookup tables: Recall 66 TCAM chip Col 00 C01 01 C01 10 00 000 01@01 01 000 a 10 0** a 00 010 10411710 01 111 a 10 *** d 00 100 10@10 01 *** d 11 1** a 00 110 OIIJQOI 11 *** d 00 *** IIQIO Figure 7.6: Strict. partitioning of a TCAM chip TCAM chip SRAM Table Col 00 C01 01 C01 10 C01 1 Col 2 Col 3 00 000 01 000 10 0** 01@01 a a 00 010 01 111 10 *** 101310 a d 00 100 01 *** ll 1** 10115.1? 10 d a ()0 110 ll *** 01(0101 d 00 *** IIQIO Figure 7.7: Decisions in SRAM that all our packing schemes can be applied to both the multi—lookup and pipelined- lookup approaches. The only difference is that when we pack multi-lookup tables into one TCAM chip, we need to deal with tables of variable width. In contrast, when we pack pipelined-lookup tables into d TCAM chips, we only deal with tables of the same width for each chip, ignoring minor differences induced by table IDs. Reassigning Table IDs With Fewer Bits: The original table. IDs were used to distinguish a table from either all other tables (in the multi—lookup approach) or all other tables of the same field (in the pipelined-lookim amn'oach). However, in a packing schen'ie, we only need to distinguish a table from the other tables in the same column. Therefore, we can often use fewer bits for tables IDs. In our packing schemes, after tables are allocated to (70111111118. we reassign table IDs using the least number of needed bits, and the decisions for the rules have to be. updated to reflect the new table IDs. Note that different columns may have different table ID widths, and rule decisions may have different lengths. In the strict partitioning scheme, for a column with 71 tables, the number of bits in the reassigned ID of each table is [log n]. Figure 7.8 shows a version of strict partitioning in Figure 7.7 with table IDs reassigned with fewer bits. TCAM chip Decision Table Col 00 C01 01 C01 10 C01 00 C01 01 C01 10 000 000 0 0** @01 a a 010 111 0 *** 0@10 a d 100 *** 1 1** O@1O d a 110 1 *** @01 d *** 1’Q1O Figure 7.8: Reassignincr the table IDs O O 0 Processing Packets: We. describe the algorithm for processing packets un- der the strict partitioning approach using examples. Suppose the given packet is (000, 110), the first TCAM lookup is 000******* and the lookup result is the index value of O. This index value is used to find entry 0 in the column 00 in the SRAM to find the decision of iii-()1, which means that the second lookup should be performed on column 01. To further perform the second lookup, the GMR is modifed to make the second lookup key ***11()****. The result. of the second lookup is the index value of 1, which means the decision is stored in the second entry of column 01 in SRAM. The second entry of column 01 in SRAM is “(1.”, which means that the final decision for the given packet is accept. Optimized Strict Partitioning Given a set of TCAM tables to be packed in a single. TCAM chip. there are many ways to do strict partitioning. First, we can choose the TCAM width to be. 36, 72, 144, or 288. Second. for each possible TCAM width, there are many possible ways to divide the TCAM. Third, for each possible division of the TCAM, there are many 68 possible ways to allocate TCAM tables to columns. Among all possible strict parti- tioning solutions for a TCAM chip, we want to find the solution that uses the least TCAM space under throughput constraints. Note that the classification throughput may decrease as the TCAM width increases. We formally define the Strict Parti- tioning Optimization Problem. as follows. We omit throughput constraints in the definition for ease of presentation. Definition 7.3.1. Given n TCAM tables t1, . . . , tn, find a partition of the tables to 771 sets c1, . . . , cm that minimizes the objective TW X maxgll Rules(c,;) such that . L211 C7: = {t1. . . . , in} . 23:11(rnaxtjeatwcj>> + log 1cm s TW where TW 6 {36,72,144,288}, w(tj) denotes the width of table tj, |tj| denotes the number of rules within t j? and Rates(ci) denotes the total number of table entries in Ci, which is théci |tj|. The problem of makespan scheduling on multiple identical machines [8]. which is NP—complete, is a special case of this problem. Thus, the strict partitioning optimization problem is NP-complete. It belongs to NP because the solution can be verified in polynomial time. 7.3.2 Shadow Packing In strict partitioning. we. viewed columns as the 1:)1'imary dimension of TCAM chips and sought to pack tables into fixed width columns. In shadow packing, on the other hand, we view rows as the primary dimension of TCAM chips and seek to pack tables within fixed height. rows. we consider shadow packing because of the following two observations. First, with strict partitioning. when tables of varying width are. allocated to the same column, the number of bits assigned to each table I. is equal to Mr) X w(t’) where t’ is the widest. table assigned to that. column. This leads to many unused bits if tables of different widths are assigned to the same colunm. On the other hand, 69 horizontally packed tables can be placed next to each other as keeping the vertical boundaries across multiple tables is unnecessary. Of course, there may be wasted bits if tables of different heights are packed in the same column. We will allow tables to be stacked in the same row if they fit within the row boundaries. Second, with strict partitioning, the table ID’s between tables in different columns cannot be shared. Thus, the number of bits used for table IDs grows essentially lin- early with the number of columns. On the other hand, horizontally aligned tables in the same row can potentially share some “row ID” bits in their table IDs; these tables would be distinguished by their horizontal offsets. Based on the above two observations, we design the shadow packing scheme that achieves more space efficiency by packing tables horizontally and allowing multiple tables to share bits in their IDs. We first define the concept of shadowing for table ID inheritance and the concept of shadow packing trees for representing table ID inheritance. Then, we present the shadow packing algorithm and discuss the procedure for processing packets with shadow packing. Shadowing In Figure 7.9(a), table to shadows tables ‘00 and t01. We define the concept of shadowing as follows: Definition 7.3.2 (Shadowing Relationship). For a table t stored in a TCAM, we use VBegin(t) and VEnd (t) to denote the vertical indexes of the TCA M entries where the table begins and ends respectively, and use HBegin(t) and HEnd (t) to denote horizon- tal indexes of the TCA M bit columns where the table begins and ends respectively. For any two tables t1 and t2 where [VBcgin.(t2), VEnd(t2)] (_Z [VBegin(t1), VEnd(t1)] and HEnd(t1) < HBcgin(t2), we say t1 shadows t2. When table t1 shadows t2, the ID of t1 can be reused as part. of 12’s ID. Suppose table t shadows tables t1, - - - , tm. bees-ruse [’5 ID defines the vertical TCAM region [Begin(t), End(t)], each t, (1 g i g m) can use t’s ID to distinguish t,- from tables outside [Begin(t), End(t)] vertically, and use [log ml bits to distii‘iguish t,- from tables 70 inside [Begin( t,) End(t(t)])ve1tically. Horizontally, table t and each table t,- can be distinguished by properly setting the GMR of the TCAIV’I. Definition 7.3.3 (Shadow Packing). Given a region defined vertically by [01,112] and horizontally by [h1, ’22].- all tables completely contained within this region are shadow packed if and only if there exist rn (rn>1) tables t1,~ -,tm in the region such that the following three conditions hold: 1. 221: VBegin(t1), VEnd(t il+1 = VBegin(t,-+1)for1 S i S rn—l, VEnd(tm) g (U2 ,0 2. no tables are allocated to the region defined vertically by [VEnd(tm) + 1,122] and horizontally by [h1, hgl; 3. for eachi (1 g i< m), the region d(fi’lttd vertically by [VBcgin(t i), VEnd(t )] and horizontally by [HEnd(t.,-)+1. [72] either has no tables or the tables allocated to the region are also shadow packed. For example, the tables in Figure 7.9(a) are shadow packed. Figure 7.9(b) shows the tree representation of the shadowing relationship among the tables in Figure 7.9(a). 0 t t / 000 O 00 t / F’ tom 0 O \ ~k / 1 t01'_’ to1 x y two 01 t:1-"—"' t101 10 1:110 (a) (b) Figure 7.9: Shadow packed tables &t shadow packing tree Shadow Packing Algorithm Given a. set of tables and a TCAM region, the shadow packing algorithm allocates the tables into the region. The goal of a shadow packing algorithm is to minimize the number of TCAM entries occupied by the tables, i.e., to minimize VEnd(tm). We call this minimization problem the Shadow Packing Optimization Problem. This problem becomes more difficult as we recurse because we must also address which tables should be allocated to which region. Whether this problem can be solved in polynomial time is an open problem. In this chapter, we present a shadow packing algorithm SPack, which has been shown to be effective in our experiments on real-life packet classifiers. The basic idea of SPack is as follows. Given a set of tables S and a TCAM region, SPack first finds the tallest table t that will fit in the region where ties are broken by choosing the fattest table. SPack returns when there are no such tables. Otherwise, SPack places t in the top left corner of the region, and SPack is recursively applied to S — {t} in the region to the right of t. After that, let S’ be the set of tables in S that have not yet been allocated. SPack is applied to S, in the region below t. Intuitively, SPack greedily packs the tallest (and fattest) possible table horizontally. The pseudocode of SPack is shown in Algorithm 4. We, however, must compute the initial SPack region. The height of the initial region is the total number of rules within the set of tables. We do not need to set this value carefully because SPack only moves to another row when all the remaining tables do not fit in any of the current shadows. The width is more complicated and must be computed iteratively. For each valid TCAM width w 6 {36,72, 144,288}, we set the initial width to be w and run SPack. Once we have a packing, we determine the number of bits I) that. are needed for node IDs. If the packing could accommodate these extra b bits, we are done. Otherwise, we choose an aggressive backoff scheme by recursing with a width of w —— b. It is possible, particularly for w = 36, that no solution will be found. To detern'iine which TCAM width we should use, we choose the width w E {36, 72, 144, 288} whose final successful value resulted in the fewest. number of entries. Note that. there are other possible strategies for 72 Algorithm 4 Shadow Packing (SPack) Input: S : a set of tables, and a region [01,122], [h1, hg]. Output: 8': the set of tables in S that have not been packed. 1: Find the tallest table t E S that will fit in [vbv2], [12.1,h2] such that ties are broken by choosing the fattest table; : if no table is found then to 3: return S; 4: else 5: Place t in the top left corner of ['v1,v2], [h1, hg]; 6: s” <— SPack(S', VBegin(t), VEnd(t), HEnd(t) + 1, 22.2); 7: return SPack(S",VEnd(t) + 1, '02, h1, hg); 8: end if determining the width of the SPack regions; for instance, instead of reducing the region width by b, the width could be reduced by 1. Furthermore, to speed up this process, SPack can be modified to abort the packing once it detects that the table packing and IDs can not fit within the region. Reassignng Table IDS and Rule Decisions: Because shadow packing estab- lishes a hierarchy of table IDs, each table needs a new ID, and all the rule decisions need to be remapped to reflect these new IDs. Each table ID is determined by a tree representation similar to the one found in Figure 7.9(b), which we call a shadow packing tree. For each node v in a shadow packing tree, if v has m > 1 outgoing edges, each outgoing edge is uniquely labeled using llogrn) bits; if v has only one outgoing edge, that edge is labeled *. For each table t, let u be the corresponding node in the shadow packing tree. All the bits along the path from the root to v are all the bits needed to distinguish t from all other tables. Note that. the * corre- sponds to a table where no additional ID bits are needed. In our shadow packing algorithm, we reserve l bit columns in the. TCAM where l is the maximum number of bits needed to the distinguish a table. Reserving some bit columns for storing table IDs has the advantage of simplifying the processing of packets since the bit columns containing the table IDs are fixed in the TCAM. Figure 7.10(a) shows the shadow packing tree for the four tables in Figure 7.2(d) and their reassigned table IDs. Figure 7.10(b) shows the final memory layout in the TCAM chip after shadow packing and the conceptual memory layout of the decision table within SRAM. The one bit ID column in Figure 7.10(b) is needed to distinguish between the tables with original IDs 01 and 11. Note that table 10 shares the table ID 0 with table 01 as it is the only table in table 01’s shadow. To make the decision table in Figure 7.10(b) easier to understand, we encode it in a memory inefficient manner using columns. Table ID Reassignment 0 1:01;. tlo Table ID . —*"’ too 00 * (a) l\ tn 01 O 10 0 11 1 ll Construct tables TCAM chip Decision Table 0 123 456 789 Col 00 C01 01 C01 10 O 000 000 0** 0@4 : 01 a a 0 010 111 *** oer: 10 a d (b) O 100 *** 0@7: 10 d 1 110 1** 0@4:10 a 1 *** *** 1@4 :01 d Figure 7.10: The shadow packing process. Processing Packets: We describe the algorithm for processing packets under the shadow packing approach using examples. Given a packet (000,111), the first TCAM lookup is *000******, and the lookup result is the index value of 0. This index value is used to find entry 0 in the column 00 in the SRAM which contains the decision of 0:024 : ()1. The 01334 means that the second lookup key should occur in table ID 0 at horizontal offset of 4, and the 01 means that decision of the next search is located in column 01 in SRAM. To perform the second lookup, the GMR is modified to make the second lookup key 0***111***. The result of the second lookup is the index value of 1, and the decision stored in the second entry of column 01 in SRAM is retrieved, which is accept. 7 .3.3 Strict Partitioning vs. Shadow Packing We now compare the space efficiency of strict partitioning and shadow packing. The sole advantage of strict partitioning is that it has no horizontal boundaries. On the other hand, shadow packing has two key advantages. It has no vertical boundaries, and tables in the same row can share some table ID bits. Furthermore, we mitigate some of the disadvantage of horizontal boundaries by greedily packing tables in the shadow of other tables. 7 .4 Table Consolidation The basic idea of table consolidation is to use one TCAM table to represent multiple TCAM tables. Table consolidation is motivated by the following two observations. First, two TCAM tables may share common entries, which result in the same infor— mation being stored multiple times. Second, existing TCAM-based packet classifi— cation systems are based on a “fat” TCAM and “thin” SRAM architecture, which means that the majority of the information (i.e., the predicates of rules) representing a packet classifier is stored in TCAMS and little information (i.e., the decision of rules) is stored in SRAMs. However, because TCAMs are much more expensive than SRAMs, we ideally would store more infm'n‘iation in SRAMs and less information in TCAMS. We begin with two new concepts: hi-dCCISIOn rule and kr~decision classifier. A k-decision rule is a classification rule whose decision is an array of k decisions. A [st—decision classifier is a sequence of k-decision rules following the first-match semantics. W'e formally define the table consolidation problem as follows: Definition 7.4.1 (Table Consolidation Problem). Given k I-decision classifiers C1,--°,(Ck, find a k-decision classifier (C such that for any i {1 S i S k), the condition (C,- E G[i] holds. We emphasize that a h-decision classifier can be viewed as a l—decision classifier if we view the array of k decisions of each rule as one decision. In general, a k-(IBCISIOII classifier can be viewed as a k’-decision classifier where k, < k, if we treat some. decisions as one decision. 7.4.1 Table Consolidation Algorithm We use multi—match FDDs to facilitate table consolidation. A multi-match F DD sat— isfies all the properties of an all—match FDD except the condition j 6 {21, i2, - - - , ik} in the 5th property. Our table consolidation algorithm works as follows. First, given a set of k clas- sifiers C1,-- ~, t2 1 * d F, F2 decision ,2 * , (1 t2 1 * d * a * 1 d 9: * a F, F2 decision 0 0 a, a l l d, d a, d Figure 7.11: Consolidation of 2-dimensional tables 7.4.2 Hierarchical Table Consolidation In building a multi-match FDD from a set of classifiers, each classifier influences the shape of the FDD and causes the FDD to grow. To localize the expansion impact of one classifier on others, we propose the following hierarchical table consolidation strategy. Given k classifiers, first, we equally divide them into [ls/m] buckets where every bucket has m classifiers except for one bucket that may have less than m clas- sifiers. Second, for the classifiers in each bucket, we apply the table consolidation algorithm and get an m-decision classifier. Thus. we get [ls/ml classifiers. By treat- ing each m—decision classifier as a l-decision classifier, we apply the above process again on the [ls/ml classifiers. This process repeats until we get the final k-decision classifier. We refer to the strategy of choosing m = k as flat table consolidation. Theorem 7.4.2 establishes the correctness of hierarchical table consolidation. Theorem 7.4.2 (Hierarchical Table Consolidation Theorem). Given the same input 77 of k J-decision classifiers and the same table consolidation algorithm, the strategies of hierarchical table consolidation and flat table consolidation output the same k- decision classifier. o ,601‘t—vd, t 001 ° oo1'o1o'->d2 {9° 010 0 0 ‘H Vii/”d: 1”,... 100 0 iii 1 _ 001 000 .L_ —> I 0 1601 \ d1 TCAM for root {010 ,'->d2 \* **’_’d => V0‘*-. _/ V1 => " 3 " 3 3 ' \ T'" __ o ooh-m, 001 0'1 ___,j i 100’:010"_,d2 010 d2 : um d4 I 100 \“'\“”*I—>d3 *** d3 TCAM for V, *** ***—>d TCAMfOl'VO ’ 4 (a) Classifier (b) Constructing FDD and tables F1: 010 0 010 0 *t 1 _001 100 o 100 o ‘ l .__, , _1_ 1 001 0 [001 d1 d4 I 0 1 010 0 1 001 I001 d. d. s F. 100 0 a. 5,9,0 :2 :4} F :0 010 |o1o d2 d4| , 3 ‘ 1“” Jan" d3 d4 t2 _‘J l . d1 ..~,. 1- a; L— (c) Table Consolidation (d) Pipelined lookup process Figure 7.12: Table consolidation for pipelined-lookup sequential decomposition 78 7.4.3 TCAM/SRAM Space Tradeoff via Bounded Consoli- dation Table consolidation creates a tradeoff between TCAM storage and SRAM storage. On one hand, merging multiple classifiers into one classifier results in less TCAM storage. On the other hand, each entry in the resulting classifier requires more SRAM for storing the decision list. We could simply merge all classifiers into a single classifier; however, this may require more SRAM space than what is avail- able. To address this issue, we propose the following bounded consolidation scheme. The basic idea of bounded consolidation is to limit the number of classifiers that we combine. Given a set of k l-decision classifiers, we first sort the classifiers in decreasing (or increasing) order according to size (i.e., the number of rules in each classifier). Second, we partition the sorted classifiers into [Ir/ml chunks where the first [kt/ml -— 1 chunks are of uniform size 772. (1 S m g A). Third, for every chunk, we apply the table consolidation algorithm to the classifiers that it contains. Finally, we get. [ls/ml multi-decision classifiers. Note that m is an adjustable parameter. 7 .5 One-Dimensional Table Consolidation While table consolidation can be applied to d—dirnensional classifiers for arbitrary d, we show that it is especially effective for consolidating one-dimensional classifiers. In particular, Theorem 7.5.1 shows that table consolidation is guaranteed to reduce TCAM space occupied when applied to one-din'iensional classifiers. Table consolida— tion’s effectiveness on one-dimensional classifiers implies that. it is especially effective when combined with sequential decomposition to minimize the space required by any single classifier. W’e defer the proof of Theorem 7.5.1 to the appendix. Theorem 7.5.1. Given any set of k7 I-decision I-dimension(Ll (.rlo..<;sifi(::rs C1, - - - , (Ck: the [sf-decision 1-dimensional classifier (C output by the. TCAM distillation algorithm satisfies the following condition: [C] S I(l'll + - - - + I('A.| — A: + l. 79 7.5.1 Table Consolidation with Sequential Decomposition Because table consolidation is guaranteed to work well on one-dimensional classifiers, table consolidation can be integrated with the sequential decomposition scheme to further reduce the space required by a TCAM—based packet classifier. Basically, in the third step of pipelined-lookup sequential decomposition, for each field 17,-, we group all the tables of F,- nodes and apply our table consolidation algorithm to produce a multi-decision table rather than using table ID’s. For example, given the two tables of the two F2 nodes ”0 and v1, our table consolidation algorithm outputs the 2-decision table t2 in Figure 7.12(c). In this case, table consolidation reduced the required number of TCAM entries from 4 to 3. The packet lookup process on the d multi-decision TCAM tables proceeds as follows. The first table has only one column of decisions and each decision is the decision column index of the second table. Similarly, each decision in the i-th (1 g i < d) table is the decision column index of the next table. We illustrate the packet. lookup process using the example in Figure 7.12(d). Given a packet (010,001), we first use 010 to search the table t1 and get result 0. Second, we use 001 to search table t2 and get. search result (d1, d4). The final result is the first element of (d1, d4), which is d1. Note that each TCAM chip has its own SRAM. Therefore, the packet lookup process can be pipelined to achieve the throughput of one packet per cycle. 7.5.2 Coping with More Fields than TCAM chips If we have fewer TCAM chips than packet fields, we have two choices. One is to strategically combine selected fields into one field until the number of fields is equal to the number of TCAM chips. We can use the FDD structure to facilitate combination of multiple fields. For example, given a. d—dimensional classifier over fields 171,172, - - - , Fd and the corresponding FDD, we can combine the first. let (1 < k: < d) fields into one field by treating the subgraph rooted at. each Fk+1 node as a. terminal node and then use our TCAM Razor algorithm presented in [20] to generate a kT-din'iensional table for the k-dimensional F DD. Then, for each subgraph rooted 80 at an F k +1 node, we can apply the above process recursively. A second choice is to employ the sequential decomposition multi—lookup approach where we store tables from multiple fields on a single TCAM chip. For example, if we have d dimensions and q chips, we would store tables from either [d/ql or (d/qj fields in each chip in the pipeline. Each stage of the pipeline would require either [d/ql or [d/qj lookups. This would result in a classification throughput of one packet per [d/ql TCAM bus cycles. 7 .6 Implementation Issues 7 .6.1 Fast FDD Reduction F DD reduction is an effective way to reduce the number of multi-lookup tables and pipelined-lookup tables generated from an FDD. A brute force deep comparison algorithm for FDD reduction was proposed in [9]. Here we use a more efficient. FDD reduction algorithm that uses signatures to speed up node comparisons. The algorithm works as follows. The algorithm processes the nodes level by level starting from the terminal nodes. At the current level, we compute a signature for each node. For a terminal node v. set. o’s signature to be its label. For a non-terminal node v, suppose v has A? children 121, vk, in increasing order of signature (Sig(-o,j) < Sig(o,+1) for 1 g i g k. — l), and the edge between 1) and its child u,- is labeled with E, a sequence of non- overlapping prefixes in increasing order. Set the signature of node v as follows: Sig(v) = h.(Sig(ol), E1, . - -, Sigh/7,), Ek) where h is a hash function. After we have assigned signatures to all nodes at a given level, we search for isomorphic subgraphs as follows. For every pair of nodes o, and U]- (1 S i 3A j S A) at. this level, if Sig(u,;) 31$ Sig(ej), then we can conclude that o,- and ”j are not isomorphic; otherwise, we explicitly determine if 2),: and DJ are isomorphic. If u,- and 'o]- are isomorphic, we delete node oj and its outgoing edges. and redirect. all the edges that point to o.)- to point to 12,. Further, we eliminate double edges between node 1),- and its parents. 81 7.6.2 TCAM Update Packet classification rules periodically need to be updated. The common practice for updating rules is to run two TCAMS in tandem where one TCAM is used while the other is updated [16]. All our approaches are compatible with this current practice. Because our algorithms are efficient (running in milliseconds) and the resultant TCAM lookup tables are small, updating TCAM tables can be efficiently performed. If an application requires very frequent rule update (at. a frequency less than a second, for example), we can handle such updates in a batch manner by chaining the TCAM chips in our proposed architecture after a TCAM chip of normal width (144 bits), which we call the “hot” TCAM chip. When a new rule comes, we add the rule to the top of the hot TCAM chip. When a packet comes, we first use the packet as the key to search in the hot chip. If the packet has a match in the hot chip, then the decision of the first matching rule is the decision of the packet. Otherwise, we feed the packet to the TCAM chips in our architecture described as above to find the decision for the packet. Although the lookup on the hot TCAM chip adds a constant delay to per packet latency, the throughput. can be much improved by pipelining the hot chip with other TCAM chips. Using batch updating, only when the hot. chip is about to fill up, we need to run our topological transformation algorithms to recornpute the TCAM lookup tables. 7.6.3 Non-ordered FDDs Recall that the FDD construction algorithm that. we used produces ordered F DDs, that is, in each decision path all fields appear in the same order. However, ordered FDDs may not. be the smallest. when compared to non-ordered FDDs. The F DD construction algorithm can be easily modified so that different. subtrees may use different. field ordering. By adding field information to the decisions in each table entry, we can easily accommodate different field orderings. Thus. the packet pro— cessing algorithm for both nuilti~lookup and pipelined~lookup can select. the correct 82 field for each lookup. The size advantage of non—ordered FDDs comes at the cost. that FDD reduction will not be able to process subtrees that have different field orders. Nevertheless, the use of non—ordered F DDs does open new possibilities for further optimizations. 7 .6.4 Lookup Short Circuiting So far, we have assumed the use of full-length F DDs where in each decision path all fields appear exactly once. Actually, this constraint can be relaxed so that some paths may omit unnecessary fields when a node in the path contains only one outgo- ing edge. In this case, the node along with singleton outgoing edge can be pruned. Using F DDS that are not full—length has the advantage of reducing FDD size and consequently reducing the total number of tables. Furthermore, this optimization al- lows some specific decision paths to be performed with a reduced number of lookups, which will allow for faster packet processing when the tables are processed in a multi- lookup fashion. Therefore, we call this optimization technique lookup short circuit- ing. Similar to the use of non—ordered FDDs, this optimization technique requires storing field information in the decisions. 7 .6.5 Best Variable Ordering In converting a packet classifier to an equivalent FDD, the order of the fields used by decision paths has a significant impact on the size of the resulting FDD. Given that fewer nodes in an F DD normally lead to a smaller multi—lookup or pipelined- lookup table, choosing a good variable order (i.e., field order) is important in FDD construction. Given a packet classifier that has five fields, we can easily try all 5! = 120 permutations to find the best. permutation for that particular packet classifier. 7 .7 Experimental Results In this section, we evaluate the effectiveness and efficiency of our sequential decom- position approach, as well as our three optimizations of lookup 1')i1_)elining. TCAM 83 kih'mbo’o'wks Compression Ratio 0 O O O O H H H 00 N H O H N 0 2 4 6 8 10 12 14 4 6 Classifier Classifier (a) (b) 'N'A'aaoa'o'nks Compression Ratio 0 O O O O H H H 00 6 8 10 12 14 4 6 Classifier Classifier (C) ((1) Figure 7.13: Compression ratio for each classifier in RL. packing, and table consolidation, on both real—life and synthetic classifiers. 7.7.1 Methodology We first define the metrics for measuring the effectiveness of our algorithms. Let C denote a classifier, 5' denote a set of classifiers, |S| denote the number of classi— fiers in S, and A denote an algorithm. We use A(C') and Direct(C) to denote the number of TCAM bits used for classifier C by algorithm A and direct expansion, respectively. We use A(C')R and Direct(C) to denote the number of SRAM bits used by algorithm A and direct expansion, respectively. Vi'e define the Compression ratio of algorithm A on C as . v A (') CRatzo(A.C) = m 84 02468101214 6 Classifier Classifier (a) (b) 16 r 314- .MC £12 IMcs~ 510- UPC .9 a 8- IP05 4 ‘1’ . . a 6 i- E 4L .4 ' O . ': , _ ‘ .. U 2L 0.000 4 6 8 101214 00 2 4 6 8 10 12 Classifier Classifier (C) ((0 Figure 7.14: Compression ratio for each classifier in RLU. the average compression ratio of algorithm A on a set of classifiers S as —' ZCES CRati0(/l, C) A( )= ISI . and the total compression ratio of algorithm A on S as A“. _ EGGS/4(0) A(S) _ ZCES Direct(C)' Likewise we define similar metrics for measuring the compression ratio for SRAM space. We define the SRAM/TCAM ratio of A on C as A(Cln STRatio(A,C) = A(C') . We denote the respective average and total SRAM/TCAM ratio as _ ZCES STR(Ll’l0(/l, (1') A(Sla S and "T/ _ EGGS/“(7)12 .4(S)R— ZCESA(C) , respectively. We use All to denote the multi—lookup approach, and use P to denote the pipelined lookup algorithm. We annotate M and P with subscripts C and S if the table consolidation and shadow packing optimizations are used. For example, M S denotes multi—lookup with the shadow packing optimization while PC S denotes the pipelined lookup with the optimizations of shadow packing and table consolida- tion. Given three optimizations, we study 8 variants (M, INC, 1113, JUGS, P, PC, PS, and PCS) in total and compare them with results from TCAM Razor (Razor) and direct expansion (DE). For the optimization technique of table consolidation, in our experiments, we performed bounded table consolidation with bound 4 because it has been found to be an effective trade off between TCAM and SRAM space. Note that when table consolidation and shadow packing are used together, table consolidation must be performed before shadow packing. The variable order that we used to convert a classifier to an equivalent. FDD affects the number of tables generated by our algorithms, which consequently affects the TCAM space efficiency. There are 5! = 120 different pernuitations of the five packet fields (source IP address, destination IP address, source port number, desti- nation port number. and protocol type). For HL, we investigate the effectiveness of the 5! = 120 perrmitations. A question that naturally arises is: which variable order achieves the best. (or close to best.) compression ratio for all our algoritlnns‘? By computing the average compression ratios of our algorithms on Illa, we discovered that the best order is (Destination IP, Source IP, Destination Port. Protocol, Source Port); we label this as permutation 31402. For any of our algorithms, the difference between the average cmnparison ratio achieved by 31402 and that achieved by the best. order for that particular algorithm is no more than 0.005. 86 7 .7.2 Effectiveness Table 7.1 shows the average and total compression ratios of TCAM Razor and our 8 variants 1W, 1113, P, PS, MC, [1105, PC, and PCS, all using permutation 31402, on the four classifier sets of RL, RLU, SYN, SYNU. we focus our discussion on average compression ratios; we primarily use total compression ratios to facilitate comparison of our techniques with other approaches. Figures 7.13(a) and 7.13(b) show the compression ratios of each of the 25 classifiers in RL for the four algorithms of Ill, 1115, P, and PS. Similarly, Figures 7.13(c) and 7.13(d) show the compression ratios of each of the 25 classifiers in RL for the four algorithms of 1110, MCS? PC, and PCS. For illustration purposes, we show 13 classifiers with low compression ratios in Figures 7.13(a) and 7.13(c) and the remaining 12 classifiers in Figures 7.13(b) and 7.13(d). A] AIS P PS 1110 11105 PC PCS Razor 12L 0.171 0.062 0.127 0.086 0.109 0.048 0.083 0.074 0.245 RIJU 1.642 0.147 0.928 0.245 0.558 0.082 0.327 0.128 0.319 SYN 0.028 0.007 0.016 0.010 0.021 0.008 0.013 0.012 0.083 SYNU 0.263 0.016 0.139 0.021 0.112 0.014 0.062 0.021 0.408 average compression ratio Ill 1113 P PS IUC [1105 PC PCS Razor RL 0.052 0.018 0.036 0.024 0.029 0.010 0.021 0.017 0.088 RLU 0.649 0.045 0.351 0.060 0.210 0.020 0.116 0.029 0.131 SYN 0.022 0.003 0.013 0.004 0.015 0.002 0.009 0.003 0.061 SYNU 0.254 0.013 0.133 0.014 0.106 0.009 0.058 0.010 0.382 total compression ratio Table 7.1: Compression ratios 87 M MS P PS MC AICS PC PCS Razor RL 1.504 5.138 1.504 7.471 2.840 15.631 2.840 24.537 0.245 RLU 3.926 2.861 3.926 4.830 4.697 7.301 4.697 10.573 0.319 SYN 0.268 0.645 0.268 0.975 0.614 2.707 0.614 4.123 0.083 SYNU 0.526 0.215 0.526 0.301 0.893 0.810 0.893 1.186 0.408 average ratio M AIS P PS NC 11105 PC PCS Razor RL 0.513 1.614 0.513 2.146 0.871 3.421 0.871 5.305 0.088 RLU 2.071 0.964 2.071 1.273 2.594 1.658 2.594 2.396 0.131 SYN 0.293 0.326 0.293 0.432 0.589 0.784 0.589 1.345 0.061 SYNU 0.508 0.157 0.508 0.171 0.850 0.452 0.850 0.510 0.382 total ratio Table 7.2: SRAM compression ratios ll] MS P PS MC MCS PC PCS DE RL 0.054 0.660 0.077 0.660 0.178 2.489 0.249 2.489 0.008 RLU 0.096 0.829 0.170 0.829 0.358 3.209 0.612 3.209 0.044 SYN 0.070 0.723 0.120 0.723 0.216 2.642 0.362 2.642 0.007 SYNU 0.172 1.131 0.325 1.131 0.687 4.525 1.240 4.525 0.086 average ratio 11] AIS P PS MC MCS PC PCS DE EL 0.086 0.800 0.123 0.776 0.259 2.849 0.362 2.755 0.009 RLU 0.146 0.976 0.270 0.976 0.568 3.842 1.024 3.778 0.046 SYN 0.091 0.801 0.159 0.791 0.272 2.761 0.466 2.767 0.007 SYNU 0.200 1.242 0.381 1.236 0.800 4.935 1.473 4.902 0.100 total ratio Table 7.3: SRAM/TCAM ratios 88 Effectiveness of Each Optimization Technique we have a total of 32 average compression ratio data points in Table 7.1 for our new techniques. For each of our three optimization techniques, we can pair together two data points to determine if the optimization improves space compression or not. For example, we can compare m with and W to determine if shadow packing improves the performance of multi-lookup on the RL data set. For each optimization technique, this leads to 16 comparison pairs. The first important insight is that shadow packing always improve space com- pression, and the typical reduction is quite large. Table consolidation improves space compression for 13 of the 16 pairs; for the other three pairs, table consolidation either has no effect or slightly degrades compression. Note, this may change if we used a different bound than 4. Finally, the lookup pipelining optimization improves space compression for the eight pairs that do not include shadow packing. However, for the eight pairs that do include shadow packing, it. actually hurts space compression, sometimes quite significantly. The experimental results show that shadow packing is the most important space compression optimization. The improvement gained from implementing shadow packing is almost always quite significant ranging between 0.06 and 0.92 with an average of 0.38. The improvement gained from implementing table consolidation ranges between 0.34 and 1.2 with an average of 0.71. Given that the lookup pipelin— ing optimization does not. perform well in combination with shadow packing, we should not use the pipelining optimization unless the packet processing throughput improvement is significant enough to warrant the extra expense of multiple TCAM chips. Best variant for space compression The best overall variant for space compression is 1)./CS, multi-lookup with the shadow packing and table consolidation optimizations. The only data set where MOS is not the best variant is the SYN dataset. where M S slightly outperforms MOS- Note again that. with a. different bound than 4. AIC S might outperform ill S 89 on this data set. Effectiveness on Classifiers with Many Distinct Decisions The effectiveness of our algorithms is relatively insensitive to the number of distinct decisions in packet classifiers. Even with the extreme examples of RLU and SYNU where every rule has a distinct decision, our best variant 1110 S achieves average compression ratios of 0.082 and 0.014, respectively. There are two main reasons that explain the insensitivity of our algorithms to the number of distinct decisions in a classifier. First, through sequential decomposition, we still produce relatively short and thin tables that can be efficiently shadow packed. Second, table consolidation neutralizes some of the effects of distinct decisions as entries in different tables with different decisions can be consolidated together. SRAM usage Our techniques reduce the number of required TCAM bits by using inexpensive SRAM space to replace expensive TCAM space. Table 7.2 shows the average and total SRAM compression ratios for all eight algorithms. For real-life classifiers, our techniques typically increase the amount of SRAM space used. There are four main reasons for this increase in SRAM space. The first is the possible increase in the total number of “thin” rules. The second is the increase in the number of bits needed to store decisions, since a decision in our algorithms is often a table ID. For fast indexing purposes, our algorithms require a uniform number of bits per TCAM entry, which is essentially the maximum number of bits to encode a decision. The third reason is the unused SRAM space for fast indexing shadow packed tables. With shadow packing, the total number of required SRAM bits is the number of TCAM entries times the maximum number of tables packed in a row times the maximum number of bits needed to encode a. decision. The fourth reason is ascribed to table. consolidation due to reasons similar to shadow packing. From Table 7.2, we can see that. shadow packing and table consolidation significantly increase SRAM usage. However, SRAM (‘zompression ratios do not account for the fact that SRAMs are 90 often under-utilized by the traditional single-lookup scheme. That is. the number of SRAM bits needed to store a classifier are significantly less than the number of needed TCAM bits. Table 7.3 shows the SRAM/TCAM ratios for all eight algo— rithms as well as direct expansion. From this table, we can see that all the algo- rithms, except 111105 and PCS, use less SRAM bits than TCAM bits on average. Even for [MC S and PCS, having an SRAM chip that is 4 times larger than the corre- sponding TCAM chip is sufficient. Given the relative costs and power consumption of SRAM and TCAM chips, this is a good tradeoff. Furthermore, in our experiments on real-life classifiers, no classifier demands more than 0.6 Mb of SRAM. Comparing with the state-of-the—art It is difficult to directly compare our results with those in other prior work, such as Dong et al.’s work [5], Lakshminarayan et al.’s DIRPE scheme [15], and Bremler- Barr and Hendler’s SRGE scheme [4]. First, the real-life classifier sets are different and are privately owned. Second, the evaluation metrics are not the same. For example. Dong et al.’s [5] only reported a total compression ratio of 0.54 while the range encoding papers only reported expansion ratio results. Fortunately, the expansion ratios reported in prior range encoding schemes can be used to extrapolate their total compression ratios. The expansion ratio of an algorithm on a classifier is defined as the ratio between the number prefix rules produced by the algorithm for the classifier and the number of original rules in the classifier. Thus. the compression ratio of an algorithm can be calculated by the ratio between the expansion ratio of the algorithm and the expansion ratio of direct expansion. For the best—known range encoding method, i.e., hybrid-SRGE [4], we calculated its total compression ratio to be 0.396. As we can see, our methods achieve significantly better total compression ratios, albeit on different data sets. 91 Chapter 8 Topological Transformations One approach for mitigating the effects of range expansion has been to reencode critical ranges. The basic idea is to reencode a given packet and use the reencoded packet as the TCAM search key. For instance, Liu [19], Lunteren and Engbern [31], and Pao et al. [21] all proposed methods of representing specific ranges as special bit- strings using extra TCAM bits. Lakshminarayan et al. [15] and Bremler-Barr and Hendler [4] proposed to replace the prefix encoding format with alternative ternary encoding formats, called DIRPE and SRGE, respectively. Previous reencoding schemes suffer from two fundamental limitations. First, they only consider range fields and ignore all other fields; thus, they miss many optimiza- tion opportunities that can be applied to prefix fields as well. It was not realized that packet classifiers often have the potential of being minimized in TCAM even when no fields are specified in ranges. Second, they require either con'iputationally or economically expensive reencoding steps that do not easily integrate into exist- ing packet classification systems. As each packet needs to be reencoded before it can be used as a search key, previous range reencoding schemes propose to perform packet reencoding using software, which greatly increases packet processing time, or customized hardware, which is expensive from a design, cost, and implementation perspective. In this chapter, we take two novel views on range reencoding that are funda- mentally different from previous range reencoding schemes. First, we view range 92 reencoding as a topological transforn’iation process from one colored hyperrectangle to another. W'hereas previous range reencoding schemes only deal with range fields, we perform reencoding on every packet field. Specifically, we propose two orthogo— nal, yet composable, reencoding schemes: domain compression and prefix alignment. In domain compression, we transform a given colored hyperrectangle, which repre— sents the semantics of a given classifier, to the smallest possible “equivalent” colored hyperrectangle. In prefix alignment, on the other hand, we strive to transform a colored hyperrectangle to an equivalent “prefix-friendly” colored hyperrectangle where the ranges align well with prefix boundaries, minimizing the costs of range expansion. Second, we view range reencoding as a classification process that can be implemented with small TCAM tables. Thus, while a preprocessing step is still required, it can be easily integrated into existing packet classification systems using the same underlying TCAM technology. Furthermore, implementing our schemes on a. pipeline of TCAM chips even increases packet classification throughput. because our schemes enable the use of TCAM chips of small width. Domain Compression: The fundamental observation is that in most packet classifiers, many coordinates (i.e., values) within a field domain are equivalent. The idea of domain compression is to reencode the domain so as to eliminate as many re- dundant coordinates as possible. This type of reduction not only leads to fewer rules, but also narrower rules, which results in smaller TCAM tables. From a geometric perspective, domain compression “squeezes” a colored hyperrectangle as much as possible. For example, consider the colored rectangle in Figure 8.1(A) that repre- sents the classifier in Figure 8.1(H). In field F1 represented by the X—axis, all values in [0, 7] U [66, 99] are equivalent; that is, for any y 6 F2 and any 1171,12 E [0, 7] U [66, 99] , packets (.1‘1,y) and (£2.11) have the same decision. Therefore, when reencoding F1, we can map all values in [0,7] U [66,99] to a single value, say 0. By identifying such equivalences along all dimensions, the rectangle in Figure 8.1(A) is reenc.o(ile(_l to the one in Figure 8.1(D), whose corresponding classifier is shown in Figure 8.1(I). Figures 8.1(B) and (C) show the two transforming tables for [’1 and F2. respec- st tively; note that these tables can be implemented as TCAM tables. We use a as 93 a shorthand for “accept” and “d” as a shorthand for “discard”. E uivalent class I 47””:1 l 0 : ' : 6 .__-L_ ______ 0 15 .-- l _____ Domain 0 Prefix 1 75.-- _____ Compression 1 Ali nment 2 [0, 7|=>0 [0, 51=>o 2 10.01=>0 [0.0] =>03 5“ g [8, l9]=>1[6, 141=>1 0 1 2 [1,11=>2 u,1|=>2 3 1 2 3 99 l m [20,44] =>2 [15.75|=>2 (D) [2,21=>3 [2,2|=>3 (G) 3 20 44 5 99 [45,65] =>1 [76,88] =>1 (E) (F) (A) [66,99] =>0 [39.99) =>o [8, 65]/\[15,75|=>a (B) (C) [1, 2]/\[2,21 =>a l2,3|l\[3,3]=>a [20,44]/\|6, 88] =>a [2. 2|/\[1,2] =>a [3,3|/\[2,3]=>a [0, 99mm, 99|=>d [0, 2mm, 2] =>d [0,3]/\[0,3|=>d (I) (J) Figure 8.1: Example of topological transformations Prefix Alignment: The basic idea. of prefix alignment is to “shift”, “shrink”, or “stretch” ranges by transforming the domain of each field to a new “prefix—friendly” domain so that the majority of the reencoded ranges either are prefixes or can be expressed by a. small number of prefixes. In other words, we want to transform a colored hyperrectangle to another one where the ranges align well with prefix boundaries. This will reduce the costs of range expansion. For example, consider the packet classifier in Figure 8.1(1), whose corresponding rectangle is in Figure 8.1(D). Range expansion will yield 5 prefix rules because interval [1. 2] or [01. 10] cannot be combined into a prefix. However. by transforming the rectangle in Figure 8.1(D) to the one in Figure 8.1(G), the range expansion of the resulting classifier. as shown in Figure 8.1(J). will have 3 prefix rules because [2,3] is expanded to 1*. Figures 8.1(D) and (E) show the two transforming tables for F1 and F2, respectively. Again, these tables can be implemented in TC AM. We implemented our algorithms and conducted experiments on both real-world and synthetic packet classifiers. Our experimental results show that our combined algorithm on real-world classifiers achieves an average expansion ratio of 0.07 ex- cluding TCAM space. for transformers, and an expansion ratio of 0.21 including TCAM space for transformers. This is much better than the best result. in prior 94 work, the database dependent scheme hybrid-SRGE proposed by Bremler—Barr and Hendler [4], where the average expansion ratio is 1.03 (this figure ignores the cost of the reencoding process and is thus most comparable to our 0.07 average expansion ratio). 8.1 Topological Transformation The basic idea of our transformation approach is to transform a given packet classifier into another classifier that can be stored more efficiently in TCAM. Furthermore, we need a transformer that can take any packet and transform it into a new packet that is then used as the search key on the transformed classifier. Of course, the decision that the transformed classifier makes for the transformed packet must be the same as the decision that the original classifier makes for the original packet. We also require that the transformer itself be a packet classifier that can be efficiently stored in TCAM. This is one of the features that differentiates our work from previous reencoding approaches. More formally, given a d—dimensional packet. classifier (C over fields 171,- - - ,Fd, a topological transformation process produces two separate components. The first component is a set of transfomzers T = {'ll‘i | 1 g 2' S d} where transformer T2- transforms D(F,-) into a new domain D’(FZ-). Together, the set of transformers T transforms the original packet space E into a new packet space 23'. The second component is a transformed (ll-dimensional classifier (C’ over packet space 23’ such that for any packet (p1, - ‘ - ,pd) E E, the following condition holds: C(pli ' ' ° 3])(1) : C’(T1(P1)WHT(I(P(1» a Each of the d transformers T,- and the transformed packet classifier (C, are imple- mented in TCAM. TCAM Chip 01............ 01”,! 0 01"T1”"—>P1’—~ 01............ 10............ “1 10.. 10|1>2 —0—~ T2 ——>P2'—w 10.."........ 00............ 00.. C’ 00............ 00w Ir; Figure 8.2: Multi-lookup architecture 8. 1.1 Architectures There are two possible architectures for storing the (1+1 TCAM tables (Cf, T1, - - - , 'll‘d: the maria—lookup architecture and the pipelined-lookup architecture, each of which is described below. In the multi-lookup architecture, which is similar to the multi-lookup architecture in Section 7.1, we store all the d+ 1 tables in one TCAM chip. To identify tables, for each table, we prepend a [log(d + 1)] bit string, which we call the table ID, to every entry in the table. Figure 8.2 illustrates the packet classification process using the Inulti-lookup architecture when d = 2. Suppose the three tables are (C’, T1, and T2, and their table IDs are 00, 01, and 10, respectively. Given a packet (p1, p2), we first concatenate 'll‘l’s table ID 01 with p1 and use the resulting bit string Ollpl as the search key for the TCAM. Let p1, denote the search result. Second, we concatenate 'll‘g’s table ID 10 with p2 and use the resulting bit string 10] [)2 as the search key for the TCAM. Let pg’ denote the search result. Third, we concatenate the table ID 00 of (C, with m, and p2,, and use the resulting bit string ()()]])1']])2' as the search key for the TCAM. The search result is the final decision for the given packet (7)1,pg). We recommend two pipelined-lookup architectures for implementing our trans- formation approaches: parallel I_)ipelined-looki1p and chained pipelined—looku]). In 96 . ecision Figure 8.3: Parallel pipelined-lookup architecture TCAM TCAM TCAM 1 [fl] 1—. P P P’le P’1|P’2 - , decision T1 T2 - C —-> Figure 8.4: Chained pipelined—lookup architecture both architectures, we store each of the d + 1 tables in separate TCAMs. As one TCAM stores only one table, we do not need to prepend table entries with table IDs for either approach. In the parallel pipelined—lookup architecture, the d transformer tables T, laid out in parallel, form a two-element pipeline with the transformed clas- sifier C’. Figure 8.3 illustrates the packet classification process using the parallel pipelined-lookup architecture when d = 2. Given a packet (p1,p2), we send p1 and p2, in parallel over separate buses, to T1 and T2, respectively. Then, the search result. 191’po, is used as a key to search on (C’. This second search result is the final decision for the given packet (p1,p2). The (d + 1)-stage chained pipelined-lookup architecture is similar to the pre- viously proposed pipelined-lookup architecture. Figure 8.4 illustrates the packet classification process using the chained pipelined-lookup architecture when d = 2. In comparison with the pipelined-lookup architecture, The main advantage of the multi-lookup architecture is that it can be easily deployed since it requires minimal modification of existing TCAM-based packet processing systems. Its main drawback is a. modest. slowdown in packet processing throughput because (1 +1 TCAM searches are required to process a d—dimensional packet. In contrast, the main advantage of the two pipelined-lookup architectures is high packet processing throughput. Their main drawback is that the hardware needs to be modified to accommodate (1+ 1 97 TCAM chips. Implementing our reencoding schemes on pipelined—lookup architectures actually improves packet processing throughput over conventional TCAM implementations. While the width of TCAM entries can be set to 36, 72, 144, or 288 bits, the typical TCAM bus width is 72 bits. Thus the conventional TCAM lookup approach, which uses a TCAM entry width of 144 bits, requires either four or five TCAM bus cycles to process a packet: four bus cycles if the decision is stored in TCAM, five bus cycles if the decision is stored in SRAM. Because all the tables produced by our reencoding schemes have width less than 36 bits, we can set TCAM entry width to be 36. Thus, using pipelined-lookup architectures, our reencoding approaches achieve a classification throughput of one packet per cycle; using multi-lookup architectures, our reencoding approaches achieve a classification throughput of one packet per twelve cycles. 8.1.2 Measuring TCAM space The TCAM space needed by our transformation approach is measured by the total TCAM space needed by the (1 +1 tables: C’, T1, - ' - , Td. We define the space used by a classifier or transformer in a TCAM as the number of entries (i.e., rules) multiplied by the width of the TCAM in bits: space 2 # of entries X TCAM width Although TCAMs can be configured with varying widths, they do not allow arbitrary widths. The width of a TCAM typically can be set at 36, 72, 144, and 288 bits (per entry). The primary goal of the transformation approach is to produce C’, T1, - - - , Td such that. the TCAM space needed by these (1 + 1 TCAM tables is much smaller than the TCAM space needed by the original classifier (C. Note that most previous reencoding approaches ignore the space required by the. transformers and only focus on the space required by the transformed classifier (C’. 98 8.1.3 TCAM Update Packet classification rules periodically need to be updated. The common practice for updating rules is to run two TCAMs in tandem where one TCAM is used while the other is updated [16]. All our approaches are compatible with this current practice. Because our algorithms are efficient and the resultant TCAM lookup tables are small, updating TCAM tables can be efficiently performed. If an application requires very frequent rule update (at a frequency less than a second, for example), we can handle such updates in a batch manner by chainng the TCAM chips in our proposed architecture after a TCAM chip of normal width (144 bits), which we call the “hot” TCAM chip. When a new rule comes, we add the rule to the top of the hot TCAM chip. When a packet comes, we first use the packet as the key to search in the hot chip. If the packet has a match in the hot chip, then the decision of the first matching rule is the decision of the packet. Otherwise, we feed the packet to the TCAM chips in our architecture described as above to find the decision for the packet. Although the lookup on the hot TCAM chip adds a constant delay to per packet latency, the throughput can be much improved by pipelining the hot chip with other TCAM chips. Using batch updating, only when the hot chip is about to fill up, we need to run our topological transformation algorithms to recompute the TCAM lookup tables. 8.2 Domain Compression In this section, we describe our new reencoding scheme called dommfin compression. The basic idea of domain con‘ipression is to simplify the logical structure of a clas- sifier by mapping the domain of each field D(F,-) to the smallest possible domain D’(F,-). We formalize this process by showing how a classifier (C defines an equiv- alence relation on the domain of each of its fields. This equivalence relation allows us to define equivalence classes within each field domain that domain compression will exploit. Domain compression has several benefits. First, with a compressed domain 99 D'(Fz-), we require fewer bits to encode each packet field. This allows us to set TCAM entries widths to be 36 bits rather than 144 bits, which saves both space in the TCAM and time as each entry fits on the 72 bit TCAM bus. Second, each transformed rule r’ in classifier (C, will contain fewer equivalence classes than the original rule 7‘ did in classifier (C. This leads to reduced range expansion and the complete elimination of some rules, which allows us to achieve expansion ratios less than one. Our domain compression algorithm consists of three steps: (1) computing equiv- alence classes, (2) constructing transformer Ti for each field Ft: and (3) constructing the transformed classifier (3'. 8.2.1 Step 1: Compute Equivalence Classes In domain compression, we compress every equivalence class in each domain D(F,-) to a single point in D’(F,-). The crucial tool of the domain compression algorithm is the Firewall Decision Diagram (FDD). The first step of our domain compression algorithm is to convert a. given (1- dimensional packet classifier (C to (1 equivalent reduced FDDs f1 through fd where the root of F DD fi is labeled by field F1. Figure 8.5(a) shows an example packet classifier over two fields F1 and F2 where the domain of each field is [0,63]. Figures 8.5(b) and (c) show the two FDDs f1 and f2, respectively. The FDDs f1 and f2 are almost reduced except that the terminal nodes are not merged together for illustration purposes. The crucial observation is that each edge of reduced FDD f2- corresponds to one equivalence class of domain D(.F,-). For example, consider the the classifier in Figure 8.5(a) and the corresponding FDD f1 in Figure 8.5(b). Obvi01.1sly, for any p1 and p1, in [7, 11] U [16, 19] U [39,40] U [43,60], we have C(p1,p2) = C(pl’, p2) for any [)2 in [0,63], so it. follows that (C{-p1} = C(pl’}. Theorem 8.2.1 (Equivalence Class Theorem). For any packet classifier C over clds F --F and an etuiualent reduced F DD rooted at an F: node v. the l, (I. 1 t a 100 labels ofv ’s outgoii‘zg edges are all the equivalence classes over field F,- as defined by (3. 8.2.2 Step 2: Construct Transformers Given a packet. classifier (C over fields F1, - - - , F d and the (1 equivalent reduced FDDs f1, - - - , fd where the root node of f,- is labeled P}, we compute transformer T,- as follows. Let i) be the root of ft: and suppose v has m outgoing edges 61,- - -,em. First, for each edge ej out of u, as all the ranges in ej’s label belong to the same equivalent class according to Theorem 8.2.1, we choose one of the ranges in cj’s label to be the representative, which we call the landmark. Any of the ranges in e is label can be chosen as the landmark. For each equivalence class, we choose the range that intersects the fewest number of rules in C as the landmark breaking ties arbitrarily. We then sort edges in the increasing order of their landmarks. We use L]- and ej to denote the landmark range and corresponding edge in sorted order where edge el has the smallest valued landmark Ll and edge em has the largest valued landn‘iark Lm. Our transformer T,- then maps all values in ej’s label to value j where 1 S j S m. For example, in Figures 85(1)) and (c), the greyed ranges are chosen as the landmarks of their corresponding equivalence classes, and Figures 8.6(a) and (b) show transformers T1 and T2 that result from choosing those landmarks. 8.2.3 Step 3: Construct Transformed Classifier we now construct transformed classifier (C, from classifier (C using transformers T7; for l g i g d as follows. Let F1 6 81 /\ - - - /\ Fd E Sd ——> (decision) be an original rule in (C. The domain compression algorithm converts F,- E S,- to Ft, 6 Si, such that for any landmark range LJ- (0 g j g m. — l), 1.]- H S,- # (ll if and only ifj 6 Si’. Stated another way, we re[_)lace range S, with range [(1. b] C_: D’(F,-) where a is the smallest number in [0, in — 1] such that. La (1 S, # (0 and b is the largest. number in [0,712, — 1] such that Lb fl 8,; 74 (0. Note, it is possible no landmark ranges intersect range Si; in this case. a. and b are undefined and 37;, = 0. For a. converted rule 101 r, 2 F1, E 81’ /\ - - - /\ Fd’ E 8d, ——> (decision) in C], if there exists 1 S i. S d such that S,’ = (0, then this converted rule r’ can be deleted from (C’. For example, consider the rule F1 E [7,60] /\ F2 E [10,58] —-> discard in the example classifier in Figure 8.5(a). For field F1, the five landmarks are the five greyed intervals in 8.5(b), namely [0,0], [1,6], [7,11], [12,15], and [63, 63]. Among these five landmarks, [7,60] overlaps with [7,11] and [12,15], which are mapped to 2 and 3 respectively by transformer T1. Thus, F1 E [7,60] is converted to F1, E [2, 3]. Similarly, for field F2, [10,58] overlaps with only one of Fg’s landmark, [10,19], which is mapped to 3 by F2’s mapping table. Thus, F2 E [10,58] is converted to F2, E [3, 3]. Figure 8.7 shows the resultant domain compressed classifier. Next, we prove that. (C, and T are semantically equivalent to (C. Theorem 8.2.2. Consider any classifier (C and the resulting transformers 'II‘ and transformed classifier (C’. For any packet p 2 (p1, . - - , pd). we have (C(Ill, ' ° ' 7pd) : (Cl/(T1 (p1) ' ° ' 7 Td(p(1)) Proof. For each field F, for 1 S i g (1, consider p’s field value 1),. Let. L(p,) be the landmark range for [3(1),]. We set. 1“, = min(L(p,)). We now consider the packet :1." = (.‘1‘1.---;r.d) and the packets :L'(j) = (.r1,...:rj__1,pJ-,...,pd) for O S j g d; that. is, in packet (EU), the first j fields are identical to packet :r and the last d —j fields are identical to packet p. Note 33(0) = p and I((l’) = .‘L‘. We now show that C(])) 2 (C(17). This follows from C(:r(0)) = C(;r(1)) = --- = C(I(d)). Each equality follows from the fact that. 17]- and p j belong to the same equivalence class within l)(FJ-). Let r be the first rule in (C that packet :17 matches. we argue that 7), will match the transformed rule r, E (C’. Consider the conjunction F, E S,- of rule r. Since :1: matches rule r, it must be the case. that 1:, E 3,. This implies that L(]),) D S, % (ll. Thus. by our construction p,’ = 'll‘,(p,) = 'll‘,(.r,) E S,’. Since this holds for all fields F,. packet. p, matches rule 7". “7e also argue that packet. [2’ will not match any rule before transformed rule 7" E C’. Suppose packet 19’ matches some rule. r1, E C’ that occurs before rule 7". This implies that, for each conjunction F, E S, of the 102 corresponding rule r1 E C that L(p,) F) S, 3é (0. However, this implies that 1:,- E 5', since if any point in L(p,) is in 8,, then all points in L(p,) are in 8,. It follows that 1' matches rule r1 E C, contradicting our assumption that. rule r was the first rule that :1: matches in C. Thus, it follows that p’ cannot match rule rl’. It then follows that r' will be the first rule in C that p’ matches and the theorem follows. C] 8.3 Prefix Alignment In this section, we describe a new topological transformation approach called prefix alignment. When applying this approach, we assume that we have a classifier C that needs to be converted into a prefix classifier via range expansion. we observe that range explosion happens when ranges do not align well with prefix boundaries. The basic idea of prefix alignment is to “shift”, “shrink”, or “stretch” ranges by transforming the domain of each field to a new “prefix-friendly" domain so that the majority of the reencoded ranges either are prefixes or can be expressed by a. small number of prefixes. This will reduce the costs of range expansion. Of course, we must guarantee that our prefix alignment transformation preserves the semantics of the original classifier. we first consider the special case where the classifier has only one field F. We develop an optimal solution for this problem using dynamic programming techniques. We then describe how we use this solution as a building block for performing prefix alignment on multi-dimensional classifiers. Finally, we discuss how to compose the two transformations of domain compression and prefix alignn‘ient. 8.3.1 Prefix Alignment Overview The one—dimensimial prefix alignment problem can be described as the following “cut” problem. Consider the three ranges [0,12]. [5,15]. and [0.15] over domain D(F1) = [0. 15] in classifier C in Figure 8.8(A), and suppose the transformed domain 0’ (F1) : [00,11] in binary format. Because D'(F1) has a total of 4 elements, we want to identify three cut points 0 E .171 < 1‘2 < 1:3 g 15 such that if [0,171] E 103 D(F1) transforms to 00 E D’(F1), [3:1 + 1,332] E D(F1) transforms to 01 E D’(F1), [332+ 1,1:3] E D(F1) transforms to 10 E D’(F1), and [13+ 1, 15] E D(F1) transforms to 01 E D’ (F 1 ), the range expansion of the transformed ranges will have as few rules as possible. For this simple example, there are two families of optimal solutions: those with 1:1 anywhere in [0, 3], 2:2 = 4, and $3 = 12, and those with 171 = 4, x2 = 12, and 1173 anywhere in [13,15]. For the first family of solutions, range [0, 12] is transformed to [00, 10] = 0* U10, range [5, 15] is transformed to [10. 11] = 1*, and range [0, 15] is transformed to [00, 11] = M. In the second family of solutions, range [0, 12] is transformed to [00, 01] = 0*, range [5, 15] is transformed to [01, 11] = 01U1*, and range [0, 15] is transformed to [00,11] = **. The classifier C, in Figure 8.8(A) shows the three transformed ranges using the first family of solutions. Thus, in both examples, the range expansion of the transformed ranges only has 4 prefix rules while the range expansion of the original ranges has 7 prefix rules. We now illustrate how to compute an optimal solution using a divide and conquer strategy. The first observation is that we can divide the original problem into two subproblems by choosing the middle cut point. The second observation is that a cut point should be the starting or ending point of a range if possible in order to reduce range expansion. Suppose the target domain D’ (F1) is [0.2b — 1]. we first need to choose the middle cut point x2b_1, which will divide the problem into two subproblems with target domains [0, 2b—1 — 1] = 0{=i<}b""1 and [21)“1. 2b — 1] = 1{*}b-1 respectively. Consider the example in Figure 8.8(A), the 172 cut point partitions [0,15] into [0, 51:2], which transforms to prefix 0*, and [1'2 + 1,15], which transforms to prefix 1*. The first observation implies either 332 = 4 or .r2 = 12. Suppose we choose r2 2 4; that is, we choose the dashed line as shown in Figure 8.8(A). This then divides the original problem into two subproblems where we need to identify the 171 cut point in the range [0.4] and the 3:3 cut point. in [5, 15]. Furthermore. in the two subproblems, we include each range trimmed to fit the restricted domain. For example, in the first subproblem, ranges [0, 12] and [0. 15] are trimmed to [0,4], and in the second subproblem. ranges [5,15] and [0.15] are trimmed to [5, 15] and range [0, 12] is trimmed to [5. 12]. It is important. to maintain 104 each trimmed range, even though there may be multiple copies of the same trimmed range. We then see in the first subproblem that the choice of 11:1 is immaterial since both trimmed ranges span the entire restricted domain. In the second subproblem, the range [5, 12] dictates that 1173 = 12 is the right choice. This divide and conquer process of computing cut points can be represented as a binary cut tree. For example, Figure 8.8(B) depicts the tree where we select $2 = 4 and 11:3 = 12. This binary cut. tree also encodes the transformation from the original domain to the target domain: all the values in a terminal node will be mapped to the prefix represented by the path from the root to the terminal node. For example, as the path from the root to the terminal node of [0,4] is 0, all values in [0,4] E D(F1) are transformed to 0*. Note that in the domain compression technique, we considered transformers that mapped points in D(F,) to points in D’(F,). In prefix alignment, we consider transformers that map points in [)(F,) to prefix ranges in D'(F,). If this seems confusing, we can also work with transformers that map points in D(F,) to points in D’(F,) with no change in results; however, transformers that map to prefixes more accurately represent the idea of prefix alignment than transformers that map to points. Note also that since we will perform range expansion on C’ with no redundancy removal, we can ignore rule order. We can then view a one-dimensional classifier C as a multiset of ranges S in I)(F1). 8.3.2 One-dimensional Prefix Alignment We next present the technical details of our dynamic programming solution to the prefix aligmnent problem by ansmrring a series of four questions. Correctness of Prefix Alignment The first question is: why is the prefix alignment transformation process correct? In other words, how does the prefix alignment transformation preserve the semantics of the original classifier? We first define the concept of prefix transformers and then show that. if prefix transformers are used. the prefix alignment transformation 105 process is correct. Given a prefix P, we use min P and max P to denote the smallest and the largest values in P, respectively. Definition 8.3.1 (Prefix transformers). A transformer T, is an order—presenting prefix transformer from D(F,) to D'(F,) for a packet classifier C if T, satisfies the following three properties. (I) (prefix property) Vx E D(F,), 'Il‘,(x) = P where P is a prefix in domain D’(F,),' (2) (order-preserving property) Vx, y E D(F,), x < y implies either 'll‘,(x) = 'll‘,(y) or max T,(x) < min ’ll‘,(y); {3) (consistency property) Vx,y E D(F,), T,(x) = T,(y) implies C{x} = C{y}. The following Lemma 8.3.1 and Theorem 8.3.1 easily follow the definition of prefix transformers. Lemma 8.3.1. Given any prefix transformer 11', for a field F,, for any a,b,x E D(F,), x E [a, b] if and only if 'll‘,(.r) E [1'1‘1in'll‘,(a), max T,(b)]. Theorem 8.3.1 (Topological Alignment Theorem). Given a packet classifier C over fields F1, - - - , F,,, and d prefix transformers T = (T, | 1 g i S d}, and the classifier C, constructed by replacing any range [a, b] over field F, (1 S i S d) by the range [min T,(a), max 'll‘,(b)], the condition C(p1.- - -.pd) = C’(T1(p1), ---, 'll‘d(pd)) holds. Find Candidate Cut Points The second question is: what cut points need to be considered? To answer this question, we first introduce the concept of atomic ranges. For any multiset of ranges S (a multiset. may have duplicate entries) and any range :17 over domain D(F 1), we use S‘in'x to denote the set of ranges in S that contain x. Definition 8.3.2 (Atomic Range. Set). Given a multiset S of ranges, the union of which constitute a range denoted U S. and a set of ranges S’, S, is the atomic range set of S if and only if the following four conditions hold: (1) (coverage property) U S = US’; (2) (disjoint property) VJ), y E S', x {’1 y = (ll; (3) (atomicity property) Vx E S and Vy E S', :1: H y 74 0 implies y g x. (.4) (maximality property) V17. 1) E S’ and max x + 1 = min y implies Std-J: 7£ S‘Qy. 106 For any multiset of ranges S, there is one and only one atomic range set of S, which we denote as AR(S). Because of the maximality property of atomic range set, the candidate out points correspond to the end points of ranges in AR(S). We now show how to compute S-start points and S-end points. For any range [x,y] E S, define the points x — 1 and y to be S-end points, and define the points x and y + 1 to be S-start points. Note that we ignore x - 1 if x is the minimum element of U S; likewise, we ignore y + 1 if y is the maximum element of U S. Let (31,- - - , sm) and (e1, - - - , em) be the ordered list of S-start points and S-end points. It follows that for 1 g i g m — 1 that s, g e, = s,+1 +1. Thus, AR(S) = {[sl,e1],- - - , [sm,em]}. For example, if we consider the three ranges in classifier C in example Figure 8.8(A), range [0, 12] creates S-start point 13 and S-end point 12, range [5, 15] creates S-end point 4 and S—start point 5, and range [0, 15] creates no S-start points or S- end points. Finally, 0 is an S—start point. and 15 is an S-end point. This leads to AH(S) = {[0,4].[5.12],[13,15]}. Choose Target Domain Size The third question is: how many bits should be used to encode domain 0' (F 1 )? The number of bits b used to encode the domain D'(F1) may impose some constraints on possible prefix transformers. Consider the example from C in Figure 8.8(A) with ranges [0,12], [5, 15]. and [0, 15]. Suppose there were a fourth range [5,7]. For this multiset of ranges S, we then have AR(S) = {[0. 4], [5, 7], [8, 12], [13, 15]}. If we allow only 2 bits to encode D’ (F 1), then there is only one possible prefix transformer. We must have [0,4] map to 00, [5,7] map to ()1, [8,12] map to 10, and [13,15] map to 11. On the other hand, if we allow 3 bits. we can also allow additional prefix transformers such as [0, 4] map to 000. [5. 7] map to 001, [8,12] map to 01*, and [13,15] map to 1**. In this case, the first prefix transformer is superior to this second prefix transformer. However. if the original ranges had been [0, 4], [0,7], [0. 12], and [0, 15], the second prefix transformer would have been superior. and this prefix transformer is only possible if we encode D, (F 1) with at least 3 bits. We will include the number of bits b used to encode D’( F1) as an input parameter 107 to our prefix alignment problem. we determine the best b through an iterative process of repeatedly incrementing b and computing an optimal solution for that b. we start by choosing = [log |AR(S)|], which is the smallest possible number of bits for any legal prefix transformer. Once we have a solution, we increment b and repeat the process until we cannot reduce the range expansion any further. we choose a linear search as opposed to a binary search for efficiency reasons. As we shall see in a moment, any solution using b bits will require a sub-solution using b— 1 bits. Thus, when we fail to find a solution using b bits and try to find a solution using 2b bits, we will require a sub-solution for each number from b + 1 to 2b - 1 (otherwise we would have found a solution using b bits). Furthermore, the binary search may miss the best b by a large factor, which will lead to a large amount of unnecessary computation. Choose Optimal Cut Points The fourth question is: How do we choose the optimal cut points? As we noted before. we can view a one-dimensional classifier C as a multiset of ranges S in D(F 1). We then formulate the one-dimensional prefix alignment problem as follows: Given a multiset of ranges S over field F1 and a number of bits b. find prefix transformer T1 such that the range expansion of the transformed multiset of ranges S' has the minimum number of prefix rules and D’ (F 1) can be encoded using only b bits. We now present an optimal solution for this problem using dynamic program- ming. Given a multiset of ranges S, we first compute its atomic range set AR(S). Suppose there are m atomic ranges R1, - - - , Hm with S-start points 81 through sm and S-end points e1 through cm sorted in increasing order. For any S-start point. sf and S—end point sy where 1 g .r g y g m, we define S rm [.1:, y] to be the multiset. of ranges from S that intersect. range [817,531]; furthernmre. we. assume that each range in S (n) [.r. y] is trimmed so that. its start point. is at least s1,- and its end point is at most sy. We then define a collection of subproblems as follows. For every 1 g .‘I? g y g m, we define a prefix alignment problem P.4(r, y. b) where the problem is to find a prefix transformer T1 for [54.5, cu] Q I)(F1) such that the range expansion 108 of (S rm [x, y])’ has the smallest possible number of prefix rules and the transformed domain D' (F 1) can be encoded in b bits. We use cost(x, y. b) to denote the number of prefix rules in the range expansion of the optimal (S n [x, y])’. The original prefix alignment problem then corresponds to PA(1, m, b) where b can be arbitrarily large. The key observation that allows the use of dynamic programming is that the prefix alignment problem obeys the optimal substructure property. For example, consider PA(1, m, b). As we employ the divide and conquer strategy to locate a middle cut point that will establish what the prefixes O{*}b_1 and 1{*}b—1 cor- respond to, there are m — 1 choices of cut points to consider: namely e1 through em_1. Suppose the optimal cut point is ek where 1 S k S m — 1. Then the op- timal solution to PA( 1, m, b) will build upon the optimal solutions to subproblems PA(1,k,b — 1) and PA(k + 1,m,b — 1). That is, the optimal prefix transformer for PA(1, m. b) will simply append a 0 to the start of all prefixes in the optimal prefix transformer for PA(1, k, b — 1), and similarly it will append a 1 to the start of all prefixes in the optimal prefix transformer for PA“: + 1, 777., b — 1). Moreover, cos/(1,771, b) = cost(1,k,b — 1) + cost(k + 1,7‘n,b — 1) — [.S'<;Ci'[1, 771]]. We subtract [S'iQ[1, m][ in the above cost equation because ranges that include all of [81, em] are counted twice, once in cost(1,k,b — 1) and once in cost(k + 1, m, b — 1). However, as [s], ck] transforms to 0{*}b-1 and l3k+17 em] transforms to 1{*}b—1, the range [31,em] can be expressed by one prefix {*}b = 0{*}b_1 U 1{*}b"1. Based on this analysis, Theorem 8.3.1 shows how to compute the optimal cuts and binary tree. As stated earlier, the optimal prefix transformer T1 can then be computed from the binary cut tree. Theorem 8.3.1. Given a multiset of ranges S with [AR(S) = 777.. cost(l,r, b) for any b 2 0. 1 S l S r S m can be computed as follows. For any 1 S l < r S m, and 1S 1’: Sm. and b 2 0: cost(l, 7', 0) = 00, cost(lsf, 11‘, b) = $th kll’ 109 and for any 1 S l < 7‘ S 777. and b 2 1 ( cost(l,]c,b—1) \ + cost l,r,b = min nth 1, ,b—1 ( ) kE{l,...,r—1} COS< + T ) t new“ ) C] Note that we set cost(k, k, 0) to |S@[k, It“ for the convenience of the recursive case. The interpretation is that with a 0-bit domain, we can allow only a single value in D’ (F 1); this single value is sufficient to encode the transformation of an atomic interval. 8.3.3 Multi-Dimensional Prefix Alignment We now consider the multi-dimensional prefix alignment problem. Unfortunately, while we can optimally solve the one—dimensional problem, there are complex inter- actions between the dimensions that make solving the multi-dimensional problem optimally extremely difficult. In particular, the total range expansion required for each rule is the product of the range expansion required for each field. Thus, there may be complex tradeoffs where we sacrifice one field of a rule but align another field so that the costs do not multiply. However, we have not found a polynomial algo- rithm for optimally choosing which rules to align well in which fields. It is an open problem to prove whether the optimal multi-dimensional prefix alignment problem is NP-hard. In this chapter, we present. a hill—climbing solution where we itm'atively apply our one—dimensional prefix alignment algorithm one field at a time to improve our solution. The basic idea is to perform prefix alignment one field at a time; however, because the range expansion of one field affects the numbers of ranges that appear in the other fields. we run prefix alignment for each field more than once. Running prefix alignment more than once allows each field to use more and more accurate information about the number of times each range appears in a field. 110 For a classifier C over fields F1, . . . , Fd, we first create (1 identity prefix trans— formers TQ, . . . .Tg. We define a multi-field prefix alignment iteration k as follows. For i from 1 to d, generate the optimal prefix transformer Tf assuming the prefix ., . , k—l k—l k—l k—l - transformers for the other fields are {T1 , ..., Ti—l , Ti+1 , ..., Td }. Our 1t- erative solution starts at k = 1 and preforms successive multi-field prefix alignment iterations until no improvement. is found for any field. 8.3.4 Composing with Domain Compression Although the two transformation approaches proposed in this chapter can be used individually to save TCAM space, we advocate combining them together to achieve higher TCAM reduction. Given a classifier C over fields F1. . . . , Fd, we first per- form domain compression resulting in a transformed classifier C, and d transformers Talc, . . . , Tgc; then, we perform prefix alignment on the classifier C, resulting in a transformed classifier C” and (1 transformers Tllm, . . . ,TZG. To combine the two transformat.ion processes into one, we merge each pair of transformers TE!“ and T? a into one transformer T, for 1 S i S (1. One nice property of their composition is that since the transformer for domain compression is a function from D(F,) to a point in D’(F,) and each point in D’(F,) will belong to its own equivalence class in D’(F,) for 1 S i S d, each point. x E D'(F,) defines an atomic range [x, x]. A good property of the two proposed topological transformation approaches is that they are composable with many other reencoding or TCAM optimization tech- niques. For example, we can apply previous TCAM minimization schen‘ies (such as [5, 18, 20]) to a transformed classifier to further reduce TCAM space. Further- more. as new classifier minimization algorithms are developed, our transformations can potentially leverage these future results. Finally, we can apply the optimal algorithm in [27] to compute the minimum possible transformers T, for 1 S i S d. 111 8.4 Experimental Results We evaluate the effectiveness and efficiency of our topological t.ransformation ap— proaches on both real-world and synthetic classifiers. Although our two approaches can be used independently, they are much more effective when used together. Thus, we only report results for both techniques combined, and we finish by running the redundancy removal algorithm in [18] on the transformed classifier C”. 8.4.1 Evaluation Metrics Given a TCAM optimization algorithm A and a classifier C, let A(C) denote the re- sulting classifier, W(A(C)) denote the number of bits to represent each rule in A(C), TW(A(C)) denote the minimum TCAM entry width for storing A(C) given choices 36, 72, 144, or 288. [A(C)| denote the number of rules in A(C), and B(A(C)) = TW(A(C)) X|A(C)[, which represents the total number of TCAM bits required to store A(C). The main goal of TCAM optimization algorithms is to minimize B (A(C)). we use Direct to denote direct range expansion algorithm, so B (Direct(C)) represents the baseline we compare against, W(Direct(C)) = 104, TW(Direct(C)) = 144. and B(Direct(C)) = 144 x |Direct(C)|. Below is the summary of our notations: For any A and C, we measure overall effectiveness by the compression ratio _ BfA ((3)) GIN/KC” _ B(Direct(C) our approaches at compressing classifiers, we define the Rule Number Ratio of A on C to be RNR(A(C)) = '18:“, which is often referred to as expansion ratio, and the ). To isolate the factors that contribute to the success of Rule Width Ratio of A on C to be RWR(A(C)) = W(A(C) . When we consider a S set of classifiers S where denotes the number of classifiers in S , we generalize our metrics as follows. Average compression ratio of A for S is = ZCeS Cliff/1(9) C'R(A(S)) 8" average rule number ratio of A for S is .. RNR A C RNR(A(S)) = DUES s ( ( )), 112 and average rule width ratio of A for S is 2 Does RWRMUCD s ' awry/4(5)) We split RL into two groups, RLa and RLb where RN R(Direct(C)) S 2 for all C E RLa and RN R(Direct(C)) > 40 for all C E RLb. We have no classifiers where 2 S RNR(Direct(C)) S 40. It turns out [RLa] = 12 and [RLb] = 13. By separating these classifiers into two groups, we can determine how well our techniques work on classifiers that do suffer significantly from range expansion as well as those that do not. 8.4.2 Effectiveness Results on real-world and synthetic classifiers Table 8.1 shows the average compression ratio, rule size ratio, and rule number ratio for our algorithm on all eight data sets. Figures 8.10 through 8.15 show the specific compression ratios, rule width ratios, and rule number ratios for all of our real-world classifiers; the black bars represent the increases in each quantity that arise from assigning each rule a. unique decision. In each figure, we sort the classifiers by the number of rules in the original classifier. We present compression ratio and rule number ratio data with and without transformers. The data without transform- ers facilitate comparison with most previous reencoding schemes. The data with transformers depicts the true space savings of our methods. Our algorithm achieves significant compression on both real-world and synthetic (_rlassifiers. On RL, our algorithm achieves an average compression ratio of 10.3% if we count TCAM space for transformers and 2.6% if we do not. These savings are attributable to both rule width and rule number compression. The average rule width compression ratio is 10.6%. which means that a typical encoded classifier only requires 11 bits, instead of 104 bits, to store a rule. However, the actual savings that rule width compression contributes to average comjjn'ession ratio is only 25% because the encoded classifiers will always use 36 bit wide. TCAM entries of 36, which is the 113 compression rule width rule number we. T with T we. T with T RL 2.6% 10.3% 10.6% 27.6% 123.3% RLU 7.0% 16.2% 14.5% 63.2% 176.8% RLa 5.3% 20.8% 14.2% 22.7% 87.4% RLaU 14.4% 33.1% 18.5% 62.5% 140.3% RLb 0.1% 0.5% 7.2% 32.2% 156.4% RLbU 0.2% 0.6% 10.8% 63.8% 210.6% SYN 0.6% 2.5% 10.4% 2.7% 11.8% SYNU 9.3% 12.4% 16.0% 43.9% 58.9% Table 8.1: Average compression ratio, rule width ratio, and rule number ratio for 9 data. sets (with transformers included and excluded) smallest possible TCAM width. In comparison, direct range expansion would use 144 bit wide TCAM entries. That is, TW(A(C)) = 32 for all the classifiers in RL (actually in all data sets including RLU, SYN, and SYNU). The remaining savings is due to rule number compression. Note that the average rule number compression ratio without transformers is 27.6%; that is, domain compression and redudancy removal eliminate an average of 72% of the rules from our real-life classifier sets. In comparison, the goal of all other reencoding schemes is an average rule number compression ratio without transformers of 100%. On other data sets, our algorithm also performs well. For example, for Taylor’s rule set TRS, we achieve an average compression ratio of 2.7% with transformers included and 1.0% with transformers excluded. Sensitivity to classifier efficiency Our algorithm is effective for both efficiently specified classifiers and inefficiently specified classifiers. The efficiently specified classifiers in RLa experience relatively little range expansion; the inefficiently specified classifiers in RLb experience signif— icant range expansion. Not surprisingly, our algorithm provides roughly 40 times 114 better compression for RLb than for RLa with average compression ratios of 0.5% and 20.8%, respectively. In both sets, TCAM width compression contributes 25% savings. The difference is rule number compression. On efficient classifiers, our al- gorithm provides modest rule number compression (even though the average rule number ratio without transformers for RLa is 22.7%). On inefficient classifiers, our algorithm provides tremendous rule number compression. Sensitivity to number of unique decisions Our algorithm '3 effectiveness is only slightly diminished as we increase the number of unique decisions in a classifier. In the extreme case where we assign each rule a. unique decision in RLU, our algorithm achieves an average compression ratio of 16.2% with transformers included and 7.0% with transformers excluded; and on SYNU, our algorithm achieves an average compression ratio of 12.4% with trans- formers included and 9.3% with transformers excluded. In particular, the TCAM width is unaffected as our algorithm still uses 36 bit wide TCAM entries. Comparison with state—of—the—art results Oar algorithm outpevforvns all existing reencoding schemes by at least a factor of 3.16 including transformers and by at least a factor of 7.24 ea'elading transformers. We first. consider the width of TCAM entries. We have 36 bit. wide TCAM entry width while the smallest TCAM width achieved by prior work is 72 [21]. There- fore, on TCAM entry width, our algorithm is 2 times better than the best known result. Next, we consider the number of T CAM entries. Excluding TCAM entries for transformers, the best rule number ratio that any other method can achieve on It]. is 100% whereas we achieve 27.6%. Therefore. e:1.‘(_'l'u(ling TCAM entries for transform("I's. our algorithm is at least 7.24 (= 2 X 100%./27.6%) times better than the optimal TCA M reencoding algorithm that does not consider classifier semantics. In comparison with PIC [21], the best. known TCAM-based reencoding alg(_)ritlin'1, the transformers in PIC use at. least the same number of TCAM entries as our al- gorithm because our domain compression technique may map multiple intervals to 115 one decision whereas PIC maps each interval to a unique decision. Thus, including TCAM entries for transformers, the best average rule number ratio that PIC can achieve on RL is 195.7%(= 123.3% — 27.6% + 100%). Therefore, including TCAM entries for transformers. our algorithm is at least 3.16 (= 2 x 195.7%/ 123.3%) times better than PIC. 116 F1 F2 Decision [12, 15] [7,60] Discard [41 , 42] [7, 60] Discard [20, 38] [0, 63] Accept [0, 63] [20, 38] Accept [7,60] [10, 58] Discard [1, 63] [0, 62] Accept [0, 62] [1, 63] Accept [0, 63] [0, 63] Discard [0,0] [1,6] [7,11] [63,63] [20,38 [16,19] [6l,62 [39,40] [43,60] [0.01 “.61 [7.91 [10,19] [61,62 [00 1,63] [0,63] E! E] El Figure 8.5: Step 1 of domain compression 117 F1 Decision l0, 0] [1,6] 0 [20,38] 0 [61,62] [7, 11] 0 [16,19] 0 [39,40] 0 [43,60] [12,15] 0 [41,42] [63,63] rbCJGNJV-H F2 Decision [0, 0] [1,6] 0 [61,62] [7, 9] 0 [20,38] 0 [59,60] [10, 1910 [39,58] [63,63] .5me Figure 8.6: Step 2 of domain compression 118 F1 F 2 Decision [3, 3] [2, 3] Discard (l) [2, 3]] Discard (l) [0, 4] Accept [0, 4] (ll Accept [2, 3] [3, 3] Discard [1, 4] [0, 3] Accept. [0, 3] [1, 4] Accept [0, 4] [0, 4] Discard U F1 F 2 Decision [3, 3] [2, 3] Discard [2,3] [3, 3] Discard [1, 4] [0, 3] Accept, [0, 3] [1, 4] Accept [0, 4] [0, 4] Discard Figure 8.7: Step 3 of domain compression 119 ‘ u}... ‘I'wmwfl-lflm‘ I m.“:. It?" fauu—u-emuu-r -. .. ~Culll'1‘. .--".. ’5“- - " ‘1 55"": ' glass-54 1-. 5.771 31.3.4.1? fa? '1' ' :93 1 - 3}." “ME 1 -' .4 0 l 4 6 7 8 9 10 ll 12 13 map .t'é 0' map to‘W. map to'." as" s 14 15 (A) Figure 8.8: Example of 1-D prefix alignment. A TCAM opt. scheme Direct direct range expansion (C packet classifier A(C) resulting classifier l‘l/"(.4(