f

a?
m“. A

' .z.

ﬁ\’v
‘Dl‘. .7.
*1: ... ~
.3.
.3

 

 

19..

all
I .I:

 

I—I Dru-1n I

 

“3:“ Michigan State
200(1 _ Unwersuty

This is to certify that the
dissertation entitled

ALGORITHMIC APPROACHES FOR FAST AND EFFICIENT
PACKET CLASSIFICATION

presented by

Chad R. Meiners

has been accepted towards fulfillment
of the requirements for the

PhD. degree in Computer Science

 

 

Mp

 

Major Professor’s Signature

M3}, l1: LOO ‘1

Date

MSU is an Affirmative Action/Equal Opportunity Employer

. - -._-—.~.—-—.-.--.—.—.—.’-.-<A

—.—--o---o— - 4 4 4

 

PLACE IN RETURN BOX to remove this checkout from your record.
a TO AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5/08 K IProg/Acc8Pres/ClRC/DateDue undd

ALGORITHMIC APPROACHES FOR FAST AND EFFICIENT PACKET
CLASSIFICATION

By
Chad R. Meiners

A DISSERTATION

Submitted to
helichigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY
Computer Science

2009

ABSTRACT

ALGORITHMIC APPROACHES FOR FAST AND EFFICIENT PACKET
CLASSIFICATION

By
Chad R. Meiners

Packet classification is the core mechanism that enables many networking services
such as packet ﬁltering and trafﬁc accounting. Using Ternary Content Addressable
Memories (TCAMs) to perform high-speed packet. classification has become the de
facto standard in industry because TCAMs can facilitate constant. time classiﬁca-
tion by comparing a packet with all rules of ternary encoding in parallel. Despite
their high speed, TCAMs have limitations of small capacity, large power consump—
tion, and relatively slow access times. The well—known range expansion problem in
converting range rules to ternary rules significantly exacerbates these TCAM lim-
itations. W’hile we can expect some gain in TCAM performance from improved
hardware, the demands on TCAM performance as measured by the number of rules
in packet classifiers increase far more rapidly due to the explosive growth of Internet.
services and threats.

Space reduction is key to addressing these three issues facing TCAMs because
power consumption and access time are determined by the capacity of the TCAM.
This dissertation describes four methods in which to reduce the space that classifiers
occupy within TCAMs: TCAM Razor, All-Match Redundancy Removal. Sequential
Deconmosition, and Topological Transformations. These methods demonstrate that.

in most cases a. substantial reduction of space is achieved.

TABLE OF CONTENTS

List of Tables ..................................
List of Figures ..................................

Introduction
1.1 Motivation .................................
1.2 Contribution ................................

Background

2.1 Firewall decision diagrams ........................
2.2 One-Dimensional Classiﬁer l\/Iinimization ................
2.3 Experimental Data ............................

Related Work
3.1 TCAM Based Classifiers .........................

3.1.1 Classiﬁer Minimization: .....................
3.1.2 Range Encoding: .........................
3.1.3 Circuit Modification: .......................

3.2 Software Based Techniques ........................
3.2.1 Parallel decomposition ......................
3.2.2 Decision Trees ...........................

Equivalent Transformation Techniques

TCAM Razor

4.1 Multi-dimensional TCAM lV‘Iinimization: The Basics ..........
4.1.1 Conversion to Firewall Decision Diagrams ...........
4.1.2 Multi-dimensional TCAM lVIinimization ............
4.1.3 Removing Redundant Rules ...................
4.1.4 The Algorithm ..........................
4.1.5 TCAM Update ..........................

All-Match Redundancy Removal
5.1 All—Match Based Redundancy Theorem .................

5.1.1
5.1.2

at
[0

5.2.1
5.2.2
5.2.3

All-Match F DDs .........................
The All-Match Based Redundancy Theorem ..........

All—Match Based Redundancy Removal .................

The All-Match F DD Construction Algorithm ..........
The All-Match Based Redundancy Removal Algorithm
Proof of Complete Redundancy Removal ............

Experimental Results
6.1 Methodology ...............................
6.2 Comparison with other techniques ....................

iii

15
15
15
17
2O
21
21
23

27

29
30
30
31
33
33
34

35
36
36
38
39
39
40
44

46
46
49

iv

II New Architectural Approaches 50
7 Sequential Decomposition 52
7.1 Multi-Lookup Approach ......................... 55
7.1.1 Constructing Multi-lookup Table ................ 55

7.1.2 Packet Processing ......................... 59
7.1.3 Analysis .............................. 59

7.2 Pipelined—lookup Approach ........................ 60
7.2.1 Pipelined—Table Construction .................. 61
7.2.2 Packet Processing ......................... 62
7.2.3 Analysis .............................. 63

7.3 Packing Approach ............................. 65
7.3.1 Strict Partitioning ........................ 66
7.3.2 Shadow Packing .......................... 69
7.3.3 Strict Partitioning vs. Shadow Packing ............. 75

7.4 Table Consolidation ............................ 75
7.4.1 Table Consolidation Algorithm ................. 76
7.4.2 Hierarchical Table Consolidation ................ 77
7.4.3 TCAM/SRAM Space Tradeoff via Bounded Consolidation 79

7.5 One-Dimensional Table Consolidation .................. 79
7.5.1 Table Consolidation with Sequential Decomposition ...... 80
7.5.2 Coping with More Fields than TCAM chips .......... 80

7.6 Implementation Issues .......................... 81
7.6.1 Fast FDD Reduction ....................... 81
7.6.2 TCAM Update .......................... 82
7.6.3 Non—ordered FDDs ........................ 82
7.6.4 Lookup Short Circuiting ..................... 83
7.6.5 Best Variable Ordering ...................... 83

7.7 Experimental Results ........................... 83
7.7.1 Methodology ........................... 84
7.7.2 Effectiveness ............................ 87

8 Topological Transformations 92
8.1 Topological Transformation ....................... 95
8.1.1 Architectures ........................... 96
8.1.2 Measuring TCAM space ..................... 98
8.1.3 TCAM Update .......................... 99

8.2 Domain Compression ........................... 99
8.2.1 Step 1: Compute Equivalence Classes .............. 100
8.2.2 Step 2: Construct Transformers ................. 101
8.2.3 Step 3: Construct. Transformed Classifier ............ 101

8.3 Preﬁx Alignment ............................. 103
8.3.1 Preﬁx Alignment Overview .................... 103
8.3.2 One—dimensional Prefix Alignment ............... 105
8.3.3 Multi-Dimensional Prefix Alignment .............. 110
8.3.4 Composing with Domain Con‘ipression ............. 111

8.4 Experimental Results ........................... 112

8.4.1 Evaluation l\-‘Ietrics ........................ 112
8.4.2 Effectiveness ............................ 113
BIBLIOGRAPHY 122

1.1
1.2
4.1
4.2
4.3
4.4
6.1
7.1
7.2
7.3
8.1

LIST OF TABLES

An example packet classifier ....................... 1
TCAM Razor output for the example packet classifier in Table 1.1 . . 2
A minimum packet classiﬁer corresponding to 212 in Fig. 4.1 ..... 31
A minimum packet classifier corresponding to 113 in Fig. 4.1 ..... 31
A minimum packet classifier corresponding to '01 in Fig. 4.1 ..... 32
Packet classiﬁer generated from the F DD in Figure 4.1 ........ 33
Statistics for experimental classifiers .................. 48
Compression ratios ............................ 87
SRAM compression ratios ........................ 88
SRAM/TCAM ratios ........................... 88
Compression ratios of Topological Transformations .......... 114

vi

LIST OF FIGURES

Illustration of FDD construction ..................... 8
An example one-dimensional TCAM minimization problem ...... 13
Illustration of dynamic program ..................... 13
A firewall decision diagram ........................ 30
“Virtual” one-dimensional packet classifier ............... 32
A simple packet classifier ......................... 37
An all-match FDD for the packet classifier in Fig 5.1 ......... 37
Constructing an all-match FDD ..................... 40

The containment list and the residency List for the all-match F DD . 42

The resulting packet classifier after removing redundant rules ..... 44
Compression and Expansion ratios of RL ................ 48
Reducing information redundancy .................... 53
FDD generation and reduction ...................... 56
The Multi—lookup scheme ......................... 57

Table generation and table mergence in the pipelined-lookup approach 62

Example of a pipelined-lookup ..................... 63
Strict partitioning of a TCAM chip ................... 67
Decisions in SRAM ............................ 67
Reassigning the table IDs ........................ 68
Shadow packed tables 8; shz-ulow packing tree ............. 71
The shadow packing process ........................ 74
Consolidation of 2-dimeusional tables .................. 77
Table consolidation for pipelined-looktq> sequential decomposition . . 78
Compression ratio for each classifier in It]. ................ 84
Compression ratio for each classifier in [if/4U ............... 85
Example of topological transformations ................. 94
Multi-lookup architecture ........................ 96
Parallel pipelined—lookup architecture .................. 97

8.4
8.5
8.6
8.7
8.8
8.9
8.10
8.11
8.12
8.13
8.14
8.15

Chained pipelined—lookup architecture ................. 97

Step 1 of domain compression ...................... 117
Step 2 of domain compression ...................... 118
Step 3 of domain compression ...................... 119
Example of 1-D prefix alignment ..................... 120
Summary of notation ........................... 120
Compression ratio of RLa and RLaU .................. 121
Compression ratio of RLb and RLbU .................. 121
Rule size ratio of RLa and RLaU .................... 121
Rule size ratio of RLb and RLbU .................... 121
Rule number ratio of RLa and RLaU .................. 121
Rule number ratio of RL-b and RLbU .................. 121

viii

Chapter 1

Introduction

Packet classiﬁcation, which is widely used on the Internet, is the core n’iechanism that
enables routers to perform many networking services such as firewall packet ﬁltering,
virtual private networks (VPNs). network address translation (N AT). quality of ser-
vice (QoS). load balancing. trafﬁc: accounting and monitoring. differentiated services
(Diffserv), etc. As more services are (.leployed on the Internet. packet. classiﬁcation
grows in demand and importance.

The function of a packet classification system is to map each packet to a decision
(i.e., action) according to a sequence (z'.e., ordered list) of rules, which is called a
packet classifier. Each rule in a. packet classifier has a predicate over some packet.
header ﬁelds and a decision to be performed upon the packets that. match the pred—
icate. To resolve possible conflicts among rules in a classiﬁer. the decision for each
packet is the decision of the ﬁrst (i.e., highest priority) rule that. the packet matches.
Table 1.1 shows an example packet classiﬁer of two rules. The format of these rules

is based upon the format. used in Access Control Lists on Cisco routers.

 

Rule Source IP Destination IP Source Port Destination Port Protocol Action

 

7‘1 1.2.3.0/24 192108.01 [1.65534] [1.05534] TCP accept

* * * * * discard

 

 

 

7.2

 

 

Table 1.1: An example packet classiﬁer

 

 

Rule Source IP Destination IP Source Port Destination Port Protocol Action
7‘1 1.2.3.0/24 192168.01 0 * * discard
r2 1.2.3.0/24 192168.01 65535 * * discard
r3 1.2.3.0/24 192168.01 * 0 * discard
7‘4 1.2.3.0/24 192168.01 * 65535 * discard
r5 1.2.3.0/24 192168.01 [0,65535] [0.65535] TCP accept
7'6 * * * * * discard

 

 

 

 

 

Table 1.2: TCAM Razor output for the example packet classiﬁer in Table 1.1

1 .1 Motivation

To classify the never—ending supply of packets at wire speed, Ternary Content Ad—
dressable Memories (TCAMs) have become the de facto standard for high-speed
routers on the Internet. [15]. A TCAM is a memory chip where each entry can store
a packet classiﬁcation rule that is encoded in ternary format. Given a packet, the
TCAM hardware can compare the packet with all stored rules in parallel and then
return the decision of the ﬁrst rule that the packet matches. Thus, it takes 0(1)
time to ﬁnd the decision for any given packet. In 2003, most packet classification de-
vices shipped were TCAlVI—based. More than 6 million TCAM devices were deployed
worldwide in 2004.

A traditional random access memory chip receives an address and returns the
content of the memory at that address. A TCAM chip works in a reverse manner:
it receives content and returns the address of the ﬁrst entry where the content lies
in the TCAM in constant time (1.6., a few CPU cycles). Exploiting this hardware
feature, TCAM-based packet classiﬁers store a. rule in each entry as an array of
0’8, 1’8, or *’s (don"t—care values). A packet header (216., a search key) matches an
entry if and only if their corresponding 0’s and 1’s match. Given a search key to
a TCAM, the hardware circuits compare the key with all its occupied entries in
parallel and return the index (or the content, depending on the chip architecture

and configt,1ra.ti(_)n,) of the ﬁrst matching entry.

Despite their high speed, TCAMs have their own limitations with respect to

packet classiﬁcation.

Range expansion TCAMs can only store rules that. are encoded in ternary for—
mat. In a typical packet classiﬁcation rule, source IP address, destination IP ad-
dress, and protocol type are speciﬁed in preﬁx format, which can be directly stored
in TCAMs, but source and destination port numbers are speciﬁed in ranges (216.,
integer intervals), which need to be converted to one or more preﬁxes before being
stored in TCAMs. This can lead to a signiﬁcant increase in the number of TCAM
entries needed to encode a rule. For example, 30 preﬁxes are needed to represent
the single range [1, 65534], so 30 X 30 = 900 TCAM entries are required to represent

the single rule 7‘1 in Table 1.1.

Low capacity TCAMS have limited capacity. The largest TCAM chip available
on the market has 18Mb while 2Mb and 1Mb chips are most popular. Given that
each TCAM entry has 144 bits and a packet classification rule may have a worst
expansion factor of 900, it is possible that an 18Mb TCAM chip cannot store all
the required entries for a modest. packet classifier of only 139 rules. While the worst
case may not happen in reality. this is certainly an alarming issue. Furthermore,
TCAM capacity is not expected to increase dramatically in the near future due to

other limitations that we will discuss next.

High power consumption and heat generation TCAM chips consume large
amounts of power and generate large amounts of heat. For example. a le TCAM
chip consumes 15-30 watts of pox-ver. Power consumption together with the con-
sequent heat generation is a serious problem for core routers and other networking
devices. Large board Space occupation: TCAMs occupy much more board space
than SRAMs. For networking devices such as routers. area efficiency of the circuit

board is a critical issue.

High hardware cost TCAMs are expensive. For example, a 1.\'Ib TCAM chip
costs about 200 ~ 250 US. dollars. TCAM cost is a signiﬁcant fraction of router

cost.

1.2 Contribution

This work describes two methods of addressing packet classiﬁcation and the related
TCAM based issues: equivalent transformation techniques and new architectural
approaches.

Equivalent transformation techniques seek to ﬁnd semantically equivalent but
more efﬁcient classiﬁers. Two methods of equivalent transformation are TCAM
Razor and All—Match Redundancy Removal. TCAM Razor decomposes a multi-ﬁeld
problem into a series of single-ﬁeld problems; these problems are solved optimally
and then recomposed into a greedy multi-ﬁeld solution. In contrast, All—match
Redundancy Removal identiﬁes a maximal set of rules that can be removed from a
packet classiﬁer without changing the packet classiﬁer’s semantics.

New architectural approaches seek to modify how the TCAM based packet classi-
ﬁers operate in order to improve efﬁciency. We propose two approaches: sequential
decomposition and topological transformation. Sequential decomposition decom-
poses a single d-ﬁeld packet classiﬁcation TCAM lookup into a sequence of d 1-field
TCAM lookups. Topological transformations provide methods to translate the do-
main of each packet ﬁeld into a more efficient representation. Both techniques allow
for the efﬁcient utilization of TCAM space. These techniques mitigate the effects
of range expansion; however, they also have the unique advantage that they ﬁnd
optimizations beyond range expansion. This advantage allows for sublinear com-

pression.

Chapter 2

Background

We now formally deﬁne the concepts of ﬁelds, packets, and packet classiﬁers. A ﬁeld
F,- is a variable of ﬁnite length (i.c., of a ﬁnite number of bits). The domain of ﬁeld
F,- of w bits, denoted D(F,-), is [0,2"" — 1]. A packet over the (1 ﬁelds F1, - - - , Fd
is a d—tuple (pl, - - . ,pd) where each p,- (1 g i g d) is an element of D(F,;). Packet
classiﬁers usually check the following ﬁve ﬁelds: source IP address, destination IP
address, source port number, destination port number, and protocol type. The
lengths of these packet ﬁelds are 32, 32, 16, 16, and 8, respectively. We use 2' to
denote the set of all packets over ﬁelds 171,- - - , Fd' It follows that E is a ﬁnite set.
and [27] = ]D(F1)| X - - - x |D(Fd)]. where |Z| denotes the number of elements in set
X and |D(Fz)| denotes the number of elements in set D(F,-).

A rule has the form (predicate) —> (decision). A (predicate) deﬁnes a set of
packets over the ﬁelds F1 through Pd and is speciﬁed as F1 6 $1 /\ - - - /\ Fd E Sd
where each S,- is a subset of l)(F.,-) and is speciﬁed as either a preﬁx or a nonnegative
integer interval. A preﬁx {0, 1}l"{*}“’“k with. A“ leading (ls or ls for a packet ﬁeld of
length to denotes the integer interval [{0, 1}k{0}“’_k, {0. 1}A'{1}“’_k]. For example.
preﬁx 01** denotes the interval [0100,0111]. A rule. F1 6 .91 /\ - -- /\ Fd E 5d —»
(decision) is a prcﬁ'r rule if and only if each S,- is represented as a prefix.

A packet matches a rule if and only if the packet. matches the predicate of the
rule. A packet (p1. - - - .pd) "mic/ms a predicate F] E 5'1 /\ /\ Fd E Sd if and

only if the condition [)1 E 91 /\ - - - /\ Pd 6 Sd holds. \Ve use US to denote the set

CH

of possible values that (decision) can be. Typical elements of 05 include accept,
discard, accept with logging, and discard with logging.

A sequence of rules (r1, - - - , rn) is complete if and only if for any packet p, there
is at least one rule in the sequence that p matches. To ensure that a sequence of
rules is complete and thus a packet classiﬁer, the predicate of the last rule is usually
speciﬁed as F1 6 D(F1) /\ - - - Fd E AD(Fd). A packet classiﬁer C is a sequence of
rules that is complete. The size of C, denoted ICI, is the number of rules in C. A
packet classiﬁer C is a preﬁx packet classiﬁer if and only if every rule in C is a preﬁx
rule. A classiﬁer with d ﬁelds is called a (ll-dimensional packet classiﬁer.

Two rules in a packet classiﬁer may overlap; that is, a single packet may match
both rules. Furthermore, two rules in a packet. classiﬁer may conﬂict; that is, the
two rules not only overlap but also have different decisions. Packet classiﬁers typi-
cally resolve such conflicts by employing a ﬁrst-match resolution strategy where the
decision for a packet p is the decision of the ﬁrst (i.e., highest priority) rule that p
matches in C. The decision that packet classiﬁer C makes for packet p is denoted
CU?)-

VVe can think of a packet classiﬁer C as deﬁning a many-to-one mapping function
from E to DS. Two packet classiﬁers C1 and C2 are equivalent, denoted C1 E C2,
if and only if they deﬁne the same mapping function from E to DS; that is, for any
packet p E Z, we have C1(p) = C2(p). A rule is redundant in a. classiﬁer if and only
if removing the rule does not change the semantics of the classiﬁer. Furthermore,
we deﬁne the equivalence relation that. classiﬁer C deﬁnes on each ﬁeld domain and
the resulting equivalence classes. We use the notation 2;,- to denote the set of all
(d— 1)-tuple packets over the ﬁelds (F1, - - -, Fi_1, Fi+b -~-, Fat) and p_,- to denote
an element of 23_,;. Then we use C(pi,p_,-) to denote the decision that packet

classiﬁer C makes for the packet p that is formed by combining p,- E DlFi) and p_,-.

Deﬁnition 2.0.1 (Equivalence Class). Given a packet classifier C over ﬁelds F1,
--«, Fd, we say that 3:, y E D(F,-) for 1 S i g d are equivalent. with respect to C if
and only ifC(;1:,p_z-) = C(y,p_,;) for any p__,; E E_,;. It follows that C partitions

D(F,~) into equivalence classes. W 3 use the notation C{.:r} to denote the equivalence

class that :1: belongs to as deﬁned by classiﬁer C.

In a typical packet. classiﬁer rule, the ﬁelds of source IP, destination IP, and pro-
tocol type are speciﬁed in preﬁx format, which can be directly stored in TCAMs;
however, the remaining two ﬁelds of source port and destination port are speciﬁed
as ranges (i.e., non-negative integer intervals), which are typically converted to pre-
ﬁxes before being stored in TCAMs. This leads to range expansion, the process of
converting a non-preﬁx rule to preﬁx rules. In range expansion, each ﬁeld of a rule
is ﬁrst expanded separately. The goal is to ﬁnd a minimum set of preﬁxes such that
the union of the preﬁxes corresponds to the range. For example, if one 3-bit ﬁeld
of a rule is the range [1, 6]. a corresponding minimum set of preﬁxes would be 001,
01*, 10*, 110. The worst-case range expansion of a w—bit range results in a set.
containing 2w — 2 preﬁxes [13]. The next step is to compute the cross product of the

set of preﬁxes for each ﬁeld, resulting in a potentially large number of preﬁx rules.

2.1 Firewall decision diagrams

A crucial data structure required for this work is the Firewall Decision Diagram
(FDD) [9] A Firewall Decision Diagram (FDD) with a decision set 03 and over
ﬁelds F1, - ~ - , Fd is an acyclic and directed graph that. has the following ﬁve proper-
ties: (1) There. is exactly one node that has no incoming edges. This node is called
the root. The nodes that have no outgoing edges are called terminal nodes. (2) Each

node v has a label. denoted F (U), such that

{F12 - . - , Fd} if v is a nonterminal node.
F(U) E
DS if v is a terminal node.

(3) Each edge 6:11 —> U is labeled with a nonempty set of integers. denoted 1(e),

where 1(e) is a subset of the domain of 11‘s label (i.e., I((’) C_: D(F(u))). (4) A

directed path from the root to a tern‘iinal node is called a decision path. No two
I'

nodes on a decision path have the same label. (a) The set of all outgoing edges

of a node v, denoted 13(1)). satisﬁes the following two conditions: (i) Consistency:

I (e) ﬂ [(6’) = Q) for any two distinct edges e and e’ in E(u).

066%) Me) = Darn».

 

 

 

 

 

 

 

 

 

 

 

(ii) Completeness:

 

 

 

F1 F 2 Decision
000 000 accept
000 111 accept
110 000 accept
110 111 accept
010 0** accept
100 0** accept
** 1** accept
** *** discard
it FDD Construction
6 001 01 1 101
000 1 10 010 100 1 1 1
@ 6 6
000 001 000 001 0” 1“ O” 1“ 1' 0”
01 ' 01 '
20;.) 1a
a a d a a d a d a d a d

 

 

 

 

 

 

 

 

Figure 2.1: Illustration of FDD construction

We deﬁne a full—length ordered F DD as an FDD where in each decision path

all ﬁelds appear exactly once and in the same order. For ease of presentation, we
use the term “FDD” to mean “full-length ordered FDD” if not otherwise speciﬁed.
Given a packet classiﬁer (C, the F DD construction algorithm in [17] can convert it. to
an equivalent full-length ordered FDD f. Figure 2.1(a) contains a sample classiﬁer,
and Figure 2.1(b) shows the resultant F DD from the construction process.

After an FDD f is constructed, we can reduce f ’5 size by merging isomorphic
subgraphs. A full-length ordered FDD f is reduced if and only if it satisﬁes the
following two conditions: (1) no two nodes in f are isomorphic; (2) no two nodes have
more than one edge between them. Two nodes '1) and v' in an FDD are isomorphic
if and only if v and 'U’ satisfy one of the following two conditions: (1) both 1) and
'v’ are terminal nodes with identical labels; (2) both 1) and v’ are nonterminal nodes
and there is a one-to-one correspondence between the outgoing edges of 'v and the
outgoing edges of U, such that every pair of corresponding edges have identical labels
and they both point to the same node. A reduced F DD is essentially a canonical
representation for packet classiﬁers. Figure 2.1(c) shows the reduced FDD from

Figure 2.1(b).

2.2 One-Dimensional Classiﬁer Minimization

The special problem of weighted one—ﬁeld TCAM I'ninirnization is used as a building
block for multi-dimensional TCAM minimization. Given a one-ﬁeld packet classiﬁer
f of n preﬁx rules (r1, r2, - - - , r"), where {Decision(r1), De(.:ision(r2), - - - . Decision(rn)
} = {d1,(12.- - -,d;} and each decision (1,; is z-issociated with a cost. Cost(d,;) (for

1 S i S .3). we deﬁne the cost. of packet classiﬁer f as follows:

n.
Cost(f) = Z Cost(Decision(ri))
i=1

Based upon the above deﬁnition, the problem of weighted one-diInensional TCAM

minimization is stated as follows.

Deﬁnition 2.2.1. Weighted ()nc—Dimensional Preﬁx: Min.i-Inization Problem. Given

a one-ﬁeld packet classiﬁer f1 where each (lccismn is associated with a cost. ﬁnd a

9

preﬁx packet classiﬁer f2 6 {f1} such that for any preﬁx packet classiﬁer f E {f1}.
the condition Cost(f2) g Cost(f) holds.

The problem of one-dimensional preﬁx minimization (with uniform cost) has
been studied in [6,27] in the context of compressing routing tables. I generalize the
dynamic programming solution in [27] to solve the weighted one-dimensional TCAM

minimization. There are three key observations:

1. For any one-dimensional packet classiﬁer f on {*}111, we can always change
the predicate of the last rule to be {*}w without changing the semantics of
the packet classiﬁer. This follows from the completeness property of packet

classiﬁers.

2. Consider any one-dimensional packet. classiﬁer f on {*}w. Let f1 be f ap-
pended with rule {1*}111 —> d, where d can be any decision. The ol;)servation
is that f E f’. This is because the new rule is redundant in f ’ since f must
be complete. A rule in a packet classiﬁer is redundant if and only if removing
the rule from the packet. classiﬁer does not change the semantics of the packet.

classiﬁer.

3. For any preﬁx 79 E {0,1}k{*}1"'_1" (0 S I; g in), one and only one of the
following ccmditions holds:
(a) r e {0,1}1~‘0{*}w-k-1,
(b) 73 E {0,1}k1{*}“’—k—1.
(c) ”P = {0,1}k{*}w—k.
This property allows us to divide a problem of {0, l}k{*}1“‘k into two sub-

problems: {0. 1}1‘70{*}w—k—1. and {0. 1}]"1{*}“1‘k—1. This divide—arid—conquer

strategy can be applied recursively.

We formulate an o )timal dynamic )rtwrammine‘ solution to the weighted one-
. C) C") F)

dimensional TCAM minimization problem.

10

Let. P denote a preﬁx {0,1}k{*}w_k. We use 2 to denote the preﬁx {0,1}k
0{*}11’_k—1, and P to denote the preﬁx {0,1}k1{*}w—k_1.

Given a one—dimensional packet classiﬁer f on {*}w, we use f7; to denote a
packet classiﬁer on P such that for any :1: E P, frp(1:) = f (1:), and we use ff; to
denote a similar packet classiﬁer on P with the additional restriction that the ﬁnal
decision is d.

C ( fp) denotes the minimum cost of a packet classiﬁer t that is equivalent to fp,
and C ( ﬁg) denotes the minimum cost of a packet classiﬁer t’ that is eq1.1ivalent to
fp and the decision of the last rule in t’ is d.

Given a one-dimensional packet classiﬁer f on {*}1“ and a preﬁx P where P Q
{*}1”, f is consistent on P if and only if V1.3 y E P, f(:r) = f(y).

The dynamic programming solution to the weighted one—dimensional TCAM
minimization problem is based on the following theorem. The proof of the theorem
shows how to divide a problem into sub-problems and how to combine solutions to

sub-problems into a solution to the original problem.

Theorem 2.2.1. Given a one—dimensional packet classiﬁer f on {*}w, a preﬁx? P
where P Q {*}w, the set of all possible decisions {(11, (12, - - - , dz} where each decision

d,- has a cost wdi (1 g i S 2:). we have that
., Z: , (l,
C(fp) : mm ((fp)

221

where each C(fgi) is calculated as follows:
(I) Iff is consistent on P, then
an __ War) was,
1p>—, , ,. ,
U)f(1,) + ”(12: iff(.z.) # ,1

'23) [ff is not consistent on P, then

11

f (11 (11
CUP. )+ C( 7—3— ) — 'wdl +u'1‘1i’

' 7

1- 1-
C(f;1_1)+ C(fg—l) — 111,17;1 + 1111,17,,
, 11,- . Hi 11 '
C(fpl) = min < C(fp1)+ C(ffﬁ‘) — 1111,12,,
C0211 )+C( 51+ l—‘U’11,+1 +1111,»

0',

1. c
C(f‘g) + C( [3“) —: 111-'11,, + “’1

 

\ "i
Proof. (1) The base case is when f is consistent on P. In this case, the minimum
cost preﬁx packet classiﬁer in {fp} is clearly (P ——) f (13)), and the cost of this
packet classiﬁer is w “1.). Furthermore, for d,- # [(17), the minimum cost preﬁx
packet classiﬁer in {fp} with decision 11,- in the last rule is (P —> f (:17),P —+ dz)
where the second rule is redundant. The cost of this packet classiﬁer is 1Uf(-r) + wdi'
(2) If f is not consistent on P, divide P into E and P. The crucial observation
is that an optimal solution f * to {fp} is essentially an optimal solution f1 to the
sul_)-problem of minimizing f2 appended with an optimal solution f2 to the sub-
problem of minimizing f5. The only interaction that can occur between f1 and f2
is if their ﬁnal rules have the same decision, in which case both ﬁnal rules can be
replaced with one ﬁnal rule covering all of P with the same decision. Let d3; be the
decision of the last rule in f1 and dy be the decision of the last rule. in f2. Then
we can compose f * whose last rule has decision (1, from f1 and f2 based on the
following cases:
(A) (1.5,- = dy :- (17;: In this case, f can be constructed by listing all the rules in f1
except the last. rule, followed by all the rules in [2 except the last rule, and then the
last rule P ——+ (1,. Thus, Cost(f) = Cost(f1) + Cost(f2) — (11(1):.
(B) (1,: : (1y # di‘ In this case, f can be constructed by listing all the rules in f1
except the last rule, followed by all the rules in f2 except the last. rule, then rule

P —> 11,-. and ﬁnally rule P -—_> d,. Thus. Cost(f) = Cost(f1)+ (70-51(12)‘1L'11I‘1’U’11

'11.

(C) (11,- 75 (1;).(11- = d,,dy # 11,-: we do not, need to consider this case because
, .(l' V (1 (l- 1 (l 1 (1: >1 (1:
(11211 + ( (1,511) = cognac-,9) +1.11 — 1.1 2 ( 11151 + 1 (1,33) — 11.1

D d; (1 ,dl- d-, d1 = (1: Similarly, this case need not be considered.
9 z J i
(E) d1: aé dy, d1: 75 di, dy # dZ-z Similarly, this case need not be considered. C]

Figure 2.2 shows the illustration of a one—dimensional TCAM minimization prob—
lem, where the black bar denotes decision “accept” and the white bar denotes deci-
sion “discard”. Figure 2.3 illustrates how the dynamic progrannning works on this
example.

L:ll::l
——

00 01 10 11

 

 

 

 

 

Figure 2.2: An example one-dimensional TCAM 11111111111281.1011 problem

HI

IHIH

\><01./ \><._/

I:

E
HUI
lll
EU

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

_r f _1_ J.
|oo o1||oo o1] |1o11||1o11|
1:] [:1 - —
— 4

 

 

 

 

 

 

—-L

10 11

loo o1 1o11| loo 0

 

 

 

 

 

Figure 2.3: Illustration of dynamic program

2.3 Experimental Data

we performed experiments on a. set, of '25 real—world packet classiﬁers, which is

denoted by R1,. The classiﬁers in 11L were chosen from a larger set. of real-world

13

classiﬁers obtained from various network service providers. where the classiﬁers range
in size from a handful of rules to thousands of rules. We partition the original
classiﬁers into 25 groups where the classiﬁers in each group share similar structure.
For example, the ACLs configured for the different interfaces of a router often share
a similar structure. we created RL by randomly choosing one classiﬁer from each
of the 25 groups. We did this because classiﬁers with similar structure often exhibit.
similar results for the proposed algorithms. If all classiﬁers were used, the results
would be skewed by the relative size of each group.

Because packet classiﬁers are considered conﬁdential due to security concerns,
which makes it difﬁcult to acquire a large quantity of real-world classiﬁers, we gen-
erated a set of synthetic classiﬁers SYN with the number of rules ranging from
250 to 8000. The predicate of each rule has ﬁve ﬁelds: source IP, destination IP,
source port, destination port, and protocol type. We based our generation method
upon Singh et al.’s [23] model of synthetic rules. We choose this model over Tay-
lor&Turner’s Classbench [30] because Classbench does not generate decisions, and
there is no rationale or guidelines for assigning decisions to each rule.

To stress test. the sensitivity of our algorithms to the number of decisions in a
classiﬁer. we created a. set. of classiﬁers RLU by replacing the decision of every rule in
each classiﬁer by a unique decision. Similarly, we created the set S YNU. Thus, each
classiﬁer in RLU (or S YNU) has the maximum possible number of distinct decisions.
Such classiﬁers might arise in the context of rule logging where the system monitors

the frequency that each rule is the ﬁrst matching rule for a packet.

14

Chapter 3

Related Work

There is signiﬁcant prior work on packet classiﬁcation for both TCAM based packet
classiﬁcation and software based packet classiﬁcation. While TCAM based sys-
tems are more immediately relevant, software based classiﬁcation shares a degree of
commonality with the sequential decomposition technique; however, differences in

available hardware result in very different design decisions.

3.1 TCAM Based Classiﬁers

There is signiﬁcant. prior work on minimizing the TCAM space. occupied by a single
classiﬁer. Such work falls into three broad categories: (1) classiﬁer minimization
(c.g., [1, 5, 6, 18, 27]), which converts a. given classiﬁer to a semantically equivalent
classiﬁer that. requires fewer TCAM entries; (2) range encoding (e.g., [4.15.19.21,31]),
which encodes the ranges (i.e., source port and destination port) in a manner that
reduces range expansion; and (3) circuit modiﬁcation (e.g., [24]). which modifies

TCAIV'I circuits to ac(_:(:)mmodate range comparisons.

3.1.1 Classiﬁer Minimization:

The basic idea is to convert a given packet classiﬁer to another semantically equiva-
lent packet classifier that. requires fewer TCAM entries. Several classiﬁer minimiza—

tion schemes have been proposed [1, 5.6. 18. 27]. The work in [1, 6.27] focuses on

one-dimensional and two dimensional packet. classiﬁers.

Construction Optimal IP Tables In [6], Draves et al. present a polynomial
algorithm for generating a minimum equivalent packet classiﬁer for one-dimensional
preﬁx match classiﬁers. Their algorithm, Optimal Routing Table Constructor(ORTC),
works by reducing a longest matching preﬁx trie to its minimal representation via.
three traversals. This minimal trie can then be used to use to generate a minimum
single ﬁeld preﬁx classiﬁer. A longest match preﬁx classiﬁer can be trivially con-
verted into a ﬁrst match preﬁx classiﬁer by sorting the rules such that the longer

preﬁx rules appear before shorter preﬁx rules.

Compressing Two-Dimensional Routing Tables In [27], Suri et al. present.
a polynomial time dynamic program that generates a minimum equivalent packet
classiﬁer for one-dimensional preﬁx classiﬁers. This dynamic program is equivalent
to the dynamic program presented in Chapter 2.2. Furthermore, Suri et al. present
a. generalization of this dynamic program for two or more ﬁelds. They show that this
generalization produces optimal two ﬁeld classiﬁers when the the solution space of
classiﬁers is restricted such that the predicates of any two rules in the a classiﬁer are

either disjoint, or one predicate is a subset of the other. However, these generalized

 

algorithms have a signiﬁcant time requirements, 0(NIDS (wl x x 111(1)), where
wd is the number of bits used for 17,. As a. result, the dynamic program ceases to be

usable for more than two ﬁelds.

Complete Redundancy Detection in Firewalls In [18]. Liu and Gouda, pro-
pose the ﬁrst algorithm that is guaranteed to detect and remove a maximal set of
redundant rules within a classifier. They propose two types of redundant rules.
upward redundant rules and downward redundant rules. These two types of rules
are shown to completely categorize the set of all redundant rules. Liu and Gouda’s
algorithm ﬁrst uses an iterative F DD construction technique to remove all upward
redundant rules. and it uses a different iterative FDD ccmstruction technique to

remove all downward redundant rules. By removing both types of rules, the al-

16

gorithm produces a classiﬁer free of redundant rules. Note, that this algorithm is
not guaranteed to remove the maximum number of redundant rules since there can
be interdependencies between redundant rules. However, this algorithm is effective
for all types of classiﬁers, and its efﬁciency scales well as the number of ﬁelds in a

classiﬁer increases.

Packet Classiﬁers in Ternary CAMS can be Smaller In [5], Dong et al.
propose the ﬁrst algorithm that modiﬁes rules within a classiﬁer in an attempt to
reduce the effects of range expansion. They propose four types of operations: trim-
ming rules, expanding rules, merging rules, and adding rules. The basic idea of
their algorithm is that by trimming or expanding the space covered by a predi-
cate, the range expansion for each rule can be reduced. They propose a two stage
algorithm that ﬁrst trims the predicate space of every rule and then expands the
predicate space of each rule going from last to ﬁrst. This algorithm is signiﬁcant in
that it accommodates classiﬁers with more than two ﬁelds. However, it is unknown
whether or not the algorithm is optimal given a one-dimensional classiﬁer. Further-
more, The algorithm requires repeated applications for a classiﬁer to converge upon
a minimal set of rules and depends heavily upon repeated applications of the redun-
dancy removal technique found in [18]. This suggests that their algorithm requires

a signiﬁcant amount of con’irmtational overhead.

Compressing Rectilinear Pictures and Minimizing Access Control Lists
In [1], Applegate et al. propose an optimal solution for two ﬁeld classiﬁers composed
entirely of strip rules. Strip rules have a wild card for at. least one ﬁeld. However,
while this work is of theoretical interest, it does not. scale to d—dimensional classiﬁers,

and it is not clear that packet classiﬁers can be efﬁciently represented with strip rules

3.1.2 Range Encoding:

The basic idea is to ﬁrst encode ranges that appear in a classiﬁer and store the en—

coded rules in a TCAM. When a packet. comes, the packet needs to be. preprocessed

17

so that the resulting encoded packet can be used as a search key for the TCAM.
Previous range encoding schemes fall into two categories: database independent en-
coding schemes [4,15], where the encoding of each rule is independent of other rules
in the classiﬁer, and database dependent encoding schemes [19, 21,31], where the
encoding of each rule may depend on other rules in the classiﬁer. The advantage
of database independent encoding schemes is that they allow fast. incremental up-
dates to the classiﬁer since each rule is encoded indepemlently. However, database
dependent schemes have the potential for better space savings since they can utilize
the low number of unique ranges that appear in real life classiﬁers to achieve lower

range expansion.

Algorithms for Advanced Packet Classiﬁcation In [15], Lakslnninarayanan
et al. propose a scheme called fence encoding, which encodes interval ranges as a
range of unary numbers. All ranges under fence encoding have an expansion factor
of one, which implies that all ranges can be encoded with one rule, but the number
of unary bits required for each rule is prohibitive since a ﬁeld with length 11) requires
2‘“ bits per rule. To reduce the required number of bits in a rule, Lakshminarayan
et al. proposed the technique called DIRPE, which con'mresses the size of fence
encodings at the expense of increasing the average expansion ratio. DIRPE works
by dividing a ﬁeld into equally sized sub-ﬁelds, which are called chunks, and fence
encoding these chunks. Selecting the number of chunks provides a trade-off between
range expansion and TCAM entry width. The authors also propose a method of
combining DIRPE with Liu’s range encoding scheme found in [19] to handle ranges
that have a large expansion factor under DIRPE. However. this combination negates

DIRPE'S ability to allow fast updates for classiﬁers.

Space-efﬁcient TCAM-based Classiﬁcation using Gray Coding Bremler-
Barr and Hendler, in [4], propose a scheme in which ﬁeld domains are encoded using
binary reﬂected gray codes(BRCC). While there is no advantage or disadvantage to
using a BRGC for ﬁelds that contain only preﬁx ranges. using BRGC on non-preﬁx

ranges breaks up the range in such a way that additional ternary bits can be used to

1,8

eliminate some of the preﬁxes needed to represent a range. The result is that some
of the preﬁxes ranges required to represent a range are merged together into a single
ternary entry. The authors note that this encoding technique is especially effective
for small ranges and name their encoding algorithm short range gray encoding or
SGRE. Since SGRE does not require any additional TCAM bits to encode ranges,
Bremler-Barr and Hendler also propose a method of combining SCRE with Liu’s
range encoding technique. Like DIRPE this combination negates SGRE’S ability
to support fast updates for classiﬁers, but it allows for the technique to concisely

encode ranges with large expansion factors under SRGE.

Efﬁcient Mapping of Range Classiﬁers into Ternary-CAM In [19], Liu pro-
poses an encoding method that designates speciﬁc ternary bits within each TCAM
entry to represent a speciﬁc range. A packet ﬁeld is encoded via an SRAM lookup
table that maps each ﬁeld value to a codeword that has a designated bit set to 1 if
and only if the value is an element of the corresponding range. Each rule’s range
predicate can then be encoded such that the designated bit is set to 1 and every
other bit is set to *. This technique eliminates range expansion completely; however,
it also requires n bits per TCAM entry when a classiﬁer has n unique ranges in a
ﬁeld. This technique quickly becomes impractical as the number of unique ranges
within a ﬁeld increases. To combat the explosive growth in required bits, Liu pro-
poses splitting a ﬁeld domain into I: disjoint ranges such that each disjoint range
intersects with a small number of unique ranges. Since [logk + 1] bits are needed
to encode these disjoint ranges, this scheme allows for a ﬁeld to be encoded using
[log k + 1] + 71’ bits where n’ is the maximum number of unique ranges that inter—
sect with a given disjoint range. Using this scheme means that rule predicates that.
intersects with more than one disjoint range must. be replicated for each intersection.
To manage the trade off between rule expansion and bit expansion, Liu proposes a
heuristic algorithm that repeatedly ﬁnds and merges the pair of disjoint ranges that
reduces rule expansion the most. These merges continue until a budget. of b bits is

exhausted.

19

Fast and Scalable Packet Classiﬁcation In [31], van Lunteren and Engbersen
propose an encoding method similar to [19]. In their encoding scheme, they control
the required number of bits by partitioning the unique ranges into 1 layers. The
ranges within each layer are then broken into disjoint ranges so that each layer can
be encoded in [log ni] bits where n,- is the number of disjoint ranges in layer i. Each
ﬁeld then become the concatenated encoding for each layer. The authors also note

’ in another layer,

that if a disjoint range r in one layer contains a disjoint range r
this information can be used to reduce the number of bits needed to encode r”s
layer. Unfortunately, no algorithms are given for partitioning unique ranges into

layers.

An Encoding Scheme for TCAM-based Packet Classiﬁcation In [21], Pao
et al. propose an encoding algorithm called. prefix inclusion encoding(PIC). PIC
utilizes van Lunteren and Engbersen’s observation that containment information for
one layer can reduce the number of bits required to encode ranges in the next layer.
That is the scheme produces a series of 1 layers L1, . . . , Ll such that each disjoint
range in L,- is a subset of a single range of Li—l for i E {2, . . . , I}. With this property,
PIC can encode a ﬁeld predicate into a compact preﬁx range. PIC was designed for
encoding ﬁelds with preﬁx ranges and large domains such as the source IP for IPv6
headers. However, the authors do suggests techniques for adapting their scheme
to encode range ﬁelds. These techniques require breaking overlapping ranges into
disjoint ranges, which in turn requires that encoded rules are replicated in a manner

similar to other encoding techniques.

3.1.3 Circuit Modiﬁcation:

The basic idea is to modify TCAM circuits to accon‘m‘iodate range comparisons. For
example, Spitznagel ct al. proposed adding comparators at each entry level to better
accommodate range matching [24]. While this research direction is important. such
solutions are hard to deploy due to high cost. [15], and modiﬁed TCAMs may be less

applicable to applications other than packet processing.

20

3.2 Software Based Techniques

The simplest. software based technique for packet classiﬁcation is a linear search,
which has excellent storage requirement but. becomes too slow for wire speed packet.
classiﬁcation for even modest sized classiﬁers. As a results, there is a rich body
of software based packet classiﬁcation techniques [2,3,7',11,14,22,23,25,26,28,32],
and an extensive survey of these techniques can be found in [29]. These techniques
trade storage space for an improvement in search time via special preprocessing
of the classiﬁer rules. Techniques can be partitioned into two categories: parallel

decomposition and decision tree classification.

3.2.1 Parallel decomposition

The objective of parallel decomposition techniques [3,11,14,25,26,28] is to break the
classiﬁcation process into several steps that can performed in parallel. The above
techniques perform the decomposition along the ﬁeld boundaries of a packet header.
This in effect allows for fast and efﬁcient. single ﬁeld classiﬁcation solutions to encode
each ﬁeld in parallel. These new values are then composed via one or more additional

classiﬁcations stages to yield a. correct classiﬁcation.

High-speed Policy-based Packet Forwarding using Efﬁcient Multi-dimen-
sional Range Matching In [14]. Lakshman and Stiliadis propose encoding each
ﬁeld’s value into a bitmap that specifies a containment relationship among values and
rules [14]. This bitmap indicates whether or not an encoded value intersects with a.
given rule’s ﬁeld predicate. Once each ﬁeld is encoded, this method uses customized
parallel AND gates to perform an intersection of these bitmaps and ultimately ﬁnds
the ﬁrst matching rule. This technique. is effective; however it requires a bit. line for

each rule in the classiﬁer and must be implemented on customized hardware.

Scalable Packet Classification In [3]. Baboescu and Varghese improve on the
above technique by observing that for classiﬁers with a low occurrence of wildcards.

bitmaps will be sparely populated with 1‘s. They group bits within each bitmap

1)1

‘—

into chunks and represent. each chuck with a single bit which is the logical OR of all
the bits within the chunk. This allows the second stage to skip the comparison of
a signiﬁcant number of bits. For classiﬁers that have a low occurrence of wildcards,
this technique is very effective at reducing the number of memory access needed
to perform the second stage processing; however, this reduction is diminished once

wildcards occur more frequently within a classiﬁer.

Fast and Scalable Layer Four Switching In [26], Srinivasan et al. propose
an encoding method called crossproducting that assigns a unique number to each
maximal disjoint range within a classiﬁer ﬁeld and constructs a lookup table for
the cross product of the numbers associated with each ﬁeld. This technique is fast;
however, its storage requirements multiplicatively increases as the number of ﬁelds
and ranges increases. As a result, the authors only intend crossproducting for small

classiﬁers with two ﬁelds.

Packet Classiﬁcation on Multiple Fields In [11], Gupta and McKeown pro-
pose an encoding method called Recursive Flow Classiﬁcation (RFC) that. is an op-
timized version of the cross-producting scheme. This uses recursive cross-producting
tables to reduce the space requirements of regular cross-producting tables. F urther-
more, they map disjoint ranges that are contained by the same set of rules into
a single value. RFC’S mapping tables define. an equivalence relation; however, this
equivalence relation is less general than the domain compression technique discussed
in Chapter 8, so they are unable to achieve a maximum compression for each ﬁeld
domain in most cases. Furthermore, the recursive cross-producting scheme requires

a signiﬁcant amount of space to store in memory.

Packet Classiﬁcation using Tuple Space Search In [25], Srinivasan et al.
propose a tuple based search approach. This approach transforms each rule predicate
with d ﬁelds into a d—tuple, which is in essence a hash of the predicate. The idea is
that this initial hashing divides the. search space into regions that. can be searched

in parallel. Perfect hashing functions are used to ﬁnd exact matches in each tuple’s

[\D
[\J

search space. The authors propose two methods of determining appropriate tuples
to search. The ﬁrst method is an exhaustive search of each tuple, and the second
uses a set pruning trie for each ﬁeld that returns a set of candidate tuples. With
set pruning, the intersections of the results from each ﬁeld is the set of tuple spaces

that need to be searched.

Scalable Packet Classiﬁcation using Distributed Crossproducting of Field
Labels In [28], Taylor and Turner propose the Distributed Crossproducting of
Field Labels (DCFL) method that assigns each locally unique range within a ﬁeld a
locally unique number. Each ﬁeld value is encoded into a set of numbers, which rep-
resents the ranges that contains the value. These sets are crossproducted together
and then intersected with the set of unique tuples generated from the classiﬁer’s
ﬁeld predicates. The resulting intersection provides a. list of rules that the packet
header matches. The authors optimize this technique by incrementally perform-
ing the crossproduct and ﬁltering the intermediate results after each incremental
crossproduct. This optimization can dramatically reduce that number of false pos-
itive tuples that are generated. Since overlapping ranges diminish the incremental
crossproducts’ ability to keep the number of candidate matches low, the technique’s

perforn'iance depends on classiﬁers having a. low number of overlapping ranges.

3.2.2 Decision Trees

Decision tree methods [2, 7, 12, 22,23, 32] use tree structures to successively prune
the search space to a single rule. or a small number of rules. which are then searched
linearly to ﬁnd a match. Decision tree methods such as HiCuts [12] and Hypercuts
[23] are similar in flavor to our sequential decomposition approach in that they use
a sequence of searches where each search uses a portion of the packet predicate to
classify a packet. However, scﬂtware—based methods are constrained by a complex
tradeoff among how many searches need to be performed. the time. required to
perform] a search and the space required to store the data structure that facilitates

the search. In the worst case, these methods require many searches, slow searches.

23

or tremendous amounts of memory.

Classiﬁcation Using Hierarchical Intelligent Cuttings In [12], Gupta and
McKeown present a decision tree algorithm called HiCuts. This algorithm builds a
decision tree similar to an unordered FDD with the following differences: Each node
makes a decision based on a partition of a ﬁeld’s domain, and leaves are allowed
to store a list of rules. The rationale for both of these decisions is derived directly
from the limitations of SRAM lookup methods. The authors implement each node
as a lookup table so that the next node in the tree can be found in constant time.
However, since a ﬁeld domain of size 232 is prohibitively large, the ﬁeld domain must
be cut into subsets to limit the size of each tree node. This technique successively
prunes the set of candidate rules with the decision tree until the set of rules is below
a certain threshold. Once this threshold is reached. the list of remaining rules is
stored in the leaf at the end of the decision path. The rationale for this decision is
to save storage space since small lists usually result in big subtrees. The authors

also present a parameterized construction algorithm that. allows the user to trade

maximum lookup time for storage space.

Packet Classiﬁcation using Multidimensional Cutting In [23], Singh et al.
present HyperCuts, which is an improvement upon HiCuts. The authors contribution
is to allow each node in the decision tree to build a nmltidimensional lookup table
from cuts in multiple ﬁelds. This improvement allows for a. more effective pruning of
the list of candidate rules. The authors show that HyperCuts signiﬁcantly improves

upon the performance of HiCuts.

A Modular Approach to Packet Classiﬁcation In [32], Woo uses a three stage
approach to classifying packets. The first. stage is a lookup table that distributes
packet value among a set of decision trees by matching m. bits within a rule predicate.
These decision trees are binary trees where the nodes select. the appropriate bit
within the predicate to determine which edge to foll(;)w. The second stage traverses

the decision tree until the third staéige is reached when a. leaf in the decision tree is

24

found. These leaves contain a list of one or more candidate matches that is searched
sequentially until a match is found. One key assumption in VVoo’s work is that
each classiﬁer predicate needs to be transformed into a ternary bit. string. This
assumption implies that classiﬁers with signiﬁcant amounts of range expansion will

degrade the storage efﬁciency of this technique.

Fast Firewall Implementations for Software-based and Hardware-based
Routers In [22], Qiu et al. revisit two trie-based lookup schemes for packet clas-
siﬁcation that have been traditionally dismissed as being inefﬁcient and show that
for real packet classiﬁers, they offer predictable classiﬁcation speeds. Longest preﬁx
matching tries are an efficient data structure for performing an exact match for a
single ﬁeld packet; however, once packet. classiﬁers requires multiple ﬁeld packets,
some packets will not match against the longest preﬁx in all dimensions. The ﬁrst
technique that they examine uses a backtracking search on a multi-ﬁeld trie to ﬁnd
every candidate rule. They provide a set of optimizations for the multi-ﬁeld trie
that speeds up the backtracking search; however, the number of memory accesses
required to classify each packet range from 117 to 196 for real-life classiﬁers. The
second technique that they examine uses set pruning tries, which enumerate all deci-
sion paths so that each packet value can only follow a single path. The authors also
propose two compression algorithms that help to reduce the storage requirements for
set pruning tries and backtracking tries. Set pruning tries outperform backtracking
search at the expense of additional memory storage requirements; however, experi-
mental results suggest that backtracking tries offer a better performance for storage
trade off. The experimental results suggest that these techniques offer a 2 to 5 times

speedup over linear search.

Tradeoffs for Packet Classiﬁcation In [7], Feldmann and Muthukrishnan pro-
pose building lookup—up trees similar to HiCuts; however, instead of using a. lookup
table at each node, they employ an inverted lookup tree call a Fat Imrcrtcd Seg—
ment(F IS) tree to store a complete set of cuts of each ﬁeld. This technique allows

for a more compact re])resentation of the classiﬁer, but. it can significamtly increasing

the number memory accesses needed to classify a packet when compared to HiCuts

or HyperCuts.

Packet Classiﬁcation for Core Routers: Is there an Alternative to CAMS?
In [2], Baboescu et al. propose the Extended Grid—0f- T1‘2'es(EGT) technique. EGT
uses a two-ﬁeld trie to prune the candidate rule list and uses a path compression
algorithm to minimize the amount of memory needed to store the trie. For core
router tables, EGT provides reasonable performance; however, EGT’S performance
depends on the structural properties of the core routing tables. Packets classiﬁers

used in other applications (6.9. ﬁrewalls) may not have acceptable performance with

EGT.

26

Part I

Equivalent Transformation

Techniques

27

Consider the following TCAM lVIinimization Problem: given a packet classiﬁer,
how can we generate another semantically equivalent packet classiﬁer that requires
the least number of TCAM entries? Two packet classiﬁers are (semantically) equiv—
alent if and only if they have the same decision for every packet. For example, the
two packets classiﬁers in Tables 1.1 and 1.2 are equivalent; however, the one in Ta—
ble 1.1 requires 900 TCAM entries, and the one in Table 1.2 requires only 6 TCAM
entries.

Solving this problem helps to address the limitations of TCAMs. As we reduce
the number of TCAM entries required, we can use smaller TCAMs, which results
in less board space and lower hardware cost. Furthermore, reducing the number of
rules in a TCAM directly reduces power consumption and heat generation because
the energy consumed by a TCAM grows linearly with the number of ternary rules
it stores [33].

While the optimal solution to the above problem is conceivably NP-hard, in
this thesis, we propose a practical algorithmic solution using two techniques. Our
ﬁrst technique, TCAM Razor, generates new but equivalent classiﬁers, whereas our
second technique. all-match redundancy removal. ﬁnds a set of rules that can be

safely removed from a classiﬁer

Chapter 4

TCAM Razor

TCAM Razor consists of the following four basic steps. First, convert a given packet.
classiﬁer to a reduced decision diagram, which is the canonical representation of the
semantics of the given packet classiﬁer. Second, for every nonterminal node in
the decision diagram, minimize the number of preﬁxes associated with its outgoing
edges using dynamic programming. Third, generate rules from the decision diagram.
Last, remove redundant rules. As an example, running our algorithms on the packet
classiﬁer in Table 1.1 will yield the one in Table 1.2.

Our solution is effective, efﬁcient, and practical. In terms of effectiveness, our ap-
proach achieves a total compression ratio of 3.9% on real—life packet classiﬁers, which
is signiﬁcantly better than the previously published best. result of 54% [5]. In terms
of efﬁciency, our approach runs in seconds, even for large packet classiﬁers. Finally,
in terms of practicality, our approach can be easily deployed as it does not require
any modiﬁcation of existing packet classiﬁcation systems. In comparison, a number
of previous solutions require hardware and architecture modiﬁcations to existing
packet classiﬁcation systems, making their adoption by networking manufz—icturers
and ISPS much less likely.

The solution is named “TCAM Razor” f(’)llowing the principle of Occams razor:
“0f two equivalent theories or eJ‘planat-ions. all other things being equal, the simpler
one is to be preferred.” In our context, of all packet classiﬁers that are equivalent,

the one with the least number of TCAM entries is preferred.

29

4.1 Multi-dimensional TCAM Minimization: The
Basics

In this section, we present TCAM Razor, our algorithm for minimizing multi-
dimensional preﬁx packet classiﬁers. A key idea behind TCAM Razor is processing
one dimension at a time using the weighted one-dimensional TCAM minimization
algorithm in Section 2.2 to greedily identify a local minimum for the current dimen-
sion. Although TCAM Razor is not guaranteed to achieve a global minimum across
all dimensions, it does signiﬁcantly reduce the number of preﬁx rules in real-life

packet classiﬁers.

4.1.1 Conversion to Firewall Decision Diagrams

To facilitate processing a packet classiﬁer one dimension at. a time, we ﬁrst convert.
a given packet classiﬁer to an equivalent reduced Firewall Decision Diagram (FDD)
[10]. Given a packet classiﬁer f1, we can construct. an equivalent. FDD f2 using the

FDD construction algorithm in [17].

 

F i "ure 4.1: A ﬁrewz-ill decision diagram
g o

.30

4.1.2 Multi-dimensional TCAM Minimization

We start the discussion of our greedy solution by examining the reduced FDD in
Figure 4.1. We ﬁrst look at the subgraph rooted at node '02. This subgraph can be
seen as representing a one-dimension packet classiﬁer over ﬁeld F2. We can use the
weighted one-dimensional TCAM minimization algorithm in Section 2.2 to minimize
the number of preﬁx rules for this one-dimensional packet classiﬁer. The algorithm

takes the following 3 preﬁxes as input:

10 * * (with decision accept and cost 1),
O * ** (with decision discard and cost 1),
11 * * (with decision discard and cost 1).

The one—dimensional TCAM minimization algorithm will produce a minimum (one-

dimensional) packet classiﬁer of two rules as shown in Table 4.1.

 

Rule # F1 Decision

 

1 10** accept

2 * ** * discard

 

 

 

 

 

Table 4.1: A minimum packet classiﬁer corresponding to 212 in Fig. 4.1

Similarly, from the subgraph rooted at. node 123, we can get. a minimum packet

classiﬁer of one rule as shown in Table 4.2.

 

Rule# F1 Decision

1 ****

 

discard

 

 

 

 

 

Table 4.2: A minimum packet classiﬁer corresponding to U3 in Fig. 4.1

Next, we look at the root. ’Ul- As shown in Figure 4.2, we view the subgraph
rooted at ’02 as a decision with a n‘iultiplication factor or cost of 2, and the subgraph
rooted at ‘03 as another decision with a. cost. of 1. Thus. the graph rooted at 1.21 can
be thought of as a “virtual” one-(lin'iensioiial packet classifier over field F1 where

each child has a multiplicative cost.

31

   
    

 
    

F2610“ -’ accept

F26 **** -* discard

F26 **** —* discard

 

Figure 4.2: “Virtual” one—dimensional packet classiﬁer

Now we are ready to use the one-dimensional TCAM 11111111111281.1011 algorithm

in Section 2.2 to minimize the number of rules for this “virtual” one-dimensional

packet. classifier. The algorithm takes the following 5 preﬁxes and associated costs

as input:

1000
101*

with decision '02 and cost 2 ,
with decision 2:2 and cost 2 ,

1001
11* *

with decision 123 and cost 1 .

( l
( l
0 * ** (with decision ()3 and cost. 1)
( l
( l

with decision '03 and cost 1 .

/

Running the weighted one-(iimensional TCAh-l minimization algorithm on the above

input. will produce the “virtual” one-dimensional packet classifier of three rules as

shown in Table 4.3.

 

Rule # F1 Decision

 

 

 

 

 

1 1001 go to node. (’3
2 10** go to node (22
3 **** go to node U3

 

Table 4.3: A minimum packet classiﬁer (:(n‘respontling to ”1 in Fig. 4.1

32

Combining the “virtual” packet classiﬁer in Table 4.3 and the two packet clas-

siﬁers in Table 4.1 and 4.2, we get a packet classiﬁer of 4 rules as shown in Table

 

 

4.4.
Rule # F1 F2 Decision
1 1001 **** discard
2 10** 10** accept
3 10** **** discard
4 **** **** discard

 

 

 

 

 

Table 4.4: Packet classiﬁer generated from the FDD in Figure 4.1

4.1.3 Removing Redundant Rules

Next, we observe that rule r3 in the packet classiﬁer in Table 4.4 is redundant. If we

remove rule 73, all the packets that used to be resolved by 13 (that is, all the packets

that match 13 but do not match r1 and r2) are now resolved by rule 7‘4, and r4 has

the same decision as 73. Therefore. removing rule 73 does not change the semantics

of the packet. classiﬁer. Redundant rules in a packet classiﬁer can be removed using

the algorithms in [18] or the algorithm in the next chapter. Finally, after removing

redundant rules, we get a packet classiﬁer of 3 rules from the FDD in Figure 4.1.

4.1.4 The Algorithm

To summarize, TCAM Razor, our multi-dimensional TCAM minimization algo-

rithm, consists of the following four steps:

1. Convert the given packet classiﬁer to an equivalent. FDD.

2. Use the FDD reduction algorithm described in the next. section to reduce the

size of the FDD. This step will be explained in more detail in the next section.

3. Generate a packet (jtlassifier from the F DD in the following bottom up fashion.

For every terminal node, assign a cost of 1. For a non-terminal node U with

33

~
~

outgoing edges {(31, - - - , ez}, formulate a one-dimensional TCAM n'iinimization
problem as follows. For every preﬁx ’P in the label of edge e.-, (1 g j S .3),
we set the decision of ”P to be j, and the cost of ’P to be the cost of the
node that edge ej points to. For node v, we use the weighted one-dimensional
TCAM minimization algorithm in Section 2.2 to compute a one—dimensional
preﬁx packet classiﬁer with the minimum cost. We then assign this minimum
cost to the cost of node 1). After the root node is processed, generate a. packet
classiﬁer using the preﬁxes computed at each node in a depth first. traversal
of the F DD. The cost of the root indicates the total number of preﬁx rules in

the resulting packet classiﬁer.

4. Remove all the redundant rules from the resulting packet classiﬁer.

4.1.5 TCAM Update

Packet classiﬁcation rules periodically need to be updated. The common practice
for updating rules is to run two TCAMs in tandem where one TCAM is used while
the other is updated [16]. TCAM Razor is compatible with this current practice.
Because TCAM Razor is efficient and the resultant TCAM lookup table is small,
TCAM updating can be efficiently performed by rerunning TCAM Razor on the
updated rules. When rules are frequently added to a classiﬁer, we suggest the
following lazy update strategy. First, after running TCAM Razor, store the resulting
rules in the lower portion of the TCAM. Let n denote the total number of entries
in the TCAM, m denote the total number of TCAM entries needed by a packet
classiﬁer after applying Razor, and let array T denote the TCAM. Initially, the m
entries are stored from T[n — m] to T[n — 1]. When a new rule 7‘ needs to be added
to the classifier. we ﬁrst. perform range expz—msion on r. Let ""1 be the number of
preﬁx rules that. are created. We store these rules in locations T[n — m — ml] to
T[n — m — 1]. As new rules are added, this process continues until the TCAM is
ﬁlled up. Thus, TCAM Razor only needs to run periodically rather than when each

new rule is added.

34

Chapter 5

All-Match Redundancy Removal

we present an all-match based complete redundancy removal algorithm. This is the
ﬁrst algorithm that attempts to solve ﬁrst—match problems from an all-match per—
spective. We formally prove that the resulting packet classiﬁers have no redundant
rules after running our redundancy removal algorithm. We conducted extensive ex-
periments on both real-life and synthetic packet classiﬁers. The experimental results
show that our redundancy removal algorithm achieves an average compression ratio
of 41.8% for TCAM entries.

We have improved upon [18] in two ways. First. the redundancy theorem be-
comes simpler. The redundancy theorem in [18] distinguishes upward and downward
redundant rules, and detects them separately. In contrast, the redundancy theorem
presented here gives a single criterion that. can detect both upward and downward
redundant rules. Second, the new redundancy removal algorithm is more efficient.
The algorithm in [18] scans a packet classiﬁer twice and build F DDS twice in order
to remove the two types of redundant rules. In comparison. the new algorithm only
scans a packet classiﬁer once and builds one all—match FDD with a cost similar cost.
to building an FDD. The new algorithm is about twice as efficient. as the algorithm

in [18].

5.1

All-Match Based Redundancy Theorem

In this section, we introduce the concept. of all—match FDDs and the all-match based

redundancy theorem.

5.1.1 All-Match FDDs

Deﬁnition 5.1.1 (All—Match FDD). An all—match FDD t for a packet classiﬁer

f :

(r1,r2,-~,rn) over ﬁelds F1,---,Fd is an FDD that has the following ﬁve

properties:

9":

Each node v is labeled with a packet ﬁeld denoted F(v). If v is a nonterminal
node, then [7(0) is a packet ﬁeld. va is a terminal node, then F(v) is a list

of integer values (121, i2, . - - , ik) where 1 S i1 < i2 . -- < ik g n.

Each edge e:u —+ v is labeled with a nonempty set of integers, denoted [(6),

where [(6) is a subset of the domain ofu ’s label (i.e., I(e) g I)(F(u))).

The set of all outgoing edges of a node “U in t, denoted E(v), satisﬁes the

following two conditions:
(a) Consistency: I(e) ﬂ I(e’) = (D for any two distinct edges 8 and e’ in E(v).
(b) Completeness: UeEE(v) I(e) = D(F(v)).

A directed path from the root to a te'miinal node is called a decision pat-h.
No two nodes on a decision path have the same label. Given a decision path
P : (2218112262 - - - 'Umemu,,,,+1), the matching set ofP is deﬁned as the set of all
packets that satisfy (F(ul) E [(61)) /\(F(v2) E I(e2))/\- - -/\(F(Um) E I(e-m)).
We use M(P) to denote the matching set of P.

For any decision path. P : (‘Ulelvgeg - - - Uy'neynvnl+1) where F(vm+1) =2 (i1,

i2, ...,ik.) and for any rule rJ-(l g j S n). if i‘\l(P)ﬂi‘ll(rJ-) 75 o. then ll/(P) Q

111(7‘j) andj E {i1,l2,- - -,'lk}. C]

36

r1 : F1 E [1, 5] /\ F2 E [1,10] ——> accept
r2 : F1 E [1, 5] /\ F2 E [5,10] —> accept
r3 : F1 E [6, 10] /\ F2 E [1,3] —> discard
r4 : F1 E [1, 10] /\ F2 E [1,10]—> discard

Figure 5.1: A simple packet classiﬁer

[1,5] [6,10]

[1,4] V0] [1,3] [4,10]
1,4 (1,2,4)( 3,4 )

Figure 5.2: A11 all—match FDD for the packet classiﬁer in Fig 5.1

Fig 5.2 shows an all-match F DD for the simple packet classiﬁer in Fig 5.1. In this
example, we assume every packet has only two ﬁelds F1 and F2. and the domain of
each ﬁeld is [1. 10].

In an all—match F DD for a packet classiﬁer f . for any decision path 73 : (01611}282
- -- vme-,-nu,,.,y+1) where F(um+1) = (i1, i2, ik), if a packet p satisfies this path P,

then {731,7}: .. "i k} are exactly all the rules in f that p mat(_:hes. 1le IS why we

2, ..
call such a FDD an “all-Inat.(':h FDD”.

37

5.1.2 The All-Match Based Redundancy Theorem

Before we present the All-I\‘Iat(_:l1 Based Redundancy Theorem, we ﬁrst prove the

following lemma.

Lemma 5.1.1. Let t be an all-722atch FDD for packet classiﬁer f : (7‘1,r2, - - - ,rn).
For any rule rz- in f, let P1, P2, - - - , Ph be all the decision paths whose terminal node
contains r23. then the following condition holds: 111(73): U _1 lil(P j') [3

Proof:

(1) According to prope1tv5 oin the deﬁnition of all- match FDDs, we have 111(Pj) Q
Al(ri) for everyj (1 < j\ <h). Thus, we have Uj=11l1(Pj ) C 111(7' i-)

(2) Consider a. packet p in A102) According to the consistency and completeness
properties of all-match F DDS, there exists one and only one decision path that p
matches. Let P be this decision path. Thus, we have p E ll[(rz-) ﬂ 111(P). According
to property 5 in the deﬁnition of all-match F DDS, i is in the label of P’s terminal
node. Thus, we have P E {P1 P2, - - - , Ph}. Therefore, we have p E U521 111(Pj).
Thus we get. lll(r ) C Uh__1M(P) [3

Theorem 5.1.1 (All—Match Based Redundancy Theorem). Let t be an all-72iatch
F DD for packet classiﬁer f : (2‘1.7‘2. - - - ,rn). Rule rz- is redundant in f if and only
if in all te77ni72al nodes oft that have i as their ﬁrst value, i is immediately followed

by another integer j such that rz- and rJ have the same deczsion.

Proof. (1) Suppose in all terminal nodes of t that have i as their ﬁrst. value. I
is immediately followed by another integer j such that r and 7'] haw the same
decision. We next. prove that 7’1‘ is redundant in f.

We observe that removing a. rule r2- only possibly affects the decisions for the
packets in 111(2)). Let P1.Pg. - - 1 , P}, be all the decision paths in t whose terminal
node contains 1'. According to 11111111121501 1 we ham M(7 =Ul-I’__ 1 ’l[( PJ- ). Con-
sider an arbitrary packet p in 111/(r1) Suppose we have. p E 1\I(Pj). (M f be the
resulting packet. classiﬁer after removing Ti from f. To prove that 7',‘ is redundant
in f, we only need to prove f(p) = f’(p). Let the. label of the terminal node of P]-

be (i1, i2, - - 1.11.). Because i E (11.19.. - - , ik}, there. are two cases:

38

1. 21 74 i. In tins case, Til is the ﬁrst rule 1n f that p matches. Thus. removmg

rz- does not affect the decision for p. In this case, we have f (p) = f ' (p).

2. i1 = i, and rz- has the same decision as 7‘2-2. In f, 7‘2- is the ﬁrst rule that p

111atches. In f ’ , rz- is the ﬁrst rule that p 111atches. Because rz- and 732 has the

2
same decision, we have f (p) = f I (p) in this case.

Therefore, rz- is redundant in f.

(2) Suppose rule rz- is redundant in f and there exists a terminal node in t whose
ﬁrst two values are i followed by j , and rz- and rj have different decisions. Let P
denote the decision path from the root to this terminal node. Consider a packet
p E 1W (P) Thus, rz- is the first rule that p matches in f and rj is the ﬁrst rule that
p matches in f’. Because rz- and rj have different decisions, we have f (p) 51$ f’ (p)
This conﬂicts with the assumption that 21,; is redundant. Therefore, if rz- is redundant
in f, then in all terminal nodes of t: that have i as their ﬁrst value, i is immediately

followed by another integer j such that 7i and rj have the same decision. El

5.2 All-Match Based Redundancy Removal

In this section, we ﬁrst present an algorithm for constructing all-111atch FDDS from

packet classiﬁers. Second, we present. a redundancy removal algorithm based on

Theorem 5.1.1. Third we rove that the resulting )acket classiﬁer does not have
7 C

any redundant rules.

5.2.1 The All-Match FDD Construction Algorithm

According to Theorem 5.1.1, in order to detect and remove redundant. rules in a
packet classifier, we ﬁrst need to construct an all-match FDD for that packet clas-
siﬁer. The pseudocode for the. all—111atch F DD construction algorithm is shown in
Algorithms 1 and 2.

Consider the packet classifier in Figure 5.1. The process of constructing the.

corresponding all—match FDD is shown in Figure 5.3.

39

 

Algorithm 1 All—Match FDD Construction Algorithm
Input: A packet classiﬁer f : (r1,712, - - -,rn).

 

Output: A all-match FDD t for packet classiﬁer f.

1: Build a path from rule 7‘1. Let. 1) denote the root. The label of the terminal node
is (1).

2: for i z: 2 to 72. do

3: APPEND( v, r21, 1, i);

4: end for

 

0 6

[1,5] - [1,5]

E e

[1.10] [1.41 [5,10]

 

 

 

 

Figure 5.3: Constructing an all-111atch FDD

5.2.2 The All-Match Based Redundancy Removal Algorithm

We ﬁrst introduce two auxiliary lists that are used in the all—match based redundancy

removal algorithm: containment list and residency list. Given an all—match FDD

40

 

Algorithm 2 APPEND

 

Input: A vertex v, a rule (F1 E 81) /\ - - - /\ (Fd E 84) —1 (dec), a depth m, and a

rule number i.

Output: 1) includes the rule (F1 E 51) /\ ~ - 1 /\ (Fd E 5(1) ——+ (dec) in its all-match

structure.

{F(v) 2 Fm and E(U)={€1,-1-,€k}}

1: ifmzd+1then

2: Add i to the end of v’s label.

3: return

4: else if (Sm -— (I(el) U - - - U I(ek))) +4 (0 then

5: Add an outgoing edge ek+1 with label Sm — (I(el) U - - - U I(ek)) to u;

6: Build a decision path from (Fm+1 E Sm+1) /\ - - - /\ (Fd E 5d) -—> (dec), and
make ek+1 point to the ﬁrst. node in this path;

7: Add i to the end of the label of the terminal node of this decision path;

8: end if

9: forj:=1t.okdo

10: if I(ej) C Sm then

11 APPEND(eJ-’s target,(F1 E 81) /\ - - - /\ (Fd E Sd) —> (dec),m+1,i);

12: end if

13: Add one outgoing edge e to v, and label e with I (ej) ﬂ Sm;

14: Make a copy of the subgraph rooted at the target node of ej, and make 6
points to the root of the copy;

15: Replace the label of ej by I(ej) — Sm;

16: APPEND(e’s target,(F1 E 81) /\ - - - /\ (Fd E 5(1) —1 (dec).m+1,i);

17’: return

18: end for

 

that. has 771 terminal nodes, we assign a. unique sequence number in [1,772.] to each

terminal node. In the containment. list. each entry consists of a terminal node

sequence number and the rule sequence numbers crmtained in the terminal node.

41

Terminal Rule #
node #

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

,._
1 1
f——'\ r———ﬂ
2 1,2,4 2 2
\____J m
3 3,4 3 3
(——\
4 4 4 1,2,3,4
Containment List Residency List

Figure 5.4: The containment. list. and the residency List for the all-match FDD in

Figure 5.2

In the residency list, each entry consists of a. rule sequence number and the set of
terminal nodes which contains this rule. The all-match list and the residency list
for the all-match FDD in Figure 5.2 are in Figure 5.4.

The all—match based redundancy removal algorithm works as follows. Given a
packet. classiﬁer f : (r1, r2, - - - , 7'”), this algorithm scans f from r-n to r1, and checks
whether each rule is redundant using Theorem 5.1.1. Whenever a rule is detected as
redundant, the rule is removed from f. The pseudocode of the algorithm is shown
in Algorithm 3.

Consider the packet classiﬁer in Figure 5.1 and its all-match FDD in Figure 5.2.
The all—111atch list and the residency list are in Figure 5.4. We, next demonstrate the
process of determining whether 7'4 is redundant using the all-Iriatch based redun-
dancy removal algorithm shown in Figure 3. From the residency list, we know that
rule r4 is contained in terminal nodes 1.2.3 and 4. 111 t(:1.rminal node 4. 7‘4 is the only
value, and thus is not redundant. Next, we check whether 7'3 is redundant. From
the. residency list. we know that. 73 is contained in terminal node 3. In terminal

node 3. the first. value is 3, and is innnediatt-1ly followed by a 4. and 7‘3 and 7'4 have

42

 

Algorithm 3 All—Match Based Redundancy Removal Algorithm
Input: A packet classiﬁer f : (r1, r2, - - - , rn), and an all-match FDD t for f.

 

Output: A packet classiﬁer f I where f E f ’ and there is no redundant rule in f ' .

1: Build the containment list ConList[1..m] from t.

2: Build the residency list ResList[1..n] from t.

3: for i := n to 1 do

4: redundant :2 true

5: for each terminal node sequence number tn in ResList[i] do

6: if i is the only value in ConList[tn]) or i is the ﬁrst value in Co-22,List[tn]
and the second value in ConList[tn], say j , satisﬁes the condition that r,-

and rj have different decisions then

7: redundant :2 false;
8: break;
9: end if

10: end for

11: if redundant then

12: remove rz- from f7;

13: for each terminal node sequence number in in ResLis-t[i] do
14: delete i from (YonListltnl;

15: end for

16: end if

17: end for

 

the same decision. According to the Theorem 5.1.1, r3 is redundant. Subsequently,
we remove r3 from the packet classiﬁer and delete 3 from the third entry of the
all-match list. In a similar fashion, we can further detect that r2 is redundant and

r1 is not. redundant. The resulting packet classiﬁer is shown in Figure 5.5.

43

F1 E [1, 5] /\ F2 E [1,10] —2 accept
F1 E [1, IO] /\ F2 E [1,10]—+ discard

Figure 5.5: The resulting packet classiﬁer after removing redundant rules from the

packet. classiﬁer in Figure 5.1

5.2.3 Proof of Complete Redundancy Removal

A packet classiﬁer redundancy removal algorithm is a complete redundancy removal
algorithm if and only if for any packet classiﬁer the algorithm produces a. semantically

equivalent packet classifier in which no rule is redundant.

Theorem 5.2.1. The All—Match Based Redundancy Removal Algorithm is a com-

plete redundancy removal algorithm. C]

Proof:

Let. f be a given packet classiﬁer and let t be an all—match FDD for f. Let f ' ' be
the resulting packet classiﬁer after running the all-match based redundancy removal
algorithm. Suppose f I I has a rule rz- that is redundant in f / I . Let f I be the resulting
packet classiﬁer after the algorithm has examined all the rules from ”+1 to r-n and
the redundant rules from n+1 to rn, has been removed. Because the algorithm does
not remove ri, 7",; is not redundant. in f’. According to Theorem 5.1.1, there is at

least a terminal node u that satisﬁes one of the following conditions:
1. this terminal node only contains i,

2. this terminal node has i as its ﬁrst value and i is inunediately followed by

another value j such that 7",; and rj have different (_'l(-?(_tlSlODS.

If U satisﬁes one of the two conditions, then u still satisﬁes that condition after the
algorithm removes all the redundant rules above Ti» because i will never be deleted
from ’12 according to the. algorithm. Therefore, r7; is not redundant. in f I ’ according

to Theorem 5.1.1. [:1

44

It is worth noting that the order from n to 1 in detecting redundant rules is
critical. If we choose another order, the algorithm may not. be able to guarantee
complete redundancy removal. Take the order from 1 to n as an example. When we
check whether rz- is redundant, suppose rz- is not redundant because there is one and
only one terminal node in the all-match F DD that has i as its ﬁrst value and i is
immediately followed by another value j such that r,- and rj have different decisions.
We further suppose j is immediately followed by another value k where 7‘2‘ and rk
have the same decision. After moving all the redundant rules after 71,-, j is possibly
removed from the terminal node and consequently rz- and rk become the ﬁrst two
values in the terminal node and they have the same decision. Thus, rz- becomes

redundant.

Chapter 6

Experimental Results

In this section, we evaluate the effectiveness of TCAM Razor and All-Match Re-
dundancy Removal on both real-life and synthetic packet classiﬁers. Note that in
cases where TCAM Razor cannot produce smaller packet classiﬁers than redundancy
removal alone, TCAM Razor will return the classifier produced by redundancy re-

moval. Thus, T CAM Razor always performs at least as well as redundancy ren'ioval.

6. 1 Methodology

We ﬁrst deﬁne the metrics that. we used to measure the effectiveness of TCAM Razor
and the redundancy removal technique by Liu and Gouda [18]. 111 this paragraph,
f denotes a packet classiﬁer, S denotes a set. of packet. classiﬁers, and A denotes
either TCAM Razor or the redundancy removal technique. We then let I f | denote
the number of rules in f, A( f ) denote the preﬁx classiﬁer produced by applying A
on f, and Direct( f) denote the preﬁx classiﬁer produced by applying direct range
expansion on f. We deﬁne the following four metrics for assessing the performance

of A on a set of classiﬁers S'.

o The average compression ratio of A over S =

(MN
Efesm

lSl

 

46

o The total compression ratio of A over S =

EfeSlAffll
Efengirect(f)|'

 

o The average expansion ratio of A over S 2

Wm
Efes m

ISI

 

 

o The total expansion ratio of A over S =

EfeSlAffll
EfeSlfl .

 

Variable Ordering

The variable order that. we used to convert a packet classiﬁer into an equivalent F DD
affects the effectiveness of TCAM Razor. There are 5! = 120 different permutations
of the ﬁve packet ﬁelds (source IP address, destination IP address, source port
number, destination port number, and protocol type).

A question that naturally arises is: which variable order achieves the best com-
pression ratio? To answer this question, for each pern‘iutation , we computed the
total compression ratio for TCAM Razor over RL. The maximum total compres-
sion ratio is 10.8% and the minimum total compression ratio is 7.7%. When we
use the best permutation for each classiﬁer, the total compression ratio is 6.7%.
Furthermore, the four permutations that start with the ﬁeld Destination IP address
and Source IP address have a total compression of less than 7.8%. Using the best
permutation from these four permutations results in a. total compression ratio of
7.5%. The best single permutation is the order (Destination IP address, Source IP
address, Source Port, Destination Port, Protocol) and has a total compression ratio

of 7.7%. We use this pernuitation to represent TCAM Razor in the experiment.

47

results, which is denoted Razor. When we report results using the best permuation

for each individual classiﬁer, we denote this algorithm as BRazor.

The next natural question to ask is: is this permutation the best order for most

packet classiﬁers?

The answer for BL is yes. In Figure 6.1(a), for each packet classiﬁer in RL, we

show the compression ratios of TCAM Razor, BRazor, and redundancy removal. The

results show that the single best permutation achieves almost the best compression

ratio for each packet classiﬁer group.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

10 500
.93 '- RR
‘6 g 80» I Razor 8 4000* t
:5 f5 , 1:1 BRazor g
9 0 60- c 3000
3 33 = 9 :
93$ 40» % 2 2000 '
‘5‘ 2 ' 3 l ,
° 20’ 1 u’i 1000 REE i
u -— ; ~
. _ 0 , W ,, 7 7 _ ..
0 10 15 20 25 0 10 15 20 25
Classifier Classifier
(a) (b)
Figure 6.1: Compression and Expansion ratios of RL
Compression Ratio Expansion Ratio
Average Total Average Total
RR Razor RR Razor RR Razor RR Razor
RL 42.4 % 23.3 % 32.5 % 7.7 ‘70 1373.4 % 60.8 ‘70 111.7 ‘70 26.5 ‘70
RLU 66.2 “/0 32.2 ‘70 56.9 ‘70 13.4 % 3217.0 % 153.3 ‘70 195.3 % 46.0 %
SYN 11.1 ‘70 6.8 ‘70 8.4 % 5.6 % 13.2 (70 8.1 ‘70 10.0 70 6.6 %
SYNU 44.1 % 42.8 ‘70 39.4 % 38.6 % 52.6 ‘70 50.9 ‘70 47.0 ‘70 46.0 ‘70

 

 

 

 

 

 

 

 

 

 

 

Table 6.1: Statistics for experimental classiﬁers

Compression Ratio

Table 6.1 demonstrate that TCAM Razor outperforms just redundancy removal.

Figure 6. 1 (a) shows that only 7 out 25 classifiers so no improvement over redundancy

48

removal; however, BRazor does achieve an improvenumt over redui‘idancy removal
for four of these classiﬁers.

We can also see that both TCAM Razor and redundancy removal are less effec-
tive on RLU due to each rule having unique decisions. TCAM Razor is still twice
as effective as redundancy on RLU, which demonstrates that TCAM Razor remains
twice effective for RL as the number of unique decisions becomes large. The results
for the synthetic classiﬁers show that TCAM Razor is an improvement over redun-
dancy removal and that synthetic generation technique produces a large number of

redundant rules.

Expansion Ratio

Table 6.1 and Figure 6.1(a) demonstrate similar results for Razor, BRazor and re-
dundancy removal. Note that. twelve classiﬁers are signiﬁcantly effected by range
expansion. and for these classiﬁers TCAM Razor is highly effective and combating

range expansion.

6.2 Comparison with other techniques

It is difﬁcult. to compare our results directly with those of Deng ct at. [5] because
we do not have access to their programs or the packet classiﬁers they experimented
with. However. Razor has a total compression ratio of 7.70},- on our real—life packet
classiﬁers. IIi contrast. Dong et al. reported a total compression ratio of 54‘ c on

their real-life packet classiﬁers.

49

Part II

New Architectural Approaches

50

New architectural approaches seek to modify how the TCAM based packet classi-
ﬁers operate in order to improve efficiency. We propose two approaches: sequential
decomposition and topological transformation. Sequential decomposition decom-
poses a single d—ﬁeld packet classiﬁcation TCAM lookup into a sequence of d l-ﬁeld
TCAM lookups. Topological transformations provide methods to translate the do—
main of each packet ﬁeld into a more efﬁcient representation. Both techniques allow
for the efﬁcient utilization of TCAM space. These techniques mitigate the effects of
range; however, they have the unique advantage that they ﬁnd optimizations beyond

range expansion. This advantage allows for sublinear compression.

Chapter 7

Sequential Decomposition

The problem that sequential decomposition tries to solve in this chapter can be
stated intuitively as follows: how can we squeeze the most information possible into
TCAM chips with little or no perfownance loss? Solving this problem helps to ad-
dress almost all limitations of TCAMs. Given a packet classiﬁer, if we can represent.
it using fewer bits, we can use a smaller TCAM chip, which will result in lower
power consumption, less heat generation. less board space, and lower hardware cost.
Furthermore, reducing the number of TCAM rules results in less power consumption
and heat generation because the energy consumed by a TCAM grows linearly with
the number of ternary rules it stores [33]

Sequential decomposition is composed of four algorithmic approaches to re-
thinking and redesigning TCAM-based packet classiﬁcation systems: multi-lookup,
pipelined-lookup, packing, and table consolidation. These approaches move beyond
the traditional paradigm that performs a. single lookup on a single TCAM for each
search key. The following two observations of information redundancy and a ternary
search key, which have mostly been ignored in prior work, form the theoretical basis

for our new approaches.

Information Redundancy Inforn'iation stored in TCAMS tends to have high
redundancy from an information theory perspective. Speciﬁ(‘:ally, we observe that

the same ternary string for a speciﬁc ﬁeld may be repetitively stored in I'nultiple

TCAM entries.
Figure 7.1(a), the strings 001, 010, and 100 from the ﬁrst ﬁeld are each stored three

For example, in the simple two-dimensional packet classiﬁer in

times in the TCAM, and the strings 001, 010, and *** from the second ﬁeld are each
stored three times in the TCAM as well. Such information redundancy is primarily
due to the multi—dimensional nature of packet classification rules. One source of

information redundancy is range expansion in two or more ﬁelds.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Single-lookup Multi-lookup
001 ,001 accept t1
001,010 accept w. log 001 t;
001,*** discard 010 t2
010,001 accept 100 t2
010,010 accept w. log ** t3
010,*** discard t2
100,001 accept 001 accept
100,010 accept w. log 010 accept w. 10g
100,*** discard an: discard
***,*** discard w. log t3
*** discard w. log
(a) (b)

Figure 7.1: Reducing information redundancy

Ternary Search Key A TCAM chip typically has a built—in Global Mask Register
(GMR) that supports ternary search keys. The GMR of a TCAM chip contains a
bit. mask that speciﬁes which bit columns in the chip participate in a search. For
example, in a TCAM chip that contains two entries 1010 and 0100, if the search
key is 0101 and the GMR speciﬁes that only the ﬁrst two columns participate in
this search, the TCAM chip will return that. the lookup key matches the second

entry 0100. In essence, the GMR allows the user to specify the search key in ternary

format. In the above example, the GMR transforms the. search key 0101 into 01**.

The GMR opens new opportunities for further improving TCAM space efﬁciency.
Intuitively, the GMR allows multiple lookup tables to be packed into one TCAM
chip where the GMR can be used at run time to dynamically select the right table
to search.

The multi—lookup approach is based on three key observations. First, breaking a
multi-dimensional packet classiﬁer into multiple one-dimensional classiﬁers greatly
reduces information redundancy in TCAMs. Second, multiple lookup tables can
co—reside in TCAM as long as extra bits are set to distinguish them. Third, a
search key can be segmented into multiple search keys where each is searched in a
one-dimensional classiﬁer. Breaking the two-dimensional classiﬁer in Figure 7.1(a)
results in the three one-dimensional classiﬁers in Figure 7.1(b). Although in this
particular example, the number of entries in the original single—lookup table is only
two more than the total number of entries in the multi—lookup tables, the savings
will increse signiﬁcantly as the repetition in each ﬁeld increases. Furthermore, note
that the width of the multi—lookup tables is much smaller than that of the single
lookup table.

The space efficiency achieved by the multi-lookup approach comes with the price
of more clock cycles to perform each search. Our pipelined—lookup approach speeds
up the multi-lookup approach by pipelining the multiple lookups using multiple
TCAM chips. Interestingly, the pipelined-lookup approach achieves even higher
packet classiﬁcation throughput than the traditional single—lookup approach because
the narrower TCAM entries now ﬁt on the data bus.

The packing approach is based on the following three observations. First, TCAM
chips have limited conﬁgurability on their width. This prevents us from conﬁguring
the TCAM width to exactly the table width, which could cause a signiﬁcant number
of bits in each TCAM entry to be unused. Second, the multi-lookup and pipelined-
lookup approaches produce “thin” tables of varying width. Third, search keys for
TCAM chips can. be ternary, which allows TCAM columns to be dynamically selected
for each lookup. The basic idea of the packing approach is that multiple tables can

be placed within the same TCAM entries. These tables will be distinguished by the

GMR.

In addition to the packing approach, table consolidation allows one TCAM table
to store multiple classiﬁers efﬁciently at the expense of extra SRAM. Table consol-
idation is based on the two observations. First, TCAM is far more expensive and
consumes much more power than SRAM; it makes sense to use a large SRAM with a
small TCAM rather than a small SRAM with a large TCAM. Second, semantically
different TCAM tables may share common entries, and transferring this redundancy

to SRAM removes the information redundancy from the more expensive TCAM.

7 .1 Multi-Lookup Approach

Prior work and current practice have assumed the use of a single-lookup for TCAM
based systems. we observe that relaxing this assumption could yield unexpected
savings on TCAM space with minor throughput degradation. In this section, we
propose a multi-lookup approach to redesigning TCAM based systems. We present
two algorithms to support this approach: an algorithm for constructing a. multi-

lookup table and an algorithm for processing packets.

7.1.1 Constructing Multi—lookup Table

The algorithm for constructing a multi-lookup TCAM table from a given packet
classiﬁer consists of the following four steps, which are illustrated in Figure 7.2 and
Figure 7.3: (1) F DD Construction: Constructing a tree—like representation, called
a Firewall Decision Diagram (FDD), of the packet.- classiﬁer. (2) F DD Reduction:
Reducing the size of the FDD. (‘3) Table Generation: Generating a TCAM table
from each nonterminal node in the reduced F DD. (4) Table Meiyence: Merging the

generated TCAM tables into a single multi-lookup TCAM table.

FDD Construction

To generate a multi—lookup TCAM table, we ﬁrst convert a. given packet classiﬁer

to an equivalent ﬁrewall decision diagram.Figure 7.2( b) shows the FDD constructed

Ci"!
CJI

 

 

F1 F 2 Decision
000 000 accept
000 1 11 accept
110 000 accept
1 10 1 1 I accept
0 10 0* * accept
100 0* * accept

** 1**

 

accept

** *** discard

 

 

 

 

 

 

 

 

ll FDD Construction

 

 

011101111
000 110 010 100
000 001 000 001 0" 1" 0“ 1"? 0"
01* 01'
1* *
1“ 1010 ‘11 11010

 

 

a a d Lallalldl laHdHaHdla d

 

 

l) FDD Reduction

 

 

 

 

 

Figure 7.2: FDD generation and reduction

from the packet classiﬁer in Figure 7.2(a).

56

[I

1) Table Generation

 

 

000 a
111

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

ll Table Mergence

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

000 01 f 00000 01
010 10 00010 10
00 10010 "NM 00100 10
110 01
11 00110 01
00*" 11
000 a 01000 a
01 1‘18 ‘--- 01111 a (b)
m d 01... d
.0 139:: f;
d
- 111" a
11 1“ a L,-—"‘[11m d
d l

 

 

 

 

 

 

Figure 7.3: The h-"Iulti—lookup scheme
FDD Reduction

Reduction is an important step in reducing the total number of TCAM entries in the
ﬁnal multi-lookup table because the reduction step reduces the number of nonter-
minal nodes, which consequently reduces the number of TCAM entries generated.
Figure 7.2(c) shows the resultant FDD after FDD reduction. A brute force algo-
rithm FDD reduction algorithm can be found in [9]; however, we provide a more.

efﬁcient reduction algorithm in Section 7.6.

Table Generation

Suppose the reduced F DD has n nonterminal nodes. Consider any nonterminal
node v. Since v is complete with respect to its labeled ﬁeld, we can view v as a
one-dimensional packet classiﬁer in which its outgoing edges point to its classiﬁer’s
decisions. We will construct a corresponding TCAM table Table(v) for each non—
terminal node v in the FDD, and we assign a unique I D in the range 0 to n — 1 to
both v and T able(v). We will refer to ID as both node v’s ID and table Table(v)’s
ID. The meaning should be clear from context. For example, the IDs for the four
nonterminal nodes in Figure 7.3(a) are 00, 01, 10, and 11.

For any table t, we deﬁne its height h(t) to be the number of entries in t, and
we deﬁne its width w(t) to be the number of bits in each TCAM entry. In the
single lookup approach, most people assume w(t) = 144 because the ﬁve packet
ﬁelds require 104 bits. Using the multi-lm1kup approach, we will show we can make
w(t) = 72.

We generate Table(v) in two steps. We ﬁrst generate a correct packet classiﬁer
by making one entry for each preﬁx on each edge. That is, for each of v’s outgoing
edges e from v to v' and for each preﬁx p on edge e, we generate a rule r as follows:
the predicate of r is p; if v’ is a terminal node, the decision of r is the label of v'; if v,
is a nonterminal node, the decision of r is the ID of v’. We then minimize the number
of TCAM entries in Table(v) by using an optimal, polynomial-time algorithm for
minimizing one-dimensional preﬁx packet classiﬁers [27]. Figure 7.3(a) shows the

four minimal TCAM tables that correspond to each of the four nonterminal nodes

in the FDD.

Table Mergence

The ﬁnal step is to merge all n TCAM tables into a single Imllti-lookup table. For
every nonterminal node v, we prepend v‘s ID to the predicate of each rule in Table(v).
Since each table ID provides a unique signature that distinguishes that table‘s entries
from all other table entries, all n tables can be merged into a. single n’mlti-lookup

table. Figure 7.3(b) shows the resultant multi—lookup table from merging the four

58

TCAM tables in Figure 7.3(a).

7.1.2 Packet Processing

After the multi-lookup TCAM table is built for a d-din‘iensional packet. classiﬁer, the
decision for a d-dimensional packet (p1,...,pd) can be found by d searches on the
TCAM. The ﬁrst search key kl is formed by concatenating the root node’s ID and
p1. Let f (131) denote the search result of 11:1. The second search key, k2 is formed by
concatenating f (131) and 172. This process continues until we compute f (kd), which
is the decision for the packet. For example, given the two dimensional multi—lookup
table in Figure 7.2(e) and a packet (010, 001), the ﬁrst search key is 00010, which
returns 10. The second search key is 10010, which returns accept as the decision for

the packet.

7.1.3 Analysis

We analyze the impact of the 11'uilt.i-lookup approach on TCAM space and packet

classiﬁcation throughput.

Space

We deﬁne the space used by a packet classiﬁer in a TCAM chip as the number of

classiﬁer entries or rules multiplied by the width of the TCAM chip in bits:
space 2 # of entries x TCAM width

Given a packet classiﬁer f, let Single(f) denote the resulting single lookup TCAM
table and let Multi(f) denote the resulting multi—lcmkup TCAM table. It follows
that. width w(Single( f )) = 144 because the table must accommodate the 104 bits in
the ﬁve packet ﬁelds, and the number of entries is h(Si/iglc( f )) Thus, the number
of bits required by the single lookup approach is h(S/fnglc(f)) x 144. On the other
hand, we can safely set width u.r(111'u.lti(f)) = 72. This follows as the maximum

width of the ﬁve packet ﬁelds is 32, which leaves 40 bits for storing a table. ID and

optionally, the decision. This is more than sufﬁcient for any realistic TCAM for the
forseeable future. Thus, the number of bits required by the multi-lookup approach

is h(Multi( f )) x 72. The multi-lookup table starts with a 50% reduction in width.

Throughput

Based on the above analysis that there are at least 40 bits to store table IDs plus
the decision, there is sufﬁcient space to store the decision for each rule in the TCAM
entry and still have each TCAM entry ﬁt within the 72 bits of the typical TCAM bus
width. Thus, it will require two bus cycles to process each packet ﬁeld: one cycle to
send the search key and one cycle to perform the search and return the result. Given
there are ﬁve packet ﬁelds that need to be processed, the total packet processing
time will require ten TCAM bus cycles. The single lookup approach requires either
four TCAM bus cycles or ﬁve TCAM bus cycles to ﬁnd the decision for a packet:
four bus cycles if the decision is stored in TCAM, five bus cycles if the decision is
stored in SRAM. Note that the overall packet processing throughput for the multi-
lookup approach may actually be closer to the packet processing throughput of the
single lookup approach because TCAM lookup is normally not the bottleneck of
such systems; instead other operations such as moving a packet in and out of queues
are the real bottlenecks, so taking a few more bus cycles to process a packet may

not have a signiﬁcant impact on throughput.

7 .2 Pipelined-lookup Approach

The multi-lookup approach is an effective method for reducing TCAM space needed
for packet classiﬁers. However, this reduction in space reduces packet classiﬁcation
throughput by requiring multiple lookups on a single TCAM chip. In this section,
we present our pipelined-lookup approach, which improves packet throughput. by
using one TCAM chip for each ﬁeld. That is. we will use ﬁve TCAM chips where,
for 1 S i g 5. chip i stores table t,- which is the merger of all tables of F,- nodes.

Having one merged table per ﬁeld in a separate TCAM chip enables us to pipeline

60

the multiple lookups needed for processing each packet. Surprisingly, the pipelined-
lookup approach can be four to ﬁve times faster than the traditional single-lookup
approach. Furthermore, separating the tables from different ﬁelds yields new op—
portunities to save bits. The result is that while more TCAM chips are needed, the
pipelined-lookup approach can be even more space efﬁcient than the multi-lookup
approach. Next, we present the technical details of the pipelined-lookup approach
to redesigning TCAM based systems. In particular, we present two algorithms to
support this approach: an algorithm for constructing a sequence of d TCAM tables

and an algorithm for processing packets.

7.2.1 Pipelined-Table Construction

Our algorithm for constructing a sequence of d tables t1 through t5 consists of four
steps. The first two steps are FDD construction and F DD reduction. which are
similar to the ﬁrst two steps in the multi-lookup approach. The last two steps, table

generation and table mergence, require some modiﬁcations as described below.

Table Generation

This step differs from the table generation step in the multi—lookup approach in
assigning node IDs in the constructed F DD. Here, we assign each node an ID that
can uniquely discriminate that node from all other nodes of the same ﬁeld. Let m,-
be the number of nodes with label F,- in the constructed FDD. The ID assigned to
each F,- node consists of [log mi] bits. For example, the IDs of the three F2 nodes
in Figure 7.4(a) are 00, 01, and 10. In contrast, in the table generation step of
the multi-lookup approach, the ID assigned to each node with label Fi consists of
[log(Zgl':1 771,-)] bits. Note that each F, node. has a unique ID in the context of
17,-. We also observe that for ﬁeld F1, there will always be a single table. Therefore,
we do not need an ID to distinguish this table from tables of other ﬁelds. In the

remainder of this section, we assume no ID is needed for the F1 table.

61

Table Mergence

Similar to the table mergence step in the multi-lookup approach, for every nontermi-
nal node v, we ﬁrst prepend v’s ID to each rule in Table(v). Second, for every ﬁeld
Fi, we combine all tables of Pi nodes into one table ti- For example, Figure 7.4(b)

shows the two pipelined TCAM tables generated from the F DD in Figure 7.4(a).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

000 a f 000 00
111 010 01
m 100 01
110 00
Hi 10
, (a)
[1“ a
0“ I.“ d
it Table Mergence
000 01 f 000 00 l
010 10 010 01
1:, 10010 ----->( 100 01 it,
110 01 110 00
*1". 11 \ ,“ 10 J
000 a 00000 a ) i
F2: 00 111 a ‘-— 00111 a ())
d 00*" d
o" a ___ 010" a >1,
13:01 d 01... d
[1" 101" a
Fzz10 I“ Z —" 10*" d

 

 

 

 

 

 

 

 

 

 

Figure 7.4: Table generation and table mergence in the pipelined-lookup approach

7.2.2 Packet Processing

Similar to the multi-lookup approach, in the pipelined-lookup approach, a til-dimensional

packet search is separated into d searches; however. with the pipelined-lookup ap-

62

proach, these searches can be pipelined. That is, the d TCAM chips are chained
together into a pipeline such that the search result of the i-th chip is part of the
search key for the (i + 1)-th chip, and the result of the last chip is the decision for
the packet. With such a chain, d packets can be processed in parallel in the pipeline.

Figure 7.5 illustrates the packet processing algorithm for the two tables t1 and
t2 in Figure 7.4(b). Suppose two packets (010,001) and (111,010) arrive one after
the other. When (010, 001) arrives, the ﬁrst. search key, 010, is formed and sent to
t1 while the rest of the the packet (001) is forwarded to t2. When the next packet
(111, 010) arrives, table t1 has sent the search result 01 to table t2. When the ﬁrst
search key for the second packet. 111 is formed, the second search key for the ﬁrst.
packet 01001 is formed in parallel, and both are sent to tables t1 and t2, respectively.
This cycle will yield a result of accept for the ﬁrst packet, and a result of 10 for the

second packet. The above process continues for every received packet.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(010,001) ——> 010 V
a
a
f 000 00 d
010 01 a
F,( 100 01 >t1 : t2
110 00
d
H 10 a
t J a
01 d
t I i

 

001 > 01001 accept

Figure 7.5: Example, of a. pipelined—lookup

7.2.3 Analysis

We next analyze the impact of the 1)ipelined—lookup approach on TCAM space. and

classiﬁcation throughput.

63

Space

The pipelined—lookup approach is at. least as space efﬁcient as the multi—lookup
approach, and there are many cases where it uses even fewer TCAM bits. We ﬁrst
observe that the pipelined-lookup approach generates the same number of TCAM
entries as the multi-lookup approach. That is, if t is the table formed by the multi-
lookup approach and t1 through t5 are the tables formed by the pipelined-lookup
approach, we have that h(t) = [3:1 h(t,;). The space savings of the pipelined—
lookup approach results from requiring fewer bits per entry. The ﬁrst opportunity
for saving space comes from the fact that the number of bits needed to encode a
table ID in the pipelined-lookup approach is less than or equal to that for the multi-
lookup approach as we only need to distinguish a table from other tables with the
same ﬁeld label. The second Opportunity for savings comes from the three ﬁelds
with width 16 or 8 bits. For these ﬁelds, we. may use width w(ti) = 36 whereas
w(t) = 72. Speciﬁcally, for the source port and destination port ﬁelds, we have
36 -— 16 = 20 bits to represent a table ID and optionally the decision of each rule.
For the protocol ﬁeld, we have 36 —— 8 = ‘28 bits for this purpose. For most classiﬁers,
these bits should suffice. In summary, the pipelined—lookup approach is at least as
space efﬁcient as its multi-lookup counterpart; fm'thermore, there are cases where it

is more space efﬁcient.

Throughput

The pipelined—lookup approach clearly leads to higher throughl‘mt than the multi-
lookup approach. More so, the pipelined—lookup approach actually achieves four or
ﬁve times higher throughput than the traditional single-lookup approach. In the
pipelined—lookup approach, because a search key can always be transmitted over
the bus in one cycle, the packet classiﬁcation throughput. is one packet per cycle.
In contrast, the packet classiﬁcation throughput. of the traditional single-l(')(_)kup

approach is one packet per four or ﬁve cycles.

64

7 .3 Packing Approach

In this section, we present the TCAM packing approach. This approach reduces
space consumption by allowing multiple rules from different TCAM tables to co-
reside in the same TCAM entry. The TCAM packing approach is orthogonal to the
multi—lookup and pipelined—lookup approaches in that it can be combined with the
two approaches to further improve TCAM space efﬁciency. The TCAM packing idea
is based on the following three key observations:

First, the reconﬁgurability of TCAM widths is limited. TCAM chips typically
only allow entry widths of 36, 72, 144, or 288 bits. This leads to wasted space as
typical TCAM tables rarely can be conﬁgured to exactly one of these widths. In the
standard single lookup approach, up to 40 bits might be unused given the predicate
will be 104 bits.

Second, the multi—lookup and pipelined—lookup approaches produce “thin” tables
of varying widths. We say the tables are thin because each table focuses on a single
packet ﬁeld. Thus, the table widths are much smaller than 104 bits. The widths vary
because the packet ﬁelds have different lengths: 8, 16, and 32, and these predicate
bits form a signiﬁcant fraction of each table entry. Having multiple ﬁelds of varying
widths provides opportunities to better approach the standard TCAM widths.

Third, the search key for TCAM chips can be ternary. In other words, TCAM
columns are dynamically selectable for each lookup. A typical TCAM chip has
a global mask register, which dynamically selects the columns that participate in a
lookup. The global mask register allows multiple entries from different lookup tables
to co—reside in the same TCAM entry without conflicting with each other.

We developed two TCAM packing schemes, which we call strict partitioning and

shadow packing.

7 .3.1 Strict Partitioning
Basic Strict Partitioning

The basic idea of strict partitioning is to divide a TCAM chip into multiple columns
and distribute multi-lookup or pipelined-lookup tables among these columns. The
distribution needs to satisfy the following two conditions: (1) all tables in the same
column must have different node IDs, (2) all the roles in a table are stored in only
one column. Multiple tables in the same column are discriminated by their table
ID. Multiple columns in the same TCAM chip are discriminated by the GMR of the
chip. The appropriate GMR can be selected by using a column [D which must have
enough bits to discriminate all the columns.

We illustrate the strict partitioning scheme using the four multi-lookup tables
in Figure 7.2(d) in a TCAM chip with entry width 21 bits. Figure 7.6 shows a
possible arrangement of the four tables: table 00 in column 1, table 01 in column 2,
and tables 10 and 11 in column 3. We use column IDs 00, 01, and 10 for columns
1, 2. and 3, respectively. We decode the ﬁrst entry 00 000 01’@01 in column 1 as
follows. The ﬁrst two bits 00 encode the table ID, the next three bits 000 are the
rule predicate, and the last four bits 01L®01 encode the rule decision where the ﬁrst
two bits encode the next table ID and the last two bits encode the column ID of
the next table. By partitioning the TCAM chip into three separate columns, the
TCAM chip is essentially divided into three TCAM chips. Lookups in the TCAM
chip are performed by padding the search key with the appropriate ternary bits via
the GMR. For example, to lookup 111 in table 10 of column 10, the lookup key is
***************10111*.

Storing Decisions: Because packing schemes can use previously unused bits in
the multi-lookup and pipelined—lookup approaches to store rules. storing the decision
of each rule in the TCAM entry is not space or cost effective. Therefore, for all our
packing schemes, we assume that the rule decisions are stored in SRAM. Figure 7.7
shows a version of the strict partitioning in Figure 7.6 with decisions stored in SRAM.

Packing multi-lookup tables vs. packing pipelined-lookup tables: Recall

66

 

TCAM chip
Col 00 C01 01 C01 10
00 000 01@01 01 000 a 10 0** a
00 010 10411710 01 111 a 10 *** d
00 100 10@10 01 *** d 11 1** a
00 110 OIIJQOI 11 *** d
00 *** IIQIO

 

 

 

 

 

 

 

Figure 7.6: Strict. partitioning of a TCAM chip

 

 

 

 

 

 

TCAM chip SRAM Table
Col 00 C01 01 C01 10 C01 1 Col 2 Col 3
00 000 01 000 10 0** 01@01 a a
00 010 01 111 10 *** 101310 a d
00 100 01 *** ll 1** 10115.1? 10 d a
()0 110 ll *** 01(0101 d
00 *** IIQIO

 

 

 

 

 

 

 

 

 

Figure 7.7: Decisions in SRAM

that all our packing schemes can be applied to both the multi—lookup and pipelined-
lookup approaches. The only difference is that when we pack multi-lookup tables
into one TCAM chip, we need to deal with tables of variable width. In contrast,
when we pack pipelined-lookup tables into d TCAM chips, we only deal with tables
of the same width for each chip, ignoring minor differences induced by table IDs.
Reassigning Table IDs With Fewer Bits: The original table. IDs were used
to distinguish a table from either all other tables (in the multi—lookup approach) or
all other tables of the same ﬁeld (in the pipelined-lookim amn'oach). However, in
a packing schen'ie, we only need to distinguish a table from the other tables in the
same column. Therefore, we can often use fewer bits for tables IDs. In our packing
schemes, after tables are allocated to (70111111118. we reassign table IDs using the least

number of needed bits, and the decisions for the rules have to be. updated to reflect

the new table IDs. Note that different columns may have different table ID widths,
and rule decisions may have different lengths. In the strict partitioning scheme, for
a column with 71 tables, the number of bits in the reassigned ID of each table is
[log n]. Figure 7.8 shows a version of strict partitioning in Figure 7.7 with table IDs

reassigned with fewer bits.

 

 

 

 

 

 

TCAM chip Decision Table
Col 00 C01 01 C01 10 C01 00 C01 01 C01 10
000 000 0 0** @01 a a
010 111 0 *** 0@10 a d
100 *** 1 1** O@1O d a
110 1 *** @01 d
*** 1’Q1O

 

 

 

 

 

 

 

 

 

 

Figure 7.8: Reassignincr the table IDs
O O 0

Processing Packets: We. describe the algorithm for processing packets un-
der the strict partitioning approach using examples. Suppose the given packet is
(000, 110), the first TCAM lookup is 000******* and the lookup result is the index
value of O. This index value is used to find entry 0 in the column 00 in the SRAM to
find the decision of iii-()1, which means that the second lookup should be performed
on column 01. To further perform the second lookup, the GMR is modifed to make
the second lookup key ***11()****. The result. of the second lookup is the index
value of 1, which means the decision is stored in the second entry of column 01 in
SRAM. The second entry of column 01 in SRAM is “(1.”, which means that the ﬁnal

decision for the given packet is accept.

Optimized Strict Partitioning

Given a set of TCAM tables to be packed in a single. TCAM chip. there are many
ways to do strict partitioning. First, we can choose the TCAM width to be. 36, 72,
144, or 288. Second. for each possible TCAM width, there are many possible ways

to divide the TCAM. Third, for each possible division of the TCAM, there are many

68

possible ways to allocate TCAM tables to columns. Among all possible strict parti-
tioning solutions for a TCAM chip, we want to ﬁnd the solution that uses the least
TCAM space under throughput constraints. Note that the classification throughput
may decrease as the TCAM width increases. We formally define the Strict Parti-
tioning Optimization Problem. as follows. We omit throughput constraints in the

definition for ease of presentation.

Deﬁnition 7.3.1. Given n TCAM tables t1, . . . , tn, ﬁnd a partition of the tables to

771 sets c1, . . . , cm that minimizes the objective TW X maxgll Rules(c,;) such that

. L211 C7: = {t1. . . . , in}
. 23:11(rnaxtjeatwcj>> + log 1cm s TW

where TW 6 {36,72,144,288}, w(tj) denotes the width of table tj, |tj| denotes the

number of rules within t j? and Rates(ci) denotes the total number of table entries in
Ci, which is théci |tj|.

The problem of makespan scheduling on multiple identical machines [8]. which
is NP—complete, is a special case of this problem. Thus, the strict partitioning
optimization problem is NP-complete. It belongs to NP because the solution can be

verified in polynomial time.

7.3.2 Shadow Packing

In strict partitioning. we. viewed columns as the 1:)1'imary dimension of TCAM chips
and sought to pack tables into fixed width columns. In shadow packing, on the
other hand, we view rows as the primary dimension of TCAM chips and seek to
pack tables within fixed height. rows. we consider shadow packing because of the
following two observations.

First, with strict partitioning. when tables of varying width are. allocated to the
same column, the number of bits assigned to each table I. is equal to Mr) X w(t’)
where t’ is the widest. table assigned to that. column. This leads to many unused bits

if tables of different widths are assigned to the same colunm. On the other hand,

69

horizontally packed tables can be placed next to each other as keeping the vertical
boundaries across multiple tables is unnecessary. Of course, there may be wasted
bits if tables of different heights are packed in the same column. We will allow tables
to be stacked in the same row if they ﬁt within the row boundaries.

Second, with strict partitioning, the table ID’s between tables in different columns
cannot be shared. Thus, the number of bits used for table IDs grows essentially lin-
early with the number of columns. On the other hand, horizontally aligned tables
in the same row can potentially share some “row ID” bits in their table IDs; these
tables would be distinguished by their horizontal offsets.

Based on the above two observations, we design the shadow packing scheme that
achieves more space efficiency by packing tables horizontally and allowing multiple
tables to share bits in their IDs. We first deﬁne the concept of shadowing for
table ID inheritance and the concept of shadow packing trees for representing table
ID inheritance. Then, we present the shadow packing algorithm and discuss the

procedure for processing packets with shadow packing.

Shadowing

In Figure 7.9(a), table to shadows tables ‘00 and t01. We deﬁne the concept of

shadowing as follows:

Deﬁnition 7.3.2 (Shadowing Relationship). For a table t stored in a TCAM, we use
VBegin(t) and VEnd (t) to denote the vertical indexes of the TCA M entries where the
table begins and ends respectively, and use HBegin(t) and HEnd (t) to denote horizon-
tal indexes of the TCA M bit columns where the table begins and ends respectively. For
any two tables t1 and t2 where [VBcgin.(t2), VEnd(t2)] (_Z [VBegin(t1), VEnd(t1)]
and HEnd(t1) < HBcgin(t2), we say t1 shadows t2.

When table t1 shadows t2, the ID of t1 can be reused as part. of 12’s ID. Suppose
table t shadows tables t1, - - - , tm. bees-ruse [’5 ID deﬁnes the vertical TCAM region
[Begin(t), End(t)], each t, (1 g i g m) can use t’s ID to distinguish t,- from tables

outside [Begin(t), End(t)] vertically, and use [log ml bits to distii‘iguish t,- from tables

70

inside [Begin( t,) End(t(t)])ve1tically. Horizontally, table t and each table t,- can be

distinguished by properly setting the GMR of the TCAIV’I.

Deﬁnition 7.3.3 (Shadow Packing). Given a region deﬁned vertically by [01,112]
and horizontally by [h1, ’22].- all tables completely contained within this region are
shadow packed if and only if there exist rn (rn>1) tables t1,~ -,tm in the region

such that the following three conditions hold:

1. 221: VBegin(t1), VEnd(t il+1 = VBegin(t,-+1)for1 S i S rn—l, VEnd(tm) g

(U2 ,0

2. no tables are allocated to the region deﬁned vertically by [VEnd(tm) + 1,122]
and horizontally by [h1, hgl;

3. for eachi (1 g i< m), the region d(ﬁ’lttd vertically by [VBcgin(t i), VEnd(t )]
and horizontally by [HEnd(t.,-)+1. [72] either has no tables or the tables allocated

to the region are also shadow packed.

For example, the tables in Figure 7.9(a) are shadow packed. Figure 7.9(b)
shows the tree representation of the shadowing relationship among the tables in

Figure 7.9(a).

0 t
t / 000
O 00

t / F’ tom
0
O \ ~k
/ 1 t01'_’ to1

x y two

01
t:1-"—"' t101

10 1:110

 

(a) (b)

Figure 7.9: Shadow packed tables &t shadow packing tree

Shadow Packing Algorithm

Given a. set of tables and a TCAM region, the shadow packing algorithm allocates
the tables into the region. The goal of a shadow packing algorithm is to minimize
the number of TCAM entries occupied by the tables, i.e., to minimize VEnd(tm).
We call this minimization problem the Shadow Packing Optimization Problem. This
problem becomes more difficult as we recurse because we must also address which
tables should be allocated to which region. Whether this problem can be solved in
polynomial time is an open problem.

In this chapter, we present a shadow packing algorithm SPack, which has been
shown to be effective in our experiments on real-life packet classiﬁers. The basic idea
of SPack is as follows. Given a set of tables S and a TCAM region, SPack ﬁrst ﬁnds
the tallest table t that will ﬁt in the region where ties are broken by choosing the
fattest table. SPack returns when there are no such tables. Otherwise, SPack places
t in the top left corner of the region, and SPack is recursively applied to S — {t} in
the region to the right of t. After that, let S’ be the set of tables in S that have not
yet been allocated. SPack is applied to S, in the region below t. Intuitively, SPack
greedily packs the tallest (and fattest) possible table horizontally. The pseudocode
of SPack is shown in Algorithm 4.

We, however, must compute the initial SPack region. The height of the initial
region is the total number of rules within the set of tables. We do not need to set
this value carefully because SPack only moves to another row when all the remaining
tables do not ﬁt in any of the current shadows. The width is more complicated and
must be computed iteratively. For each valid TCAM width w 6 {36,72, 144,288},
we set the initial width to be w and run SPack. Once we have a packing, we
determine the number of bits I) that. are needed for node IDs. If the packing could
accommodate these extra b bits, we are done. Otherwise, we choose an aggressive
backoff scheme by recursing with a width of w —— b. It is possible, particularly for
w = 36, that no solution will be found. To detern'iine which TCAM width we should
use, we choose the width w E {36, 72, 144, 288} whose final successful value resulted

in the fewest. number of entries. Note that. there are other possible strategies for

72

 

Algorithm 4 Shadow Packing (SPack)
Input: S : a set of tables, and a region [01,122], [h1, hg].

 

Output: 8': the set of tables in S that have not been packed.

1: Find the tallest table t E S that will ﬁt in [vbv2], [12.1,h2] such that ties are
broken by choosing the fattest table;

: if no table is found then

to

3: return S;

4: else

5: Place t in the top left corner of ['v1,v2], [h1, hg];

6: s” <— SPack(S', VBegin(t), VEnd(t), HEnd(t) + 1, 22.2);
7: return SPack(S",VEnd(t) + 1, '02, h1, hg);

8: end if

 

determining the width of the SPack regions; for instance, instead of reducing the
region width by b, the width could be reduced by 1. Furthermore, to speed up this
process, SPack can be modiﬁed to abort the packing once it detects that the table
packing and IDs can not ﬁt within the region.

Reassignng Table IDS and Rule Decisions: Because shadow packing estab-
lishes a hierarchy of table IDs, each table needs a new ID, and all the rule decisions
need to be remapped to reflect these new IDs. Each table ID is determined by a
tree representation similar to the one found in Figure 7.9(b), which we call a shadow
packing tree. For each node v in a shadow packing tree, if v has m > 1 outgoing
edges, each outgoing edge is uniquely labeled using llogrn) bits; if v has only one
outgoing edge, that edge is labeled *. For each table t, let u be the corresponding
node in the shadow packing tree. All the bits along the path from the root to v
are all the bits needed to distinguish t from all other tables. Note that. the * corre-
sponds to a table where no additional ID bits are needed. In our shadow packing
algorithm, we reserve l bit columns in the. TCAM where l is the maximum number
of bits needed to the distinguish a table. Reserving some bit columns for storing

table IDs has the advantage of simplifying the processing of packets since the bit

columns containing the table IDs are ﬁxed in the TCAM.

Figure 7.10(a) shows the shadow packing tree for the four tables in Figure 7.2(d)
and their reassigned table IDs. Figure 7.10(b) shows the ﬁnal memory layout in
the TCAM chip after shadow packing and the conceptual memory layout of the
decision table within SRAM. The one bit ID column in Figure 7.10(b) is needed
to distinguish between the tables with original IDs 01 and 11. Note that table 10
shares the table ID 0 with table 01 as it is the only table in table 01’s shadow. To
make the decision table in Figure 7.10(b) easier to understand, we encode it in a

memory inefﬁcient manner using columns.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Table ID Reassignment
0 1:01;. tlo Table ID
. —*"’ too 00 * (a)
l\ tn 01 O
10 0
11 1
ll Construct tables
TCAM chip Decision Table
0 123 456 789 Col 00 C01 01 C01 10
O 000 000 0** 0@4 : 01 a a
0 010 111 *** oer: 10 a d (b)
O 100 *** 0@7: 10 d
1 110 1** 0@4:10 a
1 *** *** 1@4 :01 d

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 7.10: The shadow packing process.

Processing Packets: We describe the algorithm for processing packets under
the shadow packing approach using examples. Given a packet (000,111), the ﬁrst
TCAM lookup is *000******, and the lookup result is the index value of 0. This
index value is used to ﬁnd entry 0 in the column 00 in the SRAM which contains

the decision of 0:024 : ()1. The 01334 means that the second lookup key should occur

in table ID 0 at horizontal offset of 4, and the 01 means that decision of the next
search is located in column 01 in SRAM. To perform the second lookup, the GMR
is modiﬁed to make the second lookup key 0***111***. The result of the second
lookup is the index value of 1, and the decision stored in the second entry of column

01 in SRAM is retrieved, which is accept.

7 .3.3 Strict Partitioning vs. Shadow Packing

We now compare the space efﬁciency of strict partitioning and shadow packing. The
sole advantage of strict partitioning is that it has no horizontal boundaries. On the
other hand, shadow packing has two key advantages. It has no vertical boundaries,
and tables in the same row can share some table ID bits. Furthermore, we mitigate
some of the disadvantage of horizontal boundaries by greedily packing tables in the

shadow of other tables.

7 .4 Table Consolidation

The basic idea of table consolidation is to use one TCAM table to represent multiple
TCAM tables. Table consolidation is motivated by the following two observations.
First, two TCAM tables may share common entries, which result in the same infor—
mation being stored multiple times. Second, existing TCAM-based packet classiﬁ—
cation systems are based on a “fat” TCAM and “thin” SRAM architecture, which
means that the majority of the information (i.e., the predicates of rules) representing
a packet classiﬁer is stored in TCAMS and little information (i.e., the decision of
rules) is stored in SRAMs. However, because TCAMs are much more expensive than
SRAMs, we ideally would store more infm'n‘iation in SRAMs and less information in
TCAMS.

We begin with two new concepts: hi-dCCISIOn rule and kr~decision classifier. A
k-decision rule is a classiﬁcation rule whose decision is an array of k decisions.
A [st—decision classiﬁer is a sequence of k-decision rules following the ﬁrst-match

semantics. W'e formally define the table consolidation problem as follows:

Deﬁnition 7.4.1 (Table Consolidation Problem). Given k I-decision classiﬁers
C1,--°,(Ck, ﬁnd a k-decision classiﬁer (C such that for any i {1 S i S k), the
condition (C,- E G[i] holds.

We emphasize that a h-decision classiﬁer can be viewed as a l—decision classiﬁer if
we view the array of k decisions of each rule as one decision. In general, a k-(IBCISIOII
classiﬁer can be viewed as a k’-decision classiﬁer where k, < k, if we treat some.

decisions as one decision.

7.4.1 Table Consolidation Algorithm

We use multi—match FDDs to facilitate table consolidation. A multi-match F DD sat—
isﬁes all the properties of an all—match FDD except the condition j 6 {21, i2, - - - , ik}
in the 5th property.

Our table consolidation algorithm works as follows. First, given a set of k clas-
siﬁers C1,-- ~,<Ck., we concatenate the set of classiﬁers into one classiﬁer C1_ k =
C1| - - - ICk. Second, we construct a multi—match FDD f from ([3143 such that f
satisﬁes the following additional condition: for any decision path ’P, the terminal
node of ’P consists of k numbers (7721,7712, - - - , 771k} where, for each i (1 S i S 1:),
rule rmi is the ﬁrst rule in (C,- that contains 73. In other words, for any decision
path ’P in the multi-match FDD f and for any classiﬁer (Ci: only the index of the
ﬁrst rule in (Ci that contains P is included in the label of ”P’s terminal node. Third,
after the multi-match FDD f is constructed, we run the TCAM Razor algorithm
presented in [20] on f and generate the final compact k-decision classiﬁer G. F ig-
ure 7.11 shows the process of consolidating two TCAM tables t1 and t2. The final
2-decision classiﬁer is TCAM table t4. The correctness of the table consolidation

algorithm is based on Theorem 7.4.1.

Theorem 7 .4.1. Given k I-decision classiﬁers C1, - - - ,(Ck, the table consolidation
algorithm generates a k-decision classifier (C where for any i (1 S i S k), the

condition Ci E (7'[i] holds.

76

 

 

 

 

 

 

 

 

 

 

 

 

 

 

F, F2 decision W table F, F2 decision
t] 0 * 3 t1 0 * a

* 0 a t, * 0 a

* * d t, v: * (1 t3

+ l => t2 1 * d

F, F2 decision ,2 * , (1
t2 1 * d * a

* 1 d

9: * a

 

 

 

 

 

 

 

F, F2 decision
0 0 a, a
l l d, d
a, d

 

 

 

 

 

 

 

 

Figure 7.11: Consolidation of 2-dimensional tables

7.4.2 Hierarchical Table Consolidation

In building a multi-match FDD from a set of classiﬁers, each classiﬁer inﬂuences the
shape of the FDD and causes the FDD to grow. To localize the expansion impact
of one classiﬁer on others, we propose the following hierarchical table consolidation
strategy. Given k classiﬁers, ﬁrst, we equally divide them into [ls/m] buckets where
every bucket has m classiﬁers except for one bucket that may have less than m clas-
siﬁers. Second, for the classiﬁers in each bucket, we apply the table consolidation
algorithm and get an m-decision classiﬁer. Thus. we get [ls/ml classiﬁers. By treat-
ing each m—decision classiﬁer as a l-decision classiﬁer, we apply the above process
again on the [ls/ml classiﬁers. This process repeats until we get the ﬁnal k-decision
classiﬁer. We refer to the strategy of choosing m = k as flat table consolidation.

Theorem 7.4.2 establishes the correctness of hierarchical table consolidation.

Theorem 7.4.2 (Hierarchical Table Consolidation Theorem). Given the same input

77

of k J-decision classiﬁers and the same table consolidation algorithm, the strategies
of hierarchical table consolidation and ﬂat table consolidation output the same k-

decision classiﬁer.

 

 

 

 

o ,601‘t—vd, t 001 °
oo1'o1o'->d2 {9° 010 0
0 ‘H Vii/”d: 1”,... 100 0

iii 1

 

 

 

 

_ 001 000 .L_
—>
I 0 1601 \ d1 TCAM for root

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

{010 ,'->d2
\* **’_’d => V0‘*-. _/ V1 =>
" 3 " 3 3 '
\ T'" __

o ooh-m, 001 0'1 ___,j i
100’:010"_,d2 010 d2 : um d4 I
100 \“'\“”*I—>d3 *** d3 TCAM for V,
*** ***—>d TCAMfOl'VO

’ 4
(a) Classiﬁer (b) Constructing FDD and tables
F1: 010 0 010 0 *t 1 _001
100 o 100 o ‘ l
.__, ,
_1_ 1 001 0 [001 d1 d4 I
0 1 010 0
1
001 I001 d. d. s F. 100 0 a. 5,9,0 :2 :4}
F :0 010 |o1o d2 d4| , 3 ‘
1“” Jan" d3 d4 t2 _‘J l
. d1
..~,. 1- a; L—

 

 

 

 

 

 

 

 

(c) Table Consolidation (d) Pipelined lookup process

Figure 7.12: Table consolidation for pipelined-lookup sequential decomposition

78

7.4.3 TCAM/SRAM Space Tradeoff via Bounded Consoli-

dation

Table consolidation creates a tradeoff between TCAM storage and SRAM storage.
On one hand, merging multiple classiﬁers into one classiﬁer results in less TCAM
storage. On the other hand, each entry in the resulting classiﬁer requires more
SRAM for storing the decision list. We could simply merge all classiﬁers into a
single classiﬁer; however, this may require more SRAM space than what is avail-
able. To address this issue, we propose the following bounded consolidation scheme.
The basic idea of bounded consolidation is to limit the number of classiﬁers that
we combine. Given a set of k l-decision classiﬁers, we ﬁrst sort the classiﬁers in
decreasing (or increasing) order according to size (i.e., the number of rules in each
classiﬁer). Second, we partition the sorted classiﬁers into [Ir/ml chunks where the
ﬁrst [kt/ml -— 1 chunks are of uniform size 772. (1 S m g A). Third, for every chunk,
we apply the table consolidation algorithm to the classiﬁers that it contains. Finally,

we get. [ls/ml multi-decision classiﬁers. Note that m is an adjustable parameter.

7 .5 One-Dimensional Table Consolidation

While table consolidation can be applied to d—dirnensional classiﬁers for arbitrary d,
we show that it is especially effective for consolidating one-dimensional classiﬁers.
In particular, Theorem 7.5.1 shows that table consolidation is guaranteed to reduce
TCAM space occupied when applied to one-din'iensional classiﬁers. Table consolida—
tion’s effectiveness on one-dimensional classiﬁers implies that. it is especially effective
when combined with sequential decomposition to minimize the space required by any

single classiﬁer. W’e defer the proof of Theorem 7.5.1 to the appendix.

Theorem 7.5.1. Given any set of k7 I-decision I-dimension(Ll (.rlo..<;siﬁ(::rs C1, - - - , (Ck:
the [sf-decision 1-dimensional classiﬁer (C output by the. TCAM distillation algorithm
satisﬁes the following condition: [C] S I(l'll + - - - + I('A.| — A: + l.

79

7.5.1 Table Consolidation with Sequential Decomposition

Because table consolidation is guaranteed to work well on one-dimensional classiﬁers,
table consolidation can be integrated with the sequential decomposition scheme to
further reduce the space required by a TCAM—based packet classiﬁer. Basically,
in the third step of pipelined-lookup sequential decomposition, for each ﬁeld 17,-,
we group all the tables of F,- nodes and apply our table consolidation algorithm to
produce a multi-decision table rather than using table ID’s. For example, given the
two tables of the two F2 nodes ”0 and v1, our table consolidation algorithm outputs
the 2-decision table t2 in Figure 7.12(c). In this case, table consolidation reduced
the required number of TCAM entries from 4 to 3.

The packet lookup process on the d multi-decision TCAM tables proceeds as
follows. The ﬁrst table has only one column of decisions and each decision is the
decision column index of the second table. Similarly, each decision in the i-th (1 g
i < d) table is the decision column index of the next table. We illustrate the packet.
lookup process using the example in Figure 7.12(d). Given a packet (010,001), we
ﬁrst use 010 to search the table t1 and get result 0. Second, we use 001 to search
table t2 and get. search result (d1, d4). The ﬁnal result is the ﬁrst element of (d1, d4),
which is d1. Note that each TCAM chip has its own SRAM. Therefore, the packet

lookup process can be pipelined to achieve the throughput of one packet per cycle.

7.5.2 Coping with More Fields than TCAM chips

If we have fewer TCAM chips than packet ﬁelds, we have two choices. One is
to strategically combine selected ﬁelds into one ﬁeld until the number of ﬁelds is
equal to the number of TCAM chips. We can use the FDD structure to facilitate
combination of multiple ﬁelds. For example, given a. d—dimensional classiﬁer over
ﬁelds 171,172, - - - , Fd and the corresponding FDD, we can combine the ﬁrst. let (1 <
k: < d) ﬁelds into one ﬁeld by treating the subgraph rooted at. each Fk+1 node as a.
terminal node and then use our TCAM Razor algorithm presented in [20] to generate

a kT-din'iensional table for the k-dimensional F DD. Then, for each subgraph rooted

80

at an F k +1 node, we can apply the above process recursively.

A second choice is to employ the sequential decomposition multi—lookup approach
where we store tables from multiple ﬁelds on a single TCAM chip. For example,
if we have d dimensions and q chips, we would store tables from either [d/ql or
(d/qj ﬁelds in each chip in the pipeline. Each stage of the pipeline would require
either [d/ql or [d/qj lookups. This would result in a classiﬁcation throughput of

one packet per [d/ql TCAM bus cycles.

7 .6 Implementation Issues

7 .6.1 Fast FDD Reduction

F DD reduction is an effective way to reduce the number of multi-lookup tables and
pipelined-lookup tables generated from an FDD. A brute force deep comparison
algorithm for FDD reduction was proposed in [9]. Here we use a more efﬁcient.
FDD reduction algorithm that uses signatures to speed up node comparisons. The
algorithm works as follows.

The algorithm processes the nodes level by level starting from the terminal nodes.
At the current level, we compute a signature for each node. For a terminal node v.
set. o’s signature to be its label. For a non-terminal node v, suppose v has A? children
121, vk, in increasing order of signature (Sig(-o,j) < Sig(o,+1) for 1 g i g k. — l),
and the edge between 1) and its child u,- is labeled with E, a sequence of non-
overlapping preﬁxes in increasing order. Set the signature of node v as follows:
Sig(v) = h.(Sig(ol), E1, . - -, Sigh/7,), Ek) where h is a hash function.

After we have assigned signatures to all nodes at a given level, we search for
isomorphic subgraphs as follows. For every pair of nodes o, and U]- (1 S i 3A j S A)
at. this level, if Sig(u,;) 31$ Sig(ej), then we can conclude that o,- and ”j are not
isomorphic; otherwise, we explicitly determine if 2),: and DJ are isomorphic. If u,- and
'o]- are isomorphic, we delete node oj and its outgoing edges. and redirect. all the
edges that point to o.)- to point to 12,. Further, we eliminate double edges between

node 1),- and its parents.

81

7.6.2 TCAM Update

Packet classiﬁcation rules periodically need to be updated. The common practice
for updating rules is to run two TCAMS in tandem where one TCAM is used while
the other is updated [16]. All our approaches are compatible with this current
practice. Because our algorithms are efﬁcient (running in milliseconds) and the
resultant TCAM lookup tables are small, updating TCAM tables can be efﬁciently
performed.

If an application requires very frequent rule update (at. a frequency less than a
second, for example), we can handle such updates in a batch manner by chaining the
TCAM chips in our proposed architecture after a TCAM chip of normal width (144
bits), which we call the “hot” TCAM chip. When a new rule comes, we add the rule
to the top of the hot TCAM chip. When a packet comes, we ﬁrst use the packet as
the key to search in the hot chip. If the packet has a match in the hot chip, then the
decision of the ﬁrst matching rule is the decision of the packet. Otherwise, we feed
the packet to the TCAM chips in our architecture described as above to ﬁnd the
decision for the packet. Although the lookup on the hot TCAM chip adds a constant
delay to per packet latency, the throughput. can be much improved by pipelining the
hot chip with other TCAM chips. Using batch updating, only when the hot. chip
is about to ﬁll up, we need to run our topological transformation algorithms to

recornpute the TCAM lookup tables.

7.6.3 Non-ordered FDDs

Recall that the FDD construction algorithm that. we used produces ordered F DDs,
that is, in each decision path all ﬁelds appear in the same order. However, ordered
FDDs may not. be the smallest. when compared to non-ordered FDDs. The F DD
construction algorithm can be easily modiﬁed so that different. subtrees may use
different. ﬁeld ordering. By adding ﬁeld information to the decisions in each table
entry, we can easily accommodate different ﬁeld orderings. Thus. the packet pro—

cessing algorithm for both nuilti~lookup and pipelined~lookup can select. the correct

82

ﬁeld for each lookup. The size advantage of non—ordered FDDs comes at the cost.
that FDD reduction will not be able to process subtrees that have different ﬁeld
orders. Nevertheless, the use of non—ordered F DDs does open new possibilities for

further optimizations.

7 .6.4 Lookup Short Circuiting

So far, we have assumed the use of full-length F DDs where in each decision path
all ﬁelds appear exactly once. Actually, this constraint can be relaxed so that some
paths may omit unnecessary ﬁelds when a node in the path contains only one outgo-
ing edge. In this case, the node along with singleton outgoing edge can be pruned.
Using F DDS that are not full—length has the advantage of reducing FDD size and
consequently reducing the total number of tables. Furthermore, this optimization al-
lows some speciﬁc decision paths to be performed with a reduced number of lookups,
which will allow for faster packet processing when the tables are processed in a multi-
lookup fashion. Therefore, we call this optimization technique lookup short circuit-
ing. Similar to the use of non—ordered FDDs, this optimization technique requires

storing ﬁeld information in the decisions.

7 .6.5 Best Variable Ordering

In converting a packet classiﬁer to an equivalent FDD, the order of the ﬁelds used
by decision paths has a signiﬁcant impact on the size of the resulting FDD. Given
that fewer nodes in an F DD normally lead to a smaller multi—lookup or pipelined-
lookup table, choosing a good variable order (i.e., ﬁeld order) is important in FDD
construction. Given a packet classiﬁer that has ﬁve ﬁelds, we can easily try all 5! =

120 permutations to ﬁnd the best. permutation for that particular packet classiﬁer.

7 .7 Experimental Results

In this section, we evaluate the effectiveness and efﬁciency of our sequential decom-

position approach, as well as our three optimizations of lookup 1')i1_)elining. TCAM

83

 

 

kih'mbo’o'wks

  
  

 

Compression Ratio
0 O O O O H H H

00
N
H
O
H
N

0 2 4 6 8 10 12 14 4 6
Classifier Classiﬁer

(a) (b)

 

 

 

 

 

 

'N'A'aaoa'o'nks

 

Compression Ratio
0 O O O O H H H

00

6 8 10 12 14 4 6
Classifier Classiﬁer

(C) ((1)

Figure 7.13: Compression ratio for each classiﬁer in RL.

packing, and table consolidation, on both real—life and synthetic classiﬁers.

7.7.1 Methodology

We ﬁrst deﬁne the metrics for measuring the effectiveness of our algorithms. Let
C denote a classiﬁer, 5' denote a set of classiﬁers, |S| denote the number of classi—
ﬁers in S, and A denote an algorithm. We use A(C') and Direct(C) to denote the
number of TCAM bits used for classiﬁer C by algorithm A and direct expansion,
respectively. We use A(C')R and Direct(C) to denote the number of SRAM bits
used by algorithm A and direct expansion, respectively. Vi'e deﬁne the Compression
ratio of algorithm A on C as

. v A (')
CRatzo(A.C) = m

84

 

 

 

 

 

 

 

 

 

 

 

 

   

02468101214 6
Classiﬁer Classiﬁer

(a) (b)

 

 

 

 

 

 

 

 

 

 

 

 

 

16 r
314- .MC
£12 IMcs~
510- UPC
.9
a 8- IP05 4
‘1’ .
. a 6
i- E 4L .4
' O
. ': , _ ‘ .. U 2L
0.000 4 6 8 101214 00 2 4 6 8 10 12
Classiﬁer Classiﬁer
(C) ((0

Figure 7.14: Compression ratio for each classiﬁer in RLU.

the average compression ratio of algorithm A on a set of classiﬁers S as

—' ZCES CRati0(/l, C)
A( )= ISI .

 

and the total compression ratio of algorithm A on S as

A“. _ EGGS/4(0)
A(S) _ ZCES Direct(C)'

Likewise we deﬁne similar metrics for measuring the compression ratio for SRAM

space. We deﬁne the SRAM/TCAM ratio of A on C as

 

A(Cln

STRatio(A,C) = A(C') .

We denote the respective average and total SRAM/TCAM ratio as

 

_ ZCES STR(Ll’l0(/l, (1')

A(Sla S

 

 

and

"T/ _ EGGS/“(7)12
.4(S)R— ZCESA(C) ,

 

respectively.

We use All to denote the multi—lookup approach, and use P to denote the
pipelined lookup algorithm. We annotate M and P with subscripts C and S if
the table consolidation and shadow packing optimizations are used. For example,
M S denotes multi—lookup with the shadow packing optimization while PC S denotes
the pipelined lookup with the optimizations of shadow packing and table consolida-
tion. Given three optimizations, we study 8 variants (M, INC, 1113, JUGS, P, PC,
PS, and PCS) in total and compare them with results from TCAM Razor (Razor)
and direct expansion (DE). For the optimization technique of table consolidation, in
our experiments, we performed bounded table consolidation with bound 4 because
it has been found to be an effective trade off between TCAM and SRAM space.
Note that when table consolidation and shadow packing are used together, table
consolidation must be performed before shadow packing.

The variable order that we used to convert a classiﬁer to an equivalent. FDD
affects the number of tables generated by our algorithms, which consequently affects
the TCAM space efficiency. There are 5! = 120 different pernuitations of the ﬁve
packet ﬁelds (source IP address, destination IP address, source port number, desti-
nation port number. and protocol type). For HL, we investigate the effectiveness of
the 5! = 120 perrmitations. A question that naturally arises is: which variable order
achieves the best. (or close to best.) compression ratio for all our algoritlnns‘? By
computing the average compression ratios of our algorithms on Illa, we discovered
that the best order is (Destination IP, Source IP, Destination Port. Protocol, Source
Port); we label this as permutation 31402. For any of our algorithms, the difference
between the average cmnparison ratio achieved by 31402 and that achieved by the

best. order for that particular algorithm is no more than 0.005.

86

7 .7.2 Effectiveness

Table 7.1 shows the average and total compression ratios of TCAM Razor and our
8 variants 1W, 1113, P, PS, MC, [1105, PC, and PCS, all using permutation 31402,
on the four classiﬁer sets of RL, RLU, SYN, SYNU. we focus our discussion on
average compression ratios; we primarily use total compression ratios to facilitate
comparison of our techniques with other approaches. Figures 7.13(a) and 7.13(b)
show the compression ratios of each of the 25 classiﬁers in RL for the four algorithms
of Ill, 1115, P, and PS. Similarly, Figures 7.13(c) and 7.13(d) show the compression
ratios of each of the 25 classiﬁers in RL for the four algorithms of 1110, MCS? PC,
and PCS. For illustration purposes, we show 13 classiﬁers with low compression

ratios in Figures 7.13(a) and 7.13(c) and the remaining 12 classiﬁers in Figures

7.13(b) and 7.13(d).

 

A] AIS P PS 1110 11105 PC PCS Razor

 

12L

0.171 0.062 0.127 0.086 0.109 0.048 0.083 0.074 0.245

 

RIJU

1.642 0.147 0.928 0.245 0.558 0.082 0.327 0.128 0.319

 

SYN

0.028 0.007 0.016 0.010 0.021 0.008 0.013 0.012 0.083

 

SYNU

0.263 0.016 0.139 0.021 0.112 0.014 0.062 0.021 0.408

 

average compression ratio

 

 

Ill 1113 P PS IUC [1105 PC PCS Razor

 

RL

0.052 0.018 0.036 0.024 0.029 0.010 0.021 0.017 0.088

 

RLU

0.649 0.045 0.351 0.060 0.210 0.020 0.116 0.029 0.131

 

SYN

0.022 0.003 0.013 0.004 0.015 0.002 0.009 0.003 0.061

 

SYNU

0.254 0.013 0.133 0.014 0.106 0.009 0.058 0.010 0.382

 

 

 

total compression ratio

 

Table 7.1: Compression ratios

87

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

M MS P PS MC AICS PC PCS Razor
RL 1.504 5.138 1.504 7.471 2.840 15.631 2.840 24.537 0.245
RLU 3.926 2.861 3.926 4.830 4.697 7.301 4.697 10.573 0.319
SYN 0.268 0.645 0.268 0.975 0.614 2.707 0.614 4.123 0.083
SYNU 0.526 0.215 0.526 0.301 0.893 0.810 0.893 1.186 0.408
average ratio
M AIS P PS NC 11105 PC PCS Razor
RL 0.513 1.614 0.513 2.146 0.871 3.421 0.871 5.305 0.088
RLU 2.071 0.964 2.071 1.273 2.594 1.658 2.594 2.396 0.131
SYN 0.293 0.326 0.293 0.432 0.589 0.784 0.589 1.345 0.061
SYNU 0.508 0.157 0.508 0.171 0.850 0.452 0.850 0.510 0.382
total ratio
Table 7.2: SRAM compression ratios
ll] MS P PS MC MCS PC PCS DE
RL 0.054 0.660 0.077 0.660 0.178 2.489 0.249 2.489 0.008
RLU 0.096 0.829 0.170 0.829 0.358 3.209 0.612 3.209 0.044
SYN 0.070 0.723 0.120 0.723 0.216 2.642 0.362 2.642 0.007
SYNU 0.172 1.131 0.325 1.131 0.687 4.525 1.240 4.525 0.086
average ratio
11] AIS P PS MC MCS PC PCS DE
EL 0.086 0.800 0.123 0.776 0.259 2.849 0.362 2.755 0.009
RLU 0.146 0.976 0.270 0.976 0.568 3.842 1.024 3.778 0.046
SYN 0.091 0.801 0.159 0.791 0.272 2.761 0.466 2.767 0.007
SYNU 0.200 1.242 0.381 1.236 0.800 4.935 1.473 4.902 0.100
total ratio

 

 

 

Table 7.3: SRAM/TCAM ratios

88

 

 

 

 

Effectiveness of Each Optimization Technique

we have a total of 32 average compression ratio data points in Table 7.1 for our
new techniques. For each of our three optimization techniques, we can pair together
two data points to determine if the optimization improves space compression or not.
For example, we can compare m with and W to determine if shadow
packing improves the performance of multi-lookup on the RL data set. For each
optimization technique, this leads to 16 comparison pairs.

The ﬁrst important insight is that shadow packing always improve space com-
pression, and the typical reduction is quite large. Table consolidation improves space
compression for 13 of the 16 pairs; for the other three pairs, table consolidation either
has no effect or slightly degrades compression. Note, this may change if we used a
different bound than 4. Finally, the lookup pipelining optimization improves space
compression for the eight pairs that do not include shadow packing. However, for
the eight pairs that do include shadow packing, it. actually hurts space compression,
sometimes quite signiﬁcantly.

The experimental results show that shadow packing is the most important space
compression optimization. The improvement gained from implementing shadow
packing is almost always quite signiﬁcant ranging between 0.06 and 0.92 with an
average of 0.38. The improvement gained from implementing table consolidation
ranges between 0.34 and 1.2 with an average of 0.71. Given that the lookup pipelin—
ing optimization does not. perform well in combination with shadow packing, we
should not use the pipelining optimization unless the packet processing throughput
improvement is signiﬁcant enough to warrant the extra expense of multiple TCAM

chips.

Best variant for space compression

The best overall variant for space compression is 1)./CS, multi-lookup with the
shadow packing and table consolidation optimizations. The only data set where
MOS is not the best variant is the SYN dataset. where M S slightly outperforms

MOS- Note again that. with a. different bound than 4. AIC S might outperform ill S

89

on this data set.

Effectiveness on Classiﬁers with Many Distinct Decisions

The effectiveness of our algorithms is relatively insensitive to the number of distinct
decisions in packet classiﬁers. Even with the extreme examples of RLU and SYNU
where every rule has a distinct decision, our best variant 1110 S achieves average
compression ratios of 0.082 and 0.014, respectively. There are two main reasons that
explain the insensitivity of our algorithms to the number of distinct decisions in a
classiﬁer. First, through sequential decomposition, we still produce relatively short
and thin tables that can be efﬁciently shadow packed. Second, table consolidation
neutralizes some of the effects of distinct decisions as entries in different tables with

different decisions can be consolidated together.

SRAM usage

Our techniques reduce the number of required TCAM bits by using inexpensive
SRAM space to replace expensive TCAM space. Table 7.2 shows the average and
total SRAM compression ratios for all eight algorithms. For real-life classiﬁers, our
techniques typically increase the amount of SRAM space used. There are four main
reasons for this increase in SRAM space. The ﬁrst is the possible increase in the
total number of “thin” rules. The second is the increase in the number of bits
needed to store decisions, since a decision in our algorithms is often a table ID. For
fast indexing purposes, our algorithms require a uniform number of bits per TCAM
entry, which is essentially the maximum number of bits to encode a decision. The
third reason is the unused SRAM space for fast indexing shadow packed tables. With
shadow packing, the total number of required SRAM bits is the number of TCAM
entries times the maximum number of tables packed in a row times the maximum
number of bits needed to encode a. decision. The fourth reason is ascribed to table.
consolidation due to reasons similar to shadow packing. From Table 7.2, we can see
that. shadow packing and table consolidation signiﬁcantly increase SRAM usage.

However, SRAM (‘zompression ratios do not account for the fact that SRAMs are

90

often under-utilized by the traditional single-lookup scheme. That is. the number
of SRAM bits needed to store a classiﬁer are signiﬁcantly less than the number of
needed TCAM bits. Table 7.3 shows the SRAM/TCAM ratios for all eight algo—
rithms as well as direct expansion. From this table, we can see that all the algo-
rithms, except 111105 and PCS, use less SRAM bits than TCAM bits on average.
Even for [MC S and PCS, having an SRAM chip that is 4 times larger than the corre-
sponding TCAM chip is sufficient. Given the relative costs and power consumption
of SRAM and TCAM chips, this is a good tradeoff. Furthermore, in our experiments

on real-life classiﬁers, no classiﬁer demands more than 0.6 Mb of SRAM.

Comparing with the state-of-the—art

It is difficult to directly compare our results with those in other prior work, such as
Dong et al.’s work [5], Lakshminarayan et al.’s DIRPE scheme [15], and Bremler-
Barr and Hendler’s SRGE scheme [4]. First, the real-life classiﬁer sets are different
and are privately owned. Second, the evaluation metrics are not the same. For
example. Dong et al.’s [5] only reported a total compression ratio of 0.54 while
the range encoding papers only reported expansion ratio results. Fortunately, the
expansion ratios reported in prior range encoding schemes can be used to extrapolate
their total compression ratios. The expansion ratio of an algorithm on a classiﬁer is
deﬁned as the ratio between the number preﬁx rules produced by the algorithm for
the classiﬁer and the number of original rules in the classiﬁer. Thus. the compression
ratio of an algorithm can be calculated by the ratio between the expansion ratio of
the algorithm and the expansion ratio of direct expansion. For the best—known range
encoding method, i.e., hybrid-SRGE [4], we calculated its total compression ratio to
be 0.396. As we can see, our methods achieve signiﬁcantly better total compression

ratios, albeit on different data sets.

91

Chapter 8

Topological Transformations

One approach for mitigating the effects of range expansion has been to reencode
critical ranges. The basic idea is to reencode a given packet and use the reencoded
packet as the TCAM search key. For instance, Liu [19], Lunteren and Engbern [31],
and Pao et al. [21] all proposed methods of representing speciﬁc ranges as special bit-
strings using extra TCAM bits. Lakshminarayan et al. [15] and Bremler-Barr and
Hendler [4] proposed to replace the preﬁx encoding format with alternative ternary
encoding formats, called DIRPE and SRGE, respectively.

Previous reencoding schemes suffer from two fundamental limitations. First, they
only consider range ﬁelds and ignore all other ﬁelds; thus, they miss many optimiza-
tion opportunities that can be applied to preﬁx ﬁelds as well. It was not realized
that packet classiﬁers often have the potential of being minimized in TCAM even
when no ﬁelds are speciﬁed in ranges. Second, they require either con'iputationally
or economically expensive reencoding steps that do not easily integrate into exist-
ing packet classiﬁcation systems. As each packet needs to be reencoded before it
can be used as a search key, previous range reencoding schemes propose to perform
packet reencoding using software, which greatly increases packet processing time, or
customized hardware, which is expensive from a design, cost, and implementation
perspective.

In this chapter, we take two novel views on range reencoding that are funda-

mentally different from previous range reencoding schemes. First, we view range

92

reencoding as a topological transforn’iation process from one colored hyperrectangle
to another. W'hereas previous range reencoding schemes only deal with range ﬁelds,
we perform reencoding on every packet ﬁeld. Speciﬁcally, we propose two orthogo—
nal, yet composable, reencoding schemes: domain compression and preﬁx alignment.
In domain compression, we transform a given colored hyperrectangle, which repre—
sents the semantics of a given classiﬁer, to the smallest possible “equivalent” colored
hyperrectangle. In preﬁx alignment, on the other hand, we strive to transform
a colored hyperrectangle to an equivalent “preﬁx-friendly” colored hyperrectangle
where the ranges align well with preﬁx boundaries, minimizing the costs of range
expansion. Second, we view range reencoding as a classiﬁcation process that can
be implemented with small TCAM tables. Thus, while a preprocessing step is still
required, it can be easily integrated into existing packet classiﬁcation systems using
the same underlying TCAM technology. Furthermore, implementing our schemes on
a. pipeline of TCAM chips even increases packet classiﬁcation throughput. because
our schemes enable the use of TCAM chips of small width.

Domain Compression: The fundamental observation is that in most packet
classiﬁers, many coordinates (i.e., values) within a ﬁeld domain are equivalent. The
idea of domain compression is to reencode the domain so as to eliminate as many re-
dundant coordinates as possible. This type of reduction not only leads to fewer rules,
but also narrower rules, which results in smaller TCAM tables. From a geometric
perspective, domain compression “squeezes” a colored hyperrectangle as much as
possible. For example, consider the colored rectangle in Figure 8.1(A) that repre-
sents the classiﬁer in Figure 8.1(H). In ﬁeld F1 represented by the X—axis, all values in
[0, 7] U [66, 99] are equivalent; that is, for any y 6 F2 and any 1171,12 E [0, 7] U [66, 99]
, packets (.1‘1,y) and (£2.11) have the same decision. Therefore, when reencoding
F1, we can map all values in [0,7] U [66,99] to a single value, say 0. By identifying
such equivalences along all dimensions, the rectangle in Figure 8.1(A) is reenc.o(ile(_l
to the one in Figure 8.1(D), whose corresponding classiﬁer is shown in Figure 8.1(I).
Figures 8.1(B) and (C) show the two transforming tables for [’1 and F2. respec-

st

tively; note that these tables can be implemented as TCAM tables. We use a as

93

a shorthand for “accept” and “d” as a shorthand for “discard”.

E uivalent class
I 47””:1 l

 

 

     

 

 

 

 

 

 

 

 

0 : ' :
6 .__-L_ ______ 0
15 .-- l _____ Domain 0 Preﬁx 1
75.-- _____ Compression 1 Ali nment 2
[0, 7|=>0 [0, 51=>o 2 10.01=>0 [0.0] =>03
5“ g [8, l9]=>1[6, 141=>1 0 1 2 [1,11=>2 u,1|=>2 3 1 2 3
99 l m [20,44] =>2 [15.75|=>2 (D) [2,21=>3 [2,2|=>3 (G)
3 20 44 5 99 [45,65] =>1 [76,88] =>1 (E) (F)
(A) [66,99] =>0 [39.99) =>o
[8, 65]/\[15,75|=>a (B) (C) [1, 2]/\[2,21 =>a l2,3|l\[3,3]=>a
[20,44]/\|6, 88] =>a [2. 2|/\[1,2] =>a [3,3|/\[2,3]=>a
[0, 99mm, 99|=>d [0, 2mm, 2] =>d [0,3]/\[0,3|=>d
(I) (J)

Figure 8.1: Example of topological transformations

Preﬁx Alignment: The basic idea. of preﬁx alignment is to “shift”, “shrink”, or
“stretch” ranges by transforming the domain of each ﬁeld to a new “preﬁx—friendly”
domain so that the majority of the reencoded ranges either are preﬁxes or can be
expressed by a. small number of preﬁxes. In other words, we want to transform
a colored hyperrectangle to another one where the ranges align well with preﬁx
boundaries. This will reduce the costs of range expansion. For example, consider the
packet classiﬁer in Figure 8.1(1), whose corresponding rectangle is in Figure 8.1(D).
Range expansion will yield 5 preﬁx rules because interval [1. 2] or [01. 10] cannot be
combined into a preﬁx. However. by transforming the rectangle in Figure 8.1(D) to
the one in Figure 8.1(G), the range expansion of the resulting classiﬁer. as shown
in Figure 8.1(J). will have 3 preﬁx rules because [2,3] is expanded to 1*. Figures
8.1(D) and (E) show the two transforming tables for F1 and F2, respectively. Again,
these tables can be implemented in TC AM.

We implemented our algorithms and conducted experiments on both real-world
and synthetic packet classiﬁers. Our experimental results show that our combined
algorithm on real-world classiﬁers achieves an average expansion ratio of 0.07 ex-
cluding TCAM space. for transformers, and an expansion ratio of 0.21 including

TCAM space for transformers. This is much better than the best result. in prior

94

work, the database dependent scheme hybrid-SRGE proposed by Bremler—Barr and
Hendler [4], where the average expansion ratio is 1.03 (this ﬁgure ignores the cost of
the reencoding process and is thus most comparable to our 0.07 average expansion

ratio).

8.1 Topological Transformation

The basic idea of our transformation approach is to transform a given packet classiﬁer
into another classiﬁer that can be stored more efﬁciently in TCAM. Furthermore, we
need a transformer that can take any packet and transform it into a new packet that
is then used as the search key on the transformed classiﬁer. Of course, the decision
that the transformed classiﬁer makes for the transformed packet must be the same
as the decision that the original classiﬁer makes for the original packet. We also
require that the transformer itself be a packet classiﬁer that can be efﬁciently stored
in TCAM. This is one of the features that differentiates our work from previous
reencoding approaches.

More formally, given a d—dimensional packet. classiﬁer (C over ﬁelds 171,- - - ,Fd,
a topological transformation process produces two separate components. The ﬁrst
component is a set of transfomzers T = {'ll‘i | 1 g 2' S d} where transformer T2-
transforms D(F,-) into a new domain D’(FZ-). Together, the set of transformers T
transforms the original packet space E into a new packet space 23'. The second
component is a transformed (ll-dimensional classiﬁer (C’ over packet space 23’ such

that for any packet (p1, - ‘ - ,pd) E E, the following condition holds:

C(pli ' ' ° 3])(1) : C’(T1(P1)WHT(I(P(1»

a

Each of the d transformers T,- and the transformed packet classiﬁer (C, are imple-

mented in TCAM.

TCAM Chip
01............

01”,! 0 01"T1”"—>P1’—~

01............
10............ “1

10..
10|1>2 —0—~ T2 ——>P2'—w

10.."........

00............
00.. C’

00............

 

 

 

 

00w Ir;

 

 

 

 

 

Figure 8.2: Multi-lookup architecture

8. 1.1 Architectures

There are two possible architectures for storing the (1+1 TCAM tables (Cf, T1, - - - , 'll‘d:
the maria—lookup architecture and the pipelined-lookup architecture, each of which is
described below.

In the multi-lookup architecture, which is similar to the multi-lookup architecture
in Section 7.1, we store all the d+ 1 tables in one TCAM chip. To identify tables, for
each table, we prepend a [log(d + 1)] bit string, which we call the table ID, to every
entry in the table. Figure 8.2 illustrates the packet classiﬁcation process using the
Inulti-lookup architecture when d = 2. Suppose the three tables are (C’, T1, and T2,
and their table IDs are 00, 01, and 10, respectively. Given a packet (p1, p2), we ﬁrst
concatenate 'll‘l’s table ID 01 with p1 and use the resulting bit string Ollpl as the
search key for the TCAM. Let p1, denote the search result. Second, we concatenate
'll‘g’s table ID 10 with p2 and use the resulting bit string 10] [)2 as the search key for
the TCAM. Let pg’ denote the search result. Third, we concatenate the table ID 00
of (C, with m, and p2,, and use the resulting bit string ()()]])1']])2' as the search key
for the TCAM. The search result is the ﬁnal decision for the given packet (7)1,pg).

We recommend two pipelined-lookup architectures for implementing our trans-

formation approaches: parallel I_)ipelined-looki1p and chained pipelined—looku]). In

96

    
 

. ecision

Figure 8.3: Parallel pipelined-lookup architecture

TCAM TCAM TCAM
1 [ﬂ] 1—.
P P P’le P’1|P’2 - , decision

T1 T2 - C —->

 

 

 

 

 

 

 

 

Figure 8.4: Chained pipelined—lookup architecture

both architectures, we store each of the d + 1 tables in separate TCAMs. As one
TCAM stores only one table, we do not need to prepend table entries with table IDs
for either approach. In the parallel pipelined—lookup architecture, the d transformer
tables T, laid out in parallel, form a two-element pipeline with the transformed clas-
siﬁer C’. Figure 8.3 illustrates the packet classiﬁcation process using the parallel
pipelined-lookup architecture when d = 2. Given a packet (p1,p2), we send p1 and
p2, in parallel over separate buses, to T1 and T2, respectively. Then, the search
result. 191’po, is used as a key to search on (C’. This second search result is the ﬁnal
decision for the given packet (p1,p2).

The (d + 1)-stage chained pipelined-lookup architecture is similar to the pre-
viously proposed pipelined-lookup architecture. Figure 8.4 illustrates the packet
classiﬁcation process using the chained pipelined-lookup architecture when d = 2.

In comparison with the pipelined-lookup architecture, The main advantage of the
multi-lookup architecture is that it can be easily deployed since it requires minimal
modiﬁcation of existing TCAM-based packet processing systems. Its main drawback
is a. modest. slowdown in packet processing throughput because (1 +1 TCAM searches
are required to process a d—dimensional packet. In contrast, the main advantage of
the two pipelined-lookup architectures is high packet processing throughput. Their

main drawback is that the hardware needs to be modiﬁed to accommodate (1+ 1

97

TCAM chips.

Implementing our reencoding schemes on pipelined—lookup architectures actually
improves packet processing throughput over conventional TCAM implementations.
While the width of TCAM entries can be set to 36, 72, 144, or 288 bits, the typical
TCAM bus width is 72 bits. Thus the conventional TCAM lookup approach, which
uses a TCAM entry width of 144 bits, requires either four or ﬁve TCAM bus cycles to
process a packet: four bus cycles if the decision is stored in TCAM, ﬁve bus cycles if
the decision is stored in SRAM. Because all the tables produced by our reencoding
schemes have width less than 36 bits, we can set TCAM entry width to be 36.
Thus, using pipelined-lookup architectures, our reencoding approaches achieve a
classiﬁcation throughput of one packet per cycle; using multi-lookup architectures,
our reencoding approaches achieve a classiﬁcation throughput of one packet per

twelve cycles.

8.1.2 Measuring TCAM space

The TCAM space needed by our transformation approach is measured by the total
TCAM space needed by the (1 +1 tables: C’, T1, - ' - , Td. We deﬁne the space used by
a classiﬁer or transformer in a TCAM as the number of entries (i.e., rules) multiplied

by the width of the TCAM in bits:
space 2 # of entries X TCAM width

Although TCAMs can be conﬁgured with varying widths, they do not allow arbitrary
widths. The width of a TCAM typically can be set at 36, 72, 144, and 288 bits (per
entry). The primary goal of the transformation approach is to produce C’, T1, - - - , Td
such that. the TCAM space needed by these (1 + 1 TCAM tables is much smaller
than the TCAM space needed by the original classiﬁer (C. Note that most previous
reencoding approaches ignore the space required by the. transformers and only focus

on the space required by the transformed classiﬁer (C’.

98

8.1.3 TCAM Update

Packet classiﬁcation rules periodically need to be updated. The common practice for
updating rules is to run two TCAMs in tandem where one TCAM is used while the
other is updated [16]. All our approaches are compatible with this current practice.
Because our algorithms are efﬁcient and the resultant TCAM lookup tables are
small, updating TCAM tables can be efﬁciently performed.

If an application requires very frequent rule update (at a frequency less than a
second, for example), we can handle such updates in a batch manner by chainng the
TCAM chips in our proposed architecture after a TCAM chip of normal width (144
bits), which we call the “hot” TCAM chip. When a new rule comes, we add the rule
to the top of the hot TCAM chip. When a packet comes, we ﬁrst use the packet as
the key to search in the hot chip. If the packet has a match in the hot chip, then the
decision of the ﬁrst matching rule is the decision of the packet. Otherwise, we feed
the packet to the TCAM chips in our architecture described as above to ﬁnd the
decision for the packet. Although the lookup on the hot TCAM chip adds a constant
delay to per packet latency, the throughput can be much improved by pipelining the
hot chip with other TCAM chips. Using batch updating, only when the hot chip
is about to ﬁll up, we need to run our topological transformation algorithms to

recompute the TCAM lookup tables.

8.2 Domain Compression

In this section, we describe our new reencoding scheme called dommﬁn compression.
The basic idea of domain con‘ipression is to simplify the logical structure of a clas-
siﬁer by mapping the domain of each ﬁeld D(F,-) to the smallest possible domain
D’(F,-). We formalize this process by showing how a classifier (C deﬁnes an equiv-
alence relation on the domain of each of its ﬁelds. This equivalence relation allows
us to deﬁne equivalence classes within each ﬁeld domain that domain compression
will exploit.

Domain compression has several beneﬁts. First, with a compressed domain

99

D'(Fz-), we require fewer bits to encode each packet ﬁeld. This allows us to set
TCAM entries widths to be 36 bits rather than 144 bits, which saves both space
in the TCAM and time as each entry ﬁts on the 72 bit TCAM bus. Second, each
transformed rule r’ in classiﬁer (C, will contain fewer equivalence classes than the
original rule 7‘ did in classiﬁer (C. This leads to reduced range expansion and the
complete elimination of some rules, which allows us to achieve expansion ratios less
than one.

Our domain compression algorithm consists of three steps: (1) computing equiv-
alence classes, (2) constructing transformer Ti for each ﬁeld Ft: and (3) constructing

the transformed classiﬁer (3'.

8.2.1 Step 1: Compute Equivalence Classes

In domain compression, we compress every equivalence class in each domain D(F,-)
to a single point in D’(F,-). The crucial tool of the domain compression algorithm
is the Firewall Decision Diagram (FDD).

The ﬁrst step of our domain compression algorithm is to convert a. given (1-
dimensional packet classiﬁer (C to (1 equivalent reduced FDDs f1 through fd where
the root of F DD fi is labeled by ﬁeld F1. Figure 8.5(a) shows an example packet
classiﬁer over two ﬁelds F1 and F2 where the domain of each ﬁeld is [0,63]. Figures
8.5(b) and (c) show the two FDDs f1 and f2, respectively. The FDDs f1 and
f2 are almost reduced except that the terminal nodes are not merged together for
illustration purposes.

The crucial observation is that each edge of reduced FDD f2- corresponds to one
equivalence class of domain D(.F,-). For example, consider the the classiﬁer in Figure
8.5(a) and the corresponding FDD f1 in Figure 8.5(b). Obvi01.1sly, for any p1 and
p1, in [7, 11] U [16, 19] U [39,40] U [43,60], we have C(p1,p2) = C(pl’, p2) for any [)2
in [0,63], so it. follows that (C{-p1} = C(pl’}.

Theorem 8.2.1 (Equivalence Class Theorem). For any packet classiﬁer C over

clds F --F and an etuiualent reduced F DD rooted at an F: node v. the
l, (I. 1 t a

100

labels ofv ’s outgoii‘zg edges are all the equivalence classes over ﬁeld F,- as deﬁned by

(3.

8.2.2 Step 2: Construct Transformers

Given a packet. classiﬁer (C over ﬁelds F1, - - - , F d and the (1 equivalent reduced FDDs
f1, - - - , fd where the root node of f,- is labeled P}, we compute transformer T,- as
follows. Let i) be the root of ft: and suppose v has m outgoing edges 61,- - -,em.
First, for each edge ej out of u, as all the ranges in ej’s label belong to the same
equivalent class according to Theorem 8.2.1, we choose one of the ranges in cj’s
label to be the representative, which we call the landmark. Any of the ranges in
e is label can be chosen as the landmark. For each equivalence class, we choose the
range that intersects the fewest number of rules in C as the landmark breaking ties
arbitrarily. We then sort edges in the increasing order of their landmarks. We use L]-
and ej to denote the landmark range and corresponding edge in sorted order where
edge el has the smallest valued landmark Ll and edge em has the largest valued
landn‘iark Lm. Our transformer T,- then maps all values in ej’s label to value j where
1 S j S m. For example, in Figures 85(1)) and (c), the greyed ranges are chosen
as the landmarks of their corresponding equivalence classes, and Figures 8.6(a) and

(b) show transformers T1 and T2 that result from choosing those landmarks.

8.2.3 Step 3: Construct Transformed Classiﬁer

we now construct transformed classiﬁer (C, from classiﬁer (C using transformers T7;
for l g i g d as follows. Let F1 6 81 /\ - - - /\ Fd E Sd ——> (decision) be an original
rule in (C. The domain compression algorithm converts F,- E S,- to Ft, 6 Si, such
that for any landmark range LJ- (0 g j g m. — l), 1.]- H S,- # (ll if and only ifj 6 Si’.
Stated another way, we re[_)lace range S, with range [(1. b] C_: D’(F,-) where a is the
smallest number in [0, in — 1] such that. La (1 S, # (0 and b is the largest. number in
[0,712, — 1] such that Lb ﬂ 8,; 74 (0. Note, it is possible no landmark ranges intersect

range Si; in this case. a. and b are undeﬁned and 37;, = 0. For a. converted rule

101

r, 2 F1, E 81’ /\ - - - /\ Fd’ E 8d, ——> (decision) in C], if there exists 1 S i. S d such
that S,’ = (0, then this converted rule r’ can be deleted from (C’.

For example, consider the rule F1 E [7,60] /\ F2 E [10,58] —-> discard in the
example classiﬁer in Figure 8.5(a). For ﬁeld F1, the ﬁve landmarks are the ﬁve
greyed intervals in 8.5(b), namely [0,0], [1,6], [7,11], [12,15], and [63, 63]. Among
these ﬁve landmarks, [7,60] overlaps with [7,11] and [12,15], which are mapped to 2
and 3 respectively by transformer T1. Thus, F1 E [7,60] is converted to F1, E [2, 3].
Similarly, for ﬁeld F2, [10,58] overlaps with only one of Fg’s landmark, [10,19],
which is mapped to 3 by F2’s mapping table. Thus, F2 E [10,58] is converted to
F2, E [3, 3]. Figure 8.7 shows the resultant domain compressed classifier.

Next, we prove that. (C, and T are semantically equivalent to (C.

Theorem 8.2.2. Consider any classiﬁer (C and the resulting transformers 'II‘ and

transformed classiﬁer (C’. For any packet p 2 (p1, . - - , pd). we have

(C(Ill, ' ° ' 7pd) : (Cl/(T1 (p1) ' ° ' 7 Td(p(1))

Proof. For each ﬁeld F, for 1 S i g (1, consider p’s ﬁeld value 1),. Let. L(p,) be the
landmark range for [3(1),]. We set. 1“, = min(L(p,)). We now consider the packet
:1." = (.‘1‘1.---;r.d) and the packets :L'(j) = (.r1,...:rj__1,pJ-,...,pd) for O S j g d;
that. is, in packet (EU), the ﬁrst j ﬁelds are identical to packet :r and the last d —j

ﬁelds are identical to packet p. Note 33(0) = p and I((l’) = .‘L‘. We now show that

C(])) 2 (C(17). This follows from C(:r(0)) = C(;r(1)) = --- = C(I(d)). Each equality
follows from the fact that. 17]- and p j belong to the same equivalence class within
l)(FJ-).

Let r be the ﬁrst rule in (C that packet :17 matches. we argue that 7), will match
the transformed rule r, E (C’. Consider the conjunction F, E S,- of rule r. Since :1:
matches rule r, it must be the case. that 1:, E 3,. This implies that L(]),) D S, % (ll.
Thus. by our construction p,’ = 'll‘,(p,) = 'll‘,(.r,) E S,’. Since this holds for all fields
F,. packet. p, matches rule 7". “7e also argue that packet. [2’ will not match any rule
before transformed rule 7" E C’. Suppose packet 19’ matches some rule. r1, E C’

that occurs before rule 7". This implies that, for each conjunction F, E S, of the

102

corresponding rule r1 E C that L(p,) F) S, 3é (0. However, this implies that 1:,- E 5',
since if any point in L(p,) is in 8,, then all points in L(p,) are in 8,. It follows that
1' matches rule r1 E C, contradicting our assumption that. rule r was the ﬁrst rule
that :1: matches in C. Thus, it follows that p’ cannot match rule rl’. It then follows

that r' will be the ﬁrst rule in C that p’ matches and the theorem follows. C]

8.3 Preﬁx Alignment

In this section, we describe a new topological transformation approach called preﬁx
alignment. When applying this approach, we assume that we have a classiﬁer C
that needs to be converted into a preﬁx classiﬁer via range expansion. we observe
that range explosion happens when ranges do not align well with preﬁx boundaries.
The basic idea of preﬁx alignment is to “shift”, “shrink”, or “stretch” ranges by
transforming the domain of each ﬁeld to a new “preﬁx-friendly" domain so that the
majority of the reencoded ranges either are preﬁxes or can be expressed by a. small
number of preﬁxes. This will reduce the costs of range expansion. Of course, we
must guarantee that our preﬁx alignment transformation preserves the semantics of
the original classiﬁer.

we ﬁrst consider the special case where the classiﬁer has only one ﬁeld F. We
develop an optimal solution for this problem using dynamic programming techniques.
We then describe how we use this solution as a building block for performing preﬁx
alignment on multi-dimensional classiﬁers. Finally, we discuss how to compose the

two transformations of domain compression and preﬁx alignn‘ient.

8.3.1 Preﬁx Alignment Overview

The one—dimensimial preﬁx alignment problem can be described as the following
“cut” problem. Consider the three ranges [0,12]. [5,15]. and [0.15] over domain
D(F1) = [0. 15] in classiﬁer C in Figure 8.8(A), and suppose the transformed domain
0’ (F1) : [00,11] in binary format. Because D'(F1) has a total of 4 elements, we

want to identify three cut points 0 E .171 < 1‘2 < 1:3 g 15 such that if [0,171] E

103

D(F1) transforms to 00 E D’(F1), [3:1 + 1,332] E D(F1) transforms to 01 E D’(F1),
[332+ 1,1:3] E D(F1) transforms to 10 E D’(F1), and [13+ 1, 15] E D(F1) transforms
to 01 E D’ (F 1 ), the range expansion of the transformed ranges will have as few rules
as possible. For this simple example, there are two families of optimal solutions:
those with 1:1 anywhere in [0, 3], 2:2 = 4, and $3 = 12, and those with 171 = 4,
x2 = 12, and 1173 anywhere in [13,15]. For the ﬁrst family of solutions, range [0, 12]
is transformed to [00, 10] = 0* U10, range [5, 15] is transformed to [10. 11] = 1*, and
range [0, 15] is transformed to [00, 11] = M. In the second family of solutions, range
[0, 12] is transformed to [00, 01] = 0*, range [5, 15] is transformed to [01, 11] = 01U1*,
and range [0, 15] is transformed to [00,11] = **. The classiﬁer C, in Figure 8.8(A)
shows the three transformed ranges using the ﬁrst family of solutions. Thus, in both
examples, the range expansion of the transformed ranges only has 4 preﬁx rules
while the range expansion of the original ranges has 7 preﬁx rules.

We now illustrate how to compute an optimal solution using a divide and conquer
strategy. The ﬁrst observation is that we can divide the original problem into two
subproblems by choosing the middle cut point. The second observation is that a
cut point should be the starting or ending point of a range if possible in order
to reduce range expansion. Suppose the target domain D’ (F1) is [0.2b — 1]. we
ﬁrst need to choose the middle cut point x2b_1, which will divide the problem
into two subproblems with target domains [0, 2b—1 — 1] = 0{=i<}b""1 and [21)“1. 2b —
1] = 1{*}b-1 respectively. Consider the example in Figure 8.8(A), the 172 cut
point partitions [0,15] into [0, 51:2], which transforms to preﬁx 0*, and [1'2 + 1,15],
which transforms to preﬁx 1*. The ﬁrst observation implies either 332 = 4 or .r2 =
12. Suppose we choose r2 2 4; that is, we choose the dashed line as shown in
Figure 8.8(A). This then divides the original problem into two subproblems where
we need to identify the 171 cut point in the range [0.4] and the 3:3 cut point. in
[5, 15]. Furthermore. in the two subproblems, we include each range trimmed to ﬁt
the restricted domain. For example, in the ﬁrst subproblem, ranges [0, 12] and [0. 15]
are trimmed to [0,4], and in the second subproblem. ranges [5,15] and [0.15] are

trimmed to [5, 15] and range [0, 12] is trimmed to [5. 12]. It is important. to maintain

104

each trimmed range, even though there may be multiple copies of the same trimmed
range. We then see in the ﬁrst subproblem that the choice of 11:1 is immaterial since
both trimmed ranges span the entire restricted domain. In the second subproblem,
the range [5, 12] dictates that 1173 = 12 is the right choice.

This divide and conquer process of computing cut points can be represented as a
binary cut tree. For example, Figure 8.8(B) depicts the tree where we select $2 = 4
and 11:3 = 12. This binary cut. tree also encodes the transformation from the original
domain to the target domain: all the values in a terminal node will be mapped to the
preﬁx represented by the path from the root to the terminal node. For example, as
the path from the root to the terminal node of [0,4] is 0, all values in [0,4] E D(F1)
are transformed to 0*.

Note that in the domain compression technique, we considered transformers that
mapped points in D(F,) to points in D’(F,). In preﬁx alignment, we consider
transformers that map points in [)(F,) to preﬁx ranges in D'(F,). If this seems
confusing, we can also work with transformers that map points in D(F,) to points
in D’(F,) with no change in results; however, transformers that map to prefixes
more accurately represent the idea of preﬁx alignment than transformers that map
to points. Note also that since we will perform range expansion on C’ with no
redundancy removal, we can ignore rule order. We can then view a one-dimensional

classiﬁer C as a multiset of ranges S in I)(F1).

8.3.2 One-dimensional Preﬁx Alignment

We next present the technical details of our dynamic programming solution to the
preﬁx aligmnent problem by ansmrring a series of four questions.

Correctness of Preﬁx Alignment

The ﬁrst question is: why is the prefix alignment transformation process correct? In
other words, how does the preﬁx alignment transformation preserve the semantics
of the original classiﬁer? We first deﬁne the concept of preﬁx transformers and

then show that. if prefix transformers are used. the prefix alignment transformation

105

process is correct.
Given a preﬁx P, we use min P and max P to denote the smallest and the largest

values in P, respectively.

Deﬁnition 8.3.1 (Preﬁx transformers). A transformer T, is an order—presenting
preﬁx transformer from D(F,) to D'(F,) for a packet classiﬁer C if T, satisﬁes the
following three properties. (I) (preﬁx property) Vx E D(F,), 'Il‘,(x) = P where P
is a preﬁx in domain D’(F,),' (2) (order-preserving property) Vx, y E D(F,), x < y
implies either 'll‘,(x) = 'll‘,(y) or max T,(x) < min ’ll‘,(y); {3) (consistency property)

Vx,y E D(F,), T,(x) = T,(y) implies C{x} = C{y}.

The following Lemma 8.3.1 and Theorem 8.3.1 easily follow the deﬁnition of

preﬁx transformers.

Lemma 8.3.1. Given any preﬁx transformer 11', for a ﬁeld F,, for any a,b,x E
D(F,), x E [a, b] if and only if 'll‘,(.r) E [1'1‘1in'll‘,(a), max T,(b)].

Theorem 8.3.1 (Topological Alignment Theorem). Given a packet classiﬁer C over
ﬁelds F1, - - - , F,,, and d preﬁx transformers T = (T, | 1 g i S d}, and the classiﬁer
C, constructed by replacing any range [a, b] over ﬁeld F, (1 S i S d) by the range
[min T,(a), max 'll‘,(b)], the condition C(p1.- - -.pd) = C’(T1(p1), ---, 'll‘d(pd)) holds.

Find Candidate Cut Points

The second question is: what cut points need to be considered? To answer this
question, we ﬁrst introduce the concept of atomic ranges. For any multiset of ranges
S (a multiset. may have duplicate entries) and any range :17 over domain D(F 1), we

use S‘in'x to denote the set of ranges in S that contain x.

Deﬁnition 8.3.2 (Atomic Range. Set). Given a multiset S of ranges, the union of
which constitute a range denoted U S. and a set of ranges S’, S, is the atomic range
set of S if and only if the following four conditions hold: (1) (coverage property)
U S = US’; (2) (disjoint property) VJ), y E S', x {’1 y = (ll; (3) (atomicity property)
Vx E S and Vy E S', :1: H y 74 0 implies y g x. (.4) (maximality property) V17. 1) E S’

and max x + 1 = min y implies Std-J: 7£ S‘Qy.

106

For any multiset of ranges S, there is one and only one atomic range set of S,
which we denote as AR(S). Because of the maximality property of atomic range set,
the candidate out points correspond to the end points of ranges in AR(S). We now
show how to compute S-start points and S-end points. For any range [x,y] E S,
deﬁne the points x — 1 and y to be S-end points, and deﬁne the points x and y + 1
to be S-start points. Note that we ignore x - 1 if x is the minimum element of U S;
likewise, we ignore y + 1 if y is the maximum element of U S. Let (31,- - - , sm) and
(e1, - - - , em) be the ordered list of S-start points and S-end points. It follows that
for 1 g i g m — 1 that s, g e, = s,+1 +1. Thus, AR(S) = {[sl,e1],- - - , [sm,em]}.

For example, if we consider the three ranges in classiﬁer C in example Figure
8.8(A), range [0, 12] creates S-start point 13 and S-end point 12, range [5, 15] creates
S-end point 4 and S—start point 5, and range [0, 15] creates no S-start points or S-
end points. Finally, 0 is an S—start point. and 15 is an S-end point. This leads to

AH(S) = {[0,4].[5.12],[13,15]}.

Choose Target Domain Size

The third question is: how many bits should be used to encode domain 0' (F 1 )? The
number of bits b used to encode the domain D'(F1) may impose some constraints
on possible preﬁx transformers. Consider the example from C in Figure 8.8(A) with
ranges [0,12], [5, 15]. and [0, 15]. Suppose there were a fourth range [5,7]. For this
multiset of ranges S, we then have AR(S) = {[0. 4], [5, 7], [8, 12], [13, 15]}. If we allow
only 2 bits to encode D’ (F 1), then there is only one possible preﬁx transformer. We
must have [0,4] map to 00, [5,7] map to ()1, [8,12] map to 10, and [13,15] map
to 11. On the other hand, if we allow 3 bits. we can also allow additional prefix
transformers such as [0, 4] map to 000. [5. 7] map to 001, [8,12] map to 01*, and
[13,15] map to 1**. In this case, the ﬁrst prefix transformer is superior to this
second preﬁx transformer. However. if the original ranges had been [0, 4], [0,7],
[0. 12], and [0, 15], the second preﬁx transformer would have been superior. and this
preﬁx transformer is only possible if we encode D, (F 1) with at least 3 bits.

We will include the number of bits b used to encode D’( F1) as an input parameter

107

to our preﬁx alignment problem. we determine the best b through an iterative
process of repeatedly incrementing b and computing an optimal solution for that
b. we start by choosing = [log |AR(S)|], which is the smallest possible number
of bits for any legal preﬁx transformer. Once we have a solution, we increment b
and repeat the process until we cannot reduce the range expansion any further. we
choose a linear search as opposed to a binary search for efﬁciency reasons. As we
shall see in a moment, any solution using b bits will require a sub-solution using b— 1
bits. Thus, when we fail to ﬁnd a solution using b bits and try to ﬁnd a solution
using 2b bits, we will require a sub-solution for each number from b + 1 to 2b - 1
(otherwise we would have found a solution using b bits). Furthermore, the binary
search may miss the best b by a large factor, which will lead to a large amount of

unnecessary computation.

Choose Optimal Cut Points

The fourth question is: How do we choose the optimal cut points? As we noted
before. we can view a one-dimensional classiﬁer C as a multiset of ranges S in D(F 1).
We then formulate the one-dimensional preﬁx alignment problem as follows: Given
a multiset of ranges S over ﬁeld F1 and a number of bits b. ﬁnd preﬁx transformer
T1 such that the range expansion of the transformed multiset of ranges S' has the
minimum number of preﬁx rules and D’ (F 1) can be encoded using only b bits.

We now present an optimal solution for this problem using dynamic program-
ming. Given a multiset of ranges S, we ﬁrst compute its atomic range set AR(S).
Suppose there are m atomic ranges R1, - - - , Hm with S-start points 81 through sm
and S-end points e1 through cm sorted in increasing order. For any S-start point.
sf and S—end point sy where 1 g .r g y g m, we define S rm [.1:, y] to be the multiset.
of ranges from S that intersect. range [817,531]; furthernmre. we. assume that each
range in S (n) [.r. y] is trimmed so that. its start point. is at least s1,- and its end point
is at most sy. We then define a collection of subproblems as follows. For every
1 g .‘I? g y g m, we deﬁne a prefix alignment problem P.4(r, y. b) where the problem

is to find a preﬁx transformer T1 for [54.5, cu] Q I)(F1) such that the range expansion

108

 

of (S rm [x, y])’ has the smallest possible number of preﬁx rules and the transformed
domain D' (F 1) can be encoded in b bits. We use cost(x, y. b) to denote the number
of preﬁx rules in the range expansion of the optimal (S n [x, y])’. The original preﬁx
alignment problem then corresponds to PA(1, m, b) where b can be arbitrarily large.

The key observation that allows the use of dynamic programming is that the
preﬁx alignment problem obeys the optimal substructure property. For example,
consider PA(1, m, b). As we employ the divide and conquer strategy to locate a
middle cut point that will establish what the preﬁxes O{*}b_1 and 1{*}b—1 cor-
respond to, there are m — 1 choices of cut points to consider: namely e1 through
em_1. Suppose the optimal cut point is ek where 1 S k S m — 1. Then the op-
timal solution to PA( 1, m, b) will build upon the optimal solutions to subproblems
PA(1,k,b — 1) and PA(k + 1,m,b — 1). That is, the optimal preﬁx transformer
for PA(1, m. b) will simply append a 0 to the start of all preﬁxes in the optimal
preﬁx transformer for PA(1, k, b — 1), and similarly it will append a 1 to the start
of all preﬁxes in the optimal preﬁx transformer for PA“: + 1, 777., b — 1). Moreover,
cos/(1,771, b) = cost(1,k,b — 1) + cost(k + 1,7‘n,b — 1) — [.S'<;Ci'[1, 771]]. We subtract
[S'iQ[1, m][ in the above cost equation because ranges that include all of [81, em] are
counted twice, once in cost(1,k,b — 1) and once in cost(k + 1, m, b — 1). However,
as [s], ck] transforms to 0{*}b-1 and l3k+17 em] transforms to 1{*}b—1, the range
[31,em] can be expressed by one preﬁx {*}b = 0{*}b_1 U 1{*}b"1.

Based on this analysis, Theorem 8.3.1 shows how to compute the optimal cuts
and binary tree. As stated earlier, the optimal preﬁx transformer T1 can then be

computed from the binary cut tree.

Theorem 8.3.1. Given a multiset of ranges S with [AR(S)

 

= 777.. cost(l,r, b) for
any b 2 0. 1 S l S r S m can be computed as follows. For any 1 S l < r S m, and
1S 1’: Sm. and b 2 0:

cost(l, 7', 0) = 00,

cost(lsf, 11‘, b) = $th kll’

 

109

and for any 1 S l < 7‘ S 777. and b 2 1

( cost(l,]c,b—1) \

+

cost l,r,b = min nth 1, ,b—1
( ) kE{l,...,r—1} COS< + T )

t new“ )

 

 

C]

Note that we set cost(k, k, 0) to |S@[k, It“ for the convenience of the recursive
case. The interpretation is that with a 0-bit domain, we can allow only a single value
in D’ (F 1); this single value is sufﬁcient to encode the transformation of an atomic

interval.

8.3.3 Multi-Dimensional Preﬁx Alignment

We now consider the multi-dimensional preﬁx alignment problem. Unfortunately,
while we can optimally solve the one—dimensional problem, there are complex inter-
actions between the dimensions that make solving the multi-dimensional problem
optimally extremely difﬁcult. In particular, the total range expansion required for
each rule is the product of the range expansion required for each ﬁeld. Thus, there
may be complex tradeoffs where we sacriﬁce one ﬁeld of a rule but align another ﬁeld
so that the costs do not multiply. However, we have not found a polynomial algo-
rithm for optimally choosing which rules to align well in which ﬁelds. It is an open
problem to prove whether the optimal multi-dimensional preﬁx alignment problem
is NP-hard.

In this chapter, we present. a hill—climbing solution where we itm'atively apply
our one—dimensional prefix alignment algorithm one ﬁeld at a time to improve our
solution. The basic idea is to perform preﬁx alignment one ﬁeld at a time; however,
because the range expansion of one ﬁeld affects the numbers of ranges that appear
in the other ﬁelds. we run preﬁx alignment for each ﬁeld more than once. Running
prefix alignment more than once allows each ﬁeld to use more and more accurate

information about the number of times each range appears in a ﬁeld.

110

For a classiﬁer C over ﬁelds F1, . . . , Fd, we ﬁrst create (1 identity preﬁx trans—
formers TQ, . . . .Tg. We deﬁne a multi-ﬁeld preﬁx alignment iteration k as follows.
For i from 1 to d, generate the optimal preﬁx transformer Tf assuming the preﬁx

., . , k—l k—l k—l k—l -
transformers for the other ﬁelds are {T1 , ..., Ti—l , Ti+1 , ..., Td }. Our 1t-

erative solution starts at k = 1 and preforms successive multi-ﬁeld preﬁx alignment

iterations until no improvement. is found for any ﬁeld.

8.3.4 Composing with Domain Compression

Although the two transformation approaches proposed in this chapter can be used
individually to save TCAM space, we advocate combining them together to achieve
higher TCAM reduction. Given a classiﬁer C over ﬁelds F1. . . . , Fd, we ﬁrst per-
form domain compression resulting in a transformed classiﬁer C, and d transformers
Talc, . . . , Tgc; then, we perform preﬁx alignment on the classiﬁer C, resulting in a
transformed classiﬁer C” and (1 transformers Tllm, . . . ,TZG. To combine the two
transformat.ion processes into one, we merge each pair of transformers TE!“ and T? a
into one transformer T, for 1 S i S (1. One nice property of their composition is
that since the transformer for domain compression is a function from D(F,) to a
point in D’(F,) and each point in D’(F,) will belong to its own equivalence class in
D’(F,) for 1 S i S d, each point. x E D'(F,) deﬁnes an atomic range [x, x].

A good property of the two proposed topological transformation approaches is
that they are composable with many other reencoding or TCAM optimization tech-
niques. For example, we can apply previous TCAM minimization schen‘ies (such
as [5, 18, 20]) to a transformed classiﬁer to further reduce TCAM space. Further-
more. as new classiﬁer minimization algorithms are developed, our transformations
can potentially leverage these future results. Finally, we can apply the optimal

algorithm in [27] to compute the minimum possible transformers T, for 1 S i S d.

111

8.4 Experimental Results

We evaluate the effectiveness and efﬁciency of our topological t.ransformation ap—
proaches on both real-world and synthetic classiﬁers. Although our two approaches
can be used independently, they are much more effective when used together. Thus,
we only report results for both techniques combined, and we finish by running the

redundancy removal algorithm in [18] on the transformed classiﬁer C”.

8.4.1 Evaluation Metrics

Given a TCAM optimization algorithm A and a classiﬁer C, let A(C) denote the re-
sulting classiﬁer, W(A(C)) denote the number of bits to represent each rule in A(C),
TW(A(C)) denote the minimum TCAM entry width for storing A(C) given choices
36, 72, 144, or 288. [A(C)| denote the number of rules in A(C), and B(A(C)) =
TW(A(C)) X|A(C)[, which represents the total number of TCAM bits required
to store A(C). The main goal of TCAM optimization algorithms is to minimize
B (A(C)). we use Direct to denote direct range expansion algorithm, so B (Direct(C))
represents the baseline we compare against, W(Direct(C)) = 104, TW(Direct(C)) =
144. and B(Direct(C)) = 144 x |Direct(C)|. Below is the summary of our notations:

For any A and C, we measure overall effectiveness by the compression ratio

_ BfA ((3))
GIN/KC” _ B(Direct(C)
our approaches at compressing classiﬁers, we deﬁne the Rule Number Ratio of A on

C to be RNR(A(C)) = '18:“, which is often referred to as expansion ratio, and the

 

). To isolate the factors that contribute to the success of

 

Rule Width Ratio of A on C to be RWR(A(C)) = W(A(C) . When we consider a
S

 

 

set of classiﬁers S where denotes the number of classiﬁers in S , we generalize our

metrics as follows. Average compression ratio of A for S is

= ZCeS Cliff/1(9)

 

 

 

 

C'R(A(S)) 8"
average rule number ratio of A for S is
.. RNR A C
RNR(A(S)) = DUES s ( ( )),

 

 

112

and average rule width ratio of A for S is

2 Does RWRMUCD
s '

 

awry/4(5))

 

 

We split RL into two groups, RLa and RLb where RN R(Direct(C)) S 2 for all
C E RLa and RN R(Direct(C)) > 40 for all C E RLb. We have no classiﬁers where
2 S RNR(Direct(C)) S 40. It turns out [RLa] = 12 and [RLb] = 13. By separating
these classiﬁers into two groups, we can determine how well our techniques work on
classiﬁers that do suffer signiﬁcantly from range expansion as well as those that do

not.

8.4.2 Effectiveness
Results on real-world and synthetic classiﬁers

Table 8.1 shows the average compression ratio, rule size ratio, and rule number ratio
for our algorithm on all eight data sets. Figures 8.10 through 8.15 show the speciﬁc
compression ratios, rule width ratios, and rule number ratios for all of our real-world
classiﬁers; the black bars represent the increases in each quantity that arise from
assigning each rule a. unique decision. In each ﬁgure, we sort the classiﬁers by the
number of rules in the original classiﬁer. We present compression ratio and rule
number ratio data with and without transformers. The data without transform-
ers facilitate comparison with most previous reencoding schemes. The data with
transformers depicts the true space savings of our methods.

Our algorithm achieves signiﬁcant compression on both real-world and synthetic
(_rlassiﬁers. On RL, our algorithm achieves an average compression ratio of 10.3%
if we count TCAM space for transformers and 2.6% if we do not. These savings
are attributable to both rule width and rule number compression. The average rule
width compression ratio is 10.6%. which means that a typical encoded classiﬁer only
requires 11 bits, instead of 104 bits, to store a rule. However, the actual savings that
rule width compression contributes to average comjjn'ession ratio is only 25% because

the encoded classiﬁers will always use 36 bit wide. TCAM entries of 36, which is the

113

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

compression rule width rule number

we. T with T we. T with T
RL 2.6% 10.3% 10.6% 27.6% 123.3%
RLU 7.0% 16.2% 14.5% 63.2% 176.8%
RLa 5.3% 20.8% 14.2% 22.7% 87.4%
RLaU 14.4% 33.1% 18.5% 62.5% 140.3%
RLb 0.1% 0.5% 7.2% 32.2% 156.4%
RLbU 0.2% 0.6% 10.8% 63.8% 210.6%
SYN 0.6% 2.5% 10.4% 2.7% 11.8%
SYNU 9.3% 12.4% 16.0% 43.9% 58.9%

 

 

Table 8.1: Average compression ratio, rule width ratio, and rule number ratio for 9

data. sets (with transformers included and excluded)

smallest possible TCAM width. In comparison, direct range expansion would use
144 bit wide TCAM entries. That is, TW(A(C)) = 32 for all the classiﬁers in RL
(actually in all data sets including RLU, SYN, and SYNU). The remaining savings
is due to rule number compression. Note that the average rule number compression
ratio without transformers is 27.6%; that is, domain compression and redudancy
removal eliminate an average of 72% of the rules from our real-life classiﬁer sets.
In comparison, the goal of all other reencoding schemes is an average rule number
compression ratio without transformers of 100%. On other data sets, our algorithm
also performs well. For example, for Taylor’s rule set TRS, we achieve an average
compression ratio of 2.7% with transformers included and 1.0% with transformers

excluded.

Sensitivity to classiﬁer efﬁciency

Our algorithm is effective for both efficiently speciﬁed classiﬁers and inefﬁciently
speciﬁed classiﬁers. The efﬁciently specified classifiers in RLa experience relatively
little range expansion; the inefficiently speciﬁed classifiers in RLb experience signif—

icant range expansion. Not surprisingly, our algorithm provides roughly 40 times

114

better compression for RLb than for RLa with average compression ratios of 0.5%
and 20.8%, respectively. In both sets, TCAM width compression contributes 25%
savings. The difference is rule number compression. On efficient classiﬁers, our al-
gorithm provides modest rule number compression (even though the average rule
number ratio without transformers for RLa is 22.7%). On inefficient classiﬁers, our

algorithm provides tremendous rule number compression.

Sensitivity to number of unique decisions

Our algorithm '3 effectiveness is only slightly diminished as we increase the number
of unique decisions in a classiﬁer. In the extreme case where we assign each rule
a. unique decision in RLU, our algorithm achieves an average compression ratio of
16.2% with transformers included and 7.0% with transformers excluded; and on
SYNU, our algorithm achieves an average compression ratio of 12.4% with trans-
formers included and 9.3% with transformers excluded. In particular, the TCAM

width is unaffected as our algorithm still uses 36 bit wide TCAM entries.

Comparison with state—of—the—art results

Oar algorithm outpevforvns all existing reencoding schemes by at least a factor of
3.16 including transformers and by at least a factor of 7.24 ea'elading transformers.
We first. consider the width of TCAM entries. We have 36 bit. wide TCAM entry
width while the smallest TCAM width achieved by prior work is 72 [21]. There-
fore, on TCAM entry width, our algorithm is 2 times better than the best known
result. Next, we consider the number of T CAM entries. Excluding TCAM entries
for transformers, the best rule number ratio that any other method can achieve
on It]. is 100% whereas we achieve 27.6%. Therefore. e:1.‘(_'l'u(ling TCAM entries for
transform("I's. our algorithm is at least 7.24 (= 2 X 100%./27.6%) times better than
the optimal TCA M reencoding algorithm that does not consider classiﬁer semantics.
In comparison with PIC [21], the best. known TCAM-based reencoding alg(_)ritlin'1,
the transformers in PIC use at. least the same number of TCAM entries as our al-

gorithm because our domain compression technique may map multiple intervals to

115

one decision whereas PIC maps each interval to a unique decision. Thus, including
TCAM entries for transformers, the best average rule number ratio that PIC can
achieve on RL is 195.7%(= 123.3% — 27.6% + 100%). Therefore, including TCAM

entries for transformers. our algorithm is at least 3.16 (= 2 x 195.7%/ 123.3%) times

better than PIC.

116

 

F1 F2 Decision
[12, 15] [7,60] Discard
[41 , 42] [7, 60] Discard
[20, 38] [0, 63] Accept
[0, 63] [20, 38] Accept
[7,60] [10, 58] Discard
[1, 63] [0, 62] Accept
[0, 62] [1, 63] Accept
[0, 63] [0, 63] Discard

 

 

 

 

 

 

 

 

        
 
 
  

[0,0] [1,6] [7,11] [63,63]
[20,38 [16,19]
[6l,62 [39,40]
[43,60]

 

 

 

 

 

        
 

    

[0.01 “.61 [7.91 [10,19]
[61,62
[00 1,63] [0,63]
E! E] El

 

 

Figure 8.5: Step 1 of domain compression

117

 

F1

Decision

 

 

l0, 0]
[1,6] 0 [20,38] 0 [61,62]

[7, 11] 0 [16,19] 0 [39,40] 0 [43,60]

[12,15] 0 [41,42]
[63,63]

 

rbCJGNJV-H

 

 

 

F2

Decision

 

[0, 0]
[1,6] 0 [61,62]
[7, 9] 0 [20,38] 0 [59,60]
[10, 1910 [39,58]
[63,63]

 

 

.5me

 

 

Figure 8.6: Step 2 of domain compression

118

 

 

 

 

 

 

 

 

 

 

 

 

F1 F 2 Decision
[3, 3] [2, 3] Discard

(l) [2, 3]] Discard

(l) [0, 4] Accept
[0, 4] (ll Accept
[2, 3] [3, 3] Discard
[1, 4] [0, 3] Accept.
[0, 3] [1, 4] Accept
[0, 4] [0, 4] Discard

U

F1 F 2 Decision
[3, 3] [2, 3] Discard
[2,3] [3, 3] Discard
[1, 4] [0, 3] Accept,
[0, 3] [1, 4] Accept
[0, 4] [0, 4] Discard

 

 

Figure 8.7: Step 3 of domain compression

119

‘
u}... ‘I'wmwﬂ-lﬂm‘
I

m.“:.

It?"

fauu—u-emuu-r -. .. ~Culll'1‘.

 

 

 

   
 
   

.--".. ’5“- - " ‘1 55"": ' glass-54 1-. 5.771 31.3.4.1? fa? '1' ' :93 1 - 3}." “ME 1 -' .4
0 l 4 6 7 8 9 10 ll 12 13

map .t'é 0' map to‘W. map to'."

 

    

as" s

14 15

 

(A)

Figure 8.8: Example of 1-D preﬁx alignment.

 

 

 

A TCAM opt. scheme
Direct direct range expansion
(C packet classiﬁer

A(C) resulting classiﬁer
l‘l/"(.4(<C)) width of rules in A(C)
[.4(<C)| number of rules in A((C)

TW(A(C)) minimum TCAM width for rules in A(C)
B(A(C)) TW(A(C)) x |.4,(<C)| total bits of A((C)

 

 

 

Figure 8.9: Summary of notation

120

 

 

_g ’55 El Excluding transformer
2‘? 80 Ilncluding transformer <
C
E g 60
"' a)
E29. 40
Q.
g 20
u
00 2 10

4 6 8
Classifier

 

 

Figure 8.10: Compression ratio of RLa

and RLaU

 

U!

0

010

(Percentage)

HHNNWUJ
U1

Rule Size Ratio
0.

 

 

(DUI

 

6
Classifier

 

Figure 8.12: Rule size ratio of RLa and

 

 

 

RLaU
.98 3;” :1 Excluding transformer
iii % I Including transformer
:05 300
53 250
.D L
83 2
3V
2
2
D
a:

 

 

Figure 8.14: Rule number ratio of RLa

and 11’ LaU

121

 

   

 

 

 

£37 1. Excluding transformer <
g g 1. I Including transformer

‘3 1.
.5 § 1.
5;: 0.8
‘d 0.6
E 0.4
8 0.2

0'00 2 10 12 14

6 8
Classifier

Figure 8.11: Compression ratio of RLb

and RLbU

 

 

 

Imam.“ and. .e.

i
3

 

 

Rule Size Ratio
(Percentage)

H H

O N

b m—

N]lmm-cxn

6 8
Classifier

 

 

V:—

Figure 8.13: Rule size ratio of RLb and

 

   

RLbU
3g 3‘; Cl Excluding transformer [
g? 300 I Including transformer <
E8 250
E213 200
35 15
z
1) 100
g 50

 

12 14

6 8
Classifier

Figure 8.15: Rule number ratio of RLb

and RLbU

BIBLIOGRAPHY

122

l1]

l3]

l4]

l6]

l7]

[8]

[9]

[10]

[W

BIBLIOGRAPHY

David A. Applegate, Gruia. Calinescu, David S. Johnson, Howard Karloff, Ka-
trina Ligett, and Jia Wang. Compressing rectilinear pictures and minimizing
access control lists. In Proc. ACM—SIAM Symposium on Discrete Algorithms
(SODA), January 2007.

Florin Baboescu, Sumeet Singh, and George Varghese. Packet classiﬁcation for
core routers: Is there an alternative to CAMS? In Proc. IEEE INFOCOM,
2003.

Florin Baboescu and George Varghese. Scalable packet classiﬁcation. In Proc.
ACM SICCO/l/Illrl, pages 199 210, 2001.

Anat Bremler—Barr and Danny Hendler. Space-efﬁcient TCAM-based classi-
ﬁcation using gray coding. In Proc. 26th Annual IEEE conf. on Computer
Communications (lnfocom), May 2007.

Qunfeng Dong, Suman Banerjee, Jia W’ang, Dheeraj Agrawal, and Ashutsh
Shukla. Packet classifiers in ternary CAMS can be smaller. In Proc. ACM
Sigmetrics, pages 311—322, 2006.

Richard Draves. Christopher King. Venkatachary Srinivasan, and Brian Zill.
Constructing optimal IP routing tables. In Proc. IEEE INFOCOAII, pages 88—
97, 1999.

Anja Feldmann and S. l\luthukrishnan. Tradeoffs for packet classiﬁcation. In

Proc. 19th IEEE INFOCOM, March 2000.

M. R. Carey and D. S. Johnson. Strong NP-completeness results: motivation,
examples, and implications. Journal of ACM, 25(3):499—508, 1978.

Mohamed G. Gouda and Alex X. Liu. Firewall design: consistency. complete-
ness and compactness. In Proc. filth IEEE Int. conf. on Distributed Computing
Systems (ICDCS-W), pages 320- 327, March 2004.

.\Iohz~imed G. Gouda and Alex X. Liu. Structured ﬁrewall design. Computer
Networks Journal (Elscuicr). 51(4):1106—1120, March 2007.

Pankaj Gupta and Nick McKeown. Packet. classification on multiple ﬁelds. In
Proc. ACM SIGCOMli-l. pages 1477160, 1999.

123

[12]

[13]

[14]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[241

Pankaj Gupta and Nick McKeown. Packet classiﬁcation using hierarchical in-
telligent. cuttings. In Proc. Hot Interconnects VII, Aug. 1999.

Pankaj Gupta. and Nick IVIcKetm’n. Algorithms for packet classification. IEEE
Network, 15(2):24—32, 2001.

T. V. Lakshman and Dimitrios Stiliadis. High-speed [whey-based packet for-
warding using efﬁcient multi-dimensional range matching. In Proc. ACM SIC-

COMM, pages 203—214, 1998.

Karthik Lakshminarayanan, Anand Rangarajan, and Srinivasan Venkatachary.
Algorithms for advanced packet classiﬁcation with ternary CAMS. In Proc.
ACM SICCOMM, pages 193 — 204, August 2005.

Panos C. Lekkas. Network Processors - Architectures, Protocols, and Platforms.
McGraw—Hill, 2003.

Alex X. Liu and Mohamed G. Gouda. Diverse ﬁrewall design. In Proc. Int. conf.
on Dependable Systems and Networks (DSN—OI), pages 595—604, June 2004.

Alex X. Liu and Mohamed G. Gouda. Complete redundancy detection in ﬁre-
walls. In Proc. 19th Annual IFIP conf. on Data and Applications Security,
LN CS 3654, pages 196—209, August 2005.

Huan Liu. Efﬁcient mapping of range classiﬁer into Ternary—CAM. In Proc.
Hot Interconnects, pages 95— 100, 2002.

Chad R. Meiners, Alex X. Liu, and Eric Torng. TCAM razor: A systematic
approach towards minimizing packet. classiﬁers in TCAMS. In Proc. 15th IEEE
conf. on Network Protocols ( I CNP), pages 266—275, October 2007.

Derek Pao, Yiu Keung Li, and Peng Zhou. An encoding scheme for TCAM-
based packet classiﬁcation. In Proc. 8th IEEE Int. conf. on Advanced Commu-
nication Technology (I CA C T), February 2006.

Lili Qiu, George Varghese, and Subhash Suri. Fast ﬁrewall implementations
for software-based and hardware-based routers. In Proc. the 9th Int. conf. on
Network Protocols (ICNP), 2001.

Sumeet Singh, Florin Baboescu. Gem'ge Varghese, and Jia “’ang. Packet clas-
siﬁcation using multidimensional cutting. In Proc. ACM SICCOMM, pages
213-224, 2003.

Ed Spitznagel, David Taylor, and Jonathan 'Ihrner. Packet. classiﬁcation using
extended TCAMS. In Proc. 11th I EEE Int. conf. on Network Protocols ( I CNP).
pages 120--- 131, November 2003.

V. Srinivasan. Subhash Suri, and George V-u‘ghese. Packet classiﬁcation using
tuple Space search. In Proc. A C111 S] C ("0111141. pages 133146. 1999.

124

 

[26] Venkatachary Srinivasan, George Varghese, Subhash Suri, and Marcel Waldvo—
gel. Fast and scalable layer four switching. In Proc. ACM SICCOMM, pages
191—202, 1998.

[27] Subhash Suri, Tuomas Sandholm, and Priyank W'arkhede. Compressing two-
dimensional routing tables. Algorithmica, 35:287~300, 2003.

[28] David Taylor and Jonathan Turner. Scalable packet classiﬁcation using dis—
tributed crossproducting of ﬁeld labels. In Proc. IEEE INF OCOM, pages 269—
280, 2005.

[29] David E. Taylor. Survey & taxonomy of packet classiﬁcation techniques. ACM
Computing Surveys, 37(3):238—275, 2005.

[30] David E. Taylor and Jonathan S. Turner. Classbench: A packet classiﬁcation
benchmark. In Proc. IEEE Infoeom, March 2005.

[31] Jan van Lunteren and Ton Engbersen. Fast and scalable packet classiﬁcation.
IEEE Journals on Selected Areas in Communications, 21(4):560— 571, 2003.

[32] Thomas Y. C. Woo. A modular approach to packet. classiﬁcation: Algorithms
and results. In Proc. IEEE INFOCOM, pages 1213—1222, 2000.

[33] Fang Yu, T. V. Lakshman, Marti Austin Motoyama, and Randy H. Katz. SSA:
A power and memory efficient scheme to multi-match packet classiﬁcation. In

Proc. Symposium on Architectures for Networking and Communications Sys-
tems (ANCS), pages 105-113, October 2005.

93 03062 7354

[Illillilllllll

ill]

31

 

I
I
I

E “I
H
|iI
"
H
H
Ill