ON DESIGN AND IMPLEMENTATION OF FAST & SECURE NETWORK PROTOCOLS

FOR DATACENTERS

By

Ali Munir

A DISSERTATION

Michigan State University

in partial fulﬁllment of the requirements

Submitted to

for the degree of

Computer Science – Doctor of Philosophy

2019

ABSTRACT

ON DESIGN AND IMPLEMENTATION OF FAST & SECURE NETWORK PROTOCOLS

FOR DATACENTERS

By

Ali Munir

My PhD work focuses on improving the performance and security of networked systems. For
network performance, my research focuses on scheduling and transport in datacenter networks. For
network security, my research focuses on multipath TCP security.

To improve the performance of datacenter transport, I proposed PASE, a near-optimal and
deployment friendly transport protocol. To this end, I ﬁrst identiﬁed the underlying strategies
used by existing datacenter transports. Next, I showed that these strategies are complimentary to
each other, rather than substitutes, as they have diﬀerent strengths and can address each other’s
limitations. Unfortunately, prior datacenter transports use only one of these strategies and as a
result they either achieve near-optimal performance or deployment friendliness but not both. Based
on this insight, I designed a datacenter transport protocol called PASE, which carefully synthesizes
these strategies by assigning diﬀerent transport responsibility to each strategy. To further improve
the performance of datacenter transport in multi-tenant networks, I proposed Stacked Congestion
Control (SCC), to achieve performance isolation and objective scheduling simultaneously. SCC is a
distributed host-based bandwidth allocation framework, where an underlay congestion control layer
handles contention among tenants, and a private congestion control layer for each tenant optimizes
its performance objective. To my best knowledge, no prior work supported performance isolation
and objective scheduling simultaneously.

To improve task scheduling performance in datacenters, I proposed NEAT, a task schedul-
ing framework that leverages information from the underlying network scheduler to make task
placement decisions. Existing datacenter schedulers optimize either the placement of tasks or the
scheduling of network ﬂows. Inconsistent assumptions of the two schedulers can compromise the
overall application performance. The core of NEAT is a task completion time predictor that esti-

mates the completion time of a task under given network condition and a given network scheduling
policy. Next, a distributed task placement framework leverages the predicted task completion times
to make task placement decisions and minimize the average completion time of active tasks.

To improve multipath TCP (MPTCP) security, I reported vulnerabilities in MPTCP that arise
because of cross-path interactions between MPTCP subﬂows. MPTCP allows two endpoints to
simultaneously use multiple paths between them. An attacker eavesdropping one MPTCP subﬂow
can infer throughput of other subﬂows and also can inject forged MPTCP packets to change priorities
of any MPTCP subﬂow. Attacker can exploit these vulnerabilities to launch the connection hijack
attack on the paths he has no access to, or to divert traﬃc from one path to other paths. My proposed
vulnerabilities ﬁxes, changes to MPTCP speciﬁcation, provide the guarantees that MPTCP is at
least as secure as TCP and the original MPTCP. And has been adopted by IETF.

Copyright by
ALI MUNIR
2019

This thesis is dedicated to my wife and parents.

Thank you for always believing in me.

v

ACKNOWLEDGEMENTS

Working towards a Ph.D. has been a deeply enriching and rewarding experience. Looking back,
many people have helped shape my journey. I would like to extend them my thanks.

• First and foremost, my advisor, Prof. Alex X. Liu. My work would not have been possible
without his constant guidance, unwavering encouragement, his many insights, and his excep-
tional resourcefulness. And most importantly, his friendship. I have been very fortunate to
have an advisor who has also been a close friend. For all of this, Alex, thank you.

• I would also like to thank the rest of my thesis committee Profs. Subir Biswas, Abdol
Esfahanian, and Eric Torng for their encouragement and insightful comments during my
qualiﬁer and comprehensive exams.

• I would also like to thank Dr. Fahad Dogar. I really enjoyed working with him during my
summer internship at Microsoft Research Cambridge and during our collaboration after the
internship.

• I would also like to thank Dr. Jia Wang. I really enjoyed working with her during my summer

internship at AT&T Labs - Research.

• I would also like to thank Drs. Ting He, Franck Le, Ramya Raghavendra, and Mudhakar
I really enjoyed working with them during my summer internship at IBM T.J.

Srivatsa.
Watson Research Center and during our collaboration after the internship.

• I would also like to thank Mike Schlansker and Dr. Puneet Sharma. I really enjoyed working
with them during my summer internship at Hewlett Packard Enterprise Labs and during our
collaboration after the internship.

• Throughout my Ph.D., I was supported by various NSF research grants. Thanks NSF!

vi

• I would also like to thank Michigan State University, and speciﬁcally Department of Computer
Science and Engineering for providing me ﬁnancial support to attend various conferences
during my Ph.D.

• Many thanks to my colleagues in Systems and Security Lab at Michigan State University. In
particular, I would like to thank Faraz Ahmed, Salman Ali, Kamran Ali, and Xinyu Le for
numerous insightful discussions and collaborations on various projects.

• I am also very thankful to Drs. Saad Qaisar, S.M. Hassan Zaidi, Savera Tanvir who advised

my undergraduate and masters thesis and encouraged me to pursue Ph.D.

• I am also deeply indebted to Dr. Ihsan Ayyub Qazi, my mentor during masters and research
intern at LUMS, Pakistan. His passion for science and scholarly pursuit has been an inspira-
tion to me (and surely, many of his other students) and helped set me on the path on which
I ﬁnd myself today. I feel fortunate to have him as a close friend and an advisor. For all of
your support, Dr. Ihsan, Thank you.

• Finally, I do not know how I can thank my family enough: my wife and parents, from whom
I realized that kindness and devotion is endless; my sisters, who always support me no matter
what, and the rest of the family members who were always supportive of my studies.

• I dedicate my thesis to my wife, parents and Dr. Ihsan Ayyub Qazi. Thank you all for being

an amazing part of my journey!

vii

TABLE OF CONTENTS

LIST OF TABLES . .

LIST OF FIGURES . .

.

.

.

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xi

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii

CHAPTER 1 BACKGROUND AND MOTIVATION . . . . . . . . . . . . . . . . . . . .
1
1.1 PASE: A Near-Optimal Transport for Datacenters . . . . . . . . . . . . . . . . . .
1
1.2 NEAT: Task Placement in Datacenters . . . . . . . . . . . . . . . . . . . . . . . .
4
1.3 Bandwidth Allocation in Multi-tenant Datacenters . . . . . . . . . . . . . . . . . .
9
1.4 Multipath TCP Attacks & Countermeasures . . . . . . . . . . . . . . . . . . . . . 12

CHAPTER 2 REVIEW OF RELATED RESEARCH . . . . . . . . . . . . . . . . . . . . 15
2.1 PASE: A Near-Optimal Transport for Datacenters . . . . . . . . . . . . . . . . . . 15
2.2 NEAT: Task Placement in Datacenters . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Bandwidth Allocation in Multi-tenant Datacenters . . . . . . . . . . . . . . . . . . 17
2.4 Multipath TCP Attacks & Countermeasures . . . . . . . . . . . . . . . . . . . . . 18

CHAPTER 3 OVERVIEW .

.

.

3.3.1

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1 PASE: A Near-Optimal Transport for Datacenters . . . . . . . . . . . . . . . . . . 20
3.1.1 Transport Strategies in Isolation . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.2 Transport Strategies in Unison . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 NEAT: Task Placement in Datacenters . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 Bandwidth Allocation in Multi-tenant Datacenters . . . . . . . . . . . . . . . . . . 28
Interference when Sharing Network . . . . . . . . . . . . . . . . . . . . . 28
Performance in Isolation . . . . . . . . . . . . . . . . . . . . . . 29
3.3.1.1
3.3.1.2
Coexistence of Diﬀerent Objectives . . . . . . . . . . . . . . . . 30
Same Objective Diﬀerent Tenants . . . . . . . . . . . . . . . . . 30
3.3.1.3
3.3.2 Current Isolation won’t Work . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4 Multipath TCP Attacks & Countermeasures . . . . . . . . . . . . . . . . . . . . . 31
3.4.1 MPTCP Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4.2 Threat Model . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4.3 Backup Flag Vulnerability . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4.4 Throughput Inference Vulnerability . . . . . . . . . . . . . . . . . . . . . 34

.

4.1 Arbitration Control Plane .

CHAPTER 4 PASE: A NEAR-OPTIMAL TRANSPORT FOR DATACENTERS . . . . . 37
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.1.1 Basic Arbitration Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1.2 Bottom Up Arbitration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 End-host Transport
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3 Discussion . .
.
.
4.4 Performance Evaluation and Comparison . . . . . . . . . . . . . . . . . . . . . . 48
Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.5 Evaluation . .

4.4.1

.
.

.
.

.
.

.

.

.

.

.

.

.

viii

.

.

4.7 Micro-benchmarks

4.6 Macro-benchmarks .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.6.1 Comparison with Deployment Friendly Schemes . . . . . . . . . . . . . . 51
4.6.2 Comparison with Best Performing Scheme . . . . . . . . . . . . . . . . . 53
4.6.3 Evaluation under the Web Search and Data Mining Workloads . . . . . . . 54
PASE under the Incast scenario . . . . . . . . . . . . . . . . . . . . . . . 56
4.6.4
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
. . . . . . . . . . . . . . . . . . . . . . . . . . 57
In-network Prioritization and Transport Micro-benchmarks . . . . . . . . . 59
4.8 Distributed vs. Centralized Arbitration . . . . . . . . . . . . . . . . . . . . . . . . 60
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.9 Testbed Evaluation .
4.10 Discussion . .
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
.
4.11 Conclusion .

4.7.1 Arbitration Optimizations
4.7.2

.
.
. .

.
.
.

.
.
.

.
.

.

.

.

5.2.1
5.2.2
5.2.3
5.2.4

CHAPTER 5 NEAT: TASK PLACEMENT IN DATACENTERS . . . . . . . . . . . . . . 63
5.1 Task Performance Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2 FCT Prediction for Flow Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . 66
FCT under FCFS Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . 67
FCT under Fair or LAS Scheduling . . . . . . . . . . . . . . . . . . . . . 67
FCT under SRPT Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . 68
Invariance Condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.3 CCT Prediction for Coﬂow Scheduling . . . . . . . . . . . . . . . . . . . . . . . . 69
5.3.1 CCT under FCFS Scheduling . . . . . . . . . . . . . . . . . . . . . . . . 70
5.3.2 CCT under Fair or LAS Scheduling . . . . . . . . . . . . . . . . . . . . . 70
5.3.3 CCT under Permutation Scheduling . . . . . . . . . . . . . . . . . . . . . 71
5.3.4
Invariance Condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.4 NEAT Task Placement Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Flow Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.4.1
5.4.2 Coﬂow Placement
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.4.3 Application to MapReduce Scheduling . . . . . . . . . . . . . . . . . . . 75
5.4.4 Application to DAG Scheduling . . . . . . . . . . . . . . . . . . . . . . . 76
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Simulation Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.7 Macrobenchmarks . .
5.8 Microbenchmarks . .
5.9 Testbed Evaluation .
.
5.10 Discussion . .
5.11 Conclusion .
.

5.5 NEAT Optimizations . .
5.6 Evaluation . .
.

5.6.1

.
.
. .

.
.
.
.
.

.
.
.
.
.

.
.

.

.

.

.

.

CHAPTER 6 BANDWIDTH ALLOCATION IN MULTI-TENANT DATACENTERS . . . 89
6.1 Underlay Congestion Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.2 Private Congestion Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.3 PCC Tunnel Algorithms
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.3.1 Deadline-Sensitive Tunnel
. . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.3.2 Completion-Sensitive Tunnel . . . . . . . . . . . . . . . . . . . . . . . . . 93

.

ix

.

.

6.3.3 Latency-Sensitive Tunnel . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.4 Testbed Veriﬁcation .
6.5 Large-scale NS-2 Simulations
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.6 Comparison in Coexistence Scenario . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.7 Comparison in Policy Enforcement . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.8 Comparison in Isolation Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . 101
.
6.9 SCC Dynamics . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.10 Conclusions .
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

. .

.
.

.
.

.
.

.

.

.
.

.
.

7.2.1 Motivation .
7.2.2 Overview . .
7.2.3 Attack Validation .

CHAPTER 7 MULTIPATH TCP ATTACKS & COUNTERMEASURES . . . . . . . . . 105
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.1 Experimental Setup .
7.2 Connection Hijack Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.3 Directed Traﬃc Diversion Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . 108
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
. . . . . . . . . . . . . . . . . . . . . . . . . 110
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.4.1 Current MPTCP Security Mechanism . . . . . . . . . . . . . . . . . . . . 112
7.4.2 Backup Flag Vulnerability Prevention . . . . . . . . . . . . . . . . . . . . 113
7.4.3 Throughput Inference Vulnerability Prevention . . . . . . . . . . . . . . . 117
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.5 Security Analysis . .
7.6 Linux Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.7 Conclusions .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

7.3.1 Motivation .
7.3.2 Overview . .
7.3.3 Attack Validation .
7.3.4

Step-wise traﬃc divergence

7.4 Countermeasures . .

. .

.
.

.

.

.

.

.
.

.

.

.

.

CHAPTER 8 FUTURE WORK .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
8.1 Automatic signature generation of network protocols on-the-ﬂy . . . . . . . . . . . 125
8.1.1 Motivation and Problem Statement . . . . . . . . . . . . . . . . . . . . . . 125
8.1.2 Limitations of Prior Art
. . . . . . . . . . . . . . . . . . . . . . . . . . . 126
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
8.1.3

Project Objectives:

BIBLIOGRAPHY . .

.

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

x

LIST OF TABLES

Table 3.1: Comparison of diﬀerent transport strategies. . . . . . . . . . . . . . . . . . . . . 21

Table 4.1: Priority Queues and ECN support in popular commodity top-of-rack switches.

The numbers are per interface.

. . . . . . . . . . . . . . . . . . . . . . . . . . . 37

Table 4.2: Default simulation parameter settings. . . . . . . . . . . . . . . . . . . . . . . . 51

Table 5.1: Default parameter settings used in simulation. . . . . . . . . . . . . . . . . . . . 80

Table 6.1: Notation Table . .

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

xi

LIST OF FIGURES

Figure 1.1: Network scheduling aware placement

. . . . . . . . . . . . . . . . . . . . . .

Figure 1.2: NEAT Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

8

Figure 3.1: Comparison of a) D2TCP, DCTCP and pFabric, b) PDQ (an arbitration based

approach) with DCTCP.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Figure 3.2: Toy example illustrating problem with pFabric. . . . . . . . . . . . . . . . . . . 24

Figure 3.3: Loss rate for pFabric as a function of load.

. . . . . . . . . . . . . . . . . . . . 24

Figure 3.4: Comparison of minDist and minLoad placement under SRPT and Fair net-

work scheduling policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

Figure 3.5:

Interference demonstration, when alone and coexist.

. . . . . . . . . . . . . . . 29

Figure 3.6: Example 2-path MPTCP connection setup . . . . . . . . . . . . . . . . . . . . 31

Figure 3.7: Subﬂow pause using MP_PRIO option . . . . . . . . . . . . . . . . . . . . . . 34

Figure 4.1: The bottom-up arbitration approach employed by PASE for intra-rack and

inter-rack traﬃc scenarios.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

Figure 4.2:

(a) The early pruning optimization used to reduce the arbitration overhead.
Note that si and ki represent ﬂows that are mapped to the highest priority queue
at the senders and the ToR arbitrators, respectively and (b) the delegation
optimization used to reduce the setup latency and control overhead. . . . . . . . 44

Figure 4.3:

(a) Comparison of PASE with L2DCT and DCTCP in terms of AFCTs under
the (left-right) inter-rack scenario, (b) shows the CDF of FCTs at 70% load in
case of (a), and (c) Deadline-constrained ﬂows: Comparison of application
throughput for PASE, D2TCP and DCTCP under the intra-rack scenario.

. . . . 50

Figure 4.4: Comparison of PASE and pFabric for inter-rack scenario: a) AFCT, b) 99th

percentile with % gain of PASE over pFabric.

. . . . . . . . . . . . . . . . . . 54

Figure 4.5: Comparison of PASE and pFabric intra-rack scenario: a) AFCT, b) 99th

percentile with % gain of PASE over pFabric.

. . . . . . . . . . . . . . . . . . 54

Figure 4.6: Comparison of PASE with pFabric, L2DCT and DCTCP under web search

and data mining workloads.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

xii

Figure 4.7: AFCT comparison of PASE with pFabric, L2DCT and DCTCP under the web

search workload. Note: the y-axis of plots have diﬀerent ranges.

. . . . . . . . 56

Figure 4.8: AFCT comparison of PASE with pFabric, L2DCT and DCTCP under the data

mining workload. Note that y-axis of plots have diﬀerent labels. . . . . . . . . . 56

Figure 4.9: Performance of PASE under (a) incast scenario and (b) AFCT improvement

due to PASE optimizations as a function of oﬀered load. . . . . . . . . . . . . . 57

Figure 4.10: AFCT improvement and overhead reduction as a function of load for PASE

with its optimizations in the left-right scenario.

. . . . . . . . . . . . . . . . . 58

Figure 4.11: (a) PASE with varying switch buﬀer size and (b) PASE with varying number

of priority queues (left-right scenario).

. . . . . . . . . . . . . . . . . . . . . . 59

Figure 4.12: AFCT improvement under distributed arbitration over centralized arbitration

under (a) intra-rack scenario (b) inter-rack scenario.

. . . . . . . . . . . . . . . 60

Figure 4.13: PASE AFCT performance.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

Figure 5.1: NEAT components

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

Figure 5.2: Flow placement under DCTCP . . . . . . . . . . . . . . . . . . . . . . . . . . 80

Figure 5.3: Flow placement under Hadoop workload . . . . . . . . . . . . . . . . . . . . . 81

Figure 5.4: CoFlow placement for Hadoop workload . . . . . . . . . . . . . . . . . . . . . 83

Figure 5.5: Fair prediction vs SRPT prediction . . . . . . . . . . . . . . . . . . . . . . . . 83

Figure 5.6: Beneﬁts of using preferred hosts placement . . . . . . . . . . . . . . . . . . . . 84

Figure 5.7: Completion time prediction error . . . . . . . . . . . . . . . . . . . . . . . . . 85

Figure 5.8: Testbed: minLoad vs NEAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

Figure 6.1: SCC framework with tenant-objective tunnel abstraction.

. . . . . . . . . . . . 89

Figure 6.2: Testbed Results w/ & w/o SCC. . . . . . . . . . . . . . . . . . . . . . . . . . . 96

Figure 6.3: Performance under CS/DS coexistence scenario . . . . . . . . . . . . . . . . . 96

Figure 6.4: Beneﬁts of In-tunnel algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 98

Figure 6.5: Coexistence of CS/DS/LS ﬂows with datamining workload . . . . . . . . . . . 99

xiii

Figure 6.6: Coexistence with Data-mining workload . . . . . . . . . . . . . . . . . . . . . 99

Figure 6.7: Comparison under Data-mining workload . . . . . . . . . . . . . . . . . . . . 100

Figure 6.8:

Isolation scenario.

. .

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

Figure 6.9: Coexistence with and without SCC support for diﬀerent workloads . . . . . . . 102

Figure 6.10: SCC Dynamics.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

Figure 6.11: Objective isolation using priority queues . . . . . . . . . . . . . . . . . . . . . 103

Figure 7.1: Experiment Topology .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

Figure 7.2: MPTCP connection hijack attack using backup ﬂag . . . . . . . . . . . . . . . 107

Figure 7.3: Demonstration of the traﬃc diversion attack . . . . . . . . . . . . . . . . . . . 110

Figure 7.4: Demonstration of stepwise traﬃc diversion attack.

. . . . . . . . . . . . . . . . 111

Figure 7.5: Secure MP_PRIO TCP option . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

Figure 7.6: Example Use of Secure MP_PRIO . . . . . . . . . . . . . . . . . . . . . . . . 116

Figure 7.7: Secure DSS MPTCP option . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

Figure 7.8: Explicit Signaling of GSN Encryption on the First Subﬂow.

. . . . . . . . . . . 119

xiv

CHAPTER 1

BACKGROUND AND MOTIVATION

Datacenters have become a critical infrastructure for hosting user-facing online services such as web
search, stock trading, social networking, and product advertising. Such services usually generate
a large number of short but latency-sensitive ﬂows and a small number of long but delay-tolerant
ﬂows [12, 17, 20]. For short ﬂows, even a fraction of a second increase in latency can make
a quantiﬁable diﬀerence in application performance, which in turn, impacts user experience and
operator revenue. For example, Amazon found that every additional 100ms of latency costs them
1% loss in business revenue [70]. With today’s TCP that treats short and long ﬂows equally,
short ﬂows suﬀer from high latency as long ﬂows tend to occupy most of the buﬀer space in
switches [17, 126]. Thus, datacenter networks need a new transport protocol and task scheduling
mechanisms that can minimize the latency for short ﬂows yet achieve high throughput for long
ﬂows. Similarly, with today’s task scheduling policies, network state is ignored while making task
placement decisions. Therefore, there is a need for network state aware task placement framework.
Similarly, with the design of new technologies, protocols introduce new security vulnerabilities.
In this work, we discover two vulnerabilities in multipath TCP (MPTCP) that enable attackers to
launch various attacks on the connections. Below, I present four works that address these issues for
datacenter protocols.

1.1 PASE: A Near-Optimal Transport for Datacenters

Unfortunately, prior datacenter transport protocols either achieve near-optimal performance
(in terms of minimizing ﬂow completion times or maximizing the number of deadlines met) or
deployment friendliness (i.e., require no changes in switch hardware) but not both. For example,
protocols like PDQ [70] and pFabric [20] can minimize average ﬂow completion times (AFCTs) but
require switches to perform complex operations that are not available in commercial switches and
thus mandate modiﬁcations to switch hardware. On the other hand, protocols like DCTCP [17],

1

L2DCT [97], and D2TCP [126] achieve deployment friendliness but lack global coordination
between ﬂows, which leads to sub-optimal ﬂow scheduling. In this paper, we ask: how can we
design a datacenter transport protocol that achieves both near-optimal ﬂow scheduling as well as
deployment friendliness?

Towards this end, we ﬁrst identify three underlying strategies used in prior datacenter transport
protocols, namely in-network prioritization (used in pFabric [20]), arbitration (used in D3 [134]
and PDQ [70]), and self-adjusting at endpoints (used in DCTCP [17] and L2DCT [97]). With
in-network prioritization, switches schedule and drop packets based on the priority of each packet,
thus allowing each switch to locally prioritize short ﬂows over long ﬂows. In the arbitration strategy,
every network link has an arbitrator (implemented at a switch or a host) that allocates rates to each
ﬂow in real-time based on the global view of all active ﬂows. With the self-adjusting at end-points
strategy, each sender independently decides to increase or decrease their rate based on the observed
congestion in the network.

Our key insight is that these three strategies, which were used individually before, are in fact
complementary to each other and can be uniﬁed to work together nicely. For example, approaches
that rely on arbitration alone have high ﬂow switching overhead because ﬂows need to be explicitly
paused and unpaused. With in-network prioritization using strict priority queues inside switches,
switching from a high priority ﬂow to the next is seamless. Conversely, in-network prioritization
approaches typically need to support a large number of priority levels whereas existing switches
only have a limited number of priority queues. Arbitration addresses this problem by dynamically
changing the mapping of ﬂows to queues. Flows whose turn is far away are all mapped to the lowest
priority queue while ﬂows whose turn is about to come are mapped to the high priority queues.

Based on the above insights, we propose PASE; a transport protocol for datacenters that
achieves near-optimal performance as well as deployment friendliness by synthesizing the three
strategies, in-network Prioritization, Arbitration, and Self-adjusting at Endpoints. The design of
PASE is based on the underlying principle that each strategy should focus on what it is best at
doing. Arbitrators should do inter-ﬂow prioritization at coarse time-scales. They should not be

2

responsible for computing precise rates or for doing ﬁne-grained prioritization. End-points should
react to congestion or spare capacity on their own, without involving any other entity. Further,
given their lack of global information, they should not try to achieve inter-ﬂow prioritization. We
have observed that protocols that try to do this have poor performance. In-network prioritization
mechanism should focus on per-packet prioritization at short, sub-RTT timescales.

This division of responsibilities among the three strategies follows the well-known separation of
concerns principle and makes PASE perform comparable to the state-of-the-art transport protocols
while also being deployment friendly, i.e., requiring no changes to the network switches. A key
aspect of PASE is a scalable control plane for arbitration. PASE uses distributed arbitrators that
decide the priority of a ﬂow given other ﬂows in the system. For every link in the datacenter
topology, PASE uses a dedicated arbitrator for arbitration. This functionality can be implemented
at the endpoints themselves (e.g., for their own links to the switch) or on dedicated nodes within
the datacenter. The outcome of arbitration is the priority queue and reference rate for a ﬂow. At
endpoints, PASE uses a TCP-like endpoint transport protocol, which leverages the priority queue
and reference rate information for its rate control. At switches, PASE leverages existing priority
queues to prioritize the scheduling of packets over network links.

To achieve its goals, the ﬁrst challenge for PASE is to minimize the arbitration latency and
arbitration overhead. Arbitration latency depends on the distance between the end-hosts and
their corresponding arbitrators. This delay matters most during ﬂow setup time as it can end
up increasing the ﬂow completion times, especially for short ﬂows. Due to a separate control
plane, each arbitration message is potentially processed as a separate packet by the switches which
consumes their bandwidth and also require processing at arbitrators. We need to ensure that this
overhead is kept low and that it does not cause network congestion for our primary traﬃc and does
not add any signiﬁcant processing delay.

To overcome these challenges, we exploit the typical tree-based datacenter topology features
[12, 15] to make the arbitration decisions in a bottom up fashion, starting from the end-points and
going up to the root. This has several performance and scalability beneﬁts. First, for intra-rack

3

communication, which constitutes a sizeable share of datacenter traﬃc [17], only the source and
destination are involved, obviating the need to communicate with any other entity. Second, lower
level arbitrators (those closer to the leaf nodes) can do early pruning by discarding those ﬂows
that are unlikely to become part of the top priority queue. Third, high level arbitrators (those
closer to the root) can delegate their arbitration decision to lower level arbitrators. Both early
pruning and delegation reduce the arbitration latency and arbitration overhead, of course, at the
cost of potentially less accurate decisions. PASE evaluation shows that these optimizations improve
performance by up to 10% and reduce arbitration overhead by up to 60%.

Loss recovery in PASE is more challenging because packets can be delayed in a lower priority
queue, which may trigger spurious timeouts, if we use today’s timeout mechanisms. Thus, for
lower priority ﬂows, instead of retransmitting the data packet, PASE uses small probe packets that
help in determining whether the packet was lost or was delayed.

We evaluate PASE using a small testbed and ns2 [7] simulations. In terms of ﬂow completion
times, compared to deployment friendly protocols, PASE improves the average FCT (AFCT) by
40% to 60% for various scenarios; compared to near-optimal protocols, PASE performs within
6% of pFabric in scenarios where pFabric is close to optimal while in other scenarios, PASE
outperforms pFabric by almost 80% both in terms of the AFCT and the 99th percentile FCT.

PASE shows the promise of combining existing transport strategies in a single transport frame-
work. We view it as a ﬁrst step towards a more holistic approach for building the next generation
datacenter transport protocols.

1.2 NEAT: Task Placement in Datacenters

Data transfer time has a signiﬁcant impact on task completion times within a datacenter because
most datacenter applications (such as MapReduce [48], Pregel [91], MPI [64], Dryad [72]) are
data-intensive and they need to access data that are distributed throughout the network. For
example, for MapReduce, 30% task completion time is spent on transferring data across the
network [138]. For large commercial datacenters, the transfer of shuﬄe data is the dominant

4

source of network traﬃc [13].
In Facebook [138] MapReduce clusters, 20% jobs are shuﬄe-
heavy (i.e., generate a large amount of shuﬄe data), and in Yahoo! clusters [38] this goes up to
60%. The network congestion caused by shuﬄe data is a key factor that degrades the performance
of MapReduce jobs [43]. Therefore, data transfer time in datacenter is critical for improving
application performance.

Prior work on minimizing data transfer time falls into two categories: task placement ( [13,23,
73, 76, 123, 138]) and network scheduling ( [21, 27, 40–42, 51, 69, 94, 96, 141, 144]). The idea of
task placement is to place compute tasks close to input data so that data locality is maximized and
network traﬃc is minimized [23,73,138]. The idea of network scheduling is to schedule the ﬂows
or groups of ﬂows (i.e., coﬂows) generated by tasks, at shared links, based on given ﬂow properties
(such as size, deadline, and correlation among ﬂows) and given task placement to minimize ﬂow
completion times [21, 42, 69, 94, 96]. The limitation of prior task placement policies is that they
design traﬃc matrix assuming fair sharing of network bandwidth and ignore the priorities assigned
to diﬀerent ﬂows by the network scheduler; meanwhile, network schedulers schedule ﬂows based
on ﬂow priorities (such as size or deadline) and the ﬂow completion time of a low priority ﬂow
can increase if placed on a path sharing links with high priority ﬂows. The limitation of network
schedulers is that the source/destination of each ﬂow is independently decided by the task schedulers
and not necessarily optimal.

We propose NEAT, a Network-schEduling-Aware Task placement framework that leverages
network scheduling policy and network state in making task placement decisions for data-intensive
applications. The placement by NEAT coupled with a properly selected network scheduling policy
ensures that tasks are ﬁnished as quickly as possible. The intuition behind NEAT is that a good
placement of data-intensive tasks should spread the ﬂows to minimize the sharing of network links.
When sharing is inevitable, however, the preferred placement strategy depends on how the network
schedules ﬂows, as illustrated by the following example.

Let us consider a simple scenario in Figure 1.1, where we want to place a task R that needs to
read data from task M running on node 2 onto candidate hosts node 1 or node 3. At the current

5

Figure 1.1: Network scheduling aware placement

time, the network has one ﬂow with remaining size of 4 Gb on path 2 ! 3 and requires four more
seconds (on 1 Gbps link) to ﬁnish if running alone. There are two other ﬂows on path 2 ! 1, each
of remaining size 10 Gb, thus requiring ten more seconds if running alone. Suppose that the input
data of candidate task R is of size 5 Gb and our goal is to ﬁnish the task as quickly as possible.
Assuming First Come First Serve scheduling (FCFS) in the network, we will place task R on node 3,
as it provides the smaller completion time of 9 seconds as compared to 25 seconds when placed on
node 1, because ﬂow R will not be scheduled until the existing ﬂows ﬁnish data transfer. Assuming
fair scheduling (Fair) in the network, we will place task R on node 3, as it requires 9 seconds to
complete ﬂow R on node 3 compared to 15 seconds on node 1. Assuming the network uses shortest
remaining processing time (SRPT) scheduling, we will place task R on node 1, as it can ﬁnish in 5
seconds on node 1 by preempting both the existing ﬂows. We assume ﬂows with the same priority
share the network fairly as in existing solutions [21, 94]. However, if the goal is to minimize the

6

Active flows:One 4Gb flowActive flows:Two 10Gb flowsNode 2Node 1Node 3Task MCandidate hosts: Node 1 or 3,  New Flow (R) size: 5 Gb, Link capacity: 1 GbpsNetwork Scheduling Policy Placement of RCompletion time of R (s)Increase in Total Completion time (s)FCFSNode 12525Node 399FairNode 11525Node 3913SRPTNode 1515Node 399increase in total (i.e., sum) ﬂow completion time of all the active ﬂows in the network, then under
SRPT, the scheduler will choose node 3 to minimize the increase in completion time of ﬂows in
the network (i.e., completion time of new ﬂow plus increased completion times of existing ﬂows).
We see that the optimal task placement in a network can vary depending on the network scheduling
policy and the network performance metric.

NEAT leverages these insights in its design and provides a task placement framework to mini-
mize the average ﬂow completion time (AFCT) in the network. NEAT employs a task performance
predictor that predicts the completion time of a given task under an arbitrary network scheduling
policy. NEAT is quite pluggable in terms of better resource or application models and predictors.
NEAT makes task placement decision in two steps: ﬁrst, it predicts task performance by hypothet-
ically placing a task on each candidate node and analyzing its performance based on the network
scheduling policy and the network state. Next, it selects a node for the task based on the predicted
task completion time and a node state, that characterizes the sizes of existing ﬂows. The proposed
task placement mechanism focuses on network performance in task placement decisions and only
uses node properties (e.g., CPU, memory) to determine whether a node is a candidate host. We
leave joint placement to future work. To realize this idea, NEAT addresses two challenges:

The ﬁrst technical challenge in designing NEAT is to accurately predict the performance of a
task for a given network state and scheduling policy. We use the completion time of data transfer,
i.e., the ﬂow completion time (FCT) or the coﬂow completion time (CCT), to measure the overall
task performance. Compared to network-centric performance metrics such as delay, jitter, and
available bandwidth, completion time is task-centric and better models network conditions in terms
of their impact on task performance (§5.6).

The second technical challenge is to eﬃciently collect network state information and make
placement decisions. As illustrated in Figure 1.2, NEAT uses a distributed control framework with
two components: a local controller (network daemon) that maintains a compressed state of all
active ﬂows (or coﬂows) on each node to predict task completion times, and a global controller
(task placement daemon) that gathers the predicted task completion times from local controllers

7

Figure 1.2: NEAT Architecture

to make task placement decisions. Here, a “node” refers to an endhost (e.g., server) capable of
running tasks. Using the ﬂow state information, the local controllers predict the FCT/CCT on edge
links (i.e., links directly connected to nodes), based on which the global controller uses a greedy
algorithm to pick the node whose edge link has the smallest FCT/CCT. To reduce communications
with the local controllers, the global controller also maintains a state for each node, deﬁned as the
remaining size of the smallest active ﬂow on that node, such that only nodes with preferred states
(no less than the size of the current ﬂow) are contacted.

NEAT does not require any changes to the network. However, it requires task information (e.g.,
size) from applications, similar to prior works [21, 40, 94], to accurately predict task completion
time. For recurrent workloads, this information can be inferred by monitoring task execution [76].
NEAT’s performance is upper-bounded by the underlying network (ﬂow/coﬂow) scheduling
policy. When the network scheduling policy is suboptimal (e.g., Fair, LAS), when our goal is
FCT, there is plenty of room for improvement and NEAT is able to signiﬁcantly improve the
task performance; when the network scheduling policy is near optimal (SRPT), the room for
improvement is reduced, NEAT still achieves notable improvement for such scenarios.

We evaluate NEAT using both a trace-driven simulator and a 10-machine testbed based on a
variety of production workloads [21, 48] and network scheduling policies such as DCTCP [18],
L2DCT [96], PASE [94], and Varys [42]. We evaluate two other state-of-art task placement policies:

8

Task Placement Daemon Network DaemonNetwork DaemonTask propertiesPredicted performance, Node StateminLoad that always selects a node with the minimum load, measured by the total size of ﬂows
scheduled on that node, and minDist that always selects a node closest to the input data. NEAT
performs 3.7x better than alternative strategies when the underlying network scheduling policy is
Fair (DCTCP [18]), 3x better when the policy is least attained service (LAS) (L2DCT [96]), and
30% better when the policy is SRPT (PASE [94]). In both cases, NEAT improves performance by
being aware of the network scheduling policy. In particular, since Fair is the underlying policy in
most commercial datacenters, NEAT can signiﬁcantly improve the performance of data-intensive
applications in majority of cases.

1.3 Bandwidth Allocation in Multi-tenant Datacenters

In datacenter networks, ﬂows can have diﬀerent performance objectives. A private datacenter
is shared by various tenants, such as search engine, advertising and e-Business applications. Each
tenant can run many service entities (e.g., Virtual Machines, Containers, Java processes) that
communicate over the underlying network. The ﬂows generated by these services have diﬀerent
performance objectives due to their service requirements. Some ﬂows are Latency-Sensitive (LS):
service can enqueue copies of a task in multiple servers to combat computation time variability [46];
to minimize resource wastage, a cancelation message should be sent to the counterpart servers as
soon as the ﬁrst replica is ﬁnished. On the other hand, some ﬂows are Deadline-Sensitive (DS): the
partition-aggregate architecture of Online Data Intensive applications (OLDI) [49,75] and real-time
analytic [52, 92] enforce deadline semantics for every leaf-to-parent ﬂow. Furthermore for many
other applications, minimizing average ﬂow-completion-time (AFCT) can signiﬁcantly improve
their performance [21, 69, 94], and we call these ﬂows as Completion-Sensitive (CS) ﬂows. We
use a tenant-objective division to denote all the ﬂows of a tenant that share the same performance
objective.

Bandwidth allocation in datacenters should support not only performance isolation among di-
visions but also objective-oriented scheduling among ﬂows within the same division. Bandwidth
allocation design, in essence, deﬁnes how ﬂows behave when congestion happens. Most data-

9

center networks are oversubscribed [54] and congestion is not uncommon: packet drops due to
congestion can be observed when the whole network utilization is around only 25% [121]. To
achieve performance isolation, administrators can assign weights to diﬀerent divisions that share
the underlying network [120]. For example, upon congestion, an administrator may prefer tenant
A’s DS ﬂows over tenant B’s DS ﬂows, or all tenants’ LS ﬂows over their CS ﬂows. Various
techniques can be used to support objective-oriented ﬂow scheduling: some reduce tail latency of
messages [18, 19, 65, 77], some add deadline awareness [69, 127, 135, 142], and others focus on
reducing AFCT [21,27,59,69,93,94,96,106,141].

This paper studies the Multi-Tenant Multi-Objective (MT-MO) bandwidth allocation problem
in datacenter networks. To our best knowledge, no existing work supports performance isolation
and objective scheduling simultaneously.

Many of the existing objective-oriented approaches [18,19,21,27,59,65,69,77,93,94,96,106,
141] are designed to achieve only a single performance objective at a time, and there could be severe
interference if approaches of diﬀerent objectives coexist without isolation. This happens because
these approaches may detect congestion diﬀerently (e.g., packet drop, or ECN) or react to congestion
diﬀerently (e.g., the ECN co-existence problem in production Cloud [80, 122]). pFabric [21] and
Karuna [35] evaluate the coexistence of the DS and CS ﬂows by setting absolute priority to
DS ﬂows over CS ﬂows. However, performance isolation among ﬂows with the same objective
but of diﬀerent tenants is not considered. Furthermore, existing performance isolation approaches
cannot optimize performance objectives for individual tenant-objective division. Neither bandwidth
guarantee [66,71,79,84,108,114] nor proportional sharing [120] can perform bandwidth allocation
at ﬂow-level granularity.

Bandwidth allocation design should be practical and readily-deployable. Many works either
require non-trivial switch modiﬁcations [19, 21, 51, 69, 141], or assume non-blocking network,
which is not widely available in production datacenter [59]. A centralized scheduling mechanism
(e.g., Fastpass [106]) may perform optimization for all divisions, but requires an ideally scalable
control plane and time synchronization for arbitration.

10

In this work, we propose Stacked Congestion Control (SCC), a distributed host-based bandwidth
allocation framework. SCC can support both performance isolation and objective scheduling, and
is readily-deployable. SCC has two layers to perform congestion control: (i) an underlay congestion
control (UCC) layer handles contention among divisions (i.e. performance isolation), and (ii) a
private congestion control (PCC) layer for each division to optimize its performance objective
(i.e. objective scheduling). UCC supports the following abstraction: a weight (obtained via
administrator policy or utility function maximization [107]) is attached with each division; each
division should get a bandwidth share that is proportional to its weight.

The ﬁrst challenge is how to achieve weighted bandwidth allocation by controlling the active
ﬂows within each division in a distributed and transparent way. Irrespective of the tenant-objective
division semantic, TCP (and its variants) only allocates bandwidth proportional to the contending
ﬂows. SCC thus introduces tenant-objective tunnel as an edge-based abstraction (between every
eligible pair of source and destination hosts, ﬂows of each division are aggregated inside a dedicated
point-to-point tunnel). Each tunnel is assigned a weight (dynamically updated by the division’s
PCC) and SCC allocates the bandwidth share to a tunnel in proportion to its weight. Because
of the state-of-the-art multi-path load balancing schemes (e.g., Presto [67]) in datacenter (e.g.,
Clos-network), the share of a tunnel in each network tier is statistically proportional to its weight.
Our key insight is that the division’s bandwidth share can be achieved by cooperatively scaling each
tunnel’s network weight if the tunnels in the same division ensure that the sum of their weights
equals the global weight of the division.

The second challenge is how to support two congestion control mechanisms (UCC and PCC)
simultaneously. For distributed host-based bandwidth allocation, two factors deﬁne congestion
control behaviour: (i) congestion signal, i.e., what method is used to detect network congestion,
and (ii) control law, i.e., what rate increase (decrease) law is used to grab (yield) network band-
width. The dilemma is that, to respect performance isolation, all tunnels should follow the same
UCC mechanism; there is only one uniform set of congestion signal available from the network,
which reﬂects the contention among tunnels; it cannot be directly exploited for objective-oriented

11

congestion control inside each tunnel. Our technical contribution here is to add a rate-limiting
queue in the ingress of each tunnel to completely decouple two congestion control mechanisms.
UCC uses ECN as the congestion signal and uses weighted network sharing algorithms (similar
to Seawall [120]) to derive the tunnel rate according to network condition. Moreover, each sender
queue can locally generate desired congestion signal for PCC layer.

Further, we develop PCC algorithms for every objective to dynamically derive the per-tunnel
weight, so that each tunnel is allocated a weight according to the demand of its ﬂows as compared to
other ﬂows within the same division. Besides existing objective-oriented scheduling approaches, we
also develop new in-tunnel scheduling algorithms to exploit the extra ﬂexibility provided by SCC.
We implement SCC as a Linux kernel module using NetFilter. We evaluate SCC on a small-
scale testbed with 16 Dell servers and a commodity PICA-8 Gigabit Ethernet switch with ECN
enabled. To complement our small-scale testbed experiments, we further conduct large-scale
simulations using NS-2. SCC can meet requirements of performance objectives, when ﬂows of
diﬀerent objectives coexist in the datacenter networks. Compared to the direct coexistence cases,
SCC reduces latency by up to 40% for Latency-Sensitive ﬂows, deadline miss ratio by up to 3.2(cid:2) for
Deadline-Sensitive ﬂows, and average ﬂow-completion-time by up to 53% for Completion-Sensitive
ﬂows.

1.4 Multipath TCP Attacks & Countermeasures

Multipath TCP (MPTCP) is an IETF standardized suite of TCP extensions that allow an
MPTCP connection between two hosts to simultaneously use multiple available paths [56]. This
parallelism capability signiﬁcantly improves throughput and reliability while maintaining the same
TCP interface to the applications. For example, smartphones can use MPTCP to stream videos over
both WiFi and 3G/4G connections simultaneously for better user experience. Since its inception,
MPTCP has been fueled with great interest and excitement from both academia [82,105,112,136]
and industry [56,116]. For example, Apple iOS (version 7 onwards) supports MPTCP and Google
Android now has modiﬁed ROMs with MPTCP [10,11] support.

12

Compared to TCP, MPTCP splits the security risk through multiple subﬂows.

If only one
subﬂow is attacked, the MPTCP connection as a whole may not be severely aﬀected. In theory,
if only one subﬂow is eavesdropped, the attacker should not be able to eavesdrop the traﬃc of the
whole MPTCP connection. On the ﬂip side, MPTCP has a larger attack surface than TCP – as
long as any one of the subﬂows is under attack (e.g. by a man-in-the-middle attacker), the whole
MPTCP connection can be aﬀected, which puts MPTCP at a disadvantage compared to TCP. Prior
work lacks on analyzing and comparing the attack surface of MPTCP with that of TCP [25,26]. To
date, it is still largely unclear whether new attacks are possible or whether prior known attacks on
TCP (e.g. connection hijack [8,88,110,111]) can be more easily launched on MPTCP [56].

To ﬁll this gap, in this paper, we speciﬁcally analyze the MPTCP’s cross-path attack surface,
i.e. what an attacker controlling one path can do to aﬀect other paths or the MPTCP connection as
a whole. We show that there are indeed worrisome cross-path attacks that make the current version
of MPTCP less secure than TCP. For instance, we show that an attacker controlling any one of the
subﬂows can take full control over the MPTCP connection (See § 7.2), which makes many attacks
that require full man-in-the-middle capability much more likely to occur in MPTCP [24, 37, 119].
We subsequently analyze the root causes of such cross-path vulnerabilities, propose and implement
a set of changes to MPTCP to prevent such class of attacks. As the weaknesses are fundamental to
MPTCP, we argue that these changes should be made at the MPTCP layer itself (instead of being
pushed to higher layers).

Our key contributions and ﬁndings are as follows.

• Vulnerabilities: We report two cross-path MPTCP vulnerabilities. First, an attacker eavesdropping
one MPTCP subﬂow can infer the throughput of other MPTCP subﬂows by analyzing the local and
global sequence numbers (GSN) embedded in MPTCP headers. Second, an attacker can send forged
packets with the MPTCP MP_PRIO option containing the backup ﬂag to set any MPTCP subﬂow
as the backup, causing it to be stalled completely. Even though the vulnerabilities are veriﬁed on
the Linux MPTCP implementation (version 0.91), they are actually caused by the current design of
MPTCP, which calls for changes to the speciﬁcation rather than simple implementation ﬁxes.

13

• Attacks: Based on these vulnerabilities, we present two practical attacks on MPTCP, directed traﬃc
diversion attacks and hijack attacks. We show that an attacker can onload or oﬄoad controlled
amount of traﬃc on speciﬁc MPTCP subﬂows. We also show that an attacker controlling one
path can easily force all traﬃc to route through itself by setting all other subﬂows as backup.
We implement and validate these attacks on a real network testbed using MPTCP Linux Kernel
implementation (v0.91) [101].
• Attack Prevention and Countermeasures: We show that the two cross-path vulnerabilities can in
fact be prevented by leveraging the “secret splitting” across multiple paths. The intuition is that it
is highly unlikely that an attacker will be able to monitor and control all paths used by an MPTCP
connection. Therefore, secrets can be split across multiple paths to prevent complete eavesdropping
or tampering. We propose a set of changes to MPTCP based on the intuition and implement in the
Linux MPTCP code base to verify the ability to defend against this class of cross-path attacks.

14

CHAPTER 2

REVIEW OF RELATED RESEARCH

In this chapter, I review the existing schemes for each of the project.

2.1 PASE: A Near-Optimal Transport for Datacenters

We can categorize prior works in terms of the underlying transport strategies they use.
Self-Adjusting Endpoints: Several data center transports use this strategy [17, 36, 97, 126].
DCTCP uses an adaptive congestion control mechanism based on ECN to maintain low queues.
D2TCP and L2DCT add deadline-awareness and size-awareness to DCTCP, respectively. MCP [36]
improves performance over D2TCP by assigning more precise ﬂow rates using ECN marks. These
rates are based on the solution of a stochastic delay minimization problem. These protocols do not
support ﬂow preemption and in-network prioritization, which limits their performance.

Arbitration: PDQ [70] and D3 [134] use network-wide arbitration but incur high ﬂow switching
overhead. PASE’s bottom up approach to arbitration has similarities with EyeQ [78], which targets
a diﬀerent problem of providing bandwidth guarantees in multi-tenant cloud data centers. PASE’s
arbitration mechanism generalizes EyeQ’s arbitration by dealing with scenarios where contention
can occur at links other than the access links.

In-network Prioritization: pFabric [20] uses in-network prioritization by doing priority-based
scheduling/dropping of packets. DeTail [140], a cross layer network stack, focuses on minimizing
the tail latency but does not target the average FCT. In [83], authors propose virtual shapers to
overcome the challenge of limited number of rate limiters. While both DeTail and virtual shapers use
in-network prioritization, they do not provide mechanisms for achieving network-wide arbitration.
In general, prior proposals target one speciﬁc transport strategy, PASE uses all these strategies
in unison to overcome limitations of individual strategies and thus achieves high performance across
a wide range of scenarios while being deployment friendly.

15

2.2 NEAT: Task Placement in Datacenters

The related work can be divided in two categories: Task placement (or scheduling) and network

scheduling.

Task Scheduling: State-of-art task schedulers follow the principle of “maximizing data local-
ity” to minimize data transfer over the network. For the ﬁrst stage of jobs (e.g., Map), techniques
such as delay scheduling [138] and ﬂow-based scheduling [73] try to place tasks on the machines
or racks where most of their input data are located. Mantri instead optimizes the data locality while
placing reducers [23]. However, when the data is spread throughout the cluster (e.g., in distributed
ﬁle systems like HDFS [30]), the subsequent job stages (e.g., shuﬄe phase) have to read data using
cross-rack links, and often suﬀer from heavy network congestion [39]. To alleviate the problem,
techniques like ShuﬄeWatcher [13] attempt to localize the Map tasks of a job to one or a few racks,
and thus reduce cross-rack shuﬄing. However, such techniques end up increasing the cross-rack
transfer of the input data. A similar idea was exploited in Sinbad [39] to place the endpoints of
ﬂows to avoid congested links, but Sinbad is for a special type of ﬂows (replicated ﬁle system
writes) and is agnostic to the underlying network scheduling policy. When the input data of a job
is known beforehand, techniques like Corral [76] can be used to pack the input data to a few racks
such that all computation can be performed on local data. Similarly, many fairness based cluster
scheduling frameworks [60–62,68,128] or straggler mitigation techniques [22,23,113,139] have
been proposed. However, these works assume fair sharing in the network and does not consider
eﬀect of underlying network scheduling policies. Another category of works [33, 34] does joint
network and task execution scheduling, however these works assume that they have knowledge
about task arrival times and make oﬄine task scheduling decisions.

Network Scheduling: Generally, data-intensive jobs spend a substantial fraction of their
execution time on data transfer [138], and consequently, their performance critically depends on the
performance of network scheduling. In recent years, a variety of network scheduling systems such
as DCTCP [18], L2DCT [96], PASE [94], Varys [42], and Baraat [51] have been proposed. Many
of these protocols provide near-optimal data transfer time, but the improvement to the overall task

16

NEAT addresses these limitations in two ways:

performance is limited as they do not optimize the placement of sources/destinations of the ﬂows.
it places the destinations of ﬂows (via task
placement) while taking into account the underlying network scheduling policy to minimize the
overall data transfer time, and it uses a performance predictor that captures network state and
essence of the network scheduling policies.

2.3 Bandwidth Allocation in Multi-tenant Datacenters

We can divide the related work based on the objective.
Latency Sensitive (LS): Silo [77] proposes a placement algorithm and a hypervisor-based
packet pacing to provide latency guarantees. However, it does not diﬀerentiate among ﬂows and
applications with other performance goals such as deadlines and relative priorities among ﬂows.
HULL [19] trades bandwidth for ultra-low latency and it requires modiﬁcation to switch ASIC.
Fastpass [106] uses a centralized arbiter to decide when and which path each packet should be sent.
Its potential in large scale datacenters is yet to be proven. QJUMP [65] uses multiple physical
switch queues to provide diﬀerent combinations of latency and throughput.
It supports both
latency-sensitive and throughput-sensitive ﬂows, but with constraints. DCTCP can keep switch
queue length at a low level to reduce packet latency [18].

Deadline Sensitive (DS): D3 pioneers the idea of incorporating deadline awareness into network
scheduling [135]. However, D3 and RACS [95] are not deployment friendly as they require
modiﬁcations to the commodity switch hardware. D2TCP [127] is a distributed deadline-sensitive
protocol, which uses both ECN feedback and deadlines to modulate the congestion window.

Completion Sensitive (CS): PDQ [69] uses preemptive ﬂow scheduling to minimize average
FCT. pFabric [21] and PASE [94] assume ﬂow size is known a priori, and attempt to approximate
Shortest Job First (SJF), which is the optimal scheduling for minimizing average FCT in a single
bottleneck scenario. L2DCT [96] and PIAS [27], without prior knowledge of ﬂow size information,
reduce AFCT by approximating the Least Attained Service (LAS) scheduling discipline.

17

2.4 Multipath TCP Attacks & Countermeasures

TCP Attacks We review TCP attacks aimed at terminating or hijacking TCP connections. In
such attacks, an attacker ﬁrst ﬁnds an in-window sequence number and then forges packets (such
as TCP reset packets) to terminate or hijack a TCP connection. To ﬁnd an in-window sequence
number, the attacker ﬁrst generates candidate sequence numbers (e.g., using random guessing)
and then uses a feedback mechanism to identify which candidate sequence number falls into the
TCP window. Several feedback mechanisms to identify the correctly guessed sequence number
have been used in prior work.
In [8], the feedback mechanism is based on the behavior of the
Linux 2.0.X kernel implementation, i.e., the kernel dropped the packets if the sequence number
is too small and reset the connection if the sequence number is too large, and this vulnerability
In [88, 111], the feedback mechanism is
has been ﬁxed in newer versions of the Linux kernel.
based on the behavior of recent Windows and Linux TCP stack implementations respectively.
In [110], the feedback mechanism is based on the behavior of some stateful ﬁrewalls that drop
packets with out-of-window sequence numbers. The attacks in [111] and [110] assume that a piece
of unprivileged malware runs on the host and the malware can read packet counter from procfs
(which is publicly accessible) in the Linux operating system and notify the attacker when an ACK
packet with in-window sequence number is received.

MPTCP Attacks: RFC 6181 [25] and RFC 7430 [26] discuss some MPTCP-speciﬁc attacks
including ﬂooding, hijacking, and Denial of Service (DoS) attacks. However, they do not cover the
threat model and the problems discussed in this paper.

Shaﬁq et al. [118] proposed cross-path throughput inference attacks on MPTCP. The authors
showed that MPTCP subﬂows are interdependent due to the congestion control coupling. The
authors demonstrated that the resulting side-channels can be exploited by an attacker controlling
one of the subﬂows to infer the throughput of other subﬂows. In this paper, we build on their work
by proposing novel traﬃc diversion attacks optionally based on cross-path MPTCP throughput
inference that provides feedback to the attacker. We also design and implement countermeasures to
defend against these cross-path throughput inference based attacks.

18

Popovici et al. [109] discuss use of MPTCP congestion control for diverting traﬃc from one
network to another to gain proﬁt. Authors introduce a policy drop strategy that, a network operator,
can be used to divert traﬃc from one path to another path by introducing packet drop on one of
the paths. This allows network operators to reduce the traﬃc they carry, and divert some of their
own traﬃc to the other path(s). However, this work does not consider per-connection policy drops
and assume all the connections experience same level of packet drops. Our MPTCP connection
throughput inference attack will allow more ﬁne-grained control of traﬃc diversion attacks. Network
operators can use throughput inference to monitor the overall connection throughput, from the
subﬂow through their network, and only divert some amount of traﬃc that does not hurt the overall
MPTCP connection throughput. Moreover, using MP_PRIO option, an attacker can diverge traﬃc to
or from a speciﬁc path, which can not be done by other existing attack scenarios.

Jadin et al. [74] propose MPTCPSec that adds authentication and encryption to MPTCP for
both the data and MPTCP options. MPTCPSec is a complete suite that can be used to prevent the
attacks discussed in this paper. It is a relatively heavyweight solution (e.g., key distribution/sharing
and more complete encryption) compared to the small changes we propose to prevent the reported
attacks.

Since MPTCP transparently distributes traﬃc of an end-to-end TCP connection across multiple
paths, network monitoring devices such as intrusion detection systems (IDS) can only observe
partial traﬃc. Therefore, MPTCP will negatively impact the functionality of network monitoring
devices [104]. For examples, attackers can evade signature-based detection by splitting malicious
payloads across multiple paths. Distributed signature-based intrusion detection approaches (e.g.,
[89]) have been investigated to overcome these diﬃculties.

19

CHAPTER 3

OVERVIEW

In this chapter, I explain the overview of limitations of existing approaches using some case studies.

3.1 PASE: A Near-Optimal Transport for Datacenters

To achieve high performance, which means to minimize completion times, maximize through-
put, or reduce the deadline miss rate [17,20,70], existing datacenter transports use one or two of the
following three transport strategies: (1) self-adjusting at endpoints, (2) arbitration, or (3) in-network
prioritization. In this section, we ﬁrst introduce these strategies and discuss their limitations when
they are employed in isolation. Then, we discuss how these limitations can be addressed if these
strategies are used together.

3.1.1 Transport Strategies in Isolation

Each transport strategy has its own strengths and limitations as shown in Table 3.1. We now
introduce the basic working of each strategy, discuss its strengths, and validate their limitations
through simulation experiments.
Self-Adjusting at Endpoints: With this strategy, endpoints themselves decide the amount of data
to send based on network congestion status, which is determined through congestion signals that
could be implicit (such as packet loss) or explicit (such as ECN). In case of congestion, if fairness
is the objective, the window size is reduced by the same factor for all ﬂows; if ﬂow prioritization
is the objective, the window size is reduced based on other parameters (such as the remaining ﬂow
size [20,97] or deadline [126]).

Protocols in this category are easy to deploy because they do not require any changes to the
network infrastructure. However, when considering ﬂow prioritization, their performance is inferior
to the state-of-the-art datacenter transport protocols (such as pFabric [20] and PDQ [70]) as they
do not provide strict priority scheduling: even low priority ﬂows, which should be paused, continue

20

Transport
Strategy
Self-
Adjusting
Endpoints
Arbitration

In-network
Prioritization

Pros

Cons

Ease of deployment

Lack of support for strict priority scheduling

(1) Support for strict priority
scheduling
(2) Fast convergence to
allocated rates
(1) Work conservation
(2) Low ﬂow switching
overhead

(1) High ﬂow switching overhead
(2) Hard to compute precise ﬂow rates

(1) Limited number of priority queues in switches
(2) Switch-local decisions lead to sub-optimal
performance

Table 3.1: Comparison of diﬀerent transport strategies.

Examples

DCTCP [17],
D2TCP [126]
L2DCT [97]
D3 [134]
PDQ [70]

pFabric [20]

to send at least one packet per RTT. Thus, this hurts transport performance, especially at high loads
when multiple ﬂows are simultaneously active.

To demonstrate this behavior, we take two protocols that follow the self-adjusting end-point
strategy, DCTCP [17] and D2TCP [126] (a deadline-aware version of DCTCP), and compare their
performance with pFabric [20], the state-of-the-art datacenter transport with the best reported
performance. We replicate a deadline oriented scenario in ns2. This corresponds to the experiment
section 4.1.3 in the D2TCP paper [126]: it represents an intra-rack scenario, where the source and
destination of each ﬂow is picked randomly and the ﬂow sizes are uniformly distributed between
[100 KB, 500 KB] in the presence of two background long ﬂows. The deadlines are uniformly
distributed from 5 ms-25 ms. Figure 3.1(a) shows the fraction of deadlines met (or application
throughput) as a function of load for the three schemes. While at low loads, D2TCP is able to meet
deadlines (i.e., achieve prioritization), at higher loads its performance approaches its fair-sharing
counterpart, DCTCP. Moreover, both these protocols perform much worse than pFabric at high
loads, highlighting their limitations in achieving priority scheduling across a wide range of network
loads.
Arbitration: Explicit rate protocols, like PDQ [70], use this transport strategy. Instead of end-
points making decisions on their own, arbitration based approaches [70, 134] require the switches
to make the scheduling decision, keeping in view all network ﬂows and their individual priorities
(e.g., deadline, ﬂow size). The scheduling decision is communicated as a rate at which ﬂows should

21

Figure 3.1: Comparison of a) D2TCP, DCTCP and pFabric, b) PDQ (an arbitration based
approach) with DCTCP.

(a)

(b)

send data at. The rate could be zero, if the ﬂow needs to be paused because of its low priority, or
it could be the full link capacity, if the ﬂow has the highest priority. While a centralized problem
in general, prior work [70,134] shows that arbitration can be done in a decentralized fashion–each
switch along the path of a ﬂow adds its rate to the packet header and the minimum rate is picked by
the sender for transmitting data.

The explicit nature of arbitration based approaches ensures that ﬂows achieve their desired rate
quickly (typically in one RTT). Moreover, the ability to pause and unpause ﬂows enables strict
priority scheduling of ﬂows: highest priority ﬂow gets the full link capacity (if it can saturate
the link) while other ﬂows are paused. However, the explicit rate assignment comes with its
own set of problems. For example, calculating accurate rates for ﬂows is challenging as ﬂows
could be bottlenecked at other non-network resources (e.g., source application, receiver, etc).
Another important issue is the ﬂow switching overhead, which refers to the overhead of pausing and
unpausing ﬂows. This overhead is typically around 1-2 RTTs, which can be signiﬁcant in scenarios
involving short ﬂows (when ﬂows last for a small duration) and at high network load (when ﬂows
need to be frequently preempted).

We illustrate the impact of ﬂow switching overhead in a practical scenario through a simulation
experiment. We consider PDQ [70]1, which is considered as the best performing arbitration based
1Based on the simulator code obtained from the PDQ authors. It supports all the optimizations

that reduce the ﬂow switching overhead.

22

 0 0.2 0.4 0.6 0.8 1 10 20 30 40 50 60 70 80 90App. ThroughputOffered load (%)pFabricD2TCPDCTCP 1 10 100 10 20 30 40 50 60 70 80 90AFCT (msec) - log ScaleOffered load (%)PDQDCTCPscheme, and compare its performance with DCTCP [17]. The scenario is a repeat of the previous
intra-rack, all-to-all experiment, except the metric here is ﬂow completion time. Figure 3.1(b) shows
the average ﬂow completion time as a function of network load. At low loads, PDQ outperforms
DCTCP because of its fast convergence to the desired rate. However, at high loads, the ﬂow
switching overhead becomes signiﬁcant as more ﬂows contend with each other, thereby requiring
more preemptions in the network. As a result, PDQ’s performance degrades and the completion
time becomes even higher than that of DCTCP.
In-network Prioritization: In transport protocols that use in-network prioritization (e.g., pFab-
ric [20]), packets carry ﬂow priorities, such as the ﬂow deadline or size, and the switches use this
priority to decide which packet to schedule or drop (in case of congestion). This behavior ensures
two desirable properties: work conservation, so a lower priority packet is scheduled if there is no
packet belonging to the high priority, and preemption, which ensures that when a higher priority
packet arrives, it gets precedence over a lower priority packet.

The well known downside to in-network prioritization is the limited number of priority queues
available in switches–typically on the order of 8-12 [134] (also see Table 4.1). For most practical
scenarios, this number is much smaller than the number of unique ﬂow priorities in the system.
Proposals that can support larger number of priority levels require changing the network fabric [20],
which makes them hard to deploy.

Another shortcoming of this strategy is that switches make local decisions about prioritization
which can lead to sub-optimal performance in multi-link scenarios. This is shown in Figure 3.2
through a simple toy example involving three ﬂows. Flow 1 has the highest priority; Flow 2 has
medium priority and ﬂow 3 has the lowest priority. Flows 1 and 2 share link B, so only ﬂow 1 can
progress while ﬂow 2 should wait. A protocol like pFabric continues to send packets of ﬂow 2 on
link A even though these packets are dropped at link B. These unnecessary transmissions stall ﬂow
3, which could have run in parallel with ﬂow 1 as both ﬂows do not share any link.

The above toy example highlights a common use case present in all-to-all traﬃc patterns
(e.g., MapReduce [47], Search) where a node typically has data to send to many other nodes. To

23

Figure 3.2: Toy example illustrating problem with pFabric.

Figure 3.3: Loss rate for pFabric as a function of load.

quantify this problem under such practical settings, we simulate the interaction between workers and
aggregators within a single rack of a search application. Each worker-aggregator ﬂow is uniformly
distributed between [2,198] KB. We focus on the loss rate of pFabric when the network load is
increased. Figure 3.3 shows that loss rate shoots up as the load on network links is increased. For a
load of 80%, more than 40% packets are dropped. These lost packets translate into throughput loss
as we could have used these transmissions for packets belonging to other ﬂows. Section 5.6 shows
how this high loss rate results in poor ﬂow completion time for pFabric.

24

Src. 1Switch Dest. 2Src. 2Dest. 1321Link BLink AFlow 1: Src. 1 ˘Dest. 1 (Highest Priority)Flow 2: Src. 2 ˘Dest. 1 (Medium Priority)Flow 3: Src. 2 ˘Dest. 2 (Lowest Priority)Stalled Flows 0 10 20 30 40 50 60 70 10 20 30 40 50 60 70 80 9095Loss Rate (%)Offered load (%)pFabric3.1.2 Transport Strategies in Unison

We now discuss how combining these transport strategies oﬀers a simple solution to the problems
identiﬁed earlier.
In-network Prioritization complementing arbitration: The high ﬂow switching overhead of
arbitration-only approaches can be avoided with the help of in-network prioritization. As today’s
arbitration-only approaches, like PDQ [70], assume no prioritization within the network, they
achieve priority scheduling by communicating explicit rates to end-hosts, which takes time, and
thus results in a high ﬂow switching overhead. If we have in-network prioritization, the arbitrator
can just assign relative priorities to the ﬂows (e.g., high priority ﬂow vs. low priority ﬂow),
leaving it up to the switches to enforce this relative priority through a suitable scheduling and
dropping mechanism. The current in-network prioritization mechanisms (e.g., priority queues)
provide seamless switching between ﬂows of diﬀerent priorities, so there is no ﬂow switching
overhead and link utilization remains high during this period.

A simple example illustrates this beneﬁt. Assume we have two ﬂows–F1 (higher priority) and F2
(lower priority). With arbitration-only approaches, F1 is initially assigned the entire link capacity
while F2 is paused during this time. When F1 ﬁnishes, we have to explicitly signal F2 to unpause.
With in-network prioritization, we can just assign these ﬂows to diﬀerent priority classes–F1 is
mapped to the high priority class while F2 is mapped to the low priority class. The switch ensures
that as soon as there are no more packets of F1 in the high priority queue, it starts scheduling
packets of F2 from the lower priority queue.
Arbitration aiding In-network Prioritization: The small number of priority queues cause per-
formance degradation when multiple ﬂows get mapped to the high priority queue [20]. This results
in multiplexing of these ﬂows instead of strict priority scheduling (i.e., one ﬂow at a time). This
problem can be avoided with the help of arbitration. Instead of statically mapping ﬂows to queues,
an arbitrator can do a dynamic mapping. So a ﬂow’s priority queue keeps on changing during the
ﬂow’s lifetime. Flows whose “turn” is far away are mapped to lower priority queues. As a ﬂow’s
turn is about to come, it moves up to a higher priority queue.

25

We explain this idea through a simple two queue example. Suppose queue A (QA) is the high
priority queue and queue B (QB) is the lower priority queue. We have four ﬂows to schedule (F1,
F2, F3, and F4, with F1 having the highest priority and F4, the lowest)–as the number of ﬂows is
more than the number of queues, any static mapping of ﬂows to queues will result in sub-optimal
performance. With the help of arbitration, we can initially map F1 to QA and the other three ﬂows
to QB. When F1 ﬁnishes, we can change the mapping of F2 from QB to QA while ﬂows F3 and F4
are still mapped to QB. Similar process is applied when F2 (and later on, F3) ﬁnishes. In short, the
highest priority queue is used for the active, high priority ﬂow while the lower priority queue is used
primarily to keep link utilization high (i.e., work-conservation). The example shows how arbitration
can help leverage the limited number of priority queues without compromising on performance.
Arbitration helping Self-Adjusting Endpoints: With arbitration-only approaches, calculating
precise ﬂow rates can be hard as the arbitrator may not have accurate information about all the
possible bottlenecks in the system [70,134]. Thus, we can end up underestimating or overestimating
the available capacity. Unfortunately, in arbitration-only approaches, end-points–which typically
have a better idea of path conditions–are dumb: they always transmit at the rate assigned by the
arbitrator, so even if they are in a position to detect congestion or spare capacity in the network,
they cannot respond.

The self-adjusting end-point strategy naturally addresses this problem as it constantly probes
the network: if there is any spare capacity, it will increase its rate, and if there is congestion, it
will back oﬀ. For example, suppose there are two ﬂows in the system with diﬀerent priorities. The
higher priority ﬂow is assigned the full link capacity but it is unable to use it. The lower priority
ﬂow will remain paused if we do not use self-adjusting end-points. However, if the end-point uses
a self-adjusting policy, it will detect spare capacity and increase its rate until the link is saturated.
Note that arbitration also helps the self-adjusting end-point strategy: instead of just blindly probing
the network for its due share, a ﬂow can use information from the arbitrator to “bootstrap” the
self-adjusting behavior.

26

3.2 NEAT: Task Placement in Datacenters

Task placement policies perform diﬀerently when used with diﬀerent network scheduling
policies. To illustrate this, we compare the performance of two task scheduling policies under
two diﬀerent network scheduling policies. For task scheduling policies, we consider: minimum-
distance placement (minDist), which places each task to minimize its distance to the input data,
and minimum-load placement (minLoad), which places each task on the server with the minimum
traﬃc load. For network scheduling policy, we consider: DCTCP [18], which approximates the Fair
sharing policy, and PASE [94], which approximates the shortest remaining processing time (SRPT)
policy in the network.

We simulate a topology of 160 machines, connected using a folded CLOS topology as in [94].
We conﬁgure the links to have 200(cid:22)sec RTT and 1 Gbps (edge) or 10 Gbps (core) bandwidth.
We evaluate each combination of the above task/network scheduling policies under ﬂow demands
generated according to the data-mining workload [21]. Figure 3.4(a) and Figure 3.4(b) show the
ratio between the FCT of minDist and the FCT of minLoad for various ﬂow sizes under the Fair, and
SRPT network scheduling policy, respectively. A bar with coordinates (x, y) indicates that ﬂows of
sizes x have an FCT ratio of y, where y<1 indicates that minDist outperforms minLoad and y>1
indicates the opposite.

(a) SRPT (PASE)

(b) Fair (DCTCP)

Figure 3.4: Comparison of minDist and minLoad placement under SRPT and Fair network
scheduling policy

27

 0.6 0.8 1 1.2 1.4<22-250700-1K1.5K-2K>3KRatio minDist. over minLoadFlow Size (packets)SRPT 0.6 0.8 1 1.2 1.4<22-250700-1K1.5K-2K>3KminDist over minLoadFlow Size (packets)FAIRThe results show that when the network follows SRPT policy (Figure 3.4(a)), minDist placement
works better (ratio less than 1). In contrast, when the network follows a Fair policy (Figure 3.4(b)),
minLoad placement works better for the majority of the ﬂow sizes. The results can be explained
by the following observations: when the network follows SRPT policy (Figure 3.4(a)), short ﬂows
preempt long ﬂows, and receive full link bandwidth. Short ﬂows therefore achieve near-optimal
completion time. We veriﬁed it in the FCT logs but omit the details because of page limitations.
After the short ﬂows complete, more bandwidth becomes available to the long ﬂows, allowing them
to ﬁnish quickly. The minDist placement outperforms minLoad placement because it minimizes
the total network load, deﬁned as sum of the product of the size and the hop count for each ﬂow. In
contrast, when the network follows the Fair policy (Figure 3.4(b)), short ﬂows no longer preempt
long ﬂows. The minLoad placement works better – especially for long ﬂows – because it ensures
that long ﬂows avoid being placed on nodes with existing long ﬂows. However, Figure 3.4(b)
shows that short ﬂows may suﬀer longer FCTs under the minLoad placement compared to the
minDist placement. This usually happens when newly arriving long ﬂows are placed on nodes
with ongoing short ﬂows, thus increasing the FCTs of short ﬂows. In conclusion, this experiment
demonstrates that the performance of task placement depends on the underlying network scheduling
policy. Therefore, it is important to design a task placement framework that takes into account the
underlying network scheduling policy.

3.3 Bandwidth Allocation in Multi-tenant Datacenters

3.3.1

Interference when Sharing Network

To demonstrate interference, when ﬂows with diﬀerent objectives coexist in the same network, we
use a single bottleneck topology as shown in Fig. 3.5(a). This represents a common datacenter
scenario where edge to core links might be the bottleneck among ﬂows of diﬀerent applications.
We set the switch ECN marking threshold to 20, the switch buﬀer size to 250 packets and the
round trip time to 300 us. We consider traﬃc from a LS application (from SRC1 to DST1), a
DS application (from SRC2 to DST2), and a CS application (from SRC3 to DST3). We use the

28

(a) Example topology

(b) LS ﬂows’ RTT

(c) DS ﬂows’ missed (d) CS ﬂows’ ﬁnish time

Figure 3.5: Interference demonstration, when alone and coexist.

instantaneous end-to-end packet RTT, fraction of missed deadlines, and average FCT (AFCT) as
the performance metrics for evaluating the LS, DS and CS ﬂows, respectively. We use NS-2 for
performance evaluation.

We consider state-of-the-art deployment-friendly protocols for each of the performance goals
and implement them in NS-2. Speciﬁcally, we use DCTCP for LS ﬂows, D2TCP for DS and L2DCT
for CS ﬂows. For LS ﬂows, we generate small (2 KB) ﬂows that arrive at intervals following a Poisson
distribution and require an aggregate bandwidth of 400 Mbps on average. For DS and CS, the ﬂows
are generated uniformly within the range 2 KB-198 KB with deadlines in the range 5-25 ms (for DS
only). Flows arrive at intervals generated following Poisson distribution and incurring low (20%),
medium(50%) and high (80%) load on the link.

3.3.1.1 Performance in Isolation

We ﬁrst evaluate the application performance in a scenario where each protocol has exclusive access
to the underlying network. When used in isolation, we represent ﬂows as Base-LS, Base-DS, and
Base-CS respectively. In these experiments, the bottleneck link capacity is 500 Mbps.
Base-LS: As shown in Fig. 3.5(b), the LS ﬂow packets experience almost zero queuing delay and
the round-trip latency is very close to the RTT (300 us).
Base-DS: Fig. 3.5(c) demonstrates the deadline missing rate of DS ﬂows across a range of loads.
We can see that for light to moderate loads, almost no ﬂows miss deadlines. However, at high load
many ﬂows miss deadlines.
Base-CS: Fig. 3.5(d) shows the AFCT under diﬀerent network load conditions.

29

DS FlowsCS FlowsSRC 2SRC 3DST 2DST 3L2DCTLS FlowsSRC 1DST 1D2TCPDCTCP 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 0 0.05 0.1 0.15 0.2Instantaneous RTT (ms)Time (sec)LS OnlyLS w/ DS 20 40 60 80 100LightMediumHighDeadlines Missed (%)Network LoadDS OnlyDS w/ LSDS w/ CS 10 20 30 40 50 60 70LightMediumHighAFCT (msec)Network LoadCS OnlyCS w/ DS3.3.1.2 Coexistence of Diﬀerent Objectives

Next, we evaluate the scenarios with two performance objectives coexisting in the network. The
bottleneck link capacity is 1000 Mbps, which is large enough to meet the demands of both types of
ﬂows.
Coexist-LS/DS:

In this scenario, the performance of LS ﬂows is severely aﬀected as shown in
Fig. 3.5(b). LS ﬂows experience large queuing delays and, as a result, their tail latency increases by
more than 300%; the average latency increases by more than 50%. On the other hand, DS ﬂows also
experience performance degradation and many ﬂows miss their deadlines, as shown in Fig. 3.5(c).
Coexist-DS/CS: We consider the coexistence of DS and CS ﬂows in the network. In this scenario,
DS ﬂows experience performance degradation, as shown in Fig. 3.5(c): even at low loads, some of
the ﬂows miss their deadline due to interference from CS ﬂows and the degradation increases up
to 4x for medium and higher network loads. Similarly, the completion times of CS ﬂows is also
aﬀected and ﬂows take longer to ﬁnish compared to the scenario with only CS ﬂows in the network
(Fig. 3.5 (d)). Speciﬁcally, at higher loads, the AFCT of CS ﬂows increases by more than 2x.

3.3.1.3 Same Objective Diﬀerent Tenants

We also consider the scenarios where multiple tenants with the same performance objective coexist.
We evaluate both Coexist-LS/LS and Coexist-DS/DS scenarios, where the bottleneck capacity is
not large enough to meet the requirements of all ﬂows. Lacking the ability to favor one tenant
over another, ﬂows of both the tenants experience performance degradation. We omit results due to
space limitation.

3.3.2 Current Isolation won’t Work

An intuitive solution is to segregate application ﬂows with diﬀerent objectives to separate physical
queues in a switch. Such an segregation approach requires a large number of traﬃc classes, which
are not supported by currently available CoS tags in packet formats; existing commodity switches

30

also do not have suﬃcient number of queues [120].
Bandwidth Guarantee: Several bandwidth guarantee approaches have been proposed, which pro-
vide network abstraction models and placement algorithms for tenants or applications to express
their bandwidth requirements [66, 79, 84, 108, 114, 120]. Bandwidth guarantee are coarse grained:
they should meet the peak requirements, which is either ineﬃcient, or is suboptimal in terms of
ﬂow objectives.
Proportional Sharing: Seawall provides per-entity weight enforcement for each VM-to-VM tun-
nel [120]. However, it does not support bandwidth allocation at ﬂow-level granularity, hence cannot
optimize the performance objectives of individual ﬂow divisions. Further, it does not support tenant
level network sharing.

3.4 Multipath TCP Attacks & Countermeasures

Below, I ﬁrst provide a brief overview of MPTCP, and then discuss security vulnerabilities of

MPTCP.

Figure 3.6: Example 2-path MPTCP connection setup

31

Host AHost BAddress A1Address A2Address BSYN + MP_CAPABLE(Key-A)SYN/ACK + MP_CAPABLE(Key-B)SYN + MP_JOIN(Token-B, nonce-A)SYN/ACK + MP_JOIN(HMAC-B, nonce-B)ACK + MP_CAPABLE(Key-A, Key-B)ACK + MP_JOIN(HMAC-A2)ADD_ADDR(Addrs-A2)3.4.1 MPTCP Background

MPTCP is an extended version of TCP implemented using the TCP option ﬁeld, which allows a
TCP connection between two hosts to use multiple paths simultaneously [56]. Below we explain
the operation of an example 2-path MPTCP connection, illustrated in Figure 3.6.
Connection Establishment. The client uses its IP address A1 to establish an MPTCP subﬂow with
the server. The SYN, SYN/ACK, and ACK handshake packets include the MP_CAPABLE option to
indicate MPTCP support. Hosts also exchange 64-bit keys during the handshake to authenticate the
establishment of additional subﬂows in future. The client notiﬁes the server that it has another IP
address A2 using the ADD_ADDR option. Note that, these address advertisements can be refreshed
periodically and sent on any subﬂow. A host receiving the advertisements will then establish new
MPTCP subﬂows using the MP_JOIN option. The SYN, SYN/ACK, and ACK handshake packets
include the MP_JOIN option to indicate establishment of the new MPTCP subﬂow, using Adress A2.
Hosts use 32-bit tokens to identify the MPTCP connection. Hosts also use 64-bit truncated Hash-
based Message Authentication Code (HMAC) for authentication. Finally, MPTCP uses address
identiﬁers (advertised via ADD_ADDR), instead of IP addresses, to cope with possible IP address
changes due to the presence of a NAT on the path. Each address has a corresponding address
identiﬁer to uniquely identify it within a connection.
Data Transfer. MPTCP packets include a 32-bit local sequence number and a 64-bit global
sequence number (also called data sequence number, short for DSN) to ensure reliable and in-order
data delivery. The 32-bit local sequence number is used for ﬂow control at the subﬂow level and
64-bit global sequence number is used for ﬂow control at the connection level. MPTCP also signals
both the connection-level acknowledgements and subﬂow level acknowledgments to implement
ﬂow control.
Subﬂow Priority. MPTCP uses the MP_PRIO option to dynamically change the priority of subﬂows.
Hosts can request change in the priority to use a subﬂow as regular or backup. A backup subﬂow
is used only if there is no regular subﬂow available for data transmission [56]. MP_PRIO carries an
optional address ID ﬁeld that can be used to specify the subﬂow that needs to be set as backup. To

32

set a subﬂow as backup, a host can request a change in the subﬂow priority by sending MPTCP
MP_PRIO option to the other host with the corresponding address identiﬁer. The key property of
MP_PRIO messages is that they can be sent on any subﬂow. The reason is that if a subﬂow is already
congested, it may not be possible to deliver the message to pause itself.

3.4.2 Threat Model

Figure 3.7 illustrates an MPTCP connection where the attacker is on one of the two paths used
in the MPTCP connection. In the more general case, if there are m paths (with m subﬂows), an
attacker may be on a subset of the paths. Our threat model assumes that the attacker can eavesdrop
or inject traﬃc on the paths. This threat model is commonly found in the case of open WiFi access
points where an attacker connected to the same access point can eavesdrop or inject traﬃc on the
subﬂow established through the access point. Our threat model does not assume that the attacker
can modify existing traﬃc.

3.4.3 Backup Flag Vulnerability

MPTCP supports using subﬂows as backup – i.e., using a subﬂow to send data only if there is no
other subﬂow available. Speciﬁcally, an MPTCP host can send a request to the other host to set any
subﬂow as a backup by using the MPTCP MP_PRIO option. As mentioned earlier, this request can
be sent via any MPTCP subﬂow. After receiving such a control packet, the sender will stop sending
data and the corresponding subﬂow’s throughput will drop to zero. Unfortunately, unlike MP_JOIN,
such a control packet has no authentication required by the speciﬁcation whatsoever [56], allowing
an attacker controlling only one path to set any subﬂow (using the corresponding address ID) as
back up, Figure 3.7. For example, for the two path MPTCP connection shown in Figure 3.7, the
attacker on path1 can send a forged MP_PRIO packet with address ID of path2 to set subﬂow on
path2 as backup. The backup ﬂag vulnerability allows an attacker to divert traﬃc among MPTCP
subﬂows. An attacker can oﬄoad traﬃc from the eavesdropped subﬂow to other MPTCP subﬂows
by sending forged packets with the MP_PRIO option to set the eavesdropped subﬂow as a backup.

33

Figure 3.7: Subﬂow pause using MP_PRIO option

An attacker can also onload traﬃc from non-eavesdropped subﬂows to the eavesdropped subﬂow
by sending forged packets with the MP_PRIO option to set all other subﬂows as backup. This attack
is similar to connection hijack attack, where an attacker diverts all the traﬃc to pass through the
subﬂow on the eavesdropped path.

3.4.4 Throughput Inference Vulnerability

If there are two subﬂows in an MPTCP connection, Shaﬁq et al. showed that an attacker eaves-
dropping a subﬂow can learn the throughput of the other subﬂow [118]. For example, an attacker
can infer the throughput of the non-eavesdropped subﬂow on most phones that have both WiFi
and cellular network interfaces. The subﬂows of an MPTCP connection are seemingly indepen-
dent, but in reality they are fundamentally coupled because they carry data of the same MPTCP
connection. Take the example of an MPTCP connection with two subﬂows, since the two subﬂows
carry packets in the same MPTCP connection, the destination host needs to reassemble the packets
from both subﬂows into that MPTCP connection. Thus, each packet in a subﬂow needs to have two
sequence numbers: a local sequence number for reliable transmission over that subﬂow, which is
the traditional subﬂow sequence number encoded in the TCP sequence ﬁeld, and a global sequence
number for assembling packets from both subﬂows into that MPTCP connection. By passively
eavesdropping one subﬂow, an attacker can calculate the throughput of the eavesdropped subﬂow
from the local sequence number and that of the MPTCP connection from the global sequence

34

Path 1Path 2Time t2Time t1Data Transfer+ backup message (MP_PRIO [Path2 Address ID])Time t2Time t1Data TransferData Transfer (paused)Data Transfernumbers.

In this paper, we show that it is not hard to generalize the result to more than two subﬂows.
Speciﬁcally, if an attacker eavesdropping on m (cid:0) 1 subﬂows for an m-path MPTCP connection
(m (cid:21) 2), it is possible to infer the throughput of the non-eavesdropped subﬂows. For an eavesdropped
subﬂow, from the local sequence numbers, which indicates the number of bytes that have been
transferred over that subﬂow, the attacker can calculate the throughput of the subﬂow; from the
global sequence number, which indicates the number of bytes that have been transferred over all
subﬂows, the attacker can calculate the throughput of the overall MPTCP connection. Let t1;(cid:1) (cid:1) (cid:1) ; tm
denote the throughput of m subﬂows over paths p1;(cid:1) (cid:1) (cid:1) ; pm, respectively. Let t denote the throughput
of the overall MPTCP connection over m paths. Then we have t1 + ::: + tm = t. Suppose the attacker
can eavesdrop the m (cid:0) 1 paths p1;(cid:1) (cid:1) (cid:1) ; pm(cid:0)1. Then, the attacker can calculate the throughput tm of

the subﬂow over path pm as tm = t (cid:0)∑m(cid:0)1

i=1 ti.

The ability to infer throughput can help an attacker validate the success of connection hijack
attack and traﬃc diversion attacks. Note that, a simple approach to validate the attacks is to check
that there are no holes in DSS space, however, it works only for a connection with two subﬂows
or when all the subﬂows have been hijacked. For a connection with more than two subﬂows or
when only speciﬁc subﬂows are targeted, we still need throughput inference to validate and detect
change in throughput.An attacker eavesdropping a single path can infer the total throughput of the
MPTCP connection (t) and the throughput of the subﬂow (tm) on that path. Therefore, using this
information, an attacker can validate that the backup ﬂag attack was successful. For example, for
an MPTCP connection with two subﬂows, t will be equal to tm if the backup attack was successful.
Similarly, from this the attacker can calculate the total throughput of the MPTCP connection on
other paths (as tpaths = t (cid:0) tm), however, an attacker can not calculate the throughput of a particular
subﬂow on the uneavesdropped path, unless it can eavesdrop m (cid:0) 1 paths. For example, an attacker
eavesdropping m (cid:0) 2 paths can not calculate the exact throughput of the other 2 paths as it does
not have suﬃcient information to compute the throughput. This information can help an attacker to
oﬄoad or onload traﬃc from one subﬂow to another subﬂow while maintaining the same overall

35

throughput of the MPTCP connection. Note that the connection hijack attack can work without
throughput information, however throughput information can help an attacker make sure that the
attack was successful. For example, in some cases, the host may or may not decide to stop sending
the data on the backup path. Therefore, it is important for connection hijack attack that the attacker
is able to infer the throughput.

36

CHAPTER 4

PASE: A NEAR-OPTIMAL TRANSPORT FOR DATACENTERS

PASE is a transport framework that synthesizes the three transport strategies, namely in-network
Prioritization, Arbitration, and Self-adjusting Endpoints. The underlying design principle behind
PASE is that each transport strategy should focus on what it is best at doing, such as:

• Arbitrators should do inter-ﬂow prioritization at coarse time-scales. They should not be

responsible for computing precise rates or for doing ﬁne-grained prioritization.

• Endpoints should probe for any spare link capacity on their own, without involving any other
entity. Further, given their lack of global information, they should not try to do inter-ﬂow
prioritization (protocols that do this have poor performance, as shown in Section 2).

• In-network prioritization mechanism should focus on per-packet prioritization at short, sub-
RTT timescales. The goal should be to obey decisions made by other strategies while keeping
the data plane simple and eﬃcient.

Given the above roles for each strategy, the high-level working of PASE is as follows. Every
RTT, each source invokes the arbitration algorithm to get its priority queue and reference rate. The
arbitration decision is based on the ﬂows currently in the system and their priorities (e.g., deadline,
ﬂow-size). As the name suggests, the reference rate is not binding on the sources, so depending
on the path conditions, the sources may end up sending at higher or lower than this rate (i.e.,
self-adjusting endpoints). A key beneﬁt of PASE is that we do not require any changes to the data

Switch

Vendor
BCM56820 [3] Broadcom

G8264 [5]
7050S [2]
EX3300 [6]
S4810 [4]

IBM
Arista
Juniper
Dell

Num. Queues

10
8
7
5
3

ECN
Yes
Yes
Yes
No
Yes

Table 4.1: Priority Queues and ECN support in popular commodity top-of-rack switches. The
numbers are per interface.

37

plane: switches on the path use existing priority queues to schedule packets and employ explicit
congestion notiﬁcation (ECN) to signal congestion. As shown in Table 4.1, most modern switches
support these two features.

To achieve high performance while being deployment friendly, PASE incorporates two key
components: a control plane arbitration mechanism and an end-host transport protocol. While
existing arbitration mechanisms operate in the data plane (and hence require changes to the network
fabric), we implement a separate control plane for performing arbitration in a scalable manner. To
this end, we introduce optimizations that leverage the typical tree structure of datacenter topologies
to reduce arbitration overhead. Finally, PASE’s transport protocol has an explicit notion of reference
rate and priority queues, which leads to new rate control and loss recovery mechanisms.

In the following sections, we describe the control plane and the end-host transport protocol of

PASE, followed by the details of its implementation.

4.1 Arbitration Control Plane

While a centralized arbitrator is an attractive option for multiple reasons, making it work in
scenarios involving short ﬂows is still an open problem [14, 45]. Prior work [70] shows that the
problem of ﬂow arbitration can indeed be solved in a distributed fashion: each switch along the
path of a ﬂow independently makes the arbitration decision and returns the allocated rate for its
own link, and the source can pick the minimum rate. While prior work implements arbitration as
part of the data plane, PASE supports this as part of the control plane because experiences with
prior protocols (e.g., XCP [81]) show that even small changes to the data plane are hard to deploy.
We now describe (a) the basic arbitration algorithm employed by each PASE arbitrator in the
control plane and the key sources of overhead that limit the scalability of the arbitration process,
(b) how arbitrators exchange information to determine the priority queue and the reference rate for
each ﬂow across the end-to-end path, and (c) two optimizations that reduce the arbitration overhead
by exploiting the characteristics of typical datacenter topologies.

38

4.1.1 Basic Arbitration Algorithm

For each link in the datacenter topology, there is an arbitrator that runs the arbitration algorithm and
makes decisions for all ﬂows that traverse the particular link. The arbitrator can be implemented on
any server in the datacenter or on dedicated controllers that are co-located with the switches. The
arbitration algorithm works at the granularity of a ﬂow, where a ﬂow could be a single RPC in a
typical client-server interaction, or a long running TCP-like connection between two machines.

Algorithm 1 Arbitration Algorithm

Input: < FlowSize; FlowID; demand„optional” >
Output: < PrioQue; Rre f >
Arbitrator locally maintains a sorted list of ﬂows Step#1: Sort/update ﬂow entry based on the FlowSize
Step#2: Compute PrioQue and Rre f of the ﬂow.
//Link capacity is C, and AggregateDemandHigher (ADH) is the sum of demands of ﬂows with priority higher than
current ﬂow.
1: Assign rate:
if ADH < C then

else

Rre f = min„demand;C (cid:0) ADH”;
Rre f = baserate;

end if
2: Assign PrioQue:
PrioQue = ⌈ADH(cid:157)C⌉;
if PrioQue > LowestQueue then

PrioQue = LowestQueue;

end if
3: return < PrioQue; Rre f >;

The interaction between a ﬂow’s source and the arbitrator(s) is captured in Algorithm 1. The
source provides the arbitrator(s) with two pieces of information: i) the ﬂow size (FlowSize), which is
used as the criterion for scheduling (i.e., shortest ﬂow ﬁrst). To support other scheduling techniques,
the FlowSize can be replaced by deadline [134] or task-id for task-aware scheduling [50] and ii)
demand, this represents the maximum rate at which the source can send data. For long ﬂows that
can saturate the link, this is equal to the NIC rate, while for short ﬂows that do not have enough data
to saturate the link, this is set to a lower value. Demand and ﬂow size are inputs to the arbitration
algorithm whereas the ﬂows’ reference rate (Rre f ) and priority queue (PrioQue) is the output.

To compute Rre f and PrioQue, the arbitrator locally maintains a sorted list of ﬂows based on
their sizes. This list is updated based on the latest FlowSize information of the current ﬂow. The

39

ﬂow’s priority queue and reference rate depend on the aggregate demand (AggregateDemandHigher
or ADH) of ﬂows that have higher priority compared to the current ﬂow. An ADH value less than
the link capacity C implies that there is some spare capacity on the link and the ﬂow can be mapped
to the top queue. Thus, if the ﬂow’s demand is less than the spare capacity, we set the reference rate
equal to the demand. Otherwise, we set the reference rate equal to the spare capacity.

The other case is when the ADH exceeds link capacity. This happens when a link is already
saturated by higher priority ﬂows, so the current ﬂow cannot make it to the top queue. In this case,
the ﬂow’s reference rate is set to a base value, which is equal to one packet per RTT. This allows
such low priority ﬂows to make progress in case some capacity becomes available in the network,
and to even increase their rate in the future based on self-adjusting behavior.

Finally, if the current ﬂow cannot be mapped to the top queue, it is either mapped to the lowest
queue (if ADH exceeds the aggregate capacity of intermediate queues) or is mapped to one of
the intermediate queues. Thus, each intermediate queue accommodates ﬂows with an aggregate
demand of C and the last queue accommodates all the remaining ﬂows.

Challenges. The above design described how each arbitrator determines the priority queue and
reference for a ﬂow at a particular link. To carry out arbitration across the end-to-end path of a ﬂow,
we design a distributed control plane for arbitration. There are three sources of overhead that limit
the scalability of such a distributed control plane.

• Communication Latency. The communication latency between the source and the arbitrator
depends on their physical distance within the datacenter. This delay matters the most during
ﬂow setup time as it can end up increasing the FCT, especially for short ﬂows. To keep this
delay small, arbitrators must be placed carefully, such that they are located as close to the
sources as possible.

• Processing Overhead. The second challenge is the processing overhead of arbitration mes-
sages, which can potentially add non-negligible delay, especially under high load scenarios.

• Network Overhead. Due to a separate control plane, each arbitration message is potentially

40

Figure 4.1: The bottom-up arbitration approach employed by PASE for intra-rack and inter-rack
traﬃc scenarios.

processed as a separate packet by the switches which consumes link capacity. We need to
ensure that this overhead is kept low and that it does not cause network congestion for our
primary traﬃc.

To reduce these overheads, PASE’s distributed control plane has been structured to carry out

arbitration in a bottom-up manner, which we describe next.

4.1.2 Bottom Up Arbitration

We exploit the typical tree structure of datacenter topologies. Sources obtain information about the
PrioQue and Rre f in a bottom-up fashion, starting from the leaf nodes up to the core as shown in
Figure 4.1. For this purpose, the end-to-end path between a source and destination is divided into
two halves—one from the source up to the root (srcHalf) and the other from the destination up to
the root (dstHalf).

For each half, the respective leaf nodes (i.e., source and destination) initiate the arbitration
at the start of a ﬂow1. They start oﬀ with their link to the ToR switch and then move upwards.
The arbitration request messages move up the arbitration hierarchy until it reaches the top-level
1The destination node initiates the arbitration when it receives an arbitration request message

from the source node.

41

arbitrator. The responses move downwards in the opposite direction. A source receives arbitration
responses for the srcHalf and the dstHalf containing priority queue and reference rate assigned
to the ﬂow. A source then uses the minimum of these reference rates and the lowest of the priority
queues assigned by arbitration at the srcHalf and the dstHalf.

This bottom-up approach ensures that for intra-rack communication, arbitration is done solely
at the endpoints. Thus, in this scenario, ﬂows incur no additional network latency for arbitration.
This is particularly useful as many datacenter applications have communication patterns that have
an explicit notion of rack aﬃnity [17,28].

For the inter-rack scenario, the bottom up approach facilitates two other optimizations, early
pruning and delegation, to reduce the arbitration overhead and latency. Both early pruning and
delegation exploit a trade-oﬀ between low overhead and high accuracy of arbitration. As our
evaluation shows, by giving away some accuracy, they can signiﬁcantly decrease the arbitration
overhead.

Early Pruning The network and processing overheads can be reduced by limiting the number of
ﬂows that contact the arbitrators for PrioQue and Rre f information. In early pruning, only ﬂows
that are mapped to the highest priority queue move upwards for arbitration. Thus, in Algorithm 1
a lower-level arbitrator only sends the arbitration message to its parent if the ﬂow is mapped to
the top queue(s). This results in lower priority ﬂows being pruned at lower levels, as soon as they
are mapped to lower priority queues (see Figure 4.2(a)). The intuition behind this is that a ﬂow
mapped to a lower priority queue on one of its links will never make it to the highest priority queue
irrespective of the arbitration decision on other links. This is because a ﬂow always uses the lowest
of the priority queues assigned by all the arbitrators (i.e., bottleneck in the path). Thus, we avoid
the overhead of making arbitration decisions for the ﬂows mapped to lower priority queues.

There are two key beneﬁts of early pruning. First, it reduces the network overhead as arbitration
messages for only high priority ﬂows propagate upwards. Second, it reduces the processing load
on higher level arbitrators. In both cases, the reduction in overhead can be signiﬁcant, especially in

42

the more challenging heavy load scenarios, where such overhead can hurt system performance.

In general, early pruning makes the overhead independent of the total number of ﬂows in the
system. Instead, the overhead depends on the number of children that a higher level arbitrator may
have because each child arbitrator only sends a limited number of ﬂows (the top ones) upwards.
Due to limited port density of modern switches (typically less than 100), the number of children,
and hence the overhead, is quite small. This leads to signiﬁcant reduction in the overhead compared
to the non-pruning case where the overhead is proportional to the total number of ﬂows traversing
a link, which can be in thousands (or more) for typical datacenter settings [17, 28]. However, the
above overhead reduction comes at the cost of less precise arbitration. As we only send the top ﬂows
to the higher-level arbitrators, ﬂows that are pruned do not get the complete and accurate arbitration
information. Our evaluation shows that sending ﬂows belonging to the top two queues upwards
(rather than just the top queue), provides the right balance: there is little performance degradation
while reduction in overhead is still signiﬁcant.

Delegation Delegation is specially designed to reduce the arbitration latency. While early pruning
can signiﬁcantly reduce the arbitration processing overhead as well as the network overhead, it does
not reduce the latency involved in the arbitration decision because top ﬂows still need to go all the
way up to the top arbitrator.

In delegation, a higher level link (i.e., closer to the core) is divided into smaller “virtual” links
of lower capacity – each virtual link is delegated to one of the child arbitrators who then becomes
responsible for all arbitration decisions related to the virtual link. Thus, it can make a local decision
about a higher level link without going to the corresponding arbitrator. On each virtual link, we run
the same arbitration algorithm i.e., Algorithm 1 as we do for normal links.

As a simple example, the aggregation-core link of capacity C shown in Figure 4.2(b) can be
divided into N virtual links of capacity aiC each (where ai is the fraction of capacity allocated
to virtual link i) and then delegated to one of the child arbitrators. The capacity of each virtual
link is updated periodically, reﬂecting the PrioQue of ﬂows received by each child arbitrator. For

43

(a) Early Pruning

(b) Delegation

Figure 4.2: (a) The early pruning optimization used to reduce the arbitration overhead. Note that
si and ki represent ﬂows that are mapped to the highest priority queue at the senders and the ToR
arbitrators, respectively and (b) the delegation optimization used to reduce the setup latency and
control overhead.

example, one child that is consistently observing ﬂows of higher priorities can get a virtual link of
higher capacity.

Delegation provides two beneﬁts. First, it reduces the ﬂow setup delay because arbitration
decisions for higher level links are made at lower level arbitrators, which are likely to be located
close to the sources (e.g., within the rack). Second, it reduces the control traﬃc destined towards
higher level arbitrators. Note that this happens because only aggregate information about ﬂows is
sent by the child arbitrators to their parents for determining the new share of virtual link capacities.
The impact on processing load has both positive and negative dimensions. While it reduces the
processing load on higher level arbitrators, it ends up increasing the load on lower level arbitrators
as they need to do arbitration for their parents’ virtual links too. However, we believe this may be
acceptable as lower level arbitrators typically deal with fewer ﬂows compared to top level arbitrators.
Like early pruning, delegation also involves the overhead-accuracy trade-oﬀ. The capacity
assigned to a speciﬁc virtual link may not be accurate which may lead to performance degradation.
For example, we may have assigned a lower virtual capacity to a child who may suddenly start
receiving higher priority ﬂows. These ﬂows would need to wait for an update to the virtual link
capacity before they can get their due share. On the other hand, there could be cases where the
virtual capacity may remain unused if the child does not have enough ﬂows to use this capacity.

44

ToRs Aggregation Core a1C a2C C aNC Delegated Capacities Link Capacity This is especially true in scenarios where a link is delegated all the way to the end-host and the
end-host may have a bursty ﬂow arrival pattern.

Given the above trade-oﬀ, PASE only delegates the Aggregation-Core link capacity to its
children (ToR-Aggregation arbitrators). These child arbitrators should be typically located within
the rack of the source/destination (or co-located with the ToR switch). Thus, for any inter-rack
communication within a typical three level tree topology, the source and destination only need to
contact their ToR-aggregation arbitrator, who can do the arbitration all the way up to the root. In
fact, ﬂows need not wait for the feedback from the other half i.e., destination-root. Thus, in PASE,
a ﬂow starts as soon as it receives arbitration information from the child arbitrator. This approach
is reasonable in scenarios where both halves of the tree are likely to have similar traﬃc patterns. If
that is not true then PASE’s self-adjusting behavior ensures that ﬂows adjust accordingly.

4.2 End-host Transport

PASE’s end-host transport builds on top of existing transports that use the self-adjusting end-
points strategy (e.g., DCTCP). Compared to existing protocols, the PASE transport has to deal with
two new additional pieces: (a) a priority queue (PrioQue) on which a ﬂow is mapped and (b) a
reference sending rate (Rre f ). This impacts two aspects of the transport protocol: rate control and
loss recovery. Rate control in PASE is more guided and is closely tied to the Rre f and the PrioQue
of a ﬂow. Similarly, the transport requires a new loss recovery mechanism because ﬂows mapped to
lower priority queues may experience spurious timeouts as they have to wait for a long time before
they get an opportunity to send a packet. We now elaborate on these two aspects.

Rate Control: A PASE source uses the Rre f and the PrioQue assigned by the arbitrators to
guide its transmission rate. The rate control uses congestion window (cwnd) adjustment, based on
the Rre f and the ﬂow RTT, to achieve the average reference rate at RTT timescales.

Algorithm 2 describes the rate control mechanism in PASE. For the ﬂows mapped to the top
queue, the congestion window is set to Rre f (cid:2) RTT in order to reﬂect the reference rate assigned
by the arbitrator. For all other ﬂows, the congestion window is set to one packet. Note that for ﬂows

45

belonging to the top queue, the reference rate Rre f is generally equal to the access link capacity
unless ﬂows have smaller sizes.

For the ﬂows mapped to lower priority queues (except the bottom queue), the subsequent
increase or decrease in congestion window cwnd is based on the well-studied control laws of
DCTCP [17]. In particular, when an unmarked ACK is received, the window size is increased as

cwnd = cwnd + 1(cid:157)cwnd:

(4.1)

When a marked ACK (i.e., with ECN-Echo ﬂag set) is received, the window size is reduced as

cwnd = cwnd (cid:2) „1 (cid:0) (cid:11)(cid:157)2”

(4.2)

where (cid:11) is the weighted average of the fraction of marked packets. This self-adjusting behavior for
higher priority queues is important for ensuring high fabric utilization at all times because the Rre f
may not be accurate and there may be spare capacity or congestion along the path.

For the ﬂows mapped to the bottom queue, the window size always remains one. This is because
under high loads all ﬂows that cannot make it to the top queues are mapped to the bottom queue,
so the load on the bottom queue can be usually high.

Algorithm 2 Rate Control
Input: < PrioQue; Rre f >
Output: < cwnd > // congestion window
// Priority queues q1, q2,.., qk where q1 is the highest priority queue and q2, q3, and qk(cid:0)1 are intermediate queues
// if an ACK with the ECN-Echo ﬂag set is received
if ACKmarked == 1 then
else

cwnd = cwnd (cid:2) „1 (cid:0) (cid:11)(cid:157)2”; // Use DCTCP decrease law
if PrioQue == q1 then
cwnd = Rre f (cid:2) RTT;
isInterQueue = 0; // not an intermediate queue
else if PrioQue 2 fq2; q3; ::; qk(cid:0)1g then

// if already mapped to an intermediate queue
if isInterQueue == 1 then

cwnd = 1 + 1(cid:157)cwnd; // Use DCTCP increase law

else

end if

isInterQueue = 1; cwnd = 1;

else if PrioQue == qk then
end if

cwnd = 1; isInterQueue = 0;

end if

46

Loss Recovery: For ﬂows belonging to the top queue, we use existing loss recovery mechanisms
(i.e., timeout based retransmissions). However, ﬂows that get mapped to the lower priority queues
can timeout for two reasons: (a) their packet(s) could not be transmitted because higher priority
queues kept the link fully utilized and (b) a packet was lost. In case of scenario (a), a sender
should avoid sending any new packets into the network as it increases the buﬀer occupancy and the
likelihood of packet losses especially at high network loads. In case of scenario (b), a sender should
retransmit the lost packet as early as it is possible so that the ﬂows can make use of any available
capacity without under-utilizing the network. However, diﬀerentiating between these two scenarios
is a challenging task without incurring any additional overhead.

We use small probe packets instead of retransmitting the entire packet whenever a ﬂow, mapped
to one of the lower priority queues, times out. If we receive an ACK for the probe packet (but not the
original packet), it is a sign that the data packet was lost, so we retransmit the original data packet.
If the probe packet is also delayed (no ACK received), we increase our timeout value (and resend
another probe) just like we do this for successive losses for data packets. An alternative approach is
to set suitable timeout values: smaller values for ﬂows belonging to the top queue and larger values
for the other ﬂows.

Finally, a related issue is that of packet reordering. When a ﬂow from a low priority queue gets
mapped to a higher priority queue, packet re-ordering may occur as earlier packets may be buﬀered
at the former queue. This can lead to unnecessary backoﬀs which degrades throughput. To address
this, we ensure that when a ﬂow moves to a higher priority queue, we wait for the ACKs of already
sent packets to be received before sending packets with the updated priority.

4.3 Discussion

Handling arbitrator failures: When an arbitrator fails, the arbitration result can be inaccurate,
which may lead to network congestion (e.g., if the reference rate is greater than the available
capacity). To handle such scenarios, PASE sources fall back to applying DCTCP control laws upon
detecting the failure of one or more arbitrators along a ﬂow’s path. In addition, such ﬂows are

47

mapped to the top queue to avoid starvation of ﬂows. When the arbitrator becomes alive again, the
source arbitrator detects this within the arbitration period (equal to one RTT), and the end-to-end
arbitration resumes. As each arbitration message carries ﬂow state information, arbitrators can
rebuild the sorted list of ﬂows as soon as arbitration messages are received.

Handling losses of arbitration messages: Loss of arbitration messages can negatively impact
the arbitration process. Thus, we use TCP SACK for reliably sending arbitration messages. More-
over, to ensure that arbitration messages get delivered in a timely manner, they are mapped to the
highest priority queue.

Overhead of arbitration messages: The size of each arbitration message is 52 bytes (including
40 bytes of TCP and IP headers). Sending 52-byte messages every RTT=300(cid:22)s equates to a sending
rate of (cid:25)1.4Mbps. On a 1Gbps and 10Gbps links, this amounts to only 0.14% and 0.014% of
overhead, respectively. With the delegation optimization, arbitration messages do not percolate
beyond the ToR-Aggregation arbitrators, which prevents such control traﬃc from aggregating at
higher level links. With early pruning, arbitration decision is only carried out for ﬂows that are
mapped to the top queues. This further limits the control traﬃc2.

4.4 Performance Evaluation and Comparison

4.4.1

Implementation

We implement PASE on a small-scale Linux testbed as well as in the ns2 simulator. Our ns2
implementation supports all the optimizations discussed in Section III whereas our testbed imple-
mentation only supports the basic arbitration algorithm that runs on the endhosts (src/dest only)
and we do not implement separate (external) arbitrators for our Linux implementation.

Client: For the testbed evaluation, we implement the transport protocol and the arbitration
logic as Linux kernel modules on the clients. The Linux transport module, which is built on top of
DCTCP, communicates with the arbitration module to obtain the packet priority and the reference
2In our evaluation, the path BDP was 38KB. Even if a single source generates 38 1KB ﬂows

every RTT, this would lead to a control traﬃc overhead of 0.53% on a 10Gbps link.

48

rate. It then adds the PASE header on outgoing packets. For ns2 evaluation, we implement client
side similar to the Linux implementation, however, additionally the arbitration module at the client
communicates with the arbitrator that is co-located with the ToR switch. The arbitration module
communicates with the arbitrator using packets containing PASE header.

Arbitrator: The arbitration module implements Algorithm 1. It maintains a sorted list of ﬂows
based on ﬂowSize or ﬂowDeadline and assigns a priority queue and reference rate using the logic
described in Algorithm 1. It returns these values to the arbitration module running on clients, which
communicate with rate control module to control ﬂow’ priority.

Switch: For the desired switch support, we use the PRIO [9] queue and CBQ (class-based
queueing) implementation, on top of the RED queue implementation in Linux and ns2. We use
eight priority queues/classes and classify packets based on the DSCP ﬁeld in the IP header. Out
of these eight queues, a separate, strictly lower priority queue is maintained for background traﬃc.
For the RED queue, we set the low and high thresholds of RED queue to K and perform marking
based on the instantaneous rather than the average queue length as done in DCTCP [17].

4.5 Evaluation

We evaluate the performance of PASE using ns2 simulations [7] as well as through small-scale
testbed experiments. First, we conduct macro-benchmark experiments in ns2 to compare PASE’s
performance against existing datacenter transports. We compare PASE against both deployment-
friendly protocols including DCTCP [17], D2TCP [126], and L2DCT [97] (§4.6.1) as well as
the best performing transport, namely, pFabric [20] (§4.6.2). Second, we micro-benchmark the
internal working of PASE (e.g., beneﬁts of arbitration optimizations, use of reference rate, etc)
using simulations (§4.7).

Data center Topology: We use a 3-tier topology for our evaluation comprising layers of ToR
(Top-of-Rack) switches, aggregation switches, and a core switch. This is a commonly used topology
in datacenters [15, 126, 134]. The topology interconnects 160 hosts through 4 ToR switches that
are connected to 2 aggregation switches, which in turn are interconnected via a core switch. Each

49

(a) AFCT

(b) CDF of FCTs

(c) Application Throughput

Figure 4.3: (a) Comparison of PASE with L2DCT and DCTCP in terms of AFCTs under the
(left-right) inter-rack scenario, (b) shows the CDF of FCTs at 70% load in case of (a), and (c)
Deadline-constrained ﬂows: Comparison of application throughput for PASE, D2TCP and
DCTCP under the intra-rack scenario.

host-ToR link has a capacity of 1 Gbps whereas all other links are of 10 Gbps. This results in an
oversubscription ratio of 4:1 for the uplink from the ToR switches. In all the experiments, the
arbitrators are co-located with their respective switches. The end-to-end round-trip propagation
delay (in the absence of queueing) between hosts via the core switch is 300(cid:22)s.

Traﬃc Workloads: We consider traﬃc workloads that are derived from patterns observed in
production datacenters. Flows arrive according to a Poisson process and ﬂow sizes are drawn from
the interval [2 KB, 198 KB] using a uniform distribution, as done in prior studies [70, 134]. This
represents query traﬃc and latency sensitive short messages in datacenter networks. In addition
to these ﬂows, we generate two long-lived ﬂows in the background, which represents the 75th
percentile of multiplexing in datacenters [17]. Note that we always use these settings unless speciﬁed
otherwise. We consider two kinds of ﬂows: deadline-constrained ﬂows and deadline-unconstrained
ﬂows. They cover typical application requirements in today’s data centers [126].

We also evaluate PASE under the web search [17] and data mining [20] workloads. Compared
to web search, the data mining is much more skewed, with (cid:24)80% of ﬂows having a size of less than
10KB and the largest 3.5% of ﬂows containing more than 95% of all bytes.

Protocols Compared: We compare PASE with several data center transports including DCTCP
[17], D2TCP [126], L2DCT [97], and pFabric [20]. We implemented DCTCP, D2TCP and L2DCT
in ns2 and use the source code of pFabric provided by the authors to evaluate their scheme. The
parameters of these protocols are set according to the recommendations provided by the authors,

50

 1 2 3 4 5 6 7 8 10 20 30 40 50 60 70 80 90AFCT (msec)Offered load (%)PASEL2DCTDCTCP 0 20 40 60 80 100 0 2 4 6 8 10Fraction of Flows (cdf)Time (msec)PASEL2DCTDCTCP 0 0.2 0.4 0.6 0.8 1 10 20 30 40 50 60 70 80 90App. ThroughputOffered load (%)PASED2TCPDCTCPParameters
qSize = 100 pkts (packet size is 1KB)

Scheme
DCTCP
D2TCP markingThresh = 65 pkts (10Gbps) & 20 pkts (1Gbps)
L2DCT minRTO = 10ms, initCwnd = 10 pkts

pFabric

PASE

qSize = 76 pkts (= 2(cid:2)BDP)
initCwnd = 38 pkts (= BDP)
minRTO = 1ms ((cid:24)3.3(cid:2)RTT)
qSize = 100 pkts
minRTO (ﬂows in top queue) = 10ms
minRTO (ﬂows in other queues) = 200ms
numQue = 8

Table 4.2: Default simulation parameter settings.

or reﬂect the best settings, which we determined experimentally (see Table 5.1).

Performance Metrics: For traﬃc without any deadlines, we use the FCT as a metric. We
consider the AFCT as well as the 99th percentile FCT for small ﬂows. In case of web search and
data mining workloads, we report the AFCT across all ﬂows as well as separately for short and
long ﬂows. For deadline-constrained traﬃc, we use application throughput as our metric which is
deﬁned as the fraction of ﬂows that meet their deadlines. We use the control messages per second
to quantify the arbitration overhead.

4.6 Macro-benchmarks

4.6.1 Comparison with Deployment Friendly Schemes

PASE is a deployment friendly transport that does not require any changes to the network fabric.
Therefore, we now compare PASE’s performance with deployment friendly data center transports,
namely, DCTCP and L2DCT. DCTCP [17] is a fair sharing protocol that uses ECN marks to infer
the degree of congestion and employs adaptive backoﬀ factors to maintain small queues. The goal
of L2DCT [97] is to minimize FCT – it builds on DCTCP by prioritizing short ﬂows over long
ﬂows through the use of adaptive control laws that depend on the size of the ﬂows.

Deadline-unconstrained ﬂows: We consider an inter-rack communication scenario (termed as
left-right) where 80 hosts in the left subtree of the core switch generate traﬃc towards hosts in the
right subtree. This is a common scenario in user-facing web services where the front-end servers

51

and the back-end storage reside in separate racks [140]. The generated traﬃc comprises ﬂows with
sizes drawn from the interval [2 KB, 198 KB] using a uniform distribution. In addition, we generate
two long-lived ﬂows in the background.

Figure 4.3(a) shows the improvement in AFCT as a function of network load. Observe that
PASE outperforms L2DCT and DCTCP by at least 50% and 70%, respectively across a wide range
of loads. At low loads, PASE performs better primarily because of its quick convergence to the
correct rate for each ﬂow. At higher loads, PASE ensures that shorter ﬂows are strictly prioritized
over long ﬂows, whereas with L2DCT, all ﬂows, irrespective of their priorities, continue to send
at least one packet into the network. This lack of support for strict priority scheduling in L2DCT
leads to larger FCTs for short ﬂows. DCTCP does not prioritize short ﬂows over long ﬂows. Thus,
it results in worst FCTs across all protocols.

Figure 4.3(b) shows the CDF of FCTs at 70% load. Observe that PASE results in better FCTs

for almost all ﬂows compared to L2DCT and DCTCP.

Deadline-constrained ﬂows: We now consider latency-sensitive ﬂows that have speciﬁc dead-
lines associated with them. Thus, we compare PASE’s performance with D2TCP, a deadline-aware
transport protocol. We replicate the D2TCP experiment from [126], which considers an intra-rack
scenario with 20 machines and generate short query traﬃc with ﬂow sizes drawn from the interval
[100 KB, 500 KB] using a uniform distribution. Figure 4.3(c) shows the application throughput as
a function of network load. PASE signiﬁcantly outperforms D2TCP and DCTCP, especially at high
loads because of the large number of active ﬂows in the network. Since each D2TCP and DCTCP
ﬂow sends at least one packet per RTT, these ﬂows consume signiﬁcant network capacity which
makes it diﬃcult for a large fraction of ﬂows to meet their respective deadlines. PASE, on the
other hand, ensures that ﬂows with the earliest deadlines are given the desired rates and are strictly
prioritized inside the switches.

52

4.6.2 Comparison with Best Performing Scheme

We now compare the performance of PASE with pFabric, which achieves close to optimal perfor-
mance in several scenarios but requires changes in switches. With pFabric, packets carry a priority
number that is set independently by each ﬂow. Based on this, pFabric switches perform priority-
based scheduling and dropping. All ﬂows start at the line rate and backoﬀ only under persistent
packet loss. PASE performs better than pFabric in two important scenarios: (a) multi-link (single
rack) scenarios with all-to-all traﬃc patterns and (b) at high loads (generally > 80%) whereas it
achieves similar performance (< 6% diﬀerence in AFCTs) compared to pFabric in the following
two cases: (a) single bottleneck scenarios and (b) when the network load is typically less than 80%.
We ﬁrst consider the left-right inter-rack scenario where the aggregation-core link becomes the
bottleneck. Figure 4.4 shows the AFCT and 99th percentile FCT as a function of load. Observe that
PASE achieves AFCT similar to pFabric (Figure 4.4(a)) and for 99th percentile FCT (Figure 4.4(b))
pFabric achieves smaller FCT for up to 50% load and PASE achieves comparable performance.
However, at (cid:21) 60% loads, PASE results in smaller FCT than pFabric. At 90% load, this improvement
is more than 85%. This happens due to high and persistent loss rate with pFabric at high loads.

Next we consider an all-to-all intra-rack scenario, which is common in applications like web
search where responses from several worker nodes within a rack are combined by an aggregator
node before a ﬁnal response is sent to the user. Moreover, any node within a rack can be an
aggregator node for a user query and the aggregators are picked in a round robin fashion to achieve
load balancing [12].

Figure 4.5 shows that PASE provides up to 85% improvement in 99th percentile FCT over
pFabric (Figure 4.5(b)) and results in lower AFCTs across all loads (Figure 4.5(a)). This happens
because with pFabric multiple ﬂows sending at line rate can collide at a downstream ToR-host link.
This causes a signiﬁcant amount of network capacity to be wasted on the host-ToR links, which
could have been used by other ﬂows (as this is an all-to-all scenario). With PASE, ﬂows do not
incur any arbitration latency in the intra-rack scenario as new ﬂows start sending traﬃc at line rate
based on the information (priority and reference rate) from their local arbitrator. After one RTT,

53

Figure 4.4: Comparison of PASE and pFabric for inter-rack scenario: a) AFCT, b) 99th percentile
with % gain of PASE over pFabric.

(b) 99th percentile

(a) AFCT

Figure 4.5: Comparison of PASE and pFabric intra-rack scenario: a) AFCT, b) 99th percentile
with % gain of PASE over pFabric.

(b) 99th percentile

(a) AFCT

all ﬂows obtain their global priorities which helps in avoiding any persistent loss of throughput in
case the local and global priorities are diﬀerent.

4.6.3 Evaluation under the Web Search and Data Mining Workloads

We now evaluate PASE’s performance under the web search [17] and data mining [20] workloads
and perform comparison with DCTCP, L2DCT and pFabric. These empirical workloads contain a
diverse mix of short and long ﬂows with heavy-tailed characteristics. In the web search workload,
more than 95% of all bytes are from the 30% of the ﬂows with sizes in the range [1,20]MB. The
data mining workload is more skewed: (cid:24)80% of the ﬂows are less than 7KB in size and 95% of

54

 0 2 4 6 8 10 10 20 30 40 50 60 70 80 90AFCT (msec)Offered load (%)PASEpFabric 1 10 100 10 20 30 40 50 60 70 80 90FCT (msec)-log scaleOffered load (%)-1.8-1.8-1.8-5.1-518293283% ImprovementPASEpFabric 1 10 100 1000 10 20 30 40 50 60 70 80 90AFCT (msec)-log scaleOffered load (%)PASEpFabric 1 10 100 1000 10 20 30 40 50 60 70 80 90FCT (msec)-log scaleOffered load (%)20.60.631029738133% ImprovementPASEpFabric(a) Web search workload

(b) Data mining workload

Figure 4.6: Comparison of PASE with pFabric, L2DCT and DCTCP under web search and data
mining workloads.

all bytes are in the 4% ﬂows that are larger than 35MB. Note that the data mining workload leads
to large number of tiny ﬂows. PASE’s arbitration may incur greater overhead in this case (as large
number of ﬂows get mapped to the top queue) as well as inﬂate the AFCT of short ﬂows (due to
the arbitration latency being a greater fraction of their FCTs). Thus, it is important to evaluate how
PASE performs under such workloads.

We consider the left-right inter-rack scenario, where the aggregation-core link becomes the
bottleneck. Figure 4.6 shows the AFCT as a function of load under the data mining and web search
workloads. Observe that PASE consistently outperforms DCTCP and L2DCT and achieves similar
performance compared to pFabric across all loads. Note that using a larger initial window size
(=10) in case of DCTCP and L2DCT improves performance but lack of in-networking priority
scheduling and arbitration leads to large AFCTs especially under medium to high loads. Figure 4.7
and 4.9 shows the breakdown of AFCT across small and large ﬂows the same scenario. Observe
that PASE not only achieves similar performance to pFabric for ﬂows less than 100KB (i.e., short
ﬂows), it improves the AFCT of long ﬂows under the web search workload. This happens because
long ﬂows with PASE experience lower packet loss rate than pFabric. These results show that PASE
achieves comparable or better performance than the existing state-of-the-art protocols, like pFabric,
yet does not require any changes to the switch hardware.

55

 0 10 20 30 40 50 10 20 30 40 50 60 70 80 90AFCT (msec)Offered load (%)DCTCPL2DCTpFabricPASE 40 60 80 100 120 140 10 20 30 40 50 60 70 80 90AFCT (msec)Offered load (%)DCTCPL2DCTpFabricPASE(a) (0,100)KB

(b) (10MB,1)

Figure 4.7: AFCT comparison of PASE with pFabric, L2DCT and DCTCP under the web search
workload. Note: the y-axis of plots have diﬀerent ranges.

(a) (0,100)KB

(b) (10MB,1)

Figure 4.8: AFCT comparison of PASE with pFabric, L2DCT and DCTCP under the data mining
workload. Note that y-axis of plots have diﬀerent labels.

4.6.4 PASE under the Incast scenario

We now evaluate PASE under synchronized ﬂow arrivals, which are fairly common in datacenters
due to the partition-aggregate structure followed by many datacenter applications [17, 100]. We
generate multiple synchronized short ﬂows with sizes uniformly distributed between 50KB and
150KB and vary the number of ﬂows from 2 to 100 (where each ﬂow originates from exactly
one sender). Figure 4.9(a) shows that PASE performs well across a range of conditions. PASE is
able to tolerate large packet bursts as it maps only those ﬂows to the high priority queues that can
saturate the link and maps rest of the ﬂows to the lower priority queues. As a result PASE essentially

56

 0 2 4 6 8 10 10 20 30 40 50 60 70 80 90FCT(msec)Load (%)DCTCPL2DCTpFabricPASE 100 200 300 400 500 10 20 30 40 50 60 70 80 90FCT(msec)Load (%)DCTCPL2DCTpFabricPASE 0 2 4 6 8 10 10 20 30 40 50 60 70 80 90FCT(msec)Load (%)DCTCPL2DCTpFabricPASE 0 2 4 6 8 10 10 20 30 40 50 60 70 80 90FCT(sec)Load (%)DCTCPL2DCTpFabricPASE(a) Incast scenario

(b) AFCT

Figure 4.9: Performance of PASE under (a) incast scenario and (b) AFCT improvement due to
PASE optimizations as a function of oﬀered load.

eliminates network congestion and substantially improves AFCT.

4.7 Micro-benchmarks

In this section, we evaluate the basic components of PASE with the help of several micro-
benchmarks. Results show that PASE optimizations signiﬁcantly reduce the control traﬃc overhead
and the number of messages that arbitrators need to process. In addition, other features of PASE
such as its rate control are important for achieving high performance.

4.7.1 Arbitration Optimizations

PASE introduces early pruning and delegation for reducing the arbitration overhead of update
messages. We now evaluate the overhead reduction brought about by these optimizations as well as
study their impact on the performance of PASE. Figure 4.10(a) shows the overhead reduction that is
achieved by PASE when all its optimizations are enabled. Observe that these optimizations provide
up to 50% reduction in arbitration overhead especially at high loads. This happens because when
these optimizations are enabled, higher-level arbitrators delegate some portion of the bandwidth to
lower level arbitrators, which signiﬁcantly reduces the control overhead on ToR-Aggregation links.
In addition, updates of only those ﬂows are propagated that map to the highest priority queues due
to early pruning. Figure 4.9(b) shows that these optimizations also improve the AFCT by 4-10%

57

 1 10 100 0 20 40 60 80 100AFCT (msec)# of flowsPASEpFabricDCTCP 0 2 4 6 8 10 10 20 30 40 50 60 70 80 90AFCT Improvement (%)Offered load (%)(a) Overhead

(b) AFCT

Figure 4.10: AFCT improvement and overhead reduction as a function of load for PASE with its
optimizations in the left-right scenario.

across all loads. This happens because of delegation, which reduces the arbitration delays.

Beneﬁt of end-to-end arbitration: PASE enables global prioritization among ﬂows through
its scalable end-to-end arbitration mechanism. This arbitration, however, requires additional update
messages to be sent on the network. It is worth asking if most of the beneﬁts of PASE are realizable
through only local arbitration, which can be solely done by the endpoints. Thus, we now compare
the performance of end-to-end arbitration and local arbitration in the left-right inter-rack scenario.
Figure 4.10(b) shows that end-to-end arbitration leads to signiﬁcant improvements (up to 60%)
in AFCTs across a wide range of loads. This happens because local arbitration cannot account
for scenarios where contention does not occur on the access links, thus leading to sub-optimal
performance.

In PASE, sources contact the arbitrators periodically, every RTT. The Arbitration=OFF case
can also be viewed as (a) a scenario in which the arbitration period is so large that none of the short
ﬂows get any end-to-end arbitration feedback during their lifetime or (b) a scenario in which one or
more arbitrators have failed. As a result, a source cannot get any end-to-end arbitration feedback.
Instead, only the local arbitrator (i.e., the one at the end-host) assigns a priority and rate to a
ﬂow based on the host-ToR link. Figure 4.10(b) shows that even in such cases, PASE outperforms
DCTCP.

58

 0 10 20 30 40 50 60 10 20 30 40 50 60 70 80 90Overhead Reduction (%)Offered load (%) 0 1 2 3 4 5 10 20 30 40 50 60 70 80 90AFCT (msec)Offered load (%)Arbitration=ONArbitration=OFFDCTCP(a) Impact of switch buﬀer size

(b) Priority Queues

Figure 4.11: (a) PASE with varying switch buﬀer size and (b) PASE with varying number of
priority queues (left-right scenario).

4.7.2

In-network Prioritization and Transport Micro-benchmarks

In this section, we evaluate the impact of diﬀerent components in PASE.

Impact of switch buﬀer size: We now evaluate the impact of reducing the switch buﬀer size.
Figure 4.11(a) shows the AFCT for two switch buﬀer sizes (i.e., 100 packets and 500 packets).
Observe that for loads less than 70%, the AFCT remains unchanged. However, at higher loads the
AFCT increases by up to 9% at 90% load. These results show that even with small buﬀer sizes,
PASE keeps AFCTs small.

Impact of number of priority queues: We now evaluate the impact of changing the number
of priority queues in the switches. To test this scenario we repeat the left-right inter-rack scenario
with diﬀerent number of queues. Figure 4.11(b) shows that using four queues provide signiﬁcant
improvement in AFCT at loads (cid:21) 70%. However, increasing the number of queues beyond this
provides only marginal improvement in AFCT. These results further reinforce the ability of PASE
to achieve high performance with existing switches that support a limited number of priority queues.
Reference Rate: We now evaluate the beneﬁt of using the reference rate information. We
compare PASE with PASE-DCTCP, where all ﬂows (including the ones mapped to the top queue
as well as the lowest priority queue) behave as DCTCP sources and do not use the reference rate.
However, these ﬂows are mapped to diﬀerent priority queues through the normal arbitration process.
To illustrate this, we consider the intra-rack scenario with 20 nodes and uniformly distributed ﬂow

59

 0 5 10 15 20 25 30 10 20 30 40 50 60 70 80 90AFCT (msec)Offered load (%)bufferSize=100bufferSize=500 2 4 6 8 10 12 10 20 30 40 50 60 70 80 90AFCT (msec)Offered load (%)3 Queues4 Queues6 Queues8 Queues(a) Intra-rack scenario

(b) Inter-rack scenario

Figure 4.12: AFCT improvement under distributed arbitration over centralized arbitration under
(a) intra-rack scenario (b) inter-rack scenario.

sizes from [100 KB, 500 KB]. As shown in Figure 4.13(a), leveraging reference rate results in
AFCTs of PASE to be 50% smaller than the AFCTs for PASE-DCTCP.

Impact of RTO and Probing: Flows mapped to the lower priority queues may experience
large number of timeouts, which can aﬀect performance. We implement probing in which ﬂows
mapped to the lowest priority queue send a header-only probe packet every RTT rather than a full-
sized packet. Using probing improves performance by (cid:25)2.4% and (cid:25)11% at 80% and 90% loads,
respectively in the all-to-all intra-rack scenario. Note that unlike pFabric, PASE does not require
small RTOs which forgoes the need to have high resolution timers.

4.8 Distributed vs. Centralized Arbitration

In this section, we compare the performance of PASE under a single centralized arbitrator
(which makes decisions for all ﬂows in the DC network) with (multiple) distributed arbitrators. For
distributed arbitration, we use the same settings as mentioned in Section 5.6. Our evaluation shows
that distributed arbitration improves AFCT under both the all-to-all and left-to-right traﬃc scenarios
(see Figure 4.12). Under distributed arbitration, local arbitrators (that are close to the sources) can
assign priorities to the sources and ﬂows start sending data without waiting for feedback from other
arbitrators along the path. This reduces the arbitration latency and improves AFCT. In the centralized
approach, the arbitrator may not be located close to all sources, which can signiﬁcantly increase

60

 5 10 15 20 25 30 35 10 20 30 40 50 60 70 80 90AFCT Improv. (%)Offered load (%)AggregationCore-2 0 2 4 6 8 10 12 14 10 20 30 40 50 60 70 80 90AFCT Improv. (%)Offered load (%)AggregationCore(a) PASE vs PASE-DCTCP

(b) Testbed Evaluation

Figure 4.13: PASE AFCT performance.

the arbitration latency for some traﬃc. To study the impact of controller/arbitrator placement under
the centralized approach, we consider placing the arbitrator at aggregation and core levels (i.e.,
co-located with the switches). The evaluation results in Figure 4.12) show that the application
performance improves by up to 50% when a centralized arbitrator is placed at the aggregation
switch level as compared to when placed at core switch level.

4.9 Testbed Evaluation

We now present a subset of results from our testbed evaluation. Our testbed comprises of a
single rack of ten nodes (nine clients, one server), with 1 Gbps links, 250 (cid:22)sec RTT and a queue
size of 100 packets on each interface. We set the marking threshold K to 20 packets and use eight
priority queues. We compare PASE’s performance with the DCTCP implementation (provided
by its authors). To emulate data center like settings, we generate ﬂows sizes that are uniformly
distributed between 100 KB and 500 KB, as done in [126]. We start 1000 short ﬂows and vary
the ﬂow arrival rate to generate a load between 10% to 90%. In addition, we also generate a long
lived background ﬂow from one of the clients. We compare PASE with DCTCP and report the
average of ten runs. Figure 4.13(b) shows the AFCT for both the schemes. Observe that PASE
signiﬁcantly outperforms DCTCP: it achieves (cid:25)50%-60% smaller AFCTs compared to DCTCP.
This also matches the ns2 simulation results.

61

 2 4 6 8 10 12 14 16 10 20 30 40 50 60 70 80 90AFCT (msec)Offered load (%)PASEPASE-DCTCP 10 20 30 40 50 10 20 30 40 50 60 70 80 90AFCT (msec)Offered load (%)PASEDCTCP4.10 Discussion

Multipath PASE: The existing arbitration mechanism of PASE assumes that each ﬂow uses a
single path. However, it can be extended to allow striping of a ﬂow across multiple network paths.
To achieve this, every arbitrator can maintain a ﬂow list for each of its outgoing links. In addition
to the reference rate and priority queue assignment, the arbitration algorithm will also return the
path information. The source arbitrators will use this information to allow routing of ﬂows across
multiple paths. For ﬂowlet-based schemes, like CONGA [16], PASE arbitrators can track ﬂowlets
instead of ﬂows. With packet-level load balancing schemes, such as DRB [32], arbitrators can
assign an average rate to a ﬂow on each path it uses. However, we leave a detailed treatment of
multipath PASE for future work.

4.11 Conclusion

In this paper, we make three key contributions. First, we extract the three underlying strategies
used in data center transport protocols and ﬁnd that these strategies, which were used individually
in earlier transports, are in fact complementary to each other and can be uniﬁed to work together.
Second, we design PASE, a data center transport protocol that synthesizes existing transport
strategies. PASE includes two new components: a scalable arbitration control plane for data center
networks, and an endpoint transport protocol that is explicitly aware of priority queues and uses
a guided rate control mechanism. Third, we conducted a comprehensive evaluation of PASE,
which includes both macro-benchmarks that compare PASE’s performance against multiple existing
transport protocols, and micro-benchmarks that focus on the internal working of the system.

62

CHAPTER 5

NEAT: TASK PLACEMENT IN DATACENTERS

NEAT, is a Network-schEduling-Aware Task placement framework that considers the network
scheduling policy and the network state while making task placement decisions. NEAT considers
placing both ﬂows and coﬂows in the network via placing the tasks on the nodes that act as
destinations of these ﬂows (coﬂows). The placement framework, coupled with the underlying
network scheduling policy (e.g., Fair, FCFS, LAS, SRPT), ensures that network resources are
properly allocated among tasks to minimize their average completion time.

NEAT uses a distributed architecture to make task placement decisions, as shown in Figure 5.1.
It leverages the master-slave architecture of existing task scheduling systems [68, 128] to achieve
its functionality in a distributed fashion. NEAT has two components, i) a global task placement
daemon and ii) a per-node network daemon. The two components exchange information to perform
network-aware task scheduling.

Figure 5.1: NEAT components

Network Daemon: NEAT network daemon (similar to node manager in Hadoop) runs on

63

1. Submit JobsJob QueueTask Placement DaemonCollect Task/Node Stats Place Tasks2. Request Pred. Perf.3. Pred. Perf. , Node state4. Place TaskN.D.Maintain Flow & Node State Info.Network SchedulerN.D.N.D.Network DaemonSchedule Flowsall the nodes (i.e., endhosts) in the network and uses the completion time of data transfer (FCT
for ﬂow, CCT for coﬂow) as a metric for predicting task performance. The prediction is based
on the network state, e.g., (residual) sizes of all active ﬂows, and the network scheduling policy.
Therefore, network daemon performs two functions: (i) maintaining network information of all the
active tasks on the node (such as the number of ﬂows and their residual sizes), and ii) predicting the
performance of a new task when requested by the task placement daemon, based on the analysis
in section 5.1. NEAT considers only the edge links for task performance prediction. Therefore, the
task performance prediction is based on the assumption that only the edge links are bottleneck in
the network and core is congestion free. As a result NEAT needs to maintain ﬂow information at
the end-hosts only and does not require any feedback from the network.

Task Placement Daemon: NEAT task placement daemon makes placement decisions in two
steps: First, it queries the network daemons on nodes that are valid hosts (determined by node state,
CPU and memory) of a given task for their predicted task performance. Next, it chooses the best
node or subset of nodes for placing the task using the predicted task completion times and node
states (i.e., the smallest residual ﬂow size of active tasks on each node). It also caches the node
states locally to facilitate future placement decisions, as discussed below. The node states cache is
updated whenever a new task is placed on the node or an existing task is completed.

NEAT Workﬂow: As illustrated in Figure 5.1, when the task placement daemon receives a
request to place a new task (step 1), it identiﬁes a set of preferred hosts based on local copies of
node states and contacts the respective network daemons to get predicted performance of the new
task (step 2). Each network daemon maintains the local network state (such as the number and
the sizes of ﬂows starting/ending at the node) and uses this state together with knowledge of the
network scheduling policy to predict the performance of the new task. The task placement daemon
then gathers this information (step 3) and places the task on the best of the preferred hosts based
on the predicted performance (step 4).

Discussions: It is worthwhile to clarify a few points. First of all, NEAT is a task placement
framework and as such, it does not impose any changes in the network switches or network

64

scheduling mechanism. The network congestion is managed by the underlying network scheduling
policy. However, NEAT helps to mitigate the network congestion by properly distributing traﬃc
loads across the network. Moreover, in the current design, NEAT abstracts network as a single-
switch topology, as assumed in earlier works [21, 76]. The single-switch assumption means that
only edge links are the bottleneck and thus for task completion time (FCT/CCT) prediction, it only
considers edge links. However, such restriction is not fundamental to NEAT, and we discuss in
Section 5.10 how it can be extended to other topologies.

In the next two sections, we discuss task performance prediction and NEAT design in more

detail.

5.1 Task Performance Predictor

NEAT has a pluggable task performance predictor that can predict the completion time of a
given task under an arbitrary network scheduling policy. An interesting insight of our analysis is
that: for ﬂow-based scheduling, it suﬃces to predict FCT according to the Fair policy under mild
conditions even if the underlying policy is not Fair; for coﬂow-based scheduling, however, this is
generally not true.

From a networking perspective, placing a task onto a server is equivalent to placing a ﬂow onto
a path (or a coﬂow onto a set of paths). We now deﬁne the objective function for ﬂow placement; the
objective function for coﬂow placement is analogous. For ﬂow placement, our goal is to place each
ﬂow onto the path that minimizes the sum (and hence average) FCT over all active ﬂows. Consider
a ﬂow f placed on path p f with size s f (bits). Let FCT„ f ; l” denote the completion time of ﬂow f
on link l 2 p f (i.e., time taken for the ﬂow to ﬁnish transmission over link l). Now consider placing
a new ﬂow f0 onto path p f0. Let ∆FCT„ f ; l” denote the increase in the completion time of f on
link l due to the scheduling of f0. Let F be the set of cross-ﬂows in the network (i.e., ﬂows that
share at least one link with f0). Then the increase in the sum FCT over all active ﬂows due to the

65

placement of f0 equals

FCT„ f0; l” +

max
l2p f0

(

∑

f 2F

max
l2p f

„FCT„ f ; l” + ∆FCT„ f ; l””

)

FCT„ f ; l”

(cid:0) max
l2p f

;

(5.1)

and the optimal online placement is to place f0 such that (5.1) is minimized (note that (5.1) depends
on p f0).

The original objective (5.1) requires calculation of the FCT before and after placing f0 for every
ﬂow f 2 F and every hop l traversed by f . To simplify computation, we exchange the order of
summation and maximization to obtain an alternative objective function

∑
FCT„ f0; l” + max
∑
∆FCT„ f ; l”“ﬁ‹ ;
l2p f0

'›«FCT„ f0; l” +

∆FCT„ f ; l”

f 2Fl

f 2Fl

max
l2p f0
(cid:21) max
l2p f0

(5.2)

where Fl is the set of ﬂows traversing link l. Note that (5.2) is only an approximation of (5.1) and
thus minimizing (5.2) is not guaranteed to minimize the sum FCT. However, as we show later,
the alternative objective (5.2) has a nice property that it is invariant, up to a constant scaling, to
the network scheduling policy under certain conditions (see Propositions 5.2.1 and 5.3.1), and the
placement that minimizes (5.2) achieves superior performance in minimizing the FCT for a variety
of ﬂows and network scheduling policies.

For simplicity of analysis, we do not consider the impact of future task arrivals or task completion
on the prediction. We choose this design in favor of applicability (not requiring prediction capability)
and stability (not moving existing ﬂows) of NEAT design (see remark § 5.4.1).

5.2 FCT Prediction for Flow Scheduling

Given a placement of ﬂow f0 and its cross-ﬂows, the FCT of f0 is determined by the scheduling
policy employed by the network. We therefore study FCT prediction under several widely-adopted
scheduling policies, including FCFS, Fair, LAS, and SFF. For ease of presentation, we ﬁx a

66

(candidate) placement and consider the prediction for a given link l with bandwidth Bl. The key
idea of NEAT predictor is to compute the total number of bytes that will be transmitted across the
bottleneck link by the time the current ﬂow ﬁnishes transmission, which is then divided by the link
bandwidth to predict the completion time of the current ﬂow. For example, to predict the completion
time of a ﬂow f when placed onto a path p, we compute for each link l on p the total traﬃc volume
Vl from f and coexisting ﬂows that will cross l upon the completion of f , and then the maximum
of Vl(cid:157)Bl (Bl is the bandwidth of l) over all the links l 2 p gives the predicted completion time of
f . This approach applies to any work-conserving scheduling policy (i.e., a link is never idle when
there is pending traﬃc on that link), while diﬀerent policies diﬀer in the computation of Vl.

5.2.1 FCT under FCFS Scheduling

Under FCFS scheduling, ﬂows are served in the order they arrive, and FCT of the new ﬂow can be
predicted as:

FCTFCFS„ f0; l” =

„s f0

1
Bl

+

s f ”;

(5.3)

and ∆FCTFCFS„ f ; l” (cid:17) 0 for all
f 2 Fl since the new ﬂow does not aﬀect the completion of
existing ﬂows. Hence, the two objectives in (5.1) and (5.2) coincide in this case, both equal to
maxl2p f0

FCTFCFS„ f0; l”.

5.2.2 FCT under Fair or LAS Scheduling

Under a scheduling policy that performs fair sharing, all ﬂows will receive equal service until
completion, i.e., by the time ﬂow f0 completes transmission over link l, each existing ﬂow f 2 Fl
will have transmitted min„s f ; s f0

” bytes over l. The FCT of f0 can thus be predicted as:

∑

f 2Fl

FCTFAIR„ f0; l” =

1
Bl

min„s f ; s f0

(5.4)

”“ﬁ‹ :

∑

+

f 2Fl

'›«s f0

67

Meanwhile, the fair sharing rule implies that for each ﬂow f 2 Fl of size s f < s f0, scheduling f0
increases its FCT by imposing an additional traﬃc load of s f on link l (during the lifetime of f );
for each ﬂow f 2 Fl of size s f (cid:21) s f0, f0 will ﬁnish earlier, causing an additional load of s f0. Thus,
the change in FCT of existing ﬂows is:

∆FCTFAIR„ f ; l” =

min„s f ; s f0

”

Bl

:

(5.5)

Substituting (5.4) and (5.5) into (5.1) gives the overall increase in the sum FCT.

The alternative objective (5.2) accepts a more compact form in this case. Speciﬁcally, for given

l,

∑

∑

f 2Fl

f 2Fl
min„s f ; s f0

FCTFAIR„ f0; l” +

∆FCTFAIR„ f ; l”

=

s f0
Bl

+

2
Bl

” (cid:25) 2FCTFAIR„ f0; l”;

∑

Meanwhile, each ﬂow of size s f > s f0 will be delayed by s f0
ﬂows of size s f (cid:20) s f0 will not be aﬀected, implying:
s f0
∆FCTSRPT„ f ; l” =
Bl

s f >s f0

68

(5.6)

where the approximation is accurate when s f0 is small relative to

f 2Fl min„s f ; s f0

”.

Remark: Since LAS scheduling (with preemption) is equivalent to fair sharing, the above result

also applies to LAS scheduling.

5.2.3 FCT under SRPT Scheduling

If the network employs SRPT scheduling, then only ﬂows of remaining sizes smaller than or equal
to the current ﬂow will be served before the current ﬂow ﬁnishes (assuming FCFS rule is applied to
break ties among ﬂow sizes), and larger ﬂows will be preempted. Hence, the FCT of the new ﬂow
f0 under SRPT can be predicted as:

FCTSRPT„ f0; l” =

1
Bl

'››«s f0

+

∑
f 2Fl:s f (cid:20)s f0

“ﬁﬁ‹ :

s f

(5.7)

(cid:157)Bl due to the placement of f0, while

;

(5.8)

where (cid:1) is the indicator function. Applying (5.7) and (5.8) to (5.1) gives the ﬁrst objective. For
second objective, we have

∑

'›«s f0

+

∑

f 2Fl

=

1
Bl

5.2.4

Invariance Condition

FCTSRPT„ f0; l” +

∆FCTSRPT„ f ; l”

f 2Fl
min„s f ; s f0

”“ﬁ‹ = FCTFAIR„ f0; l”:

(5.9)

Although the predicted FCT varies for diﬀerent network scheduling policies, we show that the
optimal placement can be invariant under certain conditions. Speciﬁcally, we have shown that
under SRPT scheduling, the alternative objective (5.2) reduces to the fair-sharing FCT of the newly
arrived ﬂow (see (5.9)), and under Fair/LAS scheduling, this objective is approximately twice
of the fair-sharing FCT (see (5.6)). Since constant-factor scaling does not aﬀect the selection of
placement, these results imply that we can use the same objective function to place ﬂows no matter
whether the network performs Fair, LAS, or SRPT scheduling.

Proposition 5.2.1 If the network performs Fair, LAS, or SRPT ﬂow scheduling and each ﬂow is
small relative to the total load on each link, then the optimal placement that minimizes (5.2) is
always the one that minimizes bottleneck fair-sharing FCT of the newly arrived ﬂow as predicted
by (5.4).

Remark: Surprisingly, we have shown that when the network schedules ﬂows by LAS or SRPT,
we should place ﬂows to minimize their predicted FCTs under Fair scheduling in order to optimize
the objective (5.2).

5.3 CCT Prediction for Coﬂow Scheduling

We now revise the above formulas for the case that the network schedules traﬃc at the level of
coﬂow. We study the following policies applied to coﬂow scheduling, including FCFS [51], Fair,

69

LAS [41], and an extension of SRPT called permutation scheduling [42]. For a coﬂow c, let sc
denote its total size (in number of bytes), and sc;l denote the sum size of the individual ﬂows in
c that traverse link l. We show that the idea of computing FCT can be generalized to predict the
performance of coﬂows.

We make the following assumptions: (i) all ﬂows of a coﬂow have the same priority, and (ii)
all ﬂows of a coﬂow complete simultaneously. The ﬁrst assumption is based on a principle adopted
by state-of-art coﬂow schedulers [42, 51] that ﬂows within a coﬂow should progress together; the
second assumption is based on a state-of-art coﬂow rate adaptation mechanism [42], which slows
down all but the slowest ﬂows in a coﬂow to match the completion time of the slowest ﬂow so as to
conserve bandwidth without increasing the CCT. In particular, assumption (ii) allows us to apply the
same objective functions (5.1, 5.2) in placing coﬂows, with f ( f0) replaced by c (c0), FCT replaced
by CCT, and p f replaced by pc (denoting the set of links traversed by any ﬂow of a coﬂow c). The
computation of CCT„c; l” and ∆CCT„c; l” is, however, diﬀerent from their single-ﬂow counterparts,
and will be detailed below. We redeﬁne Fl to be the set of coﬂows with at least one constituent ﬂow
traversing link l.

5.3.1 CCT under FCFS Scheduling

Under FCFS scheduling, each link will serve traﬃc from existing coﬂows before serving the newly
arrived coﬂow, thus

CCTFCFS„c0; l” =

„sc0;l +

1
Bl

and ∆CCTFCFS„c; l” (cid:17) 0 for all c 2 Fl.

5.3.2 CCT under Fair or LAS Scheduling

∑

c2Fl

sc;l”;

(5.10)

Under fair sharing or LAS scheduling, all coﬂows of size smaller than c0 will have ﬁnished and
all coﬂows of size larger than c0 will have transmitted sc0 bytes when coﬂow c0 completes. Under
assumption (ii), ﬂows within a coﬂow make progress proportionally to their sizes, and thus when

70

a coﬂow traversing link l ﬁnishes b bytes over all its constituent ﬂows, bsc;l(cid:157)sc bytes must be
”sc;l(cid:157)sc on
transmitted over link l. Therefore, each coﬂow c 2 Fl introduces a load of min„sc; sc0
link l during the lifetime of coﬂow c0 and vice versa. The above arguments imply that

CCTFAIR„c0; l” =

1
Bl

∑

'›››››«sc0;l +

∑

sc0sc;l

sc

c2Fl
sc(cid:20)sc0

sc;l +

c2Fl
sc>sc0

for the newly arrived coﬂow c0, and

∆CCTFAIR„c; l” =

for each existing coﬂow c 2 Fl. From the above, we see that

∑

c2Fl

∆CCTFAIR„c; l” =

1
Bl

”

min„sc; sc0
∑

scsc0;l
sc0

+

sc0;l
Blsc0

∑

c2Fl
sc(cid:20)sc0

'›››››«

sc0;l

c2Fl
sc>sc0

which is diﬀerent from (5.11) unless sc;l(cid:157)sc = sc0;l(cid:157)sc0 for all c 2 Fl. It means that compared to
the single-ﬂow Fair/LAS scheduling, where minimizing the objective (5.2) reduces to minimizing
the FCT of the newly arrived ﬂow, we have to explicitly consider the impact on existing coﬂows
through (5.13) (in addition to (5.11)) under coﬂow Fair/LAS scheduling.

5.3.3 CCT under Permutation Scheduling

Permutation scheduling [42] serves the coﬂows sequentially and includes many scheduling policies
as special cases (e.g., FCFS and all variations of SRPT). Given a permutation (cid:25) = „(cid:25)c”c2Fl[fc0g of
all the coﬂows sharing link l, where (cid:25)c is the order of scheduling coﬂow c, the CCT of the newly
arrived coﬂow c0 equals

∑

sc;l;
and the increase in CCT for each existing coﬂow c 2 Fl equals

(cid:25)c(cid:20)(cid:25)c0

CCT(cid:25)„c0; l” =

1
Bl

∆CCT(cid:25)„c; l” =

sc0;l
Bl

(cid:25)c>(cid:25)c0

;

71

“ﬁﬁﬁﬁﬁ‹

“ﬁﬁﬁﬁﬁ‹ ;

(5.11)

(5.12)

(5.13)

(5.14)

(5.15)

which can be used to evaluate the ﬁrst objective (5.1).

For the second objective (5.2), we have

∑

CCT(cid:25)„c0; l” +

∆CCT(cid:25)„c; l”

∑

(cid:25)c(cid:20)(cid:25)c0

'›«

=

1
Bl

c2Fl
sc;l + sc0;ljfc 2 Fl : (cid:25)c > (cid:25)c0

gj“ﬁ‹ :

(5.16)

In particular, for a counterpart of SRPT called smallesT-Coﬂow-First (TCF) [42], (5.16) becomes

∑
∑

c2Fl

c2Fl
sc(cid:20)sc0

CCTTCF„c0; l” +

'›››››«sc0;l +

=

1
Bl

∆CCTTCF„c; l”

∑

sc;l +

c2Fl
sc>sc0

sc0;l

“ﬁﬁﬁﬁﬁ‹ ;

(5.17)

which is similar to the CCT under fair sharing (5.11) if sc;l(cid:157)sc = sc0;l(cid:157)sc0 for all c 2 Fl, sc > sc0.

Invariance Condition

5.3.4
The above analysis shows that in the special case of sc;l(cid:157)sc = sc0;l(cid:157)sc0 for all c 2 Fl, i.e.,
all coﬂows split traﬃc among traversed links in the same way, the objective (5.2) satisﬁes
CCTFAIR„c0; l” +
∆CCTFAIR„c; l” (cid:25) 2CCTFAIR„c0; l” under Fair/LAS scheduling (assum-
∆CCTTCF„c; l” =
sc;l +
CCTFAIR„c0; l” under TCF scheduling. By arguments similar to Proposition 5.2.1, we have the
following statement.

ing sc0;l ≪ ∑

∑
c2Fl
c2Fl sc(cid:20)sc0

sc0;l), and CCTTCF„c0; l” +

c2Fl

c2Fl sc>sc0

∑

∑

Proposition 5.3.1 If the network performs Fair, LAS, or TCF coﬂow scheduling, each coﬂow
imposes a small load on each link (relative to its total load), and sc;l(cid:157)sc are identical among all
coﬂows c 2 Fl for each link l, then the optimal placement that minimizes (5.2) is always the one
that minimizes the bottleneck fair-sharing CCT of the newly arrived coﬂow as predicted by (5.11).

Remark: We note that compared to the case of ﬂow scheduling (Proposition 4.1), the invariance
property for coﬂow scheduling only holds in very special cases when all the coﬂows split traﬃc

72

in the same way. It means that in contrast to ﬂow placement, there is rarely a uniformly optimal
placement for coﬂows under diﬀerent coﬂow scheduling policies, and it is crucial to customize
coﬂow placement according to the underlying coﬂow scheduling policy in the network.

5.4 NEAT Task Placement Framework

NEAT targets data-intensive applications such as MapReduce or directed acyclic graph DAG-
based systems (such as Spark, Tez etc.). Such applications run in multiple stages and NEAT aims
at placing tasks in each stage with the goal of minimizing the average completion time of all the
active tasks in the system. For example, a MapReduce job has two stages, Map (where input data
is processed) and Reduce (where results of Map stage are summarized) and each stage may have a
data shuﬄe part and a data processing part. Both data shuﬄe and data processing contribute to the
total completion time of a job. Similarly, DAG-based applications can be considered an extension
of MapReduce applications, which can have multiple Map/Reduce phases [76]. Therefore, in this
work for simplicity, we use MapReduce as the base model and then extend it to DAG for multi-stage
jobs.

5.4.1 Flow Placement

NEAT places a new task in two steps: First, it identiﬁes a set of candidate hosts for task placement
and next, it selects a best host among these for task execution.

1) Identiﬁcation of preferred hosts: For a given task, NEAT divides nodes into preferred hosts
and non-preferred hosts based on the the size of the current task and the node state (deﬁned as the
smallest remaining ﬂow size on each node). If the task generates a ﬂow of size s, then a node is
a preferred host if it is idle or all the ﬂows currently scheduled on this node are no smaller than
s. Using preferred hosts provides various beneﬁts. First, it reduces the communication overhead
among the network daemons and task placement daemons. For each new task, the task daemon
contacts only a subset of hosts to get predicted task completion times. Thus, it improves the time
taken to search the right candidate such that the selection time is linear in the number of candidate

73

hosts. Secondly, using preferred hosts compliments the beneﬁts of diﬀerent network scheduling
policies. For example, under the Fair or LAS policy, it ensures that long ﬂows are not placed
with existing short ﬂows (although short ﬂows can be placed with existing long ﬂows under LAS
scheduling), thus improving separation between short and long ﬂows and reducing the switch
queuing delays experienced by the short ﬂows. Similarly, under the SRPT policy, it makes sure that
the new task can start data transfer immediately. The preferred hosts are further compared in the
next step. If there is no preferred host, such that every node has at least one ﬂow smaller than the
ﬂow of the current task, then this preference is ignored, i.e., all the nodes are considered preferred
hosts. NEAT, in case of large set of preferred hosts, can also consider input data placement to limit
the number of preferred hosts for each task.

2) Selection of a best host: Given a set of preferred hosts, the next step is to select a host that
achieves the minimum task completion time. The completion time of a task depends on: i) the time
required to transfer input data, and ii) the time required to process the data. The data transfer time
is captured by the predicted FCT/CCT, which is obtained from the network daemon (see Section
3). The data processing time depends on the available node resources (e.g., CPU and memory),
which can be obtained from an existing per-node resource manager (e.g., Hadoop node manager).
In the current design, NEAT compares the available node resources of each preferred host with
requirements imposed by the application (e.g., minimum CPU and memory) to identify candidate
hosts, and then selects the one that gives the minimum data transfer time. However, this design can
be easily extended with a processing time predictor to minimize the total task completion time; we
leave the details to future work.

Remark: NEAT is an online scheduler that greedily optimizes the performance of each new task.
We choose this design because: (i) it signiﬁcantly simpliﬁes the computation per placement, and
(ii) it approximates the optimal placement under certain conditions as speciﬁed in Proposition 5.2.1
and 5.3.1. Note that NEAT only places each task once at the beginning of its lifetime, and does not
move any task during its execution.

74

5.4.2 Coﬂow Placement

Compared to ﬂow placement, coﬂow placement poses additional challenges. Ideally, we want to
jointly place all the ﬂows in a coﬂow to minimize the completion time of the slowest ﬂow. However,
in the case of one-to-many or many-to-many coﬂows, joint placement of all the ﬂows in a coﬂow
has exponential complexity due to the exponentially many possible solutions. To solve this problem,
we use a sequential heuristic, where we ﬁrst place the largest ﬂow within the coﬂow using the ﬂow
placement algorithm, and then place the second largest ﬂow using the same algorithm based on
the updated network state, and so on. This sequential heuristic used for coﬂow placement should
have quadratic complexity. The reason for placing ﬂows in descending order of their sizes is that
larger ﬂows are more likely to be the critical ﬂows determining the completion time of the coﬂow
and hence should be placed on nodes that have more available resources. We leave evaluation of
other coﬂow placement algorithms as a future work. Note that many-to-one coﬂows do not have
the complexity problem and thus can be placed optimally.

5.4.3 Application to MapReduce Scheduling

NEAT considers each MapReduce job as a concatenation of two tasks, one for the Map stage and
one for the Reduce stage. Since there are generally multiple Map “tasks” in a job1, this requires
NEAT to place two coﬂows, one many-to-many coﬂow for reading input data into Map tasks and
one many-to-one coﬂow for shuﬄing intermediate results to Reduce task (or many-to-many if there
are multiple Reduce tasks). In both the cases, NEAT strives to achieve data locality if possible (as a
node with the input data will have zero FCT/CCT), and minimizes data transfer time otherwise. For
example, for placing Map tasks, it can leverage the data placement policy of the cluster ﬁlesystem
(such as HDFS in Apache Yarn). In HDFS, data is usually placed on multiple locations for fault
tolerance and to provide better data locality. During Map task placement, NEAT can either place
task on the nodes that has the input data available or choose the nodes using it’s placement heuristic.

1Note that NEAT deﬁnes “task” diﬀerently from the MapReduce framework.

75

5.4.4 Application to DAG Scheduling

DAG-style applications can be modelled as an extension of MapReduce applications, except that a
DAG may have multiple shuﬄe stages. NEAT considers each stage of a DAG job as a separate task
and places ﬂows (coﬂows) within a task using the procedure discussed in sections 5.4.1 and 5.4.2.
The placement algorithm used by NEAT encourages placing tasks of same job on nodes close to
the input data.

5.5 NEAT Optimizations

NEAT needs the state of all the active ﬂows in the network to make optimal task placement
decisions, which can incur signiﬁcant overhead at both the network daemons and the task placement
daemon. To reduce the overhead, NEAT introduces two optimizations: (i) instead of keeping
individual states of all the active ﬂows, each network daemon only maintains a compressed state
with a constant size, (ii) instead of contacting all the network daemons, the task placement daemon
determines a subset of nodes as candidate hosts based on local information and only contacts the
network daemons at the candidate hosts.

Compressed Flow State: Keeping the original ﬂow state requires space that grows linearly
with the number of active ﬂows, which limits scalability of the network daemon (with respect
to network load). NEAT addresses this issue by approximating the FCT/CCT prediction using a
compressed ﬂow state. Each network daemon compresses information of its local ﬂows (coﬂows)
by quantizing the ﬂow sizes into a ﬁxed number of bins and maintaining summary statistics (e.g.,
total size and number of ﬂows) in each bin. Compared with the original ﬂow state that keeps track
of individual ﬂows, this compressed state has a size that is determined by the number of bins,
regardless of the number of ﬂows in the network.

We refer to the set of active ﬂows Fl on link l as the ﬂow state of link l. The ﬂow states of
all the links form the overall ﬂow state F of the network. Note that under FCFS, we can easily
c2Fl sc;l for coﬂow
compress Fl by storing only the total load (
scheduling), which suﬃces for FCT/CCT prediction. We thus focus on the other scheduling policies

c2Fl s f for ﬂow scheduling and

∑

∑

76

in the sequel.

p∑

'›«s f0

N∑

“ﬁ‹ ;

For FCT prediction, our idea is to divide ﬂows on each link into a ﬁnite number of bins
n = 1; : : : ; N, where we keep the following parameters for each bin: the minimum and maximum
ﬂow size „s
”, the total ﬂow size (in number of bytes) bl;n, and the number of ﬂows cl;n. In
essence, we compress a set of ﬂow sizes into a histogram of ﬂow sizes with ﬂexible bin boundaries
„1”
(s
l;n

„2”
l;n ). Using the compressed ﬂow state, we can approximate (5.4) by:

„1”
l;n

„2”
l;n

; s

; s

FCTFAIR„ f0; l” (cid:25) 1
Bl

+

bl;n + s f0

(5.18)
where p = ml„s f0
” and ml„s” 2 f1; : : : ; Ng is the index of the bin containing ﬂow size s (i.e.,
„2”
„1”
l;ml„s” (cid:20) s < s
l;ml„s”). Other predictions in § 5.2 can be approximated similarly, although we only
need to approximate (5.4) if the invariance condition in Proposition 5.2.1 is satisﬁed. Note that
storing bl;n is optional as it can be approximated by cl;ns for some s 2 »s

n=p+1

cl;n

n=1

; s

s

„1”
l;n

„2”
l;n

”.

For CCT prediction, the compressed ﬂow state in each bin has two additional attributes: dl;n,
denoting the total load on link l (i.e., sum of sc;l over coﬂows in this bin), and el;n, denoting the
”,
total normalized load on link l (i.e., sum of sc;l(cid:157)sc over coﬂows in the bin). Assuming q = ml„sc0
for Fair/LAS scheduling, we can approximate (5.11) by

q∑
'›«sc0;l +
'›« q∑

n=1

n=1

“ﬁ‹ ;
“ﬁ‹ :

N∑
N∑
“ﬁ‹ :

(5.19)

(5.20)

(5.21)

„1”
l;n

; s

„2”
l;n

”.

CCTFAIR„c0; l” (cid:25) 1
Bl

dl;n + sc0

el;n

n=q+1

and (5.13) by

∑

c2Fl

Under TCF scheduling, we can approximate the righthand side of (5.17) by

∆CCTFAIR„c; l” (cid:25) sc0;l
Blsc0

bl;n + sc0

cl;n

n=q+1

q∑

'›«sc0;l +

1
Bl

dl;n + sc0;l

n=1

n=q+1

cl;n

N∑

Here storing el;n is also optional as it can be approximated by dl;n(cid:157)s for some s 2 »s

Remark: The compressed ﬂow state has a size that is linear in the number of bins (which is a
design parameter), regardless of the number of ﬂows in the network. The cost of this compression

77

is the loss of accuracy in the FCT/CCT prediction, caused by uncertainty in the sizes of ﬂows
(coﬂows) that are in the same bin as the newly arrived ﬂow (coﬂow). The bin sizes can be set
based on the datacenter traﬃc distribution. For example, for heavy tailed traﬃc [18], one can use
exponentially growing bin sizes to have smaller bin sizes for short ﬂows and large bin sizes for long
ﬂows.

Reduced Communication Overhead: NEAT uses distributed components (network daemons)
spread across the network to maintain the ﬂow states (see Figure 1.2). To minimize the communica-
tion overhead between the task placement daemon and the network daemons, the network daemons
do not always report their updated ﬂow states; instead, the task placement daemon pings the network
daemons to get the predicted task performance when it needs to place a new task.

In a large cluster, the task placement daemon has to contact a large number of network daemons
to place a task. To further reduce the overhead, the task placement daemon uses local information to
reduce the number of network daemons it needs to contact. Speciﬁcally, it only contacts a network
daemon if (i) the node it resides on is suﬃciently close to the input data (e.g., in the same rack
or a rack one-hop away from the input data), and (ii) if the node state (i.e., the smallest residual
size of ﬂows currently scheduled on this node) is no smaller than the size of the current task. Note
that without contacting the network daemon, the task placement daemon does not know the current
node state. In our design, the task placement daemon caches the node states previously reported by
the network daemons and uses the cached values as estimates.

5.6 Evaluation

We evaluate NEAT using trace-driven simulations based on ns2 [7] as well as experiments on

a small-scale testbed.

5.6.1 Simulation Settings

Datacenter Topology: We use a 160 node, 3-tier multi-rooted topology for our evaluation com-
prising layers of ToR (Top-of-Rack) switches, aggregation switches and core switches, similar to

78

[15,127,135]. Each host-ToR link has a capacity of 1 Gbps whereas all other links are of 10 Gbps.
The end-to-end round-trip propagation delay (in the absence of queuing) between hosts via the core
switch is 300(cid:22)s.
Traﬃc Workloads: We consider traﬃc workloads that are derived from patterns observed in
production datacenters. We evaluate NEAT using the benchmark Hadoop (map-reduce) [48] and
web-search [21] workloads. These workloads contain a diverse mix of short and long ﬂows with
a heavy-tailed ﬂow size distribution. In the web-search workload, more than 75% of all bytes are
from 50% of the ﬂows with sizes in the range [1,20MB]. The hadoop workload is less skewed:
(cid:24)50% of the ﬂows are less than 100MB in size and 4% ﬂows are larger than 80GB. Note that we
always use these settings unless speciﬁed otherwise.

Protocols Compared: We compare NEAT with two task scheduling strategies: load aware place-
ment (minLoad) and locality aware placement (minDist). We consider several state-of-art datacenter
network scheduling strategies including DCTCP (that implements Fair) [18], L2DCT (that imple-
ments LAS) [96], and PASE (that implements SRPT) [94] for ﬂow scheduling and Varys [42] for
coﬂow scheduling. minLoad places each task on the node with the minimum network load, deﬁned
as the utilization ratio of its link to ToR. minDist places each task as close as possible to its input
data, as proposed by earlier schemes [76]. We also test coﬂow performance using smallest coﬂow
ﬁrst (SCF) heuristic. We implemented DCTCP, L2DCT, PASE, and Varys using the source code
provided by the authors to evaluate their scheme. The parameters of these protocols are set according
to the recommendations provided by the authors, or reﬂect the best settings, which we determined
experimentally (see Table 5.1). For FCT prediction, we use Fair sharing based prediction model
unless stated otherwise and for CCT prediction, we use the prediction models corresponding to
each evaluated coﬂow scheduling scheme.

Performance Metrics: We consider the average ﬂow completion time (AFCT) for the ﬂow-based
scheduling and coﬂow completion time (CCT) for coﬂow-based scheduling schemes. We use gap
from optimal as the performance metric. For example, for FCT based schemes, the gap from optimal
is calculated as „FCT (cid:0) FCT opt”(cid:157)FCT opt, where FCT opt is the optimal FCT deﬁned as the time

79

Scheme
DCTCP

Parameters
qSize = 250 pkts
markingThresh = 65
L2DCT minRTO = 10 msec

PASE

qSize = 250 pkts
minRTO (ﬂows in top queue) = 10 msec
minRTO (ﬂows in other queues) = 200 msec
numQue = 8

Table 5.1: Default parameter settings used in simulation.

(a) Hadoop

(b) Websearch

Figure 5.2: Flow placement under DCTCP

it takes to complete the data transfer when that is the only ﬂow (coﬂow) in the network. Note that
the gap from optimal equals slowdown (a.k.a. stretch) minus one. The gap from optimal tells us
the margin for performance improvement for diﬀerent kind of network scheduling policies. For
example, we observe in our evaluation that the room for improvement is more for suboptimal
scheduling policies, in terms of FCT, (such as FAIR, LAS) and less for near-optimal network
scheduling policies (such as SRPT).

5.7 Macrobenchmarks

NEAT improves application performance upto 3.7x, compared to the existing task placement
algorithms that only consider data and compute, by incorporating network state and network
scheduling policies into task placement decisions. Below, we ﬁrst discuss NEAT performance in
placing ﬂows and then its performance in placing coﬂows.
Flow placement performance under Fair sharing: Figure 5.2 shows that NEAT outperforms
minLoad and minDist by up to 3.6x for web-search workload and up to 3.7x for Hadoop workloads

80

 10 20 30 40 50 60 70<0.10.1-0.20.3-0.50.7-0.9>1.3Gap from Optimal (%)Flow Size (M packets)NEATminLoadminDist 10 20 30 40 50 60 70<150150-200250-300350-400>500Gap from Optimal (%)Flow Size (packets)NEATminLoadminDist(a) LAS (L2DCT)

(b) SRPT (PASE)

Figure 5.3: Flow placement under Hadoop workload

when the network employs Fair sharing scheduling policy. This is because NEAT chooses the node
with minFCT to ﬁnish ﬂows faster and uses node state NEAT to avoid placing short ﬂows on same
links with long ﬂows. Note that many of the existing policies deployed in the datacenter employ Fair
sharing discipline. This shows that even if datacenter does not implement any optimized network
scheduling mechanism, NEAT can improve application performance signiﬁcantly by being aware
of network state.
Flow placement performance under diﬀerent network scheduling policies: Figure 5.3 shows
NEAT performance for Hadoop workload when the network uses L2DCT (LAS) and PASE (SRPT)
as network scheduling policies. NEAT improves the performance by 2.7x compared to minLoad and
by upto 3.2x compared to minDist when used with L2DCT. NEAT also improves the performance
by 30% compared to minLoad and by upto 20% compared to minDist when used with PASE.

81

 10 20 30 40 50 60 70<0.10.1-0.20.3-0.50.7-0.9>1.3Gap from Optimal (%)Flow Size (M packets)NEATminLoadminDist 10 20 30 40 50 60 70<0.10.1-0.20.3-0.50.7-0.9>1.3Gap from Optimal (%)Flow Size (M packets)NEATminLoadminDistNote that PASE is a near-optimal algorithm and there is very little performance improvement
margin for NEAT. Also, observe that minDist and minLoad perform diﬀerently under diﬀerent
scheduling policies, whereas, NEAT consistently performs better than both for all the evaluated
network scheduling policies. We observe similar performance trends with websearch workloads.
CoFlow placement performance under diﬀerent network scheduling policies: NEAT improves the
CCT performance for Hadoop workloads by upto 25% for two coﬂow scheduling policies, 1) Varys
(Figure 5.4(a)) and 2) Smallest Coﬂow First (SCF) (Figure 5.4(b)). Coﬂow placement requires a
group of ﬂows to be placed in a batch. For coﬂow placement, NEAT works as described in Section
5.1, while minLoad and minDist are modiﬁed as follows - For minLoad, we place each ﬂow of the
coﬂow separately, in the decreasing order of ﬂow sizes, on a node that has the minimum network
load. For minDist, our goal is to place ﬂows of the same coﬂow within the same rack and close to
input data and we start by placing the largest ﬂow of a coﬂow ﬁrst. NEAT performs better under
both coﬂow scheduling policies, however its gains are limited by the performance of underlying
coﬂow scheduling policy. For example, minDist with Varys scheduling performs better than NEAT
with SCF scheduling. The reason for this is that Varys, which adjusts the ﬂow rates based on the
smallest bottleneck heuristic, is a much better heuristic than SCF. However, we observe that the
gap from optimal is substantial (30% to 50%), for these strategies, for large ﬂow sizes with Varys
and nearly for all ﬂow sizes with SCF. The reason for this gap is that the optimal completion time
is not always achievable, even under the optimal scheduling, as our deﬁnition of optimal ignores
resource competition from other ﬂows/coﬂows. Also, many of these protocols implement certain
approximations, instead of the actual optimal scheduling algorithm, to be more deployment friendly.

5.8 Microbenchmarks

In this section, we discuss various aspects of NEAT design, with the help of several micro-
benchmarks. We analyze NEAT dynamics using Hadoop workload under SRPT (PASE) network
scheduling policy, unless stated otherwise. Our evaluation shows that NEAT optimizations (such as
prediction assuming FAIR policy only, preferred hosts) improve the application performance and

82

(a) Varys
(b) SCF
Figure 5.4: CoFlow placement for Hadoop workload

Figure 5.5: Fair prediction vs SRPT prediction

it can predict task completion times quite accurately.
Beneﬁts of using Fair predictor: Figure 5.5 shows the NEAT performance when task completion
time is predicted using the Fair-sharing model (Eq. 5.4) or the SRPT-sharing model (Eq. 5.7)
when the underlying network scheduling mechanism is SRPT. Both predictors achieve similar
performance, therefore, even though the network performs SRPT ﬂow scheduling, NEAT can
perform FCT prediction by assuming the fair-sharing principle. This validates our proposition,
in §5.2.4, which states that for ﬂow based network scheduling mechanisms, task placement using
minimum predicted task completion time, assuming FAIR predictor, can provide better performance
for any ﬂow based network scheduling policy.
Beneﬁts of using preferred hosts aware placement: Figure 5.6 shows the beneﬁts of using preferred

83

 10 20 30 40 50 60 70<0.10.1-0.20.3-0.50.7-0.9>1.3Gap from Optimal (%)Flow Size (M packets)NEATminDist 10 20 30 40 50 60 70<0.10.1-0.20.3-0.50.7-0.9>1.3Gap from Optimal (%)Flow Size (M packets)NEATminDist 2 4 6 8 10<0.10.1-0.20.3-0.50.7-0.9>1.3Gap from Optimal (%)Flow Size (M packets)Fair-Pred.SFF-Pred.Figure 5.6: Beneﬁts of using preferred hosts placement

hosts for making task placement decisions. To evaluate the impact on performance, we compare
NEAT performance to minFCT. minFCT strategy makes task placement decision based on the
predicted task completion times alone and ignores the node states i.e., it places each task on
to a node with the smallest predicted FCT and does not consider the priority of the tasks already
scheduled on that node. Figure 5.6 shows that minFCT placement degrades application performance
by up to 50%, and performs even worse than minDist placement. NEAT on the other hand achieves
better performance by being aware of node states. Intuitively, minFCT hurts performance in two
ways: i) it groups short ﬂows together, thus increasing fair sharing among short ﬂows. ii) for long
ﬂows it selects nodes with many short ﬂows, thus long ﬂows experience more preemption by the
currently scheduled and newly arriving short ﬂows. We note that the comparison between minFCT
and minDist is workload-dependent, e.g., for the workloads, where short and long ﬂows have
proportional number of bytes, minFCT performs similar to or better than minDist placement. Thus,
using preferred hosts based task placement not only minimizes communication overhead, but it
also improves application performance.
Flow completion time prediction accuracy: NEAT can predict ﬂow completion times with rea-
sonable accuracy as shown in Figure 5.7. NEAT performance depends on accurately predicting
completion times. It is important to keep the prediction error small because NEAT only con-
siders currently active tasks while predicting FCTs and future task arrivals may make current
placement decisions suboptimal. Figure 5.7 reports the prediction accuracy, which is calculated as

84

 20 40 60 80 100<0.10.1-0.20.3-0.50.7-0.9>1.3Gap from Optimal (%)Flow Size (M packets)NEATminDistminFCT(a) Short ﬂows

(b) Long ﬂows

Figure 5.7: Completion time prediction error

„FCT f low (cid:0) FCT pred”(cid:157)FCT pred, where FCT f low is the actual FCT and FCT pred is the FCT
predicted by NEAT. Figure 5.7 shows the prediction accuracy for short ﬂows (Figure 5.7(a)), and
long ﬂows (Figure 5.7(b)) and we observe that the prediction error increases with the ﬂow size.
This is because, with the increase in ﬂow size, the ﬂow spends more time in the network and is
aﬀected more by the tasks arriving later in time. In many practical datacenter applications, most
tasks generate short ﬂows, for which NEAT can predict the FCT within 5% error. This experiment
shows that NEAT is not very sensitive to mis-prediction in task completion time estimate and it is
able to achieve good performance even in the presence of small inaccuracies in task completion
time estimates.
5.9 Testbed Evaluation

We discuss testbed implementation and evaluation results.
Implementation: NEAT follows a master/slave architecture similar to Hadoop-Yarn [1], and
we implement NEAT in Scala using support from Akka for messaging between task placement
daemon (TD) and network daemons (ND). We currently implement NEAT as an application layer
program by extending Varys implementation [42] and leave the integration with Hadoop-Yarn as a
future work. Similar to Yarn, the TD acts as the resource manager (RM) and makes task placement
decisions, and NDs act as node managers, which maintain ﬂow state of the active jobs and share the
network state upon receiving requests from the task placement daemon. The TD accepts a user job

85

 0 2 4 6 8 10 12 14 40 80 120 160 200FCT Prediction Error (%)Size(MB)NEAT-FAIRNEAT-SFF 0 2 4 6 8 10 12 14 0 200 400 600 800 1000FCT Prediction Error (%)Size(MB)NEAT-FAIRNEAT-SFFsimilar to RM and invokes network daemon APIs at the nodes (endhosts). Cluster framework (e.g.,
Spark or Hadoop) drivers can submit a job using the register() API to interact with NEAT.
Upon receiving a request from the TD, NDs update their local states and reply with predicted
task performance and node state. The TD then runs our placement algorithm to schedule ﬂows
on selected nodes. To support multiple network scheduling policies, we reuse Varys’ put() and
get() method to send and receive data between diﬀerent nodes. Cluster frameworks can use NEAT
InputStream and OutputStream to leverage various network scheduling policies. The desired
network scheduling policy can be speciﬁed during network conﬁguration phase.

Evaluation Results: We evaluate NEAT using a 10-node single-rack cluster consisting of
DELL servers (with 1G NICs) and PICA-8 Gigabit switch with ECN support enabled. We consider
scenario with all-to-all traﬃc based on Hadoop workload [48] and generate traﬃc such that the
average network load is 50%. We ﬁrst generate the traﬃc using ns2 and replay the same traﬃc in the
testbed. The hadoop workload generates many long ﬂows as (cid:24)50% of the ﬂows are less than 100MB
in size and 4% ﬂows are larger than 80GB. We compare NEAT to minLoad task placement under
Fair (implemented by DCTCP) and LAS (implemented by L2DCT) network scheduling policies.
Because of the small scale of the testbed, we do not compare with minDist (since all node pairs
have the same distance), and only evaluate using ﬂow-based network scheduling policies.

Our evaluation shows that, compared to minLoad, NEAT can improve application performance
by upto 30% under Fair scheduling and upto 27% under LAS scheduling policy, see Figure 5.8. The
gain in testbed scenario is much less than the large scale simulations, and the reason for small gain is
the small scale of the testbed. In our evaluation, we observe that a lot of long ﬂows are generated that
run simultaneously in the network and share links with short ﬂows. Both the placement algorithms
spread large ﬂows across all the nodes in the network, over the time, and therefore increase the
completion time of short ﬂows because they share the network links and switch buﬀers, most of
which are occupied by long ﬂows. This eﬀect is not observed in large scale networks where a
ﬂow can mostly have many available nodes to chose from. Although, the size of the testbed limits
the amount of improvement, NEAT is able to improve the performance under both the network

86

Figure 5.8: Testbed: minLoad vs NEAT

scheduling policies by making better placement decisions. We also compare testbed results to ns2
simulations using same settings as testbed setup and observe similar performance gains.
5.10 Discussion

Input Data Placement: NEAT can leverage the knowledge of input data locations to improve
scalability and make more accurate placement decisions. However, input data may not be optimally
placed, which limits the performance of NEAT. In such cases, NEAT can be combined with
strategies like corral [76] to make input data and task placement decisions based on the task
execution history.

Flow Size Information: NEAT requires ﬂow size information to predict task performance,
however, ﬂow size may not always be available. For example, the motivation behind using LAS-
based network scheduling is to address this challenge [96]. In such scenarios, NEAT can use
approximate ﬂow sizes based on the task execution history for performance prediction, as we have
shown that NEAT minimizes task completion time even with some inaccuracy in predicted task
completion times (§5.8). We leave evaluation of such scenarios for future work.

Generalization of Network Topologies: Currently, NEAT assumes a single-switch topology
and predicts task performance based on the edge links only, however, any link in the network can be
bottleneck in other topologies. To make better placement decisions, one can use a distributed state
maintenance mechanism similar to PASE [94] where a dedicated arbitrator maintains ﬂow state for
each link in the network. NEAT can choose the destination that gives the minimum completion time

87

 0 10 20 30 40 50FAIRLASGain over minLoad (%)Offered load (%)SimulationTestbedalong the path from source to destination. Nevertheless, our evaluation shows that the single-switch
abstraction already provides substantial performance improvement over existing solutions (section
6).

NEAT Scalability: NEAT introduces several optimizations to make system more scalable. For
example, to mitigate the communication overhead among daemons, it uses node states while con-
tacting nodes for performance prediction estimate. This reduces the amount of overhead messages
sent over the network for communication among network and task placement daemons. Also, to re-
duce the overhead of task state maintenance by the network daemons, NEAT uses compressed ﬂow
state while predicting the task performance. As shown in §5.8, these optimizations help improve
NEAT performance, however, we leave large scale scalability analysis of these optimizations as a
future work.

5.11 Conclusion

This work proposes a task placement framework that takes into account the network scheduling
policy and shows that signiﬁcant performance gains can be obtained over task schedulers oblivious
to network scheduling. NEAT task scheduler makes task placement decisions by using a pluggable
task performance predictor that predicts the data transfer time of a given task under given network
conditions and network scheduling policy. Despite several simplifying assumptions, NEAT achieves
promising performance improvement under several commonly used network scheduling policies. Its
pluggable architecture allows easy incorporation of sophisticated resource and application models.

88

CHAPTER 6

BANDWIDTH ALLOCATION IN MULTI-TENANT DATACENTERS

In this work, we propose Stacked Congestion Control (SCC), a distributed host-based bandwidth
allocation framework. SCC can support both performance isolation and objective scheduling, and
is readily-deployable. SCC has two layers to perform congestion control: (i) an underlay congestion
control (UCC) layer handles contention among divisions (i.e. performance isolation), and (ii) a
private congestion control (PCC) layer for each division to optimize its performance objective (i.e.
objective scheduling). Below, we explain these components in more detail.
Network Mode: This paper targets the prevalent Clos-network fabrics. We assume the adoption
of state-of-the-art load balancing works such as Presto [67]: a large ﬂow can be divided into
ﬁne-grained, near-uniform units of data (i.e., ﬂowcells) and load is balanced across almost every
available path between two neighboring tiers of network switches.

Allocation of network weights among divisions is outside the scope of this work. We assume
each division weight is given either as input by an algorithm that enforces administrator policy or
by a utility maximization function [98,107].
Data Path: An example data path is shown in Figure 6.1. There are two CS tunnels and one DS
tunnel among three pairs of hosts: division 1 contains two CS tunnels; division 2 contains one DS

Figure 6.1: SCC framework with tenant-objective tunnel abstraction.

89

CS‐1CS‐2CS‐3UCC DST3CS2CS1CS Tunnel‐1CS Tunnel‐2DS Tunnel‐1Division 1 (W=3)Division 2  (W=1)DS‐F1CS3DS3DST2DST1rate‐limit queueW=1W=2W=1SRC1SRC2SRC3rate‐limit queuerate‐limit queuePCC PCC PCC tunnel; all CS/DS ﬂow(s) enter the CS/DS tunnel(s). Each tunnel has a rate-limiting queue at it’s
ingress port in the endhost. A tunnel weight Wt is attached with each tunnel.

Note that tunnel is only an logical abstraction: it does not require any additional packet headers,

hence NIC oﬄoad features such as TCP Segmentation Oﬄoad (TSO) are aﬀected.

6.1 Underlay Congestion Control

Congestion Signal: UCC uses ECN as the congestion signal and uses weighted network sharing
algorithms to derive the contemporary tunnel rate. For all outgoing packets in a tunnel, the ECN-
enable bits in IP headers are set by the tunnel. In the receiver end of the tunnel, the ECN mark
ratio is recorded; the destination host periodically sends ECN feedback; for all incoming TCP
acknowledgement packets, ECN bit is removed before transferring to upper transport protocols.
Note that if the upper transport protocol supports ECN, UCC should pass this information to PCC
layer.
Tunnel Rate: SCC provides the tunnel with a bandwidth share that is proportional to its weight;
This allocation is an end-to-end and work-conserving design. Each tunnel has a rate limiting queue
attached to the ingress port in the source host. Based on the ECN feedback, each tunnel estimates
the available bandwidth and adjusts the rate of the send queue. Similar to Seawall, we mimic the
DCTCP behaviour in tunnel level congestion control. If there is no congestion in the network, the
rate Rt increases as:

Rt = „1 + Wt” (cid:3) Rt;

(6.1)

(6.2)

where, Wt is the weight assigned to a tunnel. Upon detecting congestion, the rate reduces as:

Rt = „1 (cid:0) (cid:11) (cid:3) Wt” (cid:3) Rt;

where (cid:11) is the level of congestion in the network.
Scaling Tunnel Weight: As a distributed solution, each tunnel in SCC, periodically generates
its reference tunnel weight W′
t according to its PCC layer algorithm. This calculation is objective-
oriented, and the corresponding algorithms are presented in Section 6.3. All tunnels in the same

90

division should ensure that the global weight Wd allocated to this division is respected, by cooper-
atively scaling each tunnel’s self-derived reference network weight.

A tracker is elected from all the source hosts in each division, and each source host periodically

sends W′

t to this tracker. A scale factor are calculated as:

∑

8t2d

(cid:21)d = Wd(cid:157)

′
W
t ;

(6.3)

and (cid:21)d is periodically sent back to hosts in the division. Accordingly, each source hosts derive its
contemporary weight value as follows:

′
Wt = (cid:21)d (cid:3) W
t :

(6.4)

In this way, the division weight is dynamically divided among all its constituent tunnels. As shown
by the example in Figure 6.1, the sum of weights for CS tunnel 1 and tunnel 2 equals 3, which is
equal to the division 2’s allocated weight.

Due to the impact of load balancing, in each tier of the network, the share of bandwidth obtained
by a tunnel is statistically proportional to its weight. Since the sum of their weights equals the global
weight allocated to this division, the division’s bandwidth share is achieved.

6.2 Private Congestion Control

Congestion Signal: SCC is completely transparent to the upper layer transport protocols. The
congestion signal for objective-oriented scheduling is generated from the underlying host send
queue instead of the network. Each division tunnel has full control of using its own congestion
signal and control law; it can either rely on existing objective-oriented transport protocols, or
directly schedule ﬂows in the tunnel. The upper transport protocols may enable ECN support,
such as in D2TCP, DCTCP and L2DCT. Note that in SCC, ECN bits are used for division-level
congestion control and before handing over to upper layers, all ECN marks are removed. For this
scenario, SCC remarks the ECN bit based on concurrent queue length and properties of ﬂows inside
the tunnel.

91

It requires no operation if the transport protocols rely on round trip time. Otherwise, if the

transport protocols rely on packet drop, a queue length should be set (e.g., 100 packets).
Generate Reference Tunnel Weight and In-Tunnel Scheduling: To optimize the performance
objective, there are two algorithm options for a particular division: (i) generate reference tunnel
weight W′
t according to contained ﬂows, and relies on upper layer transport protocols to schedule
ﬂows inside the tunnel; and (ii) integrated design of both reference tunnel weight and in-tunnel
scheduling. We present our PCC algorithms in Section 6.3. Note that SCC is a general architecture:
researchers can developed algorithms for diﬀerent objectives, or algorithms also target LS/DS/CS
ﬂows but with better performance.

6.3 PCC Tunnel Algorithms

In this section, our designs of tunnel algorithms, for DS/CS/LS divisions are presented. To
optimize the performance objective, we adopt an integrated design method of both reference tunnel
weight and in-tunnel scheduling. Even if upper layer transport protocols are preferred to schedule
ﬂows inside the tunnel, the reference tunnel weight algorithms can still be used.

A goal of algorithms is that, for a division, using a datacenter network exclusively, the control
mechanism should achieve comparable performance with and without SCC. Assume four DS ﬂows
with the ﬂow priority f1 > f2 > f3 > f4; f1 and f2 share the same tunnel, and f3 and f4 share
the same tunnel. Prior to SCC, and without the concern of coexistence, all DS ﬂows may compete
independently. With SCC , now f1/ f2 tunnel competes with another tunnel that contains f3/ f4. This
metric reﬂects the overhead/gain of SCC algorithms.

We omit the detailed proofs for the optimality of the algorithms due to space limitation. The

symbols used are listed in Table 6.1.

6.3.1 Deadline-Sensitive Tunnel

Deadline-sensitive ﬂows require to transfer a ﬂow of certain size within a deadline. It is well known
that for a single ﬂow session s with remaining size Ms and deadline (cid:14)s, a guaranteed bandwidth

92

Ms„t”
(cid:14)s„t”
xs„t”
Qs„t”
(cid:21)s
Q(cid:3)
s

BWreq„t”
BWres„t”

(cid:28)

remaining data size for session s at time t

remaining time till deadline for session s at time t

requested source rate for session s at time t

buﬀer status for session s at time t
average data arrival rate for session s

target average queue length

request from the host on the bandwidth at time t

response from the coordinator on thebandwidth at time t

decision slot duration

Table 6.1: Notation Table

xs no less than Ms(cid:157)(cid:14)s can meet its deadline. Therefore, SCC dynamically updates the required
minimum bandwidth of each DS tunnel based on the sizes and deadlines of all active ﬂows in the
tunnel (step 1 and 2, Algo. 6.3.1). This requirement is updated periodically ((cid:25) msec).
Algorithm 3 Rate Control for DS Tunnel
At the beginning of each decision slot, the tunnel
Step 1: Updates the ﬂow information, fMs„t”; (cid:14)s„t”g
Step 2: Estimates the rates x(cid:3)
Step 3: Let W′
In-tunnel scheduling

s x(cid:3)
s„t”.
t , and set BWres„t” = Wt.

t = BWreq„t”, gets Wt from W′

s„t” = Ms„t”

(cid:14)s„t” and BWreq„t” =

∑

The tunnel responds to the allowed bandwidth, which may or may not be equal to the required
bandwidth (step 3, Algo. 6.3.1). PCC can choose to either use or not use in-tunnel scheduling
algorithm, irrespective of the upper layer transport protocols, . We demonstrate later in evaluation,
that in-tunnel scheduling can provide signiﬁcant performance improvement, which accredits to the
ﬂexibility of the SCC framework.

6.3.2 Completion-Sensitive Tunnel

The bandwidth sharing mechanism for CS tunnels is similar to the DS tunnels, except that only ﬂow
sizes are used to compute required bandwidth. To mimic the Shortest-Flow-First (SFF) strategy,
the inversion of ﬂow size is used as each ﬂow’s weight (step 2, Algo. 6.3.2). Inside the CS tunnel,
ﬂows are simply scheduled by SFF.

93

∑

Algorithm 4 Rate Control for CS Tunnel
At the beginning of each decision slot, the tunnel
Step 1: Updates the ﬂow information, fMs„t”g
Step 2: Estimates the weight W′
s 1(cid:157)Ms„t”.
t =
Step 3: Gets Wt from W′
t .
In-tunnel scheduling
Shortest-Flow-First
6.3.3 Latency-Sensitive Tunnel

For LS tunnels, the buﬀer length in the host/network queues deﬁnes the ﬂow latency. This algorithm
considers queuing at the end-hosts only, not from the switches in the network. We rely on tunnel
level UCC protocols to maintain switch queues at a low-level, and an additional option is to put LS
ﬂows into a higher physical queue.

PCC maintains small queues in the tunnel by using ECN based notiﬁcation. To compute
bandwidth requirement, PCC leverages insights from how existing transport protocols work. For
example, most of the existing datacenter transport protocols start with initial congestion window
of 10 packets (such as DCTCP, L2DCT, etc.,), and to avoid timeouts, they need at least one ACK
per RTT. Therefore, we set average bandwidth of a tunnel to be at least 10 (cid:2) N f , where N f is the
number of LS ﬂows in each tunnel. Additionally, we also consider the queue size to determine the
target rate for the LS ﬂows as listed in Algo. 6.3.3.
Algorithm 5 Rate Control for LS Tunnel
[
At the beginning of each decision slot, the tunnel
Step 1: Update the current queue status fQs„t”g
s„t” = Qs„t”(cid:0)ϵsQ(cid:3)
Step 2: Estimates the rates x(cid:3)
s
1
t = BWreq„t”, gets Wt from W′
Step 3: Let W′
t , and set BWres„t” = Wt.
In-tunnel scheduling

and BWreq„t” =

s x(cid:3)
s„t”.

∑

]

Qs„t” > ϵsQ(cid:3)

s

(cid:28)

6.4 Testbed Veriﬁcation

In this section, we ﬁrst discuss SCC system implementation and testbed evaluation. Then, we

compare simulation results to testbed results.
System Implementation: We implement the SCC shim layer as a Netﬁlter kernel module in Linux.
Two hash based ﬂow tables, one for send and one for receive, are used for packet classiﬁcation

94

and for tracking per-ﬂow state. To enforce accurate rates over short timescales, we use Linux
high-resolution kernel timer, HRTIMER, for our rate limiters.

At the sender, we use the LOCAL_OUT hook to intercept all outgoing packets to enforce virtual
tunnel abstraction. To support ECN, the ECN-capable (ECT) codepoint is marked in every packet’s
IP header. The packet is then forwarded to a rate limited per-tunnel queue.

At the receiver end, we use the LOCAL_IN hook to intercept all incoming packets. Each packet
is matched against the receive table, and its ECN bit is checked. The receiver shim layer calculates
the fraction of ECN marking packets and delivers this information back to the sender that uses it
to perform tunnel-level congestion control. We have not implemented in-tunnel algorithms for the
testbed.
Testbed Experiments: For Testbed experiments, we use a single rack with 16 DELL servers (with
1G NICs) and PICA-8 Gigabit switch with ECN support enabled. For evaluation, we consider three
kind of traﬃc, latency-sensitive (DCTCP), deadline-sensitive (D2TCP) and completion-sensitive
(L2DCT). We use same settings as Base-DS, Base-LS and Base-CS for each traﬃc.

First, we consider CS application in isolation, with and without SCC support. Flows are gener-
ated such that it incurs an average load of 400 Mbps (low load). Figure 6.2(a) (i.e. only scenario)
shows that performance of CS ﬂows remain the same with or without SCC support in this scenario.
The reason is that, in a network with low load, ﬂows do not experience interference and ﬁnish
quickly.

Next, we consider coexistence of CS/DS (medium load) and CS/DS/LS (high load) ﬂows in the
network. With SCC support enabled (Figure 6.2(a)), the AFCT of L2DCT ﬂows is improved by
17%;53% for CS/DS and CS/DS/LS coexistence scenarios, respectively. With SCC, deadline miss-
ing of DS ﬂows is reduced by 2(cid:2);1(cid:2) for CS/DS and CS/DS/LS coexistence scenario, respectively.
Veriﬁcation using Simulations: In this section, we verify NS2 simulations via testbed results,
when the CS and DS ﬂows coexist in the network, under the motivation example scenario. We
use same settings as Base-DS, and Base-CS for each application and vary network load. In this
scenario, as shown earlier, DS ﬂows and CS ﬂows experience degraded performance (Fig. 3.5(c) and

95

(a) CS ﬂows

(b) DS ﬂows

Figure 6.2: Testbed Results w/ & w/o SCC.

(a) DS

(b) CS

Figure 6.3: Performance under CS/DS coexistence scenario

(d)). Figure 6.3 shows that with SCC support, fewer ﬂows miss their deadlines, more speciﬁcally, at
higher loads, we observe 4(cid:2) performance improvement compared to when D2TCP is used. Similarly,
completion time of ﬂows is improved by 5x at higher loads. This is due to the in-queueing at the
end-hosts and the adaptive ECN marking at shim layer. We observe similar performance trends as
testbed at low and medium loads, while simulations show better improvement at higher loads.

6.5 Large-scale NS-2 Simulations

Our evaluation addresses the following questions:

1. Can SCC improve the performance of diﬀerent ﬂows when they coexist? We evaluate
the scenarios where ﬂows with diﬀerent performance objectives coexist in the data center

96

 0 10 20 30 40 50CS-OnlyCS+DSCS+DS+LSAFCT (ms)SCCL2DCT 0 5 10 15 20 25 30DS+CSDS+CS+LSDeadlines Missed (%)SCCD2TCP 20 40 60 80 100LowMediumHighDeadlines Missed (%)Network LoadSCCD2TCP 10 20 30 40 50 60 70LowMediumHighAFCT (msec)Network LoadSCCL2DCTnetwork. SCC is able to meet the requirements of objectives like CS, DS or LS simulta-
neously across a wide range of workloads such as Web-search [18], Data-mining [63] and
MapReduce [38]. SCC improves performance by up to 3:2(cid:2) for DS ﬂows and 40% for CS
ﬂows as compared to state-of-art protocols like pFabric [21].

2. Can SCC achieve the same or even better performance when only ﬂows of the same
objective exist in the network? In our evaluation, we show that SCC can achieve similar or
better performance for protocols when they exist in the network alone. At high loads, SCC
can reduce the deadline miss rate by up to 2(cid:2) and achieve similar AFCTs.

3. Does SCC consistently perform well? We test SCC performance in oversubscribed settings
and with diﬀerent transport protocols. The results demonstrate that with SCC enabled, most
protocols deliver better performance.

Evaluation Setup: We use a Clos topology for NS2 simulations, unless speciﬁed otherwise. The
capacity of edge links is 1 Gbps and core links is 10 Gbps. We assume ECN capable switches with
250 KB buﬀering and maximum end-to-end RTT of 300 usec [94].

We model two traﬃc classes. Class-I traﬃc belongs to deadline-sensitive small message ap-
plication and requires deadline guarantees. Class II traﬃc is similar to Class-I, but has diﬀerent
objective than deadline, such as minimizing ﬂow completion times or latency-guarantees. We repli-
cate Web-search [18], Data-mining [63] and MapReduce [38] workloads for Class-I and Class-II
traﬃc. This setting coarsely models the workload for OLDI applications and distributed storage.

In our experiments, deadline-sensitive traﬃc uses D2TCP [127], latency-sensitive uses DCTCP [18],

and completion-sensitive traﬃc uses L2DCT [96] transport protocol at the endhosts. We use de-
fault parameters, from respective papers, for each of the protocol. The applications generate traﬃc
according to a poisson process, such that the average bandwidth requirement of each class is
300 Mbps to meet its requirements. For deadline ﬂows, the deadlines are exponentially distributed
using guidelines from [96].

97

(a) DS ﬂows

(b) CS ﬂows

Figure 6.4: Beneﬁts of In-tunnel algorithm

6.6 Comparison in Coexistence Scenario

In this section, we evaluate the impact of interference, on application performance, when ﬂows

with diﬀerent performance objectives coexist in the network.
Coexistence Performance w/o In-Tunnel Algorithms: In this section, we show the beneﬁts of us-
ing in-tunnel scheduling algorithms on top of objective-tunnel scheduling mechanism. To illustrate
the beneﬁts, we reuse the simulation setup of motivation example and show the performance when
DS and CS ﬂows coexist in Fig. 6.4. As shown in Figure 6.4(a), the deadline miss ratio of DS ﬂows
reduces by up to 8% when in-tunnel scheduling is used. Similarly, as shown in Figure 6.4(b), the
AFCT performance of CS ﬂows reduces by up to 6% when in-tunnel scheduling is used. This shows
that simple in-queue scheduling can further provide beneﬁts on top of the tunnel level abstraction.
We observe similar gains in other scenarios. In all other experiments, we use in-tunnel scheduling
algorithms.
Coexist-LS/DS/CS Performance: In the ﬁrst setup, we consider data-mining workload and assume
that the ﬂows of the three objectives (i.e., latency-sensitive, deadline-sensitive and completion-
sensitive) coexist in the network. The aggregate workload generated by traﬃc from three objectives
is 900 Mbps, generating 90% network load. Figure 6.5 shows that when the three objectives
coexist without SCC support, they aﬀect each other’s performance. Data-mining workload is more
skewed and more than 80% of the ﬂows are less than 2 KB, SCC completes 3:2(cid:2) more ﬂows
than without SCC, in Figure 6.5(a). SCC mitigates network interference and improves the AFCT

98

 5 10 15 20 25 30 35LowMediumHighDeadlines Missed (%)Network LoadSCC w in-TunnelSCC w/o in-Tunnel 5 10 15 20LowMediumHighAFCT (msec)Network LoadSCC w in-TunnelSCC w/o in-Tunnel(a) CS and DS

(b) CDF of CS ﬂows

Figure 6.5: Coexistence of CS/DS/LS ﬂows with datamining workload

(a) DS
(b) CS
Figure 6.6: Coexistence with Data-mining workload

by 40% compared to network without SCC support, in Figure 6.5(b). SCC, improves latency of
latency-sensitive ﬂows’ by up to 40%. We omit results due to space limitation.
Coexistence performance of diﬀerent objectives also hurts in the presence of interference traﬃc
without SCC. To evaluate the impact of interference, we consider same settings as above and use
Data-mining workload. When coexisting, DS ﬂows miss lots of deadlines because of interference
in Figure 6.6(a). While coexistence with CS and LS ﬂows, SCC reduces the deadline miss ratio
by up to 3(cid:2). When coexisting, CS performance aﬀects more in the presence of DS ﬂows in the
network, in Figure 6.6(b). However, when coexisting with only LS ﬂows, the AFCT is not degraded.
The reason is that there is suﬃcient capacity available in the network and ﬂows do not experience
queuing delays.

99

 0 50 100 150 200CS FlowsDS Flows 0 5 10 15 20 25AFCT (msec)Deadline Missed (%)SCCL2DCTD2TCP 0 20 40 60 80 100 0 2 4 6 8 10Fraction of Flows (cdf)Time (msec)SCCL2DCT 0 5 10 15 20 25 30 35DS/LSDS/CSDS/DSDeadline Missed (%)SCCD2TCP 0 50 100 150 200 250CS/LSCS/DSCS/CSAFCT (ms)SCCL2DCTComparison with State-of-the-Art: We also compare the SCC performance with state-of-the-art
protocols such as pFabric [21] that consider the coexistence of diﬀerent objectives like CS/DS.
We consider coexistence of CS and DS ﬂows using data-mining workload and each application
generates 40% load in the network. For pFabric, we assign high priority to DS ﬂows as compared
to the CS ﬂows. Figure 6.7(a) shows that SCC improves the performance of the CS ﬂows by 30%,
compared to pFabric. However, SCC misses 2% more deadlines for DS ﬂows. This is expected as,
in this scenario, SCC does not enforce any speciﬁc prioritization policy in the network.

6.7 Comparison in Policy Enforcement

In this section, we consider the coexistence performance of same tenants or objectives but with
constraints on the available network bandwidth and weight. We consider two tenants (A and B) with
DS objective and both the tenants require 600 Mbps bandwidth to meet the deadline requirements of
their applications. We assume that tenant A has more priority in the network, thus operator assigns
more bandwidth to A and assigns rest of the available bandwidth to the tenant B. Figure 6.7(b)
shows that SCC meets almost all the deadlines of Tenant A, however, the performance degrades for
Tenant B. SCC meets more deadlines for both the Tenants combined.

(a) pFabric

(b) Admin policy

Figure 6.7: Comparison under Data-mining workload

100

 0 50 100 150 200CS FlowsDS Flows 0 5 10 15 20AFCT (msec)Deadline Missed (%)SCCpFabric 0 10 20 30 40 50Tenant ATenant BDeadlines Missed (%)SCCD2TCP6.8 Comparison in Isolation Scenario

In this setup, we evaluate performance of SCC for deadline-sensitive and completion-sensitive
objectives when they have exclusive access to the network and compare it to the performance of
D2TCP and L2DCT protocols respectively. The goal is to evaluate if, in the absence of coexistence,
SCC can achieve similar performance as the transport protocols across a range of network loads.
We use data-mining workload and use same network setup as in section 6.5.

For deadline-sensitive applications, SCC meets more deadlines across all the network loads
and reduces the deadline missing by up to 2:5(cid:2) (Figure 6.8(a)). This shows that SCC not only
eliminates interference among ﬂows of diﬀerent objectives, but also improves performance via the
abstraction of objective-tenant congestion-free tunnel.

For completion-sensitive applications, SCC minimizes the AFCT and achieves performance
similar to L2DCT, as shown in Figure 6.8(b). SCC achieves similar performance as L2DCT at low
loads and improves AFCT at higher loads by eliminating the inter-ﬂow interference via its priority
sub-queues.

6.9 SCC Dynamics

In this section, we evaluate SCC performance under diﬀerent workloads, network oversubscrip-

tion, diﬀerent transport protocols, and strict priority.

(a) DS ﬂows’ missed deadline

(b) CS ﬂows’ AFCT

Figure 6.8: Isolation scenario.

101

 0 5 10 15 20 25 30 10 20 30 40 50 60 70 80Deadline Missed (%)Load (%)SCCD2TCP 0 20 40 60 80 100 10 20 30 40 50 60 70 80AFCT (ms)Load (%)SCCL2DCT(a) AFCT of CS ﬂows

(b) missed deadline of DS ﬂows
Figure 6.9: Coexistence with and without SCC support for diﬀerent workloads

(a) Over-subscribed

(b) Diﬀerent transport

Figure 6.10: SCC Dynamics.

Coexist-LS/DS/CS Performance under diﬀerent workloads: Next, we evaluate CS/DS/LS co-
existence performance under diﬀerent workloads in the network. The aggregate workload generated
by traﬃc from three objectives is 900 Mbps, generating 90% network load.

Figure 6.9 shows that when the three objectives coexist without SCC support, they aﬀect each
other performance. SCC, reducers AFCT of CS ﬂows’ by up to 40%;24%;32% for Data-mining,
Web-search, and MapReduce Figure 6.9(a). SCC reduces deadline missing ratio by 3:2(cid:2)/1:8(cid:2)/0:8(cid:2)
for the Data-mining, Web-search, and MapReduce workloads respectively, Figure 6.9(b). We ob-
serve similar trends for Web-search and MapReduce workloads.
Over-subscribed Network Scenario: We evaluate SCC on a 3:1 oversubscribed network. In this
topology, we have 4 ToR switches, each connecters to 30 hosts with 1 Gbps links and connects to

102

 0 300 600 900 1200 1500DataMin.W.SearchMapRed.AFCT (ms)SCCL2DCT 0 10 20 30 40 50 60 70DataMin.W.SearchMapRed.Deadline Missed (%)SCCD2TCP 0 20 40 60 80 100CS FlowsDS Flows 0 5 10 15 20AFCT (msec)Deadline Missed (%)SCCL2DCTD2TCP 0 100 200 300 400 500CS FlowsDS Flows 0 10 20 30 40 50AFCT (msec)Deadline Missed (%)SCCL2DCTD2TCPDCTCP(a) CS/DS
Figure 6.11: Objective isolation using priority queues

(b) LS latency

core switches using 10 Gbps links. We generate east-to-west traﬃc: all traﬃc is inter-rack. Note that
load is calculated based on network core load compared to previous scenarios, where the load was
generated based on the edge links. We consider three applications, CS, DS and LS in the network.
All the ﬂows follow the Web-search workload and have average network load of 300 Mbps each.
Figure 6.10(a) shows that SCC reduces AFCT of CS ﬂows by 20% and reduces deadline missing
ratio of DS traﬃc only marginally.
Diﬀerent Transport Protocol: Figure 6.10(b) shows that even with DCTCP as transport protocol,
SCC improves the application performance compared to the original protocols. The AFCT of CS
ﬂows improves by 3(cid:2) as these ﬂows get equal share of the network bandwidth as DS ﬂows. DCTCP
is more aggressive in increasing rate than the L2DCT, for large ﬂows DCTCP increases congestion
window by 1 pkt/RTT whereas L2DCT increases by the factor k, where k 2 0:5;1, which depends
on the ﬂow size. The performance of DS ﬂows is also better than D2TCP protocol, however, the
deadline performance degrades compared to the scenario where SCC uses D2TCP as the transport
protocol. The reason is that now both CS and DS ﬂows share network fairly and beneﬁts of D2TCP
at the transport layer are lost.
Isolating Tenants using Priority Queues: In this section, we compare and contrast the tradeoﬀs
of using priority queues inside the switches, to isolate LS ﬂows from other objectives to meet
its requirements. To illustrate the beneﬁts, we consider a network with two priority queues and

103

 0 20 40 60 80 100CS FlowsDS Flows 0 5 10 15 20AFCT (msec)Deadline Missed (%)SCC w/o prio que.SCC w/ prio que. 0.2 0.4 0.6 0.8 1 1.2 0 0.05 0.1 0.15 0.2Instantaneous RTT (ms)Time (sec)LS w/o prio queLS w/ prio quecoexistence of LS, DS and CS ﬂows. Here, LS ﬂows are mapped to the higher priority queue and
CS/DS ﬂows share the lower priority queue. Figure 6.11(a) shows that the performance of DS and
CS ﬂows is hurt a little when LS ﬂows are prioritized over them. However, latency requirements of
LS ﬂows are better met when strict priority is enforced in switch queues, in Figure 6.11(b).

6.10 Conclusions

There exist two kinds of stakeholders in a datacenter networks: the tenants and the administrators.
An individual tenant’s congestion control tries to achieve its own performance objective. With a
global view of the network, the administrators need to maximize its utility function. In this paper, we
provide a framework to decouple the congestion control among diﬀerent performance objectives,
from the congestion control among ﬂows with the same objective. This work demonstrate that simple
solutions can be adopted to eliminate the interference among diﬀerent performance objectives and
diﬀerent tenants. This work is inspired by the Application-Driven Network (ADN) [130].

104

CHAPTER 7

MULTIPATH TCP ATTACKS & COUNTERMEASURES

In this section, ﬁrst, we demonstrate how an attacker can leverage passive cross-path throughput
inference and backup ﬂag vulnerability to launch a connection hijack attack. Next, we demonstrate
how an attacker can launch directed traﬃc diversion attacks. In the hijack attack, an attacker diverts
all the traﬃc through his own path to take full control of the connection. In the directed traﬃc
diversion attack, an attacker manipulates traﬃc on speciﬁc individual subﬂows while maintaining
the same throughput of the MPTCP connection. Note that the connection throughput can itself vary
in real-life. However, using throughput inference, an attacker can ensure that the overall throughput
does not change too much (for long) and any unusual activity is not discovered by the attack victim.

Figure 7.1: Experiment Topology

7.1 Experimental Setup

To validate the hijack and traﬃc diversion attack under varying network settings, we conduct
experiments on a testbed running the MPTCP Linux implementation (v0.91) [101]. We consider the
scenario shown in Figure 7.1, where a client and server can communicate using two paths conﬁgured
as WiFi (path p1) and Wired (path p2). We setup and conﬁgure the loss rate and throughput on these
paths to approximate WiFi and Wired connections. MPTCP connections compete with single-path

105

IP1(WiFi)IP2(Wired)Client Path p1Path p2ServerIPserverTCP connections on paths p1 and p2. The client and the server can have a 2-path MPTCP connection
whose subﬂows use paths p1 and p2. Let t1 and t2 denote the throughput of the MPTCP subﬂows
over the paths p1 and p2 respectively. Let T1 and T2 denote the throughput of TCP connections
passing through paths p1 and p2 respectively. In our experiments, we assume that there is a malicious
attacker on path p1 who can eavesdrop and inject forged packets on path p1.
7.2 Connection Hijack Attacks

First, we show a special case of traﬃc diversion, where an attacker can launch a connection

hijack attack to onload traﬃc to the subﬂow on its eavesdropped path.

7.2.1 Motivation

In these scenarios, an attacker’s motivation could be to observe all the data in the MPTCP connection
for surveillance or to charge users for additional data usage as a cellular ISP. To gain full control
over MPTCP connection, an attacker can set the subﬂows as backup and pause non-eavesdropped
subﬂows between two hosts. By pausing the non-eavesdropped subﬂow, an attacker can get full
control over the MPTCP connection.

7.2.2 Overview

In the connection hijack attack, the attacker uses the backup ﬂag vulnerability to onload traﬃc to
its own path and uses the cross-path throughput inference to validate the success of hijack attack.
Consider the scenario shown in Figure 7.1 where two subﬂows of an MPTCP connection pass
through paths p1 and p2. An attacker can hijack the non-eavesdropped subﬂow on p2 by dynamically
changing the priority of a subﬂow and declaring it as backup. To launch this attack, an attacker
only needs to know the address identiﬁer of the host, which can be obtained by eavesdropping other
paths or easily guessed as they are set incrementally in current Linux implementation of MPTCP.
Since the identiﬁer has only 8 bits, it can even be bruteforced. To set a non-eavesdropped subﬂow
as backup between hosts A and B, an attacker can request a change in the subﬂow priority by

106

Figure 7.2: MPTCP connection hijack attack using backup ﬂag

sending MPTCP MP_PRIO option to host B with the address identiﬁer of host A. Since by design
the MP_PRIO messages can be sent on any subﬂow, the attacker eavesdropping on any path capable
of sending such a forged backup message can pause any other path. Note that a backup MPTCP
subﬂow may still be used for data transmission later, which can be detected by the attacker by
observing the overall MPTCP throughput.

Eﬀectively, this attack degrades an MPTCP connection to a regular TCP connection as all
traﬃc will be routed through the attacker-controlled path. Given that MPTCP has an increased
attack surface, this vulnerability makes MPTCP more likely to be hijacked compared to TCP.
Unfortunately, MPTCP speciﬁcation does not include any authentication mechanism whatsoever
regarding the MP_PRIO message.

7.2.3 Attack Validation

To validate the attack, we consider a scenario where the client and the server have a 2-path MPTCP
connection whose two subﬂows use paths p1 and p2. Figure 7.2 plots the results of the hijack attack
using backup ﬂag for the scenario shown in Figure 7.1. In Figure 7.2, both t1 and t2 achieve similar
throughput till t = 2 minutes. To gain full control, at t = 2 minutes, the attacker launches backup
ﬂag attack and as a result p2 throughput drops to zero, which can be observed by the attacker on
p1 using our throughput inference scheme. The attacker can unset the backup ﬂag to resume the
subﬂow on p2 (t = 6 minutes).

107

 0 5 10 15 20 25 1 2 3 4 5 6 7 8Throughput (Mbps)Time (min)t1t27.3 Directed Traﬃc Diversion Attacks

Next, we show how an attacker can launch a directed traﬃc diversion attack on a speciﬁc
subﬂow to oﬄoad or onload traﬃc from one subﬂow to another subﬂow. As compared to simple
traﬃc diversion, the goal of the directed traﬃc diversion attack is to hurt the performance of a
subﬂow on a speciﬁc path or a subﬂow used by a speciﬁc network. Our threat model is diﬀerent
from (also more generic) [109] and can be used to launch both the simple and directed traﬃc
diversion attacks.

7.3.1 Motivation

There are both malicious and benign motivations for traﬃc diversion.

A malicious attacker eavesdropping an MPTCP subﬂow can launch the directed traﬃc diversion
(oﬄoading) attack to gain more bandwidth, i.e., diverting traﬃc to other paths so that he can use
that path to carry more of his own traﬃc and at the same time aﬀect the performance of connections
using a speciﬁc network to which attacker has no access. This will improve the network performance
of attacker, however, it will hurt the performance of the users in the targeted network to whom traﬃc
is diverted. Likewise, a malicious ISP can launch the traﬃc diversion attack to reduce its load by
diverting traﬃc to other ISPs. This allows a malicious ISP to carry more traﬃc of its own users and
degrade the performance of users in the other network.

In a benign scenario, a cellular ISP can use traﬃc diversion to reduce users’ data usage over
3G/4G links by diverting traﬃc to WiFi links. This helps the users keep their cellular data usage
under control and avoid overage charges, and also helps the cellular ISP to oﬄoad traﬃc during
peak hours.

In traﬃc diversion attacks, the attackers can make sure that end-user quality of experience is
not aﬀected. This can be made possible by diverting only a limited amount of traﬃc such that
the overall throughput of the MPTCP connection remains unchanged. An attacker can monitor
the overall connection throughput from global sequence numbers using our proposed throughput

108

inference model.

7.3.2 Overview

To launch the directed oﬄoading or onloading attack, an attacker needs to identify the target path to
oﬄoad or onload data. For directed oﬄoading data to a speciﬁc path, an attacker can either throttle
traﬃc of the paths (e.g., introducing packet loss), he has access to, or set the subﬂows going through
these paths as back up using MP_PRIO option in MPTCP. For onloading traﬃc, an attacker can set
targeted paths as backup to direct traﬃc to its own paths.

To decide which path to onload from or oﬄoad to, an attacker can gather mapping between IP
address and address ID from the MP_JOIN and ADD_ADDR options. Both the options contain the
IP address and corresponding address ID and are sent over the existing connection in plain-text.
Therefore, an attacker eavesdropping on one of the subﬂows may be able to see this information, if
advertised on the eavesdropped path, and ﬁnd mapping between IP address and its corresponding
address ID to launch the targeted traﬃc diversion attacks.

7.3.3 Attack Validation

We ﬁrst consider a scenario where the client and the server have a 2-path MPTCP connection whose
two subﬂows use paths p1 and p2. Let t denote the throughput of the MPTCP connection where
t = t1 + t2. We also setup TCP connections (acting as background traﬃc) passing through paths
p1 and p2, with throughput T1 and T2 respectively. The attacker can eavesdrop path p1 but not p2.
By eavesdropping, the attacker can calculate the overall throughput t of the MPTCP connection.
Using our MPTCP throughput inference scheme, the attacker can then calculate t2 = t (cid:0) t1. The
attacker can use this information for traﬃc diversion to make sure that the backup ﬂag vulnerability
is exploited correctly and the attack was successful.

Figure 7.3 plots the results of the traﬃc diversion attack for two-path connection scenario. To
launch the traﬃc diversion attack, MPTCP subﬂow passing through p1 or p2 is throttled using

109

(a) Oﬄoading
(b) Onloading
Figure 7.3: Demonstration of the traﬃc diversion attack

MP_PRIO option. In Figure 7.3(a), t1 and t2 achieve a throughput of about 6 Mbps and 2 Mbps,
respectively, and the spare capacity is utilized by TCP ﬂows on both the paths.

Later, when the attacker sets its own path as backup (e.g. to free up more bandwidth on its own
path), the MPTCP subﬂow on path p1, to oﬄoad its traﬃc to other path, t1 decreases signiﬁcantly;
however, t2 increases to make up for the reduced throughput of t1. However, in this case as the
available bandwidth is not suﬃcient, not all the traﬃc is oﬄoaded to p2, the overall throughput,
t = t1 + t2 drops during the traﬃc diversion attack. Attacker can detect this change and make sure
the attack was successful. Similarly, Figure 7.3(b) demonstrates that when the attacker sets other
path as backup, the MPTCP subﬂow on path p2, to onload traﬃc from other path, t2 drops to zero
and t1 increases to make up for the reduced throughput of t2. In this case as the available bandwidth
is suﬃcient on p1, therefore, more than 80% traﬃc is oﬄaoded to p1, and the overall throughput,
t remains almost unchanged during the traﬃc diversion attack. Attacker can detect the drop in t2
using our throughput inference model and make sure the attack was successful.

7.3.4 Step-wise traﬃc divergence

Step-wise divergence can be used to oﬄoad data in steps without realizing any change in throughput.
An attacker can launch this attack by gradually throttling the throughput of the desired subﬂow. By
gradual throttling, a two-path MPTCP connection, the attacker can identify the minimum value for

110

 0 2 4 6 8 10 0 1 2 3 4 5Throughput (Mbps)Time (min)t1t2T1T2 0 2 4 6 8 10 0 1 2 3 4 5Throughput (Mbps)Time (min)t1t2T1T2(a) With Stepwise Diversion

(b) Throughput Diverted (%)

Figure 7.4: Demonstration of stepwise traﬃc diversion attack.

t1 that does not impact t. There are multiple ways to throttle a subﬂow, such as the attacker can rate
limit a subﬂow by dropping packets, which will cause throughput degradation due to the congestion
control mechanisms, which we use in our experiments.

Our experiment demonstrates that an attacker can oﬄoad data to other paths without hurting t
using the proposed traﬃc diversion method. Note that the magnitude of possible traﬃc diversion
depends on the properties of path p2, i.e., an attacker might not be able to oﬄoad enough data
during higher congestion. To illustrate this, we repeat the above experiment but throttle t1 in a
stepwise manner. We throttle the subﬂow by dropping the packets of MPTCP subﬂow and increase
the drop rate stepwise to achieve diﬀerent level of throttling. Similar eﬀect can be achieved by
sending duplicate Acks on the path, and as a result the subﬂow throughput will be dropped. From
Figure 7.4, we can observe that initially all of the throttled traﬃc is diverted to the other path
(i.e., t2 in this case); however as the attacker increases throttling level, the fraction of the traﬃc
diverted to other path decreases (Figure 7.4(b)). In Figure 7.4(a), when the attacker throttles the
MPTCP subﬂow on path p1 by 20% at time = 2 minutes, t1 decreases and t2 increases to make
up for the reduced throughput of t1. However, when the attacker increases the throttling level at
time = 3;5 and 7 minutes respectively, not all the traﬃc is diverted to the path p1. This experiment
shows that the amount of diverted traﬃc may vary depending on path conditions and that all of the
traﬃc may not be diverted if enough capacity is not available.

111

 0 1 2 3 4 5 1 2 3 4 5 6 7 8Throughput (Mbps)Time (min)t1t2 0 20 40 60 80 100 20 40 60 80 100Traffic Diverted (%)Traffic Throttled (%)7.4 Countermeasures

The problems of throughput inference and connection hijack are rooted in the MPTCP speciﬁ-
cation. In this section, we propose countermeasures to address these vulnerabilities and as a result
prevent the traﬃc diversion and connection hijack attacks. At a high level, our security improvement
hinges on the simple fact that a secret can be split and distributed across multiple paths so that
an attacker controlling only a subset of the paths will not be able to know the assembled secret.
We show how this idea can be applied to authenticating the MP_PRIO control packets as well as
encrypting the global sequence number (GSN).

7.4.1 Current MPTCP Security Mechanism

MPTCP does have a built-in authentication mechanism that prevents unauthorized subﬂows to be
established. In the beginning of a new MPTCP connection (i.e. the very ﬁrst subﬂow), the sender
and receiver exchange a set of keys (i.e. sender key and receiver key) in plaintext at the connection
setup time through MP_CAPABLE messages during the TCP handshake process. Additional subﬂows
start in the same way as initiating a normal TCP connection, but the SYN, SYN/ACK, and ACK
packets also carry the MP_JOIN option as well as the authentication information. Speciﬁcally, for a
new subﬂow, the MP_JOIN on the SYN packet contains a token, a random number, and an address
ID. The token is generated from a HASH function of the receiver’s key and is used to identify the
MPTCP connection to which the new subﬂow belongs. The random number (nonce) is sent as a
challenge to the receiver who needs to use it to compute a subsequent HMAC which prevents replay
attacks. Upon receiving an MP_JOIN SYN with a valid token, the host responds with a MP_JOIN
SYN/ACK containing a random number and a truncated (leftmost 64 bits) Hash-based Message
Authentication Code (HMAC) based on the sender’s and receiver’s key as well as the sender’s
random number. If the token or HMAC is incorrect, the initial sender can reset the subﬂow and fall
back to TCP or it can deny the request for new subﬂow setup.

As we can see, the only “secrets” (keys) are exchanged during the very ﬁrst subﬂow; subsequent

112

subﬂows only authenticate themselves but no additional new secrets are being shared. This means
that as long as an attacker controls the ﬁrst subﬂow, he or she can manipulate all subﬂows (e.g.
creating fake subﬂows, or sending other control packets such as ADD_ADDR that require authenti-
cation). As will be discussed next, our proposed solution requires new secrets to shared on newly
established and authenticated subﬂows. Further, the changes mostly piggyback on the existing
security mechanisms without substantial redesign of the protocol.

7.4.2 Backup Flag Vulnerability Prevention

As mentioned before, the fundamental reason for backup ﬂag attack is the lack of authentication of
MP_PRIO messages. We present progressively-more-secure changes to the way MP_PRIO messages
are generated and handled.
1. Address ID removal. MP_PRIO messages use address identiﬁers (ID) to specify a target subﬂow
for which priority needs to be changed. A simple solution to prevent this attack is to simply remove
the address ID from the MP_PRIO option, i.e. the MP_PRIO option must be sent on the same subﬂow
as the one where it applies. This implies that the attacker must be on-path of this subﬂow to cause
it to become a backup (which he can do anyways). This defense is simple to implement, however,
as suggested in RFC6824 [56], address ID is optional for the scenario where a path is extremley
congested (e.g., radio coverage issue) that the MP_PRIO message has to be delivered on a diﬀerent
path. This proposal has been adopted by RFC6824bis [57].
2. Invalid Address ID rejection. MP_PRIO messages use address identiﬁers (ID) to specify a
target subﬂow for which priority needs to be changed. One possibility to counter an attacker that
sends random address IDs could be, for an MPTCP implementation, to reject a subﬂow where an
MP_PRIO refers to an invalid address id. Rejecting/terminating a subﬂow where an MP_PRIO with
one or more invalid address ID is observed can be a simple solution, as it indicated a compromised
path. However, it is unclear whether this can lead to false alarms (especially when signiﬁcant
dynamics are present, e.g., new subﬂows joining and old ones leaving).
3. Address ID randomization. Currently, address IDs are assigned incrementally (not required by

113

RFC6824 [56]) in MPTCP Linux implementation, which makes it easier for the attacker to guess
the address ID of the subﬂows on the non-eavesdropped path. For example, if the ﬁrst address is
assigned an address ID as 1, the next address of the same connection will be assigned an address
ID of 2. This behavior is not deﬁned in the MPTCP speciﬁcation but rather observed in the current
Linux MPTCP implementation. We can mitigate this ﬂaw by randomizing the address IDs that
are assigned to the address in MPTCP connection. Although this approach will not completely
eliminate the threat, it can at least make it harder for the attacker to guess correct address ID of
target subﬂow. We can make it more diﬃcult to guess the correct address ID by adopting techniques
such as limiting the number of IDs that can be attempted per ﬁxed time interval. The ﬁrewalls and
middleboxes can also be conﬁgured to avert this attack. However, the current address ID ﬁeld is
only 8 bits and an attacker can still quickly launch this attack.
4. Adding a standard/static badge for authentication. A better solution is to require a subﬂow
level authentication badge, which is sent along with the MP_PRIO packet, to ensure that an attacker
controlling only one subﬂow cannot spoof a valid packet. This badge can be similar to the HMAC
computed in MP_JOIN messages during the subﬂow setup time. However, there are two issues with
using the standard badge, similar to current token. First, the current token is static as it is simply a
hash of the received key (and will be the same no matter on which subﬂow it is used). Unfortunately,
the current token is exposed during the subﬂow setup in either the MP_JOIN or MP_CAPABLE packet,
allowing an attacker controlling any subﬂow to be able to replay the token. The problem is that
the current token is supposed to be used only once during the subﬂow setup phase, while it is now
misused multiple times (as a badge) to authenticate control packets such as MP_PRIO. Second, even
if the badge is replay-resistant, as long as it is generated from the key exchanged in the ﬁrst subﬂow,
an attacker knowing the key can always compute the badge easily and set any other subﬂow as
backup.
5. Adding a secure/dynamic badge for authentication. Next, we propose a secure badge based
authentication that must meet the following requirements: First, the badge needs to be replay-
resistant. Second, the badge generation should not depend on keys shared on only a speciﬁc subﬂow

114

Figure 7.5: Secure MP_PRIO TCP option

(e.g., primary one).
Choice of badge generation. To satisfy the ﬁrst requirement, we propose to generate an authentica-
tion badge using HMAC derived from not only the keys as well as an additional nonce (32 bits) that
changes every packet (e.g., a new ﬁeld in the MP_PRIO message, Figure 7.5). This avoids sending
a ﬁxed badge over the MPTCP connection and prevents replay attacks. We propose using a badge
size of 8 bytes, similar to current MP_JOIN option in the current MPTCP speciﬁcation. Similar to
the way current token is generated, we can use the truncated HMAC (leftmost 64 bits) using the
key material plus nonce, as shown in the example below.
Choice of secret splitting. To satisfy the second requirement, we propose to exchange independent
keys on individual subﬂows and use the combined key to compute the badge. Speciﬁcally, since
only the primary subﬂow exchanges keys explicitly through MP_CAPABLE messages, we need a way
to exchange new keys for non-primary subﬂows. A simple strategy is to reuse the nonces exchanged
in MP_JOIN as keys (as they are random in nature too). Once we have a pair of independent keys
exchanged on each subﬂow, the new badge can be computed by combining keys of two subﬂows:
(1) the one on which MP_PRIO is sent; (2) the one that is to be set as backup. This means that
even if the attacker eavesdrops on the ﬁrst subﬂow, it is impossible to compute a valid badge to set
any other subﬂow as backup, so long as the attacker is not eavesdropping on other subﬂows. For
bookkeeping, the additional pair of keys for each subﬂow needs to be maintained until the subﬂow
is terminated.
Example scenario. Now let us consider an example to understand how everything works together,

115

Kind0  1  2  3  4  5  6  7  8  9  0  1  2  3  4  5  6  7  8  9  0  1  2  3  4  5  6  7  8  9  0   1LengthSubtypeBAddrID(opt)Authentication Badge (64 bits -opt)Nonce/ random number (32 bits -opt)Figure 7.6: Example Use of Secure MP_PRIO

as shown in Figure 7.6. Assuming an MPTCP connection between hosts A and B with two subﬂows
1 and 2, where the attacker controls the subﬂow 1. First, the keying material is taken from the keys
in MP_CAPABLE option of subﬂow 1 and the random numbers from the MP_JOIN of subﬂow 2. To
change the priority of subﬂow 2, the secure MP_PRIO will contain an authentication badge generated
using keying material and the random nonce, as shown in Fig. 7.5. The authentication badge is the
HMAC derived from the MP_PRIO nonce and the keys exchanged on both subﬂows. For example,
in this case, the badge for subﬂow 2 can be generated as follows: Badge2 = HM AC„keyA +
keyA2; NONCE + ID”, where keyA and keyA2 are simply concatenated. If a legitimate MP_PRIO
message has been sent on subﬂow 1 (to change the priority of subﬂow 2), an attacker eavesdropping
on 1 may learn the badge. However, in order to inject a subsequent MP_PRIO message, the attacker
needs to compute a new badge (using the new nonce). Further, since the badge also includes the
keys exchanged on subﬂow 2 during MP_JOIN, the attacker cannot easily compute the new badge,
unless the attacker is also eavesdropping on subﬂow 2 (in which case the attacker already has full
control of both subﬂows and the attack is unnecessary).

116

Host AHost BAddress A1Address A2Address B1SYN + MP_CAPABLE(Key-A)SYN/ACK + MP_CAPABLE(Key-B)SYN + MP_JOIN(Key-A2= nonce-A)SYN/ACK + MP_JOIN(Key-B2= nonce-B)MP_PRIO(ID=Addr2, BADGE, NONCE)BADGE = HMAC(Key-A+Key-A2, ID + NONCE)Store: Key-A, Key-B & Key-A2,Key-B27.4.3 Throughput Inference Vulnerability Prevention

The fundamental enabler for cross-path throughput inference is the GSN embedded in MPTCP
packets. On one hand, they are needed for reassembling packets from multiple subﬂows into a
MPTCP connection and for detecting packet losses. On the other hand, as we demonstrated, GSN
create a vulnerability that allows an attacker to infer the throughput of a subﬂow that he has no
access to.

Packet header encryption using solutions such as IPsec [58] could prevent attackers from
inferring throughput. However, this requires additional keys to be established ahead of time and the
VPN server can be a single point of failure. In addition, from the VPN server to the real destination,
MPTCP may still be used and therefore encounter same problem. Note that using SSL/TLS will not
help as they are above the transport layer and the proposed attacks will still work at the transport
layer.

Figure 7.7: Secure DSS MPTCP option

Therefore, we propose a simple solution to address the very problem introduced in the MPTCP
layer in MPTCP itself — encrypt only the DSN (and DSAck) in MPTCP headers, as shown in
Figure 7.7. As will be detailed later, we can simply combine the keys exchanged on multiple paths
(similar to the idea in backup ﬂag attack prevention) to encrypt and decrypt the GSN (symmetric key
encryption). The symmetric key can be used in any standard symmetric key encryption algorithms

117

Kind0  1  2  3  4  5  6  7  8  9  0  1  2  3  4  5  6  7  8  9  0  1  2  3  4  5  6  7  8  9  0   1LengthSubtype(reserved)F | m | M | a | AEncrypted Data ACK (4 or 8 octets, depend on flags)Encrypted Data Sequence number (4 or 8 octets, depend on flags)SubflowSequence Number (4 octets)Data-Level Length (2 octets)Checksum (2 octets)such as AES [99] (e.g., in CTR mode to eliminate the ﬁxed block size [87]). Note that the
attacker might have some knowledge about the plaintext GSN which constitutes a “known-plaintext
attack” (i.e. knowing a pair of corresponding plaintext and ciphertext). However, modern symmetric
encryption algorithms such as AES is resistant to such attacks [99].

However, the challenge here is that, unlike the authentication token that is sent sporadically,
GSN is embedded in every data packet. Once a key has been agreed on between the sender and
receiver, it is not trivial to update the keys on the ﬂy. For instance, when the very ﬁrst subﬂow is
established, only the keys exchanged on the ﬁrst path can be used to encrypt the GSN, which is
susceptible to man-in-the-middle attacks on the path (as the keys are exchanged in cleartext and the
GSN can be decrypted). Later on, when the second subﬂow is established, to make encryption more
secure we can combine the keys on the second path to form the key material for GSN encryption.
This is easy for the second subﬂow as the key material is readily available. However, it is not the
case for ﬁrst subﬂow as it needs to switch from unencrypted GSNs to encrypted ones. Therefore, it
requires additional signaling mechanism to switch to the newer key material.
Choice of key material change signaling. To address this problem, we can make use of subﬂow
setup or the unused header ﬁelds in MPTCP DSS option or MP_JOIN TCP option. For example,
when a new subﬂow is setup, the MP_JOIN option or ﬁrst DSS MPTCP option, on the new path, can
be used as a signal to start the encryption of the ﬁrst subﬂow’s GSN without the need of sending
any explicit signal. A simple strategy is to start the encryption of GSN on the ﬁrst subﬂow upon
seeing MP_JOIN packet of the second subﬂow, or upon receiving the ﬁrst DSS packet on the second
subﬂow. Such an implicit signaling may cause ambiguity for packets that are in-ﬂight together with
the MP_JOIN packet (unclear what key material to use to decrypt them). To avoid such confusion, we
encode an explicit signal in the MP_JOIN message of the second subﬂow, indicating when exactly
the encryption should start on the ﬁrst subﬂow. This way, as long as the speciﬁed DSS is larger than
any of the outstanding packets, the receiver knows exactly when to start decrypting the DSS. Note
that the signaled DSS needs to be part of the HMAC input during MP_JOIN to prevent tampering
(as we will discuss later in §7.5). We can add a 4-byte or 8-byte sequence number to the MP_JOIN

118

Figure 7.8: Explicit Signaling of GSN Encryption on the First Subﬂow.

packet, and set the packet Length ﬁeld to 16 or 20 respectively. We assume that the signal has to
be initiated from the client to avoid race conditions where both sides simultaneously initiate new
subﬂows. When the server acknowledges the message, a reverse signal can be piggybacked in the
response MP_JOIN to decide when to perform GSN encryption in the reverse direction. We outline
this process in Figure 7.8.

For backward compatibility, we propose using a ﬂag in MP_CAPABLE to signal if GSN encryption
is supported by both the client and server. If either the client or server does not support encryption,
GSN encryption will not be employed.
Choice of secret splitting. As described earlier, we want to combine keys of multiple paths to form
the key material for GSN encryption. For example, for an MPTCP connection with 2 subﬂows, the
keys of path1, and path2, can be used for encrypting GSN on path1 and path2 as well. Similarly,
for an MPTCP connection with 3 subﬂows, the keys of path1 and path3 can be used to encrypt
GSN on path3. The strategy is to combine the keys exchanged on the primary subﬂow and the
ones changed on the current subﬂow where GSN encryption takes place. For more security, an
alternative strategy could be to combine keys of all active subﬂows. For example, for an MPTCP
connection with 3 subﬂows, we can use the keys of path1, path2, and path3 for encrypting GSN on

119

Host A Host B Address A1 Address A2 Address B1 SYN + MP_CAPABLE(Key-A, Enc) SYN/ACK + MP_CAPABLE(Key-B, Enc) SYN + MP_JOIN(Key-A2 = nonce-A, DSS-A) SYN/ACK + MP_JOIN(Key-B2= nonce-B, DSS-B) MP_DSS(DSAck-Enc, DSS-Enc) DSS-Enc = Encrypt(Key-A+Key-A2+ID, dss), when dss >= DSS-A DSAck-Enc = Encrypt(Key-A+Key-A2+ID, dsack), when dsack >= DSS-B Store: Key-A, Key-B         & Key-A2,Key-B2 MP_DSS(DSAck, DSS) path1. However, it may need frequent updates to the keys generated for the paths as more subﬂows
join or leave the MPTCP connection. For example, the key material for the ﬁrst subﬂow will be
updated twice as subﬂow 2 and 3 are established; it gets tricky when both new subﬂows are being
established simultaneously because the client and server may not have a consistent view on which
of these two subﬂows gets established ﬁrst (and may cause a diﬀerent key material to be formed on
the two ends).
Choice of key generation. Based on the above discussion, we propose combining the keys ex-
changed on only two subﬂows: (1) the current subﬂow and (2) the primary subﬂow. For example,
for a connection between hosts A and B, with two subﬂows 1 and 2, the encryption key of subﬂow 2
(encryption at A and decryption at B) can be generated as EncKey2A
= keyA + keyA2 (as shown in
Figure 7.8). The encryption starts as soon as the original value is greater than the signaled DSS value
(DSS (cid:0) A in the example). Similarly, the encryption of the DSAck starts when its value becomes
greater than the signaled DSS value in the opposite direction (DSS (cid:0) B in the example). We omit
the details for the reverse direction where the encryption key will be EncKey2B
= keyB + keyB2.
Given that all such numbers are randomly generated, we can simply concatenate them to form the
key.
Example scenario. Let us consider a 2-path MPTCP connection setup in Figure 7.8 to understand
how an MPTCP connection progresses when GSN encryption is enabled. All subﬂows are setup
and authenticated as per the original MPTCP connection setup. The required keying material is
exchanged in MP_CAPABLE (i.e. the keys, Key (cid:0) A and Key (cid:0) B) and MP_JOIN (i.e. nonce (cid:0) A and
nonce (cid:0) B) as described before. At the end of subﬂows setup, all the required keying material for
the encryption have been exchanged. Moreover, during MP_CAPABLE exchange, hosts also negotiate
if encryption is supported (Using Enc bit) by both the hosts A and B. Moreover, the MP_JOIN
option on subﬂow 2 also carries the signal DSS after which the encryption starts at host A (denoted
as DSS (cid:0) A). Similarly, host B informs the host A of the DSS after which the encryption starts
at host B (denoted as DSS (cid:0) B). Next, the keying material is combined to form the encryption
keys (DSS (cid:0) Enc and DSAck (cid:0) Enc) for each subﬂow. When data transmission starts, GSNs are

120

encrypted on path 2 from the beginning because the key materials are ready (details are omitted in
the ﬁgure). For encrypting GSNs (and GSN-ACKs) on path1, it needs to start encrypting the GSN
at a later point. Note that some data packets might be transferred on subﬂow 1 before subﬂow 2
setup is completed. GSNs of these few packets will not be encrypted and visible to the attacker on
path1, as shown in Figure 7.8. However, as soon as subﬂow 2 setup completes, the encryption can
start on path 1 as indicated by “DSS (cid:0) Enc” in Figure 7.8, so long as DSS > DSS (cid:0) A. Similarly,
the encryption of DSACK can start as soon as DSAck > DSS (cid:0) B.
7.5 Security Analysis

We now discuss the potential attacks that can be launched against our security improvements
and show we do not introduce any new vulnerabilities. We follow the same threat model as deﬁned
in where a MITM attacker is on one or more communication paths and can both passively eavesdrop
on and actively tamper with the subﬂow(s).

First, more speciﬁcally, when an attacker is capable of tampering the ﬁrst subﬂow, the whole
MPTCP connection can be forced to revert back to a regular TCP connection and thus subject to
total control of the attacker. Our proposed changes do not change this result. However, if the attacker
is capable of tampering subﬂows other than the ﬁrst, our changes will prevent the two classes of
cross-path attacks that were possible without the changes.

Second, an active attacker may tamper with a subset of the secrets shared (when the corre-
sponding paths are controlled by the attacker), in which case the wrong key material may be used
for authentication token and GSN encryption. The HMAC mechanism introduced earlier during
subﬂow setup already mitigates this threat. For instance, if the key on the second subﬂow is tam-
pered, the attacker also needs to know the key exchanged on the ﬁrst subﬂow to be able to compute
an updated HMAC (as the key is part of the input). Even if the attacker can indeed successfully
cause wrong keys to be used in a particular subﬂow, the worst case scenario is denial-of-service —
the authentication token generated by either end will be considered invalid — the encrypted GSN
may be decrypted using the wrong key material. However, we argue that such attacks are no worse

121

than before, as both are possible if an attacker is on-path.

Third, we consider two possible targeted attacks against GSN encryption:

• Attacking the MPTCP option to indicate the support of GSN encryption in MP_CAPABLE messages.
An attacker has to tamper with the ﬁrst subﬂow to be able to strip such option or fake such an
option to either disable GSN encryption or cause confusion later on. However, as we discussed,
if an attacker is already capable of tampering with the ﬁst subﬂow, then complete control or
denial-of-service can be easily achieved.
• Attacking the signal carried in an MP_JOIN message (e.g., second subﬂow setup) indicating the
GSN of the ﬁrst subﬂow after which GSN encryption will begin. An attacker capable of tampering
with this subﬂow can strip or modify the sequence number. However, the HMAC of MP_JOIN
includes the signaled GSN as input so any tampering will be immediately detected, unless a
powerful attacker also eavesdrop the ﬁrst subﬂow and know the initial keys exchanged there to be
able to compute the updated HMAC. Even for a powerful attacker, in the worst case it is again DoS
which is no worse than before.

Finally, we want to ensure that no unnecessary security features are added in reference to TCP.
Since MPTCP and TCP both reside in the same transport layer, whatever security guarantees not
oﬀered by TCP do not have to be oﬀered by MPTCP either. Otherwise, it is likely that the proposed
features will be redundant with higher layer ones (e.g. SSL/TLS). Speciﬁcally, TCP does not provide
any conﬁdentiality or integrity of any connection data. The proposed changes to MPTCP do not
attempt to provide conﬁdentiality or integrity of any subﬂow either (e.g. a MITM attacker can
read/modify the payload or any other ﬁelds in an MPTCP packet). What we do address are the
security ﬂaws that manifest as cross-path vulnerabilities, where a MITM attacker on one path can
infer or tamper with the subﬂow on another path, which eﬀectively makes MPTCP much more
susceptible to MITM attacks — if one of the subﬂows of an MPTCP connection is under a MITM
attacker, the whole MPTCP connection may be subject to attacks. We consider this an increased
attack surface for MPTCP and therefore makes it less secure than TCP (as TCP traverses only a
single path and is “less likely” subject to MITM attacks).

122

7.6 Linux Implementation

We implement a simple version of the proposed countermeasures in Linux kernel 4.3 using
MPTCP implementation v0.91 [101]. We add a ﬁxed 8-byte badge ﬁeld in MP_PRIO option to
prevent backup ﬂag attack. Our current implementation supports both the existing MP_PRIO option
formats and the proposed badge ﬁeld. The option ﬁeld Length can be changed to choose among
the desired format. For the length control, we add a new parameter MPTCP_SUB_LEN_PRIO_B.
For our current implementation, we used a static badge value and have not implemented the
secure badge generation, as discussed above. Upon receiving MP_PRIO packet with length ﬁeld
MPTCP_SUB_LEN_PRIO_B, correct ADDR_ID and badge, the receiver sets the corresponding path as
backup, else it silently ignores the request. We modify the required logic in mptcp_parse_options
functions in mptcp_input:c We check the badge ﬁeld and then verify the token in the mptcp_handle_options
function.

We encrypt data sequence numbers in MP_DSS option to prevent throughput inference attack. At
the sender side, for an outgoing packet with MP_DSS option, we encrypt the DSN using encryption
key. For this, we modify the mptcp_options_write function in mptcp_output:c ﬁle and update
the mptcp_write_dss_data_ack and mptcp_write_dss_data_seq functions. At the receiver
end, upon receiving packet with MP_DSS option, we ﬁrst decode the DSN numbers before further
processing the packets. We modify the required logic in mptcp_parse_options functions in
mptcp_input:c. We do not keep the encrypted DSNs at the endhosts as DSN is encrypted just
before sending packets out in the network and is immediately decrypted upon receiving packet at
the destination. In our current implementation, we assume static key for encryption (xor of DSN
and key) and assume both the sender and receiver have this key. We leave the exact implementation
of key generation and encryption as future work.
Evaluation: To evaluate, we repeat attack experiments and observe that proposed countermeasures
prevent the connection hijack attacks by making it impossible to infer connection throughput and
use backup vulnerability. We omit the results for brevity. We also evaluate the overhead of proposed
modiﬁcations on the MPTCP connection throughput. First, we observe that adding a token ﬁeld to

123

the MP_PRIO does not eﬀect the throughput of MPTCP connection, as MP_PRIO message is sent
only once or very few times over the connection. Second, in our current implementation, the per-
packet encryption of DSN can incur processing overhead and may decrease the overall connection
throughput. To evaluate this, we evaluate MPTCP connection throughput between two Linux
machines with 1 Gbps ethernet cards. We observe very minimal MPTCP connection throughput
variation. For example, we observe a throughput of 912 Mbps without encryption scheme, and
a throughput of 907 Mbps with GSN encryption. This shows that even at high speed packet
transmission the computation overhead of encryption is minimal and does not eﬀect connection
signiﬁcantly. Note that, the connection speed in real scenarios (e.g. cellular networks) is much less,
therefore adding encryption will not hurt the overall connection throughput.

7.7 Conclusions

MPTCP has already been widely deployed (including Android and iOS); however, its security
is not well understood. In this paper, we describe MPTCP vulnerabilities that can be exploited to
divert traﬃc and hijack MPTCP subﬂows. We implemented and evaluated our attacks on a testbed
using the MPTCP Linux kernel (v0.91) implementation. We also proposed lightweight encryption-
based countermeasures based on secret splitting to mitigate these attacks. Our countermeasures do
not introduce any new security ﬂaws to MPTCP. We plan to explore opportunities to leverage the
secret splitting as a way to bootstrap keys for higher level protocols such as TLS.

124

CHAPTER 8

FUTURE WORK

In this chapter, I provide an overview of the planned future work.

8.1 Automatic signature generation of network protocols on-the-ﬂy

8.1.1 Motivation and Problem Statement

Inferring the unknown application signatures, in real-time, is crucial to combat zero-day attacks
and polymorphic worms as their traﬃc is typically carried by unknown application protocols. Zero-
day attacks, i.e., previously unknown attacks, such as Aurora, Dugu and Stuxnet have been ever
increasing. For example, 27 zero-day attacks were identiﬁed in 2014 as reported by Internet Security
Threat Report by Symantec [53]. The protocol signatures can be used to protect against the zero-
day attacks as the intrusion detection/prevention systems (IDSs/IPSs) [103,115] identify malicious
traﬃc by comparing the packet contents and protocol signatures, which can be generated from
protocol formats using signature generation tools [29, 86, 102]. However, for real-time protection,
once an attack is detected, an IDS system should be able to quickly infer protocol formats from the
network traﬃc and generate signatures, which can be used by the IPS systems [31, 85, 117, 129],
deployed across the network, to block malicious traﬃc.

Application protocol signatures can be used to combat polymorphic worms that may implement
many variants of the same protocol format. For example, the following format, for Code Red based
polymorphic worm, regular expression “ida?NNNNNNNNNNNN...”, where the long string of Ns is
used to trigger a buﬀer overﬂow vulnerability, may have many variants where each N is replaced
by an X. By inferring the signature that is invariant to the protocol format and extracting the string
beginning with “ida?”, an IDS can detect any variant of the Code Red polymorphic worm by
measuring the length of the string.

125

8.1.2 Limitations of Prior Art

Existing protocol format inference based approaches have following limitations. First, many of
the existing approaches, [44, 55, 90, 131, 133], have low computational eﬃciency and take longer
time to infer protocol formats, which renders these approaches unsuitable for real-time signature
generation. Second, some of the existing approaches assume previous knowledge of the protocol
formats [124, 125], or consider traﬃc from only a single application while inferring formats [90].
Third, some approaches like Securitas [137] and ProHacker [132] do real-time traﬃc classiﬁcation.
However, Securitas requires more packet processing time to extract the protocol keywords and
ProHacker is a semi-supervised approach that assumes previous information about the protocol.
Lastly, some approaches like unexpected [90], and ProWord [143] only infer the important keywords
found in the network trace and do not generate protocol signatures.

8.1.3 Project Objectives:

This work concerns the problem of inferring unknown protocol formats from unseen network traﬃc
in real-time. Real-time inference means that the system should have following properties. First, it
should start the format inference using a few number of ﬂows or messages. This is required for quick
format generation using the minimum traﬃc observed from only a few monitoring points in the
network. Second, it should generate formats quickly using a light-weight inference algorithm. This is
important so that the system can infer protocol formats quickly and deploy appropriate rules across
the network, to ﬁlter malicious traﬃc, in a timely manner to minimize the attack surface. Third,
it should infer the protocol formats incrementally and should not assume any previous knowledge
of the protocol formats. This is important as the formats of some of the malwares, like the Code
Red polymorphic worm example above, may have many invariants and the format inference system
should be able to infer new protocol formats as more traﬃc appears.

126

BIBLIOGRAPHY

127

BIBLIOGRAPHY

[1]
[2]
[3]
[4]

[5]

[6]

[7]
[8]

Apache hadoop yarn. https://goo.gl/qEdpo5.
Arista 7050s switch. http://www.aristanetworks.com/docs/Manuals/ConﬁgGuide.pdf.
Broadcom bcm56820 switch. http://www.broadcom.com/collateral/pb/56820-PB00-R.pdf.
Dell force10 s4810 switch. http://www.force10networks.com/CSPortal20/KnowledgeBase/
DOCUMENTATION/CLIConﬁg/FTOS/Z9000_CLI_8.3.11.4_23-May-2012.pdf.
Ibm rackswitch g8264 application guide. http://www.bladenetwork.net/userﬁles/ﬁle/G8264_
AG_6-8.pdf.
Juniper ex3300 switch. http://www.juniper.net/us/en/products-services/switching/ex-series/
ex3300/.
The network simulator - ns-2. http://www.isi.edu/nsnam/ns/.
Linux Blind TCP Spooﬁng Vulnerability. The U.S. Department of Energy, Computer Inci-
dent Advisory Capability, March 1999.
Prio qdisc linux. http://linux.die.net/man/8/tc-prio, 2006.
iOS: Multipath TCP Support in iOS 7. http://support.apple.com/kb/HT5977, Jan. 2014.

[9]
[10]
[11] MPTCP on Android Devices. http://multipath-tcp.org/pmwiki.php/Users/Android, January

2014.

[12] Dennis Abts and Bob Felderman. A guided tour of data-center networking. Commun. ACM,

55(6):44–51, June 2012.

[13] Faraz Ahmad, Srimat T Chakradhar, Anand Raghunathan, and TN Vijaykumar. Shuﬄe-
In USENIX ATC,

watcher: Shuﬄe-aware scheduling in multi-tenant mapreduce clusters.
pages 1–12. USENIX, 2014.

[14] M. Al-Fares, S. Radhakrishnan, B. Raghavan, N. Huang, and A. Vahdat. Hedera: dynamic

ﬂow scheduling for data center networks. In NSDI’10.

[15] Mohammad Al-Fares, Alexander Loukissas, and Amin Vahdat. A scalable, commodity
data center network architecture. In ACM SIGCOMM Computer Communication Review,
volume 38, pages 63–74. ACM, 2008.

[16] M. Alizadeh, T. Edsall, S. Dharmapurikar, R. Vaidyanathan, K. Chu, A. Fingerhut, T. Lam,
F. Matus, R. Pan, N. Yadav, and G. Varghese. Conga: Distributed congestion-aware load
balancing for datacenters. In SIGCOMM’14.

[17] M. Alizadeh, A. G. Greenberg, D. A. Maltz, J. Padhye, P. Patel, B. Prabhakar, S. Sengupta,

and M. Sridharan. Data center TCP (DCTCP). In SIGCOMM’10.

128

[18] Mohammad Alizadeh, Albert Greenberg, David A Maltz, Jitendra Padhye, Parveen Patel,
Balaji Prabhakar, Sudipta Sengupta, and Murari Sridharan. Data center tcp (dctcp). In in
Proc. ACM SIGCOMM 2011.

[19] Mohammad Alizadeh, Abdul Kabbani, Tom Edsall, Balaji Prabhakar, Amin Vahdat, and
Masato Yasuda. Less is more: trading a little bandwidth for ultra-low latency in the data
center. In NSDI, pages 19–19. USENIX, 2012.

[20] Mohammad Alizadeh, Shuang Yang, Milad Sharif, Sachin Katti, Nick McKeown, Balaji
In

Prabhakar, and Scott Shenker. pfabric: Minimal near-optimal datacenter transport.
SIGCOMM’13.

[21] Mohammad Alizadeh, Shuang Yang, Milad Sharif, Sachin Katti, Nick McKeown, Balaji
Prabhakar, and Scott Shenker. pfabric: Minimal near-optimal datacenter transport. In ACM
SIGCOMM 2013.

[22] Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica. Eﬀective straggler

mitigation: Attack of the clones. In NSDI, volume 13, pages 185–198. USENIX, 2013.

[23] Ganesh Ananthanarayanan, Srikanth Kandula, Albert G Greenberg, Ion Stoica, Yi Lu, Bikas
Saha, and Edward Harris. Reining in the outliers in map-reduce clusters using mantri. In
OSDI, volume 10, page 24, 2010.

[24] Nimrod Aviram, Sebastian Schinzel, Juraj Somorovsky, Nadia Heninger, Maik Dankel, Jens
Steube, Luke Valenta, David Adrian, J. Alex Halderman, Viktor Dukhovni, Emilia Käsper,
Shaanan Cohney, Susanne Engels, Christof Paar, and Yuval Shavitt. Drown: Breaking tls
using sslv2. In USENIX Security 16, 2016.

[25] M. Bagnulo. Threat analysis for TCP extensions for multipath operation with multiple

addresses. RFC 6181, IETF, March 2011.

[26] M. Bagnulo, C. Paasch, F. Gont, O. Bonaventure, and C. Raiciu. Analysis of Residual Threats

and Possible Fixes for Multipath TCP (MPTCP). RFC 7430.

[27] Wei Bai, Li Chen, Kai Chen, Dongsu Han, Chen Tian, and Hao Wang. Information-agnostic

ﬂow scheduling for commodity data centers. In NSDI. USENIX, 2015.

[28] Theophilus Benson, Aditya Akella, and David A Maltz. Network traﬃc characteristics of

data centers in the wild. In ACM IMC, pages 267–280. ACM, 2010.

[29] N. Borisov, D.J. Brumley, H.J. Wang, J. Dunagan, P. Joshi, C. Guo, and ICE Nanjing. A
generic application-level protocol analyzer and its language. In 14h Symposium on Network
and Distributed System Security (NDSS), 2007.

[30] Dhruba Borthakur. The hadoop distributed ﬁle system: Architecture and design. Hadoop

Project Website, 11(2007):21, 2007.

[31] David Brumley, James Newsome, Dawn Song, Hao Wang, and Somesh Jha. Towards
IEEE Symposium Security and

automatic generation of vulnerability-based signatures.
Privacy, 2007.

129

[32]

Jiaxin Cao, Rui Xia, Pengkun Yang, Chuanxiong Guo, Guohan Lu, Lihua Yuan, Yixin
Zheng, Haitao Wu, Yongqiang Xiong, and Dave Maltz. Per-packet load-balanced, low-
latency routing for clos-based data center networks. In CoNEXT’13.

[33] Hyunseok Chang, Murali Kodialam, Ramana Rao Kompella, TV Lakshman, Myungjin Lee,
and Sarit Mukherjee. Scheduling in mapreduce-like systems for fast completion time. In
INFOCOM, 2011 Proceedings IEEE, pages 3074–3082. IEEE, 2011.

[34] Fangfei Chen, Murali Kodialam, and TV Lakshman. Joint scheduling of processing and
shuﬄe phases in mapreduce systems. In INFOCOM, 2012 Proceedings IEEE, pages 1143–
1151. IEEE, 2012.

[35] Li Chen, Kai Chen, Wei Bai, and Mohammad Alizadeh. scheduling mix-ﬂows in commodity

datacenters with karuna. In ACM SIGCOMM, 2016.

[36] Li Chen, Shuihai Hu, Kai Chen, Haitao Wu, and Danny H. K. Tsang. Towards minimal-delay

deadline-driven data center tcp. In Hotnets’13.

[37] Shuo Chen, Rui Wang, XiaoFeng Wang, and Kehuan Zhang. Side-channel leaks in web

applications: A reality today, a challenge tomorrow. In SP, pages 191–206. IEEE, 2010.

[38] Yanpei Chen, Archana Ganapathi, Rean Griﬃth, and Randy Katz. The case for evaluating

mapreduce performance using workload suites. In IEEE MASCOTS’11.

[39] Mosharaf Chowdhury, Srikanth Kandula, and Ion Stoica. Leveraging endpoint ﬂexibility in

data-intensive clusters. In ACM SIGCOMM, volume 43, pages 231–242. ACM, 2013.

[40] Mosharaf Chowdhury and Ion Stoica. Coﬂow: a networking abstraction for cluster applica-

tions. In ACM Hotnets, pages 31–36. ACM, 2012.

[41] Mosharaf Chowdhury and Ion Stoica. Eﬃcient coﬂow scheduling without prior knowledge.

In ACM SIGCOMM, pages 393–406. ACM, 2015.

[42] Mosharaf Chowdhury, Yuan Zhong, and Ion Stoica. Eﬃcient coﬂow scheduling with varys.

In ACM SIGCOMM, pages 443–454. ACM, 2014.

[43] Paolo Costa, Austin Donnelly, Antony Rowstron, and Greg O’Shea. Camdoop: exploiting in-
network aggregation for big data applications. In Proceedings of the 9th USENIX conference
on Networked Systems Design and Implementation, pages 3–3. USENIX Association, 2012.

[44] Weidong Cui, Jayanthkumar Kannan, and Helen J. Wang. Discoverer: automatic protocol re-
verse engineering from network traces. In Proceedings of 16th USENIX Security Symposium,
SS’07, pages 14:1–14:14, Berkeley, CA, USA, 2007. USENIX Association.

[45] A.R. Curtis, J.C. Mogul, J. Tourrilhes, P. Yalagandula, P. Sharma, and S. Banerjee. Devoﬂow:

scaling ﬂow management for high-performance networks. In SIGCOMM’11.
Jeﬀrey Dean and Luiz André Barroso. The tail at scale. Communications of the ACM, 2013.

[46]

130

[47]

[48]

Jeﬀrey Dean and Sanjay Ghemawat. MapReduce: Simpliﬁed Data Processing on Large
Clusters. In OSDI’04.

Jeﬀrey Dean and Sanjay Ghemawat. Mapreduce: simpliﬁed data processing on large clusters.
Communications of the ACM, 51(1):107–113, 2008.

[49] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash
Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels.
Dynamo: amazon’s highly available key-value store. In in ACM SIGOPS 2007.

[50] Fahad R. Dogar, Thomas Karagiannis, Hitesh Ballani, and Ant Rowstron. Decentralized

Task-aware Scheduling for Data Center Networks. In SIGCOMM’14.

[51] Fahad R Dogar, Thomas Karagiannis, Hitesh Ballani, and Antony Rowstron. Decentralized

task-aware scheduling for data center networks. In ACM SIGCOMM.

[52] Cliﬀ Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael J Franklin, Scott Shenker,
and Ion Stoica. Shark: fast data analysis using coarse-grained distributed memory. In ACM
SIGMOD 2012.

[53] Symantec Enterprise. Internet security threat report 2014, 2015.

[54] Nathan Farrington and Alexey Andreyev. Facebook data center network architecture. In

IEEE Optical Interconnects Conf. Citeseer, 2013.

[55] Alessandro Finamore, Marco Mellia, Michela Meo, and Dario Rossi. Kiss: Stochastic packet
inspection classiﬁer for udp traﬃc. IEEE/ACM Transactions on Networking, 18(5):1505–
1515, 2010.

[56] A. Ford, C. Raiciu, M. Handley, and O. Bonaventure. TCP Extensions for Multipath Opera-

tion with Multiple Addresses. RFC 6824.

[57] A. Ford, C. Raiciu, M. Handley, and O. Bonaventure. TCP Extensions for Multipath Opera-

tion with Multiple Addresses. RFC 6824bis-09.

[58] S. Frankel and S. Krishnan. IP Security (IPsec) and Internet Key Exchange (IKE) Document

Roadmap. RFC 6071.

[59] Peter X Gao, Akshay Narayan, Gautam Kumar, Rachit Agarwal, Sylvia Ratnasamy, and
Scott Shenker. phost: Distributed near-optimal datacenter transport over commodity network
fabric. In Proceedings of the CoNEXT. 2015.

[60] Ali Ghodsi, Vyas Sekar, Matei Zaharia, and Ion Stoica. Multi-resource fair queueing for

packet processing. In ACM SIGCOMM, volume 42, pages 1–12. ACM, 2012.

[61] Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, and Ion
Stoica. Dominant resource fairness: Fair allocation of multiple resource types. In NSDI,
volume 11, pages 24–24. USENIX, 2011.

131

[62] Ali Ghodsi, Matei Zaharia, Scott Shenker, and Ion Stoica. Choosy: Max-min fair sharing for

datacenter jobs with constraints. In ACM EuroSys.

[63] Albert Greenberg, James R Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim,
Parantap Lahiri, David A Maltz, Parveen Patel, and Sudipta Sengupta. Vl2: a scalable and
ﬂexible data center network. In ACM SIGCOMM 2009.

[64] William Gropp, S Huss-Lederman, A Lumsdaine, E Lusk, B Nitzberg, W Saphir, and M Snir.

Mpi. The Complete Reference. The MPI-2 Extensions, 2, 1998.

[65] Matthew P Grosvenor, Malte Schwarzkopf, Ionel Gog, Robert NM Watson, Andrew W
Moore, Steven Hand, and Jon Crowcroft. Queues don’t matter when you can jump them! In
NSDI. USENIX, 2015.

[66] Chuanxiong Guo, Guohan Lu, Helen J Wang, Shuang Yang, Chao Kong, Peng Sun, Wenfei
Wu, and Yongguang Zhang. Secondnet: a data center network virtualization architecture
with bandwidth guarantees. In ACM CoNext 2010.

[67] Keqiang He, Eric Rozner, Kanak Agarwal, Wes Felter, John Carter, and Aditya Akella.
Presto: Edge-based load balancing for fast datacenter networks. ACM SIGCOMM Computer
Communication Review, 2015.

[68] Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D Joseph,
Randy H Katz, Scott Shenker, and Ion Stoica. Mesos: A platform for ﬁne-grained resource
sharing in the data center. In NSDI, volume 11, pages 22–22. USENIX, 2011.

[69] Chi-Yao Hong, Matthew Caesar, and P Godfrey. Finishing ﬂows quickly with preemptive

scheduling. In Proc. ACM SIGCOMM 2012.

[70] C.Y. Hong, M. Caesar, and P. Godfrey. Finishing ﬂows quickly with preemptive scheduling.

In SIGCOMM’12.

[71] Shuihai Hu, Wei Bai, Kai Chen, Chen Tian, Ying Zhang, and Haitao Wu. Providing
bandwidth guarantees, work conservation and low latency simultaneously in the cloud. In
Proceedings of infocom, 2016.

[72] Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad: dis-
tributed data-parallel programs from sequential building blocks. In ACM SIGOPS Operating
Systems Review, volume 41, pages 59–72. ACM, 2007.

[73] Michael Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder, Kunal Talwar, and Andrew
Goldberg. Quincy: fair scheduling for distributed computing clusters. In ACM SIGOPS,
pages 261–276. ACM, 2009.

[74] Mathieu Jadin, Gautier Tihon, Olivier Pereira, and Olivier Bonaventure. Securing multipath

tcp: Design & implementation. In IEEE INFOCOM 2017, 2017.

[75] Virajith Jalaparti, Peter Bodik, Srikanth Kandula, Ishai Menache, Mikhail Rybalkin, and
In ACM SIGCOMM

Chenyu Yan. Speeding up distributed request-response workﬂows.
2013.

132

[76] Virajith Jalaparti, Peter Bodik, Ishai Menache, Sriram Rao, Konstantin Makarychev, and
Matthew Caesar. Network-aware scheduling for data-parallel jobs: Plan when you can. In
ACM SIGCOMM, pages 407–420. ACM, 2015.

[77] Keon Jang, Justine Sherry, Hitesh Ballani, and Toby Moncaster. Silo: Predictable message

latency in the cloud. In in Proc. ACM SIGDC, 2015.

[78] Vimalkumar Jeyakumar, Mohammad Alizadeh, David Mazieres, Balaji Prabhakar,
Changhoon Kim, and Albert Greenberg. EyeQ: Practical Network Performance Isolation at
the Edge. In NSDI’13.

[79] Vimalkumar Jeyakumar, Mohammad Alizadeh, David Mazieres, Balaji Prabhakar,
Changhoon Kim, and Albert Greenberg. Eyeq: practical network performance isolation
at the edge. In NSDI. USENIX, 2013.

[80] Glenn Judd. Attaining the promise and avoiding the pitfalls of tcp in the datacenter. In

USENIX NSDI 2015.

[81] Dina Katabi, Mark Handley, and Charlie Rohrs. Congestion Control for High Bandwidth-

Delay Product Networks. In SIGCOMM’02.

[82] Ramin Khalili, Nicolas Gast, Miroslav Popovic, Utkarsh Upadhyay, and Jean-Yves Le
In

Boudec. MPTCP is not pareto-optimal: performance issues and a possible solution.
ACM CoNEXT, 2012.

[83] Gautam Kumar, Srikanth Kandula, Peter Bodik, and Ishai Menache. Virtualizing traﬃc

shapers for practical resource allocation. In HotCloud’13.

[84]

Jeongkeun Lee, Yoshio Turner, Myungjin Lee, Lucian Popa, Sujata Banerjee, Joon-Myung
Kang, and Puneet Sharma. Application-driven bandwidth guarantees in datacenters. In ACM
SIGCOMM, pages 467–478. ACM, 2014.

[85] Zhichun Li, Lanjia Wang, Yan Chen, and Zhi Fu. Network-based and attack-resilient length
signature generation for zero-day polymorphic worms. IEEE Int. Conference Network Pro-
rocols (ICNP), pages 164–173, 2007.

[86] Zhichun Li, Gao Xia, Hongyu Gao, Yi Tang, Yan Chen, Bin Liu, Junchen Jiang, and Yuezhou
Lv. NetShield: Massive semantics-based vulnerability signature matching for high-speed
networks. In Proceedings of AM SIGCOMM, 2010.

[87] Helger Lipmaa, Phillip Rogaway, and David Wagner. Comments to nist concerning aes

modes of operations: Ctr-mode encryption. Technical report, NIST, 2000.

[88]

[89]

lkm. Blind TCP/IP hijacking is still alive, 2007.

Jiefei Ma, Franck Le, Alessandra Russo, and Jorge Lobo. Detecting Distributed Signature-
based Intrusion: The Case of Multi-Path Routing Attacks. In IEEE INFOCOM, 2015.

133

[90]

Justin Ma, Kirill Levchenko, Christian Kreibich, Stefan Savage, and Geoﬀrey M Voelker.
In Proceedings of the 6th ACM SIGCOMM
Unexpected means of protocol inference.
conference on Internet measurement, pages 313–326. ACM, 2006.

[91] Grzegorz Malewicz, Matthew H Austern, Aart JC Bik, James C Dehnert, Ilan Horn, Naty
Leiser, and Grzegorz Czajkowski. Pregel: a system for large-scale graph processing. In ACM
SIGMOD 2010, pages 135–146.

[92] Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoﬀrey Romer, Shiva Shivakumar, Matt
Tolton, and Theo Vassilakis. Dremel: interactive analysis of web-scale datasets. in Proc. of
the VLDB 2010.

[93] Radhika Mittal, Nandita Dukkipati, Emily Blem, Hassan Wassel, Monia Ghobadi, Amin
Vahdat, Yaogong Wang, David Wetherall, David Zats, et al. Timely: Rtt-based congestion
control for the datacenter. In Proc. ACM SIGDC 2015.

[94] Ali Munir, Ghufran Baig, Syed M Irteza, Ihsan A Qazi, Alex X Liu, and Fahad R Dogar.
Friends, not foes: synthesizing existing transport strategies for data center networks. In ACM
SIGCOMM 2014.

[95] Ali Munir, Ihsan Ayyub Qazi, and S Bin Qaisar. On achieving low latency in data centers.

In ICC, pages 3721–3725. IEEE, 2013.

[96] Ali Munir, Ihsan Ayyub Qazi, Zartash Afzal Uzmi, Aisha Mushtaq, Saad N Ismail, M Safdar
Iqbal, and Basma Khan. Minimizing ﬂow completion times in data centers. In in Proc. IEEE
INFOCOM, 2013.

[97] Ali Munir, Ihsan Ayyub Qazi, Zartash Afzal Uzmi, Aisha Mushtaq, Saad Naveed Ismail,
Muhammad Safdar Iqbal, and Basma Khan. Minimizing Flow Completion Times in Data
Centers. In INFOCOM’13.

[98] Kanthi Nagaraj, Dinesh Bharadia, Hongzi Mao, Sandeep Chinchali, Mohammad Alizadeh,
and Sachin Katti. Numfabric: Fast and ﬂexible bandwidth allocation in datacenters. In ACM
SIGCOMM 2016.

[99]

James Nechvatal, Elaine Barker, Lawrence Bassham, William Burr, and Morris Dworkin.
Report on the development of advanced encryption standard (aes), 2000.

[100] Rajesh Nishtala, Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry C.
Li, Ryan McElroy, Mike Paleczny, Daniel Peek, Paul Saab, David Staﬀord, Tony Tung, and
Venkateshwaran Venkataramani. Scaling memcache at facebook. In NSDI’13.

[101] Christoph Paasch, Sébastien Barré, et al. Multipath tcp in the linux kernel. Available from

fwww. multipath-tcp. org.g, 2013.

[102] R. Pang, V. Paxson, R. Sommer, and L. Peterson. binpac: A yacc for writing applica-
tion protocol parsers. In Proceedings of the 6th ACM SIGCOMM conference on Internet
measurement, pages 289–300. ACM, 2006.

134

[103] Vern Paxson. Bro: a system for detecting network intruders in real-time. Computer networks,

31(23):2435–2463, 1999.

[104] Catherine Pearce and Sherali Zeadally. Ancillary Impacts of Multipath TCP on Current and

Future Network Security. Internet Computing, 2015.

[105] Qiuyu Peng, Anwar Walid, and Steven H. Low. Multipath TCP algorithms: theory and

design. In ACM SIGMETRICS, pages 305–316, 2013.

[106] Jonathan Perry, Amy Ousterhout, Hari Balakrishnan, Devavrat Shah, and Hans Fugal. Fast-

pass: A centralized zero-queue datacenter network. In ACM SIGCOMM.

[107] Lucian Popa, Gautam Kumar, Mosharaf Chowdhury, Arvind Krishnamurthy, Sylvia Rat-
In ACM

nasamy, and Ion Stoica. Faircloud: sharing the network in cloud computing.
SIGCOMM 2012.

[108] Lucian Popa, Praveen Yalagandula, Sujata Banerjee, Jeﬀrey C Mogul, Yoshio Turner, and
Jose Renato Santos. Elasticswitch: practical work-conserving bandwidth guarantees for
cloud computing. In ACM SIGCOMM 2013.

[109] Matei Popovici and Costin Raiciu. Exploiting multipath congestion control for fun and proﬁt.

In Hot Topics in Networks, pages 141–147. ACM, 2016.

[110] Zhiyun Qian and Z. Morley Mao. Oﬀ-Path TCP Sequence Number Inference Attack - How

Firewall Middleboxes Reduce Security. In SP, 2012.

[111] Zhiyun Qian, Z. Morley Mao, and Yinglian Xie. Collaborated oﬀ-path tcp sequence number

inference attack — how to crack sequence number under a second. In CCS, 2012.

[112] C. Raiciu, C. Paasch, S. Barre, Alan Ford, Michio Honda, F. Duchene, O. Bonaventure, and
Mark Handley. How hard can it be? Designing and implementing a deployable Multipath
TCP. In USENIX NSDI, 2012.

[113] Xiaoqi Ren, Ganesh Ananthanarayanan, Adam Wierman, and Minlan Yu. Hopper: Decen-
tralized speculation-aware cluster scheduling at scale. In ACM SIGCOMM, pages 379–392.
ACM, 2015.

[114] Henrique Rodrigues, Jose Renato Santos, Yoshio Turner, Paolo Soares, and Dorgival Guedes.
In

Gatekeeper: Supporting bandwidth guarantees for multi-tenant datacenter networks.
WIOV, 2011.

[115] Martin Roesch et al. Snort: Lightweight intrusion detection for networks. In Lisa, volume 99,

pages 229–238, 1999.

[116] M. Scharf and A. Ford. Multipath TCP (MPTCP) Application Interface Considerations.

RFC 6897.

[117] Nabil Schear, David R. Albrecht, and Nikita Borisov. High-speed matching of vulnerability

signatures. In Intl. Symposium on Recent Advances in Intrusion Detection (RAID), 2008.

135

[118] M. Zubair Shaﬁq, Franck Le, Mudhakar Srivatsa, and Alex X. Liu. Cross-path inference

attacks on multipath tcp. In HotNets-XII, 2013.

[119] Y. Sheﬀer, R. Holz, and P. Saint-Andre. Summarizing Known Attacks on Transport Layer
Security (TLS) and Datagram TLS (DTLS). RFC 7454, Internet Engineering Task Force
(IETF), 2015.

[120] Alan Shieh, Srikanth Kandula, Albert G Greenberg, Changhoon Kim, and Bikas Saha.

Sharing the data center network. In NSDI, 2011.

[121] Arjun Singh, Joon Ong, Amit Agarwal, Glen Anderson, Ashby Armistead, Roy Bannon,
Seb Boving, Gaurav Desai, Bob Felderman, Paulie Germano, et al. Jupiter rising: A decade
of clos topologies and centralized control in google’s datacenter network. In Proc. ACM
SIGDC 2015.

[122] Anirudh Sivaraman, Keith Winstein, Suvinay Subramanian, and Hari Balakrishnan. No

silver bullet: extending sdn to the data plane. In in ACM HotNets 2013.

[123] Jian Tan, Alicia Chin, Zane Zhenhua Hu, Yonggang Hu, Shicong Meng, Xiaoqiao Meng,
and Li Zhang. Dynmr: Dynamic mapreduce with reduce task interleaving and maptask
backﬁlling. In ACM EuroSys, page 2. ACM, 2014.

[124] Alok Tongaonkar, Ram Keralapura, and Antonio Nucci. Santaclass: A self adaptive network
traﬃc classiﬁcation system. In IFIP Networking Conference, 2013, pages 1–9. IEEE, 2013.

[125] Alok Tongaonkar, Ruben Torres, Marios Iliofotou, Ram Keralapura, and Antonio Nucci.
Towards self adaptive network traﬃc classiﬁcation. Computer Communications, 56:35–46,
2015.

[126] Balajee Vamanan, Jahangir Hasan, and T.N. Vijaykumar. Deadline-aware datacenter tcp

(d2tcp). In SIGCOMM’12.

[127] Balajee Vamanan, Jahangir Hasan, and TN Vijaykumar. Deadline-aware datacenter tcp

(d2tcp). In ACM SIGCOMM, pages 115–126, 2012.

[128] Vinod Kumar Vavilapalli, Arun C Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar,
Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, et al. Apache
hadoop yarn: Yet another resource negotiator. In ACM SoCC, page 5. ACM, 2013.

[129] Helen J. Wang, Chuanxiong Guo, Daniel R. Simon, and Alf Zugenmaier.
vulnerability-driven network ﬁlters for preventing known vulnerability exploits.
ceedings SIGCOMM, 2004.

Shield:
In Pro-

[130] Yi Wang, Dong Lin, Changtai Li, Junping Zhang, Peng Liu, Chengchen Hu, and Gong
Zhang. Application driven network: providing on-demand services for applications. In ACM
SIGCOMM 2016 Poster.

136

[131] Yipeng Wang, Xiaochun Yun, M Zubair Shaﬁq, Liyan Wang, Alex X Liu, Zhibin Zhang,
Danfeng Yao, Yongzheng Zhang, and Li Guo. A semantics aware approach to automated
reverse engineering unknown protocols. In 2012 20th IEEE International Conference on
Network Protocols (ICNP), pages 1–10. IEEE, 2012.

[132] Yipeng Wang, Xiaochun Yun, and Yongzheng Zhang. Rethinking robust and accurate appli-
cation protocol identiﬁcation: a nonparametric approach. In 2015 IEEE 23rd International
Conference on Network Protocols (ICNP), pages 134–144. IEEE, 2015.

[133] Yipeng Wang, Zhibin Zhang, Danfeng Daphne Yao, Buyun Qu, and Li Guo.

protocol state machine from network traces: a probabilistic approach.
Conference on Applied Cryptography and Network Security, pages 1–18. Springer, 2011.

Inferring
In International

[134] C. Wilson, H. Ballani, T. Karagiannis, and A. Rowstron. Better never than late: Meeting

deadlines in datacenter networks. In SIGCOMM’11.

[135] Christo Wilson, Hitesh Ballani, Thomas Karagiannis, and Ant Rowtron. Better never than

late: Meeting deadlines in datacenter networks. In ACM SIGCOMM, pages 50–61, 2011.

[136] Damon Wischik, Costin Raiciu, Adam Greenhalgh, and Mark Handley. Design, implemen-

tation and evaluation of congestion control for multipath TCP. In USENIX NSDI, 2011.

[137] Xiaochun Yun, Yipeng Wang, Yongzheng Zhang, and Yu Zhou. A semantics-aware approach
to the automated network protocol identiﬁcation. IEEE/ACM Transactions on Networking,
24(1):583–595, 2016.

[138] Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, Khaled Elmeleegy, Scott Shenker,
and Ion Stoica. Delay scheduling: a simple technique for achieving locality and fairness in
cluster scheduling. In ACM EuroSys, pages 265–278. ACM, 2010.

[139] Matei Zaharia, Andy Konwinski, Anthony D Joseph, Randy H Katz, and Ion Stoica. Improv-
ing mapreduce performance in heterogeneous environments. In OSDI, volume 8, page 7,
2008.

[140] David Zats, Tathagata Das, Prashanth Mohan, Dhruba Borthakur, and Randy Katz. Detail:

Reducing the ﬂow completion time tail in datacenter networks. In SIGCOMM’12.

[141] David Zats, Tathagata Das, Prashanth Mohan, Dhruba Borthakur, and Randy Katz. De-
tail: reducing the ﬂow completion time tail in datacenter networks. In ACM SIGCOMM,
volume 42, pages 139–150, 2012.

[142] Hong Zhang, Kai Chen, Wei Bai, Dongsu Han, Chen Tian, Hao Wang, Haibing Guan, and
Ming Zhang. Guaranteeing deadlines for inter-datacenter transfers. In Proceedings of the
Tenth European Conference on Computer Systems, page 20. ACM, 2015.

[143] Zhuo Zhang, Zhibin Zhang, Patrick PC Lee, Yunjie Liu, and Gaogang Xie. Proword: An
unsupervised approach to protocol feature word extraction. In INFOCOM, 2014 Proceedings
IEEE, 2014.

137

[144] Yangming Zhao, Kai Chen, Wei Bai, Chen Tian, Yanhui Geng, Yiming Zhang, Dan Li,
and Sheng Wang. Rapier: Integrating routing and scheduling for coﬂow-aware data center
networks. In Proc. IEEE INFOCOM, 2015.

138