PERFORMANCE ANALYSIS AND PRIVACY PROTECTION OF

NETWORK DATA

By

Faraz Ahmed

A DISSERTATION

Submitted to

Michigan State University

in partial fulﬁllment of the requirements

for the degree of

Computer Science—Doctor of Philosophy

2018

ABSTRACT

PERFORMANCE ANALYSIS AND PRIVACY PROTECTION OF

NETWORK DATA

By

Faraz Ahmed

The goal of this thesis is to address network management research challenges faced by

operational networks - with speciﬁc focus on cellular networks, content delivery networks,

and online social networks. Next, I give an overview of my research on network management

of these networks.

Cellular networks utilize existing service quality management systems for detecting perfor-

mance degradation issues inside the network, however, under certain conditions degradation

in End-to-End (E2E) performance may go undetected. These conditions may arise due to

problems in the mobile device hardware, smartphone applications, and content providers.

In this thesis, I present a system for detecting and localizing E2E performance degradation

at cellular service providers across four administrative domains: cellular network, content

providers, device manufacturers, and smartphone applications.

Cellular networks also need systems that can prioritize performance degradation issues

according to the number of customers impacted. Cell tower outages are performance de-

gradation issues that directly impact connectivity of cellular network users. In this thesis,

we design and evaluate a cell tower outage monitoring system that analyzes and estimates

device level impact during cell tower outages.

Content delivery networks (CDNs) maintain multiple transit routes from content distribu-

tion servers to eyeball ISP networks which provide Internet connectivity to end users. Two

major considerations for CDNs are transit prices and performance dynamics of delivering

content to end users. The dynamic nature of transit pricing and performance makes it chal-

lenging to optimize the cost and performance tradeoﬀ. There are thousands of eyeball ISPs

which are reachable via diﬀerent transit routes and diﬀerent geographical locations. Each

choice of transit route for a particular eyeball ISP and geographical location has distinct cost

and performance characteristics, which makes the problem of developing a transit routing

strategy challenging. In this thesis, I present a measurement approach to actively collect

client perceived network performance and then use these measurements towards optimal

transit route selection for CDNs.

Online Social Networks (OSNs) often refuse to publish their social network graphs due

to privacy concerns. Diﬀerential privacy has been the widely accepted criteria for privacy

preserving data publishing. In this thesis, I present a random matrix approach to OSN graph

publishing, which achieves storage and computational eﬃciency by reducing dimensions of

adjacency matrices and achieves diﬀerential privacy by adding a small amount of noise.

Dedicated to all those who value selﬂessness, friendship, and sacriﬁce; patience, tolerance,

kindness, and compassion; perseverance, prudence, and hope; and integrity, honesty,

humility, gratitude, and contentment.

iv

ACKNOWLEDGEMENTS

Working towards a Ph.D. has been a deeply enriching and rewarding experience. Looking

back, many people have helped shape my journey. I would like to extend them my thanks.

• First and foremost, my advisor, Prof. Alex X. Liu. My work would not have been pos-
sible without his constant guidance, his unwavering encouragement, his many insights,

and his exceptional resourcefulness. And most importantly, his friendship. I have been

very fortunate to have an advisor who has also been a close friend. For all of this, Alex,

thank you.

• I would also like to thank the rest of my thesis committee Profs. Pang Ning Tan, Ri-
chard Lawrence Wash, and Jiayu Zhou for their encouragement and insightful comments

during my comprehensive exams.

• I would also like to thank Dr. Zubair Shaﬁq. I learnt a lot from him. My collaboration

with him has been one of the most fruitful and fun engagements I have experienced.

• I would also like to thank Dr. Charles B. Owen. Working as his Teaching Assistant has
been one of my most valuable teaching experience. I learned a lot and had a lot of fun.

• I would also like to thank Drs. Jia Wang, Zihui Ge, Jeﬀrey Erman, and He Yan. I really
enjoyed working with them during my summer internship at AT&T Labs - Research.

• Throughout my Ph.D., I was supported by various NSF research grants. Thanks NSF!
• I would also like to thank Michigan State University, and speciﬁcally Department of
Computer Science and Engineering for providing me ﬁnancial support to attend various

conferences during my Ph.D.

• Many thanks to my colleagues in Systems and Security Lab at Michigan State Univer-
sity. In particular, I would like to thank Ali Munir, Ann Wang, Kamran Ali, Xinyu

Lei, and Salman Ali for numerous insightful and inspirational discussions on various

projects.

• I would like to thank my mentors, colleagues, and friends who helped me navigate life

v

as a Ph.D. student.

In particular, I would like to thank Drs. Hassan Aqeel Khan,

Muhammad Shahzad, Muhammad Zeeshan, and Mudassir Shabbir.

• I must say that I owe my great time in Michigan State University to all of my fabulous
friends. It is simply not feasible to list all of them here. I would like to thank them all

for their friendship and support.

• I am also very thankful to Drs. Muhammad Abulaish and Muddassar Farooq, who

advised me before my Ph.D. and encouraged me to pursue Ph.D.

• I am also very thankful to these great personalities from my high school in my hometown
Bahawalpur, Pakistan; Mr. Saleem, Mr. Rauf, and Mrs. Khalida Sarfraz. Looking back

at the trajectory of my professional pursuits, I always ﬁnd them at its origin and as its

cause.

• Finally, I do not know how I can thank my family enough. First and foremost my
parents, from whom I realized that kindness and devotion is endless, my siblings, who

have always supported me, my wife, and the rest of the family members who have been

a source of happiness.

vi

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii

LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1.1 Performance Degradation Detection and Localization in Cellular Net-
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

works

1.1.2 Assessing Impact of Cell Tower Outages in

Cellular Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1.3 Optimizing Transit Route Selection for CDNs
. . . . . . . . . . . . .
1.1.4 Measurement and Analysis of Adult Traﬃc . . . . . . . . . . . . . . .
1.1.5 Privacy Preserving Social Network Graph Data Publishing . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.2 Published Material

1
2

2

3
3
4
5
5

2.1

2 Performance Degradation Detection and Localization in Cellular Networks 7
7
7
10
11
12
14
14
16
21
22
23
25
27
27
30
31
31
31
32
32

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . .
2.1.2 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.3 Technical Challenges and Solutions . . . . . . . . . . . . . . . . . . .
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Monitoring E2E Performance Dynamics and Data Analysis . . . . . . . . . .
2.3.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.2 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4 Modeling E2E Performance Dynamics . . . . . . . . . . . . . . . . . . . . . .
2.4.1 Aggregate E2E Performance Matrix . . . . . . . . . . . . . . . . . . .
2.4.2 Coarse Grained E2E Modeling . . . . . . . . . . . . . . . . . . . . . .
2.4.3 Fine Grained E2E Modeling . . . . . . . . . . . . . . . . . . . . . . .
2.4.3.1 Deviating E2E Instance Identiﬁcation . . . . . . . . . . . .
2.4.3.2 Deviating E2E Instance Grouping . . . . . . . . . . . . . . .
2.4.3.3 Recursive E2E Instance Grouping . . . . . . . . . . . . . . .
2.5 E2E Performance Degradation . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.1 Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5.2 Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .

2.6.1

Synthetic Anomalies

vii

2.6.2 Anomaly Detection in the Wild . . . . . . . . . . . . . . . . . . . . .
2.6.3 Real-World Anomalies . . . . . . . . . . . . . . . . . . . . . . . . . .
2.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35
38
39

3 Assessing Impact of Cell Tower Outages in Cellular Networks

3.1

3.3.1 Dataset

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . .
3.1.2 Limitations of Prior Art
. . . . . . . . . . . . . . . . . . . . . . . . .
3.1.3 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.4 Technical Challenges and Solutions . . . . . . . . . . . . . . . . . . .
3.1.5 Key Novelty and Advantages
. . . . . . . . . . . . . . . . . . . . . .
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Background and Data Collection . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4 Spatio-Temporal Characterization . . . . . . . . . . . . . . . . . . . . . . . .
3.5 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5.1 Activity Proﬁling . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5.2 Location Proﬁling . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5.3 Quantifying Temporal Impact . . . . . . . . . . . . . . . . . . . . . .
3.5.4 Quantifying Spatial Impact
. . . . . . . . . . . . . . . . . . . . . . .
3.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.6.1
. . . . . . . . . . . . . . . . . . . . . . .
3.6.2 Real Network Outages . . . . . . . . . . . . . . . . . . . . . . . . . .
3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . 40
40
40
41
42
43
45
45
46
47
49
52
52
53
53
54
56
57
60
63

Synthetic Network Outages

4 Optimizing Transit Route Selection for CDNs . . . . . . . . . . . . . . . . 64
64
67
67
69
70
70
73
77
81
83
84
89
92
93

4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.2
Internet Transit Dynamics . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Performance Measurements & Analysis . . . . . . . . . . . . . . . . . . . . .
4.3.1 Measurement Methodology . . . . . . . . . . . . . . . . . . . . . . . .
4.3.2 Analysis and Discussions . . . . . . . . . . . . . . . . . . . . . . . . .
4.4 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5.1 Tradeoﬀ analysis
4.5.2 Transit selection analysis . . . . . . . . . . . . . . . . . . . . . . . . .
4.6 Pareto Front Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

viii

5.1

5.4

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1.2 Limitation of Prior Art . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1.3 Main Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3 Measurement & Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5 Measurement and Analysis of Adult Traﬃc . . . . . . . . . . . . . . . . . . 95
95
95
96
96
98
99
5.3.1 Aggregate Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.3.1.1 Content Composition . . . . . . . . . . . . . . . . . . . . . . 101
5.3.1.2 Traﬃc Composition . . . . . . . . . . . . . . . . . . . . . . 101
5.3.1.3 Temporal Access Patterns . . . . . . . . . . . . . . . . . . . 103
5.3.1.4 Device/OS Usage . . . . . . . . . . . . . . . . . . . . . . . . 103
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.3.2.1 Content Size
. . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.3.2.2 Content Popularity . . . . . . . . . . . . . . . . . . . . . . . 105
Impact of Content Injection and Aging on Popularity . . . . 107
5.3.2.3
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.3.3.1 User Request Inter-arrival Time . . . . . . . . . . . . . . . . 111
5.3.3.2 User Session Length . . . . . . . . . . . . . . . . . . . . . . 111
5.3.3.3 User Addiction . . . . . . . . . . . . . . . . . . . . . . . . . 113
Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.4.1 CDN Cache Hit Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.4.2 HTTP Response Codes . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

5.3.2 Content Dynamics

5.3.3 User Dynamics

6.1

6 Privacy Preserving Social Network Graph Data Publishing . . . . . . . . 118
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . 118
6.1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.1.3 Limitations of Prior Art
. . . . . . . . . . . . . . . . . . . . . . . . . 120
6.1.4 Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.1.5 Technical Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.1.6 Key Contributions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.3 Random Matrix Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.3.1 Theoretical Guarantee on Diﬀerential Privacy . . . . . . . . . . . . . 125
6.3.2 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.3.3 Theoretical Guarantee on Eigenvector Approximation . . . . . . . . . 127
6.3.4 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

6.4.1 Dataset

ix

6.4.2 Node Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.4.3 Node Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
6.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

x

LIST OF TABLES

Table 2.1 RMSE values for diﬀerent weighting functions and regression methods

Table 2.2 Single dimension anomalies . . . . . . . . . . . . . . . . . . . . . . . .

Table 2.3 Multi-dimension anomalies

. . . . . . . . . . . . . . . . . . . . . . . .

Table 4.1 Data set summary statistics . . . . . . . . . . . . . . . . . . . . . . . .

Table 4.2 Summary of assignments for diﬀerent pricing rates and POP locations

22

36

36

73

85

Table 6.1 Dataset description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

Table 6.2 Memory utilization and running time for the proposed algorithm . . . 137

Table 6.3 NMI for clustering using LNPP Approach [141] for σ = 1 . . . . . . . 138

Table 6.4 n × MSE using the proposed approach . . . . . . . . . . . . . . . . . . 140
Table 6.5 n × MSE using baseline LNPP . . . . . . . . . . . . . . . . . . . . . . 140

xi

LIST OF FIGURES

Figure 2.1 Normalized TCP Loss Ratio . . . . . . . . . . . . . . . . . . . . . . .

Figure 2.2 Normalized Round Trip Time (RTT) . . . . . . . . . . . . . . . . . .

Figure 2.3 Cellular network architecture

. . . . . . . . . . . . . . . . . . . . . .

Figure 2.4 CDF of Round Trip Time (RTT) in 4G LTE networks.

. . . . . . . .

Figure 2.5 Heatmap plots for combination of dimensions

. . . . . . . . . . . . .

Figure 2.6 Residual error distribution . . . . . . . . . . . . . . . . . . . . . . . .

Figure 2.7 Real and predicted performance curves . . . . . . . . . . . . . . . . .

Figure 2.8 No. of outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 2.9 Single model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 2.10 Multiple models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 2.11 Synthetic anomaly detection. . . . . . . . . . . . . . . . . . . . . . . .

Figure 2.12 Real-world anomaly 1: root cause outside the cellular network . . . .

Figure 2.13 Real-world anomaly 2: root cause inside the cellular network . . . . .

Figure 3.1 Spatio-temporal characterization of M2M devices

. . . . . . . . . . .

Figure 3.2 Active device density . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 3.3 Overview of system architecture . . . . . . . . . . . . . . . . . . . . .

Figure 3.4 Example devices

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 3.5 Estimation accuracy of start times and duration of synthetic outages

Figure 3.6 Estimation accuracy of user impact in synthetic outages

. . . . . . .

Figure 3.7 Average of absolute diﬀerence in timings . . . . . . . . . . . . . . . .

Figure 3.8 Device impact of real outages

. . . . . . . . . . . . . . . . . . . . . .

Figure 3.9 Heatmap visualization of real outage for diﬀerent p-value threshold .

Figure 3.10 A special case of real world site outage. . . . . . . . . . . . . . . . . .

Figure 4.1 A CDN interconnects with multiple transit ISPs at IXPs . . . . . . .

Figure 4.2 Transit performance measurements

. . . . . . . . . . . . . . . . . . .

14

14

15

19

21

25

26

26

26

26

33

37

37

49

51

52

53

56

58

59

59

60

60

68

71

xii

Figure 4.3 Deployment summary . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 4.4 Spatial performance variations for diﬀerent POPs . . . . . . . . . . .

Figure 4.5 Performance diﬀerence of transit providers for diﬀerent billing periods

73

75

76

Figure 4.6 Diﬀerence in transit performance for AS-POP pair: Telefonica-Madrid 77

Figure 4.7 Diﬀerence in transit performance for AS-POP pair: Comcast-San Jose 77

Figure 4.8 Cost-performance tradeoﬀ curves for γ ranging from 0 to 5. . . . . . .

Figure 4.9 Transit provider selection for diﬀerent pricing scenarios. . . . . . . . .

Figure 4.10 Transit provider selection for varying price ratio: TEL price
NTT price .

. . . . .

Figure 4.11 Eﬀect of γ values on transit provider selection (Paris) . . . . . . . . .

Figure 4.12 Analysis of transit provider bandwidth usage . . . . . . . . . . . . . .

Figure 4.13 Pareto fronts for diﬀerent hours. . . . . . . . . . . . . . . . . . . . . .

82

86

87

88

88

89

Figure 5.1 Content composition of ﬁve adult websites.

. . . . . . . . . . . . . . 101

Figure 5.2 Traﬃc composition of ﬁve adult websites. . . . . . . . . . . . . . . . . 102

Figure 5.3 Hourly traﬃc volume timeseries of ﬁve adult websites.

. . . . . . . . 103

Figure 5.4 Device type composition . . . . . . . . . . . . . . . . . . . . . . . . . 104

Figure 5.5 Content size distributions.

. . . . . . . . . . . . . . . . . . . . . . . . 105

Figure 5.6 Content popularity distributions . . . . . . . . . . . . . . . . . . . . . 106

Figure 5.7 Fraction of total object requested for adult websites at diﬀerent ages . 107

Figure 5.8 Content clustering dendrograms of two adult websites.

. . . . . . . . 107

Figure 5.9 Cluster medoids for V-2 adult website . . . . . . . . . . . . . . . . . . 108

Figure 5.10 Cluster medoids for P-2 adult website . . . . . . . . . . . . . . . . . . 108

Figure 5.11 User request inter-arrival time distributions

. . . . . . . . . . . . . . 111

Figure 5.12 User session length distributions . . . . . . . . . . . . . . . . . . . . . 112

Figure 5.13 Repeated access of objects . . . . . . . . . . . . . . . . . . . . . . . . 112

Figure 5.14 CDF of repeated content access by users . . . . . . . . . . . . . . . . 113

Figure 5.15 Hit ratio for image and video objects . . . . . . . . . . . . . . . . . . 114

Figure 5.16 Response codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

xiii

Figure 6.1 Degree distribution of three datasets.

. . . . . . . . . . . . . . . . . . 132

Figure 6.2 NMI values for Facebook . . . . . . . . . . . . . . . . . . . . . . . . . 133

Figure 6.3 NMI values for Live Journal

. . . . . . . . . . . . . . . . . . . . . . . 133

Figure 6.4 NMI values for Pokec . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

Figure 6.5 Cluster distribution for Facebook . . . . . . . . . . . . . . . . . . . . 136

Figure 6.6 Cluster distribution for Live Journal

. . . . . . . . . . . . . . . . . . 137

Figure 6.7 Cluster distribution for Pokec . . . . . . . . . . . . . . . . . . . . . . 137

Figure 6.8 Percentage of preserved ranks using random matrix approach . . . . . 138

xiv

LIST OF ALGORITHMS

Algorithm 1

Estimating outage start time and duration . . . . . . . . . . . . . .

Algorithm 2

Generating pareto fronts recursively . . . . . . . . . . . . . . . . .

55

91

Algorithm 3

bA = Publish(A, m, σ2)

. . . . . . . . . . . . . . . . . . . . . . . . 124

xv

1

Introduction

Operational networks, such as Cellular Networks, Content Delivery Networks (CDNs), and

Online Social Networks (OSNs) are rapidly changing in terms of network topology, networ-

king infrastructure, and network traﬃc characteristics. Network operators frequently change

peering agreements, routing policies, networking equipment, etc., to improve their network

performance and end-user Quality of Experience (QoE). In addition, external entities like

content providers, end-user device manufacturers and ﬁrmware developers also contribute

to the constantly changing internet landscape. Upgrades in network infrastructure, new

network protocols, new content types, etc., inadvertently creates new network management

challenges for operational networks.

In this thesis, I address network management needs by measuring network performance,

and modeling and analyzing network data for improving network performance. For cellular

networks, I focus on modeling of network performance. I have used statistical and machine le-

arning methods to develop system that solve network management issues involving detection

and localization of network performance degradation. I also present my work on assessing

the customer impact of cell tower outages. For content distribution networks (CDNs), I

focus on developing approaches to actively collect client perceived network performance. I

then use these measurements along with server side logs for modeling and analysis. I present

approaches that solve research problems involving quantiﬁcation and optimization of CDN

performance. For online social networks, I present my work on developing privacy preserving

approaches for graph data publishing.

1

1.1 Contributions

This thesis takes an in-depth look at the following research problems.

1.1.1 Performance Degradation Detection and Localization in

Cellular Networks

Cellular network users can experience performance degradation that can occur due to issues

outside the cellular network, it may be due to the user device itself (such as device OS,

application software) or due to the content provider (such as application servers, datacenter

network). When users experience degradation in performance, they call customer care re-

presentatives of the cellular network. When cellular service providers receive such calls, they

have to go through lengthy and labor intensive processes to ﬁrst verify the issue, and then

solve it. Cellular networks utilize existing service quality management systems for detecting

performance degradation issues inside the network, however, degradation in performance due

to issues outside the network may go undetected. In this work, I build a comprehensive and

holistic system that monitors E2E service performance across four dimensions - user locati-

ons, content providers, device types, and application types [23, 24, 17]. I model the expected

service performance for each combination of service location, content provider, device type,

and application type using training data, then compare ongoing E2E performance measu-

rement with the model and detect performance degradation as it arises, and ﬁnally localize

the service degradation using an associate rule mining based approach. The system detected

several performance degradation instances over a period of one week in an operational cellu-

lar network. In 80% of the detected degraded instances, content providers, device types, and

application types were the only factors of performance degradation. The evaluation using

both real network traces and synthetically generated performance degradation shows that

our method is highly eﬀective.

2

1.1.2 Assessing Impact of Cell Tower Outages in

Cellular Networks

A major network management challenge faced by network operators is to identify service

disruptions such as cell tower outages, in a timely manner and assess impact of service

disruptions on cellular network users. Typically, many service disruptions can occur concur-

rently in a cellular network. Network operators want to prioritize these outages according to

the customer impact. Due to scalability constraints, only spatially and temporally sampled

customer level measurements are collected, which are typically insuﬃcient for cellular net-

work operators to determine the customer impact. I design and evaluate a cell tower outage

monitoring system that analyzes and estimates the device level impact during cell tower

outages [18]. I propose to look at the network activity of Machine-to-Machine devices that

communicate using the same cellular network as human operated devices such as mobile

phones. We ﬁrst identify a candidate set of devices with highly predictable mobility and

traﬃc patterns using data collected from a large-scale cellular provider in North America.

We then use these qualiﬁed M2M devices to observe the impact of cell tower outages from

end users perspective. We evaluate the system by using cell tower outages in a 2-months time

window and our operational experience reveals that: (i) network quality can be analyzed at

a higher granularity by monitoring M2M device communication; (ii) customer impact during

network outages varies both in space and time; and (iii) customer impact estimates can be

used by cellular network provider to improve network coverage and network resiliency. To

the best of our knowledge, this is the ﬁrst work that employs M2M devices to assess the

impact of service issues in operational cellular networks at a large scale.

1.1.3 Optimizing Transit Route Selection for CDNs

Two major considerations for CDNs are cost and performance of delivering content to end

users. Pricing and performance of transit providers vary with respect to time and geo-

3

graphical location. Due to the dynamics of varying performance and pricing on transit

routes, CDNs need to implement a transit route selection strategy to optimize performance

and cost tradeoﬀs. CDNs have multi-homed servers at IXPs (Internet Exchange Point )

which provide explicit control over content routing via multiple transit routes. There are

thousands of eyeball ISPs which are reachable via diﬀerent transit routes and diﬀerent geo-

graphical locations. Each choice of transit route for a particular eyeball ISP and geographical

location has distinct cost and performance characteristics. In this work, I ﬁrst implement a

client-side performance measurement JavaScript to obtain simultaneous performance measu-

rements across multiple transit routes and then formalize the transit routing problem using

a multi-attribute objective function to simultaneously optimize end-to-end performance and

cost [22, 20]. The measurement methodology allows CDNs to simultaneously capture user

perceived end-to-end performance via multiple transit routes. The optimization methodo-

logy allows CDNs to navigate the cost and performance tradeoﬀ in transit routing. I evaluate

the approach using real-world measurements from CDN servers located at 19 geographically

distributed IXPs. The results show that CDNs can achieve signiﬁcant cost and performance

beneﬁts using the measurement and optimization approach.

1.1.4 Measurement and Analysis of Adult Traﬃc

Adult (or pornographic) websites attract a large number of visitors and account for a sub-

stantial fraction of the global Internet traﬃc. However, little is known about the makeup

and characteristics of online adult traﬃc. In this work, we present the ﬁrst large-scale mea-

surement study of online adult traﬃc using HTTP logs collected from a major commercial

content delivery network [21]. We analyze several characteristics of online adult traﬃc inclu-

ding content and traﬃc composition, device type composition, temporal dynamics, content

popularity, content injection, and user engagement. Our analysis reveals several unique cha-

racteristics of online adult traﬃc. We also analyze implications of our ﬁndings on adult

content delivery. Our ﬁndings suggest several content delivery and cache performance op-

4

timizations for adult traﬃc, e.g., modiﬁcations to website design, content delivery, cache

placement strategies, and cache storage conﬁgurations.

1.1.5 Privacy Preserving Social Network Graph Data Publishing

Online Social Networks (OSNs) often refuse to publish their social network graphs due to

privacy concerns. Diﬀerential privacy has been the widely accepted criteria for privacy pre-

serving data publishing. In this work, we propose a random matrix approach to OSN graph

publishing, which achieves storage and computational eﬃciency by reducing dimensions of

adjacency matrices and achieves diﬀerential privacy by adding a small amount of noise [19].

The key idea is to ﬁrst project each row of an adjacency matrix into a low dimensional space

using random projection, and then perturb the projected matrix with random noise, and

ﬁnally publish the perturbed and projected matrix. We prove that random projection plus

random perturbation preserve diﬀerential privacy, and also that the random noise required

to achieve diﬀerential privacy is small. We validate the proposed approach and evaluate

the utility of the published data for three diﬀerent applications, namely node clustering,

node ranking, and node classiﬁcation, using publicly available OSN graphs of Facebook,

Live Journal, and Pokec.

1.2 Published Material

The chapters of this dissertation are based in part on the following publications.

• Faraz Ahmed, Zubair Shaﬁq, Amir Khakpour, and Alex X. Liu. ”Optimizing Internet
Transit Routing for Content Delivery Networks”, IEEE/ACM Transactions on Networ-

king (ToN), 2018.

• Faraz Ahmed, Jeﬀrey Erman, Zihui Ge, Alex X. Liu, Jia Wang, and He Yan. ”Detecting
and Localizing End-to-End Performance Degradation for Cellular Data Services Based

on TCP Loss Ratio and Round Trip Time”, IEEE/ACM Transactions on Networking

5

(ToN), 2017.

• Faraz Ahmed, Jeﬀrey Erman, Zihui Ge, Alex X. Liu, Jia Wang, and He Yan. ”Mo-
nitoring Quality-of-Experience for Operational Cellular Networks Using Machine-to-

Machine Traﬃc”, IEEE INFOCOM, 2017

• Faraz Ahmed, Zubair Shaﬁq, Amir Khakpour, and Alex X. Liu. ”Optimizing Internet

Transit Routing for Content Delivery Networks”, IEEE ICNP, 2016.

• Faraz Ahmed, Jeﬀrey Erman, Zihui Ge, Alex X. Liu, Jia Wang, and He Yan. ”Detecting
and Localizing End-to-End Performance Degradation for Cellular Data Services”, IEEE

INFOCOM 2016.

• Faraz Ahmed, Alex X. Liu, and Rong Jin. ”Social Graph Publishing with Privacy

Guarantees”, IEEE ICDCS 2016.

• Faraz Ahmed, M. Zubair Shaﬁq, and Alex X. Liu. ”The Internet is for Porn: Measure-

ment and Analysis of Online Adult Traﬃc”, IEEE ICDCS 2016.

• Faraz Ahmed, Jeﬀrey Erman, Zihui Ge, Alex X. Liu, Jia Wang, and He Yan. ”Detecting
and Localizing End-to-End Performance Degradation for Cellular Data Services”, IEEE

SIGMETRICS [Extended Abstract] 2015.

6

2 Performance Degradation

Detection and Localization in Cellular

Networks

2.1

Introduction

2.1.1 Background and Motivation

Internet access through cellular data services has become an essential part of people’s ever-

yday life such as emailing, web browsing, video streaming, and online shopping. Nowadays

cellular network customers using data services expect seamless service with high applica-

tion performance, i.e., the end-to-end (E2E) performance. To best serve their customers

it is critical for cellular service providers to provide high E2E performance and maintain

their competitive edge.

In the context of cellular data services, E2E performance means

the performance that customers experience for a speciﬁc location, content provider, device

type, and application type. A user location means the Radio Network Controller (RNC)

that the user’s device connects to. A content provider means the Internet domain that

is serving the user. A device type means a speciﬁc brand and model of the user’s device,

such as Apple iPhone 5. An application type means the categorical type of the application

that the user is running on their cellular device, such as emailing or Web browsing. Prior

work has shown that E2E performance in cellular networks varies across the above menti-

7

oned attributes [27],[72]. Along with the cellular network’s performance, the hardware and

software performance of diﬀerent device types and applications also largely deﬁne the E2E

performance over the cellular network. A group of users may experience bad service due to

problems in their device ﬁrmware, applications and/or due to problems at the network side.

For example, degradation in performance can be due to issues such as limited processing po-

wer of smartphones devices in combination with device operating systems and applications

running on them [73]. Thus, the inherently complex nature of cellular data services creates

challenges for cellular data service providers to manage Quality of Experience (QoE) from

the subscribers’ point of view.

In this work, we present a system for detecting and localizing E2E performance degrada-

tion (such as slow webpage loading and unsmooth video playing) at cellular service providers

across four administrative domains: cellular service providers, content providers, device ma-

nufacturers, and application developers. For detection, we want to detect E2E performance

degradation before cellular network operators receive complaint calls. For localization, we

want to ﬁnd the problematic domain that is causing performance degradation such as user

location, content provider, device type, and/or application type. For example, if all users

connecting to an RNC are experiencing performance degradation, regardless of content provi-

ders, device types, and application types, then probably the RNC is causing the performance

degradation. For another example, if all users of iPhone 5 are experiencing performance de-

gradation for their email application, regardless of user locations and content providers, then

probably iPhone 5 is having issues with email applications.

Detection and localization of E2E performance degradation is crucial for all administra-

tive domains to jointly troubleshoot root causes. When users experience E2E performance

degradation, they have no clue whom they should “blame”. For example, when a user at

New York experience performance degradation for yahoo email on his iPhone 5, he does

not know whether it is the cellular provider, or yahoo, or iPhone 5, or his email client app,

that is causing the problem. When user experience such E2E performance degradations,

8

they blankly ascribe the fault to their cellular service providers and issue complaints to the

cellular network customer service centers, which may result in both reputation damage and

ﬁnancial losses for the cellular service providers. When cellular service providers receive such

calls, they have to go through lengthy, manual, labor intensive process to localize the issue,

and the actual issue may not be the cellular service provider problem, it may be due to the

user device itself (such as device OS, application software) or due to the content provider

(such as application servers, datacenter network). For example, the E2E performance de-

gradation maybe caused by a content provider upgrading its service, a device type having

updated its OS with incompatibility issues, and an app having been patched with buggy

code. The localization ﬁndings allow cellular service providers to quickly mitigate service

problems, eﬀectively communicate with customers complaining about service problems, and

engage content providers, device manufactures, or application developers and operators to

jointly troubleshoot for root causes. They sometimes can walk around the E2E performance

issues on their own. For example, if the issue is caused by a neighboring network, then they

may adjust their intra/inter domain routing to avoid that network.

There are several key challenges in detecting and localizing E2E performance degradation

at cellular service providers. First, there are a wide range of elements (such as mobile devi-

ces, cell towers, radio resource controller, routers, switches, ﬁbers, media gateways, ﬁrewalls,

multicast servers, name servers, and content servers) at various layers (such as physical layer,

link layer, transport layer and application layer) that may cause the E2E performance de-

gradation. Furthermore, the heterogeneity of mobile devices increases the number of factors

contributing to the E2E performance degradation. For example, there are over 24k dis-

tinct device types from Android. Over 2 million Android applications access content from

millions of content providers through cellular networks [110]. Similarly, other device manu-

facturers such as Windows, Apple etc have numerous device models. Applications developed

for diﬀerent operating systems running on these devices increase the dimensionality.

Second, it is practically infeasible to gain the complete visibility of E2E performance

9

issues because content providers, devices, and applications belong to diﬀerent administrative

domains. For example, issues on mobile devices and content servers are not visible to cellular

service providers. Third, the expected E2E performance varies signiﬁcantly depending on

which application (such as web browsing, video streaming, or gaming) requesting information

from which content provider (such as youtube.com or google.com) running on which mobile

device (such as Apple iPhone 5 or Motorola V3 Razr) from which geographic location (such

as New Jersey suburban areas or NYC downtown areas) at what time of day and during

which day of a week [71, 72, 125].

2.1.2 Proposed Approach

In this chapter, we propose a holistic approach to detecting and localizing E2E performance

degradation at cellular service providers across the four administrative domains of user loca-

tions, content providers, device types, and application types. Our approach consists of three

steps: modeling, detection, and localization.

First, we build models using training data to capture the normal performance of individual

E2E instance, which means the ﬂows corresponding to a particular user location, content

provider, device type, and application type. Each E2E instance has 24 ∗ 7 models where
each model corresponds to a speciﬁc hour of a day and a speciﬁc day of a week. We build

diﬀerent models for each hour in a day and each day in a week because studies have shown

that people have distinct daily and weekly diurnal patterns in their use of cellular services

[125]. Our training data, which is collected from a major cellular service provider in North

America, contains anonymized TCP ﬂow level information such as standard coordinated

universal time (UTC), the serving RNC, the type of the cellular device, the type of the

application, and the Internet domain that is being accessed.

Second, we use our models to detect performance degradation for each E2E instance on

hourly basis. For each E2E instance, if the actual performance in the testing phase is worse

than the expected performance obtained through our models, we label it as degrading.

10

Third, after marking each E2E instance non-degrading or degrading, we use association

rule mining techniques to localize the source of performance degradation. For example, rule

iPhone 5, Email → degrading shows that at a particular time instant, for all locations and
content provides, the cellular users of iPhone 5 are experiencing signiﬁcant degradation in

performance when they use the email application.

2.1.3 Technical Challenges and Solutions

To implement our approach, we face three key technical challenges. The ﬁrst challenge is

to tradeoﬀ between model accuracy and model complexity and to deal with data sparsity.

We represent E2E performance in a four dimensional matrix that is called E2E matrix and

is denoted EI = [1..L, 1..P, 1..D, 1..A], where the four dimensions are L user locations, P

content providers, D device types, and A application types. An element EI [l, p, d, a] in this

matrix represents the E2E instances corresponding to user locations l, content provider p,

device type d, and application a. On one extreme end, we build only one model for all E2E

instances in the E2E matrix, which gives us the least accuracy and the least complexity; on

the other extreme end, we build one model for each individual E2E instance, which in theory

gives us the most accuracy and the most complexity. Actually for many E2E instances, we

do not have enough ﬂow data to build meaningful models. To deal with data sparsity we ﬁrst

aggregate all ﬂows belonging to same E2E instance. We use the aggregated ﬂows to obtain a

two-dimensional aggregate E2E performance matrix denoted as EA. The EA matrix is our

design matrix used as input to the regression algorithm. To deal with the tradeoﬀ between

model accuracy and complexity, use the aggregate E2E matrix EA to build a baseline model

for all E2E instances,

identify the E2E instance groups that have signiﬁcantly diﬀerent

performance, and then model these groups separately, leaving the rest E2E instances still

being modeled by the baseline model. Within each group, the performance of some E2E

instances may be relatively diﬀerent from that of others in the group; thus, we apply this

grouping strategy recursively among identiﬁed groups.

11

The second challenge is to localize E2E performance degradation issues. After we label

each E2E instance as either non-degrading or degrading, we cannot directly present such a

labeled E2E matrix to network operators because the number of degrading E2E instances

maybe too large. To address this challenge, we use association rule mining to summarize the

E2E matrix using some simple rules such as iPhone 5, Email → degrading.

The third challenge is to quantitatively evaluate the eﬀectiveness of our approach because

we are short of ground truth data. For detection, we do not have pre-labeled testing data

because labeling such prohibitive amount of data is practically infeasible. For localization,

the customer tickets that we have access to mostly document hard failures such as an RNC

is down. When such hard failures happen, typically there is no ﬂow for the aﬀected E2E

instances; therefore, our localization scheme will not ﬁnd such failures. To address this

challenge, ﬁrst, we performed manual inspection for some E2E performance degradation

cases; second, we injected some synthetic performance degradation cases into the test data

and use those injected cases to serve as the ground truth.

2.2 Related Work

To the best of our knowledge, the detection and localization of E2E performance degradations

at cellular service providers has not been previously studied. Eﬀort related to ours can be

categorized into detecting and troubleshooting network or service issues and cellular service

E2E performance measurements.

Detecting and Troubleshooting Network or Service Issues: SCORE [89], Shrink

[80], and [90] focus on diagnosing network failures through the inference of underlying lower-

layer failures using a Shared Risk Link Group (SRLG) model. Pinpoint focuses on diagnosing

the root causes of service failures at the application server end, i.e., the content provider end

[41]. Sherlock [32] and NetMedic [81] both focus on diagnosing the root causes of service

failures within the elements (such as DNS servers, ﬁrewall conﬁgurations, and application

12

servers) of an enterprise network and both use dependency graphs. Sherlock and NetMe-

dic conduct diagnosis at the granularity of machines and processes, respectively.

In [98],

Mahimkar et al. focused on characterizing and troubleshooting performance issues in one of

the largest IPTV networks in North America. In [99], Mahimkar et al. designed and imple-

mented a tool for detecting the impact of network upgrades on performance. Our work is

diﬀerent from these eﬀorts in terms of both the problem being solved and the solution being

used. From the problem perspective, our problem is on monitoring, detecting, and locali-

zing E2E performance degradations for cellular data service users along the four dimensions

of user locations, content providers, device types, and application types at cellular service

providers without the full visibility of the E2E service path. From the solution perspective,

our approach consists of passive cellular traﬃc monitoring, data centric modeling, model

based performance degradation detection, and association rule based root cause localization,

whereas these eﬀorts uses variations of dependency graphs.

Cellular service E2E performance Measurements: There is work on measuring the

impact on cellular service E2E performance by factors such as resource usage by mobile

applications [114], content providers [50], and cellular technologies [72]. In [114], Qian et al.

designed and implemented a tool for discovering ineﬃcient resource usage for smartphone

applications using cross-layer interaction among various layers including radio resource chan-

nel state, transport layer, application layer, and the user interaction layer. In [50], Finamore

et al. performed a measurement study that compares YouTube traﬃc generated by mobile

devices with that generated by PCs, and correlated user behavior with system performance.

In [72], Huang et al. studied the interactions among applications, network transport pro-

tocols, and radio layers for the 4G LTE technology and their impact on performance using

both active and passive measurements, and showed that E2E performance (such as TCP loss

ratio) in cellular services varies across diﬀerent content providers, device types, applications,

connection types, and wireless carriers. Comparing these eﬀorts with ours, we go one step

further to localize the causes of E2E performance degradation. The closest work to ours is

13

1

0.8

0.6

0.4

0.2

e
t
a
R
 
s
s
o
L
 
P
C
T

T
T
R
 
d
e
z

i
l

a
m
r
o
N

1

0.8

0.6

0.4

0.2

0

e
t
a
R
 
s
s
o
L
 
P
C
T

1

0.8

0.6

0.4

0.2

0

Sa

Su

M

Tu W Th

F

Time

e
t
a
R
 
s
s
o
L
 
P
C
T

1

0.8

0.6

0.4

0.2

0

Sa

Su

M

Tu W Th

F

Time

e
t
a
R
 
s
s
o
L
 
P
C
T

1

0.8

0.6

0.4

0.2

0

Sa

Su

M

Tu W Th

F

Time

Sa

Su

M

Tu W Th

F

Time

(a) User Location

(b) Content Provider

(c) Device Type

(d) Application Type

Figure 2.1 Normalized TCP Loss Ratio

1

0.8

0.6

0.4

0.2

T
T
R
 
d
e
z

i
l

a
m
r
o
N

Sa

Su

M

Tu W Th

F

Sa

Su

M

Tu W Th

F

Time

Time

T
T
R
 
d
e
z

i
l

a
m
r
o
N

1

0.8

0.6

0.4

0.2

0

1

0.8

0.6

0.4

0.2

T
T
R
 
d
e
z

i
l

a
m
r
o
N

Sa

Su

M

Tu W Th

F

Sa

Su

M

Tu W Th

F

Time

Time

(a) User Location

(b) Content Provider

(c) Device Type

(d) Application Type

Figure 2.2 Normalized Round Trip Time (RTT)

presented in [27]. The authors look at aggregated Round Trip Time (RTT) measurements

in the UMTS network to identify poorly performing attributes including RNC, device, ap-

plication and content provider. In comparison, we look at two metrics RTT and TCP loss

ratio. We present a comprehensive approach that includes measurement and modeling to

detect and localize performance degradation on hourly basis. In addition to real world case

studies, we evaluate our approach by detecting and localizing anomalies in the wild over a

period of one week.

2.3 Monitoring E2E Performance Dynamics and Data

Analysis

2.3.1 Data Collection

In this study, we utilize anonymized ﬂow level data collected from the core network of a major

cellular service provider in the United States. Figure 2.3 illustrate the architectural overview

14

UE

UE

UE

UE

UE

UE

Gn

Figure 2.3 Cellular network architecture

of the core of the mobile network, which consists of two main types of nodes: the Serving

GPRS Support Node (SGSN) and the Gateway GPRS Support Node (GGSN). The GGSN is

the root node in the hierarchy of the cellular data network. GGSN is responsible for sending

and receiving Internet traﬃc to and from the cellular network. SGSN is an intermediate

node that connects the lower level nodes to the GGSN through the Gn interface. Typically,

a single SGSN is connected to multiple Radio Network Controllers (RNCs) and each RNC

serves a geographical region through cell towers.

The data was collected at the Gn interface which connects the SGSNs to the GGSNs. The

data collection system logs TCP ﬂow level information for each user equipment. Therefore,

our system takes as input the incoming data feed and outputs the detection and localization

results. As our goal in this work is to detect and localize the E2E performance degradation

in the cellular service, we collect TCP ﬂow level information for each TCP connection which

has been aggregated by user equipment (UE) or handheld device type, the particular RNC,

SGSN, and GGSN.

Each aggregated record in addition contains the timestamp of the (1hr) bin, the ap-

plication type (e.g. web browsing, streaming video, etc) and content provider (e.g.

www.something.com).

In order to deal with data sparsity, we limit our time granularity

to 1hr time bins. This limits our system to detect anomalies on hourly basis. For each TCP

15

ﬂow, we collect information including standard coordinated universal time (UTC), ID of the

serving RNC that describes the user access point, the device type, the application type, and

the content provider being accessed.

In this work, we focus on two most important E2E

performance metrics in a cellular service: TCP loss ratio and Round Trip Time (RTT). For

each ﬂow, we calculate its TCP loss ratio using the following formula:

(cid:18) observed # bytes in the flow
actual # bytes in the flow − 1(cid:19)

Here the observed number of bytes in a ﬂow means the number of bytes that is transmitted

in the network for the ﬂow including those being retransmitted due to packet loss. The

actual number of bytes in a ﬂow means the total number of bytes in the ﬂow, excluding

retransmissions. We detect retransmitted packets by tracking packet sequence numbers. In

this formula, the observed number of bytes is equal to the actual number of bytes in the ﬂow

plus the number of retransmitted bytes. For each ﬂow, we calculate its RTT using two RTT

measurements. The ﬁrst measurement corresponds to the cellular network side of the Gn

interface. The second measurement corresponds to the internet side of the Gn interface. The

E2E RTT of a ﬂow is equal to the sum of cellular network side RTT and internet side RTT.

Note that all user/device identiﬁers (such as IMSI and IMEI) are completely anonymized to

protect user privacy.

2.3.2 Data Analysis

To understand the characterization of TCP Loss Ratio and RTT across the four dimensions

( namely user location, content provider, device type, and application type), we examine

one week of data collected from one northeast region in the United States. We now analyze

the dynamics in TCP loss ratio and RTT for each of the 4 dimensions. The insights that we

gain in such analysis will be useful in our modeling of E2E performance.

User Location: Figure 2.1(a) shows the aggregate hourly TCP loss ratios for all user

locations. We use RNC ID as an identiﬁer for user location. Each RNC connects hundreds

16

of cell towers to the core network, covering a large geographical region. Figure 2.1(a) plots

TCP loss ratios of 78 diﬀerent RNCs covering multiple states in the United States. Here

each grey line represents aggregate hourly TCP loss ratio of a single RNC and the black line

represents the average. The aggregate hourly TCP loss ratio of a user location is calculated

based on the sum of the observed number of bytes transmitted for all downlink ﬂows divided

by the sum of the actual number of bytes in all downlink ﬂows using the following formula:

 Pf∈{downlink flows} observed # bytes in f
Pf∈{downlink flows} actual # bytes in f − 1!

From this ﬁgure, we ﬁrst observe that the aggregate TCP loss ratios of all user locations

follow a similar diurnal pattern. Second, we observe that the aggregate hourly TCP loss

ratios for all user locations demonstrate moderate deviation from their average. Figure

2.2(a) shows the aggregate hourly RTT for all user locations. From this ﬁgure, we observe

that RTTs of all user locations have moderate deviations from their average. This is because

aggregate end-to-end RTT is dominated by the wireless portion of the cellular network [27].

Therefore, there is little deviation in aggregate RTTs across locations.

Content Provider: Figure 2.1(b) shows the aggregate hourly TCP loss ratios for top 20

popular content providers, where the grey lines represent the individual ratios and the black

line represents the average. The aggregate hourly TCP loss ratio of a content provider is cal-

culated in a similar fashion as that of a user location, except that the aggregation is for each

content provider. From this ﬁgure, we ﬁrst observe that the aggregate TCP loss ratios of

these content providers follow a similar diurnal pattern as those of user locations. Second, we

observe that the aggregate hourly TCP loss ratios of these top 20 content providers demon-

strate signiﬁcant deviation from their average, as compared to those ratios of user locations.

Third, we observe that a few content providers constantly perform signiﬁcantly worse than

others. Figure 2.2(b) shows the aggregate hourly RTT values for the top 20 popular content

providers. From this ﬁgure, we observe that some content providers demonstrate diﬀerent

RTT values over time as compared to their average. Also, similar to TCP loss ratio, RTT

values of some content providers are signiﬁcantly worse than others. The variations of TCP

17

loss ratio and RTT across diﬀerent content providers show that E2E performance can be

aﬀected by the content provider. This could be because of several reasons such as content

provider popularity, content type (e.g., video, image, text etc), diﬀerent content delivery

strategies which may not be optimized for mobile platform etc.

Device Type: Figure 2.1(c) shows the aggregate hourly TCP loss ratios for top 10 popular

handheld cellular device types, where the grey lines represent the individual ratios and the

black line represents the average. The aggregate hourly TCP loss ratio of a device type is

calculated in a similar fashion as that of a user location, except that the aggregation is for

each device type. From this ﬁgure, we ﬁrst observe that the aggregate TCP loss ratios of

these device types follow a similar diurnal pattern as those of user locations. Second, we

observe that the aggregate hourly TCP loss ratios of these content providers demonstrate

moderate deviation from their average, slightly higher than those ratios of user locations.

Third, we observe that a few device types constantly perform moderately worse than others.

Figure 2.2(c) shows the aggregate hourly RTT values for top 10 popular handheld cellular

device types. From this ﬁgure, we observe that some device types demonstrate diﬀerent

RTT values over time as compared to their average. The variations can be associated to

diﬀerent hardware capabilities such as diﬀerent screen resolutions, processing power, and

other diﬀerences in mobile device hardware.

Application Type: Figure 2.1(d) shows the aggregate hourly TCP loss ratios for top

10 popular application types, where the grey lines represent the individual ratios and the

black line represents the average. The aggregate hourly TCP loss ratio of an application type

is calculated in a similar fashion as that of a user location, except that the aggregation is

for each application type. From this ﬁgure, we ﬁrst observe that the aggregate TCP loss

ratios of these application types follow a similar diurnal pattern as those of user locations.

Second, we observe that the aggregate hourly TCP loss ratios of these top 10 application

types demonstrate signiﬁcant deviation from their average, as compared to those ratios of

user locations. Third, we observe that a application type, which is email, constantly perform

18

1

0.8

0.6

0.4

0.2

F
D
C

1

0.8

0.6

0.4

0.2

F
D
C

1

0.8

0.6

0.4

0.2

F
D
C

0

0

200

400

600

RTT (msec)

800

1000

0

0

200

400

600

RTT (msec)

800

1000

0

0

200

400

600

RTT (msec)

800

1000

(a) User Location

(b) Device Type

(c) Operating System

Figure 2.4 CDF of Round Trip Time (RTT) in 4G LTE networks.

signiﬁcantly worse than others. Figure 2.2(d) shows the aggregate hourly RTT values for top

10 popular application types,. From this ﬁgure, we observe that some applications demon-

strate diﬀerent RTT values over time as compared to their average. Also, similar to TCP

loss ratio, RTT values of some applications are signiﬁcantly worse than others. Variations in

application performance can be attributed to resource usage and communication behavior

of individual applications[114].

These performance diﬀerences also manifest in 4G LTE networks. Our data is collected

on a 3G network, to verify the applicability of these ﬁndings in 4G LTE networks we utilize

publicly available Google mobile networks measurements dataset [106]. The dataset covers

295 locations, 75 operating system versions, and 33 device types. The data is collected from

144 carriers and 9 network technologies all over the world.

It contains 6.6 million RTT

measurements. Due to limited information available in the data, we look at performance

diﬀerences across diﬀerent locations, device types and operating system versions. For, sim-

plicity we only show top 10 locations, devices, and operating systems. Figure 2.4 shows

the Cumulative Distribution Function of RTT measurements. Each line in the ﬁgure cor-

respond to a single location, device type or an operating system. Figure 2.4(a) shows that

performance diﬀerences exist across the top 10 locations. In comparison, we observe large

performance diﬀerences across device types and operating system versions in 2.4(b) and (c)

respectively. These observations show that irrespective of the cellular network technology,

device hardware, content providers, and smartphone applications aﬀect E2E performance.

19

We have analyzed the temporal dynamics of TCP loss rates and RTT by aggregating

performance along each individual dimension (user locations, content providers, device types,

or application types).

It is also useful to understand E2E performance diﬀerences that

manifest due to the interplay of any of these dimensions. To understand the relationships

between diﬀerent combinations of dimensions, we obtain average TCP loss rates for all

combinations of two dimensions. For each combination of dimension, we obtain an aggregate

performance matrix, where each row represent one attribute of a dimension (e.g., a device in

Device Types) and each column of the matrix represent one attribute of another dimension

(e.g., an application in Application Types). Each element of the matrix contains average

E2E performance over a period of one week. The average is obtained by selecting all TCP

ﬂows identiﬁed by the two attributes. We provide a visualization of these matrices in Figure

3.9 that shows the heat map of diﬀerent combinations of dimensions. The values of the

matrix are normalized and the color bar shows the range of values in the heat map. Figure

3.9(a) shows the heat map of 50 locations and 25 devices. The y-axis represents locations

and the x-axis represents devices. Each pixel in the ﬁgure represent the average TCP loss

rate of a particular device type at a location. From this ﬁgure we observe that majority of

devices have similar average performance across all locations. For example, devices 1 and 5

have relatively poor performance as compared to device 4, however, all three devices have

consistent performance characteristics across all locations. We observe similar performance

characteristics for applications and content providers in Figures 3.9(b) and (c) respectively.

For example, applications 7 and 8 have worse performance across most locations. Similarly,

content providers 9 and 16 have worse performance across all locations. Figure 3.9(d) shows

heat map for combinations of devices and applications. Here, we observe weaker trends

e.g., application 2 has worse performance across most devices, device 5 and 23 are weakly

consistent across all applications. We observe no trends in Figure 3.9(e). Figure 3.9(f) shows

that most combinations are invalid, represented as the darkest color with a value of zero on

the color bar. This is because most applications are speciﬁc to some content provider except

20

 

1

 

1

 

1

s
n
o

i
t

a
c
o
L

10

20

30

40

50

 

0.8

0.6

0.4

0.2

0

s
n
o

i
t

a
c
o
L

10

20

30

40

50

 

2

4

6

8

10

12

Applications

0.8

0.6

0.4

0.2

0

s
n
o

i
t

a
c
o
L

10

20

30

40

50

 

5

10

15

20

Content Providers

0.8

0.6

0.4

0.2

0

5

10

15
Devices

20

25

(a) Locations - Devices

(b) Locations - Applications

(c) Locations - Content Pro-

 

1

 

1

 

1

viders

i

s
e
c
v
e
D

5

10

15

20

25

 

2

4

6

8

10

12

Applications

0.8

0.6

0.4

0.2

0

i

s
e
c
v
e
D

5

10

15

20

25

 

0.8

0.6

0.4

0.2

0

5

10

15

20

i

s
r
e
d
v
o
r
P

 
t
n
e
t
n
o
C

25

 

2

4

6

8

10

12

Applications

0.8

0.6

0.4

0.2

0

5

10

15

20

25

Content Providers

(d) Devices - Applications

(e) Devices - Content Provi-

(f) Applications

- Content

ders

Providers

Figure 2.5 Heatmap plots for combination of dimensions

browser applications. In summary, these observations show that locations have little impact

on E2E performance and variations in performance occur across content providers, device

types, and applications.

2.4 Modeling E2E Performance Dynamics

Our approach is to ﬁrst build baseline E2E performance models and then detect and localize

E2E performance anomalies based on these models. To build models, we ﬁrst compute a two

dimensional aggregate E2E performance matrix that captures the average performance along

the dimensions of user locations, content providers, device types, and application types over

each hour in a week. We choose a week as the duration because people have weekly diurnal

patterns in their use of cellular services [125]. This aggregate E2E performance matrix is

21

used as an input predictor variable of a regression algorithm, which uses an iteratively re-

weighted least squares approach, to obtain a single baseline model for all users. Based on this

single model, we ﬁlter out extreme outlier data points, which are mostly errors and noises

introduced in our data collection process. A single model cannot accurately describe the per-

formance of all E2E instances because some E2E instances have diﬀerent performance than

others. An E2E instance refers to the ﬂows of a speciﬁc location, content provider, device

type, and application type. To accurately model diﬀerent groups of E2E instances, where

within each group, all instances have relatively similar performance, but across diﬀerent

groups, instances have relatively diﬀerent performance, we use an association rule mining

approach to partition users into such groups. Furthermore, within each such group of E2E

instances, there are subgroups where within each subgroup, all E2E instances have relatively

similar performance, but across diﬀerent subgroups, E2E instances have relatively diﬀerent

performance. Thus, we recursively apply the association rule mining approach within groups

and subgroups. Then, we build ﬁne grained models for each group/subgroup and then detect

and localize E2E performance anomalies based on these models.

Weighting Function

Ordinary Least Square Method

Robust Regression Method

Talwar
20.06
1.23

Bisquare

Cauchy

20.06
1.40

20.06
1.64

Fair
20.06
2.44

Huber
20.06
1.99

Logistic Welsch
20.06
1.44

20.06
2.04

Table 2.1 RMSE values for diﬀerent weighting functions and regression methods

2.4.1 Aggregate E2E Performance Matrix

Given TCP ﬂow data for a certain time period of W weeks at a certain region as the training

data, let L denote the total number of user locations, C denote the total number of content

providers, D denote the total number of device types, and A denote the total number of

application types. For each user location, we ﬁrst calculate the average performance (packet

loss rate or RTT) across all content provides, all device types, and all application types, for

each hour in the W weeks of 24∗7∗W hours; and then, for each location and for each hour in

22

a week of 24∗ 7 hours, we calculate the median of the W values; thus, for each hour in a week
of 24 ∗ 7 hours, we obtain a vector of L median values. Similarly, for each content provider,
we ﬁrst calculate the average performance across all user locations, all device types, and all

application types, for each hour in the W weeks of 24 ∗ 7 ∗ W hours; and then, for each
content provider and for each hour in a week of 24 ∗ 7 hours, we calculate the median of the
W values; thus, for each hour in a week of 24∗7 hours, we obtain a vector of C median values.
We apply such calculation for device types and application types as well. In the end, we

obtain a two-dimensional aggregate E2E performance matrix EA = [1..24 ∗ 7,{L, P, D, A}],
which has 24∗ 7 rows and four columns with indices denoted L, P, D, and A. In this matrix,
elements EA[i, L], EA[i, P], EA[i, D], EA[i, A] are four vectors of L, P , D, and A median

values, respectively, as we calculated above. This matrix EA will be the input to the robust

regression algorithm described below.

2.4.2 Coarse Grained E2E Modeling

Taking the aggregate E2E performance matrix as input, we build a single baseline model for

all E2E instances, based on which we can remove extreme outlier data points. Note that

these extreme outlier data points are not the E2E anomalies that we are looking for because

such data points are mostly errors and noises introduced in our data collection process.

We use robust regression for this baseline modeling because it can minimize the impact of

extreme outlier data points on the produced model.

Robust regression uses the standard regression model y = EAβ +ǫ, where y is the response
vector of size 24 ∗ 7, β is the coeﬃcient vector of size 4, and ǫ is the residual vector of size
24 ∗ 7. It computes a robust estimate of β such that the impact of extreme outlier data
points on the produced model is minimized. To compute the coeﬃcient β, we use the

well known iteratively re-weighted least squares (IRLS) algorithm [69]. This algorithm ﬁrst

obtains the residual errors based on the standard least square method. Second, it uses a

weighting function over the residuals to mark outliers. Third, it ignores outliers and re-

23

estimates regression coeﬃcients. It repeats the above process until the diﬀerence of values of

estimated coeﬃcients obtained in two successive iterations approaches a minimum threshold.

Mathematically, the robust estimates of β after n + 1 iterations can be represented as:

βn+1 = argmin

pXi=1

wn
i |yi − QAi

β|2

(2.1)

Here wn

i is the weight assigned to the ith observation at the nth iteration. The weight assign-

ment is done through a weight function w(ri), where ri is the scaled residual calculated using

the method proposed in [46]. We choose the weighting functions that minimize the overall

error by assigning less weights to outlying points. We used our training datasets to evaluate

the performance of some well known weighting functions [69]. We calculate the Root Mean

Squared Error (RMSE) of the regression models developed using seven diﬀerent weighting

functions. Table 2.1 gives a comparison of RMSE values obtained through Ordinary Least

Square (OLS) method and the IRLS method. It is clear that the Talwar weighting function

has the least RMSE value. This is because Talwar weighting function assigns a weight of 1

or 0 to each data instance. Equation 2.2 describes the weighting criteria, which is based on

the scaled residual values ri.

w(ri) =( 1

0

if abs(ri) < 1
otherwise

(2.2)

Using Talwar in IRLS, a data instances with anomalous residual value will get a weight of

0, which means that all data instances that have zero weights will not aﬀect the regression

coeﬃcients in the next iteration. Therefore, the baseline developed for the initial training

set will not be aﬀected by these outliers.

After we obtain the robust regression model, we calculate the residual error for all data

points, plot their distribution, and then remove the data points whose residual error is outside

of the 99.99 percentile of the residual error distribution. Figure 2.6 plots the cumulative

distribution function (CDF) of the residual errors obtained from the original data, from

which we observe that the noisy instances have very high percentage loss error reaching

up to 500% with extremely low frequency. Figure 2.6 also plots the CDF of residual error

24

obtained from the ﬁltered data after we remove outlier data points. From the ﬁgure we

observe that the maximum residual error that we obtain for the ﬁltered data is less than

150. Whereas, we have several instances with very large residual error values.

1

0.95

F
D
C

0.9

0.85

0.8

 
0

 

Filtered
Original

200

400
600
Residual Error

800

1000

Figure 2.6 Residual error distribution

2.4.3 Fine Grained E2E Modeling

We have deﬁned aggregate E2E performance matrix, now we deﬁne E2E matrix. An E2E

matrix denoted EI = [1..L, 1..P, 1..D, 1..A] is a four dimensional matrix where the four

dimensions are L user locations, P content providers, D device types, and A application

types. An element EI [l, p, d, a] in this matrix represents the E2E instances (or say ﬂows)

corresponding to user locations l, content provider c, device type d, and application type a.

Describing the performance of all E2E instances in the E2E matrix using only one model is

not accurate enough because the performance of some E2E instances is signiﬁcantly diﬀerent

from that of other E2E instances. For example, Figure 2.11(a) and (b) show the observed

and predicted performance curves for two diﬀerent cells of the individual E2E matrix EI . We

obtain the predicted performance curve using the single model in the above coarse grained

E2E performance modeling. From Figure 2.11(a), we observe that for E2E matrix cell 1, the

predicted performance curve are fairly consistent with the real one but from Figure 2.11(b),

we observe that for E2E matrix cell 2, the predicted performance curve signiﬁcantly deviate

from the real one.

Our approach to addressing this issue is to partition E2E instances into groups such that

25

e

t

a
R
 
s
s
o
L
P
C
T

 

30

25

20

15

10

5

0

 

Real E2E for Cell 1
Predicted E2E for Cell 1

 

Mon

Tue Wed

Thu

Fri

Sat

Sun

e

t

a
R
 
s
s
o
L
P
C
T

 

30

25

20

15

10

5

0

 

Real E2E for Cell 2
Predicted E2E for Cell 2

 

Mon

Tue Wed

Thu

Fri

Sat

Sun

(a) Cell 1

(b) Cell 2

Figure 2.7 Real and predicted performance curves

1

0.8

0.6

0.4

0.2

p
u
o
r
g
b
u
s
 
r
e
p
 
s
r
e

i
l
t

u
o

 
f

 

o
n
o

i
t
c
a
r
F

 
0
0

 

Baseline Model Outliers
1st Iteration Outliers
2nd Iteration Outliers

20

40

60

80

Top Level Groups

F
D
C

1

0.8

0.6

0.4

0.2

0
0

1

0.8

0.6

0.4

0.2

F
D
C

10

20

30

40

50

60

Residual Error

0
0

10

20

30

40

Residual Error

50

60

Figure 2.8 No. of outliers

Figure 2.9 Single model

Figure 2.10 Multiple models

each group has distinct performance. The more groups that we partition, the more accurate

models we can obtain, but at the same time, the number of models and the complexity will

increase. On one extreme end, we have only one model for all E2E instances, which gives us

the least accuracy and the least complexity; on the other extreme end, we have one model for

each individual E2E matrix cell, which gives us the most accuracy and the most complexity.

Our strategy to tradeoﬀ between model accuracy and model complexity is to identify the E2E

instance groups that have signiﬁcantly diﬀerent performance and then model these groups

separately, leaving the rest E2E instances still being modeled by a single E2E model. Within

each group, the performance of some E2E instances may be relatively diﬀerent from that of

others; thus, we apply this grouping strategy recursively among identiﬁed groups.

In this work, we identify E2E instances that deviate from the baseline model; then group

such deviating E2E instances using association rule mining; within each group, we apply this

strategy recursively. Next, we present the details of these three steps: deviating E2E instance

26

identiﬁcation, deviating E2E instance grouping, and recursive E2E instance grouping.

2.4.3.1 Deviating E2E Instance Identiﬁcation

To identify the E2E instances that have deviating performance, we compare the real per-

formance of each E2E instance with its predicted performance based on the baseline model

for each hour in the 24 ∗ 7 hours of a week. To take the standard performance deviation
across W weeks into consideration, we compute a two dimensional matrix denoted ¯EA that
captures such deviation. Here ¯EA diﬀers from EA only in that each value in ¯EA is the

standard deviation of the W values whereas each value in EA is the median of the W values.
In ¯EA, the elements ¯EA[i, L], ¯EA[i, P], ¯EA[i, D], ¯EA[i, A] are four vectors of L, P , D, and A
standard deviation values, respectively. We feed ¯EA to the same robust regression algorithm
and obtain the standard deviation σ. For each E2E instance and each hour of the 24∗7 hours
of a week, if the real performance is too much out of the range of [predicted performance -

σ, predicted performance + σ], we label this E2E instance as deviating for this hour. In this

work, we use [predicted performance - 2 ∗ σ, predicted performance + 2 ∗ σ] as the guideline
for quantifying “too much”. For each E2E instance, if for a large percentage of the 24 ∗ 7
hours of a week, this E2E instance has a deviating performance, we label this E2E instance

as deviating. In this work, we use 50% as the guideline for quantifying “large percentage”.

2.4.3.2 Deviating E2E Instance Grouping

Now each instance of the E2E matrix EI has been labeled “not deviating” or “deviating”.

Visually, we mark non-deviating rules as white and deviating rules as black. The next

step is to group such deviating E2E instances together based on their performance. In this

work, we model the E2E matrix EI as a transactional database and use the association rule

mining technique to perform grouping. An instance EI [l, p, d, a] with color c is modeled as
a transaction with ﬁve items hl, p, d, a, ci. Given this transaction database as the input, we
use the class based Apriori association rule mining algorithm to produce rules with only the

27

color being the consequence. In association rule mining, a rule is of the form antecedent →
consequence where antecedent and consequence are disjoint sets of some items. The

meaning of the rule is that most people who buy antecedent also buy consequence. The

quality of a rule is measured by two metrics: support and conﬁdence. The support of a rule is

the percentage of the transactions that contain all the items in both the antecedent and the

consequence of the rule among all transactions in the database. The conﬁdence of a rule is

the percentage of the transactions that contain all the items in both the antecedent and the

consequence of the rule among the transactions that contains all the items in antecedent.

We are only interested in rules with the color black as the consequence and therefore ignore

all rules with color white. Each rule represents a group. For example, a rule

NewYork, Google, iPhone → black

represent a group of all E2E instances whose user location is New York, content provider is

Google, and device type is iPhone. This rule means that most E2E instances corresponding

to user location New York, content provider Google, and device type iPhone, regardless of

application types, have deviating performance from the baseline model. Here New York is

just a metaphor for a speciﬁc RNC. For simplicity, we use

NewYork, Google, iPhone,∗

to denote this group where wild card ∗ denote all application types. The size of this group is
A, which is the number of all application types. As another example, the size of the following

group is D ∗ A:

NewYork, Google,∗,∗

The size of each group is actually exactly the support of the corresponding rule. Note that

some of the E2E instances in a group may be white. The percentage of the black E2E

instances in a group is exactly the conﬁdence of the corresponding rule. To tradeoﬀ between

model accuracy and model complexity, we want to ﬁnd groups with a large support and a

28

high conﬁdence. In this work, we set the support threshold to be 0.1% and the conﬁdence

threshold to be 80%. We pay attention to only the rules whose support and conﬁdence are

above these thresholds and ignore the other rules. For simplicity, in the rest of this chapter,

we use the terms “rule” and “group” interchangeably.

After we ﬁlter out rules who support or conﬁdence is below the corresponding threshold,

we still have too many groups to model each individually. We do not choose to reduce the

number of rules by simply increasing support or conﬁdence thresholds because a rule with

larger support and/or conﬁdence does not necessarily have a larger impact on the baseline

model. Instead, we choose to select the groups that have the most impact on the baseline

model. Next, we present an exclusion method and an inclusion method to quantify the

impact of a group on the baseline model.

In the exclusion method, ﬁrst, for the set of all E2E instances, denoted EA, we use the

simple regression algorithm to build the baseline model. Second, for each group of E2E

instances R, we use the simple regression algorithm to build a model for EA − R. Third,
we calculate the absolute diﬀerence of the RMSEs of the two models for EA − R and EA.
The result is used to quantify the impact of group R. Note that here we use the simple

regression algorithm, instead of the robust regression algorithm, because robust regression

ignores outliers that have large eﬀect on the RMSE.

In the inclusion method, let Σ denote the union of all groups; ﬁrst, for EA − Σ, we use
the simple regression algorithm to build the baseline model. Second, for each group of E2E

instances R, we use the simple regression algorithm to build a model for (EA − Σ) ∪ R.
Third, we calculate the absolute diﬀerence of the RMSEs of the two models for EA − Σ and
(EA − Σ) ∪ R. The result is used to quantify the impact of group R.

After we quantify the impact of each group on the baseline model, we can rank all groups

based on their quantiﬁed impact. Note that groups may overlap. For example, the following

two groups overlap:

NewYork, Google,

∗,

∗

29

∗,

Google,

iPhone,

∗

For any two groups R1 and R2 that overlap, if R1 is ranked higher than R2, then for all E2E
instances in R1 ∩ R2, we model them using the R1’s model. Given the n ranked groups in
the non-ascending order of their impact, denoted by R1, R2,· · · , Rn, for each 1 ≤ i ≤ n, we
build a separate model for group Ri − ∪1≤j≤i−1Rj. Note that for group Ri, if there exists
1 ≤ j ≤ i − 1 such that Ri ⊂ Rj, then Ri − ∪1≤j≤i−1Rj = ∅.

2.4.3.3 Recursive E2E Instance Grouping

Just like having one model for all E2E instances in EI is not accurate enough, for some

groups, having one separate model is also not accurate enough. Therefore, we apply our

ﬁne grained E2E modeling algorithm recursively on each group that we have identiﬁed as

above. We now empirically show that through recursive E2E instance grouping, we achieve

high modeling accuracy. We use a training data collected over a duration of six weeks (i.e.,

W = 6). We calculate the aggregate E2E performance matrix using the six weeks data and

build a single baseline model for all E2E instances. We then identify a ﬁrst set of deviating

E2E instance groups. For each of these E2E instance groups, we calculate the total number

of E2E instances that are deviating from the single baseline model. In Figure 2.8, the black

bars are the fraction of deviating E2E instances belonging to a particular group, and the

grey bars represent the fraction of deviating E2E instances when a separate model is used for

each of the E2E instance groups. For each of the E2E instance groups, we develop a separate

set of ﬁne grained models through recursion. The white bars are the fraction of deviating

E2E instances in the original E2E instance group after separate sets of ﬁne grained models

are used for each E2E instance group after recursion. We observe that after at most three

levels of recursion, we do not get any further E2E instance grouping. Figure 2.9 shows the

CDF of residual errors obtained when a single baseline model is used. Figure 2.10 shows the

CDF of residual errors obtained when a separate baseline models are used. We observe that

recursive E2E instance grouping and ﬁne grained modeling greatly reduce residual errors.

30

2.5 E2E Performance Degradation

2.5.1 Detection

So far we have discussed how to build E2E performance models based on training data.

Now we present solutions to detect and localize degradations in an on-going data feed on an

hourly basis. In order to determine whether a particular E2E instance in the on-going data

feed has degrading performance in the latest hour, we compare the performance of that E2E

instance in the latest hour with its predicted performance based on the ﬁne-grained model

for that E2E instance for the same hour. For a given E2E instance at a particular hour, if

the value of its performance metric exceeds an upper bound then we label this E2E instance

as ”degrading”. We use the predicted performance + 2∗ σ as the upper bound for detecting
performance degradation.

2.5.2 Localization

For each new hour in the on-going data feed, we ﬁrst label each E2E instance as degradation

or normal and then apply association rule mining algorithm on the E2E instances labeled as

degradation. Unlike the training phase, we do not apply the association rule mining recursi-

vely. The output rules show the localization of performance degradations. For example, the

following rule shows that for a particular hour, for all content providers, all device types,

and all applications, the cellular users in the New York area are experiencing signiﬁcant

degrading performance for their cellular services.

NewYork → degrading

This rule is useful for cellular network operators to investigate potential issues in their cellular

services in the New York area. Again, here New York is just a metaphor for a speciﬁc RNC.

For another example, the following rule shows that for a particular hour, for all locations,

for all content providers, the cellular users of iPhone are experiencing signiﬁcant degrading

31

performance for their email application.

iPhone, Email → degrading

This rule is useful for cellular network operators to contact iPhone manufactures to investi-

gate potential issues in their email application.

2.6 Evaluation

In this section, we evaluate our approach using the operational data collected from a large

US-based cellular service provider. The main challenge in our evaluation is lack of ground

truth anomalies. Due to the prohibitively huge size of the data, it is practically infeasible to

manually label all anomalous events. To address this challenge, we introduce some synthetic

anomalies into the collected operational data to eﬀectively evaluate the accuracy of our

approach. Speciﬁcally, we ﬁrst apply our approach on the operational data collected during

six consecutive weeks to learn the ﬁne grained E2E performance models and then use these

models to detect and localize synthetic anomalies. Besides synthetic anomalies, we also

present a few real-world anomalies detected and localized by using the learned ﬁne grained

E2E performance models. To identify real world anomalies we use past six weeks of data

to build our training model and detect anomalies in the seventh week. Operational cellular

networks consistently update their network by adding new cell towers, similarly new devices

and applications are launched on daily basis. We update our E2E instance matrix EI on

weekly basis by adding new elements along the four dimensions of the matrix and reﬂect

these updates in the aggregate E2E performance matrix EA.

2.6.1 Synthetic Anomalies

To evaluate our system we introduce synthetic anomalies in the data. We created three sets

of scenarios, in the ﬁrst set we have 19 scenarios and in each scenario we introduce a one

32

1

0.8

e
c
n
e
d

i
f

n
o
C

0.6

0.4

0.2

 
0
0

 

1

Synthetic
Others

 

Synthetic
Others

e
c
n
e
d

i
f

n
o
C

0.8

0.6

0.4

0.2

 

Synthetic
Others

1

0.8

0.6

0.4

0.2

e
c
n
e
d

i
f

n
o
C

10

20

30

40

50

Scenarios

 
0
0

10

20

30

Scenarios

40

50

 
0
0

10

20

30

40

50

Scenarios

(a) 1 dimension

(b) 2 dimensions

(c) 3 dimensions

Figure 2.11 Synthetic anomaly detection.

dimensional anomaly, the second set involves 26 scenarios with two dimensions and the third

set involves 19 scenarios with three dimensions. In each scenario we introduced a one-hour

anomaly that is associated with a group of instances in the huge E2E performance matrix.

For example, a particular scenario introduces performance degradations at a particular hour,

for all users accessing their email (application type) from Motorola V3 Razr (device type).

Speciﬁcally, we ﬁrst collect another week’s operational data (diﬀerent from the six weeks we

used to learn the ﬁne grained E2E performance models) and calculate the mean and standard

deviation of the TCP loss ratio for each E2E instance using 168 hours of that week. Then

for each E2E instance in each particular scenario, we create a one-hour synthetic anomaly

by using Anom = µ + 2σ + |N (0, 2)|.

For each scenario, we ﬁrst used the ﬁne grained E2E performance models learned from

six-week data to detect anomalies for each instance of E2E matrix and then use association

rule mining to localize the root cause of all detected anomalies. As we use association rule

mining for localization, we can tune the support and conﬁdence threshold values to control

the number of antecedents in the output rule. We choose support threshold values for each

type of scenario separately. This is because on one hand the threshold may be too high

for three dimensional scenarios which should be identiﬁed, but since they do not satisfy the

high support threshold they are never selected as output rules. On the other hand, if the

threshold value is too low then we may see a lot of undesirable rules. For example, for a

33

minimum support threshold of 0, we obtain rules with the number of items in antecedent and

consequence equal to number of items in a transaction, that eﬀectively represents a single

E2E instance. Therefore, we use three sets of support threshold values, one for each type

of scenario. We use the frequency of occurrence of item sets in the transaction data set to

select the support thresholds. We choose the frequency of least frequent 1-itemset in the

data as the minimum support threshold for 1-dimensional scenarios. Similarly, we choose

the frequency of least frequent 2 and 3-itemsets as the minimum support threshold for 2 and

3-dimensional scenarios respectively.

We use conﬁdence as our evaluation metric for the rules obtained through association

rule mining and select the top ranked rule obtained from association rule mining according

to their conﬁdence values. As we can see from Figure 2.11(a), Figure 2.11(b) and Figure

2.11(c), all 64 synthetic anomalies that we injected into the collected operational data are

successfully detected and localized. We get all scenarios as association rules with 90% con-

ﬁdence for one dimensional scenarios. Figure 2.11(a) shows the conﬁdence of top 50 rules

obtained for one dimensional scenarios using a minimum conﬁdence threshold of 10%. The

red bars represent the synthetic scenarios introduced in the data. The blue bars represent

the false positives which are identiﬁed due to noisy instances in the data. From this ﬁgure

we observe that by selecting a minimum conﬁdence threshold of 40% we do not obtain any

false positives. Figure 2.11(b) and 2.11(c) shows the conﬁdence of top 50 rules obtained for

two dimensional and three dimensional scenarios respectively, using a minimum conﬁdence

threshold of 0.1. We select conﬁdence threshold using the optimal operating point of the

Receiver Operating Characteristic (ROC) curve. For all three scenarios we obtain ROC ope-

rating points separately and choose conﬁdence threshold that gives minimum false positive

rate and maximum true positive rate. We achieve 0 false positives for a minimum conﬁdence

threshold of 70% and 80% for two and three dimensional scenarios respectively.

34

2.6.2 Anomaly Detection in the Wild

In this section, we characterize the performance anomalies detected on an operational net-

work over a period of one week in August 2014. Through this characterization analysis,

we will explain that how frequent E2E performance anomalies are and how often multiple

dimensions are the root cause of performance degradation. We implemented and deployed

our system on an operational network and detected performance anomalies for packet loss

ratio and RTT separately. Overall, we monitored 78 diﬀerent regions, 51 device types, 13

application types and 36 content providers.

It is interesting to see that how often a single dimension is responsible for performance

degradation. We use the output rules from the localization step to count the number of single

dimensional and multi-dimensional anomalies. For each rule, we count the total number of

all E2E instances and the number of E2E instances marked as degrading, which belong to

that rule. Table 2.2 gives the number of one dimensional anomalies detected on hourly

basis. For each dimension the number of anomalies is the number of E2E instances which

were marked as degrading in the detection and localization process as discussed in Section

2.5. We also report the average conﬁdence that is the fraction of all E2E instances which

were marked as degrading. We can see that there are no performance anomalies in the

user location dimension, which means that a single user location cannot be blamed for the

anomalies occurring during the one week’s time period. We observe that one dimensional

anomalies are rare for content providers and device types. This is mainly because content

providers use content distribution networks for serving content to remotely located users.

It is very unlikely that the performance of a content provider will degrade across all user

locations. Therefore, a problematic content server can only aﬀect the performance of a subset

of user locations. For applications, single dimension anomalies are relatively frequent. This

observation highlights the performance impact of application patches and bug ﬁxes that are

released periodically and frequently and bugs in these updates can cause performance issues

at all user locations, across all device types and content providers.

35

Dim
Anom
Conf
Anom
Conf

Loss

RTT

0
0
0
0

5

Loc Dev App
98
0.68
82
0.68

0.80
11
0.82

CP
27
0.67

0
0

Table 2.2 Single dimension anomalies

L,D
Dim
Anom 134
0.94
Conf
57
Anom
Conf
0.80

L,A

0
0
13
0.85

L,P
215
0.93
56
0.84

Loss

RTT

D,A D,P A,P
79
0.76
280
0.71

263
0.83
355
0.83

0
0
0
0

L,D,A L,D,P

L,A,P D,A,P

378
0.82
32
0.82

44
0.80
20
0.96

233
0.76
444
0.72

962
0.78
4627
0.78

Table 2.3 Multi-dimension anomalies

Next we look at multi-dimensional anomalies reported in Table 2.3. We observe that

anomalies involving multiple dimensions are more frequent as compared to single dimension

anomalies. We detected a total of 8415 E2E instances as anomalous. Our analysis reveals

that in 83% of these E2E instances content provider was one of the dimensions, 86% involved

device types and 86% involved application types. Overall, in 80% of these instances, the RNC

or location dimension was not involved. In other words, 80% of the time the anomalies are

not because of problems at the cellular network. This composition of E2E performance

anomalies shows that most of the time content providers, device types and applications are

involved in the performance degradations. These results are extremely useful for network

operators as they highlight the nature of majority of performance anomalies. Additionally,

the ﬁndings point network operators to the problematic dimensions and reduce the search

space of root causes. Veriﬁcation of detected E2E performance anomalies requires manual

analysis of hundreds of time series. The manual analysis acts as the ground truth of the

detected E2E instance. For each E2E instance, we plot all the time series and verify the

occurrence of anomalies in those time series. We also veriﬁed the network anomalies in the

cellular network outage reporting system. However, due to large manual eﬀort required to

verify each anomaly from the network outage reporting system, we verify two real anomalies.

36

1

e
c
n
e
d

i
f

0.5

n
o
C

0

20

e

t

15

10

a
R
 
s
s
o
L
P
C
T

 

 

Real Performance
Predicted Performance
Upper Bound

5

 
0
0

50

100

150

Hours of the Week

Figure 2.12 Real-world anomaly 1: root cause outside the cellular network

1

e
c
n
e
d

i
f

0.5

n
o
C

0

 

20

15

10

t

e
a
R
 
s
s
o
L
P
C
T

 

Real Performance
Predicted Performance
Upper Bound

5

 
0
0

50

100

150

Hours of the Week

Figure 2.13 Real-world anomaly 2: root cause inside the cellular network

37

2.6.3 Real-World Anomalies

In the section, we present two interesting real-word anomalies detected and localized by the

learned ﬁne grained E2E performance models (same as the synthetic anomalies) and the

possible underlying root causes.

The ﬁrst real-world anomaly involves three dimension namely, content provider, device

type and application type. Our system detected and localized a one-hour anomaly across all

78 user locations (corresponding to 78 RNCs) but speciﬁc to apple.com (content provider),

iPhone 4s (device type) and browsing (application type). Figure 2.12 shows the time series

plot of the TCP loss ratio for all 78 user locations , apple.com, iPhone 4s and browsing. We

also plot the predicted performance and the upper bound of the TCP loss ratio for apple.com,

iPhone 4s and browsing across all locations. One can clearly see that the anomaly occurred

around the 8th hour in the week for all 78 user locations. The ﬁgure also shows the conﬁdence

values that we obtain for this particular case. We observe that the conﬁdence value peaks

during the 8th hour, indicating that all E2E instances belonging to apple.com, iPhone 4s

and browsing were marked as anomalous. As the peak in loss ratio is across all locations and

speciﬁc to apple.com, iPhone 4s and browsing, the root cause is most likely to be outside of

the cellular network and might be apple related. There are other time instances where we

observe non zero conﬁdence values, indicating anomalies occurring across a small fraction of

E2E instances belonging to apple.com, iPhone 4s and browsing.

In the second real-world anomaly, only one dimension is involved. Our system detected

and localized a 15-hours anomaly for four user locations across all content providers, device

types and application types. Figure 2.13 shows the time series plot of the TCP loss ratio

for all locations aggregated on other 3 dimensions. We only picked one user location to plot

the predicted performance and the upper bound of the TCP loss ratio for that user location,

apple.com, iPhone 4s and browsing due to space limits. One can clearly see only 4 curves

(corresponding to 4 RNCs) had a signiﬁcant spike starting around hour 108. Besides the

4 signiﬁcant spikes, there are several locations whose actual performance is exceeding the

38

upper bound. The ﬁgure also shows the conﬁdence values that we obtain for the performance

degradation across diﬀerent locations. We observe that the conﬁdence value peaks to 0.7,

indicating that 70% of the locations were marked as anomalous. As a contrast to the ﬁrst

real-world anomaly, the anomalies here are only speciﬁc to only user locations regardless

other 3 dimensions. Thus the root cause here is most likely to be something inside the

cellular network impacting the user locations (corresponding to RNCs).

2.7 Conclusions

We make the following key contributions in this chapter. First, we design and implement

a comprehensive and holistic measurement system that monitors E2E service performance

across four dimensions - user locations, content providers, device types, and application

types. Second, we propose ﬁne grained models that capture the normal performance for

every combination of user locations, content providers, device types, and application types,

and use the models to detect performance degradation. Third, we propose an association

rule mining based approach to localize performance degradation. Fourth, we use both real

network traces and synthetic performance degradation to show that our method is highly

eﬀective.

39

3 Assessing Impact of Cell Tower

Outages in Cellular Networks

3.1

Introduction

3.1.1 Background and Motivation

The ubiquity of cellular data networks has fueled the rapid increase in the number of cellular

network users that connect to the Internet for an increasingly diverse set of applications

ranging from entertainment, such as streaming videos and online gaming, to commercial

applications, such as online shopping, to even some more critical applications, such as VoIP

calls. Cellular network operators need to maintain high reliability and performance.

Despite the constant eﬀorts by cellular data network providers in hardening and upgrading

their network infrastructures, it remains a hard challenge for cellular network operators to

identify service disruptions and assess impact of service disruptions from end-users’ point

of view. Under normal conditions, each user is typically “connected” to a set of nearby

cell sites via multiple radio links for fast data transmission and service redundancy. The

users are connected to multiple sites to facilitate seamless handoﬀs. With such an inherent

service redundancy mechanism built in, the impact of a site outage may be “absorbed” by

the surrounding sites through a rippling series of signaling, with the result being that users

perceive no service impact. On the other hand, it is also possible that users associated with

fully functional sites may suddenly suﬀer performance degradations resulting from the sudden

40

migration of traﬃc load from a nearby site that failed. This makes the impact analysis of

a site failure dependent on its nearby site locations, antenna conﬁgurations, power settings,

user mobility, local landscape features (e.g., hills, buildings), and a wealth of other potential

factors. Detailed modeling of each of these features can be extremely challenging, if not

impossible.

3.1.2 Limitations of Prior Art

The state-of-art monitoring systems deployed in cellular networks mostly rely on cell site

level measurements [117, 94]. As a result, granularity of current monitoring systems is

limited to the level of cell sites, which is typically an aggregation of tens or hundreds of users

depending on the locations of the cell sites. Unfortunately the cell site level measurements are

insuﬃcient for cellular network operators to truly understand the ﬁne-grained impact of cell

site outages from users’ perspective due to the unique challenge in cellular networks stemming

from the built-in redundancy of radio links. Typically an user is within the radio reach of

multiple cellular sites and upon the failure of his or her associated site, he or she may switch

to get service from other nearby cell sites [67]. This type of radio migration during outage is

very hard to be estimated without user level measurements. Due to scalability constraints,

only spatially and temporally sampled user level measurements are collected, which are

typically insuﬃcient for cellular network operators to determine what user perceived network

quality in the ﬁeld given one or multiple cell site outage [123]. Even worse, the accurate

locations are hard to obtain for the sampled user level measurements as cellular users are

moving around most of the time [10, 7]. Consequently, the impact analysis result may be

quite inaccurate, for example when the operator misjudges the failure scenario, or under- or

over-estimates the user impact. When inaccurate impact analysis result is used in outage

prioritization and escalation, many severe outages may not receive adequate attention from

various teams of operators, vendor supports, customer care and public relation oﬃcials, and

executive personnel, causing prolonged user dissatisfaction.

41

3.1.3 Proposed Approach

In this work, we propose to leverage the existing M2M devices deployed in the ﬁeld as the

approximation of human users to enhance service quality monitoring in operational cellular

network. We refer to this approach as M2MSourcing in this chapter. Our key idea is to

monitor communication activity of M2M devices over the cellular network infrastructure

and identify gaps in the activity to estimate quality of service perceived by human users.

The ubiquity of cellular data networks has attracted vendors from the (M2M) device

market to develop innovative applications that use operational cellular networks. Popular

applications of M2M devices in cellular networks include: home security provided by DIGI-

LIFE, utility monitoring through Smart Grid and transportation services by OnStar. Some

of these M2M devices such as those used by home security applications, do not sleep and

perform automated data transfer to an application server in the Internet. This automa-

ted data transfer between machines is commonly known as Machine Type Communication

(MTC). Service disruptions in cellular networks directly impact MTC activity of M2M de-

vices. Additionally, some applications of M2M devices require the M2M device to be ﬁxed

at a particular location. Examples of such applications are home security, ﬁre alarm, smart

meters etc. These applications require the device to be installed at a ﬁxed location and then

communicate periodically to a central server through cellular data network. These devices

serve as ideal candidates for monitoring network service quality at speciﬁc geographical loca-

tions. Given the location of stationary M2M devices, network operators can monitor MTC

activity of these devices to estimate service quality around those coordinates. Therefore, we

propose to utilize existing M2M devices deployed in the ﬁeld as sensor nodes for monitoring

network service quality and analyze user impact at speciﬁc geographical locations.

In this chapter we design and evaluate a M2MSourcing based cellular network monitoring

system called M2MScan that monitors communication activity of M2M devices to analyze

and estimate the user impact during cell site outages. For monitoring purposes, we want to

identify a signiﬁcant number of M2M devices satisfying certain criteria such as stationarity,

42

communication frequency (always awake) and spatial distribution. For estimation of user

impact, we want to utilize the communication pattern of identiﬁed set of stationary M2M

devices for performing measurements of network quality in diﬀerent coverage regions. Our

approach consists of three steps: identiﬁcation of candidate M2M devices, detection of net-

work outages, and measurement of user impact. First, we use call-data records of a major

cellular service provider to identify a set of M2M devices where each device in the set should

satisfy the following two criteria: (i) the M2M device should be ﬁxed at a particular location

and should be stationary (ii) the M2M devices should communicate continuously with their

application server. In this chapter we refer to this set of M2M devices as stationary and

active M2M devices or simply qualiﬁed M2M devices. We use the qualiﬁed M2M devices as

network sensors to monitor network quality. Second, to detect network outages, we develop

a spatial map of candidate M2M devices that allow us to monitor network quality at multiple

locations within a cell sector. We monitor the communication activity of all M2M devices

in a given region to detect outages. Third, we use a non-parametric statistical approach

for identifying those devices in the region which have discontinuities in their communication

pattern due to an outage. Additionally, our approach allows us to measure the duration du-

ring which a device was oﬄine. Using this information we measure the spatial and temporal

nature of impact of a service disruption on users. Our operational experience reveals that

our M2MSourcing based system M2MScan is able to provide cellular network operators with

ﬁne-grained spatial and temporal aspects of users impact for cell site outages.

3.1.4 Technical Challenges and Solutions

We face several key technical challenges in monitoring M2M devices for assessing service

quality. The ﬁrst challenge is to accurately infer location of candidate M2M devices. Location

of candidate M2M devices is required for creating a spatial map. To deal with this challenge

we use approximated location samples which are obtained by using signal strength of a device.

The signal strength is measured at multiple site locations and then through triangulation

43

the location is approximated. We use multiple location samples taken over a period of one

month for each device and then select the most representative location sample by identifying

the medoid of the measurements.

The second challenge is to infer the customer impact of an outage which is measured

in terms of outage start time and duration. Typically, the documented outage start times

reﬂect the times when outages are ﬁrst reported by users or they are reported through

existing outage reporting systems. Users calling times are inherently inaccurate because

users may experience bad service at a time other than the actual start time and users’ calls

are strongly contextual to users’ tolerance levels. Similarly, existing reporting systems have

to go through a series of manual veriﬁcations to ﬁnally report the outage start times. To deal

with this challenge we utilize the communication activity of candidate M2M devices to infer

the outage start time. We denote the set of devices in a region R as DR. For each device
d ∈ DR, we use a p-value based statistical approach to compute a score table P d
i×W , where
each row of the table P is identiﬁed by a start-time value and each column is identiﬁed by a

duration value. Here each entry of the table P d

i×W represents p-value at a given start time
and duration. We compute the number of impacted devices for diﬀerent values of < i, W >

pairs. The pair with maximum number of impacted devices represent the outage start-time

and duration.

The third challenge is to quantitatively evaluate the eﬀectiveness of our approach. To deal

with this challenge, we introduce synthetic cell site outages in real data and used M2MScan

to identify these outages and estimate their user impact. We evaluated M2MScan using three

diﬀerent metrics namely; accuracy of estimating outage starting time, outage duration, and

user impact. We also use real network outage reports collected over a period of 6 months

for evaluation purposes. We compare the reported outage hours with those identiﬁed by

M2MScan and then measure the user impact using frequency histograms.

44

3.1.5 Key Novelty and Advantages

To the best of our knowledge, monitoring user experience via M2M devices in a large scale

operational cellular network has not been previously studied. M2MSourcing is a simple

cost-eﬀective yet pragmatic approach to measure the user impact of cell site outages by

discovering abnormal communication activities of stationary M2M devices. This approach

has several advantages over state of art network monitoring tools. First, there are no changes

required in the existing infrastructure. Prior approaches like Jigsaw and WiScape deploy

sensor nodes in the ﬁeld to measure network performance [42, 123]. This kind of deployment

for measuring network performance for nation wide operational cellular networks is infeasible.

Second, a large percentage of M2M devices are deployed and maintained by third parties,

therefore from the perspective of cellular networks there are no maintenance costs associated

with these M2M devices. Thus M2M devices deployed in the ﬁeld are free user experience

sensors which can be used to closely measure the user impact of cell site outages. Third,

we use stationary M2M devices for monitoring network conditions. Prior work has shown

that mobility of end-user equipment can cause up to 79% diﬀerence in measured throughput

[55, 106]. Utilizing stationary M2M devices allow us to obtain consistent measurements over

time. Fourth, we ﬁlter out M2M devices which are active for an indeﬁnite period of time.

This has advantage over prior crowd-sourcing based approaches, as M2MSourcing is not

dependent on user input to obtain measurements.

3.2 Related Work

Prior work on measuring service quality of large operational networks has adopted two po-

pular measurement methodologies; user feedback, and user-level performance measurements.

Authors in [79] use a tool named HostView to obtain user feedback and then predict ser-

vice quality through passive measurements of system and network performance metrics. In

[78], Jin et al. study the eﬀectiveness of using a smartphone applications for reporting ser-

45

vice problems in large operational networks. The authors use a Location based reporting

tool (LRT) to obtain user feedback about service quality issues in cellular networks. User

feedback based methodologies are limited by the accuracy and completeness of information

provided by users. Furthermore, in such cases a feedback is recorded only when the user

initiates the feedback process, thus limiting the visibility into network conditions over time.

Our work utilizes M2M devices which communicate continuously, thereby allowing us to

analyze network service quality at any given time.

There is a signiﬁcant amount of work on assessing network quality by measuring diﬀerent

network performance indicators pro-actively [55, 106, 100, 127]. Q-score is another service

quality assessment framework that quantiﬁes service quality by correlating network perfor-

mance indicators with user feedback in the form of customer troubled tickets [128]. Netradar

is a mobile network measurement platform that performs active and passive user level me-

asurements to obtain insights about user-perceived service quality [129]. It measures signal

strength and relate it to user service quality. Jigsaw and WiScape are client assisted wide

area wireless network performance monitoring frameworks that deploy monitoring nodes in

the ﬁeld to measure network performance [42, 123, 33]. In this work, we propose to utilize

existing M2M devices already connected to the Internet through an operational cellular net-

work for monitoring user experience. This allows us to monitor network health at any given

time at any given location. To the best of our knowledge M2MScan is the ﬁrst tool to utilize

M2M devices for network monitoring.

3.3 Background and Data Collection

Figure 2.3 shows the UMTS cellular network architecture. A UMTS cellular network consists

of two main sub-networks; the radio access network and the core network. Radio access

network consists of NodeBs which provide radio access link to the user equipment. The

UMTS core network consists of two main components the Serving GPRS Support Node

(SGSN) and the Gateway GPRS Support Node (GGSN). GPRS Tunneling Protocol (GTP)

46

is used at the Gn interface to send and receive data to external networks. The GTP protocol

is an IP protocol used in UMTS and LTE core networks. In a UMTS network, a Packet

Data Protocol (PDP) Context is established between SGSN and GGSN for packet transfer

to and from the User Equipment (UE) [15]. At the UE side, connectivity to the Internet is

achieved by establishing a radio resource channel (RRC) with a nearby cell site. Typically,

due to built-in redundancy of cellular network architecture, multiple cell sites are available

for RRC establishment.

Despite the built-in redundancy of cellular networks, cellular network users can experience

poor or no service due to a variety of reasons. One of the major reasons is cell site outages

which may occur due to bad weather conditions, power outages, cable cuts etc. During a

cell site outage, users in the coverage regions would connect to nearby correctly functioning

towers. Thus a cell site outage may not have any measurable user impact i.e., users do not

experience service disruption. In worst cases multiple sites located in the same geographical

region may fail together and render services inaccessible to some users.

3.3.1 Dataset

In this study we use data collected at the core of a nation-wide cellular network provider in the

United States. The data consists of: 1- M2M device activity over time 2- Approximate M2M

device location 3- Cell site outage information 4- Blackout regions where users experience

no service during cell site outages. Note that all data sources were anonymized to protect

the privacy of cellular network users.

Device activity. We use call and data records maintained by the cellular network for

analyzing device activity. For each device the data set consists of information about data

and call sessions. For data records, the measurement system is deployed at the core of a

cellular network, it monitors the PDP context session of each device. A PDP context that

is established by a device can remain active for hours depending on the activity of that

device. The measurement system creates a record whenever a connection is torn down or

47

the session length reaches a speciﬁed time limit, typically 1 hour. For data sessions, each

record consists of time stamp in GMT at which the record is created, anonymized device

identiﬁer, PDP context duration in seconds, bytes uploaded and downloaded, connection

type, cell and cell-sector identiﬁers. We use this information to monitor the communication

activity of devices connected to the cellular network. Communication activity is a good

indicator for diﬀerentiating a smartphone from a M2M device. Several M2M applications

required M2M devices to communicate continuously to the application server. As a result

PDP contexts established by these devices remain active 24 hours a day and 7 days a week.

Therefore, by monitoring communication activity of M2M devices for a reasonable period

of time we can identify active M2M devices. Furthermore, we use activity information for

detecting cell site outages. When an outage occurs, M2M devices in the outage region are

unable to access the cellular network as a result the measurement system does not record

any activity for the impacted M2M devices. By looking at the time series, we can identify

gaps in the communication activity of groups of M2M devices. We also measure impact of

an outage by computing the fraction of M2M devices impacted out of all the M2M devices

in the outage region. We use this fraction as an estimate of user impact.

Device location. For each device this data set consists of approximate latitude and

longitude information. The approximate location coordinates are computed by measuring the

signal strength of the device at various cell sites and then approximating the location through

triangulation [29]. Each record consists of the time in GMT at which the measurement was

taken and the approximate latitude and longitude information. The measurement system

records multiple location samples of devices over time. Multiple location samples taken over

diﬀerent time instances are useful for identifying if a device is moving or stationary. Using

these location samples we can create a map of active and stationary M2M devices which can

act as sensors for monitoring network health.

Network coverage reports. We also have information about network coverage of cell

sites. Given a set of sites we can obtain geographical regions which are expected to be

48

1

0.8

0.6

0.4

0.2

F
D
C

1

0.8

0.6

0.4

0.2

F
D
C

100

102
Distance (meters)

104

106

100

101
102
No. of devices

103

(a) Median pairwise distances

(b) Device density

1

0.9

F
D
C

0.8

0.7

0.6

0.5

1

0.8

F
D
C

0.6

0.4

0.2

5

2

1
4
Records per hour

3

20

10
No. of devices per cell

30

40

50

(c) Device activity

(d) Active device density

Figure 3.1 Spatio-temporal characterization of M2M devices

completely out of service (blackout regions). Additionally, we use network outage reports

recorded by various network performance monitoring tools and by customer calls. Each

record consists of network outage ID, outage start time, outage end time, type of outage(cell

site, RNC, etc), list of cell sites aﬀected, technology of the cell sites i.e. 2.5G, 3G, 4G or

LTE, and cause of the outage (power outage, weather, etc). Note that not all types of

outages aﬀect the device connectivity, therefore we only focus on those types of outages

which directly impact user activity. We use this information to evaluate M2MScan .

3.4 Spatio-Temporal Characterization

In this section we characterize devices connected to a cellular network according to their

communication activity and mobility patterns. Our characterization analysis allow us to

49

diﬀerentiate M2M devices from smartphones. We are interested in a speciﬁc set of M2M

devices which satisfy the following two requirements. First, the devices should be stationary.

M2M devices used for metering, building security, ﬁre alarm are installed at ﬁxed locations

and act as reliable information sources. Typically, these devices use only cellular data services

and are inactive over the voice plane. A major beneﬁt of monitoring stationary devices is that

service quality landscape can be characterized more accurately. Second, the devices should

be active 24/7. M2M devices communicating continuously provide useful hourly information

about the network accessibility conditions. Discontinuities in the communication activity

are clear indications of potential network accessibility issues such as cell site outages. In the

rest of this chapter we use the term qualiﬁed for devices which satisfy these requirements.

Mobility characteristics. We ﬁrst identify devices which are inactive over the voice

plane of cellular service for a period of one month. This gives us a set of devices which

are active only over the data plane. Next, we utilize the location samples collected during

a one month period and calculate pairwise distances between all location samples of each

device. For each device we obtain the median value of all pairwise distance values. Figure

3.1(a) show the cumulative distribution function of the median distance computed for each

device. From this ﬁgure, we observe that 40% of the devices which are inactive over the

voice plane have a distance of less than 100 meters. We also observe that a small percentage

of the devices (< 5%) have a distance between 100 to 1000 meters. Therefore, we consider

all those devices stationary which have the median pairwise distance less than a threshold of

100 meters. We also look at the density of stationary devices. Figure 3.1(b) shows the CDF

of number of devices present in a single hexagonal coverage region (cell sites). The ﬁgure

shows that 80% of cell sites have at least 10 stationary device.

Communication activity. Besides stationarity we require that the qualiﬁed set of devi-

ces should also communicate continuously overtime. This is necessary for continuous network

quality monitoring. A discontinuity in the communication pattern of a device indicates that

the device is oﬄine. The discontinuity can be either due to device application behavior or

50

NA 0−10 10−50 50−100100−500>500

 

 

Figure 3.2 Active device density

poor network service. A device which has discontinuities in its communication under normal

conditions is not a reliable candidate. For monitoring network service quality in the face

of a cell site outage, we want to identify devices which went oﬄine due to the outage and

not because of the application behavior. Therefore, we ﬁlter out devices which have discon-

tinuities in their communication pattern. To ﬁlter out less active devices we simply look

at the number of records generated by each device per hour. Figure3.1(c) shows the CDF

plot of the number of records per device over a period of one week. The ﬁgure shows that

approximately 18% of the qualiﬁed devices generate at least one record in every hour i.e.,

7% of the original set of devices used in Figure 3.1(a). These devices serve as our qualiﬁed

set of devices which are stationary and active over the data plane.

Geo-spatial characteristics. We next step look at the spatial distribution of the qua-

liﬁed devices. Figure 3.1(d) shows the CDF plot of the number of qualiﬁed devices present

in a cell. The ﬁgure shows that more than 60% of cell sites have at least one device which

is stationary and communicates continuously.

In Figure 3.2 we visualize the county-wise

geographical map of device density. The colored regions do not have M2M devices that

satisfy our criteria. On average majority of the regions have 10 to 50 devices per cell sector.

Counties with large metropolitan regions have device densities greater than 100 devices per

cell sector. These results show that sites located in large metropolitan regions have higher

device density that allow service quality monitoring at higher granularity.

51

Figure 3.3 Overview of system architecture

3.5 System Design

Our approach is to monitor communication activity of stationary M2M devices to measure

service quality and detect cell site outages. This approach has several advantages over

traditional monitoring approaches. First network operators can obtain a service quality

landscape at a higher resolution. Second, activity patterns of M2M devices can be used to

detect cell site outages. Third, M2MSourcing allows network operators to measure customer

impact of an outage. Figure 3.3 shows the overview of M2MScan system architecture.

3.5.1 Activity Proﬁling

For each device we measure its activity on hourly basis by looking at number of bytes

uploaded and downloaded in that hour. At any given hour, if the total number of bytes

uploaded or downloaded is zero then we consider the device as in-active. We consider any

in-activity as an impact on user experience. We can infer the impact duration on hourly

basis by looking at the in-activity start time and end time. Figure 3.4 (a) shows time series

plot of some example devices impacted due to a cell site outage. The discontinuities in the

time series plot of each device represent the durations during which devices were unable to

download/upload data to their application server. We make two main observations: First,

the impact duration of individual devices located within the same outage region can vary.

From the ﬁgure we observe that all three devices were impacted for diﬀerent time durations.

52

7
10

6
10

5
10

4
10

3
10

)
s
e
t
y
B

l

(
 
e
m
u
o
V
 
c
i
f
f
a
r
T

2
10

 
0

 

Device −A
Device −B
Device −C

e
d
u

t
i
t

a
L

32.8

 

Samples
Towers
Medoid

20

40

60

80

100

120

140

160

Time (Hrs)

 

−96.4

Longitude

(a) Time series

(b) Location samples

Figure 3.4 Example devices

Second, a device may be able to communicate inconsistently during the complete outage

duration. From the ﬁgure, we observe that Device-A is impacted during two diﬀerent time

windows. These observations highlight the dynamic nature of impact of network outages on

cellular network users. M2MScan allows network operators to monitor variation in impact

durations and identify user impact at a higher granularity both in space and time.

3.5.2 Location Proﬁling

We use location samples collected for each device and obtain the most representative location

sample as a device’s location. For each device all location samples form a cluster on the

geographical map. We use medoid to identify the most representative sample of each cluster.

A medoid is deﬁned as the most centrally located point of a cluster. Figure 3.4 (b) shows

an example of location samples and the medoid of a device. The ﬁgure also shows the sites

near the measured samples, where majority of samples of each device are clustered together.

3.5.3 Quantifying Temporal Impact

In this section we discuss our approach for inferring the temporal impact of network cell site

outages. We denote the set of devices in a region R as DR. For each device d ∈ DR, we
monitor the device activity in terms of traﬃc volume on hourly basis. Assuming two time

53

instances a and b, where b − a = N hours, we compute sum of traﬃc volume using a sliding
window of W hours. This gives N − W + 1 samples of sum of traﬃc volume between a and b.
At time i we denote the sample as Si,W , where i represents the starting time index of sample
window and W represents the window size. Next we rank the N − W + 1 samples while
breaking ties. We use the N − W + 1 ranks to compute p-value at various time instances
between a and b using Equation 3.1.

pval(i, W ) =

rank(Si,W )
N − W + 1

(3.1)

Here i ≤ b−W , and W << N. For each device we compute p-values for N values of i and M
values of W . Let P d
i,W contain p-values for all the set of pairs of i and W for device d. We use
i,W∀d ∈ DR to compute a frequency histogram of devices denoted as Hi,W .
We select pairs of values of i and W for which we get p-values less than a speciﬁc threshold

all sets of pairs P d

th. We select the threshold th according to the total number of samples between the two

time instances a and b i.e., W/(N − W + 1). The threshold is such that the samples with
smallest rank values are selected. The pair with maximum frequency represent the outage

start-time and duration < St, D >, during which maximum number of devices were aﬀected.

Note that the sets of devices belonging to diﬀerent pairs of start times and durations may not

be disjoint i.e., a device may contribute to the frequency of diﬀerent start time and duration

pair. We consider devices belonging to the pair with highest frequencies for quantifying the

user impact in space. Algorithm 1 describes our methodology for obtaining estimates of

outage start-times, duration and user impact.

3.5.4 Quantifying Spatial Impact

To quantify the impact of a network outage we utilize the list of devices obtained through the

p-value computation. First we obtain a spatial map of impacted devices, then we use a 2D

Gaussian kernel to obtain a quantiﬁable measure of impact regions. For a device d ∈ DR we
compute the impact score in its surrounding region using 2D Gaussian kernel centered over

54

Algorithm 1 Estimating outage start time and duration
Result: < St, D >,|D′| = Estimate(DR, th, M)
Input: (1) Set of devices DR in region R

(2) P-value threshold th

(3) Maximum duration M

Output: (1) Start time, duration pair: < St, D >,

(2) No. of devices impacted |D′|

1 for i ← 1 to N do

2

3

4

5

for W ← 1 to M do

for d ∈ DR do

P d
i,W ← pval(i, W )
if P d

i,W < th then Hi,W ← Hi,W + 1

6 < St, D >← argmax(H)
7 |D′| = max(H)
8 return < St, D >,|D′|

the device location. We obtain the impact score as the value of function Gd(x, y) deﬁned as

follows:

Gd(x, y) =

− x2+y2
2σ2
d

e

1

2πσ2
d

(3.2)

Here x and y represent the coordinates at which the score is being computed and x2 + y2

is the squared distance between the device and the location. We deﬁne σd as the distance

between the device and its nearest neighbor. The impact score of a device at x = 0 and

1
y = 0, i.e., at the device’s location is simply
2πσ2 . To compute the impact score at any given
location x′ and y′, we compute the scores Gd(x, y)∀d ∈ DR. Using the values of Gd(x, y) we
quantify the strength of outage impact at any particular location.

55

)

%

(
 
y
c
a
r
u
c
c
A

100

80

60

40

20

0

 

)

%

(
 
y
c
a
r
u
c
c
A

100

80

60

40

20

0

 

 

 

Start time

Duration

Start time

Duration

 

100

)

%

(
 
y
c
a
r
u
c
c
A

80

60

40

20

0

 

0.01

0.02

0.04

P−value Threshold

0.01

0.02

0.04

P−value Threshold

(a) 1 Hours

(b) 2 Hours

Start time

Duration

Start time

Duration

 

100

)

%

(
 
y
c
a
r
u
c
c
A

80

60

40

20

0

 

0.01

0.02

0.04

P−value Threshold

0.01

0.02

0.04

P−value Threshold

(c) 5 Hours

(d) 10 Hours

Figure 3.5 Estimation accuracy of start times and duration of synthetic outages

3.6 Evaluation

We use real data collected at a nation wide cellular network operator in our evaluation. First,

we introduce synthetic network outages in the real data and evaluate our system using three

diﬀerent metrics: network outage start time, network outage duration and outage impact in

terms of fraction of devices aﬀected by the outage. Second, we deployed our system in an

operational network and detected/validated network outage events and their impact. For

validation purposes we uses external sources of information such as tickets generated by user

complaints and network outage databases maintained by the cellular network operator.

56

3.6.1 Synthetic Network Outages

A na¨ıve way to simulate an outage in a geographical region is to remove activity records of

all qualiﬁed devices in that region for a speciﬁc duration. In real outage scenarios, devices

in an outage region typically connect to nearby working sites and remain active. Therefore,

all devices in an outage region are not impacted. To allow for real outage simulations we use

network coverage reports to obtain blackout regions. These are geographical regions where

customers experience complete network outage when cell sites in those regions fail.

Evaluation data: We identiﬁed six geographical regions covered by 102 sites and 1881

qualiﬁed devices. Each geographical region belongs to a large metropolitan city in the United

States. The smallest region consists of 3 sites, and the largest region is covered by 37 sites.

We obtain blackout regions for each set of sites. Approximately, 5% of the qualiﬁed devices

are in the blackout regions. Using these sets of sites and devices we created 120 diﬀerent

outage scenarios. For each region we used device activity data for a period of one week. The

length of a real network outage typically ranges from a few minutes to tens of hours. Since,

our dataset consists of hourly measurements we introduce synthetic network outages of time

duration ranging from 1 to 15 hours.

Outage scenarios.

In each outage scenario, we choose a set of sites and introduce

synthetic network outages at a given starting time and duration. For a given set of sites T

in a geographical region we obtain a set of blackout regions R from the network coverage

report dataset. These regions are 100× 100 meter boxes where users are likely to experience
absolute service inaccessibility if the cell sites in T are out of service. Next, we obtain a

set of qualiﬁed devices D, served by the sites in T and identify a subset of devices Db ∈ D
which are located in the blackout regions. Finally, we introduce network outage of duration

d, at time t by removing activity records of devices in the set Db. For simulation purposes

we assume that devices not in the blackout region are able to switch to a nearby site not in

the set T and continue communicating over the data plane seamlessly. For each set of sites

T , we created 20 diﬀerent scenarios with four diﬀerent durations and ﬁve diﬀerent starting

57

)

%

(
 
y
c
a
r
u
c
c
A

100

80

60

40

20

0

 

1−Hr

2−Hr

5−Hr

10−Hr

 

0.01

0.02

0.04

P−value Threshold

Figure 3.6 Estimation accuracy of user impact in synthetic outages

times. Each outage scenario serves as the ground truth for evaluation purposes. Finally,

we evaluate detection accuracy by recording the time and duration parameters returned by

M2MScan for each outage scenario.

Evaluation metrics. For a given outage scenario M2MScan returns the < St, D > pair

that has maximum number of impacted devices. We compare the < St, D > values for

each scenario with the ground truth and compute the estimation accuracy. An accurate

estimate of outage start time duration is essential for measuring temporal impact of an

outage. Here we deﬁne estimation accuracy as the percentage of outage scenarios whose

< St, D > parameters were correctly identiﬁed.

We also evaluate the accuracy of estimating user impact. We measure user impact as

the percentage of devices aﬀected during the outage time. For evaluating accuracy of user

impact estimation, we compute the fraction of impacted devices identiﬁed by M2MScan out

of the total number of impacted devices. We evaluate user impact by looking at the fraction

of impacted devices correctly identiﬁed by M2MScan out of the total number of devices

actually impacted. Figure 3.6 shows the accuracy of estimating user impact. We observe

that accuracy of user impact increases with p-value threshold because higher p-values allow

inconsistent devices to be classiﬁed as impacted.

58

)
s
r
H

(
 

e
c
n
e
r
e

f
f
i

 

D
e
g
a
r
e
v
A

2

1.5

1

0.5

0

 

Start time

Duration

 

0.01

0.02

0.04

P−value Threshold

Figure 3.7 Average of absolute diﬀerence in timings

)

%

(
 
t
c
a
p
m

I
 

i

e
c
v
e
D

35

30

25

20

15

10

5

0

 

P−value=0.01
P−value=0.02
P−value=0.03

 

2

4

6

8

10

12

14

Outage ID

Figure 3.8 Device impact of real outages

Simulation results. Figure 3.5 shows the accuracy of estimating outage starting time

and duration. We report estimation accuracies for four diﬀerent durations. Each group of

bars represent accuracies for diﬀerent p-value thresholds. We also draw error bars which

represent the percentage of scenarios whose estimations were incorrect. Top portion of an

error bar represents percentage of scenarios whose parameter values are more than the actual

value. Similarly, bottom portion of and error bar represents percentage of scenarios whose

parameter values are less than the actual value. For example, Figure 3.5 (a) shows that the

accuracy for outage scenarios with duration equal to one hour is approximately 67%.

59

i

 

e
z
S
w
o
d
n
W

i

 

 

5

10

15

20

25

30

35

40

45

 

20

40

60

Start Time (Hrs)

(a) 0.01

40

35

30

25

20

15

10

5

0

i

 

e
z
S
w
o
d
n
W

i

5

10

15

20

25

30

35

40

45

 

20

40

60

Start Time (Hrs)

(b) 0.02

80

70

60

50

40

30

20

10

0

i

 

e
z
S
w
o
d
n
W

i

5

10

15

20

25

30

35

40

45

 

 

80

60

40

20

0

20

40

60

Start Time (Hrs)

(c) 0.04

Figure 3.9 Heatmap visualization of real outage for diﬀerent p-value threshold

i

 

e
z
S
w
o
d
n
W

i

 

80

70

60

50

40

30

20

10

 
60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90

Start time (Hour of week)

5

10

15

20

25

Figure 3.10 A special case of real world site outage.

3.6.2 Real Network Outages

We implemented and deployed M2MScan at the core of a cellular network and analyzed

outages during a two month time period. We analyzed 14 diﬀerent outages of varying

time duration and diﬀerent geographical locations. In each outage more than 3 cell sites

are involved. For all outages, the average device density is 38 devices per cell site with a

maximum of 145 qualiﬁed devices per cell site and minimum of 20 device per cell site.

We utilize information available in network outage reports as ground truth of real outages.

We evaluate the accuracy of M2MScan on real network outages for diﬀerent p-values. Figure

3.7 shows average of absolute diﬀerence between estimated and actual start time and duration

values. For large p-values we observe large average diﬀerence that indicates lower estimation

accuracy. These results correlate with the results obtain through simulations of synthetic

60

outages. For each network outage we also look at the user impact. Figure 3.8 shows the

percentage of impacted devices detected by M2MScan for all 14 outages. We also observe

that for some outages an increase in p-value threshold have little eﬀect on the percentage of

impacted devices. For example, for outages 4 to 6 there is a small increase of approximately

1% user impact. Our analysis also reveals that outages with large user impact have large

number of failed sites involved. For example outages 12 to 14 have more than 10% devices

impacted, all of these outages have more than 10 failed sites.

We also visualize the histogram of start time and duration pairs in Figure 3.9. Here the x-

axis represents the start time value denoted as i in our formulation and y-axis represents the

window size denoted as W . The hottest points represent the outage start-time and duration

pair for which maximum number of devices are classiﬁed as impacted. We observe that for

a change in p-value from 0.01 to 0.02, the estimate of user impact increases by two times.

Also, we do not observe signiﬁcant shift in the outage start time duration values. However,

for a p-value of 0.04 we observe a small change in the user impact estimate and large shift

in the outage start-time and duration estimate. These observations highlight that small p-

values are more suited for estimating outage start times and durations. Also, larger p-value

threshold provides better estimates of user impact. However, the downside of increasing the

threshold value is an increase in number of false positives. The heatmap also shows that

majority of devices were impacted with a similar duration and starting time. A geographical

analysis of the outage reveals that the outage occurred in a region with a large number of

cell sites located geographically close to each other. Therefore, in this case either the devices

were able to switch to other nearby sites or were directly impacted. These example show

the potential of M2MScan to identify the temporal impact of an outage.

To evaluate spatial impact of outages, we compute impact scores using the gaussian smoo-

thing kernel. For each outage in a region R, we create 10 random subsets containing equal

number of devices including both impacted and non-impacted devices. We use devices in
9 subsets denoted as d′, to obtain the impact scores G′d(x, y)∀d′ ∈ DR at the locations of

61

devices in the 10-th subset. We then label each device in the 10-th subset as impacted or

non-impacted based on the impact scores. For a given device in the 10-th subset we obtain
|d′| number of scores. A device is impacted if the maximum score at its location belongs to
another impacted device, otherwise we label the device as not-impacted. For 6 outages the

impact scores correctly labeled 85% of the impacted devices. For remaining outages impact

scores correctly labeled less than 10% of impacted devices. To investigate this variation we

look at the geographical map of device and site locations. Our analysis reveals that due to

terrain conditions and coverage planning done by cellular network, devices located within

the outage region may connect to sites outside the outage region.

Next we discuss temporal characteristics of a special case of real site outage. Figure

3.10 shows the heat map plots generated from the frequency histograms. We observe that

maximum number of devices are impacted with a starting time of 68 and a duration of 20.

However, we also observe that a signiﬁcant number of devices do not have this starting time

and duration. For example, one particular group of devices has the same starting time with a

duration of 5 time units. We also observe that devices belonging to the two diﬀerent impact

durations are disjoint set of devices. This shows that devices located in the same outage

region can have diﬀerent impact durations. To understand the varying impact durations we

looked at the geographical locations of all qualiﬁed devices in the region. We found that the

devices with smaller impact durations are located in the outer region of the outage and the

devices with longer impact durations are located in the central region of the outage.

In summary, M2M devices enable us to analyze network outages at a higher granularity in

both space and time. The impact duration and impact region of a given cell site outage can

vary both in space and time. These variations are dependent on geographical terrain and

cell site locations in the outage region. Besides obtaining an estimate of outage start time,

duration, and user impact, M2MScan can be used to characterize network outages and their

user impact.

62

3.7 Conclusion

In this chapter, we present our design and prototype of a M2MSourcing based monitoring

system called M2MScan for a large-scale cellular network. We ﬁrst identify a qualiﬁed set of

M2M devices which are stationary and have consistent communication activity using data

collected from a large-scale cellular provider in North America. We then use these qualiﬁed

M2M devices to measure user impact of cell site outages. Our analysis reveals that impact

of cell site outages is highly dynamic in space and time. We show that M2MScan is highly

eﬀective in estimating temporal impact of cell site outages. M2MScan is also useful in

estimating percentage of user impacted due to a cell site outage. Spatial impact is highly

dynamic due to geographical terrain and site locations, devices tend to handoﬀs to nearby

working cell sites and remain active even if they are in the outage region To the best of our

knowledge, large scale cellular network monitoring using M2M device at operational cellular

networks has not been previously studied.

63

4 Optimizing Transit Route

Selection for CDNs

4.1

Introduction

Content publishers usually rely on third-party Content Delivery Networks (CDNs) for eﬃ-

ciently delivering content to end users. A signiﬁcant fraction of web content on the Internet

is served by CDNs. Cisco estimates that the share of Internet video traﬃc served by CDNs

will increase from 61% in 2015 to 73% by 2020 [12]. Two major considerations for CDNs are

cost and performance of delivering content to end users. CDNs maintain copies of content at

cache servers that are deployed at carefully selected geographical locations. CDNs also main-

tain multiple transit routes from cache servers to access ISPs (or “eyeball networks”) which

provide Internet connectivity to end users. When a user requests an object, the request is

redirected to a nearby cache server containing the requested object. The cache server sends

the object to the end user via one of multiple transit routes. Since the performance of transit

routes changes over time, end-to-end performance achieved by a CDN is dependent on the

choice of transit route [133]. Furthermore, the price of Internet transit also varies from one

transit provider to another. Thus, a major challenge faced by CDNs is to develop a transit

routing strategy to simultaneously optimize cost and performance.

The dynamic nature of transit pricing and performance makes it challenging to optimize

the cost and performance tradeoﬀ. There are thousands of eyeball ISPs which are reachable

64

via diﬀerent transit routes and diﬀerent geographical locations. Each choice of transit route

for a particular eyeball ISP and geographical location has distinct cost and performance

characteristics, which makes the problem of developing a transit routing strategy challenging.

Therefore, it is important for CDNs to carefully design and adopt a transit route selection

strategy by analyzing the dynamically changing cost and performance tradeoﬀs.

To the best of our knowledge, the problem of optimal Internet transit routing for CDNs

considering both cost and performance tradeoﬀs is not addressed in prior literature. Prior

work on route selection has studied multi-homed access networks [44, 54, 149]. Unlike multi-

homed access networks, CDNs have multi-homed servers at IXPs (via multiple transit pro-

viders) which provide explicit control over content routing. The state of the art for tran-

sit routing at IXPs by CDNs does not use a fully automated approach. CDN operators

mostly conﬁgure routing strategies manually to optimize cost and performance. Due to

performance dynamics across thousands of access ISPs and numerous transit routes, it is

practically impossible for CDN operators to manually achieve an optimal tradeoﬀ between

cost and performance.

In this work, we present measurement, analysis, design, and evaluation of optimal transit

route selection for CDNs. First, we present an approach to measure end-to-end performance

across multiple transit routes simultaneously. We measure end-to-end performance as the

delay between the CDN server located at an IXP and the end user. Second, we analyze the

temporal and spatial dynamics of performance diﬀerences between multiple transit routes.

Measuring performance diﬀerences across multiple transit routes allows us to characterize

performance dynamics of Internet transit. Finally, we present our proposed approach for

optimal transit routing. We formulate the transit route selection problem as using a multi-

attribute objective function that optimizes cost and performance simultaneously. Using

linear programming, we obtain tradeoﬀ curves between cost and performance for various

routing strategies. Our results show that CDNs can achieve signiﬁcant cost and performance

beneﬁts using our measurement and optimization approach.

65

We face two key technical challenges in optimizing Internet transit routing. The ﬁrst

challenge is to obtain simultaneous performance measurements across multiple transit routes.

This is necessary because performance at each transit route is dependent on multiple external

factors. Performance diﬀerences between transit routes can vary due to congestion at the

ISP-transit interconnections or congestion at intra-ISP links. Furthermore, performance

can vary due to traﬃc engineering policies of customer ISPs. To solve this challenge, we

capture the dynamics in performance diﬀerences by exploiting the multi-homing capability of

CDN servers. Speciﬁcally, we implement a client-side performance measurement JavaScript

that is embedded in client-requested web pages by the CDN. The JavaScript downloads

multiple copies of a pixel tag simultaneously via multiple transit routes. This measurement

methodology allows us to capture user perceived end-to-end performance via multiple transit

routes. We measure performance of multiple transit routes simultaneously and focus on pixel

tags whose download time is representative of Round Trip Time (RTT).

The second challenge is to obtain a transit routing strategy that provides an optimal

tradeoﬀ between cost and performance. For each IXP location, thousands of destination

ISPs are reachable and for each destination ISP there are numerous transit routes. We

solve this challenge by formulating the Internet transit routing problem as a multi-objective

optimization problem. The objective function computes the utility of a particular routing

strategy and optimizes it over all possible strategies. We describe utility as the sum of

overall cost incurred to a CDN and performance weighted by a factor γ. By varying values

of γ, we obtain various selection strategies. We obtain an optimal strategy by looking at the

tradeoﬀs between cost and performance for diﬀerent γ values. The ability to analyze and

automatically control transit routing has signiﬁcant cost and performance beneﬁts for CDNs.

Our approach allows CDN operators to analyze cost-performance tradeoﬀs and accordingly

choose an optimal routing strategy. We summarize our key contributions below.

• First, we characterize Internet transit performance from multiple IXP vantage points.
Our comparative analysis of two transit providers reveals that one transit route signi-

66

ﬁcantly outperforms the other for more than 50% users at a European IXP and more

than 30% of users at a North American IXP.

• Second, we propose an optimization approach that allows CDN operators to navigate
cost and performance tradeoﬀs in transit routing through a control knob. Our results

show that CDNs can reduce their transit costs on average by 57% without incurring

any performance degradation.

4.2 Background

4.2.1 Architecture

CDN caching infrastructure consists of servers located at multiple geographically distributed

locations. A content provider pushes copies of its content to the CDN. The CDN maintains

copies of the content at geographically distributed cache servers. The client requests for

content are typically directed to suitable CDN cache servers based on geographic proximity

using DNS redirection or anycast [91].

There are two common server deployment strategies employed by most commercial CDNs

[70]. CDNs either deploy servers inside many access ISPs that are closer to users (enter-

deep strategy) or deploy servers at a few carefully chosen geographical locations (bring-home

strategy). For example, Akamai has adopted the enter-deep deployment strategy, and has

deployed 100,000+ servers across thousands of ASes [4, 132]. Several major content providers

have also adopted the enter-deep deployment strategy. For example, Netﬂix’s Open Connect

Appliance (OCA) servers are deployed at ISPs delivering over 5Gbps in peak daily Netﬂix

traﬃc [8]. Google Global Cache (GGC) servers are also installed inside large ISP networks

[6].

In the bring-home strategy, CDNs deploy large clusters of servers at fewer sites and

connect these sites with high-speed connectivity. Instead of deploying these clusters inside

large ISPs, these CDNs strategically place their server clusters near Internet Exchange Points

(IXPs). At IXPs, CDNs can interconnect with a large number of ISPs using peering or transit

67

[9, 96, 39]. According to a snapshot of the PeeringDB in August 2013, 76% of ASes use Open

peering, 21% use Selective, and 3% use Restrictive [96]. For example, Limelight has 18,000+

servers at dozens of Points of Presence (POPs) around the world. CDNs interconnect with

backbone transit ISPs, at IXPs to eﬃciently deliver content to end users.

Figure 4.1 provides an architectural overview of the bring-home CDN that we study in

this work. As discussed earlier, we note that the CDN cache servers are located near major

IXPs, where they interconnect with backbone transit providers (or transit ISPs).1 CDNs buy

transit services from multiple transit providers. CDNs can use one or simultaneously use

multiple transit providers in order to minimize their transit costs and maximize performance

for end users. Unlike enter-deep strategy, a bring-home CDN has more control over content

delivery servers because cache servers are located at a small number of key geographical

locations. However, as shown in prior literature [70], the CDN has to deal with larger

end-to-end delay and higher transit costs as compared to enter-deep CDNs like Akamai.

Therefore, it is crucial for bring-home CDNs, like the one discussed here, to carefully choose

transit routes to optimize both performance and cost.

Transit ISP

CDN

Transit ISP

Access ISP

Transit ISP

Figure 4.1 A CDN interconnects with multiple transit ISPs at IXPs

1Note that the CDN can peer with an access ISP (or “eyeball network”) to eliminate transit costs if the
access ISP has presence at the IXP. However, small access ISPs typically do not have presence at multiple
large IXPs [107]. Moreover, large access ISPs may not directly peer with the CDN due to the intricacies of
peering [107]. In this chapter, unless stated otherwise, we restrict ourselves to transit routing for CDNs.

68

4.2.2

Internet Transit Dynamics

Pricing and performance of transit providers vary with respect to time and geographical

location. Below, we provide an overview of both pricing and performance dynamics in the

Internet transit market.

Pricing Dynamics. Internet transit prices have steadily decreased over the years due

to technological advances and increased competition in the Internet transit market. Usage

based and tiered pricing models are commonly used in the Internet transit market [137]. In

the usage based pricing model, Internet transit is a metered service, i.e., transit providers

charge their customers by measuring the amount of traﬃc sent or received during the billing

period. Some transit providers may charge customers diﬀerently based on traﬃc volume

and destination. In the tiered pricing model, transit providers charge customers based on

geographical region, traﬃc commit levels, type of traﬃc i.e, on-net vs oﬀ-net, etc.

[137].

The customers who commit higher bandwidths are able to negotiate lower per-Mbps costs

as compared to the customers who commit lower bandwidths [108]. The most commonly

used pricing scheme in the Internet transit market is called 95th-percentile pricing [131].

In this scheme, usage over a ﬁxed billing period (typically one month) is measured on a

megabit per second basis using the 95th percentile value. Unlike capped or ﬁxed billing,

where customers pay a ﬁxed amount regardless of usage, 95th percentile charging is ﬂexible.

The service providers do not have to implement various charging policies and the custo-

mers pay only for what they utilize. Note that customers with bursty traﬃc are likely to

pay higher costs than customers with consistent bandwidth utilization, even though overall

traﬃc volume transferred by bursty customers may be less than the consistent ones. When

considering the costs incurred due to eﬀects of bursty traﬃc on traﬃc engineering policies,

95th percentile charging method balances the tradeoﬀ between ﬂexibility and the amount

charged to customers.

Performance Dynamics. There are several factors that cause performance diﬀerences

across transit routes. For instance, a transit route may simply be longer (more IP hops) than

69

others, resulting in consistently higher propagation delays. A CDN can easily identify such

cases when a transit route is consistently worse than others. Transit performance is also af-

fected by congestion at ISP-transit interconnections or congestion at intra-ISP links resulting

in larger queueing delays and packet losses due to buﬀer overﬂows. The congestion can be

temporary (e.g., during peak hours) or long-lasting indicting link under-provisioning. Such

changes in transit performance are not in control of CDNs because contractual agreements

between ISPs and changes in inter- and intra-domain routing policies are considered con-

ﬁdential information. From a CDN’s perspective, it is important to continually monitor

performance across diﬀerent transit providers and choose transit routes accordingly to opti-

mize end-to-end performance.

Overall,

in addition to optimizing performance, CDNs also have to consider ﬁnancial

aspects of Internet transit routing. Diﬀerent transit providers may charge diﬀerently, use

diﬀerent pricing models, and set up contractual agreements with/without performance SLAs.

Thus, CDNs have to navigate the cost-performance tradeoﬀ. A CDN can choose the cheapest

transit route by sacriﬁcing performance or can pay more for better performance. In this work,

our goal is to understand the cost performance tradeoﬀ from the perspective of CDNs.

4.3 Performance Measurements & Analysis

4.3.1 Measurement Methodology

In this section, we discuss our methodology to measure and analyze performance of transit

providers. To measure the performance of multiple transit providers serving a particular ISP

(e.g., AS in a speciﬁc geographical region), we utilize the multi-homing capabilities of the

CDN’s cache servers located at IXPs. Speciﬁcally, we embed a JavaScript in client-requested

web pages to conduct active performance measurements. The client-side JavaScript generates

HTTP requests to the multi-homed IXP server and downloads a pixel tag via diﬀerent transit

providers. The same pixel tag is downloaded simultaneously via multiple transit providers;

70

this allows us to capture the performance diﬀerences of various transit routes simultaneously.

Figure 4.2 illustrates our measurement methodology. The JavaScript is embedded in the

HTML pages served to end users. Once a client downloads the HTML page, the JavaScript

executes in the background and sends an HTTP GET request for a pixel tag. To avoid

additional delays incurred due to DNS lookups, we hard-code the IP addresses of the mea-

surement server in the JavaScript. To avoid local cache hits, we add a nonce check in the

client’s HTTP request. This ensures that the pixel tag is served only from the multi-homed

measurement server and not from the local browser’s cache. We set the size of pixel tag to

10 kilobytes, which is less than the server’s initial TCP congestion window. Thus, it takes

approximately 2 RTTs to download the pixel tag. The JavaScript records the RTTs at the

client side and periodically uploads the measurements to a database server.

Host

1) http://example.com/index.html

Server

2) Web page with embedded dynamic JavaScript

TCP SYN

RTT

TCP ACK

3) http://IP-address1/pixel_tag?nonce1

3) http://IP-address2/pixel_tag?nonce2

RTT

4) pixel_tag

4) pixel_tag

Transit Provider

# 1

Transit Provider

# 2

T
r
a
n
s
m
s
s
o
n

i

i

 

d
e
a
y

l

5) Measurement data

Database

Figure 4.2 Transit performance measurements

Our dataset consists of active measurements conducted from 19 measurement servers which

are located at diﬀerent IXPs. Each record in the dataset consists of time stamps indicating

71

the measurement time, identiﬁers for transit providers, download time values obtained for

each transit provider, client IP address, client AS number, and ISP name. In total, the da-

taset consists of 6 million entries which were recorded from more than 2 million IP addresses

distributed across 16, 752 ASes. All user identiﬁers in the dataset are anonymized to protect

the privacy of users. The measurement time span is distributed over a period of one year.

Note that a measurement is recorded only when a client sends a request to download a web

page. Therefore, the scope and size of our measurements depends on content popularity

and user demographics. However, the CDN discussed in this study has thousands of clients

covering a large number of ASes, which include most major residential broadband providers.

We aggregate performance measurements on an hourly basis and use one hour as the time

resolution for the rest of our analysis.

As shown in Figure 4.3(a), 19 measurement servers are located at geographically distri-

buted IXPs and cover all major continents including North America, Europe, Asia, and

Australia. We used the publicly available IP geolocation databases to locate host IP addres-

ses. Figure 4.3(b) visualizes the IP address space of all host IP addresses in our data set. The

plot divides the IP space by the ﬁrst two most signiﬁcant bytes as axes. The color indicates

the number of records from each /16 IP block, where brighter colors represent more records

and darker colors represent fewer records. The plot shows that our measurements covers a

large chunk of the IP space. Some of the empty portions represent the space for reserved

IP addresses. Other large empty portions correspond to unobserved (i.e., not served by the

CDN servers), non-allocated, or inactive IP addresses.

Table 4.1 provides summary statistics of our measurements. Each row enumerates the

total number of measurements, number of unique host IPs, number of unique ASes, and

median download time for each transit provider. The three letter abbreviations in the ﬁrst

column denote various transit providers used by the CDN at various IXPs throughout the

world. We note that Nippon Telegraph and Telephone (NTT) and Telia (TEL) provide the

most coverage (in terms of number of IPs and ASes) to the CDN. Other transit providers

72

 

>100

e

t
y
B

 
t

n
a
c
i
f
i

n
g
S

i

 
t
s
o
M

16
32
48
64
80
96
112
128
144
160
176
192
208
224
240
256

 

16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256

2nd Most Significant Byte

90

80

70

60

50

40

30

20

10

0

(a) Red dots represent the locations of

(b) Each dot indicates count of measure-

measurement servers at IXPs and black

ments from a /16 IPv4 range block.

dots represent the location of end users.

Figure 4.3 Deployment summary

are used at IXPs when NTT and TEL do not cover certain IXPs, or when NTT or TEL are

temporarily unavailable. The median download times for NTT and TEL are 288 and 286

milliseconds, respectively.

Transit
Provider

NTT

TEL

DTA

PAC

PCC

AAP

#Records #IPs #ASes Median

x1000
5,435.7
5,217.5
1,869.9
552.9
533.7
192.3

x1000
2,013.2
1,923.5
772.3
236.8
228.8
814.3

ASes
15,559
15,616
7,969
1,910
1,775
189

Time(ms)

288
286
297
488
427
241

Table 4.1 Data set summary statistics

4.3.2 Analysis and Discussions

We ﬁrst analyze the performance characteristics of diﬀerent transit providers using our me-

asurements. We are particularly interested in understanding where and when one transit

provider outperforms another transit provider. To this end, we explore spatial (where) and

temporal (when) variations in transit provider performance diﬀerence. In this work, we use

the time to download the pixel tag for each transit route as a measure of its performance.

If t1 and t2 denote the download time of the pixel tag via two transit providers, then let

73

δt = |t2 − t1| denote the performance diﬀerence in terms of download time. In the rest of
this section we use the performance diﬀerence δt as performance metrics. Below, we focus

our analysis to two major transit providers (NTT and TEL) across four popular POP locations

(Paris, Madrid, San Jose, Chicago).

Spatial variations. Figure 4.4 plots the cumulative distribution of percentage perfor-

mance diﬀerence between transit providers for users across four POP locations. The x-axis

represents users and the y-axis represents the percentage performance diﬀerence between

two transit providers for a given user. The plot indicates the portion of users who receive

better performance from one transit provider versus the other transit provider. A positive

percentage diﬀerence indicates that NTT has better performance, and a negative percentage

diﬀerence indicates that TEL has better performance. Each curve is aggregated for all ASes

that interconnect at the corresponding POP location. We observe that NTT provides signiﬁ-

cantly better performance than TEL for users on the left of the x-axis, while TEL outperforms

NTT for users on the right side of the x-axis. For Madrid POP, we note that NTT outperforms

TEL by 10% for approximately 40% of users and TEL outperforms NTT by 10% for approxi-

mately 10% of users. In sum, 50% users experience signiﬁcant performance diﬀerence across

two transit providers. The remaining 50% users would fare similarly on either transit provi-

der. Using the same 10% performance diﬀerence threshold, we note that one of these transit

providers outperforms the other for for 55% users at Paris POP, 30% of users at Chicago

POP, and 30% of users at San Jose POP. Our ﬁndings highlight that there is no outright

best transit provider for all users and a careful choice of transit provider is necessary.

Temporal variations. To investigate temporal variations in performance diﬀerences

between transit providers, we analyze our performance measurements over time. For each

POP location, we again plot the distribution of percentage performance diﬀerence between

transit providers for users in diﬀerent one-month billing periods. Figure 4.5 plots the curves

for four popular POP locations and for three diﬀerent billing periods. We observe that for

Paris and Madrid POPs, one transit provider outperforms the other for a vast majority of

74

100

80

60

40

20

0

−20

−40

−60

−80

)

%

(
 

e
c
n
e
r
e

f
f
i

 

D
e
c
n
a
m
r
o

f
r
e
P

NTT is better

 

Paris
Madrid
San Jose
Chicago

TEL is better

−100

 
0 10 20 30 40 50 60 70 80 90 100

Users (%)

Figure 4.4 Spatial performance variations for diﬀerent POPs

users across all billing periods. For example, TEL outperforms NTT for up to 70% of users at

the Paris POP and NTT outperforms TEL for up to 65% of users at the Madrid POP. While

the overall trend remains the same for all POPs, we observe changes across diﬀerent billing

periods. For instance, the performance diﬀerence between NTT and TEL increases over time for

San Jose POP. More speciﬁcally, NTT has equal or better performance as compared to TEL for

most users in July. However, TEL’s performance improves as compared to NTT’s performance

over the next billing months. By November, TEL outperforms NTT for more than 25% of users

while NTT outperforms TEL for approximately 10% of users. Similar temporal variations in

performance diﬀerence can be observed for all POPs. During certain consecutive months two

transit providers may have similar performance diﬀerence characteristics, as a result we may

not observe variations over consecutive months. Therefore, we choose diﬀerent months for

diﬀerent POPs to show variations across diﬀerent geographical locations at diﬀerent times.

To understand ﬁner-grained temporal variations in transit performance, we analyze the

time series of percentage diﬀerence between transit providers across AS-POP pairs. We plot

the hourly performance time series for the Telefonica-Madrid AS-POP pair in Figure 4.6. A

positive value of tTEL−tNTT indicates that NTT’s performance is better than TEL’s performance.
For this particular AS-POP pair, we observe that NTT’s performance is consistently better

75

100

80

60

40

20

0

−20

−40

−60

−80

NTT is better

January
February
October

TEL is better

 

100

)

%

(
 
e
c
n
e
r
e
f
f
i

D
 
e
c
n
a
m
r
o
f
r
e
P

80

60

40

20

0

−20

−40

−60

−80

NTT is better

 

February
May
September

TEL is better

−100

 
0 10 20 30 40 50 60 70 80 90 100

−100

 
0 10 20 30 40 50 60 70 80 90 100

Users (%)

(a) Paris

Users (%)

(b) Madrid

100

80

60

40

20

0

−20

−40

−60

−80

 

100

NTT is better

July
August
November

)

%

(
 

e
c
n
e
r
e

f
f
i

 

D
e
c
n
a
m
r
o
f
r
e
P

TEL is better

80

60

40

20

0

−20

−40

−60

−80

NTT is better

 

July
August
November

TEL is better

)

%

(
 
e
c
n
e
r
e
f
f
i

D
 
e
c
n
a
m
r
o
f
r
e
P

)

%

(
 

e
c
n
e
r
e

f
f
i

 

D
e
c
n
a
m
r
o
f
r
e
P

−100

 
0 10 20 30 40 50 60 70 80 90 100

−100

 
0 10 20 30 40 50 60 70 80 90 100

Users (%)

(c) San Jose

Users (%)

(d) Chicago

Figure 4.5 Performance diﬀerence of transit providers for diﬀerent billing periods

than TEL’s performance. We plot the hourly performance time series for the Comcast-San

Jose AS-POP pair in Figure 4.7. We observe that NTT’s performance is generally better

than TEL’s performance during the ﬁrst three weeks. However, the trend is quickly reversed

in the fourth week. The reversal in performance seems to be due to improved performance

of TEL (e.g., infrastructure upgrade) rather than degraded performance of NTT (e.g., traﬃc

pattern shift). For rest of the measurement period, TEL consistently outperforms NTT. Figure

4.6 (b) and 4.7 (b) show the zoomed time series plots for the fourth week. We note diurnal

variations for tTEL − tNTT, which indicate congestion during peak hours.

76

i

 

e
m
T
d
a
o
n
w
o
D

l

T
T
N

t

−

L
E
T

t

 

NTT
TEL

1000

800

600

400

 

200

200

100

0

−100

−200

)
c
e
s
m

(

)
c
e
s
m

(

i

 

e
m
T
d
a
o
n
w
o
D

l

T
T
N

t

−

L
E
T

t

 

NTT
TEL

1000

800

600

400

200

 

200

100

0

−100

−200

)
c
e
s
m

(

)
c
e
s
m

(

1

2

3

4

5

6
Time (week)

7

8

9

Su

M

Tu

W

Th

F

Sa

(a) Performance during 9 weeks

(b) Zoomed in plot for week no. 2

Figure 4.6 Diﬀerence in transit performance for AS-POP pair: Telefonica-Madrid

i

 

e
m
T
d
a
o
n
w
o
D

l

T
T
N

t

−

L
E
T

t

 

NTT
TEL

1000

800

600

400

 

200

200

100

0

−100

−200

)
c
e
s
m

(

)
c
e
s
m

(

i

 

e
m
T
d
a
o
n
w
o
D

l

T
T
N

t

−

L
E
T

t

 

NTT
TEL

1000

800

600

400

 

200

200

100

0

−100

−200

)
c
e
s
m

(

)
c
e
s
m

(

1

2

3

4

5

6
Time (week)

7

8

9

Su

M

Tu

W

Th

F

Sa

(a) Performance during 9 weeks

(b) Zoomed in plot for week no. 4

Figure 4.7 Diﬀerence in transit performance for AS-POP pair: Comcast-San Jose

4.4 Problem Formulation

Due to the rich connectivity at IXPs, a CDN can choose from multiple (often dozens) of

transit options to route traﬃc to end users. In particular, the traﬃc to an AS can be routed

via one of many transit providers. As discussed earlier, the dynamics of cost and performance

makes manual solution out of scope. We need to optimize both cost and performance for

each destination AS. Our discussions with network engineers revealed that this optimization

is often done manually on an hourly basis, and misconﬁgurations and performance issues are

quite common. Due to a large number of ASes, network engineers typically do the manual

77

optimization only for large ASes or the ones with poor performance [149, 2, 5]. Thus, we need

automated solutions to the cost-performance optimization problem that can be conﬁgured

for diﬀerent cost-performance tradeoﬀs.

We formally deﬁne the dynamic cost-performance optimization problem for Internet transit

selection using constraint programming. In a constraint programming approach an objective

function is formulated with a set of constraints on variables of the objective function. Then

a solution is obtained such that the variable values are within the speciﬁed constraints [53].

Constraint programming allows us to formulate the optimization problem in a form that

can be used to obtain a tradeoﬀ curve between cost and performance. More speciﬁcally,

we design an objective function that minimizes a utility cost function. The function is a

weighted sum of cost and performance of a transit provider selection strategy. Based on the

tradeoﬀ curve, CDNs can adapt transit provider selection to obtain the desired tradeoﬀ.

For a POP Pj, consider a transit provider i that charges ri units of cost to a CDN. Each

POP serves multiple ASes and each AS can be served through multiple transit providers. For

a AS-POP pair < Ak, Pj >, let bi
k,j respectively be the estimated bandwidth usage
and performance through transit provider i. For usage based pricing ri is a ﬁxed amount

k,j and di

that transit provider i charges on hourly basis. For 95th percentile pricing, we consider a

billing period of one week and compute the 95th percentile of hourly bandwidth usage. We

use this bandwidth usage to compute the transit costs. Speciﬁcally, for transit provider i

k,j, where bi

k,j represents the 95th percentile value. We use

the transit cost is Pk,j ri × bi

number of records observed in each hour as an estimate of hourly bandwidth usage. The

performance di

k,j is measured in terms of download time in milliseconds. The CDN needs

to select a transit provider for each AS-POP pair < Ak, Pj >. Let xi
variable that assigns transit provider i to AS-POP pair < k, j >, thus xi

k,j be the optimization
k,j ∈ {0, 1}. We

formulate the minimization objective function as follows.

minimizeXk,j Xi

rixi

k,jbi

k,j + γxi

k,jdi

k,j.

(4.1)

The objective function minimizes the utility cost of assigning transit providers to POP-AS

78

pairs. The utility cost function consists of two terms. The ﬁrst term computes the cost

in dollars charged by the selected transit providers weighted by the bandwidth usage. The

second term is the median download time observed at the transit provider multiplied by a

tunable parameter γ. We choose to model performance as part of the objective function

because modeling performance as a constraint may render the problem unsolvable as perfor-

mance of all transit providers may not satisfy the constraint. It is also possible that multiple

transit providers may satisfy the constraint resulting in selecting the transit provider with

poorer performance. The CDN operator can vary γ to obtain a desired tradeoﬀ between cost

and performance. Essentially, overall utility is the weighted sum of cost and performance.

Smaller values of γ push the optimizer towards a minimum cost solution whereas larger va-

lues of γ push towards a better performance solution. The objective function is subject to

the following constraints:

xi
k,j = 1∀k, j

xi
k,j ∈ {0, 1}
k,j ≤ Ci
k,jbi
xi

Xi
Xk,j

(4.2)

(4.3)

(4.4)

Here Ci is the capacity of transit provider i. The ﬁrst constraint ensures that only one transit

provider is assigned to a POP-AS pair. The second constraint ensures that assignments are

integral. The third constraint sets the limits on maximum bandwidth utilization by transit

providers.

Note that the second constraint for ensuring integral assignments results in a combina-

torial optimization problem, where the optimal solution can be obtained by evaluating the

objective function over all possible assignments. We can utilize openly available Mixed-

Integer Programming (MIP) solvers to obtain the optimal assignment. However, in practice

each POP connects to thousands of ASes through multiple transit providers. Since Integer

Programming (IP) is NP-Complete, it is not feasible to solve it for large input sizes [83]. We

provide the proof of its NP completeness by reduction from the boolean satisﬁability pro-

79

blem 3-SAT. The boolean satisﬁability problem has been proven to be NP-complete using

the Cook-Levin theorem. By reducing 3-SAT to IP we prove that IP is NP-Complete. A SAT

instance is composed of variables and clauses. Variables in the SAT instance are boolean

and their values should be such that the clause must result in a boolean TRUE.

Proof. For the 3-SAT problem consider the boolean variables b1, b2, ..., bn. Consider a similar

IP instance with the same number of integer variables denoted as x1, x2, ..., xn. Values of

each integer variable is constrained by the inequality:

For any 3-SAT clause, there is a corresponding constraint. Consider a 3-SAT clause:

0 <= xi <= 1∀i

b1 ∨ ¯b2 ∨ b3

The corresponding constraint is given by the following inequality:

x1 + (1 − x2) + x3 >= 1

This reduction can be done in polynomial time.

(4.5)

(4.6)

(4.7)

Thus, we relax the constraint to allow fractional assignments i.e., xi

k,j ∈ [0, 1]. This
relaxation allows us to obtain optimal solution with fractional assignments in linear time

using standard Linear Programming (LP). Following lemmas state the relationships between

solution obtained through LP relaxation and Integer LP (ILP) [122, 138].

Lemma 4.4.1. Let SLP be set of feasible solutions in the LP and let SILP be set of feasible
solutions in ILP. All feasible solutions in ILP are also feasible solutions in LP; i.e., SILP ⊂
SLP . For the minimization problem, solution obtained through relaxed LP has smaller value
than the solution obtained through original ILP i.e., min(LP)≤ min(ILP).

An optimal solution in LP may be found at the boundaries of the convex hull formed by

the linear constraints. However, the feasible solutions in ILP are given by a set of points

80

inside the convex hull formed by the linear constraints. These set of points do not form a

convex set. Hence, value of the optimal solution in a minimization problem in LP is less

than or equal to the value of an ILP optimal solution. Following the inequality in Lemma

4.4.1, we next state the condition under which both optimal solutions for LP and ILP are

equal.

Lemma 4.4.2. Let X o denote the optimal solution of the original ILP problem. Let X l

denote the optimal solution obtained from LP. If the optimization variable xk

i,j of the LP
takes integral values in X l, then it is also an optimal solution for ILP and hence the optimal

solution for the original problem; i.e, X l = X o.

Optimal solution of a LP as stated in the lemma generally does not exist. Several heuristics

have been proposed in the literature to obtain an integral solution from the relaxed LP [43].

In this work, we use the relaxed version of the minimization problem. For our empirical

evaluation, we obtain a small percentage of fractional values in the solution to the relaxed

version of the original problem. The fractional assignments can be dealt heuristically through

randomized rounding or greedy assignments [116]. Another approach to deal with fractional

assignments is to employ fractional routing, which can be realized by hash-based splitting

or through multi-homing agents [25, 104].

4.5 Results

In this section, we discuss the cost and performance beneﬁts of our proposed optimization

approach for transit selection by CDNs. For pricing, we use usage-based and 95th percentile

pricing models to compute costs incurred by the CDN in using various transit providers. On

an hourly basis, we compute the median bandwidth usage for each AS-POP pair. For usage-

based pricing, we study transit provider selection for two diﬀerent scenarios; ﬁrst, when

all transit providers charge equal pricing rates (per Mbps), and second when all transit

providers charge diﬀerent pricing rates. The optimization framework works on hourly basis,

81

t
s
o
C
d
e
z

 

i
l

a
m
r
o
N

1

0.8

0.6

0.4

0.2

0

 

 

1

Paris
Madrid
San Jose
Chicago

400

600

800

1000

1200

Download Time (msec)

t
s
o
C
d
e
z

 

i
l

a
m
r
o
N

0.8

0.6

0.4

0.2

0

 

 

Paris
Madrid
San Jose
Chicago

400

600

800

1000

1200

Download Time (msec)

(a) Usage-based pricing

(b) 95th percentile pricing

Figure 4.8 Cost-performance tradeoﬀ curves for γ ranging from 0 to 5.

and the optimization works for each PoP independently over all ASes and transit providers.

Therefore, diﬀerences in pricing for diﬀerent POP locations (which is very common) does

not impact transit providers across POPs. For 95th percentile charging, we use one week as

the billing period in our experiments. 95th percentile charging is based on 95th percentile of

all measurements over the billing period. Therefore, the optimization framework works on

weekly basis under this charging model. For performance, we use download time of active

measurements over a year as input to the optimization problem. For bandwidth, we use

number of records observed in each hour as an estimate of bandwidth usage in that hour.

As number of records are dependent on clients requesting the pixel tags, larger number of

requests indicate greater bandwidth usage, therefore, we use number of records as an estimate

for bandwidth usage. Speciﬁcally, we compute the median download time in every hour for

all transit providers for each POP-AS pair. For a given transit provider, our system records

several measurements for each AS-POP pair during a particular hour. We use the median

value of these measurements in our experiments. We use an open source implementation

of Embedded Conic Solver (ECOS) for solving the optimization problem. Given the solver

output, we can compute the performance and cost of traﬃc via transit providers. While our

82

optimization framework is scalable for more than two transit providers, we limit our analysis

to two transit providers (namely NTT and TEL) for simplicity.

4.5.1 Tradeoﬀ analysis

We ﬁrst study the tradeoﬀs between cost and performance in the Internet transit selection.

By varying γ, we obtain a tradeoﬀ curve between cost and performance for each POP. For

each hour, we compute the cost and download time using the ﬁrst and second terms of our

objective function, respectively. We repeat this process for diﬀerent γ values and obtain

a tradeoﬀ curve between total cost and average download time per AS. Each point on the

tradeoﬀ curve represents a transit provider selection strategy for all ASes served by the POP.

We solve the optimization problem every hour and use average to obtain a tradeoﬀ curve.

Figure 4.8 plots the cost-performance tradeoﬀ curves for selection between two transit

providers at four POP locations. We plot the tradeoﬀ curves for usage-based pricing models

and 95th percentile pricing models in Figures 4.8(a) and (b), respectively. The x-axis is the

average download time per AS (in milliseconds) and the y-axis represents the incurred transit

costs (in dollars). Each point on the curve represents the average cost and performance over

a period of one month for a given γ value. To plot the tradeoﬀ curves, we vary the values

of γ from 0 to 5 with an increment of 0.1. For smaller values of γ, we obtain solutions with

lower costs but worse performance. For γ = 0, we get the lowest cost solution that selects

transit providers with minimum costs irrespective of their performance. As we increase the

value of γ, we give higher weight to performance and we obtain higher cost solutions with

better performance. For γ = 5, we get the best performance solution that always selects the

transit providers with the lowest download time. The pairs of dots on each curve represent

the minimum cost solution for γ = 0 and best performance solution for γ = 5. The lines are

extrapolated to show solutions that can be obtained for values of γ less than 0 and greater

than 5. Choosing larger values of γ produces solutions on the vertical portion of the tradeoﬀ

curve. This means that we always obtain the same best performance solution regardless of

83

the cost. Therefore, we only choose values of γ that produces meaningful tradeoﬀs between

cost and performance.

Comparing Figures 4.8(a) and (b), we note that the CDN ends up paying more under

the 95th percentile pricing model. For Paris and Madrid POPs, we observe smooth tradeoﬀ

curves, indicating that the CDN can select among a number of strategies by varying γ

around the knee of the tradeoﬀ curves. We also observe that regardless of the pricing model,

the CDN can save more than 50% on cost with a performance degradation of less than

40 milliseconds. For Chicago and San Jose POPs, we observe sharp knees of the tradeoﬀ

curves which indicate that slightly changing γ values has a substantial impact on cost and

performance. For these POPs, the CDN can save on average 30% on cost with a performance

degradation of less than 20 milliseconds. For large values of γ, the CDN may end up paying

a lot more even though there is much room for savings without signiﬁcantly degrading the

performance. Thus, there is very limited ﬂexibility for CDNs in terms of optimal transit

provider selection.

In summary, the tradeoﬀ curves under the usage based pricing model

and the 95th percentile charging model show that the CDN can reduce transit cost by 57%

and 35% respectively, without signiﬁcant degradation in performance.

4.5.2 Transit selection analysis

Recall that transit provider selections are outputs of our optimization framework, i.e., the

values of the target variable xk

i,j provides selection results. We analyze transit provider

selections for three diﬀerent usage-based pricing scenarios: (1) Both transit providers charge

equal rate, (2) NTT charges higher rate than TEL, and (3) TEL charges higher rate than NTT.

For equal pricing scenario, the optimization framework provides the performance-optimal

solution. For unequal pricing scenarios, the optimization framework selects the best transit

provider in terms of both cost and performance.

Figure 4.9 plots the timeseries of performance diﬀerence between TEL and NTT for an

example POP-AS pair. The primary y-axis (left side) shows the diﬀerence in performance

84

NTT and TEL charges equal price

NTT charges higher price
TEL charges higher price

Paris
TEL

75.3
87.4
57.3

NTT

59.5
40.0
78.4

Madrid

San Jose

Chicago

Frac
2.1
0.1
0.86

NTT

TEL

75.3
59.6
87.3

62.7
78.1
38.8

Frac
1.8
0.0
0.0

NTT

TEL

72.3
41.7
91.6

67.6
89.7
34.3

Frac
7.8
0.1
0.2

NTT

TEL

63.6
41.3
84.2

79.6
90.5
52.9

Frac
9.9
0.0
0.0

Table 4.2 Summary of assignments for diﬀerent pricing rates and POP locations

measurements between the two transit providers. The secondary y-axis (right side) shows

the output selection values for transit provider TEL, where 1 indicates TEL is selected and 0

indicates NTT is selected. Blue dots represent transit provider selections (right y-axis). Figure

4.9(a) shows selections when both transit providers charge equal rates. We observe that

during the ﬁrst four weeks TEL performs worse than NTT, as a result TEL is not selected. For

the later half, the performance diﬀerence becomes negative indicating that TEL is performing

better than NTT, thus TEL is selected during the later half. Figures 4.9(b) and (c) show

selections when NTT and TEL charge diﬀerent rates.

In both cases, we selected a γ value

that provides the best tradeoﬀ between performance and cost, i.e., the knee of the tradeoﬀ

curve. When NTT charges more, TEL is mostly selected even when TEL has worse performance.

Whereas when TEL charges more, we see selections for both NTT and TEL. This shows the

impact of performance diﬀerence and γ values on transit provider selection. In the later half,

performance diﬀerence is larger, and as a result the optimal selection is to select the best

performing transit provider. Therefore, even though TEL is charging more than NTT, TEL is

selected because performance improvement trumps additional cost.

In order to systematically quantify the impact of diﬀerent pricing on transit route selection,

we compare selections for varying pricing rates. Figure 4.10 plots the percentage of selections

obtained from the optimization framework for varying price ratio between NTT and TEL. The

y-axis shows the fraction of total number of hours in the input data. Each pair of symmetric

red/blue lines correspond to a given γ value. The fraction of NTT selections are shown in blue

and the fraction of TEL selections are shown in red and are symmetric to the corresponding

blue line. Price ratio is calculated as TEL price

NTT price . The plot shows that when TEL is 10 times

cheaper than NTT, TEL has larger percentage of selections. Furthermore, when TEL is 10 times

85

200

100

0

−100

−200

Su W Sa Tu F M ThSu W Sa Tu F M ThSu W Sa Tu F M

(a) Equal pricing rate

200

100

0

−100

−200

Su W Sa Tu F M ThSu W Sa Tu F M ThSu W Sa Tu F M

(b) NTT charges higher pricing rate

200

100

0

)
c
e
s
m

(
 

t
t

n

t

−

l

e

t

t

)
c
e
s
m

(
 

t
t

n

t

−

l

e

t

t

)
c
e
s
m

(
 

t
t
n

t

−

l

e
t

t

−100

−200

Su W Sa Tu F M ThSu W Sa Tu F M ThSu W Sa Tu F M

(c) TEL charges higher pricing rate

1

0.8

0.6

0.4

0.2

0

s
t

n
e
m
n
g
s
s
A

i

1

0.8

0.6

0.4

0.2

0

s
t
n
e
m
n
g
s
s
A

i

1

0.8

0.6

0.4

0.2

0

s
t
n
e
m
n
g
s
s
A

i

Figure 4.9 Transit provider selection for diﬀerent pricing scenarios.

expensive than NTT, selections for NTT reach more than 90% for smaller γ values and more

than 85% for larger γ values. Overall, we observe that as γ increases the fraction of NTT

selections also increase regardless of performance. This ﬁnding indicates that overall NTT is

favorable in terms of performance.

We also observe some fractional transit assignments by the optimization framework. We

summarize the integral and fractional assignments as follows. Let δtp denote total number of

hours during which transit provider tp is better. Let Seltp denote the number of hours transit

provider tp is better and it is selected i.e. xk

i,j = 1 for i = tp. Then

δtp
Seltp

is the fraction of

86

Increasing γ

 

1

0.8

0.6

0.4

0.2

%
 
s
n
o
i
t
c
e
e
S

l

Increasing γ
−2
−1

10

 
0
10

NTT
TEL

0
10

Price Ratio

1
10

2
10

Figure 4.10 Transit provider selection for varying price ratio: TEL price
NTT price .

hours during which the performance diﬀerence was large enough to select transit provider

tp. For cases when we have non-zero values of gamma and unequal transit provider pricing

rates, the magnitude of performance diﬀerence has to be large enough to aﬀect transit route

selection. In summary, for the equal pricing scenario, NTT and TEL were selected integrally

72.3% and 67.6% of the time with fractional assignments 7.8% of the time. When NTT

is charging more, these percentages are 41.7%, 89.7%, and 0.1%. When TEL is charging

more, these percentages are 91.6%, 34.3%, and 0.2%. We summarize the transit provider

selections for four diﬀerent POP locations in Table 4.2. We observe that pricing rates have

a major impact on transit provider selections for all POP locations. We also observe that

the percentage of fractional assignments is larger when under both transit providers charge

equal pricing rates. Since the percentage of fractional assignments is generally small, we

deem it unnecessary to utilize heuristics to obtain integral solutions in polynomial time.

We next analyze the eﬀect of varying the weight factor γ on transit provider selections. We

plot the normalized frequency histograms of TEL and NTT selections for performance diﬀerence

values. Figures 4.11 show the histograms for the three diﬀerent charging scenarios at two

POP locations. Figures 4.11(a) shows that under equal cost changing γ values do not aﬀect

transit provider selections. We observe that the percentage of TEL selections is more than NTT

87

10

8

6

4

2

)

%

(
 
s
n
o
i
t
c
e
e
S

l

 

10

TEL
NTT

)

%

(
 
s
n
o
i
t
c
e
e
S

l

8

6

4

2

 

10

TEL
NTT

)

%

(
 
s
n
o
i
t
c
e
e
S

l

8

6

4

2

0

 

−200

−100

0
−t

t
TEL

NTT

100

200

0

 

−200

−100

100

200

0

 

−200

−100

0
−t

t
TEL

NTT

0
−t

t
TEL

NTT

 

TEL
NTT

100

200

(a) Equal Cost

(b) NTT charges more

(c) TEL charges more

Figure 4.11 Eﬀect of γ values on transit provider selection (Paris)

)
s
p
b
M

(
 

i

t

h
d
w
d
n
a
B
d
e
a
m

 

t

i
t
s
E

10

5

0

 

10

5

0

 

10

5

 

 

 

TEL
NTT

↑ TEL
↓ NTT

↓ TEL
↑ NTT

 
0
Su Tu Th Sa M W F Su Tu Th Sa M W F Su Tu Th Sa M W F Su Tu Th Sa M W F Su Tu

(a) Paris

Figure 4.12 Analysis of transit provider bandwidth usage

selections when TEL provides better performance (negative x-axis). Similarly, the percentage

of NTT selections is more than TEL selections when NTT provides better performance (positive

x-axis). In Figure 4.11(b), we observe no change in percentage of TEL selections when TEL

provides better performance; i.e, when TEL is better and cheaper, TEL is always selected

irrespective of the choice of γ. However, we observe that as γ increases the fraction of NTT

selections also increases when NTT provides better performance. This pattern indicates that

when NTT is better and expensive, increase in γ value gives more weight to performance, as

a result, the percentage of NTT selections increases. We observe similar patterns in Figures

4.11(c). In summary, a transit provider’s selection does not signiﬁcantly change for varying

γ values if it provides better performance and is cheaper.

We further analyze the timeseries of bandwidth usage for competing transit providers

in Figure 4.12. Note that bandwidth usage is calculated using transit provider selection

88

information and traﬃc volume for each AS-POP pair. The ↑ symbol denotes the more
expensive transit provider and the ↓ symbol denotes the cheaper transit provider. No arrows
indicate that both transit providers have equal costs. For these AS-POP pairs, we observe

that under equal costs the best performing transit provider is selected most of the time. We

also observe some cases where both transit providers are being used indicating fractional

assignments. The timeseries also show short-term and long-terms traﬃc shifts. In Figure

4.12, we note long-term traﬃc shifts when transit provider TEL is expensive. We observe

that most of the traﬃc is routed through TEL for the ﬁrst three weeks. During the later half

of the duration we observe that most of the traﬃc is routed through NTT which is cheaper.

1

t
s
o
C
 
d
e
z

i
l

a
m
r
o
N

0.8

0.6

0.4

0.2

01:00 am
03:00 am
09:00 am

1

t
s
o
C
 
d
e
z

i
l

a
m
r
o
N

0.8

0.6

0.4

0.2

01:00 am
03:00 am
09:00 am

0.6

0.7

0.8

0.9

1

0.6

0.7

0.8

0.9

1

Normalized Download Time

Normalized Download Time

(a) Solution from pre-selected γ values

(b) Solution obtained from Algorithm 1

Figure 4.13 Pareto fronts for diﬀerent hours.

4.6 Pareto Front Generation

In this section, we extend our tradeoﬀ results by generating sets of pareto-optimal solutions

on hourly basis. The aggregate tradeoﬀ curves in Figure 4.8 are obtained by taking average of

pareto fronts belonging to several hours. For any particular hour, a pareto front is generated

by using a set of pre-selected values of the weighting factor γ. CDN operators need to

perform transit route selection on hourly basis. Therefore, they have to navigate the cost

89

and performance tradeoﬀs on hourly basis. Operators can analyze the pareto front for any

hour and navigate tradeoﬀs to choose a solution from the set of pareto solutions.

A pre-selected set of γ values is not enough to navigate the cost performance tradeoﬀs.

There could be several unexplored regions over the pareto front where better tradeoﬀs can be

achieved. Figure 4.13(a) shows the pareto fronts obtained for diﬀerent hours using the pre-

selected set of γ values. The ﬁgure shows that for some hours the pareto optimal solutions

are not evenly distributed over the pareto front. In addition, the solutions are far apart and

the tradeoﬀ characteristics of the pareto front are not completely visible.

In order to obtain a well distributed set of solutions on the pareto front, the set of γ values

has to be selected carefully. To ﬁnd solutions in the unexplored regions of the pareto front,

we use a recursive approach proposed by Kim et al. in [88]. The approach explores pareto

front by recursively selecting a new set of γ values for the unexplored regions. Algorithm 2

describes the recursive method for Pareto front generation. An initial set of Pareto solutions

P0 corresponding to a pre-speciﬁed set of γ values is used as an input to the method. First,

Euclidean distance Di of all segments formed by points in P is calculated. If the length of

segment i is less than a threshold ǫ, then this means that the two solutions (pi, pi+1) are

nearly overlapping. For solutions very close to each other on the Pareto front, we do not see

any signiﬁcant cost-performance tradeoﬀs, therefore we do not perform any operations on the

unexplored solution space between the two points. For a segment i formed by Pareto solutions

(pi, pi+1), and with length greater than ǫ, we obtain a set of γ values Gi. All γ values in Gi

range from the γ value corresponding to pi to the γ value corresponding to pi+1. Next, we

obtain a set of Pareto solutions Pi, for all γ values in Gi. Finally, we use the Pareto solutions

in Pi to recursively obtain solutions on the unexplored regions of the Pareto front. Figure

4.13(b) shows the pareto fronts obtained through the recursive approach. In comparison to

Figure 4.13(a), we observe that several solutions are obtained in the unexplored regions of

the pareto fronts. For example, pareto front belonging to Paris at 1 am has new solutions

in the region between normalized download time of 0.6 to 0.9. We compare of pareto front

90

Algorithm 2 Generating pareto fronts recursively
Input: (1) Initial set of γ values: G0 = {γ1, γ2, ..., γn}

(2) Initial set of Pareto solutions: P0 = {p1, p2, ..., pn}
(3) Minimum step size ǫ

Output: Final set of Pareto solutions: P

1 Function ParetoFront(G0,P0,ǫ)

2

3

4

5

6

7

8

9

10

11

12

13

14

15

for i=1,...,n-1 do

Di = Distance(pi, pi+1)

Davg = Pi Di

n

for i=1,...,n-1 do

if Di > ǫ then

stepi = round(

Di
Davg

)

Gi = ComputeSet(stepi, pi, pi+1)

Pi = Minimize(Gi)

P = P aretoF ront(Gi, Pi, ǫ)

return P

else

stepi = 0
P={pi, pi+1}
return P

solutions obtained before and after using the recursive approach using variance of the length

of segments formed by solutions on the pareto front. The solutions in Figure 4.13(b) have

approximately three order of magnitude lower variance. This shows that cost-performance

tradeoﬀ is more ﬂexible for CDN operators as they can choose from a variety of solutions

which are evenly distributed on the pareto front.

91

4.7 Related Work

There is prior work on route selection from the perspective of stub ISPs, edge networks, and

enterprise networks. Several studies have measured and analyzed performance and pricing

aspects of such multi-homed networks. In contrast, this chapter studies transit route selection

at IXPs by CDNs.

Multi-homing: Akella et al. presented a large scale measurement based performance

analysis of multihoming [25]. The authors evaluated multihoming performance from the

perspectives of enterprises and content providers, and found that multihoming improves

performance.

In [44], Dhamdhere et al. proposed a two step methodology for optimizing

cost and performance in multi-homing. Tao et al. also explored the performance beneﬁts of

path switching in multi-homing scenarios and overlay network scenarios [134]. In [56], Yang

et al. proposed smart routing algorithms to optimize cost and performance for multihomed

enterprises and stub ISPs. The customer ISPs are assumed to be multihomed to a set of ISPs

and they share ISP links for multihoming purposes. The proposed algorithms dynamically

assign traﬃc to achieve optimal cost and performance.

In [54] the authors showed that

route control systems in multihomed networks can cause traﬃc oscillations between available

paths which have adverse aﬀects on performance. The authors used bandwidth information

and randomization in their algorithms to avoid oscillations. In [140], Wang et al. studied

monetary aspects of multihoming. The authors proposed a dynamic programming algorithm

to obtain a subset of ISPs for multihoming purposes so that the cost is minimized for the

customer. They also study dynamics of ISP pricing strategies in response to customer

subscriptions. More recently, in [149] and [104], the authors proposed measurement and

optimization frameworks for service provider selection for content delivery from content

providers to its users.

As compared to prior work on multi-homing, our work is diﬀerent in two major dimensions:

First, we focus on multi-homed CDN servers at IXPs, which oﬀer a large number of possible

interconnections between CDNs and ISPs. Prior work on multi-homing assumes the multi-

92

homed network to be an edge network, whereas in our case we deal with multi-homed CDN

servers at IXPs. Second, our light-weight measurement methodology is unique as it allows

us to measure performance across multiple transit routes simultaneously at a large scale.

Transit pricing: Pricing in the internet transit has been studied from two diﬀerent per-

spectives. First, from the transit ISP perspective, where transit ISPs aim to maximize their

proﬁt while satisfying service quality and customer traﬃc demands. Prior work has stu-

died the impact of adopting various pricing strategies on ISP proﬁts such as time-dependent

pricing, tiered pricing, and pricing diﬀerentiation [77, 87, 62]. In particular, Valancius et

al. studied the aﬀect of destination based tiered pricing on ISP proﬁt [137]. The authors

developed demand and cost models using real-world traﬃc data and showed the impact of

various traﬃc bundling strategies on ISP proﬁts.

In contrast, we study pricing from the

transit ISP customer’s perspective, where CDNs aim to minimize transit costs while achie-

ving best performance. For transit cost minimization, prior work has studied various transit

link sharing strategies to reduce transit costs [130, 60, 26, 37]. Our work focuses on cost

reduction by selecting best available transit route which also provides best performance.

Server deployment: Some prior work focuses on optimizing server deployments by

CDNs. In [115], the authors study the problem of online placement of servers to minimize

cost in terms of latency, hop count, and economic cost. In [51], the authors propose a system

that enables CDN-ISP collaboration leading to informed server placement and end-user to

server assignment. In [65], the authors formulate and solve the cache deployment optimiza-

tion problem to minimize CDN deployment costs while maintaining end-user performance.

4.8 Conclusions

In this chapter, we studied the problem of optimal Internet transit selection from the per-

spective of CDNs. To the best of our knowledge, transit route selection at IXPs by CDNs is

93

not studied in prior literature. We make the following key contributions in this chapter. First,

we propose a method to conduct simultaneous performance measurements across multiple

transit routes which are maintained by CDNs for delivering content to access ISPs. Second,

we provide spatio-temporal characterization of performance diﬀerences observed across mul-

tiple transit routes. Our ﬁndings highlight that there is not an outright best transit provider.

Third, we formulate the optimal transit route selection problem as an optimization problem

and analyze tradeoﬀ curves and resulting selection strategies for diﬀerent pricing of transit

routes. We observe sharp knees in tradeoﬀ curves for some geographical regions, indicating

that there is very limited ﬂexibility for CDNs in terms of choosing a suitable transit pro-

vider. Without an optimal selection strategy, CDNs may end up sacriﬁcing performance

and/or cost. Using our proposed approach, CDNs can reduce transit costs on average by

57% without sacriﬁcing performance.

94

5 Measurement and Analysis of

Adult Traﬃc

5.1

Introduction

5.1.1 Background

As the saying goes: “The Internet is for porn” [30]. While it is diﬃcult to estimate how much

porn content is available on the Internet [142], there are several widely varying estimates

available. According to one estimate, there are at least 4 million adult websites on the

Internet, which constitute approximately 12% of all websites [13]. It has also been reported

that adult websites have more monthly unique visitors than Netﬂix, Twitter, and Amazon

combined [14, 109]. Furthermore, multiple websites in the Alexa’s top-500 global list serve

adult content [1]. Moreover, a recent measurement study of a tier-1 ISP reported that at

least 15% of all mobile video traﬃc in the Unites States is from adult content providers

[48]. Overall, these statistics indicate that online adult content attracts a large number of

users and accounts for a substantial fraction of the global Internet traﬃc. However, despite

its signiﬁcant volume, little is known about the makeup and characteristics of online adult

traﬃc. The understanding of online adult traﬃc is important for optimizing its content

delivery, which involves complex interactions between content delivery networks and ISPs.

95

5.1.2 Limitation of Prior Art

There is little prior work on dedicated analysis of online adult websites. Only recently,

Tyson et al. [135, 136] studied behavioral aspects of two popular adult sites: YouPorn and

PornHub. These studies reveal several unique characteristics of adult content such as the

elasticity of adult content consumption – users consume whatever they ﬁnd on the front-

pages of adult websites. However, the utility of these studies is limited due to three main

reasons: (1) they rely on website data obtained by crawling, which is limited in terms of both

temporal coverage and granularity; and (2) their analysis cannot distinguish among users

because they rely on aggregate view counts of adult videos; and (3) the focus of these studies

is on the behavioral and demographic aspects of two speciﬁc adult websites, and whether

their ﬁndings are representative of other adult websites is unclear.

5.1.3 Main Findings

In this chapter, we conduct the ﬁrst large-scale measurement study of online adult traﬃc

using HTTP logs collected from a major commercial content delivery network. The week-

long HTTP logs include traﬃc from several dozen major adult websites and their users in

four diﬀerent continents. Overall, our HTTP logs account for approximately 323 terabytes

worth of traﬃc from 80 million users. Based on these traces, we present some detailed

characteristics of ﬁve popular adult websites. We choose these websites on the basis of their

popularity, e.g., several of these adult websites have been ranked in the global Alexa top-500

list [1]. This selection represents a broad variety of adult websites, ranging from adult video

services, adult image sharing services, to adult social networking services.

We provide an in-depth analysis of online adult traﬃc including its aggregate, content,

and user dynamics. Below, we summarize our key ﬁndings and their implications on adult

content delivery.

• Aggregate Analysis: Adult traﬃc primarily comprises of video and image multimedia
content. For many popular adult websites, up to 99% traﬃc volume consists of video

96

and image content. While the majority of users access adult websites from desktop,

smartphones and other mobile devices account for a non-trivial fraction of visitors.

• Content Analysis: Adult content has widely varying sizes:

images are generally less

than 1 megabyte and videos are on the order of tens of megabytes. Content popularity

distributions exhibit the expected skewness. The temporal access patterns of adult

websites are unique and diﬀerent from the typical diurnal access patterns of traditional

web content. Our clustering analysis of content popularity reveals groups of objects

with diurnal, long-lived, and short-lived temporal access patterns.

• User Analysis: User engagement with adult websites is shorter than non-adult websites.
i.e., some users repeatedly access certain

However, adult content can be addictive,

content. For example, at least 10% of video objects have more than 10 requests per

unique user.

• Implications: Our ﬁndings have implications on adult website design and content deli-
very infrastructure management. For example, a vast majority of users do not visit adult

websites on smartphones. This ﬁnding highlights the need for adult websites to improve

their web interfaces and content delivery strategies for mobile devices. As another ex-

ample, due to their unique diurnal access patterns, it is important to separately account

for adult traﬃc in the traﬃc forecasting models and network resource allocation. We

also ﬁnd that adult content providers cannot rely on browser cache to store locally po-

pular content because of prevalent use of incognito/private web browsing. Adult content

providers can instead optimize content caching performance by customized networked

cache conﬁguration. For example, content delivery networks can improve performance

and reduce network traﬃc by pushing copies of popular adult objects to locations closer

to their end-users.

97

5.2 Related Work

Despite a large number of studies on general web traﬃc analysis (e.g., [103, 28, 148, 145, 112,

126]), little is known about the makeup and characteristics of online adult traﬃc. To ﬁll this

gap, this chapter presents the ﬁrst in-depth and large-scale characterization of online adult

traﬃc. To provide some context for our study, we review prominent related work below.

Wondracek et al. studied the economic and security issues of the online adult industry [144].

In their study, the authors used manual inspection and automated crawling to investigate the

characteristics of adult websites. They classiﬁed adult websites based on their functionality

into the following categories: paysites, link collections, search engines, domain redirector

services, keyword-based redirectors, and traﬃc brokers. They also created two adult web

sites from scratch for traﬃc proﬁling and vulnerability assessment. They concluded that

several prevalent practices in online adult industry are questionable and can be used to

conduct malicious activities. This work provides a broad understanding of the economic and

security issues of the online adult industry. In contrast, we analyze HTTP access logs of

several popular adult websites to understand their content and user traﬃc patterns.

More recently, Tyson et al. conducted measurement studies of two Web 2.0 adult websites

YouPorn and PornHub [135, 136].

In [135], the authors crawled YouPorn to understand

user interactions with the website. This measurement study shows that the fraction of user

generated content on YouPorn is much smaller as compared to non-adult video websites

like YouTube and Vimeo. However, they found that YouPorn has more average views per

video than other non-adult video websites. The authors concluded that the video viewing

behavior on YouPorn is dependent on two main factors: front page browsing patterns and

the number of categories assigned to a particular video. They found that videos with most

views are located on the front page of the website and the number of views a video receives

is correlated with the number of categories assigned to it.

In [136], Tyson et al. crawled PornHub, an adult social networking website, to analyze

the demographic makeup of users. The study shows that a majority of proﬁles on PornHub

98

belong to young males, however female proﬁles are more popular in terms of receiving more

comments and proﬁle views. The authors also found that active proﬁles have larger social

groups and there is positive correlation between content uploaded by a proﬁle and its number

of connections. The data analyzed in these studies is collected by periodic crawling of these

two adult websites, which is generally limited to meta-data like view counts, ratings, user

proﬁle information. In contrast, our work is based on analysis of detailed HTTP access logs

of multiple adult websites, which allows us to track individual user content requests at a

ﬁne-grained timescale.

Some prior web traﬃc measurement studies have reported basic statistics of adult traﬃc

in the context of overall traﬃc makeup. For example, Du et al. studied HTTP traﬃc from

Internet kiosks from two developing countries and reported that adult content accounts for

less than 1% of total traﬃc volume [45]. Erman et al. studied video traﬃc in a large 3G

cellular network and reported that adult content accounts for approximately 15% of total

mobile video traﬃc [48]. While these studies are useful, they do not speciﬁcally analyze

adult traﬃc in terms of its unique characteristics as compared to non-adult traﬃc. To the

best of our knowledge, our work provides the ﬁrst analysis of online adult traﬃc at scale.

5.3 Measurement & Analysis

Most major adult and non-adult content publishers use third-party content delivery networks

(CDNs) to eﬃciently deliver their content to end-users. According to recent estimates, a

signiﬁcant fraction of Internet traﬃc is served by CDNs [93]. For instance, Akamai alone

delivers between 15-30% of all web traﬃc [3]. A CDN operator typically places content at

multiple geographically distributed data centers. A user’s request for content (such as a web

page, an image, or a video ﬁle) is redirected to the closest data center via DNS redirection,

anycast, or other CDN-speciﬁc methods [111, 52].

For this study, we collected HTTP access logs from multiple data centers of a major

99

commercial CDN for the duration of one week. The HTTP logs include traﬃc from several

dozen major adult websites and their users in four diﬀerent continents. Overall, our HTTP

logs account for more than 323 terabytes worth of traﬃc from 80 million users. All personally

identiﬁable information in the HTTP logs (e.g., IP addresses) is anonymized to protect the

privacy of end users without aﬀecting the usefulness of our analysis. Each record in our trace

includes information about an HTTP request, containing publisher identiﬁer, hashed URL,

object ﬁle type, object size in bytes, user agent, and the timestamp when the request was

received. We use the user agent ﬁeld to distinguish between diﬀerent device types, operating

systems, and web browsers [49]. Each record in our trace also includes information about

the corresponding HTTP response sent by the CDN server, containing the cache status for

the requested object. A HIT value indicates that the requested object was found in the CDN

cache and a MISS value indicates that the ﬁle does not exist in the CDN cache. We use

the HTTP response codes and cache status information to measure caching performance of

proprietary CDN caching algorithms.

Through an extensive manual analysis of publisher identiﬁers, we separated adult content

publishers from the rest (dubbed as “non-adult”). We select ﬁve popular adult websites

in our data set for further in-depth analysis. The names of these websites are anonymized

due to business conﬁdentiality agreements. For adult websites, two websites primarily serve

adult video content (termed V-1 and V-2), two websites provide image-heavy adult content

(termed P-1 and P-2), and one is an adult social networking website (termed S-1). In this

section, we analyze aggregate statistics of adult websites (e.g., content composition), content

dynamics (e.g., popularity, new content injection), and user dynamics (e.g., session length,

addiction to adult content).

100

s
t
c
e
b
o

j

 
f

o

 

#

6
10

4
10

98%

Video

Image

Other

84%

15%

99%

99%

99%

2
10

0
10

1%

<1%

<1%

<1%

<1%

<1%

<1%

<1%

<1%

V−1

V−2

P−1

P−2

S−1

Figure 5.1 Content composition of ﬁve adult websites.

5.3.1 Aggregate Analysis

5.3.1.1 Content Composition

We ﬁrst analyze the content composition of adult websites on CDN servers. To this end, we

categorize objects based on their ﬁle types into video (e.g., FLV, MP4, MPG, AVI, WMV),

image (e.g., JPG, PNG, GIF, TIFF, BMP), and other (e.g., text, audio, HTML, CSS, XML,

JS). From Figure 5.1, we note that V-1 primarily stores video objects on the CDN servers,

e.g., 98% of all objects are videos. V-1 uses 6.6K objects, V-2 uses 55.6K objects, P-1 uses

16.3K objects, P-2 uses 29.6K objects, and S-1 uses 22.9K objects. We breakdown content

into 3 categories: (1) video, (2) image, and (3) other. Other category includes objects that

are not classiﬁed as video or image. In contrast, V-2 stores a mix of image (84%) and video

(15%) objects. V-2 uses a large number of GIF images to show a video summary when users

hover the cursor over the video. P-1, P-2, and S-1 store images (99%) on CDN servers.

5.3.1.2 Traﬃc Composition

We next analyze the traﬃc composition of adult websites in terms of the request count and

request size during the data collection period. Request count is the total number of requests

received from website visitors for all objects. Request size is the total size of objects requested

by website visitors for all objects. We again breakdown content across video, image, and other

101

t
n
u
o
C

 
t
s
e
u
q
e
R

8
10

6
10

4
10

2
10

0
10

Video

Image

Other

99%

62%

34%

56%

43%

99%

97%

3%

<1%

<1%

<1%

<1%

<1%

2%

<1%

V−1

V−2

P−1

P−2

S−1

10

10

5
10

)

B
K

(
 

e
z
S

i

 
t
s
e
u
q
e
R

0
10

 

Video

Image

Other

 

99%

98%

1%

99%

<1%

<1%

<1%

<1%

<1%

75%

18%

6%

84%

15%

<1%

V−1

V−2

P−1

P−2

S−1

(a) Request Count

(b) Request Size

Figure 5.2 Traﬃc composition of ﬁve adult websites.

categories respectively. Figure 5.2(a) shows the traﬃc composition distributions with respect

to their request count. We observe that the majority of traﬃc on adult websites consists

of video and images. Only V-1 has signiﬁcantly more video traﬃc than other content types

– V-1 traﬃc includes 3.1M requests for video objects. V-2 has smaller percentage of video

content traﬃc as compared to images or other categories. For V-2, 359K requests are for

video content whereas 657K requests are for image content. For P-1 and P2, 719K and 175K

requests are for images, respectively. For S-1, 231K requests are for images. Figure 5.2(b)

shows the traﬃc composition distributions with respect to their request size. In contrast to

Figure 5.2(a), we note that video content accounts for disproportionately more traﬃc volume.

Since video ﬁles are signiﬁcantly larger than image ﬁles, videos tend to dominate the traﬃc

in terms of byte volume. Video traﬃc in V-1 alone accounts for 258 gigabytes worth of

traﬃc. This traﬃc mix is composed of mostly multimedia content and it is representative of

popular free (ad-driven) and subscription based adult websites [144]. Our ﬁndings highlight

that the adult content publishers and the respective CDNs need to design and provision their

infrastructure to primarily serve multimedia content.

102

l

e
m
u
o
V
 
c
i
f
f

 

a
r
T
e
g
a

t

n
e
c
r
e
P

5.5

5

4.5

4

3.5

3

2.5

 

 

V−1
V−2
P−1
P−2
S−1

6

12
Hour

18

24

Figure 5.3 Hourly traﬃc volume timeseries of ﬁve adult websites.

5.3.1.3 Temporal Access Patterns

We next analyze the temporal access patterns for adult websites. Figure 5.3 plots the nor-

malized hourly timeseries of traﬃc volume across the day. We converted the timestamps to

local timezones to calculate hourly traﬃc volumes. Overall, we observe that access patterns

of adult websites are not typical diurnal patterns. Prior literature (e.g., [145, 112, 126])

reported content access peaks during 7-11 pm and troughs in late night and early morning

hours. In contrast, for example, we note that V-1 traﬃc volume peaks at late-night and early

morning hours. This pattern for V-1 is almost opposite to typical diurnal hours reported in

prior literature. The temporal access patterns for V-2, P-1, P-2, and S-1 have less pronoun-

ced variations than V-1, yet they are diﬀerent from typical diurnal patterns. The diﬀerences

between peak access of adult and other contents are likely due to the unique nature and

viewing preferences for adult content. Thus, it is important for CDNs to separately account

for adult traﬃc in the traﬃc forecasting models and network resource allocation.

5.3.1.4 Device/OS Usage

The web traﬃc is known to be gradually shifting from traditional desktop to smartphones

and tablets over the last several years [95]. Recall that we extract user agent information

from HTTP headers to identify device/OS of a user. We next investigate the device/OS

103

)

%

(
 
s
r
e
s
U

100

80

60

40

20

0

 

Desktop

Android

iOS

Misc

 

V−1

V−2

P−1

P−2

S−1

Figure 5.4 Device type composition

composition of user requests to adult websites. All adult websites discussed in this chapter

have mobile friendly websites/apps. Figure 5.4 plots the device distribution of users accessing

the adult websites. We observe that the desktop category dominates smartphones (Android

and iOS) and miscellaneous (tablets and other mobile devices) categories. For instance, V-2

has more than 95% users accessing content from traditional desktop devices. We observe that

image-heavy and social networking websites receive relatively more visitors from smartphone

devices as compared to video websites. For instance, more than one-third of users access

S-1 from smartphone and miscellaneous device categories. The diﬀerences across diﬀerent

adult websites can be partially explained by user preference to view adult content on larger

screens. CDNs can customize content delivery (e.g., compression, encoding) depending on

diﬀerent video playback devices. Since a vast majority of users do not visit adult websites

on smartphones, our ﬁndings highlight the need for adult websites to further improve their

web interfaces and content delivery strategies for mobile devices.

5.3.2 Content Dynamics

5.3.2.1 Content Size

We next investigate content sizes for adult websites. Figure 5.5 plots the Cumulative Distri-

bution Functions (CDFs) of content sizes. We note bi-modal distributions for image objects.

104

1

0.8

0.6

0.4

0.2

F
D
C

0
 
2
10

 

1

 

F
D
C

0.8

0.6

0.4

0.2

0
 
2
10

V−1
V−2
P−1
P−2
S−1

4
10

6
10

8
10

10

10

File Size (Bytes)

V−1
V−2
P−1
P−2
S−1

4
10

6
10

8
10

10

10

File Size (Bytes)

(a) Video

(b) Image

Figure 5.5 Content size distributions.

Small images are low-resolution thumbnails and large images are high-resolution pictures.

Overall, we observe that content sizes vary in the range of a few kilobytes (KB) to hund-

reds of megabytes (MB), where majority of requested video objects have sizes greater than

1 MB and image objects are less than 1 MB in size. Figure 5.5(a) shows the content size

distribution of video objects. Video adult websites V-1 and V-2 have a majority of objects

larger than 1 MB. P-1 and S-1 have relatively small number of video objects. P-2 has the

largest video object sizes. Figure 5.5(b) shows the content size distribution of image objects.

We note that multiple adult websites have bi-modal distributions, indicating thumbnail sized

images as well as large images of sizes up to 1 MB. These observations have signiﬁcant impli-

cations for CDN/ISP caching optimization. For instance, using separate caching platforms

to optimally serve small and large sized objects. The caching platform for small objects can

be optimized for high-throughput I/O; and, the caching platform for large objects can be

optimized for more storage capacity.

5.3.2.2 Content Popularity

We now investigate object popularity for adult websites. We quantify object popularity

in terms of request count. Understanding the popularity of objects is important because

CDNs typically optimize their caching performance by focusing on popular objects and

105

1

0.8

0.6

0.4

0.2

F
D
C

0
 
0
10

 

1

 

F
D
C

0.8

0.6

0.4

0.2

V−1
V−2
P−1
P−2
S−1

V−1
V−2
P−1
P−2
S−1

2
10

4
10

# of Requests

(a) Video

6
10

0
 
0
10

2
10

4
10

6
10

# of Requests

(b) Image

Figure 5.6 Content popularity distributions

reduce storage costs by ignoring unpopular and dynamically changing content. Additionally,

analyzing the popularity of adult objects can provide insights for identifying similarities

and diﬀerences between adult objects and non-adult objects. We plot the request count

distribution of video and image objects for adult websites in Figure 5.6. We observe long-

tail distributions for all adult websites. This observation indicates that a signiﬁcant fraction

of adult objects are requested infrequently and a small fraction of adult objects are very

popular. From a content delivery perspective, this information is useful as CDNs can improve

their caching performance by caching heavily requested objects. The long-tailed distribution

is similar to those reported for traditional web and video content in prior literature, where a

smaller fraction of viral media content is heavily requested. The long-tailed distribution also

potentially highlights some social aspects of adult content. In comparison, the large number

of requests for typical non-adult objects is mainly because of word-of-mouth content sharing

in online social networking websites [119]. We would not typically expect adult content

viewers to share adult content on popular social networking websites, though recent work

has shown that users use social features in adult websites [136]. We next investigate unique

aspects of adult content popularity by further analyzing individual object request patterns.

106

d
e

t
s
e
u
q
e
R
 
s
t
c
e
b
O

j

1

0.8

0.6

0.4

0.2

0

 

V−1
V−2
P−1
P−2
S−1

1 

 

2 

3 

4 

5 

6 

7 

Content Age (days)

Figure 5.7 Fraction of total object requested for adult websites at diﬀerent ages

e
c
n
a

t
s
D

i

20

15

10

5

0

e
c
n
a

t
s
D

i

2

1.5

1

0.5

0

Diurn al
6 1 %

L o n g −live d
2 5 %

Flash Cro w d

1 4 %

O utliers
3 3 %

Diurn al− A
1 1 %

L o n g −live d
2 2 %

S h ort−live d

2 0 %

Diurn al− B
1 4 %

(a) Video V-2

(b) Image P-2

Figure 5.8 Content clustering dendrograms of two adult websites.

5.3.2.3

Impact of Content Injection and Aging on Popularity

We would expect more requests for objects when they are new, and as the content ages

we expect its popularity to decrease accordingly. To understand this phenomenon for adult

websites, we plot the fraction of adult objects requested at diﬀerent ages in Figure 5.7.

The plot shows that a declining fraction of objects are requested as their age increases. In

particular, about 20% of objects are not requested after 3 days for most adult websites. Only

about 10% of objects are requested throughout the trace duration of one week.

We further explore the popularity trends of objects by clustering them with respect to

their temporal popularity patterns. To identify temporal popularity patterns, we analyze

the time series of request count for individual objects. Characterizing the request count

107

0.025

0.02

0.015

0.01

0.005

0

t

n
u
o
C

 
t
s
e
u
q
e
R
d
e
z

 

i
l

a
m
r
o
N

t

n
u
o
C

 
t
s
e
u
q
e
R
d
e
z

 

i
l

a
m
r
o
N

0.08

0.06

0.04

0.02

0

t

n
u
o
C

 
t
s
e
u
q
e
R
d
e
z

 

0.1

0.05

i
l

a
m
r
o
N

0

−0.005

Sat Sun Mon Tue Wed Thu Fri

Sat Sun Mon Tue Wed Thu

Fri

Sat Sun Mon Tue Wed Thu Fri

(a) Diurnal-A

(b) Long-lived

(c) Short-lived

Figure 5.9 Cluster medoids for V-2 adult website

0.015

0.01

0.005

t

n
u
o
C

 
t
s
e
u
q
e
R
d
e
z

 

i
l

a
m
r
o
N

0.06

0.05

0.04

0.03

0.02

0.01

0

0.06

t

n
u
o
C

 
t
s
e
u
q
e
R
d
e
z

 

i
l

a
m
r
o
N

0.04

0.02

0

t

n
u
o
C

 
t
s
e
u
q
e
R
d
e
z

 

i
l

a
m
r
o
N

0
Sat Sun Mon Tue Wed Thu Fri

−0.01

Sat Sun Mon Tue Wed Thu Fri

Sat Sun Mon Tue Wed Thu Fri

(a) Diurnal

(b) Long-lived

(c) Short-lived

Figure 5.10 Cluster medoids for P-2 adult website

time series helps us to assess how fast an adult object reaches its maximum popularity,

identifying governing popularity trends of image and video objects, and developing insights

for improving the caching performance of CDNs.

We identify distinct popularity trends by measuring the similarities in shape between the

normalized request count time series. We use Dynamic Time Warping (DTW) to compute

similarity between two request count time series [102]. DTW uses a dynamic programming

approach to obtain a minimum distance alignment between two time series. More speciﬁcally,

for two time series T1 and T2, DTW obtains an optimal alignment between T1 and T2 by

warping the time dimension of T1 and T2. DTW achieves an optimal alignment between

two time series by obtaining a non-linear mapping of points on one time series to points

of the second time series. This non-linear mapping is also known as the warping path.

108

More speciﬁcally, for T1 = (a1, a2, ..., aN ) and T2 = (b1, b2, ..., bM ) of length N and M

respectively, a warping path w = (w1, w2, ..., wL) deﬁnes an alignment between T1 and T2,
where w1 = (a1, b1) deﬁnes the mapping of element a1 ∈ T1 to b1 ∈ T2. A cost function is
used to compute the optimality of a warping path. Large cost values indicate low similarity

in shapes of two time series and small cost values indicate high similarity. The total cost of

the a warping path w is deﬁned as:

Cw =

LXi=1

c(ai, bi)

Intuitively, the cost function c is deﬁned as the area between the time warped time series.

Using a dynamic programming approach, DTW computes all possible sets of mappings

(warping paths) between two time series. The optimal warping path is the path w′ that has

the minimum total cost among all possible warping paths. The total cost of the optimal

warping path is deﬁned as the DTW distance. We use the DTW distance as a metric for

quantifying the similarity between two request count time series. We compute pairwise

DTW distances for all request count time series. We then use the pairwise DTW distance

matrix to obtain hierarchical clusters for the request count time series. We use agglomerative

hierarchical clustering to obtain dendrogram for each adult website.

Figures 5.8(a) and (b) show two example dendrograms of video and image objects for V-2

and P-2, respectively. The x-axis of a dendrogram shows the cluster labels and their mem-

berships and the y-axis represents the DTW similarity metric. We observe four dominant

popularity trends in our clustering analysis: diurnal, long-lived, and short-lived. We

also observe that some objects in V-2 and P-2 websites have other popularity trends that

cannot be neatly categorized as any of the aforementioned categories. We categorize these

objects as outliers.

To visualize the unique popularity trends, we next identify a representative sample object

from each cluster and plot its normalized request count time series with point-wise standard

deviations. To identify the representative sample object for each cluster, we identify its

medoid, where a medoid is deﬁned as the most centrally located point of a cluster [85].

109

We calculate point-wise standard deviations by looking at the normalized request count

timeseries of all objects in the cluster. We plot the normalized request count time series of

mediods of unique clusters in Figure 5.9. The shaded regions represent the standard deviation

of all timeseries from the mean of the cluster and the solid line represent the medoid of the

cluster. Figures 5.9(a) and 5.10(a) show the medoids for the diurnal popularity cluster.

Diurnal popularity trends of video objects highlight that certain video objects are requested

continuously with regular day/night time variations. Figures 5.9(b) and 5.10(b) show the

cluster medoids of objects with long-lived popularity trend. These objects request count

reaches its maximum popularity within the ﬁrst day of their injection. Their request count

decays in a diurnal fashion and completely dies down after a few days. Figures 5.9(c)

and 5.10(c) show the cluster medoids of objects with short-lived popularity trend. The

request count of these objects reaches its maximum during the ﬁrst day of its arrival but

decays sharply and completely dies down within a few hours. Our further analysis reveals

that video objects with diurnal trends are smaller in size as compared video objects with

long-lived and short-lived trends. The long-lived video objects have the largest size

followed by short-lived video objects.

Overall, our analysis reveals that objects served by adult websites have diverse popularity

trends. Each website has a diﬀerent composition of these popularity trends. A large fraction

of image and video objects have diurnal request patterns. There are two possible explanations

for diurnal patterns. First, prior work [135] suggests that users discover content on adult

websites through front page browsing. Objects with diurnal patterns are most likely objects

displayed on the front page of a website, and are always delivered when users visit the

website. Second, we note that the potentially addictive nature of adult content could drive

users to access speciﬁc content repeatedly, similar to non-adult media content.

CDNs can utilize this information to optimize cache control by re-validating diurnal ob-

jects less frequently and other objects more frequently, for example, hourly for objects with

short-lived access patterns and daily for objects with long-lived access patterns. This

110

1

0.8

0.6

0.4

0.2

F
D
C

 

V−1
V−2
P−1
P−2
S−1

 

0
1 sec 5 sec

1 min

1 hr
Inter−arrival time

1 day 1 week

Figure 5.11 User request inter-arrival time distributions

can also be achieved by setting longer expire times for objects with diurnal and long-lived

access patterns. Furthermore, CDNs can reduce network traﬃc by pushing copies of objects

with diurnal and long-lived request patterns to locations closer to end-users.

5.3.3 User Dynamics

5.3.3.1 User Request Inter-arrival Time

We characterize the user request arrival process in terms of its Inter-arrival time (IAT)

distribution. Figure 5.11 plots the user request IAT distributions for all adult websites.

Comparing diﬀerent adult websites, we observe that video adult websites have shorter request

IATs as compared to image-heavy adult websites. For video objects in adult websites, the

median request IAT is less than 10 minutes, whereas it is more that 1 hour for image-heavy

adult websites. We later use these observations for estimating user session lengths.

5.3.3.2 User Session Length

A key metric from the perspective of content publishers and CDNs is user engagement [124],

which is typically quantiﬁed in terms of website bounce time [57]. From the network-side

logs, we can estimate user engagement in terms of user session length,1 where a session

1It is noteworthy that the user session length is a strictly lower-bound of traditional bounce time because

we cannot tell how long a user continues to watch the downloaded content from network-side logs.

111

1

0.8

0.6

0.4

0.2

F
D
C

 

V−1
V−2
P−1
P−2
S−1

 

0
1 sec

5 sec

1 min

10 min

1 hr

User Session Length

Figure 5.12 User session length distributions

6
10

s
t
s
e
u
q
e
R

 
f
o
 
#

4
10

2
10

6
10

s
t
s
e
u
q
e
R

 
f
o
 
#

4
10

2
10

0
10

0
10

1
10

2
10

3
10

4
10

5
10

0
10

0
10

1
10

2
10

3
10

4
10

5
10

# of Users

(a) V-1

# of Users

(b) P-1

Figure 5.13 Repeated access of objects

consists of consecutive user requests within a timeout interval. We set the timeout value for

user sessions at 10 minutes based on our earlier analysis of user request IAT distributions.

We plot user session length distributions for adult websites in Figure 5.12. We observe that

the median session lengths for most adult websites are around one minute. Our ﬁndings

indicate that user engagement for adult content consists of relatively short-lived sessions as

compared to non-adult content. We further veriﬁed our observations using the engagement

statistics reported by Alexa. We note that average session lengths for popular non-adult

websites tend to be much larger than popular adult websites. For example, the average

session length for YouTube [147] is approximately two minutes, whereas the average session

112

1

0.9

0.8

0.7

0.6

F
D
C

0.5

 
0
10

 

1

 

F
D
C

0.9

0.8

0.7

0.6

V−1
V−2
P−1
P−2
S−1

2
10

0.5

 
0
10

1
10

Requests per User

(a) Video

V−1
V−2
P−1
P−2
S−1

2
10

1
10

Requests per User

(b) Image

Figure 5.14 CDF of repeated content access by users

length or XVIDEOS [146] is less than one minute.

5.3.3.3 User Addiction

To further investigate user engagement, we next analyze user addiction. We analyze repeated

content accesses by a user to investigate content addiction. For each object, we compute

the total number of requests and the total number of unique users who make these requests.

Figure 5.13(a) shows the scatter plot highlighting repeated access of video objects for V-1

adult website. Each data point in the plot represents a distinct video object. We observe

that certain video objects are requested by a user multiple times, i.e., data points above the

diagonal. In some cases, an object is requested by a large number of times by an individual

user. For example, some objects have up to two orders of magnitude more requests than

unique users. This observation indicates that some video objects are popular due repeated

access by a certain user (i.e., addiction), whereas other video objects are popular due to

multiple users accessing the content (i.e., viral). Figure 5.13(b) shows the scatter plot of

repeated access of objects for P-1 adult website. A comparison of repeated access reveal

that video objects are more likely to get repeated user requests than image objects. This

also highlights that video content is more addictive/engaging as compared to image content.

To further analyze user addiction to content, we plot the distribution of repeated user

113

F
D
C

1

0.8

0.6

0.4

0.2

0

 
0

 

1

 

V−1
V−2
P−1
P−2
S−1

0.2

0.4

0.6

0.8

1

Hit Ratio

(a) Image

F
D
C

0.8

0.6

0.4

0.2

0

 
0

V−1
V−2
P−1
P−2
S−1

0.2

0.4

0.6

0.8

1

Hit Ratio

(b) Video

Figure 5.15 Hit ratio for image and video objects

access for all objects. Figure 5.14 plots the CDF of number of requests per user for all

adult websites. We note that less than 1% of image objects are requested more than 10

times by a user, whereas at least 10% of video objects have more than 10 requests per

unique user. From a CDN’s perspective, this information is particularly useful as they can

diﬀerentiate between objects that are popular only due to requests from multiple users versus

those objects that are popular because of repeated accesses by a user. This information is

also useful in optimizing local browser caching and proxy caches deployed by ISPs. Objects

accessed multiple times by a group of users should be locally cached closer to end-users.

5.4

Implications

We discuss potential implications of our measurement and analysis of online adult traﬃc. We

are particularly interested in understanding the impact of diﬀerent content access patterns

on CDN caching. To this end, we analyze the caching performance for adult websites by

looking at server-side HTTP response codes and cache hit ratios.

114

6
10

4
10

2
10

t

n
u
o
C

 
t
s
e
u
q
e
R

0
10

 

200

204

206

304

403

416

Response Codes

 

V−1
V−2
P−1
P−2
S−1

6
10

4
10

2
10

t

n
u
o
C

 
t
s
e
u
q
e
R

0
10

 

 

V−1
V−2
P−1
P−2
S−1

200

204

206

304

403

416

Response Codes

(a) Video

(b) Image

Figure 5.16 Response codes

5.4.1 CDN Cache Hit Ratios

We ﬁrst investigate the cache performance of CDN servers by analyzing server-side cache

hit ratios. Figure 5.15 plots the distributions of cache hit ratios for objects in all adult

websites. A hit ratio value of 0 indicates that the requested objects were not present in

the CDN cache. A hit ratio value of 1 indicates that all requested objects were served from

the CDN cache. Note that the CDN treats video chunks as separate objects for the sake

of caching. Comparing image and video objects, we observe that image objects have better

overall cache hit ratio than video objects. S-1 has the smallest percentage of objects added to

the CDN cache. To understand the cache performance of objects with diﬀerent popularity,

we compute correlations between hit ratio and object popularity. As expected, we ﬁnd that

popular objects tend to have higher hit ratios (more than 0.9 correlation coeﬃcient for all

adult websites). Thus, while the distributions in Figure 5.15 may indicate otherwise, overall

CDN cache hit ratios range between 80-90% for adult websites. CDNs often customize cache

conﬁguration and performance for individual publishers. Thus, diﬀerences in cache hit ratios

may reﬂect cache prioritization for diﬀerent content publishers. Furthermore, customized

caching strategies for streaming video content can also be implemented by the CDN.

115

5.4.2 HTTP Response Codes

We next analyze HTTP response codes for adult websites. Figure 5.16 shows the number

of requests associated with each type of response code. The most common HTTP response

codes for both video and image objects include: 200, 206, 304, and 403. We note that a

majority of response code are 200. Of particular interest to a CDN operator is the 304

response code, which indicates that the client’s requested object is not modiﬁed and the

local cache copy is up to date. We note that 304 responses constitute a small fraction of all

requests, which indicates the potential for improved localized content caching to improve user

performance and reduce traﬃc load on CDN content replica servers. Despite the cacheability

of popular adult objects, 304 response counts are particularly low for adult websites because

users are known to browse adult content in incognito/private browsing modes [16]. Web

browsers dispose local cache content when users close the incognito/private browser windows.

In contrast, note that Facebook reported that more than 65% of their photo requests are

served from local browser caches [74]. Unfortunately, while traditional non-adult website

can fully utilize browser cache to improve performance and reduce traﬃc, adult content

publishers cannot solely rely on it in designing their content and caching mechanisms.

5.5 Conclusion

In this chapter, we presented a large scale and in-depth measurement and analysis of online

adult traﬃc. We provide an in-depth analysis of their aggregate, content, and user dynamics.

We ﬁnd that the temporal access patterns of adult website are unique and diﬀerent from

typical diurnal access patterns. While a majority of users access adult websites from desktop,

smartphones and other mobile devices account for a non-trivial fraction of visitors. Our

clustering analysis of content popularity reveals groups of objects with diurnal, long-lived,

and short-lived temporal access patterns. User engagement in adult websites is shorter than

non-adult websites; however, adult content can be addictive, i.e., some users repeatedly access

116

certain content. Our ﬁndings have implications on adult website design and content delivery

infrastructure management. For instance, adult content providers cannot rely on browser

cache to store locally popular content because of the prevalent usage of incognito/private

web browsing. Content delivery networks can also reduce improve performance and network

traﬃc by pushing copies of popular adult objects to locations closer to the end-users.

117

6 Privacy Preserving Social Network

Graph Data Publishing

6.1

Introduction

6.1.1 Background and Motivation

Online Social Networks (OSNs) have become an essential part of modern life. Billions of users

connect and share information using OSNs such as Facebook and Twitter. Graphs obtained

from these OSNs can provide useful insights on various fundamental societal phenomena

such as epidemiology, information dissemination, marketing, and sentiment ﬂow [58, 118].

Various analysis methods [38, 92] have been applied to OSNs by explicitly exploring its graph

structure, such as clustering analysis for automatically identifying online communities and

node inﬂuence analysis for recognizing the inﬂuential nodes in social networks. The basis of

all these analysis is to represent a social network graph by an adjacency matrix and then

represent individual nodes by vectors derived from the top eigenvectors of the adjacency

matrix. Thus, all these analysis methods require real social network graphs.

Unfortunately, OSN companies often refuse to publish their social network graphs due

to privacy concerns. Social network graphs contain sensitive information about individuals

such as an user’s topological characteristics in a graph (e.g., number of social ties, inﬂuence

in a community, etc). From the user perspective, the sensitive information revealed from a

social network graph can be exploited in many ways such as fraud and spam. From the OSN

118

perspective, disclosing sensitive user information may put them in the risk of violating privacy

laws. A natural way to bridge the gap is to anonymize original social network graphs (by

means such as removing identiﬁers) and publish the anonymized ones. For example, Netﬂix

published anonymized movie ratings of 500,000 subscribers and AOL published search queries

of 658,000 users [63, 105]. However, such anonymization is vulnerable to privacy attacks

where attackers can identify personal information by linking two or more databases [31].

6.1.2 Problem Statement

In this work, we aim to develop a scheme for publishing social network graphs with diﬀerential

privacy guarantees. Recently, diﬀerential privacy has become the widely accepted criteria

for privacy preserving data publishing because it provides strongest privacy guarantees for

publishing sensitive datasets [47]. The concept of diﬀerential privacy was proposed by Dwork

in the context of statistical databases, where a trusted party holds a dataset D containing

sensitive information (e.g.medical records) and wants to publish a dataset D′ that provides

the same global statistical information as D while preserving the private information of

each individual user. Given a database D and its neighbor database D’ diﬀering from D in

only one record, according to diﬀerential privacy, any information that can be learned from

the database D can also be learned from D’, irrespective of the presence or absence of a

record. Our targeted diﬀerential privacy preserving social graph publishing scheme should

satisfy the following two requirements. First, the published dataset should maintain the

utility of the original dataset. Second, the scheme should guarantee diﬀerential privacy that

is, an adversary should learn nothing more about any individual from the published data,

regardless of the presence or absence of the record of any particular individual in the dataset.

We emphasize that these two goals are often conﬂicting: to achieve diﬀerential privacy, a

suﬃciently large amount of random noise has to be added to the published data, which could

potentially result in a large error in approximating the top eigenvectors of the original data.

The goal of this work is to achieve a proper tradeoﬀ between privacy and utility.

119

6.1.3 Limitations of Prior Art

Given an original n × n matrix, some schemes have been proposed to perturb this matrix
by adding random noise and then publish the perturbed n × n matrix [40, 82]. The key
limitation of such schemes is that they are computationally unpractical as they are not

designed to handle large matrices such as the adjacency matrices of OSN graphs, which can

be in the scale of millions or even billions. Recently, Wang et al. proposed a scheme called

LNPP, which perturbs the eigenvectors of the original matrices by adding random noise and

then publishes the perturbed eigenvectors [141]. Although this reduces computation cost

and storage space, this scheme requires a large amount of random perturbation in order to

preserve diﬀerential privacy, which leads to poor estimation of the top eigenvectors.

6.1.4 Proposed Approach

We propose a random matrix approach by leveraging the theories of random matrix and

diﬀerential privacy. Our key idea is to ﬁrst project each row of an adjacency matrix into

a low dimensional space using random projection, and then perturb the projected matrix

with random noise, and ﬁnally publish the perturbed and projected matrix. Given a n × n
adjacency matrix A of a social network graph, with m projections and k eigenvectors, we

output an n×m (0 < k ≤ m ≪ n) randomly perturbed matrix bA where the top k eigenvectors
of A are close to the k eigenvectors of bA (in terms of both vector directions and magnitude).

The random projection is critical in our approach. First, it signiﬁcantly reduces the size of

the matrix to be published because n × m ≪ n × n, addressing the limitations of prior art.
Second, based on the theory of random matrix [61], we can preserve the top eigenvectors of

the adjacency matrix using random projection. Third, the random projection step by itself

has the ability of achieving diﬀerential privacy, which makes it possible to ensure diﬀerential

privacy in the second step by introducing a small random perturbation [113, 34].

120

6.1.5 Technical Challenges

There are three key technical challenges. The ﬁrst challenge is to prove that random pro-

jection plus random perturbation preserve diﬀerential privacy. The second challenge is to

prove that the random noise required to achieve diﬀerential privacy is small. We address

these two challenges by leveraging the theory of random matrix and diﬀerential privacy. The

third challenge is to validate the proposed approach and demonstrate the utility preservation

of eigen-spectrum. We address this challenge by evaluating the utility of the published data

for two diﬀerent applications that require the spectral information of an OSN graph. We use

publicly available datasets of Facebook, Live Journal and Pokec social networks [11, 143].

The ﬁrst application is OSN node clustering, which has been widely used for OSN appli-

cations such as community detection [139]. We focus on spectral clustering algorithms as

they are based on the eigenvectors of adjacency matrices. The second application is OSN

node ranking, which has been widely used for OSN applications such as inﬂuential node

identiﬁcation [75]. We focus on principal component centrality based algorithms as they are

based on the eigenvectors of adjacency matrices.

6.1.6 Key Contributions

We make three key contributions in this chapter. First, we propose a random matrix appro-

ach to OSN graph publishing, which achieves space eﬃciency by reducing the dimensions

of adjacency matrices and achieves diﬀerential privacy by adding a small amount of noise.

Second, we formally prove that our scheme achieves diﬀerential privacy and that the amount

of added noise is small. We present theoretical error bounds for approximating top−k ei-
genvectors. Third, we validate our random matrix approach on two diﬀerent applications:

node clustering, and node ranking, which require the spectral information of a graph. Our

experimental results show that even for high values of noise variance σ = 1, our scheme

preserves the utility of the dataset. For node clustering, our scheme achieves high clustering

quality deﬁned by normalized mutual information, which is more than 0.7. For node ranking,

121

our scheme can correctly recover 80% of the most inﬂuential nodes. For node classiﬁcation

with 16 classes, our scheme obtains an accuracy of 70% with 5−fold cross validation. We
also compare our results with the LNPP scheme, which represents the state-of-the-art [141].

We show that our approach achieve better results for both clustering and ranking. The

LNPP scheme directly perturbs the eigenvector of the original data by a Laplacian noise; in

contrast, we ﬁrst use random projections to reduce the size of the adjacency matrix and then

compute the eigenvectors. Our approach achieves a high clustering quality of 0.74, which is

the normalized mutual information; for LNPP, the same amount of noise almost completely

destroys the clustering of nodes in the original dataset. For node ranking, our approach

correctly identiﬁes 80% of the most inﬂuential nodes, whereas LNPP correctly identiﬁes less

than 1% of the most inﬂuential nodes.

6.2 Related Work

In the area of diﬀerential privacy in data publishing, some schemes have been proposed to

perturb an n × n matrix by adding random noise and then publish the perturbed n × n
matrix [40, 82]. Given a large n × n matrix, no matter dense or spare, all these schemes
output another large dense matrix of size n × n. The key limitation of such schemes is
that they are computationally unpractical as in large OSNs, n is in the scale of millions

or even billions. Blum et al. propose to publish the covariance matrix of the original data

contaminated by random noise [35].

[113] and [34], show that random projection by itself

can preserve both the diﬀerential privacy and the eigen-spectrum of a given matrix provided

appropriate modiﬁcation is made to the original matrix. Hardt et al. proposed an iterative

algorithm to compute diﬀerential private eigenvectors [64]. This scheme generates large dense

matrix of n× n at each iteration. Sampling approaches based on the exponential mechanism
have been proposed for computing diﬀerential private singular vectors [40, 82]. Since these

approaches require sampling very high dimensional vectors from a random distribution, they

122

are computationally infeasible for large social networks. Wang et al. propose to publish

eigenvectors perturbed by Laplacian random noise, which unfortunately requires a large

amount of random perturbation for diﬀerential privacy preservation and consequentially

leads to poor utility of data [141].

Some other work has been done on publishing social network data with diﬀerential privacy;

however, none of them addresses the utility of preserving eigenvectors of social network

graphs, which is the main goal of our work [120, 66, 101, 86]. Sala et al. presented a scheme to

share meaningful graph datasets, based on the dk−graph model, while preserving diﬀerential
privacy [120]. Hay et al. proposed a scheme to preserve the degree distribution of a social

network graph [66]. Mir and Wright proposed to guarantee diﬀerential privacy by perturbing

Kronecker model parameters [101]. Kenthapadi developed a diﬀerential private algorithm

that preserves distance between any two samples in a given database [86]. Although these

studies deal with diﬀerential private publication of social network data, none of them address

the utility of preserving the eigenvectors of the graph, the central theme of this work.

6.3 Random Matrix Approach

In this section, we ﬁrst present the proposed approach for diﬀerential private publication of

social network graph based on the random matrix theory. We then present its guarantee on

diﬀerential privacy and the approximation of eigenvectors.

Let G be a binary graph representing the connectivity of a social network, and A ∈
{0, 1}n×n be the adjacency matrix representing the graph, where Ai,j = 1 if there is an
edge between nodes i and j, and Ai,j = 0, otherwise. The ﬁrst step of our approach is
to generate two Gaussian random matrices P ∈ Rn×m and Q ∈ Rm×m, where m ≪ n is
the number of random projections. Here, each entry of P is sampled independently from a

Gaussian distribution N (0, 1/m), and each entry of Q is sampled independently from another
Gaussian distribution N (0, σ2), where the value of σ will be discussed later. Using Gaussian

123

random matrix P , we compute the projection matrix Ap ∈ Rn×m by Ap = A × P , which
projects each row of A from a high dimensional space Rn to into a low dimensional space

publishing the social network graph.

Rm. We then perturb Ap with the Gaussian random matrix Q by bA = Ap + Q, and publish
bA to the external world. Algorithm 3 highlights the key steps of the proposed routine for
Algorithm 3 bA = Publish(A, m, σ2)
Input: (1) symmetric adjacency matrix A ∈ Rn×n
(2) the number of random projections m < n

(3) variance for random noise σ2

Output: bA
1 Compute a random projection matrix P , with Pi,j ∼ N (0, 1/m)
2 Compute a random perturbation matrix Q, with Qi,j ∼ N (0, σ2)
3 Compute the projected matrix Ap = AP

4 Compute the randomly perturbed matrix bA = Ap + Q

Compared to the existing approaches for diﬀerential private publication of social network

graphs, the proposed algorithm is advantageous in three aspects: First, the proposed al-

gorithm is computationally eﬃcient as it does not require either storing or manipulating a

dense matrix of n × n. Second, the random projection matrix P allows us to preserve the
top eigenvectors of A due to the theory of random matrix. Third, the joint eﬀort of the

random projection P and the random perturbation Q preserves diﬀerential privacy. This

unique feature allows us to introduce a small amount of random perturbation for diﬀerential

privacy preservation, thus improving the utility of data.

Next, we give a theoretical analysis of two main aspects of publishing diﬀerential private

graphs of OSNs. First, we formally prove that our random matrix approach to publishing

OSN graphs guarantees diﬀerential privacy. Second, we give theoretical error bounds for

approximating top−k eigenvectors.

124

6.3.1 Theoretical Guarantee on Diﬀerential Privacy

Before we show the guarantee on diﬀerential privacy, we ﬁrst introduce the deﬁnition of

(ǫ, δ)-Diﬀerential Privacy: A (randomized) algorithm A satisﬁes (ǫ, δ)-diﬀerential privacy, if
for all inputs X and X0 diﬀering in at most one user’s one attribute value, and for all sets
of possible outputs D ⊆ Range(A), we have

Pr (A(X) ∈ D) ≤ eǫ Pr (A(X0) ∈ D) + δ,

(6.1)

where the probability is computed over the random coin tosses of the algorithm.

To understand the implication of (ǫ, δ)-diﬀerential privacy, consider the database X ∈
{0, 1}n×m as a binary matrix. Let pi,j := Pr(Xi,j = 1) represent the prior knowledge of
an attacker about X, and let p′i,j = Pr(Xi,j = 1|A(X)) represent his knowledge about X
after observing the output A(X) from algorithm A. Then, if an algorithm A satisﬁes (ǫ, δ)-
diﬀerential privacy, then with a probability 1 − δ, for any i ∈ [n] and j ∈ [m], the following
condition holds:

(cid:12)(cid:12)(cid:12)ln pi,j − ln p′i,j(cid:12)(cid:12)(cid:12) ≤ ǫ

In other words, the additional information gained by observing A(X) is bounded by ǫ. Thus,
parameter ǫ > 0 determines the degree of diﬀerential privacy: the smaller the ǫ, the less the

amount of information will be revealed. Parameter δ ∈ (0, 1) is introduced to account
the rare events when the two probabilities Pr (A(X) ∈ D) and Pr (A(X0) ∈ D) may diﬀer
signiﬁcantly from each other.

Theorem 1. Assuming δ < 1/2, n ≥ 2, and

σ ≥

1

2δ(cid:19) ln

n
δ

1

ǫs10(cid:18)ǫ + ln

Then, Algorithm 3 satisﬁes (ǫ, δ)-diﬀerential privacy w.r.t. a change in an individual person’s

attribute.

The key feature of Theorem 1 is that the variance for generating the random perturbation

matrix Q is O(ln n), independent from the size of social network. As a result, we can ensure

125

diﬀerential privacy for the published bA for a very large social network by only introducing

a Gaussian noise with small variance, an important feature that allows us to simultaneously

preserve both the utility and diﬀerential privacy. Our deﬁnition of diﬀerential privacy is a

generalized version of ǫ-diﬀerential privacy which can be viewed as (ǫ, 0)-diﬀerential privacy.

6.3.2 Proof of Theorem 1

To prove that Algorithm 3 is diﬀerential private, we need the following Theorem from [86]

Lemma 6.3.1. (Theorem 1 [86]) Deﬁne the ℓ2-sensitivity of the projection matrix P as
1≤i≤n|Pi,∗|2, where Pi,∗ represents the ith row of matrix P . Assuming δ < 1/2,

w2(P ) = max

and

w2(P )

ǫ s2(cid:18)ǫ + ln

1

2δ(cid:19)

σ ≥

Then Algorithm 3 satisﬁes (ǫ, δ)-diﬀerential privacy w.r.t. a change in an individual person’s

attribute.

In order to bound w2(P ), we rely on the following concentration for χ2 distribution.

Lemma 6.3.2. (Tail bounds for the χ2 distribution ) Let X1, . . . , Xd be independent draws
from N (0, 1). Therefore, for any 0 < δ < 1, we have, with a probability 1 − δ,

dXi=1

Deﬁne

X2

1
δ

+ 2 ln

1
δ

i ≤ d + 2rd ln
mXj=1

z2
i =

P 2
i,j

w2

2(P ) = max
1≤i≤n

z2
i

Evidently, according to the deﬁnition of w2

2(P ), we have

Since Pi,j ∼ N (0, 1/m), we have mz2
3.2, we have, with a probability 1 − δ,

i follow the χ2 distribution of d freedom. Using Lemma

z2

i ≤ 1 + 2r 1

m

126

ln

1
δ

+

2
m

ln

1
δ

(6.2)

By taking the union bound, we have, with a probability 1 − δ

z2

i ≤ 1 + 2r 1

m

w2

2(P ) = max
1≤i≤m

ln

n
δ

+

2
m

ln

n
δ ≤ 2

(6.3)

where the last inequality follows from m ≥ 4 ln(n/δ). We complete the proof by combining
the result from Lemma 3.1 and inequality in (6.3). Inequality (6.3) comes from the inequality

(6.2). According to inequality (6.2), each zi can fail inequality (6.2) with a probability at

most δ. Then, for all zi, to ensure that the chance for inequality (6.2) to fail with probability

at most δ, we can simply replace δ in (6.2) with δ/n, which immediately leads to the inequality

(6.3). Therefore, Theorem 1 holds because of Lemma 1 and inequality (6.3).

6.3.3 Theoretical Guarantee on Eigenvector Approximation

Let u1, . . . , un the eigenvectors of the adjacency matrix A ranked in the descending order of

eigenvalues λ1, . . . , λn. Let k be the number of top eigenvectors of interests. Leteu1, . . . ,euk
be the ﬁrst k eigenvectors of bA. Deﬁne the approximation error for the ﬁrst k eigenvectors

as

Our goal is to show that the approximation error E 2 will be small when the number of
random projections m is suﬃciently large.

E 2 = max

1≤i≤k |ui −eui|2

Theorem 2. Assume (i) m ≥ c(k + k ln k), where c is an universal constant given in [121],
(ii) n ≥ 4(m + 1) ln(12m) and (iii) λk − λk+1 ≥ 2σ√2n. Then, with a probability at least

1/2, we have

E 2 ≤

16σ2n

(λk − λk+1)2 +

32
λ2
k

nXi=k+1

λ2
i

The corollary below simpliﬁes the result in Theorem 2 by assuming that λk is signiﬁcantly

larger than the eigenvalues λk+1, . . . , λn.

127

assumption for m and n as Theorem 2, we have, with a probability at least 1/2,

Corollary 6.3.1. Assume (i) λk = Θ(n/k), and (ii) Pn
√n(cid:21)(cid:19)

E ≤ O(cid:18)k(cid:20) σ

√n

1

+

i=k+1 λ2

i = O(n). Under the same

As indicated by Theorem 2 and Corollary 6.3.1, under the assumptions (i) λk is signi-

ﬁcantly larger than eigenvalues λk+1, . . . , λn, (ii) the number of random projections m is

suﬃciently larger than k, and (iii) n is signiﬁcantly larger than the number of random pro-

jections m, we will have the approximation error E ∝ O(k/√n) in recovering the eigenvectors

of the adjacency matrix A. We also note that according to Corollary 6.3.1, the approxima-

tion error is proportional to σ, which measures the amount of random perturbation needed

for diﬀerential privacy preservation. This is consistent with our intuition, i.e., the smaller

the random perturbation, the more accurate the approximation of eigenvectors.

6.3.4 Proof of Theorem 2

Next, we prove Theorem 2. Let A ∈ Rn×n be the adjacency matrix, Ap = AP , and
bA = Ap+Q. Letbu1, . . . ,buk be the ﬁrst k eigenvectors of matrix Ap. Deﬁne U = (u1, . . . , uk),
bU = (bu1, . . . ,buk), and eU = (eu1, . . . ,euk). For each of these matrices, we deﬁne a projection
operator, denoted by Pk, bPk and ePk, as

Pk =

bPk =
ePk =

uiu⊤i = UU⊤

kXi=1
kXi=1buibu⊤i = bUbU⊤
kXi=1euieu⊤i = eUeU⊤

We ﬁrst bound the approximation error E 2 by the diﬀerence between projection operators,
i.e.,

E 2 = max

1≤i≤k |ui −eui|2 ≤ kUU⊤ −eUeU⊤k2 = kPk − ePkk2

128

where k · k2 stands for the spectral norm of matrix. Using the fact that

E 2 ≤ kPk − ePkk2
≤ 2kPk − bPkk2
≤ 2kPk − bPkk2

2 = kPk − bPk + bPk − ePkk2
2 + 2kPk − ePkk2
F + 2kPk − ePkk2

2

2

2

(6.4)

where k· kF stands for the Frobenius norm of matrix, below we will bound kPk −bPkkF and
kPk − ePkkF , separately.
To bound kPk − bPkkF , we need the following theorem for random matrix.

Lemma 6.3.3. (Theorem 14 [121]) Assume 0 < ǫ ≤ 1 and m ≥ c(k/ǫ + k ln k), where c is
some universal constant. Then, with a probability at least 2/3, we have

kA − bPk(A)kF ≤ (1 + ǫ)kA − Pk(A)kF ,

Since

kA−bPk(A)kF ≥−kA−Pk(A)kF +kPk(A)−bPk(A)kF
= −kA−Pk(A)kF −kPk(A)+bPk Pk(A)+bPk Pk(A)−bPk(A)kF
≥ −kA−Pk(A)kF +kPk(A)−bPk Pk(A)kF −|bPk(A−Pk(A))kF
≥ kPk(A)−bPk Pk(A)k−2kA−Pk(A)kF >,

combining with the result from Lemma 3, we have, with a probability at least 2/3,

k(Pk − bPkPk)(A)kF ≤ (3 + ǫ)|A − Pk(A)|F

(6.5)

Since

k(Pk − bPkPk)(A)kF
= k(PkPk − bPkPk)(A)kF = k(Pk − bPk)Pk(A)kF
≥ kPk − bPkkFkPk(A)k2 = λkkPk − bPkkF

129

combining with the inequality in (6.5), we have, with a probability at least 2/3,

kPk − bPkkF ≤

3 + ǫ
λk |A − Pk(A)|F

(6.6)

In order to bound kbPk − ePkk2, we use the Davis-Kahan sinΘ theorem given as below.
Lemma 6.3.4. Let A and ˜A be two symmetric matrices. Let {ui}k
i=1 be the
ﬁrst k eigenvectors of A and ˜A, respectively. Let λk(A) denote the kth eigenvalue of A.

i=1 and {eui}k

Then, we have

kPk − ePkk2 ≤

kA − ˜Ak2

λk(A) − λk+1( ˜A)

if λk(A) > λk+1( ˜A), where Pk =Pk

Using Lemma 4 and the fact

i=1 uku⊤k and ePk =Pk

i=1euieu⊤i .
λk+1(bA) ≤ λk+1(Ap) + kAp − bAk2 = λk + kQk2

we have

kbPk − ePkk2 ≤

≤

kAp − bAk2
λk(Ap) − λk+1(bA)

λk − λk+1 − kQk2

kQk2

Under the assumption that λk − λk+1 ≥ 2kQk2, we have

In order to bound the spectral norm of Q, we need the following lemma from random matrix.

kbPk − ePkk2 ≤

2kQk2
λk − λk+1

Lemma 6.3.5. Let A ∈ Rr×m be a standard Gaussian random matrix. For any 0 < ǫ ≤ 1/2,
with a probability at least 1 − δ, we have
1
m

provided

(cid:13)(cid:13)(cid:13)(cid:13)

AA⊤ − I(cid:13)(cid:13)(cid:13)(cid:13)2 ≤ ǫ

m ≥

ln

2r
δ

4(r + 1)

ǫ2

130

Using Lemma 5 and the fact that Qi,j ∼ N (0, σ2), we have, with a probability at least
5/6

where

kQQ⊤k2 ≤ (1 + η)σ2n

n ≥

4(m + 1)

η2

ln(12m)

As a result, we have, with a probability at least 5/6,

and therefore

kQk2 ≤ σp(1 + η)n

2σ

λk − λk+1p(1 + η)n

We complete the proof by combining the bounds for kPk − bPkkF and kbPk − ePkk2 in (6.6)

and (6.7) and plugging them into the inequality in (6.4).

kbPk − ePkk2 ≤

(6.7)

6.4 Experimental Results

To demonstrate the eﬀectiveness of our diﬀerential private random matrix approach and to

illustrate the utility preservation of eigen-spectrum, we perform experiments over graphs

obtained from three diﬀerent online social networks. We analyze the impact of perturbation

by evaluating the utility of the published data for two diﬀerent applications which require

spectral information of a graph. First, we consider clustering of social networks, which has

been widely used for community detection in social networks. We choose spectral clustering

algorithm in our study, which depends on the eigenvectors of the adjacency matrix. Next,

we examine the problem of identifying the ranks of inﬂuential nodes in a social network

graph. For the evaluation purposes, we obtain clusters and node ranks from the published

graph, and compare the results against those obtained from the original graph. We give a

brief description of the results obtained for each of the applications of graph spectra in the

subsequent sections.

131

 

Facebook
LiveJournal
Pokec

5
10

e
e
r
g
e
D
e
d
o
N

 

0
10

 
0
10

10

10
Node Index (reverse sorted by degree)

5
10

Figure 6.1 Degree distribution of three datasets.

6.4.1 Dataset

In our evaluation we use three diﬀerent social network graphs from Facebook, Live Journal

and Pokec. We use the Facebook data set collected by Wilson et al. from Facebook [143].

The social graphs of Live Journal and Pokec were obtained from publicly available SNAP

graph library. The choice of these social networks is based on two main requirements. First,

the network should be large enough so that it is a true representation of real online social

structure. A small network not only under-represents the social structure, but also produces

biased results. Second, the number of edges in the network should be suﬃciently large in

order to reveal the interesting structure of the network. For all three benchmark datasets, the

ratio of the number of edges to the number of nodes is between 7 and 20. Table 6.1 provides

the basic statistics of the social network graphs. Figure 6.1 shows degree distribution of

3 online social networks on log-log scale. We can see that the data follows a power law

distribution which is a characteristic of social network degree distribution.

132

Network
Facebook

Pokec

LiveJournal

Nodes

Edges

3, 097, 165
1, 632, 803
3, 997, 962

23, 667, 394
30, 622, 564
34, 681, 189

Table 6.1 Dataset description

1

0.95

0.9

I

M
N

0.85

0.8

0.75

0.7

 
2 

1

0.95

0.9

I

M
N

0.85

0.8

0.75

0.7

 
2 

 

O
m=200
m=20

4 

8 

16

k

1

0.95

0.9

I

M
N

0.85

0.8

0.75

0.7

 
2 

 

O
m=200
m=20

4 

8 

16

k

1

0.95

0.9

I

M
N

0.85

0.8

0.75

0.7

 
2 

 

O
m=200
m=20

4 

8 

16

k

(a) σ = 0.1

(b) σ = 0.5

(c) σ = 1

Figure 6.2 NMI values for Facebook

 

O
m=200
m=20

4 

8 

16

k

1

0.95

0.9

I

M
N

0.85

0.8

0.75

0.7

 
2 

 

O
m=200
m=20

4 

8 

16

k

1

0.95

0.9

I

M
N

0.85

0.8

0.75

0.7

 
2 

 

O
m=200
m=20

4 

8 

16

k

(a) σ = 0.1

(b) σ = 0.5

(c) σ = 1

Figure 6.3 NMI values for Live Journal

6.4.2 Node Clustering

Clustering is a widely used technique for identifying groups of similar instances in a data.

Clustering has applications in community detection, targeted marketing, bioinformatics etc.

Social networks posses large amount of information which can be utilized in extensive data

mining applications. Large complex graphs can be obtained from social networks which

represent relationships among individual users. One of the key research questions is the

understanding of community structure present in large social network graphs. Social net-

133

1

0.95

0.9

I

M
N

0.85

0.8

0.75

0.7

 
2 

 

O
m=200
m=20

4 

8 

16

k

1

0.95

0.9

I

M
N

0.85

0.8

0.75

0.7

 
2 

 

O
m=200
m=20

4 

8 

16

k

1

0.95

0.9

I

M
N

0.85

0.8

0.75

0.7

 
2 

 

O
m=200
m=20

4 

8 

16

k

(a) σ = 0.1

(b) σ = 0.5

(c) σ = 1

Figure 6.4 NMI values for Pokec

working platforms possess strong community structure of users, which can be captured by

clustering nodes of a social network graph. Detecting communities can help in identifying

structural position of nodes in a community. Nodes with a central position in a community

have inﬂuence in the community. Similarly, nodes lying at the intersection of two commu-

nities are important for maintaining links between communities. Disclosure of the identity

of such nodes having important structural properties results in serious privacy issues. The-

refore, in order to protect an individual’s privacy it is crucial for data publishers to provide

rigorous privacy guarantees for the data to be published. In our experiments, we use spectral

clustering for evaluating our privacy-preserving random matrix approach. Spectral clustering

has many fundamental advantages over other clustering algorithms. Unlike other clustering

algorithms, spectral clustering is particularly suitable for social networks, since it requires

an adjacency matrix as an input and not a feature representation of the data. For social

network data graph G represented by the binary adjacency matrix A, spectral clustering

techniques utilize the eigen-spectrum of A to perform clustering. The basic idea is to view

clustering as a graph partition problem, and divide the graph into several disjoint subgraphs

by only removing the edges that connect nodes with small similarities.

In order to evaluate the utility of the published data for clustering, we utilize normalize

mutual information (NMI) as a measure to evaluate the quality of clustering [59]. Although

Purity is a simpler evaluation measure, high purity is easy to achieve for large number of

134

clusters and cannot be used to evaluate trade oﬀ between quality of clustering and number of

clusters. NMI allows us to evaluate this tradeoﬀ by normalizing mutual information I(ω; C)

as described in Equation 6.8.

NMI =

I(ω; C)

[H(ω) + H(C)]/2

,

(6.8)

where H is entropy which measures the uniformity of the distribution of nodes in a set of

clusters, ω = w1, ..., wk is a set of clusters and C = c1, ..., ck is a set of classes or ground

truth. NMI is bounded between 0 and 1, and the larger the NMI, the better the clustering

performance is. We perform extensive experiments over the datasets to evaluate our appro-

ach. We now give a stepwise explanation of our evaluation protocol. Since we do not have

ground truth about the communities in the datasets, we employ an exhaustive approach to

evaluate clustering over the original data and generate the ground truth communities. First,

for a given value of k we generate 5 diﬀerent sets of clusters from spectral clustering algo-

rithm, represented as Ci for i = 1, .., 5. Since spectral clustering employs k−means, each set
Ci can have diﬀerent cluster distributions. Therefore, to evaluate the consistency in cluster

distribution, NMI values are obtained for(cid:0)5

2(cid:1) diﬀerent pairs of sets represented as (Ci, Cj),

where i 6= j and average value is reported. Then, another 5 cluster sets are obtained through
diﬀerential private spectral clustering, represented as ωi for i = 1, ..., 5. Finally, to evaluate

cluster sets ωi, NMI values are obtained using Ci as the ground truth. In this case NMI
values are obtained for each pair (ωi, Cj)∀i, j ∈ 1, ..., 5 and the average value is reported.

We evaluate the clustering results for three diﬀerent values of σ, suggested in [82]. For

each σ, we evaluate clustering for two diﬀerent number of random projections m = 20, 200.

Figure 6.2, 6.3 and 6.4 shows NMI values obtained for four diﬀerent values of k, where

symbol O represents the NMI values obtained by using the original data. It is not surprising

to observe that the clustering quality deteriorates with increasing number of clusters. This is

because the larger the number of clusters, the more the challenging the problem is. Overall,

we observe that m = 200 yields signiﬁcantly better clustering performance than m = 20.

135

80

60

40

20

e
g
a

t

n
e
c
r
e
P

 

Original
Published

70

60

50

40

30

20

10

e
g
a

t

n
e
c
r
e
P

 

Original
Published

40

30

20

10

e
g
a

t

n
e
c
r
e
P

0

 

1

2

0

 

1

Cluster Index

2
3
Cluster Index

4

0

 

1

2

 

Original
Published

7

8

15

10

e
g
a

t

n
e
c
r
e
P

5

 
0
0

 

Original
Published

5

10

15

Cluster Index

4

3
6
Cluster Index

5

(a) 2-Clusters

(b) 4-Clusters

(c) 8-Clusters

(d) 16-Clusters

Figure 6.5 Cluster distribution for Facebook

When the random perturbation is small (i.e. σ = 0.1), our approach with m = 200 random

projections yields similar clustering performance as spectral clustering using the original

data. This is consistent with our theoretical result given in Theorem 2, i.e. with suﬃciently

large number of random projections, the approximation error in recovering the eigenvectors
of the original data can be as small as O(σ/√n). Note that the NMI values represented

by the symbol O, are obtained for pairs of clusters (Ci, Cj). These NMI values represents

the quality of clustering when clusters Ci are compared with clusters Cj. The NMI values

represented by m = 20, 200, are obtained when clusters Cj are compared with clusters ωi.

Finally, we observe that the clustering performance declines with larger noise for random

perturbation. However, even with random noise as large as σ = 1, the clustering performance

using the diﬀerential private copy of the social network graph still yield descent performance

with NMI ≥ 0.70. This is again consistent with our theoretical result: the approximation
error of eigenvectors is O(σ/√n), and therefore will be small as long as σ is signiﬁcantly
smaller than √n. Table 6.2 shows that both the memory requirement and running time

increases signiﬁcantly with increasing number of random projections. To show the variation

in the cluster distribution, we select clusters obtained from Facebook data for k = 200 and

σ = 1. Figure 6.5, 6.6 and 6.7 shows the percentage of nodes present in clusters obtained

from the original and published data. Note that perturbation has little to no eﬀect over

small number of clusters as the distribution of nodes is identical. We compare our results

with an approach presented by [141], which directly perturbs the eigenvector of the original

136

e
g
a

t

n
e
c
r
e
P

70

60

50

40

30

20

10

0

e
g
a

t

n
e
c
r
e
P

70

60

50

40

30

20

10

0

Original
Published

1

2

Cluster Index

e
g
a

t

n
e
c
r
e
P

40

30

20

10

0

Original
Published

1

2
3
Cluster Index

4

e
g
a

t

n
e
c
r
e
P

60

50

40

30

20

10

0

1

2

Original
Published

80

60

40

20

e
g
a

t

n
e
c
r
e
P

Original
Published

7

8

0
0

5

10

15

Cluster Index

4

3
6
Cluster Index

5

(a) 2-Clusters

(b) 4-Clusters

(c) 8-Clusters

(d) 16-Clusters

Figure 6.6 Cluster distribution for Live Journal

Original
Published

1

2

Cluster Index

e
g
a

t

n
e
c
r
e
P

60

50

40

30

20

10

0

Original
Published

1

2
3
Cluster Index

4

e
g
a

t

n
e
c
r
e
P

30

25

20

15

10

5

0

1

2

Original
Published

Original
Published

40

30

20

10

e
g
a

t

n
e
c
r
e
P

7

8

0
0

5

10

15

Cluster Index

4

3
6
Cluster Index

5

(a) 2-Clusters

(b) 4-Clusters

(c) 8-Clusters

(d) 16-Clusters

Figure 6.7 Cluster distribution for Pokec

Dataset

Facebook

Memory (MB)m = 200
Memory (MB)m = 20

Time (sec)m = 200
Time (sec)m = 20

4955
495
150
6.15

Pokec
2612
261
97
4.60

LiveJournal

6396
639
211
8.15

Table 6.2 Memory utilization and running time for the proposed algorithm

data by a Laplacian noise. We refer to this approach as (LNPP) for short. We note that

we did not compare to the other approaches for diﬀerential private eigen decomposition

because they are computationally infeasible for the large social networks studied in our

experiments. We implement the LNPP mechanism and evaluate the clustering performance

by comparing it to the clustering results generated by the original adjacency matrix. Table

6.3 gives NMI results using LNPP over diﬀerent datasets for σ = 1. It is clear that LNPP

performs signiﬁcantly worse than the proposed algorithm in clustering. Note that we did not

include the clustering performance of LNPP in Figure 6.2, 6.3 and 6.4 because of its poor

performance that basically overlaps with the horizonal axis.

137

Cluster
Facebook

LiveJournal

Pokec

2

9.1E − 8
9.7E − 7
1.1E − 7

4

8.8E − 7
3.2E − 6
3.5E − 6

8

4.1E − 6
3.6E − 6
5.8E − 6

16

1.3E − 5
1.1E − 5
2.6E − 5

Table 6.3 NMI for clustering using LNPP Approach [141] for σ = 1

100

80

60

40

20

e
g
a
t
n
e
c
r
e
P

 
0
2 

 

100

 

100

 

e
g
a
t
n
e
c
r
e
P

80

60

40

20

 
0
2 

e
g
a
t
n
e
c
r
e
P

80

60

40

20

 
0
2 

10
100
1000
10000

4 

8 

16

k

10
100
1000
10000

4 

8 

16

k

10
100
1000
10000

4 

8 

16

k

(a) Facebook

(b) LiveJournal

(c) Pokec

Figure 6.8 Percentage of preserved ranks using random matrix approach

6.4.3 Node Ranking

Identifying information hubs in a social network is an important problem. An information

hub refers to a node which occupies a central position in the community and has a large

number of connections with other users. Such central nodes play an important role in

information diﬀusion. Advertising agencies, can utilize information about top-t inﬂuential

nodes for word-of-mouth advertisements [97]. Therefore, the preservation of privacy of such

inﬂuential nodes is important. Inﬂuential node analysis require information about the eigen-

spectrum of the social network graph. Eigen-vector centrality (EVC) is a measure to quantify

the inﬂuence of a node in a social network [36]. EVC is mathematically related to several

other inﬂuence measures such as [84, 68]. EVC requires the computation of eigen-vectors

and assigns ranks to nodes according to their location in the most dominant community.

EVC of an adjacency matrix is deﬁned as its principle eigenvector. We employ principal

component centrality (PCC) which is based on EVC measure to rank the nodes [76]. Let k

denote the number of eigen vectors to be computed. Let U denote the n × k matrix whose
ith column represents the ith eigenvector of an n × n adjacency matrix A. Then PCC can

138

be expressed as:

Ck =q((AUn×k)K(AUn×k)1k×1

(6.9)

Where Ck is an n × 1 vector containing PCC score of each node. Nodes with highest
PCC scores are considered the inﬂuential nodes. Similar to spectral clustering, a diﬀerential

private matrix bA of social network A is computed using Algorithm 3. The published matrix

is then used to compute eigenvectors and obtain the PCC scores ˆCk using Equation 6.9.

We evaluate the utility preservation of the published data by evaluating the accuracy

with which inﬂuential nodes with high ranks are identiﬁed. First, for a given value of

k, eigenvectors corresponding to the k largest eigenvalues are computed from the original

adjacency matrix and the published data i.e., after applying matrix randomization using

Algorithm 3. We then obtain PCC scores of all the nodes in a graph from original and

published data (denoted as Ck and ˆCk respectively). The original scores Ck and the published
scores ˆCk are then compared in two diﬀerent ways. For all experiments, we compute PCC

scores by varying the number of eigenvectors in the range k = 2, 4, 8, 16.

In the ﬁrst evaluation, we use Mean Square Error (MSE) to compute the error between
score values of Ck and ˆCk. We report n × MSE in our study in order to alleviate the
scaling factor induced by the size of social networks. In the second evaluation, we identify

two sets of top t inﬂuential nodes based on the PCC scores computed from the original data

as well as from the published data. We then evaluate the performance of our algorithm by

measuring the percentage of overlapped nodes between these two sets. Table 6.4 gives the

values of Mean Square Error between PCC scores obtained from the original and published

data. We also compare these results with the LNPP approach. For comparison, we show in

Table 6.5 the MSE results for baseline LNPP. It is clear that the proposed algorithm yields

signiﬁcantly more accurate estimation of PCC scores than LNPP.

In the second evaluation, we measure the percentage of nodes correctly identiﬁed as the
top−t inﬂuential nodes. First, we obtain two sets T and ˆT that contain the top−t most
inﬂuential nodes measured by the PCC scores given by Ck and ˆCk. Then the percentage of

139

# of Eigenvectors

2

Facebook

Live Journal

Pokec

2.6e−26
4.0e−4
3.0e−4

4

2.9e−4
0.006
0.005

8

0.021
0.034
0.009

16

0.013
0.719
0.019

Table 6.4 n × MSE using the proposed approach

# of Eigenvectors

Facebook

Live Journal

Pokec

2

1.83
1.96
1.79

4

1.83
1.96
1.63

8

1.67
1.88
1.62

16
1.64
1.92
1.55

Table 6.5 n × MSE using baseline LNPP

nodes common to both T and ˆT is computed. We consider top 10, 100, 1000 and 10000 ranked

nodes. Figure 6.8 shows the percentage of nodes correctly identiﬁed as the top−t inﬂuential
nodes for the three datasets. We can see that for all cases, the proposed algorithm is able to

recover at least 80% of the most inﬂuential nodes. In contrast, LNPP fails to preserve the

most inﬂuential nodes as the percentage of nodes correctly identiﬁed as the top−t inﬂuential
nodes is less than 1% for all cases.

6.5 Conclusions

In this chapter, we propose a random matrix approach to OSN graph publishing, which

achieves space eﬃciency by reducing the dimensions of adjacency matrices and achieves

diﬀerential privacy by adding a small amount of noise. The key advantage of our work over

prior art is that our proposed random matrix approach achieves both space eﬃciency and

utility preservation, whereas prior work either loses space eﬃciency (due to outputting large

n × n matrices) or utility preservation (due to the addition of a large amount of noise). We
formally prove that our scheme achieves diﬀerential privacy and that the amount of added

noise is small. Furthermore, we validated our random matrix approach on two diﬀerent

applications: node clustering, and node ranking, which require the spectral information of a

graph. Our experimental results show that even for high values of noise variance σ = 1, our

scheme preserves the utility of the dataset.

140

8 Conclusion

In this thesis, I presented network systems that solve network management needs of three

types of operational networks namely; cellular networks, content delivery networks, and

online social networks. For cellular networks, I presented an approach to modeling of end-

to-end network performance. I used statistical, data mining and machine learning methods to

develop models for detection and localization of network performance degradation issues. For

content delivery networks (CDNs), I presented approaches to actively collect client perceived

network performance and use these measurements along with server side logs for modeling

and analysis for quantiﬁcation and optimization of CDN performance. The solutions that

were developed have been tested using real data which is collected at operational networks.

For online social networks, my research is on developing privacy preserving approaches for

graph data publishing.

The vision of this thesis can be extended to many other similar research directions. Net-

work management involves monitoring other important set of metrics that measure end-user

QoE, that is measured in terms of qualitative metrics such as user engagement and abandon-

ment rate, and quantitative metrics which are speciﬁc to an application e.g., page load time

for web browsing and buﬀering ratio for video streaming. The modeling and analysis fra-

mework adopted in this thesis can be leveraged to develop network management systems

that automate the process of debugging, quantifying, and optimizing QoE for operational

networks.

141

BIBLIOGRAPHY

142

BIBLIOGRAPHY

[1] Alexa Top 500 Global Sites. http://www.alexa.com/topsites.

[2] Being good stewards of the Internet, https://www.verizondigitalmedia.com/blog/2015/02/being-

good-stewards-of-the-internet/.

[3] Facts & Figures - Akamai. http://www.akamai.com/html/about/facts_figures.

html.

[4] Facts & Figures - Akamai Technologies, Inc. http://www.akamai.com/html/about/

facts_figures.html.

[5] Four

Ways

to

Avoid

Cascading

Failures

on

a

CDN,

https://www.verizondigitalmedia.com/blog/2017/06/four-ways-to-avoid-cascading-
failures-on-a-cdn/.

[6] Google Global Cache (GGC). https://peering.google.com/about/ggc.html.

[7] Mark

the

spot.

http://itunes.apple.com/us/app/at-t-mark-the-

spot/id338307313?mt=8.

[8] Netﬂix Open Connect Content Delivery for ISPs. https://openconnect.netflix.

com.

[9] PeeringDB. http://www.peeringdb.com.

[10] Root wireless inc. http://www.rootwireless.com.

[11] Stanford large network dataset collection. http://snap.stanford.edu/data/.

[12] White paper: Cisco visual networking index forecast and methodology 2015-2020.

Technical report, Cisco.

[13] Internet Filter:

Internet Pornography

Statistics.

http://internet-ﬁlter-

review.toptenreviews.com/internet-pornography-statistics.html, 2006.

[14] Porn Sites Get More Visitors Each Month Than Netﬂix, Amazon And Twitter
Combined. http://www.huffingtonpost.com/2013/05/03/internet-porn-stats_
n_3187682.html, 2013.

[15] 3GPP. General packet radio service (gprs); gprs tunnelling protocol (gtp) across the

gn and gp interface. TS 29.060, 3rd Generation Partnership Project (3GPP), 2011.

[16] Gaurav Aggarwal, Elie Bursztein, Collin Jackson, and Dan Boneh. An analysis of
private browsing modes in modern browsers. In USENIX Security Symposium, 2010.

208

[17] F. Ahmed, J. Erman, Z. Ge, A. X. Liu, J. Wang, and H. Yan. Detecting and localizing
end-to-end performance degradation for cellular data services based on tcp loss ratio
and round trip time. IEEE/ACM Transactions on Networking, 2017.

[18] F. Ahmed, J. Erman, Z. Ge, A. X. Liu, J. Wang, and H. Yan. Monitoring quality-of-
experience for operational cellular networks using machine-to-machine traﬃc. In 36th
Annual IEEE International Conference on Computer Communications (INFOCOM),
2017.

[19] F. Ahmed, A. X. Liu, and R. Jin. Social graph publishing with privacy guarantees.
In IEEE 36th International Conference on Distributed Computing Systems (ICDCS),
2016.

[20] F. Ahmed, M. Z. Shaﬁq, A. R. Khakpour, and A. X. Liu. Optimizing internet transit
routing for content delivery networks. IEEE/ACM Transactions on Networking, 2017.

[21] F. Ahmed, M. Z. Shaﬁq, and A. X. Liu. The internet is for porn: Measurement and
analysis of online adult traﬃc. In IEEE 36th International Conference on Distributed
Computing Systems (ICDCS), 2016.

[22] F. Ahmed, Z. Shaﬁq, A. Khakpour, and A. X. Liu. Optimizing internet transit routing
In IEEE 24th International Conference on Network

for content delivery networks.
Protocols (ICNP), 2016.

[23] Faraz Ahmed, Jeﬀrey Erman, Zihui Ge, Alex X. Liu, Jia Wang, and He Yan. De-
tecting and localizing end-to-end performance degradation for cellular data services.
In ACM International Conference on Measurement and Modeling of Computer Systems
(SIGMETRICS), 2015.

[24] Faraz Ahmed, Jeﬀrey Erman, Zihui Ge, Alex X. Liu, Jia Wang, and He Yan. Detecting
and localizing end-to-end performance degradation for cellular data services. In 35th
Annual IEEE International Conference on Computer Communications (INFOCOM),
2016.

[25] Aditya Akella, Bruce Maggs, Srinivasan Seshan, Anees Shaikh, and Ramesh Sitara-
man. A measurement-based analysis of multihoming. In Proceedings of SIGCOMM
Conference on Applications, Technologies, Architectures, and Protocols for Computer
Communications, pages 353–364, 2003.

[26] Isabel Amigo, Pablo Belzarena, and Sandrine Vaton. On the problem of revenue sharing
in multi-domain federations. In Proceedings of IFIP Networking, pages 252–264, 2012.

[27] C. Amrutkar, M. Hiltunen, T. Jim, K. Joshi, O. Spatscheck, P. Traynor, and S. Venka-
taraman. Why is My Smartphone Slow? On the Fly Diagnosis of Underperformance

209

on the Mobile Internet. In 43rd Annual IEEE/IFIP International Conference on De-
pendable Systems and Networks (DSN), pages 1–8, 2013.

[28] Xueli An and Gerald Kunzmann. Understanding mobile Internet usage behavior. In

IFIP Networking, 2014.

[29] M. Austin, J. Fix, S. Meredith, S. Puthenpura, and G. Meempat. Location estimation

of a mobile device in a umts network, 2012. US Patent App. 12/870,254.

[30] AvenueQ.

The Internet

Is

for Porn.

https://www.youtube.com/watch?v=

LTJvdGcb7Fs.

[31] Lars Backstrom, Cynthia Dwork, and Jon Kleinberg. Wherefore art thou r3579x?:
anonymized social networks, hidden patterns, and structural steganography. In Pro-
ceedings of 16th International Conference of on World Wide Web, pages 181–190.
ACM, 2007.

[32] Paramvir Bahl, Ranveer Chandra, Albert Greenberg, Srikanth Kandula, David A
Maltz, and Ming Zhang. Towards highly reliable enterprise network services via infe-
rence of multi-level dependencies. 37(4):13–24, 2007.

[33] Dziugas Baltrunas, Ahmed Elmokashﬁ, and Amund Kvalbein. Measuring the reliability
In Proceedings of the 2014 Conference on Internet

of mobile broadband networks.
Measurement Conference, IMC ’14, pages 45–58. ACM, 2014.

[34] Jeremiah Blocki, Avrim Blum, Anupam Datta, and Or Sheﬀet. The johnson-
lindenstrauss transform itself preserves diﬀerential privacy.
In Proceedings of 53rd
IEEE Annual Symposium on Foundations of Computer Science, pages 410–419. IEEE,
2012.

[35] Avrim Blum, Cynthia Dwork, Frank McSherry, and Kobbi Nissim. Practical privacy:
the sulq framework. In Proceedings of twenty-fourth ACM SIGMOD-SIGACT-SIGART
symposium on Principles of database systems, pages 128–138. ACM, 2005.

[36] Phillip Bonacich. Factoring and weighting approaches to status scores and clique

identiﬁcation. Journal of Mathematical Sociology, 2(1):113–120, 1972.

[37] I. Castro and S. Gorinsky. T4P: Hybrid interconnection for cost reduction. In IEEE
Conference on Computer Communications Workshops (INFOCOM WKSHPS), pages
178–183, 2012.

[38] Meeyoung Cha, Alan Mislove, and Krishna P Gummadi. A measurement-driven ana-
lysis of information propagation in the ﬂickr social network. In Proceedings of 18th
International Conference of on World Wide Web, pages 721–730. ACM, 2009.

210

[39] Nikolaos Chatzis, Georgios Smaragdakis, Anja Feldmann, and Walter Willinger. There
is more to IXPs than meets the eye. ACM SIGCOMM Computer Communication
Review, 43(5):19–28, 2013.

[40] Kamalika Chaudhuri, Anand Sarwate, and Kaushik Sinha. Near-optimal diﬀerentially
private principal components. In Advances in Neural Information Processing Systems
25, pages 998–1006, 2012.

[41] M. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. Brewer. Pinpoint: Problem determi-
nation in large, dynamic Internet services. In Proceedings of International Conference
of on Dependable Systems and Networks, pages 595–604, 2002.

[42] Yu-Chung Cheng, John Bellardo, P´eter Benk¨o, Alex C Snoeren, Geoﬀrey M Voelker,
and Stefan Savage. Jigsaw: solving the puzzle of enterprise 802.11 analysis, volume 36.
2006.

[43] Y. Crama and P.L. Hammer. Boolean Models and Methods in Mathematics, Computer

Science, and Engineering. 2010.

[44] Amogh Dhamdhere and Constantinos Dovrolis.

ISP and egress path selection for

multihomed networks. In Proceedings of IEEE INFOCOM, pages 1–12, 2006.

[45] Bowei Du and Michael Demmerand Eric Brewer. Analysis of WWW Traﬃc in Cam-

bodia and Ghana. In WWW, 2006.

[46] William Dumouchel and Fanny O’Brien. Integrating a robust option into a multiple
regression computing environment. In Proceedings of 21st Symposium on the Interface
of Computer Science and Statistics, page 41, 1991.

[47] Cynthia Dwork. Diﬀerential privacy: A survey of results. In Theory and Applications

of Models of Computation, pages 1–19. Springer, 2008.

[48] Jeﬀrey Erman, Alexandre Gerber, K.K. Ramakrishnan, Subhabrata Sen, and Oliver
Spatscheck. Over The Top Video: The Gorilla in Cellular Networks. In ACM Internet
Measurement Conference (IMC), 2011.

[49] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, and T. Berners-
Lee. RFC 2616: Hypertext Transfer Protocol – HTTP/1.1. Technical report, Network
Working Group, Internet Engineering Task Force, 1999.

[50] Alessandro Finamore, Marco Mellia, Maurizio M Munaf`o, Ruben Torres, and Sanjay G
Impact of device and infrastructure synergies on user

Rao. Youtube everywhere:
experience. In Proceedings of ACM IMC, pages 345–360, 2011.

211

[51] Benjamin Frank, Ingmar Poese, Yin Lin, Georgios Smaragdakis, Anja Feldmann, Bruce
Maggs, Jannis Rake, Steve Uhlig, and Rick Weber. Pushing cdn-isp collaboration to
the limit. ACM SIGCOMM Computer Communication Review, 43(3):34–44, 2013.

[52] Benjamin Frank, Ingmar Poese, Georgios Smaragdakis, Anja Feldmann, Bruce M.
Maggs, Steve Uhlig, Vinay Aggarwal, and Fabian Schneider. Recent Advances in Net-
working, chapter Collaboration Opportunities for Content Delivery and Network In-
frastructures. ACM SIGCOMM, 2013.

[53] Eugene C. Freuder and Mark Wallace. Constraint Programming, pages 369–401. Sprin-

ger US, 2014.

[54] Ruomei Gao, Constantinos Dovrolis, and Ellen W Zegura. Avoiding oscillations due
to intelligent route control systems. In Proceedings of IEEE INFOCOM, pages 1–12,
2006.

[55] Aaron Gember, Aditya Akella, Jeﬀrey Pang, Alexander Varshavsky, and Ramon Ca-
ceres. Obtaining in-context measurements of cellular network performance. In Procee-
dings of ACM conference on Internet measurement conference, pages 287–300. ACM,
2012.

[56] David K Goldenberg, Lili Qiuy, Haiyong Xie, Yang Richard Yang, and Yin Zhang.
Optimizing cost and performance for multihoming. ACM SIGCOMM Computer Com-
munication Review, 34(4):79–92, 2004.

[57] Ilya Grigorik. Breaking the 1000 ms Time to Glass Mobile Barrier. In SF HTML5

meetup, 2013.

[58] Daniel Gruhl, Ramanathan Guha, David Liben-Nowell, and Andrew Tomkins. Infor-
mation diﬀusion through blogspace. In Proceedings of 13th International Conference
of on World Wide Web, pages 491–501, 2004.

[59] Silviu Guia¸su. Information theory with applications. McGraw-Hill New York, 1977.

[60] Laszlo Gyarmati, Rade Stanojevic, Michael Sirivianos, and Nikolaos Laoutaris. Sharing
the cost of backbone networks: cui bono? In Proceedings of ACM IMC, pages 509–522,
2012.

[61] N. Halko, P. G. Martinsson, and J. A. Tropp. Finding structure with randomness:
Probabilistic algorithms for constructing approximate matrix decompositions. SIAM
Rev., 53(2):217–288, 2011.

[62] Prashanth Hande, Mung Chiang, Robert Calderbank, and Junshan Zhang. Pricing
under constraints in access networks: Revenue maximization and congestion manage-
ment. In Proceedings of IEEE INFOCOM, pages 1–9, 2010.

212

[63] Saul Hansell. Aol removes search data on vast group of web users. New York Times,

8, 2006.

[64] Moritz Hardt and Aaron Roth. Beyond worst-case analysis in private singular vector

computation. arXiv preprint arXiv:1211.0975, 2012.

[65] S. Hasan, S. Gorinsky, C. Dovrolis, and R. K. Sitaraman. Trade-oﬀs in optimizing the
cache deployments of CDNs. In Proceedings of IEEE INFOCOM, pages 460–468, 2014.

[66] Michael Hay, Chao Li, Gerome Miklau, and David Jensen. Accurate estimation of
the degree distribution of private networks. In Proceedings of 9th IEEE International
Conference of on Data Mining, pages 169–178. IEEE, 2009.

[67] Matt Osinski Jennifer Yates He Yan, Zihui Ge. When cell towers fail: Quantifying the

customer impact. 2013.

[68] Cornelis Hoede. A new status score for actors in a social network. Twente University

Department of Applied Mathematics (Memorandum no. 243), 1978.

[69] Paul W Holland and Roy E Welsch. Robust regression using iteratively reweighted
least-squares. Communications in Statistics-Theory and Methods, 6(9):813–827, 1977.

[70] Cheng Huang, Angela Wang, Jin Li, and Keith W. Ross. Measuring and evaluating

large-scale CDNs. In Proceedings of ACM IMC, pages 15–29, 2008.

[71] Junxian Huang, Feng Qian, Alexandre Gerber, Z Morley Mao, Subhabrata Sen, and
Oliver Spatscheck. A close examination of performance and power characteristics of
4g lte networks. In Proceedings of 10th International Conference of on MobiSys, pages
225–238, 2012.

[72] Junxian Huang, Feng Qian, Yihua Guo, Yuanyuan Zhou, Qiang Xu, Z Morley Mao,
Subhabrata Sen, and Oliver Spatscheck. An in-depth study of lte: Eﬀect of network
protocol and application behavior on performance. In Proceedings of ACM SIGCOMM,
pages 363–374, 2013.

[73] Junxian Huang, Qiang Xu, Birjodh Tiwana, Z Morley Mao, Ming Zhang, and Paramvir
Bahl. Anatomizing application performance diﬀerences on smartphones. In Proceedings
of of the 8th International Conference of on Mobile systems, applications, and services,
pages 165–178, 2010.

[74] Qi Huang, Ken Birman, Robbert van Renesse, Wyatt Lloyd, Sanjeev Kumar, and

Harry C. Li. An Analysis of Facebook Photo Caching. In SOSP, 2013.

[75] Muhammad U Ilyas and Hayder Radha. Identifying inﬂuential nodes in online social
networks using principal component centrality. In Proceedings of IEEE International
Conference of on Communications, pages 1–5. IEEE, 2011.

213

[76] Muhammad U Ilyas, M Zubair Shaﬁq, Alex X Liu, and Hayder Radha. A distributed
and privacy preserving algorithm for identifying information hubs in social networks.
In INFOCOM, 2011 Proceedings IEEE, pages 561–565. IEEE, 2011.

[77] Libin Jiang, Shyam Parekh, and Jean Walrand. Time-dependent network pricing and
In Proceedings of IEEE Network Operations and Management

bandwidth trading.
Symposium Workshops, NOMS, pages 193–200, 2008.

[78] Yu Jin, Nick Duﬃeld, Alexandre Gerber, Patrick Haﬀner, Wen-Ling Hsu, Guy Jacob-
son, Subhabrata Sen, Shobha Venkataraman, and Zhi-Li Zhang. Large-scale app-based
reporting of customer problems in cellular networks: Potential and limitations. In Pro-
ceedings of ACM SIGCOMM workshop on Measurements up the stack, pages 13–18.
ACM, 2011.

[79] Diana Joumblatt, Jaideep Chandrashekar, Branislav Kveton, Nina Taft, and Renata
Teixeira. Predicting user dissatisfaction with internet application performance at end-
hosts. In Proceedings of INFOCOM, pages 235–239. IEEE, 2013.

[80] S. Kandula, D. Katabi, and J.P. Vasseur. Shrink: A tool for failure diagnosis in IP
networks. In Proceedings of ACM SIGCOMM workshop on Mining network data, pages
173–178, 2005.

[81] Srikanth Kandula, Ratul Mahajan, Patrick Verkaik, Sharad Agarwal, Jitendra Padhye,
and Paramvir Bahl. Detailed diagnosis in enterprise networks. Proceedings of ACM
SIGCOMM, pages 243–254, 2009.

[82] Michael Kapralov, Kunal Talwar, Michael Kapralov, and Kunal Talwar. On diﬀerenti-
ally private low rank approximation. In Proc. 24rd Symposium on Discrete Algorithms
(SODA), 2012.

[83] Richard M. Karp. Reducibility among Combinatorial Problems. Springer US, Boston,

MA, 1972.

[84] Leo Katz. A new status index derived from sociometric analysis. Psychometrika,

18(1):39–43, 1953.

[85] Leonard Kaufman and Peter J Rousseeuw. Finding groups in data: an introduction to

cluster analysis, volume 344. John Wiley & Sons, 2009.

[86] Krishnaram Kenthapadi, Aleksandra Korolova, Ilya Mironov, and Nina Mishra. Pri-
vacy via the johnson-lindenstrauss transform. arXiv preprint arXiv:1204.2606, 2012.

[87] George Kesidis, Apurba Das, and Gustavo de Veciana. On ﬂat-rate and usage-based
pricing for tiered commodity internet services. In Proceedings of IEEE CISS, pages
304–308, 2008.

214

[88] I.Y. Kim and O.L. de Weck. Adaptive weighted-sum method for bi-objective op-
timization: Pareto front generation. Structural and Multidisciplinary Optimization,
29(2):149–158, 2005.

[89] Ramana Rao Kompella, Jennifer Yates, Albert Greenberg, and Alex C. Snoeren. Ip

fault localization via risk modeling. In Proceedings of NSDI, pages 57–70, 2005.

[90] RR Kompella, J. Yates, A. Greenberg, and AC Snoeren. Detection and localization of

network black holes. In Proceedings of IEEE INFOCOM, pages 2180–2188, 2007.

[91] Rupa Krishnan, Harsha V. Madhyastha, Sridhar Srinivasan, Sushant Jain, Arvind
Krishnamurthy, Thomas Anderson, and Jie Gao. Moving beyond end-to-end path
information to optimize CDN performance. In Proceedings of ACM IMC, pages 190–
201, 2009.

[92] Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. What is twitter, a
social network or a news media? In Proceedings of 19th International Conference of
on World wide web, pages 591–600. ACM, 2010.

[93] Craig Labovitz.

First Data on Changing Netﬂix and CDN Market Share,
June 2012. http://www.deepﬁeld.net/2012/06/ﬁrst-data-on-changing-netﬂix-and-cdn-
market-share/.

[94] Markus Laner, Philipp Svoboda, Stefan Schwarz, and Markus Rupp. Users in cells:
In Wireless Communications and Networking Conference

a data traﬃc analysis.
(WCNC), pages 3063–3068. IEEE, 2012.

[95] Andrew Lipsman. Major Mobile Milestones in May: Apps Now Drive Half
http://www.comscore.com/Insights/Blog/

of All Time Spent on Digital.
Major-Mobile-Milestones-in-May-Apps-Now-Drive-Half-of-All-Time-
Spent-on-Digital, June 2014.

[96] A. Lodhi, N. Larson, A. Dhamdhere, C. Dovrolis, and kc claﬀy. Using PeeringDB
to understand the peering ecosystem. ACM SIGCOMM Computer Communication
Review, 44(2):20–27, 2014.

[97] Hao Ma, Haixuan Yang, Michael R Lyu, and Irwin King. Mining social networks
using heat diﬀusion processes for marketing candidates selection.
In Proceedings of
17th ACM Conference of on Information and knowledge management, pages 233–242.
ACM, 2008.

[98] Ajay Anil Mahimkar, Zihui Ge, Aman Shaikh, Jia Wang, Jennifer Yates, Yin Zhang,
and Qi Zhao. Towards automated performance diagnosis in a large iptv network. In
Proceedings of ACM SIGCOMM, pages 231–242, 2009.

215

[99] Ajay Anil Mahimkar, Han Hee Song, Zihui Ge, Aman Shaikh, Jia Wang, Jennifer Yates,
Yin Zhang, and Joanne Emmons. Detecting the performance impact of upgrades in
large operational networks. ACM SIGCOMM Computer Communication Review, pages
303–314, 2011.

[100] Mahesh K Marina, Valentin Radu, and Konstantinos Balampekos. Impact of indoor-
outdoor context on crowdsourcing based mobile coverage analysis. In Proceedings of
Workshop on All Things Cellular: Operations, Applications and Challenges, pages
45–50. ACM, 2015.

[101] Darakhshan Mir and Rebecca N Wright. A diﬀerentially private estimator for the
stochastic kronecker graph model. In Proceedings of Joint EDBT/ICDT Workshops,
pages 167–176. ACM, 2012.

[102] Meinard M¨uller. Dynamic Time Warping. Information retrieval for music and motion,

pages 69–84, 2007.

[103] Diala Naboulsi, Marco Fiore, Stephane Ribot, and Razvan Stanica. Large-scale Mobile

Traﬃc Analysis: a Survey. IEEE Communications Surveys & Tutorials, 2015.

[104] Srinivas Narayana, Wenjie Jiang, Jennifer Rexford, and Mung Chiang. Joint server
selection and routing for geo-replicated services. In Proceedings of IEEE/ACM Con-
ference of on Utility and Cloud Computing, pages 423–428, 2013.

[105] Arvind Narayanan and Vitaly Shmatikov. Robust de-anonymization of large sparse
datasets. In Proceedings of IEEE Symposium on Security and Privacy, pages 111–125.
IEEE, 2008.

[106] Ashkan Nikravesh, David R Choﬀnes, Ethan Katz-Bassett, Z Morley Mao, and Matt
Welsh. Mobile network performance from user devices: A longitudinal, multidimensi-
onal analysis. In Passive and Active Measurement, pages 12–22. Springer, 2014.

[107] W. B. Norton. The 2014 Internet Peering Playbook: Connecting to the Core of the

Internet. DrPeering Press, 2014.

[108] William B Norton. The Internet Peering Playbook: Connecting to the Core of the

Internet. DrPeering Press, 2011.

[109] Ogi Ogas and Sai Gaddam. A billion wicked thoughts: What the world’s largest expe-

riment reveals about human desire. Dutton New York, NY, 2011.

[110] OpenSignal. http://opensignal.com/reports/fragmentation-2013/.

[111] Mukaddim Pathan and Rajkumar Buyya. Content Delivery Networks, Chapter 2: A

Taxonomy of CDNs. Springer, 2008.

216

[112] Utpal Paul, Anand Prabhu Subramanian, Milind Madhav Buddhikot, and Samir R.
Das. Understanding Traﬃc Dynamics in Cellular Data Networks. In IEEE Infocom,
2011.

[113] Davide Proserpio, Sharon Goldberg, and Frank McSherry. A workﬂow for diﬀerentially-
private graph synthesis. In Proceedings of ACM Workshop on online social networks,
pages 13–18. ACM, 2012.

[114] Feng Qian, Zhaoguang Wang, Alexandre Gerber, Zhuoqing Mao, Subhabrata Sen,
and Oliver Spatscheck. Proﬁling resource usage for mobile applications: a cross-layer
approach. In Proceedings of 9th MobiSys, pages 321–334, 2011.

[115] Lili Qiu, V. N. Padmanabhan, and G. M. Voelker. On the placement of Web server

replicas. In Proceedings of IEEE INFOCOM, pages 1587–1596, 2001.

[116] Prabhakar Raghavan and Clark D Tompson. Randomized rounding: A technique for
provably good algorithms and algorithmic proofs. Combinatorica, 7(4):365–374, 1987.

[117] Fabio Ricciato. Traﬃc monitoring and analysis for the optimization of a 3g network.

Wireless Communications, IEEE, pages 42–49, 2006.

[118] Matthew Richardson and Pedro Domingos. Mining knowledge-sharing sites for vi-
ral marketing. In Proceedings of 8th ACM SIGKDD International Conference of on
Knowledge Discovery and Data Mining, pages 61–70. ACM, 2002.

[119] Tiago Rodrigues, Fabrcio Benevenuto, Meeyoung Cha, Krishna P. Gummadi, and
Virgilio Almeida. On Word-of-Mouth Based Discovery of the Web. In ACM Internet
Measurement Conference (IMC), 2011.

[120] Alessandra Sala, Xiaohan Zhao, Christo Wilson, Haitao Zheng, and Ben Y Zhao.
Sharing graphs using diﬀerentially private graph models. In Proceedings of ACM SIG-
COMM Internet Measurement Conference of , pages 81–98. ACM, 2011.

[121] Tamas Sarlos. Improved approximation algorithms for large matrices via random pro-
jections. In Proceedings of 47th Annual IEEE Symposium on Foundations of Computer
Science, FOCS ’06, pages 143–152, 2006.

[122] Alexander Schrijver. Theory of Linear and Integer Programming. John Wiley & Sons,

1998.

[123] Sayandeep Sen, Jongwon Yoon, Joshua Hare, Justin Ormont, and Suman Banerjee.
Can they hear me now?: a case for a client-assisted approach to monitoring wide-
area wireless networks.
In Proceedings of ACM SIGCOMM conference on Internet
measurement conference, pages 99–116. ACM, 2011.

217

[124] M. Zubair Shaﬁq, Jeﬀrey Erman, Lusheng Ji, Alex X. Liu, Jeﬀrey Pang, and Jia Wang.
Understanding the Impact of Network Dynamics on Mobile Video User Engagement.
In ACM SIGMETRICS, 2014.

[125] M. Zubair Shaﬁq, Lusheng Ji, Alex X. Liu, and Jia Wang. Characterizing and modeling
internet traﬃc dynamics of cellular devices. In Proceedings of ACM SIGMETRICS,
pages 305–316, San Jose, California, June 2011.

[126] M. Zubair Shaﬁq, Lusheng Ji, Alex X. Liu, and Jia Wang. Characterizing and Modeling

Internet Traﬃc Dynamics of Cellular Devices. In ACM SIGMETRICS, 2011.

[127] Joel Sommers and Paul Barford. Cell vs. wiﬁ: on the performance of metro area mobile
connections. In Proceedings of ACM conference on Internet measurement conference,
pages 301–314. ACM, 2012.

[128] Han Hee Song, Zihui Ge, Ajay Mahimkar, Jia Wang, Jennifer Yates, Yin Zhang, An-
drea Basso, and Min Chen. Q-score: proactive service quality assessment in a large
iptv system. In Proceedings of ACM SIGCOMM conference on Internet measurement
conference, pages 195–208. ACM, 2011.

[129] Sebastian Sonntag, Lennart Schulte,

measurements-it’s not all about signal strength.
munications and Networking Conference (WCNC), pages 4624–4629. IEEE, 2013.

and Jukka Manner.

Mobile network
In Proceedings of Wireless Com-

[130] Rade Stanojevic, Ignacio Castro, and Sergey Gorinsky. CIPT: Using tuangou to reduce

IP transit costs. In Proceedings of ACM CoNEXT, page 17, 2011.

[131] Rade Stanojevic, Nikolaos Laoutaris, and Pablo Rodriguez. On economic heavy hitters:
Shapley value analysis of 95th-percentile pricing. In Proceedings of ACM IMC, pages
75–80, 2010.

[132] Ao-Jan Su, David R Choﬀnes, Aleksandar Kuzmanovic, and Fabi´an E Bustamante.
Drafting behind Akamai: Inferring network conditions based on CDN redirections.
IEEE/ACM Transactions on Networking (TON), 17(6):1752–1756, 2009.

[133] Peng Sun, Minlan Yu, Michael J Freedman, and Jennifer Rexford. Identifying perfor-
mance bottlenecks in CDNs through TCP-level monitoring. In Proceedings of ACM
SIGCOMM workshop on Measurements up the stack, pages 49–54, 2011.

[134] Shu Tao, Kuai Xu, Ying Xu, Teng Fei, Lixin Gao, Roch Guerin, Jim Kurose, Don
Towsley, and Zhi Li Zhang. Exploring the performance beneﬁts of end-to-end path
switching. In Proceedings of IEEE ICNP, pages 304–315, 2004.

[135] Gareth Tyson, Yehia Elkhatib, Nishanth Sastry, and Steve Uhlig. Demystifying Porn
2.0: A look into a major adult video streaming website. In ACM Internet Measurement
Conference (IMC), pages 417–426. ACM, 2013.

218

[136] Gareth Tyson, Yehia Elkhatib, Nishanth Sastry, and Steve Uhlig. Are People Really
Social on Porn 2.0? In AAAI Conference on Web and Social Media (ICWSM), 2015.

[137] Vytautas Valancius, Cristian Lumezanu, Nick Feamster, Ramesh Johari, and Vijay V
Vazirani. How many tiers?: Pricing in the Internet transit market. In Proceedings of
ACM SIGCOMM, pages 194–205, 2011.

[138] R.J. Vanderbei. Linear Programming: Foundations and Extensions.

International

Series in Operations Research & Management Science. 2007.

[139] Deepak Verma and Marina Meila. A comparison of spectral clustering algorithms.

Technical report, 2003.

[140] Hao Wang, Haiyong Xie, Lei Qiu, Avi Silberschatz, and Yang Richard Yang. Optimal
ISP subscription for Internet multihoming: Algorithm design and implication analysis.
In Proceedings of IEEE INFOCOM, pages 2360–2371, 2005.

[141] Yue Wang, Xintao Wu, and Leting Wu. Diﬀerential privacy preserving spectral graph
analysis. In Advances in Knowledge Discovery and Data Mining, volume 7819 of Lecture
Notes in Computer Science, pages 329–340. Springer Berlin Heidelberg, 2013.

[142] Mark Ward. Web porn: Just how much is there?

http://www.bbc.com/news/

technology-23030090, July 2013.

[143] C. Wilson, B. Boe, A. Sala, K.P.N. Puttaswamy, and B.Y. Zhao. User interactions in
social networks and their implications. In Proceedings of ACM European Conference
on Computer Systems, 2009.

[144] Gilbert Wondracek, Thorsten Holz, Christian Platzer, Engin Kirda, and Christopher
In

Is the Internet for Porn? An Insight Into the Online Adult Industry.

Kruegel.
Workshop on the Economics of Information Security (WEIS), 2010.

[145] Qiang Xu, Jeﬀrey Erman, Alexandre Gerber, Zhuoqing Mao, Jeﬀrey Pang, and Shobha
In ACM

Identifying diverse usage behaviors of smartphone apps.

Venkataraman.
Internet Measurement Conference (IMC), 2011.

[146] xvideos.com Site Overview - Alexa. http://www.alexa.com/siteinfo/xvideos.com.

[147] youtube.com Site Overview - Alexa. http://www.alexa.com/siteinfo/youtube.com.

[148] Ying Zhang and Ake Arvidsson. Understanding the characteristics of cellular data

traﬃc. ACM SIGCOMM Computer Communication Review, 42(4):461–466, 2012.

[149] Zheng Zhang, Ming Zhang, Albert G Greenberg, Y Charlie Hu, Ratul Mahajan, and
Blaine Christian. Optimizing cost and performance in online service provider networks.
In Proceedings of USENIX NSDI, pages 33–48, 2010.

219