MEASUREMENT AND MODELING OF LARGE
SCALE NETWORKS
By
Muhammad Zubair Shafiq

A DISSERTATION
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
Computer Science and Engineering—Doctor of Philosophy
2014

ABSTRACT
MEASUREMENT AND MODELING OF LARGE SCALE
NETWORKS
By
Muhammad Zubair Shafiq
The goal of this thesis is to identify measurement, modeling, and optimization opportunities for large scale networks – with specific focus on cellular networks and online social
networks. These networks are facing unprecedented operational challenges due to their very
large scale.
Cellular networks are experiencing an explosive increase in the volume of traffic for the
last few years. This unprecedented increase in the volume of mobile traffic is attributed to
the increase in the subscriber base, improving network connection speeds, and improving
hardware and software capabilities of modern smartphones. In contrast to the traditional
fixed IP networks, mobile network operators are faced with the constraint of limited radio
frequency spectrum at their disposal. As the communication technologies evolve beyond 3G
to Long Term Evolution (LTE), the competition for the limited radio frequency spectrum
is becoming even more intense. Therefore, mobile network operators increasingly focus on
optimizing different aspects of the network by customized design and management to improve
key performance indicators (KPIs).
Online social networks are increasing at a very rapid pace, while trying to provide more
content-rich and interactive services to their users. For instance, Facebook currently has
more than 1.2 billion monthly active users and offers news feed, graph search, groups, photo
sharing, and messaging services. The information for such a large user base cannot be
efficiently and securely managed by traditional database systems. Social network service
providers are deploying novel large scale infrastructure to cope with these scaling challenges.
In this thesis, I present novel approaches to tackle these challenges by revisiting the cur-

rent practices for the design, deployment, and management of large scale network systems
using a combination of theoretical and empirical methods. I take a data-driven approach in
which the theoretical and empirical analyses are intertwined. First, I measure and analyze
the trends in data and then model the identified trends using suitable parametric models.
Finally, I rigorously evaluate the developed models and the resulting system design prototypes using extensive simulations, realistic testbed environments, or real-world deployment.
This methodology is to used to address several problems related to cellular networks and
online social networks.

ACKNOWLEDGEMENTS

Working towards a Ph.D. has been a deeply enriching experience; at times it has been
exciting, at times depressing (particularly after crushing paper rejections), but it has always
been very rewarding. Looking back, many people have helped shape my journey. I would
like to extend them my thanks.
• First and foremost, my advisor, Prof. Alex X. Liu. I do not know where to start.
My work would not have been possible without his constant guidance – like when he
nudged me into considering measurement research more seriously – his unwavering
encouragement, his many insights, and his exceptional resourcefulness. And most
importantly, his friendship. I have been very fortunate to have an advisor who has also
been a close friend. For all of this, Alex, thank you.
• I would also like to thank the rest of my thesis committee Profs. Eric Torng, Guoliang
Xing, and Habib Salehi for their encouragement and insightful comments during my
qualifier and comprehensive exams.
• I would also like to thank Drs. Jeffrey Pang, Jia Wang, and Lusheng Ji. I learned
a lot from them during my summer internship at AT&T Labs - Research and during
our collaboration throughout my Ph.D. My collaboration with them has been one of
the most fruitful and fun engagements I have experienced. I have learned a lot writing
papers with them.
• I would also like to thank Drs. Franck Le and Mudhakar Srivatsa. I really enjoyed
working with them during my summer internship at IBM T.J. Watson Research Center
and during our collaboration after the internship.
• Throughout my Ph.D., I was supported by various NSF research grants. Thanks NSF!

iv

• I would also like to thank Michigan State University, and specifically Department of
Computer Science and Engineering for providing me financial support to attend various
conferences during my Ph.D.
• Many thanks to my colleagues in Systems and Security Lab and WAVES lab at Michigan State University. In particular, I would like to thank Ahmed Majeed Khan, Amir
Khakpour, Chad Meiners, Eric Norige, Faraz Ahmed, Fei Chen, Hassan Aqeel Khan,
Muhammad Shahzad, and Muhammad Usman Ilyas for numerous insightful discussions
and collaborations on various projects.
• I must say that I owe my great time in Michigan State University to all of my fabulous
friends. It is simply not feasible to list all of them here. I would like to thank them all
for their friendship and support.
• I am also very thankful to Drs. Syed Ali Khayam and Muddassar Farooq, who advised
my undergraduate thesis and encouraged me to pursue Ph.D.
• I am also deeply indebted to Mr. Saleem, my high school physics teacher in Bahawalpur, Pakistan. His passion for science and scholarly pursuit has been an inspiration to me (and surely, many of his other students) and helped set me on the path on
which I find myself today.
• Finally, I do not know how I can thank my family enough: my wife and parents, from
whom I realized that kindness and devotion is endless, my brother, Omair, who always
supports me no matter what, and the rest of the family members who were always
supportive of my studies.

v

TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x

1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.1 Quality-of-Experience for Mobile Video . . . . . . . .
1.2.2 Mobile Network Performance during Crowded Events
1.2.3 Breaching Privacy in Encrypted IM Networks . . . .
1.2.4 Information Hub Identification in Social Networks . .
1.3 Published Material . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .

1
1
2
2
2
3
3
3

2

Quality of Experience for Mobile Video . . . . . . .
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 Cellular Network Background . . . . . . . . . .
2.2.2 Data Collection and Pre-processing . . . . . . .
2.2.3 Video Traffic Statistics . . . . . . . . . . . . . .
2.2.4 Quantifying User Engagement . . . . . . . . . .
2.3 Analysis of Network Factors . . . . . . . . . . . . . . .
2.4 Modeling User Engagement . . . . . . . . . . . . . . .
2.4.1 Background and Problem Statement . . . . . .
2.4.2 Proposed Approach . . . . . . . . . . . . . . . .
2.4.3 Experimental Setup . . . . . . . . . . . . . . . .
2.4.4 Evaluation . . . . . . . . . . . . . . . . . . . . .
2.4.5 Discussion . . . . . . . . . . . . . . . . . . . . .
2.5 Related Work . . . . . . . . . . . . . . . . . . . . . . .
2.5.1 Network-side Instrumentation . . . . . . . . . .
2.5.2 Client-side Instrumentation . . . . . . . . . . .
2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . .

5
5
7
7
9
11
11
16
21
21
22
23
24
27
31
31
32
33

3

Mobile Network Performance during Crowded
3.1 Introduction . . . . . . . . . . . . . . . . . . . .
3.2 Data Set . . . . . . . . . . . . . . . . . . . . . .
3.3 Characterizing Performance Issues . . . . . . . .
3.3.1 Pre-connection Network Performance . .
3.3.2 Post-connection Network Performance .
3.4 Understanding Performance Issues . . . . . . . .
3.4.1 Aggregate Network Load . . . . . . . . .
3.4.2 User-level Sessions . . . . . . . . . . . .
3.5 Evaluating Mitigation Schemes . . . . . . . . .

vi

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .

Events
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

. . . 34
. . . 34
. . . 37
. . . 39
. . . 40
. . . 42
. . . 46
. . . 46
. . . 50
. . . 51

3.6
3.7

3.5.1 Radio Network Parameter Tuning .
3.5.2 Opportunistic Connection Sharing .
3.5.3 Limitations . . . . . . . . . . . . .
Related Work . . . . . . . . . . . . . . . .
Conclusion . . . . . . . . . . . . . . . . . .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

Networks
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

. . . 61
. . . 61
. . . 64
. . . 64
. . . 65
. . . 65
. . . 65
. . . 67
. . . 69
. . . 69
. . . 70
. . . 72
. . . 72
. . . 74
. . . 75
. . . 77
. . . 80
. . . 81

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .

4

Breaching Privacy in Encrypted Instant Messaging
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . .
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . .
4.2.1 Mix Network De-anonymization . . . . . . . .
4.2.2 Social Network De-anonymization . . . . . . .
4.3 Problem Description and Attack Scenarios . . . . . .
4.3.1 IM Service Architecture . . . . . . . . . . . .
4.3.2 Attack Scenarios . . . . . . . . . . . . . . . .
4.4 COLD: COmmunication Link De-anonymization . . .
4.4.1 Architecture . . . . . . . . . . . . . . . . . . .
4.4.2 Details . . . . . . . . . . . . . . . . . . . . . .
4.5 Experimental Results . . . . . . . . . . . . . . . . . .
4.5.1 Data Set . . . . . . . . . . . . . . . . . . . . .
4.5.2 Evaluation Metrics . . . . . . . . . . . . . . .
4.5.3 Results . . . . . . . . . . . . . . . . . . . . . .
4.5.4 Discussions . . . . . . . . . . . . . . . . . . .
4.6 Evasion and Countermeasures . . . . . . . . . . . . .
4.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . .

5

Information Hub Identification in Social Networks .
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . .
5.1.1 Background and Motivation . . . . . . . . . . .
5.1.2 Limitations of Prior Art . . . . . . . . . . . . .
5.1.3 Proposed Solution . . . . . . . . . . . . . . . .
5.1.4 Experimental Results and Findings . . . . . . .
5.1.5 Key Contributions . . . . . . . . . . . . . . . .
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . .
5.3 Our Proposed Solution . . . . . . . . . . . . . . . . . .
5.3.1 Eigenvector Centrality . . . . . . . . . . . . . .
5.3.2 Motivation for Principal Component Centrality
5.3.3 Definition of PCC . . . . . . . . . . . . . . . . .
5.3.4 Selection of Number of Eigenvectors . . . . . . .
5.3.5 Decentralized Eigendecomposition Algorithm . .
5.4 Experimental Results . . . . . . . . . . . . . . . . . . .
5.4.1 Data Set . . . . . . . . . . . . . . . . . . . . . .
5.4.2 Selection of PCC Parameter . . . . . . . . . . .
5.4.3 Comparison With Ground Truth . . . . . . . .
5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . .

vii

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

51
53
58
58
60

83
83
83
84
84
86
87
88
89
89
90
91
93
94
99
99
101
102
107

6

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

viii

LIST OF TABLES
Table 2.1

Core network features. i denotes the flow index of a session with N flows. 15

Table 2.2

Radio access network features. . . . . . . . . . . . . . . . . . . . . . .

16

Table 2.3

Accuracy of 4-way classification . . . . . . . . . . . . . . . . . . . . . .

25

Table 2.4 Accuracy of completed vs. abandoned and completed, non-skipped
vs. rest classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

Table 2.5

Root-mean-square error of regression . . . . . . . . . . . . . . . . . . .

27

Table 3.1

Description of voice call error codes . . . . . . . . . . . . . . . . . . .

44

Table 4.1

Data set statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

73

Table 5.1

Basic statistics of the friendship graphs analyzed in this study . . . . . 100

ix

LIST OF FIGURES
Figure 2.1 Cellular network architecture. For interpretation of the references to
color in this and all other figures, the reader is referred to the electronic version
of this dissertation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

Figure 2.2 Illustration of a video streaming session. Gray rectangles represent
distinct flows in a session. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

Figure 2.3

Distributions of container type, video duration, and video player type

12

Figure 2.4 Examples of video streaming session classes. Y-axis limits are set to
the video sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

Figure 2.5

Distribution of video download completion . . . . . . . . . . . . . . .

14

Figure 2.6 Abandonment rate distributions. Shaded areas represent confidence
intervals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

Figure 2.7 Skip rate distributions. Shaded areas represent the 95% confidence
interval. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

Figure 2.8

ROC threshold plots for various class pairs . . . . . . . . . . . . . . .

24

Figure 2.9

Accuracy vs. feature set size for completed vs. abandoned classification 28

Figure 2.10 Pruned decision and regression tree models for τ ≤ 10 seconds using
All feature set. The tuples below leaf nodes represent (error%,population
size%). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

Figure 3.1

(Normalized) Timeseries of common types of RRC failures . . . . . .

39

Figure 3.2 RRC failure ratios plotted as a function of distance to the venue and
for time intervals before, during, and after the event . . . . . . . . . . . . . .

39

Figure 3.3

(Normalized) Voice performance measurements . . . . . . . . . . . . .

41

Figure 3.4

(Normalized) Data performance measurements . . . . . . . . . . . . .

43

Figure 3.5

(Normalized) Network load measurements . . . . . . . . . . . . . . .

46

Figure 3.6

Flow count histograms for top content publishers in our data set . . .

48

Figure 3.7

(Normalized) Session Count, Average Length, Average Inter-arrival Time 49

Figure 3.8 (Normalized) Per-Session Downlink Bytes (Bdown ), Uplink Bytes
(Bup ), Ratio (Bdown /(Bup ) . . . . . . . . . . . . . . . . . . . . . . . . . . .

49

Figure 3.9

50

Experimental results for radio network parameter tuning . . . . . . .

x

Figure 3.10 Tradeoff between performance metrics for varying RRC timeout (α)
values. Y-axis is max-normalized for each metric. α values corresponding to
black circles achieve better performance tradeoff. . . . . . . . . . . . . . . . .

54

Figure 3.11 Experimental results for opportunistic connection sharing . . . . . . .

56

Figure 4.1

Transforming logged traffic traces to user traffic signals . . . . . . . .

66

Figure 4.2

Two attack scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . .

68

Figure 4.3 Time series plot of traffic volume, in bytes and number of packets, over
the entire 60 minute time period from 8 − 9 a.m. . . . . . . . . . . . . . . .

73

Figure 4.4

Node degree distribution in our Yahoo! Messenger data set. . . . . .

74

Figure 4.5 Hit rates of COLD for vertices of degree 1 in the (a) 10 minute data
set, (b) 20 minute data set, (c) 30 minute data set, (d) 40 minute data set,
(e) 50 minute data set, and (f) 60 minute data set. . . . . . . . . . . . . . .

77

Figure 4.6 Hit rates of TSC for vertices of degree 1 in the (a) 10 minute data set,
(b) 20 minute data set, (c) 30 minute data set, (d) 40 minute data set, (e) 50
minute data set, and (f) 60 minute data set. . . . . . . . . . . . . . . . . . .

79

Figure 5.1 Conceptual depiction of the friendship graph between users and the
overlaid interaction graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

85

Figure 5.2 PCC of nodes in a network consisting of two Barab´asi-Albert graphs
of 100 and 50 nodes connected by a few links when computed using the most
significant (a) 1, (b) 5, (c) 10, and (d) 100 eigenvectors. . . . . . . . . . . . .

92

Figure 5.3 An illustrative example of message exchanges in power iteration algorithm for (a) 1st eigenvector, (b) 2nd eigenvector, and (c) subsequent eigenvectors. Reference node is colored black, neighbor nodes grey and non-neighbor
nodes white. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

97

Figure 5.4 Average Mean Squared Error (MSE) for the KM algorithm reported
for varying values of number of eigenvectors (k) and number of iterations (t).

98

Figure 5.5

Degree distribution of friendship graph for Facebook data set A . . . 101

Figure 5.6

Degree distribution of friendship graph for Facebook data set B . . . 101

Figure 5.7 Plot of the phase angle φ(P ) between PCC vectors CP and EVC vector
CE plotted against number of feature vectors P for (a) Facebook data set A,
and (b) Facebook data set B. . . . . . . . . . . . . . . . . . . . . . . . . . . 102

xi

Figure 5.8 Correlation coefficients ρ of PCC CP and, (a) flow count of Facebook
data set A (ϑ(A)), (b) flow count of Facebook data set B (ϑ(B)). The correlation coefficients are plotted as functions of the number of eigenvectors P
and plotted separately for each interaction graph. . . . . . . . . . . . . . . . 103
Figure 5.9 Size of the intersection set in (a) Facebook data set A, and (b) Facebook
data set B, for varying number of eigenvectors used in computation of PCC.

105

Figure 5.10 Cardinality of the intersection set in (a) Facebook data set A, and (b)
Facebook data set B, for varying fraction of nodes in graph. . . . . . . . . . 106
Figure 5.11 Distance between ordered lists computed by PCC and interaction data
using (a) Facebook data set A, and (b) Facebook data set B, for varying
fraction of nodes in graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

xii

1

Introduction

1.1

Background

Big data has become the cornerstone of the revolution in computing systems that is transforming how network infrastructure, services, and applications are designed, deployed, and
managed. Many conventional wisdoms on which we have based the designs of existing network systems are either not well understood or they are simply outdated due to the rapidly
changing nature of workloads, performance metrics, and user requirements. Therefore, existing systems face unprecedented challenges in terms of efficiency, performance, reliability,
adaptability, and scalability due to their inferior and outdated designs.
This thesis aims to tackle these challenges by revisiting the current practices for the design, deployment, and management of network systems using a combination of theoretical
and empirical methods. To model dynamic workload behaviors, my work takes a data-driven
approach in which the theoretical and empirical analyses are intertwined. First, I measure
and analyze the trends in data and then model the identified trends using suitable parametric
models. Finally, I rigorously evaluate the developed models and the resulting system design
prototypes using extensive simulations, realistic testbed environments, or real-world deployment. I have successfully applied this methodology to address several problems related to
mobile and cellular networks and online social networks – especially for the issues that arise
from dynamic workload behaviors.

1

1.2

Contributions

This thesis takes an in-depth look at the following research problems.

1.2.1

Quality-of-Experience for Mobile Video

Mobile network operators have a significant interest in the performance of streaming video on
their networks because network factors directly influence the quality-of-experience. However,
unlike video service providers, network operators are not privy to the client- or serverside logs typically used to measure user engagement. To address this limitation, I analyze
the impact of cellular network performance on mobile video abandonment (a measure of
user engagement) from the perspective of a network operator. I develop predictive models
to enable mobile network operators to monitor mobile video user engagement using only
standard radio network statistics and/or TCP/IP flow records, a necessity for continuous
monitoring at scale [88].

1.2.2

Mobile Network Performance during Crowded Events

Cellular network usage is at an all-time high even under normal operating conditions and
projections show that traffic volume will further increase 26x by 2015 as compared to 2010.
During crowded events, cellular networks face voice and data traffic volumes that are often
orders of magnitude higher than what they face during routine days. I conduct a large
scale investigation of cellular network performance during crowded events to identify the
root cause of performance degradation. I design and evaluate practical mitigation schemes,
such as radio resource allocation tuning and opportunistic connection sharing, which do not
require making significant changes to the current cellular network infrastructure [90].

2

1.2.3

Breaching Privacy in Encrypted IM Networks

The proliferation of online social networks has attracted the interest of computer scientists to
mine the available social network data for developing behavior profiles of people. IM services
– such as Yahoo! Messenger, Skype, IRC, and ICQ – are popular tools to privately communicate with friends and family over the Internet. I demonstrate how to de-anonymizing
friendship links in encrypted instant messaging (IM) networks using timing analysis on the
collected network traffic logs [48]. This work is also relevant to related problems such as mix
network de-anonymization.

1.2.4

Information Hub Identification in Social Networks

Identifying top-k information hubs is crucial for many applications such as advertising in
social networks where advertisers are interested in identifying hubs to whom free samples
can be given. Centralized computation of top-k information hubs is mostly unrealistic for
parties such as advertisers because online social networking companies are reluctant to share
their interaction or friendship graphs due to privacy concerns and regulations. I develop
distributed and privacy-preserving algorithms for computing top-k information hubs in online
social networks [46, 47].

1.3

Published Material

The chapters of this dissertation are based in part on the following publications.
• M. Zubair Shafiq, Jeffrey Erman, Lusheng Ji, Alex X. Liu, Jeffrey Pang, Jia Wang.
Understanding the Impact of Network Dynamics on Mobile Video User Engagement.
ACM SIGMETRICS, 2014.
• M. Zubair Shafiq, Alex X. Liu, Amir Khakpour. Revisiting Caching in Content Delivery Networks: Measurement, Design, and Evaluation (Extended Abstract). ACM

3

SIGMETRICS, 2014.
• M. Zubair Shafiq, Lusheng Ji, Alex X. Liu, Jeffrey Pang, Shobha Venkataraman, Jia
Wang. A First Look at Cellular Network Performance during Crowded Events. ACM
SIGMETRICS, 2013.
• M. Zubair Shafiq, Lusheng Ji, Alex X. Liu, Jeffrey Pang, Jia Wang. A First Look at
Cellular Machine-to-Machine Traffic - Large Scale Measurement and Characterization.
ACM SIGMETRICS, 2012.
• M. Zubair Shafiq, Lusheng Ji, Alex X. Liu, Jia Wang. Characterizing and Modeling
Internet Traffic Dynamics of Cellular Devices. ACM SIGMETRICS, 2011.
• Muhammad U. Ilyas, M. Zubair Shafiq, Alex X. Liu, Hayder Radha. Who are You
Talking to? Breaching Privacy in Encrypted IM Networks. IEEE ICNP, 2013.
• Yipeng Wang, Xiaochun Yun, M. Zubair Shafiq, Liyan Wang, Alex X. Liu, Zhibin
Zhang, Danfeng(Daphne) Yao, Yongzheng Zhang, Li Guo. A Semantics Aware Approach to Automated Reverse Engineering Unknown Protocols. IEEE ICNP, 2012.
• M. Zubair Shafiq, Muhammad U. Ilyas, Alex X. Liu, Hayder Radha. Identifying Leaders and Followers in Online Social Networks. IEEE Journal on Selected Areas in
Communications (JSAC), 2013.
• Muhammad U. Ilyas, M. Zubair Shafiq, Alex X. Liu, Hayder Radha. A Distributed
Algorithm for Identifying Information Hubs in Social Networks. IEEE Journal on
Selected Areas in Communications (JSAC), 2013.
• M. Zubair Shafiq, Lusheng Ji, Alex X. Liu, Jeffrey Pang, Jia Wang. Characterizing
Geospatial Dynamics of Application Usage in a 3G Cellular Data Network. IEEE
INFOCOM, 2012.

4

2

Quality of Experience for Mobile

Video
2.1

Introduction

Online video services such as YouTube, Netflix, and Hulu are very popular on mobile networks. It has been estimated that video currently makes up more than half of all mobile
data traffic and will grow by a factor of 16 by 2017 [13]. Therefore, it is crucial for mobile
network operators to monitor the user experience, or Quality of Experience (QoE), of video
streaming and understand how network characteristics and performance influence it.
Unfortunately, prior approaches for monitoring and understanding the user experience of video streaming are insufficient for mobile network operators.

Recent seminal

work [15, 16, 31, 55] investigated how video streaming quality influences important user engagement metrics, such as video abandonment rate. However, these studies rely on client-side
instrumentation to measure video quality metrics such as buffering, startup delay, and bitrate. This instrumentation is not available to network operators, so the ability to measure
user engagement using only network-side measurements is crucial from their perspective.
Other work used network traffic analysis to study video streaming volume and abandonment
rates in wired [36,39] and wireless networks [33]. However, these techniques use deep-packetinspection to extract information beyond TCP/IP headers, which requires significant computational resources to employ at the scale of network carriers and can pose privacy problems

5

in practice. Moreover, these studies did not provide insight into how network characteristics
and performance influence abandonment rates.
To redress these limitations, this work presents the first large-scale study to characterize
video streaming performance in cellular networks and its impact on user engagement. Our
study is based on month-long anonymized data sets collected from the core network and radio
access network of a tier-1 cellular network in the United States. We analyze 27 terabytes of
video streaming traffic from nearly half a million users in this data set. Our analysis makes
two main contributions.
First, to the best of our knowledge, our analysis is the first to quantify the impact that
network characteristics have on mobile video user engagement in the wild. We quantify the
effect that 31 different cellular network factors have on video abandonment rate and video
skip (e.g., fast forward) rate. In particular, we quantify user engagement by labeling video
streaming sessions in our data set as completed/abandoned and skipped/non-skipped,
and then evaluate the extent to which core network and radio network factors correlate with
abandonment rate and skip rate. These factors include TCP flow throughput, flow duration,
handover rate, signal strength, and the physical location’s land cover type. Our results
provide network operators insights and direct guidance on how to improve user engagement.
For example, improving mean signal-to-interference ratio by 1 dB reduces the likelihood
of video abandonment by 2%. Moreover, reducing the load in a cell sector by 10 active
users reduces the likelihood of video abandonment by 7%. Through these insights, network
operators can identify and prioritize network factors that have the most impact on user
engagement.
Second, we are the first to show how a network operator can monitor mobile video user
engagement using only standard radio network statistics and/or TCP/IP flow records, a necessity for continuous monitoring at scale and for mitigating privacy concerns. Moreover, we
show that our approach can predict video abandonment very early in a video session, which
can help future networks decide which users to optimize performance for (e.g., using LTE

6

self-organizing networks [1]). Specifically, we model the complex relationships between network factors and video abandonment. We find that the C4.5/M5P algorithm with bootstrap
aggregation can build decision/regression tree models that accurately predict video abandonment. Our results show that it can predict whether a video streaming session is abandoned
or skipped with more than 87% accuracy by observing only the initial 10 seconds. Our
model achieves significantly better accuracy than prior models that require video service
provider logs [15, 16], while only using standard radio network statistics and/or TCP/IP
headers readily available to network operators.
The rest of this chapter is organized as follows. In Section 2.2, we present a brief background and details of the data collection process. Section 2.3 presents the characterization of
video mobile video streaming performance and its impact on user engagement. We develop a
machine learning model for user engagement and present the results in Section 2.4. Section
3.6 reviews related work and the chapter is concluded in Section 4.7.

2.2

Data

To study mobile video streaming performance, we collected anonymized flow-level logs from
a tier-1 cellular network in the United States. Next, we first provide a brief background of
video streaming in cellular networks, description of our data collection methodology, and
some high-level statistics of the collected data set.

2.2.1

Cellular Network Background

A typical UMTS cellular network, shown in Figure 2.1, can be visualized as consisting of two
major components: Radio Access Network (RAN) and Core Network (CN). RAN consists of
NodeBs and Radio Network Controllers (RNCs). Each NodeB has multiple antennas, where
each antenna corresponds to a different cell sector. A user via user equipment (UE) connects
to an active set of one or more cell sectors in the RAN. The UE periodically selects a primary

7

Radio Access Network
(RAN)

UEs

Core Network
(CN)

Radio link
NodeBs

Internet

RNC

Gn
SGSN

UEs

Radio link
NodeBs

Gn

RNC
SGSN

UEs Radio link NodeBs

GGSN
External
Networks

RNC

Figure 2.1. Cellular network architecture. For interpretation of the references to color in
this and all other figures, the reader is referred to the electronic version of this dissertation.
or serving cell among the active set based on their signal strength information. From the
active set, only the primary cell actually transmits downlink data to the UE. The traffic
generated by a UE is sent to the corresponding NodeB by cell sectors. Each RNC controls
and exchanges traffic with multiple NodeBs, each of which serves many users in its coverage
area. RNCs manage control signaling such as Radio Access Bearer (RAB) assignments,
transmission scheduling, and handovers. Each UE negotiates allocation of radio resources
with the RAN based on a wide range of factors, such as available radio resources and signal
strength [14].
CN consists of Serving GPRS Support Nodes (SGSNs) facing the user and Gateway GPRS
Support Nodes (GGSNs) facing the Internet and other external networks. RNCs send traffic
from NodeBs to SGSNs, which then send it to GGSNs. GGSNs eventually send traffic to
external networks, such as the Internet. For data connections, the IP layer of a UE is peered
with the IP layer of GGSNs in the form of tunnels known as Packet Data Protocol (PDP)
contexts. These tunnels, implemented as GPRS Tunneling Protocol (GTP) tunnels, carry
IP packets between the UEs and their peering GGSNs. From the perspective of an external
network such as the Internet, a GGSN connecting CN to the Internet appears just like an IP
router and the UEs that connect through the GGSN appear as IP hosts behind the router.

8

2.2.2

Data Collection and Pre-processing

For our study, we simultaneously collected two anonymized data sets from the RAN and
CN of a tier-1 cellular network in the United States. Our data collection covers a major
metropolitan area in the Western United States over the duration of one month in 2012.
The RAN data set is collected at the RNCs and contains event-driven signaling information
such as current active set, RAB state, handovers, bitrate, signal strength, and RRC requests
from users and corresponding responses from the network. The CN data set is collected from
the Gn interfaces between SGSNs and GGSNs, and contains flow-level information of video
streaming traffic such as server IP and port, client IP and port, flow duration, TCP flags,
anonymized user identifier (IMSI), and anonymized device identifier (IMEI). These fields
require only TCP/IP or GTP level information, which is efficiently collected.
In order to determine the ground-truth of video abandonment, we also collected the following HTTP-level information: URL, host, user agent, content type, content length, and
byte-range request from clients and response from servers. Large scale monitoring tools often
do not collect HTTP information because it requires processing hundreds of bytes of text
beyond the 40-byte TCP/IP header. Thus, it is important that day-to-day monitoring does
not require its collection at scale. All device and user identifiers (e.g., IMSI, IMEI) in our
data sets are anonymized to protect privacy without affecting the usefulness of our analysis.
The data sets do not permit the reversal of the anonymization or re-identification of users.
To minimize the confounding factors that different content providers (live vs. video-ondemand), connectivity (cellular vs. cable), and device type (mobile vs. desktop) could have
on our network-centric analysis, we chose to focus on the most popular video service provider
in our cellular network data set. This provider (anonymized for business confidentiality)
serves user generated content on demand, and according to a recent study [33], it serves
over 37% of all video objects. This provider streams videos using progressive download with
byte-range requests, which is the most common protocol currently in use. We believe the
conclusions we draw in this work apply to 9 of the 14 most popular mobile video content

9

providers as they use the same protocol [33]. Previous work found the top providers that
use this protocol behave similarly in wired networks [83].
Since our collected data contains traffic records for all types of content, we first need to
separate video streaming traffic from the rest. Towards this end, we use the HTTP host and
content-type headers to separate the video streaming traffic from other TCP/IP traffic. We
can also separate video traffic based only on the server IP and port, since all video streaming
traffic is served by a known block of CDN cache servers.
A video is progressively downloaded in one or multiple HTTP byte-range requests, which
represent different portions of the video [33]. Figure 2.2 illustrates a video streaming session
that involves multiple HTTP byte-range server response flows. The x-axis represents time,
which starts with the first HTTP byte-range server response flow. The y-axis represents byterange of the video file with maximum value same as the video file size, which is highlighted
by the horizontal broken line. Consequently, each gray rectangle represents a distinct HTTP
byte-range server response flow. Note that flows may have different byte-range lengths and
durations, they may be overlapping, and there may be time gaps between consecutive flows.
For the purpose of our analysis, we group HTTP flows into video sessions based on a
unique ID that is the same in the URLs of each video session. In practice, we found that we
can group flows into sessions without any HTTP information. In particular, by looking for
a change in the server IP to determine when a new video session for a user starts, we can
detect session starts correctly with 98% accuracy. This is because videos are served from a
CDN and different videos are likely served from different cache servers. Even if all videos
were served from a single server, we found that we can still detect session starts with 97%
accuracy using a simple decision tree classifier trained on the size of and inter-arrival time
gap between HTTP flows. (We omit details due to space constraints.) Thus, we conclude
that TCP/IP information would be sufficient to detect and group HTTP flows into video
sessions.

10

2.2.3

Video Traffic Statistics

Next we present some details of our collected data set such as aggregate statistics, container
types, encoding bitrates, and video player types. Overall, our data set consists of more than
27 terabytes worth of video streaming traffic, from more than 37 million flows, from almost
half a million users over the course of one month.
Our data set mostly contains standard definition video streaming streaming traffic. Figure
2.3(a) shows the distribution of video streaming traffic with respect to container types. The
most common container types [5] are: (1) 3GP (3GPP file format), (2) MPEG-4, (3) FLV
(Flash), (4) WebM, and (5) MP2T (MPEG transport stream). We observe that a vast
majority, almost 70%, of video streaming traffic uses the 3GPP container type – followed
by MPEG-4 and Flash container types as distant 2nd and 3rd most popular, respectively.
Only a small fraction, less than 2%, of the video streaming traffic belongs to containers
types used for live content. We exclude these from our analysis since our focus is on videoon-demand streaming. Further analysis of video encoding bitrate showed that a majority
of video streaming traffic belongs to lower bitrates, which correspond to 144/240p video
resolution. 240p is the most commonly used video resolution. Only a small fraction of
video streaming traffic belongs to higher video resolutions. For example, less than 5% video
streaming traffic belongs to high definition (HD) 720p content.
Short duration videos account for most of the streaming traffic in our data set. In Figure
2.3(b), we plot the cumulative distribution function (CDF) of video duration. We observe
that more than 70% videos are less than 5 minutes long and only 10% videos are longer than
15 minutes. This type of skewed distribution is expected for content providers that serve
user generated content [33].

2.2.4

Quantifying User Engagement

As a first step towards analyzing user engagement, we discuss two ways to quantify it:
discrete and continuous.
11

Video
Size

Δtn

Byte Range

Bn
Δt2
B2

Δt1
B1
t(1,start)

t(1,end) t(2,start) t(2,end)
Time

t(n,start) t(n,end)

Figure 2.2. Illustration of a video streaming session. Gray rectangles represent distinct flows
in a session.
1
0.8

0.6

0.6
CDF

Fraction of Byte Volume

0.8

0.4

0.4

0.2

0

0.2

3GP

MP4

FLV
MP2T
Container Type

0

WebM

(a) Container type

5

10 15 20 25 30 35 40 45 50 55 60
Video Duration (minutes)

(b) Video duration

Figure 2.3. Distributions of container type, video duration, and video player type
For discrete quantification, we first use a nominal variable that represents the following
classes: completed and abandoned. The completed class represents video streaming sessions in which the download process reaches the end-point. The abandoned class represents
video streaming sessions in which the download process is abandoned before reaching the
end-point. In our data set, 21.2% sessions belong to the completed class and 78.8% sessions
belong to the abandoned class. Since users tend to skip videos when streaming gets stuck, we
also use a nominal variable that represents the following classes: skipped and non-skipped.

12

5

x 10

6

x 10
10

12
8
Byte Range

Byte Range

10
8
6

6
4

4
2

2
0
0

10

20
Time (sec)

0
0

30

10 20 30 40 50 60 70 80 90 100 110 120
Time (sec)

(a) completed, non-skipped

(b) abandoned, non-skipped

6

x 10

7

x 10

14
12
Byte Range

Byte Range

15

10
8
6
4

10

5

2
0
0

10

20

30

40
50
Time (sec)

60

70

0
0

80

(c) completed, skipped

30

60

90

120 150 180 210 240 270
Time (sec)

(d) abandoned, skipped

Figure 2.4. Examples of video streaming session classes. Y-axis limits are set to the video
sizes.
The skipped class represents video streaming sessions in which the download process includes at least one seek-forward between the start-point and the last byte downloaded. The
non-skipped class represents video streaming sessions in which the download process does
not include seek-forward between the start-point and the last byte downloaded. In our data
set, 33.9% sessions belong to the skipped class and 66.1% sessions belong to the non-skipped
class. Combining the aforementioned user engagement classification schemes, we can define
the following four non-overlapping classes: (1) completed, non-skipped, (2) abandoned,
non-skipped, (3) completed, skipped, and (4) abandoned, skipped. In our data set,
13

aggregate

< 1 min

1−5 min

5−15 min

> 15 min

1

CDF

0.8
0.6
0.4
0.2
0
0

10

20

30
40
50
60
70
Video Download Completion (%)

80

90

100

Figure 2.5. Distribution of video download completion
17.6% sessions belong to the completed, non-skipped class, 48.5% sessions belong to the
abandoned, non-skipped class, 3.6% sessions belong to the completed, skipped class,
and 30.3% sessions belong to the abandoned, skipped class. Figure 2.4 illustrates examples of video streaming sessions for the four user engagement classes. As mentioned earlier
and observable in Figure 2.4, sessions generally consist of more than one flow. On average,
a video streaming session in our data set consists of 11 flows, where earlier flows tend to
be larger than the following flows. This trend is because video players tend to aggressively
download larger chunks to fill up the available buffer during the initial buffering phase of
a video streaming session [83]. The download rate in this initial phase is limited by the
end-to-end available bandwidth. Afterwards in the steady state phase, the remaining video
is generally downloaded in multiple smaller flows. The download rate in this phase depends
on the video encoding rate and the playback progress.
For continuous quantification, we use a continuous variable (∈ [0,1]) representing the fraction of video download completion. Figure 2.5 shows the CDF of video download completion. Comparing videos of different durations, we observe that shorter videos achieve higher
download completion than longer videos. For aggregate distribution, almost 15% of video
streaming sessions are abandoned with less than 5% download completion. However, after

14

Table 2.1. Core network features. i denotes the flow index of a session with N flows.
Feature

Description

Flow volume

(Bi ) The number of bytes transferred during the ith flow. (Summary stats)

Flow duration

(ti ) The duration (in seconds) from the SYN packet to the last packet in the ith flow.
(Summary stats)

Flow TCP throughput

(Ti ) The ratio of flow volume to flow duration in the ith flow, in KB/s. (Summary stats)

Flow inter-arrival time

(Ii ) Time (in seconds) between the end of the ith flow and the start of the i + 1th flow.
(Summary stats)

Flow flags

FINi and RSTi respectively denote the number of packets with TCP-Finish (no more
data from sender indicating completion) and TCP-Reset (reset the connection indicating
some unexpected error) flags set in the ith flow. Based on the direction of packet transfer,
we distinguish between client-to-server (c→s) and server-to-client (s→c) flags. (Summary
stats)

Largest flow volume

(Bj ) The largest flow volume among all flow volumes, where j denotes the index of this
flow.

Largest flow duration

(tj ) The duration of the j th flow.

Largest flow TCP throughput

(Tj ) The throughput of the j th flow.

Largest flow flags

FINj and RSTj respectively denote the number of packets with TCP-Finish and TCPReset flags set in the j th flow. We distinguish between c→s and s→c flags.

Number of flows

(N ) The total number of flows in a session.

Session volume

(B) The sum of all flow volumes in a session.

Session duration

(t) The sum of all flow durations in a session. t =

Session TCP throughput

(T) The average throughput of a session. T =

Session inter-arrival time

(I) The sum of all flow inter-arrival times (in seconds) in a session

Session flags

(FIN and RST) respectively denote the number of packets with TCP-Finish and TCPReset flags set in a session. We distinguish between c→s and s→c flags.

N
i=1

N
i=1 ti .

Bi /

N
i=1 ti .

the 5% completion mark, the distribution is fairly uniform until the 80% completion mark.
The initial modality in the distribution indicates abandonment that is likely either because
users tend to sample videos [15] or due to longer join times [31]. The later modality in
the distribution (excluding the 100% completion mark) indicates abandonment that is likely
either because users lose interest in the content (e.g., due to video closing credits) or because
shorter videos achieve higher download completion due to aggressive initial buffering.
We note that our definitions of user engagement detect abandonment and skips only during
the download phase of a video. We cannot detect a video abandonment or skip if these events
occur after a video has downloaded completely (e.g., due to lack of user interest). However,
network operators are typically not interested in those events because they are unlikely to
be influenced by network factors.

15

Table 2.2. Radio access network features.
Feature

Description

# soft handovers

(HS ) This handover occurs when a cell is added or removed from the active set [105].
(Session- and cell-level)

# inter-frequency handovers

(HIF ) This type of handover occurs when a UE switches to cell sector of the same or
different NodeB with different operating frequency [105]. (Session- and cell-level)

# IRAT handovers

(HRAT ) This type of handover occurs when a UE switches between different radio access
technologies (e.g., UMTS and GPRS) [105]. (Session- and cell-level)

# RRC failure events

A RRC failure event is logged when a request by a user to allocate more radio resources
is denied by the respective RNC due to network overload or other issues [90]. (Sessionand cell-level)

# admission control failures

These events occur when a user cannot finish the admission control procedure often due
to lack of available capacity. (Session- and cell-level)

Received signal code power

RSCP is the RF energy of the downlink signal obtained after the correlation and descrambling process [105]. It is usually measured in dBm. (Summary stats)

Signal energy to interference

This ratio (Ec /Io ) denotes the ratio of the received energy to the interference level of the
downlink common pilot channel [105]. It is usually measured in dB. (Summary stats)

Received signal strength

RSSI takes into account both RSCP and Ec /Io [105]. It is usually measured in dBm. It
is defined as: RSSI = RSCP - Ec /Io . (Summary stats)

Size of active set

(SAS ) The number of unique cell sectors in the active set. (Summary stats)

Uplink RLC throughput

(TU ) The uplink data rate for UE in the DCH state (in kbps). (Session- and cell-level
summary stats)

Downlink RLC throughput

(TD ) The downlink data rate for UE in the DCH state (in kbps). (Session- and cell-level
summary stats)

# Users in DCH state

(UDCH ) Users served by the representative cell over a window of 1 hour.

Landcover

A nominal variable that defines the geographical terrain of a cell. 2006 National Land
Cover Database contains the 16-class geographical terrain categorization of the United
States at a spatial resolution of 30 meters [3, 37]. The categories include developed-open
space, developed-high intensity, perennial ice/snow, deciduous forest, open water, etc.
We extract the top-3 most common landcover categories in terms of spatial area within
1 km of the representative cell.

Elevation

Elevation of a cell is extracted from the National Elevation Dataset (NED) at a spatial
resolution of 30 meters [6]. We use average elevation of the representative cell as a
feature.

2.3

Analysis of Network Factors

Our main goal is to understand the influence of network factors on user engagement. Towards
this end, this section presents an in-depth analysis of the relationships between network
factors and video abandonment.
We first itemize a wide range of factors that can potentially impact or be influenced
by mobile video user engagement. We compile a comprehensive list of features from the
information available in both CN and RAN data sets. It is noteworthy that while features

16

extracted from the RAN data set are only applicable for cellular networks, features extracted
from the CN data set are applicable for other kinds of wired and wireless networks as well.
For each video streaming session, we can extract CN features for individual flows and
the whole session (labeled as F low and Session features in Table 2.1, respectively). Since
sessions may have different number of flows, we compute the following statistical measures to
summarize the flow-level features for whole sessions: mean, standard deviation, minimum,
maximum, 25th percentile, median (50th percentile), and 75th percentile. Hence each flowlevel feature listed in Table 2.1 (labeled with “Summary stats”) represents 7 summary values.
We also extract these features for the largest flow (in terms of byte volume) of a video
streaming session, as a single flow typically dominates each video session.
For each video streaming session, we also extract RAN features for the user and the cell
sectors that service the user during the session. The RAN features are described in Table 2.2.
For session-level features, the RAN data set records belonging to a user can be identified
using the combination of IMSI and session start and end timestamp information. For celllevel features, however, the selection criterion of the representative cell for a session is not
obvious because the active set and the primary cell may change between the start and end
of a session. Towards this end, we select the most common cell sector (in terms of number
of records) to be the representative cell for a session. For each session, cell-level features are
computed for all users served by the representative cell in the time window at the session
start. Features in Table 2.2 labeled with “Session- and cell-level” indicate features that we
compute both a session-level value and cell-level value, as defined above. For example, for #
soft handovers, we compute one feature as the number of soft handovers for the user during
the session, and another as the number of soft handovers for all users in the representative cell
of that session. For features that can hold multiple values during a session (e.g., RSSI), we
compute the same 7 summary statistic values listed above for flow features. These features
are labeled with “Summary stats” in Table 2.2.
To better understand the relationship between features and user engagement, we plot the

17

100

95

95

90
85
80
75
70
65

90
85
80
75
70
65

60
< 1 sec

5 sec
15 sec 30 sec
Session Duration

1 min

5 sec
15 sec
Average Flow Duration

75
70

30 sec

60

1 min

95

85
80
75
70

Abandonment Rate (%)

95
Abandonment Rate (%)

95
90

90
85
80
75
70
65

2

3

4
5
6
7
FIN Flag Count (s−>c)

8

9

80
75
70

10

20
30
40
DCH User Count (/hour)

50

60
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Hour of Day

60

(e) DCH user count

(f) Hour of day

95

95

75
70
65
60
0

1

2

3
4
5
6
7
8
Cell RLC Uplink Throughput (Mb/s)

9

10

(g) Cell RLC uplink throughput

Abandonment Rate (%)

95

Abandonment Rate (%)

100

80

90
85
80
75
70

90
85
80
75
70

65

65

60
−17 −16 −15 −14 −13 −12 −11 −10 −9 −8 −7 −6 −5 −4 −3
Average Session Signal Energy to Interference Ratio (dB)

60
−96 −92 −88 −84 −80 −76 −72 −68 −64 −60 −56 −52 −48
Average Received Signal Strength Indicator (dBm)

(h) Signal-to-interference ratio

(i) Average received signal strength
100

95

95

95

85
80
75
70
65

Abandonment Rate (%)

100

Abandonment Rate (%)

100

90

90
85
80
75
70
65

60

−90

−86 −82 −78 −74 −70 −66 −62
Average Received Signal Code Power (dBm)

60
0

−58

(j) Received signal code power

90
85
80
75
70
65

1

2

3
4
5
6
Cell Soft Handover Count

7

8

60
1

9

(k) Cell soft handover count

95

95

95

80
75
70

Abandonment Rate (%)

100

Abandonment Rate (%)

100

85

90
85
80
75
70
65

65
60
0

1
Cell RRC Error Count

(m) Cell RRC error count

2

60

1.5

2
2.5
Average Active Set Size

3

(l) Average active set size

100

90

30

85

100

85

25

90

100

90

15
20
Number of Flows

65

60

10

10

(c) Number of flows
100

1

5

(b) Average flow duration

(d) FIN flag count (s→c)
Abandonment Rate (%)

80

100

60
0

Abandonment Rate (%)

85

100

65

Abandonment Rate (%)

90

65

60
< 1 sec

2 min

(a) Session duration
Abandonment Rate (%)

Abandonment Rate (%)

100

95
Abandonment Rate (%)

Abandonment Rate (%)

100

90
85
80
75
70
65

10

20

30
40
50
60
70
80
Session TCP throughput (KB/sec)

90

(n) Session TCP throughput

100

60
0

100
200
300
400
500
Session RLC Downlink Throughput (kb/s)

600

(o) Session downlink throughput

Figure 2.6. Abandonment rate distributions. Shaded areas represent confidence intervals.
abandonment rate distributions of prominent features in Figure 2.6. The abandonment rate
is defined as the fraction of sessions in the data set that are abandoned. The shaded areas
18

90

80

80

70

70

70

60
50
40

Skip Rate (%)

90

80

Skip Rate (%)

Skip Rate (%)

90

60
50
40

50
40

30

30

30

20

20

20

10
< 1 sec

5 sec
15 sec
30 sec
Maximum Flow Inter−arrival Time

10

1 min

5

(a) Max. flow inter-arrival time

10

15
20
Number of Flows

25

30

10
< 1 sec

80

80

80

70

70

70

40

60
50
40

30

30

20

20

10
< 1 sec

5 sec
15 sec
Average Flow Duration

30 sec

(d) Average flow duration

1 min

Skip Rate (%)

90

50

10
0

1 min

2 min

(c) Session duration

90

60

5 sec
15 sec 30 sec
Session Duration

(b) Number of flows

90

Skip Rate (%)

Skip Rate (%)

60

60
50
40
30
20

1

2

3

4
5
6
7
RST Flag Count (s−>c)

8

9

(e) RST flag count (s→c)

10

10
0

1

2

3
4
5
6
7
8
Cell RLC Uplink Throughput (Mb/s)

9

10

(f) Cell RLC uplink throughput

Figure 2.7. Skip rate distributions. Shaded areas represent the 95% confidence interval.
represent the 95% confidence interval [112]. The horizontal line in each plot denotes the
average abandonment rate. Figure 2.6 suggests the following implications:
Abandoned sessions are shorter. Although this result is expected, we find that each measure
of session length provides unique information. Figure 2.6(a) shows sessions shorter than
15 seconds are significantly more likely to be abandoned. The sharp inflection point may
be due to automated failure of sessions that do not complete the initial buffering phase.
Similarly, Figure 2.6(b) shows a sharp drop in abandonment rate for sessions with average
flow duration longer than 1-3 seconds. Figures 2.6(c) and 2.6(d), both measures of flow
count, show that sessions with more flows are less likely to be abandoned. Thus, each of
these features provides information useful for detecting abandonment.
Network load increases the abandonment rate. Despite the low bitrate of video streams relative to the capacity of a cell (∼500 kbps vs. 3-14 Mbps), we find there is a nearly linear
relationship between various measures of RAN load and abandonment rate. For example,
Figure 2.6(e) shows that the abandonment rate goes up by roughly 7% for each 10 active
users in a sector, even though these resources are scheduled in a proportional fair manner

19

every 2 ms [14]. This load relationship can also be seen in Figure 2.6(f), which shows that
abandonment rate is highest during the peak load hours of the day and much lower during the
off-peak hours. This effect can be explained by Figure 2.6(g), which shows that the abandonment rate begins to grow when aggregate cell uplink throughput is just 50 kbps, significantly
less than the cell capacity. This is likely because even small amounts of uplink traffic can
cause interference, and Figure 2.6(h) shows that abandonment rate decreases by 2% for each
dB increase in the signal-to-interference ratio (Ec /Io ). Furthermore, Figures 2.6(i) and 2.6(j)
show that the abandonment rate increases as RSSI and RSCP increase, contrary to the general belief that higher received power means a better user experience. These Ec /Io , RSSI,
and RSCP results strongly suggest that users with higher received power also experience
more interference. Hence, user engagement in our data set is more limited by interference
rather than poor coverage. In summary, these results suggest that measures a cellular operator takes to reduce per-sector load and interference will improve user engagement in a
roughly linear manner.
Handovers increase the abandonment rate. Another important question for operators is
whether cell handovers disrupt the user experience. Our results suggest that all handover
types are correlated with a decrease in user engagement. Figure 2.6(k) shows that cells with
soft handovers, which are “make-before-break” handovers, have significantly higher abandonment rates. This result is supported by Figure 2.6(l), which shows increase abandonment
rates for non-integral mean active set values (i.e., sessions that incurred active set additions
or deletions during soft handovers). These effects may be partially due to the RRC signalling required for handover. Figure 2.6(m) shows that when RRC signalling errors occur,
abandonment rate increases as well.
Higher throughput does not always mean lower abandonment. Although measured throughput is often used as a proxy for network quality (e.g., [39]), our results suggest higher average
throughput does not always indicate lower abandonment. Figures 2.6(n) and 2.6(o) show
that abandonment rate decreases as average TCP and RLC throughput increases up to a

20

point. However, the abandonment rate is lowest at TCP throughput equal to the steady
state streaming rate, and it grows for higher throughput values. This pattern is because
early abandonment, while the video is still in the non-rate-limited buffering phase, actually
results in higher average throughput than watching a video through the rate-limited steady
state phase.
In Figure 2.7, we plot the skip rate distributions of prominent features. The skip rate is
defined as the fraction of sessions in the data set that are skipped. Due to space constraints,
we only plot the skip rate curves for features that have different trends than the respective
abandonment rate curves. The shaded areas represent the 95% confidence interval [112].
The horizontal line in these plots denotes the average skip rate.
We note that skip rate has a direct relationship with maximum flow inter-arrival time
(Figure 2.7(a)) and number of flows (Figure 2.7(b)). This is likely because skips result in
more flows and larger gaps between them. Skip rate peaks at session and flow durations
of just a few seconds (Figures 2.7(c) and 2.7(d)), suggesting that users chose to skip early
in a session, either due to network issues or lack of interest. Figure 2.7(e) shows larger
RST flag count correlated with higher skip rate likely because skips cause connection resets.
These contrasting patterns imply that it is more challenging to measure both skips and
abandonment than a single engagement metric.

2.4

Modeling User Engagement

In this section, we develop models to accurately predict user engagement using only standard
radio network statistics and/or TCP/IP header information.

2.4.1

Background and Problem Statement

Network operators would like to predict user engagement for three main applications. First,
directly estimating these metrics from network traffic requires cost-prohibitive collection of

21

sensitive data (requiring deep-packet-inspection) beyond TCP/IP headers. Thus, cost and
privacy concerns would be alleviated with a model that accurately predicts these engagement
metrics using only standard radio network statistics and/or TCP/IP header information
that is already collected. A simple and efficient model would be able to monitor video
engagement metrics over an entire network in real-time to facilitate trending and alarming
applications. Second, self-organizing networks [1] (SON) enable mobile networks to adapt
resource allocation dynamically. Thus, the ability to accurately predict video abandonment
early in a video session can help guide SONs to provide more resources to the most vulnerable
sessions. Third, an interpretable model that relates network factors to user engagement
metrics can help network operators in prioritizing infrastructure upgrades by identifying the
combination of factors that need to be adjusted for improving engagement.
Our goal is to jointly use the available features to accurately model both nominal and
continuous measures of user engagement (defined in Section 2.2). Moreover, we want our
models to make the prediction decisions as early as possible in a video session. Therefore,
we define the modeling problem as follows: given the feature set computed over the initial τ
seconds (τ ≤ t) of a video session, predict the user engagement metric.

2.4.2

Proposed Approach

As we observed in Section 2.3, many network features are not independent of each other and
the relationships among them can be non-linear. Therefore, modeling user engagement given
all available features is a non-trivial prediction task. Furthermore, our modeling approach
should answer pertinent questions such as: Which features are more useful for prediction?
How many features do we need to reap a substantial accuracy gain?
To address these challenges, we use a machine learning approach for modeling the complex
relationships between network features and user engagement metrics. The choice of learning
algorithm is crucial to successfully modeling feature interdependence and non-linearity. After
some pilot experiments, we found that decision tree algorithms with bootstrap aggregation

22

(or bagging) [113] work well for both nominal (classification) and continuous (regression)
user engagement metrics. Other commonly used Bayes and linear regression algorithms
were outperformed by the decision tree algorithms in our pilot experiments. Decision trees
do not require feature independence assumption and can handle non-linearities by employing
multiple splits/breaks for each feature. Furthermore, decision tree models comprise of simple
if-then-else branches, which can process data efficiently. For our experiments, we used C4.5
decision tree algorithm [81] and M5P regression tree algorithm [82].

2.4.3

Experimental Setup

We evaluate the effectiveness of classification models in terms of the following standard
Receiver Operating Characteristic (ROC) metrics [35]: (1) True Positives (TP), (2) True
Negatives (TN), (3) False Positives (FP), and (4) False Negatives (FN). We summarize the
|TP|

classification results in terms of the following ROC metrics: True positive rate = |TP|+|FN| ,
|TP|+|TN|

|FP|

False positive rate = |FP|+|TN| , and Accuracy = |TP|+|TN|+|FP|+|FN| . We also plot the
standard ROC threshold curves in our evaluation. An ideal ROC threshold curve approaches
the top-left corner corresponding to 100% true positive rate and 0% false alarm rate. The
Area Under Curve (AUC ∈ [0, 1]) metric summarizes the classification effectiveness of an
ROC threshold curve, where the AUC values approaching 1 indicate better accuracy. Besides,
we evaluate the effectiveness of regression models in terms of the standard root-mean-square
error (RMSE ∈ [0, 1]) metric.
To avoid class imbalance and over-fitting during model training, we use k-fold crossvalidation with class resampling [113]. In our pilot experiments, different values of k yielded
very similar results. All experimental results reported in this work are presented for k = 10.
Furthermore, we evaluate the feature sets on varying initial time window sizes: τ = t (i.e.,
use all available data), τ ≤ 60 seconds, and τ ≤ 10 seconds. We expect the classification
accuracy to degrade for smaller initial time windows.
We separately evaluate the core network feature set (abbreviated as CN), the radio net-

23

1

0.9

0.9

0.9

0.8

0.7
completed,non−skipped
abandoned,non−skipped
completed,skipped
abandoned,skipped

0.6

0.5
0

0.1

0.2
0.3
False Positive Rate

0.4

(a) 4-way

0.5

True Positive Rate

1

True Positive Rate

True Positive Rate

1

0.8

0.7

0.6

0.5
0

0.8

0.7

0.6

0.1

0.2
0.3
False Positive Rate

0.4

0.5

(b) completed vs. abandoned

0.5
0

0.1

0.2
0.3
False Positive Rate

0.4

(c) completed, non-skip.

0.5

vs.

rest

Figure 2.8. ROC threshold plots for various class pairs
work feature set (abbreviated as RAN), and the combined feature set (abbreviated as All).
The radio network and core network features are separately grouped because they require
different types of instrumentation. Recall from Section 2.2, measuring radio network features
requires instrumentation at RNCs and measuring core network features requires instrumentation at Gn interfaces in a cellular network.

2.4.4

Evaluation

Our experimental results demonstrate that the proposed machine learning model can predict
both video abandonment and skips with high accuracy using only the initial 10 seconds of
a session, while meeting the constraints of network operators. We find that although some
features are more useful than the rest for prediction, using all available features results in
significant accuracy gain as compared to using only a few top features. Moreover, our decision
and regression tree models are interpretable; they inform us about the relative usefulness of
features by ordering them at different tree levels and we can understand the specific set of
network conditions that impact user engagement by following each path in the tree. Many
of these conditions can be influenced by a network operator, and thus provide guidance on
how to improve user engagement in different situations.
Next we present detailed results for both classification and regression. For classification,
we build decision tree models to predict both individual classes and their combinations. For

24

Table 2.3. Accuracy of 4-way classification
Feature
Set

completed
non-ski.
(%)

abandoned
non-ski.
(%)

CN
RAN
All

72.0
64.1
73.1

78.4
53.7
77.8

CN
RAN
All

69.5
62.6
70.4

CN
RAN
All

69.5
60.5
69.6

completed
skipped
(%)

abandoned
skipped
(%)

Avg.
(%)

76.2
73.2
77.4

73.4
55.7
74.4

75.0
61.7
75.7

64.6
57.0
65.4

65.2
56.5
66.3

65.3
57.4
65.5

64.4
55.9
65.2

τ =t

τ ≤ 60 seconds
62.7
47.8
63.7

63.8
58.5
65.7

τ ≤ 10 seconds
59.6
46.6
60.7

63.3
59.0
64.9

individual classes, we train the decision tree algorithm for 4-way classification. By combining
classes, we change the granularity at which the model predicts user engagement. We use
the following two class pairs: completed vs. abandoned and completed, non-skipped
vs. rest. For combined classes, we train the decision tree algorithm for 2-way classification.
Naturally, we expect better accuracy for 2-way classification than 4-way classification because
the model is trained at a coarser granularity.
For 4-way classification, we observe that the core network feature set outperforms the
radio network feature set. Combining the core and radio network feature sets improves
the average accuracy. In Table 2.3, we observe the best average accuracy of 75.7% for the
combined feature set at τ = t seconds. In practice, improvement in accuracy means that
fewer sessions need to be measured before the network operator can be confident that a
real change in user engagement has occurred and an alarm can be raised. For a cell sector
serving only a handful of users simultaneously, this can mean a significant reduction in
time to detection of issues since video sessions may not be frequent. For the combined
feature set, ROC threshold curves are plotted in Figure 2.8(a). The ordering of ROC curves
conforms with the class-wise accuracy results in Table 2.3. The best operating accuracy of
77.8% is observed for abandoned, non-skipped class, which corresponds to 95.5% AUC. As
expected, in Table 2.3 we observe that the average accuracy degrades for smaller values of

25

Table 2.4. Accuracy of completed vs. abandoned and completed, non-skipped vs. rest
classification
Feature
Set

comp.

abandoned

(%)

(%)

Avg.
(%)

completed,
non-ski.
(%)

rest

Avg.

(%)

(%)

77.2
71.9
76.9

92.3
88.8
92.4

88.5
84.5
88.5

78.0
72.7
77.6

91.5
88.4
91.4

88.1
84.4
88.0

77.1
73.2
76.6

90.7
89.2
90.7

87.3
85.1
87.2

τ =t
CN
RAN
All

80.5
73.9
80.5

85.9
77.9
86.5

83.2
75.9
83.5

τ ≤ 60 seconds
CN
RAN
All

79.5
74.1
78.8

82.1
78.4
82.6

80.8
76.3
80.7

τ ≤ 10 seconds
CN
RAN
All

79.6
74.2
77.8

80.7
78.9
82.1

80.1
76.5
79.9

τ . We observe the best average accuracy of 65.2% for the combined feature set at τ ≤ 10
seconds, representing more than 10% accuracy reduction as compared to τ = t seconds.
Since operators may only be interested in predicting video abandonment, it is also important to build models to accurately predict completed vs. abandoned and completed,
non-skipped vs. rest class pairs (instead of all four classes). These class pairs compare
the scenarios when users either abandon or skip the video streaming session. Table 2.4
presents the classification results for these two class pairs. As expected, we observe significant improvement in accuracy for both class pairs as compared to 4-way classification due
to reduced number of classes. Moreover, we observe that the average accuracy suffers only
minor degradation (less than 5%) as τ is reduced. For completed vs. abandoned class pair,
we observe the best average accuracy of 83.5% for the combined feature set at τ = t seconds.
For the combined feature set, the ROC threshold curve is plotted in Figure 2.8(b), which
corresponds to 93.4% AUC. For completed, non-skipped vs. rest class pair, we observe
the best average accuracy of 88.5% for the combined feature set at τ = t seconds. For the
combined feature set, the ROC threshold curve is plotted in Figure 2.8(c), which corresponds
to 95.1% AUC.
For regression, we build regression tree models to predict video download completion.

26

Table 2.5. Root-mean-square error of regression
Feature Set

Linear Regression

CN
RAN
All

0.25
0.30
0.23

CN
RAN
All

0.27
0.36
0.24

CN
RAN
All

0.29
0.37
0.28

M5P Regression Tree

τ =t
0.15
0.27
0.14

τ ≤ 60 seconds
0.18
0.34
0.17

τ ≤ 10 seconds
0.22
0.34
0.21

Overall, we observe similar patterns across feature sets and varying initial window sizes
for regression results as observed for classification results earlier. Table 2.5 presents the
results of M5P regression tree algorithm and a simple linear regression algorithm. We note
that M5P regression tree algorithm consistently outperforms the simple linear regression
algorithm, indicating that M5P can successfully capture the non-linear dependencies between
features and video download completion that are not modeled by the simple linear regression
algorithm. RMSE is lower for larger τ values, and All feature set has the lowest RMSE as
compared to individual CN and RAN feature sets. We observe the best RMSE of 0.14 for
τ = t and All feature set.

2.4.5

Discussion

Our evaluation highlighted that using all features together results in better classification/regression accuracy than using their subsets. To systematically analyze the utility
of adding features to the classification/regression model, we plot accuracy versus feature set
size for completed vs. abandoned classification in Figure 2.9. Towards this end, we iteratively rank the feature set using the following greedy approach: for k th iteration, we evaluate
the accuracy gain of the model by separately adding candidate features and selecting the
k + 1th feature which provides the best accuracy. The top features are related to session size
and TCP throughput which we believe are correlated with user engagement, as sufficient
27

90

Accuracy (%)

80

70

60
Without Bagging
With Bagging
50
1

5
10
Feature Set Size (%)

25

50

100

Figure 2.9. Accuracy vs. feature set size for completed vs. abandoned classification
throughput is required for video streaming and abandonment results in low volume sessions.
The plot shows that a few top features provide most of the accuracy gain. However, the
gains in accuracy we achieve from including 5% to 100% of features are not diminishing.
Thus, it makes sense to use all available features because the computational overheads of
feature extraction and testing for additional features is low (in order of milliseconds).
The decision/regression tree models also provide actionable insights. The pruned versions
of the decision tree model for completed vs. abandoned class pair and the regression tree
model for video download completion are plotted in Figures 2.10(a) and (b), respectively.
The tuples below the rectangular leaf nodes represent their error and population size. Due
to space constraints, we only plot the tree models for τ ≤ 10 seconds which are useful for
network operators to predict user engagement by observing only the initial 10 seconds of
video streaming sessions.
From the model, network operators can identify network factors that may have the most
impact on user engagement and make decisions to prioritize certain infrastructure upgrades.
The features at the higher levels of a tree tend to have more distinguishing power and account
for more population size than lower level features. The root node is session duration for
Figure 2.10(a) and largest flow volume in Figure 2.10(b). However, it is noteworthy that the
ordering of network factors in Figure 2.10 does not strictly determine their importance. First,
trees shown in Figure 2.10 represent one of many candidate tree models generated during
28

Session duration
> 5.9 sec

≤ 5.9 sec
Abandoned

Maximum flow volume

(16%,27%)

≤ 90 KB

Abandoned

> 90 KB

Flow throughput standard deviation

(19%,17%)

> 25 KB/s

≤ 25 KB/s

Cell DCH user count

Abandoned
(22%,4%)

≤ 60 KB/s

≤ 9 am

Completed

> 9 am

(26%,5%)

> -83.5 dBm

≤ -83.5 dBm

Completed

Abandoned

(28%,6%)

(27%,7%)

≤ 2.9 MB

Session throughput
≤ 56 KB/s

>0

Median flow volume

≤ 48 Mbps > 48 Mbps

Session maximum RSSI

Completed

≤0

Cell downlink throughput

(27%,2%)

> 24

Largest flow FIN flag (c→s)

> 1.4 MB

≤ 1.4 MB
Abandoned

≤ 8 pm

> 8 pm

Mean flow volume

(22%,3%)

Hour of day

Cell DCH user count
≤ 24

> 60 KB/s

Abandoned

Hour of day

(27%,4%)

Largest flow throughput

(24%,3%)

> 0.5

> 0.9

≤ 0.9

Abandoned

Flow FIN flag mean (s→c)
≤ 0.5

Flow FIN flag mean (s→c)

> 13

≤ 13

Completed

Abandoned

(20%,7%)

(25%,2%)

> 2.9 MB

> 56 KB/s
Session volume
≤ 5 MB > 5 MB

Completed

Abandoned

Abandoned

Completed

Abandoned

Completed

(26%,2%)

(32%,2%)

(29%,2%)

(17%,3%)

(32%,3%)

(26%,1%)

(a) C4.5 decision tree model for completed vs. abandoned class pair
Largest flow volume
> 0.6 MB

≤ 0.6 MB
Largest flow duration
≤ 5.3 s

Minimum flow volume
> 377 B

> 5.3 s

0.174

0.492

(25%,15%)

(44%,8%)

≤ 377 B
Largest flow RST flag (c→s)

Mean flow volume
> 2.5 MB

Largest flow FIN flag (c→s)

≤ 2.5 MB

> 0.5

≤ 0.5
0.653

0.777

(37%,6%)

(44%,4%)

≤0

Median flow RST flag (c→s)

≤ 10.8 s

> 10.8 s

Median flow volume
≤ 38 KB

> 0.75

≤ 0.75
Minimum flow duration

> 38 KB

>0

Mean flow FIN flag (c→s)
≤ 0.34

Largest flow volume
> 3.9 MB ≤ 3.9 MB

> 0.34

0.879

Largest flow throughput

0.638

0.830

(23%,6%)

(49%,7%)

≤ 51 kB/s

0.638

0.896

(28%,4%)

(23%,7%)

(40%,7%)

> 51 kB/s

Largest flow volume
> 2.8 MB

≤ 2.8 MB

0.591

0.424

0.945

(26%,2%)

(25%,6%)

(26%,6%)

Largest flow RST flag (s→c)
≤ 0.5

> 0.5

0.735

0.562

(27%,5%)

(25%,7%)

Maximum flow throughput
≤ 58 kB/s

> 58 kB/s

0.928

0.967

(17%,2%)

(14%,8%)

(b) M5P regression tree model for video download completion

Figure 2.10. Pruned decision and regression tree models for τ ≤ 10 seconds using All feature
set. The tuples below leaf nodes represent (error%,population size%).
bagging – other candidate trees have different feature ordering. Second, these features are
not independent – session duration and maximum flow volume (top two features in Figure
2.10(a)) jointly account for session throughput and largest flow throughput to some extent.
The paths from the root node to leaves represent the equivalent rule sets, which inform
29

network operators of the interdependence among multiple features. For the regression tree in
Figure 2.10(b), if largest flow volume is ≤ 0.6 MB and largest flow duration is ≤ 5.3 seconds
then video download completion prediction is 17.4%. In contrast, if largest flow volume
is ≤ 0.6 MB and largest flow duration is > 5.3 seconds then video download completion
prediction is increased to 49.2%. Moreover, for the decision tree in Figure 2.10(a), if session
duration > 5.9 seconds and maximum flow volume is ≤ 90 KB after the first 10 seconds then
our model predicts that the video session will be abandoned with 19% error probability.
A network operator can influence many network factors to improve user engagement. The
feature splits in Figure 2.10 provide network operators actionable insights for this purpose.
For example, the decision tree predicts a session to be abandoned if cell DCH user count is
larger than a threshold under certain conditions. Most cellular network users are covered by
multiple cell sectors, and handover algorithms use signal quality and sector load to determine
which sector each user should receive data from. The thresholds used for handover are
typically fixed at a single global value. However, the feature splits in Figure 2.10(a) suggest
that the network operator can tolerate a higher cell sector load threshold for sessions with
higher throughput variance than sessions with lower throughput variance (24 vs. 13 users
occupying DCH channels).
Below, we discuss limitations of our analysis and results. First, our results are based
on traces from a single video service provider that uses progressive download with byterange requests. Therefore, our findings may not be representative of video service providers
that use other streaming methods. Second, our user engagement model cannot differentiate
between video abandonment due to network-related issues and due to lack of user interest.
Distinguishing between these two cases requires either client- or server-side information,
which is not available to network operators.

30

2.5

Related Work

Prior studies can be categorized based on whether they use network-side or user-side instrumentation.

2.5.1

Network-side Instrumentation

Our study builds upon previous work by Gill et al. [39], Finamore et al. [36], and Erman et
al. [33]. Each of these studies characterized the abandonment rate of video streaming sessions by collecting passive network traces at a campus edge network, 5 different wired edge
locations, and a cellular network, respectively. To estimate video quality, these studies use
deep-packet-inspection techniques (e.g., [86]) to understand the video provider protocol. Our
finding that 77% of video sessions are not completely streamed is closest to Finamore’s result
(80%), whereas Gill and Erman found lower abandonment rates (50% and 60%, respectively).
All these results indicate that abandonment rates are high. Based on the ratio of download
rate to encoding bitrate of video, Gill et al. concluded that approximately 20% of the video
streaming sessions were interrupted due to poor performance. However, we find that average
throughput is not always a good indicator of abandonment rate.
Our work makes two significant contributions on top of these studies. First, in order to
measure abandonment, previous studies relied on deep-packet-inspection to extract information beyond TCP/IP headers, which requires prohibitive computational resources to employ
at the scale of network carriers and can pose privacy problems in practice. Our work demonstrates that we can accurately measure abandonment without such information. Second,
these studies did not provide insight into how network characteristics and performance impact abandonment rates. Our study is the first to examine the relationship between mobile
network factors and user engagement and the first to provide guidance on how operators can
reduce video abandonment.

31

2.5.2

Client-side Instrumentation

In [83], Rao et al. conducted an active measurement study of video streaming traffic from
YouTube and Netflix. They proposed models to express various properties of completed
and interrupted video streaming traffic as a function of the video parameters. However, the
authors did not study user engagement because this work is based on active measurement
data.
Dobrian et al. conducted a large scale, passive, user-side study to understand the impact
of video streaming quality on user engagement [31]. They used video player instrumentation
to quantify video streaming quality metrics such as join time, buffering ratio, average bitrate,
rendering quality, and rate of buffering. Their analysis showed that buffering ratio has the
largest impact on user engagement for non-live content and average bitrate significantly
impacts user engagement for live content. Krishnan and Sitaraman also conducted a large
scale, passive, user-side study to understand the impact of video streaming quality on user
engagement [55]. They quantified the impact of video streaming quality metrics on user
engagement using quasi-experimental designs. In [15, 16], Balachandran et al. developed
a QoE model using various user-side video quality metrics. Specifically, they developed
a decision tree based machine learning model to predict the extent of the video watched
by users. For the two-class problem (completed vs. interrupted/abandoned), their trained
model achieved up to 70% accuracy which progressively decreases as the number of classes
is increased.
While these studies analyzed user engagement using data collected via video player instrumentation, our work focuses on characterizing and modeling user engagement using
network-side measurements. Modeling user engagement using network-side data is particularly important for network operators because they do not have access to video player instrumentation data. Interestingly, our model based on network-side data can predict whether
a user completely downloads a video with more than 87% accuracy, which is significantly
better than the client-side model developed by Balachandran et al. Furthermore, since our

32

model does not require client-side instrumentation, it can be used by any network operator,
not just the video content provider.

2.6

Conclusions

This work represents the first characterization of mobile video streaming performance and
models its impact on user engagement from the perspective of network operators. We observed that many network features exhibit strong correlation with abandonment rate and
skip rate. Our proposed model achieved more than 87% accuracy by observing only the
initial 10 seconds of video streaming sessions. Overall, we conclude that the proposed model
based on standard radio network statistics and/or TCP/IP header information can be successfully used by network operators to predict video abandonment. Our model is useful for
network operators to continuously monitor at scale to proactively mitigate the factors that
can adversely impact user engagement.

33

3

Mobile Network Performance

during Crowded Events
3.1

Introduction

Crowded events, such as football games, public demonstrations, and political protests, put an
extremely high demand for communication capacity on cellular networks around the duration
of the events [11]. Cellular networks are facing unprecedent challenges in dealing with such
spiky demand. First, cellular network utilization has already been rapidly approaching its
full capacity throughout the world due to the increasing prevalence of cellular devices such as
smartphones, tablets, and Machine-to-Machine (M2M) devices. Even in the United States,
cellular network usage is at an all-time high even under normal operating conditions and
projections show that traffic volume will further increase by 26 times by 2015 as compared
to 2010 [10,58]. Second, the spiky demand caused by crowded events is often extremely high
because there maybe a large number (often tens of thousands) of users gathered in a small
region (such as a football stadium) that is covered by only a small number of cell towers.
Even worse, people tend to use their cellular devices more than usual during the events to
either talk with their friends or access the Internet (such as uploading a photo to Facebook
or a video clip to YouTube during a football game). Third, it is critical for cellular networks
to cope with such high demand during crowded events because poor performance will affect a
large number of people and cause widespread user dissatisfaction. Although cellular network

34

operators have deployed remediation solutions, such as portable base stations called Cells on
Wheels (COWs) for temporarily increasing communication capacity and free Wi-Fi access
points for offloading Internet traffic from cellular base stations, crowded events still remain
a major challenge for cellular network operators.
To the best of our knowledge, this work presents the first thorough investigation of cellular
network performance during crowded events. Based on the real-world voice and data traces
that we collected from a tier-1 cellular network in the United States during two high-profile
crowded events in 2012, we aim to answer the following three key questions.
How does cellular network performance degrade during crowded events as compared to
routine days? To answer this question, we characterize cellular network performance during
both the pre- and post-connection phases, which helps us to understand user experience
before and after acquiring radio resources. For pre-connection phase, we find that preconnection failures dramatically increase during the crowded events by 100-5000 times as
compared to their average on routine days. These failures occur because when too many users
attempt to acquire radio resources at the same time, they exhaust the limited bandwidth of
the signaling channel resulting in connection timeouts and failures. We find that this resource
exhaustion occurs not only at the event venue, but also as far as 10 miles around the event
as users arrive and depart. Moreover, some failures, such as dropped and blocked voice
calls, are most likely to occur in bursts just before, after, and during event intermissions.
For post-connection phase, we find that voice network performance in terms of dropped and
blocked calls degrades during crowded events by 7-30 times, and data network performance
in terms of packet loss ratio and round trip time (RTT) degrades during crowded events by
1.5-7 times, compared to their average on routine days.
What causes the performance degradation? To answer this question, we analyze user traffic
patterns in terms of both aggregate network load and user-level session characteristics. For
aggregate network load, we find that uplink traffic volume increases by 4-8 times, and both
downlink traffic volume and the number of users increase by 3 times, during the crowded

35

events as compared to their average on routine days. We conclude that the large number
of users trying to access radio resources at the same time is a major cause of the observed
excessive pre-connection failures. For user-level session characteristics, we find that the
average byte volume per session decreases by 0.5 times during the events even though the
average session length increases. Our investigation suggests that this change in workload is
due to a change in application usage during these events, such as the increased use of online
social networks. We conclude that lower byte volume per session, despite an increase in
average session length, is a major cause of the waste of radio resources in the post-connection
phase.
How would practical mitigation schemes perform in real-life? To answer this question, we
investigate two practical mitigation schemes that do not require making significant changes
to the cellular infrastructure: radio resource allocation tuning and opportunistic connection
sharing. Radio resource allocation tuning addresses the issue of inefficient radio resource
allocation in the post-connection phase by adjusting cellular network resource allocation parameters. Cellular networks allocate resources to each user using a Radio Resource Control
(RRC) state machine, which is synchronously maintained by the network and devices. Different states of the RRC state machine correspond to different amount of radio resources
allocated by the network and energy consumption by cellular devices. Since a large number of users contend for limited radio resources during crowded events, we show that more
aggressive release of radio resources via 1-2 seconds shorter RRC timeouts helps to achieve
a better tradeoff between wasted radio resources, energy consumption, and delay during
crowded events. Note that cellular network operators often know the time and location of
large crowded events beforehand; thus, it is practical for them to adjust cellular network
parameters before events and restore them after events. Opportunistic connection sharing
addresses increased pre-connection failures by aggregating traffic from multiple devices into
a single cellular connection. That is, by having some devices share their cellular connection
with nearby devices over their Wi-Fi or Bluetooth interface (i.e., “tethering”), opportunistic

36

connection sharing should reduce the number of overall cellular connection requests, thereby
reducing request congestion and connection failures. Using trace-driven simulations, we show
that connection sharing can reduce connection failures by more than 95% when employed
by a small number of devices in each cell sector. Although much work has been done on
opportunistic connection sharing to address issues such as mobility, energy use, and incentives [2, 42, 67], no prior work has demonstrated the significant benefit that such connection
sharing can achieve based on real-life cellular network data.
The rest of this chapter is organized as follows. In Section 3.2, we present details of
the data collection process. Section 3.3 presents the characterization of performance issues
during the crowded events and Section 3.4 presents various aspects of user traffic patterns
to study the underlying causes of performance issues. We conduct trace-driven simulations
to evaluate radio network parameter tuning and opportunistic connection sharing in Section
3.5. Section 3.6 reviews related work and the work is concluded in Section 3.7.

3.2

Data Set

The data set used in this study contains anonymized logs collected from RAN and CN of
a tier-1 cellular network in the United States serving over 100 million customers. Our data
set consists of two separate collections, each covering a metropolitan area during a highprofile event in 2012. The collections include information from hundreds of thousands of
users and thousands of cell locations over multiple days including the event days. The first
event, referred to as Event A hereafter, is a sporting event that consists of two segments of
activities separated by an intermission. The second event, referred to as Event B hereafter, is
a conference event that consists of multiple segments of activities separated by intermissions
of varying lengths. In terms of publicly available attendance statistics, event A is roughly
twice the size of event B. The activity segments in both events are illustrated by gray bars in
all timeseries figures presented in this work. Furthermore, it is noteworthy that free Wi-Fi

37

service was provided to all users during both of the events to offload as much cellular network
traffic as possible. However, we do not have measurements on the network traffic that was
offloaded to these Wi-Fi services; thus, we acknowledge that our results may be biased by
this offloading.
The anonymized logs collected at an RNC in RAN contain throughput and RRC protocol
request/response information. Using RRC requests from UEs and responses from the RNC,
the RAB status of all UEs can be monitored. The anonymized logs collected from the CN
contain TCP header information of IP flows carried in PDP context tunnels. They are
collected from the Gn interfaces between SGSNs and GGSNs in the core network. They
contain timestamp, per-flow traffic volume, content publisher, RTT computed during TCP
handshake [49], and estimated packet loss ratio for each TCP flow aggregated in 5 minute
bins. All device and user identifiers (e.g., IMSI, IMEI) are anonymized to protect privacy
without affecting the usefulness of our analysis. The data set does not permit the reversal
of the anonymization or re-identification of users. We note that logs collected at RNCs
encompass both voice and data traffic, whereas logs collected from the CN contain only data
traffic information.
Next, we characterize performance issues during the aforementioned two high-profile events
in Section 3.3. To study the underlying causes of the identified performance issues, we then
correlate network performance with various aspects of user traffic patterns in Section 3.4.
Throughout, we present results of the event day in relation to a routine day for baseline
comparison. We normalize the actual measurement values by their mean values on the
routine day (unless stated otherwise); our results thus effectively represent how the event
differs from routine conditions. We omit absolute numbers from some non-normalized plots
due to proprietary reasons.

38

800
Event
Routine

350
300
250
200
150
100
50

700

Event
Routine

600
500
400
300
200
100

0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

t (hour)

Too Many Serving Cell Users

400

Radio Link Addition Failure

Radio Link Setup Failure

450

120
100

Event
Routine

80
60
40
20
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

t (hour)

t (hour)

150

Radio Link Addition Failure

Radio Link Setup Failure

2500
Event
Routine
2000
1500
1000
500

Event
Routine

100

50

0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Too Many Serving Cell Users

(a) Event A
6000
Event
Routine

5000
4000
3000
2000
1000

0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

t (hour)

t (hour)

t (hour)

(b) Event B

10

2

10

1

10

0

10

−1

10

−2

10 −1
10

0

10
Distance from Venue (mi)

1

10

During Event

3

Event Failures/Routine Failures

Before Event

3

Event Failures/Routine Failures

Event Failures/Routine Failures

Figure 3.1. (Normalized) Timeseries of common types of RRC failures
10

2

10

1

10

0

10

−1

10

−2

10 −1
10

0

10
Distance from Venue (mi)

1

10

After Event

3

10

2

10

1

10

0

10

−1

10

−2

10 −1
10

0

10
Distance from Venue (mi)

1

10

4

10

2

10

0

10

−2

10

−1

10

0

10
Distance from Venue (mi)

1

10

During Event

6

10

4

10

2

10

0

10

−2

10

−1

10

0

10
Distance from Venue (mi)

1

10

Event Failures/Routine Failures

Before Event

6

10

Event Failures/Routine Failures

Event Failures/Routine Failures

(a) Event A
After Event

6

10

4

10

2

10

0

10

−2

10

−1

10

0

10
Distance from Venue (mi)

1

10

(b) Event B

Figure 3.2. RRC failure ratios plotted as a function of distance to the venue and for time
intervals before, during, and after the event

3.3

Characterizing Performance Issues

Generally speaking, a user’s experience about network performance can be divided into two
phases. The pre-connection phase is characterized by the UE attempting to establish a

39

connection with the cellular network, or in other words establishing a RAB. In this phase,
the user waits for connection establishment, while not being able to exchange traffic at
this time. The post-connection phase starts after a RAB is assigned. In this phase, user
experience is related to more traditional voice call performance metrics such as call drop and
block rate or end-to-end TCP performance metrics, such as delay and packet loss. Below,
we separately discuss both pre- and post-connection network performance experienced by
users during both events.

3.3.1

Pre-connection Network Performance

Users may experience difficulty in establishing RABs in the pre-connection phase due to a
wide variety of reasons. Every time a request to allocate more radio resources is denied by
the RNC, a RRC failure and its underlying reason is logged by our measurement apparatus.
In our analysis, we study the logs of various types of RRC failures that are collected at the
RNC. Each type of RRC failure corresponds to a specific problem in the cellular network
operation. The 3 most common types of failures observed in our data set are the following.
1. Radio link setup failures occur when a user’s request to setup a radio link is not served
due to poor RF channel quality, which is often caused by increased interference.
2. Radio link addition failures occur when a user’s request to add a radio link to an
existing radio connection for soft handovers is denied.
3. Too many serving cell users indicates blocking for new users which results when all
available RABs are occupied by existing users.
Figure 3.1 plots the timeseries of the most common types of RRC failures on the event and
routine days for both events. We observe that RRC failures increase sharply on the event
days, whereas they are negligible (and steady) on the routine days for both events. For both
events, RRC failures start occurring around noon and generally reach their peak either just
before or during the event. Specifically, radio link addition failures peak at more than 700x
40

40
Event
Routine

30
25
20
15
10
5

35

Event
Routine

Event
Routine

30
25

Count

35

# Voice Calls Blocked

# Voice Calls Dropped

40

20
15
10

0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

5
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

t (hour)

1

2

3

4

t (hour)

5 6 7 8 9 10 11 12 13
Error Code Index

(a) Event A
8

10

5

0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

t (hour)

7

Event
Routine

Event
Routine

6
5

Count

Event
Routine

# Voice Calls Blocked

# Voice Calls Dropped

15

4
3
2
1
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

t (hour)

1

2

3

4

5 6 7 8 9 10 11 12 13
Error Code Index

(b) Event B

Figure 3.3. (Normalized) Voice performance measurements
their average on the routine day for event A and too many serving cell users peak at more
than 5000x their average on the routine day for event B.
The nature of RRC failures for both events indicates that their potential root cause is high
network load and congestion due to a large number of competing users at cell sector level.
Therefore, we next analyze RRC failures at cell sector level before, during, and after the
events as a function of distance from the venue. Figure 3.2 shows the scatter plots between
the distance of cell sectors from the venue (in miles) and the ratio of the number of RRC
failures on the event day to that on the routine day. The horizontal dashed line at y = 1 is a
reference for the data points where RRC failures on the event and routine days are equal. So
the data points above the reference line represent cell sectors that have more RRC failures
on the event day than the routine day. Likewise, the data points below the reference line
represent cell sectors that have less RRC failures on the event day than the routine day.
Both x- and y-axes are converted to logarithmic scale for the sake of clarity. Note that
there are many cell sectors equidistant from the venue, especially those cell sectors that are
close to the venue. These cell sectors are mounted at the same cell tower but face different
directions, and have different tilt angles and frequencies.
41

Overall, we observe that cell sectors closer to the venue have 2-3 orders of magnitude more
RRC failures on the event day than the routine day. The RRC failure ratios progressively
decrease as the distance of cell sectors to the venue increases. For both events, we observe
interesting dynamics across the scatter plots for time intervals before, during, and after the
event. For event A, we observe that the failure ratios generally increase by 2-3 orders of
magnitude for cell sectors less than half a mile from the venue throughout the event day.
In contrast, for the cell sectors that are far from the venue, their failure ratios drop during
the event and jump by 1-2 orders of magnitude after the event finishes. The aforementioned
observations can be linked to the sporting nature of event A, where people swarm the venue
before and during the event, creating a void in surrounding areas. The post-event jump
in the failure ratio is likely correlated with most people leaving the venue and using their
devices to share their experience with others via voice calls or social network posts (we show
later in this section that the observed user activity supports this hypothesis). We observe
similar trends for event B as well; however, the post-event jump in the failure ratio is clearly
visible only for cells within 1 mile of the venue. For these reasons, while characterizing user
network traffic in the next section, we focus our attention on the cell sectors that are within
1 mile radius of the venues for both events.
Pre-connection failures (especially those pertaining radio link addition and indicating too
many serving cell users) peak by a factor of 700 (for event A) and 5000 (for event B) relative
to their average on the routine days. These failures increase by 2-3 orders of magnitude in
cell sectors very close to the venues before and during the events, but only increase in cell
sectors further away after the venues.

3.3.2

Post-connection Network Performance

As discussed in Section 3.2, during the RAB setup phase, the RNC verifies that the needed
radio resource for the request actually exists before it assigns a RAB. In other words, if a
device has successfully acquired a RAB for communication, its performance should theoreti-

42

2.2

4
Event
Routine

3.5

1.8

Event
Routine

1.3

3

1.4
1.2

2.5
2
1.5

1

Event
Routine

1.2

R−RTT

1.6

I−RTT

Packet loss ratio

2

1.1
1
0.9

1

0.8

0.8

0.6
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

0.5
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

t (hour)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

t (hour)

t (hour)

(a) Event A
7

1.4

1.8

Event
Routine

1.6

Event
Routine

1.3

Event
Routine

1.2

5
4
3

1.4

R−RTT

I−RTT

Packet loss ratio

6

1.2

0.8

0.8

1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

t (hour)

1
0.9

1

2

1.1

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

t (hour)

0.7
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

t (hour)

(b) Event B

Figure 3.4. (Normalized) Data performance measurements
cally remain acceptable per operator’s configuration even if the overall network demand level
exceeds network capacity. This is because excessive demand requests will get blocked off by
the RNC from acquiring any RAB. However, network conditions can quickly change even
for UEs that have already acquired a RAB because of factors such as interference, mobility,
etc. Such dynamic network conditions can force UEs to request a change in current RAB
status, initiating a series of RRC failures which could in turn result in degraded voice and
data performance. Below, we separately discuss voice and data performance.
To quantify voice performance, we study voice call drop and block rates for both events
in Figure 3.3. Similar to our observations about pre-connection network performance, the
number of voice call drops and blocks increase substantially on the event days as compared
to the routine days for both events. Specifically, we observe peak increases of more than
30x and 7x relative to their average on the routine days for events A and B, respectively. It
is noteworthy that voice call drop and block rates peak just before the start of the events,
during the intermissions, and at the end of the events. This observation is consistent with
our expectation that users are less likely to make voice calls during event activities and more
likely to make voice calls either before the start of events, after the end of events, or during
43

Table 3.1. Description of voice call error codes
Index

Category

Description

1

Unspecified

2

Radio Connection
Supervision
Radio Connection
Supervision
Radio Connection
Supervision
Radio Connection
Supervision
Operations &
Management
Soft Handover
Soft Handover
Soft Handover
Soft Handover
Soft Handover

All cases which do not map
to the ones described below
Radio Link Control (RLC)
unrecoverable
Maximum number of
RLC retransmissions
Expiry of timer

3
4
5
6
7
8
9
10
11
12
13

Inter-Frequency
Handover
Channel Switching

Radio link failure indication
Cell lock indication
No active set addition update
No active set deletion update
No active set replacement update
Cell not in the neighbor set
High speed-downlink shared
channel cell change failure
Inter-frequency handover failure
Transition to DCH
state not completed

intermissions between event activities. To further investigate the root causes of voice call
blocks and drops, we also plot the histogram of their error codes in Figure 3.3. The error
code descriptions in Table 3.1 indicate that the two most common categories of error codes
for both events are related to radio connection supervision and soft handovers, which in turn
point to interference and mobility as the root cause.
To quantify data performance, we study two key end-to-end TCP performance metrics:
packet loss ratio and RTT for both events.1 Packet loss ratio quantifies network reliability.
We only have packet loss ratio measurements for TCP flows, which constitute approximately
95% of all flows in our data set. RTT quantifies network delay and is defined as the duration
of time taken by a packet to reach the server from the UE plus the duration of time taken by
a packet to reach the UE from the server. It is important to note that RTT measurements
are biased by differences in the paths between different UEs and the external servers they
communicate with. Similar to packet loss ratio measurements, we only have RTT measure1

Because end-to-end TCP performance also involves additional parameters such as back-haul bandwidth
and even remote server load, we leave a more detailed investigation of TCP performance to future work.

44

ments for TCP flows. RTT measurements for TCP flows are estimated by SYN, SYN-ACK,
and ACK packets in the TCP handshake. In a cellular network, RTT essentially consists of
two components: radio network RTT and Internet RTT. Radio network RTT (R-RTT) is
the time duration between the SYN-ACK packet from server passing the Gn interface and
the ACK packet from the UE passing the Gn interface. Internet RTT (I-RTT) is the time
duration between the SYN packet from the UE passing the Gn interface and the SYN-ACK
packet from the server passing the Gn interface. Thus, RTT = R-RTT + I-RTT.
Figure 3.4 shows the timeseries plots of packet loss ratio, Internet RTT, and radio network
RTT for both events. Packet loss ratio peaks at 2x and 7x relative to its average on the
routine days for events A and B, respectively. We observe different trends for radio network
RTT and Internet RTT for both events. There is only a minor increase in radio network RTT
on the event days. Internet RTT increases during the intermissions for both events; however,
this increase indicates congestion at remote servers caused by increased event-driven traffic.
Overall, data performance results indicate that users experience data connection performance
issues to varying extents during the two events.
Post-connection performance degradation is observed for both voice and data network
during the events. Specifically, voice call failures (dropped calls and call blocks) increase
by a factor of as much as 30 (for event A) and 7 (for event B). Moreover, packet loss ratio
increases by a factor of 2 (for event A) and 7 (for event B); while the RTT increases by
a factor of 3.5 (for event A) and 1.5 (for event B). While these indicate a degradation in
performance experienced by users already connected to the network, this is substantially
smaller than the pre-connection failures discussed in Section 3.3.1. Overall, pre- and postconnection network performance results highlight that limited radio resources are the major
bottleneck during crowded events.

45

9
8

4

Event
Routine

3

7

Event
Routine

3.5

2.5

3

2

2.5

5
4
3

U

Tdown

Tup

6

1.5
1

1

0.5

0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

2
1.5

2
1

Event (AC)
Event (RAB)
Event (DCH)
Routine (AC)
Routine (RAB)
Routine (DCH)

0.5
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

t (hour)

t (hour)

t (hour)

(a) Event A
5

3

3

Event
Routine

2.5

4

Event
Routine

2.5
2

Tdown

Tup
2
1
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

U

2

3

1.5

1.5

1

1

0.5

0.5

0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

t (hour)

Event (AC)
Event (RAB)
Event (DCH)
Routine (AC)
Routine (RAB)
Routine (DCH)

0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

t (hour)

t (hour)

(b) Event B

Figure 3.5. (Normalized) Network load measurements

3.4

Understanding Performance Issues

Next, we characterize user network traffic to identify patterns that correlate with the observed pre- and post-connection performance degradation during the events. Using the insights obtained from this characterization, we aim to identify network optimization opportunities that can potentially improve end-user experience in crowded locations. We characterize
network traffic in terms of both aggregate network load and user-level session characteristics.

3.4.1

Aggregate Network Load

We quantify aggregate network load in terms of the following two metrics: throughput and
user counters. Throughput or bit-rate is sampled for all UEs at the RNC every couple of
seconds. Based on the direction of traffic, we can split the throughput into uplink throughput (Tup ) and downlink throughput (Tdown ). Figure 3.5 plots the timeseries of uplink and
downlink throughput on the event and routine days for both events. For the routine days,
both uplink and downlink throughput peak around the noon time and decline steadily afterwards, reaching the bottom during late night and early morning. We observe a different

46

trend for uplink throughput on the event days. For instance, the peak uplink throughput on
the event day is more than 8x and 4x the average throughput on the routine day for events A
and B, respectively. We also observe that the uplink throughput peaks and event activities
are approximately aligned. For instance, uplink throughput sharply increases at the start
and end of the second segment for event A. Similar, though less pronounced, patterns are
also observable for event B. In contrast to the uplink throughput, increases in the downlink
throughput timeseries are steadier for both events.
To further analyze traffic volume characteristics, we plot the traffic flow count histograms
for top content publishers in Figure 3.6. We focus on flows rather than bytes to avoid bias
towards high volume applications, such as video streaming. We observe that flow counts
of social networking content publishers more than double on the event day as compared to
the routine day for event A. Likewise, social networking content accounts for most flows on
the event day for event B. Our further investigation (not shown here) revealed that social
networking content is at least 2x more upstream heavy as compared to other content types,
which explains the increase in uplink throughput during both events.
We also analyze user counters for the event and routine days for both events. Users
are classified into the following overlapping categories based on their RRC states: admission
control (AC), radio access bearer (RAB), and dedicated channel (DCH). AC category includes
the users who have completed the admission control procedure. RAB category includes the
users who have been assigned a RAB after admission control. Such users are typically in
either FACH or DCH state. Finally, DCH category only includes the users who are in
DCH state. Let U denote the number of users, also let UAC , URAB , and UDCH denote the
number of users in the aforementioned categories. As a general rule, UAC ≥ URAB ≥ UDCH .
Figure 3.5 plots the timeseries of number of users in AC, RAB, and DCH categories. These
timeseries show a trend similar to the throughput measurements. All user counters have
higher values on the events days as compared to the respective routine days for both events.
Specifically, the number of users with admission control peaks at more than 3x during the

47

cdn 3

cdn 3

cdn 2

cdn 2

cdn 1

cdn 1

search

search

audio

audio

video

video

social 3

social 3

social 2

social 2

social 1
0

social 1
0.5

1
1.5
Flow Count Ratio

2

2.5

Flow Count on the Event Day

(a) Event A
cdn 3

cdn 3

cdn 2

cdn 2

cdn 1

cdn 1

search

search

audio

audio

video

video

social 3

social 3

social 2

social 2

social 1
0

social 1
0.5

1
1.5
Flow Count Ratio

2

2.5

Flow Count on the Event Day

(b) Event B

Figure 3.6. Flow count histograms for top content publishers in our data set
events as compared to its average on the routine days.
Both aggregate uplink and downlink throughput increase during the event days; uplink
throughput increases by a factor of as much as 8 and 4 (for events A and B respectively),
while downlink throughput increases by a factor of 3 (for both events). Moreover, there is
a substantial increase in the traffic volume of social networking content during the events,
which is relatively more upstream heavy. Likewise, number of users with admission control

48

Average Session Length

Session Count

1.5
Event
Routine

3
2.5
2
1.5
1
0.5
0

1.4
1.3
1.2
1.1
1
0.9
0.8
0.7

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Event
Routine

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

t (hour)

Avg. Session Inter−arrival Time

4
3.5

1.5
Event
Routine

1

0.5
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

t (hour)

t (hour)

4

Session Count

Average Session Length

Event
Routine

3.5
3
2.5
2
1.5
1
0.5
0

1.15

Event
Routine

1.1
1.05
1
0.95
0.9
0.85
0.8

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

t (hour)

Avg. Session Inter−arrival Time

(a) Event A
1.5

Event
Routine

1

0.5
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

t (hour)

t (hour)

(b) Event B

Figure 3.7. (Normalized) Session Count, Average Length, Average Inter-arrival Time
2

2
Event
Routine

1.5

Event
Routine

Event
Routine
1

up

/B )

1.5

down

0.5

log(B

Bup

Bdown

1.5
1

1
0.5

0
−0.5

0

0.5
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

t (hour)

−1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

t (hour)

t (hour)

(a) Event A
1.5

2

1.5
Event
Routine

Event
Routine

log(Bdown/Bup)

Event
Routine

Bup

B

down

1.5
1

1

0.5
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

t (hour)

0.5
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

t (hour)

1

0.5

0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

t (hour)

(b) Event B

Figure 3.8. (Normalized) Per-Session Downlink Bytes (Bdown ), Uplink Bytes (Bup ), Ratio
(Bdown /(Bup )
increase by a factor of 3 for both event days. Overall, our aggregate network load characterization shows that increased user activity during the events, specifically in terms of
uplink throughput and user counters, is correlated with increased pre-connection failures.
To reduce the impact of increased network load during crowded events, we will investigate

49

Power

α=2
α=4
α=6
α=8
α=10

Promotion delay

TDCH−IDLE

α=2
α=4
α=6
α=8
α=10

α=2
α=4
α=6
α=8
α=10

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

t (hour)

t (hour)

t (hour)

(a) Event A

Power

α=2
α=4
α=6
α=8
α=10

Promotion delay

T

DCH−IDLE

α=2
α=4
α=6
α=8
α=10

α=2
α=4
α=6
α=8
α=10

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

t (hour)

t (hour)

t (hour)

(b) Event B

Figure 3.9. Experimental results for radio network parameter tuning
the effectiveness of opportunistic connection sharing in Section 3.5.2.

3.4.2

User-level Sessions

We now analyze characteristics of user-level traffic sessions for both events. A session consists
of consecutive time intervals with uplink or downlink byte transfer and its end is marked by an
inactivity timeout of τ seconds. The results presented in this section are computed for τ = 5
seconds. Changing the value of τ does not qualitatively affect the analysis results. Figure 3.7
shows the timeseries of session count, average session length, and average session inter-arrival
time for both events. Session count follows a similar trend to the earlier aggregate network
load metrics – at peak, there is more than 3.5x increase relative to the average on the routine
days for both events. Furthermore, we observe an increase in average session length on the
event days as compared to the routine days, e.g., there is more than 1.4x increase for event
A. On the contrary, average session inter-arrival time decreases sharply on the event days as
compared to the routine days – this indicates that users are initiating sessions much more
frequently during the events. To further investigate the nature of changing session patterns,

50

we plot the timeseries of average downlink bytes per session (Bdown ), average uplink bytes
per session (Bup ), and the average ratio of downlink bytes to uplink bytes per session in
Figure 3.8. We observe that average downlink bytes per session sharply decreases up to 0.5x
during the event days; whereas, average uplink bytes per session exhibits a mixed trend. The
ratio (Bdown /Bup ) also sharply decreases during the events, which is due to the increased
traffic volume of upstream-heavy social networking content.
User sessions are on average longer during both events (by a factor of as much as 1.4) – as
well as more numerous and more frequently initiated. However, users exchange only as much
as half the bytes per session on average. This change in workload is due to a change in the
application usage during these events, such as greater proportion of social networking flows
observed earlier. These trends point to potential waste of radio resources by UEs, which can
be mitigated by tuning radio network parameters. Towards this end, we will investigate the
effectiveness of varying RRC timeouts in Section 3.5.1.

3.5

Evaluating Mitigation Schemes

In this section, we evaluate two proposals to mitigate cellular network performance degradation during crowded events.

3.5.1

Radio Network Parameter Tuning

We first investigate whether tuning radio network parameters can result in more efficient
radio resource usage during crowded events. As mentioned in Section 3.2, UEs acquire and
release radio resources by transitioning to different RRC states. A UE is promoted to a
higher energy state depending on buffer occupancy and it is demoted to a lower energy state
depending on timeouts. Here, we study how RRC timeouts can be tuned for more efficient
radio resource utilization, without explicit feedback from individual UEs. Recall from Figures
3.7 and 3.8 that average bytes per session decreases during the events, despite the increase

51

in average session length. This observation highlights potential waste of radio resources and
UE energy consumption in crowded locations. Therefore, a natural suggestion would be
to reduce RRC timeouts to mitigate the radio resource wastage. However, reducing RRC
timeouts can result in more frequent state promotions, which can introduce state promotion
delays resulting in degraded user experience [59, 80]. Hence, there is a tradeoff between
performance and resource efficiency.
We conduct trace-driven simulations to quantitatively study the tradeoffs involved in
changing RRC timeouts. We simulate the RRC state machine of every user using the RNC
logs while focusing on the DCH state, which has the highest allocated radio resources and energy consumption among all RRC states. Specifically, we study DCH→FACH RRC timeout
parameter, which is denoted by α hereafter. As mentioned earlier, changing RRC timeouts
introduces tradeoffs among radio resource wastage, user experience, and UE energy consumption. We use the following three performance metrics to quantify these factors. (1)
The DCH state idle occupation time, denoted by TDCH-IDLE , quantifies the radio resources
wasted by UEs in DCH state. (2) The promotion delay quantifies the additional delay caused
when UEs transition to DCH state from FACH state. (3) The power consumption quantifies
the total energy consumed by UEs during DCH state occupation and in FACH to DCH transitions. We use the following simulation parameters in our experiments (inferred by Qian
et al. in [80]): (1) FACH→DCH promotion radio power = 700mW, (2) DCH state power
= 800mW, (3) FACH→DCH promotion delay = 2 sec, and (4) RLC buffer threshold = 500
bytes.
Similar to the evaluation of opportunistic connection sharing, we evaluate radio network
parameter tuning for a subset of cell sectors that are within 1 mile radius of the venues. We
conduct trace-driven simulations of individual users’ RRC state machines on this subset for
both event and routine days. Figure 3.9 shows the timeseries plots of the aforementioned
three performance metrics for varying α values. We observe that the DCH state idle occupation time and UE energy consumption increase for larger α values. On the other hand,

52

promotion delay decreases for larger α values. These observations indicate that decreasing
the RRC timeout values reduces the waste of scarce DCH channels and UE energy consumption. However, this benefit is achieved at the cost of increased promotion delay that may
degrade user experience, especially for applications that are not delay-tolerant.
To systematically study the tradeoffs between these performance metrics on the event days
and compare them to routine days, we plot them as a function of α. Figure 3.10 plots the
max-normalized average of the performance metrics as a function of α for the event and
routine days. In theory, we want to select a value of α which simultaneously minimizes
the values of all performance metrics. In this case, the crossover points (highlighted by
black circles in Figure 3.10) and their corresponding α values represent suitable performance
tradeoff. We find that these crossover points shift to smaller α values – by 1-2 seconds –
on the event days as compared to the routine days. In practice, however, α is typically
set to achieve a target delay or resource overhead. In this case, as observable from Figure
3.10, we can tune α to smaller values to achieve the same targets and achieve strictly better
performance during crowded events.

3.5.2

Opportunistic Connection Sharing

We now evaluate a simple opportunistic connection sharing scheme to reduce the network
load at individual cell sectors for eradicating RRC failures observed in Section 3.3.1. The
basic idea is that users can share their connection to NodeBs with other users to reduce the
overall network load in terms of occupied radio channels. In this scheme, a selected set of
UEs act as Wi-Fi hotspots for other UEs in their vicinity. Therefore, other UEs, instead of
wastefully establishing separate connections, can connect to NodeBs via the UEs acting as
Wi-Fi hotspots. Using this approach, we aim to reduce the number of UEs that are directly
connected to NodeBs to free up channels, although the overall throughput carried by the
network remains the same.
To evaluate the potential benefit of the opportunistic connection sharing scheme, we con-

53

Event Day
TDCH−IDLE / Promotion delay / Power

0.9
0.8
0.7
0.6
0.5
TDCH−IDLE

0.4

Promotion delay
Power

T

DCH−IDLE

/ Promotion delay / Power

Routine Day
1

2

4

6
α

8

1
0.9
0.8
0.7
0.6
0.5
TDCH−IDLE

0.4

10

2

Promotion delay
Power
4

6
α

8

10

(a) Event A
Event Day
TDCH−IDLE / Promotion delay / Power

0.9
0.8
0.7
0.6
0.5
TDCH−IDLE

0.4

Promotion delay
Power

T

DCH−IDLE

/ Promotion delay / Power

Routine Day
1

2

4

6
α

8

1
0.9
0.8
0.7
0.6
0.5
TDCH−IDLE

0.4

10

2

Promotion delay
Power
4

6
α

8

10

(b) Event B

Figure 3.10. Tradeoff between performance metrics for varying RRC timeout (α) values.
Y-axis is max-normalized for each metric. α values corresponding to black circles achieve
better performance tradeoff.
duct cell sector level trace-driven simulations. With respect to the mobility of the users, we
assume that the users are static within 1 minute time bins. This is a reasonable assumption for crowded events in stadiums, auditoriums, and conference rooms. We do not have

54

fine-grained location information of users in our data set; therefore, we have to simulate the
locations of users. In this work, we aim to generate the locations of users in a grid-like scenario – similar to how people are typically seated in stadiums and conferences. Towards this
end, we use Complete Spatial Randomness (CSR) point generation model with hard-core
inhibition [20]. CSR with hard-core inhibition does not allow neighbors within a pre-defined
radius around the randomly generated points, resulting in a grid-like setting. The points in
the realizations denote the locations of users in our simulations. In our simulations, Wi-Fi
hotspots are selected randomly because we do not have access to other relevant information,
such as battery life and signal strength, that may be used to optimize this selection. Once
a UE connects to a Wi-Fi hotspot, it is disconnected after 1 minute of inactivity. The locations of inactive users are updated using the above CSR model. In our simulations, the cell
sectors are set to have 2, 250, 000 ft2 coverage area, the inhibition radius is set to 2 ft, the
Wi-Fi range of users is simulated as N ∼ (200 ft, 20 ft), and the upper limit on the number
of simultaneous connections for each Wi-Fi hotspot is set to 5. The cell sector coverage
area is in typical range for crowded urban locations, the inhibition radius is set to be reasonably large, and the Wi-Fi range and the maximum number of simultaneous connections
are conservatively set. We assess the benefit of the connection sharing scheme in terms of
the following metrics: the number of users in DCH state (UDCH ) and the number of RRC
failures.
Since we are primarily interested in deploying this scheme in congested locations, we
focus our evaluations on a subset of cell sectors in our data set that are within 1 mile
radius of the venues. We evaluate the opportunistic connection sharing scheme using tracedriven simulations on this subset on the event days. The results plotted in Figure 3.11 are
the average of 1000 independent simulation runs. We plot the timeseries of the number
of occupied DCH channels (UDCH ) for varying number of Wi-Fi hotspots per cell sector
(denoted by N ). As expected, we observe that UDCH values become smaller for larger values
of N , freeing up DCH channels that are now available for UEs unable to transition to the

55

N=0
N=2
N=4
N=6
N=8
N=10

# RRC Failures

UDCH

N=0
N=2
N=4
N=6
N=8
N=10

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

t (hour)

t (hour)

N=0
N=2
N=4
N=6
N=8
N=10

N=0
N=2
N=4
N=6
N=8
N=10

# RRC Failures

UDCH

(a) Event A

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

t (hour)

t (hour)

(b) Event B

Figure 3.11. Experimental results for opportunistic connection sharing
DCH state due to RRC failures. We also plot the number of RRC failures for varying values
of N in Figure 3.11. Again, as expected, we observe that RRC failures decrease for increasing
values of N . Consequently, based on instantaneous load conditions, the cellular network can
dynamically vary the required number of users acting as Wi-Fi hotspots to minimize RRC
failures. We note that this connection sharing scheme successfully eradicates more than 95%
RRC failures for both events when N = 10. This substantial reduction in the number of
RRC failures in congested cell sectors will likely result in improved performance for users.
Below, we discuss some practical issues of opportunistic connection sharing.
• Wi-Fi Hotspot Selection: The selection of Wi-Fi hotspots can be mediated by the cellular
network based on a variety of factors, such as battery life and signal strength. UEs acting
as Wi-Fi hotspots may experience high energy drain and may run out of battery power.
56

To address this issue, the role of Wi-Fi hotspot can be periodically rotated among the user
pool by the cellular network. The cellular network should prefer UEs with better signal
strength because UEs consume significantly more energy and suffer reduced effective bit
rate when the signal strength is poor [87]. On the other hand, the UEs that are unable to
get RAB assignments can discover Wi-Fi hotspots in their range using the standard Wi-Fi
discovery methods. In case of multiple options, UEs should prefer hotspots with better signal
strength.
• Initial Connection Delay: After a device connects to a Wi-Fi hotspot, similar to RRC
protocol, it disconnects after a pre-defined inactivity timer expires. However, the value of
this timer should be set much higher than the corresponding RRC timers so that the device
does not have to incur initial delay, which is up to several seconds, for every data transfer.
In our simulations, the inactivity timer was set to be 1 minute.
• Out of Range: A device has to request RAB assignment when it moves out of a hotspot’s
Wi-Fi range. If it is unable to get a RAB due to congestion then the RNC can dynamically
assign more Wi-Fi hotspots in the cell sector to provide connectivity to more users.
• Radio Technologies: Opportunistic connection sharing is only usable when a majority of
devices in the cellular network have built-in Wi-Fi capability. In our simulations, we assume
that all devices have Wi-Fi capability. In case Wi-Fi is not available, other technologies such
as Bluetooth can also be used. Bluetooth has lower power consumption, smaller radio range,
and supports less data rate as compared to Wi-Fi. Consequently, it can be used as a low
power alternative for small transmissions such as tweets.
• Wi-Fi-Cellular Handovers: Working extensions to the Wi-Fi standard already address the
issue of smooth handovers between Wi-Fi and cellular networks, including 3GPP Access
Network Discovery and Selection Function (ANDSF), Hotspot 2.0 initiative [76], and other
techniques [4].
• Voice Traffic Offloading: In this opportunistic connection sharing scheme, voice traffic can
be tunneled via the Wi-Fi connection using the well-known Voice over Wi-Fi solutions, such

57

as Wi-Fi certified Voice-Enterprise [12].
• Incentives: Cellular network operators may provide billing based incentives to users for
participating in this opportunistic connection sharing scheme.

3.5.3

Limitations

Below, we briefly mention two limitations of our trace-driven simulation evaluations. First,
our simulation based evaluations cannot account for changes in traffic workload resulting
from different network conditions due to our proposed mitigation schemes. Second, they
also cannot account for low-level dependencies between performance metrics and network
load. For example, some types of RRC failures are impacted by interference, which in
turn is a function of network load. Addressing these limitations requires experiments on
operational cellular networks, which are beyond the scope of this work. However, despite
these limitations, we believe that the sheer magnitude of the improvements observed in our
simulations indicates that the mitigation schemes discussed in this work would accrue some
benefit in practice.

3.6

Related Work

We divide related work into the following categories.
Cellular Performance Characterization: The areas of cellular performance characterization
have recently received much attention by the research community. For example, small-scale
studies have characterized application performance [44, 114] and fairness [14]. Large-scale
studies have characterized throughput and airtime [78], smartphone traffic [94], M2M device
traffic [91], smartphone app traffic [92,116], and heavy users [25]. In contrast to these studies,
we believe that we are the first to analyze cellular performance changes specifically during
crowded events.
Radio Network Parameter Tuning: Prior work on radio network parameter tuning study

58

the impact of RRC timers on network performance and smartphone energy consumption.
Most prior work is based on user-end measurements performed using a few cellular devices.
For instance, Liu et al. characterized performance in a 1xEV-DO network using measurements obtained from two laptops equipped with Sierra Wireless data cards [65]. Balasubramanian et al. proposed a UE based approach, called TailEnder, to alter traffic patterns based
on the prior knowledge of RRC state machine [17]. Some studies are based on theoretical
analysis and simulation. For instance, Liers et al. proposed a scheme to adaptively tune
RRC timeout parameters based on the demand and load situation, and validated it using
simulations [64]. Yeh et al. proposed a scheme to tune RRC timeout parameters using analytical models based on available radio resources, energy consumption, quality of service, and
processing overheads of the radio access network [118]. Qian et al. conducted trace-driven
RRC state machine simulations using network-end measurements to investigate the optimality of RRC timeout parameters [80]. Furthermore, they proposed an application-aware tail
optimization protocol to simultaneously optimize radio and energy resources [79]. Similar
to the prior work by Qian et al. [79, 80], our analysis of radio network parameter tuning is
based on trace-driven RRC state machine simulations. However, we focus on network-end
tuning of RRC timeouts without any cooperation from UEs.
Opportunistic Connection Sharing: We build on existing work on opportunistic traffic
offloading [42, 67]. Luo et al. proposed a unified architecture, where mobile clients use both
3G cellular link and Wi-Fi based peer-to-peer links for routing packets via peer-to-peer links
to the appropriate destinations [67]. Han et al. proposed content-specific opportunistic
communication scheme to offload cellular traffic via Wi-Fi or Bluetooth [42]. However,
neither of these proposals were evaluated using real-world traces, and both approaches require
architectural changes to network protocols and hardware. Our work complements these
proposals by showing that their simplest and most practical instantiation — a simple onehop connection sharing scheme that does not require architectural changes — can be very
effective in real-life crowded events. To the best of our knowledge, this work is the first to

59

evaluate practical connection sharing techniques on real-world traces.

3.7

Conclusion

This work presents the first performance characterization of an operational cellular network
during crowded events. We make three key contributions in this study based on the realworld voice and data traces that we collected from a tier-1 cellular network in the United
States during two high-profile crowded events in 2012. First, we measured how cellular
network performance degrades during crowded events as compared to routine days. Second,
we analyzed what causes the observed performance degradation. Third, we evaluated how
practical mitigation schemes for the observed performance degradation would perform in
real-life crowded events using trace-driven simulations. Our findings from this study are
crucial for cellular design, management, and optimization during crowded events.

60

4

Breaching Privacy in Encrypted

Instant Messaging Networks
4.1

Introduction

The proliferation of online social networks has attracted the interest of computer scientists
to mine the available social network data for developing behavior profiles of people. These
profiles are often used for targeted marketing [62, 117, 119], web personalization [77], and
even price discrimination on e-commerce portals [70, 75]. Recently, there has been increased
interest in more fine-grained profiling by leveraging information about people’s friendship
networks. It has been shown that information from people’s friendship networks can be used
to infer their preferences and religious beliefs, and political affiliations [18, 43, 72, 120].
There has been a lot of research on de-anonymizing people’s friendship networks in online
social networks such as Facebook, MySpace, Twitter [29, 63]. Surprisingly, little prior work
has focused on de-anonymizing people’s friendship link in instant messaging (IM) networks.
IM services – such as Yahoo! Messenger, Skype, IRC, and ICQ – are popular tools to
privately communicate with friends and family over the Internet. IM networks are different
than other online social networks in various respects. For example, in contrast to online
social networks, communication among users in IM networks is synchronous in nature and
messages between two communicating users are routed through relay servers of the IM service
provider.

61

The goal of this work is to identify the set of most likely IM users that a given user is
communicating with during a fixed time period. Note that packet payloads in IM traffic are
encrypted; therefore, payload information cannot be used for the identification. Therefore,
to infer who a user is talking to, we will rely only on the information in packet header
traces. Packet header traces contain information such as timestamp, source IP address,
destination IP address, source port, destination port, and protocol type, and payload size
of each packet. It is noteworthy that each packet in the IM traffic has as its source and
destination IP addresses of a user computer and an IM relay server (or vice versa). At
no point do two users exchange packets directly with each other, i.e., there are no packets
in which the two communicating users’ IP addresses appear in the same packet. For this
attack, we assume that IM service acts neutral, i.e., it neither facilitates the attacker nor
actively participates in providing anonymity to the users using non-standard functionality.
Our specific goal is to accurately identify a candidate set of top-k users with whom a given
user possibly talked to using only the information available in packet header traces.
A natural approach to tackle this problem would be to match header information of packets
entering and leaving IM relay servers. However, simply matching header information of
packets entering and leaving IM servers is not feasible due the following reasons. First, a
user may be talking to multiple users simultaneously. Second, IM relay servers typically serve
thousands of users at a time. Third, the handling of duplicate packets that are the result
of packet losses followed by re-transmissions. Forth, the handling of out-of-order packets.
Finally, the handling of variable transmission delays, which are introduced by the IM relay
servers.
In this work, we propose a wavelet-based scheme, called COmmunication Link Deanonymization (COLD), to accurately infer who’s talking to whom using only the information
available in packet header traces. Wavelet transform is a standard method for simultaneous
time-frequency analysis and helps to correlate the temporal information in one-way (i.e.
user-to-server or server-to-user) traffic logs across multiple time scales [68]. Wavelet anal-

62

ysis allows decomposition of traffic time series between a user and an IM relay server into
several levels. All levels are associated with a coefficient value and contain different levels
of frequency information starting from low to high. The original traffic time series can be
reconstructed by combining all levels after weighing them with their respective coefficients.
COLD leverages the multi-scale examination of traffic time series provided by wavelet analysis to overcome the aforementioned technical challenges. Given two candidate time series
between an IM relay server and two users, COLD computes correlation between the vectors
of wavelet coefficients for both time series to determine whether these users talked to each
other.
We evaluate the effectiveness of COLD on a Yahoo! Messenger data set comprising of
traffic collected over 10, 20, 30, 40, 50 and 60 minute periods. We also compare COLD’s
performance to a baseline time series correlation (TSC) scheme, which represents the state
of the art. The effectiveness is quantified in terms of hit rate for a fix-sized candidate set.
The results of our experiments show that COLD achieves a hit rate of more than 90% for a
candidate set size of 10. For slightly larger candidate set size of 20, COLD achieves almost
100% hit rate. In contrast, a baseline method using time series correlation could only achieve
less than 5% hit rate for similar candidate set sizes.
We summarize the major contributions of this work as follows.
1. We define an attack for breaching communication privacy in encrypted IM networks
using only the information available in packet header traces.
2. We propose COLD to to infer who’s talking to whom using wavelet based multi-scale
analysis.
3. We conducted experiments using a real-world Yahoo! Messenger data set to evaluate
the effectiveness of our proposed approach.
The rest of this chapter is organized as follows. Section 4.2 summarizes the related work.
A detailed description of attack scenarios is provided in Section 4.3. Section 4.4 provides
63

details of the proposed attack. In Section 4.5, we present the evaluation results on a realworld Yahoo! Messenger data set. Possible evasion techniques and their countermeasures
are discussed in Section 4.6. Finally, Section 4.7 concludes the chapter.

4.2

Related Work

In this section, we provide details of the research work related to our study. To the best of our
knowledge, no prior work has reported a successful attack to breach users’ communication
privacy in encrypted IM networks using only the information available in packet header
traces. However, there is some relevant work in the area of mix network de-anonymization.
We discuss it and other related studies below.

4.2.1

Mix Network De-anonymization

In the area of mix network, several studies have used correlation techniques for deanonymization. However, most of these studies are limited to computing temporal correlation between traffic of two user-network links to find user-user links. Furthermore, deanonymization of mix networks is fundamentally different from our problem in the following
two aspects. First, mix network de-anonymization techniques require traffic logs from multiple points inside a mix network. In contrast, this study treats IM relay servers as a black
box. Second, the size of user populations in mix network de-anonymization studies is only
of the order of tens or hundreds. However, in real-life IM networks, thousands of users can
simultaneously communicate with other users; therefore, presenting a more challenging problem. In [108], Troncoso and Danezis build a Markov Chain Monte Carlo inference engine
to calculate probabilities of who is talking to whom in a mix network using network traces.
However, they log network traces from multiple points in a mix network and the maximum
network size studied in their paper is limited to 10. In [122], Zhu et al. compute mutual
information between aggregate inflow and outflow traffic statistics to decide if two users are

64

talking to each other in a mix network. Similar to this study, they also log traffic from the
edges of a mix network. However, their proposed approach requires traffic logs for longer
time durations. In this work, we compare the results of COLD and the method proposed by
Zhu et al. [122].

4.2.2

Social Network De-anonymization

There is also some related work in the field of social network de-anonymization. Narayanan
and Shamitkov developed an algorithm to utilize sparsity in high-dimensional data sets for
de-anonymization [73]. Later they developed a user re-identification algorithm that operated
on anonymized social network data sets [74]. Other related studies use group membership
information to identify users in a social network [115, 120]. IM networks also fall under the
broader category of online social networks; however, our problem and the nature of the data
available to us is different from those tackled in the aforementioned papers. These studies
focus on user identification using mainly topological information; whereas, we focus on link
identification using dynamic user communication traffic.

4.3

Problem Description and Attack Scenarios

In this section, we first provide a summary of architectural details of IM services. We then
provide the details of information available from traffic traces logged near IM relay servers.
Finally, we describe two scenarios in which traffic can be logged for link de-anonymization.

4.3.1

IM Service Architecture

We first describe the architecture of a typical IM service. Consider the scenario depicted in
Figure 4.1 where two users v1 and v2 are communicating with each other via an IM service.
When v1 sends a message to v2 , the source IP address in packets containing this message
correspond to v1 and the destination IP address correspond to the IM relay server. These
65

User 2

Time

Packet
Size

IM
Relay
Server

User 1

Packet
Size

User 1

User 2

Time

Time

Figure 4.1. Transforming logged traffic traces to user traffic signals
packets are received by the IM relay server after a random delay. After receiving a packet
from v1 , the IM server first looks up the IP address of v2 . It then creates new packets with its
IP address as source and IP address of v2 as destination. These packets containing message
from v1 are then relayed by the IM relay server to v2 and have the same contents. This
process incurs an additional delay after which the new packet reaches v2 .
The network traffic logged near the IM relay server only contains header information because the packet payload contents are not useful due to encryption. The statistics recorded by
the well-known traffic logging tools like Cisco’s NetFlow include IP addresses, port numbers,
protocol, packet size, and timestamp information [28]. As mentioned before, IP addresses
are used to identify individual users of the IM service. IM traffic is filtered from rest of the
traffic using a combination of protocol and port number information. We are left with only
aggregated packet sizes and timestamp information for each flow. A logged entry for a flow
is an aggregation of packets which may be sent to or received from the IM server. Due to
aggregation, information about the direction of flow is lost for individual packets. Therefore,
we make a realistic assumption that the direction information is not available in the logged
traffic. An example of a similar publicly available data set is the Yahoo! Network Flows
Data [7].

66

Referring to Figure 4.1, each flow in the data set comprises of information about incoming
and outgoing packets between an IM relay server and a user. Furthermore, individual users
can be distinguished based on IP addresses in the IM traffic. In Figure 4.1, traffic exchanged
between v1 and the IM relay server is represented by blue arrows and traffic exchanged
between v2 and the IM relay server is represented by red arrows. The timestamps and
packet sizes are both discrete and in units of milliseconds. The packet sizes are typically
recorded in bytes. The resulting signal for each flow is discrete in both time and amplitude
as shown in Figure 4.1. These sparse time domain traces of network traffic are referred to as
traffic signals from now-onwards. It is interesting to simultaneously analyze traffic signals
for both users v1 and v2 . Note that every entry in v1 ’s traffic signal has a time-shifted (time
delayed or advanced) matching entry of equal magnitude in v2 ’s traffic signal. These matches
between each pair of entries are marked by broken lines joining traffic signals in Figure 4.1.
Matching entries across both traffic signals may not have the same order due to random
end-to-end delays. For example, 3rd message flow entry in v2 ’s trace appears as 4th entry in
v1 ’s trace in Figure 4.1.

4.3.2

Attack Scenarios

We now consider two different scenarios in which traffic information necessary for the proposed attack can be obtained.
The first scenario assumes the capability to monitor incoming and outgoing traffic of an
IM relay server or server farm. Figure 4.2(a) shows four users v1 , v2 , v3 and v4 connected
to an IM relay server. The shaded circular region around the IM relay server marks the
boundary across which network traffic is logged. For the scenario depicted in Figure 4.2(a),
v1 is communicating with v2 and v3 is communicating with v4 . Traffic signals for all users
that are obtained after pre-processing their traffic flow logs. For each flow represented in
a user’s traffic signal, a corresponding flow entry can be observed in the traffic flow log.
The IM relay servers also introduces a random delay between the time a message arrives at

67

(a) Collecting all incoming and outgoing traffic from IM
relay server

1

2

(b) Collecting all incoming and outgoing traffic near
border gateway routers of an organizational network

Figure 4.2. Two attack scenarios
the IM relay server and the time it is relayed to the other user. Therefore, there will be a
mismatch in the timestamps of the occurrences of a message in communicating users’ traffic
signals.
The second scenario assumes that all IM users communicating with each other are located
in the same network. Many organizations, such as universities, connect to external networks
and the Internet through one or more gateway routers. The incoming and outgoing traffic

68

has to pass through a small number of gateway routers. In this scenario, it is possible
to collect flow logs near the gateway routers of an organizational network. Figure 4.2(b)
depicts the above-mentioned scenario. Here, v1 and v2 are in the same network and are
communicating with each other via an IM relay server. All incoming and outgoing traffic
of the network passes through the border gateway router near which it can be logged. The
region near border gateway router is represented by the shaded region in Figure 4.2(b). The
traffic signals obtained from pre-processing the flow logs have the same characteristics as
described for the first scenario.

4.4

COLD: COmmunication Link De-anonymization

In this section, we present the details of our proposed method (COLD) to detect communication links between users in IM networks. We first introduce the overall architecture of
COLD. We then provide details of each of its modules. Finally, we provide an easy-to-follow
toy example of COLD on a small set of three IM users.

4.4.1

Architecture

As mentioned in Section 4.3, the logged traffic traces are separated for all users based on IP
address information. These user-wise separated traffic traces are further pre-processed and
converted to traffic signals. The traffic signals for all users are stored in a database. Note
that traffic signals of users may span different time durations. To overcome this problem, we
use zero-padding so that the lengths of traffic signals are consistent for all users. After this
pre-processing, wavelet transform is separately applied to all users’ traffic signals [68]. We
then construct feature vectors for all users using the computed wavelet coefficients. Now, to
compare two given users, we compute the correlation coefficient between their constructed
feature vectors. Finally, the values of the correlation coefficient are sorted to generate the
candidate set. The details of all modules of COLD are separately discussed below.

69

4.4.2

Details

After pre-processing the traffic traces, we compute the discrete wavelet transform (DWT) of
each user’s traffic signal. The wavelet transform enables us to conduct a simultaneous timefrequency analysis. A traffic signal is decomposed into multiple time series, each containing
information at different scales that range from coarse to fine. A time series at a coarse
scale represents the low frequency or low pass information regarding the original time series.
Likewise, a time series at a fine scale represents the high frequency or high pass information
regarding the original time series. This allows us to compare traffic patterns of users at
multiple time scales.
We have to select an appropriate wavelet function for our given problem. Since we are
processing traffic signals of a large number of users, we want to select an efficient wavelet
type. For our study, we have chosen the Haar wavelet function for wavelet decomposition
[66]. We have chosen the Haar wavelet function because it is simple and is computationally
and memory-wise efficient. Furthermore, the wavelet transform can be applied for varying
decomposition levels to capture varying levels of detail. Choosing the optimal number of
decomposition levels is important because this may lead to suppressing relevant and critical
information that might be contained in one or more levels of the wavelet decomposition.
Below, we discuss the method to select the optimal number of decomposition levels.
Let D ∈ Z+ denote the optimal number of decomposition levels. Different methods have
been proposed in the literature to select the optimal number of decomposition levels. In
this work, we have used Coifman and Wickerhauser’s well-known Shannon entropy-based
method to select the optimal number of decomposition levels [30]. We applied this method
to traffic signals of all users and then selected the optimal decomposition level at the 95th
percentile. Now that we have selected the optimal number of decomposition levels, we can
apply the wavelet transform on user traffic signals.
Once we have obtained the wavelet coefficients after applying the wavelet transform to a
user’s traffic signal, we need to convert them to a standard feature vector so that we can

70

compare users’ signals. Let FX denote the feature vector of a user X. The coefficients that
contain high frequency information are more numerous and such coefficients are assigned
lower weights. Similarly, the coefficients that contain low frequency information are fewer
and are assigned higher weights. The time signal corresponding to level 1 of the wavelet
decomposition represents the coarsest features containing low frequency information, and
level D refers to the highest level describing the most detailed features containing high
frequency information. The level D feature coefficients are assigned weight 1, the level D − 1
coefficients are assigned weight 2, etc., and the level 1 coefficients are assigned weight 2D−1 .
In general, the level d features are assigned a weight of 2D−d−1 . To produce the standard
feature vector in which each coefficient is given the appropriate weight, we replace each
coefficient by a vector of its copies of length equal to its weight, i.e. a wavelet coefficient of
decomposition level d is replaced by a vector containing 2D−d−1 copies. This is equivalent to
using the undecimated wavelet transform of users’ traffic signals. By following this procedure,
the total length of the feature vectors of all traffic signals becomes consistent.
After applying the wavelet transform and post-processing coefficients to a user X’s traffic
signal, we obtain a feature vector denoted FX . To compare the feature vectors FX and
FY for two users X and Y , we have to compute their correlation. The sample correlation
coefficient rX,Y of two discrete signals FX and FY , both of length L, is defined as,

rX,Y =

L
i=1 (FX (i) − FX )(FY (i) − FY ) .

(L − 1)sX sY

(4.1)

Here, FX (i) is the ith element of the feature vector FX , FX is the sample mean of
its elements, and sX is the sample standard deviation of its elements. The values of the
correlation coefficient lie in the closed interval [−1, 1]. The correlation coefficient values close
to zero indicate no correlation; whereas, the values close to 1 and −1 respectively highlight
strong correlation and anti-correlation. For this study, we only consider the magnitude of
the correlation coefficient and discard its sign. After computing the correlation coefficient

71

for all pairs of users, we get the upper triangular correlation matrix R0 . ri,j is written into
the ith row and the jth column of the correlation matrix R0 . Conceptually, this correlation
matrix is similar to the adjacency matrix of a weighted graph. We add to R0 its transpose
to obtain R
After obtaining the correlation matrix R whose elements are in the range of [0, 1] we
need to generate, for each node, a sorted list of nodes in decreasing order of probability of
communicating. This is done by sorting the node indices in descending order of correlation
coefficients in every column of R. The resulting matrix will have the same size as R and
is labeled I ↓. Suppose that the S most likely users that are communicating with user i is
required. Then the user IDs contained in the top S rows of the i-th column of I ↓ is the
sorted list of users i is most likely communicating with.

4.5

Experimental Results

In this section, we first describe the data set used for evaluating COLD, then define evaluation
metrics, and finally present evaluation results.

4.5.1

Data Set

We collected a data set from Yahoo! Messenger IM network to validate our proposed approach. To keep the volume of logged data manageable, the users of Yahoo! Messenger
were filtered by geographic location and restricted to the New York City area. This data set
consists of traffic logs of Yahoo! Messenger user activity over a period of 60 minutes from
the greater New York area, between 8 a.m. to 9 a.m. Using this data set, we create six data
sets that are the subsets of the entire data. These consist of data over the only the first 10,
20, 30, 40, 50 and 60 minutes, i.e. from 8 − 8 : 10 a.m., 8 − 8 : 20 a.m., 8 − 8 : 30 a.m.,
8 − 8 : 40 a.m., 8 − 8 : 50 a.m. and 8 − 9 a.m. To gauge the effect of the duration over which
a data set is collected we evaluated our proposed COLD scheme on all six data sets. Table

72

Table 4.1. Data set statistics
Time
8-8:10a
8-8:20a
8-8:30a
8-8:40a
8-8:50a
8-9a

Duration
10 mins
20 mins
30 mins
40 mins
50 mins
60 mins

Users
3,420
5,405
7,438
9,513
11,684
13,953

Messages
15,370
33,192
53,649
75,810
99,721
126,694

Sessions
1,968
3,265
4,661
6,179
7,669
9,264

Traffic (Packet)

Traffic (Bytes)

4

x 10
2.5
2
1.5
1
0.5
0

0

500

1000

1500

2000

2500

3000

3500

2500

3000

3500

Time (Sec)
100

50

0

0

500

1000

1500

2000

Time (Sec)

Figure 4.3. Time series plot of traffic volume, in bytes and number of packets, over the entire
60 minute time period from 8 − 9 a.m.
4.1 lists, along with the time of day and duration, the number of logged users, number of
messages exchanged between them, and the number of instant messaging sessions included
in each data set.
The collected data is divided into two parts: input data and ground truth data, to systematically evaluate our proposed approach. Both data sets were collected with the assistance
of Yahoo! and are described in the following text.
The input data consists of user-to-server traffic traces that were collected similar to the
scenario described in Figure 4.2(a). Figure 4.3 plots the volume of traffic logged in these
traffic traces. The figure on top plots number of bytes per second against time. Similarly,
the plot in the bottom figure plots the traffic volume in packets per second for the same
period of time.
73

5

10

8−8:10a
8−8:20a
8−8:30a
8−8:40a
8−8:50a
8−9a

4

Node count

10

3

10

2

10

1

10

0

10 0
10

1

10

Node degree

Figure 4.4. Node degree distribution in our Yahoo! Messenger data set.
The verification data contains a record of the actual user-to-user connections resulting
from conversations between users. Therefore, the verification data contains the ground truth
for given problem. Our proposed COLD scheme attempts to recreate the link structures
between users contained in the verification data by only using information in the input data.
Figure 4.4 is a plot of the degree distribution of users observed in the verification data
collected over 10 and 60 minute time periods. The distribution is approximately linear on
log-log scale over the range of degrees from 1 to 9 for the 10 minute data, and from 1 to 11
for the 60 minute data.

4.5.2

Evaluation Metrics

Let V denote the set of Yahoo! Messenger users v1 , v2 , . . . , vN . Furthermore, let E denote
the set of actual communication links u1 , u2 , . . . , uM of size M between N users captured
in the verification data. Then G(V, E) is the graph of users (or vertices) connected by
the communication links (or edges) between them. Recall that the goal of the attack is
to detect communication links U that estimates the actual set of communication links in
the verification data U . The graph G(V, U ) is the outcome of the scheme that constitutes
the attack. In the rest of this section, we compare our proposed COLD scheme with the

74

baseline time series correlation (denoted by TSC here onwards). A graph that is obtained
using COLD will be denoted by GC (V, UC ). A graph obtained using TSC is denoted by
GT (V, UT ).
Consider the subset of vertices with degree δ in a graph G V, U

obtained using either

schemes. Now consider a candidate set Ci of size S ≥ δ for every vertex vi of degree δ.
The candidate set Ci of a vertex vi contains S vertices most likely to be vi ’s neighbors,
as determined by the COLD or TSC. We also define a neighborhood function denoted by
ΓG (). ΓG (vi ) returns the set of vertices in the graph G that are connected to vertex vi .
Furthermore, we define the node hit rate of a vertex vi as the fraction of vertices in ΓG (vi )
that are also elements of candidate set Ci of size S. The node hit rate of vertex vi is denoted
hi (S) an is defined formally as follows.

hi (S) =

|ΓG (vi ) ∩ Ci (S)|
|ΓG (vi )|

(4.2)

The node hit rate can take values in the range of the closed interval [0, 1]. We also define
the hit rate HG (S, δ) for degree δ vertices of a graph G V, U

as the average of their node

hit rates hi (S) when candidate set sizes are S.

HG (S, δ) =

N
i=1,δi =δ hi (S)

nd

(4.3)

Here nd is the number of vertices in G of degree δ. Just like the node hit rate, the hit rate
can take values in the range of the closed interval [0, 1].

4.5.3

Results

We compute the hit rates achieved using COLD on the 10, 20, 30, 40, 50 and 60 minute data
sets and compare them with the hit rates achieved by TSC. We further separate vertices by
the number of packets they exchange over the duration of the data set, i.e. hit rates are
computed separately for vertices exchanging 1 − 60, 61 − 70, 71 − 80, 81 − 90, 91 − 100,
101 − 110, and 111 − 120 packets. As we observed in the degree distributions of nodes in
75

figure 4.4, data sets for all six durations are dominated by nodes of degree 1. Therefore, in
our evaluation we focus primarily on degree 1 vertices. Figures 4.5(a), 4.5(b), 4.5(c), 4.5(d),
4.5(e), and 4.5(f) plot the hit rates of degree 1 vertices as a function of set size S for COLD
on 10, 20, 30, 40, 50, and 60 minute data sets, respectively. Within each figure, hit rates are
segregated according to the number of packets users send and receive over the duration the
data was collected. As these six figures consistently show, the hit rate reaches between 0.9
and 1.0 for users exchanging 71 or more packets over the duration of the data sets. In case
of the 20, 30, 40, 50, and 60 minute data sets in Figures 4.5(b), 4.5(c), 4.5(d), 4.5(e), and
4.5(f), this set of users is further extended to those exchanging 61 or more packets. In the
10 minute data set in figure 4.5(a) users with 61-70 packets in their trace have a high hit
rate of more than 0.80. However, the candidate set size S has to be increased all the way to
40 for the hit rate to reach close to 1.0. For users exchanging between 1-60 packets the hit
rate starts out between 0.20 and 0.40. As the candidate set size is increased from 1 upward,
the hit rate rises at a very similar rate in all six data sets.
We compare the accuracy of our proposed approach to that of the time series correlation
(TSC) method. Similarly, figures 4.6(a), 4.6(b), 4.6(c), 4.6(d), 4.6(e) and 4.6(f) plot the hit
rates of degree 1 vertices as a function of set size S for TSC on 10, 20, 30, 40, 50 and 60
minute data sets, respectively. The baseline TSC method, which represents the state of the
art, fails to deliver sufficient performance to be useful for any conceivable application, across
all six data sets. With one slight exception, TSC fails to achieve a hit rate of even 0.20 even
for candidate set size of as large as 100. The only exception is the group of users exchanging
between 71-80 packets in the 10 minute data set. However, even for this subset of users,
TSC provides a hit rate of less than 0.30 at a set size greater than 70, i.e. at best, for users
messaging with only one other person, in a set of 70 candidates TSC will include the actual
instant messaging partner with a probability of only 0.30.

76

1

0.8

0.8
Hit Rate HC(S,1)

Hit Rate HC(S,1)

1

0.6
60
70
80
90
100
110
120

0.4

0.2

0

20

40

60
Set Size S

80

0.6
60
70
80
90
100
110
120

0.4

0.2

0

100

20

1

1

0.8

0.8

0.6
60
70
80
90
100
110
120

0.4

0.2

0

20

40

60
Set Size S

80

0.2

0

100

20

0.8
Hit Rate HC(S,1)

Hit Rate HC(S,1)

0.8

0.6
60
70
80
90
100
110
120

0.4

0.2

60
Set Size S

40

60
Set Size S

80

100

(d) 8-8:40 a.m
1

40

100

60
70
80
90
100
110
120

0.4

1

20

80

0.6

(c) 8-8:30 a.m

0

60
Set Size S

(b) 8-8:20 a.m

Hit Rate HC(S,1)

Hit Rate HC(S,1)

(a) 8-8:10 a.m

40

80

0.6
60
70
80
90
100
110
120

0.4

0.2

0

100

(e) 8-8:50 a.m

20

40

60
Set Size S

80

100

(f) 8-9 a.m

Figure 4.5. Hit rates of COLD for vertices of degree 1 in the (a) 10 minute data set, (b) 20
minute data set, (c) 30 minute data set, (d) 40 minute data set, (e) 50 minute data set, and
(f) 60 minute data set.

4.5.4

Discussions

These results provide us with several insights into the working of COLD. We separately
discuss these insights in the following text.

77

First, there appears to be a very clear threshold value for the number of recorded packets
beyond which the de-anonymization attack using COLD yields high hit rates. From the
plots in figure 4.5 we observe that the hit rate for users containing more than 60 packets in
their traffic traces is significantly higher, above 90%, even at very small candidate set sizes.
On the other hand, the hit rate of users containing 60 packets or less in their traffic trace is
significantly lower. This threshold value holds across all six data sets of different durations.
More packet entries in traffic traces provide more points to match two communicating users’
traces with each other. The greater number of data points also reduces the probability of a
false match. Therefore, it is easier to identify communicating users that message each other
more frequently.
Second, the hit rate of users, classified by the traffic they generate, is largely independent
of the time duration over which the traces were collected. Rather, it is the actual number
of message packets exchanged during that period that determines the hit rate. Hit rates for
users exchanging the same number of packets over different periods of time are very similar.
Therefore, we can state that we can identify two communicating users using COLD with
great certainty as soon as they exchange more than 60 message packets.
Third, while we have already stated that the time period over which traffic traces are
collected have only a weak effect on the hit rate. However, looking at the hit rate functions
of users with 61 − 70, 71 − 80 and 81 − 90 packets in their traffic trace across different data
sets, we observe that the hit rate function rises close to 1.0 at a faster rate in data sets
collected over longer durations.
Fourth, judging by the time durations of the data sets (between 10-60 minutes), we conclude that the amount of data necessary to achieve a high hit rate by COLD can be collected
in a relatively short period of time. Therefore, COLD does not require an extensive data
collection effort to achieve high accuracy.
Finally, we observe that when TSC is applied to all data sets, the hit rate remains almost 0
for vertices of all traffic levels. This leads us to the conclusion that TSC is effectively unable

78

1

0.6

0.8
Hit Rate HT(S,1)

Hit Rate HT(S,1)

0.8

1
60
70
80
90
100
110
120

0.4

0.2

0

0.6

60
70
80
90
100
110
120

0.4

0.2

20

40

60
Set Size S

80

0

100

20

(a) 8-8:10 a.m

0.8

0.4

0.2

0

0.6

20

40

60
Set Size S

80

0

100

20

60
Set Size S

100

1
60
70
80
90
100
110
120

0.8
Hit Rate HT(S,1)

Hit Rate HT(S,1)

40

(d) 8-8:40 a.m

0.4

0.2

0

80

0.4

0.2

1

0.6

100

60
70
80
90
100
110
120

(c) 8-8:30 a.m

0.8

80

1
60
70
80
90
100
110
120

Hit Rate HT(S,1)

Hit Rate HT(S,1)

0.6

60
Set Size S

(b) 8-8:20 a.m

1

0.8

40

0.6

60
70
80
90
100
110
120

0.4

0.2

20

40

60
Set Size S

80

0

100

(e) 8-8:50 a.m

20

40

60
Set Size S

80

100

(f) 8-9 a.m

Figure 4.6. Hit rates of TSC for vertices of degree 1 in the (a) 10 minute data set, (b) 20
minute data set, (c) 30 minute data set, (d) 40 minute data set, (e) 50 minute data set, and
(f) 60 minute data set.
to detect any communication links among users. We attribute this failure to the random
phase delay of packet entries in traffic traces of two communicating users. These delays are
a result of the bidirectional flow of traffic and jitter in the end-to-end delay.

79

4.6

Evasion and Countermeasures

This section presents some possible techniques that an adversary may utilize to evade the deanonymization attack by COLD. We also discuss countermeasures to such evasion techniques
below.
1. Evasion by using proxy or NAT. An adversary may access instant messaging network
behind a proxy or a NAT to bypass the detection by the COLD attack algorithm.
However, in this scenario, COLD will still detect the external IP address, which appears
in the traffic traces collected outside the proxy or NAT. Once the external IP address
is detected, our proposed approach will require additional traces collected inside the
proxy or NAT to specifically pin-point the end-host.
2. Evasion by IP spoofing. An adversary may try to spoof source IP address to evade
COLD. However, IP spoofing will not be successful because every end-user has to setup
a connection with the IM relay server, which is not possible with spoofed IP address.
3. Evasion by fragmentation/aggregation. An adversary may try to break-down a large
message into multiple smaller messages. However, fragmentation at the end-host into
smaller packets will not adversely affect COLD because our approach relies on correlating the traffic traces that are collected entering and leaving the IM service. The
smaller packets created due to fragmentation will appear the same in both sets of traffic traces. In fact, the increased number of packets would improve COLD’s accuracy.
On the other hand, an adversary may try to aggregate as many messages as possible
into a single message to minimize the data available. However, the maximum packet
size is limited by the IM service provider and maximum transmission unit (MTU) of
the network.
4. Evasion by changing packet sizes. If an adversary tries to deliberately change packet
sizes, e.g. by inserting garbage, they will appear the same in the two sets of traffic

80

traces correlated by COLD. Therefore, changing packets sizes will not affect COLD.
5. Evasion by random delays. Adversaries may also add random small or long delays
between their communications. The time delays introduced by end-host will not affect
COLD because these delays appear the same in the two sets of traffic traces. In
another scenario, random delays may be introduced by the IM network due to network
congestion or other processing delays. These delays will affect COLD because they will
be different across the two correlated traffic traces. However, COLD is robust to such
delays as well because it utilizes binning techniques, which reduces their effect.
6. Evasion by injecting noise packets. Injecting random noise packets is unlikely to affect
the accuracy of COLD as long as the noise packets follow the protocol utilized by the
IM network. Packets that do not follow the protocol utilized by the IM network will
be discarded by the IM network after sanity checks and will not appear in the second
traffic trace collecting traffic exiting the IM network. To mitigate the effect of such
noise packets, similar sanity checks can be deployed to check if the logged packets
follow the protocol utilized by the IM network under study.
7. Evasion by encryption. Encryption is only applicable to the packet payloads and packet
headers remain unaffected. The use of encryption cannot evade COLD because our
proposed approach only utilizes fields in the packet header.

4.7

Conclusions

In this work, we present a novel attack to breach the privacy of IM communication services
that allows an attacker to infer who’s talking to whom with high accuracy. We proposed
a wavelet-based scheme, called COLD, that allows us to examine and compare the time
series of one-way (user-server) traffic logs at multiple scales. We evaluated the COLD attack
algorithm using a real-world Yahoo! Messenger data set, which was specifically collected for

81

this study. Our experimental results showed that COLD clearly outperforms the baseline
time series correlation scheme.
Our proposed approach can also be applied to the related problems such as mix network
de-anonymization. In the mix network de-anonymization problem, a set of mix servers can
be treated as the black box and the traffic logs at the edges of the mix network can be
correlated using COLD to detect communication links among end-users [109, 123].

82

5

Information Hub Identification in

Social Networks
5.1
5.1.1

Introduction
Background and Motivation

In a social network, a user that has a large number of interactions with other users is defined
as an information hub (or simply a hub) [26]. An interaction refers to the transmission of
information by one user to another user. For example, an interaction from user A to user
B in online social networks may be the action when user A posts a message or comment
on user B’s profile. Hubs play important roles in the spread or subversion of propaganda,
ideologies, or gossips in social networks. Taking the advertising industry as an example,
instead of giving free product samples to random people, to improve the effectiveness of
word of mouth advertising and increase recommendation based product adoption, they may
want to give free samples to hubs only [40]. For example, CNN reported that Samsung
used social networks information to target dissatisfied owners of Apple iPhone 4 in a recent
advertisement campaign [38]. Samsung first monitored Twitter feeds to identify dissatisfied
iPhone 4 owners who are the most active in terms of communicating with their friends (i.e.
hubs) and are therefore most influential in spreading word of mouth recommendation, then
offered free GalaxyS phones to some of them. Furthermore, observing adoption of products
or trends at hubs helps to predict the eventual total sale of a product [40]. For instance,
83

advertisers can observe the impact of distributing free samples to hubs to predict the future
successfulness of a product. Due to limited advertisement budget (e.g., free product samples),
advertisers want to identify the top-k nodes in a social network. Therefore, identifying top-k
information hubs in social networks is an important problem.

5.1.2

Limitations of Prior Art

Prior methods for computing top-k information hubs (e.g., [61] and [41]), are mostly centralized assuming the availability of either interaction or friendship graphs. The interaction
graph of a social network is a directed multigraph [21] whose nodes represent users and directed links represent the existence of a directed pair-wise interaction. Each link is labeled
with a time stamp that indicates when the interaction occurred. The friendship graph of
a social network consists of nodes representing users and undirected links representing the
friend relationship among users. Figure 5.1 shows the conceptual depiction of the friendship
graph between users and the overlaid interaction graph. However, centralized computation
of top-k information hubs is mostly unrealistic for parties such as advertisers because online
social networking companies are reluctant to share their interaction or friendship graphs due
to privacy concerns and regulations [8]. Furthermore, advertisers cannot even directly collect
interaction or friendship information from social network sites by means such as crawling
because for many online social networking companies such as Facebook [9], unauthorized
data collection is a violation of their terms of service.

5.1.3

Proposed Solution

In this work, we propose a distributed and privacy preserving algorithm for computing
top-k information hubs in social networks. Distributed algorithms for computing top-k
information hubs have to be privacy preserving because users are typically hesitant to disclose
explicit information about their friendship links or interaction information due to privacy
concerns [121]. To preserve the privacy of user interactions, our algorithm is distributed and
84

Figure 5.1. Conceptual depiction of the friendship graph between users and the overlaid
interaction graph.
does not require the advertiser to know users’ friendship associations or interactions. There
are three technical challenges in designing such an algorithm. First, the problem of inferring
a user’s salience (whose ground truth resides in the interaction graph) from the corresponding
friendship graph is inherently difficult because an interaction graph has more information
than its corresponding friendship graph. Furthermore, a friendship graph is undirected and
un-weighted, whereas the interaction graph is a directed multigraph. Second, the complete
friendship graph itself may not be available to the parties interested in identifying hubs.
Third, preserving users’ privacy in this computation is difficult as any information exchange
involved in this computation should not contain any personal information.
We now present an overview of our proposed solutions to the above-mentioned technical
challenges. To address the first challenge, we apply principal component centrality (PCC),
a new measure of centrality we introduced in [45], to the friendship graphs. The intuition
behind PCC is that a user who is connected to well-connected users (even if the user himself
is poorly connected) has a more central status. For example, a poorly connected person
who has a direct connection with a well-connected representative in some population may be
capable of propagating an opinion well by simply convincing the representative. Unlike other
measures of user influence (e.g., eigenvector centrality [22,23]), PCC takes into consideration
the fact that social networks can be multi-polar consisting of multiple communities with few
85

connections between them.
To address the second challenge of friendship graph data availability, we distribute the
computation of PCC among users and therefore do not require a central entity to access
the friendship graph. Advertisers can utilize existing functionality in popular online social
networks (such as groups and pages in Facebook [8]) to implement our proposed distributed
method. Motivation for user participation in decentralized PCC computation may range
from tangible incentives such as receiving free samples from advertisers (e.g. [38]), to intangible incentives such as bragging rights about one’s popularity (e.g. [102]). We decentralize
the PCC computation using the Kempe-McSherry (KM) algorithm [50]. These iterative
algorithms compute eigenvalues and eigenvectors that are essential for computing nodes’
PCCs. Our decentralized algorithm restricts the set of users that a particular user has to
communicate with to its immediate friends. Furthermore, the memory requirement at each
user of this algorithm grows only linearly with the number of friends. Hence, one of the
contributions of this work is extending the original centralized PCC approach to a more
practical distributed PCC form. This new distributed PCC form is an accurate and robust
centrality measure that is capable of identifying all salient users in a social graph using a
truly decentralized and scalable method.
Finally, to address the issue of user privacy, only real numbers representing PCC intermediate scores are exchanged between users. It is impossible to reverse-engineer users’ friendship
associations from these intermediate scores.

5.1.4

Experimental Results and Findings

We evaluated the effectiveness of our proposed technique using real-world Facebook data sets
[111] containing about 6 million users and more than 40 million friendship links. We have four
major findings from our experimental results. First, there is indeed close correlation between
the PCC of nodes in the friendship graph and corresponding dynamic user interaction data.
We envision that this correlation can be exploited for other purposes as well. Second, the

86

computation of PCC can be effectively distributed across individual users in a social graph
without compromising its accuracy. This eliminates the requirement of a central authority
for identifying hubs. Third, the accuracy of PCC improves as we use more eigenvectors in its
computation. Further, the appropriate number of eigenvectors required in the computation
of PCC for real-world social networks is around 10-20. Fourth, the accuracy (in terms of
number of correctly identified top-k users and their estimated rank) of PCC improves as the
duration of interaction data used for comparison is increased from 1 month, to 6 month to
more than a year. This essentially shows that PCC scores reflect the flow of information
between users of a social network over long time periods.

5.1.5

Key Contributions

We make four key contributions in this work. First, we propose a novel method to infer
information lying in the interaction graph (i.e. hub identification) from the friendship graph
in social networks without using the interaction data. Earlier works are limited to solving
this problem using complete interaction graph data. Second, our proposed method, first
of its kind, allows third parties (other than social network owners) to solve this problem.
We use a distributed method to overcome the requirement of a central authority. Third,
our proposed method preserves the privacy of users, i.e., users do not release any personal
information to other users. We achieve this objective by letting each user share only some
real numbers that cannot be reverse-engineered. Finally, we evaluate the effectiveness of
our proposed technique using real-world Facebook data sets that are publicly available. The
results of our experiments show that the proposed approach improves the accuracy of finding
the top-k user set by approximately 50% over existing measures. Furthermore, the proposed
technique accurately estimates the rank of individual users.
The rest of this chapter proceeds as follows. In Section 5.2, we present an overview of
related work. We provide the details of our proposed approach in Section 5.3. We also
provide the analysis of the data set used for evaluating our proposed technique in Section

87

5.4.1. We then provide the detailed results of our evaluation in the rest of Section 5.4.
Finally, we conclude the chapter in Section 5.5.

5.2

Related Work

Besides work on hub identification using user interaction data ( [41,61]) mentioned in Section
5.1, we provide an overview of rest of the relevant research on influence maximization.
Several algorithms have been proposed for identifying top-k influential users in social
networks [34,51–53,106,124]. The objective function for this influence maximization problem
is to maximize the number of users that are part of information flows initiated by the top-k
influential users. In contrast, our method uses friendship graphs, is fully distributed, and is
privacy-preserving, while such work uses user interaction data and is centralized and is not
privacy-preserving.
Kempe et al. studied this influence maximization problem for the first time, proved it
is NP-hard, and proposed a heuristics-based algorithm that achieves 63% of the optimal
result in most cases and outperforms degree and distance centrality heuristics [51]. Suri and
Narahari later proposed Shapley value based heuristic for solving this problem [106]. Zou et
al. studied the same problem with an additional constraint of latency [124]. Estevez et al.
proposed an algorithm called the Set Covering Greedy (SCG) algorithm, which takes into
account the intuition that we should prefer to select nodes in different neighborhoods rather
than selecting highly connected nodes lying in the same neighborhood [34]. Kimura et al.
studied the influence maximization problem with respect to two widely-used fundamental
stochastic information diffusion models in networks, and proposed a solution utilizing tools
from bond percolation and graph theory [52, 53].
Algorithms that forgo using interaction data use structural information like the friendship
graph. They are based on centrality measures computed from friendship graph topologies.
Marsden [69] used degree, closeness, betweenness and eigenvector centrality measures. This

88

is followed by Shi et al. in [104] who used the same centrality measures, i.e. degree, closeness,
betweenness and pagerank [54,57,85] (which is just an iterative algorithm to compute eigenvector centrality). However, as Borgatti showed in [24], degree, closeness and betweenness
centrality are inappropriate measures of centrality for influence processes. Degree centrality is a good measure of the rate of immediate rate of spread of influence from nodes in
the short-term. Betweenness and closeness centrality are ill-suited for the problem at hand
because the definitions underlying them assume that the flow on the network does not replicate and occurs only along shortest paths. Therefore, the performance of Marsden’s use of
eigenvector centrality and Shi’s use of Pagerank form the baseline for comparison against
our proposed algorithm. Canright, Engø-Monsen and Jelasity [27] described a distributed
and privacy preserving algorithm for the computation of eigenvector centrality/PageRank.

5.3

Our Proposed Solution

This section presents our proposed technique for identifying information hubs in social networks. We model the information flow as an influence process. The underlying rationale
for doing so is rooted in the assumption that in social networks people (nodes) with more
friends (connections) send and receive more messages. Furthermore, people will receive more
messages from friends that send/receive a lot of traffic than from those that send/receive
fewer messages. This information flow can be modeled as an influence process. According
to Borgatti’s two dimensional taxonomy of node centrality measures in [24], the appropriate
measure to quantify nodes’ influence is eigenvector centrality (EVC) [22, 23].

5.3.1

Eigenvector Centrality

Let A denote the adjacency matrix of a graph G(V, E) consisting of the set of nodes V =
{v1 , v2 , v3 , . . . , vN } of size N and set of undirected edges E. When a link is present between
two nodes vi and vj , both Ai,j and Aj,i are set to 1 and set to 0 otherwise. Let Γ (vi )

89

denote the neighborhood of vi , the set of nodes vi is connected to directly. EVC of a node is
recursively defined as proportional to the number of its neighbors and their respective EVCs.
Let x(i) be the EVC score of a node vi . Then,

x(i) =

1
λ1

x(j) =
vj ∈Γ (vi )

1
λ1

N

Ai,j x(j)
j=1

(5.1)

Here λ1 is a constant (later found to be the principal eigenvalue of A). Equation 5.1
can be rewritten in vector form Equation 5.2 where x1 = [x(1), x(2), x(3), . . . , x(N )]T is the
vector of EVC scores of all nodes.
x1 =

1
Ax1 ⇐⇒
λ1

λ1 x1 = Ax1
(5.2)

Equation 5.2 is the well-known eigenvector equation where this centrality takes its name
from. Obviously several eigenvalue/eigenvector pairs exist for an adjacency matrix A. Here,
λ1 is the largest of all eigenvalues of A by magnitude. If λi is any other eigenvalue of
A then |λ1 | > |λi |. The eigenvector x1 = [x1 (1), x1 (2), . . . , x1 (N )]T corresponding to the
principal eigenvalue is the principal eigenvector. Thus, the vector of node EVCs is equivalent
to the principal eigenvector. The EVC of a node vi is the corresponding element x1 (i) of
the principal eigenvector x1 .

5.3.2

Motivation for Principal Component Centrality

As we demonstrated in [45], in networks of multiple communities with sparse connectivity
between communities, EVC assigns centrality scores to nodes according to their location
with respect to the most dominant community. When applied to large networks, EVC fails
to assign significant scores to a large fraction of nodes. The principal eigenvector is “pulled”
in the direction of the largest community. The motivation for using PCC as a measure of

90

node influence may be understood by looking at EVC in the context of principal component
analysis (PCA) [32]. In PCA, when feature vectors are extracted from an N × N covariance
matrix of N random variables, the principal eigenvector is the most dominant feature vector,
i.e. the direction in N -dimensional hyperspace along which the spread of data points is
greatest. Similarly, the second eigenvector (corresponding to the second largest eigenvalue)
is representative of the second most significant feature of the data set. The second eigenvector
may also be thought of as the most significant feature after the data points are collapsed
along the direction of the principal eigenvector. When the covariance matrix is computed
empirically from a set of data points, the eigendecomposition is the well-known PCA. Here
PCA is used to find the eigenvectors x1 , x2 , x3 , . . . , xN and eigenvalues λ1 , λ2 , λ3 , . . . , λN of
the graph G’s adjacency matrix A.

5.3.3

Definition of PCC

While EVC assigns centrality to nodes according to their location with respect to the most
dominant community in a graph G, PCC takes into consideration additional communities.
We define the PCC of a node in a graph as its Euclidean distance/ℓ2 norm from the origin in
the P -dimensional eigenspace. The basis vectors of that eigenspace are the P most significant
eigenvectors of the adjacency matrix A of the graph G under consideration. For a graph
G, its N eigenvalues |λ1 | ≥ |λ2 | ≥ . . . ≥ |λN | correspond to the normalized eigenvectors
x1 , x2 , . . . , xN , respectively. The eigenvector/eigenvalue pairs are indexed in descending
order of magnitude of eigenvalues. When P = 1, PCC equals a scaled version of EVC. The
parameter P in PCC can be used as a tuning parameter to adjust the number of eigenvectors
included in PCC.
Let X denote the N × N matrix of concatenated eigenvectors X = [x1 x2 . . . xN ] and let
Λ = [λ1 λ2 . . . λN ]T be the vector of eigenvalues. Furthermore, if P < N (typically P ≪ N )
and if matrix X has dimensions N ×N , then XN ×P will denote the submatrix of X consisting
of the first N rows and first P columns. Then PCC can be expressed in matrix form as:

91

CP =

((AXN ×P ) ⊙ (AXN ×P )) 1P ×1

(5.3)

The ‘⊙’ operator is the Hadamard (or entrywise product or Schur product) operator and
1P ×1 is a vector of 1s of length P . Equation 5.3 can also be expressed in terms of the
eigenvalue and eigenvector matrices Λ and X, of the adjacency matrix A:
CP =

(XN ×P ⊙ XN ×P ) (ΛP ×1 ⊙ ΛP ×1 ).

(a) C1 (i)

(b) C5 (i)

(c) C10 (i)

(d) C100 (i)

(5.4)

Figure 5.2. PCC of nodes in a network consisting of two Barab´asi-Albert graphs of 100 and
50 nodes connected by a few links when computed using the most significant (a) 1, (b) 5,
(c) 10, and (d) 100 eigenvectors.
Node PCCs as defined in equations 5.3 and 5.4 are not normalized. To allow interpretation
of centrality scores, Ruhnau advocated in [84] that they should be normalized by either the
Euclidean norm (ℓ2 norm) or the maximum norm (ℓ∞ i.e. the maximum centrality score)
92

of the centrality vector. For the remainder of this work the PCC vector will be normalized
by the ℓ∞ norm, thereby restricting all entries to the range [0, 1].
We demonstrate PCC on a small-scale example, a graph consisting of two Barab´asi-Albert
graphs [19], one consisting of 100 nodes in one community that is sparsely connected with
another Barab´asi-Albert graph of 50 nodes. Figure 5.2 demonstrates the effect of changing
number of eigenvectors P for PCC CP from 1 (Figure 5.2(a)) for EVC to 5 (Figure 5.2(b)),
10 (Figure 5.2(c)) and 100 (Figure 5.2(d)). As this example shows, EVC is only able to assign
significant centrality to the most well connected node in the larger of the two subgraphs. As
P is raised to 5 and 10, gradually more nodes are assigned significant centrality scores, even
some in the smaller subgraph of 50 nodes.

5.3.4

Selection of Number of Eigenvectors

The cost of computing an eigenvector can be significant for large matrices, favoring the
use of as few eigenvectors for PCC as are necessary. To determine appropriate number of
eigenvectors (Papp ), we consider the phase angle φ as a function of P . The phase angle
φ(P ) of a PCC vector CP is defined as its angle with the EVC vector CE and is defined
mathematically in equation 5.5.
φ(P ) = arccos

CP
C
· E
|CP | |CE |

(5.5)

When the phase angle function is plotted for a range of P , the value of P at which φ begins
approaching its final steady value is used for that particular graph (Papp ) [45]. The selection
of Papp can be made as,
Papp = min{φ(P + 1) − φ(P )}

∈ [−ǫ, ǫ], ∀ [P, N ],

(5.6)

where ǫ is a small real number. It is our observation that the value of Papp is close to the
number of well-connected communities in a social graph.

93

5.3.5

Decentralized Eigendecomposition Algorithm

The massive sizes of social networks require a method whose space and time complexity
scales well with the number of nodes and links between them. According to Equation 5.3,
for a node to compute its own PCC score it needs to know its corresponding entries in the
first P eigenvectors x1 , x2 ... xP of the adjacency matrix A, as well as who its neighbors
are, i.e. the entries in its corresponding row of A. Although many decentralized algorithms
for computing eigenvectors of a matrix exist, many of them are not designed to minimize the
communication overhead between participating nodes. We discuss 2 well-known distributed
algorithms in the following text.
A method that lends itself to a decentralized implementation is the power iteration algorithm [56]. This algorithm avoids computing a matrix decomposition and is suitable for very
large, sparse matrices. Canright, Engø-Monsen and Jelasity [27] describe a fully distributed
implementation of the power iteration algorithm for computing the principal eigenvector.
The power iteration algorithm exploits the fact that the principal eigenvector of a symmetric matrix can be interpreted as the steady state distribution of an ergodic Markov chain.
The transition matrix of this Markov model is the stochastic, row-normalized version of the
adjacency matrix A. The power iteration algorithm initializes x as a random vector and
Ax
iteratively computes Ax
→ x until x ∞ becomes stable within a margin of tolerance
∞
(tol). The denominator term Ax ∞ is the eigenvalue λ corresponding to the eigenvector

x. To compute subsequent eigenvectors using the power iteration algorithm, as mentioned
x xT

in [60], the adjacency matrix is “deflated,” i.e. Ai+1 = Ai − λi Ti i . The deflation operx x
i

i

ation removes from adjacency matrix A = A1 the structure that is explained by the first
eigenvector x1 . The principal eigenvector of the deflated matrix A2 is computed using the
power iteration algorithm again. A2 ’s principal eigenvector is then the second eigenvector
of the original matrix A = A1 . Thus, starting from A, P applications of the power iteration
algorithm and P − 1 deflations will produce the P most significant eigenvalues λ1 , λ2 , ..., λP
and eigenvectors x1 , x2 , ..., xP .
94

The pseudo code in Algorithm 1 describes the power iteration and deflation algorithms. In
the application at hand, the algorithm takes friendship graph’s adjacency matrix A and P
number of eigenvectors as input. Typically, friendships in graphs obtained from online social
networks are recorded as bidirectional, making A a symmetric matrix. This ensures that all
its eigenvalues will be real. The principal eigenvector x1 can be initialized arbitrarily [27].
Note that according to Algorithm 1, the messages exchanged between nodes do not contain
any link information or other structural information about the graph. Nodes do not reveal
either their own or any other nodes’ friends. The fact that only a numeric value is exchanged
by communicating nodes preserves privacy.
While the power iteration algorithm provides sufficiently accurate results for the principal
eigenvector, it is not suitable for computing many subsequent eigenvectors/eigenvalues for
two reasons. First, it suffers from rounding error that becomes progressively worse with the
computation of each subsequent eigenvector. Second, the number of messages that have to be
exchanged by a node grows larger for every successive eigenvector that is computed. Consider
an illustrative example in Figure 5.3 where the node under consideration is colored black.
The neighbor nodes it is connected with in the friendship graph are colored grey and nonneighbor nodes are colored white. Figure 5.3(a) illustrates the exchange of messages to/from
the node when the first eigenvector is computed. In this scenario the node does not need to
communicate outside its circle of friends. Recall that to compute the second eigenvector and
eigenvalue the power iteration algorithm is applied to the once deflated adjacency matrix
A2 , which will include non-zero entries that will require nodes to venture outside of their
circle of neighbors and exchange messages with nodes to whom they are not connected to in
the friendship graph. This is shown in Figure 5.3(b). However, the content of the message
exchange is still only numeric values. After a few deflation operations Ai will not remain
sparse anymore, thus increasing the number of messages that have to be sent/received by
every node in the friendship graph. This case is illustrated in Figure 5.3(c). The number of
message exchanges can be reduced at the cost to the rounding error by rounding very small

95

Algorithm 1 Power iteration algorithm with deflation
Input: N = Number of nodes in graph
Input: A = Friendship graph adjacency matrix of size N × N
Input: P = Number of eigenvectors to be computed
Output: x1 , x2 , ..., xP = P largest eigenvectors
Output: λ1 , λ2 , ..., λP = P largest eigenvalues
1:

x1 ← [1 1 1 ... 1]T

2:

A1 = A

3:

for i = 1 to P do

4:

λi,prev ← 0

5:

λi ← 1

6:
7:

λ −λ
while i λi,prev > tol do
i
λi,prev = λi

8:

x′i ← Ai x1

9:

λi = x′i ∞
x′

11:

xi = λ i
i
end while

12:

xx
Ai+1 ← Ai − λi Ti i

10:

T

13:

xi xi

end for

entries in eigenvectors xi and deflated matrices Ai to zero.
In [50] Kempe and McSherry developed a decentralized algorithm for the computation of
the first P most significant eigenvectors. Their approach differs from other algorithms in
that each node is only required to communicate with neighbor nodes, as shown in Figure
5.3(a). This means that the computational complexity of the algorithm at every node, and
the volume of messages exchanged by each node scales only linearly with the number of its
neighbors and linearly with the number of eigenvectors that are computed. Furthermore, the
time for the algorithm to converge is O(τmix log N ), where N is the total number of nodes in

96

(a) 1st eigenvector

(b) 2nd eigenvector (c) 3rd

(subsequent)

eigenvectors

Figure 5.3. An illustrative example of message exchanges in power iteration algorithm for
(a) 1st eigenvector, (b) 2nd eigenvector, and (c) subsequent eigenvectors. Reference node is
colored black, neighbor nodes grey and non-neighbor nodes white.
the graph and τmix is the mixing time of the Markov chain with a transition matrix that is
the row-normalized version of A. Although the power method’s overhead and convergence
properties vary greatly from those of the KM algorithm, the iterative components of the KM
algorithm are very similar to those of the power method (lines 6 to 11 in Algorithm 1) when
it is used in the computation of the principal eigenvector only. Both algorithms perform a
deterministic simulation of a random walk. For a detailed coverage of the KM algorithm we
refer the reader to [50].
Kempe reported the error of the ℓ2 norm of the space spanned by RP , the projection of P
most significant eigenvectors on A, and RP ′ , the projection of P most significant eigenvectors
by KM-algorithm onto A, with high probability as follows.
λP +1 t
·N
λP

RP − RP ′ 2 ≤ O

+ 3ǫ4t
(5.7)

Here, t denotes the number of iterations for which the KM algorithm executes. Clearly, since
λP +1 < λP , the fractional term will be decreasing with t at a geometric rate. Figure 5.4
shows a plot of average mean squared error (MSE) between the actual and estimated top-k
eigenvectors (using the KM algorithm) for random graphs of 100 nodes. We report average
97

MSE values for varying number of eigenvectors (k) and number of iterations (t). Each point
in the plot is an average of 1000 independent runs and the confidence intervals are too small
to be shown. As expected from equation 5.7, we observe that average MSE values sharply
decrease approximately at a geometric rate for increasing number of iterations. Furthermore,
for a given number of iterations, average MSE values increase for larger k values.
1.4

k=1
k=2
k=3
k=4
k=5
k=6
k=7
k=8
k=9
k=10

Average MSE

1.2
1
0.8
0.6
0.4
0.2
0
10

20

30

40

50
60
Iterations

70

80

90

100

Figure 5.4. Average Mean Squared Error (MSE) for the KM algorithm reported for varying
values of number of eigenvectors (k) and number of iterations (t).
From the above discussions of the power method with deflation and the KM algorithm,
we conclude the following:
• The power method with deflation has a communication overhead that grows exponentially with each additional eigenvector computation. This limits the number of
eigenvectors that can be used in the computation of PCC. On the other hand, the KM
algorithm never requires a node to communicate beyond its immediate neighbors. This
implies that the communication overhead of the KM algorithm scales linearly with the
number of computed eigenvectors.
• The power method with deflation suffers from compounding of round-off errors in
the computation of subsequent eigenvectors. This limits the maximum number of
eigenvectors that can be used in computation of PCC. On the other hand Kempe et
98

al. reported near perfect convergence for their algorithm, which is also verified from
our observations in Figure 5.4.
For these reasons, we choose to use the KM algorithm for the distributed computation of
eigenvectors for PCC.

5.4
5.4.1

Experimental Results
Data Set

We now present details of the data sets used to evaluate the efficacy of our proposed technique. In our study, we use two independently collected data sets from Facebook [111]. We
use both data sets to demonstrate that our proposed solution is not biased in favor of any
particular data set. The data sets are labeled data set A and data set B here-onwards. As
Wilson et al. describe in [111], at the time of collection in April 2008 Facebook had 67
million subscribers of whom 44.3 million belonged to a regional network (regional networks
were defined on the basis of geography and institutions). Each regional network forms a
community of nodes that are strongly intra-connected but sparsely connected to other communities. Their crawler performed a breadth-first-search and collected data from the 22
largest regional networks. The crawler was initialized with 50 randomly seeded user profiles.
Wilson et al. verified the completeness of their coverage of regional networks by performing 5 simultaneous crawls of the San Francisco regional network, each seeded by a different
number of seed user IDs varying from 50 to 5000. The difference in the number of users
discovered between crawls was a mere 0.1%. Therefore, we can conclude that the coverage
of users in these data sets is fairly complete. The data contained in data set A and data set
B is from different regions.
Each data set further consists of two types of graphs. First, we have an undirected friendship graph where the nodes represent users and links represent the friendship between two
users. Second, we have a directed pair-wise user interaction graph where the nodes repre99

sent users and the directed links represent the interaction from one user to another. The
interaction data spans a time duration of one year. Note that we use the interaction data
only to evaluate the ground truth.
Table 5.1 provides the basic statistics of the friendship graphs analyzed in this study. We
note that the number of users in data set A are slightly more than those in data set B. We
also note that the ratio of the number of friendship links to the number of users for data set
A is ∼ 7.6, which is slightly more than ∼ 7.1 for data set B. This statistic is also reflected
in the values of average clustering coefficients of both data sets. Moreover, the number of
cliques in the friendship graph of data set A is more than those in data set B. However,
we observe that the transitivity value (defined as the fraction of possible triangles that are
actually triangles) for data set A is less than the respective value of data set B.
Table 5.1. Basic statistics of the friendship graphs analyzed in this study
Property

Data set A

Data set B

# Users
# Friendship Links
Average Clustering Coefficient
# Cliques
Transitivity

3097165
23667394
0.0979
28889110
0.0477

2937612
20959854
0.0901
27593398
0.04832

Figures 5.5 and 5.6 show the plots of degree distributions for friendship graphs of both
networks. In Figures 5.5(a) and 5.6(a), we plot the histograms of one thousand bins in which
we have grouped users. Although the distribution does not follow a power-law exactly, it fits
it reasonably well as shown by straightness on log-log scale and verified by high goodnessof-fit (R2 ) values for data set A and data set B. This observation is in accordance with the
result of recent studies that have shown that the degree distribution of many online social
networks is power-law [71]. An equivalent representation is shown in Figures 5.5(b) and
5.6(b) where users are reverse-sorted by their degree. Note that the estimated values of
model parameters are similar for both data sets.

100

10

4

10

y(x) = a x^n
a = 8.2555e+007
n = −2.3646
R = 0.95198

5

Degree

Count

10

10

0

10

2

10

y(x) = a x^n + c
a = 4266.1
c = −84.674
n = −0.27257
R = 0.95412

0

0

10

1

2

10
10
Bin Index (reverse−sorted)

10 0
10

3

10

2

4

6

8

10
10
10
10
Node Index (reverse−sorted by degree)

(a) Power law

(b) Zipf-like

Figure 5.5. Degree distribution of friendship graph for Facebook data set A
10

4

10

y(x) = a x^n
a = 3.7189e+007
n = −2.2168
R = 0.95833

5

Degree

Count

10

10

0

10

2

10

y(x) = a x^n + c
a = 4626.4
c = −78.608
n = −0.28508
R = 0.94416

0

0

10

1

2

10
10
Bin Index (reverse−sorted)

3

10

(a) Power law

10 0
10

2

4

6

(b) Zipf-like

Figure 5.6. Degree distribution of friendship graph for Facebook data set B

5.4.2

Selection of PCC Parameter

We can compute the PCC vector CP for a range of number of eigenvectors P . Note that at
P = 1 the PCC C1 is the EVC CE , which serves as the measure of baseline comparison, as
mentioned in [104] and [69]. Although we will be comparing PCC with EVC for a range of
values of P in some of our subsequent analysis, we will try to determine the “appropriate”
number of eigenvectors for PCC (denoted by Papp ). We do this by means of plotting the
phase angle function defined in Equation 5.5. Figures 5.7(a) and 5.7(b) plot the phase angle
functions of Facebook data sets A and B, respectively, for the range of P = 1 to 100. For
101

8

10
10
10
10
Node index (reverse−sorted by degree)

data sets A and B, the phase angle function rises quickly initially until P = 6 and rises only
very slowly thereafter. Using Equation 5.6, the Papp values are 10 and 20 for data sets A
and B, respectively. The difference in Papp values for data sets A and B can be explained
by differences in their network structures. In Table 5.1, we showed that the friendship graph
of data set A has higher average clustering coefficient values than the friendship graph of
data set B. This essentially shows that the friendship graph of data set A is on average more
tightly connected than the friendship graph of data set B. In terms of PCC computation,
this indicates that all nodes can be reached from a given node in lesser number of hops
on average. In other words, fewer number of eigenvectors (denoted by Papp ) are enough to
approximate the steady-state PCC value.
1
0.8
0.6

φ (rad)

φ (rad)

0.8

0.4

0.6
0.4
0.2

0.2
0
0

20

40

60

80

100

0
0

P

20

40

60

80

100

P

(a) Facebook data set A

(b) Facebook data set B

Figure 5.7. Plot of the phase angle φ(P ) between PCC vectors CP and EVC vector CE
plotted against number of feature vectors P for (a) Facebook data set A, and (b) Facebook
data set B.

5.4.3

Comparison With Ground Truth

Now that we have identified an appropriate number of eigenvectors for PCC for both data
sets, we devote the remaining section to evaluating its accuracy by comparing the results to
the ground truth, i.e. interaction data. For both data sets A and B, we have interaction

102

graphs spanning 1 month, 6 months, and 1 year time periods.
Recall that the PCC scores for individual users are calculated using only information from
the friendship graph. We compare the PCC of nodes against their actual flows over various
time periods to get a sense of the time period over which PCC best predicts the flow.
We have used a symmetric measure called Pearson’s product-moment coefficient to quantify the similarity between the output of PCC and the ground truth from interaction data.
The Pearson’s product-moment coefficient ρ is defined in Equation 5.8. Here E is the expectation operator, σ refers to standard-deviation, and µ denotes mean value.
ρ(CP , ϑ) =

E[(CP − µC )(ϑ − µϑ )]
P
σC σϑ

(5.8)

P

Figure 5.8 shows the plots of correlation coefficients ρ(CP , ϑ) as a function of number of
eigenvectors for the range 1 ≤ P ≤ 100. Figure 5.8(a) plots ρ(CP , ϑ) for flows collected
over 1 month, 6 months and the entire collection time period (labeled ‘All’) for data set A.
Figure 5.8(b) does the same for data set B.
0.6

0.5
0.4
ρ(CP,ϑ)

ρ(CP,ϑ)

0.5
0.4
0.3
All
6 mo
1 mo

0.2
0.1

20

40

60

80

0.3
All
6 mo
1 mo

0.2
0.1

100

P

20

40

60

80

100

P

(a) Facebook data set A

(b) Facebook data set B

Figure 5.8. Correlation coefficients ρ of PCC CP and, (a) flow count of Facebook data
set A (ϑ(A)), (b) flow count of Facebook data set B (ϑ(B)). The correlation coefficients
are plotted as functions of the number of eigenvectors P and plotted separately for each
interaction graph.
We make two major observations from these plots. First, we note that the value of ρ
103

generally increases with increasing number of eigenvectors P for computing PCC. It rises
quickly to reach its steady-state value for both data sets. For Facebook data set A, ρ
reaches close to its steady-state value at around 10 eigenvectors. Whereas, for Facebook
data set B, ρ reaches close to its steady-state value at around 20 eigenvectors. Note that
the steady-state values for ρ are reached at Papp values selected in the previous subsection.
This observation verifies the merit of using phase angle for selection of appropriate value of
P in PCC computation.
Second, we note that the correlation coefficients are higher for interaction data collected
over longer periods of time. This observation follows our intuition that the trends in shortterm interaction data can deviate from our expectations in steady-state friendship graph;
however, the trends in long-term interaction data show greater similarity with the underlying
friendship graph. This observation remains consistent across both Facebook data sets.
To further evaluate the accuracy of PCC in finding information hubs, we analyze the
overlap between the set of top-2000 users by PCC (denoted by S2000 (CP )) and the ground
truth. Note that the choice of 2000 nodes in the following analysis is purely arbitrary. The
results of our analysis for different set sizes are qualitatively similar. Let the cardinality of
the intersection set of the first k nodes by PCC and the first k nodes by flow ϑ be denoted
by Ik (CP , ϑ) and defined in Equation 5.9 below.
Ik (CP , ϑ) =

|Sk (CP ) ∩ Sk (ϑ)|
k

(5.9)

Figures 5.9(a) and 5.9(b) plot I2000 for data sets A and B, respectively. We evaluate separately for interaction data of different durations. As expected, the cardinality of the intersection set increases with the number of eigenvectors used in computation of PCC. In both
figures, the data points at P = 1 represent the baseline for our comparison, i.e. EVC.
For data set A, the cardinality of the intersection set of the top-2000 nodes by EVC
and top-2000 nodes by flow ϑ, the cardinality of the intersection set I2000 (CE , ϑ) is 342.
At P = 10, I2000 (C10 , ϑ) = 513 for data set A. These represent increases of 50.0%. For
104

Facebook data set B intersection cardinality of EVC set with flow are I2000 (CE , ϑ) = 358.
At P = 20, I2000 (C20 , ϑ) = 426 for data set A, an increase of 19.0%. For the remainder
of this section, we fix the values of P at Papp for both data sets A and B. We see greater
agreement between the list of nodes generated by PCC score with flow data collected over a
longer durations.
500

All
6 mo
1 mo

400

All
6 mo
1 mo

400
I2000(CP,ϑ)

I2000(CP,ϑ)

600

200

300
200
100

0

20

40

60

80

0

100

20

40

P

60

80

100

P

(a) Facebook data set A

(b) Facebook data set B

Figure 5.9. Size of the intersection set in (a) Facebook data set A, and (b) Facebook data
set B, for varying number of eigenvectors used in computation of PCC.
Figures 5.10(a) and 5.10(b) plots Ik set for top-k users for data sets A and B, respectively.
We observe an increasing trend for Ik as we increase the bracket size of top-k users. We
also note that the cardinality of the intersection set increases for increasing durations of
interaction data. The overlap approaches 40% of k mark for top-1% users. Moreover, we
observe that the results for Facebook data set A are slightly better than those of Facebook
data set B.
The evaluation described till now focuses on the number of users that are common in topk set assembled with respect to PCC scores and node degree in directed interaction graph.
In a more fine-grained analysis, we are also interested in quantifying the accuracy of ranks
assigned using PCC scores. Towards this end, we compute the difference between ranks
assigned by PCC and those determined using data from interaction graph. Moreover, the
significance of correct ranking of high-ranked users is more important than low-ranked users.
105

50
40
Ik(C20,ϑ) (%)

Ik(C10,ϑ) (%)

40
30
All PCC
All EVC
6 mo PCC
6 mo EVC
1 mo PCC
1 mo EVC

20
10
0

0.2

0.4
0.6
k (%)

30

All PCC
All EVC
6 mo PCC
6 mo EVC
1 mo PCC
1 mo EVC

20
10
0

0.8

0.2

0.4

0.6

0.8

1

k (%)

(a) Facebook data set A

(b) Facebook data set B

Figure 5.10. Cardinality of the intersection set in (a) Facebook data set A, and (b) Facebook
data set B, for varying fraction of nodes in graph.
To accomplish these objectives, we have devised a distance metric to compare the relevance
of two ordered lists. We denote the list of nodes of length k in descending order of CP by
Rk (CP ) and the list of nodes of length k in descending order of interaction graph degree
by Rk (ϑ). The distance is normalized in the range [0, 1], where 0 correspond to the perfect
match between two given order lists, and vice-versa. We define the normalized distance
d ∈ [0, 1] between these two ordered lists as:

d (Rk (CP ), Rk (ϑ)) =

i∈Rk (ϑ)

wi |Rk (CP (i))−Rk (ϑ(i))|
N −2i+1
i∈Rk (ϑ) wi

(5.10)

Here wi is the degree of user i in the interaction graph and N is the total number of users.
Figures 5.11(a) and 5.11(b) show the variation in distance between two ordered lists as we
increase its size k for data sets A and B, respectively. Similar to the intersection results, we
first note that the best results are achieved when comparison is done with interaction data of
longer time duration. Second, we note that the results slightly degrade for increasing values
of k. Third, it is evident that the results for Facebook data set A are better than those
for Facebook data set B. For example, d ≈ 0.01 at k = 0.5% of N for data set A, whereas
d ≈ 0.03 at k = 0.5% of N for data set B.
106

0.06

All
6 mo
1 mo

0.04

d(Rk(C20),Rk(ϑ))

d(Rk(C10),Rk(ϑ))

0.06

0.02

0

0.2

0.4
0.6
k (%)

0.05
0.04
0.03
0.02

0.8

All
6 mo
1 mo

0.2

0.4

0.6

0.8

1

k (%)

(a) Facebook data set A

(b) Facebook data set B

Figure 5.11. Distance between ordered lists computed by PCC and interaction data using
(a) Facebook data set A, and (b) Facebook data set B, for varying fraction of nodes in graph.

5.5

Conclusions

Information hubs in social networks play important roles in the speed and depth of information diffusion. Identifying hubs helps us to harness their power to pursue social good at local,
national, and global levels. In this work, we propose the first friendship graph based, fully
distributed, and privacy-preserving method for identifying hubs in online social networks.
Unlike prior work, our method can be used to identify hubs by parties other than social network owners as we do not require any entity to access interaction or friendship graphs. We
conducted experiments using data collected from Facebook. The data sets used in this study
were collected over the period of more than a year and contain data from about 6 million
users. The results of our experiments using this data showed that our proposed protocol
accurately (in terms of number of correctly identified top-k nodes and their estimated rank)
identifies the top-k information hubs in a social network.

107

6

Conclusion

In this thesis, I measured and modeled various network systems using a combination of
theoretical and empirical methods. Specifically, I focused on understanding various aspects
of cellular networks and online social networks. For cellular networks, I first presented an
approach to modeling QoE for mobile video and then characterized cellular network performance during crowded events. For online social networks, I first presented an approach to
de-anonymize communication among users in an IM network and then presented a model to
identify information hubs in online social networks. The primary challenge in characterizing
and modeling large scale networks is “dynamics” — dynamics of user behavior, dynamics
of networks, and dynamics of external factors. In this thesis, I show that we can effectively model these dynamics using machine learning and signal processing approaches to
gain substantial performance benefits.
The vision of this thesis can be extended to many other similar research directions. For
cellular networks, several performance optimization opportunities arise due to unintended
interactions among protocols operating at different layers. For example, TCP’s performance
generally suffers on wireless networks due to their high link layer losses. These cross-layer
inefficiencies present characterization and modeling opportunities, e.g., protocol parameter
tuning, traffic routing, caching, etc. Many of these research directions are part of this thesis
and subject to my ongoing work on malware detection [95, 96, 100, 101, 107], social network
modeling [46–48, 89, 98], operational network performance optimization [88, 90–94, 99], and
protocol vulnerability analysis [97, 103, 110].

108

BIBLIOGRAPHY

109

BIBLIOGRAPHY
[1] 3GPP Self-Organizing Networks. http://www.3gpp.org/SON.
[2] Architecture enhancements for non-3GPP accesses.
Specs/html-info/23402.htm.

http://www.3gpp.org/ftp/

[3] Multi-Resolution Land Characterization (MRLC) consortium, National Land Cover
Database 2006 (NLCD 2006). http://www.mrlc.gov/nlcd2006.php.
[4] Offload service. http://www.devicescape.com/offload-service.
[5] RFC 2046, Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types.
http://tools.ietf.org/html/rfc2046.
[6] U.S. Geological Survey, National Elevation Dataset. http://ned.usgs.gov/.
[7] Yahoo! network flows data, version 1.0. Yahoo! Research Webscope Data Sets.
[8] Facebook Advertising, http: // www. facebook. com/ advertising/ , 2010.
[9] Statement of rights and responsibilities. http://www.facebook.com/terms.php, 2010.
[10] Cisco Visual Networking Index: Global Mobile Data Traffic Forecast Update, 20102015. White Paper, February 2011.
[11] Actix Press Release. http://www.actix.com/sites/www. actix.com/files/Actix_
Hotspots_Study_Findings.pdf, June 2012.
[12] Wi-Fi CERTIFIED Voice-Enterprise, Delivering Wi-Fi voice to the enterprise. White
Paper, May 2012.
[13] Cisco Visual Networking Index: Global Mobile Data Traffic Forecast Update 2012–
2017. Technical report, Cisco, 2013.
[14] Vaneet Aggarwal, Rittwik Jana, Kadangode Ramakrishnan, Jeffrey Pang, and
N. Shankaranarayanan. Characterizing fairness for 3G wireless networks. In IEEE
LANMAN, 2011.
[15] Athula Balachandran, Vyas Sekar, Aditya Akella, Srinivasan Seshan, Ion Stoica, and
Hui Zhang. A quest for an internet video quality-of-experience metric. In 11th ACM
Workshop on Hot Topics in Networks (HotNets-IX), 2012.
[16] Athula Balachandran, Vyas Sekar, Aditya Akella, Srinivasan Seshan, Ion Stoica, and
Hui Zhang. Developing a predictive model of quality of experience for internet video.
In ACM SIGCOMM, 2013.
110

[17] N. Balasubramanian, A. Balasubramanian, and A. Venkataramani. Energy consumption in mobile phones: A measurement study and implications for network applications.
In ACM IMC, 2009.
[18] Marco Balduzzi, Christian Platzer, Thorsten Holz, Engin Kirda, Davide Balzarotti,
and Christopher Kruegel. Abusing social networks for automated user profiling. In
Recent Advances in Intrusion Detection, 2010.
[19] A.L. Barab´asi and R. Albert. Emergence of scaling in random networks. Science,
286(5439):509, 1999.
[20] Roger S. Bivand, Edzer J. Pebesma, and Virgilio Gomez-Rubio. Applied Spatial Data
Analysis with R. Springer, 2008.
[21] B. Bollobas. Modern graph theory. Springer Verlag, 1998.
[22] P. Bonacich. Factoring and weighting approaches to status scores and clique identification. Journal of Mathematical Sociology, 2(1):113–120, 1972.
[23] Phillip Bonacich. Technique for analyzing overlapping memberships. Sociological
Methodology, 4:176–185, 1972.
[24] S.P. Borgatti. Centrality and network flow. Social Networks, 27(1):55–71, 2005.
[25] Alessio Botta, Antonio Pescape, Giorgio Ventre, Ernst Biersack, and Stefan Rugel.
Performance footprints of heavy-users in 3G networks via empirical measurement. In
International Symposium on Modeling and Optimization in Mobile, Ad Hoc and Wireless Networks (WiOpt), 2010.
[26] Guido Caldarelli. Scale-Free Networks. Oxford University Press, 2007.
[27] G. Canright, K. Engø-Monsen, and M. Jelasity. Efficient and robust fully distributed
power method with an application to link analysis. Department of Computer Science,
University of Bologna, Tech. Rep. UBLCS-2005-17, pages 2005–17, 2005.
[28] B. Claise. Cisco systems NetFlow services export version 9. Wikipedia, the free
encyclopedia, October 2004.
[29] Aaron Clauset, Cristopher Moore, and M. E. J. Newman. Hierarchical structure and
the prediction of missing links in networks. Nature, 453:98–101, 2008.
[30] RR Coifman and MV Wickerhauser. Entropy-based algorithms for best basis selection.
IEEE Transactions on Information Theory, 38(2 Part 2):713–718, 1992.
[31] Florin Dobrian, Vyas Sekar, Asad Awan, Ion Stoica, Dilip Joseph, Aditya Ganjam,
Jibin Zhan, and Hui Zhang. Understanding the impact of video quality on user engagement. In ACM SIGCOMM, 2011.
111

[32] R.O. Duda, P.E. Hart, and D.G. Stork. Pattern Classification. Wiley New York, 2nd
edition, 2001.
[33] Jeffrey Erman, Alexandre Gerber, K.K. Ramakrishnan, Subhabrata Sen, and Oliver
Spatscheck. Over the top video: The gorilla in cellular networks. In ACM IMC, 2011.
[34] Pablo A. Estevez, Pablo Vera, and Kazumi Saito. Selecting the most influential nodes
in social networks. In proceedings of IJCNN, 2007.
[35] Tom Fawcett. ROC Graphs: Notes and Practical Considerations for Researchers.
Technical report, HP Laboratories, 2004.
[36] A. Finamore, M. Mellia, M. Munafo, R. Torres, and S. R. Rao. YouTube Everywhere:
Impact of Device and Infrastructure Synergies on User Experience. In ACM IMC,
2011.
[37] J. Fry, G. Xian, S. Jin, J. Dewitz, C. Homer, L. Yang, C. Barnes, N. Herold, and
J. Wickham. Completion of the 2006 National Land Cover Database for the Conterminous United States. Photogrammetric Engineering & Remote Sensing (PE&RS),
77(9):858–864, 2011.
[38] Duncam Geere. Samsung offers free phones to frustrated iPhone users. CNN Tech,
http://www.cnn.com/2010/TECH/mobile/07/24/ samsung.replacing.iphones/, 2010.
[39] P. Gill, M. Arlitt, Z. Li, and A. Mahanti. YouTube Traffic Characterization: A View
From the Edge. In ACM IMC, 2007.
[40] Jacob Goldenberg, Sangman Han, Donald R. Lehmann, and Jae Weon Hong. The
role of hubs in the adoption processes. Journal of Marketing, American Marketing
Association, 2008.
[41] A. Goyal, F. Bonchi, and L.V.S. Lakshmanan. Discovering leaders from community
actions. In Proceeding of the 17th ACM Conference on Information and Knowledge
Management (CIKM), 2008.
[42] Bo Han, Pan Hui, V. S. Anil Kumar, Madhav V. Marath, Guanhong Pei, and Aravind
Srinivasan. Cellular traffic offloading through opportunistic communications: A case
study. In ACM MobiCom Workshop on Challenged Networks, 2011.
[43] Raymond Heatherly, Murat Kantarcioglu, and Bhavani M. Thuraisingham. Preventing private information inference attacks on social networks. IEEE Transactions on
Knowledge and Data Engineering, 2012.
[44] Junxian Huang, Q. Xu, B. Tiwana, Z. Morley Mao, Ming Zhang, and Victor Bahl.
Anatomizing application performance differences on smartphones. In ACM MobiSys,
2010.
112

[45] M. U. Ilyas and H. Radha. A KLT-inspired node centrality for identifying influential neighborhoods in graphs. In Conference on Information Sciences and Systems,
Princeton, NJ, 2010. Princeton University.
[46] Muhammad U. Ilyas, M. Zubair Shafiq, Alex X. Liu, and Hayder Radha. A Distributed
and Privacy-Preserving Algorithm for Identifying Information Hubs in Social Networks.
In 30th Annual IEEE Conference on Computer Communications (INFOCOM) MiniConference, 2011.
[47] Muhammad U. Ilyas, M. Zubair Shafiq, Alex X. Liu, and Hayder Radha. A Distributed
Algorithm for Identifying Information Hubs in Social Networks. In IEEE Journal on
Selected Areas in Communications (JSAC), 2013.
[48] Muhammad Usman Ilyas, M. Zubair Shafiq, Alex X. Liu, and Hayder Radha. Who
are You Talking to? Breaching Privacy in Encrypted IM Networks. In 21st IEEE
International Conference on Network Protocols (ICNP), 2013.
[49] Hao Jiang and Constantinos Dovrolis. Passive estimation of TCP round-trip times.
SIGCOMM CCR, 32(3), 2002.
[50] D. Kempe and F. McSherry. A decentralized algorithm for spectral analysis. Journal
of Computer and System Sciences, 74(1):70–83, 2008.
[51] David Kempe, Jon Kleinberg, and Eva Tardos. Maximizing the spread of influence
through a social network. In proceedings of KDD, 2003.
[52] Masahiro Kimura, Kazumi Saito, and Ryohei Nakano. Extracting influential nodes for
information diffusion on a social network. In proceedings of AAAI, 2007.
[53] Masahiro Kimura, Kazumi Saito, Ryohei Nakano, and Hiroshi Motoda. Finding influential nodes in a social network from information diffusion data. Springer Social
Computing and Behavioral Modeling, 2009.
[54] C. Kohlsch¨
utter, P. Chirita, and W. Nejdl. Efficient parallel computation of pagerank.
Lecture Notes in Computer Science, 3936:241, 2006.
[55] S. Shunmuga Krishnan and Ramesh K. Sitaraman. Video stream quality impacts
viewer behavior: Inferring causality using quasi-experimental designs. In ACM IMC,
2012.
[56] C. Lanczos. An iteration method for the solution of the eigenvalue problem of linear
differential and integral operators. J. Res. Nat. Bur. Standards, 45(4):255–282, 1950.
´
[57] A.N. Langville, C.D. Meyer, and P. FernAndez.
Googles pagerank and beyond: the
science of search engine rankings. The Mathematical Intelligencer, 30(1):68–69, 2008.

113

[58] Stephen Lawson. Wireless networks are near capacity. http://www.pcworld.
com/businesscenter/article/235964/survey_wireless_networks_are_near_
capacity.html, July 2011. IDG News Service.
[59] Patrick P. C. Lee, Tian Bu, and Thomas Woo. On the detection of signaling DoS
attacks on 3G wireless networks. In IEEE Infocom, 2007.
[60] R.B. Lehoucq and D.C. Sorensen. Deflation Techniques for an Implicitly Restarted
Arnoldi Iteration. SIAM Journal on Matrix Analysis and Applications, 17:789, 1996.
[61] Jure Leskovec, Andreas Krause, Carlos Guestrin, Christos Faloutsos, Jeanne VanBriesen, and Natalie Glance. Cost-effective outbreak detection in networks. In KDD,
2007.
[62] Xiaotong Li. Informational cascades in IT adoption. Communications of the ACM,
47(4), 2004.
[63] David Liben-Nowell and Jon Kleinberg. The link prediction problem for social networks. In CIKM ’03: Proceedings of the12th International Conference on Information
and Knowledge Management, pages 556–559, New York, NY, USA, 2003. ACM.
[64] F. Liers and A. Mitschele-Thiel. UMTS data capacity improvements employing dynamic rrc timeouts. In 16th IEEE International Symposium on Personal, Indoor and
Mobile Radio Communications (PIMRC), 2005.
[65] X. Liu, A. Sridharan, S. Machiraju, M. Seshadri, and H. Zang. Experiences in a 3G
network: Interplay between the wireless channel and applications. In ACM MobiCom,
2008.
[66] Wei Lu and Ali A. Ghorbani. Network anomaly detection based on wavelet analysis.
EURASIP Journal on Advances in Signal Processing, 2009.
[67] Haiyun Luo, Ramachandran Ramjeey, Prasun Sinhaz, Li (Erran) Liy, and Songwu Lu.
UCAN: A unified cellular and adhoc network architecture. In ACM MobiCom, 2003.
[68] S. Mallat. A wavelet tour of signal processing. Academic press, 1999.
[69] P.V. Marsden. Egocentric and sociocentric measures of network centrality. Social
Networks, 24(4):407–422, 2002.
[70] Jakub Mikians, Laszlo Gyarmati, Vijay Erramilli, and Nikolaos Laoutaris. Detecting
price and search discrimination on the internet. In HotNets, 2012.
[71] Alan Mislove, Massimiliano Marcon, Krishna P. Gummadi, Peter Druschel, and Bobby
Bhattacharjee. Measurement and analysis of online social networks. In Proceedings of
IMC, San Diego, CA, October 2007.
114

[72] Alan Mislove, Bimal Viswanath, Krishna P. Gummadi, and Peter Druschel. You are
who you know: inferring user profiles in online social networks. In ACM International
Conference on Web Search and Data Mining (WSDM), 2010.
[73] A. Narayanan and V. Shmatikov. Robust de-anonymization of large sparse datasets.
In IEEE Symposium on Security and Privacy, 2008.
[74] A. Narayanan and V. Shmatikov. De-anonymizing social networks. In IEEE Symposium on Security and Privacy, 2009.
[75] A. M. Odlyzko. Privacy, economics, and price discrimination on the internet. In Fifth
International Conference on Electronic Commerce (ICEC), 2003.
[76] Barbara Orlandi and Frank Scahill. Wi-Fi Roaming – Building on ANDSF and Hotspot
2.0. Technical report, Alcatel-Lucent and BT, 2012.
[77] Eli Pariser. The Filter Bubble: How the New Personalized Web Is Changing What We
Read and How We Think [Paperback]. Penguin Books, 2012.
[78] Utpal Paul, Anand Prabhu Subramanian, Milind Madhav Buddhikot, and Samir R.
Das. Understanding traffic dynamics in cellular data networks. In IEEE Infocom, 2011.
[79] Feng Qian, Zhaoguang Wang, Alexandre Gerber, Z. Morley Mao, Subhabrata Sen,
and Oliver Spatscheck. TOP: Tail optimization protocol for cellular radio resource
allocation. In IEEE ICNP, 2010.
[80] Feng Qian, Zhaoguang Wang, Alexandre Gerber, Zhuoqing Morley Mao, Subhabrata
Sen, and Oliver Spatscheck. Characterizing radio resource allocation for 3G networks.
In ACM IMC, 2010.
[81] J. Ross Quinlan. C4.5: programs for machine learning. Morgan Kaufmann, 1993.
[82] Ross J. Quinlan. Learning with continuous classes. In 5th Australian Joint Conference
on Artificial Intelligence, 1992.
[83] Ashwin Rao, Yeon sup Lim, Chadi Barakat, Arnaud Legout, Don Towsley, and Walid
Dabbous. Network characteristics of video streaming traffic. In ACM CoNEXT, 2011.
[84] B. Ruhnau. Eigenvector-centrality–a node-centrality? Social networks, 22(4):357–365,
2000.
[85] K. Sankaralingam, S. Sethumadhavan, and J.C. Browne. Distributed pagerank for p2p
systems. In Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing, pages 58–68, 2003.

115

[86] Raimund Schatz, Tobias Hobfeld, and Pedro Casas. Passive YouTube QoE Monitoring
for ISPs. In Sixth International Conference on Innovative Mobile and Internet Services
in Ubiquitous Computing (IMIS), 2012.
[87] Aaron Schulman, Vishnu Navda, Ramachandran Ramjee, Neil Spring, Pralhad Deshpande, Calvin Grunewald, Kamal Jain, and Venkata N. Padmanabhan. Bartendr: A
practical approach to energy-aware cellular data scheduling. In ACM MobiCom, 2010.
[88] M. Zubair Shafiq, Jeffrey Erman, Lusheng Ji, Alex X. Liu, Jeffrey Pang, and Jia Wang.
Understanding the Impact of Network Dynamics on Mobile Video User Engagement.
In ACM SIGMETRICS, 2014.
[89] M. Zubair Shafiq, Muhammad U. Ilyas, Alex X. Liu, and Hayder Radha. Identifying
Leaders and Followers in Online Social Networks. In IEEE Journal on Selected Areas
in Communications (JSAC), 2013.
[90] M. Zubair Shafiq, Lusheng Ji, Alex X. Liu, Jeffrey Pang, Shobha Venkataraman, and
Jia Wang. A First Look at Cellular Network Performance during Crowded Events. In
ACM SIGMETRICS, 2013.
[91] M. Zubair Shafiq, Lusheng Ji, Alex X. Liu, Jeffrey Pang, and Jia Wang. A First Look at
Cellular Machine-to-Machine Traffic - Large Scale Measurement and Characterization.
In ACM SIGMETRICS/Performance, 2012.
[92] M. Zubair Shafiq, Lusheng Ji, Alex X. Liu, Jeffrey Pang, and Jia Wang. Characterizing
Geospatial Dynamics of Application Usage in a 3G Cellular Data Network. In IEEE
INFOCOM, 2012.
[93] M. Zubair Shafiq, Lusheng Ji, Alex X. Liu, Jeffrey Pang, and Jia Wang. Large
Scale Measurement and Characterization of Cellular Machine-to-Machine Traffic. In
IEEE/ACM Transactions on Networking (ToN), 2013.
[94] M. Zubair Shafiq, Lusheng Ji, Alex X. Liu, and Jia Wang. Characterizing and Modeling
Internet Traffic Dynamics of Cellular Devices. In ACM SIGMETRICS, 2011.
[95] M. Zubair Shafiq, Syed Ali Khayam, and Muddassar Farooq. Embedded Malware Detection using Markov n-grams. In International Conference on Detection of Intrusions,
Malware and Vulnerability Assessment (DIMVA), 2008.
[96] M. Zubair Shafiq, Syed Ali Khayam, and Muddassar Farooq. Improving Accuracy
of Immune Inspired Malware Detectors using Intelligent Features. In Genetic and
Evolutionary Computation Conference (GECCO), 2008.
[97] M. Zubair Shafiq, Franck Le, Mudhakar Srivatsa, and Alex X. Liu. Cross-Path Inference Attacks on Multipath TCP. In Twelfth ACM Workshop on Hot Topics in
Networks (HotNets-XII), 2013.
116

[98] M. Zubair Shafiq and Alex X. Liu. A Random Walk Approach to Modeling the Dynamics of the Blogosphere. In 10th IFIP/TC6 Networking, 2011.
[99] M. Zubair Shafiq, Alex X. Liu, and Amir Khakpour. Caching in content delivery
networks: Measurement, design, and evaluation. In ACM SIGMETRICS, 2014.
[100] M. Zubair Shafiq, Syeda Momina Tabish, and Muddassar Farooq. PE-Probe: Leveraging packer detection and structural information to detect malicious portable executables. In Virus Bulletin Conference (VB), 2009.
[101] M. Zubair Shafiq, Syeda Momina Tabish, Fauzan Mirza, and Muddassar Farooq. PEMiner: Mining Structural Information to Detect Malicious Executables in Realtime.
In 12th International Symposium On Recent Advances In Intrusion Detection (RAID),
2009.
[102] Rumah
Shahbaz.
how
high
is
your
popularity?
http://www.facebook.com/apps/application.php?id=174042725891, 2010.
12120
monthly users.
[103] Muhammad Shahzad, M. Zubair Shafiq, and Alex X. Liu. A Large Scale Exploratory
Analysis of Software Vulnerability Life Cycles. In 34th International Conference on
Software Engineering (ICSE), 2012.
[104] X. Shi, M. Bonner, L.A. Adamic, and A.C. Gilbert. The very small world of the wellconnected. In Proceedings of the 19th ACM conference on Hypertext and hypermedia
(HT), 2008.
[105] Lingyang Song and Jia Shen, editors. Evolved Cellular Network Planning and Optimization for UMTS and LTE. CRC Press, 2010.
[106] N. Rama Suri and Y. Narahari. Determining the top-k nodes in social networks using
the shapley value. In proceedings of AAMAS, 2008.
[107] Syeda Momina Tabish, M. Zubair Shafiq, and Muddassar Farooq. Malware Detection
using Statistical Analysis of Byte-Level File Content. In Workshop on CyberSecurity
and Intelligence Informatics (CSI), ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2009.
[108] Carmela Troncoso and George Danezis. The Bayesian traffic analysis of mix networks.
In ACM Conference on Computer and Communications Security (CCS), 2009.
[109] M-H. Wang V. Shmatikov. Timing analysis in low-latency mix networks: Attacks
and defenses. In European Symposium on Research in Computer Security (ESORICS),
2006.

117

[110] Yipeng Wang, Xiaochun Yun, M. Zubair Shafiq, Liyan Wang, Alex X. Liu, Zhibin
Zhang, Danfeng(Daphne) Yao, Yongzheng Zhang, and Li Guo. A Semantics Aware
Approach to Automated Reverse Engineering Unknown Protocols. In 20th IEEE International Conference on Network Protocols (ICNP), 2012.
[111] C. Wilson, B. Boe, A. Sala, K.P.N. Puttaswamy, and B.Y. Zhao. User interactions in
social networks and their implications. In Proceedings of ACM European Conference
on Computer Systems, 2009.
[112] Edwin B. Wilson. Probable inference, the law of succession, and statistical inference.
Journal of the American Statistical Association, 22(158):209–212, 1927.
[113] Ian H. Witten, Eibe Frank, and Mark A. Hall. Data Mining: Practical Machine
Learning Tools and Techniques. Morgan Kaufmann, 2011.
[114] Mike P. Wittie, B. Stone-Gross, K.C. Almeroth, and E.M. Belding. MIST: Cellular
data network measurement for mobile applications. In IEEE BROADNETS, 2007.
[115] Gilbert Wondracek, Thorsten Holz, Engin Kirda, and Christopher Kruegel. A practical
attack to de-anonymize social network users. In IEEE Symposium on Security and
Privacy, 2010.
[116] Qiang Xu, Alexandre Gerber, Z. Morley Mao, Jeffrey Pang, and Shobha Venkataraman. Identifying diverse usage behaviors of smartphone apps. In ACM IMC, 2011.
[117] Wan-Shiou Yang, Jia-Ben Dia, Hung-Chi Cheng, and Hsing-Tzu Lin. Mining social
networks for targeted advertising. In 39th Annual Hawaii International Conference on
System Sciences (HICSS), 2006.
[118] J.-H. Yeh, J.-C. Chen, and C.-C. Lee. Comparative analysis of energy saving techniques
in 3GPP and 3GPP2 systems. IEEE Transactions on Vehicular Technology, 58(1):432–
438, 2009.
[119] Yu Zhang, Zhaoqing Wang, and Chaolun Xia. Identifying key users for targeted marketing by mining online social network. In Advanced Information Networking and
Applications Workshops (WAINA), 2010.
[120] E. Zheleva and L. Getoor. To join or not to join: The illusion of privacy in social
networks with mixed public and private user profiles. In World Wide Web (WWW)
Conference, 2009.
[121] Elena Zheleva and Lise Getoor. To join or not to join: The illusion of privacy in
social networks with mixed public and private user profiles. In Proceedings of the 18th
International World Wide Web conference (WWW), 2009.

118

[122] Y. Zhu, Xinwen Fu, R. Bettati, and Wei Zhao. Anonymity analysis of mix networks against flow-correlation attacks. In IEEE Global Communications Conference
(GLOBECOM), 2005.
[123] Ye Zhu, Xinwen Fu, Bryan Gramham, Riccardo Bettati, and Wei Zhao. Correlationbased traffic analysis attacks on anonymity networks. IEEE Transactions on Parallel
and Distributed Systems, 2009.
[124] Feng Zou, Zhao Zhang, and Weili Wu. Latency-bounded minimum influential node
selection in social networks. In proceedings of WASA, 2009.

119