TOWARDS MACHINE LEARNING BASED  

SOURCE IDENTIFICATION  

OF ENCRYPTED VIDEO TRAFFIC 

By 

Yan Shi 

 

 

 

 

 

 

 

A DISSERTATION 

Michigan State University 

in partial fulfillment of the requirements 

Submitted to 

for the degree of 

Electrical Engineering - Doctor of Philosophy 

2019 

 

 

ABSTRACT 

TOWARDS MACHINE LEARNING BASED SOURCE IDENTIFICATION  

OF ENCRYPTED VIDEO TRAFFIC 

By 

Yan Shi 

The rapid growth of the Internet has helped to popularize video streaming services, which 

has now become the most dominant content on the Internet. The management of video streaming 

traffic is complicated by its enormous volume, diverse communication protocols and data formats, 

and the widespread adoption of encryption. In this thesis, the aim is to develop a novel firewall 

framework, named Soft-margined Firewall, for managing encrypted video streaming traffic while 

avoiding  violation  of  user  privacy.  The  system  distinguishes  itself  from  conventional  firewall 

systems by incorporating machine learning and Traffic Analysis (TA) as a traffic detection and 

blocking  mechanism.  The  goal  is  to  detect  unknown  network  traffic,  including  traffic  that  is 

encrypted,  tunneled  through  Virtual  Private  Network,  or  obfuscated,  in  realistic  application 

scenarios. Existing TA methods have limitations in that they can deal only with simple traffic 

patterns – usually, only a single source of traffic is allowed in a tunnel, and a trained classifier is 

not portable between network locations, requiring redundant training. This work aims to address 

these limitations with new techniques in machine learning. The three main contributions of this 

work are: 1) developing new statistical features around traffic surge periods that can better identify 

websites with dynamic contents; 2) a two-stage classifier architecture to solve the mixed-traffic 

problem with state-of-the-art TA features; and 3) leveraging a novel natural-language inspired 

feature to solve the mixed-traffic problem using Deep-Learning methods. A fully working Soft-

margin Firewall with the above distinctive features have been designed, implemented, and verified 

for both conventional classifiers and the proposed deep-learning based classifiers. The efficacy of 

 

the proposed system is confirmed via experiments conducted on actual network setups with a 

custom-built  prototype  firewall  and  OpenVPN  servers.  The  proposed  feature-classifier 

combinations  show  superior  performance  compared  to  previous  state-of-the-art  results.  The 

solution  that  combines  natural-language  inspired  traffic  feature  and  Deep-Learning  is 

demonstrated to be able to solve the mixed-traffic problem, and capable of predicting multiple 

labels associated with one sample. Additionally, the classifier can classify traffic recorded from 

locations that are different from where the trained traffic was collected. These results are the first 

of  their  kind  and  are  expected  to  lead  the  way  of  creating  next-generation  TA-based  firewall 

systems. 

 

 

ACKNOWLEDGMENTS 

I would like to express my sincere gratitude to Dr. Subir Biswas for his extraordinary support 

and guidance during my work. He has advised me not only on my research but also on my career 

and life.  I thus feel fortunate to work with him and proud of being a member of his research group.  

I would also like to thank my committee, Dr. Tongtong Li, Dr. Daniel Morris, and Dr. Eric Torng 

for their time and support. 

Thanks must be given to my lab mates Stephan Lorenz, Bo Dong, Faezeh Hajiaghajani, 

Saptarshi Das, Benjamin Sanda, Dezhi Feng, Brandon Harrington, Henry Griffith, Rui Wang, and 

Tianyi Wu, for all of the brainstorming and implementation discussions. Finally, I would like to 

give thanks to my parents and family for their unconditional love and support and encouragement 

throughout my life. My achievement would not be possible without them. 

 

iv 

TABLE OF CONTENTS 

LIST OF TABLES ........................................................................................................... viii 

LIST OF FIGURES ........................................................................................................... ix 

Chapter 1 : Introduction ...................................................................................................... 1 
1.1 Motivation ................................................................................................................. 1 
1.2 Proposed Solution ..................................................................................................... 3 
1.3 Main Contribution ..................................................................................................... 4 
1.4 Organization of the Chapters .................................................................................... 5 

Chapter 2 : Related Works .................................................................................................. 6 
2.1 Introduction ............................................................................................................... 6 
2.2 Firewall Systems ....................................................................................................... 6 
2.2.1 Types of Firewall ............................................................................................... 6 
2.2.2 Netfilter based Firewall ..................................................................................... 7 
2.2.3 Cisco NetFlow .................................................................................................... 7 
2.3 Privacy Enhancing Networking Protocols ................................................................ 8 
2.3.1 Virtual Private Network (VPN) .......................................................................... 8 
2.3.2 The Onion Router Network (Tor) ....................................................................... 9 
2.3.3 Traffic Obfuscation ............................................................................................ 9 
2.4 Traffic Analysis ...................................................................................................... 10 
2.5 Video Streaming Protocols ..................................................................................... 15 
2.6 Summary ................................................................................................................. 16 

Chapter 3 : Source Identification of Dynamic Web Traffic ............................................. 17 
3.1 Introduction ............................................................................................................. 17 
3.2 Experimental Setup ................................................................................................. 17 
3.3 Traffic Pattern for Web Transactions ..................................................................... 20 
3.4 Feature Definition and Extraction ........................................................................... 21 
3.4.1 Existing Features ............................................................................................. 22 
3.4.2 Surge Period .................................................................................................... 24 
3.5 Performance ............................................................................................................ 26 
3.5.1 Feature Analysis .............................................................................................. 26 
3.5.2 Website Fingerprinting .................................................................................... 29 
3.6 Summary and Conclusions ..................................................................................... 33 

Chapter 4 : Protocol Independent Source Identification of Video Streaming Traffic ...... 35 
4.1 Introduction ............................................................................................................. 35 
4.2 Problem Definition ................................................................................................. 35 
4.3 System Architecture ................................................................................................ 35 
4.4 Packet Arrival Interval (PAI) Feature ..................................................................... 37 
4.5 Experimental Setup ................................................................................................. 42 
4.6 Experimental Results .............................................................................................. 44 
4.7 Summary and Conclusions ..................................................................................... 49 

 

v 

Chapter 5 : Characterizing Common Traffic Features ...................................................... 51 
5.1 Introduction ............................................................................................................. 51 
5.2 Problem Definition ................................................................................................. 51 
5.3 Experimental In-laboratory setup ........................................................................... 52 
5.4 Feature analysis ....................................................................................................... 54 
5.5 Experimental results ............................................................................................... 56 
5.6 Experiments with Commercial Service Providers .................................................. 59 
5.7 Summary and Conclusions ..................................................................................... 64 

Chapter 6 : Effects of Location Variability  on Single-Stage Classifier ........................... 65 
6.1 Introduction ............................................................................................................. 65 
6.2 Problem Definition ................................................................................................. 65 
6.3 System Components ............................................................................................... 65 
6.4 Experimental setup ................................................................................................. 67 
6.5 Feature analysis ....................................................................................................... 70 
6.6 Experimental results ............................................................................................... 74 
6.7 Summary ................................................................................................................. 80 

Chapter 7 : Two-stage Classifier Design for Mixed-Traffic Problem .............................. 81 
7.1 Introduction ............................................................................................................. 81 
7.2 Problem Definition ................................................................................................. 81 
7.3 Classifier Design ..................................................................................................... 81 
7.3.1 Two-Stage Classifier ........................................................................................ 82 
7.3.2 Feature Set ....................................................................................................... 83 
7.3.3 Automated Feature Selection ........................................................................... 87 
7.4 Experimental Setup ................................................................................................. 88 
7.5 Experimental Results .............................................................................................. 96 
7.5.1 Feature Selection ............................................................................................. 96 
7.5.2 Scenario #1 (Stage1_PacketCount+Stage2_Pure) .......................................... 97 
7.6 Scenario #2 (Stage1_CFS+Stage2_Pure) ............................................................... 99 
7.6.1 Scenario #3 (Stage1_PacketCount+Stage2_Mixed) ...................................... 101 
7.6.2 Scenario #4 (Stage1_CFS+Stage2_Mixed) ................................................... 102 
7.6.3 Discussions .................................................................................................... 102 
7.7 Summary ............................................................................................................... 103 

Chapter 8 : Deep-Learning Based Traffic Analysis ........................................................ 104 
8.1 Introduction ........................................................................................................... 104 
8.2 Deep Neural Network ........................................................................................... 104 
8.2.1 Deep Learning in Natural Language Processing .......................................... 105 
8.2.2 Word Embedding ........................................................................................... 106 
8.3 Language-Like Feature for Traffic Analysis ........................................................ 107 
8.4 Neural Network Architecture for Traffic Analysis ............................................... 109 
8.5 Experimental Setup ............................................................................................... 113 
8.6 Experimental Results ............................................................................................ 117 
8.6.1 Binary Classification ..................................................................................... 117 
8.6.2 Multi-label Classification .............................................................................. 119 
8.6.3 Knockout Test ................................................................................................. 122 

 

vi 

8.6.4 Testing at a Different Location ...................................................................... 124 
8.7 Summary ............................................................................................................... 128 

BIBLIOGRAPHY ........................................................................................................... 130 

 

vii 

LIST OF TABLES 

 

Table 3-1: Feature sets used in experimentation ........................................................................... 30 

Table 4-1: Video streaming services and server locations ............................................................ 44 

Table 5-1: Protocol Identification Rate (%) .................................................................................. 57 

Table 5-2: Protocol/Codec identification rate (%) ........................................................................ 58 

Table 5-3: Protocol Identification Rate (%) .................................................................................. 62 

Table 5-4: Site Identification Rate (%) ......................................................................................... 63 

Table 6-1: Collected video samples .............................................................................................. 69 

Table 7-1: Confusion matrix (Stage1_PacketCount) .................................................................... 98 

Table 7-2: Confusion Matrix (Stage1_CFS) ............................................................................... 100 

Table 8-1: NLP Processing Methods Used ................................................................................. 109 

Table 8-2: Available Features by M. Cruz et. al ......................................................................... 112 

Table 8-3: Traffic Data Format ................................................................................................... 114 

Table 8-4: Classes based on video streaming sources ................................................................ 116 

Table 8-5: Sample counts by No. of Clients ............................................................................... 116 

Table 8-6: Training Performance ................................................................................................ 117 

Table 8-7: AUC of Binary Classifiers using Y.Kim Method ..................................................... 119 

Table 8-8: Video Streaming Sites for Testing on a Different Location ...................................... 125 

 

viii 

LIST OF FIGURES 

 

Figure 1-1: Firewall circumvention via virtual private networks ................................................... 2 

Figure 2-1: Onion Router Network ................................................................................................. 9 

Figure 2-2: Evolution of Video Streaming Protocol Adoption, 2012-2017 ................................. 16 

Figure 3-1: Experimental traffic probing arrangement ................................................................. 18 

Figure 3-2: Target dynamic websites as the experimental group ................................................. 19 

Figure 3-3: Target static websites as the control group ................................................................ 19 

Figure 3-4: Decomposition of the trace from an example transaction .......................................... 23 

Figure 3-5: Example of Surge Period for amazon.com traffic ..................................................... 25 

Figure 3-6: Traffic rate traces for web transaction to different sites ............................................ 28 

Figure 3-7: Surge Period for dynamic and static sites .................................................................. 29 

Figure 3-8: First 15 Haar components for dynamic and static site ............................................... 29 

Figure 3-9: Website fingerprinting classification workflow ......................................................... 30 

Figure 3-10: Class-wise true positive rates for the control group ................................................. 31 

Figure 3-11: Class-wise true positive rates for the experimental group ....................................... 32 

Figure 3-12: Average true positive rates for both the site groups ................................................. 33 

Figure 4-1: Classification architecture for traffic analysis ........................................................... 36 

Figure 4-2: Relationship between PAI and other delay metrics ................................................... 38 

Figure 4-3: Example distributions of PAI for different servers .................................................... 40 

Figure 4-4: Example distribution of IPDV  .................................................................................. 41 

Figure 4-5: On-off patterns in video streaming traffic samples .................................................... 42 

Figure 4-6: Experimental setup with OpenVPN ........................................................................... 43 

 

ix 

Figure 4-7: Number of collected slices from different servers ..................................................... 45 

Figure 4-8: Classification accuracy for source identification ....................................................... 46 

Figure 4-9: PAI distribution for data sample collected at Client 1 ............................................... 47 

Figure 4-10: Class-average PAI distribution at Client 1 ............................................................... 48 

Figure 4-11: Classification Accuracy with Sub-features .............................................................. 49 

Figure 5-1: Network environment and the experimental setup .................................................... 53 

Figure 5-2: Collected video Samples ............................................................................................ 54 

Figure 5-3: Packet size distribution of 4 “Sita Sings Blues” classes ............................................ 55 

Figure 5-4: RTMP/H.264 Intra-Group Confusion ........................................................................ 58 

Figure 5-5: Smooth Streaming Intra-Group Confusion (a: H.264, b: VC-1) ................................ 59 

Figure 5-6: Long-haul experiments .............................................................................................. 62 

Figure 5-7: Intra-site Confusion in Long-haul Experiments ......................................................... 63 

Figure	6-1:	Data	collection	and	feature	extraction ................................................................... 66 
Figure	6-2:	Setup	of	Network	Environment .............................................................................. 68 
Figure	6-3:	Burst	characteristics	of	YouTube,	Hulu,	and	Netflix ............................................. 72 
Figure	6-4:	Samples	of	Consecutive	Packet	Size	Pair	Distribution ......................................... 73 
Figure	6-5:	Results	using	packet	size	distribution	feature ...................................................... 75 
Figure	6-6:	Results	using	consecutive	packet	size	pair	feature .............................................. 77 

Figure 6-7: Recognition Results for the Open-world scenario ..................................................... 79 

Figure 7-1: Two-stage traffic source classification scheme ......................................................... 82 

Figure 7-2: Sample distribution of packet counts ......................................................................... 84 

Figure 7-3: Sample packet size distribution .................................................................................. 86 

Figure 7-4: Experimental setup with OpenVPN ........................................................................... 89 

 

x 

Figure 7-5: Video Streaming Providers in Dataset ....................................................................... 90 

Figure 7-6: Feature formatting with ground truth tags ................................................................. 91 

Figure 7-7: Training the stage-1 classifier .................................................................................... 92 

Figure 7-8: Training the stage-2 classifier .................................................................................... 93 

Figure 7-9: Testing the classifiers at both stages .......................................................................... 95 

Figure 7-10: Optimal feature subset (stage-1) chosen by CFS ..................................................... 97 

Figure 7-11: Confusion matrix (Stage2_Pure) ............................................................................. 99 

Figure 7-12: Performance indices for 4 Test Scenarios .............................................................. 100 

Figure 7-13: Confusion Matrix (Stage2_Mixed) ......................................................................... 101 

Figure 8-1: Extracting Language-Like Traffic Feature .............................................................. 107 

Figure 8-2: Word Frequency of Traffic Feature ......................................................................... 108 

Figure 8-3: Hierarchical Attention Network (Figure 2 in [51]) .................................................. 110 

Figure 8-4: Network Architecture by Yoon Kim (Figure 1 in [55]) ........................................... 111 

Figure 8-5: Network Architecture by Mark Berger .................................................................... 111 

Figure 8-6: Network Architecture by M. Cruz et. al. (Figure 3 in [31]) ..................................... 113 

Figure 8-7: Experimental Setup .................................................................................................. 115 

Figure 8-8: Binary Classifier Accuracy ...................................................................................... 118 

Figure 8-9: Effect of Embedding Size on HAN-based Binary Classifier (YouTube) ................ 119 

Figure 8-10: Multi-label Classifier ROC .................................................................................... 120 

Figure 8-11: Difference in Multi-label Classifier AUC (No. of Client ³ 2) ............................... 121 

Figure 8-12: Multi-label Classifier ROC (No. of Client ³ 3) ..................................................... 121 

Figure 8-13: Selected Results from Knockout Tests .................................................................. 123 

Figure 8-14: Experimental Setup for Testing a Different Location ............................................ 124 

 

xi 

Figure 8-15: Classification Performance (Two Locations) ........................................................ 126 

Figure 8-16: Classification Performance (Two locations & Client ≥ 3) ................................... 127 

 

 

xii 

Chapter 1 : Introduction 

1.1 Motivation 

The rapid growth of the Internet has helped to popularize various data-intensive applications. 

One prominent example is video streaming, which has become the dominant type of application 

on today’s Internet in terms of traffic volume according to a 2016 research report [1]. For example, 

the top 2 most heavily accessed video streaming sites, Netflix and YouTube, constitute roughly 

50% of the North American Internet traffic (according to the same 2014 report). As the popularity 

of video streaming sites grows, their usage also starts to spread into private enterprises, where 

watching  certain  videos  may  be  undesirable.  For  example,  employers  may  not  want  their 

employees to watch certain videos containing political, sexual, or violence related messages. Often 

the administrator of an enterprise network may want to block users from watching videos in order 

to  conserve  bandwidth  and/or  to  maintain  enterprise  productivity.  At  the  same  time,  an 

administrator may wish to allow users to access some of the video streaming sites due to other 

business reasons. As such, the rise of video streaming applications poses an interesting problem in 

network traffic management. 

Enforcing the regulations on network traffic first requires the classification of traffic, which 

is a problem that draws much interest from both the networking community and the business side. 

It  is  well  investigated  in  the  past.  Various  solutions  have  been  proposed  and  applied  to  the 

problem[2][3].  Existing  solutions  fall  into  two  categories:  rule-based  solutions,  and  Traffic 

Analysis (TA) based solutions. They are effective in their respective problem domains, and they 

have found applications in commercial networking devices.  

 

 

1 

 
Figure 1-1: Firewall circumvention via virtual private networks 

 

However, existing solutions face several challenges. Firstly, the adversaries, who want to 

avoid detection, upgrade their methods over time, resulting in an arms race between them and the 

network  administrators  (an  example  scenario  is  shown  in  Figure  1-1).  Traffic  encryption 

technologies such as VPN [4] and The Onion Router (Tor) [5] are widely accessible nowadays, 

and researchers are finding new ways to masquerade traffic [6]. Therefore, the current methods 

need reevaluation and improvement. 

Secondly, the amount and diversity of network traffic have increased drastically during the 

last decade. Existing solutions might not scale well and needs reevaluation under the new traffic 

conditions. Growth in video streaming traffic is arguably the most significant recent change in 

network  traffic,  but  existing  studies,  including  recent  ones  [7],  [8],  only  target  web  browsing 

traffic. Analysis of heterogeneous traffic (where multiple types of network traffic occur at the same 

time) is left out of the existing research as well, but happening quite often in real-world situations. 

Therefore, the targeted traffic type needs to be extended to cover these changes. 

Lastly, traffic classification function is still limited to professional networking devices [9], 

while a need for traffic monitoring exists outside of the professional environment (for example, 

 

2 

from parents who want to manage the traffic going in/out of the home network). The technology 

needs miniaturization and cost reduction for it to reach a wider population. At the same time, the 

increasing  amount  of  data  has  turned  this  into  a  problem  that  requires  the  processing  of  big 

datasets. The divide between processing power and cost-effectiveness needs addressing. 

This research sets out to tackle these challenges by combining the latest machine-learning 

technology and a distributed data collection/analysis framework. 

1.2 Proposed Solution 

This  research  proposes  a  solution  to  the  traffic  classification  problem,  which  is  a  new-

concept firewall extension system that we will call “Soft-margin firewall” (SMFW) for the rest of 

this  paper.  The  system  uses  machine  learning  and  TA  to  classify  traffic  that  cannot  be  easily 

classified by a rule-based firewall.  The word “Soft-margin” refers to the way the firewall classifies 

traffic. While a rule-based firewall can classify each packet with no ambiguity (within its capacity), 

an  SMFW  must  base  its  decisions  on  statistical  features  extracted  from  a  number  of  packets. 

Therefore,  some  packets  might  avoid  the  classification.  In  exchange,  an  SMFW  could  detect 

higher-order  traffic  patterns  that  a  rule-based  firewall  cannot,  allowing  more  fine-grained 

knowledge gathering and policy enforcing on the traffic. 

Aspects  of  the  system  are  designed  to  address  the  challenges  that  traffic  classification 

systems are facing. The traffic classifier is based on TA methodology, which works on types of 

traffic that are difficult for a rule-based firewall system. Targeted traffic types include encrypted 

traffic, tunneled traffic and aggregated heterogeneous traffic, as well as compound traffic in which 

any aforementioned conditions can happen at the same time. The classifier uses the Deep Neural 

Network (DNN) and other machine learning tools to accurately classify traffic. 

 

3 

The proposed system is also specially designed for classifying heterogeneous traffic. Having 

a mixture of application layer protocols in the traffic stream increases the chance for the traffic to 

be misclassified. With a novel two-stage classifier design, only the part of the traffic that looks 

homogeneous  enough  is  classified.  The  architecture  reduces  the  false  positive  rate  drastically, 

while still being able to use a considerable amount of traffic for classification. 

Lastly, the implementation of the system is an extension to the iptables firewall framework, 

which is an integral part of the Linux operating system.  Since Linux is widely used in networking 

devices,  an  SMFW  probe  can  be  easily  plugged  into  an  existing  network  infrastructure.  The 

extension only takes a minimal amount of data from the traffic, and therefore can scale to high 

network load situations. The system can off-load computationally intensive data processing such 

as feature extraction and training the neural network to a computing cloud so that the end devices 

can remain cost-effective. Iptables also provides ways to interact with traffic, such as blocking, 

throttling or prioritization, once the SMFW makes a decision. 

1.3 Main Contribution 

The main contribution of the research is three-fold. It proposes the concept of Soft-margin 

Firewall as a solution to the traffic classification problem. On top of proposing the idea, it also 

demonstrates  its  feasibility  by  showcasing  a  prototype  implementation.  In  this  work,  we  also 

reevaluate  several  state-of-the-art  Traffic  Analysis  techniques  against  network  traffic  in 

contemporary settings. The evaluations especially target video streaming traffic for its prevalence 

and lack of representation in literature. The goal is to find ways to extend the capability of the 

existing TA method to work with video streaming protocols. The limitation of existing methods 

when classifying heterogeneous traffic is also explored, to provide a clue for better system design. 

These evaluations either result in the method being incorporated into the proposed system, or 

 

4 

improvements being made in the system design. Evaluations of two novel traffic feature extraction 

methods proposed by the author for the system are also present. One method is evaluated against 

conventional classifiers, while the other is proposed for using a deep neural network (DNN) based 

classifier that achieves better noise robustness and capable of multi-label classification.  

1.4 Organization of the Chapters 

The rest of the paper is organized as follows. Chapter 2 reviews the existing body of research 

on  Traffic  Analysis,  Video  Streaming  protocol,  firewall  systems,  and  machine  learning 

technology, which form the basis of the research. Chapter 3 and Chapter 4 discuss the usages of 

Packet Arrival Interval as well a newly proposed feature named Surge Period in video source 

identification. Chapter 5 explores the feasibility of the main goal of traffic classification using 

existing  features.  Chapter  6  discusses  the  performance  of  a  single-stage  classifier  for  source 

identification. Chapter 7 introduces the concept of two-stage classification, which improves the 

performance of source identification on mixed-traffic streams. Chapter 8 introduces the DNN-

based classifier and highlights its unique capabilities when working with mixed traffic and multi-

label traffic data.

 

5 

Chapter 2 : Related Works 

2.1 Introduction 

This engineering solution to solve the problem introduced in Chapter 1 is built on top of 

techniques developed in several research fields including TA,  machine learning, deep learning 

and firewall technology. A review of relevant technologies in those fields will be provided in this 

chapter. In particular, development in TA technique is reviewed detail. To provide pointers for 

improvement, we focus on the technical challenges that TA has to overcome before it can be used 

in practical scenarios. Reviews are also provided for the intended targets of the solution, namely 

VPNs and video streaming protocols.  

2.2 Firewall Systems 

2.2.1 Types of Firewall 

A firewall is a network security system that monitors and controls the traffic coming in or 

going out of a certain domain [10]. Firewalls operate on top of pre-defined rules. The firewall 

matches traffic with these rules, and execute corresponding actions when it detects a match. 

Firewalls can be divided into 2 major categories: network layer firewalls and application 

layer firewalls.  

Network layer firewalls work at the network layer of the Open Systems Connection (OSI) 

model [11] and handle packet-level traffic. A packet is a small chunk of data with the information 

required for routing it to its destination (which is the network layer header) attached. It is the data 

unit of the networking layer. They work at a lower layer than the application layer firewalls. As 

such, their understanding of high-level protocol information is limited, but they make faster per-

packet decisions. For a network layer firewall, the concept of “traffic flow” is a series of packet 

 

6 

exchanges between two ports (a port is an identifier for an application running on a host) on two 

network hosts. 

Application layer firewalls work at the application layer of the OSI model. Their data units 

are higher in level (a file, a stream, etc.). As a result, they could match traffic based on more high-

level  protocol  information  than  network  layer  firewalls.  For  example,  Deep  Packet  Inspection 

(DPI) is a popular subcategory of application layer firewalls [12] that saw many applications. 

2.2.2  Netfilter based Firewall 

Netfilter [13] offers support for packet filtering in kernel-mode on GNU/Linux operating 

systems. It forms a firewall solution that is widely used in Linux based networking systems. Due 

to its popularity among embedded Linux systems in home and small-business routers, Netfilter is 

available on most of the devices that the proposed firewall targets. This makes Netfilter an ideal 

platform for the proposed firewall. Netfilter provides an interface for third-party drivers to extend 

its functionality, a function that allows highly customized processing of packets. This extensibility 

is the key to making the prototype implementation. 

2.2.3  Cisco NetFlow 

NetFlow  [14]  protocol  supports  the  transmission  of  traffic  statistics  to  remote  data 

aggregators, letting network administrators collect data from many routers for analysis at a central 

location. This architecture of NetFlow, which separates routers and data aggregators, is a good fit 

for the proposed SMFW system.  The author chose to base the prototype implementation of the 

firewall probe on an open-source implementation of Cisco’s NetFlow protocol is provided by the 

OpenWall Project.  

 

7 

2.3 Privacy Enhancing Networking Protocols 

2.3.1  Virtual Private Network (VPN) 

A VPN is a means to connect isolated private networks through the public Internet  [15]. 

VPN  keeps  sensitive  information  private  by  transmitting  the  network  traffic  through  a  secure 

tunnel. The tunnel ensures that the message body and the real source/destination information are 

encrypted between the 2 ends of the tunnel. The tunnel can be created on top of public networks, 

where it is possible for adversaries to tap into the traffic yet unable to decrypt the information or 

reveal the actual source and destination. 

There are many variations of VPN in terms of the tunneling protocol a VPN uses to create a 

secure tunnel. Two of the popular protocols in use are Internet Protocol Security (IPsec), and the 

Secure Socket Layer/Transport Layer Security (SSL/TLS) [16]. OpenVPN [17], the VPN solution 

used in this work, supports an SSL-based tunneling protocol. 

Video streaming traffic tunneled through VPN services is difficult to identify because of the 

added obfuscation, padding, encryption, packet fragmentation, etc. Different VPN solutions may 

have different ways of obfuscating traffic. Combined with the diverse types of video streaming 

protocol, a significant variation could be introduced into the traffic stream by these operations. In 

[6],  the  authors  survey  several  obfuscation  techniques  and  categorize  them  into  four  groups: 

encryption, randomization (make traffic behave like random data), mimicry (masquerading the 

traffic to look like another protocol) and tunneling (embedding traffic as the payload of another 

protocol). VPNs belong to the tunneling group, but they can also employ techniques from other 

groups to further obfuscate traffic. 

 

8 

2.3.2  The Onion Router Network (Tor) 

Tor is an anonymous routing protocol/network that hides client identity by routing a packet, 

encrypted with certificates from multiple routers one after another, through the chain of routers 

while each router decrypts the data only to reveal the next hop in the routing chain. Only at the 

exit  node  can  the  packet  be  restored  and  routed  accordingly.  Because  Tor  changes  the  route 

randomly  it  is  hard  to  trace.  The  packet  that  Tor  creates  has  an  onion-like  nested  encryption 

structure. One layer of encryption encloses another, making the payload at each stop. To further 

confuse TA attackers, Tor also randomly changes routes every 10 minutes, Tor pads data into 512-

bit cells and tries to hide the trace of HTTPS handshake by pre-establishing the encryption keys to 

separate the process from actual traffic. Tor browser is a modified Firefox browser with HTTPOS-

like capability. It is used with Tor router to combine several TA counter-measures together in order 

to provide maximum anonymity. A depiction of a Tor virtual route and how the packet is structured 

is shown in Figure 2-1. 

 Figure 2-1: Onion Router Network 

 

2.3.3  Traffic Obfuscation 

An example of obfuscation technology is Obfsproxy [6], which is a pluggable obfuscation 

layer that is developed by the same team that develops Tor. Obfsproxy extends the traffic mimicry 

 

9 

functionality of Tor, which makes its traffic look like SSL traffic, to be able to masquerade Tor 

traffic as a wide range of protocols. Obfsproxy makes the identification of traffic more difficult by 

padding  packets  with  misleading  information,  changing  delays  between  packets,  and  inserting 

dummy packets into the traffic flow. To make the study more focused, we choose OpenVPN, 

which is widely used by inexpensive personal VPN services as well as corporate VPN services, as 

the  target  platform  [16].  OpenVPN  does  not  have  additional  protocol  obfuscation  besides 

tunneling.  The  concepts  proposed  in  this  work  can  scale  to  other  streaming  protocols  and 

VPN/Proxy mechanisms. 

 

2.4 Traffic Analysis 

Traffic Analysis (TA) is a machine-learning based technique to inspect network traffic and 

determine its nature, such as application layer protocol, traffic source, and destination, among other 

properties. TA works with metadata such as packet direction, packet size, packet count, and timing. 

Interesting pieces of information are then extracted from the traffic and combined into a feature 

vector in the feature space. A classifier is then trained on this feature space to distinguish between 

various classes. The classifier can then classify unknown traffic trace and assign it to one of the 

classes.  

Since totally masking traffic activities is not practical (it will require the channel to be active 

at the maximum capacity constantly to hide the activities), TA is effective in situations where 

extracting  information  from  encrypted  and  obfuscated  traffic  is  the  goal.  TA  is  proven  to  be 

effective in identifying security-enhancing protocols such as the Onion Router (Tor) protocol[3], 

[7], [18] and other commonplace protocols including FTP, IMAP and HTTP[19]. The work of 

Hermann et. al. [2] is an early piece that established that good recognition can be achieved when 

 

10 

using a conventional VPN system. Panchenko et. al. [3] target the Tor protocol with a combination 

of  features,  including  traffic  amount,  packet  count,  packet  size  among  others,  and  achieved  a 

significantly  higher  recognition  rate  (55%)  compared  to  previous  works.  This  is  considered  a 

breakthrough and helped popularize Tor as a target for further TA based researches, including[7], 

[18], [20]. At the same time, other researchers expressed concern about the practicality of TA 

attacks, since most proponents use idealized scenarios that are not representative of the real-world 

traffic[8]. All of the above works assume the role of an attacker when performing TA, and the 

targets are usually specific web pages, since revealing the browsing history of the target is the 

attacker’s goal.  

On  the  other  hand,  there  are  relatively  few  researches  on  streaming  video  source 

identification. In[21], the authors present a TA based framework for recognizing video content 

from streaming traffic. Other works from the same researchers[22] have used a wavelet-based 

feature to identify video content. The application of both methods is limited because the video 

content needs to be known, which is an unrealistic scenario for an SMFW configuration. In [23], 

the authors present the detection results on the traffic data for several popular video streaming sites 

with various combinations of classifier and features. In [24], several different aspects of the feature 

set are explored. It identifies Packet Size Distribution as one of the effective classification features 

for identifying protocols used by tunneled video streaming traffic. It also shows that for the same 

streaming  protocol  in  a  secure  tunnel,  there  exists  enough  difference  between  various  video 

streaming  sites  that  permit  a  classifier  to  separate  them.  The  work  in[25]  explores  the  packet 

arrival interval (PAI) as a feature for video streaming source identification. A detailed analysis of 

the PAI feature in[25] also revealed that the network path for traffic flow could leave a signature 

in PAI. The results in[25], [26] pave the way for site-based TA applications, as attempted in this 

 

11 

work.  The  classifier  in  this  paper  is  designed  based  on  the  previous  results  described  in  the 

aforementioned  works.  It  uses  the  classification  method  that  was  proven  to  work  well  in  this 

context, and the feature set will include the features that were proven to work well (packet size 

and PAI). 

  Another issue with previous works on TA is that they assume homogeneity in network 

traffic. In other words, the traffic flow in the tunnel is assumed to contain traffic of only one 

protocol type from one source to one destination. The reality of tunneled traffic (e.g., through 

encrypted VPN or Proxy server) is that it often contains traffic of different protocol types from 

different  sources,  and  to  different  destinations.  The  mixture  of  traffic  can  alter  features 

significantly and is identified as a limiting factor in the real-world applicability of TA[8]. This 

mixed-flow problem, which to the best of our knowledge has not been addressed by other works, 

is the target of this work. Mixed (heterogeneous) traffic in a tunnel is defined as traffic flow in 

which there is a video streaming flow coupled with an unspecified volume of non-video traffic in 

the tunnel. Pure (homogeneous) traffic flow in a tunnel is defined as traffic flow in which there is 

just one video streaming traffic flow. Traffic that contains more than one video streams is excluded 

from the scope of this chapter but will be addressed in future chapters. 

We propose 2 solutions to the mixed-flow problem, the first one is based on classic machine-

learning techniques, which is based on the classifier documented in [27]. It is a 2-stage classifier 

that  adds  a  pre-filtering  stage  to  mitigate  the  effect  of  mixed-flow  on  classification.  It  is 

demonstrated to be able to classify with high confidence in the presence of BitTorrent Traffic and 

Video Streaming Traffic, at the expense of having fewer usable samples due to the pre-filtering. 

In this work, the design of the 2-stage classifier is significantly augmented with new methods for 

feature selection (i.e., Correlation Feature Selection (CFS)), and various forms of mix-and-match 

 

12 

training of both the classifier stages. Second, a new, much larger dataset that consists of web traffic 

and BitTorrent traffic is included to evaluate the performance of the classifier architecture. Third, 

new architecture level performance indices are developed for generalized evaluation of the two-

stage classifier architecture. Finally, Packet Size Interval (PAI) is analyzed in detail to reveal its 

contribution in the feature-set for enhancing the performance of source identification. 

The second solution is a DNN based classifier. The rapid development of deep learning has 

attracted researchers to apply it to TA problems. The features used in deep-learning are much 

higher  in  dimensionality,  and  technologies  like  Word  Embedding  helps  handle  even  higher 

dimensional  data  with  no  need  for  manual  feature  engineering.  The  increased  resolution  also 

means that it might be possible to identify individual streams within a  heterogeneous flow. There 

is potential in using the deep-learning technique to solve the feature engineering problem and the 

mixed-flow problem at once. 

Researchers have taken on the challenge of applying the deep-learning technique to traffic 

analysis problems before. Existing researches can be divided into 3 groups regarding how the 

features are constructed. The first group contains researches that use purely flow-based, statistical 

features, like the ones used in traditional TA studies. They are a natural progression from using 

classical Machine-Learning technology but fails to exploit the unique capability of DNNs. Such 

studies include [28], [29]. The second group is the researches that focus on analyzing the packets. 

The features used in this group are commonly directly based on the packet payload, and therefore 

can be regarded as a natural extension of deep-packet investigation with deep learning. Deep-

packet is a famous example here [30]. Researches in this group are less applicable to the scenario 

that this work aims to tackle since the analysis of the packet payload is ruled out by the presence 

of the encrypted tunnel. The third group is where new flow-based features are developed to take 

 

13 

advantage of the feature selection capability of DNNs. There are few works in this group. In [31] 

the researchers extracted a time-series of statistical features from the traffic flow. Each point in the 

series represent 3 seconds of traffic flow, and the resulting time series is used to train a multi-layer 

LSTM network to classify the traffic streams between the 2 protocols (BitTorrent protocol and 

plain HTTP protocol) tunneled through an encrypted proxy. The experimental results show that 

the  new  features  could  yield  92%  accuracy,  which  rivals  the  state-of-the-art  non-DNN  based 

methods. 

The proposed DNN method in this work is based on several widely-used DNN architecture 

on top of a novel natural language-inspired feature which the author created with the intention of 

solving  the  mixed-flow  problem.  This  feature  allows  the  classifier  to  exploit  the  temporal 

characteristic of the traffic flow. For more detailed information on developments in deep-learning 

based Natural Language Processing, which gave the inspiration and tools to the proposed method, 

please refer to Chapter 8.2. 

The  idea  of  integrating  traffic  analysis  capability  into  firewalls  has  been  realized  in 

commercial  products  before,  but  existing  implementations  focus  more  on  the  basic  task  of 

performing protocol identification on encrypted traffic. Applying fine-grained classification faces 

numerous obstacles, the most significant being that home/small business routers do not have the 

computing power to carry out the analysis. To address this, the proposed solutions assume that the 

devices  have  access  to  cloud-based  computing  power  to  perform  sophisticated  analysis.  The 

separation of operation and analysis is the key to keep the deployment cost of the technology low. 

On the other hand, this separation also demands a novel feature format that is efficient to compute 

and transmit to avoid adding overhead to the network. Both of the solutions are proposed with the 

 

14 

efficient implementation in restricted environments in mind, as will be detailed in the following 

chapters. 

2.5 Video Streaming Protocols 

Streaming  protocols  are  networking  protocols  designed  to  ensure  good  streaming  video 

quality under fluctuating network bandwidth limitations. It is important to the proposed solutions 

since the protocol has a significant effect on the features of the traffic flow. The most important 

change in the field in the last 10 years is the rising popularity of HTTP adaptive streaming (Figure 

2-2). HTTP adaptive streaming transfers video data and control data through the HTTP protocol 

rather than a special-purpose one. At the beginning of the 2010s, the video streaming protocol that 

comes with Adobe Flash, the Real-time Messaging Protocol (RTMP) [32], is used by most of the 

providers  due  to  the  dominant  position  Flash  had  in  web  development.  Microsoft’s  Smooth 

Streaming  [33]  (which  is  an  HTTP  adaptive  streaming  protocol)  is  used  by  only  2  providers 

(Amazon and Netflix).  By 2017 however, Adobe Flash is not widely used anymore, and HTTP 

adaptive streaming dominates the scene. This change is due to several reasons: 1) the HTML5 

standard  that  HTTP  adaptive  streaming  is  based  on  makes  both  development  and  deployment 

easier,  since  the  application  will  work  on  virtually  all  mainstream  browsers;  2)  The  protocol 

enables the providers to utilize existing HTTP servers to deliver video content, which saves cost; 

and 3) The developer tools (for example, Javascript frameworks) are vastly improved during this 

time period, which also makes development easier. 

 

 

15 

 

Figure 2-2: Evolution of Video Streaming Protocol Adoption, 2012-2017 

 

As  the  following  chapters  will  show,  the  proposed  methods  work  under  all  protocols. 

Experiments in Chapter 3-Chapter 5 feature RTMP and Smooth Streaming, much like the “2012” 

part in Figure 2-2, while experiments in Chapter 7 and Chapter 8 feature datasets that look like the 

2017 part. Experiments in Chapter 6 were conducted in a state when YouTube was still using 

RTMP. The TA technique is therefore proven to be robust against changes in protocol. 

2.6 Summary 

The works reviewed in this chapter are crucial for the development of the proposed systems. 

We design the systems with the intention of eventual deployment in a firewall that can analyze the 

diverse and mixed traffic types in today’s network. The existing firewall technologies provided 

tools for such a development. The solution is a TA based traffic probe built on machine-learning 

and deep learning technologies.  However, a new TA technique has to be developed to address the 

critiques that the existing ones received before it can be used practically. These reviews provide 

pointers for the following chapters to investigate and build upon. Finally, we focus on widely used 

VPN systems serving the dominant traffic type on today’s Internet, the video streaming traffic, as 

a good target to test the system. 

 

16 

Chapter 3 : Source Identification of Dynamic Web Traffic 

3.1 Introduction 

Website Finger Printing (WFP) is a form of Traffic Analysis that is widely researched since 

it has many administrative applications, including preventing access to forbidden websites and 

site-specific Quality of Service (QoS) provisioning. Previous works in this area mainly looked at 

the problem of identifying traffic from relatively static websites, thus limiting the applicability of 

the technique for websites with dynamically changing contents. In this chapter, we attempt to 

generalize the mechanism for dynamic sites by the way of introducing a new classification feature 

traffic surge period and adapting the first n Components of Haar Wavelet Transformation, which 

is commonly used in traditional signal processing applications. The results in this chapter prove 

the newly proposed features to be superior when classifying web pages with dynamic content.  

3.2 Experimental Setup 

In the experimental set up in Figure 3-1, a web client is used from home to browse different 

servers through an SSL/TLS VPN service provided by one of Michigan State University’s Juniper 

VPN servers. All packets are encrypted through the VPN tunnel which terminates at MSU’s VPN 

server. The server interprets the packets sent by the client through an SSL session within the tunnel 

and forwards them to the intended target web server. When the target server responds, the VPN 

server uses the same SSL session for sending traffic back to the client. From the client application’s 

perspective, the content of the web page comes from the target server.  

   The SSL/TLS tunneled traffic is probed at an egress router by the traffic capture software 

Wireshark. At a probing point, packets do not have any destination-specific identifiers/addresses 

since they are embedded into the application layer header. To separate the traffic stream, it is 

assumed that the client IP address as well as the VPN server IP address is known to the probing 

 

17 

entity, which is consistent with the QoS provisioning or site blocking applications as mentioned 

before. The clients run Internet Explorer V.11 in a windows 7 machine to access the target web 

pages. Note that this arrangement also emulates scenarios in which probing is done by an ISP when 

it intends to perform website blocking or site-specific QoS allocation based on traffic analysis 

based website fingerprinting. 

Figure 3-1: Experimental traffic probing arrangement 

 

 

Traffic is collected for 11 popular website front pages that represent dynamic content. Those 

are listed in Figure 3-2. This group of websites is referred to as the experimental group. Content 

dynamism is ensured by accessing personal home pages with enough interval so that the home 

page changes significantly in terms of text and image as parts of newsfeeds, status updates, etc. 

Quantification of such intra-site dynamism is presented in Section 3.5. Data collection is done 

through a time span of 10 days, with an average of 30 samples of each page being taken a day. 

Consecutive data collections are separated by at least 20 minutes to ensure that the web pages are 

sufficiently different. During the data acquisition phase, we collected 3250 valid samples. The 11 

sites chosen comprises of major classes of websites with dynamic contents. 

 

18 

www.amazon.com
www.ebay.com
www.facebook.com
news.google.com
www.linkedin.com
www.reddit.com
www.slashdot.org
www.stackoverflow.com
en.wikipedia.org/wiki/Main_Page
www.washingtonpost.com
www.yahoo.com

0

50

100

150
200
No. of Samples

250

300

350

 

Figure 3-2: Target dynamic websites as the experimental group 

 

As for the control group, we also collect data for several static web pages as shown in  Figure 

3-3. The web pages in the control group don’t change much over time. Traces from the control and 

the experimental groups are collected under identical conditions. In Section 3.5, we will assess the 

impacts of dynamic pages on WFP performance with respect to the static control group.  

http://www.audioheritage.org/html/history/altec/altec-1963.htm
http://rationalwiki.org/wiki/Global_warming_denialism
http://grandcanyonhistory.clas.asu.edu/sites_adjacentlands.html
http://docs.oracle.com/javase/7/docs/api/
http://rationalwiki.org/wiki/Moon_landing_hoax
http://en.wikipedia.org/wiki/Panellus_stipticus
http://en.wikipedia.org/wiki/Plato
http://tools.ietf.org/html/rfc5246
http://www.speex.org
http://www.voynich.nu/descr.html
http://www.fcps.edu/islandcreekes/ecology/water_bear.htm

170

175

180

185

No. of Samples

190

195

200

Figure 3-3: Target static websites as the control group 

 

 

 

19 

3.3 Traffic Pattern for Web Transactions 

A typical web transaction is initiated by a request from a web browser to a web server, which 

replies to the request with objects that are parts of a web page via HTTP protocol. In the beginning, 

the client sends a request including the URL of the web page and any parameter needed to server. 

The server responds with the main textual content of the web page in Hypertext Markup Language 

(HTML) form to the client. The client/browser then parses the main content, discovers a reference 

to any additional resource needed to show the page, and sends one request for each resource. The 

server responds in the same order as it receives the requests. This process continues recursively 

until all references are responded to, and the transaction concludes. 

There are several additional complexities that are added due to the following optimizations 

in the transaction process. First, at the client side, the download and parsing often go in parallel 

for improved viewing responsiveness. This often means that requests are usually sent out as soon 

as  possible  when  a  reference  is  revealed  by  the  parsing  process.  This  behavior  also  has  the 

implication that the timing of the request should indicate its position in the HTML text stream.  

Second, for higher performance, a client can choose to establish several HTTP connections 

to download resources in parallel. Also, if the server supports HTTP pipelining, then the client can 

send several requests in one TCP datagram. The server can also respond to such requests in one 

batch. These optimizations can interfere with the order of request/response pairs. The timing of 

requests can be mingled with response data from other connections. Third, the underlying network 

conditions can change the shape of traffic, especially in terms of the inter-packet delay jitter. This 

means that the timing of each burst can change because of different download times. 

Fourth, a TCP connection can be reused for multiple requests in order to avoid the delay 

involved in successive connection setup and termination. This means that it is not safe to assume 

 

20 

that one connection corresponds to one resource. If the protocol is plain HTTPS, then a connection 

can still tell actual object sizes transferred, while if the protocol is modified by the VPN client, for 

example distributing one resource among multiple connections, even connection-wise burst sizes 

can  be  unreliable.  If  encryption  does  not  change  traffic  signature  drastically,  all  the  above 

optimization related traffic pattern changes apply to encrypted web traffic within a VPN tunnel.  

Figure 3-4 shows an example of 4 sec. long traffic trace for a transaction from the LinkedIn 

web server, collected using Wireshark as shown in Figure 3-1. Decomposition of the trace was 

performed  as  follows.  In  the  figure,  the  stripe  to  the  left  marks  the  timing  of  different  HTTP 

requests sent by the client to the server. The middle plot area is divided into several lanes, and each 

TCP connection used in this transaction is plotted in one lane. One lane is further divided into 2 

halves vertically, with the left half used for upstream traffic and the right half for downstream 

traffic. Dark stripes mark the point in time where the connection is downloading. It shows how 

requests are processed by different TCP connections (numbered #1 to #8) and how the traffic of 

different  connections  interleaves.  To  the  right,  the  traffic  of  all  TCP  connections  combined  is 

shown. This example trace shows the complexity of the underlying process and the challenges of 

finding  traffic  signature  patterns  in  the  presence  of  deviations  from  the  baseline  transaction 

process. 

3.4 Feature Definition and Extraction 

   Pre-processing  of  the  probed  traffic  is  performed  following  the  approach  in  [3].  Pre-

processing removes the non-TCP and pure ACK packets that might affect the results.  

 

21 

3.4.1  Existing Features  

The  term  trace  is  used  to  describe  a  time-stamped  sequence  in  which  every  packet  is 

associated with the time it appears on a traffic probing point. It can be represented as a series of 

timestamps  as.  We  assume  that  the  content  of  traffic  is  generally  encrypted  and  that  only  the 

direction and packet size are of importance. Following the convention in [2], we combine the 

packet size and direction into one scalar. The absolute value of the scalar equals the size of the 

packet. All upstream packets have negative signs and downstream packets have positive signs. 

This series of packet sizes is defined as . The following features have been used in the literature. 

Burst Size: In [18], a burst is defined as the interval during which there are only packets in 

one direction, but preceded and followed immediately by packets in the opposite direction. There 

are upstream and downstream bursts. In Figure 3-4 upstream bursts are shown on the negative half 

of  the  graph  while  downstream  packets  on  the  positive  half.  Each  time  the  stream  switches 

direction from the negative half to positive half or vise-versa, a burst boundary is recorded. Each 

burst is marked by 2 boundaries and the direction of packets inside that burst becomes the direction 

of the burst itself. Burst is also used in[3], albeit named differently. It is used as part of the feature 

set in two different ways. 1) The number of bytes transmitted in each burst is summed up into a 

histogram and contained in the feature. 2) The number of packets transmitted in each burst is 

summed up into a histogram and contained in the feature. 

 

 

22 

TCP connections

Net Traffic

up
down

 

HTML stream

0

0.5

1

1.5

2

2.5

3

3.5

4
requests

#1

#2 #3

#4

#5 #6

#7

 
0

#8

20

kB/10ms

40

 

Figure 3-4: Decomposition of the trace from an example transaction 

)
s
(
e
m

i
t

 

First n Components of Haar Wavelet Transformation: Haar wavelet transform is often used 

in the signal analysis to analyze local information of a signal. We use this as a feature to represent 

coarse-scale variation information of a trace. Each trace is first converted to a bit-rate series using 

a 100ms time window, then interpolated to 4096 data points before transforming. We take the first 

15 components of the transformed vector as the feature. 

 

23 

3.4.2  Surge Period  

We propose a new feature, termed as Surge Period, which unlike the packet direction-based 

bursts, is defined based on the timing of surges in traffic density, or high bandwidth utilization. A 

Surge  Period  marks  the  parts  of  traffic  trace  where  the  channel  is  busy  transmitting  packets 

upstream or downstream with back to back packets. Any packet except for the first one and the 

last one within the surge period should be separated from its predecessor and subsequent packets 

by a time period no larger than a pre-defined time window size. The window size depends on the 

network condition and typically ranges from 10 milliseconds to several hundred milliseconds.  

During web transaction, the timing of a Surge Period corresponds to one object or a group 

of objects being downloaded together. We have speculated that, on a coarse scale, the timing of a 

Surge Period corresponds to the location of those references that resulted in it. This is because 

modern browsers want to render the contents with the highest speed possible and that they should 

send requests for resources quickly after the references are discovered by the parser.  

As demonstrated in Figure 3-4, while the probed data can be decomposed into individual 

TCP connections, the Surge Period feature is extracted from the combined traffic pattern (the far-

right column in Figure 3-4). This is done so that the feature should be prepared for situations, such 

as Tor [5], in which the traffic cannot always be separated into individual connections. 

As done in[2], [3], we assume that all traffic from or to a browser client is from the same 

transaction. Meaning, we ignore the interleaving of multiple web transactions.  

 

24 

Figure 3-5: Example of Surge Period for amazon.com traffic 

 

 

   The Surge Period feature is extracted as follows. First, a probed trace is converted into a 

time-stamped series of (time, packet-size) pairs. Second, the time-stamped series is converted into 

a bit rate time series computed over 100ms non-overlapping windows. Third, the most significant 

Surge Periods are then extracted from the bitrate time series. To do this, we apply the following 

adaptive method. A continuous period of bitrate higher than a certain threshold  is counted as a 

Surge Period. Multiple thresholds are experiment with, starting from the highest value possible, 

and  gradually  descending  towards  until  a  threshold  where  more  than  80%  of  all  the  bytes 

transferred are covered by in burst periods. Finally, the number of bytes transmitted in a Surge 

Period is summed up as the ‘size’ of that period, and all Surge Periods are lined up according to 

their timing order. Top N periods in size are then chosen, where N is an arbitrary number. The 

 

25 

resulting vector of period sizes are used as the feature that represents the sample. If there is not 

enough number of Surge Periods even after  is tried, the vector is padded with zeroes at the end.  

   Figure 3-5 shows an example computation of the Surge Period feature on an actual trace 

captured from amazon.com traffic. The bitrate vs. time graph is shown in the upper half of Figure 

3-5 with a descending threshold. When the threshold is lowered to a level where 80% of the traffic 

is included in various surges, the threshold is made fixed. With this threshold level, the surges are 

extracted and then rounded up to form the vector shown in the lower part of Figure 3-5. The arrows 

indicate the correspondence between surge periods and feature. 

3.5 Performance 

3.5.1  Feature Analysis 

In this section, we present performance comparisons of classifying websites using different 

features. First, we demonstrate the dynamism of the dataset we collected by presenting sample 

probed traffic trace for web transactions corresponding to different destination sites. Figure 3-6 

shows data rate over time during the progression of a transaction. Traffic in both directions is 

considered in the graphs. 

It  can  be  observed  that  the  data  rate  signatures  for  different  destination  sites  vary 

significantly in terms of their burstiness, the number of bursts, and the aggregated average rate. 

These variations in the traffic patterns translate into distinguishable features across various target 

websites, thus yielding good classification performance as reported later in this section. To depict 

dynamism of the pages to the same website, in what follows, we present intra-site feature variation 

by showing samples of burst sizes, Haar wavelet transformation components, and surge period 

features from various target sites. 

 

26 

Surge Periods of 2 sites are presented in Figure 3-7. One row in the graph corresponds to 

one site. The site above comes from the experimental dynamic group while the bottom one comes 

from the static control group. The feature can capture the intra-site variation, as well as the obvious 

inter-site variation between the 2 sites. Samples of Haar wavelet transformation components are 

presented in Figure 3-8. The distinction between 2 sites and their two access sessions are obvious 

for this feature. 

Due to space constraint, all intra- and inter-site variations in Figure 3-7 and Figure 3-8 are 

shown only for a limited number of sites and samples. Similar variations were observed across all 

experimented sites and across all sessions for the same sites. 

 

27 

x 104

amazon(sample #11)

aol(sample #75)

x 104

linkedin(sample #134)

stackoverflow(sample #15)

x 104 AltecLansing(sample #15)

GlobalWarm(sample #14)

x 104 MoonLanding(sample #36)

Mushroom(sample #75)

5

4

3

2

1

0
5

4

3

2

1

0
6

4

2

0
6

4

2

s
e

t
y
b

s
e

t
y
b

s
e

t
y
b

s
e

t
y
b

0
0

5

10

time(s)

15

20

0

5

10

time(s)

15

20

 

Figure 3-6: Traffic rate traces for web transaction to different sites 

 

28 

200

100

)

B
k
(
e
z
s

i

0
200
150
100
50
0

)

B
k
(
e
z
s

i

amazon(sample #15)

amazon(sample #175)

Java7Doc(sample #5)

Java7Doc(sample #59)

1

5

10

1

5

10

Figure 3-7: Surge Period for dynamic and static sites 

linkedin(sample #50)

linkedin(sample #112)

MoonLanding(sample #20)

MoonLanding(sample #100)

0

-5

1

5

10

15

1

5

10

15

Figure 3-8: First 15 Haar components for dynamic and static site 

10 x 104
5
0
-5
5 x 105

)

B
k
(
e
z
s

i

)

B
k
(
e
z
s

i

 

 

  

 

3.5.2  Website Fingerprinting 

   We use the Bayesian network classifier (BayesNet) from the Weka [34] machine learning 

library for executing WFP. As shown in Figure 3-9, the classifier is trained offline with individual 

features from Table 3-1 or combinations of them. 

 

29 

   The training is supervised, and features extracted from each trace are marked with a class 

label that represents the URL where the trace is from. There are 11 classes (i.e., URLs) for the 

experimental set and 11 classes in the control set. A full list of feature sets experimented with is 

provided in Table 3-1. 

Table 3-1: Feature sets used in experimentation 

 

1 

2 

3 

Feature Set 

Histogram of burst sizes 

First n components of Haar wavelet transform (n=15) 

Surge Period (N = 10) 

    

 

Figure 3-9: Website fingerprinting classification workflow 

 

To test the performance after the training phase, we do a pseudo-random 2:1 training/set 

partitioning before the classifier is trained, then train the classifier with the training set and test it 

with the test set. The whole process is shown in Figure 3-9. 

 

30 

1-Altec

2-RationalWiki

3-GrandCanyon

4-Oracle

5-RationalWiki2

6-Wikipedia

7-Wikipedia2

8-IETF

9-SPEEX

10-Voynich

11-FCPS

0.7

0.75

0.8

0.85

0.9
Accuracy

0.95

1

1.05

Burst size + Haar

Burst size + surge period

Burst size

 

Figure 3-10: Class-wise true positive rates for the control group 

 

   Given  their  lower  intra-site  variations  in  the  traffic  signatures,  it  is  expected  that  the 

classification results for the static sites would be better than the dynamic sites. We first experiment 

with the static sites with three feature scenarios, namely, burst size, burst size plus surge period, 

and burst size plus Haar. Class-wise results for the control group are shown in Figure 3-10. 

 Observe that all three features scenarios generally work very well for the static website 

fingerprinting. This is consistent with previous results published in [2], [3], [18]. The next step is 

 

31 

to  apply  the  same  feature  scenarios  for  the  experimental  dynamic  sites.  Class-wise  results  for 

dynamic sites are shown in Figure 3-11. 

1-amazon.com

2-ebay.com

3-facebook.com

4-news.google.com

5-linkedin.com

6-reddit.com

7-slashdot.org

8-stackoverflow.com

9-en.wikipedia.org

10-washingtonpost.com

11-yahoo.com

0.8

0.85

0.9

0.95

Accuracy

1

1.05

Burst size + Haar

Burst size + surge period

Burst size

 

Figure 3-11: Class-wise true positive rates for the experimental group 

 

It  should  be  observed  that  when  the  traditional  feature  burst  size  is  used  alone,  the 

fingerprinting performance degrades compared to the static sites as shown in Figure 3-10. For 

certain sites (e.g., sites 2 and 7), the loss of performance was substantial. The reasons for such loss 

are due to high intra-session diversity for the same target site as depicted in Figure 3-6. 

 

32 

Figure 3-12 shows results when the feature burst size is aided with our proposed feature 

surge size, and the components of the Haar transform. It is evident that these new features improve 

the overall fingerprinting performance by being able to characterize and identify the inter-site 

variation in probed traffic data. 

y
c
a
r
u
c
c
A

1
0.98
0.96
0.94
0.92
0.9
0.88
0.86
0.84
0.82

Burst size

Burst size +surge period

Burst size + Haar

Experimental

Control

 

Figure 3-12: Average true positive rates for both the site groups 

 

Figure 3-12 reports the overall true positive rates averaged over all the sites within both the 

groups.  These  figures  confirm  the  general  observations  made  for  individual  site-specific 

performance as reported in Figure 3-10 and Figure 3-11. The following two summary observations 

can be made. First, fingerprinting performance on static (i.e., control) sites are generally better 

than those for the dynamic experimental sites. Second, adding the new features, namely, surge size 

and the components of Haar transform, improves results for both static and dynamic sites, although 

more notably for the dynamic group.  

3.6 Summary and Conclusions 

We  have  presented  a  characterization  and  performance  enhancement  of  a  website 

fingerprinting mechanism. Using experiments carried out over an SSL VPN, it was first shown 

that a widely used classification feature, namely, traffic burst size, is not able to perform well when 

 

33 

the target website contents are sufficiently dynamic. This effect was mitigated by using a new 

classification feature traffic surge period and adapting the first n Components of Haar Wavelet 

Transformation, which is commonly used in traditional signal processing applications. Presented 

results  show  that  the  addition  of  these  features  can  bridge  the  fingerprinting  performance  gap 

between  static  and  dynamic  websites.  Overall,  the  true  positive  rates  for  dynamic  website 

fingerprinting were enhanced from about 87% with traditional features, to up to 97% by leveraging 

those new features. The results in this chapter help to establish the experimental environment that 

will be used in other works in this thesis and provided benchmark results for reference as the focus 

is turned to the classification of video streaming protocols.

 

34 

Chapter 4 : Protocol Independent Source Identification of Video Streaming 

Traffic 

4.1 Introduction 

In this chapter, we test the hypothesis that identifying the source of traffic is equivalent to 

identifying the route that the traffic travels (given that one end of the traffic route is fixed at the 

firewall).  It  is  hypothesized  that  the  inter-packet  arrival  interval  (PAI)  would  be  useful  in 

identifying the route. A variation of the PAI feature is proposed in this chapter to capture the 

characteristics of both the route and the traffic source. We will test this feature in a video source 

detection context. The chapter also provides analysis on the nature of the PAI feature, revealing 

that it can be used to estimate the distribution of inter-packet delay variation (IPDV), an established 

end-to-end feature to measure network condition, without measuring end-to-end packet delay. 

4.2 Problem Definition 

In this chapter, we address the problem of identifying encrypted and tunneled video traffic 

at its source level granularity. Such identification should be performed in a manner that is agnostic 

to the underlying streaming protocol, coding, and actual content. We are particularly focused on 

streaming video traffic to achieve our overarching goal: to identify and block traffic from a specific 

video-streaming  source  (e.g.,  Netflix,  YouTube)  passing  through  encrypted  tunnels.  Blocking 

should be at the firewall through which the encrypted tunnel is passing through. 

4.3 System Architecture 

The proposed system for streaming video traffic source detection is shown in Figure 4-1. 

The system comprises of three major components: 1) a data collection agent that is inside the 

domain  guarded  by  a  firewall;  2)  a  traffic  probe  that  is  placed  at  the  firewall;  and  3)  a  data 

 

35 

processing engine at the firewall. 

 

Figure 4-1: Classification architecture for traffic analysis 

 

 

The  data  collection  agent  within  the  domain  is  used  for  collecting  traffic  for  training 

purposes. For training data collection, sessions from the agent are created to different external 

video streaming servers, one at a time, through the same VPN server. Collected traffic samples are 

distinctly labeled with the corresponding server (i.e., YouTube, Netflix, etc.) so that the samples 

can be used for supervised training of a classifier. Agents in our experiments are computers that 

run automated scripts so that the training data is periodically collected at different times. Such 

temporal diversity in data collection is needed to ensure that the classifiers can cope with changing 

network conditions over time. 

The classification engine in Figure 4-1 uses the reference labeled samples collected by the 

collection agent for training and generating classification models. Such models are then used for 

classifying unknown traffic probed from the video downloaded by real clients. If a classified video-

streaming server appears in the enterprise’s blacklist, the firewall attempts to block or throttle the 

 

36 

traffic  from  that  server.  In  this  chapter,  we  focus  only  on  the  source  identification  part.  The 

mechanisms for throttling/blocking will be handled later.  

4.4 Packet Arrival Interval (PAI) Feature 

The objective of the arrangement in Figure 4-1 is to be able to classify video streaming 

servers in a protocol- and content-independent manner. Being able to identify a specific client-

server route would satisfy that objective. Since an encrypted tunnel prevents any route information 

to be directly revealed to the firewall, a traffic signature or feature that indirectly indicate the route 

information is needed.  

Most common features are packet and burst size, packet count, single trip delay, round-trip 

delay, and Packet Arrival Interval (PAI). Since burst and packet sizes depend on the underlying 

streaming  protocol  and  the  video  content,  those  are  not  applicable  for  protocol-  and  content-

independent server identification. The single-trip delay is ruled out because it requires coordination 

between the firewall and the destination server. This is because the probe can be placed only at the 

client end. In addition, since the server needs to be identified only by observing traffic at the client 

end, there is no way of measuring round-trip delay. This leaves Packet Arrival Interval (PAI) as 

the primary feasible feature for the source server classification scenario. 

In  what  follows  we  show  that  PAI  contains  much  useful  timing  information,  which  can 

contribute  to  its  usefulness  in  source  server  classification.  Figure  4-2  depicts  two  consecutive 

packet transmissions from the server to the client. The packets are sent from the server with an 

interval, and they reach the client-side firewall at time instants T1 and T2. The PAI is defined as. 

The other timing parameter of interest is the Inter Packet Delay Variation (IPDV)[35], represented 

by. The parameter  is termed as Inter packet Generation Delay (IPGD), which depends on specific 

video content and coding methods. Path delays, and  represent the properties of the specific route 

 

37 

including the congestion situations in the intermediate routers. The quantity  depends on both  and. 

From Figure 4-2, when the packets arrive in-order (i.e.,). Considering both in-order and out-of-

order cases, PAI can be generalized as: 

 

 

(4-1) 

Figure 4-2: Relationship between PAI and other delay metrics 

 

 

 

Both  and  can be modeled as stochastic processes. Figure 4-3 depicts example  distributions 

for a video being downloaded from Amazon, YouTube, and Netflix. The traffic samples presented 

in the figure are taken from 1-minute slices of actual traffic data from the mentioned sites. Then 

the distribution of PAI is extracted for all consecutive packet pairs. Note that the distributions are 

clearly bi-modal, with a stable saddle point at approximately 0.66ms between the 2 modes. The 

general observation here is that PAI, as observed at the firewall near the client, generally show a 

bi-modal  distribution  irrespective  of  the  content  and  the  streaming  protocol  and  video  coding 

combinations used by the server. 

To understand the distribution, we start by examining shifted gamma distribution[36], which 

 

38 

is an accepted realistic model for expressing an end-to-end route delay. This model assumes that 

the  path  delays  (i.e.,  and)  have  a  constant  component,  and  another  component  that  follows  a 

gamma distribution. It works well when all links in a path have similar delays. The probability 

density function of this delay model is shown in Eqn. (4-2). Here  is the shift, while α and β are 

the respective shape and rate factors of the gamma distribution. 

 

(4-2) 

With the assumption that end-to-end delays are independent, the probability density function 

of IPDV can be derived from Eqn. (4-2) as its autocorrelation function. 

Because of the symmetrical nature of IPDV, its PDF when  is a mirror of the part when. The 

 

(4-3) 

shape of  is shown in Figure 4-4. It is Gaussian-like because it is the sum of several independent, 

identically distributed random variables and therefore is subject to the central limit theorem. For 

comparison, a second dotted line shows a Gaussian density function with identical parameters.  

Since  depends on the property of an end-to-end route, it can be assumed as stationary as 

long as the network conditions change with a time constant that is larger than the classifier-training 

interval. The author of [37] points out that the stationary assumption can be effective for up to an 

hour, which is sufficient for the classifier model to be updated with new training done based on 

data collected within the hour. 

 

39 

Figure 4-3: Example distributions of PAI for different servers 

 

 

Observe that the distribution  in Figure 4-4 has only one mode, while the distribution in 

Figure 4-3  for  has two modes. The mode from Figure 4-4 actually corresponds to the lower time 

mode in Figure 4-3. Because the measurement captures only absolute values, the corresponding 

mode in Figure 4-3 represents only half of the mode in Figure 4-4. 

The other mode in Figure 4-3 is contributed by Inter Packet Generation Delay (IPGD), which 

can also generally be assumed to be stationary when the coded video generation is modeled as an 

on-off process as shown below. 

 

 

40 

x 104

IPDV Distribution (a = 5, b = 2x10-6)

 

Gaussian
d D(t)
F

6

4

2

0
 
-4

 

-3

-2

-1

0

1

2

3

4
x 10-5

 

Figure 4-4: Example distribution of IPDV   

To investigate this further, we examine probed (i.e., at the firewall in Figure 4-1) traffic 

samples from various video-streaming services as depicted in Figure 4-5. A common visual pattern 

across all the samples is that the traffic switches between two distinct states. One with short periods 

of high intensity, and the other with longer duration lower intensity. This alternating behavior is 

because data chunks in video streaming protocols are downloaded in bursts. Figure 4-5 generally 

confirms this. 

This general observation can be modeled using an on-off model [38] Under such a model, 

the packet generation process can be assumed to be stationary when the system is in one of the two 

states. The process can be modeled as shown in Eqn. (4-4). 

 

(4-4) 

Note that the position of the first peak in Figure 4 is close to 10us. This is because, for a 

1Gbps node, the source delay of a typical IP packet (i.e., 1500 bytes) is about 10us. We can use 

this value as an estimation of. The shape of this peak carries IPDV information with little distortion 

because  of  the  highly  concentrated  distribution  of  ,  which  acts  essentially  as  a  Dirac  impulse 

 

41 

function. Moreover, since the distribution of *, the shape is very similar to. The second peak in 

Figure 4-3 exists because the source is switched to off state occasionally, either because the packets 

are being queued but not sent yet, or the streaming protocol has finished transmitting a data block 

and waiting for the time to transmit the next block.  

Figure 4-5: On-off patterns in video streaming traffic samples 

 

 

 

To summarize, the above analysis confirms the statement in Eqn. (4-1), which claims that 

the feature Packet Arrival Interval (PAI) does capture information about both Inter packet Delay 

Variation (IPDV) and the Inter Packer Generation Delay (IPGD). While IPDV reflects network 

conditions,  IPGD  indicates  application-layer  properties.  The  hypothesis  here  is  that  with  such 

information diversity about the underlying streams, PAI would be able to classify video sources 

with high enough accuracy for our application. 

4.5 Experimental Setup 

Figure  4-6  depicts  the  experimental  setup  in  which  video  is  streamed  from  multiple 

streaming servers to two distinct client locations, both through an OpenVPN server. OpenVPN, a 

 

42 

popular  open-source  VPN  server,  uses  Secure  Socket  Layer  (SSL)  encryption.  In  OpenVPN 

version 1.5, a tunnel can be either UDP or TCP based. We use the UDP tunnel. 

The server is set up within Michigan State University’s Engineering network. Client-1 is 

placed within MSU’s Spartan Village apartment network, which is a part of the broader MSU 

network. An encrypted OpenVPN tunnel is created from the client to the VPN server, and the 

videos are streamed from commercial streaming servers such as YouTube, Netflix, etc. (see Table 

4-1) to the client via the tunnel. Traffic traces are collected at the client using the Wireshark [39] 

traffic probe. 

 

 

 

Figure 4-6: Experimental setup with OpenVPN 

 

For Client-2, a similar arrangement is used except that it is located at a home, which is 

completely out of the MSU’s network. The combinations of two client locations and many video-

streaming  servers  provide  many  possible  end-to-end  network  paths  for  investigating  the 

performance of source identification in diverse network conditions. 

Since  most  of  the  modern  video  streaming  servers  use  their  own  content  distribution 

network, the actual server machine from which the content is streamed depends on the location of 

 

43 

the client. Table 4-1 depicts the actual geographical location (i.e., obtained using IP to location 

mapping services) for the services when accessed from the two client locations in Figure 4-4. This 

client-dependent  server  machine  location  adds  additional  end-to-end  path  diversity,  positively 

contributing towards the server identification process. 

Table 4-1: Video streaming services and server locations 

 Service 
DailyMotion 
CNN News 
Fox News 
YouTube 
Netflix 
Amazon 
Vevo 

 

Client 1 
New York City 
Ann Arbor, MI 
Cambridge, MA 
Mountain View, CA 
Southfield, MI 
Seaford, Delaware 
Ann Arbor, MI 

Client 2 
New York City 
Ann Arbor, MI 
Unknown 
Mountain View, CA 
Ann Arbor, MI 
Seaford, Delaware 
Ann Arbor, MI 

For each server in Table 4-1, a randomly chosen list of video titles was downloaded at the 

clients at a different time of the day. Collected traffic streams are sliced into 1-minute chunks and 

a feature vector is extracted from each slice. 10-fold cross validation is used when testing the 

performance. Three different classification algorithms including J48 tree classifier, SVM, and 1-

Nearest Neighbor are used to train models with labeled feature vectors.  

Note  that  Wireshark  probe  collected  the  traffic  traces  in  the  above  setup  without  any 

background traffic at the client. In other words, the downloaded video streams are the only traffic 

present at the client. Video source identification in the presence of background traffic will be a 

topic of future investigation. 

4.6 Experimental Results 

Feature Extraction: As per the analysis in Section 4.4, we use the measured Packet Arrival 

Interval (PAI) distribution at the client end as the classification feature. We use 50 bin with the 

 

44 

smallest one covering 0-3μs, and each subsequent bin has an upper limit 1.5 times larger than the 

previous one. The largest bin covers 56.6 sec to 85 sec. This spread covers most of the interesting 

activity of the traffic stream. Upstream packets and downstream packets are counted separately, 

and the resulting features are concatenated together. Therefore the actual feature distribution has 

100 bins.  

   Traffic streams are divided into 1-minute slices. A slice is a record of the time and direction 

of each packet within that one minute. The set of slices is filtered by network activity before 

features are extracted. Only those slices that have more than 1500 packets/min are included in the 

feature set. The distribution data from all 100 bins are used for both classifier training and during 

the classification process. We collected more than 60 hours of data at each client location. The 

number of slices collected at each location is listed in Figure 4-7. 

Vevo

Amazon

Netflix

YouTube

Fox News

CNN News

DailyMotion

0

200

400

600

800

1000

1200

Sample Count

Client 1

Client 2

 

Figure 4-7: Number of collected slices from different servers 

 

Classification Accuracy: Classification results in terms of successfully identifying the video 

streaming servers in Table 4-1 are reported in Figure 4-8. The high true positive rates for both the 

clients  indicate  that  PAI  as  a  feature  does  contain  enough  information  about  the  end-to-end 

 

45 

network path so that it can uniquely identify the routes between specific client-server pairs. It 

should be noted that this high true positive rate happens when the classifier is trained and tested 

with different video contents, thus providing a source-identification mechanism that is content-

agnostic.  

 

1-NN

SVM

J48

(a) True Positive Rate 

86.00%

88.00%

90.00%

92.00%

94.00%

96.00%

True Positive Rate

Client 1

Client 2

(b) False Positive Rate 

1-NN

SVM

J48

0.00%

0.50%

1.00%

1.50%

2.00%

2.50%

3.00%

False Positive Rate

Client 1

Client 2

Figure 4-8: Classification accuracy for source identification 

 

 

 

Figure 4-8 also shows false positive rates. Low FP values confirm that when the proposed 

method will eventually be used for blocking undesirable video servers, very few wrong streams 

will be blocked mistakenly.  

Sub-feature Analysis: As analyzed in Section 4.4, the feature PAI includes information about 

both  IPDV,  which  reflects  network  conditions,  and  IPGD,  which  reflects  application  level 

 

46 

properties.  In this subsection, we plan to investigate the individual contributions of those two 

components or sub-features to the classification process. 

 

   After analyzing the PAI distribution for many samples it was found that the time 0.66ms 

approximately  represents  the  boundary  between  the  two  modes  as  shown  in  Figure  4-3.  This 

signifies  that  all  data  below  0.66ms  in  PAI  distribution  represents  IPDV  and  above  0.66ms 

represents IPGD. 

(a) From Amazon.com 

(b) From Netflix.com 

 

 
 

(c) From YouTube.com 

 

 

Figure 4-9: PAI distribution for data sample collected at Client 1 

 

 

   Figure 4-9(a) shows the PAI distribution of the samples collected at Client 1 from Amazon.com. 

Each horizontal line in the plot represents one sample, and the shade of a pixel represents the value 

of one particular bin in the PAI distribution for that sample. Black represents 0 and while represents 

1. Other shades of gray represent values in between. The two vertical lines indicate the 0.66ms 

 

47 

division between IPDV and IPGD.  

Figure 4-9(b) and  Figure 4-9(c) depict similar results for NetFlix.com and YouTube.com. 

These, as well as similar plots for other servers, show the very similar boundary between those 

two sub-features. 

 In Figure 4-10, the mean vectors of the slices from all three sites in Figure 4-9 are plotted 

as histograms. For a given site, the bin values for all samples are averaged and the resulting mean 

distribution is plotted. These graphs confirm the significance of the 0.66ms division points as 

illustrated in Figure 4-3 and Figure 4-9.   

y
t
i
l
i

b
a
b
o
r
P

0.2
0.1
0

y
t
i
l
i

b
a
b
o
r
P

0.2
0.1
0

y
t
i
l
i

b
a
b
o
r
P

0.2
0.1
0

Amazon

0

Time(s)
Netflix

0

Time(s)
YouTube

85s

0.66ms

85s

0.66ms

0.66ms

85s

0.66ms

85s

85s

0.66ms

0

Time(s)

0.66ms

85s

 

Figure 4-10: Class-average PAI distribution at Client 1 

 

 

Classification results with individual sub-features are shown in Figure 4-11. The following 

observations can be made. First, IPGD serves as a better classification sub-feature compared to 

 

48 

IPDV. Second, the true positive and false positive numbers in Figure 4-11:b are close to those in 

Figure 4-8. This means that IPGD, by itself, would have been an effective classification feature 

for video source identification.  

 

100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%

Client
1-TP

Client
Client
2-TP
1-FP
(a) IPDV

Client
2-FP

100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%

Client
1-TP

Client
Client
2-TP
1-FP
(b) IPGD

Client
2-FP

J48

SVM 1-NN

J48

SVM 1-NN

 

Figure 4-11: Classification Accuracy with Sub-features 

 

However, this, as well as IPDV, are not measurable features at the client. They are available 

only at the server side. In our setting, since the video source needs to be identified at the client side 

using  the  information  available  at  the  client,  Packet  Arrival  Interval  (PAI)  serves  as  the  only 

feasible feature. The above analysis demonstrates that PAI, in fact, embeds two very important 

classification  sub-features,  namely,  IPDV  and  IPGD,  both  of  which  contain  appreciable 

classification abilities. 

4.7 Summary and Conclusions 

The  chapter  investigates  Packet  Arrival  Interval  (PAI)  as  a  classification  feature  for 

identifying sources of tunneled video streaming traffic. The structure of the PAI feature is analyzed 

 

49 

to reveal its composition and its potential as a server-identifying feature. Extensive experiments 

were conducted with the OpenVPN server and a number of commercial video streaming services 

to evaluate the efficiency of PAI as the classification feature. In addition to validating its efficacy, 

we have also demonstrated that PAI embeds two very important classification sub-features, namely, 

Inter-Packet Delay Variation (IPDV), and inter-packet generation delay (IPGD), both have good 

classification abilities. 

 

50 

Chapter 5 : Characterizing Common Traffic Features 

5.1 Introduction 

In this chapter, we explore the characterization of the factors that influence the performance 

of TA when identifying sources of tunneled video streaming traffic. Such identification can be 

used in enterprise firewalls for blocking unauthorized viewing of the tunneled video. We attempt 

to characterize and evaluate the impacts of the primary influencing factors, namely, streaming 

protocol, codec, and the actual video content. Analysis of data obtained from the test environment 

shows that the streaming protocols provide the most dominant source identification distinction. 

Also, while the codecs provide some weak distinctions, the influence of video content is marginal. 

In  addition  to  in-laboratory  experiments,  a  real-world  verification  to  corroborate  those 

observations  is  also  made  with  commercial  streaming  service  providers.  Such  “long-haul 

experiments” indicate that the end-to-end network conditions between the streaming server and 

video  client  can  act  as  an  additional  influencing  factor  for  TA  towards  video  stream  source 

identification. Overall, the results suggest the feasibility of TA for unknown video stream source 

identification with diverse video examples.   

5.2 Problem Definition 

The  problem  tackled  in  this  chapter  is  the  characterization  of  Traffic  Analysis  (TA)  for 

identifying sources of tunneled video streaming traffic. Its purpose is to answer the question of 

whether network traffic contains information to identify its source. Such identification is useful in 

enterprise firewalls for blocking unauthorized viewing of the tunneled video. While we attempted 

to answer the question with mathematical analysis in Chapter 4, we will attempt to characterize 

and  evaluate  the  impacts  of  the  primary  TA-influencing  factors  experimentally.  Three  major 

 

51 

factors are evaluated: streaming protocol, codec, and the actual video content, all while using 

packet size distribution as a feature. A test environment is built so we could control these factors. 

As a second step, another factor that is important in this context, the route between server 

and client computers, is evaluated as well. These are named “long-haul experiments” because of 

the long, artificially created path between the servers and the clients. This step is designed to 

answer the question of whether signatures of server location can be detected in traffic while all 

other parameters being identical. 

5.3 Experimental In-laboratory setup 

Network Configuration: An experimental setup is built with the following components: 1) a 

privately-owned streaming server running Linux, and 2) a client computer that runs Windows. 

Both machines are connected to the same router and therefore are just one hop from each other.  

The server runs both an instance of “C++ RTMP Server” (crtmpserver) and a modified instance 

of “GStreamer Streaming Server” (GSS) that supports the newest version of Smooth Streaming 

protocol and recognizes VC-1 encoded video file. To allow the client to stream the video, an Adobe 

Flash-based  player,  JWPlayer,  is  used  to  stream  from  crtmpserver,  while  a  reference 

implementation of Smooth Streaming player found in Silverlight Media Framework (SMF) is used 

for streaming from GSS.  Client and server communicate through an internal VPN server that runs 

the IPSec based Juniper VPN service. The entire setup is shown in Figure 5-1. 

 

52 

 

Figure 5-1: Network environment and the experimental setup 

 

Traffic Data Collection: Traffic traces are collected by running Wireshark protocol analyzer 

on the client machine. Only the traffic between client and Testbed server is collected while other 

packets are dropped. Since the server is not running other services (no more than a single HTTP 

browsing  session)  outside  of  video  streaming,  and  the  client  is  not  running  any  significant 

networking application other than video streaming players, we consider the traffic to be free from 

background noise. 

   Several  public-domain  or  Creative  Common  (CC)  licensed  movies  are  chosen  for  this 

experiment and they are converted to different formats to be served on both crtmpserver and GSS. 

The complete list of video files and formats is shown in Figure 5-2. Smooth Streaming is shown 

as “SStream” in the table. Care is taken to choose video files with diverse qualities and picture 

sizes.  We chose “Sita Sings blues” to be encoded into 2 different picture sizes (originally 852x480 

pixels, reduced to 500x282 pixels) to study the effect of quality on recognition results. Video files 

are  converted  into  H.264/AAC  format  by  Handbrake  video  converter  with  a  fixed  keyframe 

interval of 2 seconds, or into VC-1/WMA format by Microsoft Expression Encoder (MEE). While 

MEE encodes a video file into multiple quality levels when encoding for Smooth Streaming, only 

 

53 

one randomly chosen file from the encoded files is used for each video class to represent the 

diversity of real-life video files. 

Data Preprocessing: Packet size distribution is used as the classification feature following 

its usage in [2]. Samples are sliced into 1-minute chunks and packet size distribution is extracted 

from  each  chunk.  Features  are  labeled  by  their  respective  video  file  sources.  The  number  of 

samples extracted is also shown in Figure 5-2.  Sample size is in general proportional to the original 

length  of  the  video,  while  some  variance  is  because  of  missing  data.  Half  of  the  H.264/AAC 

portion of “Sita Sings Blues” is collected with a video file that has 640x480 frame size. This is 

done to test the effect of content format on the features.  A detailed description of the extraction 

method can be found in Section 5.4. 

Ooh Diese Ferien

Elephant’s Dream

Deadlock

Insecurity

Oceania

Big Buck Bunny

Route 66

Sita Sings Blues

0

500

1000

1500

2000

2500

3000

3500

Sample Count

RTMP/H.264/AAC

RTMP/H.264/AAC (640x480)

Sstream/H.264/AAC

SStream/VC-1/WMA

Figure 5-2: Collected video Samples 

 

 

5.4 Feature analysis 

Both RTMP and Smooth Streaming are TCP based, and therefore leave the task of splitting 

data  into  packets  to  the  underlying  IP  layer.  Although  the  splitting  is  transparent  from  the 

application  layer/transport  layer  protocol,  results  have  shown  that  the  packet  size  pattern  is 

 

54 

distinctive enough for recognition of not only application layer protocol, but also webpage content 

and the website accessed.  

500

1000

1500

2000

2500

Sita Sings
Blues
RTMP,H.264

Sita Sings
Blues
RTMP,H.264
Width = 500

Sita Sings
Blues
SStream,H.264

Sita Sings
Blues
SStream,VC-1

l

 

#
e
p
m
a
s

-1500

-834

-167
size(bytes)

499

1166

Figure 5-3: Packet size distribution of 4 “Sita Sings Blues” classes 

 

 

Distribution of packet size is obtained by dividing all the packets in a sample into different 

bins using their direction and size. The number of packets in each bin is counted and a histogram 

is built. In this chapter 10 bins that evenly divide the range [-1500, +1500] bytes are used to 

calculate the feature.  The sign before the packet size indicates the direction of this packet (minus 

is  upstream,  while  plus  is  downstream).  The  choice  of  10  bins  is  because  of  concerns  in  the 

robustness of features, as well as computational cost. Before computation of feature, the stream 

first passes through a filter that removes pure TCP ACK packets from the stream following the 

practice in [3]. Histogram calculation follows, then each histogram is normalized, i.e. the value of 

 

55 

each bin is replaced by its value divided by the sum of all bins. This mitigates the influence of 

traffic load on the feature.  

A few example features are shown in Figure 5-3 in which all samples of the four classes that 

belong to “Sita Sings Blues” are plotted as a grayscale map. The x-axis corresponds to the 10 bins 

in the histogram, while the y-axis is the count of the samples. Samples are divided into 4 classes 

by horizontal lines that are most clearly visible from the rightmost of the figure, and on each area, 

there is a label that names the class it corresponds to. A dark stripe indicates that the histogram has 

a high value in that bin. Even before classification, we can see that the classes are clearly divided 

into 2 groups: The RTMP/H.264 group and the Smooth Streaming/VC-1 or H.264 group. The 

difference between the two can even be spotted on the map without the help of a classifier. This 

corroborates the established fact that Traffic Analysis can be effective at recognizing application 

layer protocol. 

To get higher resolution and more consistent results, a classifier must be trained. Support 

Vector Machine (SVM) is used in [3], but [23] also finds that packet size distribution works well 

with other types of classifiers. A J48 tree classifier is used because it is quick to train and has 

shown good performance. 

   To simulate the real-world situation where the classifier has to deal with a large test set 

with a relatively small training set, the dataset is split between the training set and test set at a 15%-

85% ratio. Only 15% of the dataset is used for training while the rest of the dataset is used for 

testing. 

5.5 Experimental results 

We first train a J48 tree classifier using 15% of the dataset and then test with the rest 85%. 

We seek answers for the following questions: How good is the classifier in distinguishing between 

 

56 

1) different protocols, 2) different codecs encapsulated in the same protocol, and 3) different video 

content for the same protocol/codec combination. 

 

Protocol 
RTMP 
SStream 

Table 5-1: Protocol Identification Rate (%) 

As RTMP 
93.8 
4.8 

As SStream 
7.2 
95.1 

The first result is from the protocol combinations and is shown in Table 5-1. The classifier 

recognizes the two protocols very well, with more than 90% accuracy. This is to be expected since 

not only is the protocol identification ability already established by earlier studies, but also the two 

protocols are obviously different as visible in terms of their features. 

In  the  second  set  of  results  shown  in  Table  5-2,  classes  are  grouped  according  to  their 

protocol/codec combinations. Each group is assigned a number (1~3 in the 1st column) and the 

percentage  of  that  class  being  classified  as  the  nth  class  is  shown  in  the  column  “As  #n” 

respectively. The majority of RTMP/H.264 samples are classified correctly, which is expected. 

The  interesting  point  in  this  table  is  that  while  the  overall  performance  is  worse  than  that  of 

RTMP/H.264,  the  two  Smooth  Streaming  based  groups  are  also  separated  relatively  well. 

Considering  that  the  codec  only  affects  the  traffic  load  (size  of  slices)  in  Smooth  Streaming 

protocol but not the underlying HTTP protocol, a result like this is unexpected. The reason for this 

difference is still under investigation. However, given higher than 90% accuracy in the protocol 

identification  case,  the  difference  between  the  two  codecs  is  localized  within  the  Smooth 

Streaming protocol group. This can also be confirmed from the confusion rates between class #2 

and class #3, which are significantly higher than the confusion rates between class #1 and either 

of the two Smooth Streaming classes. The distinguishing impacts from codec are clearly visible 

but are weaker than that from the streaming protocol. This is one of the key observations that states 

 

57 

that the streaming protocol is more influential than the codec for distinguishing video sources using 

traffic analysis when packet size distribution is used as the classification feature. 

Table 5-2: Protocol/Codec identification rate (%) 

 
1 
2 
3 

Protocol/Codec 
RTMP/H.264 
SStream/H.264 
SStream/VC-1 

As #1 
94.8 
3.1 
2.4 

As #2 
3.8 
85.7 
20.1 

As #3 
1.4 
11.2 
77.5 

We also want to know the distinctiveness of the feature among different video files of the 

same protocol/codec combination. For this, we calculate the conditional probability of a certain 

class to be classified as another class in the same group. Let  to be the set of all classes within one 

protocol/codec group. Each class is identified as, then the probability can be expressed as:  

 

(5-1) 

Figure 5-4: RTMP/H.264 Intra-Group Confusion 

 

 

Following this calculation method, we compute Figure 5-4 and Figure 5-5 that shows the 

confusion within the RTMP group and Smooth Streaming group respectively. It is clear from the 

results  that  the  classifier  faces  difficulty  in  separating  classes  with  the  same  protocol/codec 

combination. This is an indication that the degree by which classes overlap in the feature space is 

 

58 

high. Judging by the performance, the content does have a minimal impact on the feature, but most 

of the samples do not deviate too much from their protocol/codec combination (hence the good 

performance with protocol/codec).  

 

                       (b) 

 

(a) 

Figure 5-5: Smooth Streaming Intra-Group Confusion (a: H.264, b: VC-1)  

 

Special attention is paid to the two “Sita Sings Blues” classes with RTMP grouping but 

different sizes. The confusion between the two classes is only marginally higher than the confusion 

between  other  classes.  This  shows  that  the  change  in  quality  makes  a  video  stream  almost  as 

different as another video stream within the same protocol/codec group. 

5.6 Experiments with Commercial Service Providers 

To test if the results obtained from the controlled scenarios in Section 5.5 scale to real-world 

long-haul scenarios, the following experiment, as shown in Figure 5-6, is set up. In this setup, the 

client is put into a separate network that is at least 10 hops away from the VPN server to allow 

network  variations.  Collecting  data  from  internal  test  server  is  replaced  by  collecting  from 

commercial servers that serve YouTube, Netflix, and Hulu, while the VPN server in use is still the 

same Juniper VPN server as described in section 5.3. This setup allows us to view the difference 

between the single-server setup in section 5.3 and a multi-provider scenario. Two of the three 

 

59 

chosen sites run the same protocol/codec (RTMP/H.264 is used by both Hulu and YouTube) but 

run from different locations. This introduces additional variability in terms of network delay and 

jitter caused by different end-to-end server-client routes. In all, 10 video titles are randomly chosen 

from the three video streaming sites. For Hulu, we have chosen a playlist that randomizes its 

content in each playback session in addition to the other two randomly chosen titles. Data has been 

collected in a similar fashion as in section 5.3. Considering the additional intermediate nodes, as 

well as the fact that all 3 sites are hosted on multiple servers and potentially in different data 

centers, this setup is more complex and practical than the isolated test environment used in the 

previous sections. 

We focus on the same results presented in section 5.5. The RTMP group in Table 5-3 is 

comprised of all video samples from YouTube and Hulu, while the SStream group contains all 

samples from Netflix. The recognition rates between the two groups shown in Table 5-3 are around 

86%, which are lower compared to the isolated test results, but still decent. 

The result of identifying samples from different sites is shown in  

 

 

60 

Table 5-4. Note that this result was not presented in the isolated test scenario in Section 5.5, 

because  there  is  only  one  server  in  that  case.  Interestingly  the  confusion  between  Hulu  and 

YouTube is low, even though they share the same protocol/codec pair. The observation that the 

recognition rate is higher between Hulu and YouTube than what can be obtained from protocol 

and codec (shown in section 5.5) can be explained using the following hypothesis.  

The network condition (i.e., in terms of end-to-end delay, jitter, etc.) of the route between 

the streaming server and client has an influence on the traffic analysis feature. If true, this would 

be good news for a firewall that intends to block certain sites while not affecting others using the 

same protocol/codec combinations. The hypothesis needs to be verified further in future work in 

an isolated environment by artificially introducing network condition variations.  

 

 

61 

Figure 5-6: Long-haul experiments  

 

Table 5-3: Protocol Identification Rate (%) 

Protocol 
RTMP 
SStream 

As RTMP 
86.3 
13.2 

As SStream 
13.7 
86.8 

 

 

62 

 

 

Site 
Hulu 
Netflix 
YouTube 

 
1 
2 
3 

 

Table 5-4: Site Identification Rate (%) 

As #1 
80.4 
12.1 
9.1 

As #2 
5.6 
86.8 
10.7 

As #3 
14 
1.1 
80.2 

Results on multiple videos from the same site are shown in Figure 5-7. They also show a 

trend that is like that of the isolated test scenario in Section 5.5.  The classifier still has difficulty 

recognizing individual video titles, but the accuracy is statistically better than random guesses, 

indicating minor influence from the video content. The figures are higher than those in the isolated 

test, suggesting that the samples from video streaming service providers are less homogenous. A 

possible reason for higher heterogeneity is that the providers run multiple servers to serve their 

content. 

 

 

Figure 5-7: Intra-site Confusion in Long-haul Experiments 

 

63 

To summarize, the results from long-haul experiments with commercial video streaming 

servers  corroborate  the  basic  findings  from  the  in-laboratory  isolated  testing  in  Section  5.5. 

Meaning, streaming protocols, codec, and video content have classification influence in decreasing 

order. Additionally, it also turns out that the underlying network condition can be an important 

impact factor that influences the classification feature. Further work is needed to characterize and 

understand it better. 

5.7 Summary and Conclusions 

The  goal  of  this  chapter  is  to  answer  the  question  of  what  factors  contribute  to  the 

effectiveness of features used in Traffic Analysis based streaming video source identification. The 

study focuses on the analysis of encrypted tunneled video streaming protocols. Normalized packet 

size distribution is used as the feature for measurement. The results show that video streaming 

protocol has the most distinctive impact to feature among the three factors studied. In addition, 

video codec and video content are also identified as sources of video source distinctions. While 

the protocol has the greatest impact on feature distinctiveness, the impact of the codec is less 

significant  but  clearly  visible,  and  individual  content  only  contributes  marginally.  These 

observations  provide  insights  into  the  effectiveness  of  Traffic  Analysis  methods  and  provoke 

questions about the nature of the variations. A real-world verification of the observations is made 

by  analyzing  data  collected  from  major  video  streaming  service  providers  over  a  long-haul 

network. Not only the results corroborate the observations made in the isolated study, but it also 

reveals  the  difference  between  single-server  streaming  test  environment  and  the  real-world 

massively distributed streaming environment. The samples become less homogenous in this case, 

as samples from different service providers show more diversity than from the protocol and codec 

that we can measure in the test environment. 

 

64 

Chapter 6 : Effects of Location Variability  

on Single-Stage Classifier 

6.1 Introduction 

This chapter explores the capability of a single stage classifier to identify video streaming 

traffic, like a conventional Traffic Analysis setup.  This will provide a basis for constructing other 

improved  classification  schemes.  The  classifiers  and  features  used  in  this  chapter  are  chosen 

following several state-of-the-art works in this field. The experimental setup is unique in that it 

collects data between two client locations and one VPN server. The dataset obtained from the 

experiment  is  used  to  verify  the  generalizability  of  the  classifier  between  different  geological 

locations. Results show that the features are affected by the client location, and therefore training 

data for a firewall has to be collected locally. 

6.2 Problem Definition 

This chapter aims to build a TA based blocking framework in which site-specific signatures 

are  identified  first  by  analyzing  traffic  from  different  video  streaming  sites,  and  then  such 

signatures  are  used  for  blocking  traffic.  Sites  that  are  subject  to  this  study  include  YouTube, 

Netflix, and Hulu. We review several popular features used by established works in the TA field, 

evaluate their effectiveness against video traffic, and develop a framework for recognizing video 

streaming sites. We also present the performance of applying TA methods to traffic from or to new 

places that are unknown to a trained classifier. 

6.3 System Components 

In Figure	6-1,  the  flow  of  data  in  a  typical  TA  application  involving  video  streams  is 

presented up to the point of feature extraction. As the client computer accesses a video-streaming 

 

65 

server through a secure tunnel, the firewall is able to obtain the timing, the size, and direction of 

each packet. The traffic then goes through the preprocessing stage to become a time-stamped series 

of packet sizes. Packet sizes in the timed series are signed to indicate their direction. The upstream 

packets are assigned negative sizes and the downstream packets are assigned positive sizes. A 

variety of feature extraction methods are then applied to the preprocessed data to get the features. 

Finally, a classifier is trained using such features to identify specific content and/or pattern in the 

video stream.  

	

 

Figure	6-1:	Data	collection	and	feature	extraction	

	

For  example,  to  recognize  traffic  from  YouTube,  the  firewall  needs  to  collect  enough 

samples of traffic from YouTube vs. non-YouTube streams, and then extract relevant features as 

shown in the figure. A classifier is then trained using the extracted features. When unknown stream 

 

66 

traffic is detected, the same feature extraction process as in the training phase is applied to the 

unknown traffic to extract features for classification. The classifier determines the source of the 

traffic to be YouTube or non-YouTube. Based on the classification results, the firewall can then 

take further actions, including throttling or blocking the stream or to provide priority bandwidth 

allocation to it.  

6.4 Experimental setup 

Network Configuration: We set up an experimental environment to implement the case when an 

enterprise firewall collects traffic from clients inside its domain. As shown in Figure	6-2, the setup 

includes clients, one inside the same class B subnet (MSUNet) with the VPN server (A Juniper 

VPN server) and another outside of the class B network. The setup introduces a variation in client 

positioning with respect to the network. The intention is to be able to observe the influence of 

network structure on the features described in Figure	6-1. The Juniper VPN server supports both 

IPsec based VPN and SSL based VPN. We choose SSL based VPN service for the described 

experiments. 

Traffic traces are collected at different times of the day (morning, afternoon and night) in 

order to capture realistic network traffic conditions. We collect data across a 10-day period with 

three sessions each day.  

 

67 

Figure	6-2:	Setup	of	Network	Environment	

	

 

Choice of Sites and Content: In this chapter, we focus the effort on discerning YouTube 

traffic from the traffic of a few other popular video streaming sites and HTTP based web traffic 

from other popular websites. The type of content may have an influence on how well the content 

can be compressed by the codec and therefore, its immediate bitrate. In general, simple images are 

compressed better than complex images, and stills are compressed better than dynamic scenes. To 

take content type into consideration, we choose videos from different sites.  

The sites and the video clips we choose from each site are listed in Table 6-1. The sites are 

deliberately chosen to represent the most popular video streaming sites and the type of protocols. 

Hulu is chosen as an example of a video streaming site using the same streaming protocol as 

YouTube (RTMP), and Netflix is chosen because it is the most popular video streaming site in 

 

68 

North America. Netflix also uses Smooth Streaming, a different streaming protocol than what 

YouTube uses. 

Site 
YouTube 
YouTube 
YouTube 
YouTube 
YouTube 
Hulu 
Hulu 
Netflix 
Netflix 

 

Table 6-1: Collected video samples 

Video Title 
Epic funny cats 20 mins 
10 Weird Facts about Human Evolution 
Lucy Official Trailer #1 
Katy Perry – Birthday 
Underwater Bullets at 27,000fps 
Bones season 9 episode 21 
Deadbeat season 1 episode 4 
Futurama season 1 episode 1 
The Silence of the Lamb 

Type 
Home Video 
Slides 
Movie 
Music Video 
 
TV series 
TV series 
TV series 
Movie 

Collection of traffic: Video-streaming traffic to a client machine is usually accompanied by other 

types of traffic caused by activities such as browsing websites, checking E-mails, or downloading 

data  from  FTP  sites.  From  the  video  traffic  standpoint,  these  are  background  traffic.  For  the 

reported experiments in this chapter, we assume that there is no background traffic when the video 

plays because video streaming usually is so bandwidth intensive that most of the packets going 

through the channel belong to the video stream being watched. We also assume that web browsing 

is the only type of other traffic from which the video needs to be distinguished. This is because, 

on  the  Internet,  web  browsing  happens  to  be  the  most  dominating  network  usage  other  than 

watching streaming video [1]. For representative web browsing, we collect traffic traces from 

popular web sites including Facebook, Yahoo, and Google. 

Data preprocessing: The goal for the presented experiments is to be able to distinguish YouTube 

traffic from other video streaming traffic such as Netflix or Hulu, and web browsing traffic. After 

 

69 

collecting packet traces for different traffic in different sessions, each such trace is sliced into one-

minute long chunks that can be used for classifier training and validation purposes. Such slicing is 

performed for both video and non-video web traffic. For each network layer packet within a slice, 

its size, direction, and timing are recorded. Classification features are then constructed using those 

packet-specific parameters. 

6.5 Feature analysis 

Due to the tunneling of traffic through VPN, features of the traffic must be built from the 

two basic features that are not concealed by tunneling. Those two features are the size and timing 

of each packet. In this section, we evaluate the traffic analysis features established in [2] and [40]. 

Packet  Size:  A  video  server  splits  a  data  stream  into  network  layer  data  packets  before 

sending the stream to its clients. In certain conditions, the upper layer protocol can be directly 

discerned by looking at the characteristic data packet lengths. For example, the characteristic 54 

bytes ACK packets that are present in large numbers when the transport layer protocol is TCP [3]. 

Identification of even higher layer protocols can be achieved when one looks at the distribution of 

different packet sizes present in the traffic. Packet size may also be influenced by the specific 

implementation of a protocol. As previous works show, packet size is also useful in identifying a 

website where a certain traffic trace is from, especially when the underlying protocol is known.  

In our experiments, we calculate packet size distribution from the data collected for each 1-

minute video slice for all the collected sessions. We record the distribution of packet size for each 

video in a 40-bin histogram. The bins are evenly distributed in the range of -1500 bytes and +1500 

bytes, with negative values indicating upstream traffic and positive values indicating downstream 

traffic. The histograms are then normalized before stored as the final feature. 

 

70 

In Figure	6-3, the, burst characteristics of YouTube, Hulu and Netflix traffic are presented. 

The first 3 graphs reveal that all three of them download the video in relatively large chunks instead 

of a continuous stream of packets. YouTube traffic is characterized by two periodically reoccurring 

bursts that both appear every 30 seconds in the graph. Traffic from Hulu has an average burst 

interval of about 10 seconds, while Hulu frequently switches into the second mode of operation in 

which a continuous stream of traffic with lower peak bitrate than what occurs in the burst mode. 

Traffic from Netflix has shorter burst intervals and both the interval and the size of a burst change 

frequently. The problem of using burst as a feature is that burst can be influenced by the end-to-

end route. In the bottom part of Figure	6-3, we present another trace of the same video on Netflix. 

The  only  difference  is  that  the  client  can  only  access  Netflix  on  a  longer  route.  The  burst 

characteristic  is  almost  invisible  here.  Considering  the  above  observations,  we  exclude  the 

distribution of burst sizes as a feature for video server identification/classification as attempted in 

this chapter.  

 

 

71 

Burst 

mode 

Commercial  

Continuous 

mode 

Continuously 

playing 

	

	

	

	
Figure	6-3:	Burst	characteristics	of	YouTube,	Hulu,	and	Netflix	

72 

 

 

Consecutive packet size: Consecutive packet size pair is defined as the 2-tuple of 2 packet 

sizes  appearing  one  after  another  in  a  traffic  trace.  In  [40]  the  2-tuples  constituted  of  packets 

appearing immediately adjacent. The reason behind the effectiveness of this feature in [40] is that 

the abundance of recurring phrases in natural languages and the variable bit-rate compression that 

leaves  a  trace  of  information  about  the  original  voice  signal  even  after  encryption.  For  other 

protocols  that  do  not  use  variable  bit-rate  compression,  consecutive  packet  sizes  may  still  be 

effective as an indication of the protocol because many protocols have recurring sequences of 

packets much like recurring phrases in natural languages. 

Hulu

YouTube

-750
0
750
1500

-750
0
750
1500

-750

0

750

1500

Netflix

-750

0

750

1500

Web Traffic

-750
0
750
1500

 

Figure	6-4:	Samples	of	Consecutive	Packet	Size	Pair	Distribution	

1500

-750

750

1500

-750

0

0

750

-750
0
750
1500

	

We calculate this feature as follows. The packet size pairs are recorded in a 20x20 bins 2-D 

histogram. The bins distribute evenly from -1500 bytes to +1500 bytes for each element of the 

pair.  The  resulted  histogram  is  normalized  before  constructing  the  final  feature.  Some  of  the 

sample  features  are  shown  in  Figure  6-4.  Darker  points  on  the  map  correspond  to  the  denser 

distribution of packet size pairs around that bin. The distinctiveness of the feature can be clearly 

 

73 

observed from these sample distributions in Figure 8.  While RTMP is used by both Hulu and 

YouTube, YouTube traffic has fewer small packets followed by another small packet while Hulu 

traffic has more. Hulu traffic also has more middle-sized packets compared to other sites. 

6.6 Experimental results 

Experiments are done in the following two phases. During the first phase, the focus is put 

on the performance of distinguishing YouTube traffic from the rest of the sites. This is done using 

either packet size distribution or consecutive packet size as the classification feature. In this phase, 

the training dataset and the test dataset contain traces collected from client #1 only (see Figure 

6-2). For the non-video traffic class (HTTP accesses), we put traces from the same set of popular 

websites in both the training set and the test set. This phase is named “closed-world” since the 

classifier does not deal with unknown locations or unknown websites.  

The second phase focuses on the performance of distinguishing YouTube traffic in an “open-

world” scenario. In this phase, unlike in the closed-world scenario, a classifier is trained using 

traces collected from client #1 but tested with traces collected from client #2. Also adding to the 

difficulty  of  the  problem,  the  non-video  traffic  traces  in  the  test  dataset  are  collected  from  a 

completely different list of websites than those in the training set. The goal of this two-phase setup 

is to best mimic the real-world situation that the detection engine will face: classifying traffic from 

a large number of websites to many internal clients with only a limited set of traces to train the 

classifier. 

We also present the performance of traffic collected across different physical sites/places. 

The features for training are extracted from the data obtained from client #1 according to the 

methods introduced in Section 5.4. The trained classifiers are then tested on features extracted 

 

74 

from data obtained from client #2. This is an open-world scenario since not all the clients are 

known to the classifier apriori. 

Y
NY
1
0.5
0
 
1%

TP-rate(BayesNet)

 

3%
TP-rate(SMO)

5% 10%

FP-rate(BayesNet)

1
0.5
0
1%

3%
FP-rate(SMO)

5% 10%

3%

1
0.5
0
1%
5% 10%
TP-rate(RandomForest)
1
0.5
0
1%

5% 10%

3%
TP-rate(1-NN)

3%

1
0.5
0
1%
5% 10%
FP-rate(RandomForest)
1
0.5
0
1%

5% 10%

3%
FP-rate(1-NN)

1
0.5
0
1%

1
0.5
0
1%

e
t
a
r
 
e
v
i
t
i
s
o
p

e
t
a
r
 
e
v
i
t
i
s
o
p

e

t

a
r
 

e
v
i
t
i
s
o
p

e

t

a
r
 

e
v
i
t
i
s
o
p

3%

5% 10%

split percentage

Figure	6-5:	Results	using	packet	size	distribution	feature	

split percentage

5% 10%

3%

 

 

 Results using packet size feature: Results obtained from dataset #1 are shown in Figure	6-5. 

The following abbreviations are used: TP-rate = true positive rate; FP-rate = false positive rate; Y 

stands for YouTube class while NY stands for the Non-YouTube class; All other figures in this 

section follow the same convention. 

   The “split” percentage indicates how much of the data set is used to train the classifier. 

The entire dataset contains 4019 samples, of which 546 samples are collected from YouTube (1 

 

75 

sample = 1 minute). By splitting 1% for training, the training set contains only 40 minutes of data, 

of which 6 minutes is from YouTube. The results show that packet size distribution has good 

generalization performance in the closed-world scenario even when the training set is small. Only 

3% of data (120 minutes in all 17 minutes from YouTube) is enough for the classifier to yield a 

good result on the rest of the dataset. The problem with having a small training set is that the false 

positive rate (FP rate) tends to go up. The figure of the FP rate for YouTube class needs special 

attention because we want to minimize the innocent victims of the classifier. We can see from the 

results that BayesNet classifier and 1-Nearest Neighbor classifier yield lower FP rate than SMO 

classifier and J48 tree classifier. Again, a 3% split is enough for these two classifiers to yield FP 

rate as low as 10-3.  

Results using consecutive packet size pair feature:  Following the analysis done in Section 

5.4, consecutive packet size preserves the temporal relationship of packet sizes and is expected to 

be effective in distinguishing the traffic of different sites. The results of experiments with this 

feature are shown in Figure	6-6.  

 

76 

Y
NY
1
0.5
0
 
1%

TP-rate(BayesNet)

 

3%
TP-rate(SMO)

5% 10%

FP-rate(BayesNet)

1
0.5
0
1%

3%
FP-rate(SMO)

5% 10%

3%

1
0.5
0
1%
5% 10%
TP-rate(RandomForest)
1
0.5
0
1%

5% 10%

3%
TP-rate(1-NN)

3%

1
0.5
0
1%
5% 10%
FP-rate(RandomForest)
1
0.5
0
1%

5% 10%

3%
FP-rate(1-NN)

1
0.5
0
1%

1
0.5
0
1%

e
t
a
r
 
e
v
i
t
i
s
o
p

e
t
a
r
 
e
v
i
t
i
s
o
p

e

t

a
r
 

e
v
i
t
i
s
o
p

e

t

a
r
 

e
v
i
t
i
s
o
p

3%

	
Figure	6-6:	Results	using	consecutive	packet	size	pair	feature	

split percentage

split percentage

5% 10%

5% 10%

3%

 

The results show that while still effective, the feature performs slightly worse than packet 

size distribution. The added temporal information does not contribute to the recognition rate. 

Results for different locations/sites: In Figure	6-2, we introduced the experimental setup, 

 

which includes 2 clients at two different physical sites. The goal of this setup is to evaluate whether 

the classification/detection can still work when a classifier is trained with data from a specific 

physical location in the network, and test data is applied from a different location. 

We first collect video traffic data on client #2, then collect web traffic data on client #2 on 

the next day. This formed a data set that consists of 1219 traces, of which 408 are from YouTube, 

 

77 

309 are from Hulu, and the rest is web-browsing traces from various non-video websites. Netflix 

data is absent here because of a technical issue. None of the non-video websites is included in the 

site list when we collect data from client #1. Combined with the unknown client, this makes the 

classification  problem  more  difficult.  Packet  size  is  extracted  from  the  dataset  for  recognition 

because the previous experiments have shown that it works better than consecutive packet size 

pair. Additional tests also revealed that the positive rate for consecutive packet size pair, in this 

case, is usually lower than %1, making it unfavorable. We then use data collected from client #1 

to train a classifier to apply to the data set we got from client #2. The results are shown in Figure 

6-7. Following the convention in the earlier part of this section, we provide results with classifiers 

trained with different training set sizes up to 10% of the size of data set from client #1. 

The performance in this situation is significantly worse than that obtained from the closed-

world tests. The result shows an interesting behavior that larger training sets, in this case, do not 

necessarily result in better results. For SMO and 1-NN classifiers the TP rate of the YouTube class 

is lower than 30% even with a 10% split, while for the RandomForest classifier, the TP rate for 

YouTube stays low until the training data size reached 10% split. The problem this phenomenon 

signifies is that while the increased training set size makes the classifier perform better with data 

obtained from the same location, it makes the classifier perform worse on data obtained from other 

places. 

 

 

78 

Y
NY
1
0.5
0
 
1%

TP-rate(BayesNet)

 

3%
TP-rate(SMO)

5% 10%

FP-rate(BayesNet)

1
0.5
0
1%

3%
FP-rate(SMO)

5% 10%

3%

1
0.5
0
1%
5% 10%
TP-rate(RandomForest)
1
0.5
0
1%

5% 10%

3%
TP-rate(1-NN)

3%

1
0.5
0
1%
5% 10%
FP-rate(RandomForest)
1
0.5
0
1%

5% 10%

3%
FP-rate(1-NN)

1
0.5
0
1%

3%

5% 10%

split percentage

1
0.5
0
1%

3%

5% 10%

split percentage

e

t

a
r
 

e
v
i
t
i
s
o
p

e

t

a
r
 

e
v
i
t
i
s
o
p

e
t
a
r
 
e
v
i
t
i
s
o
p

e
t
a
r
 
e
v
i
t
i
s
o
p

Figure 6-7: Recognition Results for the Open-world scenario 

 

 

We also investigated the behavior of RandomForest classifier with larger training data sets 

as large as 30% of the dataset #1, which means more than 1200 minutes of data is used for training. 

The results show that the recognition rate does not correlate well with the increase in training set 

size. While the TP rates are not high, the FP rates at higher split percentages are very low for the 

BayesNet classifier and the RandomForest classifier. This is good news for a detector that wants 

to detect video traffic. In such applications, the number of samples is so large that misclassifying 

part of the samples as negative does not affect the performance, while misclassifying legitimate 

 

79 

traffic as positive affects other use of network and is a more severe problem. Another interesting 

point here is that the most favorable classifiers are different from those of closed-world tests. 

6.7 Summary 

This chapter solves the problem of detecting video streaming traffic tunneled through VPN 

or proxy servers.  It was proven that with Traffic Analysis (TA) based recognition methods, it is 

possible to recognize video streaming traffic tunneled through a certain VPN service with low 

false positive rate, even when the classifier deals with traffic from hosts that are unknown in the 

training process. It was also proven that the packet size distribution feature works better for video 

streaming protocols when compared to the consecutive packet size pair distribution. The results 

indicate  the  possibility  of  a  board  range  of  applications  in  the  field  of  rule  enforcement  and 

surveying. The results also show that TA methods can be successfully applied to protocols other 

than HTML with decent results when the same-location rule holds. 

 

80 

Chapter 7 : Two-stage Classifier Design 

for Mixed-Traffic Problem 

7.1 Introduction 

In this chapter, we present a novel two-stage classifier design that tackles the mixed-traffic 

problem. The mixed-traffic problem (also called heterogeneous traffic in this work) is defined as 

the traffic analysis problem when the traffic contains more than one type of protocols. Traditional 

methods tend to lose performance in the presence of mixed-traffic. However, in this chapter, a 

two-stage classifier shows that it can classify traffic with higher accuracy at the expense of some 

discarded samples. 

7.2 Problem Definition 

The  mixed-traffic  problem  in  this  chapter  is  defined  as  the  Traffic  Analysis  of  tunneled 

traffic in which a video streaming traffic flow (the main target for identification) and a web traffic 

flow (considered as a distraction) are mixed together. The presence of web traffic flow is assumed 

to be sporadic, so there would be a small pocket of undisturbed video streaming traffic. The two-

stage classifier is designed to take advantage of this fact. Experiments are designed to test this 

hypothesis, as well as verify the design of the two-stage classifier. 

7.3 Classifier Design 

The key part of the architecture is a novel classifier, which is shown in Figure 7-1.  This 

section will explain the components in detail. 

 

81 

Figure 7-1: Two-stage traffic source classification scheme 

 

 

7.3.1  Two-Stage Classifier 

Since  the  composition  of  a  network  traffic  flow  could  change  over  time,  the  classifier 

evaluates traffic in small time slices, which are termed as samples. Each sample is a temporal 

portion of traffic flow recorded within a certain period. The proposed classifier is split into two 

stages. In the first stage, a classifier is fed a combination of features extracted from the sampled 

traffic flow, and it determines if the sample corresponds to mixed traffic flows or a pure flow. For 

a  pure sample, the second stage determines the sample’s video source. Only those samples that 

are classified as pure in stage-1 are passed on to stage-2. 

Two conditions are assumed for the classifier to work: 1) the target video streaming traffic 

forms  the  majority  of  the  traffic  in  a  tunnel,  and  2)  non-video  streaming  traffic  only  occurs 

sporadically. These assumptions are based on the observation that video consumption demands a 

high level of user attention. Therefore, the web traffic will be sporadic. A sample may be mixed 

 

82 

or pure depending on the composition of traffic in the tunnel during that period. We assume 50% 

of the video watching sessions are accompanied by web browsing activities. We believe these 

assumptions to be reasonably close to real human behavior.  

One goal of the 2-stage classifier is to maximally reduce the chance of false positives. False 

positives are costlier since they disrupt normal services going through an SMFW. Following the 

example of [41], it is done on stage-1 classifier by adjusting the cost of false positives to be higher 

than  false  negatives,  while  on  the  stage-2  classifier,  it  is  done  by  first  converting  the  n-class 

classifier  to  n  binary  classifiers  and  then  assign  a  higher  cost  for  the  false  positive  detection 

(detection of a certain video streaming class when its traffic is not present in data) a higher cost 

than the false negative detection. Since there are 7 classes in the experiment, the stage-2 classifier 

is actually an ensemble of 7 binary classifiers. The ensemble is based on a max-confidence rank 

algorithm, which takes the classifier output that has the highest confidence rank as the final output. 

For a RandomForest [42] Classifier, that means the output that has the most popular votes wins. 

The classifier is designed to detect traffic samples from the targeted video streaming providers 

with a minimal amount of false positive. 

7.3.2  Feature Set 

For  each  packet  within  an  encrypted  tunnel,  the  SMFW  can  observe  a  3-tuple:  {time, 

direction, size}. Time is the observation timestamp, the direction is a binary value indicating the 

direction of that packet (either from inside the SMFW to the outside, or vice versa), and size is the 

number of bytes in that packet. Converting the raw data to this tuple series is the first step in feature 

extraction. Although the tuple series is already a lot smaller than the raw traffic data, it can still be 

quite large, and the raw parameters may lack robustness to compensate for the pseudo-randomness 

of network traffic. Thus, practical features are usually combinations of one or more statistical 

 

83 

characteristics of the tuple series. Criteria for a good feature set include: 1) distinctiveness: the 

feature should be able to capture the difference between the various traffic classes, 2) robustness: 

the feature set should minimize the influence of the inherent randomness in traffic, and 3) ease of 

computation: it should fit in a resource-limited environment such as a router. From the collected 

{time, direction, size} information, the following features are computed. 

Figure 7-2: Sample distribution of packet counts  

 

 

Packet Count: The number of packets in each sample is treated as one of the classification 

features.  Data  collected  from  the  experiment  shows  the  distribution  of  packet  count  to  be 

appreciably correlated to the presence of a mixture in video streaming traffic. A few examples of 

the difference in packet count distribution between mixed traffic and pure traffic are shown in 

Figure 7-2. Amazon prime video traffic and BlinkX video traffic collected with or without mixed-

in video traffic are shown. While it is more evident in the case of Amazon that pure traffic has a 

higher packet count, BlinkX traffic also shows the same tendency. This trend has been generally 

observed for other video sources and different traffic mixes. The difference between pure and 

mixed traffic is due to rate throttling when a client initiates multiple download sessions. The video 

streaming client can detect that the network interface is crowded and choose to play a lower bit-

 

84 

rate version of the video instead, which results in less video streaming traffic and in turn, less 

overall packet count [43]. The traffic in Figure 7-2 is taken from without the tunnel, and it reflects 

the difference in traffic clearly. 

Packet Size Distribution: Distribution of packet size is obtained by counting the number of 

packets in a sample as they fall into different bins according to their direction and size. This results 

in a histogram. In this chapter 30 bins that uniformly divide the range [-1500, +1500] bytes are 

used to compute the feature. The packet size has a sign associated with it to indicate the direction 

of this packet (minus is upstream, and plus is downstream). In order to remove bias, before feature 

computation, all TCP ACK packets are removed from the stream based on the packet size below 

a threshold of 53 bytes [3]. The feature histogram is then calculated and normalized, i.e. the value 

of each bin is replaced by its value divided by the sum of all bins. This mitigates the influence of 

traffic load on the feature.  

 

85 

Figure 7-3: Sample packet size distribution  

 

 

Several sample features that are collected from the experimental environment are shown in 

Figure 7-3. Samples from 3 different sites both with and without the background traffic (labeled 

as Mixed and Pure respectively) are shown. The variation of the feature between different sites is 

clearly visible. Background traffic has an observable influence on the feature. 

Packet Arrival Interval (PAI): PAI is defined as the distribution of inter-packet time. In what 

follows we show that PAI contains useful timing information, which can be helpful in source server 

classification. Please refer to Chapter 4 for a detailed analysis of the PAI feature. 

 

 

86 

7.3.3  Automated Feature Selection 

We use an automated process by which the most relevant features in a feature set can be 

selected by a feature selector algorithm for a particular classification problem. Due to the high 

number of features involved in Traffic Analysis, and the low signal-to-noise ratio of some features, 

feature selection is conducted automatically rather than manually. A naïve approach would be to 

generate different subsets of the original feature set using all possible combination of attributes, 

train classifiers based on them and measure the performance of those classifiers against a common 

test set. The subset that performs best would win the selection process.  

However, the cost of such a search would be too high if the original feature set contains 

many attributes. Therefore, a more practical approach to feature selection involves the use of a 

suboptimal search of the feature space. Suboptimal search makes the process faster. The search 

can also be further optimized with a proxy measure chosen as the optimization goal rather than 

using the classification performance as a goal. 

In this chapter, we use the Correlation Feature Selection (CFS) [44] measure to select the 

optimal features. CFS is only used to on stage-1 classifier because stage-1 needs to run quickly, 

while the stage-2 classifier uses the full feature since it needs more resolution to classify traffic 

source.  The  stages  of  our  classification  architecture  were  summarized  in  Figure  7-1.  CFS 

constructs  subsets  based  on  the  assumptions  that  1)  a  good  subset  should  contain  mostly 

uncorrelated features, and that a feature should provide independent information to the subset 

instead of merely being a combination of other features in the subset, and 2) that features could be 

weighed individually against the classification problem, and a combination of features that perform 

well individually indicates a subset that could perform well. We use CFS and the “BestFirst” 

condition [44] from Weka [34] library for this purpose. Weka is a tool for performing machine-

 

87 

learning tasks, which is developed by The University of Waikato. It provides common tools for 

machine learning applications, which includes the CFS algorithm used in this chapter. With the 

“BestFirst” condition, the CFS selector will greedily search the attribute space for the next attribute 

that has the best evaluation of the selected subset of attributes. The algorithm also backtracks to 

try a different path when there is no improvement for several iterations until a stop condition is 

reached.  

7.4 Experimental Setup 

To test the performance of the proposed classification scheme, a series of experiments are 

set  up  as  follows.  An  OpenVPN  server  [17]  is  installed  on  the  campus  of  Michigan  State 

University. The detail of the setup is shown in Figure 7-4. The videos are streamed from multiple 

streaming providers to a client location that is set off campus (so that the client is in a different 

domain than the VPN server). In OpenVPN version 1.5, a tunnel can be either UDP or TCP based. 

We use the UDP tunnel because it is more efficient and therefore used by more service providers. 

An encrypted OpenVPN tunnel is created from the client to the VPN server, and then the videos 

are streamed from commercial streaming servers such as YouTube, Netflix, etc. (see Table 4-1) to 

the client via the OpenVPN tunnel. 

A router with Wireshark [39] probe is installed at the client’s home. The traffic is collected 

at  this  router.  The  setup  emulates  a  real-world  operational  scenario  of  the  proposed  SMFW. 

Numerous video streaming servers are chosen for data collection during the experiment.  

 

88 

Figure 7-4: Experimental setup with OpenVPN 

 

 

Most of the video streaming servers use a content distribution network in order to speed up 

service delivery and save cost. To enable that, the actual server machine that a client downloads 

from is a function of the client’s location, network load, and various other factors. Servers from 

our  experiments  are  resolved  and  matched  in  an  IP  geolocation  dataset  to  reveal  their  actual 

location. The number of samples collected during the test period is shown in Figure 7-5. The 

difference in sample size from one site to another is a result of the length of the chosen video 

stream (each sample is 20 seconds long) and the condition by which samples are filtered (samples 

with less than 1500 packets/sec are discarded). 

The dataset is collected over a 6-week period at the client location. The set   contains 147563 

samples, each 20 second long.  Normal Internet usage by human participants within the same 

period is sampled as “Profile” traffic, which provides a background for the samples to be compared 

against. The data is collected using an automated script that generates random web page views 

during video playbacks. The average rate is set to be 1 view/minute, and 50% of the video streams 

are randomly chosen to have mixed-in web traffic.   

 

89 

Amazon (Seaford, Delaware)
BlinkX (Seattle, Washington)
CNN News (Ann Arbor, MI)
DailyMotion (New York, NY)
Fox News (Cambridge, MA)
Netflix (Southfield, MI)
YouTube (Mountain View, CA)
Background

0

5000

10000

15000

20000

Sample Count

Pure Mixed

Figure 7-5: Video Streaming Providers in Dataset 

 

 

Data Processing: All traffic streams collected from the experimental setup are cut into 20-

second slices and a feature vector is extracted from each slice. The feature vector contains packet 

size  distribution  (30  attributes),  the  packet  count  (1  attribute)  and  PAI  (40  attributes)  of  each 

sample. The samples are collected either with or without background traffic. The resulting dataset 

is tagged with 2 labels, one is whether the dataset is collected with or without background traffic 

present, and another is the video streaming site the traffic originates from. Packet count, packet 

size  distribution,  and  PAI  features  are  extracted  from  every  sample,  then  the  features  are 

concatenated into a feature vector the format of which is shown in Figure 7-6. 

 

90 

Figure 7-6: Feature formatting with ground truth tags 

 

 

For  a  fair  comparison,  all  classification  models  in  this  experiment  are  trained  with 

RandomForest classifier from the Weka machine learning library. A RandomForest [42] classifier 

is an aggregation of tree classifiers that form a “forest”. Each tree in the forest is trained on a 

random subset of the attributes. The classifier outputs are the result of a popular vote of all the tree 

classifiers. A tree classifier is a type of classifier which memorizes the training set as a hierarchy 

of  conditions.  A  sample  is  classified  by  starting  at  the  “root”  of  the  tree,  which  contains  the 

broadest  condition,  then  following  the  conditional  branches  that  match  the  sample,  until  the 

evaluation reaches one of the leaves of the tree. The label associated with that leaf is the output for 

the sample. The dataset is separated into 2 different partitions. One partition, which contains only 

“pure” samples and “profile” samples, is for the training of the stage-1 classifiers, while the other, 

which contains “pure” samples, “mixed” samples, as well as “profile” samples, is used for training 

stage-2 classifiers. 

 

 

91 

 

Figure 7-7: Training the stage-1 classifier 

 

Stage-1 classifier (Figure 7-7) solves a binary classification problem of distinguishing traffic 

samples with background traffic from those samples that do not have background traffic. Before 

training, the partition for stage-1 is further split into 3 sub-partitions: one for CFS feature selection, 

one for training, and one for testing. A feature selector is first trained using the CFS partition. The 

2 candidates of the stage-1 classifier are trained using different subsets of attributes chosen from 

the training set. The first candidate is trained using the packet count feature alone, and the 1st label 

(video  streaming  site)  is  erased.  The  second  candidate  is  trained  using  an  optimal  subset  of 

attributes selected by the feature selector, with the 1st label also erased. To decrease the chance of 

misclassifying mixed samples as pure samples, the cost of misclassifying mixed samples as pure 

is set to be 1.5 times the cost of misclassifying a pure sample as a mixed one. The data flow for 

training stage-1 classifier is shown in Figure 7-8. In the following text, the 1st candidate classifier 

will be referred to as Stage1_PacketCount, and the 2nd candidate will be referred to as Stage1_CFS. 

The test set will be used in the testing procedure, described in Figure 7-10. 

 

92 

Figure 7-8: Training the stage-2 classifier 

 

 

After passing the first stage classifier, samples are either labeled as “mixed” or “pure”. The 

stage-2 classifier will only examine the samples labeled as pure by the stage-1 classifier. Two 

candidates for the stage-2 classifier are trained. The first candidate is a classifier that is only trained 

with the subset of all pure samples from the stage-2 partition, and the second classifier is trained 

with the entire stage-2 partition. Both candidates are comprised of 7 RandomForest classifiers as 

specified in section 7.3. Each of the RandomForest classifiers is cost sensitive. The cost of making 

a  false  positive  prediction  is  10  times  that  of  making  a  false  negative  prediction.  Finally,  the 

prediction of a sample is determined by the “winning” classifier, which is the classifier that 1) 

made a positive prediction, and 2) made the prediction with the highest confidence among the 

classifiers. Also, since there is a default class in the training set (class 8 which are comprised of 

“profile” samples) every sample that is rejected by all other classifiers is regarded as class 8. 

Henceforth, the candidates will be called Stage2_Pure and Stage2_Mixed respectively. The data 

flow for training the stage-2 classifier is shown in Figure 13. 

The experiments are run from a personal computer with an AMD 7700K CPU. The Java 

heap size for running the Weka classifiers is 1GB. The runtime for training stage-1 classifier is 

about 165 seconds for the CFS dataset, and 22 seconds for the packet count dataset, while the 

 

93 

stage-2  classifiers  range  from  the  shortest  75  seconds  (BlinkX)  to  the  longest  367  seconds 

(YouTube) depending on the size of the dataset. Once trained, the evaluation time of both the 

stage-1 classifier and the stage-2 classifier on the test set, which contains 29512 samples, are within 

a few seconds. 

Because there are two classification stages, the performance evaluation is more complex 

compared to a usual single-stage classification scheme. First, the same test set is used in testing 

both stages. The stage-1 classifier will receive a selected subset of the attributes (Packet Count or 

CFS chosen) while the stage-2 classifier will receive the full range of attributes. Second, the result 

from the stage-1 classifier will have an impact on the stage-2 classifier, so the test set has to be 

separated into subsets based on predictions made by stage-1 classifier. From here on,  is defined 

as the entire test set,  is defined as the part of the test set that is pure, according to the ground truth, 

while  is the part of the test set that is pure according to the prediction of the stage-1 classifier. 

Similarly,  and  are the mixed part of the test set according to the ground truth and prediction of 

stage-1 classifier respectively. 

The performance of the stage-2 classifier has to be gauged in a way that respects the output 

of the stage-1 classifier. The overall process is shown in Figure 7-10. After the samples are tagged 

by the stage-1 classifier, both  and the stage-2 classifier will be tested twice, first time with  as input 

and second time with   in the test separately. Note it that the design goal of the 2-stage classifier is to 

filter out unqualified samples and as such, only the result of the stage-2 classifier on  is the actually 

intended  input  for  the  stage-2  classifier.  However,  evaluating  the  performance  of  the  stage-2 

classifier on   allows us to measure its performance gain with respect to the scenario where there 

is no stage-1 classifier. The output of the stage-2 classifier is the estimated label of the sample 

 

94 

(possible  values  are  listed  in  Table  4-1).  We  define  the  performance  indicators  in  Eqn. 

(7-1)(7-2)(7-3)(7-4). 

 

Figure 7-9: Testing the classifiers at both stages 

 

Filtered Accuracy: defined as the accuracy of the stage-2 classifier on . 

Unfiltered Accuracy: defined as the accuracy of the stage-2 classifier on . 

 

(7-1) 

  

(7-2) 

 

	
	

in. 	

Filtered False Positive Rate (FPR): defined as the weighted average of the FPR of the 7 

  

classes on. Eqn. (7-3). Here  in the equation stands for the subset of samples that belong to class  

(7-3) 

  

 

95 

Unfiltered False Positive Rate: defined as the weighted average of the FPR of the 7 classes 

on. (Eqn. (7-4)). Here  in the equation stands for the subset of samples that belong to class  in T. 

The definition of these 4 performance indices is intended to reveal a full picture of the effect 

of the 2-stage classification process. It is known [23] that mixture leads to lower recognition rate 

in stage-2 classifier, but it may also result in lower false positive because of removal of potentially 

noisy (therefore lower in quality) samples early on. It is important to evaluate the performance 

gain from introducing the stage-1 classifier, which is defined in Eqn. (7-5): 

	

	

(7-4) 

  

(7-5) 

  

This  index  shows  the  improvement  introduced  by  the  2-stage  classifier,  in  terms  of  the 

reduction of false positive detection rate. The smaller the ratio, the more the stage-1 classifier has 

contributed to the reduction of false positives. 

7.5 Experimental Results 

7.5.1  Feature Selection 

Correlation Feature Selector (CFS) algorithm is first applied to the CFS slice in the stage-1 

dataset. The results of the CFS selector is then applied to the stage-1 training set and also saved 

for the testing stage since the same selector has to be applied to the test set as well. Streaming-site 

label (see Figure 7-6) is erased when invoking CFS. The attributes chosen by CFS for stage-1 is 

shown in Figure 7-10. Each square block in the figure represents one attribute in the combined 

feature vector, and those marked as black are the ones chosen by the CFS algorithm. The CFS 

 

96 

algorithm chose the features partly from the packet size distribution and partly from the packet 

arrival interval to form the optimal subset. Packet count is not included in the selected subset. 

  

 

Figure 7-10: Optimal feature subset (stage-1) chosen by CFS 

 

An interesting observation is a tendency for the CFS algorithm to focus on certain regions 

of the PAI feature. In both cases, the chosen subset of PAI concentrates on certain regions. Those 

regions correspond to the small end of the time scale (< 50µs), and the larger end of the time scale 

(>0.17s). According to the observations formulated in [25], the smaller end is where the influence 

of the network condition is coded, while the larger end is where the source characteristic is coded. 

The fact that CFS algorithm picks up those regions is a corroboration of the observations in [25]. 

7.5.2  Scenario #1 (Stage1_PacketCount+Stage2_Pure) 

Scenario  #1  simulates  the  following  operational  condition  in  a  real-world  traffic 

classification  system.  When  collecting  training  data  from  the  SMFW,  all  traces  are  collected 

without any background traffic mixed in, and packet count is used as an indicator of the pure/mixed 

nature of traffic trace in the stage-1 of the classification process. In this case, the confusion matrix 

of the stage-1 classifier is shown in Table 7-1. The matrix is normalized by the sum of each row, 

so the table shows true positive/false positive, true negative and false negative rates. Because the 

 

97 

training set is generated differently between the Stage2_Pure and Stage2_Mixed classifiers, there 

are 2 different confusion matrices. 

The accuracy of Stage1_PacketCount is 41.33% and the false positive rate is 6.97%. On top 

of the results of Stage1_PacketCount, the confusion matrix of Stage2_Pure is shown in Figure 

7-11: (a). The ratio of accepted samples is shown in (b) part the same figure. Samples that are not 

accepted are defaulted to be of class 8. The acceptance rate is low due to the fact that the 2nd stage 

classifier is trained on pure samples only and is likely to reject ambiguous samples. 

The filtered accuracy of Stage2_Pure, with the filtering of Stage1_PacketCount, is 91.30%, 

compared to the unfiltered 80.9%, while only 11.95% of the samples passed the filter. The η value, 

in this case, is 18.37. The Accuracies of different subsets are shown in Figure 7-12. 

Table 7-1: Confusion matrix (Stage1_PacketCount) 

Test Passes 
 
Stage2_Pure 
 
Stage2_Mixed 
 

 
As 
Mixed 
Pure 
Mixed 
Pure 

Mixed 
 
93.03% 
85.47% 
93.03% 
88.18% 

Pure 
 

6.97% 
14.53% 
6.97% 
11.82% 

 

 

98 

Figure 7-11: Confusion matrix (Stage2_Pure) 

 

 

We can see that while the accuracy of Stage1_PacketCount is low, the filtering still boosted 

the filtered accuracy of Stage2_Pure from the unfiltered 80.49% to 91.30%, at the expense of 

passing only 11.95% of the samples to the 2nd stage. 

7.6 Scenario #2 (Stage1_CFS+Stage2_Pure) 

Scenario #2 simulates the following operational condition. When collecting training data 

from the SMFW, all traces are collected without any background traffic mixed in, and the CFS 

subset  is  used  as  an  indicator  of  the  pure/mixed  nature  of  traffic  trace  in  the  stage-1  of  the 

classification process. In this case, the confusion matrix of the stage-1 classifier is shown in Table 

7-2. 

 

99 

100%

#1

#2

#3

#4

95%

90%

85%

80%

75%

70%

Unfiltered
Accuracy

Filtered
Accuracy

3.5%

3.0%

2.5%

2.0%

1.5%

1.0%

0.5%

0.0%

#1

#2

#3

#4

Unfiltered

FPR

Filtered
FPR

 

Figure 7-12: Performance indices for 4 Test Scenarios 

 

The accuracy of Stage1_CFS is 58.05% and its false positive rate is 2.08%. The confusion 

matrix of Stage2_Pure is shown in Figure 7-11. Accuracies on different subsets are shown in 

Figure 7-12. 

We  see  that  while  the  filtering  of  Stage1_CFS  performs  moderately  well,  the  filtering 

boosted the filtered accuracy of Stage2_Pure from the unfiltered 80.49% to 95.95%, while only 

25.33% of the samples passed the filter. The η value, in this case, is 18.96. 

Table 7-2: Confusion Matrix (Stage1_CFS) 

Test Passes 
 
Stage2_Pure 
 
Stage2_Mixed 
 

 
As 
Mixed 
Pure 
Mixed 
Pure 

Mixed 
 
97.92% 
62.62% 
97.92% 
74.90% 

Pure 
 
2.08% 
37.38% 
2.08% 
25.10% 

 

100 

7.6.1  Scenario #3 (Stage1_PacketCount+Stage2_Mixed) 

Scenario #3 simulates the following operational condition. When collecting training data 

from the SMFW, all traces are collected with background traffic mixed in, and packet count is 

used as an indicator of the pure/mixed nature of traffic trace in the stage-1 of the classification 

process. In this case, the confusion matrix of the stage-1 classifier is shown in Table 7-1 since it’s 

the same Stage1_PacketcCount classifier. The accuracy of Stage1_PacketCount is 32.50% and the 

false positive rate is 6.97%. The confusion matrix of Stage2_Mixed and acceptance rate of samples 

are shown in Figure 7-13(a) while the acceptance rate is shown in part (b) of the same figure.  

Figure 7-13: Confusion Matrix (Stage2_Mixed) 

 

 

It  can  be  observed  that  Stage2_Mixed  accepts  more  samples  than  Stage2_Pure  while 

simultaneously  has  less  confusion.  The  performance  indices  are  shown  in  Figure  7-12.  The 

filtering boosted the positive accuracy of Stage2_Mixed from the unfiltered 81.57% to a higher 

value (89.88 %), while only 10.58% of the samples passed the stage-1 classifier. The η value, in 

this case, is 16.79. 

 

101 

7.6.2  Scenario #4 (Stage1_CFS+Stage2_Mixed) 

Scenario #4 simulates the following operational condition. When collecting training data 

from the SMFW, the collection agent adds background traffic to the tunnel, and the optimal subset 

of feature chosen by CFS algorithm is used as an indicator of the pure/mixed nature of traffic trace 

in  the  stage-1  of  the  classification  process.  In  this  case,  the  confusion  matrix  of  the  stage-1 

classifier is shown in Table 7-2 since the first stage classifier is also Stage1_CFS here. 

On top of the results of Stage1_CFS, the confusion matrix of Stage2_Mixed, including both 

accepted  samples  and  rejected  samples,  is  shown  in  Figure  7-13.  The  overall  accuracy  of 

Stage2_Mixed, if without the filtering of Stage1_CFS, is 83.22%. Accuracies on different subsets 

are shown in Figure 7-12. We see that while the filtering of Stage1_CFS only performs moderately 

well, the filtering boosted the filtered accuracy of Stage2_Mixed from the unfiltered 83.22% to 

95.56%, while only 33.16% of the samples passed the filter. 

7.6.3  Discussions 

 

Comparing the 4 different scenarios, it could be observed that with appropriate tuning, the 

2-stage classifier configuration can boost the performance of the RandomForest classifier to have 

higher accuracy and significantly lowers false positive rate. As can be seen in Figure 7-12, the 

combination that got the most boost is the Stage1_PacketCount + Stage2_Mixed pair. But this 

comes  at  the  expense  that  Stage1_PacketCount  passes  fewer  samples  to  the  2nd  stage  than 

Stage1_CFS does. This low pass rate will affect the response time of the classifier since it takes 

more time for the classifier to make a decision. In this regard, the CFS classifiers perform much 

better.  The  best  performance  is  achieved  by  scenario  #4  (Stage1_CFS+Stage2_Mixed).  The 

reduction rate (η) in that case is 16.79, and the absolute false positive rate is as low as 0.14%. The 

performance of the filtered classifier on a VPN tunnel is comparable to that of [2], which assume 

 

102 

homogeneous traffic. The conclusion is that the classifier can choose the subset of mixed-traffic 

data on which performance like that of pure traffic data can be achieved. 

7.7 Summary 

This chapter proposes and analyzes a TA framework for identifying sources of tunneled 

video streaming traffic. The main contribution is a two-stage classifier that combines the power of 

a pre-filter classifier, which filters traffic samples according to their pureness (i.e. whether the 

traffic  is  heterogeneous  or  pure)  with  a  video  source  classifier.  Using  OpenVPN  servers  for 

creating  encryption  tunnels,  extensive  experiments  were  conducted  on  many  popular  video 

streaming sources with various combinations of feature extraction and data processing techniques 

to verify the effectiveness of the 2-stage classifier. The results confirm the effectiveness of the 

design. It was demonstrated that the system works best when an optimal feature subset was chosen 

using a Correlation Feature Selector (CFS), and the training data contains mixed traffic samples. 

Accuracy of up to 95% is achieved using this setup. Future work includes: a) studying of the model 

of background traffic in order to assist training noise addition, b) handling multiple video streams, 

probably from multiple machines in addition to the background web traffic, and c) verifying the 

design of the classifier in a custom-built SMFW setting that closely reflects real-world scenarios. 

An implementation of the proposed SMFW system that can detect a mixture of video streams in 

the traffic flow is being developed and experimented upon. Future work will present it in more 

detail. 

 

103 

Chapter 8 : Deep-Learning Based Traffic Analysis 

8.1 Introduction 

In this chapter, we introduce a Natural Language Processing (NLP)-based feature and its 

usage with deep-learning based traffic classifiers. The introduction of deep learning methods into 

Traffic Analysis will profoundly change the way the TA methods are applied. So far, the existing 

methods rely on hand-engineered features with the hope that they will capture some identifiable 

trait of the underlying traffic flow. These features are usually low in dimensionality compared to 

the traffic flow. For example, a traffic flow lasting for 30 seconds could have more than 10,000 

packets, each can be described by several properties, yet a packet size distribution like the one 

introduced in Chapter 7 contains only 10 to 40 values. Most of the information is discarded in the 

feature extraction process. Defining a feature that is good at capturing traits of the traffic flow is a 

painful trial-and-error process, and finding a good feature requires a substantial amount of resource 

as well as luck. 

With  the  introduction  of  deep  learning,  the  machine  learning  subsystem  will  be  able  to 

handle inputs that are much higher in dimensionality. On top of that, deep learning is good at 

finding subsets of the input that is good for identifying the classes. In other words, the classifier 

can do feature selection by itself when it is trained. To exploit this advantage, we need a new way 

of encoding traffic into a data format that not only keeps more fine-grained information about the 

original traffic but also easy for deep-learning to handle. This chapter document one such effort 

by the author. 

8.2 Deep Neural Network 

In this section, we review relevant literature from Deep Learning. We will focus on the 

application of deep-learning to NLP problems. The rationale behind the decision is that 1) network 

 

104 

traffic  flow  and  text  flow  are  similar  in  structure,  and  2)  it  enables  the  classifier  to  borrow 

techniques from NLP. 

8.2.1  Deep Learning in Natural Language Processing 

Deep-learning is a family of biologically inspired machine learning algorithms that feature 

multiple  layers  of  cascading  non-linear  representations  of  data  that  can  be  trained.  The  most 

popular of deep-learning algorithms is Deep Neural Network (DNN), in which the layers are made 

of artificial neurons that can be trained through error backpropagation [45]. The learning of multi-

layer representations is why DNN is “deep” and is also the reason behind its classification power. 

Each layer depends on the previous layer(s) and therefore can be considered as a feature extractor. 

Having multiple layers means there is potential in learning a deep hierarchy of features directly 

from a high-dimensional representation of the data. In other words, instead of relying on manually 

engineered features, deep learning algorithms can learn features from raw data. For this reason, 

deep learning classifiers are commonly trained against densely sampled representations of the data. 

Images and time series, for example, are often used with minimal preprocessing. 

For natural language data, two neural network architectures are used: Convolutional Neural 

Network (CNN) [46] and Recurrent Neural Network (RNN) [47]. A CNN does a convolution 

operation  on  the  data  with  a  kernel  function  that  is  shared  throughout  the  series,  the  most 

significant component is then picked as the input for subsequent layers via max-pooling. An RNN 

keeps internal states so that it can react to input sequences. The states are updated at each input 

step  by  a  combination  of  the  past  state  and  the  current  input.  CNN  and  Variations  of  RNN, 

(Especially Long Short-Term Memory, LSTM [48] & Gated Recurrent Unit, GRU [49]), have had 

great success in NLP[50]. 

 

105 

The most notable recent development in DNN based NLP is the Attention-Based Network 

(ATTN). ATTNs define attention as a neural network (potentially deep) that accepts some form of 

input and outputs a weight factor that regulates how much weight a part of that input has on the 

output. ATTN can be chained with other networks to regulate their inputs. Researchers have found 

that adding attention to a network increases its memory capacity and accuracy. As an example, 

[51] is regarded as one of the state-of-the-art text classification results to date. 

8.2.2  Word Embedding 

An efficient word representation is important for NLP. A straightforward method to map 

words  to  vectors  is  one-hot  encoding.  In  one-hot  encoding,  the  ith  word  in  a  vocabulary  that 

contains N words is converted to an N-element long vector, with only the ith element being 1 while 

all  others  being  0.  One-hot  encoding  is  simple  to  implement,  but  scales  poorly  to  larger 

vocabularies  and  does  not  maintain  information  on  the  correlation  between  words.  Word 

embedding [52] is developed to reduce dimensionality while maintaining the correlation. A Word 

Embedding is a unique projection from a vocabulary of N words onto N dense vectors in an M-

dimensional  space  (usually  M£N).  The  embedding  ensures  that  related  words  (i.e.  those  that 

appear close to each other in the original sequence) also have vector representations that are like 

each other. An embedding can be trained from a dataset separately from the classifier, and then 

reused for different tasks to save time. Widely used methods to train word-embedding include 

word2vec [53] and Global Vectors for Word Representation (GLOVE) [54]. Word-embedding is 

an example of the NLP techniques that TA can benefit from. 

 

106 

8.3 Language-Like Feature for Traffic Analysis 

 To bridge the gap between NLP and traffic analysis, a language-like feature is defined to 

encode network traffic. The goal of this feature is 1) achieve high data reduction ratio before the 

traffic reaches the classifier. Achieving this goal helps reduce the network load when the classifier 

is eventually deployed to a real-world networking system. 2) Produce a language-like feature that 

existing NLP techniques can process. And 3) Achieve high classification accuracy. 

Flow 

(300ms slice)

Upstream

Packet

Bytes

Downstream

Packet

Bytes

f

f

f

f

Output sequence

…

… aaaa abcc

a

b

c

e

aabc abae …

 

Figure 8-1: Extracting Language-Like Traffic Feature 

 

To achieve the goals, we define the feature as depicted in Figure 8-1. A 1-minutes-long 

bidirectional traffic stream is split into 200 non-overlapping windows each 300ms long. Within 

each  window,  the  sum  of  upstream/downstream  packet  counts  is  taken  as  well  as  the 

upstream/downstream  byte  counts.  Each  value  is  mapped  onto  the  English  alphabet  using  the 

formula defined in Eqn. (8-1). 

                    

(8-1) 

 

107 

We use a base-10 log-scale to convert an input value to an index. For a 100Mbps link, which 

is used in the data collection, the highest possible letter is “g” (the range between 107 and 108). 

The reason for a log-scale is that it makes the error tolerance in-scale with the absolute values. It 

also helps to reduce the number of symbols in the feature. The 4 letters are concatenated to form a 

single symbol (“word”) in the output sequence. 

Figure 8-2: Word Frequency of Traffic Feature 

 

 

After the feature extraction, every sample is turned into a sequence of 200 words. Figure 8-2 

shows the word frequency distribution of the dataset. The distribution is sorted by popularity, and 

every 1 in 5 words is labeled on the vertical axis. We can observe that 1) the resultant feature has 

a vocabulary of only 68 words, a much smaller number than the number of possible words. 2) The 

word frequency follows an exponential distribution. Each word is about 83.83% as popular as the 

previous one. A significant anomaly happens at aaaa. The over-representation of aaaa is because 

it is the word that represents an idle channel, and therefore happens more often in the data. 

 

108 

8.4 Neural Network Architecture for Traffic Analysis 

We verify the effectiveness of the feature with the DNN-based text classification methods 

listed in Table 8-1. We also choose the traffic classifier by Cruz et. al [31] for our dataset as a 

comparison.  

Method 
HAN 
Yoon Kim 
Mark Berger 

 

Table 8-1: NLP Processing Methods Used 
Architecture 
GRU+Attention 
CNN 
GRU 

Optimizer 
Adam 
Adam 
RMSProp 

 
 
 
 

Word2vec embedding is used across all NLP methods. The embedding layer is a part of each 

network and will be trained alongside it. 

The first architecture used is based on Hierarchical Attention Network (HAN) proposed by 

Zhichao Wang, et. al.[51]. 

We  define  2  types  of  classification  problems  in  this  chapter.  The  first  type  is  a  binary 

classification problem: given unknown traffic, classify it as either containing traffic from a certain 

video provider or not. The second type is the multi-label problem: given unknown traffic, identify 

the probability of it containing traffic from each of the video providers in a list. In both cases, the 

class labels are video streaming providers, but the outputs are categorical in case of the binary 

problem, and binary in case of the multi-label problem. We use binary_crossentropy loss function 

to train the models in both cases. The performance is measured by categorical_accuracy in the 

binary case and by Receiver Operating Characteristic (ROC) area in the multilabel case. 

 

109 

k = 200

0
0
2
 
=
n

GRU

GRU

…

GRU

tanh

…

Attention Network

gate
v1

v2
…
vn

softmax

GRU Output

x

Embedded 

vectors

Bidirectional 
GRU (64 units) 

Dense Output Layer with 

Dropout (l units)

 

Figure 8-3: Hierarchical Attention Network (Figure 2 in [51]) 

 

In this network (Figure 8-3), the output is a weighted sum of the outputs of a bidirectional 

GRU layer, and the weights are from the output of an ATTN. This ATTN is equivalent to the 

word-level attention network in the original work. We do not need a document-level ATTN as the 

original  work  does  since  there  is  no  higher-level  structure  in  our  dataset.  A  200-dimensional 

Word2Vec embedding is used as inputs to this network and all other network models. The output 

is a dense sigmoid layer of labels, with the labels either in categorical format or binary format. 

The  second  method  used  in  this  chapter  is  proposed  by  Yoon  Kim  [55].  This  network 

(depicted in Figure 8-4) has multiple 1D-CNN layers with ReLU activation, each having a different 

kernel shape.  According to this method, a layer with a kernel shape n is defined for n-grams in 

input data. We use kernels with lengths from 1 to 5 to focus on unigrams and up to 5-grams. Each 

1D-CNN layer contains 256 units. The input is 200 10-dimensional Word2Vec embedded vectors. 

The CNN layers are flattened to a dense vector with max-pooling-over-time before connecting to 

the output layer. The output layer is the same format as HAN. 

 

110 

k = 10

0
0
2
 
=
n

Embedded 

vectors

1D-CNN Layers
(Variable size & 
kernel shape)

Max-Pooling
over Time

Dense Output Layer with 

Dropout (l units)

 

Figure 8-4: Network Architecture by Yoon Kim (Figure 1 in [55]) 

Figure 8-5: Network Architecture by Mark Berger 

 

 

The third method (Figure 8-5) used is proposed by Mark Berger [56]. The network accepts 

the same 10-dimensional embedded vectors as in Figure 8-4 and consists of a GRU layer with 

ReLU activation and a dense output layer that also follows the same format in Figure 8-2. 

We feature the traffic classification network by M. Cruz et. al [31] as a recent example of a 

DNN network designed for TA. The network accepts a vector of traffic features. Each feature 

vector is calculated from 3 seconds of traffic and contains many components. To adapt it for our 

dataset the TCP specific features (for example roundtrip time and TCP flags) are discarded because 

 

111 

we do not assume a TCP-based tunneling protocol. The remaining 60 feature values are listed in 

Table 8-2: Available Features by M. Cruz et. al 

Description 
Bidirectional packet counts of the 1st second of 
traffic 
Same as above for the 2nd second 
Same as above for the 3rd second 
Bidirectional byte counts of the 1st second of 
traffic 
Same as above for the 2nd second 
Same as above for the 3rd second 
Each stat contains the following values: 
- maximum value 
- minimum value 
- mean value 
- median value 
- variation (s2) 
- extreme outlier (>2s) count 
- mild outlier (>s) count 
- Shannon entropy 
- 2,3,4 and 5-permutation entropy 

Table 8-2. 

Field 
Packet count 1 

Packet count 2 
Packet count 3 
Byte count 1 

Byte count 2 
Byte count 3 
Bidirectional 
packet size stats over a 3-second interval 
& 
Bidirectional  
inter-arrival  interval  stats  over  3-second 
interval 
 

 

The network (Figure 8-6) contains a dense input layer with ReLU activation, then a chain of 

6 LSTM layers with various hidden state sizes that have tanh activation, and another dense layer 

with ReLU activation before the sigmoid output layer. The LSTM layers (except for the last one) 

output the entire output sequence to allow them to be chained. The original network uses a binary 

output format and uses binary_crossentropy loss function for the binary classification problem. 

We change the output layer to use the categorical output (in binary problems only) to match the 

other networks in this chapter. 

 

112 

k = 60

LSTM (32 units) 

0
2
 
=
n

Feature 
vectors

LSTM (128 units) 

LSTM (16 units) 

Time Distributed 
Dense (256 units) 

LSTM (64 units) 

Dense (256 units) 

 

Figure 8-6: Network Architecture by M. Cruz et. al. (Figure 3 in [31]) 

 

8.5 Experimental Setup 

We construct the network shown in Figure 8-7 (a) for the experiments. A Wi-Fi access point 

runs an OpenVPN[17] client which tunnels to an OpenVPN server. The client and the server talk 

in the SSL-over-UDP protocol. The tunnel is monitored by a probe router that has the prototype 

SMFW installed (the architecture is shown in Figure 8-7 (b)) that records metadata for each packet 

coming through and calculate features for each flow on-the-fly. A brief sample of the recording 

that the router produces is shown in Table 8-3. Up to 4 clients can connect to the Internet through 

this tunnel to do different activities. The clients are randomly assigned to watch a video (various 

length)  from  one  of  the  six  video  streaming  sites  listed  in  Table  8-4,  to  access  a  website 

(1min/access), or to stay idle for reference. We reserve one client for accessing a web page every 

minute to create noise in the dataset. Any sample could contain 0 to 3 video streaming traffic flow 

and  some  web  traffic  flow.  The  tunnel  is  also  used  for  day-to-day  Internet  access  when  data 

collection  is  idle,  and  the  traffic  from  the  casual  usage  is  recorded  to  create  a  referential 

“background” data. 

 

113 

Table 8-3: Traffic Data Format 

size(byte) 
1514 
1514 
1514 
66 
... 
1514 

src ip 
54.192.39.46 
54.192.39.46 
54.192.39.46 
192.168.0.95 
… 
54.192.39.46 

dst ip 
192.168.0.95 
192.168.0.95 
192.168.0.95 
54.192.39.46 
… 
192.168.0.95 

sport 
443 
443 
443 
59666 
… 
443 

dport 
59666 
59666 
59666 
443 
… 
59666 

t(ms) 
0.000 
0.060 
0.180 
0.590 
… 
2.490 

 

The data collection continued for about a month, during which time about 19k 1-minute long 

samples are collected. A breakdown of the collected data is shown in Table 8-5. The accessed 

video streaming sites are recorded and later used to assign labels to the samples. Each sample 

might be assigned 1-4 labels depending on the client activities at the time of recording. 4-label 

samples are rare in the dataset (28 samples), the reason is that they only happen between the 

transitions  of  videos.  These  samples  are  then  processed  with  the  proposed  feature  extraction 

method to be classified by the 3 NLP-based methods. The HAN implementation is by Lei Li from 

Language Computing and Machine Learning Group (LANCO), Peking University. The Y.Kim/M. 

Berger implementations are from INSPIRE-HEP project by the European Organization for Nuclear 

Research (CERN). Features that resemble Table 8-2 are also extracted to be classified by Cruz. et. 

al.’s classification method. 

In all cases, we randomly choose 80% of the dataset as the training set and 20% as the test 

set. The training set is balanced when evaluating the binary classification problems but is left as is 

when evaluating the multi-label classification problems. 

 

114 

Encrypted Tunnel

Internet

VPN-Enabled 
Access Point

Simulated
ISP Router

OpenVPN
Server

Local Network

Traffic Database

Neural network

(a) 
 

(b) 

 

 

Figure 8-7: Experimental Setup 

 

115 

 

 

 

 

Video Source 
Amazon 
CNN News 
Fox News 
DailyMotion 
Netflix 
YouTube 

 

 

All 
19321 
 

Table 8-4: Classes based on video streaming sources 
Number of videos in the playlist 
2 
1 (autoplays other videos) 
1 (autoplays other videos) 
1 (autoplays other videos) 
2 
2 (autoplays other videos) 

Table 8-5: Sample counts by No. of Clients 

1 Client 
8703 

2 Clients 
9128 

3 Clients 
1402 

4 Clients 
28 

In the following sections, we will study the following aspects of the performance. 1) The 

accuracy of the method when solving a binary classification problem. 2) Comparison when solving 

a  multi-label  classification  problem  compared  with  the  binary  classification  case,  3)  the 

performance of the testing samples that bear a combination of labels that are absent in the training 

set. (This problem is called the “Knockout Problem” from here on) and 4) the performance of the 

classifier when the training and the testing set are not from the same network location. 

 The rationale behind aspect 1) and 2) are obvious, while the reason for aspect 3) and 4) is 

because they allude to the Zero-shot Learning ability of the network. Zero-shot learning a name 

coined for the ability of a classifier to classify classes not present in the training set). It is a desirable 

trait because this ability reduces the training overhead of the network greatly by allowing the 

network to be trained on an incomplete subset of all possible label combinations. Evaluating aspect 

4)  also  let  us  revisit  the  same-location  rule  that  is  generally  established  in  TA  studies  and  is 

discussed in Chapter 6 in this work. 

 

116 

8.6 Experimental Results 

8.6.1  Binary Classification 

The accuracies of all binary classification experiments are shown in Figure 8-8. The result 

shows that the newly defined feature is effective. The accuracy in binary problems is defined as 

follows: if the activation level of the neuron corresponding to the label (0 or 1) is higher than the 

activation level of the other output neuron, it is considered a hit. The accuracies of the binary 

classifiers vary from 86.3% to 95.7% depending on the video streaming site being focused on. 

Cruz et. al performs erratically: its accuracy is 65.3% on Netflix and 64.4% on Amazon, but its 

highest accuracy reaches 92.3% (on Fox). The newly defined feature performs either comparable 

or significantly better than the Cruz. et. al. method in all cases. 

Table 8-6: Training Performance 
M.Berger 
54 
50 

Y.Kim 
35 
50 

HAN 
17 
30 

Cruz. et.al. 
14 
50 

Method 
Epoch Time(s) 
#Epochs 

 

Table 8-6 shows the time needed for each network to complete one training epoch and the 

number of epochs needed for convergence. The training/testing is done on a setup that consists of 

TensorFlow 1.11 running on a GeForce 750Ti GPU with 4GB of video memory. While Cruz. et. 

al.’s method is the fastest to complete an epoch, HAN is the fastest to converge among the methods, 

with good classification performance achieved at 30 epochs while other methods take about 50 

epochs to reach good performance.  

 

117 

 

y
c
a
r
u
c
c
A
n
o
i
t
a
d
i
l
a
V

100%
95%
90%
85%
80%
75%
70%
65%
60%

 

Amazon

YouTube
HAN

Netflix

CNN

Fox

DailyMotion

Y.Kim M.Berger

Cruz  et. Al

 

Figure 8-8: Binary Classifier Accuracy 

Effect of Word Embedding: We highlight the fact that the embedding size used in HAN 

method is set to k=200, while there are only 68 unique words in the dataset. The choice is based 

on observations featured in Figure 8-9, which show that the oversized embedding improves the 

accuracy  (from  88.4%  when  embedding  size=10  to  92.1%  when  k=200).    The  observation  is 

sensible since in machine learning, mapping the features into a higher-dimensional space in order 

to find better separations between classes is an established practice.  Increasing the embedding size 

further has the opposite effect. Other methods also benefit from a larger embedding size albeit to 

a lesser degree. For example, Y. Kim’s method gains another 1.3% accuracy on YouTube when 

its embedding size is increased to 200. A large embedding size, however, costs more memory, 

which is why the embedding sizes for the other methods are set to 10.  These observations hint at 

potentially interesting uses of embedding in TA and other applications. 

 

118 

0.95

0.9

0.85

0.8

0.75

0.7

y
c
a
r
u
c
c
A
n
o

 

i
t

a
c
i
f
i
r
e
V

0.65

0

k=10
k=100
k=200

5

10

15

#Epoch

20

25

30

 

Figure 8-9: Effect of Embedding Size on HAN-based Binary Classifier (YouTube) 

 

8.6.2  Multi-label Classification 

To  evaluate  multi-label  classification  performance  we  obtain  the  activation  level  at  the 

output stage for all of the methods and convert them into ROC curves. The AUCs would indicate 

the overall performance. We let the classifiers to classify the 6 video streaming sites together. 

For comparison, the AUCs of Y. Kim’s binary classifiers are provided in Table 8-7. We 

choose them since Y. Kim’s classifier performs the best in the binary experiments. These numbers 

provide the performance reference for the multi-label classifiers. 

Table 8-7: AUC of Binary Classifiers using Y.Kim Method 

Site 
AUC 

Amazon 
0.984 

YouTube 
0.941 

Netflix 
0.883 

CNN 
0.887 

Fox 
0.984 

DailyMotion 
0.937 

 

 

119 

The ROC curves and their respective AUCs of the multi-label classifiers can be seen in 

Figure 8-10. The horizontal axis in these graphs is the false positive rate, and the vertical axis is 

the true positive rate. The AUCs can be found in the legends in the figures. 

Figure 8-10: Multi-label Classifier ROC 

 

 

Y.  Kim’s  method  again  performs  well  across  all  classes,  with  HAN  following  closely. 

M.Berger’s  method  performs  moderately  on  Amazon  and  YouTube  while  showing  declining 

performance on other classes. Cruz et. al.’s method only performs well enough in one class (Fox). 

The AUCs of Y. Kim method does not change much from the binary AUC values listed in Table 

8-7, proving that the proposed feature is effective in multi-label classification. 

Figure 8-12 shows the ROC curves and AUCs of the same multi-label classifiers evaluated 

on the subset of samples with 3 or more active clients. These samples are considered challenging 

because of the traffic mixture. We also observe that the AUCs do not change much from those in 

 

120 

Figure 8-10 when we limit the evaluation to the subset of samples with 2 or more active clients. 

The difference in AUCs of this subset from the whole set is provided in Figure 8-11 for reference. 

CNN

DailyMotion

The only significant difference happens on Netflix samples. 

Difference in AUC (from all)
YouTube
Fox

Netflix

Amazon

0.1

0

-0.1

-0.2

-0.3

HAN

Y. Kim

M. Berger

Cruz et. al.

 

Figure 8-11: Difference in Multi-label Classifier AUC (No. of Client ³ 2) 

Figure 8-12: Multi-label Classifier ROC (No. of Client ³ 3) 

 

 

 

 

121 

According to Figure 8-12, the resilience of the classifiers is challenged on the subset where 

client number is greater than 2. Y. Kim’s method again performs well compared to the overall 

results suffering only minor drops in AUC. YouTube is a major exception, where the method 

suffered a drop of AUC from 0.964 to 0.808. It still performs better than other methods in all cases. 

Cruz  et  al.  method  achieves  better  performance  on  YouTube  and  DailyMotion  but  does  not 

perform well on Amazon, Netflix or CNN. The observations here show the effectiveness of the 

proposed feature applies to the case of multi-label classification. 

8.6.3  Knockout Test 

 

To evaluate the performance of the classifier, we remove samples bearing certain pairs of 

labels from the training set, one pair at a time, and train HAN and Y.Kim classifier on the same 

conditions as the multi-label tests using these modified datasets. The trained classifiers are then 

evaluated on the augmented test sets that contain those previously removed samples. The ideal 

output is for both output neurons (corresponding to the pair of knocked-out classes) to have high 

activation levels. The before-and-after comparisons of two pairs, YouTube-Amazon on the left 

and Amazon-DailyMotion on the right, are shown in Figure 8-13. The 1st row is the result before 

the knockout for reference, the 2nd row is the result of Y. Kim’s method after the knockout and the 

3rd row is the result of HAN after the knockout. The observations here are generalizable to other 

cases. 

Before the knock-out, the network can be trained to drive both neurons to high activation 

levels. But while the network can still identify the sites after the removal, it becomes weaker at 

identifying both at once. For many samples, the network is only able to drive one of the two 

 

122 

neurons to high activation levels at a time. This effect is more visible with HAN, for no sample 

can drive the network output to the top-right corner after the knockout. 

Before (Y. Kim)

After (Y. Kim)

After (HAN)

Figure 8-13: Selected Results from Knockout Tests 

 

 

 

123 

The  observation  might  be  a  result  of  the  max-pooling  mechanism  or  the  attention 

mechanism. Max-pooling or attention mechanism helps the network focus on the most significant 

input  components  (assumed  to  be  the  site  generating  the  most  traffic  in  that  sample)  while 

suppressing  other  components,  resulting  in  the  observed  results.  The  performance  might  be 

improved if a different network architecture that can focus on both is used. 

8.6.4  Testing at a Different Location 

To test the performance of the network when the test traffic is not from the same location as 

where the training set is collected, we made changes to the experimental setup as depicted in Figure 

8-14 to collect a new dataset. The VPN server is migrated to an Amazon Web Services (AWS) 

instance in the US East (Ohio) region. We chose a residential location in East Lansing, MI in 

addition to the lab at Michigan State University as client locations. The access point and probe 

router  are  installed  at  the  lab  for  about  one  week  to  collect  the  first  half  of  the  dataset,  then 

transported to the residential location to collect the second half. 

Figure 8-14: Experimental Setup for Testing a Different Location 

 

The video streaming sites in this dataset are listed in Table 8-8. Due to the limitations AWS 

posts to the VPN server, Amazon Prime and Netflix are removed from the list of sites, while more 

 

 

124 

sites are added to make up for the loss. Twitch and Vimeo Livestream are added to evaluate the 

performance of the classifier on live streaming services, while all other services serve video-on-

demand (VOD).  The dataset serves as a chance to test on a larger number of video streaming 

providers to evaluate the generalizability of the proposed technique.  

Table 8-8: Video Streaming Sites for Testing on a Different Location 

Video Source 
VEOH 
Vimeo 
Fox News 
DailyMotion 
NatGeo (ng) 
Twitch 
Vimeo Livestream 
YouTube 

 

Number of videos in the playlist 
2 
2 
1 (autoplays other videos) 
1 (autoplays other videos) 
2 
1 
1 
2 (autoplays other videos) 

Results: The performance figures in the rest of this subsection are derived from training the 

HAN and Y.Kim classifiers on the “lab” part of the dataset while testing on the “residential” part. 

We use Y. Kim’s method and HAN for comparison since they are the two methods that are verified 

to perform well in single-location cases. The ROC curves of HAN and Y. Kim methods are shown 

in Figure 8-15. All 8 classes are trained at once, the same as the multi-label cases in single-location 

experiments. 

 

125 

Figure 8-15: Classification Performance (Two Locations) 

 

 

Compared  to  single-location  results  we  can  see  that  the  change  in  location  impacts 

performance. Some sites (YouTube and DailyMotion) are impacted more, having their ROC areas 

reduced to about 0.74 compared to the generally >0.9 areas in single-location experiments. On the 

other hand, the classifiers perform well on VEOH, Vimeo, and Fox. The tests on live streaming 

sites (Livestream and Twitch) show very poor performance compared to the VOD sites. Another 

trend that can be observed is that for VOD sites, about 30% to 40% of the positive samples can be 

identified with small false positive rates. The rest of the samples can become ambiguous. This 

 

126 

means that classifiers can still be used across locations by adjusting the thresholds, albeit with a 

performance penalty. For completeness, the ROC areas of two-location test on samples with more 

than 2 active clients are shown in Figure 8-16. The networks struggle to classify YouTube in this 

case, while other sites received mild hindrance. 

 

 

Figure 8-16: Classification Performance (Two locations & Client ≥ 3) 

 

 

127 

8.7 Summary 

In  this  chapter,  we  propose  a  novel  classification  method  for  identifying  the  source  of 

heterogeneous network traffic flow using deep-learning based classifiers. The method consists of 

a  novel  feature  that  is  inspired  by  the  researches  in  NLP.  The  feature  is  a  concise  text-like 

representation of the encrypted network traffic. Once calculated the feature can be processed by 

several NLP techniques that outperforms traditional methods. We combine the feature with neural 

networks that take advantage of NLP methods such as word embedding and attention to build 

several classifiers as our test subjects. 

Using a dedicated experimental setup, we test the classifiers we built with the novel feature 

and 3 NLP methods, as well as a traffic classification method that uses hand-crafted traffic features. 

The results show that the new feature performs better on binary classification problems. We then 

test the performance of the methods on the multi-label classification problems. Again our method 

is shown to work well. 

Extra evaluations to test the robustness of the feature are also conducted. First, we explore 

the possibility of classifying traffic containing a combination of labels that are not present in the 

training dataset (“knockout” problem). The method is shown to be able to identify the predominant 

component of the traffic.  Second, we train the classifier against a dataset containing traffic from 

one location but evaluate against another dataset collected from another location that contains 

traffic from the same video streaming sites. The evaluation shows that while the performance is 

impacted by the discrepancy in training location and test location, the classifier nonetheless could 

classify  about  30-40%  of  the  VOD  data  with  tolerable  false  positives.  This  is  a  breakthrough 

compared to the erratic performance figures when evaluating against 2 locations as presented in 

 

128 

Chapter 6. The proposed Deep-Learning based traffic analysis is significant in overcoming the 

obstacles toward practical TA applications. 

Future work should include improvement of the feature to include more information while 

maintaining a concise format and analysis and improvement of its performance when classifying 

living streaming services. 

 

 

129 

 

BIBLIOGRAPHY 

 

 

 

130 

BIBLIOGRAPHY 

 

[1]  S. Incorporated, “Global Internet Phenomena Report, 1H 2016,” Apr. 2016. 

[2]  D. Herrmann, R. Wendolsky, and H. Federrath, “Website fingerprinting,” in Proceedings of 

the 2009 ACM Workshop on Cloud Computing Security, 2009, p. 31. 

[3]  A.  Panchenko,  L.  Niessen,  A.  Zinnen,  and  T.  Engel,  “Website  Fingerprinting  in  Onion 
Routing  Based  Anonymization  Networks,”  in  Proceedings  of  the  10th  Annual  ACM 
Workshop on Privacy in the Electronic Society, 2011, pp. 103–114. 

[4]  S. Khanvilkar and A. Khokhar, “Virtual Private Networks: An Overview with Performance 

Evaluation,” IEEE Commun. Mag., vol. 42, no. 10, pp. 146–154, Oct. 2004. 

[5]  R.  Dingledine,  N.  Mathewson,  and  P.  Syverson,  “Tor:  The  Second-generation  Onion 
Router,” in Proceedings of the 13th Conference on USENIX Security Symposium - Volume 
13, 2004, p. 21. 

[6]  L. Dixon, T. Ristenpart, and T. Shrimpton, “Network Traffic Obfuscation and Automated 

Internet Censorship,” IEEE Secur. Priv., vol. 14, no. 6, pp. 43–53, Nov. 2016. 

[7]  T. Wang and I. Goldberg, “Improved Website Fingerprinting on Tor,” in Proceedings of the 
12th ACM Workshop on Workshop on Privacy in the Electronic Society, 2013, pp. 201–212. 

[8]  M. Juarez, S. Afroz, G. Acar, C. Diaz, and R. Greenstadt, “A Critical Evaluation of Website 
Fingerprinting Attacks,” in Proceedings of the 2014 ACM SIGSAC Conference on Computer 
and Communications Security - CCS ’14, 2014, pp. 263–274. 

[9]  N. Namdev, S. Agrawal, and S. Silkari, “Recent Advancement in Machine Learning Based 

Internet Traffic Classification,” Procedia Comput. Sci., vol. 60, pp. 784–791, 2015. 

[10]  R. Oppliger, “Internet security: firewalls and beyond,” Commun. ACM, vol. 40, no. 5, pp. 

92–102, May 1997. 

[11]  ISO/IEC,  “Information  technology  -  Open  Systems  Interconnection  -  Basic  Reference 

Model: The Basic Model,” Jun. 1994. 

[12]  R. Bendrath and M. Mueller, “The end of the net as we know it? Deep packet inspection and 

internet governance,” New Media Soc., vol. 13, no. 7, pp. 1142–1160, Apr. 2011. 

[13]  M. Rash, Linux Firewalls. 2003. 

[14]  R. Hofstede et al., “Flow Monitoring Explained: From Packet Capture to Data Analysis With 
NetFlow and IPFIX,” IEEE Commun. Surv. Tutorials, vol. 16, no. 4, pp. 2037–2064, 2014. 

 

131 

[15]  R. Stanton, “Securing VPNs: Comparing SSL and IPsec,” Comput. Fraud Secur., vol. 2005, 

no. 9, pp. 17–19, Sep. 2005. 

[16]  T. Rowan, “VPN technology: IPSEC vs SSL,” Netw. Secur., vol. 2007, no. 12, pp. 13–17, 

Dec. 2007. 

[17]  M. Feilner, OpenVPN: Building and Integrating Virtual Private Networks. Packt Publishing, 

2006. 

[18]  K. P. Dyer, S. E. Coull, T. Ristenpart, and T. Shrimpton, “Peek-a-Boo, I Still See You: Why 
Efficient  Traffic  Analysis  Countermeasures  Fail,”  in  Proceedings  of  the  2012  IEEE 
Symposium on Security and Privacy, 2012, pp. 332–346. 

[19]  Nguyen and G. Armitage, “A survey of techniques for internet traffic classification using 
machine learning,” Commun. Surv. &amp; Tutorials, IEEE, vol. 10, no. 4, pp. 56–76, Oct. 
2008. 

[20]  R. Morla, P. Gonçalves, and J. Barbosa, “High-performance network traffic analysis for 
continuous  batch  intrusion  detection,”  J.  Supercomput.,  vol.  72,  no.  11,  pp.  4107–4128, 
2016. 

[21]  Y. Liu, A.-R. Sadeghi, D. Ghosal, and B. Mukherjee, “Video Streaming Forensic – Content 
Identification with Traffic Snooping,” in Information Security, vol. 6531, M. Burmester, G. 
Tsudik, S. Magliveras, and I. Ilić, Eds. Springer Berlin Heidelberg, 2011, pp. 129–135. 

[22]  Y. Liu, C. Ou, Z. Li, C. Corbett, B. Mukherjee, and D. Ghosal, “Wavelet-Based Traffic 
in  Global 

Analysis  for  Identifying  Video  Streams  over  Broadband  Networks,” 
Telecommunications Conference, 2008. IEEE GLOBECOM 2008. IEEE, 2008, pp. 1–6. 

[23]  Y. Shi and S. Biswas, “Detecting tunneled video streams using traffic analysis,” in 2015 7th 
International Conference on Communication Systems and Networks, COMSNETS 2015 - 
Proceedings, 2015. 

[24]  Y.  Shi  and  S.  Biswas,  “Characterization  of  Traffic  Analysis  based  video  stream  source 
identification,”  in  2015  IEEE  International  Conference  on  Advanced  Networks  and 
Telecommuncations Systems (ANTS), 2015, pp. 1–6. 

[25]  Y. Shi and S. Biswas, “Protocol-independent identification of encrypted video traffic sources 
using traffic analysis,” in 2016 IEEE International Conference on Communications, ICC 
2016, 2016. 

[26]  Y.  Shi  and  S.  Biswas,  “Characterization  of  Traffic  Analysis  based  video  stream  source 
identification,” in International Symposium on Advanced Networks and Telecommunication 
Systems, ANTS, 2016, vol. 2016–Febru. 

[27]  Y. Shi and S. Biswas, “Using traffic analysis for simultaneous detection of BitTorrent and 
streaming video traffic sources,” in 2017 9th International Conference on Communication 
Systems and Networks, COMSNETS 2017, 2017, pp. 79–86. 

 

132 

[28]  R. Vinayakumar, K. P. Soman, and P. Poornachandrany, “Secure shell (SSH) traffic analysis 
with  flow  based  features  using  shallow  and  deep  networks,”  in  2017  International 
Conference on Advances in Computing, Communications and Informatics, ICACCI 2017, 
2017, vol. 2017–Janua, pp. 2026–2032. 

[29]  J. Yan and J. Kaur, “Feature Selection for Website Fingerprinting,” Proc. Priv. Enhancing 

Technol., vol. 2018, no. 4, pp. 200–219, 2018. 

[30]  M. Lotfollahi, R. S. H. Zade, M. J. Siavoshani, and M. Saberian, “Deep Packet: A Novel 

Approach For Encrypted Traffic Classification Using Deep Learning,” Sep. 2017. 

[31]  M.  Cruz,  R.  Ocampo,  I.  Montes,  and  R.  Atienza,  “Fingerprinting  BitTorrent  Traffic  in 
Encrypted  Tunnels  Using  Recurrent  Deep  Learning,”  in  Proceedings  -  2017  5th 
International Symposium on Computing and Networking, CANDAR 2017, 2018, vol. 2018–
Janua, pp. 434–438. 

[32]  M. T. H. Parmar, “Adobe’s Real Time Messaging Protocol,” Adobe Systems Incorporated, 

2012. . 

[33]  A. Zambelli, “IIS Smooth Streaming Technical Overview,” Mar. 2009. 

[34]  M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The WEKA 
Data Mining Software: An Update,” SIGKDD Explor. Newsl., vol. 11, no. 1, pp. 10–18, Nov. 
2009. 

[35]  C.  Demichelis  and  P.  Chimento,  “IP  Packet  Delay  Variation  Metric  for  IP  Performance 

Metric (IPPM),” 2002. . 

[36]  D. Chen, X. Fu, W. Ding, H. Li, N. Xi, and Y. Wang, “Shifted gamma distribution and long-
range prediction of Round Trip Timedelay for Internet-based teleoperation,” in Robotics and 
Biomimetics, 2008. ROBIO 2008. IEEE International Conference on, 2009, pp. 1261–1266. 

[37]  R.  Castro,  M.  Coates,  G.  Liang,  R.  Nowak,  and  B.  Yu,  “Network  Tomography:  Recent 

Developments,” Stat. Sci., vol. 19, no. 3, pp. 499–517, 2004. 

[38]  T.  M.  Chen,  “Network  Traffic  Modeling,”  in  Handbook  of  Computer  Networks,  vol.  3, 

Hoboken, NJ, USA: John Wiley & Sons, Inc., 2012, pp. 326–339. 

[39]  “Practical Packet Analysis: using Wireshark to solve real-world network problems,” Netw. 

Secur., vol. 2011, no. 8, p. 4, 2011. 

[40]  C.  V  Wright,  L.  Ballard,  F.  Monrose,  and  G.  M.  Masson,  “Language  identification  of 
encrypted VoIP traffic: Alejandra y Roberto or Alice and Bob?,” in Proceedings of 16th 
USENIX Security Symposium on USENIX Security Symposium, 2007. 

[41]  X. Gong, N. Kiyavash, N. Schear, and N. Borisov, “Website Detection Using Remote Traffic 

Analysis.” Sep-2011. 

 

133 

[42]  L. Breiman, “Random Forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32, 2001. 

[43]  B. Vandalore, W. Feng, R. Jain, and S. Fahmy, “A Survey of Application Layer Techniques 
for Adaptive Streaming of Multimedia,” Real-Time Imaging, vol. 7, no. 3, pp. 221–235, Jun. 
2001. 

[44]  M. Hall, “Correlation-based Feature Selection for Machine Learning.” 1998. 

[45]  Q. Zhang, L. T. Yang, Z. Chen, and P. Li, “A survey on deep learning for big data,” Inf. 

Fusion, vol. 42, pp. 146–157, Jul. 2018. 

[46]  A. Krizhevsky and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural 

Networks,” Neural Inf. Process. Syst., 2012. 

[47]  A. L. Caterini and D. E. Chang, “Recurrent neural networks,” in SpringerBriefs in Computer 

Science, 2018. 

[48]  S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Comput., vol. 9, no. 

8, pp. 1735–1780, 1997. 

[49]  K. Cho et al., “Learning Phrase Representations using RNN Encoder-Decoder for Statistical 

Machine Translation,” Jun. 2014. 

[50]  D. W. Otter, J. R. Medina, and J. K. Kalita, “A Survey of the Usages of Deep Learning in 

Natural Language Processing,” Jul. 2018. 

[51]  Z.  Yang,  D.  Yang,  C.  Dyer,  X.  He,  A.  Smola,  and  E.  Hovy,  “Hierarchical  Attention 
Networks for Document Classification,” in Proceedings of the 2016 Conference of the North 
American  Chapter  of  the  Association  for  Computational  Linguistics:  Human  Language 
Technologies, 2016, pp. 1480–1489. 

[52]  T.  Mikolov,  K.  Chen,  G.  Corrado,  and  J.  Dean,  “Efficient  Estimation  of  Word 
Representations in Vector Space,” 2016 IEEE Work. Spok. Lang. Technol. SLT 2016 - Proc., 
pp. 414–419, Jan. 2013. 

[53]  H.  Caswell,  C.  de  Vries,  N.  Hartemink,  G.  Roth,  and  S.  F.  van  Daalen,  “Age × stage-
classified demographic analysis: a comprehensive approach,” Ecol. Monogr., vol. 88, no. 4, 
pp. 560–584, Nov. 2018. 

[54]  J.  Pennington,  R.  Socher,  and  C.  Manning,  “Glove:  Global  Vectors  for  Word 
Representation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural 
Language Processing (EMNLP), 2014. 

[55]  Y. Kim, “Convolutional Neural Networks for Sentence Classification,” Proc. EMNLP, Aug. 

2014. 

[56]  M. J. Berger, “Large Scale Multi-label Text Classification with Semantic Word Vectors,” 

Tech. Rep., pp. 1–8, 2014. 

 

134