A Graph Theoretic Approach to Malware Detection
By
Syeda Momina Tabish

A THESIS
Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of
MASTER OF SCIENCE
Computer Science
2012

ABSTRACT
A Graph Theoretic Approach to Malware Detection
By
Syeda Momina Tabish

Current malware detection approaches (i.e. anti-virus software) deployed at end
hosts utilize features of a speciﬁc malware instance. These approaches suﬀer from
poor accuracy because such features are easily evaded by trivial obfuscation such
as garbage insertion and re-ordering of instruction or call sequences. In this paper,
we introduce a novel graph theoretic approach to detect malware from instruction
or call sequences. In our approach, we map the instruction or call sequence of an
executable program to a graph. We then extract features from the constructed
graphs at three levels: (1) vertex level, (2) sub-graph level, and (3) graph level.
These features act as footprints of the behavior of an executable program and are
leveraged to diﬀerentiate between benign and malware programs. The results of our
experiments show that our graph-theoretic approach diﬀerentiates between benign
and malware programs with ≈ 100% accuracy.

Copyright by
SYEDA MOMINA TABISH
2012

TABLE OF CONTENTS

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

v

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vi

1 Introduction

1

1.1

Background and Motivation . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

Limitations of Prior Art . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.3

Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.4

Key Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

2 Related Work

5

2.1

Static Detection Techniques . . . . . . . . . . . . . . . . . . . . . . .

5

2.2

Dynamic Detection Techniques . . . . . . . . . . . . . . . . . . . . .

6

3 Proposed Approach

10

3.1

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2

Graph Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.3

Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.4

Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4 Experimental Evaluation

19

4.1

Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.2

Feature Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.3

Classiﬁcation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.4

Obfuscation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5 Conclusion

32

BIBLIOGRAPHY

33

iv

LIST OF TABLES

Table 4.1

Basic statistics of the API call dataset used in this study. . . . 20

Table 4.2

Classiﬁcation accuracy results for diﬀerent feature sets. . . . . . 29

v

LIST OF FIGURES

Figure 3.1

Architecture diagram of our proposed approach. For interpre-

tation of the references to color in this and all other ﬁgures, the reader
is referred to the electronic version of this thesis. . . . . . . . . . . . . 11
Figure 4.1

Timeseries and behavior graph of an example benign program. 21

Figure 4.2

Timeseries and behavior graph of an example malware program. 22

Figure 4.3

Distribution of some graph level features. . . . . . . . . . . . . 24

Figure 4.4

Q-Q probability plot of sub-graph level features. . . . . . . . . 26

Figure 4.5

Out-degree: vertex level feature of LocalAlloc API call. . . . 27

Figure 4.6

Number Of Triangles: vertex level feature of LocalAlloc API

call. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Figure 4.7

Closeness Centrality: vertex level feature of LocalAlloc API

call. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Figure 4.8

ROC plots of classiﬁcation accuracy. . . . . . . . . . . . . . . 30

Figure 4.9

AUC of Obfuscation Analysis . . . . . . . . . . . . . . . . . . 31

vi

Chapter 1
Introduction
1.1

Background and Motivation

Malware analysis and detection has recently been a prime research area due to the
widespread increase in the number of new and previously unknown malware. The
huge monetary gains involved and ease of writing and distributing a malicious piece
of code have made it a cat and mouse chase, where the malware writers are ahead
and the security experts are closely following behind. Although a number of behavior
based malware detection techniques have been proposed recently, signature based
malware detection is still the most widely deployed defence mechanism by commercial anti-virus software. This is mainly because behavior based malware detection
techniques suﬀer from false alarms; whereas, signature based malware detection
techniques have zero false alarm rate. Furthermore, behavior based malware detection techniques suﬀer from large processing overheads; on the other hand, signature
based malware detection techniques have acceptable processing overheads.
However, behavior based malware detection techniques are being actively investigated because of their ability to detect zero-day (or previously unknown) malware as
they do not rely on any a-priori knowledge about new malware. Some of the behavior based malware detection approaches have been integrated into the commercial
anti-virus software with limited success. The state-of-the-art commercial anti-virus
software still rely heavily on exact signature matching for malware detection and

1

hence they typically fail whenever there is a new malware outbreak. At the moment
the only line of defense for the naive end user does not provide suﬃcient protection
and therefore an eﬃcient and eﬀective non-signature based line of defense is needed
that can protect end hosts from previously unseen or zero-day malware. There is
still room for non-signature based malware detection techniques that have high detection accuracy, are computationally inexpensive, and are easily scalable with the
increasing number of new malware.

1.2

Limitations of Prior Art

Existing commercial anti-virus software generally rely on separate signature or behavioral pattern extraction for every malware sample. This approach requires signature extraction for every malware sample and the size of signature database has
a tendency to explode and hence is unscalable due the increasing number of new
malware. According to a survey by TrendMicro, almost 5, 000 new malware samples
are submitted daily to the anti-virus software vendors for analysis [18].
Some behavior based malware detection techniques have also been proposed in
prior literature. A majority of these techniques rely on extracting or mining unique
patterns or signatures from the instruction or call graphs, as discussed later in
Section 2. The detection is typically done by graph matching algorithms or subgraph based features, which limits the granularity of such approaches to a particular
malware instance only.

1.3

Proposed Approach

In this paper, we propose a graph theoretic approach to malware detection at end
hosts using instruction sequences or API call sequences. The instruction or API call
sequences are mapped to graphs and structural properties of graphs are leveraged
to extract relevant features. The distinguishing feature sets, forming the basis of
our detection methodology, are extracted at three levels: (1) vertex level, (2) sub-

2

graph level, and (3) graph level. Our proposed vertex level features can be divided
into degree, path, and connectivity categories. Examples of these three categories
respectively are degree, diameter, and clustering coeﬃcient. The aforementioned
and other features also give us interesting insights about general graph properties.
We extract a similar set of graph level features. At sub-graph level, we identify and
extract features (program instructions or call sequences) based on Markov chains.
We also use the concept of typicality of Markov chain states to identify program
instruction or call sequences of varying lengths. The concept of typicality is important in the context of eﬃciently identifying a small subset of program instruction
or call sequences from a very large sample space. All of the aforementioned features, extracted at various graph levels, are then used along with machine learning
algorithms for automatically learning classiﬁcation models.

1.4

Key Contributions

The key contributions of our proposed two-step approach are described below.
1. The ﬁrst step of our approach is feature extraction. To our best knowledge,
this work is the ﬁrst attempt to extract and capture graph properties at various
levels starting from vertex level, to sub-graph level, and to graph level. The
canvas of feature extraction in our approach is wide enough to portray suﬃcient
information for accurate classiﬁcation. In the second step, this comprehensive
feature set is further used in conjunction with machine learning algorithms, so
as to make full use of their sophistication and classiﬁcation capabilities.
2. We evaluate the eﬀectiveness and accuracy of our proposed approach using a
comprehensive data set of benign and malware ﬁles, which is collected keeping
in mind that this data set should be a good representation of a real-world
end host. We made sure to include majority of the normal daily use benign
application with wide range of functionality some of which also match closely
with the malware behavior and vice versa.

3

The rest of this paper is organized as follows. We provide an overview of the
related work in Section 2. The details of our proposed approach are explained in
Section 3. In Section 4, we evaluate the accuracy of our proposed approach. We
ﬁnally conclude this paper in Section 5.

4

Chapter 2
Related Work
Malware detection has been one of the most pressing security problem in recent
years. A lot of work has been done in this ﬁeld, which can be divided into two
broad categories; static analysis and dynamic analysis techniques. Static analysis
is based on the concept of detecting or classifying an executable before it is run.
On the other hand, dynamic analysis schemes base their detection process while the
executable is running. Both analysis schemes have their pros and cons, for instance,
in static analysis it is not guaranteed that the code analyzed is the same as the
code executed, many polymorphic worms change the execution pattern on the ﬂy.
Likewise, in most of the dynamic analysis techniques, the malicious program is ﬁrst
run in a virtual environment or sand box to determine detection check points. We
separately discuss both categories of work in the following text.

2.1

Static Detection Techniques

We now discuss and summarize the most relevant static analysis techniques.
In [9], Christodorescu et al. presented an approach to detect malicious patterns
in executables using static analysis. They also claim that the proposed approach is
resilient to common obfuscation transformations. They ﬁrst generalize the malicious
code into an automaton and then an internal representation of malicious code is
generated using abstraction patterns. The third process in the proposed framework

5

is that using IDA pro and Code surfer, control ﬂow graphs (CFGs) are created which
are further passed as an input to the annotator. The annotator creates an annotated
CFG which contains the information such as where an abstraction pattern is found
in the code. The last step is the detector which uses the annotated CFG’s and
the malicious code automaton and decides using code uniﬁcation if the malware
pattern matches with a pattern in the annotated CFG. The evaluation is done on
4 viruses and their obfuscated versions and has good accuracy in terms of false
positive and false negative rates. The limitation of the proposed approach is that it is
computationally expensive and can handle only very simple obfuscation techniques.
In [10], the authors presented a malware detection scheme which uses semantics
of the executables instead of exact string or signature matching. The executable
is parsed through IDA pro and CFG’s are constructed and an intermediate representation, which is a library of x86 instructions also called a template, is created.
The detection algorithm keeps ﬁnding template nodes which match the program
template and returns with the instructions if a match is found. The limitations of
their approach are in the intermediate representation of the instructions, where it
is computationally hard to generate a complete set of every possible instruction
representation and take signiﬁcant time for processing as well.

2.2

Dynamic Detection Techniques

Below we describe some dynamic analysis techniques that are most relevant to our
approach.
In [4] and [3], Bayer et al. proposed an analysis tool for malware security experts
so that they do not have to manually analyze the suspicious executable and which
can give them enough information about it as to render it malware or benign. The
tool is based on a PC emulator Qemu, which can handle self modifying code. The
proposed tool works by analyzing Windows API and native API calls made by the
malicious process, most importantly it keeps track of ﬁle created or opened using
function arguments. The analysis is compiled in the form of a comprehensive report

6

which contains, general information (ﬁle size, command line arguments, etc), ﬁle
activity, registry activity, service activity, process activity, and network activity.
The evaluation is done on ten malware samples and the analysis report matched
closely with the descriptions of majority of them.
In [20] a taint graph based malware detection technique was proposed by Yin
and Song. The malicious program is run in an emulator containing a test engine on
which the test scripts are run which ﬁrst introduce important taint information (like
password input, TCP/UDP/ICMP traﬃc etc) and then monitor the activities of
the sample program and the overall system under observation. The behavior of the
program in a system wide context is further represented in the form of graphs which
help to identify if the program tries to read or misuse the taint information. The
evaluation is done on 42 real-world malware samples containing root kits, password
thieves, key loggers etc and the proposed approach detected all the malware, however, falsely detected 3 benign programs, which closely matched with the malware
behavior.
In [6], Bose et al. proposed a behavioral detection framework for mobile malware.
In this framework, system calls are logged and correlated to create dependency
graphs that are further pruned and aggregated for behavioral signature generation.
The classiﬁcation of partial or incomplete malware behavior, when the malware has
not yet ﬁnished execution is done using machine learning algorithm support vector
machine for binary classiﬁcation. The evaluation of the proposed approach is done
on custom made programs which emulate real world Symbian worms and other
well known malware. The results show that the proposed framework achieved good
detection accuracy with decreasing number of false negatives as the training data
size is increased.
In [16], Kolbitsch et al. proposed a dynamic malware detection technique which
using system calls information automatically creates ﬁne-grained behavior models
and match any given unknown program against these models to classify it as benign
or malware. The behavior models are represented in the form of graphs in which
nodes denote interesting system calls and edges denote the data dependency between

7

two system calls when the input of one is dependent on the output of the other.
The ﬁrst part of the approach is malware analysis that is carried out in a custom
made dynamic malware analysis tool Anubis, which records all the disassembled
instructions, system calls, data dependencies using taint analysis, and memory logs.
All of the above information is used to create behavior graphs. The second part
of the approach is detection which comprises of a scanner that monitors system
call invocations and checks if the monitored program matches any of the behavior
graphs. Upon matching, the malicious program is terminated. The evaluation of the
proposed approach is done on 6 malware families and it achieves reasonable detection
accuracy.
In [1], Ahmed et al. proposed a runtime malware analysis and detection tool
that detects malware using spatio-temporal information in API call logs. The spatial
information (from arguments and return values of calls), and temporal information
(from sequence of calls) is further used to build formal models to distinguish malware
from benign programs on the basis of API calls. The most discriminative spatial
and temporal features of the API calls are selected on the basis of information
gain, and are then passed on to the standard machine learning and data mining
classiﬁers using 10-fold cross validation. The evaluation is done on 416 malware
and 100 benign samples. The proposed approach achieves up to approximately 98%
detection accuracy.
In [8] a graph based malware detection approach is presented, which focuses on
mining patterns in a dependence graph after reducing its data space. The dependence graph is constructed using system calls and is often quite large in practice.
The proposed approach ﬁrst drastically reduces the size of dependence graph by randomly and repeatedly summarizing it, and then it uses subgraph mining algorithm
to traverse the pattern space. The evaluation of the proposed approach is done on 6
malware families and it is also compared to gSpan, a state-of-the-art graph miner.
The results show that the proposed approach outperforms gSpan by showing stable
running time and eﬃciency and reveals interesting malware ﬁngerprints.
In [15], Islam et al. proposed a behavior based malware detection technique,

8

which extracts strings (system call names and function arguments) from malware
and benign trace logs, which are then used for classiﬁcation. Each ﬁle of the dataset
is run in a virtual environment and using a trace tool HookMe, the system call logs
are obtained. The trace ﬁles are used in feature extraction, which generates a list of
strings present in the binary and stores them in a hash table. The feature vectors
are created on the basis of the absence or presence of a string in a particular ﬁle.
These features are then passed onto data mining and machine learning base classiﬁers, which are decision tree, rule learner, support vector machine, instance based
classiﬁer, and meta classiﬁer to perform k-fold cross validation. The experimental
evaluation is done on about one thousand malware and about ﬁve hundred benign
samples. The proposed approach achieves 97.3% detection accuracy.
In [14], Fredrikson et al. proposed an approach which focuses on automatically
extracting optimal speciﬁcations for behavior based malware detection. They use
graph mining (leap mining) and concept analysis algorithms to analyze a set of
malicious and benign programs, that extracts signiﬁcant malicious behaviors, and
creates an optimally discriminative speciﬁcation. The experimental evaluation is
done on 912 malware samples, collected using an internet honeypot and 49 benign
samples. The proposed approach achieves 86% detection accuracy for previously
unknown malware samples with 0% false positive rate. The comparison of their
technique with commercial anti-virus software detectors shows that their proposed
approach outperforms commercial software.

9

Chapter 3
Proposed Approach
In this section, we present the details of our proposed graph theoretic approach
towards malware detection. We ﬁrst provide an overview of its architecture, which
consists of three main modules: (1) graph construction, (2) feature extraction, and
(3) detection.

3.1

Overview

We now introduce our graph-theoretic approach to detect malware from instruction
or call sequences. Figure 3.1 shows the architecture of our proposed approach, which
consists of three main modules. In the ﬁrst module, we map the instruction or call
sequence of an executable program to a graph. The next module extracts features
from the constructed graphs at three levels: (1) vertex level, (2) sub-graph level, and
(3) graph level. These features act as footprints of the behavior of an executable
program and are leveraged in the detection module to diﬀerentiate between benign
and malware programs.

3.2

Graph Construction

The prior solutions for behavioral malware detection can be broadly classiﬁed into
two categories: static analysis (e.g. [9, 10]) and dynamic analysis (e.g. [3, 6, 17, 20]).
Depending on the category of solution, static or dynamic, the input is a sequence of
10

Feature Extraction

Instruction or
Call Sequence

1. LocalAlloc
2. RegOpenKeyExW
3. RegSetValueExW
4. RegCloseKey

Graph-level
Subgraph-level
Vertex-level

Graph
Construction

1

2

3

6

5

f1

f2

f3

Detection

…

4

5. CreateProcessW
6. GetProcAddress

Benign Malware

Figure 3.1: Architecture diagram of our proposed approach. For interpretation of the
references to color in this and all other ﬁgures, the reader is referred to the electronic
version of this thesis.
disassembled code instructions or system calls. A sequence of disassembled code instructions or system calls can be mapped to a multi-digraph where vertices represent
elements of the sequence and directed edges appear between consecutive sequence
elements. Let G = (V, E) denote the graph, where V is the set of vertices and E
is the set of directed edges. The graph corresponding to a sequence of length l will
have the edge set, such that |E| = l − 1. Each edge in the edge set has an associated
time index, which is deﬁned by the function T (eij ), where eij is an edge incident on
vertex vj from vertex vi . In the rest of this paper, we will refer to the constructed
graphs as behavior graphs.

3.3

Feature Extraction

We characterize the properties of behavior graphs in terms of feature sets that capture their structure at diﬀerent levels of granularity. More speciﬁcally, we extract
features quantifying their behavior at three levels: (1) vertex level, (2) sub-graph
level, and (3) graph level. The details of these feature sets are separately provided
below.

11

Vertex-level Features
The vertex-level features can be classiﬁed into the following categories: degree, path,
and connectivity. Important degree based features that we analyze in our proposed
approach are: in-degree, out-degree, degree, and reciprocity. Important path based
features that we analyze in our proposed approach are: betweenness centrality and
closeness centrality. Important connectivity based features that we analyze in our
proposed approach are: number of triangles, clustering coeﬃcient, and eigenvector
centrality. These features are separately computed for all unique vertices (representing instructions or API calls) in the behavior graphs. Below we formally deﬁne these
properties.
• Degree: The degree of a vertex is deﬁned as the number of edges incident on
it. The degree δi of a vertex vi is deﬁned as:

δi =

(3.1)

ejk
∀j=i∨k=i

where ejk denotes an inwards or outwards edge between vertices vj and vk .
• In-Degree: The degree of a vertex is deﬁned as the number of inwards edges
incident on it. The in-degree δ↓i of a vertex vi is deﬁned as follows.

δ↓i =

ejk

(3.2)

∀k=i

• Out-Degree: The degree of a vertex is deﬁned as the number of outwards
edges incident on it. The out-degree δ↑i of a vertex vi is deﬁned as follows.

δ↑i =

ejk

(3.3)

∀j=i

• Reciprocity: The reciprocity of a vertex is deﬁned as the ratio of its outdegree to its in-degree. The reciprocity Ri of a vertex vi is deﬁned as:

Ri = δ↑i /δ↓i =

ejk /
∀j=i

12

ejk
∀k=i

(3.4)

where ejk denotes an inwards or outwards edge between vertices vj and vk .
• Number of Triangles: A triangle is deﬁned as the subset of any three vertices
in a graph that are completely connected. The triangle count ∆i of a vertex
vi is deﬁned as the number of triangles that contain the given vertex as one of
their vertices. Let Γi denote the set of vertices that a vertex vi is connected to
then its triangle count ∆i is deﬁned as:

ejk

∆i =

(3.5)

vj ,vk ∈Γi

• Clustering Coeﬃcient: The clustering coeﬃcient of a vertex is deﬁned as
the ratio of number of triangles it is a part of to the total number of possible
triangles. If δi denotes the degree of a vertex and Ti denotes the number of
triangles it is a part of then its clustering coeﬃcient Ci is deﬁned as:

Ci =

∆i
δi
2

=

2∆i
δi (δi − 1)

(3.6)

• Eigenvector Centrality: The eigenvector centrality of a vertex is a measure
of its importance in a network. To deﬁne eigenvector centrality ei of vertex vi ,
let W denote the adjacency matrix of the graph G where wi,j is 1 if an edge
exists between vertices vi and vj , and 0 otherwise.

1
ei =
λ

N

wi,j xj

(3.7)

j=1

where λ is the principal eigenvalue of matrix W.
• Betweenness Centrality: The betweenness centrality of a vertex is deﬁned as
the fraction of all pair shortest paths, except those originating or terminating
at it, that pass through it. Let Pjk denote the shortest path from vertex vj to
vertex vk , where Pjk = (vj , vl , vm , vn , ..., vk ). The betweenness centrality bi of
a vertex vi is deﬁned as:

bi =

2I(Pjk , i)
|V |(|V | − 1)
13

(3.8)

where I(Pjk , i) is an indicator function such that I(Pjk , i) = 1 when vi ∈ Pjk
and I(Pjk , i) = 0 when vi ∈ Pjk .
/
• Closeness Centrality: The closeness centrality of a vertex is deﬁned as the
average length of shortest paths to all vertices reachable from it. Let |Pij |
denote the shortest path length from vertex vi to vertex vj . The closeness
centrality ci of a vertex vi is deﬁned as follows.
N
j=1 |Pij |

ci =

|V |

(3.9)

Sub-graph level Features
Let P (x1 , x2 , ..., xn ) denote the probability of ﬁnding a particular n-gram x1 , x2 , ..., xn
in a given sequence and k denote the sample space size of each element in the sequence. The size of sample space for a sequence containing n elements is k n . Equivalently, it can be represented as the joint probability P (x1 ∩ x2 ... ∩ xn ). Using Bayes
theorem, this joint probability can be deﬁned by a combination of the marginal
and conditional probabilities. More speciﬁcally, P (x1 ∩ x2 ... ∩ xn ) = P (x1 |x2 ... ∩
xn )P (x2 ... ∩ xn ). In fact, the conditional probability contains more precise information due to its reduced sample space compared to the joint probability. These
conditional n-gram probabilities can be conveniently represented using a discrete
time Markov chain of order n − 1 containing k n−1 states. Note that each conditional
n-gram sequence corresponds to a unique state in the multi-order Markov chain.
Now let us assume that the presence of a state in Markov chain is represented by a
binary indicator random variable Ii , i = 1, 2, ..., k n−1 , where k n−1 is the total number of states of the Markov chain of order n − 1. Hence, P (Ii = 1) represents the
probability for the presence of state Xi .
It is important to select an appropriate order for a Markov chain model. It is
also important to note that we are typically interested in employing a single Markov
chain to model a set of multiple sequences S. Here let |S| denote the size of the set
of sequences we want to model. For each sequence, autocorrelation is a well-known
heuristic for selecting appropriate order for its Markov chain model [7]. For a given
14

lag t, the autocorrelation function of a sequence, Sm (where m is the index), is
deﬁned as:

ρ[t] =

E{S0 St } − E{S0 }E{St }
,
σS0 σSt

(3.10)

where E(Si ) and σSi respectively represent the expectation and standard deviation
of S at lag i. The value of the autocorrelation function lies in the range [−1, 1], where
|ρ[t]| = 1 indicates perfect correlation at lag t and ρ[t] = 0 means no correlation at
lag t. The minimum value of lag tmin for which ρ[tmin ] falls inside the 95% conﬁdence
interval band is selected to be the appropriate order for a Markov chain. For a set of
multiple sequences, let T denote the set of selected orders as per the aforementioned
criterion. We select the maximum value in T, denoted by Tmax , as the order of a
single Markov chain model that we want to employ.
As mentioned earlier, the number of states in a Markov chain increase exponentially for higher orders and so does the complexity of the underlying model.
Furthermore, higher order Markov chains require a large amount of training data to
identify a subset of states that actually appear in the training data. In other words,
a Markov chain model trained with limited data is typically sparse. To overcome
these challenges, we can combine multiple states in a higher order Markov chain to
reduce its number of states. We are essentially using states from lower order Markov
chains as we combine diﬀerent states in a multi-order Markov chain. We also need
to establish a criterion to combine states in a multi-order Markov chain.
Towards this end, we use the concept of “typicality” of Markov chain states [7].
The basic idea of typicality is that we can identify a “typical” subset of Markov
chain states by generating its realizations. Before delving into further details, we
ﬁrst state the well-known “typicality theorem” below:
For any stationary and irreducible Markov process X and a constant c, the sequence x1 , x2 , ..., xm is almost surely (n, )-typical for every n ≤ c log m as m → ∞. A
ˆ
sequence x1 , x2 , ..., xm is called (n, )-typical for a Markov process X if P (x1 , x2 , ..., xn ) =
0, whenever P (x1 , x2 , ..., xn ) = 0, and
ˆ
P (x1 , x2 , ..., xn )
− 1 < , when P (x1 , x2 , ..., xn ) > 0.
P (x1 , x2 , ..., xn )
15

ˆ
Here P (x1 , x2 , ..., xn ) and P (x1 , x2 , ..., xn ) are the empirical relative frequency and
the actual probability of the sequence x1 , x2 , ..., xn , respectively. In other words,
ˆ
P (x1 , x2 , ..., xn ) ≈ P (x1 , x2 , ..., xn ).
A direct consequence of the aforementioned typicality theorem is that we can empirically identify “typical” sample paths of arbitrary length for a given Markov process.
Towards this end, we can generate realizations (or sample paths) of arbitrary lengths
from the transition matrix of the Markov process. By generating suﬃciently large
number of sample paths of a given length, we can accurately identify a relatively
small subset of sample paths that are typical.
Using this criterion, we select a subset of top-10000 typical states X1000 as potential features, whose lengths vary in the range [0, Tmax ]. To further cut down the
number of sub-graph level features (up to 100), we use an information-theoretic
measure called information gain to rank features [11]. Information gain is used to
quantify the diﬀerentiation power of features (Markov chain states in our case). In
this context, information gain is the mutual information between a given feature
Xi and the class variable Y . For a given feature Xi and the class variable Y , the
information gain of Xi with respect to Y is deﬁned as:

IG(Xi ; Y ) = H(Y ) − H(Y |Xi ),

(3.11)

where H(Y ) denotes the marginal entropy of the class variable Y and H(Y |Xi )
represents the conditional entropy of Y given feature Xi . In other words, information
gain quantiﬁes the reduction in the uncertainty of the class variable Y given that
we have complete knowledge of the feature Xi . Note that, in this paper, the class
variable Y is {Benign, Malware} for our application. Using information gain, we
ﬁnally select a subset of top-100 typical states X100 as features.
Graph-level Features
We now separately deﬁne the graph-level features below.

16

• Clique Number: A clique is a sub-graph such that all vertices in it are
directly connected to each other by an edge. The clique number ω of a graph
G = (V, E) is deﬁned as the number of vertices in its largest clique.
¯
• Average Clustering Coeﬃcient: The average clustering coeﬃcient C of a
graph G(V, E) is deﬁned as:

1
¯
C=
|V |

|V |

Ci

(3.12)

i=1

• Diameter: The diameter D of a graph G(V, E) is deﬁned as:

D(G) = max(|Pjk |),
∀j,k

(3.13)

where Pjk is the shortest path length between vertices vj and vk .
• Average Path Length: The average path length l of a graph G(V, E) is
deﬁned as:

l(G) =

∀j,k (|Pjk |)

|V |(|V | − 1)

,

(3.14)

where Pjk is the shortest path length between vertices vj and vk .

3.4

Detection

We can now jointly use all features to diﬀerentiate between benign and malware
programs. In particular, given separate feature sets characterizing vertex-level, subgraph level, and graph-level properties, we ﬁrst separately classify behavior graphs
by deploying a machine learning classiﬁer for each of them. Given the classiﬁcation
results for each of the feature sets, we then use a meta machine learning classiﬁer
for ﬁnal classiﬁcation. In this study, we use Na¨ Bayes algorithm as the baseıve
and the meta-classiﬁer. Na¨ Bayes is a popular probabilistic classiﬁer that has
ıve
been widely used for problems like text and malware classiﬁcation and is known to
outperform more complex techniques in terms of accuracy [19]. It trains using two
sets of probabilities: the prior, which represents the marginal probability P (Y ) of the
17

class variable Y ; and the a-priori conditional probabilities P (Xi |Y ) of the features
Xi given the class variable Y . As previously explained, these probabilities can be
computed from the training set. Now, for a given test behavior graph with observed
features Xi , i = 1, 2, ..., n, the a-posteriori probability P (Y |X (n) ) can be computed
for both classes Y ∈ {Benign, Malware}, where X (n) = (X1 , X2 , ..., Xn ) is the vector
of observed features in the test cascade under consideration:

P (Y |X

(n)

P (X (n) , Y ) P (X (n) |Y )P (Y )
)=
=
.
P (X (n) )
P (X (n) )

(3.15)

The Na¨ Bayes classiﬁer then combines the a-posteriori probabilities by assuming
ıve
conditional independence (hence the “na¨
ıve” term) among the features.
n

P (X

(n)

|Y ) =

P (Xi |Y ).

(3.16)

i=1

Although the independence assumption among features makes it feasible to evaluate the a-posteriori probabilities with much lower complexity, it is unlikely that
this assumption truly holds all the time. For our study, we mitigate the eﬀect of
the independence assumption by pre-processing the features using the well-known
Karhunen-Loeve Transform (KLT) to un-correlate them [12].

18

Chapter 4
Experimental Evaluation
In this section, we ﬁrst provide details of the dataset used to evaluate our proposed
approach. We then provide the classiﬁcation results of individual vertex-level, subgraph level, and graph-level feature sets along with those of the meta-classiﬁer. We
also provide the results of obfuscation analysis in this section.

4.1

Dataset

The behavior of a program can be characterized using either of the two well-known
approaches: static and dynamic analysis. For static analysis, oﬀ-the-shelf executable
program loaders can be used to construct a sequence of code instructions. IDA
Pro [13] and CodeSurfer [2] are two well-known software to construct instruction call
graphs. For dynamic analysis, an open source processor emulator, such as QEMU [5],
can be used to log system calls in case of the Unix operating system and Application
Programming Interface (API) calls in case of the Windows operating system. QEMU
eﬃciently program calls through dynamic translation and caching [5]. Other analysis
systems such as TTAnalyze [3] and Panorama [20] have been built on the top of
QEMU.
To evaluate the eﬀectiveness of our proposed approach, we have collected the
API call sequences of benign and malware programs. A set of one hundred programs
from freshly installed Windows are chosen to represent benign class. We have chosen

19

a diverse subset of more than three hundred malware programs from the dataset provided by OﬀensiveComputing.net. These malware programs are further categorized
into the categories of virus, trojans, and worm based on their functionality. Table 4.1
shows basic statistics of the API call data used in this study. In our data set, we log
more than 51 million API calls belonging to 237 unique API functions. We note from
Table 4.1 that benign programs tend to have larger number of calls to more unique
API functions on average than malware programs. This shows that benign programs
tend to have more complex functionality than malware programs in general. Within
malware programs, we observe that trojans have the largest number of calls to most
unique API functions on average. On the other hand, worms have the smallest number of calls to least unique API functions on average. This observation is linked to
the functionality of program, where worms are concisely implemented to facilitate
automated propagation and trojans covertly perform some backdoor activity with
some benign upfront functionality. Finally, it is interesting to compare the average
entropy of API call distributions for benign and malware programs. We note that
benign programs have larger average entropy than malware programs, which points
to more complex functionality of benign programs than malware programs.
Table 4.1: Basic statistics of the API call dataset used in this study.
Program

Average

Average

Average

Average

Category

Number of

Number

Number of

Entropy of

Programs

of Calls

Unique Calls

Call Dist.

Benign

100

271, 694

135

0.96

Malware

312

77, 250

43

0.57

Virus

104

59, 086

43

0.57

Trojan

104

152, 527

47

0.54

Worm

104

33, 232

39

0.60

As mentioned in Section 3, instruction or call sequences can be mapped to behavior graph. Figure 4.1 shows the plots of timeseries and behavior graph of an example
benign program. Likewise, Figure 4.2 shows the plots of timeseries and behavior
20

80

Call Index

60

40

20

0
0

500

1000

1500
Time Index

2000

2500

3000

(a) Timeseries

(b) Behavior Graph

Figure 4.1: Timeseries and behavior graph of an example benign program.
graph of an example malware program. The indices of calls are ﬁxed in the alphabetical order in the timeseries plots. We visually observe interesting patterns in the
timeseries and behavior graph of benign and malware programs. For instance, we
observe repetitive call subsequence blocks in timeseries of both benign and malware
programs. However, the sizes of blocks are signiﬁcantly smaller for benign programs

21

80

Call Index

60

40

20

0
0

500

1000

1500
Time Index

2000

2500

3000

(a) Timeseries

(b) Behavior Graph

Figure 4.2: Timeseries and behavior graph of an example malware program.
as compared to malware programs. Note that the layout of behavior graphs in Figures 4.2 and 4.1 is radial in nature. In a radial layout, we randomly choose a call
as a center vertex and the remaining vertices are put in concentric circles based on
the distance from the center vertex. A visual comparison of benign and malware
behavior graphs also highlights interesting diﬀerences. For instance, we note that
22

the degree distribution of behavior graph of malware program is signiﬁcantly more
skewed compared to the behavior graph of benign program. In addition, we observe
signiﬁcantly deeper branches of vertices in the behavior graph of example benign
program compared to that of example malware program. We aim to capture such
diﬀerences in an automated fashion by leveraging the three feature sets encompassing properties of behavior graphs at various levels (graph, sub-graph, and vertex).
Below we explore the distribution of these feature sets across benign and malware
programs.

4.2

Feature Analysis

Figure 4.3 shows the distribution of two graph level features. In Figure 4.3(a), we
observe that behavior graphs of benign programs tend to have larger values of average clustering coeﬃcient compared malware programs. Recall from Section 3 that
average clustering coeﬃcient is a measure of connectivity of a graph, where larger
values of average clustering coeﬃcient indicate more connectivity and smaller values
indicate less connectivity. The class-wise distribution of average clustering coeﬃcient
shows that benign behavior graphs are more tightly connected than malware behavior graphs. This observation also provides an insight into the peculiar functional
characteristics of malware programs. Figure 4.3(b) shows the class-wise distribution
of average path length of benign and malware programs. We would expect the average path length of benign behavior graphs to be smaller than that of malware
behavior graphs because the former are more tightly connected (quantiﬁed by clustering coeﬃcient) compared to the latter. However, we observe an opposite trend in
Figure 4.3(b) where malware behavior graphs having larger average path length than
benign behavior programs. This indicates that benign behavior graphs are tightly
connected such that average path length are relatively larger. This observation is explained by the reasoning that benign behavior graphs have tightly connected clusters
which have fewer connections across them.
We now analyze the properties of sub-graph level features of behavior graphs.

23

0.4
Benign
Malware
Probability

0.3

0.2

0.1

0
0.56

0.62
0.69
0.76
Average Clustering Coefficient

0.82

(a) Average Clustering Coeﬃcient

Benign
Malware

0.3

Probability

0.25
0.2
0.15
0.1
0.05
0
1.86

1.96
2.07
Average Path Length

2.17

2.28

(b) Average Path Length

Figure 4.3: Distribution of some graph level features.
Recall from Section 3 that we can identify a typical sample paths of Markov chain
model of behavior graphs. The presence or absence of these typical sample paths act
as sub-graph level binary features. Here we are interested in studying the distribution
of their empirically estimated probabilities. Ideally, we want the sample paths in
typical set to have high probabilities. Figure 4.4 shows the Q-Q probability plot

24

of sub-graph level features. We observe that most sample paths have moderate to
high occurrence probabilities which is desirable. This shows that we can successfully
capture variable length Markov chain sample paths have
We now investigate the class-wise distribution of some vertex level features. Figures 4.5, 4.6 and 4.7 show the distribution of some vertex level features including
out-degree, number of triangles, and closeness centrality for LocalAlloc API call.
This particular API call is important for memory management and allocates the
speciﬁed number of bytes from the heap. A common trend observed for all of these
vertex level features is that we observe smaller values for malware programs compared to benign programs. Similar to the analysis of class-wise distribution of graph
level features, these observations provide insight into the diﬀerences between benign
and malware program behavior.

4.3

Classiﬁcation Results

We now provide the classiﬁcation results of diﬀerent feature sets proposed in Section 3. To systematically evaluate the eﬀectiveness of these feature sets in classifying
benign and malware programs, we ﬁrst evaluate standalone feature sets and then
evaluate their all possible combinations. Figure 4.8(a) shows the Receiver Operating
Characteristics (ROC) plots for the Na¨ Bayes classiﬁer with standalone feature
ıve
sets. Figure 4.8(b) shows the ROC plots for the Na¨ Bayes classiﬁer with combinaıve
tions of feature sets. ROC performance is characterized in terms of false positive rate
(fraction of benign programs falsely detected as malware programs) and true positive rate (fraction of malware programs correctly detected as malware programs). In
ROC plot, the x-axis represents the false positive rate and the y-axis represents the
true positive rate. An ideal classiﬁer will approach the operating point in the top-left
corner with 1.0 true positive rate and 0.0 false positive rate. Both true positive rate
and false positive rate are jointly incorporated in a single metric called Area Under
ROC Curve (AUC). The AUC for an ideal classiﬁer is 1.0. Another alternate measure of classiﬁcation performance is called precision, which is deﬁned as the fraction

25

Quantile Probability

0.999
0.98
0.75
0.50
0.25
0.02
0.001
0

0.2

0.4

0.6

0.8

1

0.8

1

Data
(a) Benign transition matrix

Quantile Probability

0.999
0.98
0.75
0.50
0.25
0.02
0.001
0

0.2

0.4

0.6
Data

(b) Malware transition matrix

Figure 4.4: Q-Q probability plot of sub-graph level features.
of correctly classiﬁed instances.
In Figure 4.8(a), we observe that ROC curve for sub-graph feature set is the

26

0.8
Benign
Malware

0.7
0.6

Probability

0.5
0.4
0.3
0.2
0.1
0
12

21

30
Out−degree

39

48

Figure 4.5: Out-degree: vertex level feature of LocalAlloc API call.
0.8
Benign
Malware
Probability

0.6

0.4

0.2

0
75

160
245
Number of Triangles

330

415

Figure 4.6: Number Of Triangles: vertex level feature of LocalAlloc API call.
closest to the top-left operating point. It is followed by the vertex feature set and
the ROC curve of graph feature set is furthest from the top-left operating point.
In Figure 4.8(b), we observe that combinations of feature sets does improve the
classiﬁcation performance. In fact, the best classiﬁcation performance is achieved

27

Benign
Malware

Probability

0.4
0.3
0.2
0.1
0
0.58

0.64
0.70
Closeness Centrality

0.77

0.83

Figure 4.7: Closeness Centrality: vertex level feature of LocalAlloc API call.
when all feature sets are combined. Table 4.2 also provides the detailed classiﬁcation
accuracy results for diﬀerent feature sets. It is interesting to note that combining
sub-graph and graph feature sets actually degrades the classiﬁcation performance
compared to standalone feature sets. This is because redundant features can confuse
the classiﬁer resulting in degraded classiﬁcation accuracy. However, we do note that
combining all feature sets consistently provides the best classiﬁcation performance
in terms of all metrics. With combined feature set, our proposed approach achieves
precision up to 99.5%.

4.4

Obfuscation Analysis

We have also done obfuscation analysis on our scheme so that we can observe the effect of random removal/addition of API calls edges; that is in our case to the graphs.
We speciﬁcally want to see whether a crafty malware writer would be successful in
trying to break the sequence of API calls.
Figure 4.9 shows the AUC of the obfuscation analysis on our scheme. We obfuscated API calls ranging from 10 - 100. The accuracy gracefully degrades but no
drastic eﬀect even if 40 API calls are obfuscated.

28

Table 4.2: Classiﬁcation accuracy results for diﬀerent feature sets.
Feature

AUC

Set

True Positive False Positive Precision
Rate

Rate

Graph

0.892

0.923

0.343

0.918

Vertex

0.949

0.923

0.02

0.995

Sub-graph

0.989

0.983

0.02

0.995

Vertex,

0.949

0.923

0.02

0.995

0.989

0.983

0.02

0.995

0.89

0.92

0.141

0.965

0.993

0.998

0.02

0.995

Graph
Sub-graph,
Graph
Vertex,
Sub-graph
All

29

True Positive Rate

1
0.9
0.8
0.7
Vertex
Sub−graph
Graph

0.6
0.5
0

0.1

0.2
0.3
False Positive Rate

0.4

0.5

(a) Individual Feature Sets

True Positive Rate

1
0.9
0.8
0.7
Vertex,Graph
Sub−graph,Graph
Vertex,Sub−graph
All

0.6
0.5
0

0.1

0.2
0.3
False Positive Rate

0.4

0.5

(b) Combinations of Feature Sets

Figure 4.8: ROC plots of classiﬁcation accuracy.

30

(a) Table of AUC for obfuscating API’s ranging from 10 to 100

(b) AUC’s plotted for diﬀerent obfuscation values

Figure 4.9: AUC of Obfuscation Analysis

31

Chapter 5
Conclusion
In this paper, we have proposed a graph theoretic malware detection technique
that operates on the program instruction sequences or the API call sequences. The
instruction sequences are mapped to a graph from which distinguishing feature sets
are extracted at three levels, namely vertex level, sub-graph level, and graph level.
These feature sets are then given as an input to machine learning algorithms, both
individually and as a whole, to generate training models. The experimental results
show that the proposed approach achieves approximately 100% detection rate and
0% false positive rate when the three feature sets are jointly used.

32

BIBLIOGRAPHY

33

BIBLIOGRAPHY

[1] Faraz Ahmed, Haider Hameed, M. Zubair Shaﬁq, and Muddassar Farooq:. Using
spatio-temporal information in API calls with machine learning algorithms for
malware detection. In ACM Workshop on Security and Artiﬁcial Intelligence,
2009.
[2] Paul Anderson and Tim Teitelbaum. Software inspection using CodeSurfer. In
Workshop on Inspection in Software Engineering, 2001.
[3] Ulrich Bayer, Christopher Kruegel, and Engin Kirda. TTAnalyze: A tool for
analyzing malware. In European Institute for Computer Antivirus Research
Annual Conference, 2006.
[4] Ulrich Bayer, Andreas Moser, Christopher Kruegel, and Engin Kirda. Dynamic
analysis of malicious code. Journal in Computer Virology, 2(1):67–77, 2006.
[5] F. Bellard. Qemu, a fast and portable dynamic translator. In USENIX Annual
Technical Conference, 2005.
[6] Abhijit Bose, Xin Hu, Kang G. Shin, and Taejoon Park. Behavioral detection of malware on mobile handsets. In ACM Conference on Mobile systems,
Applications, and Services (MobiSys), 2008.
[7] Pierre Bremaud. Markov Chains. Springer, 2008.
[8] C. Chen, C. Lin, M. Fredrikson, M. Christodorescu, X. Yan, and J. Han. Mining
graph patterns eﬃciently via randomized summaries. In ACM Conference on
Very Large Data Bases, 2009.
[9] Mihai Christodorescu and Somesh Jha. Static analysis of executables to detect
malicious patterns. In USENIX Security Symposium, 2003.
[10] Mihai Christodorescu, Somesh Jha, Sanjit A. Seshia, Dawn Song, and Randal E.
Bryant. Semantics-aware malware detection. In IEEE Symposium on Security
and Privacy, 2005.
[11] T. M. Cover and J. A. Thomas. Elements of Information Theory. WileyInterscience, 1991.
[12] R.D. Dony. The Transform and Data Compression Handbook, Chapter 1. CRC
Press, 2001.

34

[13] Chris Eagle. The IDA Pro Book: The Unoﬃcial Guide to the World’s Most
Popular Disassembler. No Starch Press, 2011.
[14] Matt Fredrikson, Somesh Jha, Mihai Christodorescu, Reiner Sailer, and Xifeng
Yan. Synthesizing near-optimal malware speciﬁcations from suspicious behaviors. In IEEE Symposium on Security and Privacy, 2010.
[15] Raﬁqul Islam, Ronghua Tian, Lynn M Batten, and Steve Versteeg. Diﬀerentiating malware from cleanware using behavioural analysis. In IEEE International
Conference on Malicious and Unwanted Software, 2010.
[16] Clemens Kolbitsch, Paolo Milani Comparetti, Christopher Kruegel, Engin
Kirda, Xiaoyong Zhou, and XiaoFeng Wang. Eﬀective and eﬃcient malware
detection at the end host. In USENIX Security Symposium, 2009.
[17] Andrea Lanzi, Davide Balzarotti, Christopher Kruegel, Mihai Christodorescu,
and Engin Kirda. Accessminer: Using system-centric models for malware protection. In ACM conference on Computer and Communications Security (CCS),
2010.
[18] D. Perry. Here comes the ﬂood or end of the pattern ﬁle. Virus Bulletin VB,
2008.
[19] Ian H. Witten, Eibe Frank, and Mark A. Hall. Data Mining: Practical Machine
Learning Tools and Techniques. Morgan Kaufmann, 2011.
[20] Heng Yin and Dawn Song. Panorama: Capturing system-wide information
ﬂow for malware detection and analysis. In ACM conference on Computer and
Communications Security (CCS), 2007.

35