ENHANCING AUTOMATED FAULT DISCOVERY AND ANALYSIS
By
Jared David DeMott

A DISSERTATION
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
DOCTOR OF PHILOSOPHY
Computer Science
2012

ABSTRACT
ENHANCING AUTOMATED FAULT DISCOVERY AND ANALYSIS
By
Jared David DeMott

Creating quality software is difficult. Likewise, offensive researchers look to penetrate quality
software. Both parties benefit from a scalable bug hunting framework. Once bugs are found, an
equally expensive task is debugging. To debug faults, analysts must identify statements involved
in the failures and select suspicious code regions that might contain the fault. Traditionally, this
tedious task is performed manually. An automated technique to locate the true source of the
failure is called fault localization.
The thesis of this research is that an automated process to find software bugs and quickly
localize the root cause of the failure is possible by improving upon existing techniques. This
research is most interested in bugs that lead to security vulnerabilities. These bugs are high
value to offensive researchers, and to the typical software test engineer. In particular, memory
corruption bugs characterized via an application crash is the subset of all bugs focused on in this
work.
Existing distributed testing frameworks do not integrate with fault localization tools. Also,
existing fault localization tools fail to localize certain difficult bugs. The overall goal of this
research is to: (1) Build a dynamic testing framework powerful enough to find new bugs in
commercial software. (2) Integrate an existing fault localization technique into the framework
that can operate on code without the requirement of having the source code or pre-generated test

cases. (3) Create a novel fault localization algorithm that better operates on difficult to localize
flaws. (4) Test the improvement on benchmark and real-world code.
Those objectives were achieved and empirical studies were conducted to verify the goals of
this research.

The constructed distributed bug hunting and analysis platform is called

ClusterFuzz. The enhanced fault localization process is called Execution Mining. Test results
show the novel fault localization algorithm to be an important improvement, and to be more
effective than prior approaches. This research also achieved ancillary goals: visualizing fault
localization in a new environment; assembly basic blocks for fully compiled code. A pipeline
approach to finding and categorizing bugs paves the way for future work in the areas of
automated vulnerability discovery, triage, and exploitation.

ACKNOWLEDGEMENTS

I would like to thank my advisors Dr. Enbody and Dr. Punch for their guidance, encouragement,
and assistance. I would like to thank my company, the Crucial Security division at Harris
Corporation, for invaluable support. I would like to thank various coworkers, acquaintances, and
friends that have encouraged me with personal and professional support; particularly, Dr.’s Pauli
and Engebretson of Dakota State University. A special thank you goes to my family, whom
without their support this would not have been possible – especially my wife, Michelle. Finally,
thank you to God the Father, Son, and Spirit for working in my life and the lives of others to
make this degree possible.

iv

TABLE OF CONTENTS

LIST OF TABLES ....................................................................................................................... viii
LIST OF FIGURES ....................................................................................................................... ix
LIST OF ALGORITHMS ............................................................................................................... x
LIST OF CODE LISTINGS .......................................................................................................... xi
Chapter 1 Introduction .................................................................................................................... 1
Thesis Statement ......................................................................................................................... 3
Contribution ................................................................................................................................ 3
Chapter 2 Software Security Testing Background.......................................................................... 5
Secure Development Lifecycle ................................................................................................... 5
Static, Dynamic, and Manual Auditing ................................................................................... 7
Fuzzing Background ................................................................................................................... 7
Fuzzing Basics ........................................................................................................................ 7
Mutation Fuzzing .................................................................................................................... 9
Intelligent Fuzzing ................................................................................................................ 10
Feedback Fuzzing ................................................................................................................. 11
Whitebox Fuzzing ................................................................................................................. 12
Distributed Fuzzing ............................................................................................................... 14
Chapter 3 Enhancing Vulnerability Discovery ............................................................................. 16
Distributed Fuzzing and Analysis Framework ......................................................................... 16
Initial Results for the Fuzzer Component ................................................................................. 21
New Vulnerabilities Discovered ............................................................................................... 24
Chapter 4 Fault Localization Background .................................................................................... 28
Set Union and Set Intersection.................................................................................................. 28
Nearest Neighbor ...................................................................................................................... 29
SDG-ranking Technique ........................................................................................................... 30
Cause Transitions ...................................................................................................................... 31
SOBER...................................................................................................................................... 31
Failure-Inducing Chops ............................................................................................................ 32
Towards Locating Omission Errors .......................................................................................... 34
Tarantula ................................................................................................................................... 34
Top-K LEAP ............................................................................................................................. 37
Test Data Creation .................................................................................................................... 38
Using Input Data for Fault Localization ................................................................................... 39
Database Fault Localization...................................................................................................... 39
v

Chapter 5 Building In Fault Localization and Visulization .......................................................... 41
Automatic Fault Localization and Visualization ...................................................................... 43
Example 1: A Simple Bug with Code Coverage Differences ............................................... 46
Example 2: A Simple Data-only Bug ................................................................................... 50
Motivation for an Enhancement ............................................................................................ 54
Chapter 6 Enhancing Fault Localization with Select Input Tainting and Tracking ..................... 56
Dynamic Information Flow Tracking ....................................................................................... 56
Dynamic Binary Instrumentation .......................................................................................... 57
Data Flow Tracking with Dynamic Binary Instrumentation ................................................ 57
Our sources of Tainted Data ................................................................................................. 58
How We Emit Tainted Basic Blocks .................................................................................... 58
Fault Localization Algorithm Details ....................................................................................... 59
FL-Basic ................................................................................................................................ 65
FL-Enhanced ......................................................................................................................... 65
FL-DFT ................................................................................................................................. 66
Chapter 7 Test Results .................................................................................................................. 68
Contrived .................................................................................................................................. 68
GZIP.......................................................................................................................................... 69
Pure-FTPd Network Server ...................................................................................................... 74
Siemens Benchmark ................................................................................................................. 80
VLC Media Player .................................................................................................................... 82
Firefox ....................................................................................................................................... 85
Chapter 8 Significance of our Primary Fault Localization Enhancement .................................... 88
Lack of Focus on Data .............................................................................................................. 88
Improvement on Data-only and Noise Problems ...................................................................... 89
Size Estimate............................................................................................................................. 91
Second Estimate ........................................................................................................................ 91
Methodology Examples ........................................................................................................ 92
Second Estimate Results ....................................................................................................... 95
Reasonableness ......................................................................................................................... 95
Summary ................................................................................................................................... 96
Chapter 9 Conclusions .................................................................................................................. 98
Merit.......................................................................................................................................... 98
Threats to Validity .................................................................................................................... 99
Bug Hunting Limitations ...................................................................................................... 99
Fault Localization Limitations ............................................................................................ 100
Future Work ............................................................................................................................ 100
APPENDICES ............................................................................................................................ 103
ClusterFuzz GUI Walkthrough ............................................................................................... 103
Significance Data .................................................................................................................... 107
vi

BIBLIOGRAPHY ....................................................................................................................... 112

vii

LIST OF TABLES

Table 1: Sample Coverage Data ................................................................................................... 35
Table 2: Top 5 Suspicious Code Blocks from Figure 6 ................................................................ 49
Table 3: Top 5 Suspicious Code Blocks from Figure 7 ................................................................ 54
Table 4: Top 10 Suspicious Code Blocks from FL-Basic, FL-Enhanced, and FL-DFT .............. 73
Table 5: Results for Pure-FTPd .................................................................................................... 78
Table 6: Comparing FL Types (1=Success; 0=Failure) ................................................................ 80
Table 7: Results for Schdule2, Version 8 ..................................................................................... 82
Table 8: Results for VLC .............................................................................................................. 83
Table 9: Results for Firefox PNG ................................................................................................. 87
Table 10: Data-flow Estimate ....................................................................................................... 95
Table 11: Reasoning about the Accuracy of the Data-flow Estimate ........................................... 96
Table 12: Bugzilla-Redhat Details .............................................................................................. 107

viii

LIST OF FIGURES

Figure 1: The Microsoft Security Development Lifecycle ............................................................. 6
Figure 2: High Level Fuzzing Diagram .......................................................................................... 8
Figure 3: ClusterFuzz .................................................................................................................... 17
Figure 4: Visualization of Java Bug.............................................................................................. 26
Figure 5: Execution Mining .......................................................................................................... 44
Figure 6: An Example with Covered Code Differences ............................................................... 48
Figure 7: Data-only Example ........................................................................................................ 53
Figure 8: get_suffix Function from GZIP...................................................................................... 72
Figure 9: dotype Function from Pure-FTPd .................................................................................. 77
Figure 10: Visualized parse_master Function in IDA Pro via FL-DFT ....................................... 84
Figure 11: File ./fs/nfsd/vfs.c ........................................................................................................ 93
Figure 12: File ./tools/libxc/xc_hvm_build.c ................................................................................ 94
Figure 13: ClusterFuzz Primary View ........................................................................................ 104
Figure 14: Clicked on Job Results for Java ................................................................................ 105
Figure 15: Clicked on Specific Java Results............................................................................... 106

ix

LIST OF ALGORITHMS

ALGORITHM 1. ClusterFuzz: Bug Hunting and Analysis System ............................................ 19
ALGORITHM 2. Coverage-based Fault Localization with Input Data Tainting and Tracking .. 59

x

LIST OF CODE LISTINGS

CODE LISTING 1. SpTagToPublic Function ........................................................................... 24
CODE LISTING 2. Code with a Bug because of Path Differences ........................................... 46
CODE LISTING 3. Data-only Bug ............................................................................................ 50
CODE LISTING 4. Get_suffix function ..................................................................................... 69
CODE LISTING 5. Relevant code from dotype function .......................................................... 75
CODE LISTING 6. Relevant code from put_end function ........................................................ 81
CODE LISTING 7. Relevant code from parse_master function ............................................... 82
CODE LISTING 8. Relevant code from row_callback function ............................................... 86

xi

CHAPTER 1 INTRODUCTION
A study conducted by NIST in 2002 reports that software bugs 1 cost the U.S. economy $59.5
billion annually [Tassey]. More than a third of this cost could be avoided if better software
testing was available. Software can never be completely tested [Lucas 1961]. However, a
practical process for continuous improvement is [Lewis 2000]. The improvement process calls
for novel or enhanced tools and techniques, which can be practically applied. Distributed
dynamic testing frameworks are one practical way researches have been able to scale the
automated discovery of software coding mistakes [Gallagher 2009, Nagy 2009, Randolph 2009,
Molnar and Opstad 2010, Eddington et al. 2011]. The first portion of this research provides a
new distributed framework, which enhances the ability to locate software mistakes.
Debugging located faults is a large portion of the overall cost, as part of the repair of
identified bugs [Vessey 1985]. Debugging is traditionally a labor intensive activity where testers
or developers seek to identify the erroneous code for which a given input is generating an error.
Consequently, researchers seek techniques that automatically help locate the root of bugs for the
humans responsible for maintaining the code.
Opposite to software repair, is software exploitation. The tools and techniques to find and
understand code mistakes are similar to testing and repair, except that once errors are understood
they are weaponized into working software exploits. Countries such as China, Russian, and the

1

A software mistake or defect may be referred to throughout this document as an error, bug, or

fault.
1

United States are actively seeking to create cyber weapons [Clark and Knake 2010, Goel 2011].
The Stuxnet worm is a prime example of a cyber-weapon that leveraged memory corruption bugs
[Greengard 2010] in the Microsoft Windows operating system. Therefore, cyber attackers also
seek a system that can automatically find and understand security relevant bugs.
Given test cases [Voas and Miller 1992], dynamic code coverage2 information can be used to
aid automatic, root-cause discovery of coding mistakes.

This process is known as fault

localization (FL), and it is an active research area. Many papers on debugging and faultlocalization have been published in academic conferences and journals [Pan et al. 1992, Agrawal
et al. 1995, Liblet et al. 2003, Renieris and Reiss 2003, Cleve and Zeller 2005, Gupta et al. 2005,
Jones and Harrold 2005, Liu et al. 2005, Zhang et al. 2007, Hsu et al. 2008, Jeffrey et al. 2008,
Cheng et al. 2009, Santelices et al. 2009, and Artzi et al. 2010]. Despite their work, FL is
difficult and needs to be improved upon to see higher utilization [Wong and Debroy 2009].
In particular, existing coverage-based fault localization (CBFL) algorithms fail on certain
bug types: Data-only bugs and noisy code bugs. Data-only bugs are those for which measured
code coverage does not differ between test cases. Thus, CBFL cannot reliably determine the
location of the fault. Noisy code bugs are those which are harder to localize because of noisy
coverage information. Noise in this context indicates sporadic and non-relevant code blocks that
show up in failed test cases, but not in passing tests. CBFL algorithms keys in on those

2 Code

coverage (CC) is a measure commonly used in software testing. It describes the degree to

which the source code of a program has been tested. In this case, CC refers to the specific blocks
that were executed for a given test.
2

differences. But if there are many differences, and they do not relate to the bug(s), the CBLF
algorithm is led astray and many false locations get reported to the human analyst. The second
portion of this research provides an improvement to fault location.
Thesis Statement
The thesis of this research is that popular coverage-based fault localization (CBFL) techniques
can be enhanced. The enhancement is done by providing better information about which code
blocks are operating on important information.

Important information is defined as select

portions of untrusted input. The input is tracked throughout a programs execution; this is called
tainting. Code blocks that operate on tainted data score in our new CBFL as more suspicious
than those that do not. The novel CBFL is also enhanced by pairing it with a distributed testing
framework to provide a pipeline of prioritized software bugs.
Contribution
This research is most interested in bugs that lead to security vulnerabilities. These bugs have
value to offensive researchers, and to the typical software test engineer. In particular, memory
corruption bugs characterized by an application crash are the subset of all bugs focused on in this
dissertation. This research also shows that data-only and noisy bugs are a significant portion of
the memory corruption subset.
This research provides the following contributions:
1. A powerful distributed fuzzing framework capable of more quickly finding and triaging
bugs in large real-world applications. For example, zero-day vulnerabilities were found
in Java. A zero-day is a security researcher’s term for a new bug, not known to the
3

vendor, and capable of being translated into a new attack for which no current defense
has an exact protection.
2. The combination of fault localization with the distributed fuzzer. To our knowledge we
are the first to do this. Furthermore, our system does not depend on source code or pregenerated test cases.
3. Two novel fault localization algorithms:
a. The first FL improvement cuts through basic noise associated with gathering code
coverage information to provide better results.
b. The more impressive new FL algorithm pairs select input data tracking with
coverage-based fault localization. The results show its superior effectiveness on a
number of benchmark applications.
4. The test results of the new FL algorithms on two challenging and real-world problems:
a. The first is a data-only bug in VLC TiVo.
b. The second is a bug in the Firefox PNG code, which resides in a large and noisy
module.

In both cases, the new approach significantly outperformed other

coverage-based fault localization algorithms.
5. An analysis of the significance of the data-only and noisy subset. We show that our
primary novel FL techniques works well for code and data bugs. However, an additional
goal was to understand how significant data-flow bugs are, as compared to the relative
size of control-flow bugs.

We reason and empirically show that data bugs are a

significant portion of the studied faults, giving further credence to our work.

4

CHAPTER 2 SOFTWARE SECURITY TESTING BACKGROUND
Software testing is an investigation conducted to provide stakeholders with information about the
quality of the product or service under test [Myers 1979]. Security testing is a process to
determine that an information system protects data and maintains functionality as intended
[Mercedes and Winograd 2008]. The most common subset of security requirements or security
goals consists of the following: confidentiality, integrity, and availability (CIA) [Gasser 1988].
To meet those challenges, Microsoft has developed a well-known secure development lifecycle
(SDL) [Microsoft 2011].
Secure Development Lifecycle
Creating robust and trustworthy applications is difficult. Figure 1 shows Microsoft’s SDL
process.

Each step of Figure 1 is critical.

Quality training raises the security IQ of an

organization, and makes security part of the culture. In the requirements stage, threat modeling
insures appropriate security measures are implemented on a per application basis. Auditing the
design diagrams is the first security and quality control measure.

During implementation,

programmers should use best-practices and up-to-date build tools. Static analysis upon code
check-in provides programmers with instant feedback and accountability. Dynamic, or runtime
testing, is first used in the verification phase.
organizations to apply our research.

This is the ideal time for development

Fuzz testing has become a popular dynamic testing

technique [Miller et al. 1990]. Before release, a final security review is conducted for most
applications. Applications meeting certain threat thresholds receive a manual code review.
Finally, plans to maintain, patch, and respond to post-release events is put in place. That is
required for bugs naturally found, and because hackers, or legitimate offensive professionals, are
5

known to implement bug hunting frameworks (such as the one presented in this dissertation) to
uncover coding errors after software has been fielded.

Figure 1: The Microsoft Security Development Lifecycle
For interpretation of the references to color in this and all other figures, the reader is
referred to the electronic version of this dissertation.

6

Static, Dynamic, and Manual Auditing
Static testing tools are best integrated into the development process.

These tools are

configurable and can be used to enforce custom rules, such as denying the use of dangerous API
functions. Static tools are known to the raise the overall security posture, or baseline security, of
a code base. However, static tools often have false negatives and positives. Therefore, dynamic
testing is required once the code base is stable. Fuzzing is discussed in the next section. Manual
review for high threat applications is desirable, because each type of testing tends to catch errors
missed by other types. Manual review is expensive because humans are not as efficient as
computers. But highly skilled humans often catch obvious and nonobvious coding mistakes
missed by automated techniques.
Fuzzing Background
Fuzzing is a testing technique known for its ability to uncover memory corruption bugs, which
tend to have security implications [Takanen et al. 2008]. Hackers and security researchers have
used fuzzing for years to find bugs in widely available commercial (closed source) and open
source software.

Because of its effectiveness, fuzzing is included in many development

environments, e.g. Microsoft incorporates fuzzing into their Secure Development Lifecycle
[Microsoft 2011].
Fuzzing Basics
Fuzzing can be defined as a: Runtime testing technique where semi-invalid input is delivered
while the system under test (SUT) is monitored, often with a debugger, for common exceptions
such as memory access violations.
7

Figure 2 shows the fuzzing methodology from a high level [Oehlert 2005]. In Figure 2, data
is created or mutated, and given to the SUT which is monitored. If the SUT crashes, the current
context information from the program along with the input data is saved. Fuzzing continues
until a stopping point. Random based fuzzers often have no defined stopping point.

Get Data

Submit to
Application

Application
fails?

No

Yes
Save Input &
Crash Info
Figure 2: High Level Fuzzing Diagram

Fuzzing can create totally random input. Barton Miller et al. did this over twenty years ago
against UNIX utilities, and for graphical user interfaces [Miller et al. 1990, Miller et al. 1995,
Miller et al. 2006]. They had much success. They found post-release bugs missed by the

8

manufactures. However, much of today's fuzzing is less random. Data is either constructed
based on samples (mutation fuzzing), or from the protocol specification (intelligent fuzzing).
Fuzzing is mostly concerned with software faults that tend to have security implications.
Crash bugs caused by memory access violations are the most common type of bug discovered.
These bugs can be easily caught in a debugger, and can often be turned into either a DoS (Denial
of Service) or an access tool (software exploit). Because of this, fuzzing would not be an ideal
tool for all testing, since there is typically no fuzzing oracle to catch non-crashing bugs. A noncrashing oracle could be constructed as some have done for web application fuzzing. Web
fuzzing is discussed in Chapters 9 and 10 of a recent fuzzing book [Sutton et al. 2007]. Finally,
there is no completeness guarantee with fuzzing. It is a heuristic technique that has been
empirically proven successful.
Mutation Fuzzing
Mutation fuzzing uses data samples as a starting point. Slight mutations are made with each
iteration, before the data is delivered to the application. In this way the notion of "semi-valid"
data is maintained. That is, to achieve sufficient protocol penetration, most of the input needs to
pass the applications basic data sanity checks.
File fuzzing is the most popular type of mutation fuzzing in use today. Samples for clientside applications such as Adobe Reader, Microsoft Word, etc. are collected and used to fuzz the
application. Much success has been reported with mutation fuzzing [Sutton 2006, Miller 2010].
Fuzzing with many different samples has been shown to be more effective than using a single
sample. Intelligently selecting samples for mutation has been shown to be even more effective
and efficient than simply using lots of samples [Molnar and Opstad 2010]. The method for
9

choosing samples wisely is sometimes referred to as MinSet operation. MinSet refers to finding
the minimum set of inputs that cover that maximum amount of code.
effectiveness, while decreasing runtime.

This increases

MinSet is covered later in Chapter 3 as part of

Algorithm 1.
Intelligent Fuzzing
Intelligent fuzzing means building a data model that describes the input data. That way, each
data type can be mutated in a way that makes sense for the current data element. For example, it
is common for binary protocols to expect data in what is called the Type/Length/Variable (TLV)
format. To fuzz such a protocol, the data model must specify the number of bytes that is
included in each TLV element. Then, when the fuzzer mutates an element, TLV integrity will be
maintained. E.g. the length variable will be correctly updated with a new value, if the variable
data is made longer or shorter than it originally was. Intelligent frameworks should also be able
to correctly handle other complicated data elements such as checksums, hashes, and more.
Two common intelligent fuzzing frameworks are the Peach Fuzzing framework [Peach
2011], and Sulley Fuzzing Framework [Amini and Portnoy 2010]. Sulley is geared toward
network fuzzing. Sully data models are constructed directly in the Python language that Sulley is
written in. Peach is a more robust tool, but developing the Peach data models in XML (as
required) is considered more difficult. One notable improvement over Sulley is that Peach can
leverage a crash sorting plug-in by Microsoft called !exploitable (pronounced “bang
exploitable”) [Exploitable 2011]. !exploitable is discussed further in Chapter 3. Peach also can
perform file fuzzing without framework modifications, unlike Sulley.

10

Researchers have reported success with intelligent fuzzing [Sutton and Greene 2005,
Mulliner and Miller 2009]. While intelligent fuzzing is able to handle complex elements like
data checksums, some researchers believe that with enough samples and iterations intelligentlike results can be achieved via mutation fuzzing, in sufficiently complex software, without the
effort of constructing a data model [Miller 2010, Molnar and Opstad 2010]. Regardless, experts
agree that different techniques find different bugs [Molnar and Opstad 2010]. A common
recommendation is to use multiple fuzzing approaches, since no one testing technique captures
all bugs.
Feedback Fuzzing
Fuzzers that adapt, or create dynamic inputs, based on information from the target are called
feedback fuzzers. Evolutionary fuzzing is one such example. Evolutionary fuzzers operate
based on a fitness function, such as covered code. As the fuzzer runs, new inputs are breed and
delivered using genetic algorithms [McMinn and Holcombe 2006].
One of the first published approaches to evolutionary fuzzing came from our work at
Michigan State University3 [DeMott et al. 2007]. The approach was a grey-box technique
(assumes no access to source code, instead using analyzed runtime information). Using a python

3 First

place research poster award was given at MSU for this work in 2007.

http://www.cse.msu.edu/?Pg=50&Col=3&Inc=1&Nid=115

11

debugger to measure covered code blocks, new inputs were generated and delivered that
achieved generationally better code coverage to find bugs.
Whitebox Fuzzing
Researchers at Microsoft created a tool called SAGE (Scalable, Automated, Guided Execution),
an application of whitebox file fuzzing for x86 Windows programs [Godefroid et al. 2008].
SAGE works by starting with an initial input. This input is then symbolically executed by the
program. Information about how the data was used is stored. Constraint (branch condition)
information is an important part of the recorded data. Each constraint is negated one at a time
until the entire system is solved. This results in a new input to the program which has a different
execution path. This is repeated for each constraint in the program. In theory, this technique
should provide 100% code coverage for the entire attack surface. In practice, this is not always
the case as will be discussed later in this section.
The following is a contrived function, for which SAGE can quickly get complete code
coverage, where mutation fuzzers would struggle:

void tricky_function ( char input[4] )
{
int cnt = 0;
if(input[0] == ‘b’) cnt++;
if(input[1]==’a’) cnt++;
if(input[2]==’d’) cnt++;
if(input[3]==’!’) cnt++;
if(cnt>=3) abort();

//This line is the error.

}

12

The above code illustrates a point: by using random inputs, the probability of finding the error is
approximately 2^(-30). Therefore, mutation fuzzers would require a sample containing the input
“bad!” to follow this code path. Obtaining such a sample may or may not be likely, depending
on available samples for this hypothetical protocol.
To continue the example, we walk through how SAGE would generate inputs for each
constraint. Suppose the initial input is “root”. The resulting constraints are:
{input[0] != ‘b’, input[1] !=’a’, input[2]!=’d’, input[3]!=’!’}.
SAGE will systematically negate each of the constraints to get new inputs. Eventually, the
following set of constraints will be generated:
{input[0] == ‘b’, input[1] ==’a’, input[2]==’d’, input[3]==’!’}.
The final solution gives the input “bad!”, which finds the bug.
The SAGE technique does have limitations. The most obvious is that there are a very large
number of paths in real-world programs. This is called the path explosion problem. Path
explosion can be partially mitigated by generating inputs on a per-function basis and then tying
the information together. Other factors might prevent SAGE from being able to solve the
constraints in a reasonable amount of time. This happens when a difficult to solve condition
occurs. Checking cryptologic variables is a good example. Yet another problem arises because
symbolic execution may be imprecise due to interactions with system calls and pointer aliasing
problems. Finally, the ability of SAGE to quickly generate good inputs relies heavily on the
quality of the initial input, much like mutation based fuzzing.
Despite these limitations, SAGE claims to have an impressive history of finding real
vulnerabilities in real products. For example, Microsoft states that SAGE was able to uncover
13

the ANI format animated cursor bug [SDL 2007]. This vulnerability specifically arises when an
input is used with at least two ANIH records, and the first one is of the correct size. Microsoft
had previously fuzzed this code, but all of their inputs only had one ANIH record. Therefore,
they did not find this particular bug (before it was released in the wild). However, given an input
with only one ANIH record, SAGE generated an input with multiple ANIH records and quickly
discovered this bug. The relevant code contained 341 branch constraints and SAGE required
about 8 hours to uncover the bug.
Unfortunately, at the present time, SAGE is not available outside of Microsoft Research.
Also, for our work we assume no access to source code information. Assuming no source is a
valid assumption for blackbox testing, legacy testing (where source has been lost), and most
prominently for offensive research when source code access is not permitted.
Distributed Fuzzing
The state-of-the-art in fuzzing is now focused on distribution or parallelization. Consider a
fuzzing engine that generates 800,000 tests, each of which will take 10 seconds to complete. It
would require roughly 92 days of runtime on one machine. If 92 machines are used, the same
tests finish in one day. An expanded platform based on these notions, created in this research
work and called ClusterFuzz is later discussed.
Microsoft and Adobe have both made reference to in-house distributed fuzzing tools
[Gallagher 2009, Randolph 2009]. At Microsoft, Gallagher speaks about how the Office test

14

team turned desktop and lab machines into a botnet4 for fuzzing during downtime. At Adobe,
Randolph mentions that the Peach fuzzer [Peach 2011] is used in a distributed fashion, though
they give few details on how Peach was extended to start jobs or collect results on virtual
machines (VMs). Molnar shows how he uses Amazon Elastic Cloud Computing to create a
distributed fuzzer [Molnar and Opstad 2010]. Molnar urges individuals to download and run
tests on his pre-built Linux VM. Results are then sent to his website: www.metafuzz.com. Nagy
spoke about a FuzzPark he created in Ruby to fuzz Microsoft Word. Nagy discovered many
crashes, and shared those with Microsoft [Nagy 2009].

4 Botnet

is a jargon term for a collection of software agents, or robots, that run autonomously and

automatically. The term is most commonly associated with malicious software, but it can also
refer to a network of computers using distributed computing software.
15

CHAPTER 3 ENHANCING VULNERABILITY DISCOVERY
Fuzzing is important to us because it provides a way to find new bugs and apply fault
localization to closed-source, i.e. commercial code. However, to be effective as a front-end to
fault localization, fuzzing must generate and examine a large number of cases requiring
significant compute time. To mitigate this problem, a distributed fuzzer was created and
combined with a backend to categorize the many potential bugs such an environment will
generate. This framework is described next.
Distributed Fuzzing and Analysis Framework
The distributed fuzzing platform that was constructed to feed the fault localization process is
called ClusterFuzz (CF). ClusterFuzz is a distributed computing framework that facilitates
fuzzing of larger data input sets. It can be used as a stand-alone fuzzer or as a front end to a faultlocalization tool. ClusterFuzz does fuzzing in parallel, making fuzzing more efficient, and does
not require source code or a test set to generate tests. The speed-up is linear as resources are
added to CF. E.g. A fuzzing run that would have taken 200 days on 1 computer, can be done in
1 day on 200 virtual machines in CF.
Once an application has been fuzzed and bug results have been generated, ClusterFuzz
proceeds to do further evaluation of the results. First, it examines the generated bugs and
attempts to cluster them based on the similarity. Second, it attempts to rate the severity of the
observed bugs. Figure 3 shows the high level design for ClusterFuzz. Each of the major
sections from the diagram is briefly described below.

16

Figure 3: ClusterFuzz

Configure. The first thing the bug hunter does is choose a target to fuzz. That target, a.k.a.
the system under test, must then be properly installed in a base virtual machine (VM). Windows
XP was used for the base VM operating system (OS) as it is easier to configure and requires
fewer resources than Windows Vista or 7. Windows in general was chosen both because the
Peach fuzzing tool [Peach 2011], which is used in this work, operates best under Windows, and
because many interesting targets to fuzz, such as Microsoft's Internet Explorer, reside only on
17

Windows. However, it is possible to use a different base OS and a different fuzzer with
ClusterFuzz. Next, the test configuration must be constructed specifically for the SUT. The test
configuration specifies all the details surrounding the setup of the SUT: input types, execution
requirements, etc. The test configuration includes the data model (DM), which describes the
protocol required for intelligent fuzzing (if desired). To aid the user, a graphical user interface
(GUI) was created, which includes an option to automatically create a simple mutation DM, thus
bypassing the manual work of protocol description. The base DM often yields reasonable results
because the input samples are intelligently gathered (using Auto-MinSet described below). A
common practice is to run the base DM initially, and then later design a custom DM for the
particular application if budget allows.
Run. The present setup has six VMware ESX 4.0 virtualization servers. Based on current
hardware, each of the servers can handle approximately 25 virtual machines cloned from the
base image. Therefore, 150 VMs are running on hardware that cost $30,000 USD when procured
(October 2009). Each virtual machine has generous amounts of RAM and CPU cycles. These
VMs may be used for a variety of purposes such as:


Auto-MinSet is a technique created as part of this research to find good input samples
from the Internet for a given file format protocol. Auto-MinSet uses code coverage to
find an optimal test set from samples, which were automatically searched for and
downloaded. The method for choosing samples wisely is referred to as MinSet, since the
goal is to select the minimum set of files that obtains that maximum amount of code
coverage. The MinSet subroutine in Algorithm 1 provides the details. Briefly described
here, MinSet:
18

1. Gathers a list of inputs and locates the input that covers the most code.
2. Next, the covered regions of each input are tested against the pool of regions
covered thus far. If new regions are covered, the input is deemed worthy to be
added to the minimum set of desired inputs. Otherwise, that input is discarded.


Auto-Retest, which checks the reliability of discovered bugs.



Execution Mining (described later).



Fuzzing.

Collect. Once the fuzzing sessions have completed, the ordered (from !exploitable, described
in the next section) bug results are collected. The CF GUI provides a convenient way for analysts
to view each bug and create notes as they work through the pipeline of results. CF is described
again in Algorithm 1:
Algorithm 1.

ClusterFuzz: Bug Hunting and Analysis System

Input: A test configuration controls the fuzzing of each system under test, and is also typically
used to automatically fetch good inputs. Each data sample in the collected set of Inputs is fuzzed
to create a broad set of test inputs. Fuzzed permutations of an input i are denoted as i´. The most
common EXCEPTION sought is a memory access violation.
Output: Reliable, severe, and unique bugs (defined shortly) are sorted (in bins) in the set
Results, and ready to be later used by Execution Mining
Inputs=DownloadSamples(Test_Configuration );
Inputs=MinSet(Test_Configuration );
for each input i in Inputs do
for each fuzzed permutation i´ of i do
19

result, exception_data = Test(Test_Configuration, i´);
if(result) == EXCEPTION
then
logs=Log(exception_data, i´, i)
end
end
end

Results = Combine_Bugs_from_VMs (logs);
Results = Order_Within_Bins(Results);
Results = AutoRetest(Results);

Subroutine MinSet(Inputs):
largest, coverage = set();
CC = dictionary{ input: BasicBlocks}
for each input i in Inputs do
CC[i] = Get_Set_of_BBs_Executed(i);
if length(largest) < length(CC[i])
then
largest = CC[i];
end
end
20

minset = [ ];
coverage = largest;
for each input i in Inputs do
if any of CC[i] not in coverage
then
minset += i;
coverage.update(CC[i])
end
end
Return minset

Initial Results for the Fuzzer Component
To demonstrate the robustness of ClusterFuzz, it was tested on a variety of commercial, closedsource software. The following are applications and data formats that have been fuzzed using
ClusterFuzz:


Client-side applications
o Browsers


Internet Explorer, Chrome, Firefox, Safari, Opera

o Office Applications


Writer (Open Office), Word (Microsoft Office), Adobe Reader, Adobe Flash
Player, Picture Manager (Microsoft Office)

o Other


iTunes, QuickTime, Java, VLC Media Player, Windows Media Player, RealPlayer
21



File formats:
o Images:


JPG, BMP, PNG, TIFF, GIF

o Video:


AVI, MOV

o Office:


DOC, DOCX, XLS, XLSX, ODT

o Adobe:


PDF, SWF

Statistics were gathered over one month of running CF on the previously listed applications,
providing many of the file formats as input. Not every combination produced a fault, but when
faults were noted they were recorded as collective results, which are as follows:


141,780 faults total
 faults/day: 4726
 faults/hour: 197



828 total unique fault bins
 17 "Exploitable" bins
 6 "Probably exploitable" bins
 0 "Probably Not exploitable" bins
 805 "Unknown" bins
 Unique fault bins/day: 28
 Unique bins "probably exploitable" or "exploitable" /day: 0.9

Many observed faults are actually manifestations of the same fault, which is why the word
“bin” is used above to show each grouping of bugs. Classifying faults is important to reduce
downstream effort and a tool exists to do just that: !exploitable [Exploitable 2011].
22

The !exploitable tool is a Windows debugging (Windbg) extension that provides automated crash
analysis, security risk assessment and guidance. The tool uses a hash to uniquely identify a crash
and then assigns one of the following exploitability ratings to the crash: Exploitable, Probably
Exploitable, Probably Not Exploitable, or Unknown. In an attempt to identify the uniqueness of
each bug, two hash types are provided: major and minor. The difference is that the minor hash
hashes more of the stack information, and therefore attempts to provide a finer amount of detail
when sorting bugs.
The data shown above helps illustrate how !exploitable works, and the importance of filtering
in fuzzing. While there may be many crashes, often there are many fewer unique and securitycritical bugs. These high quality bugs are what developers and bug exploiters usually want to
focus on. In this research, high quality (HQ) is defined as: reliable (repeatable when retested) and
severe (a rating of Probably Exploitable or higher).
On top of the classification provided by !exploitable, CF also collects and stores relevant
crash information such as the registers used and nearby disassembly code in the database. This
information allows researchers to later search for particular register patterns as they become
widely known. For example, when the Intel registers ECX and EIP contain the exact same value
that is a condition that may indicate the presence of an exploitable C++ structured exception
handler (SEH) overwrite.
The !exploitable output provides a rough sorting mechanism according to type and severity.
However, by looking back at Figure 3, notice that each unique grouping of bugs must still be
analyzed in the final section marked “work.” Further information about the crash such as the
location of the actual erroneous code block (fault localization) and other metadata (visualizing
23

blocks covered, etc) can expedite the analysis process. The next section discusses discovered
bugs and visualization.
New Vulnerabilities Discovered
To show that the system finds new real-world bugs (sometimes called zero-day vulnerabilities),
multiple previously undisclosed flaws were identified in the latest java version (1.6.0_25) when
fuzzing the Java picture parsing routines by using sample JPEG pictures. The flaws reside in the
color management module (CMM.dll) of the Windows Java package, specifically in the
SpTagToPublic function. Code Listing 1 shows the area of code where one of the bugs resides.
Lines 11-16 contain the bug. Buf is incremented for certain character inputs. Later in lines 18-25
Buf is used in a memory copy, and data outside the bounds of Buf may be inserted into the
destination buffer. It is possible to use this type of vulnerability to leak internal program
information. Leaking the location of a module is a recent approach used by attackers as part of a
multi-part technique to bypass modern protections (non-executable data memory and address
space randomization).

CODE LISTING 1.

SpTagToPublic Function

1. #define IS_VALID_SIGNATURE_CHAR(c) ((0x30 <= ((unsigned char)(c)) && ((unsigned
char)(c)) <= 0x39) || (0x41 <= ((unsigned char)(c)) && ((unsigned char)(c)) <= 0x5a) ||
(0x61 <= ((unsigned char)(c)) && ((unsigned char)(c)) <= 0x7a))
/*DESCRIPTION -- Convert attribute value from more compact internal form. */
2. SpStatus_t KSPAPI SpTagToPublic ( SpTagId_t TagId, KpUInt32_t TagDataSize,
SpTagValue_t FAR *Value, …)
24

{
3.

char

KPHUGE *Buf;

4.

char

KPHUGE *BufSave;

5.

KpUInt32_t

Index, Limit;

6.

SpData_t

FAR *Data;

7.

SpSig_t

8.

KpUInt32_t

TypeSig;
trimCount = 0;

/* set tag id and type in callers structure */
9.

Value->TagId = TagId;
...

10.

Buf = TagData;
/* trim tag signature if needed */

11.

while (!IS_VALID_SIGNATURE_CHAR(*Buf)) {

12.

if (++trimCount > TagDataSize) {

13.

return SpStatBadTagData;

14.

}

15.

Buf ++;

16.

}
...

17. switch (Value->TagType) {
...
25

18.

case Sp_AT_Unknown:

19.

Value->Data.Binary.Size = TagDataSize;

20.

BufSave = (char KPHUGE *) SpMalloc (TagDataSize);

21.

if (NULL == BufSave)

22.

return SpStatMemory;

23.

KpMemCpy (BufSave, (void *)(Buf - 8), TagDataSize);

24.

Value->Data.Binary.Values = BufSave;

25.

return SpStatSuccess;

The coloring and fault localization (described in Chapter 5) are helpful since the function
where the bug exists is complex. Figure 4 shows the visualization for Code Listing 1.

Figure 4: Visualization of Java Bug
26

The small red and yellow blocks (upper left) indicate the most suspicious blocks (lines 1116 ). The grey blocks indicate basic blocks executed by good and bad traces. The white blocks
were not executed. These colorations aid manual analysis for challenging bugs. Visualization is
later described in detail. This illustration is provided to ignite interest in the chapters to come.

27

CHAPTER 4 FAULT LOCALIZATION BACKGROUND
Fault localization (FL) techniques normally make use of runtime trace information collected
from both passing and failing executions5. FL generates an ordered set of code regions that are
predicted to contain a bug. The method for generating the order is an important part of the
technique and is critical to the technique's success. For the FL technique to be considered
helpful, the location of the actual bug should be presented within the top X suspicious code
regions. If this is not the case, FL is not saving enough debugging time for the programmer
inspecting the FL results. For example, Cheng et al. [2009] assume that the tester will tire if
more than the top 5 locations must be examined. Based on that work, our measure of success
also requires the actual bug to occur within the top 5 rated FL candidates6.
Each of the following sections provides an overview of prior relevant approaches. Each
description will focus on two important elements: first, the method for computing the initial set
of suspicious statements, and then the method for ordering the rest of the statements.
Set Union and Set Intersection
Set union and set intersection use code coverage information for fault localization.

This

dissertation refers to FL techniques that operate based on comparing covered code regions (line,

5

As a measure of effectiveness in fault localization, the percent of the program that does not

have to be examined to find the bug after localization is one common measure.
6

The top 10 suspicious locations are colored in the visualization in the event an analyst wishes

to keep looking beyond the top 5.
28

basic block, edges, or procedural) among passing and failing traces as a coverage-based
technique. Path-based, control-flow-based, program spectrum-based, and statistics-based are
similar terms that appear in other FL documents. Agrawal et al. [1995] used a technique that
computes the set difference of the statements covered by two test cases, one passing and one
failing. An initial set of suspicious statements is obtained by subtracting the statements executed
in the passing trace from the statements in the failing trace. They called it a dice of two slices.
Pan et al. [1992] demonstrated a set of dynamic slice heuristics that use set algebra for similar
purposes. In addition to using the information listed above, Agrawal et al. also use analysis
information to compute dynamic slices of the program and test cases from a particular program
point.
A problem occurs if all statements are executed by both the passing trace and failing trace,
resulting in an empty set of candidate statements. However, even if the initial set of candidate
statements is empty or the bug is not in the initial set, there are techniques that can be used to
identify suspect code. Renieris and Reiss [2003] show a technique that provides an ordering to
the entities based on the system dependence graph, or SDG. Their fault localization technique
and this SDG-ranking method are described next.
Nearest Neighbor
Renieris and Reiss [2003] addressed the issue of failing code occurring in passing traces using
their Nearest-Neighbor FL technique. Rather than removing, by set difference, the statements
executed by all passed test cases from the set of statements executed by a single failed test case,
they selectively choose a single best passed test case for the set difference. The important
problem is identifying the "best" case. They do so by first identifying a single failed test case
29

and then selecting an associated passing test case that has similar code coverage to the failed
case. Using these two test cases, they subtract the set of statements executed by the passed test
case from the set of statements executed by the failed test case. The resulting set of statements is
the initial set of statements from which the programmer should start looking for the bug. As
before, this process could still result in an empty set. If the bug is not contained in the initial set,
they use the SDG-ranking technique discussed next.
SDG-ranking Technique
In the work of Renieris and Reiss [2003] the code entities in the initial suspicious set from FL are
called the "blamed" nodes. A breadth-first search is conducted from the blamed nodes along
dependency edges in both forward and backward directions. An example in the next paragraph
will clear up how this works. All nodes that are at the same distance are grouped into a rank.
Given a distance d, and a set of nodes at that distance S(d), the rank number that is assigned to
every node in S(d) is the size of every set of nodes at shorter distances plus the size of S(d).
For example, suppose an initial list of suspicious code blocks contained two statements. The
programmer looks at each and finds that the bug is not in either one. He then inspects all
forward and backward control-flow and data-flow dependencies at a distance of 1. Assume that
yields an additional 5 nodes. The rank number of the initial set is 2 and the rank number of all
the nodes a distance of 1 is 7 (2+5). Using the size of the rank, plus the size of every rank at a
smaller distance for the rank number, gives the maximum number of nodes that would have to be
examined to find the bug following the order specified by this technique. Since this technique
does not use the original runtime information that was part of the fault localization, except as a
starting point, its effectiveness is limited.
30

Cause Transitions
Cleve and Zeller [2005] use a binary search of the memory states of a program between a
passing test case and a failing test case. This technique is part of a suite of techniques defined by
Zeller et al. called Delta Debugging [1999]. Their FL technique called Cause-Transitions,
defines a method to automate the process of making hypotheses about how state changes will
affect output. In this technique, the program under test is halted by a debugger along the trace of
both a passing and failing test case. Part of the memory state is swapped between the two traces
and then allowed to continue running to termination. The memory that appears to cause the
failure is narrowed down using a technique much like a binary search with iterative executions of
the program in the debugger. This narrowing of the state is iteratively performed until the
smallest state change that causes the original failure can be identified. This technique is repeated
at several program points to find the flow of the differing states causing the failure throughout
the lifetime of each execution. These program points are then used as the initial set of points
from which to search for the bug.
If the bug is not contained in this initial set, they too suggest the SDG-ranking technique.
They provide two improvements to the SDG-ranking that can exploit the programmer's
understanding of whether particular states are "infected" by a bug or "cause" the bug to be
manifested. These improvements are called "exploiting relevance" and "exploiting infections"
and are fully defined in Cleve and Zeller’s 2005 paper.
SOBER
Liu et al. [2005] present their SOBER technique that performs statistical calculations on passing
and failing traces to isolate bugs. They show their technique to be superior to Liblit et al. [2003,
31

2005] which used predicate probabilities to isolate bugs from collected crashes (failing traces
only). SOBER considers a predicate to be relevant if its evaluation pattern in incorrect traces
differs significantly from that of correct ones.
SOBER uses profiles of branch executions within each test case to localize branches that are
related to bugs. The algorithm begins by modeling evaluations of a predicate P as independent
Bernoulli trials: each evaluation of P returns either True or False. Next, the estimate of the
probability of P being True, which is called the "evaluation bias," is calculated from the fraction
of True observations within each program execution. Let x be the random variable carrying the
evaluation bias of predicate P, there exist two statistical models, f(x|Correct) and f(x|Incorrect),
that govern the evaluation bias observed from correct and incorrect executions. Finally, if
f(x|Incorrect) significantly differs from f(x|Correct), it indicates that the evaluation of P in
incorrect traces is abnormal, and thus likely related to a bug. This quantification is used to rank
all the instrumented predicates, creating a ranked list of suspicious predicates. Programmers can
either examine the list from top down, or they can choose to examine only the top-ranked
suspicious branches for debugging. One issue with this approach is that only predicates can be
considered as suspicious locations.
Failure-Inducing Chops
Since debugging is usually performed by analyzing the statements of the program when it is
executed using a specific input, Korel and Laski [1988] proposed the idea of dynamic program
slicing. A dynamic slice identifies a subset of executed statements that is expected to contain
faulty code. The effectiveness of a given slicing algorithm in fault location is determined by two

32

factors: how often is the faulty statement present in the slice, and how big is the slice, i.e. how
many statements are included in the slice?
There are two kinds of dynamic slices: backward dynamic slices and forward dynamic slices.
The backward dynamic slice of a variable at a point in the execution trace includes all those
executed statements which affect the value of the variable at that point. In contrast, the forward
dynamic slice of a variable at a point in the execution trace includes all those executed
statements that are affected by the value of the variable at that point. Backward dynamic slicing
has been proposed to guide programmers in the process of debugging [Agrawal et al. 1993] by
focusing the attention of the user on a subset of program statements which are expected to
contain the faulty code.
Gupta et al. [2005] combined Delta Debugging [Zeller 1999] with dynamic slicing [Agrawal
and Horgan 1990] to narrow down the search for buggy code. Prior work had shown backward
dynamic slicing to be useful for debugging, but their work was the first to show the application
of forward slicing with Delta Debugging for FL. The technique works by identifying a minimal
failure-inducing input, using this input to compute a forward slice, and then intersecting the
statements in the forward slice with the statements of the backward slice of the erroneous output
to compute a failure-inducing chop. The relevant part of the input is the minimal failure
inducing input. Gupta et al. compute the intersection of the statements in the forward dynamic
slice of the failure-inducing input and the backward dynamic slice of the faulty output to locate
the likely faulty code. The statements in the intersection of forward and backward dynamic slices
are known as the failure inducing chop. The failure-inducing chops are expected to be much

33

smaller than backward dynamic slices since they capture only those statements of the dynamic
slices that are affected by the minimal failure-inducing input.
Towards Locating Omission Errors
Zhang et al. [2007] noted that execution omission errors, that is code that should have been
executed as part of the program but was not, are difficult to find dynamically, since those
statements are never covered and thus cannot be operated on as part of the FL algorithm. They
introduce the notion of implicit dependences, which are dependences that are normally invisible
to dynamic slicing due to omission errors. They designed a dynamic method that forces the
execution of the omitted code by switching outcomes of relevant predicates such that those
implicit dependences are exposed and become available for dynamic slicing. Their technique
consists of reexecuting the program while switching a specific predicate instance, and aligning
the original and switched executions. Then dynamic code coverage information can be
recollected for fault localization which they do so using dynamic slices similar to their prior
work [Zhang et al. 2005], which is similar to the failure chops described in the prior section.
Tarantula
Researchers at Georgia Tech have considered runtime code coverage information to develop a
technique they call Tarantula [Jones and Harrold 2005; Hsu et al. 2008; Santelices et al. 2009].
Tarantula, like Renieris and Reiss and others, uses code coverage information to track each line
(or other coverage element) of code as it is executed. A pass-fail oracle labels each trace as such.
Like other approaches, the intuition behind Tarantula is that code regions in a program that are
primarily executed by failing test cases are more likely to be the culprit than those that are
34

primarily executed by passing test cases. Tarantula allows tolerance for lines that occasionally
participated in passing test cases. This tolerance has been shown to make the technique very
effective.
Table 1: Sample Coverage Data
Trace 1 Trace 2 Trace 3 Trace 4 Suspiciousness
Line 1

x

x

Line 2
Line 3

x
x

.58

x

x

.82

x

Line 4
Line 5

x

x

0

x

.58

x

x

.58

Table 1 provides sample code coverage data to be evaluated by Tarantula using the adapted
Ochiai [Abreu et al. 2007] formula, which is described by Equation (1). Traces 2 and 4 failed.
Trace 3 passed, but executes lines that failing traces also executed. The likelihood that any one
line contains the bug is calculated as follows:

( )

( )
√

(

( )

( ))

(1)

In Equation (1), failed(s) is the number of failed test cases that executed statement s one or
more times. Passed(s) is the number of passed test cases that executed statement s one or more
times.

Totfailed is the total number of failed test cases for the entire test suite.

If the

denominator is zero, zero is assigned as the result. Using this calculation, Tarantula identifies
35

Line 2 as the likely bug. The calculation for the line containing the bug from Table 1, using
Equation (1), is shown below:

suspiciousness(Line2) = 2 / sqrt( (2 * (2 + 1)) )
suspiciousness(Line2) = .82

Using this suspiciousness score, the covered lines are sorted. The set of lines with the
highest score are considered first by the programmer attempting to fix the bug.

If after

examining the first entity and the bug is not found, the programmer continues down the list from
highest to lowest score. Ties must be manually resolved. For a visual cue Tarantula colors code
to further help debugging.
Tarantula-based approaches exhibit three weaknesses:
1. Tarantula works well for bugs with differences in covered code, but will likely not
identify data-only bugs (discussed below).
2. Coverage-based algorithms may become confused in the presence of noise.
3. Previous Tarantula implementations require source code and a test suite of passing
and failing traces.
Weaknesses one and two are defined as follows (discussed later as well):


Data-only bugs are ones in which passing and failing traces have no code coverage
differences at runtime.



Noise in this article refers to failing traces that contain so many uniquely covered code
regions that the typical Tarantula-based approach will end up suggesting many
36

suspicious locations that are not bugs. Noisy conditions increase the amount of code
locations to be considered by the analyst to be prohibitively high before the actual bug
is encountered, rendering the FL algorithm ineffective for practical use.
For the class of bugs of interest (crashing memory corruption issues on fully compiled code)
all three weaknesses are addressed via this new fuzzing and fault localization system (later
described by Algorithms 1 and 2).
Top-K LEAP
Cheng et al. [2009] have used a data mining technique to mine execution graphs to localize bugs.
Given a set of graphs of passing and failing traces, they extract the subgraphs that discriminate
the most between the program flow of the traces. Their experiments show that their technique
performs better than another popular FL approach known as RAPID [Hsu et al. 2008].
Their approach is different from traditional graph mining techniques, which mine a very
large set of frequent subgraphs. Since mining many subgraphs is expensive, they formulate
subgraph mining as an optimization problem and directly generate the most discriminative
subgraph with a graph-mining algorithm called LEAP [Yan et al. 2008]. LEAP is both more
efficient and more accurate than previous graph approaches. They subsequently extend LEAP to
generate a ranked list of Top-K discriminative subgraphs representing distinct locations that may
contain the bugs. Equation (2), Discriminate Subgraph mining FL, shows how subgraphs are
pruned via their algorithm.

(

)

(

)

(

)

37

(2)

The operators failed(edg) and passed(edg) report the number of failing program traces that
include the edge edg and the number of passing program traces that include the edge edg
respectively. Using this approach, only suspicious edges are retained (nodes with no edges are
removed). As an example, assume that 20 of 50 traces were failing traces. Now suppose a
particular edge appears in 15 failing traces, but only 5 passing traces. Since 3 > 0.666 they
assume this edge is suspicious.
Test Data Creation
FL algorithms require test suites with many test cases. Artzi et al. [2010] showed how one can
automatically build a test suite for use with FL algorithms. Of particular value is that they can
generate a sequence of test cases where each subsequent test differs from its predecessor in
small, systematic ways. Such an approach works well with coverage-based fault localization,
because FL works best when most of the execution trace is the same, allowing the differences to
stand out.
They used two similarity metrics, one focused on similar paths, and the other focused on
similar data to expand a few test cases into a larger test suite. However, they require at least one
failed test case to seed their process, and they require source code (PHP currently) to facilitate
their symbolic evaluation. Finding a failed test case in closed-source, commercial code is nontrivial, and symbolic evaluation of large code bases is problematic. However, their technique is
elegant and builds an efficient test suite.

38

Using Input Data for Fault Localization
Selective data flow tracking as part of a code coverage-based fault localization algorithm is
novel; it is a contribution of this work. However, using input data to perform FL has been
investigated. Researchers Jeffrey et al. [2008] localized bugs by varying inputs. They find
inputs similar to failing inputs, such that the bug is no longer triggered. The statements operating
on the input are presumed to be the fault, or related to it.
Database Fault Localization
We recently discovered work, which was conducted in parallel to ours without our knowledge.
Researchers at IBM [Saha et al. 2011] discovered what we did: fault localization without
consideration of data is likely to fail for some bugs. They however did not use FL on typical
compiled desktop programs. They focused strictly on database code that conducts business logic
for customer transactions. ABAP is the SQL-like language their work can only operate on.
They found that through a number of enhancements to a typical slice differencing technique
[Agrawal et al. 1995] they could obtain better results. Their primary enhancement operates by
tracking a key variable value at each statement location. Thus, if all statements are the same
between passing and failing traces, they will still have information that differs. This is similar to
one of our enhancements (FL-DFT from Chapter 6), which modifies a code coverage-based FL
technique to consider data.
Their approach differs from ours in the following important ways:
1. They do not use a high speed Dynamic Binary Instrumentation (explained in Chapter 6)
that can collect trace information for large x86 programs. They use a slower, custom set
of single-step debugging scripts to gather data.
39

2. They do not taint and trace input for closed source applications, but rather depend on the
ABAP source code to track code and data.
3. They focus on data-centric code that has very little controlflow difference. They use
knowledge of the code semantics to further unwind statements to conduct slicing which
includes an important variable for each statement.
4. They do not combine their FL approach to a bug hunting and analysis system

40

CHAPTER 5 BUILDING IN FAULT LOCALIZATION AND VISULIZATION
Fault localization tools typically require (1) a test suite and (2) the software source code. A test
suite will have both input and the ability to recognize output. The output of each test case is
labeled, indicating whether that particular result is a “pass” or a “fail”, where failure indicates a
software-coding mistake. Having identified the existence of a fault, the next step is to localize
the error within the code. A typical approach relies on the basic premise that code regions (often
lines or basic blocks) that occur more frequently in failing traces7 are more likely to contain the
bug. These suspect regions are identified based on occurrence, and then ranked, creating a final
list of suspicious locations. The more similar the passing and failing input test cases are, the
better coverage-based algorithms tend to perform, because runtime differences are more easily
identified.
This research expands upon this general approach in two important ways:
1. A preexisting test suite is not require to identify faults.
2. Source code is not required to perform fault localization.

These improvements are important because test suites may be incomplete or expensive to
generate. In addition, source code may not always be available for legacy code, or for offensive
research. Issue 1 is addressed by creating test suites automatically. First, quality inputs are
downloaded based on code coverage. Second, the collected set is expanded by systematically

7

A trace is a list of code regions encountered during one execution of the program given some

input.
41

modifying the test inputs, a dynamic testing technique known as fuzzing [Miller et al. 1990].
Issue 2 is addressed because fuzzing can operate on binaries where source is not available.
Fuzzing automatically seeks memory corruption vulnerabilities, a critically important subclass of
bugs that are the basis for most control-flow hijack exploits. Hackers use fuzzing to uncover
bugs and create offensive tools. Commercial developers use fuzzing as a final test of their
code—to keep ahead of the hackers.
Once fuzzing has identified an input that generates faulty behavior, the tools in this research
generate a smaller test suite (series of similar inputs) that is specific to the discovered bug. This
is possible even though the location of the bug is not yet known. As before, fault localization
requires that the generated test suite have inputs that generate both passing and failing outputs.
A coverage-based algorithm is then applied to localize bugs to basic blocks—a unit of code that
does not contain a branch.
This work provides a number of contributions, above and beyond those already mentioned.
This new approach enhances results by increasing the suspiciousness score of blocks not thought
to be noise (explained later). In addition, accuracy is increased by using an input tainting, dataflow tracking algorithm. For data-flow tracking, key portions of the input data are tagged. The
program records the basic blocks within the program which operate on tagged data. A score
modifier (increased suspiciousness) is then applied to each basic block that uses tainted input
data8. This chapter will:

8

Input data that is tracked throughout a programs lifespan is considered “tainted input”.

Tracking means the input will taint other data and registers as it is operated on in compiled code.
42

1. Describe the novel fault localization approach, and show that the algorithm outperforms
previous approaches.
2. Describe the technique that allows the framework to work with closed-source code.
3. Detail the environment (system) in which this research currently operates.

This chapter describes the integration of our FL system to our fuzzing system. This chapter
also serves as motivation for the FL enhancements we fully describe in Chapter 6. Chapter 7
discusses the results of those enhancements.
Automatic Fault Localization and Visualization
Now that a set of faults has been discovered, the process to find where the faults occur is
conducted. Given that this research targets the class of bugs known as vulnerabilities, a fair
question to ask is: doesn’t the analyst know where the fault is because the program crashed at
that point? Perhaps, for trivial bugs that is true. But not for subtle bugs which are, in fact, the
ones most likely to have survived the sophisticated testing of commercial software. A trivial
example would be to set a pointer to Null and to reference it at some later point. The program
crashes at the later point (Null reference), but the actual fault lies where the pointer assignment
was made or even earlier where the Null value was derived. Finding the actual fault is the task of
fault localization.
Once a bug is found via fuzzing, the discovered crashing test case is expanded into passing
and failing test cases. This is done because FL requires both passing and failing tests. The
process for this expansion is explained next.

43

Execution Mining (EM) is the name for this bug-focused test set creation, fault localization,
and visualization process. EM is tied into CF and can automatically localize faults.

The

Appendix (ClusterFuzz GUI Walkthrough) shows screen shots of an early version of the GUI.
Recognition goes to the CF team at Harris Crucial Security, Inc. for their continuing work on CF.
Figure 5 shows an overview of the process from which the constructed FL runs.

Figure 5: Execution Mining

44

The first step is to create a set of passing and failing inputs9. Following the process shown in
Figure 5, the mutated byte(s) is first isolated in the failing input file that caused the application to
crash. This research focuses on binary files for client-side applications. The tool changes each
difference in the failing input file back to whatever it was originally, until the failing input file no
longer crashes the application. It is in this way that the particular data that is responsible for the
fault is identified. If the offending data is a single byte, 256 new files can now be created using
all possibilities of that byte. The file-and-byte assumptions are reasonable for the type of clientside fuzzing focused on10.
Now that a set of passing and failing files has been collected, each file can be run through the
trace collection software. A basic block hit trace collection is created and tagged as either failed
(crashed) or passed (not crashed).

After all of the traces have been collected, the fault

localization can run. This step is shown at the bottom of Figure 5. Each of the top 10 suspicious

9

Files are operated on in this case, but the technique works for other inputs as well. We later

show that our FL algorithm works on STDIN and network inputs, but we have only implemented
the automatic test set creation portion of EM for file inputs at this time.
10

In the author’s experience, file fuzzing produces mutated files, which include a bug that can

often be localized to one key position in the input. It is possible that multiple-byte bugs could
not be isolated with this heuristic, but for the sake of automatic FL that case is ignored for file
inputs. Furthermore, rather the bug is exactly isolated via input at this time is not the point. All
our system requires to perform FL is passing and failing tests, which we characterize in this case
as non-crashing and crashing respectively.
45

locations is then colored and visualized in IDA pro [Hex-Rays 2011], a popular reverse
engineering tool.

The IDA pro disassembly screen is centered (redirected) on the most

suspicious location, as shown in the bottom of Figure 5. EM is later described in more detail in
Algorithm 2.
Example 1: A Simple Bug with Code Coverage Differences
To illustrate EM, a piece of open-source code was modified to contain a bug. The bug occurs
early in the code, but the error isn’t manifested until later in the code. The open-source code is
dlltest, which is part of a suite of Optical Character Recognition [Tesseract OCR 2011] tools.
One of the C++ files is called imgbmp.cpp. That file includes code to parse a BMP image file.
Minor changes were made, which are highlighted in Code Listing 2. Source code is shown here
to explain the bug, but remember that the approach works with the resulting binary and does not
require the source code.
On lines 6-7 local variables were added; not the bug. Lines 12-13 are the inserted bug, which
doesn't immediately result in an error or crash. On line 21, the bug is activated when the xsize
data (from the BMP) is greater than 100.

CODE LISTING 2. Code with a Bug because of Path Differences
1. inT8 read_bmp_image(

//read header

int fd,

//file to read

uinT8 *pixels,

//pixels of image

inT32 xsize,

//size of image

inT32 ysize,
46

inT8 bpp,

//bits per pixel

inT32

//bytes per line ) {

2.

uinT32 bpl;

//bytes per line

3.

uinT32 wpl;

//words per line

4.

uinT32 nread;

//current bits

5.

inT32 index;

//to cols

6.

unsigned char bug1 = 0;

7.

char * ptr = NULL;

8.

bpl = (xsize * bpp + 7) / 8;

9.

wpl = (bpl + 3) / 4;

//bytes per line

10. wpl *= 4;
11. if( xsize > 100 )
12.

bug1=1;

13. for (index = 0; index < ysize; index++) {
14.

nread = read (fd, pixels + bpl * (ysize - 1 - index), bpl);

15.

if (nread != bpl)

16.

return -1;

17.

if (wpl != bpl)

18.

lseek (fd, wpl - bpl, SEEK_CUR);

19. }
20. if(bug1)
47

21.

*ptr = 1;

//causes access violation

This bug can now be found via fuzzing using CF. After all three stages of EM finish, the
colored IDA pro disassembly (graph view) shown in Figure 6 is obtained. Finally, visualization
(described below) directs analysts to the bug.

Figure 6: An Example with Covered Code Differences

48

There are two sections of code that contain the bug: lines 12-13 and 21. EM successfully
identified both offending locations as the top two likely suspicious basic blocks. The bug is the
small red block in the middle of Figure 6 (the block with one assembly instruction, which is the
compiled code for line 13). The following coloring scheme was developed to aid the analyst in
bug analysis:


Grey blocks of code indicate both passing and failing traces executed the block.



Blue blocks indicate only passing traces executed the code block.



Brown blocks indicate only failing traces executed the block.



Red blocks indicate a highly suspicious location (top 2 locations).



Yellow blocks indicate suspicious locations (next 8 locations).



A purple block indicates the final location executed (used to identify crash locations).



An orange block is the most common final block (used to indicate a common crash
location).



All other blocks are left white, the default color in IDA pro.

Table 2: Top 5 Suspicious Code Blocks from Figure 6
Basic Block
Rating Passing Failing Virtual Address
(Start, Stop)

Color

1.0000

0

155

4016c3, 4016ca

Red

1.0000

0

155

401655, 40165a

Red

0.7806

25

39

4016a5, 4016b8 Yellow

0.7773

76

116

4016a2, 4016a5 Yellow

0.7773

76

116

401694, 4016a5 Yellow
49

The data shown in Table 2 appears in the IDA pro tool Output window (at the bottom of IDA
pro by default).

That data shows the top 5 suspicious code blocks.

The two locations

responsible for the bug show up in all 155 failing traces, and never in any of the passing traces.
Equation (1) is easily able to operate on the passing and failing information from Table 2 and
detect the bug, under the straightforward mathematical conditions of this simple example.
Example 2: A Simple Data-only Bug
Coverage-based analysis shines on Example 1, but suppose the bug had no code coverage
implications, meaning different inputs can trigger a bug with no difference in the statements
executed. This dissertation refers to this as a data-only bug. Algorithms that solely depend on
code coverage differences cannot reliably find these bugs because control-flow provides
insufficient information—data difference not code coverage needs to be considered.

To

illustrate, another contrived bug was created in the open-source code from Example 1. Inspect
the changes made to a similar region of code, shown below in Code Listing 3.

CODE LISTING 3. Data-only Bug
1. inT8 read_bmp_image(

//read header

int fd,

//file to read

uinT8 *pixels,

//pixels of image

inT32 xsize,

//size of image

inT32 ysize,
inT8 bpp,

//bits per pixel

inT32

//bytes per line
50

){

2.

uinT32 bpl;

//bytes per line

3.

uinT32 wpl;

//words per line

4.

uinT32 nread;

//current bits

5.

inT32 index;

//to cols

6. unsigned char bug3 = 1;
7. unsigned int my_size = 0;
8.

bpl = (xsize * bpp + 7) / 8; //bytes per line

9.

wpl = (bpl + 3) / 4;

10. wpl *= 4;
11. bug3 = (101 - xsize);
12. for (index = 0; index < ysize; index++) {
13.

nread = read (fd, pixels + bpl * (ysize - 1 - index), bpl);

14.

if (nread != bpl)

15.

return -1;

16.

if (wpl != bpl)

17.

lseek (fd, wpl - bpl, SEEK_CUR);

18. }
19. ptr = (char *)malloc(my_size+bug3);
20. *(ptr+xsize)=12;

Lines 6-7, add a couple local variables. Line 11 is the bug. Lines 19-20 activate the bug, if
the xsize from the BMP picture has various unexpected values. The bug results when either no,
51

or an improper amount of memory is allocated, and the pointer is dereferenced for a region of
memory of improper size. The issue illustrated is that there are no differences between passing
and failing traces in terms of statements covered. All code paths through this function will
execute each statement (e.g. identical code coverage); only different data inputs trigger the bug.
The large grey block shown in Figure 7 (containing "mov al, 65h"), is part of the bug.
Neither this block, nor the other portion of the bug, is correctly colored. The failure is because
there is no code flow difference in this data-only bug. Coverage-based analysis does still find a
difference between passing and failing traces, and therefore presents erroneous data as shown in
Table 3.

52

Figure 7: Data-only Example

The data shown in Table 3 indicates line 17 from Code Listing 3 as the most likely root
cause. This selection is not the correct choice (line 11 is). In larger programs, happening to get
this close could be sufficient, but for this simple example the bug should be directly identified as
was done in Example 1.

53

Table 3: Top 5 Suspicious Code Blocks from Figure 7

Rating

Passing

Failing

Basic Block
Virtual Address
(Start, Stop)

Color

0.7806

25

39

40169d, 4016a8

Red

0.6654

107

85

40169a, 40169d

Red

0.6654

107

85

401688, 40169d

Yellow

0.6029

132

124

405a94, 405aaa

Yellow

0.6029

132

124

405a8a, 405aaa

Yellow

Other fault localization approaches that only consider code coverage information would also
fail to identify the buggy basic block, since there is no distinguishing information present. To
locate this type of bug, this research includes a data-flow tracking technique to key in on
important code blocks. The ability to localize this difficult data-only bug is later demonstrated.
Motivation for an Enhancement
This research proposes that by addressing the data-only issue the issue of noise can also be
addressed. Noise happens when the failing traces contain too many uniquely covered code
regions when compared to the passing traces, resulting in many suspicious locations being
reported that are not bugs11. Noise is the opposite problem from the data-only issue illustrated

11

In general, we believe that binary/assembly based systems are more susceptible to noise than

high level language traces would be, but on sufficient scale noise would be present in either
scenario.
54

in the previous section. However, data-only and noise are related: either not enough signal in the
data, or too much noise, which hides the signal.
Both issues could be generalized as focus issues. The focus of the test cases, and the
collected trace data should be as small and relevant as possible. Strategies that (a) include a
more complete and focused test suite, and that (b) perform the FL only on the program modules
in which the bug occurs, will do better since the captured trace data generated is more likely to
be relevant. Those approaches are helpful, and are utilized when possible. For example, if
program.dll crashes inside of program.exe, EM will be performed against just program.dll,
rather than all of the loaded modules12. However, that particular DLL still may be very large,
which is why those two techniques alone are insufficient, or not always readily available. The
FL improvement in this research seeks enhancements in which:
1. FL uses dynamic information about tainted input data.
2. FL can more precisely pinpoint suspicious code blocks, in the presence of noise and/or
with data-only bugs, because code regions that operate on tainted input are focused on.

12

Modern operating systems load many dynamic modules and to perform FL program wide

would end in poor results because of the diversity in each trace.
55

CHAPTER 6 ENHANCING FAULT LOCALIZATION WITH SELECT INPUT TAINTING
AND TRACKING
The new approach in this research uses a Ochiai coverage-based approach, but applies a score
modifier (increased suspiciousness) to each basic block that uses tainted data. In this work, taint
is defined as:


Tagging some or all input data.



Following marked (tainted) data via a data map, which indicates which memory locations
and registers are tainted.



If a tainted element is used in an operation that modifies an untainted location or register,
taint is transferred.

Tainting is important because it allows the new algorithm to tag suspicious inputs and follow
them as they are operated on throughout the life of the program. Tainting will be further
discussed as each of the following is described:
1. Background information on dynamic information flow tracking
2. The tools upon which we build.
3. The sources of taint.
4. How code blocks are emitted, as they operate on tainted input.
5. How the FL system works (Algorithm 2).
Dynamic Information Flow Tracking
Dynamic information flow is a technique that tracks the flow of information along dynamic
dependences starting from the program inputs to the values computed by the program [Gupta et
al. 2008]. Applications include software attack detection and prevention, fault localization, and
56

data validation. Depending on the exact implementation the overhead associated with this
technology can be very expensive. In our work we use a dynamic binary instrumentation
framework to capture only select information, making our approach scalable to real-world
programs.
Dynamic Binary Instrumentation
Dynamic binary instrumentation (DBI) allows developers to modify the behavior of fully
compiled programs as they execute. The DBI framework we use to collect code coverage is
called Pin [2011]. We also use Pin to apply, track, and selectively report on tainted input data.
Pin is a framework that allows testers to instrument binary applications in arbitrary ways by
writing a Pintool. Similar results could be obtained by using scripts within a debugger, but the
performance penalty would be prohibitive when testers desire near real-time analysis.
Data Flow Tracking with Dynamic Binary Instrumentation
The concept of tracking tainted data has been shown useful in security research [Piromsopa
2006]. More recently, researchers at Columbia University [libdft 2011] released a Data Flow
Tracking (DFT) toolkit called Libdft that works with Pin. DFT deals with the tagging and
tracking of interesting data (normally user input) as they propagate during program execution.
DFT may also be referred to as Information Flow Tracking (IFT) or Dynamic Taint Analysis
(DTA). DFT has been implemented by a variety of tools for numerous purposes, including
protection from buffer overflow and cross-site scripting attacks, analysis of legitimate and
malicious software, detection and prevention of information leaks, etc. Libdft is a dynamic DFT
framework that unlike previous work, is fast, reusable, and works with commodity software and
hardware. Libdft provides an API, which can be used to create and deliver DFT-enabled tools
57

that can be applied on unmodified binaries running on common operating systems and hardware,
thus facilitating research and prototyping.
Our sources of Tainted Data
Libdft was designed to work on network sockets. Since the majority of the experiments in this
research operate on files and standard input, the Libdft code base was extended to also work on
files and standard input (STDIN). For some programs, if all input data is tagged as tainted, taint
propagates throughout much of the code—compounding the noise problem. For file inputs, the
taint impact can be decreased, because the input bytes of interest are known: EM localized the
offending bytes in the test set creation phase (Figure 5). Thus, just those bytes are marked as
tainted. This strategy provides a small amount of targeted data to track, and from the set creation
phase this is known to be the critical data. There are situations where marking all data as tainted
are required. For example, when a network program is examined later in this dissertation, all
input from socket files are tracked. The decision on how to tag is part of the test configuration,
which is developed by the user at the start of the test. The internal operation of Libdft is relied
upon to properly propagate taint information from memory/registers to other memory
locations/registers as they are copied around throughout the life of the program.
How We Emit Tainted Basic Blocks
To minimize the code bocks that get a suspiciousness boost, we choose to record only the
address of basic blocks that read from tainted memory locations. Our system works with fully
compiled code. Within our Pintool, the following two functions from Pin are used to determine
if there is a read for a particular assembly instruction: INS_IsMemoryRead and
58

INS_HasMemoryRead2. As an example, if the following instruction were inside a basic block
“movzx edx, [ebp+tainted_byte]” its address would be emitted as tainted, so that FL-DFT (the
name of our new approach–described in the next section) can later increase the suspiciousness
score of that basic block. The Intel assembly move instruction just shown, transfers the data
from a tainted pointer to the EDX register.
We also include an option to output basic blocks (BBs), which have both of the following
two properties:
1. The BB contains the use of a dangerous APIs, such as strcpy, strncpy, memcpy, etc.
2. The source parameter of the dangerous API includes tainted user input.
The reason we output BBs that use suspicious API functions (with tainted data) is because, in
general, our system is interested in finding and analyzing security critical memory corruption
bugs. The list of dangerous functions includes APIs likely to play a role in memory corruption.
Fault Localization Algorithm Details
Algorithm 2 details the process from discovered bugs, until the final ordered list of suspicious
basic block locations is output. Afterwards, the output file is loaded into IDA Pro for
visualization.
Algorithm 2.

Coverage-based Fault Localization with Input Data Tainting and Tracking

Input: Results from Algorithm 1. While the FL algorithm was tested on multiple input types,
the end-to-end system only works on files at this time. When a bug is located with a fuzzed
input, that input is automatically expanded to create a test set appropriate for the FL system.
This new test set is referred to as the NarrowTestSet.
59

Output: Ordered list of suspicious basic blocks addresses and supporting data (FL-Basic, FLEnhanced, and FL-DFT) ready to be loaded in to IDA pro for visualization.
#Execution Mining (Step 1: Automatic Test Set Creation.)
for each input pair i, i´ in Results do
FileOffset = LocateDifference(i, i´);
NarrowTestSet = CreateSimilarDataInputs(i, i´, FileOffset);

for each input f in NarrowTestSet do
RetVal= CreateTraceFile(f, DFT_Pintool, TraceName, TraceDir, FileOffset, …);
if (RetVal) == EXCEPTION
then
AppendLabelsFile_Failing(TraceName);
else
AppendLabelsFile_Passing(TraceName);
end
end
end

#Execution Mining (Step 2: Trace Collection)
passing, failing = dictionary{ BasicBlockAddress: CountCovered, TaintedData };
passing_data, failing_data = dictionary{ InstructionAddress: BasicBlock, set(unique data at
InstructionAddress) };
60

FL-Basic, FL-Enhanced, FL-DFT = dictionary{ BasicBlockAddress: score};
tainted_data = dictionary{ BasicBlockAddress: LengthOfSymDiff, InstructionAddress, Data };

for each trace t in TraceDir do
if (t) == EXCEPTION
then
UpdateDictionaries(failing, failing_data t);
else
UpdateDictionaries(passing, passing_data, t);
end
end

#Execution Mining (Step 3: Fault Localization Algorithms)
FL-Basic = Ochiai(failing, passing);
SetGlobalModifierValues(FL-Basic);
FL-Enhanced = FilterBasicNoise(FL-Basic);
tainted_data = SymmetricDifference(failing_data, passing_data);
FL-DFT = IncreaseSusForTainted(FL-Enhanced, tainted_data);

Subroutine SetGlobalModifierValues(FL-Basic):
top_10_percent = 0.1 * length( FL-Basic.keys() ));
top_10_as_int = int(round(top_10_percent));
61

high_score = FL-Basic[FIRST_KEY];
try:
lower_score = FL-Basic[top_10_as_int];
except:
lower_score = FL-Basic[LAST_KEY];

score_range = high_score – lower_score;
if score_range < .0001
then
score_range = .1;
end

big_modifier = score_range * 0.85;
medium_modifier = score_range * 0.65;
small_modifier = score_range * 0.45;
Return

Subroutine FilterBasicNoise (FL-Basic):
for each basic block address k and score s in FL-Basic do
passing_count, passing_executions_per_trace = passing[k];
failing_count, failing_executions_per_trace = failing[k];
62

if (passing_count + failing_count) < (.1 * total_blocks)
then
s -= medium_modifier;
end
if (passing_executions_per_trace != failing_executions_per_trace)
then
s += small_modifier;
end
FL-Enhanced[k] = s;
end
Return FL-Enhanced

Subroutine SymmetricDifference (failing_data, passing_data):
for each <BasicBlock:Instruction> address pair k in common and failing traces do
try:
bb_pass, pass_data = passing_data[k];
except:
bb_pass = 0;
pass_data = Set(empty);
try:
bb_fail, fail_data = failing_data[k];
except:
63

bb_fail = 0;
fail_data = Set(empty);

diff_data = pass_data.SymmetricDifference(fail_data)

if bb_pass != bb_fail
then
if length(pass_data) < length(fail_data)
then
bb_pass = bb_fail;
end
end
InstructionAddress = k.split(":")[1];
tainted_data[bb_pass] = (length(diff_data), InstructionAddress, diff_data);
end
Return tainted_data

Subroutine IncreaseSusForTainted (FL-Enhanced, tainted_data):
FL-DFT = FL-Enhanced;
for each basic block address k and count c in tainted_data do
if c == 0
then
64

FL-DFT[k] += small_modifier;
else
FL-DFT[k] += big_modifier;
end
end
Return FL-DFT

To test our FL methodology three approaches were created, each providing a list of ordered
suspicious location: basic, enhanced, enhanced with tainting. Each list of ordered suspicious
locations is referred to as: FL-Basic, FL-Enhanced, and FL-DFT. Each result set is output as a
file, which can be loaded into IDA Pro to pinpoint and visualize the fault locations. Each
approach is described in the following sections.
FL-Basic
FL-Basic follows Equation 1, an Ochiai coverage-based algorithm, which is implemented in EM.
All of the FL techniques in this dissertation operate on basic blocks (BBs), since access to source
code (and subsequent line number information) is not assumed.
FL-Enhanced
FL-Enhanced is the first improvement researched to increase the accuracy of the FL-Basic
algorithm. As is later shown in the results, it improves performance over the FL-Basic approach,
but does not yet address the data-only problem.

65

FL-Enhanced adds enhancements to the basic coverage-based algorithm to filter out noise
associated with large amounts of basic block tracing. This filtering is achieved by adding score
modifiers. The most effective modifier in the FilterBasicNoise subroutine of Algorithm 2 adds a
penalty for basic blocks that do not appear in a significant number of runs (s -=
medium_modifier). The intuition is simple: if a particular code block is recorded only a couple
times out of 100 total traces, it is assumed the block is unimportant (e.g. noise). The other
modifier in that subroutine increases a basic block’s suspiciousness score by a smaller amount if
the number of times that block was executed per trace differs between passing and failing traces,
indicating something was different about the block, probably the data it operated on. These
techniques may seem simple, but prove to be effective when combined.
FL-DFT
FL-DFT is the novel fault localization technique in this research, which includes the tainted input
tracking enhancements:


The SymmetricDifference function gathers together the locations (basic block and
instruction address pairs) that read from tainted locations, or that were emitted because of
tainted API parameters.



The tainted data is collected as a symmetric difference (SD) set between passing and
failing traces. That is the set of elements which are in either of the sets and not in their
intersection. The intuition is that, the unique tainted data, in particular unique tainted
data that only shows up in failing traces is the most important. Specifically, from
IncreaseSusForTainted of Algorithm 2:

66

1. The length of the SD set is maintained along with the tainted data at that basic
block location.
2. If the SD set is of length (count variable c) zero a small modifier is added to the
suspiciousness of that location, else a larger modifier is added. The intuition: if
different values are being used (thus the non-zero SD set), the chance that this BB
is related to important (changing) data is even higher.
For all these enhancements, the key is to allow the FL algorithm to ignore the large
amount of irrelevant statements, and focus on those that matter.

67

CHAPTER 7 TEST RESULTS
The effectiveness of the proposed approaches is demonstrated over standard FL techniques on
three different kinds of problems: contrived problems for clarity of presentation, on the popular
Siemens test set, and finally on large real-world problems.
Contrived
To illustrate how FL-DFT works, a contrived example is presented, by again performing FL
against Code Listing 3 which showed a data-only bug. The fault can now correctly be found
because:


The xsize variable is at the offset position in the file input, which is related to the bug.
That makes xsize suspicious data.



Therefore, xsize gets tainted and tracked.



Subsequently, bug3 would also be tainted because it uses xsize in its initialization.



This increases the suspiciousness of the basic block that includes line 11.



The scores of lines 19 and 20 increase as well since they use xsize or its descendants.

A standard coverage-based approach could not differentiate among the blocks in Code
Listing 3 since the code statements were executed an equal number of times in both passing and
failing traces. However, taint tracking differentiates the suspiciousness of the blocks because of
the flow of the data through the blocks. As a result, lines 11, 19, and 20 are the top suspicious
code blocks, demonstrating the validity of this approach.

68

GZIP
Knowing that the Software-artifact Infrastructure Repository [SIR 2011] includes test programs
used by fault localization researchers, those programs were considered to validate the FL
approach in this dissertation. Grep, Flex, Space, GZIP, and the Siemens test suite were among
the programs examined. However, since fault localization techniques often ignore, or cannot
handle tests that cause a crash, very few of the programs include crashing tests. A crash was
found in one of the versions included with GZIP (also in 8 Siemens versions discussed later).
The GZIP compression program (based on version 1.0.7) has a vulnerability based on a stack
buffer overflow. The vulnerable version (gzip/ versions.alt/ versions.seeded/ v1) has a function
called get_suffix, which when the FAULTY_F_KL_2 bug is enabled and compiled as part of the
program, exhibits memory corruption because of a misused strcpy function. This is shown on
lines 13-15 and 18 of Code Listing 4.

CODE LISTING 4. Get_suffix function
1.

local char *get_suffix(char *name) {

2.

int nlen, slen;

3.

char suffix[MAX_SUFFIX+3]; /* last chars of name, forced to lower case */

4.

static char *known_suffixes[] =

5.

{z_suffix, ".gz", ".z", ".taz", ".tgz", "-gz", "-z", "_z",

6. #ifdef MAX_EXT_CHARS
7.

"z",

8. #endif
69

9.

NULL};

10.

char **suf = known_suffixes;

11.

if (strequ(z_suffix, "z")) suf++; /* check long suffixes first */

12.

nlen = strlen(name);

13. #ifdef FAULTY_F_KL_2
14.

if (nlen > MAX_SUFFIX+2)

15. #else
16.

if (nlen <= MAX_SUFFIX+2)

17. #endif
{
18.
19.

strcpy(suffix, name);
} else {

20.
21.

strcpy(suffix, name+nlen-MAX_SUFFIX-2);
}

22. #ifdef SUFFIX_SEP
23.

/* strip a version number from the file name */
{

24.

char *v = strrchr(suffix, SUFFIX_SEP);

25.

if (v != NULL) *v = '\0', nlen = v - name;
}
70

26. #endif
27.

strlwr(suffix);

28.

slen = strlen(suffix);

29.

do {

30.

int s = strlen(*suf);

31.

if (slen > s && suffix[slen-s-1] != PATH_SEP

32.

&& strequ(suffix + slen - s, *suf)) {

33.

return name+nlen-s; }

34.

} while (*++suf != NULL);

The block with the vulnerable strcpy is gzip+38f9 which is colored yellow in Figure 8.

71

Figure 8: get_suffix Function from GZIP
Figure text not meant to be readable, for color and block reference only.

Table 4 shows the top 10 suspicious code blocks from the three EM FL algorithms13.

13

Note in early portions of this dissertation the actual virtual address was used, where here an

offset relative to the executable module is shown. Either a relative address within a program as
shown here or the absolute virtual address is handled properly within EM.
72

Table 4: Top 10 Suspicious Code Blocks from FL-Basic, FL-Enhanced, and FL-DFT
FL-Enhnd

FL-Basic

FL-DFT

Basic
Block

Rating

Pass

Fail

gzip+d5e

1

0

4

gzip+d5e

1

0

4

gzip+8de0

1.20

0 4

gzip+d58

1

0

4

gzip+d58

1

0

4

gzip+8db6

1.20

0 4

gzip+8de0

1

0

4

gzip+8de0

1

0

4

gzip+38f9

1.20

0 4

gzip+8db6

1

0

4

gzip+8db6

1

0

4

gzip+d5e

1

0 4

gzip+8db1

1

0

4

gzip+8db1

1

0

4

gzip+d58

1

0 4

gzip+390b

1

0

4

gzip+390b

1

0

4

gzip+8db1

1

0 4

gzip+38f9

1

0

4

gzip+38f9

1

0

4

gzip+390b

1

0 4

gzip+f81

0.755

3

4

gzip+d98

0.755

3

4

gzip+dbe

0.963

3 4

gzip+f60

0.755

3

4

gzip+bc8

0.755

3

4

gzip+8def

0.963

3 4

gzip+ed0

0.755

3

4

gzip+98da

0.755

3

4

gzip+3b0d

0.963

3 4

The interesting property of these results is that while all of the three algorithms technically
found the correct basic block as the first answer, there were many other blocks with the same
score. According to the Ochiai algorithm, ties must be manually resolved. Note that FL-DFT
helps in this case by creating a small tie of only three locations, as opposed to the seven locations
indicated by FL-Basic and FL-Enhances. This improved score results from the fact that name
variable is tainted when used in the strcpy. Since, strcpy is one of the APIs our Pintool can
recognize and record, that block receives a boost in its score according to the FL-DFT algorithm.

73

Pure-FTPd Network Server
To further test our new FL algorithms, a bug was added to an open source FTP server called
Pure-FTPd, version 1.0.21. This example allows us to highlight the following three points:
1. Though the focus of this fuzzing and analysis work has been on desktop applications, it
was also important to show that our tools could operate on networked code.
2. To illustrate the benefits of visualization during manual bug analysis after FL is
complete.
3. To illustrate the potentially subjective nature of analyzing fault localization techniques.

A stack buffer overflow was added in the TYPE command of the FTP server. Since EM
currently works for file-based inputs, a passing and failing test suite was also created. Writing a
tool to automatically create similar tests would be possible for network inputs as well, but was
not addressed for this research.
Tainting network input is similar to working with file and standard input data on UNIX. The
primary change is that network socket file descriptors are added to the tracked list. Reads (and
other related system calls such as the recv function) on tracked file descriptors sets taint on those
memory locations. By describing this bug in detail, the information display is shown to make
fault localization more useful than a simple top X suspicious locations that would result
otherwise.
Code Listing 5 shows code for the dotype function. The bug occurs on lines 18-19, which
was inserted for this research. If the input to the TYPE command is greater than
DEFAULT_MAX_TYPE (400 bytes), then up to DEFAULT_MAX_TYPE_LEN (1024 bytes)
74

will be copied into a static local buffer, creating a stack buffer overflow. The input argument
length is checked, and if it is above 400 bytes, it is copied into the local buffer (an easy
coverage-based condition to identify). The fix would be to remove lines 18-19. However, when
operating a coverage-based fault localization program, is the bug line 18 or 19? Furthermore,
since compiled code is operated on, there could be multiple basic blocks associated with a line.
When presenting the results of the FL algorithm, which basic block(s) should the algorithm
present? In this case there is one block that contains the strncpy, which we call the buggy basic
block.

CODE LISTING 5. Relevant code from dotype function
1. void dotype(const char *arg) {
2.

replycode = 200;

3.

char TYPE[DEFAULT_MAX_TYPE];

4.

if (!arg || !*arg) {

5.
6.
7.
8.
9.
10.
11.
12.

addreply(501, MSG_MISSING_ARG "\n" "A(scii) I(mage) L(ocal)");
} else if (tolower((unsigned char) *arg) == 'a')
type = 1;
else if (tolower((unsigned char) *arg) == 'i')
type = 2;
else if (tolower((unsigned char) *arg) == 'l') {
if (arg[1] == '8') {
type = 2;
75

13.

} else if (isdigit((unsigned char) arg[1])) {

14.

addreply_noformat(504, MSG_TYPE_8BIT_FAILURE);

15.

} else {

16.

addreply_noformat(0, MSG_MISSING_ARG);

17.

type = 2;
}

18.

} else if (strlen(arg) > DEFAULT_MAX_TYPE) {

19.

strncpy(TYPE, arg, DEFAULT_MAX_TYPE_LEN);
}

20.

else {

21.

addreply(504, MSG_TYPE_UNKNOWN ": %s", arg);
}

22.

addreply(0, MSG_TYPE_SUCCESS " %s", (type > 1) ? "8-bit binary" : "ASCII");

Figure 9 shows the IDA disassembly of the buggy procedure, the dotype function.

76

Figure 9: dotype Function from Pure-FTPd
Figure text not meant to be readable, for color and block reference only.

The coloring is from FL-DFT. The strncpy function (line 19) is the left side of the branch
trio all the way to the right in the picture, colored red by FL-DFT. A branch trio is the
visualization of a condition check and both of the possible branch blocks. Table 5 shows FLDFT did better than the other two approaches.

77

Table 5: Results for Pure-FTPd
FL-Basic FL-Enhanced FL-DFT
Rank in suspicious file

2

2

1

In this case, the src pointer of the strncpy(dst, src, n) API call points to tainted information.
Thus, the strncpy block gets a tainted boost to its suspiciousness score (as do other unrelated
blocks). That is why FL-DFT scored the buggy block higher in the list of suspicious code
blocks.
Finally, the merits of visualization shown in Figure 9 are discussed. Note that the second
half of the final basic block is orange. As stated earlier, orange indicates that this code region is
the common final ending14. Note that the key assembly directive in this block is a return
instruction (RET). The reason for the crash is that the stored return pointer is corrupted by a
stack buffer overflow. Anyone with experience with memory corruption bugs will quickly check
to see if there are any local buffers, and if so, note where they are used. The local buffer gets

14

Since this dissertation is concerned with crashing and non-crashes test cases, orange areas

indicate the most common location of the crash. Note that this location is generally not the bug,
and may or may not be near the actual bug.
78

used by the strncpy15 function. The color helps because the analyst can quickly visualize
important forensic details about the program execution, such as:


Blocks executed



Suspicious blocks



Crash location



Flow through branches



Rough top-to-bottom execution sequence

For example, the block all the way to the right in Figure 9 is blue, meaning that the block was
executed only in passing traces. Conversely, note that the other branch target of that same
branch trio is red, which indicates two things: (1) the red block was executed at least once by a
failing trace, and (2) that block is considered suspicious. Debugging intuition added to the
provided coloring, points to this branch and the strncpy block as part of the coding mistake.
Thus, the top X suspicious locations are only a portion of the helpful information that can be
provided. Even if the FL algorithm had failed to find the correct basic block (by identifying it in
the top 5 choices), an analyst familiar with stack bugs could look at the coloring in this function
and surmise the real code error. This feature is a useful side benefit.

15 The

strncpy function is a safer version of the traditionally misused strcpy function, because of

the addition of a maximum size parameter. As seen here, it is still possible to misuse most
functions.
79

Siemens Benchmark
The new FL approaches were tested using the Siemens benchmarks [SIR 2011], which are
commonly used to demonstrate the effectiveness of fault localization techniques. The Siemens
test set originates from work done at Siemens Corporate Research on data-flow and control-flow
test adequacy [Hutchins et al. 1994].
This research targets the subset of bugs known as vulnerabilities, especially memory
corruption bugs. If these vulnerabilities are first found in the field (as opposed to the code
developers), those systems can be exploited. Because the focus here is on vulnerabilities, only
the program versions in the Siemens test set which include a memory corruption bug were used.

Table 6: Comparing FL Types (1=Success; 0=Failure)
FL-Basic FL-Enhanced FL-DFT Top-K LEAP
print_tokens2: v10

1

1

1

1

replace: v23

1

1

1

1

schedule: v1

0

1

1

1

schedule: v5

0

1

1

1

schedule: v6

0

1

1

1

schedule2: v8

0

0

1

0

tot_info: v18

1

1

1

1

tot_info: v20

0

1

1

1

Average

37.50%

87.50%

100%

87.50%

Table 6 compares our three test variations against one of the best fault localization
techniques, Top-K LEAP [Cheng et al. 2009]. A “1” indicates that the known bug was
80

successfully found; “0” indicates the FL approach failed to find the bug. Observe that FLEnhanced is able to match the effectiveness of Top-K LEAP, and that FL-DFT outperforms TopK LEAP.
Of particular interest are the results for the schedule2, version 8 bug. That bug is not properly
localized by 3 of the approaches. To be properly localized, the code block must be in the top 5
suspicious locations. The rest of this section describes why FL-DFT is able to successfully
localize the bug in schdeule2:v8.
The particular bug in question results from missing code. The routine within the schedule2
program with the bug is a function called put_end. The code is shown in Code Listing 6.

CODE LISTING 6. Relevant code from put_end function
1. int put_end(int prio, struct process * process) /* Put process at end of queue */ {
2.

struct process **next;
/*if(prio > MAXPRIO || prio < 0) return(BADPRIO); */
/* find end of queue */

3.

for(next = &prio_queue[prio].head; *next; next = &(*next)->next);

4.

*next = process;

5.

prio_queue[prio].length++;

6.

return(OK);

The bug occurs between lines 2 and 3; a commented input validation check that needs to be
uncommented. FL-Basic found the erroneous area as the 19th choice, FL-Enhanced as the 12th
81

pick, and FL-DFT found the bug as the first choice. FL-DFT performs better because it has
access to relevant information (usage of tainted data), that other FL algorithms do not. In
particular, the information in prio and prio_queue are tainted, both of which increase the
suspiciousness scores of the relevant basic blocks in the put_end function. Table 7 summarizes
the results.
Table 7: Results for Schdule2, Version 8
FL-Basic FL-Enhanced FL-DFT
Rank in suspicious file

19

12

1

VLC Media Player
The VLC Media Player [2011] is a free and open source cross-platform multimedia player and
framework that plays most multimedia files as well as DVD, Audio CD, VCD, and various
streaming protocols. VLC had a bug, which was documented by Klein [2011]. The version
0.9.4 had a stack buffer overflow when a particular size field in the .ty+ file format (TiVo’s
digital video) was larger than a local 32 byte buffer. Code Listing 7 portrays a section of this
large code base.
CODE LISTING 7. Relevant code from parse_master function
1. static void parse_master(demux_t *p_demux) {
2.

demux_sys_t *p_sys = p_demux->p_sys;

3.

uint8_t mst_buf[32];

4. int i, i_map_size;
5. int64_t i_save_pos = stream_Tell(p_demux->s);
82

6. int64_t i_pts_secs;
7. free(p_sys->seq_table);
/* parse header info */
8. stream_Read(p_demux->s, mst_buf, 32);
9. i_map_size = U32_AT(&mst_buf[20]); /* size of bitmask, in bytes */
10. p_sys->i_bits_per_seq_entry = i_map_size * 8;
11. i = U32_AT(&mst_buf[28]); /* size of SEQ table, in bytes */
12. p_sys->i_seq_table_size = i / (8 + i_map_size);
/* parse all the entries */
13. p_sys->seq_table = malloc(p_sys->i_seq_table_size * sizeof(ty_seq_table_t));
14. for (i=0; i<p_sys->i_seq_table_size; i++) {
15.

stream_Read(p_demux->s, mst_buf, 8 + i_map_size); …

In the parse_master function, the first stream_Read function (line 8) inputs 32 bytes of the
relevant header. Parsed out of that information is the i_map_size variable, which is later misused
(line 15) on the next stream_Read, leading to a local stack buffer corruption if the size is great
than 32.
Table 8: Results for VLC
FL-Basic FL-Enhanced FL-DFT
Rank in suspicious file

456

83

288

3

Table 8 shows the results of our three approaches. The actual buggy block (containing the
second stream_Read) was correctly identified by only FL-DFT. Recall, the buggy basic block is
required to be in the top 5 to be considered successfully identified. The top yellow block in
Figure 10 is the buggy block. All three got close to the actual basic block, by reporting a nearby
basic block (the one in red). However, by closely inspecting each of the top 10 blocks reported
by the FL-DFT approach, one would note that all 10 are better, in the sense that all of the basic
blocks are in functions that operate on actual user data, rather than basic blocks that just happen
to show up more in failed runs. In other words, apart from the red block, FL-Basic and FLEnhanced more or less report random (noisy) blocks that just happened to show up more in
failing traces, but were not relevant to the bug. On the other hand, nearly all of the top ten
blocks reported by FL-DFT are meaningful code used to parse input data.

Figure 10: Visualized parse_master Function in IDA Pro via FL-DFT
84

The CF/EM system works particularly well with this example. The bug is easily found via
fuzzing.

A sample input can be found at: http://samples.mplayerhq.hu/TiVo/test-dtivo-

junkskip.ty+. One mutation fuzzing heuristic, which moves a DWORD (0xffffffff is common)
through the input file, will find this bug when the pertinent size field gets changed. EM will then
isolate the changed byte at offset 0x0030017 in the file. The original byte was the size 0x02, and
when changed to 0xff, will crash VLC (trigger the bug). The beauty of our taint tracking is that
it does not over taint. That is, if all user input (the whole file) is tainted as it is read in, many
basic blocks will be tagged, and thus the suspiciousness of many basic blocks will increase.
Tainting everything may be better than no tainting, but it is not as helpful as only tagging the
blocks that deal with the byte (0x02  0xff) at offset 0x00300017. By tagging just that one
byte, only the highly relevant blocks are tagged as tainted, and pushed to the top of the
suspiciousness list. EM does this automatically, since the byte offset position is known (from the
test set creation phase of EM). Note also that that automatic test set creation works as intended.
As new files are created, each the same, except that this particular byte varies in value (0x00 –
0xff), some will trigger the bug (with value > 32) and others will not (with value < 32).
Note that this bug also shows off a data-only situation. The only way this BB can be
distinguished is by considering data, since both passing and failing traces flow through this
block.
Firefox
To test our FL techniques against a large, real-world program the open source Mozilla Firefox
browser [Firefox 2011] was chosen. Version 3.5.8 was used because there was a public PNG
85

(graphic format) bug, which had an in-depth description on Bugzilla detailing the manual process
developers followed to determine the root of the crash [Bugzilla-Mozilla 2011]. The cause of the
bug comes from code in the file nsPNGDecoder.cpp in the location ../ modules/ libpr0n/
decoders/ png /. The row_callback function (line 684, revision 1.9.1) omits an important check
to see if the row_num is in bounds. Code Listing 8 shows the small portion of relevant code
from the repaired revision 1.9.2, with the added code on lines 6-7.

CODE LISTING 8. Relevant code from row_callback function
1. void row_callback(png_structp png_ptr, png_bytep new_row, png_uint_32 row_num,
int pass) {
2. nsPNGDecoder

*decoder

=

static_cast<nsPNGDecoder*>(png_get_progressive_ptr(png_ptr));
3.

// skip this frame

4. if (decoder->mFrameIsHidden)
5.

return;

6. if (row_num >= decoder->mFrameRect.height)
7.

return;

8. …

The FL-DFT technique was able to correctly localize the bug, while the FL-Basic and FLEnhanced techniques were not. The primary reason is that in large programs there is often too
much noise and not enough signal.

Coverage-based fault localization presumes there is a
86

reasonable amount of signal to noise. Conversely, if code is widely variable between traces,
even with similar inputs, the Ochiai algorithm may key in on unrelated, but unique code from
failing traces as the source of the bug.
To avoid such problems it is necessary to focus on the portion of large code bases that are
interesting in order to reduce the variability in analysis. One technique used is to note which
module (DLL in Windows) the exception occurs in, and only perform the FL within that module.
Isolation of the problem solves many variability issues. However, in this case Firefox always
uses a very large module called libxul, which is a combination of many other libraries designed
to decrease the start time when Firefox is launched, making the code to analyze still very large.
(When compiled, libxul.so is 482MB which is a very large compiled object.)

Thus, only

analyzing one module is not a satisfying solution in this case. This problem was overcome by
honing in on important code, using tainted input tracking to key in on the code operating on
relevant data.
FL-Basic technique found the correct code as the 100th choice in the ordered list of
suspicious locations. FL-Enhanced found it as the 34th choice. FL-DFT identified the correct
function with the 1st choice. That information is summarized in Table 9.

Table 9: Results for Firefox PNG
FL-Basic FL-Enhanced FL-DFT
Rank in suspicious file

100

87

34

1

CHAPTER 8 SIGNIFICANCE OF OUR PRIMARY FAULT LOCALIZATION
ENHANCEMENT
In this chapter, we want to establish that our primary fault localization contribution is significant.
We will show that the data-flow faults that we can find occur with sufficient frequency that it is a
problem worth solving. We begin by first showing a need to incorporate data information as part
of a code coverage-based FL algorithm—a need that has been observed by others. We finish by
illustrating the effectiveness of our FL improvements shown in Chapter 7.
We will also estimate the frequency of the data-flow bugs that would not be well localized
without our improvement. We combine our earlier results from the Siemen’s test set with a
count from a large software project. Because the best, coverage-based FL studies showed
precisions as high as 75%, we expect that the remaining bugs to be less than 25%, and we expect
our data-flow bugs to make up a significant portion of that remaining 25%. If we can find dataflow bugs to make up greater than 5% of total bugs, we will conclude that they are significant.
Lack of Focus on Data
The seminal Siemens paper [Hutchins et al. 1994] is the original source of the benchmark
programs used in Chapter 7. In that paper, Hutchins et al. sought to determine if data-flow and
control-flow testing strategies would aid traditional specification-based approaches.

Not

surprisingly, they found that both techniques aided in the discovery of bugs when supplementing
the popular informal methods of the day. Given their observations it is interesting that so many
fault localization techniques have focused primarily on control-flow, and have left data
information by the wayside.

88

Consider a quote from the Hutchins paper: “Finally, we saw an approximately equal split
between faults that were detected at higher ratios by Edge coverage and by DU [data definitionusage pair] coverage, leading us to conclude that the two methods, according to the way they
are measured by Tactic, are frequently complementary in their effectiveness.” To be clear, our
primary improvement (FL-DFT) is an overall fault localization improvement. Because it is
paired with a coverage-based technique it works well for code and data bugs. The analysis in
this chapter is not to indicate that our new technique works only against data-flow and noisy
bugs, but to understand how significant this new improvement is.
The researchers of Top-K LEAP noted in their conclusion, “In this work, we only consider
the control-flow information when building graphs and extracting discriminative signatures. We
try to make the signature rich enough by considering various edge types so that it has enough
expressiveness to discriminate the buggy cases. We find that at times this is not sufficient as the
data-flow information matters. The same control-flow could be executed by both buggy and
normal behaviors, which might differ only on the data that flows in along this control-flow”
[Chang et al. 2009]. In our initial investigations, we also noticed bugs that were not well
localized by existing techniques. That is why we sought to improve existing FL techniques by
incorporating data information.
Improvement on Data-only and Noise Problems
In Chapter 7, we compared our FL techniques (FL-Basic, FL-Enhanced, and FL-DFT) to Top-K
LEAP. Top-K LEAP properly localized 7 out of 8 bugs in the Siemens benchmark set. Our FLDFT, which includes data information, solved 8 out of 8. The difficult bug required data-flow

89

information to solve well. That experiment provides some insight into the prevalence of dataflow bugs.
When investigating our FL techniques on other problems our enhancements again showed
improvement. The following is a summary, based on the sections in Chapter 7:
1. FL-DFT did the best on GZIP, primarily because FL-DFT was able to better score the
important code block because that block used tainted input. The GZIP code base was
smaller than VLC and Firefox, so noise was not as big a problem, but as shown, dataflow was.
2. FL-DFT did the best on the VLC Media Player code. Noise and data-flow were both
problematic. FL-Enhanced helped with the noise (37% improvement), but only FL-DFT
was able to correctly locate the bug (99% improvement). FL-DFT did this by keying in
on tainted data used by the API function at fault.
3. FL-DFT did the best on Firefox. Noise and data-flow were both problematic. Again, FLEnhanced significantly outperformed FL-Basic (by 66% in this case). But only FL-DFT
was able to cut through the many potential buggy code blocks, and identify the real fault
by using data-flow information (99% improvement).
Thus, we surmise that the data-flow issue and the noise introduced by large or variable code
sets, are important considerations that have previously been left unsolved. Our FL-Basic builds
on a prior FL algorithm (Ochiai) by incorporating a bug hunting front end. FL-Enhanced
improves FL-Basic by filtering out unimportant code blocks.

FL-DFT performs the best

because it builds on prior techniques, and also includes the ability to highlight the most important
code blocks by using input data tracking.
90

Size Estimate
From the Siemen benchmark tests, 1 out of 8 (12.5%) of the faults examined were not properly
localized without data information. The Siemen’s benchmark set came from a commercial code
base so it is a well-accepted test set. Recall that there were more than eight tests, but there were
only eight that led to memory corruption vulnerabilities—the bugs we are most interested in.
Therefore, the Siemens set indicates 12% as our initial estimate on the size of the set of bugs not
well localized without our enhancement.
Second Estimate
To get another data point we used the Red Hat Linux Bugzilla repository [Bugzilla-Redhat
2012]. We used this data source because it was one of the few repositories that included (1) a
variety of real-world bugs, and (2) sufficient information on the flaw, the original source code,
and the required fix to the source code, such that we could get a reasonably accurate fix on
characteristics of each bug.
Out of thousands of bugs, 51 of the most relevant bugs were extracted from the Red Hat
project (shown in Appendix: Significance Data).

These 51 were found by searching the

repository for memory corruption bugs—bugs that are most useful to an attacker. The 51 bugs
are from different components and different versions of Linux, so they provide breadth in our
estimates. The bugs occurred from 2007 to 2012.
We examined each bug manually to determine if data-flow information would likely be
required to properly localize the bug. We choose a manual approach for a number of reasons.
One is that manual examination allows us to better reason about the circumstances of the bug—
had we simply run the examples through our code we would have only found what our code
91

found and that might be different. There was also a practical reason: setting up and running 51
samples through our test framework would have taken months without yielding better results.
We will explain our methodology using two of the Linux bugs. The first example shows
code that contains no control-flow logic near the bug. That is, control-flow alone cannot isolate
the bug. Therefore, we classify this bug as a data-flow bug because it would require data
information to localize the buggy line of code. In the second example, we show the opposite
situation: the block of code that contains the bug has an if-statement (relevant control-flow), and
would likely be correctly identified by code-coverage FL techniques.

We showed similar

examples earlier in Chapter 5 via Example 1 and Example 2.
Methodology Examples
As an example of our methodology for determining if a bug is a control or data-flow candidate,
consider the Enterprise Linux 5 kvm bug number 72938 (Appendix: Significance Data). If we
examine the flaw, and the fix, we see no control-flow that would aid an algorithm in localizing to
this location. The flaw is shown in the top of Figure 11, and fix is shown in the bottom of Figure
11. Since all paths must flow through line 801, we mark this bug as best being localized with
data information. In other words, regardless of the flows through this function the depth variable
may be in error, until the fix is applied.

92

Figure 11: File ./fs/nfsd/vfs.c

For an example of a control-flow bug, consider the Linux Xen bug number 222467
(Appendix: Significance Data). The flaw is the top of Figure 12, and the fix is the bottom of
Figure 12. The bug happens when code within the branch statement is executed (lines 445-450).
Ochiai-based algorithms are proficient at localizing code errors when there is an obvious controlflow difference. The fix moved the erroneous code to a different location in the code base.

93

Figure 12: File ./tools/libxc/xc_hvm_build.c
94

We choose the above approach based on our prior observations of when code coverage-based
techniques fail. In Chapter 5 we previously showed two contrived examples of when codecoverage works, and when it fails. The underlying difference is that distinguishable control logic
should reside near the bug, for code coverage-based techniques to operate well.
Second Estimate Results
Of the Red Hat bugs, 6 out of 51 (11.76%) could be best localized with the presence of data
information. For a combined estimate, we total the number of control-flow and data-flow bugs
with the Siemens estimate. This information is summarized in Table 10. Since the three
percentages are similar, we conclude that 12% is a reasonable estimate.

Table 10: Data-flow Estimate
Control-flow Data-flow Percent
Red Hat

45

6

11.76%

Siemens

7

1

12.50%

Total

52

7

11.86%

Reasonableness
We note that the precision of two prior fault localization studies was 72.5% (for RAPID) and
74.3% (for Top-K Leap) [Chang et al. 2009]. These two techniques are among the best available
coverage-based fault localization implementations. The first thing to note about each precision is
that there is room for improvement. That fact is expected, since they do not claim to be perfect
approaches. Improvement might come if a class of bugs being missed is found. FL-DFT did this
95

when it found 8 of 8 in our Siemens study. Second, consider these two statistics together: the set
size estimate from Table 10 and the best precision (Top-K). By adding the two numbers
(74.3%+11.86% = 86.16%) we note that there is still room for improvement, if 100% of faults
are to be localized. The fact that a chance for further improvement still exists is expected. If we
had argued that the size estimate was a number that, when added to the best precision, summed
to a number greater than 100%, our proposed set size estimate would have seemed unrealistically
high. It is reasonable to assume that even if Top-K included the ability to identify data-only
bugs, there would still be imperfection in the approach, though it would likely perform better.
This information is summarized in Table 11.

Table 11: Reasoning about the Accuracy of the Data-flow Estimate
FL Precision
RAPID

72.5%

Top-K

74.3%

Best FL Precision

74.3%

+ Data-flow Estimate

11.86%

Reasonably, sum < 100% 86.16%

Summary
In this chapter we showed that our primary FL enhancement was needed. We also showed that it
is significant both in terms of its ability to better localize a subset of bugs, and that the size of
that subset is significant. We support the need of our new algorithm, based on prior researcher
96

observations. We show the ability of our approach to better localize based on our automated,
empirical results. We estimated the size of the set of bugs not well localized without our
enhancement in the following manner:
1. We obtained the set size based on our empirical Siemens study, and on a manual study of
Red Hat Linux bugs. Both show the size to be approximately 12%.
2. We presume this number to be reasonable, because it is within the thresholds we
originally described: it is less than 25% and more than 5%.
Therefore, we say that our estimate is both reasonable and significant. We desired this
knowledge as further credence to our FL contribution.

97

CHAPTER 9 CONCLUSIONS
This dissertation presents enhancements to existing coverage-based fault localization (CBFL)
techniques. The best approach utilizes dynamically collected taint information on important data
inputs. Specifically, tainted blocks receive an addition to their base suspiciousness score. This
technique puts emphasis on code regions processing untrusted input, with the basic premise that
those blocks are more likely to be at fault, and more likely to contain bugs which could be
translated into an important subclass of bugs known as vulnerabilities or software exploits.
Our new CBFL is also combined with a distributed or parallel fuzzing framework. Finding
bugs in parallel scales. Our framework also sorts and prioritizes discovered bugs. Finally, a
user-friendly graphical interface allows testers or security researchers to enter notes about the
particular bug so that team testing is encouraged.
Merit
This dissertation research provides a number of merits for the field of software engineering and
software security research. A distributed bug hunting and analysis system was created and
enhanced and has the following key contributions:
1. Fault localization is improved by incorporating input data tainting, married with a
traditional coverage-based technique. The presented empirical studies show that the
approach is important, and obtains superior results to prior techniques, by testing against:
a. Contrived problems
b. The Siemens test set
c. Large and real-world programs.

98

2. New vulnerabilities were discovered in Java, showing the effectiveness of the distributed
testing platform.
3. The closed source (assembly) visualization was provided to aid the analyst in
understanding correctly and even incorrectly localized bugs.
Threats to Validity
Fault categorization has been noted to be difficult. Consider the following quote: “Different
authors propose different fault categories, and relatively few arguments exist that would clearly
support their particular choices … Furthermore, we conclude that the distribution of software
faults depends heavily on project-specific factors, such as the maturity of the software, operating
environment or programming language” [Ploski et al. 2007]. Therefore, the estimate on the set
of data-only bugs would likely vary if the language and environment changes significantly.
Santelices et al. [2009] note that different types of bugs might be best localized by different
FL techniques. However, since the bug type is generally not known ahead of time, multiple
coverage types performed better in their research. Clearly the "correct" approach to all of fault
localization has yet to be discovered and, just as clearly, more research needs to be done.
Moreover, developers need to integrate Fault Localization more closely with the development or
exploitation phases to become more commonly useful. This research takes a step forward with
the CF/EM system.
Bug Hunting Limitations
Our fuzzer is only capable of finding bugs typically found with dynamic testing: denials-ofService, crashes, use-after-free, race-conditions, and all manner of memory corruption (off-by99

one, buffer overflows, print format errors, etc.). Design flaws and other architectural mistakes
are unlikely to be uncovered by fuzzing. Additionally, dynamic testing rarely covers all code,
but focuses on the interfaces.
Fault Localization Limitations
Our FL scheme is focused on the typical memory corruption bugs discovered by a fuzzer.
Therefore, our pass/fail oracle is no-crash/crash. Additionally, our system is tuned for compiled
code on the x86 architecture. Porting our techniques to other oracles and architectures should be
a straight-forward endeavor.
Future Work
This dissertation leads to many possible future research projects. The following are the most
likely:
1. Future work can focus on continued data augmentations. Coverage-based approaches are
popular, effective, and quick, but can fail because of unfocused analysis. This research
suggests that continued work that supplements coverage-based work via data techniques
hold promise.
2. Future work can focus on new or enhanced techniques to locate bugs. Work also done at
Michigan State University by the author, relates to the use of genetic algorithms to
enhance fuzzing programs [Takanen et al. 2008].
3. Future work that focuses on further automation of the bug hunting and analysis pipeline
is related and of high interest, particularly to the security community.

Symbolic

execution [Avgerinos et al. 2011], automated exploit development [Heelan 2009], and
100

automated crash analysis [Miller 2010] are related to the pipeline approach taken by our
distributed bug hunting framework. Incorporating those notions into our framework
would provide value in the fields of software testing and security research.

101

APPENDICES

102

APPENDICES

ClusterFuzz GUI Walkthrough
The following is a brief walkthrough of an early version of the ClusterFuzz (CF) graphical user
interface (GUI). The GUI is accessed via a web browser. The GUI is capable of controlling all
of the CF system. The discussion and figures add more detail to the prior CF description. Only
major features of the tool are discussed. Harris has since renamed the tool to CyberFuzz and
continues work on it.

Further detail or inquiries can be made by contacting Gary Kay

(gkay@harris.com).

103

Figure 13: ClusterFuzz Primary View
Figure text not meant to be readable, for visual reference only.

In Figure 13, the Cluster Actions (upper left) controls the segmenting or cohesion of
available ESX servers. New jobs are created with the Job Action. The top right block shows the
Jobs Running. The Jobs in Queue shows the current jobs waiting to run. The Crash Data block
shows a summary of the "high quality" results. Runs that did not discover at least probably
exploitable results or better are not shown. The final window on the bottom right, shows an
overview of all the results collected.
104

Figure 14: Clicked on Job Results for Java
Figure text not meant to be readable, for visual reference only.

In Figure 14, the upper left contains actions for the created job, such as Run Job or Stop Job.
Once a job is complete, the pane below that can be manually used to retest results for reliability.
By default, this happens automatically, called Auto-Retest. The MinSet feature can be controlled
below that. To the right, note the results from the job ordered by reliability, number in each

105

bucket, and severity. The HQ stands for “high quality” (an !exploitable rating of probably
exploitable or better), and the Reliable means the bug is repeatable.

Figure 15: Clicked on Specific Java Results
Figure text not meant to be readable, for visual reference only.

In Figure 15, the left window shows the job name, the type and major bucket ID, and the
minor hash bucket ID. Execution Mining can be manually controlled from the window below,
but runs automatically for file inputs. On the right, notes about the result can be entered, or the
relevant files can be downloaded for further inspection (click on Files in upper right). Files
include a crash trace, the original input, the fuzzed input, and a python script to manually
exercise the bug.

106

Significance Data
Table 12 provides the details for the 51 bugs used in Chapter 8.

Table 12: Bugzilla-Redhat Details

Red Hat
Enterprise
Linux 5
All Linux
Enterprise
Linux 5
Enterprise
Linux 5
Enterprise
Linux 5
Enterprise
Linux 5
Enterprise
Linux 5
Enterprise
Linux 5
Enterprise
Linux 5
Enterprise
Linux 5

Component

Dataonly

Date
fixed

Date
reported

Days
to
repair

220592
222467

Kernel-xen
xen

0
0

11/7/07
2/14/07

12/22/06
1/12/07

320
33

236121

ipsec-tools

0

6/27/07

4/11/07

77

kernel boot panic with >=64GB memory
[XEN-VT]'Cannot allocate memory' error
LSPP: racoon has a buffer overflow when receiving large
security context from kernel

316371

Kernel-xen

0

5/21/08

11/30/07

173

32-bit PAE HV hardware limitation > 4GB memory

423521

kernel

0

1/20/09

12/13/07

404

432867

kernel

0

1/20/09

2/14/08

341

memory leak on size-8192 buckets with NFSV4
memory corruption due to portmap call succeeding after
parent rpc_clnt has been freed

435271

anaconda

0

6/11/08

2/28/08

104

RuntimeError: Error running lvm: Cannot allocate memory

435291

kernel

0

1/20/09

2/28/08

327

445348

kernel

0

5/29/08

5/6/08

23

LTC41974-Pages of a memory mapped NFS file get corrupted.
Please port the mtrr fixes from 2.4.25 so that some hardware
can boot with >4GiB memory

446188

kernel

0

1/20/09

5/13/08

252

BUG: Don't reserve crashkernel memory > 4 GB on ia64

Bug
ID/Link

107

Issue

Table 12 (cont’d)
Enterprise
Linux 4
Enterprise
Linux 5
Enterprise
Linux 5
Enterprise
Linux 5
All Linux
Enterprise
Linux 5
Enterprise
Linux 5
Enterprise
Linux 5
Enterprise
Linux 5
All Linux
Enterprise
Linux 5
Enterprise
Linux 5
Enterprise
Linux 5
Enterprise
Linux 5
Enterprise
Linux 5
Enterprise

memory corruption due to portmap call succeeding after
parent rpc_clnt has been freed
panic at intel_i915_configure() when amount of shared video
memory set to 1M

448046

kernel

0

5/18/09

5/23/08

360

452756

kernel
virtmanager

0

10/7/09

6/24/08

470

0

1/16/09

6/30/08

200

0

1/20/09

11/6/08

75

470769

kernel
hfs
filesytem

1

12/23/10

11/10/08

773

471613

kernel

0

1/6/10

11/14/08

418

478643

kernel

0

9/2/09

1/2/09

243

CVE-2008-5025 kernel: hfs: fix namelength memory corruption
Corruption on ext3/xfs with O_DIRECT and unaligned user
buffers
multipath test causes memory leak and eventual system
deadlock

485173

kernel

0

2/20/12

2/11/09

1104

kernel/module-verify-sig.c with memory uncleaned bug

487672
491988

kernel
gfs2-utils

0
0

9/2/09
3/27/09

2/27/09
3/24/09

187
3

slab corruption with dlm and clvmd on ppc64
GFS2 File System Corruption in RHEL5.3

499553

kernel-xen

0

1/13/11

5/7/09

616

Cannot generate proper stacktrace on xen-ia64

503062

ksh

0

4/8/10

5/28/09

315

ksh trashes its internal memory, can dump core

503905

kernel

1

9/2/09

6/3/09

91

507846

kernel-xen

0

1/13/11

6/24/09

568

514412
521865

autofs
kernel

0
0

3/30/10
3/30/10

7/28/09
9/8/09

245
203

kernel: TPM: get_event_name stack corruption [rhel-5.4]
Balloon driver gives up too easily when ballooning up under
memory pressure
RHEL 5.5 RFE - autofs add configuration option to increase max
open files and max stack size
Xen fails to boot on ia64 with > 128GB memory

453392
470267

108

virt-manager.py is taking all the memory!
cifs: data corruption due to interleaved partial writes timing
out

Table 12 (cont’d)
Fedora
Enterprise
Linux 5
All Linux
Enterprise
Linux 5
Enterprise
Linux 5
Enterprise
Linux 5
Enterprise
Linux 5
Enterprise
Linux 5
Enterprise
Linux 5
Enterprise
Linux 5
Enterprise
Linux 5
Enterprise
Linux 5
Enterprise
Linux 6
Enterprise
Linux 5
Enterprise
Linux 5
Enterprise
Linux 5

532302

pam

1

11/11/09

11/1/09

10

pam_console memory corruption

546133
548249

autofs
perl

0
0

7/21/11
2/21/12

12/10/09
12/16/09

588
797

autofs - memory leak on reload
Perl threads leak memory when detached

552211

dhcp

0

1/14/10

1/4/10

10

558809

kernel

0

3/30/10

1/26/10

63

dhcpd memory leak when failover configured
e1000 & e1000e: Memory corruption/paging error when tx
hang occurs

565973

kernel

0

1/13/11

2/16/10

331

568912

All Linux

0

3/9/10

2/26/10

11

580509

All Linux

0

1/13/11

4/8/10

280

[EMC 5.6 bug] security and PSF update patch for EMC CKD ioctl
Occasional data corruption using openssl.i686 AES ciphers on
HP dc5850 systems
Give correct memory amount back to dom0 when ballooning
fails

581933

kernel

0

1/13/11

4/13/10

275

pci_mmcfg_init() making some of main memory uncacheable

585979

kexec-tools

0

1/13/11

4/26/10

262

kdump fails to save vmcore on machine with 1TB memory

590073

libvirt

0

1/13/11

5/7/10

251

Memory leak in libvirtd

601800

kernel

0

1/13/11

6/8/10

219

NFS-over-GFS out-of-order GETATTR Reply causes corruption

606335

corosync

1

11/10/10

6/21/10

142

malloc(): memory corruption

606919

xen

0

1/13/11

6/22/10

205

623219

gdb

0

1/13/11

8/11/10

155

625828

kexec-tools

0

1/13/11

8/20/10

146

Memory leak in xenstore
gdb aborts with a 'double free or corruption' when calling
inferior functions with print or call command
kdump will not start on ppc64 server with 256G of memory
enabled

109

Table 12 (cont’d)
Enterprise
Linux 5
Enterprise
Linux 5
Enterprise
Linux 5
All Linux
Enterprise
Linux 5
Enterprise
Linux 5
Enterprise
Linux 5
Enterprise
Linux 5
Enterprise
Linux 5

660292

m2crypto

0

1/13/11

12/6/10

38

663563

kernel

0

7/21/11

12/16/10

217

memory leak in the attached script
[ext4/xfstests] 011 caused filesystem corruption after running
many times in a loop

xen
xorg-x11672812 drv-mga

0

7/21/11

12/22/10

211

backport various fixes for memory leaks

0

12/6/11

1/26/11

314

xorg-x11-drv-mga

674314

kernel

1

6/21/11

2/1/11

140

NFSD: memory corruption due to writing beyond the array

678308

kexec-tools

0

2/20/12

2/17/11

368

681440

xenpv-win

0

6/8/11

3/2/11

98

690555

kernel

0

7/21/11

3/24/11

119

729381

kvm

1

4/8/12

8/9/11

243

kexec kernel crashes due to use of reserved memory range
xenpv-win drivers leak memory on (misaligned I/O &&
backend is busy)
GFS2: resource group bitmap corruption resulting in panics
and withdraws
memory corruption handling the acceptable command line
option, leads to segfault

Avg Fix
Days:

263

665011

Percent FLDFT:

11.76%

110

BIBLIOGRAPHY

111

BIBLIOGRAPHY

ABREU R., ZOETEWEIJ P., and VAN GEMUND, A. J. C. 2007. On the Accuracy of
Spectrum-Based Fault Localization. In Proc. TAIC PART.
AGRAWAL, H. and HORGAN, J. 1990. Dynamic Program Slicing. ACM SIGPLAN
Conference on Programming Language Design and Implementation (PLDI), pages 246-256.
AGRAWAL, H., DEMILLO, R., and SPAFFORD, E. 1993. Debugging with dynamic slicing
and backtracking. Software Practice and Experience (SP&E), Vol. 23, No. 6, pages 589-616.
AGRAWAL, H., HORGAN, J.R., LONDON, S., WONG, W.E. 1995. Fault Localization using
Execution Slices and Dataflow Tests. Software Reliability Engineering. Proceedings., Sixth
International Symposium on, vol., no., pp.143-151.
AMINI, P. and PORTNOY, A. 2010. Sulley Fuzzing Framework. http://www.fuzzing.org/wpcontent/SulleyManual.pdf
ARTZI, S., DOLBY, J., TIP, F., and PISTOIA, M. 2010. Directed Test Generation for Effective
Fault Localization. In Proceedings of the 19th international symposium on Software testing
and analysis (ISSTA '10). ACM, New York, NY, USA, 49-60.
AVGERINOS, T., CHA, S., K., HAO, B., L., T., and BRUMLEY, D. 2011. AEG: Automatic
Exploit Generation. 18th Annual Network and Distributed System Security Symposium. San
Diego, CA.
BUGZILLA. 2011. https://bugzilla.mozilla.org/show_bug.cgi?id=570451
BUGZILLA. 2012. https://bugzilla.redhat.com/
CHENG, H., LO, D., ZHOU, Y., WANG, X., and YAN, X. 2009. Identifying Bug Signatures
using Discriminative Graph Mining. In Proceedings of the eighteenth international
symposium on Software testing and analysis (ISSTA '09). ACM, New York, NY, USA, pages
141-152.
CLARKE, R.A. AND KNAKE, R.K. 2010. Cyber War: The Next Threat to National Security
and What to Do. Ecco, New York.
CLEVE H. and ZELLER, A. May, 2005. Locating Causes of Program Failures, In proceedings
of the International Conference on Software Engineering, pages 342-351, St. Louis, Missouri.

112

DEMOTT, J., ENBODY, R., and PUNCH, W. July 2007. Revolutionizing the Field of Greybox Attack Surface Testing with Evolutionary Fuzzing. BlackHat USA.
EDDINGTON, M., CACCHETTI, A., KAMINSKY, D. 2011. Showing How Security Has
(And Hasn't) Improved, After Ten Years Of Trying. CanSecWest Security Conference.
http://www.dejavusecurity.com/files/DMK_AC_DD_Cansec_Fuzzmarking.pptx
EXPLOITABLE. 2011. http://msecdbg.codeplex.com/
FIREFOX. 2011. http://www.mozilla.org/projects/firefox/
GALLAGHER, T., Office Security Engineering. 2009. BlueHat.
http://technet.microsoft.com/en-us/security/ee460903.aspx#gallagher
GASSER, M. May, 1988. Building a secure computer system. Van Nostrand Reinhold. ISBN
0-442-23022-2.
GODEFROID, P., LEVIN, M., MOLNAR, D. 2008. Automated Whitebox Fuzz Testing. NDSS
GOEL, S. August 2011. Cyberwarfare: connecting the dots in cyber intelligence. Commun.
ACM 54, 8, pgs. 132-140.
GREENGARD, S. December 2010. The new face of war. Commun. ACM 53, 12, 20-22.
GUPTA, N., HE, H., ZHANG, X., and R. GUPTA. November, 2005. Locating Faulty Code
Using Failure-Inducing Chops. IEEE/ACM International Conference on Automated Software
Engineering, pages 263-272, Long Beach, California.
GUPTA, R., GUPTA, N., XIANGYU ZHANG, JEFFREY, D., NAGARAJAN, V., TALLAM,
S., CHEN TIAN. April 2008 Scalable dynamic information flow tracking and its
applications. Parallel and Distributed Processing. IEEE International Symposium on , vol.,
no., pp.1-5, 14-18.
HEELAN, S. 2009. Automatic Generation of Controlflow Hijacking Exploits for Software
Vulnerabilities. MSc Computer Science Dissertation, Oxford University.
HEX-RAYS. 2011. http://www.hex-rays.com/idapro/
HSU, H., JONES, J. A., and ORSO, A. 2008. Rapid: Identifying Bug Signatures to Support
Debugging Activities. In Proceedings of the 23rd IEEE/ACM International Conference on
Automated Software Engineering (ASE '08). IEEE Computer Society, Washington, DC, USA,
pages 439-442.

113

HUTCHINS, M., FOSTER, H., GORADIA, T., and OSTRAND, T. 1994. Experiments of the
Effectiveness of Dataflow- and Controlflow-based Test Adequacy Criteria. In Proceedings of
the 16th international conference on Software engineering (ICSE '94). IEEE Computer
Society Press, Los Alamitos, CA, USA, pages 191-200.
JEFFERY, T., GUPTA, N., and GUPTA, R. 2008. Fault localization using Value Replacement.
In Proceedings of the international symposium on Software testing and analysis (ISSTA '08).
ACM, New York, NY, USA, pages 167-178.
JONES, J., A., and HARROLD, M., J. 2005. Empirical Evaluation of the Tarantula Automatic
Fault-Localization Technique. In Proceedings of the 20th IEEE/ACM international
Conference on Automated software engineering (ASE '05). ACM, New York, NY, USA,
pages 273-282.
JONES, J. A. April, 2008. Semi-Automatic Fault Localization. Dissertation, Georgia Institute of
Technology.
KLEIN, T. 2011. A Bug Hunter's Diary: A Guided Tour through the Wilds of Software
Security. San Francisco: No Starch Press.
KOREL, B., and LASKI, J. 1988. Dynamic program slicing. Information Processing Letters
(IPL), Vol. 29, No. 3, pages 155-163.
RANDOLPH, K.
December, 2009. Fuzzing Reader - Lessons
http://blogs.adobe.com/asset/2009/12/fuzzing_reader_-_lessons_learned.html

Learned.

LEWIS, W. April, 2000. Software Testing and Continuous Quality and Improvement, 2nd
Edition, pp. 29-39. ISBN 0-849-39833-9.
LIBDFT. 2011. http://www.cs.columbia.edu/~vpk/research/libdft/
LIBLIT, B., AIKEN, A., ZHENG, A. X., and JORDAN, M. I. June, 2003. Bug Isolation via
Remote Program Sampling, In Proceedings of the Conference on Programming Language
Design and Implementation, San Diego, California.
LIBLIT, B., NAIK, M., ZHENG, A., AIKEN, A., and JORDAN, M. 2005. Scalable Statistical
Bug Isolation, ACM.
LIU, C., YAN, X., FEI, L., HAN, J. and MDKIFF, S. P. September 2005. Sober: Statistical
Model-Based Bug Localization, In ESEC/FSE.
LUCAS, J. 1961. Minds, Machines, and Gödel. Philosophy XXXVI pages 112-127.

114

MCMINN, P. AND HOLCOMBE, M. March 2006. Evolutionary Testing Using an Extended
Chaining Approach. ACM Evolutionary Computation, Pgs 41-64, Volume 14, Issue 1
MERCEDES, K. and WINOGRAD, T. October, 2008. Enhancing The Development Life Cycle
To Produce Secure Software. Data & Analysis Center for Software.
MICROSOFT. 2011. http://www.microsoft.com/security/sdl/
MILLER, B.P., FREDRIKSEN, L., and SO, B. December, 1990. An Empirical Study of the
Reliability of UNIX Utilities, Communications of the ACM.
http://pages.cs.wisc.edu/~bart/fuzz/fuzz.html
MILLER, B.P., KOSKI, D., LEE, C.P., MAGANTY, V., MURTHY, R., NATARAJAN, A., and
STEIDL, J. April 1995. Fuzz Revisited: A Re-examination of the Reliability of UNIX
Utilities and Services. Computer Sciences Technical Report #1268, University of WisconsinMadison.
MILLER, B.P., COOKSEY, G., and MOORE, F. July 2006. An Empirical Study of the
Robustness of MacOS Applications Using Random Testing. First International Workshop on
Random Testing. Portland, Maine.
MILLER, C. 2010. Babysitting an Army of Monkeys: An analysis of fuzzing 4 products with 5
lines of Python. CanSecWest.
MILLER, C., CABALLERO, J., JOHNSON, N., M., KANG, M., G., MCCAMANT, S.,
POOSANKAM, P., and SONG, D. 2010. Crash Analysis with BitBlaze. Black Hat USA.
MOLNAR, D. and OPSTAD, L. 2010. Effective Fuzzing Strategies. CERT Vulnerability
Discovery Workshop. http://www.cert.org/vuls/discovery/downloads/CERT-presentationdmolnar-larsop.pdf
MULLINER, C. and MILLER, C. 2009. Fuzzing the phone in your phone. Black Hat USA.
http://www.blackhat.com/presentations/bh-usa-09/MILLER/BHUSA09-MillerFuzzingPhone-PAPER.pdf
MYERS, G. J. 1979. The Art of Software Testing. John Wiley and Sons. ISBN 0-471-04328-1.
NAGY, B. July, 2009. Finding Microsoft Vulnerabilities by Fuzzing Binary Files with Ruby –
A New Fuzzing Framework, SyScan. http://www.syscan.org/Sg/program.html
OEHLERT, P. March/April 2005. Violating Assumptions with Fuzzing. IEEE Security &
Privacy, Pages 58-62.

115

PAN, H. and SPAFFORD, E. 1992. Heuristics for Automatic Localization of Software Faults.
Technical Report SERC-TR-166-P, Purdue University.
PEACH. 2011. http://peachfuzzer.com/
PIN. 2011. http://www.pintool.org/
PIROMSOPA, K. and ENBODY, R. October-December, 2006. Secure Bit: Transparent,
Hardware Buffer-Overflow Protection. IEEE Transactions on Dependable and Secure
Computing, Vol. 3, No. 4.
PLOSKI, J., ROHR, M., SCHWENKENBERG, P., AND HASSELBRING, W. 2007. Research
Issues in Software Fault Categorization. SIGSOFT Software Eng. Notes, vol. 32, no. 6, p. 6.
RENIERIS, M. and REISS, S. 2003. Fault Localization with Nearest Neighbor Queries. In ASE.
SANTELICES, R., JONES, J., A., YU, Y., and HARROLD, M., J. 2009. Lightweight FaultLocalization using Multiple Coverage Types. In Proceedings of the 31st International
Conference on Software Engineering (ICSE '09). IEEE Computer Society, Washington, DC,
USA, 56-66.
SAHA, D., NANDA, M. G., DHOOLIA, P., NANDIVADA, V. K., SINHA, V., and
CHANDRA, S. 2011. Fault localization for data-centric programs. In Proceedings of the
19th ACM SIGSOFT symposium and the 13th European conference on Foundations of
software engineering (ESEC/FSE '11). ACM, New York, NY, USA, 157-167.
SECURE DEVELOPMENT LIFECYCLE TEAM. April 2007. Lessons learned from the
Animated Cursor Security Bug. Microsoft Blog Post.
http://blogs.msdn.com/b/sdl/archive/2007/04/26/ lessons-learned-from-the-animated-cursorsecurity-bug.aspx
SOFTWARE-ARTIFACT INFRASTRUCTURE and REPOSITORY (SIR). 2011.
http://sir.unl.edu/portal/index.html
SUTTON, M. and GREENE, A. 2005. The Art of File Format Fuzzing. BlackHat USA.
SUTTON, M. 2006. Fuzzing: Brute Force Vulnerability Discovery. ReCON.
www.recon.cx/en/f/msutton-fuzzing.ppt
SUTTON, M., GREENE, A., and AMINI, P. 2007. Fuzzing: Brute Force Vulnerability
Discovery. Addison Wesley. ISBN: 0-321-44611-9.
TAKANEN, A., DEMOTT, J., and MILLER, C. 2008. Fuzzing for Software Security Testing
and Quality Assurance. Artech House, Inc. Norwood, MA. ISBN: 1-596-93214-7.
116

TASSEY, G. 2002. The economic impacts of inadequate infrastructure for software testing.
National Institute of Standards and Technology. Planning Report 02-3.
TESSERACT OCR. 2011. http://code.google.com/p/tesseract-ocr/
VESSEY, I. 1985. Expertise in debugging computer programs. International Journal
of Man-Machine Studies: A process analysis, Vol. 23, No. 5, pages 459-494.
VLC MEDIA PLAYER. 2011. http://www.videolan.org/vlc/
VOAS, J.M. and MILLER, K.W. 1992. The Revealing Power of a Test Case. Software Testing,
Verification and Reliability, 2(1): 25–42.
WONG, W. and DEBROY, V. 2009. Software Fault Localization. IEEE Reliability Society
2009 Annual Technology Report.
YAN, X., CHENG, H., HAN, J., and YU, P., S. 2008. Mining Significant Graph Patterns by
Leap Search. In Proceedings of the ACM SIGMOD international conference on Management
of data (SIGMOD '08). ACM, New York, NY, USA, pages 433-444.
ZELLER A. September, 1999. Yesterday, my Program Worked. Today, it does not. Why?
Seventh European Software Engineering Conference/ Seventh ACM SIGSOFT Symposium on
Foundations of Software Engineering (ESEC/FSE), pages 253-267.
ZHANG, X., HE, H., GUPTA, N., and GUPTA, R. 2005. Experimental evaluation of using
dynamic slices for fault location. In AADEBUG’05: Proceedings of the International
Symposium on Automated Analysis-driven Debugging, pages 33–42, Monterey, California,
USA.
ZHANG, X., TALLAM, S., GUPTA, N., and GUPTA, R. June, 2007. Towards Locating
Execution Omission Errors, ACM SIGPLAN Conference on Programming Language Design
and Implementation, San Diego.

117