A practical off-line taint analysis framework and its application in reverse engineering of file format

doi:10.1016/j.cose.2015.02.006

Computers & Security

Volume 51, June 2015, Pages 1-15

https://doi.org/10.1016/j.cose.2015.02.006 Get rights and content

Highlights

•
A novel off-line dynamic taint analysis framework with efficiency an pre-cision.
•
And over 60% enhancement of execution speed compared to existing tools.
•
Fine-grained analysis with parallelization during simulated playback.
•
Application to reverse engineering of file formats with over 85% cognition rate.

Abstract

This paper presents FlowWalker, a novel dynamic taint analysis framework that aims to extract the complete taint data flow while eliminating the bottlenecks that occur in existing tools, with applications to file-format reverse engineering. The framework proposes a multi-taint-tag assembly-level taint propagation strategy. FlowWalker separates taint tracking operations from execution with an off-line structure, utilizes memory-mapped files to enhance I/O efficiency, processes taint paths during virtual execution playback, and uses parallelization and pipelining mechanisms to achieve speedup. Based on the semantic correlations implied by the taint path information, this paper presents an algorithm for extracting the structures of unknown file formats. According to test data, the overall program runtime ranges from 92.98% to 208.01% of the length of the underlying instrumentation alone, while the speed enhancement is 60% compared to another well-featured tool in Windows. Medium-complexity file formats are correctly partitioned, and the constant fields are extracted. Due to its efficiency and scalability, FlowWalker can address the needs of further security-related research.

Introduction

Over the past decade, dynamic taint analysis (DTA) has become a popular technique in the field of software security analysis. Fundamentally, DTA entails tagging specific user input sources as original taint data and monitoring their propagation during the entire process runtime. Thus, a taint data flow path is extracted, which can be used for further analyses on program semantics and smart fuzzing, among other applications. Data flow tracking is also necessary to secure local servers and clients against privacy leaks, which is critical to cybercrime prevention and digital forensics.

In recent years, DTA theory has been studied in-depth and implemented by many researchers in numerous tools. The basic taint propagation strategy was first introduced by J. Newsome and D. Song in their tool TaintCheck (Newsome and Song, 2005), which aims to perform automatic detection and analysis of exploits in commodity software. DTA algebra was later discussed systematically and theoretically in (Schwartz et al., 2010). Since then, DTA applications have increased in number. Many DTA techniques have been implemented and widely utilized in various research areas related to binary analyses and vulnerability exploitation, such as Temu (Yin and Song, 2010), Panorama (Yin et al., 2007), Minemu (Bosman et al., 2011), libdft (Kemerlis et al., 2012) and TaintScope (Wang et al., 2010). A number of high-level applications have been designed on the basis of DTA tools. Examples of these applications include taint-aided data format reverse engineering and data relevance assessment using taint analysis. REWARDS (Lin et al., 2008) is an outstanding implementation of this kind, which has achieved automatic network protocol format reverse engineering through context-aware monitored execution.

However, the practicability of DTA prototypes is subject to some limitations. The most challenging problem is the excessive overhead associated with these tools and platforms. DTA can consume an excessive amount of extra storage and CPU resources, apart from the inherent overhead of binary instrumentation, which makes these tools incapable of executing even normal-scale programs. In particular, I/O bottlenecks in the recording of the huge amount of information in a typical database or set of disk files for analysis purposes severely hinder the execution speed. To realize the true power of DTA from its redundant form, some researchers have attempted to improve the implementation of various techniques. At both the NDSS (Jee et al., 2012a) and CCS (Jee et al., 2013) security conferences over the past two years, there were published papers arguing for possible enhancements from either theoretical or technical perspectives.

In this paper, we present FlowWalker, a novel taint analysis framework. The DTA function is performed off-line by separating the taint tracking logic from the execution process. Two stand-alone modules control recording and analysis: the dynamic module works on a binary instrumentation platform to instrument and record the trace of the target process, and a static analysis module or trace-replaying virtual machine replays the process and tracks the taint propagation with each executed instruction. Additionally, a file-format reverse engineering extension is designed and implemented by analyzing the implicit taint data correlations.

The original aspects and contributions of this framework are threefold:

•
Enhanced execution performance. The overhead attached to running processes is maintained at an applicable level. The off-line analysis architecture removes all workload associated with maintaining and tracking taint status from real processes. A virtual machine replaying recorded traces can carry out complicated multi-tag taint tracking and parallelize the entire workload. Moreover, with the improvement of techniques such as those related to memory-mapped files, several bottlenecks are eliminated.
•
Comprehensive and adoptable taint propagation logic. Multi-tag taint attributes and strategies are applied. Several sequences of specific instructions that can produce particular semantic effects are identified and monitored. Most importantly, support for MMX, SSE and SSE2 supplementary instruction sets is added for the taint analysis logic.
•
Innovative application to file format cognition. Currently, format reverse engineering with the aid of taint analysis mainly targets network protocols that are relatively uncomplicated compared to the more complex formats typically encountered in file-format reverse engineering. FlowWalker extracts more semantic information from taint analysis results and makes a significant attempt to deduce file formats from taint information, yielding a promising result.

Moreover, with the ultimately different architecture and techniques, FlowWalker is a brand-new project, not just improvements or modifications based on some former code-bases. In order to let our practical framework be verified and adopted in the projects in demands of an efficient DTA base, we have published our project on GitHub under modified BSD license. We would direct anyone who is interested in testing or adopting FlowWalker to visit our project page.¹

This paper is organized as follows: Section 2 provides a summary of taint analysis and an overview of the architecture of FlowWalker. Section 3 introduces the design and implementation of the off-line taint analysis function of FlowWalker, including the detailed taint propagation logic. As a demonstration of practicability, Section 4 presents an extensive description of the application of taint analysis results to grey-box file-format reverse engineering. Finally, in Section 5, the methods used to evaluate FlowWalker and the results of that evaluation are presented.

Section snippets

Background and overview

In this section, we present the principles of the DTA technique, its general uses in the scope of security analysis, and the limitations of existing implementations. Then, we present the architecture of FlowWalker.

Design of off-line taint analysis architecture

The off-line taint analysis architecture consists of Recorder and Replayer, which provide the taint analysis functionality. In this section, we present the details of the architecture design, especially the techniques utilized to enhance the performance of this framework compared to existing DTA frameworks.

Taint-analysis-aided file-format reverse engineering

File format vulnerabilities of systems and software result from a lack of input verification. They can result in programs behaving unexpectedly when dealing with abnormally constructed inputs. Fuzzing tests detect these vulnerabilities by generating disordered input files to make the program crash, rendering them rather inefficient. File format fuzzing improves upon this limitation by changing seed files according to their formats, such as Peach (Eddington, 2011). File format fuzzers require

Evaluation

In this section, the dynamic execution performances of different implementations of FlowWalker are compared to illustrate the effects of the improved techniques. A comparison is made between FlowWalker and another DTA tool, dft-win (dingelish, 2014). Finally, the results of general tests of file-format reverse engineering are given as a demonstration of its effectiveness as well as the accuracy of our taint analysis. The test sets, raw experimental data and embedded testing codes for counting

Related work

In the nine years since Newsome first proposed the concept of DTA (Newsome and Song, 2005), it has drawn significant attention from researchers and the industry. Though it has been increasingly adapted to various security-related applications, the problem of unsatisfactory execution speed has been a concern only over the last three years.

Researchers from Columbia University carried out remarkable explorations with positive results. At the VEE′12 conference, they presented their DTA platform

Conclusions

As a necessary extension of control flow theory, DTA has been highly valued and widely applied in the scope of binary software analysis, but the efficiency of existing implementations can hardly meet the requirements of real-world software. With the innovative off-line architecture implemented in FlowWalker, we shift major computing and storage overhead from execution of the target program to a standalone analysis module; together with many improvement techniques, FlowWalker is demonstrated to

Acknowledgment

This work was supported by National Natural Science Foundation of China (No. 61170268, 61100047, and 61272493).

Dr. Baojiang Cui received his PhD degree in Control Theory and Control Engineering at Nankai University in China. He is an Associate Professor in the School of Computer Science at Beijing University of Posts and Telecommunications. His main research areas include software security, Internet of things and big data.

References (29)

F.E. Allen
Control flow analysis
E. Bosman et al.
Minemu: the worlds fastest taint tracker
T. Boutell
Png (portable network graphics) specification version 1.0
(1997)
D.L. Bruening
Efficient, transparent, and comprehensive runtime code manipulation
(2004)
J. Caballero et al.
Polyglot: automatic extraction of protocol message format using dynamic binary analysis
W. Cui et al.
Tupni: automatic reverse engineering of input formats
dingelish
libdft for win
(2014)
M. Eddington
Peach fuzzing platform
(2011)
E. Gessiou et al.
Towards a universal data provenance framework using dynamic instrumentation
K. Jee et al.
Shadowreplica: efficient parallelization of dynamic data flow tracking

K. Jee et al.

A general approach for efficiently accelerating software-based dynamic data flow tracking on commodity hardware

K. Jee et al.

A general approach for efficiently accelerating software-based dynamic data flow tracking on commodity hardware

V.P. Kemerlis et al.

libdft: practical dynamic data flow tracking for commodity systems

J. Lee et al.

Tie: principled reverse engineering of types in binary programs

Cited by (0)

Fuwei Wang received his BE degree in information security at Beijing University of Posts and Telecommunications (2008–2012). Now he is a postgraduate at BUPT. His main research topics are software vulnerabilities and intelligent fuzz testing.

Tao Guo received the BE degree from the College of Automation of Huazhong University of Science and Technology, in 1997, and the ME degree in College of mechanical Science and Engineering of Huazhong University of Science and Technology, in 2000. He got the PhD degree from Huazhong University of Science and Technology in 2004. His research interests include software security and vulnerability analysis.

Guowei Dong received the BE degree from the College of Computer of Yantai University, in 2004. He got the PhD degree from Southeast University in 2009. His research interests include software security, vulnerability analysis and software testing.

View full text

Computers & Security

A practical off-line taint analysis framework and its application in reverse engineering of file format

Highlights

Abstract

Introduction

Section snippets

Background and overview

Design of off-line taint analysis architecture

Taint-analysis-aided file-format reverse engineering

Evaluation

Related work

Conclusions

Acknowledgment

Control flow analysis

Minemu: the worlds fastest taint tracker

Png (portable network graphics) specification version 1.0

Efficient, transparent, and comprehensive runtime code manipulation

Polyglot: automatic extraction of protocol message format using dynamic binary analysis

Tupni: automatic reverse engineering of input formats

libdft for win

Peach fuzzing platform

Towards a universal data provenance framework using dynamic instrumentation

Shadowreplica: efficient parallelization of dynamic data flow tracking

A general approach for efficiently accelerating software-based dynamic data flow tracking on commodity hardware

A general approach for efficiently accelerating software-based dynamic data flow tracking on commodity hardware

libdft: practical dynamic data flow tracking for commodity systems

Tie: principled reverse engineering of types in binary programs