A practical off-line taint analysis framework and its application in reverse engineering of file format
Introduction
Over the past decade, dynamic taint analysis (DTA) has become a popular technique in the field of software security analysis. Fundamentally, DTA entails tagging specific user input sources as original taint data and monitoring their propagation during the entire process runtime. Thus, a taint data flow path is extracted, which can be used for further analyses on program semantics and smart fuzzing, among other applications. Data flow tracking is also necessary to secure local servers and clients against privacy leaks, which is critical to cybercrime prevention and digital forensics.
In recent years, DTA theory has been studied in-depth and implemented by many researchers in numerous tools. The basic taint propagation strategy was first introduced by J. Newsome and D. Song in their tool TaintCheck (Newsome and Song, 2005), which aims to perform automatic detection and analysis of exploits in commodity software. DTA algebra was later discussed systematically and theoretically in (Schwartz et al., 2010). Since then, DTA applications have increased in number. Many DTA techniques have been implemented and widely utilized in various research areas related to binary analyses and vulnerability exploitation, such as Temu (Yin and Song, 2010), Panorama (Yin et al., 2007), Minemu (Bosman et al., 2011), libdft (Kemerlis et al., 2012) and TaintScope (Wang et al., 2010). A number of high-level applications have been designed on the basis of DTA tools. Examples of these applications include taint-aided data format reverse engineering and data relevance assessment using taint analysis. REWARDS (Lin et al., 2008) is an outstanding implementation of this kind, which has achieved automatic network protocol format reverse engineering through context-aware monitored execution.
However, the practicability of DTA prototypes is subject to some limitations. The most challenging problem is the excessive overhead associated with these tools and platforms. DTA can consume an excessive amount of extra storage and CPU resources, apart from the inherent overhead of binary instrumentation, which makes these tools incapable of executing even normal-scale programs. In particular, I/O bottlenecks in the recording of the huge amount of information in a typical database or set of disk files for analysis purposes severely hinder the execution speed. To realize the true power of DTA from its redundant form, some researchers have attempted to improve the implementation of various techniques. At both the NDSS (Jee et al., 2012a) and CCS (Jee et al., 2013) security conferences over the past two years, there were published papers arguing for possible enhancements from either theoretical or technical perspectives.
In this paper, we present FlowWalker, a novel taint analysis framework. The DTA function is performed off-line by separating the taint tracking logic from the execution process. Two stand-alone modules control recording and analysis: the dynamic module works on a binary instrumentation platform to instrument and record the trace of the target process, and a static analysis module or trace-replaying virtual machine replays the process and tracks the taint propagation with each executed instruction. Additionally, a file-format reverse engineering extension is designed and implemented by analyzing the implicit taint data correlations.
The original aspects and contributions of this framework are threefold:
- •
Enhanced execution performance. The overhead attached to running processes is maintained at an applicable level. The off-line analysis architecture removes all workload associated with maintaining and tracking taint status from real processes. A virtual machine replaying recorded traces can carry out complicated multi-tag taint tracking and parallelize the entire workload. Moreover, with the improvement of techniques such as those related to memory-mapped files, several bottlenecks are eliminated.
- •
Comprehensive and adoptable taint propagation logic. Multi-tag taint attributes and strategies are applied. Several sequences of specific instructions that can produce particular semantic effects are identified and monitored. Most importantly, support for MMX, SSE and SSE2 supplementary instruction sets is added for the taint analysis logic.
- •
Innovative application to file format cognition. Currently, format reverse engineering with the aid of taint analysis mainly targets network protocols that are relatively uncomplicated compared to the more complex formats typically encountered in file-format reverse engineering. FlowWalker extracts more semantic information from taint analysis results and makes a significant attempt to deduce file formats from taint information, yielding a promising result.
Moreover, with the ultimately different architecture and techniques, FlowWalker is a brand-new project, not just improvements or modifications based on some former code-bases. In order to let our practical framework be verified and adopted in the projects in demands of an efficient DTA base, we have published our project on GitHub under modified BSD license. We would direct anyone who is interested in testing or adopting FlowWalker to visit our project page.1
This paper is organized as follows: Section 2 provides a summary of taint analysis and an overview of the architecture of FlowWalker. Section 3 introduces the design and implementation of the off-line taint analysis function of FlowWalker, including the detailed taint propagation logic. As a demonstration of practicability, Section 4 presents an extensive description of the application of taint analysis results to grey-box file-format reverse engineering. Finally, in Section 5, the methods used to evaluate FlowWalker and the results of that evaluation are presented.
Section snippets
Background and overview
In this section, we present the principles of the DTA technique, its general uses in the scope of security analysis, and the limitations of existing implementations. Then, we present the architecture of FlowWalker.
Design of off-line taint analysis architecture
The off-line taint analysis architecture consists of Recorder and Replayer, which provide the taint analysis functionality. In this section, we present the details of the architecture design, especially the techniques utilized to enhance the performance of this framework compared to existing DTA frameworks.
Taint-analysis-aided file-format reverse engineering
File format vulnerabilities of systems and software result from a lack of input verification. They can result in programs behaving unexpectedly when dealing with abnormally constructed inputs. Fuzzing tests detect these vulnerabilities by generating disordered input files to make the program crash, rendering them rather inefficient. File format fuzzing improves upon this limitation by changing seed files according to their formats, such as Peach (Eddington, 2011). File format fuzzers require
Evaluation
In this section, the dynamic execution performances of different implementations of FlowWalker are compared to illustrate the effects of the improved techniques. A comparison is made between FlowWalker and another DTA tool, dft-win (dingelish, 2014). Finally, the results of general tests of file-format reverse engineering are given as a demonstration of its effectiveness as well as the accuracy of our taint analysis. The test sets, raw experimental data and embedded testing codes for counting
Related work
In the nine years since Newsome first proposed the concept of DTA (Newsome and Song, 2005), it has drawn significant attention from researchers and the industry. Though it has been increasingly adapted to various security-related applications, the problem of unsatisfactory execution speed has been a concern only over the last three years.
Researchers from Columbia University carried out remarkable explorations with positive results. At the VEE′12 conference, they presented their DTA platform
Conclusions
As a necessary extension of control flow theory, DTA has been highly valued and widely applied in the scope of binary software analysis, but the efficiency of existing implementations can hardly meet the requirements of real-world software. With the innovative off-line architecture implemented in FlowWalker, we shift major computing and storage overhead from execution of the target program to a standalone analysis module; together with many improvement techniques, FlowWalker is demonstrated to
Acknowledgment
This work was supported by National Natural Science Foundation of China (No. 61170268, 61100047, and 61272493).
Dr. Baojiang Cui received his PhD degree in Control Theory and Control Engineering at Nankai University in China. He is an Associate Professor in the School of Computer Science at Beijing University of Posts and Telecommunications. His main research areas include software security, Internet of things and big data.
References (29)
Control flow analysis
- et al.
Minemu: the worlds fastest taint tracker
Png (portable network graphics) specification version 1.0
(1997)Efficient, transparent, and comprehensive runtime code manipulation
(2004)- et al.
Polyglot: automatic extraction of protocol message format using dynamic binary analysis
- et al.
Tupni: automatic reverse engineering of input formats
libdft for win
(2014)Peach fuzzing platform
(2011)- et al.
Towards a universal data provenance framework using dynamic instrumentation
- et al.
Shadowreplica: efficient parallelization of dynamic data flow tracking
A general approach for efficiently accelerating software-based dynamic data flow tracking on commodity hardware
A general approach for efficiently accelerating software-based dynamic data flow tracking on commodity hardware
libdft: practical dynamic data flow tracking for commodity systems
Tie: principled reverse engineering of types in binary programs
Cited by (0)
Dr. Baojiang Cui received his PhD degree in Control Theory and Control Engineering at Nankai University in China. He is an Associate Professor in the School of Computer Science at Beijing University of Posts and Telecommunications. His main research areas include software security, Internet of things and big data.
Fuwei Wang received his BE degree in information security at Beijing University of Posts and Telecommunications (2008–2012). Now he is a postgraduate at BUPT. His main research topics are software vulnerabilities and intelligent fuzz testing.
Tao Guo received the BE degree from the College of Automation of Huazhong University of Science and Technology, in 1997, and the ME degree in College of mechanical Science and Engineering of Huazhong University of Science and Technology, in 2000. He got the PhD degree from Huazhong University of Science and Technology in 2004. His research interests include software security and vulnerability analysis.
Guowei Dong received the BE degree from the College of Computer of Yantai University, in 2004. He got the PhD degree from Southeast University in 2009. His research interests include software security, vulnerability analysis and software testing.