On the capability of static code analysis to detect security vulnerabilities

https://doi.org/10.1016/j.infsof.2015.08.002Get rights and content

Abstract

Context: Static analysis of source code is a scalable method for discovery of software faults and security vulnerabilities. Techniques for static code analysis have matured in the last decade and many tools have been developed to support automatic detection.

Objective: This research work is focused on empirical evaluation of the ability of static code analysis tools to detect security vulnerabilities with an objective to better understand their strengths and shortcomings.

Method: We conducted an experiment which consisted of using the benchmarking test suite Juliet to evaluate three widely used commercial tools for static code analysis. Using design of experiments approach to conduct the analysis and evaluation and including statistical testing of the results are unique characteristics of this work. In addition to the controlled experiment, the empirical evaluation included case studies based on three open source programs.

Results: Our experiment showed that 27% of C/C++ vulnerabilities and 11% of Java vulnerabilities were missed by all three tools. Some vulnerabilities were detected by only one or combination of two tools; 41% of C/C++ and 21% of Java vulnerabilities were detected by all three tools. More importantly, static code analysis tools did not show statistically significant difference in their ability to detect security vulnerabilities for both C/C++ and Java. Interestingly, all tools had median and mean of the per CWE recall values and overall recall across all CWEs close to or below 50%, which indicates comparable or worse performance than random guessing. While for C/C++ vulnerabilities one of the tools had better performance in terms of probability of false alarm than the other two tools, there was no statistically significant difference among tools’ probability of false alarm for Java test cases.

Conclusions: Despite recent advances in methods for static code analysis, the state-of-the-art tools are not very effective in detecting security vulnerabilities.

Introduction

Today’s economy is heavily reliant on computer systems and networks and many sectors, including finance, e-commerce, supply chain, transportation, energy, and health care cannot function without them. The growth of the online commercial environment and associated transactions, and the increasing volume of sensitive information accessible online have fueled the growth of cyber attacks by organized criminal elements and other adversaries [1]. According to the 2014 report by the Ponemon Institute, the mean annualized cost of the cyber crime for 257 benchmarked organizations was $7.6 million per year, with average of 31 days to contain a cyber attack [2].

Deficiencies in software quality are among leading reasons behind security vulnerabilities. Vulnerability is defined as a property of system security requirements, design, implementation, or operation that could be accidentally triggered or intentionally exploited and result in a security failure [3]. Basically, if a security failure has been experienced, there must have been a vulnerability. Based on the estimates made by the National Institute of Standards and Technology (NIST), the US economy loses $60 billion annually in costs associated with developing and distributing patches that fix software faults and vulnerabilities, as well as cost from lost productivity due to computer malware and other problems caused by software faults [4].

Therefore, it is becoming an imperative to account for security when software systems are designed and developed, and to extend the verification and validation capabilities to cover information assurance and cybersecurity concerns. Anecdotal evidence [5] and prior empirical studies [6], [7] indicated the need of using a variety of vulnerability prevention and discovery techniques throughout software development cycle. One of these vulnerability discovery techniques is a static analysis of source code, which provides a scalable way for security code review that can be used early in the life cycle, does not require the system to be executable, and can be used on parts of the overall code base. Tools for static analysis have rapidly matured in the last decade; they have evolved from simple lexical analysis to employ much more complex techniques. However, in general, static analysis problems are undecidable [8] (i.e., it is impossible to construct an algorithm which always leads to a correct answer in every case). Therefore, static code analysis tools do not detect all vulnerabilities in source code (i.e., false negatives) and are prone to report findings which upon closer examination turn out not to be security vulnerabilities (i.e., false positives). To be of practical use, a static code analysis tool should find as many vulnerabilities as possible, ideally all, with a minimum amount of false positives, ideally none.

This paper is focused on empirical evaluation of static code analysis tools’ ability to detect security vulnerabilities, with a goal to better understand their strengths and shortcomings. For this purpose we chose three state-of-the-art, commercial static code analysis tools, denoted throughout the paper as tools A, B, and C. The criteria used to select the tools included: (1) have to be widely used, (2) specifically identify security vulnerabilities (e.g., using the Common Weakness Enumeration (CWE) [9]) and support detection of significant number of different types of vulnerabilities, (3) support C, C++ and Java languages, and (4) are capable of analyzing large software applications, i.e., scale well. An additional consideration in the selection process was to choose one tool from each of the three main classes of static code analysis tools [10] (given here in no particular order): ‘program verification and property checking’, ‘bug finding’, and ‘security review’.

With respect to the vulnerabilities included in the evaluation, as in works focused on software fault detection [11], synthetic vulnerabilities can be provided in large numbers, which allow more data to be gathered than otherwise possible, but likely with less external validity. On the other side, naturally occurring vulnerabilities typically cannot be found in large numbers, but they represent actual events. Obviously either approach has its own advantages and disadvantages; therefore, we decided to use both approaches.

The first evaluation approach is based on a controlled experiment using the benchmark test suite Juliet which was originally developed by the Center for Assured Software at the National Security Agency (NSA) [12] and is publicly available. Juliet test suite consists of many sets of synthetically generated test cases; each set covers only one kind of flaw documented by the Common Weakness Enumeration (CWE) [9]. Specifically, we used the largest subset of the Juliet test suite claimed to be detectable by all three tools, which consisted of 22 CWEs for C/C++ and 19 CWEs for Java, with 21,265 and 7516 test cases, respectively. Testing static code analysis tools with this benchmark test suite allowed us: to cover a reasonably large number of vulnerabilities of many types; to automate the evaluation and computation of the tools’ performance metrics, such as accuracy, recall (i.e., probability of detection), and probability of false alarm; and to run statistical tests.

In addition to the experimental approach, our empirical evaluation includes three case studies based on open source programs, two of which were implemented in C and one implemented in Java. Each program has a known set of vulnerabilities that allow for quantitative analysis of the tools’ ability to detect security vulnerabilities. For this part of the study, because of the relatively small number of known vulnerabilities the results were obtained by manual inspection of the static code analysis tools’ outputs. The evaluation based on case studies allowed us to gauge the ability of static code analysis to detect security vulnerabilities in more complex settings.

The main contributions of this paper are as follows:

  • The experimental evaluation was based on the Juliet test suite, a benchmark for assessing the effectiveness of static code analyzers and other software assurance tools. Previous evaluations based on Juliet either did not report quantitative results [13], [14] or used very small sample of test cases related to vulnerabilities in C code only [15].

  • Our study reports several performance metrics – accuracy, recall, probability of false alarm, and G-score – for individual CWEs, as well as across all considered CWEs. We used formal statistical testing to compare the tools in terms of performance metrics and determine if any significant differences exist. None of the related works included statistical testing of the results.

  • In addition to the experimental approach, three widely-used open source programs were used as case studies. By combining experimentation with case studies, we were able to get sound experimental results supported by statistical tests and verify them in realistic settings.

Main empirical observations include:

  • None of the selected tools was able to detect all vulnerabilities. Specifically, out of the 22 C/C++ CWEs, none of the three tools was able to detect six CWEs (i.e., 27%), seven CWEs (i.e., 32%) were detected by a single tool or a combination of two tools, and only nine CWEs (around 41%) were detected by all three tools. The results obtained when running the Java test cases were similar. Out of the nineteen CWEs, two CWEs (i.e., around 11%) were not detected by any tool, thirteen CWEs (i.e., 68%) were detected by a single tool or a combination of two tools, and only four were detected by all three tools. Note that ‘detect’ in this context does not means detecting 100% of ‘bad’ functions for that specific CWE. Rather, it means correctly classifying at least one bad function.

  • The selected static code analysis tools did not show statistically significant difference in their ability to detect security vulnerabilities for both C/C++ and Java. In addition, the mean, median, and overall recall values for all tools were close to or below 50%, which indicates comparable or worse performance than random guessing.

  • For C/C++ vulnerabilities, one of the tools had better performance in terms of probability of false alarm and accuracy than the other two tools. No significant difference existed for Java vulnerabilities.

  • No statistically significant difference existed in the values of G-score (i.e., harmonic mean of the recall and 1-probability of false alarm) neither for C/C++ nor for Java vulnerabilities. (G-score combines in a single measure tools’ effectiveness in detecting security vulnerabilities with their ability to discriminate vulnerabilities from non-flawed code constructs.)

  • The experimental results related to tools’ ability to detect security vulnerabilities were confirmed on three open source applications used as case studies.

The rest of the paper is organized as follows. Related work is presented in Section 2, followed by the background description of the structure and organization of the Common Weakness Enumeration (CWE) and the Juliet test suite in Section 3. The design of the experimental study, its execution, and the analysis per individual CWEs and across all CWEs, including the results of statistical tests are given in Section 4. Section 5 presents the findings based on the three open source case studies. The threats to validity are presented in Section 6, followed by the discussion of the results in Section 7 and concluding remarks in Section 8.

Section snippets

Related work

Despite the widespread use of static code analysis, only a few public evaluation efforts of static code analysis tools have been undertaken, and even fewer with a focus on detection of security vulnerabilities.

We start with description of related works that used the Juliet benchmark as an input in evaluation of static code analysis tools [13], [14], [15], [16], [17]. These works are the closest to our work. The Software Assurance Metrics and Tool Evaluation (SAMATE) project, sponsored by the

Background on the Common Weakness Enumeration and Juliet test suite

In this section we provide background information on the Common Weakness Enumeration [9] taxonomy regarding its design and hierarchical structure. We then describe the Juliet test suite [26] and briefly cover the structure of its test cases. (More details on the Juliet test suite are given in Appendix A.)

The Common Weakness Enumeration (CWE) taxonomy aims at creating a catalog of software weaknesses and vulnerabilities. It is maintained by the MITRE Corporation with support from the Department

Experimental evaluation using the Juliet benchmark test suite

The main motivations for using experimental approach with a benchmark test suite as input include (1) conducting the analysis and evaluation in a sound context, (2) being able to automatically evaluate large number of well-designed test cases that cover wide range of vulnerabilities, (3) being able, in addition to true positives and false positives, to determine the true negatives (i.e., vulnerabilities not detected by the tools), which allowed us to evaluate the probability of vulnerability

Analysis based on real open source programs

Unlike the formal experiment presented in Section 4 for which the types of CWEs were controlled by using carefully produced bad and good test cases and there was a replication (i.e., multiple test cases exist for each CWE), the case studies look at what is happening on a real software application (i.e., can be considered as ‘research-in-the-typical’ [29]). In this context, we deal with actual vulnerabilities found in three real software applications and therefore there is no control and

Threats to validity

Empirical studies are subject to threats to validity, which we discuss in this section in terms of construct, internal, conclusion, and external threats.

Construct validity is concerned with ensuring that we are actually testing in practice what we meant to test. The first threat to construct validity is related to the choice of static code analysis tools. The number of static code analysis tools is steadily increasing. The tools we selected may not be representative of all other available tools

Discussion

The techniques for static code analysis have experienced rapid development in the last decade and many tools that automate the process have been developed. The implications of the empirical investigations described in this paper for software developers lie primarily in supporting better understanding of the strengths and limitations of the static code analysis and the level of assurance that it can provide. The goal of this section is to summarize the main findings, to compare and contrast them

Conclusion

This paper is focused on evaluation of the ability of static code analyzers to detect security vulnerabilities. For this purpose we used an experimental approach based on using the Juliet benchmark test suite which allowed us (1) to automatically evaluate tools’ performance on large number of test cases that cover wide variety of C/C++ and Java vulnerabilities, (2) to assess quantitatively tools’ performance, both per CWE and overall, across all CWEs, and (3) to conduct the analysis and

Acknowledgments

This work was funded in part by the NASA Independent Verification and Validation Facility in Fairmont, WV through grant managed by TASC Inc. Authors thank Keenan Bowens, Travis Dawson, Roger Harris, Joelle Loretta, Jerry Sims, and Christopher Williams for their input and feedback. We also thank the anonymous reviewers for their comments and suggestions.

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect

References (29)

  • Common Weakness Enumeration, https://cwe.mitre.org/(accessed 21.12.14),...
  • B. Chess et al.

    Secure Programming with Static Analysis

    (2007)
  • H. Do et al.

    Supporting controlled experimentation with testing techniques: an infrastructure and its potential impact

    Empirical Software Eng.

    (2005)
  • T. Boland et al.

    The Juliet 1.1 C/C++ and Java test suite

    IEEE Comput.

    (2012)
  • Cited by (96)

    • A Proactive Approach to assess web application security through the integration of security tools in a Security Orchestration Platform

      2022, Computers and Security
      Citation Excerpt :

      The proposed OCVA promises the identification of vulnerabilities and helps in evolving application security. Future research and development will create an aggregate tool (Zaman et al., 2021) for Artificial intelligence-machine learning based SIEM and vulnerability scanners along with a security dashboard (Goseva-Popstojanova and Perhinschi, 2015). Author names: Navdeep S. Chahal, Dr. Preeti Bali, Dr. Praveen Kumar Khosla

    • CASMS: Combining clustering with attention semantic model for identifying security bug reports

      2022, Information and Software Technology
      Citation Excerpt :

      Finally, we conclude our work in Section 7. In the early phase, bug tracking systems are used to help engineers to identify duplicate bug reports [7–11], evaluate the severity or priority of bug reports [12–14], trace bug reports back to relevant source documents [15–19], analyze and predict the effort needed to fix software bugs [20], work on characteristics of software vulnerabilities [21,22], and evaluate the ability of code analysis tools to detect security vulnerabilities [23]. Most of these methods applied textual similarity metrics (e.g., cosine similarity) and machine learning methods (e.g., SVM and KNN) to extract textual information, while the sequential and semantic information have not been considered.

    View all citing articles on Scopus
    1

    This work was done while Andrei Perhinschi was affiliated with West Virginia University.

    View full text