On the capability of static code analysis to detect security vulnerabilities

doi:10.1016/j.infsof.2015.08.002

Information and Software Technology

Volume 68, December 2015, Pages 18-33

https://doi.org/10.1016/j.infsof.2015.08.002 Get rights and content

Abstract

Context: Static analysis of source code is a scalable method for discovery of software faults and security vulnerabilities. Techniques for static code analysis have matured in the last decade and many tools have been developed to support automatic detection.

Objective: This research work is focused on empirical evaluation of the ability of static code analysis tools to detect security vulnerabilities with an objective to better understand their strengths and shortcomings.

Method: We conducted an experiment which consisted of using the benchmarking test suite Juliet to evaluate three widely used commercial tools for static code analysis. Using design of experiments approach to conduct the analysis and evaluation and including statistical testing of the results are unique characteristics of this work. In addition to the controlled experiment, the empirical evaluation included case studies based on three open source programs.

Results: Our experiment showed that 27% of C/C++ vulnerabilities and 11% of Java vulnerabilities were missed by all three tools. Some vulnerabilities were detected by only one or combination of two tools; 41% of C/C++ and 21% of Java vulnerabilities were detected by all three tools. More importantly, static code analysis tools did not show statistically significant difference in their ability to detect security vulnerabilities for both C/C++ and Java. Interestingly, all tools had median and mean of the per CWE recall values and overall recall across all CWEs close to or below 50%, which indicates comparable or worse performance than random guessing. While for C/C++ vulnerabilities one of the tools had better performance in terms of probability of false alarm than the other two tools, there was no statistically significant difference among tools’ probability of false alarm for Java test cases.

Conclusions: Despite recent advances in methods for static code analysis, the state-of-the-art tools are not very effective in detecting security vulnerabilities.

Introduction

Today’s economy is heavily reliant on computer systems and networks and many sectors, including finance, e-commerce, supply chain, transportation, energy, and health care cannot function without them. The growth of the online commercial environment and associated transactions, and the increasing volume of sensitive information accessible online have fueled the growth of cyber attacks by organized criminal elements and other adversaries [1]. According to the 2014 report by the Ponemon Institute, the mean annualized cost of the cyber crime for 257 benchmarked organizations was $7.6 million per year, with average of 31 days to contain a cyber attack [2].

Deficiencies in software quality are among leading reasons behind security vulnerabilities. Vulnerability is defined as a property of system security requirements, design, implementation, or operation that could be accidentally triggered or intentionally exploited and result in a security failure [3]. Basically, if a security failure has been experienced, there must have been a vulnerability. Based on the estimates made by the National Institute of Standards and Technology (NIST), the US economy loses $60 billion annually in costs associated with developing and distributing patches that fix software faults and vulnerabilities, as well as cost from lost productivity due to computer malware and other problems caused by software faults [4].

Therefore, it is becoming an imperative to account for security when software systems are designed and developed, and to extend the verification and validation capabilities to cover information assurance and cybersecurity concerns. Anecdotal evidence [5] and prior empirical studies [6], [7] indicated the need of using a variety of vulnerability prevention and discovery techniques throughout software development cycle. One of these vulnerability discovery techniques is a static analysis of source code, which provides a scalable way for security code review that can be used early in the life cycle, does not require the system to be executable, and can be used on parts of the overall code base. Tools for static analysis have rapidly matured in the last decade; they have evolved from simple lexical analysis to employ much more complex techniques. However, in general, static analysis problems are undecidable [8] (i.e., it is impossible to construct an algorithm which always leads to a correct answer in every case). Therefore, static code analysis tools do not detect all vulnerabilities in source code (i.e., false negatives) and are prone to report findings which upon closer examination turn out not to be security vulnerabilities (i.e., false positives). To be of practical use, a static code analysis tool should find as many vulnerabilities as possible, ideally all, with a minimum amount of false positives, ideally none.

This paper is focused on empirical evaluation of static code analysis tools’ ability to detect security vulnerabilities, with a goal to better understand their strengths and shortcomings. For this purpose we chose three state-of-the-art, commercial static code analysis tools, denoted throughout the paper as tools A, B, and C. The criteria used to select the tools included: (1) have to be widely used, (2) specifically identify security vulnerabilities (e.g., using the Common Weakness Enumeration (CWE) [9]) and support detection of significant number of different types of vulnerabilities, (3) support C, C++ and Java languages, and (4) are capable of analyzing large software applications, i.e., scale well. An additional consideration in the selection process was to choose one tool from each of the three main classes of static code analysis tools [10] (given here in no particular order): ‘program verification and property checking’, ‘bug finding’, and ‘security review’.

With respect to the vulnerabilities included in the evaluation, as in works focused on software fault detection [11], synthetic vulnerabilities can be provided in large numbers, which allow more data to be gathered than otherwise possible, but likely with less external validity. On the other side, naturally occurring vulnerabilities typically cannot be found in large numbers, but they represent actual events. Obviously either approach has its own advantages and disadvantages; therefore, we decided to use both approaches.

The first evaluation approach is based on a controlled experiment using the benchmark test suite Juliet which was originally developed by the Center for Assured Software at the National Security Agency (NSA) [12] and is publicly available. Juliet test suite consists of many sets of synthetically generated test cases; each set covers only one kind of flaw documented by the Common Weakness Enumeration (CWE) [9]. Specifically, we used the largest subset of the Juliet test suite claimed to be detectable by all three tools, which consisted of 22 CWEs for C/C++ and 19 CWEs for Java, with 21,265 and 7516 test cases, respectively. Testing static code analysis tools with this benchmark test suite allowed us: to cover a reasonably large number of vulnerabilities of many types; to automate the evaluation and computation of the tools’ performance metrics, such as accuracy, recall (i.e., probability of detection), and probability of false alarm; and to run statistical tests.

In addition to the experimental approach, our empirical evaluation includes three case studies based on open source programs, two of which were implemented in C and one implemented in Java. Each program has a known set of vulnerabilities that allow for quantitative analysis of the tools’ ability to detect security vulnerabilities. For this part of the study, because of the relatively small number of known vulnerabilities the results were obtained by manual inspection of the static code analysis tools’ outputs. The evaluation based on case studies allowed us to gauge the ability of static code analysis to detect security vulnerabilities in more complex settings.

The main contributions of this paper are as follows:

•
The experimental evaluation was based on the Juliet test suite, a benchmark for assessing the effectiveness of static code analyzers and other software assurance tools. Previous evaluations based on Juliet either did not report quantitative results [13], [14] or used very small sample of test cases related to vulnerabilities in C code only [15].
•
Our study reports several performance metrics – accuracy, recall, probability of false alarm, and G-score – for individual CWEs, as well as across all considered CWEs. We used formal statistical testing to compare the tools in terms of performance metrics and determine if any significant differences exist. None of the related works included statistical testing of the results.
•
In addition to the experimental approach, three widely-used open source programs were used as case studies. By combining experimentation with case studies, we were able to get sound experimental results supported by statistical tests and verify them in realistic settings.

Main empirical observations include:

•
None of the selected tools was able to detect all vulnerabilities. Specifically, out of the 22 C/C++ CWEs, none of the three tools was able to detect six CWEs (i.e., 27%), seven CWEs (i.e., 32%) were detected by a single tool or a combination of two tools, and only nine CWEs (around 41%) were detected by all three tools. The results obtained when running the Java test cases were similar. Out of the nineteen CWEs, two CWEs (i.e., around 11%) were not detected by any tool, thirteen CWEs (i.e., 68%) were detected by a single tool or a combination of two tools, and only four were detected by all three tools. Note that ‘detect’ in this context does not means detecting 100% of ‘bad’ functions for that specific CWE. Rather, it means correctly classifying at least one bad function.
•
The selected static code analysis tools did not show statistically significant difference in their ability to detect security vulnerabilities for both C/C++ and Java. In addition, the mean, median, and overall recall values for all tools were close to or below 50%, which indicates comparable or worse performance than random guessing.
•
For C/C++ vulnerabilities, one of the tools had better performance in terms of probability of false alarm and accuracy than the other two tools. No significant difference existed for Java vulnerabilities.
•
No statistically significant difference existed in the values of G-score (i.e., harmonic mean of the recall and 1-probability of false alarm) neither for C/C++ nor for Java vulnerabilities. (G-score combines in a single measure tools’ effectiveness in detecting security vulnerabilities with their ability to discriminate vulnerabilities from non-flawed code constructs.)
•
The experimental results related to tools’ ability to detect security vulnerabilities were confirmed on three open source applications used as case studies.

The rest of the paper is organized as follows. Related work is presented in Section 2, followed by the background description of the structure and organization of the Common Weakness Enumeration (CWE) and the Juliet test suite in Section 3. The design of the experimental study, its execution, and the analysis per individual CWEs and across all CWEs, including the results of statistical tests are given in Section 4. Section 5 presents the findings based on the three open source case studies. The threats to validity are presented in Section 6, followed by the discussion of the results in Section 7 and concluding remarks in Section 8.

Section snippets

Related work

Despite the widespread use of static code analysis, only a few public evaluation efforts of static code analysis tools have been undertaken, and even fewer with a focus on detection of security vulnerabilities.

We start with description of related works that used the Juliet benchmark as an input in evaluation of static code analysis tools [13], [14], [15], [16], [17]. These works are the closest to our work. The Software Assurance Metrics and Tool Evaluation (SAMATE) project, sponsored by the

Background on the Common Weakness Enumeration and Juliet test suite

In this section we provide background information on the Common Weakness Enumeration [9] taxonomy regarding its design and hierarchical structure. We then describe the Juliet test suite [26] and briefly cover the structure of its test cases. (More details on the Juliet test suite are given in Appendix A.)

The Common Weakness Enumeration (CWE) taxonomy aims at creating a catalog of software weaknesses and vulnerabilities. It is maintained by the MITRE Corporation with support from the Department

Experimental evaluation using the Juliet benchmark test suite

The main motivations for using experimental approach with a benchmark test suite as input include (1) conducting the analysis and evaluation in a sound context, (2) being able to automatically evaluate large number of well-designed test cases that cover wide range of vulnerabilities, (3) being able, in addition to true positives and false positives, to determine the true negatives (i.e., vulnerabilities not detected by the tools), which allowed us to evaluate the probability of vulnerability

Analysis based on real open source programs

Unlike the formal experiment presented in Section 4 for which the types of CWEs were controlled by using carefully produced bad and good test cases and there was a replication (i.e., multiple test cases exist for each CWE), the case studies look at what is happening on a real software application (i.e., can be considered as ‘research-in-the-typical’ [29]). In this context, we deal with actual vulnerabilities found in three real software applications and therefore there is no control and

Threats to validity

Empirical studies are subject to threats to validity, which we discuss in this section in terms of construct, internal, conclusion, and external threats.

Construct validity is concerned with ensuring that we are actually testing in practice what we meant to test. The first threat to construct validity is related to the choice of static code analysis tools. The number of static code analysis tools is steadily increasing. The tools we selected may not be representative of all other available tools

Discussion

The techniques for static code analysis have experienced rapid development in the last decade and many tools that automate the process have been developed. The implications of the empirical investigations described in this paper for software developers lie primarily in supporting better understanding of the strengths and limitations of the static code analysis and the level of assurance that it can provide. The goal of this section is to summarize the main findings, to compare and contrast them

Conclusion

This paper is focused on evaluation of the ability of static code analyzers to detect security vulnerabilities. For this purpose we used an experimental approach based on using the Juliet benchmark test suite which allowed us (1) to automatically evaluate tools’ performance on large number of test cases that cover wide variety of C/C++ and Java vulnerabilities, (2) to assess quantitatively tools’ performance, both per CWE and overall, across all CWEs, and (3) to conduct the analysis and

Acknowledgments

This work was funded in part by the NASA Independent Verification and Validation Facility in Fairmont, WV through grant managed by TASC Inc. Authors thank Keenan Bowens, Travis Dawson, Roger Harris, Joelle Loretta, Jerry Sims, and Christopher Williams for their input and feedback. We also thank the anonymous reviewers for their comments and suggestions.

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect

References (29)

A. Austin et al.
A comparison of the efficiency and effectiveness of vulnerability discovery techniques
Inf. Software Technol.
(2013)
G. Díaz et al.
Static analysis of source code security: assessment of tools against SAMATE tests
Inf. Software Technol.
(2013)
P. Emanuelsson et al.
A comparative study of industrial static analysis tools
Electron. Notes Theor. Comput. Sci.
(2008)
Cyberspace policy review: assuring a trusted and resilient information and communications infrastructure,...
2014 Global report on the cost of cyber crime. Ponemon Institute research report,...
Source code security analysis tool functional specification version 1.0, National Institute of Standards and...
M. Zhivich et al.
The real cost of software errors
IEEE Secur. Privacy
(2009)
G. McGraw
Software Security: Building Security In
(2006)
J. Zheng et al.
On the value of static analysis for fault detection in software
IEEE Trans. Software Eng.
(2006)
B. Chess et al.
Static analysis for security
IEEE Secur. Privacy
(2004)

Common Weakness Enumeration, https://cwe.mitre.org/(accessed 21.12.14),...

B. Chess et al.

Secure Programming with Static Analysis

(2007)

H. Do et al.

Supporting controlled experimentation with testing techniques: an infrastructure and its potential impact

Empirical Software Eng.

(2005)

T. Boland et al.

The Juliet 1.1 C/C++ and Java test suite

IEEE Comput.

(2012)

Cited by (96)

Enhancing vulnerability detection via AST decomposition and neural sub-tree encoding
2024, Expert Systems with Applications
The explosive growth of software vulnerabilities poses a serious threat to the system security and has become one of the urgent problems of the day. However, existing vulnerability detection methods are still faced with limitations in reaching the balance between detection accuracy, efficiency and applicability. Following a divide-and-conquer strategy, this paper proposes TrVD (abstract syntax Tree decomposition based Vulnerability Detector) to disclose the indicative semantics implied in the source code fragments for accurate and efficient vulnerability detection. To facilitate the capture of subtle semantic features, TrVD converts the AST of a code fragment into an ordered set of sub-trees of restricted sizes and depths with a novel decomposition algorithm. The semantics of each sub-tree can thus be effectively collected with a carefully designed tree-structured neural network. Finally, a Transformer-style encoder is utilized to aggregate the long-range contextual semantics of all sub-trees into a vulnerability-specific vector to represent the target code fragment. The extensive experiments conducted on five large datasets consisting of diverse real-world and synthetic vulnerable samples demonstrate the performance superiority of TrVD against SOTA approaches in detecting the presence of vulnerabilities and pinpointing the vulnerability types. The ablation studies also confirm the effectiveness of TrVD’s core designs.
A Proactive Approach to assess web application security through the integration of security tools in a Security Orchestration Platform
2022, Computers and Security
Citation Excerpt :
The proposed OCVA promises the identification of vulnerabilities and helps in evolving application security. Future research and development will create an aggregate tool (Zaman et al., 2021) for Artificial intelligence-machine learning based SIEM and vulnerability scanners along with a security dashboard (Goseva-Popstojanova and Perhinschi, 2015). Author names: Navdeep S. Chahal, Dr. Preeti Bali, Dr. Praveen Kumar Khosla
The increasing number of attacks leads to a growing research and development interest in cybersecurity systems. As a response to the increasingly distributed nature of attacks, many organizations have demonstrated willingness to exchange information concerning threats, incidents, and mitigation strategies with security detection tools and techniques. Various security detection techniques such as signature recognition, anomaly detection, etc fail to completely detect complicated attacks. The current situation can be dealt with as a significant tool that helps auditors and administrators to manage and identify distributed threats. In this paper, a novel social spider communicating behavior-based swarm intelligent open-source Orchestrated Continuous Vulnerability Assessment (OCVA) scanning tool is proposed. The proposed OCVA tool addresses the requirement of orchestration of continuous vulnerability assessment of all automated cybersecurity detection processes. It scans, monitors, visualizes, analyzes, mitigates, and remediates the vulnerabilities of the network, assets, and web applications. It helps the developers and security auditors overcome challenges by providing the desired visualizations and analytics of the vulnerable assets. Two case studies are conducted on the basis of the algorithmic comparative analysis with BRICK, Fuzzing, ACO, PSO and GA based vulnerability scanners along with the tool based comparative evaluation with W3af, ZAP, Wapiti, and Arachni in terms of vulnerability detection rate, accuracy, false positive rate, true positive rate and consistency. The results indicate that the proposed OCVA tool outperforms in terms of accuracy, vulnerability remediation rate, and consistency in both.
CASMS: Combining clustering with attention semantic model for identifying security bug reports
2022, Information and Software Technology
Citation Excerpt :
Finally, we conclude our work in Section 7. In the early phase, bug tracking systems are used to help engineers to identify duplicate bug reports [7–11], evaluate the severity or priority of bug reports [12–14], trace bug reports back to relevant source documents [15–19], analyze and predict the effort needed to fix software bugs [20], work on characteristics of software vulnerabilities [21,22], and evaluate the ability of code analysis tools to detect security vulnerabilities [23]. Most of these methods applied textual similarity metrics (e.g., cosine similarity) and machine learning methods (e.g., SVM and KNN) to extract textual information, while the sequential and semantic information have not been considered.
Inappropriate public disclosure of security bug reports (SBRs) is likely to attract malicious attackers to invade software systems; hence being able to detect SBRs has become increasingly important for software maintenance. Due to the class imbalance problem that the number of non-security bug reports (NSBRs) exceeds the number of SBRs, insufficient training information, and weak performance robustness, the existing techniques for identifying SBRs are still less than desirable.
This prompted us to overcome the challenges of the most advanced SBR detection methods.
In this work, we propose the CASMS approach to efficiently alleviate the imbalance problem and predict bug reports. CASMS first converts bug reports into weighted word embeddings based on $t f - i d f$ and $w o r d 2 v e c$ techniques. Unlike the previous studies selecting the NSBRs that are the most dissimilar to SBRs, CASMS then automatically finds a certain number of diverse NSBRs via the Elbow method and $k$ -means clustering algorithm. Finally, the selected NSBRs and all SBRs train an effective Attention CNN–BLSTM model to extract contextual and sequential information.
The experimental results have shown that CASMS is superior to the three baselines (i.e., FARSEC, SMOTUNED, and LTRWES) in assessing the overall performance ( $g$ -measure) and correctly identifying SBRs (recall), with improvements of 4.09%–24.26% and 10.33%–36.24%, respectively. The best results are easily obtained under the limited ratio ranges of the two-class training set (1:1 to 3:1), with around 20 experiments for each project. By evaluating the robustness of CASMS via the standard deviation indicator, CASMS is more stable than LTRWES.
Overall, CASMS can alleviate the data imbalance problem and extract more semantic information to improve performance and robustness. Therefore, CASMS is recommended as a practical approach for identifying SBRs.
An Extensive Comparison of Static Application Security Testing Tools
2024, arXiv
Profile of Vulnerability Remediations in Dependencies Using Graph Analysis
2024, arXiv
Unveiling the Complexity of Vulnerability Clone Detection in Agile Software Development in C: Challenges and Limitations
2024, SSRN

View all citing articles on Scopus

¹: This work was done while Andrei Perhinschi was affiliated with West Virginia University.

View full text