A Critical Comparison on Six Static Analysis Tools: Detection, Agreement, and Precision

Background. Developers use Automated Static Analysis Tools (ASATs) to control for potential quality issues in source code, including defects and technical debt. Tool vendors have devised quite a number of tools, which makes it harder for practitioners to select the most suitable one for their needs. To better support developers, researchers have been conducting several studies on ASATs to favor the understanding of their actual capabilities. Aims. Despite the work done so far, there is still a lack of knowledge regarding (1) which source quality problems can actually be detected by static analysis tool warnings, (2) what is their agreement, and (3) what is the precision of their recommendations. We aim at bridging this gap by proposing a large-scale comparison of six popular static analysis tools for Java projects: Better Code Hub, CheckStyle, Coverity Scan, Findbugs, PMD, and SonarQube. Method. We analyze 47 Java projects and derive a taxonomy of warnings raised by 6 state-of-the-practice ASATs. To assess their agreement, we compared them by manually analyzing - at line-level - whether they identify the same issues. Finally, we manually evaluate the precision of the tools. Results. The key results report a comprehensive taxonomy of ASATs warnings, show little to no agreement among the tools and a low degree of precision. Conclusions. We provide a taxonomy that can be useful to researchers, practitioners, and tool vendors to map the current capabilities of the tools. Furthermore, our study provides the first overview on the agreement among different tools as well as an extensive analysis of their precision.

more, our study provides the first overview on the agreement among different tools as well as an extensive analysis of their precision.
Keywords Static analysis tools · Software Quality · Empirical Study.

Introduction
Automated Static Analysis Tools (ASATs) are instruments that analyze characteristics of the source code without executing it, so that they can discover potential source code quality issues [14]. These tools are getting more popular as they are becoming easier to use-especially in continuous integration pipelines [55]-and there is a wide range to choose from [52]. However, as the number of available tools grows, it becomes harder for practitioners to choose the tool (or combination thereof) that is most suitable for their needs [48].
To help practitioners with this selection process, researchers have been conducting empirical studies to compare the capabilities of existing ASATs [29,54]. Most of these investigations have focused on (1) the features provided by the tools, e.g., which maintainability dimensions can be tracked by current ASATs, (2) comparing specific aspects considered by the tools, such as security [4,31] or concurrency defects [1], and (3) assessing the number of false positives given by the available static analysis tools [20].
Recognizing the effort spent by the research community, which led to notable advances in the way tool vendors develop ASATs, we herein notice that our knowledge on the capabilities of the existing SATs is still limited. More specifically, in the context of our research we point our that three specific aspects are under-investigated: (1) which source quality problems can actually be detected by static analysis tools, (2) what is the agreement among different tools with respect to source code marked as potentially problematic, and (3) what is the precision with which a large variety of the available tools provide recommendations. An improved knowledge of these aspects would not only allow practitioners to take informed decisions when selecting the tool(s) to use, but also researchers/tool vendors to enhance them and improve the level of support provided to developers.
In this paper, we aim to address this gap of knowledge by designing and conducting a large-scale empirical investigation into the detection capabilities of six popular state-of-the-practice ASATs, namely SonarQube, Better Code Hub, Coverity Scan, Findbugs, PMD, and CheckStyle. 1 Specifically, we run the considered tools against a corpus of 47 projects from the Qualitas Corpus dataset and (1) depict which source quality problems can actually be detected by the tools, (2) compute the agreement among the recommendations given by them at line-level, and (3) manually compute the precision of the tools. 1 ASATs verify code compliance with a specific set of warnings that, if violated, can introduce an issue in the code. This issue can be accounted for as "source code quality issue": as such, in the remaining the paper we use this term when referring to the output of the considered tools.
The key results of the study report a taxonomy of static analysis tool warnings, but also show that, among the considered tools, SonarQube is the one able to detect most of the quality issues that can be detected by the other ASATs. However, when considering the specific quality issues detected, there is little to no agreement among the tools, indicating that different tools are able to identify different forms of quality problems. Finally, the precision of the considered tools ranges between 18% and 86%, meaning that the practical usefulness of some tools is seriously threatened by the presence of false positives-this result corroborates and enlarges previous findings [20] on a larger scale and considering a broader set of tools.
Based on our findings, our paper finally discusses and distills a number of lessons learned, limitations, and open challenges that should be considered and/or addressed by both the research community and tool vendors.
Structure of the paper. The study setting is described in Section 2. The results are presented in Section 3 and discussed in Section 4. Section 5 identifies the threats to the validity of our study, while Section 6 presents related works on static analysis tools. Finally, in Section 7 we draw conclusions and provide an outlook on our future research agenda.

Empirical Study Design
We designed our empirical study as a case study based on the guidelines defined by Runeson and Höst [39]. The following sections describe the goals and specific research questions driving our empirical study as well as the data collection and analysis procedures.

Goal and Research Questions
The goal of our empirical study is to compare state-of-the-practice ASATs with the aim of assessing their capabilities when detecting source code quality issues with respect to (1) the types of problems they can actually identify; (2) the agreement among them, and (3) their precision. Our ultimate purpose is to enlarge the knowledge available on the identification of source code quality issues with ASATs from the perspective of both researchers and tool vendors. The former are interested in identifying areas where the state-of-the-art tools can be improved, thus setting up future research challenges, while the latter are instead concerned with assessing their current capabilities and possibly the limitations that should be addressed in the future to better support developers.
More specifically, our goal can be structured around three main research questions (RQ s ). As a first step, we aimed at understanding what kind of issues different tool warnings detect when run on source code. An improved analysis of this aspect may extend our knowledge on whether and how various types of quality issues are identified by the existing tools. Hence, we asked: What source code quality issues can be detected by Automated Static Analysis Tools?
Once we had characterized the tools with respect to what they are able to identify, we proceeded with a finer-grained investigation aimed at measuring the extent to which ASATs agree with each other. Regarding this aspect, further investigation would not only benefit tool vendors who want to better understand the capabilities of their tools compared to others, but would also benefit practitioners who would like to know whether it is worth using multiple tools within their code base. Moreover, we were interested in how the issues from different tools overlap with each other. We wanted to determine the type and number of overlapping issues, but also whether the overlapping is between all tools or just a subset. RQ 2 . What is the agreement among different Automated Static Analysis Tools when detecting source code quality issues?
Finally, we focused on investigating the potential usefulness of the tools in practice. While they could output numerous warnings that alert developers of the presence of potential quality problems, it is still possible that some of these warnings might represent false positive instances, i.e., that they wrongly recommend source code entities to be refactored/investigated. Previous studies have highlighted the presence of false positives as one of the main problems of the tools currently available [20]; our study aims at corroborating and extending the available findings, as further remarked in Section 2.4. All in all, our goal was to provide an updated view on this matter and understand whether, and to what extent, this problem has been mitigated in recent years or whether there is still room for improvement.

Context of the Study
The context of our study consisted of software systems and tools. In the following, we describe our selection.

Project Selection
We selected projects from the Qualitas Corpus collection of software systems (Release 20130901), using the compiled version of the Qualitas Corpus [47].
The dataset contains 112 Java systems with 754 versions, more than 18 million LOCs, 16,000 packages, and 200,000 classes analyzed. Moreover, the dataset includes projects from different contexts such as IDEs, databases, and programming language compilers. More information is available in [47]. In our study, we considered the "r" release of each of the 112 available systems. Since two of the automated static analysis tools considered, i.e., Coverity Scan and Better Code Hub, require permissions in the GitHub project or the upload of a configuration file, we privately uploaded all 112 projects to our GitHub account in order to enable the analysis 2 .

ASATs Selection
We selected the six tools described below. The choice of focusing on those specific tools was driven by the familiarity of the authors with them: this allowed us to (1) use/run them better (e.g., by running them without errors) and (2) analyze their results better, for instance by providing qualitative insights able to explain the reasons behind the achieved results. The analysis of other tools is already part of our future research agenda.
SonarQube 3 is one of the most popular open-source static code analysis tools for measuring code quality issues. It is provided as a service by the sonarcloud.io platform or it can be downloaded and executed on a private server. SonarQube computes several metrics such as number of lines of code and code complexity, and verifies code compliance with a specific set of "coding warnings" defined for most common development languages. If the analyzed source code violates a coding warning, the tool reports an "issue". The time needed to remove these issues is called remediation effort.
SonarQube includes reliability, maintainability, and security warnings. Reliability warnings, also named Bugs, create quality issues that "represent something wrong in the code" and that will soon be reflected in a bug. Code smells are considered "maintainability-related issues" in the code that decrease code readability and code modifiability. It is important to note that the term "code smells" adopted in SonarQube does not refer to the commonly known term code smells defined by Fowler et al. [18], but to a different set of warnings.
Coverity Scan 4 is another common open-source static analysis tool. The code build is analyzed by submitting the build to the server through the public API. The tool detects defects and vulnerabilities that are grouped by categories such as: resource leaks, dereferences of NULL pointers, incorrect usage of APIs, use of uninitialized data, memory corruptions, buffer overruns, control flow issues, error handling issues, incorrect expressions, concurrency issues, insecure data handling, unsafe use of signed values, and use of resources that have been freed 5 . For each of these categories, there are various issue types that explain more details about the defect. In addition to issue types, issues are grouped based on impact: low, medium, and high. The static analysis applied by Coverity Scan is based on the examination of the source code by determining all possible paths the program may take. This gives a better understanding of the control and data flow of the code 6 .
Better Code Hub 7 is also a commonly used static analysis tool that assesses code quality. The analysis is done through the website's API, which analyzes the repository from GitHub. The default configuration file can be modified for customization purposes. Code quality is generally measured based on structure, organization, modifiability, and comprehensibility. This is done by assessing the code against ten guidelines: write short units of code, write simple units of code, write code once, keep unit interfaces small, separate concern in modules, couple architecture components loosely, keep architecture components balanced, keep your code base small, automate tests, and write clean code. Out of the ten guidelines, eight guidelines are grouped based on type of severity: medium, high, and very high. Compliance is rated on a scale from 1-10 based on the results 8 .
Better Code Hub static analysis is based on the analysis of the source code against heuristics and commonly adopted coding conventions. This gives a holistic view of the health of the code from a macroscopic perspective.
Checkstyle 9 is an open-source developer tool that evaluates Java code quality. The analysis is done either by using it as a side feature in Ant or as a command line tool. Checkstyle assesses code according to a certain coding standard, which is configured according to a set of checks. Checkstyle has two sets of style configurations for standard checks: Google Java Style 10 and Sun Java Style 11 . In addition to standard checks provided by Checkstyle, customized configuration files are also possible according to user preference. 12 These checks are classified under 14 different categories: annotations, block checks, class design, coding, headers, imports, javadoc comments, metrics, miscellaneous, modifiers, naming conventions, regexp, size violations, and whitespace. Moreover, the violation of the checks are grouped under two severity levels: error and warning 13 , with the first reporting actual problems and the second possible issues to be verified.
PMD 18 is a static analysis tool mainly used to evaluate Java and Apex, even though it can also be applied to six other programming languages. The analysis is done through the command line using the binary distributions. PMD uses a set of warnings to assess code quality according to the main focus areas: unused variables, empty catch blocks, unnecessary object creation, and more. There are a total of 33 different warning set configurations 19 for Java projects. The warning sets can also be customized according to the user preference 20 . These warnings are classified under 8 different categories: best practices, code style, design, documentation, error prone, multi threading, performance, and security. Moreover, the violations of warnings are measured on a priority scale from 1-5, with 1 being the most severe and 5 being the least 21 .

Study Setup and Data Collection
This section describes the study setup used to collect the data from each tool and the data collection process. We analyzed a single snapshot of each project, considering the release available in the dataset for each of 112 systems.
SonarQube. We first installed SonarQube LTS 6.7.7 on a private server having 128 GB RAM and 4 processors. However, because of the limitations of the open-source version of SonarQube, we are allowed to use only one core, therefore more cores would have not been beneficial for our scope. We decided to adopt the LTS version (Long-Time Support) since this is the most stable and best-supported version of the tool.
We executed SonarQube on each project using SDK 8 (Oracle) and the sonar-scanner package version 4.2. Each project was analyzed using the original sources and the binaries provided in the dataset. Moreover, we configured the analysis (in the file sonar-project.properties) reporting information regarding the project key, project name, project version, source directories, test directories, binary directories, and library directories. It is important to note that the analysis performed using the original binaries reduced the issues of compilation errors and missing libraries. Moreover, it also helped to reduce issues related to false positives 22 . Once the configuration file was set up, the analysis began by running sonar scanner in the root directory of the project: $ sonar-scanner. After all the projects had been analyzed, we extracted the data related to the projects using the "SonarQube Exporter" tool 23 . This makes it possible to extract SonarQube project measures and issues in CSV format. Under the target directory, the extraction directory was made. The CSV files were then extracted from the server by running the following command: $java-jar csv-from-sonar-0.4.1-jar-with-dependencies.jar All projects starting with "QC:" were extracted from the server by clicking "Save issues" and "Save measures". This exported CSV files into the extraction directory for each project.
Coverity Scan. The projects were registered in Coverity Scan (version 2017.07) by linking the github account and adding all the projects to the profile. Coverity Scan was set up by downloading the tarball file from https://scan.coverity.com/download and adding the bin directory of the installation folder to the path in the .bash profile. Afterwards the building process began, which was dependent on the type of project in question. Coverity Scan requires to compile the sources with a special command. Therefore, we had to compile them, instead of using the original binaries. For our projects, the following commands were used in the directory of the project where the build.xml and pom.xml files reside: ant (build.xml) for building and deploying Java projects. The projects had various Java versions, so the appropriate Java version was installed according to the documentation (if available) for building.
-$ cov-build {dir cov-int ant jar -$ cov-build {dir cov-int ant maven (pom.xml) for building projects. The appropriate maven version was used according to the documentation.
-$ cov-build {dir cov-int mvn install -$ cov-int {dir cov-int mvn clean install After building the project, if over 85% of tests were successfully executed, a tar archive of the cov-int build folder was created by running: $ tar czvf myproject.tgz cov-int Once the tar archive was created, the file was submitted through the project dashboard at https://scan.coverity.com/projects/myproject to "Submit Build". The analysis was then performed, and the results were displayed on the dashboard. The data regarding the analysis was extracted from the dashboard from "View Defects", where the detailed report regarding the analysis results was located. Under Outstanding Issues -> Export CSV, the CSV file containing all issues of the project was downloaded for each project.
Better Code Hub. The .bettercodehub.yml files were configured by defining the component depth, languages, and exclusions. The exclusions were defined so that they would exclude all directories that were not source code, since Better Code Hub only analyzes source code. The analysis was conducted between January 2019 and February 2019, so the Better Code Hub version hosted during that time was used.
Once the configuration file had been created, it was saved in the root directory of the project. The local changes to the project were added, committed, and pushed to GitHub. Afterwards, the project analysis started from Better Code Hub's dashboard, which is connected to the GitHub account https://bettercodehub.com/repositories. The analysis started by clicking the "Analyze" button on the webpage. The results were then shown on the webpage. Once all the projects had been analyzed, the data was sent in one CSV file. All projects containing more than 100,000 lines of code were not analyzed, due to Better Code Hub's limit.
Checkstyle. The JAR file for the Checkstyle analysis was downloaded directly from Checkstyle's website 24 in order to engage the analysis from the command line. The executable JAR file used in this case was checkstyle>8.30-all.jar. In addition to downloading the JAR executable, Checkstlye offers two different types of warning sets for the analysis 24 .
For each of the warning sets, the configuration file was downloaded directly from Checkstyle's website 25 . In order to start the analysis, the files checkstyle-8.30-all.jar and the configuration file in question were saved in the directory where all the cloned repositories from Java Qualitas Corpus resided. To make the analysis more swift, a bash script was made to execute the analysis for each project in one go. This can be seen in Listing 1. where RULESET represents the warning set used for the analysis, "$in" represents the project name which is imported from projectList.txt, and "$in" -CS RULESET.xml represents the export file name of the analysis results in XML format. The text file projectList.txt consists of all the project names, in order to execute the analysis for all projects in one go. An example of how the projects were analyzed with Checkstyle according to the warning set Google Java Style 26 is show in Listing 2.
Listing 2 Example of Checkstyle bash script for Google Java Style configuration.
Findbugs. FindBugs 3.0.1 was installed by running brew install findbugs in the command line. Once installed, the GUI was then engaged by writing spotbugs. From the GUI, the analysis was executed through File → New Project. The classpath for the analysis was identified to be the location of the project directory.
Moreover, the source directories were identified to be the project JAR executables. Once the classpath and source directories were identified, the analysis was engaged by clicking Analyze in the GUI. Once the analysis finished, the results were saved through File → Save as using the XML file format.
PMD. PMD 6.23.0 was downloaded from GitHub 27 as a zip file. After unzipping, the analysis was engaged by identifying several parameters: project directory, export file format, warning set, and export file name. In addition to downloading the zip file, PMD offers 32 different types of warning sets for Java written projects 28 . We developed a bash script, shown in Listing 3, to engage the analysis for each project in one go. The parameter HOME represents the full path where the binary resides, "$in" represents the project name which is imported from pro-jectList.txt, RULESET represents the warning set used for the analysis, and "$in" PMD RULESET.xml represents the export file name of the analysis results in XML format. Just like in Listing 1, projectList.txt consists of all the project names. An example of how the projects were analyzed for the warning set Clone Implementation 29 is show in Listing 4.

Listing 4 Example of PMD bash script for Clone Implementation configuration.
#! / b i n / b a s h while read in ; do $HOME/pmd−bin − 6 . 2 3 . 0 / b i n / run . sh pmd −d i r " $ i n " / −f xml −R r u l e s e t s / j a v a / c l o n e . xml − r e p o r t f i l e " $ i n " PMD Clone . xml ; done < p r o j e c t L i s t . t x t "

Data Analysis
In this section, we describe the analysis methods employed to address our research questions (RQs).
Source code quality issues identified by the tools (RQ 1 ). In order to determine the tool detection capabilities and warnings overlaps, we first identified warnings that can be detected by each tool, also considering type (if available), category, and severity. Then, we calculated how many warnings are violated in our projects.
Agreement among the tools (RQ 2 ). We expected that similar warnings should be violated in the same class, and in particular in the same position of the code. For the tools that provided information on the exact position (the lines of code where the warning is detected), we analyzed the agreement using the bottom-up approach. Therefore, we examined the overlapping positioning (start and end lines) of the warnings and identifying whether the warnings highlight the same kind of issue. Essentially pairing the warnings based on position and checking whether they have similar definitions.
In order to check whether the warning pairs identified in RQ 1 appear in the same position in the code, we scanned and checked in which classes each warning was violated. Then we counted how many times the warning pairs were violated in the same class. Only warning pairs identified across the tools were considered. warnings detected only by one tool were not considered.
For example, there was a warning pair identified between SonarQube and Better Code Hub: "SQ 1 and BCH 1". Considering the warning pair detected by SonarQube (SQ) and Better Code Hub (BCH), we expected that these two warnings would be violated at least in the same class, since they identify the same kind of issue. So if for instance SQ 1 is violated 100 times and BCH 1 130 times, we would expect to find the warning pair violated at least 100 times in the same class. Hence, we can calculate the agreement between the warnings in the warning pair as follows: where #SR is the number of instances in which a warning pair is detected together in the same class, and #issues r is the overall number of issues of warning r detected in the warning pair. Agreement is calculated separately for both warnings in a warning pair. Perfect overlap is obtained if the agreement for both warnings is equal to one. This means that all the issues generated by those warnings are always detected in the same classes, and no issues are detected separately in a different class.
After analyzing using the top-down approach from the definition level to the class level, we continued the analysis using the bottom-up approach from the line level to the definition level. For each class, we considered the start and end lines of the issue and compared the degree to which the warnings overlap according to the lines affected.
As the granularity of the warnings varies between the tools, we checked what fraction of the affected lines are overlapping between the warnings, instead of requiring sufficient overlap between both warnings. This was done by selecting one issue at a time (reference) and comparing all other issues (comparison) to that. If the lines affected by the comparison warning were residing within the lines affected by the reference warning, we defined the warnings as overlapped. To quantify the degree of overlapping, we used the percentage of the lines affected by the comparison issue that overlapped with the reference issue. The results were grouped based on four percentage thresholds: 100%, 90%, 80% and 70%. The concept is visualized in Figure 1. The lines represent the issues in a code file, indicating the start and end of the affected code lines. The percentages represent the ratio of the lines affected by the warning that lie within the lines affected by the reference warning. Depending on the threshold used, different warnings are selected based on the overlapping percentage.
Unfortunately, only SonarQube, Better Code Hub, Checkstyle, PMD, and FindBugs provide information about the actual "position" of the detected warnings. They report information about the "start line" and "end line" for each class. Regarding Coverity Scan, this information is only available in the web interface, but not in the APIs. Moreover, Coverity Scan licence does not allow to crawl the web interface. Since we have detected 8,828 Coverity Scan warnings violated in our projects, it would not have been visible to report this information manually.
Precision of the tools (RQ 3 ). In our last research question, we aimed at assessing the precision of the considered tools. From a theoretical point of view, precision is defined as the ratio between the true positive source code quality issues identified by a tool and the total number of issues it detects, i.e., true positives plus false positive items (TPs + FPs). Formally, for each tool we computed precision as follows: It is worth remarking that our focus on precision is driven by recent findings in the field that showed that the presence of false positives is among the most critical barriers to the adoption of static analysis tools in practice [20,52]. Hence, our analysis provides research community, practitioners, and tool vendors with indications on the actual precision of the currently available tools-and aims at possibly highlighting limitations that can be addressed by further studies. It is also important to remark that we do not assess recall, i.e., the number of true positive items identified over the total number of quality issues present in a software project, because of the lack of a comprehensive ground truth. We plan to create a similar dataset and perform such an additional evaluation as part of our future research agenda.
When assessing precision, a crucial detail is related to the computation of the set of true positive quality issues identified by each tool. In the context of our work, we conducted a manual analysis of the warnings highlighted by the six considered tools, thus marking each of them as true or false positive based on our analysis of (1) the quality issue identified and (2) the source code of the system where the issue was detected. Given the expensive amount of work required for a manual inspection, we could not consider all the warnings output by each tool, but rather focused on statistically significant samples. Specifically, we took into account a 95% statistically significant stratified sample with a 5% confidence interval of the 65,133, 8,828, 62,293, 402,409, 33,704, and 467,583 items given by Better Code Hub, Coverity Scan, SonarQube, Checkstyle, FindBugs, and PMD respectively: This step led to the selection of a set of 375 items from Better Code Hub, 367 from Coverity Scan, 384 from SonarQube, 384 from Checkstyle, 379 from FindBugs, and 380 from PMD.
To increase the reliability of this manual analysis, two of the authors of this paper (henceforth called the inspectors) first independently analyzed the warning samples. They were provided with a spreadsheet containing six columns: (1) the name of the static analysis tool the row refers to, i.e., Better Code Hub, Coverity Scan, Sonarqube, Checkstyle, FindBugs, and PMD; (2) the full path of the warning identified by the tool that the inspectors had to verify manually; and (3) the warning type and specification, e.g., the code smell. The inspectors' task was to go over each of the warnings and add a seventh column in the spreadsheet that indicated whether the warning was a true or a false positive. After this analysis, the two inspectors had a four-hour meeting where they discussed their work and resolved any disagreements: All the items marked as true or false positive by both inspectors were considered as actual true or false positives; in the case of a disagreement, the inspectors re-analyzed the warning in order to provide a common assessment. Overall, after the first phase of inspection, the inspectors reached an agreement of 0.84-which we computed using Krippendorff's alpha Kr α [22] and which is higher than 0.80, which has been reported to be the standard reference score for Kr α [3].
In Section 3, we report the precision values obtained for each of the considered tools and discuss some qualitative examples that emerged from the manual analysis of the sample dataset.

Replicability
In order to allow the replication of our study, we have published the raw data in a replication package 30 .

Analysis of the Results
In this section, we report and discuss the results obtained when addressing our research questions (RQ s ). RQ 1 . What quality issues can be detected by Static Analysis Tools? Here we analyzed how many warnings are actually output by the considered static analysis tools as well as the types of issues they are able to discover. Static Analysis Tools Detection Capability. We report the detection capability of each tool in terms of how many warnings can be detected, and the classification of internal warnings (e.g., type and severity). Moreover, we report the diffusion of the warning in the selected projects.
Better Code Hub detects a total of 10 warnings, of which 8 are grouped based on type and severity. Better Code Hub categorizes the 8 warnings under 3 types: RefactoringFileCandidateWithLocationList, RefactoringFileCandidate, and RefactoringFileCandidateWithCategory. Of these 8 warnings, one is of RefactoringFileCandidateWithLocationList type, six are of Refactor-ingFileCandidate type, and one is of RefactoringFileCandidateWithCategory type. In addition to the types, Better Code Hub assigns three possible severities to the warnings: Medium, High, and Very High. Of these eight warnings, four were classified as Medium severity, four as High severity, and eight as Very High severity. Some of the warnings have more than one severity possibly assigned to them.
Checkstyle detects a total of 173 warnings which are grouped based on type and severity. In addition to these types, Checkstyle groups these checks under four different severity levels: Error, Ignore, Info, and Warning. The distribution of the checks with respect to the severity levels is not provided in the documentation.
Coverity Scan's total scope of detectable warnings as well as the classification is not known, since its documentation requires being a client. However, within the scope of our results, Coverity Scan detected a total of 130 warnings. These warnings were classified under three severity levels: Low, Medium, and High. Of these 130 warnings, 48 were classified as Low severity, 87 as Medium severity, and 12 as High severity. Like Better Code Hub, some of Coverity Scan's warnings have more than one severity type assigned to them.   In total, the projects were infected by 936 warnings violated 13,554,762 times. 8 (out of 10) warnings were detected by Better Code Hub 27,888 times,  (Table 1 and Table 2). It is important to note that in Table 2, the detection capability is empty for Coverity. As mentioned earlier, the full detection capability is only provided to clients and not on the public API. We also computed how often warnings were violated by grouping them based on type and severity. The full results of this additional analysis are reported in our replication package 30 . Given the scope of warnings that were detected, our projects were affected by all warnings that are detectable by Better Code Hub and by some warnings that are detectable by Coverity Scan and SonarQube (Table 3). For the sake of readability, we report only the Top-10 warnings detected in our projects by the six tools. The complete list is available in the replication package 30 . Finding 1. Our analysis provided a mapping of the warnings that currently available tools can identify in software projects. Overall, the amount of warnings detected by the six automated static analysis tools is significant (936 warnings detected 13,554,762 times); hence, we can proceed with the analysis of the remaining RQ s . Our second research question focused on the analysis of the agreement between the static analysis tools.
Agreement based on the overlapping at "class-level". In order to include Coverity Scan in this analysis, we first evaluated the detection agreement at "class-level", considering each class where the warnings detected by the other five tools overlapped at 100% and where at least one warning of Coverity Scan was violated in the same class.
To calculate the percentage of warnings pairs (columns "%", Table 4) that appear together, we checked the occurrences of both tools in our projects, then we considered only the minimum value. For example, in Table 4, calculating the percentage between Checkstyle -PMD warning pairs, we have 9,686,813 warnings Checkstyle detected and 3,380,493 PMD ones detected. The combi-nation of these warnings should be maximum 3,380,493 (the minimum value between the two). We calculated the percentage considering the column "# occorrences" and the column "# possible occorrences". The warnings overlap at the "class-level" is always low, as reported in Table 4. This means that a piece of code violated by a warning detected by one tool is almost never violated by another warning detected by another tool. In the best case (Table 4), only 9.378% of the possible warning (Findbugs-PMD). Moreover, we did find no warnings pair at "class-level" considering more that two tools (e.g. Checkstyle-Findbugs-PMD).
For each warnings pair we computed the detection agreement at class level. For the sake of readability, we report these results in Appendix A. Specifically, the three tables (Table 9, Table 10, and Table 8) overview the detection agreement of each warning pair, according to the procedure described in Section 2.4. As further explained in the appendix, for reasoning of space we only showed the 10 most recurrent pairs, putting the full set of results in our replication package 30 . In these tables, the third and fourth columns (eg. "#BCH pairs" and "# CHS pairs", Table 9) report how many times a warning instance from a tool exists with another one. The tools have separate values for the number of co-occurrences as the number of instances differs as well, for example, could be that a large rule contains several instances of the comparison rule. Then for other rule this counts as one co-occurrence while for the other rule each included rule grows the number. This makes sure the agreement is between 0 and 1 for both rules. The remaining two columns report the agreement of each tools considered in the warning pairs (eg. "#BCH Agr." and "# CHS Agr.", Table 9). Results showed for all the warning pairs that the agreement at "classlevel" is very low, as none of the most recurrent warning pairs agree well. The results also highlighted the difference in the granularity of the warnings.
Agreement based on the overlapping at the "line-level". Since we cannot compare at "line-level" the warnings detected by Coverity Scan, we could only consider the remaining five static analysis tools. Using the bottomup approach (Figure 1), several rule pairs were found according to the 100%, 90%, 80%, and 70% thresholds. Using the threshold of 100% which indicates that a rule completely resides within the reference rule, we found 17,977 rule pairs, as reported in Table 5. Using the thresholds of 90%, 80%, and 70% the following rule pairs were found respectively: 17,985, 18,004, and 18,025 (Table 6). These warnings resided partially within the reference rule. Similarly to what happened with the agreement at "class-level", it is important to note that the overlap at the "line-level" is always low. Results show that, also in this case, only 9.378% of the possible rule occurrences are detected in the same line by the same two tools (Findbugs and PMD). In addition, also in this case we did find no pair warnings at "line-level" considering more that two tools (e.g. Checkstyle-Findbugs-PMD).
When considering the agreement for each warning pair at "line-level', we could not obtain any result because of computational reasons. Indeed, the analysis at line-level of 936 warning types that have been violated 13,554,762 times would have required a prohibitively expensive amount of time/spaceaccording to our estimations, it would have been taken up to 1.5 years-and, therefore, we preferred excluding it.
Finding 2. The warnings overlapping among the different tools is very low (less than 0.4%). The warning pairs Checkstyle -PMD as the lowest overlap (0.144%) and Findbugs -PMD the highest one (9.378%). Consequently also the detection agreement is very low.

RQ 3 . What is the precision of the static analysis tools?
In the context of our last research question, we focused on the precision of the static analysis tools when employed for TD detection. Table 7 reports the results of our manual analyses. As shown, the precision of most tools is quite low, e.g., SonarQube has a precision of 18%, with the only exception of CheckStyle whose precision is equal to 86%. In general, based on our findings, we can first corroborate previous findings in the field [4,20,31] and the observations reported by Johnson et al. [20], who found through semi-structured interviews with developers that the presence of false positives represents one of the main issues that developers face when using static analysis tools in practice. With respect to the qualitative insights obtained by interviewing developers [20], our work concretely quantifies the capabilities of the considered static analysis tools.
Looking deeper into the results, we could delineate some interesting discussion points. First, we found that for Better Code Hub and Coverity Scan almost two thirds of the recommendations represented false alarms, while the lowest performance was achieved by SonarQube. The poor precision of the tools is likely due to the high sensitivity of the warnings adopted to search for potential issues in the source code, e.g., threshold values that are too low lead to the identification of false positive TD items. This is especially true in the case of SonarQube: In our dataset, it outputs an average of 47.4 violations per source code class, often detecting potential TD in the code too hastily.
A slightly different discussion is the one related to the other three static analysis tools, namely PMD, Findbugs, and Checkstyle.
As for the former, we noticed that it typically fails when raising warnings related to naming conventions. For instance, this is the case of the 'Abstract-Name' warning: it suggests the developer that an abstract class should contain the term Abstract in the name. In our validation, we discovered that in several cases the recommendation was wrong because the contribution guidelines established by developers explicitly indicated alternative naming conventions. A similar problem was found when considering FindBugs. The precision of the tool is 57% and, hence, almost half of the warnings were labeled as false positives. In this case, one of the most problematic cases was related to the 'BC -UNCONFIRMED CAST' warnings: these are raised when a cast is unchecked and not all instances of the type casted from can be cast to the type it is being cast to. In most cases, these warnings have been labeled as false positives because, despite casts were formally unchecked, they were still correct by design, i.e., the casts could not fail anyway because developers have implicitly ensured that all of them were correct.
Finally, Checkstyle was the static analysis tool having the highest precision, i.e., 86%. When validating the instances output by the tool, we realized that the warnings raised are related to pretty simple checks in source code that cannot be considered false positives, yet do not influence too much the functioning of the source code. To make the reasoning clearer, let consider the case of the 'IndentationCheck' warning: as the name suggests, it is raised when the indentation of the code does not respect the standards of the project. In our sample, these warnings were all true positives, hence contributing to the increase of the precision value. However, the implementation of these recommendations would improve the documentation of the source code but not dealing with possible defects or vulnerabilities. As such, we claim that the adoption of Checkstyle would be ideal when used in combination with additional static analysis tools.
To broaden the scope of the discussion, the poor performance achieved by the considered tools reinforces the preliminary research efforts to devise approaches for the automatic/adaptive configuration of static analysis tools [33,13] as well as for the automatic derivation of proper thresholds to use when locating the presence of design issues in source code [2,17]. It might indeed be possible that the integration of those approaches into the inner workings of the currently available static analysis tools could lead to a reduction of the number of false positive items. In addition, our findings also suggest that the current static analysis tools should not limit themselves to the analysis of source code but, for instance, complementing it with additional resources like naming conventions actually in place in the target software system.
Finding 3. Most of the considered SATs suffer from a high number of false positive warnings, and their precision ranges between 18% and 57%. The only expection is Checkstyle (precision=86%), even though most of the warnings it raises are related to documentation issues rather than functional problems and, as such, its adoption should be complemented with other static analysis tools.

Discussion and Implications
The results of our study provide a number of insights that can be used by researchers and tool vendors to improve SATs. Specifically, these are: There is no silver bullet. According to the results obtained in our study, and specifically for RQ 1 , different SAT warnings are able to cover different issues, and can therefore find different forms of source code quality problems: Hence, we can claim that there is no silver bullet that is able to guarantee source code quality assessment on its own. On the one hand, this finding highlights that practitioners interested in detecting quality issues in their source code might want to combine multiple SATs to find a larger variety of problems. On the other hand, and perhaps more importantly, our results suggest that the research community should have an interest in and be willing to devise more advanced algorithms and techniques, e.g., ensemble methods or meta-models [10,12,11,37], that can (1) combine the results from different static analysis tools and (2) account for possible overlaps among the rules of different SATs. This would allow the presentation of more complete reports about the code quality status of the software systems to their developers.
Learning to deal with false positives. One of the main findings of our study concerns with the low performance achieved by all static analysis tools in terms of precision of the recommendations provided to developers (RQ 3 ). Our findings represent the first attempt to concretely quantify the capabilities of the considered SATs in the field. Moreover, our study provides two practical implications: (1) It corroborates and triangulates the qualitative observations provided by Johnson et al. [20], hence confirming that the real usefulness of static analysis tools is threatened by the presence of false positives; (2) it supports the need for more research on how to deal with false positives, and particularly on how to filter likely false alarms [16] and how to select/prioritize the warnings to be presented to developers [21,28,23]. While some preliminary research efforts on the matter have been made, we believe that more research should be devoted to these aspects. Finally, our findings may potentially suggest the need for further investigation into the effects of false positives in practice: For example, it may be worthwhile for researchers to study what the maximum number of false positive instances is that developers can deal with, e.g., they should devise a critical mass theory for false positive ASAT warnings [36] in order to augment the design of existing tools and the way they present warnings to developers.
Complementing static analysis tools. The findings from our RQ 1 and RQ 2 highlight that most of the issues reported by the state-of-the-art static analysis tools are related to rather simple problems, like the writing of shorter units or the automation of software tests. These specific problems could possibly be avoided if current static analysis tools would be complemented with effective tools targeting (1) automated refactoring and (2) automatic test case generation. In other words, our findings support and strongly reinforce the need for a joint research effort among the communities of source code quality improvement and testing, which are called to study possible synergies between them as well as to devise novel approaches and tools that could help practitioners complement the outcome provided by static analysis tools with that of other refactoring and testing tools. For instance, with effective refactoring tools, the number of violations output by SATs would be notably reduced, possibly enabling practitioners to focus on the most serious issues.

Threats to Validity
A number of factors might have influenced the results reported in our study. This section discusses the main threats to validity and how we mitigated them.
Construct Validity. Threats in this category concern the relationship between theory and observation. A first aspect is related to the dataset used. In our work, we selected 112 projects from the Qualitas Corpus [47], which is one of the most reliable data sources in software engineering research [46]. Another possible threat relates to the configuration of the SATs employed. None of the considered projects had all the static analysis tools configured and so we had to manually introduce them; in doing so, we relied on the default configuration of the tools since we could not rely on different configurations given directly by the developers of the projects. Nevertheless, it is important to point out that this choice did not influence our analyses: indeed, we were interested in comparing the capabilities of existing tools independently from their practical usage in the considered systems. The problem of configuring the tools therefore does not change the answers to our research questions.
Internal Validity. As for potential confounding factors that may have influenced our findings, it is worth mentioning that some issues detected by SonarQube were duplicated: in particular, in some cases the tool reported the same issue violated in the same class multiple times. To mitigate this issue, we manually excluded those cases to avoid interpretation bias; we also went over the rules output by the other static analysis tools employed to check for the presence of duplicates, but we did not find any.
External Validity. Threats in this category are concerned with the generalization of the results. While we cannot claim that our results fully represent every Java project, we considered a large set of projects with different characteristics, domains, sizes, and architectures. This makes us confident of the validity of our results in the field, yet replications conducted in other contexts would be desirable to corroborate the reported findings.
Another discussion point is related to our decision to focus only on opensource projects. In our case, this was a requirement: we needed to access the code base of the projects in order to configure the static analysis tools. Nevertheless, open-source projects are comparable-in terms of source code quality-to closed-source or industrial applications [26]; hence, we are confident that we might have obtained similar results by analyzing different projects. Nevertheless, additional replications would provide further complementary insights and are, therefore, still desirable.
Finally, we limited ourselves to the analysis of Java projects, hence we cannot generalize our results to projects in different programming languages. Therefore, further replications would be useful to corroborate our results.
Conclusion Validity. With respect to the correctness of the conclusions reached in this study, this has mainly to do with the data analysis processes used. In the context of RQ 1 and RQ 3 , we conducted iterative manual analyses in order to build the taxonomy and study the precision of the tools, respectively. While we cannot exclude possible imprecision, we mitigated this threat by involving more than one inspector in each phase, who first conducted independent evaluations that were later merged and discussed. Perhaps more importantly, we made all data used in the study publicly available with the aim of encouraging replicability, other than a further assessment of our results.
In RQ 2 we proceeded with an automatic mechanism to study the agreement among the tools. As explained in Section 2.3, different static analysis tools might possibly output the same warnings in slightly different positions of the source code, e.g., highlighting the violation of a rule at two subsequent lines of code. To account for this aspect, we defined thresholds with which we could manage those cases where the same warnings were presented in different locations. In this case, too, we cannot exclude possible imprecision; however, we extensively tested our automated data analysis script. More specifically, we manually validated a subset of rules for which the script indicated an overlap between two tools with the aim of assessing whether it was correct or not. This manual validation was conducted by one of the authors of this paper, who took into account a random sample of 300 candidate overlapping rules. In this sample, the author could not find any false positives, meaning that our script correctly identified the agreement among tools. This further analysis makes us confident of the validity of the findings reported for RQ 2 .

Related Work
Automated Static Analysis Tools (ASATs) are getting more popular [52,25] as they are becoming easier to use [55]. The use of static analysis tools has been studied by several researchers in the last years [53,34,56,35]. In this section, we report the relevant work on static analysis tools focusing on their usage [42,24,27], warnings and the detected problems [15,19,8].
Developers can use ASATs, such as SonarQube 3 and CheckStyle 32 , to evaluate software source code, finding anomalies of various kinds in the code [40,50]. Moreover, ASATs are widely adopted in many research studies in order to evaluate the code quality [20,44,30] and identify issues in the code [42,24,27]. Some studies demonstrated that some rules detected by ASATs can be effective for identifying issues in the code [56,23,27]. However, evaluating the performance in defect prediction, results are discordant comparing different tools (e.g. FindBugs 35 and PMD 37 ) [38].
Rutar et al. [40] compared five bug-finding tools for Java (Bandera 33 , ES-C/Java2 34 , FindBugs 35 , JLint 36 , and PMD 37 ), that use syntactic bug pattern detection, on five projects, including JBoss 3.2.3 38 and Apache Tomcat 5.019 39 . They focused on the different warnings (also called rules) provided by each tool, and their results demonstrate some overlaps among the types of errors detected, which may be due to the fact that each tool applies different trade-offs to generate false positives and false negatives. Overall, they stated that warnings provided by the different tools are not correlated with each other. Complementing the work by Rutar et al. [40], we calculated the agreement of ASATs on TD identification. In addition, we investigated the precision with which these tools output warnings. Finally, we also investigated the types of TD items that can actually be detected by existing ASATs.
Tomas et al. [50] performed a comparative analysis by means of a systematic literature review. In total, they compared 16 Java code static analysis tools, including JDepend 40 , Findbugs 35 , PMD 37 , and SonarQube 3 . They focused on internal quality metrics of a software product and software tools of static code analysis that automate measurement of these metrics. As results, they reported the tools' detection strategies and what they detect. For instance, most of them automate the calculation of internal quality metrics, the most common ones being code smells, complexity, and code size [50]. However, they did not investigate agreement between the tools' detection rules.
Avgeriou et al. [5] identified the available static analysis tools for the Technical Debt detection. They compared features and popularity of nine tools investigating also the empirical evidence on their validity. Results can help practitioners and developers to select the suitable tool against the other ones according to the measured information that satisfied better their needs. However, they did not evaluate their agreement and precision in the detection.
Focusing on developers' perception on the usage of static analysis tools, ASATs can help to find bugs [20]. However, developers are not sure about the usefulness of the rules [45,51,43], they do pay attention to different rules categories and priorities and remove violations related to rules with high severity [51] in order to avoid the possible risk of faults [45]. Moreover, false positives and the way in which the warnings are presented, among other things, are barriers to their wider adoption [20]. Some studies highlighted the need to reduce the number of detectable rules [32,9] or summarize them based on similarities [51]. ASATs are able to detect many defects in the code. However, some tools do not capture all the possible defect even if they could be detected by the tools [49]. Even if some studies since the beginning of 2010 highlighted the need to better clarify the precision of the tools, differentiating false positives from actionable rules [28,41], many studies deal with the many false positives produced by different tools, such as FindBugs 35 [49,6,7], JLint 36 , PMD 37 , CheckStyle 32 , and JCSC 41 [49].
At the best of our knowledge, our work is the first that investigate in details which source quality problems can actually be detected by the available tools, trying to make a comparison based on the description, what is their agreement, and what is the precision of their recommendations.

Conclusion
In this paper, we performed a large-scale comparison of six popular Static Analysis Tools (Better Code Hub, CheckStyle, Coverity Scan, Findgugs, PMD, and SonarQube) with respect to the detection of static analysis warnings. We analyzed 47 Java projects from the Qualitas Corpus dataset, and derived similar warnings that can be detected by the tools. We also compared their detection agreement at "line-level" and "class-level", and manually analyzed their precision. To sum up, the contributions of this paper are: 1. A comparison of the warnings that can be detected by the tools (taxonomy), which may be useful for researchers and tool vendors to understand which warnings should be considered during refactoring; 2. An analysis of the agreement among the tools, which can inform tool vendors about the limitations of the current solutions available the market; 3. The first quantification of the precision of six static analysis tools (Better Code Hub, CheckStyle, Coverity Scan, Findgugs, PMD, and SonarQube).
Our future work includes an extension of this study with the evaluation of the recall, and the in-vivo assessment of the tools.