An Extensive Study on Smell-Aware Bug Localization

Bug localization is an important aspect of software maintenance because it can locate modules that should be changed to fix a specific bug. Our previous study showed that the accuracy of the information retrieval (IR)-based bug localization technique improved when used in combination with code smell information. Although this technique showed promise, the study showed limited usefulness because of the small number of: 1) projects in the dataset, 2) types of smell information, and 3) baseline bug localization techniques used for assessment. This paper presents an extension of our previous experiments on Bench4BL, the largest bug localization benchmark dataset available for bug localization. In addition, we generalized the smell-aware bug localization technique to allow different configurations of smell information, which were combined with various bug localization techniques. Our results confirmed that our technique can improve the performance of IR-based bug localization techniques for the class level even when large datasets are processed. Furthermore, because of the optimized configuration of the smell information, our technique can enhance the performance of most state-of-the-art bug localization techniques.


Introduction
Bug localization is the process of identifying the locations of a given bug. Because it can be a tedious task in large-scale software development projects, many ideas have been proposed to automate this process using software development information. For instance, we can identify the locations of a bug using the description of bug reports, i.e., information retrieval (IR)based [1,2], or execution traces, i.e., dynamic analysis [3]. To improve the bug localization accuracy, many hybrid techniques that combine a base technique with additional information have been proposed. For example, BugLocator [4] combined similar bug reports that were fixed in the past with an IR-based technique. BLUiR [5] incorporated structural information in addition to using similar bug reports from the past. AmaLgam [6] combined the version history, report similarity, and structural information.
Although these techniques can significantly improve the bug localization accuracy, they can only be used when sufficient additional information is available. Moreover, most existing techniques do not consider the likelihood of each module containing a bug and treat all modules equally, which may lower the accuracy of bug localization. To this end, we previously proposed a smell-aware bug localization technique to improve the IR-based bug localization accuracy using code smell information [7]. The motivation behind our approach is that modules with code smells have been found to be changed more often and fault-prone [8,9]. In addition, our technique does not require additional information as code smells can be directly detected from the source code.
Although our technique can significantly enhance the bug localization performance, the previous study still has limitations that need to be addressed. First, we experimented on only four open-source systems. This small number of targeted systems means that our results may be difficult to generalize. Second, we only used one set of configurations for the technique, even though there were many possible options. Thus, our previously reported result might not be optimal. Finally, we combined our technique with only one base bug localization technique. Therefore, it remains unclear whether our technique is applicable to other bug localization techniques.
The study presented in this paper is an extension of our previous study with the objective of overcoming these limitations. First, we replicated our study on Bench4BL [10], which is the largest dataset available for bug localization. Second, we generalized the smell-aware bug localization technique and conducted studies with different configurations to find the best configurations. Finally, we combined the smell-aware bug localization technique with different base techniques provided by Bench4BL to assess whether our technique could improve state-of-the-art bug localization techniques.
The main contributions of this paper can be summarized as follows: • We replicate the smell-aware bug localization technique at the class level and show that it is effective even for processing a large-scale dataset.
• We generalize the smell-aware bug localization technique to allow different configurations and present the optimal configuration of the technique.
• We combine the smell-aware bug localization technique with different base bug localization techniques and show that it can improve their performance.
The remainder of this paper is organized as follows. First, we provide preliminary information about IR-based bug localization and code smells in the next section. Next, we summarize related work pertaining to empirical studies of bug localization in Section 3. We describe our smell-aware bug localization technique in Section 4. In Section 5, we provide the details of our study and present the results. Threats to validity are discussed in Section 6. Discussions of this work are presented in Section 7. Finally, we provide our conclusions in Section 8.

IR-Based Bug Localization and Its Extensions
Bug localization is the process of identifying the location of the source code that should be modified to fix a specific bug. Bug localization is challenging, especially in large-scale software systems. Therefore, automated bug localization techniques can help developers save time during such tedious processes.
IR-based bug localization techniques accept a bug report and the source code of a specific version as inputs. The approaches then determine the similarity between the bug report and source code and generate a ranking of modules based on this similarity. Developers are expected to use these rankings to help them perform bug-fixing tasks.
In IR-based bug localization, the following steps were conducted to quantify similarity.
1. Corpus generation. To run the approach, the text of each module needs to be processed. The test is regarded as a sequence of tokens. In addition, compound words such as isCommitable are divided into is and Commitable. They are then processed using the standard step as in natural language processing tasks, such as stemming or stop word removal for each module. 2. Indexing. The next step is to perform indexing on the generated corpus. Specifically, approaches such as the term frequency-inverse document frequency (TF-IDF) are applied. The approach calculates the importance of each word in each module. For example, in the case of using TF-IDF, the importance of each word is calculated by using the frequency of the term. 3. Query construction. To calculate the similarity between the source code and bug report, the bug report is also preprocessed, similar to the corpus generation step. 4. Ranking. The indexed corpus and the bug report are transformed into vectors. In the case of the vector space model (VSM) approach, we can obtain their similarity by calculating the cosine similarity of two vectors. The calculated similarity values are then used to rank the modules. The higher the rank, the more likely it is to be the location of a bug.
An advantage of IR-based bug localization techniques is that a few types of inputs are required. As most software development projects currently use an issue tracking system, the bug localization techniques can easily obtain their source code and bug reports. Therefore, IR-based bug localization can be applied in most situations. In contrast, a disadvantage is their low accuracy; sufficient accuracy cannot be obtained by merely considering the similarity between the bug report and source code [4]. In addition, this approach depends on the quality of bug reports [11,12,13].
Because the accuracy of IR-based bug localization is not sufficiently high, many approaches have been proposed to combine it with other information types. It is noteworthy that the similarity between the bug report and source code is still necessary. Other types of information are solely an addition. Types of information that have been applied to bug localization are as follows.
• Source code size. Zhou et al. [4] used the number of lines of source code to represent the size of the source code because bugs are likely to be in the large size source code.
• Past bug reports. Zhou et al. [4] used bug reports similar to current bug reports to support their approach. The underlying reason is that similar bug reports are likely to modify the same file.
• Stack trace. Wong et al. [3] used stack trace information in a bug report to capture the order of the executed modules until the program failed. Therefore, as we know the modules that were executed, we can use such information to improve bug localization.
• Change history. Wang et al. [6] used the past change history when calculating the probability of buggy modules. This information represents the likelihood that a given file will contain a bug in general. This information can be calculated by the number of modified files in each bug fixing commit.
The advantages and disadvantages of the approaches that combine other information can be the opposite of IR-based approaches. Because information such as the change history is not always available, the applicability of some approaches is limited. In contrast, as the approaches use extra information, the accuracy can be improved; for instance, an approach in which the combined size of the source code can significantly improve the IR-based bug localization accuracy [4].

Code Smells
Code smells are often used as an indicator of a design flaw or problem in the source code [14]. Many studies have found that code smells are related to several aspects of software development problems [15,16]. Thus, it is recommended to remove code smells by performing related refactoring operations to improve the quality of the source code. Code smells were initially proposed using descriptive language. Thus, several studies have attempted to implement them formally.
For example, Lanza and Marinescu [17] defined a metricbased strategy for detecting God Class as follows: where ATFD ( ), WMC ( ), and TCC ( ) are the metric values of access to foreign data (ATFD), weighted method count (WMC), and tight capsule cohesion (TCC) of , respectively. Similarly, ATFD , WMC , and TCC are the thresholds of ATFD, WMC, and TCC, respectively. Specifically, ATFD measures the number of foreign attributes that are sued by a class. Therefore, the higher the ATFD, the more likely the class is to be a God Class . Similarly, WMC represents the sum of the complexities of all the methods declared in a class. Thus, the higher the WMC, the more likely it is that the class is a God Class . In contrast, the TCC represents the degree of cohesiveness of a class. As a result, a class with a lower TCC is more likely to be a God Class . These three conditions are then combined as a conjunctive form to determine God Class .
To measure the strength of a code smell, Marinescu defined severity as an integer that measures the number of times the value of a chosen metric exceeds a given threshold [18]. For instance, in the case of God Class, WMC, ATFD, and TCC were used in the detection approach. Among these three metrics, ATFD is used to calculate the severity. In other words, we can calculate the severity by computing the number of times the value of ATFD exceeds its threshold. Severity values range from 1 to 10. Note that the metrics used to compute the severity vary according to the smell type because the detection strategy of each smell type uses a different set of metrics. More information, including additional examples of the severity computation, can be found in the original paper by Marinescu [18].

Related Work
IR-based bug localization is useful for locating source code files that need to be modified to fix a specific bug. If a bug report and the source code are its inputs, it outputs files or a ranking of files that need to be modified to fix the bug. IR techniques, e.g., Latent Dirichlet Allocation (LDA) [19], Latent Semantic Indexing (LSI) [20], or Vector Space Model (VSM) [21], are used for bug localization to calculate the similarity between a given bug report and the source code. The obtained similarity scores were then utilized to specify the source files from the given bug report. In addition, techniques of IR-based concept location [22,23] and impact analysis [24] follow the same approach. Among them, Rao and Kak reported VSM to be the best choice among IR techniques for bug localization [25].
An advantage of IR-based bug localization techniques is the few types of inputs that need to be prepared because they require only bug reports and source code as their inputs. However, this could also be regarded as a disadvantage because they depend excessively on the quality of bug reports. In other words, IR techniques are unable to process low-quality bug reports effectively. To mitigate this problem, researchers focused directly  [11] proposed a technique to improve IR-based bug localization by reconstructing low-quality bug reports. Other researchers attempted to rectify the behavior of an IR-based bug localization technique according to the quality of bug reports [12,13]. Also, Moreno et al. [26] proposed a technique named QUEST to automatically configuring the parameters of the used IR approach to improve its accuracy. Another line of techniques involves the combination of additional information with an IR technique. Shi et al. [27] suggested that combining additional information with an IR-based bug localization technique can be beneficial. Zhou et al. [4] defined revised VSM (rVSM) by considering the scale of the source code in VSM. They proposed BugLocator by combining rVSM with information about similar bug reports from the past. Wong et al. [3] proposed BRTracer by combining it with stack traces. A stack trace describes the methods that are invoked and the order in which they are invoked until the test fails, which can be obtained by dynamic analysis or reported in bug reports. Similarly to BRTracer, Lobster [28] also utilized stack trace information in bug reports to improve bug localization. Tantithamthavorn et al. [29] proposed a technique that uses the history of past changes together with BugLocator. In this technique, they used the co-change information in the change history and additionally specified modules that were likely to be changed when a module was changed. Furthermore, BLUiR [5] considered the structure of the source code in addition to BugLocator. AmaLgam [6] utilizes a bug prediction technique using the version history in addition to BLUiR. Although these hybrid IR-based bug localization techniques are more accurate than basic IR-based bug localization techniques, they are more costly to apply. For instance, AmaLgam requires the user to collect the change history, which is timeconsuming. In addition, the applicability of techniques using a version history is limited to projects with sufficient history.
This study extends our previous study using Bench4BL, which is the largest available dataset for bug localization. The techniques from Bench4BL that we used in this study are BugLocator, BRTracer, BLUiR, and AmaLgam. In addition, we added VSM as a baseline for bug localization techniques and rVSM as a technique that considers the size of the source code. The comparison of the information used in each technique is presented in Table 1. As we can see, most techniques are improved based on another technique by adding more information. In contrast, our smell-aware technique is dependent on another technique and can be applied on any technique to improve the performance.
As we have already noted in Section 1, the motivation behind our approach was that modules with code smells have been found to be more changed-and fault-prone [8,9]. Following a similar motivation, there are several attempts to use smell information for bug prediction. For example, Taba et al. [30] have proposed a bug prediction technique that uses a historical metric computed from smelly classes as additional information. In addition, Palomba et al. [31] have succeeded in effectively utilizing the severity degree of code smells in bug prediction. Their studies suggest the usefulness of smell-based information on identifying buggy portion in source code, which partly justify the use of smell-based information on bug localization.

Bug Likelihood Index
The smell-aware bug localization technique aims to improve the accuracy of existing IR-based bug localization. A problem with IR-based bug localization is that it relies on the textual similarity between a bug report and the source code, i.e., it considers all modules to be equal and does not consider the likelihood of a module containing a bug. This shortcoming may be responsible for the low accuracy of the technique. To overcome this problem, our technique uses information about the code smells to represent the likelihood of a module containing a bug. Specifically, we used the smell severity, which indicates the strength of a code smell [18], to represent the likelihood. In addition, as smell information can be directly detected from the source code, it is possible to keep the cost of the technique close to that of IR-based bug localization. In other words, the user does not need to obtain additional information to use the technique.
Example. To fix the bug CAMEL-9059 1 , the method createEndpoint in the class JettyHttpComponent was modified 2 . The class had three smells: Refused Parent Bequest, Schizophrenic Class, and God Class . Also, the method had two smells: Feature Envy and Blob Operation . In particular, God Class and Feature Envy were detected as the severest (severity of 10). This class module is highly complicated, difficult to comprehend, and difficult to change safely, which might lead to bug-proneness. Unfortunately, this module was assigned a low rank in IR-based bug localization (349th in the ranking by VSM) because many other modules were textually more similar to the bug report. Smell-aware bug localization aims to increase the ranking of these modules by utilizing smell information.
To combine the code smell information with textual similarity, we proposed the bug likelihood index (BLI). The BLI for each module can be calculated as follows: where nSim( ) is the textual similarity of the bug report and module based on VSM and nSev( ) is the sum of the severity of the smells contained in module . Both values are normalized in the range of [0, 1]. More specifically, they are calculated from the original non-normalized similarity Sim( ) and severity sum Sev( ) as follows: Further, (0 ≤ ≤ 1) is a parameter representing the weight of nSim( ) and nSev( ).
In our previous study, we conducted an experiment using four open-source projects: ArgoUML, JabRef, jEdit, and muCommander. In this study, TraceLab [32] and inFusion [18] were used as a VSM implementation and a smell detector, respectively. In the VSM implementation of TraceLab, TF-IDF was used as its weighting scheme, and it was equipped with a standard IR preprocess. We applied the IR-based bug localization technique with VSM and our smell-aware bug localization to the targets and compared the accuracy using the mean average precision (MAP) [33]. As a result, our technique could improve IR-based bug localization by 36%, 34%, 24%, and 28% in relative comparison for ArgoUML, JabRef, jEdit, and muCommander, respectively.

Generalized Bug Likelihood Index
Although the result using BLI was promising, the following limitations must be noted.
• Generalization. As the goal of our previous study was to obtain a preliminary result to determine whether code smells have the potential to improve bug localization, we conducted our study on a small dataset consisting of four projects. Although this dataset is often used in related studies reported in the literature, it is difficult to generalize the applicability of our technique because of the small number of projects.
• Optimal configuration. The generalization of smellawareness requires many parameters to be specified. For example, we could specify the granularity of code smells, e.g., class or method level. We could also specify the aggregator to use when a module has more than one code smell, e.g., summation or obtaining the maximum. Finally, as many types of code smells exist, we could choose whether to include all types or only specific types of code smells. However, in our previous work, we conducted a study with only one configuration, which means that the reported results might not be optimal.
• Combination of different base techniques. Smell-aware bug localization was designed to be employed in combination with a base bug localization technique to improve its performance. However, in our previous study, VSM was the only base technique to be studied. Although VSM is a representative technique for IR-based bug localization because of its simplicity, many other high-performance bug localization techniques have been proposed [3,4,5,6]. Therefore, it remained unclear whether our technique can be used to improve other base techniques.
Therefore, to overcome the limitations mentioned above, we generalized the technique to examine different configurations of code smells and defined generalized bug localization index (gBLI). The gBLI of module can be calculated as: where nScore ( ) is the normalized output score of the bug localization technique, . Furthermore, nSmell ( ) is a normalized value based on the code smell configuration , which includes three parameters: granularity ( ), aggregator ( ), and type selector ( ). The normalization process of nScore( ) and nSmell ( ) is the same as that of nSim( ) and nSev( ). The nScore ( ) and nSmell ( ) are generalizations of nSim( ) and nSev( ), respectively. The details of the code smell configuration will be explained in the next subsection.

Code Smell Configuration
A code smell configuration includes three parameters: granularity ( ), aggregator ( ), and type selector ( ), which are explained in the following paragraphs.
Granularity ( ). Code smells are often defined on the basis of the granularity of the modules, e.g., the class or method levels. In our previous study, we used only class-level code smells because we focused on a bug localization technique that outputs class-level results. In other words, we kept the granularity of the code smells the same as that of the bug localization result. However, in addition to the class-level code smells, adding methodlevel smells may improve the performance because these smells add more information to the modules. In addition, we are likely to obtain a larger number of modules with code smells by considering the method-level code smells. Nevertheless, this information may add to the noise of the technique and decrease its performance. Therefore, it might be useful to clarify the effect of using different code smell granularities when applying the technique.
Therefore, in addition to class-level code smells, the use of method-level code smells may improve the performance. The granularity ( ) can be set as follows: • 2 : method level, and Aggregator ( ). When detecting code smells, the possibility of more than one code smell being detected in a single module is high. Therefore, we need an aggregator to combine the information of each code smell to represent the value of the module. For example, in our previous study, we used summation to combine the severity of each smell in a module. As another example, Palomba et al. [31] used the maximum smell severity value to extract the bug-proneness of a module. Thus, multiple ways are available in which to combine the smell information. A comparison of different aggregators is necessary to determine which one performs the best.
We considered the sum and maximum of the severity as aggregators because it was used in the literature. In addition, we considered using the smell existence, that is, 1 if a module contains at least one smell and 0 otherwise. We also considered the number of smells in a module. Furthermore, the average and the median of the severity of all the smells are considered as representatives of the severity degree of the target module. Finally, considering a situation in which the severity or number of smells is biased depending on the smell types, nested aggregators using the average or median after aggregating by the maximum severity or number of smells are also added. These aggregators are considered to confirm whether the use of the smell severity yields improved performance. To summarize, the aggregator ( ) can be set as follows: • 1 : sum of severity, Type Selector ( ). Most detectors can detect different types of code smells; for example, inFusion can detect 16 types of code smells. Our previous study entailed the detection of all types of code smells. Nevertheless, different types of code smells may affect bug-proneness differently. Therefore, employing a technique with different types of code smells might yield different results.
Our goal here is to compare the performance when all types of code smells are used with the performance when only certain types of code smells that are more likely to be related to bug proneness are used. We prepare five setting levels of smell type selection: • 1 : all smell types, • 2 : rare selected smell types, • 3 : medium rare selected smell types, • 4 : medium selected smell types, and which vary on their different inclusiveness of smell types. The concrete selection of each selector is open at this stage because they will be determined according to an empirical study; see Sections 5.6 and 5.7.
Specialization to BLI. Because the previous technique for class modules used VSM as its base bug localization technique and the 1 : sum of the severity of 1 : class-level smells of 1 : all smell types as its configuration of code smells, gBLI with this configuration can express our original BLI:

Research Questions
This study focuses on the following four research questions (RQs).
• RQ 1 : Does smell-aware bug localization improve IR-based bug localization using VSM even for a large-scale dataset?
• RQ 2 : What is the relationship between the performance improvement of the smell-aware bug localization and bug proneness?
• RQ 3 : What are the best configurations for smell-aware bug localization as an extension of VSM?
• RQ 4 : Is the performance of smell-aware bug localization superior to that of state-of-the-art bug localization techniques?
Details of the motivation for each respective RQ are provided later.

Approach Overview
An overview of this study, including the process of Bench4BL and smell detection, is shown schematically in Fig. 1. The nodes within the dotted enclosure are provided by Bench4BL. First, we executed the bug localization technique using the source code, bug reports, and additional information as inputs. Next, we detected the code smells from the source code using the aforementioned tool. Then, we calculated the score of each module from the smell information and output the score obtained by the bug localization technique. The scores are then used to generate the output ranking. Finally, we calculated the accuracy of the ranking based on the gold set included in Bench4BL.

Bench4BL
In this study, we used Bench4BL, which is the largest benchmark for bug localization [10]. The dataset contains the source code, bug reports, and lists of source files that were modified to fix the bug reports, i.e., gold sets, across 46 projects and their versions. Bench4BL also includes the implementation of state-of-the-art bug localization techniques such as BugLocator [4], BRTracer [3], BLUiR [5], AmaLgam [6], BLIA [34], and Locus [35]. We decided to use Bench4BL because it is suitable for conducting a large-scale empirical study. Furthermore, it would enable us to combine our smell-aware bug localization with the base techniques implemented in Bench4BL.
We regarded each project version in Bench4BL as a specific system. In a typical bug localization context, we see a system as a pair of 1) a set of bug reports that define bugs and 2) a source code snapshot to be used to locate the bugs. When applying this approach to Bench4BL, project versions are most suitable for systems because bug reports are associated with a project version in Bench4BL. This decision also means that we regard different versions of the same project as different systems.

Bug Localization Techniques
Of the six bug localization techniques (BugLocator, BRTracer, BLUiR, AmaLgam, BLIA, and Locus) provided by Bench4BL, we excluded BLIA and Locus and used the left four techniques for two reasons. First, these two implementations often behave non-deterministically, outputting different results from the same input. This behavior was not suitable for our study. Second, these two implementations had more invalid outputs than the other four implementations. Following the bug report selection and invalidity criteria shown in Section 5.3.4, we could collect 6,936 bug reports that met our criteria from the VSM results. If we used the four techniques as mentioned above, the number of bug reports decreased to 6,931, which means that only five bug reports were excluded. However, if we added BLIA into these four, 199 reports were additionally excluded, and 6,732 bug reports remained. If we added Locus in addition to BLIA, 752 reports were additionally excluded, and 5,980 remained. Since our goal was to reproducibly confirm the effectiveness of the smell-aware approach on the improved bug localization techniques, not necessarily on all bug localization techniques, we excluded BLIA and Locus to include more bug reports in the experiment. A similar buggy behavior of the Locus implementation in Bench4BL was also reported by Chaparro et al. [40]. However, note that the second reason does not directly imply that these implementations are broken. Some results are valid in terms of bug localization results but inappropriate for our study; see Section 5.3.4 for the details.

Class level
Blob Class [14,17,36] A class that is very large and complex Data Class [14,17,37] A class with no functionality, only data Distorted Hierarchy [37] A class with very narrow and deep inheritance hierarchies God Class [14,17,37] A class that handles data from other classes Refused Parent Bequest [17,37,38] A class that rarely uses members that inherit from its base class Schizophrenic Class [37,38] A class representing multiple concepts Tradition Breaker [17,37] A class that violates the conventions defined by its base class

Method level
Blob Operation [14,17,36] A method that is large and complex Data Clumps [14] A method in which several data values appear as a group External Duplication [14,36,39] A method containing duplicate code with unrelated classes Feature Envy [14,17,37] A method that is more relevant to the data of other classes than those of its own class Intensive Coupling [14,17,37] A method that is more associative with many other methods Internal Duplication [14,36,39] A method with duplicate code in its own class Message Chains [14,17,36] A method that results in a chain of many method calls Shotgun Surgery [14,17] A method that propagates changes to many other methods when it is changed Sibling Duplication [14,36,39] A  In addition to the four bug localization techniques provided by Bench4BL, we added two techniques: VSM and rVSM. Both of these techniques are components used in BugLocator, and we used them to ensure that the implementations were compliant with the Bench4BL framework. VSM is the basis of IR-based bug localization and computes the textual similarity between the bug report and the source code. We added this technique because it has the lowest cost among bug localization techniques. We also added rVSM, an extension of VSM, because of our intention to confirm whether smell-aware bug localization is effective only because it considers the size of the source code. Although we mentioned that code smell is effective for bug localization, the underlying reason may be the size of the source code because certain information from smells represents information about the size of the source code. Therefore, we compare the smellaware technique with rVSM, which considers the size of the source code in RQ 4 .
To summarize, we selected the six bug localization techniques shown in Fig. 2. Because all the techniques are based on VSM, we can compare the effects of the additional information contributed by each technique.
• rVSM [4]: Extension of VSM that considers the size of the source code.
• BugLocator [4]: A technique that combines rVSM with previous similar bug reports.
• BLUiR [5]: A technique that combines BugLocator with structural information obtained from the source code.

Code Smell Detection
In this study, code smells were detected from the source code of each version in Bench4BL. We used inFusion [18] 3 , a powerful commercial code smell detector. The inFusion is an extended version of inCode [42], which is a successor to iPlasma [43]. We selected inFusion for several reasons: • the detected smell instances are associated with a severity score, • it can be assembled in an automated manner without requiring the collection of dependent libraries, compilation of source code, and configuration on an IDE, which helps to simplify the workflow of our experiment with less manual effort, • it follows well-known metric-based smell detection strategies [17], which are explained in Section 2.2, and • it can detect 16 types of code smells at both the class and the method level, which suits our need considering that we aim to compare the effect of different types of code smells in our approach.
We regarded the first two reasons as mandatory requirements to conduct our study. Although other code smell detectors such as cASpER [44], DECOR [45], JCodeOdor [46], and JDeodorant [47], have been proposed to date, they have not met these requirements. The class-level and method-level smells detected by inFusion are summarized in Table 2. We used the detection result of both of these types of smells without manual validation.

Data Selection
We excluded data with any inconsistency from the dataset. For example, output rankings may be invalid, e.g., those including the similarity score value of Not-a-Number (NaN) or those with no gold module occurrences. Although the rankings with no gold module occurrences are valid as bug localization results, we regarded them as invalid because they are not useful for our study in terms of confirming the improvements of the smell-aware bug localization approach. We excluded any bug reports where the output ranking of at least one bug localization technique was invalid. In addition, we excluded bug reports of the versions for which inFusion could not detect any code smells because of our intention to confirm the effectiveness of the smell-aware technique. Finally, we excluded versions with fewer than five bug reports to ensure that each version involved a certain minimum number of bug reports. This approach was necessary to mitigate the threat of over-optimizing the values in the case of a small number of bug reports; see Sections 5.5.2 and 6.1 for the details. As a result, we excluded 2,528 of the 9,459 bug reports and used 6,931 bug reports over 309 versions and 35 projects. Table 3 provides information about each of these projects. The columns in this table contain the name of the project, the number of versions, the total number of bug reports, and the average number of source files.

Evaluation Metrics
To evaluate the ranking outputs from each bug localization technique, we used the following evaluation metrics: • Top N. This metric represents the ratio in which at least one gold file is included within the top of the given ranking.
Here, gold files denote files included in the gold set. Given a set of bug reports , the metric can be calculated as follows: where top ( ) returns 1 if a gold file is contained in the top of the ranking obtained from bug report , and 0 otherwise. By definition, we can calculate the actual number of bug reports that succeeded in having at least one gold module within the top of the given ranking by multiplying the Top value by the total number of bug reports. In this study, we used Top 1, Top 5, and Top 10.
• Mean Reciprocal Rank (MRR). MRR [48] is the mean of the multiplicative inverse of the rank of the first gold file in the given ranking. Given a bug report , its reciprocal rank (RR) can be calculated as follows: where rank( ) is the ranking of the highest gold file in the ranking obtained from bug report . Given a set of bug reports , the MRR is calculated as the average of the RR of each bug report in : • Mean Average Precision (MAP). MAP [33] considers all the gold files, whereas Top and MRR consider only the top gold files. Assuming that the number of files in the output ranking is , the average precision (AP) of a bug report can be calculated as follows: Here, denotes the rank of a file and precision ( ) denotes the ratio of gold files in the files ranked at or higher than the -th rank. gold ( ) returns 1 if the file ranked at the -th position is in the gold set and 0 otherwise. MAP is the average of AP of all the bug reports : Moreover, we used the Wilcoxon signed-rank test [49] for statistical testing. Because the values of all the evaluation metrics are computed as the average of the values for each bug report, we used the set of values for each bug report for all the statistical tests in this study. For example, when testing the statistical significance of the difference between the two techniques in terms of MAP aspect, we compared two sets of AP values, which were used to compute MAP. In addition, when reporting the statistical significance, we also reported Cliff's delta ( ) as a measure of the magnitude of the improvement. The Cliff's delta is interpreted based on the threshold by Romano et al. [50]: negligible for | | < 0.147, small for 0.147 ≤ | | < 0.33, medium for 0.33 ≤ | | < 0.474, and large for 0.474 ≤ | |.

RQ 1 : Does smell-aware bug localization improve IR-based bug localization using VSM even for a large-scale dataset? 5.5.1. Motivation
Although our previous study showed that smell-aware bug localization can significantly improve the accuracy of IR-based bug localization, the study was conducted with only four projects. In addition, the number of bug reports included in the dataset was only 277, which made it difficult to generalize the results. Therefore, the goal of this RQ is to verify whether the smell-aware bug localization technique can be used to improve the performance even for a large-scale dataset. This is intended as a sanity check to confirm whether the same setting involving the use of the original smell-aware bug localization would be applicable to systems in Bench4BL prior to making new attempts.

Study Design
This study was designed to replicate and extend the original study to all projects in Bench4BL. The smell-aware bug localization technique can be employed by setting the smell granularity to either the class or method level. We used the smell granularity at the class level because the granularity of modules obtained in bug localization is at the file level in Bench4BL. In addition, we used a Java file as a proxy for the class and excluded all inner classes. The value was set to a value such that each evaluation metric is maximized in each system, which ensures that the study is conducted under the same conditions as in the previous experiment. We produced the rankings for each system by calculating the BLIs from the bug reports in the system using all the possibilities, which range from 0 to 1 in increments of 0.01. We then evaluated the set of rankings for each by employing the evaluation metric that was used. Finally, the value that maximized the evaluation score was used as the parameter for the pair of the evaluation metric and the system.
For each bug report, all files in the bug localization result were sorted in descending order of BLI, and the accuracy of IR-based bug localization and the smell-aware bug localization technique were compared according to the gold set in Bench4BL.

Results
The results are shown in Fig. 3. It is noteworthy that MAP in this study increased by a lesser amount than in the previous study. Specifically, in this study, MAP increased by 7.3% on average in relative comparison, whereas in the previous study, it was 30.5%. The difference may be attributed to several factors. First, the previous experiment involved all the classes, including the inner classes, whereas this experiment considered only the top-level classes and excluded the inner classes. Next, the VSM implementation we used differed from that in the previous study. The VSM implementation in this study is the VSM part of BugLocator, which was optimized for bug localization usage [4]. In addition, the number of projects and bug reports used in this study is much larger than in the previous study; i.e., 4 vs. 36 projects and 277 vs. 6,943 bug reports were used previously and in this study, respectively. Consequently, the result of our previous study might be an extreme case, whereas the results of this study reflect a more realistic distribution.
Example. The application of smell-aware bug localization to HBASE-1795 4 is presented in Table 4. In Table 4, the top 10 items of the result using VSM ( = 0) and that using the smellaware bug localization technique ( = 0.31) are compared. The gold module, i.e., the class module that was modified to fix this bug, is highlighted in gray. The gold module Store, which is located at the tenth rank in VSM, has several class-level smells such as Blob Class and God Class . Therefore, Store eventually had the top nSmell score and was ranked at the top when using smell-aware bug localization. In addition to HBASE-1795, a total of 24 bug reports tied to HBASE 0.20.5 targeted Store to fix, and the score improvement of Store led to improved rankings for many in the system. Note that in addition to the class-level smells, certain methods of Store have a methodlevel smell called Blob Operation, which could also be utilized in smell-aware bug localization. The use of such method-level smells was studied in RQ 3 .
In summary, the smell-aware bug localization technique at the class level can also improve the accuracy of IR-based bug localization using VSM even for a large dataset.

RQ 2 : What is the relationship between the performance
improvement of the smell-aware bug localization and bug proneness?

Motivation
We showed that smell-aware bug localization could also improve the bug localization accuracy even for the Bench4BL dataset in answering RQ 1 . As already explained in Section 1, we consider that the results of smell-aware bug localization are improved because smells are bug prone [8,9]. To provide more convincing evidence of bug-proneness in the bug localization context in this study and investigate the difference in contributions by the smell types, we examine the extent to which each smell type affects the possibility of identifying buggy modules by bug localization.

Study Design
To answer RQ 2 , we calculate the relative risk [51] of the existence of buggy modules for each smell type. Let all and all ⊆ all be the sets of all the modules and the buggy modules for all the target 309 project versions, where each element is represented as a pair of a module name and the project version to which it belongs. Here, we regarded a module as buggy if and only if the module was included in the gold set of at least one bug report in the target version in Bench4BL. We denote the set of modules in which a smell of type is detected by ⊆ all . The set of buggy modules that contains the smell of is computed as = ∩ all . Then, the risk of smelly modules that are likely to contain bugs to be fixed (Risk ), that of non-smelly modules (Risk * ), and the relative risk of smelly modules (RR ) are, respectively, expressed as follows:

Results
The results are presented in Table 5. The columns show the smell type, the number of detected smelly modules (| |), the number of buggy modules in the files detected as smelly (| |), the buggy risk of the smelly modules (Risk ) and non-smelly modules (Risk * ), and the buggy relative risk (RR ) for all systems. The smell types are sorted in descending order of the obtained relative risks of the smelly modules. The total number of modules (| all | = 654,674) and buggy modules (| all | = 15,834) are also shown in the bottom row of the table. The results in the table indicate that, when considering all smell types, 5.735% of the smelly modules are buggy modules. The relative risk shows that, in comparison with nonsmelly modules, smelly modules are 2.785 times more likely to be buggy. This suggests that prioritizing smell-containing modules in bug localization can lead to improved accuracy.
The relative risk of smelly modules varied depending on their smell type. On the one hand, Blob Class, Shotgun Surgery, God Class, Blob Operation, and Intensive Coupling were the smell types with the top five relative risks. The risk of being identified as bugs in modules with these smell types is more than four times higher than modules without these smells. On the other hand, the relative risks for Data Class and Distorted Hierarchy were less than 1, such that the choices of these smells do not necessarily lead to the identification of modules with a high probability of containing bugs.
The three columns on the right in Table 5 present the relative risk of each smell type calculated using only the systems of a specific project group to determine the extent to which the obtained trend is universal. The numbers in parentheses in the table indicate the rank of an item in the project group to which it belongs. As we can see from the table, although several small differences exist, the ranking trend for each project group is similar to the global ranking.
In summary, in comparison with non-smelly modules, smelly modules are 2.785 times more likely to be buggy.

Motivation
As discussed above, when the smell-aware bug localization technique was used in our previous work, we used only one set of configurations, even though many options were available. For example, we can change the granularity of code smells, the aggregator when combining multiple code smells, and the type of code smells. Therefore, when answering this RQ, our goal was to explore the configurations that would yield the best performance.

Study Design
In our previous study, the technique was limited to textual similarity (nSim) and the sum of the severity (nSev) as shown when formulating BLI. To answer RQ 3 , we utilized the gBLI with three parameters of the code smell configuration defined in Section 4, as a combination of three granularity levels ( ), ten aggregators ( ), and five type selectors ( ).
We instantiated concrete selections of type selectors. Based on the results we obtained to answer RQ 2 , to eliminate smell types that are unlikely to be related with bug-proneness, we created several sets of smell types by excluding those with a lower relative risk and retain only those with a higher relative risk with different boundaries. Finally, we compare the performance of the technique by specifying these five settings.
• 1 : all smell types (16): all types of smells, • 2 : rare selected smell types (14): types of smells whose relative risk is greater than 1, • 3 : medium rare selected smell types (12): types of smells whose risk is greater than that of all types of smells (5.735%), • 4 : medium selected smell types (10): types of smells whose relative risk is greater than that of all types of smells (2.785), • 5 : well selected smell types (5): top five types of smells regarding their relative risk; their relative risk is greater than 4.
The concrete types are specified at the left of Table 5.
To compare all configurations, as discussed earlier, we applied our technique to all 150 (= 3 granularity levels × 10 aggregators × 5 type selectors) configurations and calculated the accuracy.
In addition to the five selectors, we also prepared special selectors that use only one smell type to investigate the performance of each smell type. When selecting these selectors together with other perspectives, 1) the smell granularity is automatically assigned to a specific value according to the type, 2) nested aggregators ( 7 -10 ) are unnecessary because the first aggregation step results in only one value instance, and these aggregators produce the same result as 2 or 4 , and 3) For classlevel smell types, severity-based aggregators ( 1 , 2 , 5 , and 6 ) and count-based ones ( 3 and 4 ) will respectively produce the same result because only one smell instance is assumed to be detected, and only the aggregators of 2 and 3 are enough to be considered as their representatives. Therefore, we prepared 1 × 2 × 7 (class-level) + 1 × 6 × 9 (method-level) = 68 configurations. We compare the best configuration of each smell type to determine which smell types contributed to the performance improvements.

Results
The accuracy for the top 20 configurations ( 1 to 20 ) is listed in Table 6, sorted in descending order of MAP. Note that two special configurations were additionally included in the table. 74 is the configuration equivalent to that used in our previous study. 0 is a pseudo-ideal configuration that allows a different configuration selection for each system as though we could know the best configuration for each system. The annotated numbers with parentheses highlights represent the rank of each metric value for the top three configurations. In addition, values at the column "# systems" indicate the numbers of systems that the smell-aware approach outperformed the baseline bug localization technique, i.e., the cases where > 0 was used. As we can see, different configurations produce different numbers of systems to succeed, mainly depending on the smell types to be used.
Considering the case of Top 1, Top 5, MRR, and MAP, the best configuration that yields the best performance was the configuration 1 3 : both class and method levels, 3 : existence of smells, 5 : well selected smell types . On the other hand, when considering Top 10, the combination 2 3 : both class and method levels, 2 : maximum severity, 3 : medium rare selected smell types performed the best. In terms of overall performance, 1 , 2 , and 3 performed well. Based on these observations, providing the technique with an appropriate configuration enables it to significantly outperform the technique developed in our previous study. Furthermore, to address RQ 1 , we used the configuration 74 to conduct the experiment and found the difference in terms of MAP to be statistically insignificant, as shown in Fig. 3. Nevertheless, when we reran the experiment with the configuration 1 , we not only observed statistically significant results in all metrics ( < 0.01; Cliff's delta: 0.066, 0.065, 0.056, 0.066, and 0.054, all negligible), but also an increase in the improvement of all metrics, as shown in Fig. 4. For example, Top 10 increased by 0.056 (from 0.675 to 0.731), which means that the total number of bug reports with gold modules in their top 10-ranked items increased by 391 (from 4,676 to 5,067). This figure also presents the results obtained using the ideal configuration 0 , which show additional improvements compared to 1 . This result indicates that there remains a scope for improvement when using smell-aware bug localization if we know the best smell configuration per system. Analysis on each configuration parameter. We compared the MAP scores according to each configuration parameter to determine the contribution of each parameter to the performance. The following discussion is based on comparisons of the MAP results by fixing all perspectives other than the one to be discussed. Figure 5 shows the difference in the MAP performance of specific perspectives over the other perspectives. Different combinations of smell granularity ( ) and smell selector ( ) are compared in Fig. 5a, whereas the smell aggregators are compared in Fig. 5b.
The three different colors in Fig. 5a specify the smell granularity that was used. The results in the figure show that the configurations using 1  (g 1 , s 1 ) class, all (g 1 , s 2 ) class, rare (g 1 , s 3 ) class, medium rare (g 1 , s 4 ) class, medium (g 1 , s 5 ) class, well (g 2 , s 1 ) method, all (g 2 , s 2 ) method, rare (g 2 , s 3 ) method, medium rare (g 2 , s 4 ) method, medium (g 2 , s 5 ) method, well (g 3 , s 1 ) both, all (g 3 , s 2 ) both, rare (g 3 , s 3 ) both, medium rare (g 3 , s 4 ) both, medium (g 3 , s 5 ) both, well (a) Smell granularity and selector.  method levels had higher scores depending on the used aggregators, and 2 : method level produced worse results in general. In particular, the use of both level smells produced more accurate results when using it with 2 : maximum severity or 3 : existence of smells. This result suggests that adding method-level smells in addition to class-level smells may be effective in extending the range of smells to be used, but considering all of them may also have a negative effect. For instance, 1 : sum of severity and 4 : number of smells add up all smells as equivalent, even if class-level smells exist. This kind of smell usage was not effective because method-level smells may decrease the importance of class-level smells. In contrast, 2 : maximum severity and 3 : existence of smells are considered to be effective because they can consider method-level smells when class-level smells do not exist or when the method-level smells have higher severity than the class-level smells.
In the smell aggregators shown in Fig. 5b, the configurations using 2 : maximum severity and 3 : existence of smells produced higher values. Other types of aggregators produced worse results in general, except for the average or median of the number of smells in each type ( 9 and 10 ) with a certain level of selection in class-level smells ( 1 , 2 -5 ). In addition, we found that configurations using median ( 6 , 8 , and 10 )  produced very similar results to those using average ( 5 , 7 , and 9 ). These results suggest that indicators such as 2 : maximum severity and 3 : existence of smells should be used.
For the smell selectors, as shown in Fig. 5b, the use of selection ( 2 -5 ) tended to produce more accurate results than the configurations using 1 : all smell types. In particular, when using 3 : both class and method levels, the use of 2 : maximum severity was the best choice at a certain level of selection, whereas the use of 3 : existence of smells was more effective if it was used together with 5 : well selected smell types. This result suggests the effectiveness of smell selection based on the likelihood of containing bugs, as indicated in Table 5 in general. In addition, the use of the severity degree tends to be more effective if a broader range of smell types are used.
Analysis of individual smell types. The best configurations when their selectors use only one smell type are presented in Table 7. Each row in this table indicates the performance of the best configuration when a specific smell type is used as its selector. Rows are ordered by their MAP score. Although several smell types, such as God Class, Blob Operation, or Blob Class, outperformed other smell types, no one outperformed the best configurations in Table 6. This result shows that an awareness of multiple smell types improves the performance to a greater extent than only one specific smell type.
Distribution of AP improvements for each bug report. To improve our understanding of the performance improve- ment, we analyzed the distribution of AP improvements of 6,931 bug reports when applying smell-aware bug localization to the VSM results. The distribution is visualized in Fig. 6. In this figure, the black line is plotted to express the AP deltas (ΔAP = AP smell-aware − AP VSM ) in descending order, obtained using the best configuration 1 . As is shown, the smell-aware bug localization improved the overall accuracy because the total improvements (top left) exceeded the total decreases (bottom right). However, the use of the smell-aware bug localization did not improve the values of all bug reports; instead, certain values were less accurate. In this result, out of 6,931 bug reports, 1,809 of AP increased, whereas 2,499 decreased. However, in most of them, the value of delta was small, and for those with an absolute delta greater than 0.01, 1,543 increased, whereas 1,138 decreased. In particular, 803 reports improved and 265 did not improve when the absolute delta was greater than 0.1. We consider that the accuracy to have improved because the number of bug reports with large improvements was relatively more extensive than that with large decreases. The figure also includes plots of the original configuration 74 and the second-best configuration 2 . Clearly, the degree of improvement by the best configuration is much higher than that of the original configuration. Moreover, the difference with the second-best configuration is small.
Benefited and non-benefited systems. Figure 7 shows the results of the top systems with the biggest improvement in MAP when using the best configuration 1 . We picked up first, second, fifth, sixth, and seventh-top systems because they are the top systems when picking up only one version from the same project. In the figure, each graph shows the MAP value at different for each system. For comparison, the results of the two best configurations as well as the original configuration, i.e., 1 , 2 , and 74 , are plotted. The point for = 0 refers to the accuracy when using only the baseline IR-based bug localization technique, i.e., VSM, whereas = 1 refers to the accuracy when using only the code smell property to localize bugs. We can see two typical shapes in these plots: Also, we can see that the best configuration did not always lead to the best result; for example, in Fig. 7e, the second-best configuration led to the best result. Note that 101 of 309 systems used the parameter of = 0 when using the best configuration 1 . The most typical case for this situation is simply that the buggy portions to be fixed were not smelly.
In conclusion, the configurations using both class and method levels for the granularity yielded the best result. In terms of the combinations of the aggregator and the selector, the existence of well selected types of smells or maximum severity of medium rare selected types of smells yielded the best results. 4 : Is the performance of smell-aware bug localization superior to that of state-of-the-art bug localization techniques?

Motivation
When addressing RQ 1 and RQ 3 , we found that the smellaware bug localization technique can improve the performance when combined with the VSM technique. However, many bug localization techniques have been proposed to improve the VSM technique such as rVSM, BugLocator, BRTracer, BLUiR, and AmaLgam. The goal of this study is to verify whether the smellaware bug localization technique can also be used to improve bug localization techniques other than VSM.

Study Design
We implemented the smell-aware bug localization technique using six existing bug localization techniques as baselines. Specifically, we used the output score of each technique as nScore in Section 4. For nSmell, we apply the best configurations discussed in RQ 3 , that is, 1 and 2 in Table 6.
Finally, we compare the accuracy of the ranking produced using gBLI and each baseline technique.

Results
The results of each technique using with 1 3 : both class and method levels, 3 : existence of smells, 5 : well selected smell types as the smell configuration are shown in Fig. 8. Although we applied two configurations ( 1 and 2 ), we reached almost the same conclusion for each configuration. To save space, we mainly selected 1 to explain the details and clarified major differences if exist. Figure 8a shows the result of the rVSM technique. The results obtained with the smell-aware bug localization technique are an improvement relative to the baseline by approximately 11.0%, 5.4%, 3.7%, 6.4%, and 4.7% in relative comparison (0.043, 0.036, 0.028, 0.033, and 0.018 in absolute comparison), for Top 1, Top 5, Top 10, MRR, and MAP, respectively. All of these improvements are statistically significant (Cliff's delta: 0.043, 0.036, 0.028, 0.041, and 0.030, all negligible). This indicates that smell information is useful even with a bug localization technique that uses information about the size of the source code. This result suggests that the smell-aware bug localization technique is effective not only because of the size of the source code but also because of other factors.
In the case of BugLocator, as shown in Fig. 8b, a similar result was observed. All of the improvements except for MAP are statistically significant: 8.6%, 4.3%, 3.2%, 4.8%, and 3.1% in relative comparison (0.036, 0.030, 0.025, 0.026, and 0.013 in absolute comparison) for Top 1, Top 5, Top 10, MRR, and MAP (Cliff's delta: 0.036, 0.030, 0.025, 0.033, and 0.022, all negligible), respectively. Note that the MAP difference was statistically significant when using the configuration 2 . This result suggests that our technique even improves the technique that used information about bug reports in the past.
However, in the cases of BRTracer, BLUiR, and AmaLgam in Figs. 8c, 8d, and 8e, statistically significant improvements in all the metrics except MAP are obvious. Specifically, the improvement in MAP in BRTracer is only 1.7%, which is the lowest improvement among all the projects.
In conclusion, optimization of the configuration of the smell-aware bug localization technique can improve stateof-the-art bug localization techniques.

Internal Validity
In this study, the weight was assigned the optimal value to maximize each accuracy metric of each version. This is intended to avoid the possibility of not being able to observe the effect of smells because of the effect on the accuracy as a result of the choice of . However, in RQ 1 and RQ 4 , we also discussed the extent to which the smell-aware bug localization technique is superior to the baseline. Therefore, the threat that remains is that the used optimal values differ from the practical values of . In particular, the possibility of such cases occurring in versions with a small number of bug reports is high because the values of might be biased toward those bug reports. Therefore, we conducted our experiments by excluding versions with fewer than five bug reports. This exclusion prevented the value of from being over-optimized in a version with fewer bug reports.
Another threat is the accuracy of the output of Bench4BL, which we used in our study. As mentioned in Section 5.3.4, we mitigated this threat by excluding inconsistent results from the output of Bench4BL. However, the validity of the output depends on the quality of Bench4BL, even for the consistent results. Although we excluded two bug localization techniques in Bench4BL as shown in Section 5.3.4, we have not verified the correctness of the remaining four techniques (and the two derived from them). The possibility of incorrectness in implementing these bug localization techniques still exists. However, we believe that these four implementations have a certain degree of correctness. As shown in Fig. 2, there is a relationship between the techniques in terms of the use of additional information, and it is expected that the more additional information is used, the higher the accuracy becomes. The accuracy attained in this study did not contradict this relationship, suggesting that the implementations may yield the expected accuracy.
Moreover, it should be noted that Bench4BL can only run on file-level bug localization. Wang et al. [52] suggested that the results of most bug localization techniques at the file level still leave developers with a large amount of code to examine. Therefore, it might be beneficial to conduct the same experiment at the class or method level on different datasets. Noteworthy is that, although our previous work was conducted on methodlevel modules, a large-scale method-level benchmark dataset is not yet available. We continue the discussion on method-level bug localization in Section 7.4.
Finally, the accuracy of smell detection in inFusion may be a threat. Manually validating the smells detected in inFusion to exclude the presence of false positives and negatives remains a future task. We continue the discussion on the false positives in Section 7.2.

External Validity
Although we mitigated the threat of external validity by using a sufficient number of bug reports, we limited our attention to Java systems. Moreover, we only used open-source systems. Therefore, performing similar studies on industrial systems may be beneficial. In addition, although we used the largest available bug localization dataset in this study, the optimal configuration presented in this paper might have different results on other datasets. Moreover, we considered 16 types of code smells in this study, yet other types of code smells or other methods to calculate the smell severity [30] that were not considered in this study are also available. In addition, we only used inFusion as the smell detector, despite the existence of other possible smell detectors that were not considered in this study. As a result, they may have a different effect on bug localization. Finally, because our experiments on bug localization were only conducted at the file level owing to the limitations of Bench4BL, we would need to conduct experiments at other levels, e.g., the method level.

Conclusion Validity
Although we performed statistical tests (Wilcoxon signedrank tests) and confirmed statistical significance between bug localization techniques, the obtained effect size computed via Cliff's delta was very small. This indicates a possibility that the significance came from a large sample set in our experiment, and the essential effect might be negligible. We will discuss this point in Section 7.1.

Accuracy of Bug Localization Techniques
Even though our results improved significantly by using information about the code smells in combination with the bug localization technique, one might consider the improvements to be small, i.e., improvements of 10.4-23.0% in relative comparison. Actually, the obtained effect size computed via Cliff's delta was all little. However, we argue that these improvements are meaningful in the context of bug localization. As shown in Fig. 8, a comparison of the accuracy of each baseline technique reveals that the improvements are on the same scale. For example, even though BugLocator was improved relative to rVSM, MAP improved by only 6.5% in relative comparison (from 0.391 to 0.417; 0.025 in absolute comparison). This means that improving the bug localization accuracy is generally difficult, even if additional information sources, such as past bug reports or history information, were to be utilized. Smell-aware bug localization can significantly improve the accuracy of state-of-the-art bug localization techniques in similar amounts. In addition, it is noteworthy that the smell-aware bug localization technique uses only the source code and does not require additional information. Therefore, we suggest that source code characteristics, such as code smells, should be considered when performing bug localization.

Accuracy of Detected Smells
In some cases, smell detectors might produce false positives. For example, a parser class having God Class is not considered problematic because its scope is generally large, and refactoring it might even reduce the comprehensibility of the class [53,54]. Several existing studies have reported the accuracy of smells detected by inFusion [55,56,57], but the reported accuracy varies widely and covers only a small subset of smell types.
To mitigate this threat, the authors manually verified sampled smell instances detected by inFusion. We followed the false positive catalog by Fontana et al. [53,54]. For the seven smell types listed in the catalog with their false positive detection strategy (Blob Class, Data Class, God Class, Blob Operation, Feature Envy, Message Chains, and Shotgun Surgery), we randomly selected five smell instances for each type, and we collected 35 smell instances in total. We selected one instance from each of the top five projects with the highest number of smell instances in total, resulting in five instances for each type. For each instance, two of the authors independently judged whether it met the condition to be regarded as a false positive according to the false positive detection strategy in the catalog. In the case of two authors' decisions being in conflict, we conducted a discussion to reach a consensus. As a result, four false positives out of 35 were identified, yielding a precision of 0.89. Although the number of extracted samples was very small, this result suggests that a certain percentage of smells used in our study were correct instances to be regarded as smells. We conclude that false positives may have little effect on the results of this study.
However, because the process of identifying false positives was performed by the authors, who are not the main developers of the projects used for this study, we cannot ensure the completeness of the identified results. Furthermore, our sampling approach cannot confirm the recall of the detected smells. Both are still regarded as a threat to validity.

Distribution of Values
We studied the distribution of the optimal value of for each system. We investigated 309 systems using MAP as the evaluation metric when the optimal configuration obtained from RQ 3 , i.e., 1 3 : both class and method levels, 3 : existence of smells, 5 : well selected smell types , was used. Note that we observed multiple values that can maximize the MAP value in 85 out of 309 systems, such as DATAMONGO 1.1.2 as shown in Fig. 7. For the sake of simplicity of the analysis, we excluded these systems and selected the other 224 systems as the target of the subsequent investigation.
The systems were broadly divided into two categories: those who could utilize smell-based scores effectively and those for which the detected smells did not work at all. On the one hand, in 49 of 224 systems, the setting of = 0 produced their optimal ranking. This means that any blending of smell-based scores reduced the accuracy of the resulting rankings; i.e., the use of smells did not improve the accuracy of these systems. On the other hand, in the remaining 175 systems, blending the smell-based score improved the ranking compared to the base IR-based bug localization using VSM. Figure 9 shows a histogram representing the distribution of optimal values of the 175 systems. The average value of the obtained optimal values was 0.215, and all the values were less than 0.5. For these systems to which the smell information contributed positively, a weak blending of the smell-based score with the IR-based score tends to improve the ranking well. Although this analysis is limited because of the selected instances of values to be used, we think that this average value can be regarded as a representative. The prediction of an appropriate value, for example, using a machine learning technique, contributes to bug localization improvement. For example, it may be possible to compute an appropriate value for a project version by using the results obtained from past versions of the project. Note that such an approach should deal with the size variation of the versions and some inappropriate versions that smell-aware bug localization is not suitable, as we studied in this section.

Application to Method-Level Bug Localization
In our previous paper [7], we have studied not only class-level but also method-level bug localization. However, the study in this paper focused on the class level. This is because Bench4BL is based on files, which does not provide bug localization at the method level, and no bug localization benchmarking framework supports the application of bug localization techniques at the method level.
To preliminary confirm whether the best configurations obtained in the study are effective for method-level bug localization, we manually adapted four systems used in the previous paper to the Bench4BL framework and obtained method-level application results. We obtained the method level matching results by converting source code snapshots to those at the method level using FinerGit [58]. The results are shown in Fig. 10. Similarly to Fig. 7, the plots in the figure show the results using the two best configurations ( 1 and 2 ) and the original configuration at the method level (130 ) using VSM and rVSM as the baseline IR techniques. As we can see from the plots, the use of the best configurations improved the MAP for all systems in both VSM and rVSM. Although a detailed study on the method level bug localization is subject to future work, these results suggest the applicability of the proposed smell-aware bug localization technique to the method level.

Conclusion
In this study, we replicated the work conducted with our previous smell-aware bug localization technique on a large-scale dataset and confirmed significant performance improvement. We proposed a generalized smell-aware bug localization technique to derive the optimal configurations for code smell information. We found that the optimal configuration entails the use of granularity that reflects both class-and method-level smells and the maximum severity when aggregating a certainly selected types of code smells or the existence of very limited types of smells. Finally, we combined our proposed technique with different baseline techniques and found that the performance improved significantly. These results suggest that code smells can be used to effectively improve existing bug localization techniques without the need for additional information.
Code smell detection does not require more inputs than bug localization in general. Although the improvement was slight, the application of the smell-aware approach can improve bug localization, which is applicable in many situations. Our study also revealed that there are situations where the effect of the smell-aware approach was negative. It is desirable to develop further techniques that use smell information more effectively to avoid losing the accuracy of the baseline bug localization technique. The use of machine learning techniques or other data fusion techniques might be effective rather than a simple linear combination that we used in this paper.
In the future, we also aim to specify the value of for each version of a specific project. For example, we can set the value of based on the optimal value of previous versions or by using a machine learning approach.
An appendix including the experimental materials is available in online [41].