Empirical Study: How Issue Classification Influences Software Defect Prediction

Software defect prediction aims to identify potentially defective software modules to better allocate limited quality assurance resources. Practitioners often do this by utilizing supervised models trained using historical data. This data is gathered by mining version control and issue tracking systems. Version control commits are linked to issues they address. If the linked issue is classified as a bug report, the change is considered as bug fixing. The problem arises from the fact that issues are often incorrectly classified within issue tracking systems. This introduces noise into the gathered datasets. In this paper, we investigate the influence issue classification has on software defect prediction dataset quality and resulting model performance. To do this, we mine data from 7 popular open-source repositories, create issue classification and software defect prediction datasets for each of them. We investigate issue classification using four different methods; a simple keyword heuristic, an improved keyword heuristic, the FastText model and the RoBERTa model. Our results show that using the RoBERTa model for issue classification produces the best software defect prediction datasets, containing on average 14.3641% of mislabeled instances. SDP models trained on such datasets achieve superior performance, to those trained on SDP datasets created using other issue classification methods, in 65 out of 84 experiments, with 55 of them being statistically relevant. Furthermore, in 17 out of 28 experiments we could not show a statistically relevant performance difference between SDP models trained on RoBERTa derived software defect prediction datasets and those created using manually labeled issues.


I. INTRODUCTION
Software defect prediction (SDP) is a popular topic in software engineering research. The aim of SDP is to identify potentially defective software modules. This information can be used by quality assurance (QA) teams to better allocate their limited resources and thus further improve the quality of developed software products [2]. This is an important activity since researchers estimate that software defects and poor QA have cost the US industry $2.08 trillion in 2020 alone [1].
The associate editor coordinating the review of this manuscript and approving it for publication was Porfirio Tramontana . SDP is a highly active research field. However, some of the obtained results could be misleading due to datasets used to conduct studies. Researchers have pointed out that datasets created by mining software repositories might contain a substantial amount of noise. When creating an SDP dataset, researchers inspect commit messages in an effort to identify links to issues defined in the issue tracking system. Classes assigned to issues are then used to derive the labels of the created SDP datasets. Issue classes specify the type of issue, for example, a modification request, a feature request, or a bug report.
In this process, there are two situations where noise can arise. First, it can arise when commits are not successfully linked to issues. Second, it can arise when issues in the issue tracking system are incorrectly classified. The introduced noise can result in biased and unreliable results. Researchers rely on issue classification. For example, Zimmermann et al. [33] constructed the Eclipse Bug Data Repository identifying defective code by searching for bug report references and keywords such as ''fixed'' and ''bug''. A similar approach was used by Cubranic and Murphy [88], Fischer et al. [89], Sliwerski et al. [8] and Bachmann et al. [90].
Herzig et al. [9] examined a substantial number of issues and found 33.8% of all bug reports were incorrectly classified. They concluded that users are the ones performing issue classification and are the source of incorrect classifications. If the software does not meet user expectations, they tend to raise an issue, and classify it as a bug report. However, users often lack technical knowledge, and insight into project details, which results in incorrect classifications. Developers could correct the issue classification, but there is no incentive for them to do so.
Kim et al. [10], Seiffert et al. [15], Pandey and Tripathi [16], Tantithamthavorn et al. [17] have all pointed out that noise in the resulting dataset can lead to severe degradation of model performance. While Khan et al. [18] showed that noise filters struggle to mitigate the problem, once noise is present in the dataset.
Antoniol et al. [19] also found that issues marked as bugs might not actually refer to bugs. They proposed using issue descriptions to classify issues and thus reduce the amount of noise in derived defect prediction datasets. In their investigation they used traditional models such as Decision Trees, Naive Bayes, and Logistic Regression on issues coming from three Java based repositories. They showed that such models can achieve an F1 score of 0.70. Please note that this metric is not directly given in their paper, instead we calculated it from the evaluation data they provided in the paper.
Since their paper was published, advanced Natural Language Processing (NLP) models, such as BERT [83], have been developed. Researchers [11], [12], [13], [14] have investigated using NLP models for issue classification and obtained promising results. However, none of this research is focused on issue classification in the context of SDP. To the best of our knowledge, no research has been done investigating the impact of issue classification on SDP dataset quality and resulting model performance. In this paper, we investigate the benefits of using such models during SDP dataset creation. The effect issue classification quality has on the noise present in the resulting SDP dataset, as well as influence on the final SDP model performance.
More specifically, we created new datasets by mining popular open-source repositories. We mined data from 7 repositories, collecting all commit data and all issue tracking data. For each dataset in each commit, we identify issue references. Issues referenced by at least one commit are considered issues-of-interest (IOI). Commits containing at least one issue reference are considered commits-of-interest (COI). For every file mentioned in version control, we identify all commits it has been modified by. If all commits modifying a file are COI, meaning all of them reference an IOI, then the file is considered a file-of-interest (FOI). Each version of a file, meaning its state after a commit it has been modified by, is decoded, stripped of all comments, and encoded using GraphCodeBERT. These encodings are used to construct semantic features, which in addition to other process and code complexity features are used to construct a software defect prediction instance. The instance is considered defect prone if at least one commit in its history references a bug related issue. For each repository, at least 1000 IOI are manually labeled and used to create a golden SDP dataset. We then investigate the amount of noise induced into the SDP dataset if the issues are not manually labeled, but instead automatically determined using a keyword matching heuristic (KWM), an improved keyword matching heuristic (IKWM), a FastText model and finally a RoBERTa model. For each resulting SDP dataset we train Logistic Regression, Decision Trees, Naive Bayes and K-Nearest Neighbours models and investigate the impact of noise in the SDP dataset on their performance.
In simpler terms, we constructed file level SDP datasets. The labels of these datasets are determined based on issue classification. If a source code file is edited by a commit, linking to a bug related issue then that file is considered defect prone, otherwise it is considered not defect prone. Depending on the quality of the issue classification, we investigate the amount of noise induced in the resulting SDP datasets and the effect this has on SDP model performance. We tested out four different issue classification methods. A KWM model, an IKWM model, a FastText model and a RoBERTa model. As SDP models, we trained Logistic Regression, Decision Trees, Naive Bayes and K-Nearest Neighbours models. These models predict if a source code file is defect prone or not, meaning we investigate how their performance is impacted by the amount of noise in the SDP dataset, and the amount of noise is a direct consequence of the issue classification quality.
Our results show that using the RoBERTa model for issue classification produced the fewest mislabeled instances in SDP datasets compared to other approaches. These datasets contain an average of 14.3641% mislabeled instances. We compared, and statistically validated, the performance of models trained on such SDP datasets and on those created using other methods. In 65 out of 84 experiments models trained on SDP datasets created using RoBERTa had superior performance and they had statistically relevant superior performance in 55 out of 84 experiments which is equal to 65.4761% of the time. Further, we compared the performance of models trained on RoBERTa derived SDP datasets to those trained on SDP datasets derived from manual issue classification. In 17 out of 28 experiments we could not show a statistically relevant performance difference.
In summary, this work makes the following contributions: • Fully mined commit and issue data for 7 popular opensource repositories. Manually labeled issue classification datasets consisting of at least 1000 issues for each repository and derived software defect prediction datasets.
• Experiments which investigates the performance of simple keywords matching heuristics (KWM), improved keyword matching heuristics (IKWM), FastText model, and RoBERTa model on issue classification and the resulting impact on SDP datasets noise levels and model performance.
The rest of this paper is organized as follows. Sect. II gives a more detailed introduction to SDP and an overview of related work. Sect. III describes how data is collected and how data of interest is identified. Sect. IV describes how issue classification datasets are created and how issue classification models are developed. Sect. V describes how SDP datasets are created and which models are used for final SDP classification. Sect. VI presents the obtained results. Sect. VII outlines threats to the validity of this study. At the end, Sect. VIII concludes this research with the author's final remarks.

II. BACKGROUND AND RELATED WORK
In this section we provide a general overview of software defect prediction (SDP) and an introduction to related work important for the topic of this research paper. The first subsection presents the basics of SDP. The second subsection presents often used SDP metrics, talks about prediction granularity, common datasets, and common approaches to tackling the SDP problem. The third subsection introduces related work focusing on noise in SDP datasets. The fourth subsection presents Natural language processing (NLP) related work and previous work done around issue classification. Finally, the last subsection ties in this manuscript within the existing body of work.

A. SOFTWARE DEFECT PREDICTION
The field of software defect prediction (SDP) was started in 1971 by Akiyama when he published a paper investigating the relation between code complexity and the number of software defects [28]. He used Lines of code (LOC) as a measure of code complexity and showed that there exists a positive correlation between the two.
Through the years the field has developed and branched out. Today, SDP can be divided into within-project prediction and cross-project prediction. Utilizing project data to predict defects in that same project is called within-project prediction [91]. This can be further divided into withinversion-within-project prediction and cross-version-withinproject prediction, depending on whether the train and test data come from the same version or different versions of that project [91]. The first studies in the field were based on within-project prediction. However, researchers pointed to potential benefits of training a model on data derived from one project and using it to detect defects in another project [50]. This would allow completely new projects to use defect-prediction models and thus improve their QA. This approach is called cross-project prediction [72]. Zimmermann et al. [50] showed that cross-project prediction is a challenging task. They showed that it is not easy to identify on which project a model should be trained in order to perform well on another project. The basic premise of machine learning states that the training and test data are sampled from the same distribution. The fact that models trained on one project data do not perform well on another project implies that data distributions between projects differ to an extent that the basic premise of machine learning is no longer satisfied. Notable approaches for improving cross-project prediction performance are: Metric compensation [58], [59], Nearest neighbor filtering [60], Meta learning [57] and Transfer component analysis [53], [54], [55], [56]. Recently, researchers have proposed methods which would allow using projects described with different metrics to achieve cross-project defect prediction, thus increasing the amount of available train data. This is called heterogeneous defect prediction [51], [52].
GraphCodeBERT [27] is of special interest to us. Graph-CodeBERT is a BERT base model for generating code representations based on code structure and data flow. It is a multi-lingual model supporting the following programming languages: Python, Java, JavaScript, PHP, Ruby and Go. In this paper we use this model to generate semantic features for created SDP instances.
Some of these metrics can be applied at various levels of granularity and some are specific to a certain level. Researchers have performed SDP on many distinct levels: component [71], file [68], class [70], method [68] and change level [69]. Change level prediction is also called Just-In-Time (JIT) defect prediction. More granular prediction facilitates faster defect localization while less granular approaches are better suited for QA resource allocation [2].

C. NOISE IN SDP DATASETS
SDP datasets are created by mining version control and issue tracking systems. Each commit in the version control system is examined and an attempt is made to link it to an issue in the issue tracking system. If a link is found and the issue is marked as a bug report the state of the code prior to the changes can be considered as defective while the code after the changes can be considered as non-defective. Alternatively, this code in all its states can be considered defect prone. Another approach is to search for an earlier commit which caused the defective code and label it as a defect inducing commit. This approach is used for creating JIT defect prediction datasets. A link is considered found if a number matching an issue number is contained in the commit message, and certainty about the link is increased if it is near important keywords such as ''bug'' or ''fix'' [8], [33], [88], [89].
This process has two critical points at which noise can be introduced into the resulting dataset. First, it can be introduced if the links between a commit and an issue cannot be established. The second point where noise can be induced into the created dataset is the incorrect classification of issues in the issue tracking system [17].
Bird et al. [5] found that the number of fixed bugs does not match the number of bug issues leading to a high false negative rate. This is a consequence of mining being based on keyword matching [6], [7], [8], [33], [88], [89], [90], while developers often do not write specific keywords.
Wu et al. [29] proposed an approach called ReLink in order to alleviate the problem of missing issue links. They manually inspected links with explicit bug IDs in change logs and observed that the links exhibit certain commonalities. Based on these, they proposed an automatic link recovery algorithm which would automatically learn criteria of features from explicit links to recover missing links.
This motivated other studies such as MLink from Nguyen et al. [30] where the authors propose a multi-layered approach that considers both textual and source code features of modified code. The approach is capable of learning relations between terms in bug reports and the entity names in source code which allows it to established bug-to-fix links even when there is not much textual similarity between the two.
Herzig et al. [9] manually examined more than 7,000 issue reports from the bug databases of five open-source repositories and found 33.8% of all bug reports were incorrectly classified and an average of 39% of files marked as defective never had a bug.
Kim et al. [10] found that prediction performance decreases significantly when the dataset holds 20% -35% of both false positive (FP) and false negative (FN) and proposed a noise detection and elimination algorithm. Seiffert et al. [15] did an extensive study of class imbalance and dataset noise effects. Their results correspond to those presented by Kim et al. [10].
Pandey and Tripathi [16] performed an empirical study focused on dealing with noise and class imbalance issues in software defect prediction. They show that if a dataset contains 10% -40% of incorrectly labeled instances the true positive rate (TPR) and true negative rate (TNR) are reduced by 20% -30% and receiver operating characteristic (ROC) values are reduced by 40% -50%.
Tantithamthavorn et al. [17] investigate the effects of noise caused by incorrect issue classification on the performance of SDP models. In their study they used 3931 manually labeled issues derived from Apache Jackrabbit and Lucene systems. Based on the obtained results they point out that incorrect issue classification does not occur completely randomly. They show that non-random noise does not degrade the model precision but can degrade recall by 32% -44%.
Khan et al. [18] investigate the effects of 9 different noise filters for dealing with incorrect instance labels. Instead of randomly generated noise, they use a dataset with clean labels annotated by experts and noisy labels obtained by heuristics. They observe that noise filters mostly struggle to improve the performance over noisy data.
Antoniol et al. [19] found that bug reports might refer to perfective and adaptive maintenance, refactoring, discussions, or requests for help. They carried out their experiment on three repositories; Mozilla, Eclipse, and JBoss and showed that decision trees, naive Bayes classifiers, and logistic regression models can classify issues based on their descriptions achieving an F1 score of 0.70.

D. NATURAL LANGUAGE PROCESSING AND ISSUE CLASSIFICATION
Earlier work in the area of Natuaral language processing (NLP) focused on word2vec models [74], convolutional models [75], [76], [77] recurrent models [78], [79] and, more recently, attention-based models [80], [81]. Substantial earlier work has shown that pre-trained models on large corpora are beneficial for text classification and other NLP tasks [82]. Using pre-trained models offers the benefit of avoiding model training from scratch, thus, speeding up the fine-tuning process and producing higher performance models than those trained only on one specific task. VOLUME 11, 2023 FastText [73] is a word embedding method which is an extension of the word2vec model. It is considered a bag of words model. Instead of learning each word directly it represents each word as a set on n-grams. In its original publication it was shown to be much faster than the deep models of that time, with comparative performance.
The current widely adopted language models are BERT [83] and RoBERTa [84]. Both models offer the same architectural design, using the encoder part of the multi-layer bidirectional Transformer architecture [85] which was pre-trained on large text corpora -the BooksCorpus (800M words) [86] and English Wikipedia (2,500M words). The Cloze task [87] inspired the masked language model (MLM) objective used to train BERT and RoBERTa in conjunction with the next sentence prediction task. The method of using a large pre-trained model and fine-tuning in downstream tasks has made a breakthrough in several natural language understanding tasks [83], [84].
Wang et al. [11] used BERT to recommend GitHub labels based on issue descriptions. Since issue creators often do not label issues, it is left to repository maintainers to label them which can become very time consuming, thus accurate automatic labeling would help reduce the amount of necessary manual labor.
Herbold et al. [12] stress that reported issue types often do not match the description of the issue. They attempt to improve upon existing issue classification by incorporating manually specified knowledge about issues.
Siddiq and Santos [13] propose TICKET TAGGER which can be used to automatically assign labels to GitHub issues. Again, the main goal is to reduce the amount of necessary manual issue labeling.
Siddiq et al. [14] apply BERT to GitHub issue classification and compare its performance to FastText. They point out that developers can find it difficult to manually label issues and thus would benefit from automatic issue labeling.
To the best of our knowledge the approach of using RoBERTa for issue classification in SDP dataset creation remains unexplored. Although, it has shown promising results in issue classification it is unknown if its performance reduces the noise levels of resulting SDP datasets to acceptable levels.

III. DATA COLLECTION
To acquire representative data of open-source repositories we have decided to extract information from GitHub. GitHub provides Internet hosting for software development and version control. At the time of writing (November 2022) it hosts 83 mil. developers and 28 mil. public repositories.
For mining repository data, we wrote a script using the PyGithub 1 Python package. The package is a wrapper, enabling simple usage of the GitHub API. 1 https://pypi.org/project/PyGithub/ We queried 10 most popular repositories which have at least 5 good first issues. Maintainers of open-source repositories mark certain issues as good first issues, thus indicating that they are simple enough to be addressed by someone without previous knowledge about the project. Such issues are considered a good first step to start contributing to a repository. The presence of such issues implies an active project with a lively community, so we included it as a constraint. We filtered out repositories which are written in a programming language not supported by GraphCodeBERT, have less than 50 stars or less than 100 issues. Again, these constraints are included hoping to obtain only active and relevant repositories.
Information gathered for each repository is shown in Table 1. From each repository, which satisfies the previously listed criteria, we pulled issue data shown in Table 2 and commit data shown in 3. For each file changed by a commit we pulled file data shown in Table 4. All mined data is stored in JSON format.   Using the described procedure, we ended up with 7 repositories totaling 299773 issues and 153664 commits. Table 5 presents an alphabetically sorted list of all mined repositories,   their main programming language, and the number of issues and commits from each repository. Figure 1 shows the distribution of programming languages of mined repositories.
We then analyzed each commit in each repository by applying the regular expression shown in Equation 1 to the commit message. In this way, we identify issue references. The regular expression searches for either a direct issue reference e.g., #12345 or a link to the issue e.g., /issue/12345 or /pull/12345. The reason we search both issue and pull is that we observed that GitHub uses both links for issues, with the only difference being that issue were started by someone raising an issue and pull were started by someone raising a pull request.
(? : #|/issues/|/pull/)(d+) (1) Issues referenced by at least one commit are considered issues-of-interest (IOI). Commits having at least one issue reference in their commit message are considered commitsof-interest (COI). Some issues referenced in commit messages were not mined because they no longer existed on GitHub. We removed such issues from the IOI and commits mentioning them from the COI. We then inspected all commits and made a list of all files, differing them by name. For every file, we identify all commits modifying that file. If all commits modifying a file are COI, meaning all of them reference an issue, the file is considered a file-of-interest (FOI). We reduce the identified set of FOI by keeping only those with the following file extensions .py, .php, .js, .java, .go and .rb as those are languages supported by GraphCodeBERT. Note that this is a wider set of programming languages than those associated with the mined repositories. The repository language is the main language used in the repository, but there may still be files written in other languages. This is why we consider a wider range of languages than just the main repository language. GitHub provides the source of each version (state after commit modifying the file) of each FOI in a base64 encoded format. We decoded the remaining versions of the files and removed all comments using the pyparsing library. Sometimes, consecutive versions of a single file are identical, which means that the changes made between different versions were not made to the source code. Since we only want to consider changes made to the source code, we remove duplicates by keeping only the earlier versions of the file. After performing these steps, the set of IOI is reduced to only those issues that are referenced by COI of the final FOI. Table 6 shows the final number of IOI, COI and FOI per each repository. From the data presented in Table 5 and Table 6 we can observe how some projects, such as nodejs, systematically reference issues in commit messages allowing many IOI, COI and FOI to be identified, while others, such as kubernetes, can hardly be connected.

IV. ISSUE CLASSIFICATION
In this section we describe how datasets for issue classification are created, and how issue classification models are trained based on the collected data of interest. Figure 2 shows the simplified general overview for issue classification. In the upper part of the figure, we see issues, each consisting of a title and a description. Using these titles and descriptions, in addition to manually determined issue labels, we can train an issue classification model. In this study we consider several issue classification models. A KWM model, an IKWM model, a FastText model or a RoBERTa model. Using a trained issue classifier, we can classify new issues. This is depicted with the arrows coming out of the classifier and pointing to specific issues. The classifier assigns either a non-defective class, depicted using a small green rectangle, or a defective class, depicted using a small red rectangle. Of course, in practice different issues are used to train the model than those to which the model is applied. In the lower part of the image, we see the classified issues linking to commits which modify a yellow, blue, and red file. This part is not relevant for issue classification itself but helps understand how issues are connected to commits and consequently to source code files from which the final SDP datasets are constructed. It is important to understand this connection as the labels of SDP instances are derived from issue classification. The first subsection describes issue classification dataset creation, and the second subsection describes issue classification model development.

A. DATASET
After identifying the IOI, we manually labeled at least 1000 IOI per repository. If a repository has less than 1000 IOI, we labeled all of them. If it has more than 1000 IOI, we selected the first 1000 IOI sorted by issue number, and then added any additional IOI that were needed for FOI influenced by the first 1000 IOI.
To manually inspect and label IOI, we developed a custom web application. The application accepts JSON data generated by the mining script and displays the repository name, issue title, issue description, labels associated with the issue, and whether the issue has a pull request. The application allows users to navigate through repositories and issues, label issues, delete issues, and download data in the same JSON format as the uploaded data. Figure 3 shows the application main view.
We allow issues to be labeled with one of four classes: Feature, Modification, Bug and Other. Feature denotes requests for new functionality. Modification denotes requests for change of existing functionality. Bug denotes reports of defective behavior of the software system. Other denotes questions, discussions, and other requests we could not place in any of the previous categories. However, in this study we are only interested in differing between bug related issues and non-bug related issues, thus we treat Feature, Modification and Other as a single label meaning Non-Bug. The labeling tool was made more general as we imagine other researchers might want to further distinguish issues.
To improve the labeling quality, three people labeled the IOI and the majority vote classification was taken for each issue. The content of resulting datasets per each repository are shown in Table 7. Labeled issues of interest are referred to as LIOI. It might be tempting to use the labels on GitHub issues as a way to classify the issues, but this approach has some problems. One issue is that the labels are not standardized or required, so they can vary significantly from one project to another, and some projects might not have any labels at all. Additionally, the labels are often used to indicate which part of the project is affected by the issue, rather than the type of issue itself.

B. MODELS
This subsection describes the development of models used for issue classification. We first describe a simple keyword matching heuristic (KWM), then an improved keyword matching heuristic, then an application of the FastText model and finally an application of the RoBERTa model.

1) SIMPLE KEYWORD MATCHING
The base model for issue classification is a simple keyword matching heuristic. It is based on a case insensitive search for bug or fix keywords. The search for these keywords was applied to the issue title and description.

2) IMPROVED KEYWORD MATCHING
The first step towards improving the KWM heuristics is to determine which keywords imply that the issue is defect related. All text is transformed into lower case, all punctuation symbols are removed and a snowball stemmer [92] is applied to each word. Issue title and description are inspected word by word, considering only words exclusively consisting of alphabetic characters and longer than 2 letters. For each word, the number of times it appears is counted (tc). Furthermore, the number of descriptions it appears in is counted (dc). For every word which occurred in a bug report, we calculate the bug importance score as log(tc/dc) * (dc/bCnt). Similarly, for every word which occurred in a non-bug report, we calculate the other importance score as log(tc/dc) * (dc/oCnt). Finally, for each word we subtract the other importance score from the bug importance score and sort all the words by the resulting score. By doing this we get defect implying words at one end of the list and non-bug implying words at the other end of the list. Figure 4 shows a word cloud visualization of top bug implying words and Figure 5 shows a word cloud visualization of top non-bug implying words. The described algorithm is shown in Algorithm 1. Expert knowledge was used to select a subset of keywords from the obtained list which is then used for defect related issue discovery. We inspected the list top to bottom and chose 8 meaningful words. This list of words includes: ''bug'', ''error'', ''fix'', ''issue'', ''line'', ''out'', ''not'' and ''test.
We narrowed down the list of words we used to a small number because using too many words would have resulted in an excessively large number of possible combinations to check. For each repository, we identified the best combination  of words to use when matching the issue title and the best combination to use when matching the issue description by testing (2 8 ) 2 = 65536 different combinations of keywords. We then used these identified keyword combinations and the previously described text pre-processing to improve our keyword matching method.

3) FastText
We use the FastText model implementation provided by the python fasttext library. All text is transformed into lower case, it is cleaned up by removing HTML tags, removing multiple consecutive spaces, newline spaces, tab spaces, replacing all hyperlinks with the keyword [link] and finally removing all non-alphanumeric characters. For each repository we train a separate FastText model treating the LIOI of that repository as the test set, and all LIOI from all other repositories as the train set. One could say that we are performing cross-project issue classification. On the train set we perform a 80/20 split taking 80% of the set as the final train set and 20% as a validation set. The model is then given 15 minutes to perform hyperparameter optimization, on an Intel(R) Core(TM) i7-7700 @ 3.60GHz CPU, and then trained with the best identified parameters.

4) RoBERTa
Furthermore, for issue classification we use the RoBERTa model [84]. The name RoBERTa stems from Robustly Optimized BERT Pretraining Approach and BERT stands for Bidirectional Encoder Representations from Transformers. The architecture of this model is a stack of 12 transformer encoders. Each encoder consists of a multi-head attention layer and a feed forward layer with summation and normalization in between and at the encoder output. On top of this a task specific feed forward neural network and a SoftMax layer are added. Where BERT performs masking once during data pre-processing, resulting in a single static mask, RoBERTa duplicates the training data 10 times so that each sequence is masked in 10 different ways. By doing this RoBERTa avoids using the same mask for each training instance in every epoch. The resulting model is then fine-tuned for the desired task.
To prepare the issue title and description as input for the model, they are joined, all hyperlinks are replaced with the keyword [link], all HTML tags are removed, as are multiple consecutive spaces, newline spaces and tab spaces. The pre-processed texts are subsequently tokenized using the model's respective tokenizer. For each repository we train a separate RoBERTa model treating the LIOI of that repository as the test set and use LIOI from all other repositories to create a train and validation set. On LIOI from other repositories, we perform an 80/20 split taking 80% of the set as the final train set and 20% as a validation set. The model receives an input length of 512 tokens, where the text that consists of fewer tokens are padded to the desired length and longer sequences are truncated to 512 tokens. Mini-batches of 4 examples are used through a 6 epoch training period. For optimization, the AdamW optimizer is applied to all model parameters with a learning rate set to 2 · 10 −5 . The weight decay factor used in the AdamW optimizer is set to 0.001 for all model parameters except for biases, where it is set to 0. Additionally, 4 gradient accumulation steps and gradient checkpointing are used to reduce the memory footprint required for model training. Also, half precision training (FP16) is used to facilitate faster training and further reduce the required GPU memory. After each epoch, the model is evaluated on the validation set and at the end of training the best performing version of the model, based on results on the validation set, is selected as the definitive version of the model.

V. SOFTWARE DEFECT PREDICTION DATASETS
In this section we describe how software defect prediction dataset are derived based on the collected data of interest and determined issue classification.
Issues labeled during manual inspection are considered labeled issues of interest (LIOI). For all LIOI we identify files of interest (FOI) influenced by LIOI through the commit links using commits of interest (COI). Files influenced by LIOI are referred to as labeled files of interest (LFOI). We then proceed to create an SDP dataset for each repository.
Each LFOI results in one instance in the SDP dataset, and the label for that instance is based on the issues associated with the commits that modify the LFOI. If any of the issues is labeled as a bug related issue, we consider the instance to be bug prone, if none of the issues are bug related the instance is considered to be clean. The issue label can be determined using the manually assigned labels or those determined by the issue classification model. The features of the instance consist of three types; code complexity features, process features and semantic features. The code complexity features, or better said feature is the simplest one. For each version of a LFOI we inspect the number of lines of code and the feature is calculated as the average number of lines of code for this file through its history. The process features used are the number of commits which modify this file, the absolute number of issues referenced by these commits and the average number of issues referenced per commit modifying this file. Finally, the semantic features are the most complex. Each version of the LFOI is embedded using GraphCodeBERT. Given that the model input is limited to 512 tokens at a time, we encode the source code in chunks of 512 tokens with a 256 token overlap between consecutive chunks. For each embedded chunk we take the embedding of the CLS token which should encapsulate the overall semantics of the processed chunk. We then calculate the mean CLS token over all chunks of a single version. The final semantic features for the file are the mean CLS token of each of the file versions and the sum of differences of consecutive version CLS tokens. The mean over all versions of the file should encapsulate the general propose of the source code while the sum of differences should encapsulate the file's change over time. The described code complexity, process and semantic features are concatenated to create the final features of the SDP instance. Figure 6 depicts the described process and Table 8 shows the resulting SDP datasets per each repository.  We trained Logistic Regression (LR), Decision Trees (DTC), Naive Bayes (NB), and K-Nearest Neighbours (KNN) models on the SDP datasets we created. We did not pursue more advanced SDP model development because the goal of the study was to investigate the impact of the quality of issue classification on the quality of the SDP dataset and the subsequent performance of the model, not to develop the most advanced SDP model. We chose to use these models because they are commonly used in SDP studies.

VI. EVALUATION
This section describes the results we obtained. For each approach and for each repository,we first present the Precision, Recall and F1 score that was achieved in the task of issue classification. Then we show the impact of issue classification on the resulting SDP dataset by presenting the confusion matrix of the automatically assigned labels compared to the ''golden'' labels of the SDP dataset. The confusion matrix includes four values. True Positive (TP) is the number of defective instances that were correctly identified as defective. True Negative (TN) is the number of non-defective instances that were correctly identified as non-defective. False Positive (FP) is the number of non-defective instances that were incorrectly identified as defective. False Negative (FN) is the number of defective instances that were incorrectly identified as non-defective. Finally, we train Logistic Regression, Decision Tree, Naive Bayes, and K-Nearest Neighbor models and investigate the impact of noise in the SDP dataset on their performance. Training of these models is repeated 30 times to mitigate the stochastic nature of the procedure and its influence on the obtained results. For each repository and each issue classification method the created SDP dataset is 80/20 split, with 80% of the dataset being used for training and 20% for testing. This is done separately for each repetition. The training set uses the labels derived from the issue classification approach in focus, while the test set uses labels derived from manual issue labeling. For each model and each SDP dataset we present the obtained Precision, Recall and MCC score. Matthews's correlation coefficient (MCC) is a metric describing the correlation between real and predicted values. It ranges from −1 to 1 with −1 representing a completely faulty prediction, 0 representing a completely random prediction and 1 representing a perfect prediction. The MCC score is used to validate the performance of the models and make sure that they are performing better than random guessing baselines. To summarize, F1 scores refer to issue classification results and MCC scores refer to SDP results.
For each method, we mention the minimum, average and maximum false positive and false negative share. For each method and for each repository, the percentage of false positives is calculated by dividing the false positive count with the total instance count. Similarly, the false negative percentage is calculated by dividing the false negative count with the total instance count. The minimum false positive percentage for a method is the minimal value obtained across different repositories. The maximum false positive percentage for a method is the maximum value obtained across different repositories. The average is the sum of all obtained values divided by the number of repositories. Analogously, things are calculated for false negatives.
We trained Logistic Regression (LR), Decision Trees (DTC), Naive Bayes (NB) and K-Nearest Neighbours (KNN) models on the SDP datasets derived from manually labeled issues. Please note that the instances of these datasets are created from source code files so in this instance we are performing file level SDP.
The results of issue classification for each classification method are summarized in Table 9. The table displays the Precision, Recall and F1 score of each issue classification model on each repository. Further, the impact this has on the resulting SDP dataset is shown in Table 10 by listing the number of True Positives, True Negatives, False Positives and False Negatives induced in the resulting SDP dataset. The ground truth from which TP, TN, FP and FN are calculated comes from the manually labeled issue classes. Finally, the performances of the models trained on the derived SDP dataset are shown in Table 11. For each SDP model, on each dataset created from a specific repository using a specific issue classification method to derive the SDP instance labels the table displays the Precision, Recall and MCC score of the that model.
For the sake of readability, the rest of this section is divided into five subsections. The first subsection presents the results obtained using KWM. The second subsection presents the results obtained using the IKWM approach. The third subsection presents the results obtained using the FastText model. The fourth subsection presents the results obtained using the RoBERTa model. Finally, the last subsection summarizes the obtained results.

A. SIMPLE KEYWORD MATCHING
Since this is an unsupervised machine learning approach, the whole issue classification dataset can be used for evaluation. We search for keywords in the issue description and in the issue title. By analyzing the results reported in Table 10 we can observe that the SDP dataset created using the KWM method consists of 13.2093% up to 58.3333% of false positives and 0.0000% up to 3.4335% of false negatives. On average there are 38.4181% of false positives and 1.7780% of false negatives. From this we see that the KWM method is prone to false positives.

B. IMPROVED KEYWORD MATCHING
For each repository we perform issue classification with the IKWM method. By analyzing the results reported in Table 10 we can observe that the SDP dataset created using the IKWM method consists of 14.9261% up to 46.0838% of false positives and 0.0000% up to 5.1979% of false negatives. On average there are 30.7992% of false positives and 1.9316% of false negatives. When comparing these results to those obtained using the KWM approach, on average, we see a drop of false positives from 38.4181% to 30.7992% and a slight increase of false negatives from 1.7780% to 1.9316% of the resulting SDP dataset.

C. FastText
For each repository we perform issue classification with the FastText model. By analyzing the results reported in Table 10 we can observe that the SDP dataset created using the Fast-Text model consists of 0.0000% up to 19.4444% of false positives and 6.0109% up to 18.4615% of false negatives. On average there are 7.4104% of false positives and 9.7246% of false negatives. When comparing these results to those of previous methods we see a significant drop in the number of false positives at the expanse of false negatives. However, the overall amount of noise in the dataset is reduced. With the IKWM method, on average 32.7308% of the SDP dataset was mislabeled while using this approach that number has been reduced to 17.1350%.

D. RoBERTa
For each repository we perform issue classification with the RoBERTa model. By analyzing the results reported in Table 10 we can observe that the SDP dataset created using the RoBERTa model consists of 0.5464% up to 23.6111% of false positives and 0.5875% up to 11.8598% of false negatives. On average there are 9.3369% of false positives and 5.0272% of false negatives.
When comparing these results to the performance of KWM and IKWM we see a significant drop in the number of false positives and an increase in the number of false negatives. When comparing them to the performance of FastText there is a slight increase in the number of false positives, but a noticeable drop in the number of false negatives. Overall, on average, the number of mislabeled instances has been reduced to 14.3641%.
If we analyze the model performances reported in Table 11 we see that in most cases models achieve superior performance on SDP datasets created using RoBERTa when compared to other methods. Also, from the positive MCC values we can easily determine that the model is not performing random guessing.
To statistically validate the achieved results, we compare MCC score distributions achieved by models trained on RoBERTa SDP datasets and models trained on datasets created using other methods. When comparing two samples we first test if both come from normal distributions (with p = 0.05). The null hypothesis of the normality test is that the sample comes from the normal distribution. If we fail to reject the null hypothesis for both distributions, we assume that they come from normal distributions and compare them using a Student's T Test [93] (with p = 0.05). If the normality test null hypothesis is rejected for at least one distribution, we apply a Mann-Whitney U Test [94] (with p = 0.05). Student's T Test determines if two samples drawn from normal distributions have the same expected value. The null hypothesis states that the distributions from which the samples are drawn have the same expected value. If it is rejected, the sample distributions are different with statistical significance. Mann-Whitney U Test is a non-parametric statistical significance test which determines if the two samples come from different distributions. By using a non-parametric test, we are not assuming any specific distributions. The null hypothesis states that there is no difference between the distributions from which the samples are drawn. If it is rejected, the sample distributions are different with statistical significance.
For each repository, there are 3 additional issue classification methods and 4 SDP models are trained, meaning that for each repository there are 12 configurations we are comparing to. Given that we have 7 repositories, that results in 84 configuration where we are comparing the performance of 11742 VOLUME 11, 2023 SDP models trained on SDP datasets created using RoBERTa to other methods.
Models trained on SDP datasets created using RoBERTa had superior performance compared to other methods 65 out of 84 times, and statistically significant superior performance 55 out of 84 times, which is equal to 65.4761% of the time.
We analyzed the model performances reported in Table 11 to see how models trained on datasets created using RoBERTa compare to those trained on golden datasets.
Again, we used the same statistical procedure with the same p values to see if there is a difference in distribution performance. Out of 28 configurations, in 17 cases the test failed to reject the null hypothesis meaning it could not differ between the performance distributions with a statistical significance.
We used t-SNE to visualize the classification of issues by the best-performing model for each repository. The red dots represent defect related issues and the blue dots represent non-defect related issues. The visualization demonstrates how the models have learned to differentiate between defective and non-defective issues. The visualizations are shown in Figure 7.

E. RESULT SUMMARY
All obtained results are presented in Table 9, Table 10 and  Table 11. Table 9 presents the issue classification results. Table 10 presents the influence of issue classification on SDP dataset quality. Finally, Table 11 presents the SDP performance of models trained on derived SDP datasets.
To briefly summarize, when applying a KWM to issue classification the resulting SDP datasets had, on average, 38.4181% of false positives and 1.7780% of false negatives, meaning a total of 40.1961% of the dataset was mislabeled. The situation improved when IKWM was applied. With IKWM the resulting datasets, on average contained 30.7992% of false positives and 1.9316% of false negatives, meaning a total of 32.7308% of the dataset was mislabeled. SDP datasets created using the FastText model, on average have 7.4104% of false positives and 9.7246% of false negatives, meaning a total of 17.1350% of the dataset is mislabeled. Finally, datasets created using the RoBERTa model, on average have 9.3369% of false positives and 5.0272% of false negatives, meaning a total of 14.3641% of the dataset is mislabeled.
Kim et al. [10] put an acceptable limit on 20% of FP and 20% of FN. They state that beyond that level there is severe degradation in model performance. Pandey and Tripathi [16] were stricter and put an acceptable limit on 10% of the dataset consisting of incorrectly labeled instances. They state that after that point there is severe performance degradation.
We see that the KWM approach results in datasets which have a high number of false positives and are beyond the limits specified by previous researchers. IKWM improves the dataset quality, but not enough to specify the laid-out   criteria. The FastText model reduces the number of false positives and false negatives to a quantity acceptable by the criteria proposed by Kim, but not the one proposed by Pandey. RoBERTa further reduces the amount of noise, but still fails to meet the criteria proposed by Pandey.
We investigated the impact on model performance and found that RoBERTa produces superior, statistically relevant performance in 55 out of 84 times and failed to show a statistically relevant performance difference between models trained on RoBERTa derived SDP datasets and golden datasts 17 out of 28 times.

VII. THREATS TO VALIDITY
This is an empirical study and as such has its own threats to validity. We have identified four possible threats: Manual issue classification was performed as part of this study. To reduce the impact of this threat multiple people labeled the issues and a majority vote was then taken. However, since the authors do not possess in-depth knowledge of all the selected repositories, incorrect classification might have still occurred and influenced the results. This is one of the reasons we have made the created dataset publicly available where it can be subject to independent assessment.
Open-source software repositories were used as a data source for this study. However, they might not be representative of all software repositories and thus they might introduce a bias into the reported results. One of the ways we have tried to alleviate this problem is by sampling multiple repositories.
English issues were the only ones used in this study. This might introduce a language-based bias. Further studies could be conducted to investigate this issue.
SDP dataset construction methodology used in this study is not the only possible option. For instance, if JIT-SDP was considered then the dataset construction process would be different as it looks into the commit being bug inducing or not. Even if different features were used the effect might be different. One way we have tried to minimize this impact is by using a mixture of code complexity, process, and semantic features. However, the fact remains that a different process might obtain different results. Further studies could be conducted to investigate this effect on different dataset construction methodologies.

VIII. CONCLUSION
As part of this study, we investigate the impact of issue classification on SDP dataset quality and resulting model performance. In order to do this, we created new datasets by mining 7 popular open-source repositories. For every repository, we collected all commit data and all issue data. By analyzing commit messages, we identified issues-ofinterest. These are issues referenced by at least one commit. Commits containing at least one issue reference are considered commits-of-interest. We then identified source code files edited exclusively by commits-of-interest and call them fileof-interest. For each repository, we sampled at least 1000 IOI, manually inspected and labeled them. We determined which FOI are related to the labeled issues. From them, using code complexity analysis, process analysis and GraphCode-BERT we created SDP dataset instances. The golden labels of these instances are derived from the manually labeled issues. We then investigated how using different methods for issue classification would influence the created SDP datasets and performance of standard models trained on these datasets. Issue classification was done using a keyword matching heuristic, an improved keyword matching heuristic, a Fast-Text model and a RoBERTa model. For each resulting SDP dataset we trained Logistic Regression, Decision Trees, Naive Bayes and K-Nearest Neighbours models.
From the achieved results we see that applying KWM to issue classification produces SDP datasets with an average 38.4181% of false positives and 1.7780% of false negatives, so a total of 40.1961% of mislabeled instances. IKWM produces datasets with an average 30.7992% of false positives and 1.9316% of false negatives, so a total of 32.7308% of mislabeled instances. FastText produced datasets have 7.4104% of false positives and 9.7246% of false negatives, so a total of 17.1350% of the dataset is mislabeled. Finally, datasets created using the RoBERTa model contain an average 9.3369% of false positives and 5.0272% of false negatives, totaling 14.3641% of mislabeled instances. The obtained results clearly show that of the inspected issue classification approaches RoBERTa produces the highest quality SDP datasets.
We then investigated the impact this has on model performance and found that models trained on RoBERTa derived SDP datasets outperformed counterparts trained on differently derived SDP datasets 65 out of 84 times, 55 of which were statistically relevant. When comparing their performance to those trained on golden datasets we could not show a statistically relevant performance difference 17 out of 28 times.
Based on the presented results we advocate that the research community use advanced NLP models such as RoBERTa when creating datasets for software defect prediction if issue classes cannot be determined with certainty. In our public repository we provide pre-trained models for issue classification.
To support further scientific inquiry in this research area, and put our own work under scrutiny, we have made all our source code, labeling application, created datasets and models publicly available. They can be found on GitHub. 2 DAVOR VUKADIN was born in Samobor, Croatia, in 1996. He received the master's degree in computer science from the Faculty of Electrical Engineering and Computing, University of Zagreb, Zagreb, Croatia, in 2020, where he is currently pursuing the Ph.D. degree. He is currently a Researcher with the Faculty of Electrical Engineering and Computing, University of Zagreb. He was published in IEEE ACCESS. His research interests include software defect prediction, natural language processing, and AI (artificial intelligence) explainability.