A Systematic Literature Review of Issue-Based Requirement Traceability

Issue reports are software artifacts that often specify the changed requirements of software systems. As software systems evolve according to these changed requirements, issue reports have become the essential artifacts that should be covered by requirement traceability. While researchers have developed automatic approaches for establishing the traceability links of issue reports, no papers have surveyed these approaches. In this paper, we conduct a systematic literature review of issue-based requirement traceability. We searched for articles published in renowned conferences and journals in the software engineering field from 2011 to 2022. From 1,347 initial articles, we identified 40 relevant articles. We investigated four aspects of issue-based traceability: problems, artifact pairs, techniques, and evaluation targets. Our findings are as follows. First, the challenges of issue-based requirement traceability are relevant to accuracy, effort, support, information, and trustworthiness. Second, issue reports are linked to commits, source code, user reviews, and test cases. Third, the studies mainly adopted machine learning and information retrieval techniques to generate and recover trace links. Finally, the main evaluation targets were open-source projects, but open datasets were also provided.


I. INTRODUCTION
Requirement traceability refers to the ability to trace relationships from software artifacts to other artifacts [1]. Requirement traceability can effectively help developers explore software artifacts or analyze the impact of a change. Modern software projects mainly adopt Issue Tracking Systems (ITSs), such as Jira, to rapidly reflect change requests in their software products. In the ITS, users can specify their requests as issue reports. Therefore, issue reports often include change requests and are considered software artifacts. In contrast to traditional requirement traceability, we named the traceability study that categorized issue reports as primary software artifacts the Issue-based Requirement Traceability (I-RT) study.
The associate editor coordinating the review of this manuscript and approving it for publication was Porfirio Tramontana .
Traditional requirement traceability studies did not handle issue reports as software artifacts but handled requirements as primary artifacts. For example, studies [2], [3], [4] generated trace links between requirements and other artifacts, such as use cases, source code, and architecture descriptions. Studies [5], [6], [7] recovered trace links between requirements and source code. Several literature surveys have even been conducted on requirement traceability [8], [9], [10], [11], [12], [13], [14], [15]. However, the surveys do not cover the studies that handle issue reports. This situation creates a gap between traditional requirement traceability and modern development practices that use ITSs. To create traceability in modern development practices, issue reports should be covered by a requirement traceability approach. Several studies have addressed requirement traceability with issue reports as part of software artifacts [16], [17], [18]. Nevertheless, to the best of our knowledge, no comprehensive literature surveys have been conducted on I-RT.
In this paper, we conducted a Systematic Literature Review (SLR) of I-RT. First, we defined four research questions that examine four aspects of I-RT: problems, artifact pairs, techniques, and evaluation targets. Second, we searched I-RT studies published from 2011 to 2022 and found 1,347 articles from well-known conferences and journals in the software engineering field. Third, we identified 40 studies that were relevant to I-RT. Finally, we extracted and analyzed relevant information for each research question.
The remainder of this paper is organized as follows: Section II discusses literature surveys on requirement traceability. Section III describes the procedure of the conducted SLR regarding I-RT. Section IV reports the results obtained for each research question. Section V discusses the future directions of I-RT. Section VI presents the threats to the validity of this research. Section VII concludes this paper.

II. RELATED WORK
There are two ways to conduct literature surveys: Systematic Mapping Studies (SMSs) and SLRs. An SMS categorizes and analyzes existing works and obtains a systematic map that describes a classified portfolio of the research results relevant to a particular research topic [19]. An SLR collects and analyzes existing studies and obtains supporting evidence to answer research questions [20]. The aims of SLRs and SMSs differ in that an SLR is an in-depth study of a narrow area that is completed by using specific and pointed research questions to create new knowledge through a meta-analysis of existing knowledge published in the literature, while an SMS aims to create a map of a wide research field [21].
The first group of works conducted SMSs [8], [9], [10]. Borg et al. [8] focused on Information Retrieval (IR)-based trace recovery and collected studies published from 1999 to 2011. They found that the inconsistent use of IR terminology was the main problem with the IR models used for trace recovery. Vale et al. [9] surveyed studies on traceability for software product lines and collected studies published from 2001 to 2015. They found that most studies focused on the trace links between assets at different levels of abstraction. Charalampidou et al. [10] focused on the relationships of software artifacts in traceability approaches and surveyed studies published before 2016. They found that requirements and source code are the most studied software artifacts and that the most studied quality attribute with respect to traceability is maintainability. The studies of Borg et al. [8] and Vale et al. [9] differ from our literature survey in that they focused on information retrieval techniques for traceability and software product line traceability, respectively. In contrast, we focused on the techniques used in I-RT. Charalampidou et al.'s study in [10] is partially similar to ours in that they investigated software artifacts. However, they did not include issue reports as software artifacts.
The second group conducted SLRs on requirement traceability [11], [12], [13], [14], [15]. Cleland-Huang et al. [15] conducted an SLR to review the state-of-the-art and discuss the future directions of software traceability. They collected papers published from 2003 to 2013. Mustafa and Labiche [11] conducted an SLR to model traceability among artifacts obtained from different domains of expertise. They collected papers published from 2000 to 2016. They found that few studies focused on heterogeneous artifacts, traceability tools, and precise semantics for trace links. Tufail et al. [12] focused on analyzing the models and tools used in requirement traceability studies and collected papers published from 2010 to 2017. They identified seven requirement traceability models, ten requirement traceability challenges, and fourteen requirement traceability tools. Wang et al. [13] comprehensively studied the technologies and challenges of traceability based on papers published from 2006 to 2016. They identified challenges in terms of traceability and technologies mapped to these challenges. Aung et al. [14] focused on change impact analysis and surveyed the available approaches for automatically recovering traceability links. They identified the approaches proposed from 2012 to 2019 and analyzed the research gaps between the current states of the proposed approaches and the desired states of these approaches. Tian et al. [22] surveyed studies on traceability for software maintenance and evolution and collected studies published from 2000 to 2020. They found that the two main challenges that hinder practitioners from employing the traceability practices are the quality of traceability links and the performance of traceability approaches and tools. None of the studies included issue reports as software artifacts.

III. SYSTEMATIC LITERATURE REVIEW PROCEDURE
To understand the newly emerging theme, I-RT, we conducted an SLR by following a guideline [23]. Figure 1 presents the overall procedure of our SLR. The procedure consists of three phases: planning, searching, and analysis. In the planning phase, we defined Research Questions (RQs). In the searching phase, we formed queries, identified data sources, applied inclusion and exclusion criteria to the candidate papers, checked full text, and did snowballing. In the analysis phase, we assessed the quality of these studies and extracted information.

A. DEFINING THE RQs
We derived RQs by considering our research purpose to provide comprehensive knowledge for the research trend of I-RT studies as follows:

B. SEARCHING THE RELEVANT STUDIES
Our selection process consists of five steps. First, we formed queries. Second, we then searched digital libraries with the queries. Third, we applied inclusion and exclusion criteria to filter out papers. Fourth, we examined the full texts of the remaining papers to identify the papers relevant to our research goal. Last, we performed forward/backward snowballing with the identified papers. We explain the details of each step as follows:

1) SEARCH TERMS
We used the PICO (Population, Intervention, Comparison, and Outcome) criteria to define the search terms based on the SLR guideline [23].
• Population: The population in this SLR is ''issue reports.'' The term ''issue reports'' can be often referred to as ''bug reports.'' In short, the terms ''issue'' and ''bug'' should be included in our search queries.
• Intervention: The intervention is ''requirement traceability.'' The commonly used term for the ''requirement traceability'' studies is ''trace link.'' We thus used a general term ''traceability'' as well as ''trace link.'' • Comparison: Since there is no alternative approach for the intervention, we did not consider this comparison part in the construction of search terms.
We formed queries with key terms of population and intervention. The population terms are ''issue'' and ''bug.'' The intervention terms are ''traceability'' and ''trace link.'' We combined these terms with OR and AND operators in queries. Table 1 shows the queries we applied to the digital libraries. We adjusted the terms according to search results.

2) DIGITAL LIBRARIES
We selected digital libraries according to the suggestions in [63]. We also considered renowned conferences and journals in the field of software engineering and selected the digital libraries that include such publications. As a result, we identified four digital libraries: IEEE Xplore, ACM Digital Library, Springer Link, and Science Direct. To find the key I-RT papers, we searched for papers that were published from 2017 to 2022 through the four digital libraries.
To broaden the scope of data sources, we also identified two general data sources: DBLP, and Google Scholar. Because DBLP is a data source that lists only computer science bibliography, we increased the search period and searched for the papers that were published from 2011 to 2022 through DBLP. Meanwhile, Google Scholar is a data source that promptly lists the latest publications. Therefore, we mainly searched the papers that were published from 2020 to 2022 through Google Scholar.
With the queries formed in Section III-B.1), we searched papers in the six digital libraries. As a result, we found a total of 1,347 papers, as shown in the second column of Table 2.

3) INCLUSION AND EXCLUSION CRITERIA
To select primary papers from the 1,347 identified papers, we defined the Inclusion and Exclusion (I/E) criteria as follows: • Inclusion criteria: -Papers published from 2011 to 2022; -Papers that were available in full text; -Papers written in English.
• Exclusion criteria: -Papers that were classified as gray literature (e.g., posters, books, and technical reports); -Papers that were not related to software (e.g., food, medicine, and agricultural papers); -Papers that were secondary or tertiary studies (e.g., SLRs, SMSs and surveys); -Papers that did not mention issue (or bug) reports. To apply the I/E criteria, we manually checked the titles and abstracts of the papers. As a result, we selected a total of 121 papers, as shown in the third column of Table 2 4) CHECKING FULL TEXT We examined the full texts of the 121 papers and identified 29 papers related to our purpose, as shown in the last row of the fourth column of Table 2.

5) SNOWBALLING
We applied the forward and backward snowballing techniques to find additional papers [64]. For the forward snowballing method, we checked papers that cited each selected paper. To find the papers that cited each selected paper, we used Google Scholar. As a result, we identified 5 papers by using the forward snowballing method.
For the backward snowballing method, we checked papers that were cited by the selected paper. Since the snowballing method was used to identify papers strongly related to our research topic, we did not establish a limit regarding the publication date. As a result, we identified 8 papers by the backward snowballing method. After this snowballing step, we were able to identify 13 additional papers related to I-RT, as shown in Table 3.

C. ANALYZING THE SELECTED STUDIES
We first analyzed papers with structures consisting of an Introduction, Methods, Results, Analysis, and Discussion (IMRAD). We then extracted information from a paper to answer each RQs.

1) ASSESSING THE QUALITY OF THE STUDIES
To assess the quality of the selected studies, we prepared our checklist by following the IMRAD structure [65].
• Introduction: Does the study discuss the topic of I-RT? • Methods: Does the study propose methods or experiments for I-RT?
• Results: Does the study include novel discoveries and useful results?
• Analysis: Does the study analyze related studies?

2) ANALYZING THE EXTRACTED INFORMATION
We extracted information from the selected papers to answer our RQs. As a result, we created the data extraction form shown in Table 4. The form included the basic information related to the RQs. We completed a form for each paper and analyzed the contents to answer each RQ.

A. WHAT PROBLEMS ARE TARGETED? (RQ1)
To answer RQ1, we need a framework to analyze the problems of issue-based requirement traceability studies. Cleland-Huang et al. [15] classified traceability-related studies according to the elements of the process: planning, creating, maintaining, and using traceability. Considering the characteristics of issue-based requirement traceability studies, we identified four main problem categories as follows: • Trace Link Generation: Studies focused on generating trace links between issues and other artifacts, given no previous trace links.
• Trace Link Recovery: The studies focused on additionally generating or repairing the trace links, given previous trace links between issues and other artifacts.
• Trace Link Maintenance: Studies focused on maintaining and retaining trace links as requirements change or trace links became outdated.
• Trace Link Aid: Studies did not directly generate, recover or maintain trace links but proposed additional techniques to assist existing traceability techniques. Figure 2 shows the classification results of the 40 studies. What stands out from the results is that 50% of studies are related to the trace link recovery problem. We observed that there are many studies aimed at recovering trace link issues and commits. Based on our observation, the noticeable percentage makes sense. Other results are as follows. Trace link generation occupied 30%. Trace link maintenance and trace link aid took 12% and 8%, respectively.   We then classified the investigated studies according to the four problem categories and their specific challenges, as shown in Table 5. The challenges of the studies can be grouped into five categories: accuracy, effort, support, information, and trustworthiness.

1) ACCURACY
Studies sought to improve the low accuracy of the existing techniques for generating, recovering, or maintaining trace links [24], [35], [36], [45], [48], [49], [53], [55]. Researchers attributed the low accuracy to error-prone links, semantic gaps, and large source files. In more detail, researchers have discussed that trace links are error-prone because developers manually managed or classified them [28], [30], [33], [40], [44], [49], [58]. Researchers have also addressed that software artifacts to be linked are semantically different [24], [37], [40], [56], [61]. For instance, the study in [56] addressed that locating bugs is difficult because issue reports were written in natural language but not in source code. In addition, researchers addressed that handling large source files and stack traces was one of obstacles to improving the accuracy of trace links [45]. To improve the accuracy of trace links, researchers have also studied removing false positives from automatically generated trace links [53].

2) EFFORT
Studies have also focused on reducing a considerable amount of developers' manual effort when creating and recovering trace links [28], [33], [44], [47], [52], [53], [57], [58]. Developers spend a substantial amount of time when manually managing traceability relationships or locating bugs [29], [38], [40], [53], [58], [59], [60]. Developers also spend their time finding test cases relevant to a given bug [46]. In this regard, researchers have addressed the insufficiency of developers' inspections of the trace links between two software artifacts [27]. In particular, the study in [32] focused on improving the accuracy of effort estimation for resolving an issue, and the study in [47] addressed that recognizing and reusing architectural knowledge from issue tracking systems are challenging.

3) INFORMATION
Various studies have addressed the absence of information to create, recover, and maintain trace links. Researchers have addressed that even if developers manually link issues and commits, the trace links are often missed [26], [28], [43], [44], [57]. Researchers have also addressed insufficient commit messages, which makes it difficult to identify the trace links between bug reports and commits or leads to biased defect information. [36], [37], [43]. Researchers have discussed the lack of integration between revision control systems and issue tracking systems, which affects the recovery of trace links, as well as the prediction of software faults [16], [50], [51]. Researchers addressed that issues form a complex network by themselves, but there were no approaches to predict link types [39], [42].
Regarding the specific information for generating trace links, Saha et al. addressed the lack of structural information because existing IR-based techniques handle source code as flat text [29]. Mayr-Dorn et al. addressed the problem of not considering a part of source code that actually implements a specific requirement (i.e., issue) or is covered by tests [38]. Wang et al. found that existing IR methods usually ignored the existing bug fixing histories and various sources of information (e.g., the metadata or the stack traces in the bug reports) [62].
Regarding the specific information for recovering trace links, Sun et al. noted that existing approaches disregarded nonsource files and the roles of source files in commits [57]. Nguyen-Truong et al. addressed insufficient discriminative information, which arises when existing techniques mine only the data within a commit [34].
Regarding the specific information for maintaining trace links, Luders et al. noted that it is difficult to maintain an overview of dependencies among issues [31]. Çetin and Tüzün focused on identifying the contributions of developers based on traceability graphs [54].

4) SUPPORT
Studies have attempted to develop support for languages, tools, etc., for trace links. First, researchers have addressed the problems of trace links between different languages, such as natural languages [25] or intermingled languages (e.g., English, Chinese, Korean) [41]. Researchers have also addressed that existing approaches do not provide support for artifacts and traces to other repositories and tools [27].

5) TRUSTWORTHINESS
There is a challenge including incomplete and untrustworthy trace links [26]. Additionally, researchers have addressed the lack of sufficient bias control for misclassified bugs, tangled commits, and localization hints [52]. Figure 3 shows the summary of the challenges per each problem category. We can observe that many researchers focused on the accuracy of trace links, human effort to establish trace link, and additional information to set up trace links. Especially for recovering trace links, researchers tried to use additional information to improve the accuracy of trace links. Besides, a few researchers conducted their research on support for trace links and trustworthiness of trace links.

RQ1
: I-RT studies can be classified into four traceability problem categories: trace link generation, recovery, maintenance, and aid. Across the four problem categories, the main challenges are low accuracy, high effort, insufficient information and support, and untrustworthiness.

B. WHICH SOFTWARE ARTIFACT PAIRS ARE LINKED? (RQ2)
We surveyed I-RT studies and found that the studies used issue or bug reports as the primary artifacts. Our analysis for RQ2 reveals that researchers focused on linking issue reports with commits, source code, user (app) reviews, test cases, model changes, user manual, and the issue reports themselves. Table 8 shows the number of studies that handle different types of artifact pairs.

2) ISSUE REPORTS AND SOURCE CODE
Seven studies linked issue reports to source code [29], [40], [45], [51], [52], [56], [62]. Two studies sought to aid in generating trace links between issue reports and source code [40], [52]. The studies in [45], [56], [62] sought to localize the relevant buggy source files based on the given issue (bug) reports. One study took issues and source code files as inputs to localize bugs [29], and another linked issue reports to the source code in patches [51].

3) ISSUE REPORTS AND TEST CASES
Four studies linked issue reports to test cases [30], [38], [44], [46]. One study focused on establishing trace links between requirements and test cases by using issue reports as requirements [38]. Three studies recovered the trace links between issue reports and test cases [30], [44], [46]. The studies intended to link bugs and test cases.

4) ISSUE REPORTS
Three studies focused on the trace links among issue reports [31], [39], [42], [48]. The studies in [39], [42] focused on the different kinds of trace links, such as related links, duplicates, and blocks. The study in [31] refined the relationships of the issue reports (e.g., parent-child, duplicate, dependency, similarity, and work breakdown). The study in [48] noticed that the ''related'' field of an issue report contains several issue numbers, but those issues are traced because of different reasons, such as duplicate or generic ones.

5) ISSUE REPORTS AND USER REVIEWS
Two studies [25], [60] sought to generate trace links between user reviews and issue reports. Among them, the study in [25] tracked the states of user reviews by narrowing the gap between user reviews and issue reports. The study in [60] aimed to identify issue reports that were related to user reviews and utilized these issue reports to identify the source locations to change.

6) ISSUE REPORTS AND MODEL CHANGES
The study in [49] sought to recover the relationship of artifacts between Jira issues (user stories or bugs) and model changes (revisions in a Model-Driven Development (MDD) context).

7) ISSUE REPORTS AND USER MANUAL
The study in [58] paid attention to classifying issue reports according to specific software feature descriptions in a user manual.

8) ALL RELEVANT ARTIFACTS
The study in [32] traced issues to requirements, model elements, source code, texts, copies, wireframes, and art designs.
Two studies did not specify pairs of artifacts [27], [59] because the study did not focus on specific trace links. The first studies developed a tool, TimeTracer, that supports arbitrary artifacts and traces by providing APIs [27]. The second study proposed a method for automatically recommending reviewers to review various artifacts, such as requirements, design diagrams, changesets, code reviews, test cases, and bugs [59].

9) SUMMARY
Many researchers studied on recovering trace links between issue reports and commits and improving the accuracy of the trace links. Researchers broadened the scope of trace links to source code, test cases, and issue report themselves. Besides, a few researchers conducted studies to establish trace links between issue reports and other artifacts such as user review, model changes, and user manuals.  RQ2: I-RT studies linked issue reports to commits, source code, test cases, issue reports themselves, user reviews, model changes, user manual, and all other relevant artifacts.

C. WHAT TECHNIQUES ARE USED IN THE STUDIES? (RQ3)
We found that I-RT studies have utilized Information Retrieval (IR) and Machine Learning (ML) techniques or even hybridized them. Figure 4 shows the changes in the techniques over the years. In the early years, several studies applied IR techniques to address the challenges of generating and recovering traceability links, while recently, the use of ML techniques has gradually emerged. Table 7 summarizes the specific techniques by year. 1

1) INFORMATION RETRIEVAL-BASED APPROACHES
From 2011 to 2016, ten related studies used IR techniques [16], [29], [36], [37], [43], [45], [46], [48], [50], [51]. The studies in [16], [36], [43], [51] sought to recover the missing links between issue reports and commits automatically. For instance, the study in [36] used the Term Frequency-Inverse Document Frequency (TF-IDF) similarity and three features. The study in [51] retrieved patches from issues, extracted patches, and recovered trace links. The study in [43] trained a random forest model with 9 text features and 11 metadata features extracted from issue reports and commit links after summarizing commit messages with ChangeScribe. The study in [50] used the SZZ algorithm proposed by Śliwerski et al. [68]. The study in [48] applies stemming, stop word removal, and term weighting of textual data in an issue tracking system to improve the effectiveness of IR approaches for building traceability between issues. Interestingly, the study in [46] recommended test cases relevant to bugs by applying two topic modeling techniques, Latent Semantic Indexing (LSI) and Latent Dirichlet Allocation (LDA).
The studies in [29], [37], [45] sought to improve the accuracy of trace links. The study in [29] used structured IR to locate bugs by using an IR toolkit Indri automatically. The study in [37] extracted textual features by analyzing error records and change logs and passed these features to multiple detectors (e.g., pattern-based link detectors) in different layers according to their categories. The study in [45] calculated the similarities between code segments and an issue report.
In 2017, three studies used IR techniques [38], [56], [57]. Study [56] identified buggy source code files based on the similarities. Study [57] modeled the similarity distances among features extracted from issue reports, commits, source code, and documents. Study [38] identified the source code lines that implement issue reports and the test cases that cover these source code lines. Meanwhile, in 2018, one study used the IR technique [40], which utilized existing project history enriched with previously unused information to recover traceability between source code files and bug reports and increase localization performance.
In 2018 and 2019, two studies used IR techniques [31], [40]. The study in [40] intended to link issue reports and commits. To recover the traceability links between bugs and source code, the proposed approach, TestScore, selects artifacts from project histories, calculates textual similarity, constructs a graph for traceability, and finally calculates a score for each source file. The study in [31] visualized VOLUME 11, 2023  trace links in a graph map, detected duplicated, missed, or unknown links, and checked the inconsistencies between a release plan and the issues in a map.
In 2020, four studies used IR techniques [27], [42], [53], [62]. To create an interaction-based trace link, the study in [53] extracted the issue IDs and code entities that include developers' interactions from commits. The study in [42] investigated network analysis techniques to generate and maintain the trace links among issue reports. The study in [27] used a data model generated by using Jira and Jama data to replay trace links. The study in [62] proposed a generation model, STMLocator+, that adopts two topic models. One is based on the LDA and LLDA models, and the other captures the semantic and textual similarity.
In 2021, one study proposed using traceability graphs for recommending code reviewers [59].

2) MACHINE LEARNING-BASED APPROACHES
In 2017 and 2018, two studies used ML techniques [26], [28]. The study in [28] distinguished positive or unlabeled trace links, extracted features from each link, and trained a random forest model to identify positive links. The study in [26] identified 18 features and experimented with several classification techniques, such as Naïve Bayes, J48 decision tree, and random forest classifiers.
In 2019, three studies used ML techniques [35], [60], [61]. The studies in [35], [61] proposed two approaches that are different but have the same name, DeepLink. The first DeepLink [35] used a Gated Recurrent Unit (GRU) with a Continuous Bag Of Words (CBOW) for issue report embedding. DeepLink created a graph-based code-based Abstract Syntax Tree (AST) and used a Recurrent Neural Network (RNN) to embed committed source code. The second DeepLink [61] used LSTM with skip-grams to make a fixed vector and recovered trace links by calculating the cosine similarities among (commit log, issue title), (commit log, issue description), and (commit code, issue code) pairs. To identify the code entities for an app review, the study in [60] clustered app reviews using hierarchical dirichlet processes with Natural Language Processing (NLP) and created trace links between clusters and issue reports by calculating similarity values based on skip-grams.
In 2020, three studies focused on the accuracy of bug localization based on ML techniques [33], [41], [52]. Study [52] proposed a near-optimal technique to generate a query from issue reports using a genetic algorithm. Study [41] used a Generative Vector Space Model (GVSM) to generate trace links between issue reports and commits written in two or more languages. The study [33] recommended a possible issue ID list for a new commit with a random forest classifier that is trained with features extracted from issues, commits, and source code.
In 2021, three studies used ML techniques [24], [25]. Study [24] utilized a pretrained BERT code model to retrieve code entities according to code descriptions and concatenate issue reports and commits. Study [25] used a context-sensitive text embedding method to convert app reviews and issue reports into the same vector space and used the cosine similarity metric to match app reviews with issue reports. The study [39] used several machine learning techniques with TF/IDF to predict the link types of issues.
In 2022, all three related studies utilized ML techniques [34], [49], [58]. The study in [49] proposed a machine learning classifier based on random forests and gradient boosted decision trees to classify the validity of trace links. The study in [58] proposed a deep learning model-based method based on a word embedding technique using a CNN (or RNN). The study in [34] automatically identified vulnerability-fixing commits by using 3 independent classifiers (i.e., commit message classifier, code change classifier, issue classifier).

3) IR-BASED & ML-BASED APPROACHES
In 2020 and 2021, two studies used a hybrid of IR and ML techniques [30], [44]. The study in [44] combined similarity metrics with LSI, LDA, BM25, and convolutional neural networks. Additionally, the study in [30], a follow-up study to [44], also used the same techniques (i.e., LSI, LDA, BM25, etc.) and vectorized issue reports as input queries and test cases as target sources. The cosine similarities between bug reports and test cases were calculated.

4) OTHER APPROACHES
Three studies did not use IR or ML techniques. The study in [47] constructed a dependency graph from source code, compared two consecutive versions, and determined the added or removed dependencies among classes and packages. Based on the calculation, the study attempted to link architectural issues to code changes. The study in [32] used the design science research method proposed by Wieringa [69] to propose a framework for designing traceability strategies. The study in [54] used the NetworkX package to analyze the social network of developers based on issues and commits.  We investigated evaluation methods and found that I-RT studies mainly conducted experiments to evaluate approaches. We then focused on evaluation targets that were used by the studies. Table 8 summarizes the evaluation targets by classifying them into three groups: open-source projects, open datasets, and student projects.

1) EXPERIMENTS WITH OPEN-SOURCE PROJECTS
Among the studies that used open-source projects for their evaluation, twelve used Apache projects. [16], [28], [33], [35], [36], [37], [38], [40], [43], [47], [57], [59]. The study in [40] used 15 open-source projects to evaluate their proposed approach (TraceScore) with state-of-the-art approaches (SimiScore and CollabScore). The study in [28] used 12 projects to evaluate their proposed FULink approach with FRlink. The studies in [35], [43], [57] collected true links (i.e., commits that fixed issue reports) from open-source projects. The study in [33] listed 5 open-source projects but only used the Apache crunch project to evaluate the recommendations of issue IDs for comments. The study in [59] used 4 projects to compare their method with 3 other methods, Naive-Bayes, RevFinder, and Profile. Three studies used 3 projects [36], [37], [47]. Two studies in [36], [37] evaluated the accuracy of the trace links between issue reports and commits or bug locations. One study in [47] estimated the size of architectural changes based on architectural issues and code changes. Two studies used two projects [16], [38]. The study in [38] demonstrated the capability of the proposed approach (ReTeCe) to provide requirement coverage reports. The study in [16] collected links between commits and issues and used the links to evaluate the proposed approach in terms of precision.
Next, five studies used projects belonging to Eclipse. [29], [45], [51], [56], [62]. Studies used 34 projects to evaluate the accuracy of the trace links between issue reports and commits [29], [45], [56]. The study in [62] selected 3 projects from the official Bug Tracking Website of Eclipse. The study in [51] used 2 projects to evaluate the proposed approach, BugTrace, where the study compared automatically recovered trace links with manually recovered links.
Five studies selected various open-source projects. [26], [48], [52], [55], [61]. The study in [52] used 803 issue reports from 15 open-source projects. The study in [61] selected 10 projects from 1,078 Java projects. The study in [26] selected 6 projects, where the study collected trace data from project management systems, issue tracking systems, and code management systems. The study in [55] selected 5 projects to comparatively evaluate 10 issue-linking algorithms. The study in [48] selected 4 projects and extracted 100 consecutive issues per project.
Two studies used mobile app projects. The study in [60] used 10 projects where experts built the ground truth as an answer set. The study in [25] used 4 projects, where the study randomly sampled 50 app reviews for each app, manually verified them, and linked them to issue reports for evaluation.
Fifth, the study in [44] and its follow-up study [30] used Mozilla Firefox open-source projects to evaluate the accuracy of the trace links generated by their proposed approaches.
In addition, studies in [39], [42] used 66 open-source projects from Jira and extracted semantic types tagged by project contributors. The study in [27] used Dronology open-source projects to manage the development of a framework for controlling and coordinating Unmanned Aerial Vehicles (UAVs). The study in [58] compared the proposed method with TicketTagger on 3 open-source projects in the domain of source code editors.

2) EXPERIMENTS WITH OPEN DATASETS
Five studies used open datasets. The study in [54] used SEOSS 33 [70], datasets for 33 OSS projects. The study in [54] selected Apache Hadoop, Apache Hive, Apache Pig, Apache HBase, Apache Derby, and Apache Zookeeper projects from the datasets for an evaluation. The study in [24] used two datasets: CodeSearchNet and a database that consisted of 3 open-source projects. The study used a golden trace link set, where the committers manually linked commits and issues. The study in [41] used 14 Chinese and 3 non-Chinese open-source projects, where the study automatically created golden trace links as an answer set by using regular expressions and time-based heuristics. The study in [49] collected data from 3 model-driven development industry datasets from internal Mendix Studio low-code platform projects. The study in [34] compared their method, HERMES, with Sabetta and Bezzis approach from the Software Application Products (SAP) manually curated dataset.

3) EXPERIMENTS WITH OTHER PROJECTS
Two studies used industrial projects. The study in [69] used the JIRA system of a web application where customers registered, selected mortgages, provided documentation for eligibility, and booked appointments with mortgage advisers. The study in [46] used an industrial project where developers already manually created the links of test cases, including bug reports. One study manually created trace links from a student project that contained 395 commits, 40 Java files, and 26 XML files [53].
One study did not conduct a quantitative evaluation [31]. Figure 6 shows the summary of the evaluation targets in form of word cloud. We can observe that Apache, Commons, AspectJ, SWT, Zookeeper, Zxing, and PIG were frequently used as the evaluation targets. We can also observe that researchers have used various projects as evaluation targets, not limiting to specific projects.

RQ4
: I-RT studies mainly used open-source projects as evaluation targets, but four studies used open datasets, and one study used student projects.

V. DISCUSSION
Through our literature review, we sought to understand the current research status of I-RT studies. Based on our findings, we identify the challenges in these studies and discuss future research directions.

VI. THREATS TO VALIDITY
Our SLR paper can face potential threats to validity, which can be divided into construct, internal, and external validity. The threats and mitigation strategies are described as follows:

A. CONSTRUCT VALIDITY
This validity concerns the process of identifying papers. The selection results of papers depend on the coherence of our search queries. To mitigate this threat, we carefully identified search terms and adjusted the combination of the terms according to the digital libraries. We also clarified the queries we made per digital library. Additionally, we defined exclusion and inclusion criteria to review these relevant studies for exact identification. Meanwhile, we also utilized snowballing techniques to obtain as many related studies as possible.

B. INTERNAL VALIDITY
We treated studies that used issue reports or bug reports as I-RT studies. That is, the studies could be diverse, and the studies whose main concerns were not traceability might have been included. For instance, bug localization papers could be included if the papers handled traceability of issue reports. From our point of view, such an inclusion is not a big deal because bug localization is one of the goals that traceability studies typically aim to achieve. Additionally, to mitigate this threat, we carefully determined whether the studies were related to traceability by checking which ''trace'' terms appeared in the studies. Another threat to internal validity is that the four authors individually extracted and analyzed the data from the selected papers. Different participants may have different views about data, so individual analysis could affect the detailed analysis results of the paper. To mitigate this threat, we conducted weekly discussions.

C. EXTERNAL VALIDITY
We searched papers with keywords in digital libraries by focusing on top-tier conferences and journals as a starting point. This perspective may be narrow, and their rankings may be inaccurate or slightly changed. To mitigate external threats, we conducted multiple rounds of comparative analysis when selecting them.

VII. CONCLUSION
We conducted an SLR to investigate the trends of I-RT studies in terms of four aspects: problems, artifact pairs, techniques, and evaluation targets. We summarize our findings as follows. First, the I-RT studies addressed the challenges of low accuracy, manual effort, insufficient support and information, and untrustworthiness of trace links. Second, the artifacts linked to issue reports are commits, source code, user reviews, test cases, etc. Third, most of the techniques used in the studies were ML and IR approaches. Finally, the primary evaluation targets used in the studies were open-source projects.
With the results, we also discussed the challenges related to I-RT. Based on our discussion, we propose future research directions. First, we need additional information to improve the accuracy and trustworthiness of trace links. In our future direction, we plan to develop state-of-the-art techniques or tools to overcome the challenge of insufficient information. Second, we need to find a way to fairly evaluate I-RT approaches. Therefore, it will also be essential to build large open datasets to improve the reliability of the evaluations.