Applications of natural language processing in software traceability

A key part of software evolution and maintenance is the continuous integration from collaborative efforts, often resulting in complex traceability challenges between software artifacts: features and modules remain scattered in the source code, and traceability links become harder to recover. In this paper, we perform a systematic mapping study dealing with recent research recovering these links through information retrieval, with a particular focus on natural language processing (NLP). Our search strategy gathered a total of 96 papers in focus of our study, covering a period from 2013 to 2021. We conducted trend analysis on NLP techniques and tools involved, and traceability efforts (applying NLP) across the software development life cycle (SDLC). Based on our study, we have identified the following key issues, barriers, and setbacks: syntax convention, configuration, translation, explainability, properties representation, tacit knowledge dependency, scalability, and data availability. Based on these, we consolidated the following open challenges: representation similarity across artifacts, the effectiveness of NLP for traceability, and achieving scalable, adaptive, and explainable models. To address these challenges, we recommend a holistic framework for NLP solutions to achieve effective traceability and efforts in achieving interoperability and explainability in NLP models for traceability. © 2023TheAuthor(s).PublishedbyElsevierInc.ThisisanopenaccessarticleundertheCCBYlicense (http://creativecommons.org/licenses/by/4.0/).


Introduction
Software traceability is a fundamentally important task in software engineering: for some domains, traceability is even assessed by certifying bodies (Guo et al., 2017a).Given that traceability permeates all aspects of software production, the need for automated traceability has increased too, considering that software projects have steadily become more complex and the ever-increasing number of artifacts (Cleland-Huang et al., 2007;Duan et al., 2009;Guo et al., 2017b).
The underlying complexities of the logical relations between artifacts, at various stages in the software process, have prompted a variety of empirical studies (Maletic et al., 2003;Schwarz et al., 2010;Mäder et al., 2017) and several areas of research, particularly in the inception of semantic domain knowledge (Marcus and Maletic, 2003;Zhao et al., 2017a).During the software process life cycle, complex traceability challenges emerge due to differential evolution and the heterogeneity of artifacts, rendering trace link retrieval to be onerous.This calls for a holistic framework that requires tools and techniques to be able to promote ✩ Editor: Nicole Novielli.extensibility and automation (by having a common representation), mapping of native representation to common representation, and rules defining consistency between artifacts (Pete and Balasubramaniam, 2015).
As we endeavour to achieve this framework, it is inevitable to acknowledge the role of natural language processing (NLP) in these efforts; a viable research frontier solution to traceability problems (Arunthavanathan et al., 2016).With recent advancements in NLP, we are addressing a critical need to consolidate and study all recent research efforts in this space.
Extracting information from a corpus of text to derive meaningful output is a technique most often found in NLP.In other words, semantic extraction is obtained from textual data and arranged in formal grammars that specify relationships between text units (Nadkarni et al., 2011).The role of NLP in software traceability addresses limitations of conventional Information Retrieval (IR), particularly around natural language data composition (Russell-Rose and Stevenson, 2009).NLP plays a vital role in these efforts, yet there is very little done to study the existing research efforts in this space.We have devised the following general topics of research focus: 1. Extracting meaningful information from software artifacts using NLP tools; 2. Recovering traceability links through automatic or semiautomatic approaches; 3. Binding the extracted information with domain-specific concepts to decipher context or domain.
These topics form the basis and rationale for this systematic mapping study (SMS), addressing the problem of traceability recovery through solutions of information retrieval with NLP.Given the width and the breadth of traceability in the software lifecycle, an SMS is a more appropriate approach to uncover the ways in which NLP has been instrumented and deployed, and in which phase of the software life cycle.By conducting this study, we are able to consolidate diverse and scattered efforts across multiple branches, and identify key areas of gaps pertaining to traceability solutions that necessitate more attention.
The following research questions were outlined based on existing research and work in NLP for software traceability, and will be assessed as part of the SMS: RQ1: What are the demographics of the published articles?Rationale: This information gives us an overview of the publications' metadata, enabling impact and quality analysis.We will also analyse high-impact publications as part of our study.
RQ2: What is the trend analysis of NLP techniques and tools proposed and evaluated in the published articles?
Rationale: This allows us to establish the state of existing knowledge and efforts, subsequently allowing us to identify research gaps in our current understanding, and predict how future trends may be.
RQ3: What is the trend analysis across the phases of the SDLC?Rationale: By using the SDLC framework, we can identify key areas of NLP application in traceability that were proposed and evaluated in publications.Given the width and breadth of the SDLC, an SMS appears to be a better choice than a Systematic Literature Review (SLR).
RQ4: What are the reported key issues, barriers and setbacks?Rationale: Through collating these, we are able to consolidate pain points and bottlenecks.This allows us to understand the perils and pitfalls of NLP in traceability so we can identify focus areas for future research.

RQ5:
What are the open challenges?Rationale: From the key issues, barriers and setbacks identified, we collate the themes covering these as open challenges.
This paper aims to tackle these questions by conducting a thoroughly focused, yet comprehensive, systematic mapping study.This paper addresses the need to consolidate recent NLP efforts in traceability, analyse what are the common issues, barriers, and setbacks to effective traceability, and provide recommendations to address open challenges.Section 3 will explain the methodology and data process behind the study.Section 4 will cover the results and subsequently will be discussed and analysed in Section 5. Section 7 finally concludes.

Contextual definition
NLP is a branch of Artificial Intelligence and Linguistics that allows the representation and analysis of human language computationally (Khurana et al., 2017).Due to the recent phenomenon of vast amounts of unstructured textual data being collected and used for machine learning, applications of NLP to solve realworld problems is gaining more attention from researchers and practitioners alike.In the context of software engineering, NLP is utilised to harness value from the natural language present in software artifacts.Justification of the use of the textual format of these artifacts relates to the following (Yalla and Sharma, 2015): • possibility for automation • information that is naturally represented, thus making it recognisable and readable for humans • easy and practical to develop and use By leveraging the syntactic and semantic nature of software artifacts, we aim to study past and current efforts in trace-link recovery between software artifacts that used NLP techniques and tools.Our paper looks into multiple perspectives (orientation) of software traceability, and the application of NLP to achieve the goals of traceability between software artifacts, including the 'golden challenge' of ubiquitous traceability (Cleland-Huang et al., 2014), that is, instrumenting traceability to be built into the engineering process.

Related work
A mapping study of IR approaches to software traceability was completed in 2014, with a particular focus on previous evaluations and evidence strength (Borg et al., 2014).The study; however, was done excluding core techniques in NLP methods such as machine learning (Spanoudakis et al., 2003) and semantic networks (Lindvall et al., 2009).These were disregarded in the study as they were too different to fit in the scope due to the complexities in development and deployment.However, the landscape in NLP research has witnessed major breakthroughs in recent years, driving a new wave of tools and applications specifically for software engineering tasks (Sawant and Devanbu, 2021).Some examples of applications: training word embeddings in the software engineering domain space (Efstathiou et al., 2018), requirements classification using deep learning (Navarro-Almanza et al., 2017), and textual classification of natural language in software engineering text mining pipelines (Mäntylä et al., 2018).
A more recent review was done, broadly focusing on adopting NLP to mine unstructured data in software repositories (Gupta and Gupta, 2019).The review was done by looking into general applications of mining repositories, with a sub-focus on traceability efforts.In terms of integrating NLP applications into the SDLC, an assessment of how NLP is employed (in the different phases) was mentioned in Yalla and Sharma (2015).This integration, deemed as multidisciplinary research, highlights the potential advantages: a more holistic approach to Computer Science and Engineering, greater possibility for automation, and a step closer to achieving universal programmability: the possibility to program in a natural language, and without the need of a formal programming language (Tichy et al., 2013).In the context of traceability in artifacts, the need for precise semantics for trace links between heterogeneous systems is critical due to inadequate available tools (Mustafa and Labiche, 2017).This review highlighted the need to define a taxonomy for trace links, as characteristics of trace data are likely to be domain-, organisation-, or even project-dependent.
Although machine learning (such as NLP) has gained an incredible amount of attention only in recent years, one of the earliest systematic literature reviews of traceability approaches (specifically for software architecture and source code) analysed efforts in automatic traceability reconstruction using machine learning classifiers to detect tactic-related classes (Javed and Zdun, 2014): classes that were instrumental to implement the tactical design decisions.This paper studies efforts in NLP application for traceability in recent years, postdating the study done in 2014 (Borg et al., 2014).Our study aims to look into recent applications of NLP by leveraging (natural language) semantics already present in these artifacts.This is an ongoing focus area in the field of information retrieval, particularly due to recent developments in computational power and the advent of large amounts of linguistic data (Torfi et al., 2021).There is a great amount of necessity to consolidate and study these sporadic efforts across different platforms globally to analyse trends in the

Methodology
Following the updated guidelines for conducting systematic mapping studies in software engineering (Petersen et al., 2015), we define our methodology through the process of identifying, analysing, and interpreting all available evidence in a way that is unbiased and (to a degree) reproducible.The following steps were taken to address our research questions outlined in Section 1.

Mapping study planning
An overview diagram of our steps is shown in Fig. 1.We extracted the content and metadata of each piece of literature using a systematic approach and applied various tools to gather all publications necessary within our scope.As shown in Fig. 1, 3.55% of the total result entries have been included as part of our study.This planning was done to ensure comprehensiveness in the study and to address the research questions at hand.Threats to the validity of our study strategy will be discussed in Section 5.

Search string
Table 1 shows the terms relevant to our search and their synonyms.These were derived to expand the boundaries of semantic keywords that are relevant to the research topics.We have separated the terms according to the relevant theme it belongs to, and only the most relevant synonyms (to our research questions) are shown in the table.
Forming the search string is the core component of any search strategy of a systematic review or mapping study that involves searching indexed literature databases, as it enables transparency for validation and reproducibility for others.An effective search strategy is usually iterative and benefits from trial searches using various combinations of search terms derived from the research question(s) (Kitchenham and Charters, 2007).This was incorporated into our search strategy in our study as follows.

Evaluating synonym terms
Including all the identified synonym terms would yield a wide coverage but be inundated with a great number of false positives.Hence, we evaluate the potential candidate synonym terms to determine those that will be included as our string output.The three components (themes) of our search string will need to be joined using the AND operator, which ensures that results will reflect a ''must" rule that all these themes need to be covered.For the individual terms in each theme, we use the OR operator to join them.This is to ensure that every theme is represented by at least one of the terms.

Trial of potential candidate terms
Fig. 2 shows the combination of terms that were tested.We grouped the synonyms according to common properties they share, denoted by the ovals.Each of these groups are then evaluated on effectiveness through trials and decision is then made.Green coloured groups were those chosen.

Decision and final string output
• Theme 1: (top-down order) Main terms, parent term, methods, model types, subject, and artificial intelligence.
• Theme 2: (top-down order) Main terms and offshoot terms.• Theme 3: (top-down order) Main terms and types of artifacts.
For Theme 1, NLP and the meaning of its acronym had to be included.We also found out that the generic term ''information retrieval'' widened the results beyond the scope of our RQs.The methods group for Theme 1 had to be included because the majority of efforts in text processing do include natural language, although not explicitly mentioned in every case.''linguistic'' as a term was producing similar results to ''information retrieval'' and the artificial intelligence group was not effectively returning the right hits.NLP solutions that already use any form of machine learning is already included when using just the main terms ''NLP''.
The terms in Theme 2 were more straightforward.We found out that using the term ''traceability'' was enough to generate the relevant papers in scope of our RQ, as the term is a commonly used term in software engineering, even without including the term ''software''.We also discovered that lemmatising ''traceability'' to ''trace'' and adding ''link'' was useful to pick up cases where traceability happens without specifically mentioning that it is a traceability problem.For example, locating bugs in the source code, or linking requirements to test cases.
For Theme 3, we found out that including the artifact types into our search string restricted our scope of search -this is particularly due to the nomenclature used to represent artifacts produced throughout the SDLC, which can be numerous.We decided to only use the main terms, both spellings of ''artifact'' and ''artefact''.As a result, we specified the following search string (in order) to extract all related publications within our scope: ("NLP" OR "natural language processing" OR "text mining" OR "text extracting") AND ("traceability" OR "trace link") AND ("source code" OR "software code" OR "software artifacts" OR "software artefacts") As control papers, we used a 10% random sample of the set of papers obtained in the query: for the updated search query, the control papers used were (Pruski et al., 2015;Lin et al., 2021;Khatiwada et al., 2017;Salih et al., 2021;Ali et al., 2018;Lam et al., 2015;Capobianco et al., 2013a;Iammarino et al., 2020;Scanniello et al., 2015).These were analysed by the second author to make sure that the search query was appropriate, or if it needed different terms.

Inclusion and exclusion criteria
To ensure our results are reflective of recent research, we have imposed inclusion criteria in terms of period scope: years 2013 to 2021.Spanning a period of 9 years in consideration, we aim to fill in the gap of studies that predated our start year and focus on more recent developments of NLP-based IR in software traceability.For exclusion, we have disregarded content that is unrelated to (software engineering) traceability, such as other reviews and artifacts with no natural language.
For the exclusion criteria, we used the following filters to weed out the papers that are not within our scope: 1. Duplicates: repeated entries 2. Language: non-English papers 3. Data: incomplete (missing) data 4. Reviews: other reviews, surveys, and mapping studies 5. Context: irrelevance to our defined research topics The exclusion process (filtering) of papers was necessary due to the abundant false positive results majorly from Google Scholar.Duplicates were identified through automated checking of integrity in titles and authors.For language, we only included those written in English.Incomplete and missing data refers to search results that do not fully reflect published material, for example, only the publication source was mentioned with no article title.We also excluded all other secondary and tertiary studies.
The final filter was 'context'.We had to determine if the papers were relevant to our defined research topics.We start with the abstract (as they typically serve as the first point of entry).If relevance is not evident, we look into the research questions and methodology, as these describe the work done to achieve a goal and to answer the research questions.The first author was responsible for this task: 9 control papers (as defined above) were read by the second author to make sure that the context was relevant for the papers to be included.Since the control papers (selected randomly) were all found to be relevant to the context chosen by the research question, a Cohen's kappa was not evaluated as not necessary to determine the agreement between reviewers.

Data extraction and management
Table 2 shows the literature databases that were used for our first step in data extraction.The aim was to gather all relevant publications related to our study topics, and by using the search string defined.The extraction was done either by exporting from the web page (via manual extraction using the Web UI) or API.
Google Scholar was further used to widen our search results: despite the abundance of false positives (noise), it has the potential to considerably extend the outreach of the systematic search (Harzing and van der Wal, 2008).The results in impact analysis (of publications) will be covered in Section 4.
After the cleaning step instrumented by the exclusion criteria, we gathered a total of 96 papers held by libraries worldwide.We have also ensured that all these were peer-reviewed publications.These were extracted, along with the metadata, and compiled into a spreadsheet consisting of all the information and content for each paper.

Results
The following are the results of our study based on our research topics.These results reflect our findings in NLP efforts in software traceability in recent years, answering our research questions at hand.

RQ1: Demographics of published articles
In terms of demographics for impact and quality analysis, we look at the following metrics: • publication type, shown in Fig. 3 • citation count per year,1 shown in Fig. 4 The complete list of papers in scope can be found in Appendix A. We have also included the respective sources (e.g., conference name) of each paper.The distribution of accepted papers is roughly two-thirds geared towards conference and workshop contributions, and the rest in more established venues (books and journals).This is further proof that conference papers still attract quality contributions, although, as relevant and well-known as a conference might be, this does not define the quality of the papers that are contained in one.Some noticeable conference venues are namely the International Conference of Software Engineering (ICSE, 4 papers), International Conference on Software Maintenance and Evolution (ICSME, 4 papers), and International Requirements Engineering Conference (RE, 4 papers).These are also examples of A*/A rated software engineering conferences, as listed in CORE conference rankings,2 which are labelled as flagship and excellent conference venues.
For citation count per year, we can see 7 outliers that are the top cited publications per year, corresponding to the papers (Panichella et al., 2013;Lam et al., 2015;Arora et al., 2015;Shokripour et al., 2013;Poshyvanyk et al., 2013;Wang et al., 2014;Lin et al., 2021).Despite the citation count to be, arguably, a weak indicator of research quality for some (Aksnes et al., 2019), for the purpose of our mapping study, we consider citation count as a factor in research impact, and we will analyse these in Section 5.

RQ2: Trend analysis of NLP techniques and tools for traceability
In this study, we identify how NLP is being used to achieve traceability solutions.Not all NLP efforts are similar; hence, it is useful to categorise these efforts by amount of task complexity, so we can understand how much of NLP was involved in the traceability solutions.We categorise according to the following tiers: • Tier 1: Only basic complexity tasks, such as processing text (stemming, pattern matching etc.) and tokenising.This category typically only deals with text syntax and no training is involved.
• Tier 2: Basic to intermediate tasks, such as training word embeddings and topic modelling.This category involves training models, pre-trained or otherwise.Semantics are involved and this is closely related to the naturalness of language.
• Tier 3: Basic to advanced tasks, such as implementing deep learning models.This category is an extension of Tier 2 where the semantics (context) of language is derived by (essentially) deep learning.This commonly involves the extended implementation of pre-trained deep learning models in the context of software traceability, such as augmenting neural networks with vector space models (VSM).
Distinction between these tiers is solely determined by task complexity: how much work (in processing natural language) has been done (not only for traceability purposes) to achieve the desired solution.For example, traceability work that uses a pretrained deep learning model (e.g., BERT) would be classified as Tier 3 because deep learning is a relatively high-complexity task albeit being already pre-trained.It is important to note that these tiers are not disjointed, but rather, each tier is an extension of the preceding tier.Tier 2 would include tasks in Tier 1 and Tier 3 would include tasks in Tiers 1 and 2. For example, to train a transformer like BERT (Tier 3), basic work tokenisation still needs to take place.Regardless, segregating these tiers is necessary as it allows us to understand 'up to' what level of task complexity is involved in each paper.Classification of these tiers was performed based on the following steps: 1.In each paper, we extract two sections where present: Introduction and Methodology.2. In the order listed above, we locate the application of NLP based on the proposed solution.Most of the proposed solutions are explained in the Introduction, although when not clear how NLP is applied, we use the Methodology section to identify the keywords which describe the task complexity involved.

Every solution typically involves multiple aspects, and
where multiple NLP techniques and tools are applied, only the highest complexity is assigned.
The classification of papers into the three tiers was performed by the first author.The second author, using the subset of publications used as control papers above, used the same three steps to determine which tier a paper belongs to.The results of the two classifications were later discussed agreement was sought.Unsurprisingly, there was a 100% agreement between the two authors on this sample of papers: this is due to how the tiers are formulated, and by how clearly each tier is defined from the others.
Based on our analysis, all the task complexity properties are transitive: Tier 1 tasks are a subset of Tier 2 tasks and Tier 2 tasks are a subset of Tier 3 tasks.For example, one does not train a word embedding without having to pre-process the text and one does not train a deep learning model without having to embed layers of word vectorisation models.Thus, these tiers are not disjointed -higher tiers will include tiers that have lower complexity levels, in other words, an implicit ''up to'' is implied for each.Fig. 5 shows the trend of the tools and techniques involved from a tiered perspective, in terms of published paper count, and by year of publication.
Within each tier, we have multiple techniques and tools that were used as part of traceability solutions in our study.Table 3 shows NLP techniques involved in traceability solutions with the relevant papers involved.Table 4 shows external support tools and libraries that have been identified with the relevant papers involved.

RQ3: Trend analysis of NLP application for traceability across the SDLC phases
We look into traceability applications through the phases of the SDLC framework.Given that there is not one official SDLC model, we will be using the common de facto phases of the framework as our basis (Mishra and Dubey, 2013 To visualise the relationships identified effectively, we present Fig. 6: a bubble chart of the pairwise SDLC phase relationship counts over the years.The horizontal dotted line across 'REQ-CODE' shows the SDLC phase relationship that is present in all years, with 2019 showing the maximum count overall.Where there is no bubble in place, it means that the count is zero. Every paper in scope has been involved in one or more pairwise SDLC relationships.In cases where papers involve multiple pairwise relationships (which is few), those papers will exist in every bubble, where the pairwise relationship is present for that year.In other words, every paper is not exclusive to every bubble -multiple bubbles may represent one paper that has multiple pairwise relationships.The distribution count is as follows: • No. of papers with one pairwise relationship: 84 • No. of papers with two pairwise relationships: 11 • No. of papers with three pairwise relationships: 1 • Total no. of papers involved: 96 Fig. 6 also shows the 'OTH' (others) phase, which refers to artifacts involved outside of the SDLC phases identified in Section 4.3.Some examples of artifacts identified at 'OTH' are (informal) documentation, user queries, and release notes.

RQ4: Key issues, barriers, and setbacks
We have identified eight key issues, barriers, and setbacks, outlined in Table 5, with relevant papers highlighting each of these.These were identified through analysing the discussion of results, which is typically found in the 'Discussion' section of each paper.We extracted all identifiable (implicit or explicit) issues, barriers, and setbacks that are direct results of using NLP in the proposed traceability solutions.Each of these is explained in this section and further discussed in Section 5.

Syntax convention
There does not exist a unified convention for naming syntax of various references in the artifacts, such as functions, variables, and classes.Due to this, we cannot generalise every model  to be trained on certain specifics, and this hampers effective traceability efforts.

Configuration
Finding the optimum configuration may be possible for one use case.However, in reality, artifacts evolve over time (through active development), and (optimal) configurations change as well.
Although NLP has been effective in recovering missing and broken trace links, it is still a pertinent issue in achieving effective traceability.In deep learning tasks (Tier 3), searching for the optimal configuration (exhaustive evaluation) poses other issues, such   as computational costs, time complexities, and hardware carbon footprint (Lauriola et al., 2022).

Translation (language)
Translation of languages is a service that is integral to any traceability solution that involves unifying cross-language artifacts.Dependency on the effectiveness of this service (by the accuracy of cross-language information retrieval output) proves to be a setback to effective traceability.A comparative study done in 2015 observed that different translation services can result in considerably different retrieval behaviours for individual queries for different language pairs and applications (Hosseinzadeh Vahid et al., 2015).

Properties (representation) of artifacts
As we implement traceability solutions using NLP (such as similarities in vectors), software artifact properties constantly change and traceability solutions using NLP do not keep up.Besides change management, this issue is also relevant for the representation of software artifacts throughout different SDLC phases.For example, in the Design phase where UML diagrams are used, some form of parser needs to be implemented to unify these representations with other artifacts from other SDLC phases.

Explainability
The lack of explainable and interpretable models is a key barrier to effective traceability.This becomes more prominent in higher tiers of task complexity as state-of-the-art pre-trained  models, although scoring high in benchmarked NLP tasks, are typically black-box in nature and serve very little purpose in situations where traceability becomes a core component mandated by requirement standards and regulations, such as for medical device software (Regan et al., 2013).

Dependency on tacit knowledge
There is still a considerable amount of dependency on tacit knowledge that is integral to traceability solutions with NLP.This dependency is hampering efforts in automated effective traceability due to the limitations of models in every domain, which is also related to the artifacts property (representation) issue where it is not a one-size-fits-all policy for all SLDC phases.

Scalability
Scaling the solutions in traceability efforts is identified as a key barrier, particularly in large-scale systems.In object-oriented programming, encapsulation of objects helps to improve scalability due to the isolation of internal modifications of any one object (Corriveau, 1996).Despite this, traceability between software artifacts does not automatically follow this, especially when large systems involve complex trace links with the increasing number of artifacts and developers involved.This is also an extension to the configuration issue where scalability in compute and time complexities are severely affecting effective traceability efforts.

Data availability
In supervised and semi-supervised strategies, we require vast amounts of training data specific to the software engineering domain.In an ideal world, all of this data is annotated and ontologies are well-defined; however, that is not the case in reality.Annotation of data is an expensive and time-consuming laborious task that does not appeal to many -and this has prompted a variety of solutions such as crowdfunding through Amazon Mechanical Turk (Snow et al., 2008).

RQ5: Open challenges
From these key issues, barriers and setbacks, we identify 3 themes that are presented as open challenges in recent applications of NLP in traceability.

Syntax and semantic similarities in representation across artifacts
Traceability between artifacts stems from identifying components that are linked to one another.To achieve this, the manifestation of concepts (through the artifacts' components) needs to be synchronised in terms of syntax and semantic similarities.This challenge is one that NLP solutions for traceability continue to face.

Effectiveness in automated software traceability
As software systems continue to evolve in scale and complexity, the call for automated traceability has never been more critical.The number of traceability links that need to be captured exponentially grows with the size and complexity of the software system (Cleland- Huang et al., 2003).Moreover, consistent changes throughout the SDLC pose a significant challenge to the maintenance of traceability links, with studies showing that change can be expected throughout the life cycle of every project (Boehm, 2003).In the noble quest for automated traceability, the effectiveness of these solutions continues to be an open challenge.

Achieving scalable, adaptive, and explainable models
Recent works (especially in deep learning and off-the-shelf solutions) have resulted in an increasing number of black-box NLP services and tools.Traceability solutions need to be transparent, especially when traceability is a factor in requirements validation and tracing of regulations.Moreover, the challenge of scaling and adapting NLP solutions continues to be an open challenge for interoperability.Any trade-offs between implementing an NLP component to achieve successful traceability, and the extra resource it needs, have to be justified.

Discussion
To further elaborate our findings based on our research questions outlined in Section 1, we will discuss the results of our study.

RQ1: Demographics and quality analysis
Fig. 3 shows the percentage spread of publication type, with conference proceedings (62%) and journal articles (34%) making up most of the papers selected.All of the conferences and journals (where the papers selected were published) were peer-reviewed and some were shown as outliers for having higher citation per year metrics compared to the dataset (Fig. 4).
In Computer Science, the citation count of conferences is no higher than in journals.Moreover, analysis has shown that Computer Science, as a discipline, values conferences as a publication venue more highly than any other academic field of study (Vrettas and Sanderson, 2015).As we look into our outliers more closely, we present a summary of the traceability solutions proposed in each and how NLP was applied -shown at

RQ2: Trend analysis of NLP techniques and tools for traceability
We look into how the techniques and tools in NLP evolved over the past recent years.Based on Fig. 5, we can see that the majority of NLP efforts are in the Tier 2 category: involving 'basic' to 'intermediate' tasks, with a prominent spike in 2019.During the early years of our scope (2013-2017), these were used mainly to process text and represent text into vectors, and using the represented vectors in a space model (VSM etc.) to detect similarities.The role of NLP has evolved over recent years due to the proliferation of efforts in combining machine learning with basic text processing.This trend continues, with a focus on deep learning, such as with transformers (Vaswani et al., 2017).The spike in 2020 (for Tier 3) may be attributed to the increasing research interest in state-of-the-art deep learning tools in NLP recently, such as the introduction of Convolutional Neural Networks more commonly (prior) used in Computer Vision (Moreno Lopez and Kalita, 2017), BERT (Devlin et al., 2018), and Huggingface Transformers in 2019 (Wolf et al., 2019).
To further understand the trend beyond using the period of years as our timeline, we should consider the research impact that each tier has (Which areas are being mostly cited?Where is the attention drawing to?).This can be done by using citation analysis for each tier; citation per year (for each tier) indicates the amount of attention (impact) the research has.Table 7 shows the average citations per count of each tier category. 3 From the table, we can see that despite Tier 3 having the least amount of papers published overall, the average citation count per year is the highest of all tiers (4.51).The aforementioned 3 Average citations per year = sum of all citation counts/number of papers.
spike in 2020 for Tier 3 is still considerably lower than Tier 2's spike in 2019; however, this citation analysis may indicate that the research impact in deep learning (for NLP applications in traceability) is the largest.It is still too early to conclude how the trend of deep learning in NLP will go (in the field of traceability), but in general, we can see an upward trend in deep learning across software engineering (Ferreira et al., 2021).

RQ3: Trend analysis of NLP applications for traceability across SDLC
Based on Fig. 6, we can see the SDLC phases where traceability with NLP occurs more frequently, i.e., relationships involving REQ, CODE, and DES phases.As noticed above, Requirements Engineering is the area with the most traceability activities throughout recent years, followed by Design and Bug Localisation, respectively.

Requirements traceability
The trend of tracing requirements to source code (and vice versa) using NLP is very common throughout the years with a considerable spike in 2019, as seen in Fig. 6.Artifacts pertaining to the REQ phase (such as functional and non-functional requirements) are generally written in natural language.There is no observable unified structure behind the language and syntax.Bi-directional traceability (Salih et al., 2021), linking to UML diagrams (Arunthavanathan et al., 2016;Salih and Sahraoui, 2018;Kchaou et al., 2019;Salih et al., 2021;Panichella et al., 2015;Kchaou. et al., 2017;Effa Bella et al., 2018), fuzzy logic (Thommazo et al., 2013), reducing false positives (Effa Bella et al., 2019;Capobianco et al., 2013b), are some examples of how NLP was used during the REQ phase.
Tracing requirements to other artifacts, such as UML diagrams and source code, is necessary, and in some cases, mandatory, to adhere to regulatory compliance.For healthcare systems, we have HIPAA (Healthcare Insurance Portability and Accountability Act) (Florez, 2019;Velasco and Aponte Melo, 2019;Lin et al., 2017;Effa Bella et al., 2018).In airspace systems, National Aeronautics and Space Administration (NASA) strives to ensure FAA (Federal Aviation Administration) governance policies and standards are met (Malik et al., 2016).Templates facilitate good quality (inherently an effective tool for conformance) by avoiding complex structures, ambiguity, and inconsistency in requirements.However, managing this conformance is labour intensive and automated checking of conformance to template tool was developed: REquirements Template Analyzer (RETA) (Arora et al., 2015).In some cases, these regulations are explicitly written as non-functional requirements, such as corresponding to safety and legal aspects (Mahmoud and Williams, 2016;Mahmoud, 2015).

Bug localisation
In our study, bug localisation was also a major highlight in several papers across different phases of the SDLC.NLP was used to reduce manual efforts in remedying faults outlined in bug reports by automating redundant tasks such as reading and searching in natural language artifacts, and locating areas of concern.Examples include comparing bugs to generated patches (Csuvik et al., 2020), between bug reports and test cases (Gadelha et al., 2021), between bug reports and source code (Khatiwada et al., 2017;Liu et al., 2019;Wang et al., 2014;Lam et al., 2015;Malhotra et al., 2018;Jiang et al., 2020;Zhou et al., 2017;Gharibi et al., 2018;Shokripour et al., 2013), cross-language bug tracing (Xia et al., 2014), and commit information (Yang and Lee, 2021).
In the current landscape of large evolving software systems, locating bugs (typically within the source code) is a challenging task.Our study looks into traceability between artifacts, and for bug localisation, we have identified bug reports to be the focal artifact involved in bug localisation.Natural language in bug reports is a common target for NLP tasks (such as traceability, which is the entirety of our study), so de-noising these bug reports to isolate the non-natural languages helps the cause (Hirsch and Hofer, 2022).
One common example of bug localisation is tracing the components of a bug report to source code.Bug reports are a form of change request, which serves to change the existing program elements (e.g.source code files) to correct an undesired behaviour of the software (Dilshener et al., 2017).This allows developers to identify what needs to be rectified and modified in the source code to remove the bug, which is a core software maintenance task.Through the lens of traceability using NLP, these components may relate to terms that match between bug reports and source code.Empirical studies have shown that vocabulary used in bug reports was also present in the source code files (Moreno et al., 2013;Saha et al., 2013) -be it an exact or partial match of program elements (i.e.class, method, or variable names and comments).This matching (syntax and semantic similarity) paves a way for NLP to determine bug location more effectively.

Continuous developed tools
We have also identified some tools that were developed continuously across the years (covered by multiple papers reflecting incremental development) across the SDLC phases, namely Software Artifacts Traceability Analyzer (SAT-Analyzer) and TiQi.NLP was first introduced in SAT-Analyzer for addressing artifact inconsistencies due to natural language representation (Arunthavanathan et al., 2016) -it improves the usability of SAT-Analyzer through automated generation of XML input from requirement artifacts, which was then evaluated by a case study on a Pointof-sale (POS) system (Rubasinghe et al., 2018b).SAT-Analyzer was also covered in DevOps practices (Rubasinghe et al., 2018a(Rubasinghe et al., , 2020)); a traceability management tool for continuous integration and multi-user collaboration.TiQi, on the other hand, focuses on trace queries that are generally complex and naturally worded, transforming them into executable SQL statements (Pruski et al., 2014).A more in-depth description of the architecture, design, and heuristic rules was then published in a later paper (Pruski et al., 2015), and a demo was made available online (Lin et al., 2017).

RQ4: Key issues, barriers, and setbacks
We dive into each of these key points to understand further how the papers have contributed to the aforementioned issues, barriers, and setbacks.

Syntax convention
Our study has found that some assumptions had to be made in the semantic representation of syntax used in artifacts.For example, developers only use expressive, non-abbreviated variable names, such as those that are contained in BERT's dictionary (Keim et al., 2020b,a).
Lack of generally used annotation of artifacts (Kicsi et al., 2018) and imperfectly appropriate naming (Csuvik et al., 2019b) typically lead to inaccurate links.The added challenge of artifacts, such as non-functional requirements (Mahmoud and Williams, 2016), hinders traceability efforts due to the lack of homogeneity in syntax representation: natural language pertaining to nonfunctional requirements is less explicit in tracing links.Moreover, the detection of constraints in non-functional requirements becomes more difficult due to the lack of robust modelling and documentation techniques (Mahmoud, 2015).
In a case study for SAT-Analyzer, it was observed that the inaccurate artifact elements extraction and identification with NLP that contain different naming conventions and less meaningful names in requirement artifacts, have led to the lack of accuracy (Rubasinghe et al., 2018b).Semantic ambiguities in artifacts written in natural language pose a challenge in tracing explicit links with other artifacts, based on the syntax used (Kchaou et al., 2019).
In specific critical contexts, such as healthcare regulations, desired levels of granularity in traceability are often not enough.The regulations related to audit control standards and session expiration in the implementation of healthcare systems were the hardest to trace to source code statements -very few lines and source code structures related to these requirements were successfully mapped (Florez, 2019).

Configuration
Although NLP has been effective in recovering missing and broken links in self-adaptive systems, it can introduce significant overhead (Hariri and Fredericks, 2018).Threshold values of semantic similarity are typically a 'moving goalpost' and high confidence values, such as 95% (Singh, 2022), were chosen arbitrarily to represent strong confidence.Selection and tuning of parameters are an impact factor for the accuracy of results, and static configurations are identified as an internal threat to the validity of results (Ali et al., 2015).Automated configurations, such as for Latent Semantic Indexing (Eder et al., 2015), improve the applicability, although computation overhead can be significant.
As mentioned in the previous section, exhaustive evaluation for optimal configuration results in various complications, such as significant computational costs and time complexities.This is exacerbated by the continuously changing nature of artifacts throughout the SDLC phases, rendering traceability efforts to become even more challenging.Achieving this (near) optimal configuration for topic modelling was the goal of one of our papers, which introduced Genetic Algorithms (GA) with LDA to boost accuracy of traceability link recovery (Panichella et al., 2013), among other tasks.This paper also highlighted the need for an efficient method to find the best configuration of parameters, as an exhaustive analysis of all possible combinations is deemed impractical.
Effective traceability is crucially dependent on the performance of the models used, which is determined by their configuration settings.One key aspect of this is the hyperparameter tuning, which often can make the difference between a mediocre performing model to a state-of-the-art (Eggensperger et al., 2015).5.4.3. Translation (language) Reported setbacks in these efforts concern the effectiveness of translation services that are readily available (Yıldız et al., 2014;Liu et al., 2020a;Xia et al., 2014).Despite these translation services being mainly black-box in nature, it is critical to the effectiveness of traceability.There is no generic dictionary (model) for all languages, as each language has its own rules of grammar (syntax) and its own semantic interpretation of words used.However, we do have a recent primer publication on pretrained multilingual embeddings (Doddapaneni et al., 2021), yet to be fully utilised in software engineering.

Properties (representation) of artifacts
In a dynamic continuously integrated, continuously developing environment (Rubasinghe et al., 2018a(Rubasinghe et al., , 2020(Rubasinghe et al., , 2018b)), artifacts transform constantly and this hampers continuous traceability efforts.In cases where traceability is necessary for regulations (Florez, 2019;Arora et al., 2015), the natural language used in these documents is not represented similarly to other artifacts, such as functional and non-functional requirements.Adaptive standard feedback was also proposed upon the consideration that software artifacts do not share the same properties of natural language documents, on which the standard feedback relies (Panichella et al., 2015).

Explainability
Despite huge successes in large language models, their blackbox nature hinders key goals of NLP, particularly in explainability (Lin et al., 2021;Keim et al., 2020b,a).In cases where traceability plays an important role (such as adherence to regulations and auditing), the black-box nature of these advanced solutions proves as a hindrance, as validation of results becomes difficult (Velasco and Aponte Melo, 2019).

Dependency on tacit knowledge
This is more prominent in traceability use cases pertaining to software architecture where experiential knowledge is vital in recovering architectural trace links (Keim and Koziolek, 2019) and links between requirements and process models (Lapeña et al., 2019).

Scalability
Large-scale systems pose a challenge in traceability management due to the complexity of trace links, particularly in visualisation (Rubasinghe et al., 2020;Chen et al., 2018).This also relates to time and compute resource complexities, and becomes even more challenging in environments where constant change is present (Rubasinghe et al., 2018a,b).

Data availability
The amount of labelled data to train classifiers is not as abundant as we ideally need it to be, and this poses a setback for effective training in supervised models for traceability (Chen et al., 2021).The amount of annotated data in some domains is richer than in others, which is heavily dependent on the efforts of the community.This translates to varying levels of model accuracy for different domains, which affects traceability effectiveness.Models can only train on data that is available, and the performance of any model is entirely dependent on the data that it is trained on.

RQ5: Open challenges
To answer RQ5, we first need to be able to identify the pertinent issues that arise; and second, through understanding the pain points, we can derive and model the open challenges.Fig. 7 shows the mapping of open challenges from the key issues, barriers, and setbacks that were identified in Section 5.4.

Syntax and semantic similarities in representation across artifacts
The first and foremost open challenge of NLP is primarily derived from the most recurring issue reported in our study (see Section 4.4.4), and centred around the role NLP plays in traceability: processing natural language in artifacts.The natural language present in artifacts needs to be represented uniformly in various parts of the SDLC, and achieving similarity in each of those representations is an open challenge that NLP continues to play a major part in solving.

Effectiveness in automated software traceability
Software artifacts are not entirely similar to that of natural language, and NLP advancement efforts are majorly based on use cases pertaining to human communication, such as developing cognitive (intelligent) skills through natural language understanding.This direction is not entirely useful for software engineering purposes, particularly relating to traceability.The open challenge is in leveraging and harnessing the value of NLP techniques, focusing NLP advancement efforts in the field of software engineering.Moreover, pure automation of traceability efforts continues to pose a common challenge despite recent successes in language models.

Achieving scalable, adaptive, and explainable models
NLP models that are involved in traceability efforts face significant challenges to scale and adapt in tandem with how software systems change and evolve throughout the SDLC.This open challenge is a derivative of identified issues pertaining to scalability, data availability, and explainability.Explainable AI is a critical component to adopting machine learning models in any decision making process, with traceability being no different.In software engineering, the adoption of these models are hindered by the lack of explainability and understanding of how these models work (Tantithamthavorn and Jiarpakdee, 2021).

A holistic framework model for NLP solutions to achieve effective traceability
NLP techniques and tools have played a major role in processing and vectorising text; serving as some form of natural language decoder to unify representations across artifacts for traceability.We recommend efforts in developing a holistic framework model to achieve effective traceability, subsequently addressing key open challenges of NLP in traceability.A holistic framework should fulfil the following: • Techniques and tools in NLP that are representative of the software engineering domain.Currently, efforts are sparse and scattered, focusing on very specific parts of software engineering that are isolated.• A unified ontology across the software engineering domain space, through consolidating and integrating taxonomies across multiple domains in software engineering.
• Models that 'understand' natural language across various aspects of the SDLC phases.Natural Language Understanding (NLU) is an extension of NLP where models are able to comprehend terms that are specific to the SDLC phases, and across these phases, through classifying intents, confidence scores stability, and extracting entities (Abdellatif et al., 2021).

Towards achieving interoperability and explainability
Models have to be transparent, scalable, and accurate in recovering trace links (i.e.effective traceability).We propose to ensure applications of NLP in traceability to be transparent and explainable.Efforts in NLP research for traceability should not only focus on having the next best model that supersedes the accuracy scores of previous models in determining trace links, but also on proving scalability and providing explainability.We need to have some form of global certification and validation process to be able to certify models as experts.Moreover, we need to incorporate efforts in explainable Artificial Intelligence (AI) and model reasoning to reduce bias and fill in the gap of dependencies on tacit knowledge from human experts; dependencies on experiential knowledge.

Threats to validity
In this section, we outline the threats to validity identified throughout our mapping study process.Based on a recent map of threats to validity in systematic mapping studies in software engineering (Zhou et al., 2016), we looked into all possible threats that emerge from conducting our study.

Construct validity
Our research questions and methodology may not entirely cover every aspect of studying how NLP is used for software traceability.However, we ensured that our research strategy was thorough and comprehensive in fulfilling the secondary study conducted to address key gaps of areas pertaining to NLP in software traceability.We adhered to the guidelines outlined in Petersen et al. (2015).It is important to stress again that a systematic literature review would be less significant to uncover the existing methods and approaches based on NLP, and it would face a larger threat to construct validity than the mapping study presented in this work.

Internal validity
The search for relevant papers to populate our mapping study was thoroughly executed: multiple library databases were used, including a search aggregate engine that covers a wide range of multiple databases and libraries -Google Scholar engine.Addressing internal threats to validity is critical in mapping studies: the findings need to be unbiased and the search string needs to be reflective of our study scope.

External validity
The specificity of the techniques and tools and trends analysed in our study may not be able to be generalised outside of our search scope.Research efforts in NLP and traceability continue to evolve rapidly in recent years, and focus choice may affect the results generated.In reducing this threat, and for the sake of generalisability, we proposed tier categorisation for NLP techniques and focused our recommendations on common key issues, barriers, and setbacks rather than specific ones.

Conclusion validity
The limited availability of published efforts in NLP and software traceability may impact the conclusions derived from our study scope, especially on empirical evidence in the industry for traceability efforts that are not published.Incorporating synonyms of terms using the Google Scholar search engine as part of our data ingestion pipeline helped us reduce this threat, despite returning abundant false positives.

Conclusion
This paper presents a systematic mapping study focusing on NLP and its applications, in the context of software traceability.A total of 96 papers were obtained -covering a period of years 2013 to 2021 -during the selection process.We looked into the different ways NLP was leveraged to aid traceability efforts across the various phases of the SDLC.We analysed the trend of techniques and tools used, the trend of traceability activities that were involved, and identified key issues, barriers, and setbacks to these traceability efforts.From these, we identified open challenges and presented key recommendations for addressing these.
The field of research in NLP is continuously evolving, and while major use cases of these efforts are typically related to human communication (i.e.human language), there is great potential value for NLP to be further leveraged effectively in software traceability.By conducting this mapping study, we are able to consolidate recent efforts in attempting to take advantage of these techniques and tools to solve traceability problems, particularly through automating redundant tasks and solving key issues that arise from conventional IR techniques.This study serves as a checkpoint for researchers and practitioners to have a wide angle of view across the various efforts within our scope of the study.Based on the trend analysis done and the open challenges identified, this study has presented two key recommendations in moving forward: a holistic framework for NLP solutions and efforts in achieving interoperability and explainability in NLP models.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Fig. 1 .
Fig. 1.Overview of steps in mapping study planning.

Fig. 4 .
Fig. 4. Box plot of citations per year -outliers are labelled with their respective values.

Fig. 6 .
Fig. 6.Bubble chart of SDLC Phase Relationships throughout the years -each bubble size represents the count of papers corresponding to the SDLC phase relationship, shown in the legend for reference.

Fig. 8
Fig. 8 presents a mapping diagram to show the relationships between the open challenges and recommendations.The following are our points of recommendation in addressing the three open challenges, as described above in Section 5.5.

Fig. 7 .
Fig. 7. Mapping of key issues, barriers, and setbacks to open challenges.

Table 1
Terms table.

Table 2
Details of index databases used.

Table 3
NLP techniques identified.

Table 4
External NLP supporting tools/libraries identified.

Table 5
Papers highlighting key issues, barriers, and setbacks.

Table 6 (
only those with cites per year ≥ 10 are shown).As visible in the table, the majority of the outlier papers come from the top publishing venues in software engineering (ACM/IEEE International Conference on Software Engineering and IEEE Transactions on Software Engineering) and the citations reflect a growing trend as long as the paper gets older.

Table 6
Top cited papers per year identified as outliers.

Table 7
Citation analysis per tier.