"Won't We Fix this Issue?"Qualitative Characterization and Automated Identification of Wontfix Issues on GitHub

Context: Addressing user requests in the form of bug reports and Github issues represents a crucial task of any successful software project. However, user-submitted issue reports tend to widely differ in their quality, and developers spend a considerable amount of time handling them. Objective: By collecting a dataset of around 6,000 issues of 279 GitHub projects, we observe that developers take significant time (i.e., about five months, on average) before labeling an issue as a wontfix. For this reason, in this paper, we empirically investigate the nature of wontfix issues and methods to facilitate issue management process. Method: We first manually analyze a sample of 667 wontfix issues, extracted from heterogeneous projects, investigating the common reasons behind a"wontfix decision", the main characteristics of wontfix issues and the potential factors that could be connected with the time to close them. Furthermore, we experiment with approaches enabling the prediction of wontfix issues by analyzing the titles and descriptions of reported issues when submitted. Results and conclusion: Our investigation sheds some light on the wontfix issues' characteristics, as well as the potential factors that may affect the time required to make a"wontfix decision". Our results also demonstrate that it is possible to perform prediction of wontfix issues with high average values of precision, recall, and F-measure (90%-93%).


Introduction
The complexity of modern software systems is growing fast and software developers need to continuously update their source code [1] to meet users' expectations and market requirements [2]. In this context, fixing bugs or addressing feature requests and enhancements, reported by users in the form of bug reports [3,4] and Github issues [5], represents a crucial task of any successful software project [6,7]. Indeed, during software development and maintenance, issue reports are valuable sources of information for developers interested in improving the quality of the software produced [4,8].
Nevertheless, software changes that are performed to address user-submitted reports often occur under time pressure [9,10], with negative effects on the developers' workloads [11]. Indeed, user-submitted reports tend to widely differ in their quality [12,13], and software developers have to spend a significant amount of time in handling these reports (e.g., verifying their content or relevance [14,15] and coordinating the teamwork [16]) for implementing the required changes [17,18,19].
In the last decade, research has developed automated solutions to facilitate the issue management and fixing processes, with techniques able to prioritize the requested changes [15, Email addresses: panc@zhaw.ch (Sebastiano Panichella), canfora@unisannio.it (Gerardo Canfora), disorbo@unisannio.it (Andrea Di Sorbo) 20], to detect potential issue misclassifications [21,22] and bug duplications [17]. Hence, most of the proposed tools and prototypes are used to answer critical and relevant questions related to reported issues, e.g., "Who should fix this bug?" [4] or "Is It a Bug or an Enhancement?" [21]. However, to the best of our knowledge, only few works investigated the nature of wontfix issues, known as "bugs that will never be fixed" [23]. By analyzing more than 6,000 issues from the history of 279 GitHub projects, we observe that developers require time (i.e., about five months, on average) before closing an issue with the wontfix status. This means that, in general, developers take about five months to answer the question: "Won't We Fix this Issue?". Starting from this preliminary result, we decided to study the main characteristics of this specific type of issues, thus investigating the main reasons behind a "wontfix decision". In addition, we further explored potential factors that could be related to the time to close a general wontfix issue and experimented automated approaches to identify with high accuracy the issues that will be labeled as wontfix, by only analyzing issue titles and descriptions. To the best of our knowledge, no prior work proposed approaches to automatically determine whether an issue will be likely marked (or labeled) as a wontfix.
Our goal is to support "community members" 1 during the issue management process. As pointed out by Guo et al. [24], unfixed bugs receive almost the same amount of attention as fixed bugs: e.g., in the Eclipse bug database the average numbers of comments that unfixed bugs and fixed ones receive are 4.5 and 5.8, respectively. Similarly, the average number of comments received by wontfix issues in our dataset is 5.14. Thus, approaches for timely identifying the issues that will likely be not addressed allow to reduce the unproductive effort (and associated costs) required for triaging and resolving such issues [25]. In particular, early identification of issue success can help (i) project managers allocating resources, (ii) developers focusing their attention on the issues that will be actually addressed, and (iii) customers knowing early if their requirements would be satisfied [26]. Indeed, the longer the issue that will be likely not fixed remains open, the more it could catch the attention of developers, making them spending efforts in gathering information for attempting to resolve it [27]. Furthermore, being aware of the reasons why developers decide to not fix specific issues could help understanding the software changes that developers consider less relevant. This information could be very useful for improving issue prioritization and triaging mechanisms, in order to better support developers to focus on the issues that will actually get addressed.
Hence, this paper aims at answering the following research questions: • RQ 1 : What are the main reasons for closing Github issues with the wontfix status? In this research question we qualitatively characterize wontfix issues, by manually analyzing a sample comprising 667 wontfix issues extracted from 97 different projects (developed in C#) hosted on GitHub, with the aim of understanding the main reasons behind a "wontfix decision". As initial outcome, we design two different taxonomies. The first taxonomy encompasses the main reasons that pushed users to open issues that later were marked with the wontfix label. The second taxonomy models the main motivations given by developers when they decide to close these issues (as wontfix).
• RQ 2 : What factors relate with the resolution time of wontfix issues? This research question is a follow-up of the previous one. However, while in RQ 1 we look more at the nature of wontfix issues (e.g., investigating the reasons behind a wontfix decision), here we investigate the factors that could be related to the time to close a wontfix issue, observing also whether different wontfix issue types (i.e., having different resolution motivations) present different characteristics and, consequently, different resolution times. As the time elapsed between the opening and closing of an issue is not necessarily linked to the actual effort spent by developers on the issue itself, RQ 2 is also aimed at better characterizing the effort spent on each wontfix issue type.
• RQ 3 : Can machine learning be used to automatically identify issues that are likely to be closed as wontfix?
change requests, and identifying also potential issues containing erroneous reports, or requesting features or changes that are out of the project's purposes.
In our work we observed that developers take, on average, about five months to figure out that an issue is not worth to be fixed and therefore be labeled as a wontfix. In this research question, we want to explore an automated method to anticipate this decision, thus helping developers recognize wontfix issues earlier in the issue management process. During our investigation (RQ 1 and RQ 2 ) we observed that most wontfix issues talk about specific aspects (e.g., feature enhancements/request, not critical bugs), and that different wontfix issue types tend to experience different resolution times. We conjecture that the topics discussed in the title and the description of an issue report are discriminant and relevant aspects to consider for the fixing of an issue. Based on this consideration we experimented approaches that leverage textual analysis and machine learning techniques to predict whether an issue will be marked (or labeled) as wontfix, by analyzing only the titles and descriptions of reported issues.
Results of our study provide insights into the nature of wontfix issues, and, in particular, we found that developers mainly tend to close issues (with the wontfix label) containing erroneous reports, or requesting features (or changes) that are not relevant or out of the project's scopes. In addition, the time required to close issues, that developers deliberately decide not to consider, is mainly connected with (i) the issue type, and (ii) the number of participants involved in the related discussions. Finally, our evaluation shows that it is possible to predict whether developers will close an issue as a wontfix by analyzing only the titles and the descriptions of the reported issues and using machine learning and textual analysis techniques. The proposed methodology achieves an average value of precision, recall, and F-measure ranging between 90% and 93%. Pragmatically speaking, the high effectiveness of the experimented ML models in automatically identifying the issues that will likely be closed with the wontfix status could improve the issue management processes allowing developers to focus on more critical issues. Indeed, on GitHub, although the use of labels has been envisaged to mark and manage issue reports [28], no automated mechanisms have been provided to support humans during issue labeling. It is worth pointing out that our work has the only aim of providing developers with a tool to more easily identify issues that developers will deliberately avoid to fix by labeling them as wontfix. Other issues may remain open for a long time without any fix [27], or developers could mark them as "invalid", or close them for any other reason [4]. All these situations are out of the scope of this paper.
We believe that our findings not only shed some more light on the nature of wontfix issues, but have the potential to build and/or improve future recommender systems aimed at prioritizing and supporting the issue fixing and the management processes of modern software projects.
In summary, the main contributions of this paper are: • Two taxonomies modelling the reasons for opening and closing wontfix issues, along with a manually-labeled dataset (available for replication purposes) of 667 wontfix issues extracted from heterogeneous GitHub projects.
• Results of our study on the characteristics of the different types of wontfix issues.
• An automated approach (available for research purposes) able to accurately identify the issues that will likely be not fixed.
Paper structure. Section 2 discusses the current issue management cycle (with specific reference to GitHub), while Section 3 illustrates the related literature. Section 4 details the data extraction process and the evaluation methodology adopted to answer our research questions. Section 5 discusses the results achieved in our study, while threats to its validity are reported in Section 6. Finally, Section 7 concludes the paper and outlines directions for future work.

Background
An issue tracking system is a repository where users and team members can submit and discuss issues (e.g., bugs and feature requests), ask for advice and share opinions useful for maintenance activities or design decisions [29]. GitHub is a social coding platform hosting more than 57 million of repositories 2 which provides advanced version control mechanisms and an integrated issue tracker.
Any GitHub user can create an issue in a public repository in order to report bugs, require enhancements, or make other kinds of requests. Thus, issues are the primary mean through which GitHub communities collect user feedback. A typical issue on GitHub is described through a title and a description. It is worth noticing that, differently from other bug trackers (e.g., JIRA or Bugzilla), the GitHub issue tracker does not provide an explicit description field and the issue description is usually provided in the issue's first comment. Here and in the remainder of the paper, we refer to the first comment of an issue as the issue description. Moreover, one or more predefined labels are used to help in categorizing the issue. Each issue is assigned to one assignee that is responsible for working on it, but comments allow anyone to provide feedback. In order to offer high flexibility, GitHub only provides two issue states (open and closed), while any other state must be realized via labels. The GitHub issue tracker provides a set of default labels in each repository, including the wontfix label which indicates that work will not continue on an issue 3 . The wontfix label is among the most used labels in GitHub projects [30]. Although labeling has a positive impact on the effectiveness of issue processing [31], the labeling mechanism is scarcely used on GitHub [30]. Thus, automated approaches able to predict the correct labels to assign to issues could stimulate the use of such a mechanism.
In this study we are interested in extracting issues from heterogeneous projects hosted on GitHub, having the closed status and the wontfix label assigned, in order to investigate common characteristics of this kind of documents.

Issue Management Process and Practices
Fixing bugs and addressing feature requests or enhancements, reported in the form of bug reports [3,4] and Github issues [5], are crucial tasks for the success of any software project [6,7,32]. For this reason, researchers investigated factors characterizing or affecting the issue management process and practices.
Previous work investigated the aspects that should characterize an informative (or "good") bug report [12,13,33]. Specifically, Hooimeijer et al. [12] presented a first descriptive model of bug report quality, which is based on a statistical analysis of over 27,000 bug reports of Mozilla Firefox. The evaluation of the model showed its usefulness in reducing the overall cost of software maintenance, suggesting, at the same time, potential features that should be considered when composing bug reports. Bettenburg et al. [13] conducted a survey involving 466 developers and reported that there is a huge information mismatch between what developers need and what users provide in the reported issues. Their results suggest that future bug tracking systems should focus on engaging bug reporters, with tools handling bug duplicates. These findings have pushed, in later years, researchers to find solutions to handle the bug duplication problem [17,34].
Recent research studied socio-technical dynamics [35] concerning the management and fixing of issues [31] or the handling of pull requests [36,8]. For instance, Aranda et al. [16] investigated coordination activities around bug fixing tasks by surveying professional developers and found that, even for simple bugs, an inefficient coordination among developers impacts the efficiency of the issue fixing process. These co-ordination problems [11] are generally caused by the wrong assignment [37] or re-assignment of bug reports [18] to developers. However, in other cases, inefficient bug resolutions are influenced by the length and complexity of issue discussions [38,39], the actual knowledge/skills of developers [40,5], and other sociotechnical dynamics [41,42,43]. In this context, Breu et al. [33] pointed out that an active participation of developers represents a crucial aspect for making progress on the reported bugs. For this reason, other work proposed strategies for determining the appropriate person to assign the reported issues [4,44,45].
Differently from the aforementioned work, this paper empirically investigates the main characteristics of wontfix bugs and the common reasons behind a wontfix decision.

Issue Reports Classification and Prioritization
As reported by Cosentino et al. [46], even if issues are generally sent to popular projects, the number of pending issues constantly grows [43] and, despite the use of labels has a positive impact on the issue evolution [30], they are scarcely used by GitHub developers. Thus, researchers proposed automated solutions to ease the issue management and fixing processes, with techniques that leverage well-known methods based on textual analysis [17], machine learning [21,47,48], natural language parsing (NLP) [49,50], and summarization approaches [51,52,53] to analyze bug reports information. Important results in this direction are related to the definition of approaches that automatically classify or analyze the textual content of reported issues [54] to derive potential misclassifications [21,22], detecting duplicated bugs [17] or predicting reopened issues [55]. In recent years, tools have been designed to automatically predict the severity of bug reports [14,56,57,58], to support the prioritization of reported issues [15,20], and to estimate the issue life time [9,59,60,61]. Finally, to facilitate the process of fixing issues, recent strategies have been proposed to translate bug reports into test cases [62], generating auto-fixes [63], or recommending relevant classes [64] for these reports.
In the context of these related studies, this paper empirically investigates the combination of machine learning and textual analysis techniques to automatically predict whether issues will be not fixed, by analyzing (only) the titles and descriptions of reported issues. The closest works to ours are (i) the one by Cabot et al. [30] who proposed labels to classify issues in open source projects, and (ii) the one by Guo et al. [24] presenting an approach to determine the bugs that will be actually fixed. Finally, recent research [25] started investigating the reasons behind wontfix bugs. By manually analyzing a sample of 600 wontfix bugs (extracted from Bugzilla) pertaining to three open-source projects (i.e., Eclipse, Mozilla, and OpenOffice), the author identified 12 categories of reasons. Similarly to this research, we perform a manual inspection of wontfix issues. However, we identify, for each inspected issue, both the reason behind the issue opening and the motivation for issue closing, to better understand the co-occurrences between the different sorts of reasons. Besides, we (i) investigate wontfix issues pertaining to 279 heterogeneous projects (hosted on GitHub), identifying further categories of reasons that have not been considered in previous work, (ii) explore the factors that could be related to the time to close a wontfix issue, and (iii) propose an approach able to automatically identify wontfix issues with high effectiveness. To the best of our knowledge, no prior work investigated the nature of wontfix issues on GitHub and proposed approaches to automatically determine whether an issue will be marked as wontfix.

Study Design
The goal of our study is to shed some light on the nature of wontfix issues, with the purpose of building and/or improving recommender systems aimed at prioritizing and supporting the issue fixing and the management processes of modern software projects. Hence, we qualitatively investigate the main reasons behind a "wontfix decision" and explore potential factors that could be correlated with the time to close a general wontfix issue. Finally, we experimented with potential strategies to predict whether an issue will be labeled as a wontfix. Figure 1 depicts the research approach we followed to answer our research questions.

Data Collection
The context of the study consists of 6,330 issues extracted from the history of 279 open source projects hosted on GitHub, whose characteristics are summarized in Table 1. The selection process we applied for this study is based on a "criterion sampling" [65], according to the following steps: 1. Language selection: As reported in [66], projects on GitHub developed in C# usually have a higher number of external users, and core developers tend to ignore reports from outsiders [67,68]. Besides, in a study involving about 100,000 GitHub projects, Bissyandé et al. [5] found that GitHub projects developed in C# usually have higher numbers of issues filed (see Figure 7 in [5]). Having a higher number of issues to deal with could increase the likelihood that developers overlook some of these issues due to lack of resources or low priority [69]. Thus, in this type of projects (C# projects), we expect to find higher amounts of wontfix issues. For this reason, we selected projects mostly developed through the C# programming language.

Projects selection:
Recent studies demonstrated that a higher number of issues co-occurs with (i) a higher number of stars received by a GitHub repository [70], and (ii) a faster growth of a GitHub project in terms of stars [71]. Thus, in line with recent empirical studies in software engineering [72,73,74,75], we selected projects relying on stars information. In particular, in order to consider projects with reliable amounts of issues, the 1000 most popular ones (i.e., top starred) have been selected from GitHub.

Metadata extraction:
We collected all closed issues metadata (e.g., URL, title, description, resolution date, etc.) from the projects selected according to the aforementioned criteria.
The aforementioned steps were performed using the R scripts available in our replication package (under the folder "1 Scripts/1 Data Collection"). In particular, a first script was employed for selecting the projects according to the (1) language selection and (2) projects selection criteria. This R script addressed the following technical issues: (i) we selected the projects having the higher number of stars; (ii) we selected projects having issues closed, filtering out projects having no issue labels (not all project had issue labels); (iii) we handled the Github download limits (setting in R a timeout with "Sys.sleep(40)"). As result, the script collected the first initial information about the identified projects such as project name, project Github url, project program language, and issue labels. The other three R scripts we implemented are responsible to collect more detailed information about the issues of the selected projects (e.g., issue url, issue title, and issue description), double-checking that no other issues have been closed as wontfix during the extraction analysis (we added few more wontfix issues with this check). It is worth noticing that during our investigation we observed that specific projects may use custom labels for designating issues that will be not addressed (e.g., status:wontfix, Resolution-Won't Fix, won't fix, resolved: wontfix, closed:wontfix, wont-fix, Won't Fix, not-fixing, Status-WontFix, WontFix, status: will not fix and Cannot fix). Thus, our scripts consider issues having the wontfix labels described above. Table 1 reports the number of projects, total number of wontfix and non-wontfix closed issues mined from these projects. We also report, in the last column of Table 1, the median number of issues per project. Replication package. We make available in our replication package 4 (i) the scripts developed to extract the data used for this research, (ii) all raw data, used to generate the data and tables reported in the paper. In the replication package, we also include the research prototype we used to answer RQ 3 .

Analysis Method
In the following, we discuss the research methods used to address each specific research question.
4.2.1. RQ 1 : What are the main reasons for closing Github issues with the wontfix status? Answering RQ 1 required, as a first step, to derive a manual labeled golden set of wontfix issues, in order to build two taxonomies: (i) a first taxonomy, M opening , summarizing the main reasons that pushed users to open issues that later were marked with the wontfix label; and (ii) a second taxonomy, M closing , modelling the main motivations of developers to close these issues as wontfix. Therefore, with the aim of manually inspecting a representative sample (99% confidence level and a margin of error below 4%), T sample , of the collected wontfix issues (see Section 4.1), we randomly selected 667 issues from the entire set of wontfix issues. These 667 issues belong to 97 projects. Concerning the 667 wontfix issues in our T sample , we observed that 286 of them (42.88%) have been opened by end-users (not contributors to the projects), while the remaining 381 issues have been opened by users with different roles in the analyzed projects. In particular, 61 (16.01%) out of these 381 issues have been opened by repositories' owners, 225 (59.06%) by organizations' members, 78 (20.47%) by contributors who had previously committed to the repositories, and 17 (4.46%) by collaborators invited to contribute to the repositories.
To derive the two taxonomies we used card sort, a technique to derive taxonomies from input data [76]. We organized card sort in three steps [77]: (i) preparation, (ii) execution, and (iii) analysis. Preparation: In this step, we created the cards related to each wontfix issue in T sample . Each card represents a wontfix issue and includes: (i) the issue title, (ii) the issue description, (iii) all the messages exchanged in the related discussion, and (iv) all the labels (further to wontfix) assigned to the issue by original developers. Execution: Two authors of the paper analyzed the cards applying open (i.e., without predefined groups) card sort. In particular, the two authors performed an iterative content analysis [78], starting with two empty lists (one for M opening and the other for M closing ) of issue categories. Each time they found a new wontfix issue type to add to one of two taxonomies, a new category was added to the connected list. The two authors used pairsorting [77], to discuss discrepancies in their thoughts for each card during the card sorting itself and avoid checking the consistency of the sorting and merging the cards in a later phase. Analysis: To guarantee the integrity of the emerging categories and remove potential redundancies, the two authors performed a second iteration on all the analyzed cards and redefined some of the categories identified in the previous step. Through the card sorting process 22 reasons for M opening and 26 reasons for M closing emerged. During the sorting process we reflected on how they could be further clustered into higher level groups. At the end of this phase, for each taxonomy we identified five high level groups. The resulting taxonomies are described in Section 5.1, where the set of 667 issues manually validated according to the taxonomies represents our golden set. 4.2.2. RQ 2 : What factors relate with the resolution time of wontfix issues? In order to characterize wontfix issues and answer RQ 2 , we computed the following factors concerning all issues in our golden set (i.e., All): • descriptionLength: Issue description length (number of characters); • maxAuthorPercentage: The proportion of messages posted by the author who posted the majority of messages in the issue discussion; • majorAuthors: Number of unique authors who have posted more than one-third of the overall messages present in the issue discussion; • meanCommentSize: Average length of comments (number of characters) in the issue discussion; • minorAuthors: Number of unique authors who have posted less than one-third of the overall messages present in the issue discussion; • nActorsT: Number of distinct authors participating in the issue discussion; • nCommentsT: Number of total comments in the issue discussion; • timeToCloseIssue: Time lapse (in days) between issue opening and closing (with the wontfix label); • timeToDiscussIssue: Time lapse (in days) between issue opening and last comment posted in the issue discussion.
These factors allowed us to investigate different issue dimensions, namely (i) the level of participation of the community members to the issue discussion (nActorsT, maxAuthorPercentage, minorAuthors and majorAuthors), (ii) the discussion's size (descriptionLength, nCommentsT and meanCommentSize), as well as (iii) timing information about the issue (timeToClo-seIssue and timeToDiscussIssue).
Moreover, we studied how these factors vary when considering the different M closing categories (i.e., Bug, Feature request/enhancement, Not suitable and Change). In particular, to verify whether statistically significant differences could be observed between the different M closing categories, for each metric m ∈ {descriptionLength, maxAuthorPercentage, majorAuthors, meanCommentSize, minorAuthors, nActorsT, nCommentsT, time-ToCloseIssue, timeToDiscussIssue}, we compared the value distributions obtained for m across the different M closing categories, through the Mann-Whitney U test, a widely used non-parametric test for comparing independent samples [79]. After checking that all the variables of interest (i.e., descriptionLength, maxAu-thorPercentage, majorAuthors, meanCommentSize, minorAuthors, nActorsT, nCommentsT, timeToCloseIssue and timeToDiscus-sIssue) are not well-modeled by normal distributions (as verified through the Shapiro-Wilk test [80]), we decided to use a non-parametric test (i.e., Mann-Withney), as the assumptions for this test are satisfied by each considered variable. For coping with multiple comparisons, the Benjamini-Hochberg correction procedure [81] has been adopted to adjust p-values.
In addition, in order to investigate if some of the considered metrics may influence the time required to close the issue, similarly to the work by Linares-Vásquez et al. [82], for each metric m we grouped the issues in our golden set in different subsets, on the basis of specific values of m (e.g., nActorsT ≤ 2, 3 ≤ nActorsT ≤ 4 and nActorsT ≥ 5) and verified (through the Mann-Whitney U test) whether statistically significant differences can be observed in the timeToCloseIssue distributions obtained for the different subsets. Again, this investigation has been carried out for (i) all the issues in our golden set (i.e., All), as well as (ii) the various M closing categories. 3 : Can machine learning be used to automatically identify issues that are likely to be closed as wontfix? After investigating the nature of wontfix issues, we propose an approach, to automatically predict or classify whether an issue will be labeled as a wontifx. For achieving this goal, we considered all non-wontfix (4,486) and wontfix (1,844) issues in our dataset (see Table 1), collected through the metadata analysis explained in Section 4.1. Specifically, our approach leverages machine learning (ML) techniques and consists of four main steps:

RQ
1. Preprocessing: All terms contained in the titles and descriptions of all 6,330 (4,486 non-wontfix plus 1,844 wontfix) issues in our dataset are used as an information base to build a textual corpus that is preprocessed applying stop-word removal (using the English Standard Stop-word list) and stemming (i.e., English Snowball Stemmer) to reduce the number of text features for the machine learning techniques [83]. In addition, we use the R package textclean, which contains a function called replace html that allows the automated removal of html tags from the text. The output of this phase corresponds to a Termby-Document matrix M where each column represents an issue of our dataset and each row represents a term contained in title and or description of the various issues. Thus, each entry M [i, j] of the matrix represents the weight (or importance) of the i−th term contained in the j−th issue.
2. Textual Feature Weighting: Words are weighted using the tf-idf score [83], as opposed to simple frequency counts, because it assigns a higher value to rare words (or group of words) appearing in issues, and a lower value to common ones. This allows identifying the most important words in the issue titles and descriptions. The weighted matrix M represents the output of this phase.
3. Training and Test sets: For the classification step, we split the matrix related to all issues of our dataset in two parts (50% each), i.e., training and test sets. As training set we considered the sub-matrix, M training , obtained by randomly selecting from the original matrix M the columns associated to a half of wontfix issues and a half of non-wontfix issues. Vice versa, for the test set we considered the sub-matrix, M test , obtained by considering from the original matrix M the columns associated to the remaining half of wontfix issues and the remaining half of non-wontfix issues.

Classification:
We automatically classify wontfix issues in the test set by relying on the output data obtained from the previous step consisting of the matrix M training and M test . Specifically, to increase the generalizability of our findings, we experimented (relying on the Weka tool 5 ) different machine learning techniques, namely, the standard probabilistic Naive Bayes classifier, the sequential minimal optimization (SMO) algorithm 6 , and the J48 tree model. It is important to note that, the choice of these techniques is not random since they were successfully used for bug reports [21,84] or vulnerability [85] classification, recent work on user reviews analysis [86,2,87], and several works on bug prediction [88,89].
It is worth highlighting that, as stated in Section 1, for predicting whether an issue will be labeled as wontfix, the machine learning models have been trained by exclusively using information that are immediately available at the issue opening (i.e., issue title and description, without considering the other features), this to simulate a more realistic scenario in which the automated classification can really help developers identifying the issues that will likely be not fixed.
To evaluate the performance of the experimented ML techniques, we adopted well-known information retrieval metrics, namely precision, recall, and F-measure [83]. It is important to specify that, as described in the steps 3 and 4 of our approach, we apply a cross-projects setting to train the ML models on data coming from different projects. This choice was made to ensure that a more general classification model is trained. To complement the evaluation process and alleviate concerns related to overfitting and selection bias, we also provide the classification results of the experimented machine learning models, by computing a 10-fold validation strategy.

Results
This section discusses the results of our empirical study.

RQ 1 : Reasons for Wontfix Issues
To explore the common motivations for closing issue reports which developers deliberately avoid to consider/fix (i.e., wontfix), and answer RQ 1 , we performed a manual analysis of a sample of issues extracted from our data collection (as described in Section 4.2.1). More specifically, such golden set encompasses 667 closed issues (with the wontfix label) extracted from 97 distinct projects hosted on GitHub.
Each issue in the sample has been marked with two labels: (i) the motivation behind the issue opening, as stated by the issue reporter (i.e., the motivation for issue opening, M opening ), and (ii) the reason for its closing (with the wontfix status), as declared by developers within the issue discussion (i.e., the motivation for issue closing, M closing ).
For M opening , we found 22 different motivations (reported in Table 2 along with their frequencies within the analyzed sample), that have been grouped in five distinct categories. It is worth to highlight that 64 (9.6%) issues have been assigned to more than one M opening category, since they have been opened for multiple purposes (this explains why the sum of percentage values in Table 2 is higher than 100%). For M closing , we  Table 3). Such motivations have been clustered in five categories. Only 23 issues (3.4%) have been marked with multiple M closing motivations, this is mainly due to the fact that community members usually tend to provide a precise indication for not fixing an issue. In most cases, the reasons for opening an issue are related to bugs reporting, feature requests or enhancements, and only in few cases, by other requests (e.g., clarification questions, performance, and testing related aspects). As expected, the majority (i.e., 648, 97.15%) of issues in our sample have been opened in order to signal troubles dealing with functional aspects (see Table 2). As illustrated in Table 2, many of the issues belonging to the Functional Aspects category are opened in order to (i) request improvements for specific features (i.e., Feature enhancement, 42.6%), (ii) report a bug (i.e., Reported a bug, 29.1%), or (iii) require a new feature to be implemented (i.e., Feature request, 24.3%). As anticipated, requests for fixing defects not strictly related with the software features (i.e., Problem in Table  2), the requests about software documentation (i.e., Documentation in Table 2), and configuration problems (i.e., Configuration in Table 2) resulted in rare motivations for opening issues.
Community members usually decide to ignore issues (i) containing erroneous reports, or (ii) requesting features or changes that are out of the project's purposes (e.g., requests of improving the performance or GUI associated to a feature, or adding a functionality that is already present in the system). As a matter of fact, 319 (47.8%) issues in our sample, have been closed with the motivation that the requested features/enhancement were not needed or had been already implemented, while 142 (21.3%) issues erroneously reported problems, which have been proved to be not suitable (see Table 3). Indeed, only 63 (9.4%) issues revealed actual bugs, which developers decided to not fix, as they have been often evaluated as (i) too expensive to fix (i.e., Impossible to fix the issue or too expensive change, 52.4% of issues signaling actual bugs), (ii) not critical (i.e., Not a critical bug, 27% of issues signaling actual bugs), or (iii) that will be fixed in the future (i.e., It will be fixed in future, 19% of issues signaling actual bugs). In addition, issues reporting change requests (i.e., Change in Table 3) are mainly closed (i.e., 14.5% of issues) because the change they propose are judged as not strategically relevant by community members. We argue that results in Table 3 could be useful to implement more accurate analysis (not necessarily based on binary classification) for future work, such as multi-label issue classification [90] and issue prioritization [91].
In Figure 2 we illustrate the frequency with which issues opened with the most recurrent purposes in M opening (i.e., Feature enhancement, Reported a bug and Feature Request) are closed with one of the motivations in M closing (see Table 3). In Figure 2 the thickness of the lines is proportional to the amount of issues opened with a specific reason (on the left) and closed with a specific M closing motivation (on the right). In particular, 146 (89%) issues opened for requiring a new feature (i.e., Feature Request) have been closed with the motivation Feature request/enhancement already implemented or not needed, while 173 issues (60.1%) requiring feature enhancements (i.e., Feature Enhancement) have been closed due to the same motivation. Moreover, 83 (28.8%) issues having the same purpose (i.e., Feature Enhancement) have been closed, since they proposed Not relevant changes. Finally, issues reporting bugs (i.e., Reported a bug) are mainly closed because (i) they do not signal actual bugs (i.e., Not a bug, 33.5%), (ii) the bugs reported are too expensive or impossible to fix (i.e., Impossible to fix the issue or too expensive change, 12%), or (iii) the signaled defects mainly depend on configuration/backup problems on the user side (i.e., unset backup or other configurations required on the user side for enabling the main functionalities of the project, 11.5%). RQ 1 summary: Developers mainly tend to close issues (with the wontfix label) containing erroneous reports, or requesting features (or changes) that are not relevant or not needed.

RQ 2 : Factors Related with the Wontfix Issues Resolution Time
As reported in Table 4, wontfix issues are mainly discussed among limited numbers of major actors and such discussions encompass 4.43 comments, on average. As anticipated, wontfix issues are closed very long time after their opening (i.e., about five months on average) and continue to be discussed even after their closing.   In order to investigate differences in the different types of wontfix issues, we study the extent to which the collected metrics vary across the specific M closing categories, and discuss the most interesting peculiarities. More specifically, to study the differences occurring between the different kinds of wontfix issues, and verify whether the observed differences are statistically relevant, we tested the following null hypothesis: H 0 has been tested with Mann-Whitney test and the p-value was fixed to .05. Table 5 reports the results of the Mann-Whitney test. This investigation is aimed at understanding whether specific types of wontfix issues exhibit more unproductive (or longer) discussions, before their closing. In particular, for each metric (on the rows) and each pair of wontfix issue types (on the columns) Table 5 reports the p-value obtained when testing the null hypothesis H 0 through the Mann-Withney U test. The Benjamini-Hochberg correction procedure [81] has been adopted to adjust p-values, since multiple comparisons are performed simultaneously. Figures 3, 4, 6 and 5 report the respective distributions of nActorsT, nCommentsT, timeToCloseIssue and descriptionLength obtained for the overall issues in our dataset (i.e., All), as well as the various M closing categories. Change requests (i.e., Change) (i) are usually described through longer texts (see Figure 5), (ii) require to be discussed between a greater number of actors (as illustrated in Figure 3) and, consequently, (iii) the related issue discussions comprise a greater amount of comments (see Figure 4), than other kinds of issues. As shown in Figure 6, not suitable reports (i.e., Not suitable) are closed much faster than the other types of issues: 50% of issues of this type are closed in less than 16 days, while the 50% of the issues belonging to the other categories require more than 39 days to be closed. Probably, this is due to the fact that developers are more resolute in closing the issue, once verified that the signaled defect is not actually suited to be addressed.
On the contrary, Feature requests/enhancement issue types usually require more time to be closed (the median value of timeToCloseIssue obtained for issues of this type is 73.59 days), probably because developers have greater uncertainty on deciding if the required improvements could be in line with the project's purposes. In general, the number of participants discussing the issues may influence, with statistical evidence (see Table 6), the time required to close a wontfix issue, while for the other collected factors we do not observe significant relationships. Specifically, as illustrated in Figure 7, when the numbers of actors participating in the issue discussions concerning wontfix issues of the Feature request/enhancement and Change types increase, we observe a longer timeToCloseIssue, while for wontfix issues of the Bug and Not suitable types, no statistically significant differences between the different subsets are revealed. It is worth noticing that we verified whether significant relationships exist between the other metrics and the timeToCloseIssue, by using similar analyses. However, such analyses did not produce noteworthy results.
In a study involving more than 4000 GitHub projects and about 1 million issues, Kikas et al. [59] found that the median lifetime of about 70% of the investigated issues is 3.7 days. We observe that the median closing time for wontfix issues is about 11.5 times slower than the median lifetime of most issues investigated in prior work, confirming (i) our intuition that wontfix issues usually remain open for a longer time compared to other types of issues, and (ii) the need for early detection of issues that will probably remain unfixed. Our study also partially confirms some of the findings reported in prior research [59], showing that the number of different actors involved in issue discussion is related to the time to close the issue. RQ 2 summary: On average, about five months are required to close issues that developers will label as wontfix. This time is mainly connected with (i) the issue type (issues indicating not suitable reports are closed much faster with respect to other kinds of issues), and (ii) the number of participants involved in the related discussions. Such discussions typically comprise less than 6 messages and involve a limited set of major actors.

RQ 3 : Automated Classification of Wontfix Issues
As explained in Section 4.2.3, we experimented with different ML techniques, namely (i) the probabilistic Naive Bayes classifier, (ii) SMO algorithm, and (iii) the J48 tree model. These ML models were trained on the training data (i.e., M training ) and evaluated on the test data (i.e., M test ) illustrated in Section 4.2.3. Table 7 provides an overview of the main results obtained through the different ML algorithms. For completeness, in Table 8, Table 9 and Table 10 we also provide the actual corresponding confusion matrices of all the three experimented ML models.
The results in Table 7 highlight that the precision, recall, and F-measure values are very positive for the J48 model, while we observe lower precision and F-measure results for the Naive Bayes and SMO models. Specifically, as reported in Table 7 the J48 algorithm achieves the best classification performance, i.e., values close to 0.90 for precision, recall, and F-measure metrics. In the case of Naive Bayes and SMO, the values of precision, recall, and F-measure are lower than the ones achieved by the J48 model, with degradation in classification performance of more than 10%. The ML models perform a binary prediction (i.e., wontfix vs. non-wontfix) relying on 14,720 textual features. The lower classification performance obtained by the Naive Bayes model could be due to the fact that, as reported by previous work on bug classification [21], "the naive Bayes classifier only exhibits a limited improvement when increasing the number of features", while more complex machine learning models tend to achieve better classification performance, when the features' set grows up. .  The variability of the results can be easily explained by observing the confusion matrices of the three ML models, reported in Tables 8,9 and 10. For the J48 model the numbers of False Negatives (FN) and False Positives (FP) are relatively low, while in the case of the Naive Bayes and SMO ML strategies, the amount of misclassified instances is substantially higher. Interestingly, the achieved results demonstrate that predicting whether an issue will be fixed or not (i.e., will be marked as a wontifx) is possible with positive results, especially when considering a tree classifier. In addition, these results confirm our conjecture that terms occurring in the title and the description of issues posted on GitHub are discriminant and relevant factors to consider for determining whether an issue will be fixed or not.
Our results are very encouraging, especially if we consider that we used just 50% of our dataset to train the different ML algorithms and predicted on the remaining 50%, leading to equally balancing training and test sets. Indeed, using a larger number of examples in the training set (e.g., using 80% of our dataset as training set and the remaining 20% as test set, as done in   traditional ML applications) is likely to result in higher performance. Thus, to check whether a larger number of points in the training set could lead to better results and mitigate concerns related to overfitting and selection bias, we repeated the classification experiment by using a 10-fold cross-validation strategy (i.e., in each run of the 10-fold cross-validation the training set was composed by 90% of items in the overall dataset). This analysis is also important to verify that the previously discussed results are not dependent on the specific data used for training the ML models. Indeed, one of the goals of using 10-fold cross-validation is to flag problems like overfitting or selection bias [92]. The results of 10-fold cross-validation are shown in Table 11. Such results confirm that the precision, recall, and Fmeasure values are very positive for J48, while lower precision, recall and F-measure results for the Naive Bayes and and SMO models are observed. More specifically, all the considered classifiers achieve slightly higher F-measure values (i.e., with improvements ranging from 0.5% to 4%) than the results previously obtained (see Table 7), confirming (i) the high performance of the J48 model in identifying the issues that will likely be not fixed by developers, and (ii) the inadequacy of the Naive Bayes technique when used for performing the aforementioned classification task.
To achieve more in-depth insights about the positive results obtained by the J48 model, we qualitatively/manually observed actual features that are selected by the J48 model to characterize/classify wontfix issues. From part of the complex J48 tree model in Figure 8 (see the full trained tree model in the RP 7 ) used to perform the classification of issues, we can observe that the selected features concern in most cases textual features semantically linked to requests for feature enhancement/addition such as make, change, provide, etc. Interestingly, this result is in line with the finding of RQ 1 , where we discovered that one of the major motivation for closing issues with the wontfix status (see Table 3) is the presence of not desired features (about 47% of wontfix issues concern requests for features enhancement/addition). To quantitatively corroborate this observation, we computed the information gain [93] for all the text features leveraged by our model and ranked them according to their scores. In Table 12, the top 15 ranked features, along with the related information gain scores are reported. Looking at Table 12, it is easy to observe that many of the features with the highest scores are conceptually linked to requests for enhancement or feature additions (e.g., provide, need, change, support, create, and make). However, this could also represent a limitation of our model. Indeed, as shown in Table 10, while the J48 model is quite effective in recognizing issues of the non-wontfix class, a higher likelihood of false negatives is observed for wontfix issues (i.e., the recall for this class is about 75%). For this reason, we believe that further efforts and tunings could be aimed at reducing the falsenegative rate for this class while keeping low the false-positive rate.
RQ 3 summary: 1. Relying on a tree classifier (i.e., J48) it is possible to automatically detect issues that will be labeled as wontfix, with precision, recall, and F-measure values up to 0.93.

2.
Consistently with the results of RQ 1 the experimented models select textual features semantically related to requests for features enhancement/addition to classify wontfix issues.

Threats to Validity
Threats to construct validity. In order to carry out our study, we measure different factors that could be not sufficient to model the whole issue handling process. As pointed out by Kalliamvakou et al. [94], many active projects do not conduct all their software development activities in GitHub, and separate infrastructures (e.g., mailing lists, forums, IRC channels, etc. [95]) could be used to support decision-making processes. This only represents a minor threat in our study, since most of the events of our interest (e.g., opening and closing of the issues, label assignments, comments) are mostly recorded in the issue tracking system, as it is primarily used by development teams to track issue data.
Threats to internal validity. Our results have been obtained by analyzing wontfix issues having the closed status assigned and such results could be misleading if a significant percentage of such issues will be reopened in the future. However, less than 9% of non-fixed bugs tend to be reopened [96] and one of the root causes for re-opening a bug report resolved as not fixed is due to the difficulty in reproducing the bug [97]. It is worth to notice that in our manual inspection, the Difficult to fix or to replicate and Not replicable bug M closing motivations have been assigned only to 1.6% and 0.1% of issues in our sample, respectively. In addition, 281 out of 667 (42.13%) wontfix issues encompassed in our manually analyzed sample are related to the aspnet/Mvc project. This could represent a threat to internal validity, as developers of this project could adopt similar criteria for deciding to avoid addressing a reported issue. In our study we analyzed issues labeled as wontfix. However, GitHub developers may indicate that a specific issue will be not resolved directly in the issue title (e.g., by using the [WONTFIX] prefix 8 ). To counteract this issue, we estimated the number of GitHub issues in C# projects where the WONTFIX keyword appears in the issue title 9 . Such a search returned a very limited number of issues (i.e., < 10) and most of them (i.e., 75%) had also the wontfix label assigned. To avoid any bias in the potential evaluation of the performance of the experimented ML techniques, we adopted well-known information retrieval metrics, namely precision, recall, and F-measure [83] and apply a cross-projects setting to train the ML models on data coming from different projects. However, since the information concerning all the issues in our dataset is used for the ML model construction, we can not exclude that characteristics related to more recent issues are leveraged to predict the resolution of previously submitted issues.
Threats to conclusion validity. In our RQ 2 , we analyze different issue clusters having different sizes in terms of issue numbers, and some of the differences we observed could be not significant. To mitigate this threat, we compared the values obtained for each cluster through the Mann-Whitney U test, widely adopted for similar purposes in the software engineering community, and discussed some of the differences which resulted statistically significant (p-value < 0.05). Since multiple comparisons are performed simultaneously, the Benjamini-Hochberg correction procedure [81] has been adopted to adjust p-values and control the false discovery rate.
Threats to external validity relate to the generalizability of our results. To mitigate this kind of threats our evaluation has been performed on a dataset containing 6,330 issues, extracted from 279 heterogeneous projects. Moreover, we manually ana-lyzed 667 wontfix issues of 97 different projects. However, such sample could be not adequately representative of all the GitHub projects. All the considered wontfix issues (1,844) are related to projects developed using mainly the C# programming language, and the average number of issues per projects tend to be quite skewed for some projects. This may represent a threat to external validity, since such issues can present common characteristics that ease their identification. For these reasons, in the future, we aim at extending our investigation by evaluating wontfix issues in further projects developed through different programming languages. However, on a positive side, the classifier demonstrated to achieve high performance in identifying wontfix issues, even when not trained on issues related to any specific project, thus this training strategy allows the classifier to be more easily used on projects, different from the ones used in our experimentation, without the need for re-training it. On the other hand, the high classification performance achieved by some ML models could depend on the fact that many wontfix issues encompassed in our dataset concern requests for features enhancement/addition. Indeed, the qualitative analysis performed in Section 5.3 highlighted that ML models leverage syntactical features semantically related to this kind of requests to perform the classification. Thus, it is not clear if similar results could be obtained on more balanced datasets, in which the different types of wontfix issues are equally represented.

Conclusions
Software maintenance and evolution activities represent crucial tasks of any successful software project, and issues reported by users are a valuable source of information for developers interested in improving their systems. However, developers spend significant time handling issue reports and user requests. To support developers during issue handling processes, researchers conceived effective solutions aimed at prioritizing requested changes, as well as detecting potential issue misclassifications or duplications. However, few prior studies explored the nature of wontfix issues, and none of these studies proposed approaches to automatically determine whether an issue will be marked as wontfix. We argue that a timely identification of issues that are likely to be not addressed, could help (i) project managers allocating resources, (ii) developers focusing their attention on the issues that will be actually addressed, and (iii) customers knowing early if their requirements would be satisfied [26]. To this aim, in this paper, by collecting more than 6,000 issues extracted from the history of 279 GitHub projects, we (i) analyzed the common characteristics of wontfix issues, and (ii) proposed an approach leveraging textual analysis and machine learning techniques to predict whether an issue will be resolved as a wontfix. Results of our study show that developers mainly tend to close issues (with the wontfix label) containing erroneous reports, or requesting features (or changes) that are not relevant or out of the purposes of projects (RQ 1 ). However, developers take a significant amount of time (about five months, on average) to decide whether an issue should be labeled as a wontfix. This time is mainly connected with (i) the issue type (issues containing erroneous bug reports, are closed much faster), and (ii) the number of participants involved in the related discussions (RQ 2 ). Last but not least, the proposed approach proved to be accurate (with a F-measure up to 93%) in identifying issues that will be likely labeled as wontfix (RQ 3 ).
This study helps to better comprehend the issue management dynamics in open source communities. As a future work, we aim at investigating whether different projects tend to have different wontfix characteristics (due to different issue management processes), and the extent to which the automated identification of wontifx issues may impact the results produced by issue prioritization approaches. In addition, in the future, we aim at comparing how issues with/without wontfix label perform each other, in order to investigate how the presence of wontfix issues may affect the overall issue management process. We intend to also study further wontfix factors useful to automatically identify/predict the actual potential motivations (as it could be useful information for developers) behind an issue that will be closed as a wontfix. Moreover, we plan to compare the results of our approach with machine learning approaches successfully used in the same context [28,98] and involving other types of labels. Finally, we plan to investigate the usage of historical analysis to provide orthogonal/complementary information that could be combined with the adopted textual features.
Differently from issue driven development, in pull-based development developers use branches to make the desired changes independently, and then create a pull request to ask merging their changes into the main repository [99]. The integrators (usually the members of the project's core team) are asked to reply to such request, evaluating the quality of the contributions, and eventually merging or rejecting the changes [100]. Manually identifying high-quality and desirable pull requests may be challenging [101], especially for popular projects, where tens of pull requests are daily submitted [102,8]. In the future we plan to verify whether the reasons behind the rejection of specific kinds of pull requests are similar to the ones that have been identified for wontfix issues, with the purpose of better comprehending the team behaviors when managing external requests.