Empirical study on developer factors affecting tossing path length of bug reports

: Bug reassignment (also known bug tossing) is a common activity in the life cycle of bug reports, and it increases the cost of time and labour to fix bugs in software projects. In large-scale projects, about 6–10% of bug reports are tossed at least three times. However, the nature of repeatedly-tossed bug reports was usually overlooked in previous works. This study focuses on developer features from four aspects, namely network centrality, developer workspace, developer expertise, and transmissibility of developers, to investigate which factors affect the tossing path length (TPL). By using statistical methods, this study finds that working theme, product, component, and degree centrality are key impact factors affecting the change of TPL. The four key features are then simplified to three core features, namely working theme, product, and component, which contribute about 90% of the variance of TPL. Finally, the two feature groups mentioned above are applied in six machine learning algorithms to predict potential developers for bug reports from Eclipse and Mozilla, and the results validate the effectiveness of the feature groups for developer recommendation. Hence, this study provides an easy-to-use feature selection method to train quality developer recommenders for automatic bug triage in an efficient way.


Introduction
Software bug resolution is an important part of software development because it is highly related to development and maintenance costs. The process of assigning bug reports to appropriate developers is known as bug triage [1]. Ideally, each new bug report will be efficiently assigned to a suitable developer, but in fact, an assigned bug report may have to be reassigned if the developer cannot fix this bug for a variety of reasons. The reassignment will repeat until a developer fixes the bug. This process of bug reassignment is called bug tossing [2]. Some empirical studies have shown that, on average, it takes about 40 days to assign a bug report to the first developer in the Eclipse project, and it then takes additional 100 days or more to reassign the bug report to the second developer [3,4].
During the bug triaging process, improper assignments and tosses (referred to as reassignments) can obviously increase the number of repeatedly-tossed bug reports. The goal of improving the efficiency of resolving bugs lies in two aspects: recommending appropriate developers for a given bug report, and reducing the number of tosses (i.e. tossing path length (TPL)). Many approaches based on machine learning (ML) have been proposed to improve the accuracy of recommending matchable developers [1,[5][6][7][8][9][10][11][12][13], in which a classifier (also known as a classification model) is built by using some features of bug reports such as title, comment, and description. Besides, a few approaches that combine ML and tossing graph (TG) have also been proposed to reduce TPL as well as to improve the prediction accuracy [4,14,15]. Although the results of these approaches are promising, the reasons why a few bug reports are repeatedly tossed remain unknown.
To answer the question mentioned above, in this study, we focus on developer factors affecting the number of tosses during the bug triaging process. That is to say, the goal of our work is to identify the key impact factors that shape repeatedly-tossed bug reports, so as to facilitate the process of feature engineering for classifier construction, which is fundamental to the application of ML in bug triage. We analyse the commonly-used factors in previous studies from the perspectives of developer and bug report, and we define the factors that affect the relationship between developer and bug triage as developer features. To this end, we consider a total of 16 factors closely relevant to developer features, and they are extracted from four aspects: network centrality, workspace, expertise, and transmissibility. The network centralities [16], including degree, in-degree, out-degree, betweenness, directed betweenness, closeness, in-closeness, and out-closeness are defined based on graph theory. The developer workspace, including product and component, is referred to as the project workspace in which a developer works. The developer expertise, including fixing probability and working theme, is defined by all the bug reports that a developer has dealt with. The transmissibility includes tossing probability and the identity of a developer acting as a fixer or tosser. The empirical results on Eclipse and Mozilla and technical contributions of this study are summarised as follows.
i. By using a multivariable regression model, four factors (i.e. degree centrality, product, component, and working theme) are identified as key impact factors from nine factors which have high correlations with TPL. The four factors contribute more than 95% of the variance of TPL. ii. The set of main impact factors is further reduced to a minimal subset of core factors (i.e. product, component, and working theme) by calculating the variance inflation factor (VIF). The three factors contribute about 90% of the variance of TPL. iii. By using six common ML algorithms, the developer classifiers trained with the two feature sets mentioned above perform better than those classifiers built using the commonly-used features of bug reports and developers described in previous studies. Therefore, our work provides an easy-to-use feature selection method to train quality developer classifiers for automatic bug triage efficiently.
In the rest of this paper, Section 2 introduces the work related to our study. Section 3 presents the measures of TPL and the 16 developer factors. Section 4 introduces the experimental setup, and Section 5 outlines research questions and presents the results of the research questions as well as their implications for practice and research. In Section 6, we discuss some potential threats to the validity of our work. Finally, Section 7 concludes this paper and presents our future work.

Related work
Since the early work of Perry and Stieg [17], many researchers have investigated effective software quality assurance approaches from the perspective of bug triage. Generally speaking, there are three mainstream types of approaches, namely information retrieval (IR) based, ML based, and graph theory-based approaches [13,18,19]. In the following subsections, we will give a brief introduction to each of the three types of approaches.

IR-based approaches
Some IR approaches were proposed to recommend appropriate developers for a new bug report [20][21][22][23][24][25]. For example, Canfora and Cerulo presented an IR-based method that identified candidate developers using the textual description of a new change request as a query [20]. Due to the difficulty in mining bug repositories without sufficient bug fixing records, Shokripour et al. collected the necessary data from the version-control repositories using some information extraction methods [22], so as to obtain better results. In the work of Nagwani and Verma [21], they extracted frequent terms from the textual information of bug reports and used term similarity to identify appropriate developers for newly reported bugs. Recently, Xia et al. recommended developers by calculating the affinity score of each developer to possible bug reports [24,25].

ML-based approaches
In addition to the works based on IR, researchers have also proposed various ML-based methods with the goal of assigning bug reports automatically and accurately [1,[5][6][7][8][9][10][11][12][13]. Cubraknic and Murphy first used the Naive Bayes (NB) classification algorithm to semi-automate the process of bug assignment [5]. Then, Anvik et al. improved the approach proposed by Cubraknic et al., and they utilised the support vector machine (SVM) algorithm to recommend a set of appropriate developers to a bug manager [1]. Baysal et al. presented a theoretical framework-based SVM for automatic bug assignment, which considered developer expertise and workloads [8]. Lin et al. reported a case study of automatic bug assignment which used Chinese text and other non-text information of bug reports [7]. In [6,10,13], several comparative analyses of supervised ML algorithms for automatic bug triage were presented in detail to guide the selection of these algorithms.

TG-based approaches
Unlike the methods mentioned above, a few researchers found a different direction based on the theory of TG to improve the prediction accuracy of triaging bug reports. Jeong et al. proposed a TG model based on Markov chains [4], and the proposed model reduced tossing steps by up to 72%. Then, Bhattacharya et al. improved the prediction accuracy and reduced TPL by employing the combination of ML classifiers and a specific TG [14]. Zhang and Lee assigned bug reports automatically using concept profile and network analysis [26]. Wang et al. presented the heterogeneous developer network model DevNet, a framework for representing and analysing developer collaboration in bug repositories [27]. Park et al. developed an automatic triaging system considering both accuracy and cost and proposed a model CosTriage to characterise user-specific experiences and estimate the cost of each bug category [28]. Recently, Zhang et al. proposed a model KSAP, which used historical bug reports and heterogeneous networks of bug repositories to improve the performance of automatic bug assignment [29]. In addition, some hybrid methods combining ML and TG have attracted much attention in recent years.

Summary
The textual features of bug reports have been widely used in bug triage. Also, some previous studies investigated efficient bug assignment methods based on developer features. For example, the early work, by Jeong et al., took into consideration the tossing probability between developers [4]. Besides, the interval of the last activity of a developer was also considered to filter out inactive (or retired) developers [14,15]. To the best of our knowledge, developer attributes, such as expertise, interest, and working partners, are also used in [30][31][32][33][34][35].
Most of the approaches mentioned above used the features of bug reports and developers, and the selections of such features are essential for improving the prediction accuracy. In Table 1, we  show an overview of the principal features of bug reports used in  previous studies, and the bold lines are used to separate the IR-, ML-, and TG-based work. The marker 'x' denotes that a feature is used in the corresponding paper. It can be seen from this table that the features in the first five columns are often used in automatic bug triage, suggesting that the vast majority of these approaches extract textual information (i.e. summary, description, and comments) and the information of source location (i.e. product and component). Similarly, in Table 2 we also present an overview of the main features of developers used in previous studies. Surprisingly, developer features used vary from one paper to another, including fixing probability, tossing probability, degree, betweenness, and closeness centrality, implying that there is yet no consensus on widely-recognised developer features.
Unlike those previous studies that focused on new methods for bug triage, in this work, we explore the main reasons why some bug reports are repeatedly tossed from a new perspective of developer feature. Our work acts as a feature selector for ML-based approaches. Thus, we can efficiently train ML classifiers using the core factors obtained in this work to predict suitable developers for given bug reports.

Tossing path
For the detailed descriptions of bug reports and bug life cycle, please refer to [4,40]. First of all, we present the concepts of tossing path and TPL.
A tossing path is defined as a set of finite tossing steps among developers in the bug tossing process [4]. In Fig. 1, the tossing path starts from the first assigned developer d 1 , and d 1 tosses out this bug report to d 2 if he/she does not fix the bug. The reassignment from d 1 to d 2 (d 1 → d 2 ) is the first tossing step, d 2 → d 3 for the second tossing step, and so forth. The path ends until developer d l fixes the bug.
A tossing path must have one fixer; otherwise, our work does not consider all the bug reports without any fixer. The number of tossing steps (l − 1) is defined as the tossing length of a tossing path K = d 1 → d 2 → d 3 , …, → d l . If the tossing path of a bug report has only one element, i.e. K = d 1 | d 1 = d l , this suggests that the first assigned developer who receives the bug report finally fixes it. Based on all available tossing paths, we can create a developer collaboration network, where each vertex represents a developer and each edge indicates a tossing step between two developers.

Measures of developer factors
In this section, we detail the definitions of the four types of developer factors used in this study, namely network centrality, workspace, expertise, and transmissibility.

Network centralities:
In the analysis of networks, three main indicators of centrality [16], namely degree, betweenness, and closeness, are used to measure the importance of vertices, while the importance of an edge is usually measured by edge betweenness centrality which reflects the communicating ability of the edge in a network. The definitions of these measures are given in (1)- (10).
Degree centrality: The degree centrality of a vertex (i.e. developer) d i is defined in the following equation [16]: where δ(d i , d j ) is 1 when a link exists between d i and d j ; otherwise, it is 0. Here, n is the number of vertexes in the network.
For directed networks, the degree is measured by in-degree and out-degree, which are defined in (2) and (3), respectively [16]. Here, δ(d i ← ( → )d j ) is 1 if d j has a link to (from) d i Closeness centrality: In undirected networks, the closeness of vertex d i is shown in the following equation: where s(d i , d j ) is the total distance of the shortest path [41] between d i and d j .
For directed networks, in-closeness and out-closeness are presented in (5) and (6), respectively. Here, s(d i ← d j ) is the length of the shortest path from d j to d i , and vice versa for Node betweenness centrality: In undirected networks, node betweenness is defined in (7), where h d i d j is the number of the shortest paths between vertexes d i and d j and h d i d j (d i ) is the number of those paths [41] also passing through d i The node betweenness centrality in directed networks can be written as (8), where the shortest paths are calculated based on directed paths [16] Edge betweenness centrality: The definitions of the edge betweenness [16] in undirected and directed networks are formulated in (9) and (10), respectively. Note that h d i d j (e) is the number of the shortest paths that vertexes d i and d j pass through edge e.

Developer workspace:
In previous studies [1,12], the location of bugs, e.g. component, is a key reference for the recommendation of suitable developers, and the developers who are responsible for a given component (or usually fix bugs from the component) may provide practical information about bug fixing. In this study, developer workspace, including component and product, indicates that developers usually fix bugs from location-specific components or products.
Working component: The developer component is determined based on all the components of the bugs in which a developer has ever participated. The probability of a developer d i working in component c is formulated in (11). Here, B d i is the set of bug reports in which d i has ever participated in the bug triaging process, and δ(b, c) represents whether bug report b comes from c.
where c * denotes all the components in which d i has ever participated. A developer may fix many bugs, and these bugs are probably from different components. Therefore, the developer may work on multiple components. Working product: Similarly, the working product of developer d i is defined by the following equation: where p * is the set of products in which d i has been involved and δ(b, p) represents whether bug report b is from product p.

Developer expertise:
Developer expertise represents a developer's proficiency in processing some specific types of bugs. Two types of developer expertise are considered in this study, i.e. working theme and fixing probability.
Working theme: The working theme of a developer is a topic to which most of the bug reports fixed by the developer belong, and it is defined according to the topics of historical bug reports in which the developer has been involved. The probability of developer d i working on theme t is shown in (14), where θ(b, t), generated from an latent Dirichlet allocation (LDA) model [42], is the probability that the corresponding bug report b belongs to topic t, and B is the set of bug reports A developer may have at least one working theme because a bug report may have a few topics. Here, t^ is defined as the working theme of developer d i (see (15)), which is determined by using the most likely topic of d i where t * is the set of working themes of d i .
Fixing probability: The fixing probability of a developer denotes his/her ability to fix bug reports, which is the ratio of the number of bugs that have been resolved to the total number of received bug reports where m d i is the number of bug reports assigned to developer d i and f d i is the number of bugs fixed by d i .

Transmissibility:
In this study, transmissibility means that developers play the role as a dispatcher in the bug tossing process, and it includes two features: tossing probability and tosser identity.
Tossing probability: The tossing probability, the probability that a developer tosses out a bug report to another developer, has been used to reduce the number of tosses [4,14,15,34]. This measure is defined by the following equation: where k d i → d j is the number of bug reports tossed from developer d i to developer d j .
Developer identity: Many developers have never fixed a bug, and they always toss out bug reports to other developers when they receive any bug reports. In this study, this category of developers is called tosser. A developer d i is a tosser if P(d i ) = 0, i.e. the number of bug reports fixed by the developer equals 0, which is a special case of (16). The tosser identity of d i is defined in (18)

Data processing
The data processing in our work includes two steps: the first one is to clean the raw data and filter out unnecessary information, and the second one is to prepare the experimental data for answering our research questions. Tossing records in the 'History' field were extracted to construct tossing step and developer network. The 'Summary,' 'Description', and 'Comment' attributes of bug reports were put together as the free-form textual content, and the filter we used for the free-form textual content was similar to that of the previous study [13]. Note that we lessen the dimensionality of training data (more than 100,000 dimensions) using LDA.
Due to the differences in the scales of numerical values for the four types of features, all the features should be normalised. Each bug report that has been resolved has a tossing path including several developers, the normalisation of a given feature is to average the values of the feature for all the developers at each tossing path. So the average value of the 10 network centralities for K tossing paths with length l is then normalised in (r1)-(r10), respectively, in Table 3. Here, Q k denotes the set of developers at the kth tossing path and l + 1 is the number of developers.  Any two developers at a tossing path may not work in the same workspace. The average probability (W C (l)) that two developers working on the same component for K tossing paths with length l is defined in (r11) in Table 3. If the components of two adjacent developers d i and d j are the same, ) is 0. Similarly, the average probability (W P (l)) of two adjacent developers working on the same product is shown in (r12).
For the developer expertise, the average fixing probability of developers is defined in (r13), and the average probability that a developer's working theme is the same with the topic of bug report b defined in (r14). If d i shares the same topic with b, Finally, for the fourth type of feature, the average tossing probability (O P (l)) is normalised in (r15). Equation (r16) represents the average proportion of tossers (O T (l)) in a tossing path with length l. The largest TPL is 13 for Eclipse, and 19 for Mozilla. It is noticeable that the bar at l = 16 does not exist in Fig. 2a2. For Eclipse and Mozilla, about 60% of bug reports were fixed at l = 0, i.e. these bug reports have not been reassigned. Also, about 20% of the remaining bug reports are tossed more than twice (l ≥ 2).

Results and discussion
In this section, we will show the analysis and experimental results to answer the research questions. It is noticeable that, for each pair of the first three successive research questions, the answer to the latter depends on the result of the former. For the last research question, RQ4, it is an application of leveraging developer features, discussed in RQ2 and RQ3, respectively, to recommend potential developers using ML algorithms.

Answers to research questions
Previous studies have proposed many approaches to recommend potential developers, predict bug fixing time, and understand repeatedly-tossed bug reports. However, the question why many bug reports are repeatedly tossed among developers remains answered quantitatively. In this study, we formulate this issue as a regression problem (i.e. exploring the prime factors that result in long tossing paths) and break down it into three research questions RQ1-RQ3. RQ1. Which developer features are relevant to the change of TPL?
Motivation. As shown in Tables 1 and 2, those previous studies utilised different types of features. However, whether these commonly-used features contribute to the change of TPL remains unknown. That is to say, we need to filter out irrelevant and weakly-correlated features from the 16 features in question first. As shown in Tables 1 and 2, 16 features are commonly used in many previous studies. However, whether these commonly-used features contribute to the change of TPL remains unknown. Therefore, we need to explore the correlation between TPL and the features.
Method. The Pearson correlations were used to measure the correlation of TPL with the four types of features quantitatively. A Pearson correlation coefficient (r) between two variables X and Y is defined as Generally speaking, two variables are highly correlated when r ≥ 0.6, and they are moderately correlated when 0.6 ≥ r ≥ 0.4. Otherwise, they are weakly correlated.
Result. The Pearson correlation coefficients of TPL with the 16 features are listed in Table 4, where the coefficients are marked with a symbol '*' if r ≥ 0.6. Table 4 shows that, in both Eclipse and Mozilla, there are significantly negative Pearson correlations between TPL (l) and four features (i.e. W P , W C , E P , and E T ). However, the Pearson correlation coefficients between l and the other features vary from project to project. For example, l is highly correlated with N D in Eclipse, but they are weakly correlated in Mozilla. To address this problem, if a Pearson correlation coefficient between l and a feature in either Eclipse or Mozilla is marked with the symbol '*', the feature is highly correlated with l in this study. According to this definition, five features (i.e. N D , N ID , N OD , N DH , and O T ) are also considered to be highly correlated with l.
Based on the results of Table 4, a few interesting findings on the nine features that increase TPL can be drawn. First, bugs are likely to be tossed to developers with significant degree centrality. Second, usually, developers who work at different tossing steps belong to a different workspace. Second, usually, developers who work at an arbitrary tossing step belong to a different workspace. Third, sometimes the topics of bug reports do not match with the working themes of assigned developers. Fourth, the proportion of tossers in the developer collaboration network increases when TPL increases.
Answer to RQ1: Nine features, N D , N ID , N OD , O T , N DH , W P , W C , E P , and E T , are highly correlated with the change of TPL. RQ2. Which developer features are crucial for the change of TPL?
Motivation. Although the Pearson correlation coefficient measures the linear association between each feature under discussion and TPL, we are still unable to determine which main features result in the change in the length of tossing paths based only on the answer to RQ1. Method. For this research question, we model the relationship between TPL and the nine features using a multivariable regression model and identify the key features that result in long tossing paths by analysing their relative importance. In the regression analysis, the nine features are called independent variables (or predictors), and the dependent variable (or responser) refers to TPL l.
Result. The best regression model fitted for Eclipse data is presented in the following equation: Note that the model was obtained by the subsets regression method and selected by meeting three criteria (i.e. the adjusted R 2 value is >0.95, low Mallow's Cp, and a minimum number of independent variables). The significance level of each independent variable, denoted by p-value, is presented above the bold line in Table 5. On the one hand, the adjusted R 2 is 0.9705 in this model, which indicates that the five independent variables, N D , N ID , E T , W P , and W C , explain 97.05% of the variance of the dependent variable. On the other hand, the p-values are <0.05 in Table 5, which suggests that the five features are statistically significant variables for the model. According to the signs of the five features in (20), l will increase with the increase of N D and N ID and decrease with the increase of E T , W P , and W C . To further determine the relative importance of independent variables, two statistical indicators, standardised regression coefficient and incremental impact on R-squared (R 2 ), were used in our experiment. The former estimates the mean change of the dependent variable based on a standard deviation change in the independent variable in question while holding other independent variables constant in a regression model. The latter calculates the increase of R 2 each independent variable produces when it is added to the regression model that already contains all of the other independent variables. In statistics, independent variables are usually more relevant when they have larger absolute standardised regression coefficients and larger increases in R 2 .
As shown in Table 5, E T explains the greatest amount of variance of TPL regarding the two indicators. In this model (see (20)), the criteria of SC and In generate the same conclusion on the relative importance of the five independent variables, and they are sorted in descending order by relative importance: Similarly, the best model fitted for Mozilla data is presented in (21), and the adjusted R 2 is 0.959. As shown in Table 5, according to the criterion of SC, E T has a standardised coefficient with the largest absolute value, followed by W P , W C , and N D . The result obtained by using the criterion of In presents a similar order of these features, i.e. E T contributes the most to the model, followed   RQ3. Is there a minimum feature subset that determines the change of TPL?
Motivation. On the one hand, it is an accepted practice that we can obtain a satisfactory (or comparable) result with fewer features, thus leading to lower training cost. On the other hand, multicollinearity is a common phenomenon in multiple regression models. Therefore, those highly correlated features that provide redundant information can be further removed, so as to form a minimum feature subset.
Method. The VIF tells whether the multicollinearity problem exists in a regression model. A VIF of 1 means that there is no correlation between a given predictor and the remaining predictors, and, if a VIF is >10, this indicates that the multicollinearity problem exists. In this study, the minimum subset of developer features can be obtained by removing those features with high VIF values. The removal process ends until the VIF values of all the remaining features are <10.
Result. By using the method mentioned above, we removed N ID and N D one by one from the model (see (20)), so as to make sure that the VIF values of the rest of the independent variables are <10. Hence, the minimum subset of developer features in Eclipse includes only E T , W P , and W C . Also, we recalculated p-value and the adjusted R 2 after the removal process ends. Compared with the value of R 2 in Table 5), all the p-values are <0.001, and the adjusted R 2 value reaches 0.933, i.e. the loss of R 2 is 4.2%. Similarity, for Mozilla, the minimum feature subset of the model (see (21)) also includes E T , W P , and W C , and the p-values of the obtained three features are <0.002. The adjusted R 2 is 0.912 after the removal of N D , i.e. the loss of R 2 is 4.7%.
Answer to RQ3: Three features, E T , W P , and W C , are the core features that have a significant effect on the change of TPL. RQ4. Can the feature sets obtained in RQ2 and RQ3 contribute a better performance on developer recommendation?
Motivation. In a statistical sense, both the two sets of features obtained in RQ2 and RQ3 have significant effects on reducing the length of tossing paths. However, their practical applications in automatic bug triage are yet to be tested. As a result of feature engineering for ML-based approaches to developer recommendation, we need to carefully examine the contributions of our work to building quality classifiers (or called recommenders) which can achieve better prediction performance.
Method. To compare the impacts of different groups of features on developer recommendation, here we elaborately design a few controlled experiments based on these groups, including the following steps: feature groups design, developer recommenders training, evaluation metrics selection, and statistical comparative analysis. First, groups a1 and a2 (see Table 6) contain only developer features, and groups a3, b1, and b2 (see Table 6) contain the most commonly-used developer or textual features employed in many previous studies (see Table 1). Note that group a1 includes the three core features obtained in RQ3, group a2 is composed of the four key features achieved in RQ2, and group b1 contains the corresponding features of bug reports compared with group a1. Second, according to the five feature groups, we then train several developer recommenders using six common classification algorithms, i.e. decision tree (DT), SVM, NB, logistic regression (LR) with the stochastic gradient descent training, K-nearest neighbours (KNN), and random forest (RF). They were implemented by using an open-source Python library sklearn [http://scikit-learn.org/stable/index.html] with default settings. Third, three frequently-used metrics, accuracy, precision, and recall, are used to measure the performance of developer recommenders, and their definitions, please refer to [43]. Fourth, the Wilcoxon signed-rank test and an effect size (cliff's delta) are utilised to conduct an in-depth statistical comparative analysis.
Result. We divided the data sets of Eclipse and Mozilla sorted in chronological order into training and validation sets. Here, we present an example of an 80:20 split. As shown in Table 7, for both Eclipse and Mozilla, in most cases SVM outperforms the other five classification algorithms regarding the three metrics, regardless of feature groups and the number of recommended developers. Moreover, these six ML algorithms can achieve better results when recommending more suitable developers. For example, the highest accuracy, precision, and recall (at top 5) are 0.909, 0.598, and 0.661, respectively, for group a2 in Eclipse, and they are 0.774, 0.519, and 0.568, respectively, for the group in Mozilla. Note that because the results at top 1 are the same with and without a TG, they are omitted in this study like [4].
Since the six classification algorithms achieve the best accuracy at top 5, we then analyse the impact of the percentage of training data on prediction results when recommending the top five developers. In each of Figs. 3 and 4, five curves with different colour represent the corresponding five feature groups. The X-axis is the percentage of training samples, ranging from 50 to 90% with a step value of 2%, and the Y-axis represents a given evaluation metric. Accuracy, precision, and recall are placed in the first, second, and third column, respectively. Generally speaking, for each of the six classification algorithms, the five curves of prediction results increase slightly with an increase in the size of training samples. Obviously, the performance of group a1 is better than those of groups b1, a3 and b2. Also, the prediction results of groups a1 and a2 are very similar regarding the three evaluation metrics in most cases.
The Wilcoxon signed-rank test and Cliff's delta effect size are employed to compare the results of a1 and the other four feature groups quantitatively, and the comparisons of 21 predictions (with different percentages of training data) at top 5 are shown in Table 8. The Wilcoxon signed-rank test determines whether results of two groups are from the same distribution, and there is no significant difference between the results of the two groups when the p-value (p v ) of the Wilcoxon signed-rank test is >0.05 (i.e. a significance threshold in our experiment). The Cliff's delta (δ) measures how often the values in one distribution are larger than the values in a second distribution, and the difference is interpreted small and can usually be ignored when ∥ δ ∥ < 0.4. Note that the value of δ is negative indicates that the result on the right-hand side is better than that on the left-hand side. Table 8 shows that except for the four cases using DT and LR algorithms, the classifiers trained using group a1 perform better than those trained by groups b1, a3, and b2, according to the small p-values (p v < 0.05) and large Cliff's delta values (δ > 0.6) for comparisons a1Vb1, a1Va3 and a1Vb2. As to comparison a1Va2, the results of accuracy, precision, and recall for groups a1 and a2 are close because more than 60% (23/36) of the p-values are >0.05, which implies that there is no significant difference between groups a1 and a2. Considering the absolute values of δ in these 23 cases are <0.4, the difference between groups a1 and a2 could be ignored, especially for NB and LR in Eclipse and for NB, KNN, and RF in Mozilla.
Answer to RQ4: The feature sets we obtained can contribute to a better performance on developer recommendation across six common classification algorithms; compared with the set of the main features, the minimum feature subset has a similar impact on developer recommendation; and SVM is the best classification algorithm for the two sets.

Discussion
Although the main goal of this study is to identify those developer features that have a significant impact on the change of TPL, a byproduct of our work is that the feature sets we obtained can  contribute to a better performance on developer recommendation, which is very useful for automatic bug triage. Some interesting findings and implications for software engineering research and practice are listed below.
According to the results shown in Tables 7 and 8, the classifiers trained by developer features outperform those trained using the textual information of bug reports. This finding embodies peoplecentred ideas in the process of software development and maintenance. Although the textual information of bug reports has been widely used in previous studies in the field of automatic bug triage, such textual information has two disadvantages that deserve attention. First, the cost of processing time for large amounts of text is much higher than that of developer features. For example, for simple classification algorithms like DT and NB, it takes about 170 times longer to train a classifier using group b2. Second, the diversity of natural language processing (NLP) techniques may lead to the low repeatability of experimental results. For example, if we use different dimensionality reduction techniques such as the singular value decomposition and word embedding, the prediction results for groups b1 and b2 are likely to change a bit.
The hints of the minimum feature subset we obtained for empirical software engineering are twofold. On the one hand, to reduce TPL and improve the accuracy of developer recommendations, the most important factor is developer's working theme. That is, a recommender should give priority to the similarity between the topic(s) of a given bug report and the working theme of a possible developer. A significant similarity in semantics indicates a high probability that the developer fixes the bug report. On the other hand, developer's workspace (including product and component) is also an important factor to consider. Generally speaking, developers tend to fix bug reports from their familiar source location. This finding is consistent with a few previous studies (see Table 1) that have trained classifiers for developer recommendation using the two features. Unlike those previous works based on tossing graphs [4,14,15], our results show that the tossing probability between developers is not a significant factor affecting the length of tossing paths, mainly because of a weak (positive) correlation between them. In fact, there are several reassignment rules which have been proposed. One of the easiest ways is to find a developer with high degree (or in-degree) centrality in the corresponding TG [27,29,34]. In this study, our results, however, indicate that there are always developers with high degree centrality on long tossing paths, especially in Eclipse. In a statistical sense, this developer factor is more likely to result in the increase of TPL. Influential developers usually have many acquaintances, and the effect of an inappropriate reassignment would be magnified by them, thus leading to cascading errors [44]. Therefore, the reassignment of a bug report to an influential developer should be careful unless the developer can fix it or find a true fixer within few hops on all possible tossing paths. Intelligent recommendation algorithms from influential developers will be our future research.

Threats to validity
Although our work obtains some interesting and useful results, there are several threats to the validity of our work that must be explained.
Construct validity: To identify developer factors that affect TPL, we consider a total number of 16 commonly-used features in this study, but in fact, there are also many other features which have been used in automatic bug triage. Our work excludes this potentially relevant feature information from outside this study, possibly leading to defining experimental outcome too narrowly.
Internal validity: There are three main threats to the internal validity of our work. First, various NLP techniques for feature words extraction and parameter tuning of classification algorithms and LDA may change our results. For example, as stated earlier, the parameters of the six classification algorithms were set by default in sklearn. Second, unlike those previous studies that employed a ten-round incremental framework for experiments [15], we partitioned each of the two data sets of Eclipse and Mozilla into training data and test data with different ratios ranging from 1:1 to 9:1. Therefore, our results are indeed better than those of previous studies on Eclipse and Mozilla. Statistical conclusion validity: The biggest concern with the statistical conclusion validity of our work is low statistical power. Since the sample size of our experiments with hypothesis tests for RQ4 is 21, low power occurs when it is small given small effect sizes.
External validity: The results obtained from Eclipse and Mozilla could provide useful suggestions on feature selection for recommending appropriate developers to fix bug reports and promoting bug fixing efficiency by reducing TPL. However, the generality of our results remains unknown for other open-source and closed-source software projects. Besides that, we utilised only six common classification algorithms to train developer recommenders without additional optimisation.

Conclusion
To investigate why a few bug reports are repeatedly tossed, in this study we focus on developer factors which affect the length of tossing paths in two popular open-source software projects. Also, we consider four types of developer factors (including 16 features in total), i.e. network centrality, developer workspace, developer expertise, and transmissibility of developers. By the experimental results, the working theme, working product, working component, and degree centrality are four key impact factors to change TPL. Moreover, we also identify a minimum feature subset (including three core features, i.e. working theme, working product, and working component) that contribute largely to the change of TPL.
We then train developer recommenders using different feature groups and ML classification algorithms to predict matchable developers for a given bug report, and the empirical results show that the feature groups obtained in this study can contribute to a better performance regarding the three evaluation metrics. More specifically, the three core features have a similar effect on developer recommendation compared with the four key features.
Our future work includes two aspects: first, we will validate the generality of the feature groups obtained in this study on developer recommendation in other software bug repositories; second, we will train better developer recommenders based on a hybrid set of developer factors and textual information of bug reports using deep artificial neural networks.