Software Fault-Proneness Analysis based on Composite Developer-Module Networks

Existing software fault-proneness analysis and prediction models can be categorized into software metrics and visualized approaches. However, the studies of the software metrics solely rely on the quantified data, while the latter fails to reflect the human aspect, which is proven to be a main cause of many failures in various domains. In this paper, we proposed a new analysis model with an improved software network called Composite Developer-Module Network. The network is composed of the linkage of both developers to software modules and software modules to modules to reflect the characteristics and interaction between developers. After the networks of the research objects are built, several different sub-graphs in the networks are derived from analyzing the structures of the sub-graphs that are more fault-prone and further determine whether the software development is in a bad structure, thus predicting the fault-proneness. Our research shows that the different sub-structures are not only a factor in fault-proneness, but also that the complexity of the sub-structure can affect the production of bugs.


I. INTRODUCTION
Program failure has always been a major concern when it comes to software development [131], [134]. Nevertheless, as the trend of software scale and complexity continuously increase, the cost of software failure exaggerates riotously. According to a study by the National Institute of Standards and Technology, the annual cost of software bugs in the U.S. reached $59.5 billion in 2002 [85]. To achieve more cost-effective management, some software revalidation and testing techniques, such as fault 1 localization, have become an indispensable part of the software development and evolution process [12]. Although the techniques of fault localization are becoming more and more complete, it is still expensive to operate with a higher degree of precision. Therefore, applying fault-proneness prediction beforehand can be a clever step to ensuring software quality [18], [30], [48], [51], [62], [68], [91], [94], [128].
In this paper, we propose a new approach to build the fault-proneness analysis model that integrates human aspects by using complex network techniques. Recent studies [34], [141] have shown that certain structural patterns in the software network have high correlations with bugs or unexpected system performances. On top of that, our research aims to integrate the pattern of developers' activity into the software network, and by doing so, we will have a more comprehensive view on deeper and indirect developermodule dependency, and thus can further help to analyze the relationship between developers' behaviors and bugs they introduce.
The rest of the paper is structured as follows. Section II introduces the basic idea of software fault-proneness prediction using various approaches, how the developers' aspects can be included in the software fault-proneness prediction model, and the implementation of complex networks on software systems. The proposed Composite Developer-Module Networks and their attributes are presented in Section III. Our case studies and experiment results are given in Section IV. Section V discusses some threats to the validity of our study, and some related work is introduced in Section VI. Finally, the conclusions of our research and future work are listed in Section VII.

II. BACKGROUND
In this section, we will provide a brief introduction to software metrics and how they are utilized in building software fault-proneness prediction models. Additionally, we will introduce the thoughts of integrating human aspects into the prediction model and how the idea can be implemented by software network analysis.
There are two major categories of software metrics, namely, product metrics and process metrics [140]. Product metrics are used to measure the aspects of the software per se. The product metrics can be categorized into generic metrics and special metrics, and generic metrics consist of two categories: static metrics, which can be calculated by the attribute of source code, and dynamic metrics, which can only be collected during runtime. To name a few, static metrics include Line of Code [88], McCabe Complexity [71], and C.K. metrics [24], and dynamic metrics include dynamic coupling metrics [3], runtime cohesion metrics [76], and code coverage-based metrics [23], [77], [135]. Both static and dynamic metrics can be further divided into internal metrics, which examine the internal structure of a software module, and external metrics, which focus on the interactions between separate software modules [132], [133]. Special metrics such as [2] measure the attributes that are not directly related to the target software. Various commercial or free tools appear on the market that provide the feature of collecting these metrics, such as Scitool Understand [108]. Process metrics are used to analyze the process of software development and evolution, and some examples are: code churn [81], [112], social network analysis-based metrics [9], [13]- [15], [20], [122], developer-based metrics [95], [127], [137], and organizational metrics [8], [79], [83]. FIGURE 1 shows the hierarchy of the software metrics.
In this paper, although we do not use a single measure of metrics on fault-proneness prediction, we still employ the concept of static product metrics (module network), process metrics (developer network), and the idea of the external structure of software module to build our model.

B. METHODS TO BUILD FAULT-PRONENESS PREDICTION MODELS
Once we have software metrics as indicators that reflex the characteristics of software, a prediction model can be built to produce beneficial results that can be used to predict faultproneness. To operate an effective fault-proneness prediction, loads of factors need to be considered during model construction [59].
Regression analysis is a straightforward method that measures the dependency of variables [17]. A regression model is given to estimate the value of some dependent variable through a regression equation that takes the value of several independent variables as inputs. Regarding software fault-proneness prediction, many forms of regression can be applied as prediction models. Take linear regression models as examples: the inputs can be a bunch of software metrics, and the output should be the tendency of fault-proneness [6], [38], [41], [64], [121]. Alternatively, logistic regression, another commonly used regression method, takes one of two different values on the dependent variable to divide the ssoftware modules into fault-prone or non-fault-prone classes [4], [29], [31], [58], [78], [114], [115], [143], [144].
Machine learning is a comparatively advanced approach used in software fault-proneness prediction that is based on various learning algorithms. With a given training set of input values (software metrics), the desired training output (index of fault-proneness) can be approached by iterative learning procedures, among which the fault-proneness prediction models can be built. Below are some samples of commonly used techniques and their brief introductions: • Artificial neural networks (ANNs) [7], [52], [120], [142]: using a pre-defined network topology to process learning algorithm. Different designs of a network can considerably affect the outcome of training results. • Decision tree-based modeling [54]- [56], [100], [112]: recursive partition into smaller subset. The technique can be easier to interpret on larger data set. • Naïve Bayesian classifier [19], [75], [123] and the Bayesian Belief Network [1], [28], [93], [96]: both based on Bayes' theorem. the Naïve Bayesian classifier assumes that the value of attributes is independent. If dependencies exist between attributes, a BBN can be used instead. • Support vector machine [32], [39], [53], [138]: maps the training tuples into a higher dimension and searches for the linear optimal separating hyperplane to separate the new tuples into different classes. • Discriminant analysis [35], [50], [92], [119]: a discriminant function is built from the training dataset to assign data points into either the desired (fault-prone) or the non-desired (non-fault-prone) group.

C. DEVELOPER ASPECTS
Since the software is a pure cognitive product of human developers, the flaws in it are significantly caused by erroneous behaviors of humans involved [44]- [47], [66], [126], [127]. Therefore, several studies have tried to clarify the accurate association of human aspects and software bugs. One of the approaches separates the consistent characteristics of a human individual and evaluates their behavior, which can be used to predict his/her future performance [11], [43], [60], [70], [95]. Another approach tries to classify specific working environments and activities during the development process that could affect the performance of developers [9], [33], [83], [113]. Since the later approach focuses more on human activities and interaction than characteristics, it can be useful when there is no sufficient historical data on individual developers and is closer to our approach.

D. SOFTWARE NETWORK
A modern software system is composed of various elements, such as functions and variables in software modules, and commit records in development. These elements are all dependent on each other in some way. Therefore, we can construct an abstract model in a complex network if we regard the elements as nodes and relationships as edges.
Hence, the network is called a software network. Studies have shown that the theory of complex networks can be accommodating in the software reliability domain because of the small-world effects and scale-free properties [82], [124]. Also, the properties of high cohesion and low coupling in the software development process can modularize [117]. As an application in fault-proneness prediction, many techniques apply the topology characteristics and properties such as complexity and evolution of rules into quantitative indicators. Among them, network analysis-based metrics are derived from the network features such as closeness or betweenness centrality [75], [98]. Furthermore, the dependency for software modules can also work as a good gauge of fault-proneness in the system [87], [122], [145].
Although a complex network is a practical technique and relatively intuitive in describing the multifaceted relationship, the entire structure can grow enormously when the target model is from a large number of real-world objects and thus hold researchers back from investigating the network altogether. To tackle the issue, the researchers must identify the specific type of structures that can carry some important local properties in the network. Network Motif [110] serves as one of the properties that are recurrent and have some statistically significant patterns and is broadly implement in many domains such as biochemistry, neurobiology, and engineering, etc. Through these practices, various searching algorithms for certain types of network motifs and their application have been applied on software domains as well [67], [69], [141].

E. NETWORK WITH HUMAN ASPECTS
Although the software network can be helpful when it comes to fault-proneness prediction, the influence of human aspects is often overlooked (as states in Section II-C). To further improve the accuracy of prediction results, several approaches try to comprise the human aspects of the network prediction model. One simple tactic is to express developer contributions with a Contribution Network (also called a Developer-Module Network in some studies) [10], [16], [22], [97]. In the contribution network, a contribution edge always refers to a commit on a module made by a developer. The weights of edges are defined as the number of commits for a developer to the specific module. FIGURE 2 depicts an example of the contribution network.

FIGURE 4. A Developer Collaboration Network with four developers
suggest that the module dependency should be used along with the contribution history of developers on faultproneness prediction. Thus, the network is called a Socio-Technical Network, which is a hybrid of developer contribution network and a network of module dependency. There are two types of edges: the contribution edges are comparable to those in the contribution network but are now bidirectional; the module edges always have the weight of one. FIGURE 3 depicts an example of a Socio-Technical Network. Then again, another unique idea is to describe developers' collaborations by assigning edges to those that have worked on common modules. The network is called Developer Collaboration Network and involves developers [66], [129] solely. In the prediction model, the metrics can be generated by applying Social Network Analysis since the edges in the network are flexing some social relations. FIGURE 4 depicts an example of a Developer Collaboration network.
A Tri-Relation Network (TRN) is the ultimate form of network that involves the human aspect [65]. The network combines three different kinds of relation, i.e., the developer contribution, module dependency, and developer collaboration. With those in hand, a more comprehensive view of activities in software development can be described. Moreover, Social Network Analysis can be applied to build a Despite the fact that all these approaches established the foundation of the human-aspect fault-proneness prediction model, they fail to illustrate developers' deeper and indirect dependency. Take the network in FIGURE 6 as an example. The developer relationship q is evident because both developer a and b have contributions to Module A. However, since there is a function call from Module A on Module B, the dependency r between developer a and developer c, who has some contributions on Module B, should be considered but often ignored in the previous studies. Our approach aims to cover such indirect dependencies on developers and hence portray a more comprehensive picture of developer relationships and activities.

III. COMPOSITE DEVELOPER-MODULE NETWORKS
In this section, the insights of the Composite Developer-Module Networks elements used in our research will be explained. The motivation of the Composite Developer-Module Networks is to include deeper and indirect dependencies between developer and software modules.
Thus, one can better analyze the more comprehensive An edge representing a function call from F to G g(D,F) An edge representing a developer D has contributions on F name(X) Name label of object X Developer Function 4 The notations we used in the following section are listed in TABLE 1.

A. DEFINITION OF COMPOSITE DEVELOPER-MODULE NETWORK
A Composite Developer-Module Network of a specific version of release R is defined as below: is a set of all vertices, including all developers before release version R, denoted as Vd (R), and all functions in software modules of release version R, denoted as Vf (R). This implies that E(R) represents all edges in the network, including Ec(R): links from developers to the functions he/she contributes to in a software module before release version R (FIGURE 7), and Ef (R): represents links between every two functions in the software in release version R (FIGURE 8). This implies that

Corollary 1. A Composite Developer-Module Network is a connected graph.
If there exists a vertex v ∈ V, then (1) if v ∈ Vd (R), then v denotes the developer that should have at least one contribution in software module; (2) if v ∈ Vf (R), then v denotes the function which has either some contributing contributor or link with other function. Thus, all the existing v in V are connected.
For a vertex vF ∈ Vf (R) that denotes a function F which exists in the software modules, it has attributes that are defined as: Where name(F) denotes the name label of the function F, path(M) denotes the path of software module where the function is located, and B∈{true, false} indicates whether the function contents bug. The function contents bug will be denoted in red. For example, for a function "atoi" in module "stdlib.h" which contents bug, then the vertex of the function vatoi is noted as: is defined as a directed edge from F to G when a function F has a function call on function G. The attributes of the e(F,G) are defined as: Where name(F) denotes the name label of function F, while name(G) denotes the name label of function G. Because the edge means "F has a function call on G," the direction should be F→G. FIGURE 10 shows the example of e(main,atoi). VOLUME XX, XXXX

Corollary 2. The link between two vertices in a Module
Network is weightless. In case that multiple function calls between two specific functions may exist, i.e., if a function F calls function G more than once in the source code, we simply define a link without weight to eliminate confusion.
Where name(D) denotes the name label of the developer D, and name(F) denotes the name label of function F. T denotes the commit time of the contribution (last time is selected if D has multiple commits), and I ∈{true, false} represents whether this commit introduced a bug. Because the edge is defined as "D has a contribution on F," the direction should be D→F. Below is an example for such edge that a developer "a" made a commit on function "atoi": With the combination of Developer Network and Module Network, we could build a comprehensive view of a network that includes both relationships of (1) function-function dependency and (2) developer-function dependency. FIGURE 14 shows an example of a Composite Developer-Module Network.

Corollary 6 A Composite Developer-Module Network is the union of Developer Network and Module Network.
The definition of Developer Network is Nd (R) = (V(R), Ec (R)), and that of Module Network is Nm (R) = (Vf (R), Ef (R)). Since Vf (R) ∈ V(R) and E = Ec (R)∪Ef (R), so Nd (R)∪Nm (R) = (V(R), E(R)) = N (R).

B. SUB-STRUCTURE OF COMPOSITE DEVELOPER-MODULE NETWORK
A sub-structure of the Composite Developer-Module Network Nsub(R) is a sub-structure that can be sliced into an independent network that contents specific vertices Vsub(R) ⊂ V(R) and edges that connect all the vertices Esub(R) ⊂ E(R). A category of patterns will be identified whether it has a relationship with software bugs. FIGURE 15 shows an example of such sub-structure retrieval.
In our research, we analyze some specific sub-structure of the network and identify the most fault-prone patterns by calculating the relationship between the sub-structures with the bug-introduced log that derived from several real-world projects.
There are many possible sub-structures that can be pruned from a Composite Developer-Module Network. To compose an objective that is useful in our study, the sub-structure should obey the rules: 1) It consists of at least one developer vertex u'D ∈ Vdsub(R), and a function vertex v'F ∈ Vf-sub(R).

2) Every developer vertex in Vd-sub(R) should be connected to at least one function vertex Vf-sub(R) with an edge in E(R), and vice versa. 3) All edges in E(R) consist of vertices in Vsub(R) is in
Esub(R), i.e., Esub(R) ⊂ E(R) and Esub(R)={e' | e' = Vsub(R) × Vsub(R) and e' ∈ E(R)}. 4) The sub-structure must be a connected graph.
With respect to the above rules, we can break down two simple forms of sub-structure consists of three vertices below:

A. EXPERIMENT METHODOLOGY
In this section, we shall discuss the research methodology in more detail. The purpose of our study is to answer the question: • R (is different Sub-Structures in the CDMN a factor of fault-proneness?): Does the structure of Developers/Modules in a sub-graph in a software network effect introducing a bug in a software project? The answer to the question will determine whether our proposed Composite Developer-Module Network is applicable in building a software fault-proneness analysis model and thus can be used in a prediction model for software fault-proneness in the future.
We use four open-source projects based on C/C++ language, namely: gedit [37], Nagios Core [84], NGINX [86], and redis [105], in our study. gedit is the default text editor in the GNOME desktop environment with a graphical interface. Nagios Core is a monitoring system for networks and infrastructure that has alert features. NGINX is a web server that is an asynchronous event-driven approach to handle requests. redis is an in-memory database system implementing distributed key-value storage.   We derived the network into several categories of substructures in the study. The number of each sub-structure and the total faulty commits within the structure for all the releases of the target software project are collected. In addition, bugs that exist in each category of sub-structure (i.e., Total Bugs per Structure) are calculated. Bug numbers corresponding to the number of developers, functions, and points (developers and functions) are also calculated to evaluate the impact of these variables.
To describe each Sub-Structure in a text format, we use the terminology of: All the categories of Sub-Structured we used in the study and will be used in the future are listed in the Error! Reference source not found. We chose 13 among the categories that have a sufficient amount of data to be collected in this study.

B. RESULTS
This section will present the collected data and analysis from the four projects in our study. For the integrity of our analysis, we leave the label on the tables, though they will not be considered in the data analysis.
In the following data sets, six metrics are calculated to emphasize different perspectives on each structure: • Total Bugs per Structure: the total bug introduction commits per structure from the whole software project all the releases.  3 shows the data of each Sub-Structure we found in the gedit project from 314 releases of versions. Generally, the trend of fault-prone is growing as the complexity of the Sub-Structure increases, except for the most highly complex one that may be due to the insufficient amount of data. 12-4-D2(4)-F2(1) is the most fault-prone Sub-Structure in this case. The trend can also be seen in FIGURE 24. TABLE 4 shows the data of Nagios Core. Likewise, the more complex, the more fault-prone, especially the 4-D-F3 group, with more than one bug in each structure. Also, an interesting observation is that the more functions one developer, or a group of developers, works on, the more chance a bug will exist. However, the number of developers, who work on the same set of functions, does not have much effect on the exists of bugs. This may be due to the concept of personal workload, which can be considered as one of our following study topics.

C. DATA ANALYSIS
To answer our research question: • R (is different Sub-Structures in the CDMN a factor of fault-proneness?) We conducted an Analysis of variance (ANOVA) on the average bug introduction commits per structure to determine whether the dependency of the variant Sub-Structure and prove the result is not random, i.e., the difference in Sub-Structure can affect the fault-proneness. The null hypothesis H0 and alternative hypothesis H1 are: • H0: All the statistical results of the average bug introduction are identical. • H1: All the statistical results of the average bug introduction are not identical. The result of the test show for all the four projects, the Fstat, the F value calculated by the data is far greater than Fcritical, the F value by the degree of freedom with significant level α = 0.05. Therefore, we can safely say that the research data is valid, and Sub-Structures in the CDMN are a factor of fault-proneness.

V. THREATS TO VALIDITY
In this section, we will discuss some threats that may affect the validity of the study. First, all the programs we analyzed are based on C language. Therefore, there is a chance that our results will not be as effective in projects developed in other programming languages.

VI. RELATED WORK
Bird et al. [8], [10] mention that low ownership of certain software modules, i.e., the modules that have too many minor contributors, can result in a great possibility of software defects. The study suggests that developers should communicate and work carefully with experienced contributors on the objects regarding the desired modification. Ell [33] and Simpson [113] use a Failure Index (F.I.) as an indicator of the possibility of some specific developer pairs making a mistake, which appears as a primary instance of the appliance to the developer activities in software fault-proneness prediction. Ohira et al. [90] apply social network analysis on developers across different projects, and the results identify that expertise on particular subjects can affect software quality in the development process.
Valverde et al. [124], [125] introduce the concept of network motif into the system software. They find that the frequent network motifs can be a consequence of network heterogeneity. On top of that, they propose a duplicationdivergence model that can explain the motif that appears in software evolution. Qian et al. [99] discovered that the nature of software networks could be split into three clusters that match with renowned super-families of network types. However, since the meaning of network motifs in software networks is still unclear, more studies are needed to support the theory as well as the implementation of the software fault-proneness prediction model.
Nagappan et al. [83] propose eight metrics that aim to measure the complexity of software development from an organizational point of view, e.g., the ratio of engineers who left the organization that has modified the codes in some software module. These metrics have proven to be helpful in fault-proneness prediction. In [97], a Developer-Module Network, which only includes the contribution relationship of developers and software modules in their definition, is proposed to build a fault-proneness prediction model. In [145], they further introduce the model using Social Network Analysis on the dependency graph of developers. These studies show that the centrality regarding the developers' contribution has a high correlation to the failure of the software. On top of that, the technique is further extended by operating it on socio-technical networks [9].
Cataldo et al. [21], [22] propose a socio-technical congruence framework that portrays the coordination patterns of developers. The studies suggest that the resolution time of modification requests is drastically reduced for specific coordination patterns. Thus, the impact of working congruence on productivity in the process of software development can be significant. Their studies also indicate that logical dependency is related to product dependency and can have more influence on development quality compared to data dependencies.

VII. CONCLUSION AND FUTURE WORK
In this paper, we propose Composite Developer-Module Networks that feature both software module dependency and developer-module dependency. To build the network, we firstly express all functions of modules and developers involved in the software project into vertices, respectively. Secondly, two different types of dependencies are defined as edges in the network: the software module dependencies, which are decided by the function call between each module, and the developer-module dependencies, which are determined by functions in the software module the developer worked on. After the network is built, we evaluate the bug introduction with the sub-structure derived from the Composite Developer-Module Network and find the most fault-prone one that appears in the network that suggests the most fault-prone pattern during the software development process. We evaluate four open-sourced projects: gedit, Nagios Core, NGINX, and redis by constructing Composite Developer-Module Networks for every release version, respectively. Our analysis results have shown that the distinct Sub-Structures in the Composite Developer-Module Networks is a factor in fault-proneness, and the more complex structures can cause more faults in general.
For our future work, we plan to further evaluate our research by applying the method to more software p with a variety of characteristics. More evidence of erroneous patterns can be found through a more comprehensive data set, thus constructing an accurate and effective faultproneness prediction model. Furthermore, we plan to apply machine learning techniques to our method to discover more potential vulnerable sub-structures automatically. Finally, as the goal of our works, we plan to integrate the technique into a practical software development environment and hope to benefit the massive software industry eventually.

APPENDIX
A table of Sub-Structures analyzed in the study is attached at the end of this paper. VOLUME XX, XXXX