Characterizing Software Stability via Change Propagation Simulation

. Softwarestabilitymeanstheresistancetotheamplificationofchangesinsoftware.Ithasbecomeoneofthemostimportantattributes thataffectmaintenancecost.Tocontrolthemaintenancecost,manyapproacheshavebeenproposedtomeasuresoftwarestability. However,itisstillaverydifficulttasktoevaluatethesoftwarestabilityespeciallywhensoftwarebecomesverylargeandcomplex. Inthispaper,weproposetocharacterizesoftwarestabilityviachangepropagationsimulation.First,weproposeaclasscoupling network(CCN)tomodelsoftwarestructureattheclasslevel.Then,weanalyzethechangepropagationprocessintheCCNbyusing asimulationway,andbydoingso,wedevelopanovelmetric, 𝑆𝑆 (software stability), to measure software stability. Our 𝑆𝑆 metric is validated theoretically using the widely accepted Weyuker’s properties and empirically using a set of open source Java software systems. The theoretical results show that our 𝑆𝑆 metric satisfies most of Weyuker’s properties with only two exceptions, and the empirical results show that our metric is an effective indicator for software quality improvement and class importance. Empirical results also show that our approach has the ability to be applied to large software systems.


Introduction
Software maintenance is widely regarded as the most costly and difficult phase in the life cycle of a piece of software [1]. It mainly consists of four phases, i.e., analyzing software, generating modification proposals, ripple effect analysis, and testing the modified software [2,3]. It is estimated that software maintenance costs account for more than 60% of the total life cycle cost, and the third phase (ripple effect analysis) usually accounts for about 40% of the software maintenance cost [4]. Worse still, this maintenance cost shows no sign of declining. The cost being so high makes how to reduce it become a problem facing many people.
Generally, there are two ways to control the maintenance cost [2]. One is to provide effective tools or techniques to help maintenance personnel ease their maintenance tasks and improve their maintenance productivity. The other is to propose meaningful metrics to assess the quality characteristics of software and help maintenance personnel (or developers) identify potential problems in software. In this paper, we resort to the second way (i.e., developing some new metrics) to control the maintenance cost.
As mentioned above, ripple effect analysis is a key activity in software maintenance. Ripple effect can be defined as a phenomena that changes to one part of software will potentially affect other parts of the software [2,3]. It is reported that software stability which means the resistance to the amplification of changes in software, is a primary attribute that affects ripple effect [2,3]. Thus, we can control the maintenance cost by proposing meaningful metrics to measure software stability. But it is still a difficult task to measure software stability quantitatively.
Software structure is usually defined as the organization of software elements (i.e., methods/attributes, classes/interfaces, and packages) within a piece of software [5]. It includes a set of software elements and a set of connections between every pair of software elements. With the increase of software complexity, software structure has become one of the most important factors that affect software stability. Changes in one part of software can propagate to 2 Complexity other parts of software along with the connections between software elements. Thus, we can analyze software stability from the perspective of software structure.
Software elements and the connections between them naturally form a network (or a graph) structure. Thus, software can be analyzed under the framework of complex network theory [6]. Many researchers applied complex network theory to analyze software systems and their dynamics by representing software structure as a software network, where nodes denote software elements and edges (or links) denote connections [6][7][8][9]. They found that software networks shared some physics-like laws such as "scale-free" and "smallworld" [6][7][8][9]. It opens up a new path to study the existing research problems in software engineering from a complex network perspective. Until now, complex networks have been applied in many different domains of software engineering, such as software refactoring [10], software element ranking [11], and software testing [12]. Such novel interdisciplinary research also provides us a promising way to study software stability.
In this paper, we aim to characterize software stability via change propagation simulation. To fulfill this task, we propose a class coupling network (CCN) to model software structure at the class level, where nodes are classes (in this work, if not mentioned, the term "class" designates both class and interface from here on), and directed links between nodes are couplings between classes. Each link in the CCN is assigned a weight to denote the probability that a change in one class (the tail end of the link) will propagate to another class (the head end of the link) connecting to it. Then, we analyze the change propagation process in the CCN by using a simulation way, and by doing so, we develop a novel metric, (software stability), to measure software stability. Our metric is validated theoretically using the widely accepted Weyuker's properties [13], and empirically using a set of open source Java software systems. The theoretical results show that our metric satisfies most of Weyuker's properties only with two exceptions, and the empirical results show that is an effective indicator for software quality improvement and class importance.
The main contributions of this paper can be summarized as follows: (i) We propose a novel metric, (software stability), to measure software stability at the class level. Our metric is based on a network representation of software structure and a change propagation simulation in the network.
(ii) Our class coupling network uses weights to denote the probability that a change in one class will propagate to other classes that link to it. Such a weight allows us to consider the influence of probability on the change propagation, which has been neglected in the existing literature.
(iii) We propose a simulation way to analyze the change propagation process in the CCN. The simulation way can also be applied to calculate the value of .
(iv) The proposed metric is validated theoretically using widely accepted evaluation criteria, and empirically using open source Java software systems. The data set and software used in this work are all published online to support further replication.
The rest of the paper is organized as follows. Section 2 reviews the related work on software change propagation and stability. Section 3 describes our approach in detail stepby-step. Section 4 contains the theoretical and empirical evaluation of our approach. We conclude the paper and discuss the future work in Section 5.

Related Work
Much effort has been made to analyze software change propagation and stability. Kung et al. [14] defined a concept, "class firewall", to denote classes that may be affected by changes in a given class and used it to explore the change impact analysis in the class diagram of the software. Li et al. [15] proposed a composite metric, SDI, to characterize software design instability by capturing the evolutionary change in an object-oriented (OO) design. Grosser et al. [16] proposed a case-based reasoning approach for predicting the stability of software items from relevant metric data and results about the stress. Ratiu et al. [17] proposed a metric, , to measure the stability of classes by characterizing the difference between two successive versions. MacCormack et al. [18] applied design structure matrices (DSMs) to map dependencies between elements of a design and defined a concept, "change cost", to quantify the average influence of an element on the other parts of software. Shaik et al. [19] conducted experiments to show the role and use of change propagation to assess the design quality of software architectures. Liu et al. [20] proposed an unweighted directed software network at class level to represent classes and their couplings and introduced a metric named "average propagation ratio" to analyze the change propagation process in the software network. Li et al. and Zhang et al. [21,22] proposed a change propagation model and defined a set of change impact metrics. They applied a simulation technique to simplify the calculation of the metrics. Pan [23] proposed a SOS metric to measure the stability of OO software systems at the method/attribute level. Alshayeb et al. [24] identified several factors that affect class stability and then used these factors to characterize class stability. Ampatzoglou et al. [25] conducted a case study to examine the stability of classes that participate in instances/occurrences of GoF design patterns. Badri et al. [26] presented several impact rules which are based on Java language constructs and further proposed a new change impact analysis model for Java programs. Abuhassan and Alshayeb [27] proposed a suite of software stability metrics for three UML diagrams at the model level, i.e., class, use case, and sequence. Pan and Chai [28] proposed a Node Influence Network (NIN) to represent classes and their couplings and further proposed several metrics to quantify class stability and software stability.
References [15][16][17]24] measured design stability (or class stability) by characterizing the difference between two successive versions. However, other existing work quantified design stability by a worst-case analysis of the software change propagation process. They usually assumed changes in one class definitely affect other classes linking to it directly and indirectly. They do not take into consideration the frequency of different couplings between classes. Our approach is very different from the existing work. We consider seven kinds of couplings between classes and also take into consideration their frequency. Furthermore, we take a simulation approach to analyze change propagation process and compute stabilityrelated metrics. By using a simulation way, we consider the influence of the weight on edges to the change propagation and software stability.

The Approach
This paper proposes to measure software stability from a network perspective (CCN) by using a simulation way. It mainly contains four parts. First, we analyze the source code of software to extract meaningful information. Second, the extracted information will be formally represented by a CCN. Third, we propose a change propagation model and some metrics to analyze the change propagation process in the CCN. Finally, we propose a change propagation algorithm to calculate these metrics. The overall framework of our approach is illustrated in Figure 1, which will be discussed in detail one-by-one in the following subsections.
3.1. Java Software. We focus on the analysis of software coded in Java. The rationale is threefold: First, Java is the most widely used programming language according to TIOBE Index (TIOBE Index. Available at: https://www.tiobe.com/tiobeindex/), and analyzing Java software can give implications to a wider range of software engineering practitioners. Second, there are a large number of open source software systems that can be used as research subjects, ensuring the replication of research work. Third, our own developed analysis platform, SNAP [11], now can only analyze software coded in Java. However, the approach proposed in this paper is general enough, and it can be used in software coded in other languages such as C++, C#, and VB.NET, once these systems are represented by CCNs. To represent software structure as software networks at the class level, we should extract the structure information of the software at the class level. Specifically, we perform static analysis of the source code of Java software to extract classes and the coupling between them such as "inheritance relation", "implements relation", and "parameter relation". In the analysis process, we consider the direction and frequency of each coupling type.

CCN Definition.
As mentioned above, the extracted information will be formally represented by CCN. Thus, in this section, we first give the definition of CCN.
Definition 1 (CCN). CCN is a weighted directed graph that represents the coupling relations between classes/interfaces in a specific software system. Specifically, all classes/interfaces in the software are modeled as graph nodes while the coupling relations are modeled as weighted directed links between nodes. For example, if class 's methods call methods of class , there exists a direct link ⟨ , ⟩ ( → ) in the graph, where is the caller class, and is the callee class. Each link is assigned a weight to denote the probability that a change in one class (the tail end of the link) will propagate to another class (the head end of the link) connecting to it. In this work, we do not differentiate classes and interfaces, treating them as the same thing. Formally, CCN can be represented as = ( , ), where is the node set and is the link set.
Java software systems usually have the following seven possible coupling relations [11,29]:  Thus, the link ⟨ , ⟩ of C can also be defined under the above seven circumstances, i.e., if there exists at least one of the seven coupling relations from class to class , then there will be a link ⟨ , ⟩. Weights on all the links of CCN can be encoded as an | | × | | adjacency matrix Ψ, whose entry signifies the change propagation probability from classes to , i.e., where denotes the weight (change propagation probability) on the link ⟨ , ⟩.
Considering the weight of links provides us a more accurate representation of software. But how to assign weights to links is still a problem that should be resolved. As mentioned above, there are total seven types of possible couplings that might exist between classes. In other words, for a specific pair of classes, and , there may exist several different types of couplings at the same time. For example, as the source code segment shown in Figure 2, class "Adoptor" has one global variable "myZoom" with type "Zoom", one method "setZoom(Zoom newZoom)" with one parameter "newZoom" of type "Zoom", and one method "say()" with a call to another method "say()" in class "Zoom".
In Section 3.2, we obtain couplings and their frequencies between every pair of classes if they have connections. Thus, for a specific pair of classes, and , the weight on the link → can be computed as where = {INR, IMR, PAR, GVR, MCR, LVR, RTR}, is the frequency of the coupling type from class to class , and is the maximum value of the weights on all the links, i.e., max = max , ∈ ( ). As shown in Figure 2, the total coupling Complexity 5 frequency on the link from class "Adoptor" to class "Zoom" is 3, and the maximum value of weights on the links is also 3. Thus, Adoptor,Zoom = (1 + 1 + 1)/3 = 1. Weights on other links in Figure 2 can be similarly computed.

Change Propagation Model.
In this section, we will give our change propagation model in detail, with focus on the atomic changes that we consider, the change propagation process, and a simple example to illustrate the change propagation process one-by-one.

A Catalog of Changes.
To simulate the change propagation process in CCN, we should transform source code edits into a set of atomic changes. As we know, CCN consists of nodes and links. Thus, changes to CCN can be roughly classified into six groups, i.e., adding new nodes, adding new links, deleting old nodes, deleting old links, changing old nodes, and changing old links. Since we analyze the change propagation process based on the static structure of CCN, changes that reconstruct the topological structure of CCN will be ignored. That is, changes on adding new nodes, adding new links, deleting old nodes, and deleting old links are not taken into consideration.
In this work, the atomic changes that we consider for CCN are summarized in Table 1. Obviously, all these atomic changes are concern with the changing old nodes, i.e., changing the modifier or content (methods and attributes) of classes. We assume any source code change edits can be decomposed into a set of atomic changes defined in Table 1. Note that AEM, DEM, AA, and DA are considered in this work, simply for they will not affect the topological structure of CCN. Further, we assume the original and the changed version of software systems are both syntactically correct and compilable.

Change Propagation Process.
Software has an intrinsic property to evolve. Fault corrections, performance improvements, and adaption to new environments all will push software to make necessary changes. When one node of CCN changes, the change will propagate to other parts of CCN along links. In this section, we will describe the change propagation process in CCN in detail. But before that, we will first give some hypotheses to simplify the analysis process.

Hypothesis 2.
When change requirements arrive, each node in CCN can be changed. Suppose the initial change probability of node is ( ), ∈ [1, | |] (| | is the number of nodes in CCN), then it should satisfy ∑ | | =1 ( ) = 1. Generally, the value of ( ), ∈ [1, | |] may vary with nodes. Since the randomness of maintenance activities, we cannot predict when the next maintenance activity will occur [2]. That is, we cannot estimate the specific value of ( ), ∈ [1, | |] for each node since we cannot estimate the future change requirement. Thus, in this work, we simply suppose each node has an equal probability to be changed, i.e., ( ) = 1/| |, ∈ [1, | |]. Table 1, there are eight types of atomic changes. However, we cannot differentiate them from each other from a network perspective. Thus, in this work, we will be treating all the atomic changes as the same change operation, namely change, regardless of the nature of each atomic change.

Hypothesis 3. As shown in
Hypothesis 4. Generally, any change requirement may cause multiple nodes to change. That is, the number of initial changed nodes may be more than one. However, due to the random nature of maintenance activities, we cannot predict what each activity will consist of [2]. That is, we cannot estimate the specific number of initial changed nodes. For simplicity, we suppose only one node is changing at any one time, and any change requirement will be fulfilled by a series of such changes. That is, at one time, one of the initial changed nodes will be selected until there are no nodes left.
Hypothesis 5. If one node of CCN changes, the change propagation process begins. The change propagates from the end node of the link to the head node of the link with a specific probability (weight on the link), and further propagates to other parts of CCN. For example, any change in node will propagate to node following link ← . The propagation process will not stop until there are no new nodes that are affected by the initial changed nodes.
Based on the CCN and the hypotheses mentioned above, we can describe the change propagation process from a network perspective. The propagation process of changes in one node of CCN to other parts of CCN is shown as follows: Step 1. Input the CCN of software, set all elements of array bChanged[] as false, and prepare a queue cQueue.
Step 3. If there is no element in cQueue, then go to Step 5; otherwise, go to Step 4.
Step 4. Pop up the first node from cQueue, and generate a random decimal ∈ [0, 1]; iterate the node in CCN oneby-one, if there is link from to (i.e., ← ), ≤ ( is the weight on the ← ), and bChanged[ ]=false, then push node into cQueue. Set bChanged[ ]=true, and go to Step 3.
Step 5. Stop the change propagation process and output array bChanged[].
There might be many "circular couplings" existing in CCN, which will affect the change propagation process. As shown in Figure 3, "V1" depends on "V2", and "V2" in turn depends on "V1". There exists a circular coupling, V1 → V2 → V1. Thus, in a particular propagation process, changes in "V1" can propagate to "V2", and the change in "V2" will go back to "V1". The propagation may stuck in an endless loop. Our approach can effectively deal with the "circular couplings" problem in CCN by introducing the "bChanged" array. That is, if changes in "V1" propagate to "V2", the corresponding elements of "V1" and "V2" in "bChanged" array are all true. Thus, the change in "V2" will not go back to "V1". The circular couplings are broken.

A Simple Example.
We give a simple example to show the change propagation process of any change in one node of CCN (see Figure 4). The red, blue, and green nodes denote the "changed" nodes, blue and green links denote the propagation route, and black nodes denote the "unchanged" nodes.
The propagation process can be described as follows: (i) The left-most part of Figure 4 is the example CCN, cQueue={ }, and bChanged={false,false,false,false}.
Note that the change propagation route (V1 → V2 → V3) showed above is just one of the possible change propagation routes. The change propagation route we obtain in a specific simulation is affected by the random decimal . For example, we may get another change propagation route as V1 → {V2,V3} → V4, if the first generated = 0.4 (at Step 3) and the second generated = 0.3 (at Step 4).

Metric Definition.
To characterize the software stability, we should quantify the change propagation quantitatively. Thus, we define a set of metrics as follows.
( ) is defined as the set of nodes in CCN that are affected by changes in one randomly selected node in the -th simulation. Thus, the number of changed nodes in the -th simulation is denoted as ( ) = | ( )|.
Definition 7 (ration of changed nodes (RCN)). of the -th simulation, ( ), is defined as the ratio that ( ) accounts for the total number of nodes in CCN. It can be calculated as ( ) = ( ( )/| |) * 100%, where | | is the number of nodes in CCN.
Note that, for two independent simulations, and , their s and s are usually different, i.e., ( ) ̸ = ( ) and ( ) ̸ = ( ). The reason is that we randomly select the initial changed node and use a random decimal in the simulation (see Section 3.4.2).
of the independent simulations, ( ), is defined as the average number of changed nodes over independent simulations. ( ) can be defined as ( ) is the ( ) of any node in (i.e., the randomly selected initial node is fixed to node ). Since varies with independent simulations, we cannot use it to quantify the change propagation process objectively. Thus, we use its averaged value over independent simulations. The convergence of will be discussed in Section 4.2.1.
Definition 10 (software stability (SS)). Software stability means the resistance to the amplification of changes in software. Thus, can be formally defined as is the node set of CCN, and is the number of simulations.
Obviously, in fact is the average number of nodes that are not affected by the changes in one node of CCN in the independent simulations. Thus, the larger is, the more stable the software is. Indeed, measures the resistance of a randomly selected node to the amplification of changes. Note that is used to quantify software stability while ( ) can be an indirect metric to quantify class stability. That is, the smaller ( ) is, the more stable class is.
3.6. Change Propagation Algorithm. Algorithm 1 shows the flow of our change propagation algorithm. It is used to compute metric values. Note that, Algorithm 1 only returns the value of P ( ), ( ), ( ), ( ), and . To compute ( ), we should replace step 7 using = , where is the index of node . The time complexity and space complexity of Algorithm 1 are ( * (| | + | |)) and (| |), respectively, where is the number of simulations, | | is the number of nodes in the CCN, and | | is the number of links in the CCN. In software systems, | | will not be very large. Even for the famous software jdk (see Table 2), it only contains 2,883 classes. Thus, the time complexity and space complexity of our approach are acceptable.

Evaluation
In this section, we theoretically and empirically evaluate the effectiveness of our proposed metric. Our aim is to check whether is an effective metric to assess the stability of a piece of software. Our theoretical evaluation is performed using widely used criteria, while our empirical evaluation is performed using a set of Java software systems. All the experiments were carried out on a PC at 2.6 GHz with 8 GB of RAM.

Research Questions.
In the evaluation, we aimed to address the following five research questions (RQs): (i) RQ1: Does converge to a stable value? As a practical metric to characterize software stability, the value of should be relatively stable. However, we use a simulation way to compute the value of , introducing some randomness. We wish to know whether the we compute for a specific piece of software converges to a relative stable value.
(ii) RQ2: Does satisfy Weyuker's nine properties? Weyuker's nine properties are widely used to evaluate the usefulness of any software metric. We wish to know whether also satisfies Weyuker's nine properties.
(iii) RQ3: Is a good indicator for software quality improvement? There are many functionally equivalent software systems. We wish to know whether has the ability to identify high-quality software from two functionally equivalent software systems.
(iv) RQ4: Is ( ) a good indicator for class importance? Important (or key) classes have a "controlling function" in a software system, providing services for a large amount of other classes.
( ) can be used as a metric to quantify the influence of a specific class to other classes in the change propagation process. Thus, we wish to know whether ( ) is a good indicator for class importance, with the hope to indirectly validate the effectiveness of .
(v) RQ5: Can be applied to large software systems? As a practical metric, may be applied to large software systems. We wish to know whether has the ability to be applied to large software systems.

Answers to Research Questions.
We evaluate the effectiveness of the proposed metric theoretically and empirically to answer the research questions raised in Section 4.1.

RQ1: Does Converge to a Stable Value?
As mentioned above, we use a simulation way to compute . Since the randomness of our approach owing to and , the we obtained for a specific piece of software may vary with independent simulations. However, as a metric that can be used to characterize software stability, its value should not change greatly, and have better convergence to a relative stable value. In this section, we perform experiments to examine whether converges to a relative stable value.
We also show some size-related metrics, i.e., thousand lines of code (KLOC), number of packages (#P), number of classes (#C), number of methods (#M), and number of attributes (#A). These values are computed from the Java code in the directory listed in the "Directory" column, not the whole distribution of the corresponding software.
(2) Metrics. To check whether is stable, we use the standard deviation of ( for short) to quantify the difference between two independent calculations of . Specifically, is defined as where denotes independent calculations of , is the value of the -th calculation, and V = (1/ ) ∑ =1 . Generally, if is absolutely stable, = 0. However, since we use a simulation way to compute , may not always exactly equal 0. But if is very close to 0, we consider is acceptably stable.
(3) Experiment Process and Results Analysis. We follow the steps shown in Figure 1 to parse the source code, extract information, and build CCNs for each subject system. Figures  5(a)-5(f) show the CCNs we build for the subject systems. Enlarging the corresponding figure can give you more information of the figure, such as the class that each node denotes, and the coupling between classes.
There is only one parameter (i.e., ) that should be set before executing Algorithm 1. To compute , we should set another parameter in formula (3). Thus, we analyze the convergence of by different settings of and . Specifically, we set from 1 to 10,000 at interval 100, and set from 1 to 100 at interval 10. Figures 6(a) and 6(b) show versus and versus , respectively. As shown in Figure 6(a), we can observe that, for all the examined systems, fluctuates slightly at the early period of our simulation ( < 2, 000 for the examined systems). But after that, its value is relatively stable. As shown in Figure 6(b), < 1 − 4 for all the examined systems, signifying is very stable and can be used as a feasible metric. Further, we also observe that the quality of all the examined systems is high, with their being larger than 0.9955.

RQ2: Does
Satisfy Weyuker's Nine Properties? Weyuker proposed a set of nine properties to theoretically validate the efficiency and robustness for various software complexity measures [13]. These properties are designed to analyze whether a specific metric qualifies as an effective measure. Following the suggestion given by [30], we also evaluate our metric against Weyuker's nine properties with the goal to answer RQ2. The theoretical evaluation is paraphrased as follows, where denotes any software complexity metric (in this work, denotes ).  Property 12 (granularity). Let be a nonnegative number, then there are only finitely many programs with ( ) = .
The universe of discourse deals with a finite number of applications. Thus, there are only a finite number of programs having the same CCNs which satisfy = . Therefore, Property 12 is satisfied by .
A lot of programs have been developed. There is a chance that two programs have same CCNs. Furthermore, it is easy to build two programs with same CCNs and values. Therefore, Property 13 is satisfied by . This property states that even if two programs have the same functionalities, they may differ in complexity. A same set of functionalities can be implemented in different ways. For example, software systems shown in Table 3 all have two versions. One applies design pattern, and the other does not apply design pattern. But the two versions have equivalent functionalities. From Table 4, we can see that the two versions have different values. Therefore, Property 14 is satisfied by .
is not a size-related metric. Thus, Property 15 is not applicable to evaluate . This property does not hold for any OO metric, since the alteration of the order of the statements does not affect the CCN and its corresponding . Thus, Property 17 is not satisfied by .

Property 18 (no change on renaming). If is a renaming of then ( ) = ( ).
As CCN is independent of the name of program, satisfies Property 18. As discussed above, satisfies seven out of nine of Weyuker's properties. The two exceptions are Properties 15 and 17. Property 15 is only applicable to size-related metrics, and Property 17 does not hold for any OO metric. Such exceptions have also been observed in other work [31][32][33][34][35]. Thus, is a well-structured metric that can be used for computing software stability.

RQ3: Is a Good
Indicator for Software Quality Improvement? It is a widely accepted agreement that design pattern is a best practice for design changes, and can improve the quality of software design [36].
is an internal quality indicator of software. It should have the ability to identify the higher-quality software from two functionally equivalent software systems.
(1) Subject Systems. We use six pairs of software systems (see Table 3). In each pair, there are two different versions with the same set of functionalities. The only difference between two versions is one does not apply any kind of design pattern (before version), the other applies one kind of design pattern. Thus, based on the six pairs of software systems, we can check whether the design pattern can improve the quality of software. Table 3 shows some simple descriptions of the examined software systems, including the design pattern they used, lines of code (LOC), number of packages (#P), number of classes (#C), number of methods (#M), and number of attributes (#A).
(2) Experiment Process and Results Analysis. We follow the steps shown in Figure 1 to parse the source code, extract information, and build CCNs for the six pairs of software systems. Figures 7(a)-7(l) show the CCNs we build for the subject systems. Enlarging the corresponding figure can give you more information of the figure, such as the class that each node denotes, the coupling between classes, and the weight on each edge. Note that weights on the edges denote the coupling frequency. They are not the weights that we mentioned in the CCN definition. To obtain the weight on the edges in each CCN, the weights should be divided by the maximum weight in each CCN. For example, in the CCN shown in Figure 7(a), weight on the edge between "NO NAME.StackList" and "NO NAME.BridgeDisc" is 1.0. The maximum weight in the CCN is 8 on the edge between "NO NAME.StackFIFO" and "NO NAME.StackArray". Thus, the true weight on the edge between "NO NAME.StackList" and "NO NAME.BridgeDisc" in CCN should be 1/8. Table 4 shows the that we compute for each pair of software systems. Obviously, the for the software that applies design patterns is larger than the software that does not apply design patterns. Thus, in the examined systems, we can find that can identify the higher-quality software from two functionally equivalent software systems.

RQ4: Is
( ) a Good Indicator for Class Importance? Important classes (also called key classes) have a "controlling function" in a piece of software. These important classes usually work together tightly to provide services for many other classes in the system.
( ) quantifies the scope that changes in node will influence in CCN. It is an indirect measurement of the influence power of node in the change propagation process. Thus, it is estimated that node with a larger value of ( ) should be much more important. There should be a positive correlation between the importance of a class and its ( ). In this section, we treat ( ) as the importance of class node , and based on n ( ) of each node to identify the important classes in software. We also compare ( ) with four baseline approaches on key class identification.
(1) Subject Systems. Two Java software systems have been chosen as our subject systems. One is ant, a library and     command-line tool for automating software build processes. The other is JMeter, an application for testing functional behavior of software and measuring its performance. The two systems have been widely used as benchmark systems for validating the effectiveness of approaches on identifying key classes. Table 5 gives some simple descriptions of the selected two subject systems, including their names and versions, the domain they belong to, the directory we analyze, and URLs to download these systems. We also show some size-related metrics as we have shown in Table 2. The data for ant shown in Table 5 are copied from Table 2.
(2) Baseline Approaches. The four approaches that we used as baseline approaches include ℎ-index [37], -index [37], SM SW+HITS [38], and SM SO+HITS [38]. These approaches use different versions of software networks at the Complexity 13 class level of granularity, and also use different metrics to quantify class importance. Specifically, these approaches are described briefly as follows: (i) ℎ-index [37] uses a weighted undirected software network at class level to represent classes and their couplings. Its software network is different from our CCN defined in Definition 1. It only considers four kinds of coupling between classes, and does not consider "IMR", "MCR", and "LVR". It neglects the direction of edges, and summarizes the frequencies of different kinds of couplings between a pair of classes as the weight on that edge. Then, it computes the ℎindex of each class in the software network as the importance of classes. After that, it ranks the classes according to their ℎ-index in a descending order. It treats the top-ranked classes as the key classes it identifies.
(ii) -index [37] is similar to ℎ-index. The only difference is it computes the -index of each class in the software network as the importance of classes.
(iii) SM SW+HITS [38] uses a weighted directed software network at class level to represent classes and their couplings. Its software network is also different from our CCN defined in Definition 1. It only considers one relation between class. That is "MCR". It counts the method call between every pair of classes to assign weights to the edge between the two classes. It counts every occurrence of a call. Then, it applies HITS algorithm to compute the hubiness score of each class as their importance. After that, it ranks the classes according to their hubiness score in a descending order. It treats the top-ranked classes as the key classes it identifies.
(iv) SM SO+HITS [38] is similar to SM SW+HITS. The only difference is it applies a different counting mechanism of method calls. It counts every occurrence of a call only once.
(3) Metrics. As mentioned above, ( ) is also used as a measure for class importance. To compare the effectiveness of ( ) with the baseline approaches, we apply three external metrics. That is, Precision, Recall, and F-Measure (i.e., (i) True Positive (TP). Classes identified by the original developers as key classes are also suggested by a specific approach as key classes.
(ii) False Positive (FP). Classes identified by a specific approach as key classes are not suggested by the original developers as key classes.
(iii) False Negative (FN). Classes identified by the original developers as key classes are not suggested by a specific approach as key classes.
(iv) True Negative (TN). Classes which are not identified by the original developers as key classes are also not suggested by a specific approach as key classes.
Generally, a larger value of Precision, Recall, and F-Measure indicates the approach has a better effectiveness.
(4) Experiment Process and Results Analysis. We follow the steps shown in Figure 1 to parse the source code of the two software systems, extract information, and build CCNs. Figure 8 shows the CCN we build for the subject system JMeter. Enlarging the figure can give you more information about the figure, such as the class that each node denotes, the coupling between classes, and the weight on each edge. As we show in Figure 7, weights on the edges also denote the coupling frequency. The CCN for the subject system ant is shown in Figure 5(c).
In the related work, the top 15% classes in the ranked list of classes are regarded as key classes [37,38]. Thus, we also examine the top 15% classes. Tables 6 and 7 show the key classes identified by ( ) and the baseline approaches for the two systems, respectively. In Tables 6 and 7, the "Key Classes" column lists the key classes in the ground truth. Other columns contain the key classes identified by each approach. We use "√" to signify the corresponding class is identified by each approach. The last four rows of each table show the metric data for each approach. Note that, [38] only examined a small part of classes as we have examined in both systems by using a dynamic analysis. However, we cannot access the list of classes that [38] examined. Thus, the results for SM SW+HITS and SM SO+HITS shown in Tables 6 and 7 are directly copied from [38].
It can be observed from Table 6 that, when applied to ant, C ( ) obtains a recall of 80%, a precision of 5.90%, an F1 of 10.99%, and an F5 of 53.94%, all better than the other four approaches in the baseline. It indicates that 80% of the key classes in the ground truth of ant can be retrieved by ( ). Similarly, we can observe from Table 7 that, when applied to JMeter, ( ) obtains a recall of 28.57%, a precision of 10.33%, an F1 of 15.17%, and an F5 of 26.75%, all better than the other four approaches in the baseline. It indicates 28.57% of the key classes in the ground truth of JMeter can be retrieved by ( ). Thus, we can conclude that ( ) perform best in all the approaches, signifying ( ) is indeed a good indicator for class importance.

RQ5
: Can Be Applied to Large Software Systems? As a feasible metric, will be applied to software with different sizes. Thus, we evaluate the scalability of by tracking the execution time that is required to compute it. As mentioned in Figure 1, to compute , we mainly follow the following three steps: (i) Parsing the source code of a piece of software to extract structural information at the class level of granularity.  (ii) Representing the extracted structural information as CCN.
(iii) Applying Algorithm 1 to compute the value of related metrics. Table 8 shows the CPU time required for each step when applied to subject software systems listed in Table 2. Obviously, it can be observed that there is a strong positive correlation between the KLOC of the software system and the CPU time for the (i) step, i.e., a larger KLOC of a piece of software indicates the more execution time that the (i) step takes. The CPU time for the (ii) and (iii) steps seems related to the number of nodes in CCNs. We can observe that although jdk and tomcat are very large in size (see Table 2), the execution time to compute from scratch is less than 4.5 minutes. Thus, the answer to RQ5 is that has a good scalability, and it can be applied to large software systems with an acceptable CPU time cost.

Threats to Validity.
In the evaluation, some factors may affect the validity of our conclusions, i.e., (i) A threat to the internal validity of our study is related to ground truth that we used to answer RQ4. Key classes in the ground truth are provided by the original developers of the software systems. However, these developers may be biased towards some parts of the software that they are familiar with. It will affect the values of , , 1, and 5. This threat has been partially mitigated by the fact that we use the benchmark software systems in the literature.
(ii) A threat to the external validity of our study is related to the generalization of our conclusions to other software systems. On one hand, our evaluation is performed on a small set of software systems when compared with the whole set of software in the world.
On the other hand, our evaluation is performed on software coded in Java. There is no software coded in non-Java language. This threat has been partially mitigated by the fact that our subject software systems are chosen from different domains and have different sizes. Furthermore, our approach can be generalized to software systems coded in non-Java languages if they can be transformed to a CCN representation.

Conclusions and Future Work
In this paper, we proposed a novel approach to characterize software stability with the aim to control software maintenance costs. Our approach is based on a novel software network representation of software structure (i.e., class coupling network) and a simulation algorithm for change propagation. A novel metric, (software stability), is developed to measure software stability.
We evaluated our metric theoretically using the widely accepted Weyuker's criteria, and empirically using a set of open source Java programs. The theoretical results show that our metric satisfies most of Weyuker's properties only with two exceptions, and the empirical results show that our metric is an effective indicator for software quality improvement and class importance. Empirical results also show that our approach has the ability to be applied to large software systems.
Our future work includes (i) collecting a large set of benchmark Java software systems to validate the effectiveness of our approach and (ii) extending the current work to analyze software coded in non-Java languages.

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.