ImReMuDF: Redundant Mutants IdentificationMethod Based on Definition and Reference of Variables

Mutation testing is an effective defect-based software testing method, but a large number of mutants lead to expensive testing costs, which hinders the application of variation testing in industrial engineering. To solve this problem and enable mutation testing to be applied in industrial engineering, this paper improves the method of identifying redundant mutants based on data flow analysis and proposes the inclusion relationship between redundant mutants, so that the redundancy rate of mutants is reduced. In turn, the cost of mutation testing can be reduced. (e redundant mutants identification method based on definition and reference of variables (ImReMuDF) was validated and evaluated using 8C programs. (e minimum improvement in redundant mutant identification rate was 34.0%, and the maximum improvement was 71.3% in the 8 C programs tested, and the verification results showed that the method is feasible and effective and has been improved in reducing redundant mutants and effectively reducing the execution time of mutation testing.


Introduction
Mutation testing is a fault-based software testing technique [1] that has received a great deal of attention in program analysis, defect detection, and test case generation [2]. Mutation testing not only has the advantages of strong troubleshooting ability, convenience, and flexibility, but also can be used to expose defects in the software, and it has the ability to measure the error found in the test data set and evaluate the adequacy of the test [3]. Mutation testing has strong fault detection capabilities [4]. Mutation testing technology has gradually become popular in the industry [5] but has not been widely used. e reason is that running and analyzing tests are expensive in terms of resources and manpower [6], the computational cost of mutation testing is high, the identification of equivalent mutations is difficult, the effective automated mutation testing tools are not perfect, etc. [7]. e concept of mutation testing was first proposed by Demillo [8], which refers to the execution of mutation operations on the original program to generate a new program (program with defects), and the new program is called a mutant. Subsequently, a large number of scientific research results appeared. In terms of reducing the number of mutants, Wong [9] proposed random selection of mutants to achieve the purpose of reducing the number of mutants. Sridharan and Namin [10] proposed the preferential selection of more informative mutation operators. Usaola et al. [11] proposed the minimization of test cases to reduce the time of mutation testing. Sun et al. [12] proposed a method for identifying redundant mutants based on data flow analysis. Chekam et al. [13,14] proposed a dynamic symbolic execution method and a mutant priority method. Shomali and Arasteh [15] proposed the firefly optimization algorithm as a heuristic algorithm for identifying the most error-prone path in the program. Hooseini et al. [16] proposed a genetic algorithm to identify the path where the program is most likely to propagate errors as the mutation location. In terms of shortening the time of mutation testing, Krauser et al. [17] proposed parallel execution of mutants.
King and Offutt [18] proposed prioritization of mutation compilation. Although the above method optimizes the mutation testing from different aspects, it is not perfect in terms of adequacy assessment, and it still has a certain impact on adequacy. e method proposed by Sun et al. reduces the mutants caused by variables very well. It reduces the execution time of mutation testing by reducing the number of mutants, but it does not extend well to the identification of redundant mutants of multiple variables.
In order to increase the recognition rate of redundant mutants and reduce redundant mutants, this paper proposes a redundant mutants identification method based on definition and reference of variables (ImReMuDF). e main contributions of this paper are as follows: (1) e definition and reference of two variables are proposed to increase the recognition rate of redundant mutants in the test. (2) In the definition and reference of two variables, the inclusion relationship of the variables in the same situation is introduced. is method improves the identification of redundant mutants caused by variables, thereby achieving the purpose of reducing the number of mutants. Compared with the definition and reference of one variable, the ImReMuDF method can be extended to the definition and reference of multiple variables to identify redundant mutants.

Traditional Mutation Testing.
In traditional mutation testing, errors are embedded into the program through mutant operators [19] or manually, and then the embedded defects are detected by test cases, and the adequacy of defect detection is determined by the mutation score. Given the tested program P and the set of test cases T, the tested program P uses mutation operators or manual embedding errors to generate a set of mutants M. It is very important to identify equivalent mutants and nonequivalent mutants in the mutation testing [20]. e traditional mutation testing process is shown in Figure 1. e traditional mutation testing process is as follows: (1) e set of mutants identifies and classifies the set of equivalent mutants I and the set of nonequivalent mutants L (2) Run the set of test cases e redundant mutants are shown in Figure 2. In Figure 2, it can be found that the mutants m 1 and m 2 mutate in the 3rd and 4th rows, respectively. Although the position of the mutation has changed, the program status below the mutation location is similar. In the mutation testing, if there is a test case that can execute the above code, it must be able to kill mutants m 1 and m 2 . In the calculation of the mutation score, since whether the mutant can be killed can be obtained by the test result of the mutant m 2 , only the mutant m 1 needs to be executed during the mutation testing, and the mutant m 2 is not executed.
Given the program P and the mutant m 1 generated by the mutation operator, if the output results of the program P and the mutant m 1 are not equal when running on the test case t, then the mutation testing is called strong mutation testing. Given the program P and the mutant m 2 generated by the mutation operator, if the state of the program P and the mutant m 2 is inconsistent when running on the test case t, then the mutation testing is called weak mutation testing. Compared with strong mutation testing, weak mutation testing has the advantage of shortening the test time. As can be seen from the above description, traditional mutation testing is based on strong mutation testing. is paper is to optimize the execution time of the mutation testing based on the weak mutation testing.

Definition 1 (inclusive relationship between mutants).
Given the program P to be tested and an input t � (t � t i(i�1,2,...,n) ) containing n-tuples, all possible t forms the input space T of the program P. Suppose there are two mutants m 1 and m 2 generated by P, and the output states of P and m when the input is t are P t and m t , respectively; the mutant m 1 including the mutant m 2 must meet the two following conditions: en, it can be expressed as m 1 ↦m 2 . Based on the above analysis, when T(m 1 ) ⊆ T (m 2 ), T(m) � t ∈ T|p t ≠ m t exists between the mutants, and m 1 and m 2 meet Definition 1, m 2 is called the redundant mutant of m 1 . Assuming that there are sets of mutants M 1 and M 2 satisfying ∀m i ∈ M 2 , ∃m j ∈ M 1 , where m i ∈ m j , then M 2 is called the set of redundant mutants, and M 1 is the set of nonredundant mutants.

Data Flow Analysis.
Data flow analysis is a technology used during compilation. And data flow analysis plays an important role in program analysis, compilation optimization, program verification, defect detection, etc. [2].
Data flow analysis mainly focuses on the data flow or possible values on the program execution path. Its purpose is to determine the relationship between variable definitions and references within a certain range of the program [21]. erefore, the following two definitions are given.
Definition 2 (definition set and reference set). Definable set: in the program P, if x exists in the assignment statement y, and the variable value of x changes, then x is an assignment variable, which can be expressed as def(y) � x | x is the assignment variable in the y statement . (1) Reference set: in the program P, if x exists in the assignment statement y, and the variable value of x does not change, then x is a reference variable, which can be expressed as Definition 3 (reference chain). In PDG [22], there is a path from def(u, v, s 1 ) to ref(u, v, s 2 ), and there is no other path from s 1 to s 2 , then there is a definition-reference Chain of variables u and v, denoted as dr(u, v, s 1 , s 2 ).

Improved Redundant Mutant Identification Steps.
When performing redundant mutant detection on the mutant set M, firstly, a series of information about each mutant is obtained; then, after the identification rules, the redundant mutants in the mutant set are identified; finally, the redundant mutants are removed from the mutant set, and a new mutant set is output. e redundant mutant process of data flow analysis is as follows: (1) e program structure of the program P to be tested is analyzed, and the block rule file of the program is obtained (2) Compare the mutant program and the source program one by one to obtain the mutation location of the mutants (3) Determine the class of the mutant according to the block rule file (4) Analyze the data flow of the source program, and generate data flow information such as variable definition, variable reference, and definition-reference chain; (5) Identify the set of redundant mutants according to the redundant mutant identification rules and combining the block categories of the mutants and information of the program data flow (6) Output the set of mutants e process of identifying redundant mutants based on data flow analysis is shown in Figure 3.
In redundant mutant based on data flow analysis, program structure analysis and mutation location analysis are applied to mutation testing tools [23]. In this paper, the data flow analysis is implemented using Frama-C [24].
Time complexity analysis of the algorithm: suppose that the number of mutants of the program to be tested is n, and the number of definition-reference chains is m. In the algorithm, each definition-reference chain needs to traverse the entire set of mutants, so the time complexity of obtaining the set of definition mutants and reference mutants is O(n);   Scientific Programming when the definition-reference chain meets the identification rules, it is necessary to traverse entire mutants in the set of reference mutant of the variable, so the time complexity is O(m). erefore, the time complexity of the algorithm is O(n × m).

Identification Rules of Redundant Mutant
When mutants undergo static analysis, the following three aspects need to be qualified: (1) the set of mutants with similar mutation locations; (2) the set of mutants with the same predecessor path conditions; (3) the set of mutants with similar program states (the state of the program after the mutation location). In the case of redundant mutant identification, to ensure the existence of a reachable path from variable definition to use between the source mutant and the mutation location of the redundant mutant, the solution is to set the analysis location after the mutant is used. Based on the above principles, the redundant mutant identification rules are defined in 4 dimensions (intrablock, sequential block, subfast, and intermodule) in the technique of data flow analysis. According to the identification rule D 1 of the definition, it is qualified that its source variant m 1 and redundant mutants m x� (2,3,...,6) belong to the same basic block. m i� (1,2,3) and m j� (1,2,3) are defined by the variable definability appearing in the s 1 statement (m x� (2,3,...,6) ) and the variable reference line appearing in the s 2 statement (ref, u, v, s 2 ), respectively, where the s 1 statement and the s 2 statement belong to the BasicBlock n . From the data flow analysis technique, we can know that if the execution of s 1 in m i is triggered in the mutation test, it can cause the execution of s 2 in m j , and then the definition-reference company can ensure the propagation of the state at s 1 to s 2 . Based on the inclusion relationship between mutants and the definition of redundant mutant, m x� (2,3,...,6) can be identified as a redundant mutant of m 1 . Figure 4 shows an example of the D 1 identification rule application.

Identification Rule
In the mutants m i� (1,2,3) , the variables x and y are defined in the 4th and 5th lines (def(x, y, 4, 5)), and in the mutants m j� (4,5,6) , the variables x and y are referenced in the 6th line (ref(x, y, 6)). In the mutants m i� (1,2,3) , the changes in the value of the variable in the definition mutant set M(def, x, y, 4, 5) of the variables x and y will cause the value of the variable temp in the 6th line of the program to change. In the mutants m j� (4,5,6) , the reference mutant set M (ref, x, y, 6) of the variables x and y has a direct effect on the value of the variable temp in the program. Due to the definition-reference chain dr(x, y, 4, 5, 6), the 6 mutants entirely have the same change state in the 6th row; that is, m x� (2,3,...,6) are redundant mutants of m 1 .

Identification Rule 2 (D 2 ).
Assuming the existence of a definition-reference chain dr(u, v, s 1 , s 2 ), let mutant m i� (1,2,3) be the definition mutation set M(def, u, v, s 1 ) of variable u, v, and let mutant m j� (1,2,3) be the reference mutation set M(ref, u, v, s 2 ) of variable u, v. s 1, belongs to the program block b n (b n is one of the BasicBlock, OptiomBlock, and LoopBlock), and s 2 belongs to Basic-Block i , where b n and BasicBlock i satisfy the sequential block relationship and meet Definition 1 in s 1 . It can be concluded that m x� (2,3,...,6) is a redundant mutant of m 1 .
According to the defined identification rule D 2 , it is qualified that its mutant m i� (1,2,3) and mutant m j� (4,5,6) belong to the sequential block relationship. m i� (1,2,3) and m j� (4,5,6) are variables defined, respectively, in the s 2 statement (def, u, v, s 1 ) and referenced in the s 2 statement (ref, u, v, s 2 ), where s 1 belongs to program block b n (b n is one of the BasicBlock, OptiomBlock, and LoopBlock) and s 2 belongs to BasicBlock i . According to the definition of the sequential block, in the mutation testing, if the execution condition of s 1 in m i is triggered, it will cause the execution of s 2 in m j . en, the definition-reference chain of the variable can ensure that the state at s 1 is propagated to s 2 . Based on the inclusion relationship between mutants and the definition of redundant mutants, m x� (2,3,...,6) can be identified as redundant mutants of m 1 . An example of the recognition rule is shown in Figure 4 below. Figure 5 shows an example of the D 2 identification rule application. e source program in Figure 5 is divided into blocks. It can be seen that lines 3, 4, and 5 belong to the same BasicBlock (b 1 ), lines 6, 7, and 8 belong to the same BasicBlock (b 2 ), and line 9 belongs to the same BasicBlock (b 3 ). e definitions of the variables x and y are in the 4th and 5th lines, expressed as (def(x, y, 4, 5)), and the references to the variables x and y are in the 9th line, expressed as (ref(x, y, 9)). b 2 is the sequential block of b 1 , and b 3 is the sequential block of b 2 , so b 3 is the indirect block of b 1 . e mutants m i� (1,2,3) belong to the definition mutation set M(def, x, y, 4, 5) of the variables x and y. If the values of the variables x and y change, the influence of the program is propagated to line 9 by reference to the variables x and y, thereby changing the value of the variable temp. e mutants m j� (4,5,6) belong to the reference mutation set M(ref, x, y, 9) of the variables x and y, and the changes in the values of the variables x and y directly affect the value of the variable temp. From the definition-reference chain dr(x, y, 4,5,9), it can be seen that the 6 mutants have the same change state in the 9th row; that is, m x�(2,3,..., 6) are redundant mutants of m 1 .

Identification Rule 3 (D 3 ).
Assuming the existence of a definition-reference chain dr(u, v, s 1 , s 2 ), let mutant m i� (1,2,3) be the definition mutation set M(def, u, v, s 1 ) of variables u, v, and let mutant m j� (4,5,6) be the reference mutation set M(ref, u, v, s 2 ) of variables u, v. When s 1 belongs to program block b n (b n is one of the BasicBlock, OptiomBlock, and LoopBlock), and s 2 belongs to BasicBlock i , BasicBlock i is the sequential block of the subblock of b n and meets Definition 1 in s 1 . It can be concluded that m x� (2,3,...,6) are redundant mutants of m 1 .
According to the identification rule D 3 , it is qualified that the relationship between the mutants m i� (1,2,3) and mutants m j� (4,5,6) is defined as a composite of the relationship of the subblock and the sequential block. m i� (1,2,3) and m j� (4,5,6) are variables defined, respectively, in the s 2 statement ((def, u, v, s 1 )) and referenced in the s 2 statement ((ref, u, v, s 2 )). s 1 belongs to program block b n (b n is one of the BasicBlock, OptiomBlock, and LoopBlock) and s 2 belongs to BasicBlock i , where b n and BasicBlock n directly or indirectly satisfy the composite of the relationship of the subblock and the sequential block. It can be known from the subblock relationship that, in the mutation testing, if b n is executed, its upper block will be executed, and then the sequential block BasicBlock n will be executed. e definition-reference chain can ensure that the error state at s 1 is propagated to s 2 . According to the inclusion relationship between mutants and the definition of redundant mutants, m x�(2,3,..., 6) can be identified as the redundant mutant of m 1 . Figure 6 shows an example of the D 3 identification rule application. e source program in Figure 6 is divided into blocks. It can be seen that lines 3, 4, 5, and 6 belong to the same loop block (b 1 ), lines 4 and 5 belong to the same basic block (b 2 ), and line 7 belongs to the same basic block (b 3 ). b 1 is the upper block of b 2 , and b 3 is the sequential block of b 1 . e definitions of the variables x and y are in the 4th and 5th lines, expressed as def(x, y, 4, 5, )), and the references to the variables x and y are in the 9th line, expressed as ref (x, y, 9). e mutants m i� (1,2,3) belong to the definition mutation set M(def, x, y, 4, 5) of the variables x and y. e change of variables x and y changes the value of the variable sum by ref(x, y, 7) in line 7. m j� (4,5,6) directly affect the change of the value of the variable sum by referencing the variables x and y. From the definition-reference chain dr(x, y, 4, 5, 7), it can be seen that the 6 mutants have the same change state in the 7th row; that is, m x� (2,3,...,6) are redundant mutants of m 1 . dr(u, v, s 1 , s 2 ), let mutants m i� (1,2,3) be the definition mutation set M(def, u, v, s 1 ) of the variables u, v, and let mutants m j� (4,5,6)
According to the identification rule D 4 , it is qualified that mutants m i� (1,2,3) and mutants m j� (4,5,6) belong to the crossmodule relationship. m i� (1,2,3) and m j� (4,5,6) are the block, where the variable definitional appearance ((def, u, v, s 1 )) is located as b n ∈ i, and the variable reference line appears in the BasicBlock n ∈ j where ((ref, u, v, s2)) is located, and there is a call relationship i ⟶ * j between b n and BasicBlock n , and BasicBlock n has no subblock relationship in module j. From this, it can be deduced that, after the execution of b n , Basic-Block n must be executed, and then the definition-reference chain can ensure that the error state at s 1 is propagated to s 2 . Based on the inclusion relationship between mutants and the definition of redundant mutants, m x� (2,3,...,6) can be identified as redundant mutants of m 1 . Figure 7 shows an example of the D 4 identification rule application. e mul function is called by the f function in the third line, and the passing parameters are the variables x and y. e definition of variables x and y in the mutants m i� (1,2,3) is represented as def(x, y, 3), and the references in the mutants m j� (4,5,6) are represented as ref(x, y, 6). m i� (1,2,3) belong to the definition mutation set M(def, x, y, 3) of variables x and y. e change of the variable will affect the return value of the function, and the value of the variable mulp in the program can be changed through the call of the function. m j� (4,5,6) belong to the reference mutation set M(ref, x, y, 5) of the variables x and y, and the reference of variables x and y will directly affect the value of the variable mulp. From the definition-reference chain dr(x, y, 3, 6), it can be seen that the 6 mutants have the same change state in the 6th row; that is, m x� (2,3,...,6) are redundant mutants of m 1 .

Experimental Analysis
4.1. Experimental Subjects. 8 C program sets (program source: http://sir.csc.ncsu.edu/portal/index.php, containing detailed information about experimental data) are used as experimental objects to verify the feasibility and effectiveness of the algorithm, which can be divided into two categories: (1) Siemens program sets and the functional descriptions of the program set: print_tokens and print_tokens2 are lexical analyzers; schedule and schedule2 are schedulers; replace is pattern matching and replacement; tcas is a vehicle collision program; tot_info is data generation statistics. (2) e space assembly is the interpreter of the array definition language.
Firstly, a large number of mutants were generated for each source program using Proteum [25]; then, the data set was preprocessed to filter a set of equivalent mutants and a set of applicable mutants. e relevant information of the program set is shown in Table 1.

Experimental Steps and Results
Analysis. Firstly, all sets of mutants with nonequivalent and single mutants are selected, and the data flow information of the experimental subjects is obtained using Frama-C [24]; then, the block rule file of the program to be tested is obtained using the static analysis method, and the mutation location information of the mutants is obtained; finally, the redundant mutants are identified after the identification rules (D 1 ∼D 4 ). e redundant mutants identified by the program for different experimental subjects were summed and counted, and the statistical results are shown in Table 2.
e redundancy rate of their mutants was used to verify the effectiveness of the algorithm in reducing the execution time of the mutation testing, and the redundancy rate was calculated by the formula shown in equation (1): where R denotes the redundancy rate, NRVI denotes the number of identified redundant mutants, MN denotes the total number of mutants, and MEN denotes the number of equivalent mutants. From Table 2 and Figure 8, it can be seen that there are different numbers of redundant mutants in the set of mutants of different experimental subjects. Although the redundancy rate accounted for by different experimental subjects varies greatly, this is enough to verify the existence   Scientific Programming of redundant mutants. From the side, it also verifies the reason why a small number of test cases can kill a large number of mutants in mutation testing. From Table 2 and Figure 9, it can be seen that the redundancy recognition rate varies widely among the different experimental objects, and the recognition variation of different identification rules is relatively large. Among the 8 C programs, it can be seen that the redundancy recognition rate of the space program reaches a maximum of 1964. However, the number of redundancy recognitions of the schedule is only 104. e validity analysis of the method is shown in Figure 9: is figure visualizes the proportion of redundant mutants under different identification rules. e number of redundant mutants identified in D 1 and D 3 is relatively high, accounting for 37% and 39% of the total number of mutants, respectively; the number of redundant mutants identified in D 2 and D 4 is relatively low, accounting for 5% and 19% of the total number of mutants, respectively. e following two points are summarized: (1) the larger the program size is, the more efficient the redundant mutants will be identified; (2)   print_tokens  565  343  4130  10881  583  3619  print_token2  510  355  4115  9369  545  3478  Schedule  412  296  2650  3516  204  1441  schedule2  307  263  2710  5794  471  2045  Replace  563  513  5542  21703  1556  7805  Tcas  173  137  1680  6478  442  2044  tot_info  406  281  1052  4308  678  3654  Space  9126  5982  136467  9380 1079 7986  print_tokens  78  32  62  180  352  print_token2  33  0  76  112  221  Schedule  41  0  23  40  104  schedule2  18  7  27  53  105  Replace  175  0  267  131  573  Tcas  0  0  112  37  149  tot_info  86  16  213  186  501  Space  1046  132  786  0  1964 Scientific Programming from the identification rules, the program structure distribution of the program can be judged, and we can verify this from the source code of the program. If there are more sequential structures in a program, the rule D 1 recognition will work in the recognition rule. ere are more branch structures and loop structures in the Siemens program set. e variables defined in the branch structure and loop structure are used more in sequential blocks, which makes D 3 identify more redundant mutants. e use of variables in the Siemens assembly is mostly through function calls, which makes D 4 recognize more redundant mutants. erefore, when there are more cyclic structures, the recognition rate of D 3 will be increased. When there are more function calls, the recognition rule D 4 will play an important role. When the program size increases in the experimental program, the number of mutants increases accordingly when the test program passes through the mutation operator, and the number of test cases also increases, in which case the traditional mutation test execution time increases exponentially, significantly increasing the expensive resources for mutation testing. e redundant mutants are identified by statically scanning the definition-use chain of the program and the set of mutants, and it can be concluded that the identification of redundant mutants is linearly related to the size of the program and is not correlated with the test cases. erefore, the expensive resources for mutation testing are reduced to some extent.
In order to better show the identification of redundant mutants based on two variables for data flow analysis, the identification of redundant variants based on one variable proposed by Sun et al. is given in Table 3 for Sun et al. [12].
Since the changed redundant mutant identification based on data flow analysis contains a definition-reference chain of variables as proposed by Sun et al., i.e., our proposed scheme has a higher identification rate in terms of implementation than the scheme proposed by Sun et al. As can be seen in Figure 10, the improved redundancy rate varies from one experimental subject to another, with the highest redundant mutant identification improvement of up to 498 in the space program; however, the schedule2 program has the lowest redundant mutant identification improvement of only 39 mutants.

Conclusion
is paper improves the identification technique based on data flow redundant mutants from the point of view of reducing the execution time of mutant test, combines the program block structure and data flow analysis technique, and defines a set of redundant mutant identification rules based on the weak mutant test. A set of redundant mutant identification rules is defined based on weak mutation testing. e effectiveness of the proposed redundant mutant identification technique is evaluated using 8 C programs. e experiments show that a large number of redundant mutants can be identified using the method in this paper, which not only reduces the number of mutants, but also shortens the execution time of mutation testing; we have compared this with the previous technique and found that our improvement still greatly improves the identification of redundancy rate and further optimizes the identification of redundant mutants. e main work in this paper is based on the improvement of the redundant mutant identification technique for data flow analysis. We introduce the inclusion relation of redundant mutants by adding one variable to achieve redundant mutant identification for two mutants, and the method can be better extended to multiple variables. Our next work is the redundant mutant identification technique for arbitrary multiple mutations, and better algorithms to further improve the testing efficiency.
Data Availability e data are available at http://sir.csc.ncsu.edu/portal/index. php.

Conflicts of Interest
e authors declare that they have no conflicts of interest.