Detecting Trivial Mutant Equivalences via Compiler Optimisations

Mutation testing realises the idea of fault-based testing, i.e., using artificial defects to guide the testing process. It is used to evaluate the adequacy of test suites and to guide test case generation. It is a potentially powerful form of testing, but it is well-known that its effectiveness is inhibited by the presence of equivalent mutants. We recently studied Trivial Compiler Equivalence (TCE) as a simple, fast and readily applicable technique for identifying equivalent mutants for C programs. In the present work, we augment our findings with further results for the Java programming language. TCE can remove a large portion of all mutants because they are determined to be either equivalent or duplicates of other mutants. In particular, TCE equivalent mutants account for 7.4 and 5.7 percent of all C and Java mutants, while duplicated mutants account for a further 21 percent of all C mutants and 5.4 percent Java mutants, on average. With respect to a benchmark ground truth suite (of known equivalent mutants), approximately 30 percent (for C) and 54 percent (for Java) are TCE equivalent. It is unsurprising that results differ between languages, since mutation characteristics are language-dependent. In the case of Java, our new results suggest that TCE may be particularly effective, finding almost half of all equivalent mutants.


INTRODUCTION
Mutation testing [1], [2] has attracted a lot of interest, because there is evidence that it is capable of simulating real faults [3], [4], [5] and subsuming other popular test adequacy criteria [6], [7], [8], [9].It can also be used as a technique for generating test data [10], [11], as well as for assessing test data quality and can also explore subtle faults [12], [13] in the presence of fault masking and failed error propagation [14].
A mutant is a syntactically altered version of the program under test.The syntactic alterations are typically small, and are designed to reflect typical faults that might reside in the original program.A mutant is said to be killed, if a test case can be found that distinguishes between the mutant and the original program.The underlying idea of mutation testing is that test suites that kill many mutants will tend to be of higher quality than those that kill fewer.In this way, mutation testing can be used to assess the quality of a test suite, and can also be used to help the test case generation, by guiding the construction of test cases towards those that kill mutants.
However, at the heart of mutation testing lies a problem that has been known to be undecidable for more than three decades [15]: the equivalent mutant problem.That is, mutation testing might produce a mutant that is syntactically different from the original, yet semantically identical.In general, determining whether a syntactic change yields a semantic difference is undecidable.As a result, the tester would never know whether he or she has failed to find a killing test case because the mutant is particularly hard to kill, yet remains killable (a 'stubborn' mutant [16]), or whether failure to find a killing test case derives from the fact that the mutant is equivalent.
A related, newly identified problem, is the problem of mutant duplication.A duplicated mutant is simply a mutant that is semantically equivalent to some other mutant, although both duplicated mutants maybe semantically different from the original program.Duplicated mutants are also a problem for mutation testing, because they may artificially inflate the apparent mutant killing power of a test suite; a test case that kills two or more duplicated mutants is, all else being equal, no better than another test case that kills only a single non-duplicated mutant.
Although theoretically undecidable, practical techniques may be able to significantly dent the equivalent and duplicate problems by detecting a proportion of equivalent/duplicated mutants.Equivalent mutant detection techniques have been extensively studied since 1979.Nevertheless, until now, no scalable, widely applicable technique has yet been found.Previous work on the detection of equivalent mutants has involved complicated program transformation techniques, which have proved difficult to scale and, thereby, have remained insufficiently practical to find implementation in current mutation testing tools and techniques.The equivalent mutant problem therefore remains the single most potent barrier to the wider uptake and exploitation of the potential power of mutation testing.
In this paper we study Trivial Compiler Equivalence (TCE) as a simple, fast and widely applicable technique for detecting equivalent mutants.The paper is an extension of our previous ICSE conference paper [26], which studied the application of TCE to the detection of equivalent and duplicated mutants in the C programming language.The present paper extends this previous study to also consider the Java programming language, allowing us to compare TCE performance on these two widely-used languages.The extended results further confirm that TCE is a highly effective and readily applicable technique, with strong evidence to suggest that it may be even more effective when applied to Java than what is already known to be when applied to C. Specifically, while TCE finds, on average, approximately one third of the equivalent mutants in C programs, it finds approximately half of the equivalent mutants in Java.These new findings for Java are based on the study of a known equivalent ground truth set, which we have augmented for this study (and make available for replication and further study 1 ).We also study the application of TCE to much larger Java programs, for which no ground truth is available, reporting results for the total number of equivalent and duplicated mutants found (using both the standard Java compiler 2 and the SOOT analysis framework 3 ).
Overall, we believe that the findings regarding TCE are extremely encouraging.It can ameliorate the adverse effects of the equivalent and duplicated mutant problems for both C and Java programs by removing such invaluable mutants (by an average of approximately 10% for Java and nearly 30% for C) and, as a consequence, reduces the overall work needed to develop mutation adequate test suites by approximately 37%, while, at the same time, improving the accuracy of the mutation score measurement by 0%-18% for Java and 0%-16% for C (depending on the ratio of the killed mutants).Furthermore, and fundamental to its success and importance, TCE is not a complicated technique; it can easily be implemented and added to any mutation testing study.It has already been included in the mutation testing tool MILU (Version 3.2), and we were easily able to incorporate TCE analysis into the results produced by the C and Java mutation testing tools PROTEUM [27] and MUJAVA [28].
The rest of the paper is organised as follows: Section 2 presents mutation testing and related approaches.Section 3 details our experiment and the studied research questions, while, Sections 4 and 5 analyse our results.Our findings are discussed in Section 6.Finally, the threats to validity are presented in Sections 7, while Section 8 concludes with potential directions for future work.

Mutation Testing
Mutation testing embeds artificial defects on the programs under test.These defects are called mutants and they are produced by simple syntactic rules, e.g., changing a relational operator from > to ≥.These rules are called mutant operators.By applying an operator only once, i.e., the defective program has only one syntactic difference from the original one, a mutant called a first order mutant is produced.By making several syntactic changes i.e., applying the operators multiple times, a higher order mutant is produced.In this paper we consider only first order mutants.These are generated by applying the operators at all possible locations of the program under test, as supported by the 3.2 version of MILU and version 3 of MUJAVA.Additional information about the corresponding operators can be found at Section 3. 4.
By measuring the ability of the test cases to expose mutants, an effectiveness measure can be established.Mutants are exposed when their outputs differ from those of the original program.When a mutant is exposed, it is termed killed, while in the opposite case, live.Of course, ideally, equivalent mutants should be removed from the test effectiveness assessment.Doing so gives the effectiveness measure called mutation score, i.e., the ratio of the exposed mutants to the number of the introduced, excluding the equivalent ones.

Equivalent Mutants
Early research on mutation testing has demonstrated that deciding whether a mutant is equivalent is an undecidable problem [15].Undecidability of equivalences means that it is unrealistic to expect all the equivalent mutants to be removed; the best we can have here is just effective algorithms that can remove most equivalent mutants.Currently, a large number of mutants must pass a manual equivalence inspection [16].This constitutes a significant cost.In addition, effort is wasted when testers generate test cases, either manually or automatically, in attempting to kill equivalent mutants.Apart from the human effort, there is a computational cost: since equivalent mutants cannot be killed, they have to be exercised on the entire test suite, whereas killable mutants only require the executions until they are killed.
Fortunately, partial and heuristic solutions exist [31].However, tackling the equivalent mutant problem is hard.This is evident by the fact that very few attempts exist.According to a recent systematic literature review on the equivalent mutant problem [44], which identified 17 relevant techniques (in 22 articles), the problem is tackled in three ways.One is to address the problem directly by detecting some equivalent mutants, while, the other two try to reduce their effects by avoiding their creation of by suggesting likely non-equivalent ones to help with the manual analysis process.Following the terminology of Madeyski et al. [44], we refer to them as the Detect, Avoid and Suggest approaches, respectively.
Table 1 summarises the current state-of-the-art techniques in chronological order by focussing on the most recent techniques.Specifically, it records: the publication Author(s) [ Compiler optimisations can be used to effectively automate the eq.mutant and duplicated mutant detection details, column "Author(s) [Reference]", the year of the publication, column "Year", the studied programming language, column "Language", the size of the largest program used, column "Largest Subject", the number of equivalent mutants studied, column "#Eq.Mutants", the existence of an automated publicly available tool, column "Publicly Av.Tool", the category of the approach, i.e., detection, avoidance or suggestion, column "Category" and the main findings of the publication, column "Findings".From this table it becomes evident that very few methods and tools exist.Regarding the equivalent mutant detection, only two publicly available tools exist with the largest considered subject being composed of 319 lines of code.It is noted that all the "large" subjects, i.e., having more than 1,000 lines of code, that were used in the previous research, involve a form of sampling.Mutants are sampled from the studied projects with no information about the relevant size of the component/class that these mutants are located.In these lines, in Table 1 we report the size of the projects that we consider.It is noted that the purpose of this table is to summarise the related work on equivalent mutants by focussing on the most recent advances.Further details on the subject can be found on the systematic literature review of Madeyski et al. [44].
Acree [30] studied killable and equivalent mutants, and found that testers correctly identified equivalent mutants for approximately 80% of the cases.In 12% of the cases, equivalent mutants were identified as killable and in 8%, killable mutants were identified as equivalent.Therefore, indicating that detection techniques, such as the one suggested by the present paper, not only help in saving resources but also at reducing the mistakes made by the humans.
The idea of using compiler optimisation techniques to detect equivalent mutants was suggested by Baldwin and Sayward [29].The main intuition behind this technique is that code optimisation rules, such as those implemented by compilers, form transformations on equivalent programs.Thus, when the original program can be transformed by an optimisation rule to one of its mutants, then, this mutant is, ipso facto, equivalent.Baldwin and Sayward proposed adapting 6 compiler optimisation transformations.These transformations were then studied by Offutt and Craft [31] who implemented them inside Mothra, a mutation testing tool for Fortran.They found that on average 45% of the equivalent mutants can be detected.Our approach is inspired by this recruitment of compilers research to assist in equivalent mutant detection.As already discussed and demonstrated in the prior, conference version of this work [26], it is surprisingly effective for the case of the C programming language.However, we propose a truly simple (and therefore scalable and directly exploitable) use of compilers, which remained unexplored.Our TCE instead of deliberately implementing specialised techniques, it simply declares equivalences only for those mutants which their compiled object code is identical to the compiled object code of the original program.As indicated by our empirical findings, in Section 6, our approach is impressively effective, practical and scalable.
Offutt and Pan [32], [33] developed an automatic technique to detect equivalent mutants based on constraint solving.This technique uses mathematical constraints to formulate the killing conditions of the mutants.If these conditions are infeasible then, the mutants are equivalent.
Nica and Wotawa [42] implemented a similar constraintbased approach to detect equivalent mutants and demonstrated that many equivalent mutants can be detected.Voas and McGraw [34] suggested that program slicing can help in detecting equivalent mutants.Later, Hierons et al. [35] showed that amorphous program slicing can be used to detect equivalent mutants as well.Although potentially powerful, these techniques suffer from the inherent limitations of the constraint-based and slicing-based techniques.
It is evident that the constraint-based approach, [32], [33], was assessed on programs consisting of 29 lines of code at maximum, while, the slicing technique remains unevaluated apart from worked examples.The scalability of these approaches is inherently constrained by the scalability of the underlying constraint handling and slicing technology.Furthermore, a new implementation is required for every programming language to be considered.By contrast TCE applies to any language for which a compiler exists and so is as scalable as the compiler itself.
Kintis and Malevris [48], [49] used data-flow patterns and showed that a large proportion of equivalent mutants and partially equivalent mutants, i.e., mutants equivalent only under specific program paths, form data-flow anomalies.Bardin et al. [50] used static analysis techniques, such as Value Analysis and Weakest Precondition calculus, to detect mutants that are equivalent because they cannot be infected.Their results show that a significant number of those mutants can be detected.Although promising, these two methods have only been evaluated with less than 200 equivalent mutant instances and so their effectiveness, efficiency and practicality remain unknown.
Hierons et al. [35] suggested using program slicing to reduce the size of the program considered during the equivalence identification.Thus, testers can focus on the code relevant to the examined mutants.Harman et al. [36] also suggested using dependence-based analysis as a complementary method to assist in the detection of equivalent mutants.
Adamopoulos et al. [37] suggested the use of coevolutionary techniques to avoid the creation of equivalent mutants.In this approach test cases and mutants are simultaneously evolved with the aim of producing both high quality test cases and mutants.However, these previous approaches have been evaluated only on case studies and synthetic data so their effectiveness and efficiency remains unknown.
More recently, several studies sought to measure the impact of mutant execution.Instead of finding a partial but exact solution to the problem, as done by the Detect approaches, they try to classify the mutants to help identify likely killable ones and likely equivalent ones, based on their dynamic behavior.
This idea was initially suggested by Grun et al. [38] and developed by the studies of Schuler et al. [39] and Schuler and Zeller [40], [41] who found that impact on coverage can accurately classify killable mutants.Kintis et al. [45], [46] further develop the approach, using the impact of mutants on other mutants, i.e., using higher order mutants.Papadakis et al. [47] proposed a mutation testing strategy that takes advantage of mutant classification.Finally, mutants belonging to software clones have been shown to exhibit analogous behaviour with respect to their equivalence [43].Thus, knowledge about the (non-)equivalence of a portion of such mutants can be leveraged to analogously classify other mutants belonging to the same clones.
Apart from the technical differences between TCE and the existing approaches, as discussed above, there is also a fundamental difference that is the identification of duplicated mutants.Existing approaches only aim at equivalent mutants while TCE tackles the general problem of mutant equivalences.

Reducing the Cost of Mutation Testing
Mutant sampling has been suggested as a possible way to reduce the number of mutants.Empirical results demonstrate that even small samples [18] can be used as cost effective alternatives to perform mutation testing [17], [19].Other approaches select mutant operators.Instead of sampling mutants at random, they select mutant operators that are empirically found to be the most effective.To this end, Offutt et al. [51] demonstrated that five mutant operators are almost as effective as the whole set of operators.
More recently, Namin et al. [52] used statistically identified optimal operator subsets.Other cost reduction methods involve mutant schemata [23], [53].This technique works by parameterizing all the mutants through instrumentation, i.e., introduce all the mutants into one parameterised program.However, apart from the inherent limitations of this technique [28] and the execution overheads that introduces, it also makes all the equivalent mutant detection techniques not applicable.
Other approaches identify redundant mutants that fail to contribute to the testing process.Kintis et al. [54] defined the notion of disjoint mutants, i.e., a set of mutants that is representative of all the others (killing them implies killing all the others), and found that 9% of all mutants are disjoint.Ammann et al. defined minimal mutants using the notion of subsumption [55] and demonstrated that a small set of mutants, approximately 1.2% subsumes all the others.Based on these works, Papadakis et al. [56] demonstrated that redundancy among mutants has a very good chance (> 60%) to inflate mutation score and lead to biassed results.Along the same lines, Kurtz et al. [57] analysed the validity of selective mutation and found that selective mutants score relatively low with respect to subsuming mutants.
Kaminski et al. [58], [59] and Just et al. [60] leverage the suggestions made by Tai [61] on fault-based predicate testing and demonstrated it possible to reduce the redundancy within the relational and logical operators.Higher order mutation can also reduce mutant numbers: Sampling [19], [44] and searching [13], [62], [63] within the space of higher order mutants both reduce the number of mutants and also of the equivalent mutants.

EXPERIMENTAL STUDY AND SETTINGS
This section details the settings of our experiment.First, it presents the TCE approach (Section 3.1) and the posed research questions (Section 3.2).Next, the studied C and Java programs are described (Section 3.3), along with the employed mutant operators (Section 3.4), and, finally, the execution environment (Section 3.5).

Detecting Mutant Equivalences: the TCE approach
Executable program generation involves several transformation phases that change the machine code.Different optimisation transformation techniques result in different executables.However, when there exist multiple program versions with identical source code, then there is no point in differentiating them with test data; it is safe to declare them as functionally equivalent.TCE realises this idea to detect mutant equivalences.It declares equivalent any two program versions with identical machine code.TCE simply compiles each mutant, comparing its machine code with that of the original program.Similarly, TCE also detects duplicated mutants, by comparing each mutant with the others residing in the same unit, i.e., function.As the reader will easily appreciate, the TCE implementation is truly trivial, hence its name: it is a compile command combined with a comparison of binaries.

Research Questions
The mutation testing process is affected by the distorting effects of the equivalent and duplicated mutants on the mutation score calculation.Therefore, a natural question to ask is how effective is the TCE approach in detecting equivalent and duplicated mutants.This poses our first RQ: RQ1 (Effectiveness): How effective is the TCE approach in detecting equivalent and duplicated mutants?
We answer this question by reporting the prevalence of the equivalent and duplicated mutants detected by the TCE approach using gcc 4 and SOOT 5 .
To reduce the confounding effects of different compiler configurations, we apply four and two popular options for gcc and SOOT on the selected classes/packages, and report the number of the equivalent and duplicated mutants found.SOOT does not support multiple levels of optimizations, thus, we only report its intra-procecural optimisations and report the equivalent and duplicated mutants found.The answer to this question also allows the estimation of the amount of effort that can be saved by the TCE method.The existing mutant equivalent detection techniques suffer from performance and scalability issues.As a result, the authors are unaware of any mutation testing system that includes a proposed equivalent mutant detection.By contrast, the TCE is static, and can be applied to any program that can be handled by a compiler.This makes TCE potentially scalable, but we need additional empirical evidence to determine the degree to which it scales.Hence, in the second RQ, we seek to investigate the observed efficiency and the scalability of the TCE approach: RQ2 (Efficiency): How efficient and scalable is the TCE approach?
To demonstrate the scalability, we use selected classes/packages from 12 large open source projects, 6 for each studied programming language, and we report the efficiency of the mutant generation, equivalent mutant detection and duplicated mutant detection processes.For the case of gcc we also explore the trade-off between the effectiveness and efficiency using different compiler settings.
To decide when it is appropriate to stop the testing process, testers need to know the mutation score.To this end, they need to identify equivalent mutants.The TCE approach improves the approximation by determining such mutants.However, to what extent?This is investigated in the next RQ: RQ3 (Equivalent Mutants) What proportion of the equivalent mutants can be detected?What types of equivalent mutants can be detected?
To answer RQ3, we need to know the 'ground truth': how many equivalent mutants are there in the subjects studied?We therefore applied the TCE approach on two benchmark sets, one for each studied programming language, with hand-analysed, ground-truth data on equivalent mutants.The first benchmark 6 , pertaining to the C test subjects, includes 990 manually-identified equivalent mutants over 18 small-and medium-sized subjects.The second one 7 , for the Java programs, comprises 196 equivalent mutants selected over 6 small-and medium-sized subjects, detected with manual analysis.
We report the proportion of the equivalent mutants found by TCE.We also analyse and report the types of the detected equivalent mutants.This information is useful in the design of complementary equivalent detection techniques.
Mutation testers usually employ subsets of mutant operators.Therefore, knowing about the relationship between the operators and the equivalent and duplicated mutants found by TCE is useful in the sense that mutation testers can better understand the importance of their choices.Hence, the next RQ examines the extent of the equivalent and duplicated mutants found per mutant operator: RQ4 (Impact on Mutant operators): What is the contribution of each operator to the proportion of equivalent and duplicated mutants found by TCE?Among the several factors that can affect TCE is the program size.Thus, one might expect that in larger programs, the equivalent mutant identification would be We answer this question by investigating correlations between the number and proportions of both equivalent and duplicated mutants found by TCE with the program and mutant set size.
Finally, since we have results for both C and Java, we investigate the similarities and differences between the two sets of programs.Thus we ask:

RQ6 (Differences between programming languages):
What are the similarities and differences between C and Java with respect to TCE?
To answer this question we compare the results of C and Java and try to provide insights on the differences between C and Java as viewed by mutation testing.

Subject Programs
We used two categories of subject programs for both C and Java.The first category is composed of 6 large to medium open source programs.In this set, we chose 'real-world' programs that vary in size and application domain.The second category of programs was taken from the studies of Yao et al. [16] and Kintis and Malevris [49].We chose these sets because they are accompanied by manually-identified equivalent mutants.The availability of known equivalent mutants allows us to answer RQ3, because it provides a 'ground truth' on the undecidable equivalence question for a set of subjects.The rest of RQs are answered using the larger programs.
Regarding the large programs, compiling all their mutants constitutes a time consuming task.This is due the increase of the mutants according to the size of the programs.It is evident by our reported results, presented in Section 4.2, where it took more than 50 hours to compile only the mutants involved in the Vim-Eval component (under -O3).TCE may be scalable in itself, but applying it to all possible mutants of a large program is clearly infeasible.
Though there are techniques to reduce the number of mutants, i.e., by sampling, we prefer not to use them in case we unintentionally bias our sample of mutants.We prefer to sample, safer, over the code to be mutated in a systematic way so that we do not pre-exclude any mutants from our investigation.Therefore, in C we rank their source files according to their lines of code.Then, we select the two largest components (source code files).On these two components we apply mutation to all the functions they contain.In Java we followed a similar process by ranking all the project packages according to their size and selected the three largest classes that could be handled without a problem by MUJAVA among the four largest packages.
Tables 2 and 3 respectively present the information about the first category of subject programs for C and Java.Regarding Table 2 (large C subjects), the Gzip and Make are GNU utility programs.The first program performs file compression and the second one builds automatically executable files from several source code files.The two largest components of Gzip are the 'trees' and 'gzip'.The former implements the source representation using variablelength binary code trees and the later implements the main command line interface for the Gzip program.The two largest components of the Make program are 'main' and 'job'.The later implements utilities for managing individual jobs during the source building processes and the former implements the command line interface.The GSL (GNU Scientific Library) is a C/C++ numerical library, which provides a wide range of common mathematical functions.Its two largest components are 'gen' and 'blas'.The 'gen' implements utilities that compute eigenvalues for generalised vectors and matrices.The 'blas' implements BLAS operations for vectors and dense matrices.
The program MSMTP is an SMTP client for sending and receiving emails.The components studied are the 'smtp' and the 'msmtp'.The 'smtp' implements the coreutilities for exchanging information with SMTP servers and the 'msmtp' component implements the command line interface.
The program Git is a source code management system and the components selected are the 'refs' and 'diff'.The 'refs' implements the 'reference' data structure that associates history edits with SHA-1 values and the 'diff' component implements utilities for checking differences between git objects, for example commits and working trees.
Finally, the program Vim is a configurable text editor.The selected components, 'spell' and 'eval', implement utilities for checking and built-in expression evaluation, respectively.
The first two columns of Table 3 (large Java subjects) refer to the first category of programs and their size in terms of source code lines.The domains of the chosen subjects range from mathematics libraries (Commons-Math) to build systems (Ant).The application domains of the remaining subjects appertain to enhancements of Java's core class (Commons-Lang), bytecode manipulation (BCEL), date and time manipulation (Joda-Time) and database applications (H2).Finally, the size of the studied Java programs ranges between 16,753 and 104,479 source code lines.The next two columns of the table present the names of the utilised  packages and the size of the considered classes, respectively.Finally, the last two columns of the table refer to the number of methods that belong to the examined classes and the number of the generated mutants.
The second category of subjects contains 8 C and 6 Java programs.The C programs have lines of code ranging from 10 to 42 lines, 7 programs with 137 to 564 lines and 3 realworld programs with 9,564 to 35,545 lines.Additional details for these programs can be found in the work of Yao et al. [16].Details regarding the Java programs are given in Table 4.The first two columns of the table present the examined programs and the considered methods.Bisect is a simple program that calculates square roots, Commons-Lang and Joda-Time are enhancements to java core library and time manipulation libraries, Pamvotis is a wireless LAN simulator, Triangle is the classic triangle classification program and XStream is an XML object serialisation framework.The last two columns of the table present the number of the generated and manually-identified equivalent mutants.It is noted that for the purposes of the present study we extended the original set of programs by manually analysing approximately 400 additional mutants.Thus, in total the considered set is composed of 1,542 manually analysed mutants, out of which 196 are equivalent.

Mutant Operators
Based on previous research on mutant operator selection, we identify and use two sets of operators (one for C and one for Java).The C set of operators was based on the  [64] and it is composed of 10 operators.A detailed description of the operators is reported in Table 5.
We detail exactly how these operators were applied since this is an important piece of information that differs from one tool to another.The ABS and UOI operators were only applied to numerical variables.The CRCR was applied to integer and floating numeric constants.No mutant operator was applied to the variables of the lefthand side of assignment statements; we only apply them to the right hand sides.This is an implementation choice that avoids the generation of duplicated mutants (as any variable on the

AOR: Arithmetic
Operator Replacement lefthand side of assignment statements will be used (and mutated) later in the program).All operators are applied recursively to all sub expressions.With respect to the Java programming language, we used all the method-level operators of MUJAVA (version 3) [28].This means that we excluded all the object oriented related mutation operators.Previous research [65] has shown that object oriented mutation operators produce a small number of mutants and a rather low number of equivalent ones and thus, there is no need to investigate this case.MUJAVA supports a wide range of mutant operators built based on the experience and studies of Offutt and colleagues, i.e., [28] and [51].
Table 6 describes the employed mutant operators: the first column of the table presents their names and the second one the mutation they impose.In total, 15 mutant operators were utilised which fall into 6 general categories: arithmetic operators, relational operators, conditional operators, shift operators, logical operators, and assignment operators.
We use these operators due to their extensive use in literature [2].To generate the C mutants, we use MILU [66], and for the Java mutants, MUJAVA (version 3).
Further details and the implementation of the tools and their operators can be found at the webpages of MILU 8 and MUJAVA 9 .

Experimental Environment
Two series of experiments were conducted.The first one was for programs written in C and the second one for programs written in Java.All the experiments of the C programs were undertaken on the Microsoft Azure Cloud platform using a A9 Compute Intensive Instance in the Ubuntu 14.04 operating system with gcc 4.8 compiler.To compile the mutants we used four configuration options.We compile   with no optimisation settings, denoted as None, and with the three popular ones, as realised by the gcc compiler, denoted as -O, -O2 and -O3 .We use the Linux time utility to measure the CPU execution time of all the involved processes.To check whether two binaries are equivalent we use the 'diff' utility with the flag '--binary'.In short, we use a gcc -flag' combined with a 'diff'.
All the experiments regarding the Java language were performed on a physical machine running Fedora 22, equipped with an i7 processor (3.40 GHz, 4 cores) and 16GB of memory.TCE relies on compiler optimisation to detect mutant equivalences.While in programming languages such as C or C++ many optimisation options have been embedded within the language compilers, e.g.gcc, this does not hold true for the standard Java compiler, i.e. javac.Despite the fact that javac does not possess advanced optimisation capabilities at the compilation time, it is able to detect some mutant equivalences.
In order to successfully apply TCE to Java, compiler optimisations are required.To this end, we used SOOT [67], a popular framework for analysing and transforming Java applications.SOOT implements various analysis and transformation procedures.We utilised the -O option of the tool which performs intra-procedural optimisations.Such optimisations include the 'elimination of common subexpressions' and 'copy and constant propagation', among others.As in the case of the C language, we used the diff command line tool, for the purposes of comparing the optimised classes.

TCE VIA G C C
This section reports the results pertaining to the C programming language.Sections 4.1 and 4.2 respectively present results regarding the TCE effectiveness and efficiency.Sections 4.3 and 4.4 detail our results regarding the ground truth and the mutant operators.Finally, Section 4.5 investigates the impact of program size to TCE.

gcc: TCE Effectiveness
To assess the effectiveness of the TCE approach, answering RQ1, we measure the number of the detected equivalent and duplicated mutants.We also measure the proportions of these mutants per program, computed as the percentage of the detected to introduced.When mutants are mutually equivalent to each other, i.e., they are duplicated, one of them should be kept, while, the other(s) should be discarded.In our results we only report the number of mutants that should be discarded.
Table 7 reports our results per program and per considered optimisation option.Overall, these results indicate that TCE can detect in total 9,551 equivalent mutants, accounting for 7.4% of all mutants.TCE also detected 27,163 duplicated mutants, which account for 21% of all mutants.Overall, TCE can thus identify and remove approximately 28% of all mutants as being useless.
Figure 1 depicts the proportions of both equivalent and duplicated mutants detected per program.The horizontal axis of the graph is ordered by the size of the components while the vertical axis records the proportions of mutants detected.From these results, it is evident that all the subjects have a reasonably high proportion of equivalent and duplicated mutants.The proportions of equivalent mutants detected varies from program to program.In the worst case it is 2%, while in the best, 17%.We observe a small variation in the proportions of the identified equivalent and duplicated mutants.The only exceptions are the Gsl-Blas and Gsl-Gen components.In the former case, TCE detects many equivalent mutants and very few duplicated ones, while, in the later case, it detects very few equivalent mutants and a similar to the other programs ratio of duplicated mutants.This divergence is mainly attributed to the internal structure and code characteristic of the component.
Finally, Table 7 reveals that, depending on the options used, the detected equivalences differ.For instance, the -O3 option found on average 84% and 100% of the equivalent and duplicated mutants that are detected by applying all the options.Interestingly, with respect to equivalent mutants, among the different optimisation options, i.e., -O, -O2 and -O3, there is no clear winner and their behaviour varies between programs.However, the overall differences between the options are relatively small.With respect to duplicated mutants, the results are clear and they show that the best options are the -O2 and -O3.

gcc: TCE Efficiency
To assess the efficiency of the TCE approach and answer RQ2, we report the CPU execution time.Table 8 summarises the execution time of TCE in total, average and per employed component, using the four studied compiler settings.The columns 'Comp.','Eq.D.' and 'D.D.' record the execution time with respect to the compilation process, the equivalent mutant detection and duplicated mutant detection, per considered compilation option, respectively.
These results reveal that the execution time of the equivalence detection process is reasonably small compared to the compilation one.For instance, TCE requires on average 22 seconds, for all cases, to detect equivalent mutants, while, the average compilation cost is 5,942 seconds in the best case.
A similar case arises when considering the costs for detecting duplicated mutants.While this is approximately an order of magnitude higher than the cost of detecting equivalent mutants, it is still reasonable; 225 seconds, and no more than 1/30 of the cheapest compilation cost.It is noted that our approach checks for equivalences only for the combinations of mutants that are located on the same function.Therefore, the reported time is analogous to the number of combinations between the mutants located at each function of the project and not between the whole combinations of all project mutants.
Our results show that the compilation time of the -O3 option is almost 5 times higher than the None option.However, this is counterbalanced by the improved effectiveness of the option.In this case, the total time spend for compiling, detecting equivalent and duplicated mutants is 374,162, 260 and 2,744 seconds, respectively.Therefore, TCE analyzed 129,161 mutants in 377,166 seconds.This time accounts for less than 3 seconds per mutant suggesting that its application is reasonable.

gcc: Equivalent Mutants
To determine the ratio of detected to all existing equivalent mutants, we applied TCE to the equivalent mutants identified by Yao et al. [16], using the accompanying website data 10 .This site is regularly updated, so data may differ slightly from those previously reported [16].Additional details about these data can be found on the website.
Table 9 reports the number and the proportions of equivalent mutants detected by TCE when using the different settings.The results are surprisingly good.They reveal that out   of all the existing equivalent mutants, TCE can detect from 9% to 100% (with 30% on the average case) of them.With respect to the total number of mutants (killable and equivalent ones), the TCE equivalent ones are approximately 7%.These results are achieved within a few seconds with the potential to save considerable manual and computational resources.Together with the previously presented results, we conclude that TCE is effective and practically applicable on large real-world programs.
Regarding the types of the equivalent detected mutants, i.e. second part of RQ3, we recall that equivalent mutants are equivalent because: a) they reside in unreachable code, b) it is impossible to affect the program state that pertains immediately after mutant execution or c) there is no possible way to propagate the infection they introduce to the program output.Interestingly, the equivalent mutants detected by TCE reside within all of these categories.In particular, TCE detected 6%, 25% and 45% of the equivalent mutants caused by a), b), and c), respectively.

gcc: Mutant Operators
To determine the influence of the mutant operators on the effectiveness of TCE, answering RQ4, we measure the number of detected equivalent and duplicated mutants per operator.We also measure the ratios of detected to introduced mutants by the studied operators.It is noted that the choice of which mutants should be discarded when computing the duplicated mutants, can unfairly influence the reported numbers with respect to the mutant operators that they belong to.To avoid this, in this section we report the number and proportions of all the mutants that are duplicated and not the discarded ones.
Table 10 reports the number and proportions of the equivalent and duplicated mutants found by TCE per pro- gram and operator.These results suggest that on different programs a similar proportion of equivalent and duplicated mutants can be detected by TCE.The only exceptions are the Gsl-Blas and Gsl-Gen components.Figure 2 depicts the proportions of equivalent and duplicated mutants detected per operator.The horizontal axis follows the presentation order of the operators from Table 10, while, the vertical axis records the proportions of detected mutants.These results reveal that the ABS and UOI operators introduce at least 15% equivalent mutants of all that they introduce.They also show that TCE detects more than 5% of equivalent mutants produced by the ABS, ROR, UOI and CRCR operators.Regarding the duplicated mutants, TCE detects large proportions, above 10%, on all of them but, the ABS, LCR and OAAA.Interestingly, the LCR operator seems to produce very few equivalent or duplicated mutants.
In conclusion, our results show that all but the LCR and OAAA operators produce a relatively high ratio of useless mutants, i.e., equivalent and duplicated.In practice this involves a huge overhead that, fortunately, can be saved by TCE.

gcc: Program Size and Mutant Equivalences
To answer RQ5, we use the Spearman rank correlation coefficient ρ.This is a non-parametric statistical test that measures whether two variables' ranks are related, i.e., it assesses the monotonic relationship between the two variables.The Spearman correlation gives values in the range of [-1, +1] with 0 indicating no relationship and +1 indicating a perfect one (-1, also implies a perfect inverse relationship).In addition to the the correlation coefficient ρ we report the obtained p-values that represent the chance that we would observe the ρ value reported, were there, in fact, to be no correlation.
We found a correlation between the number of mutants and the number of equivalent mutants detected (ρ = 0.818, p − value = 0.002).This suggests that more mutants lead to more equivalent ones.Similarly, a strong correlation between the number of mutants and the detected duplicated ones (ρ = 0.930, p − value < 2.2e − 16) was also found.The correlation between the number of mutants and the proportion of TCE equivalent and duplicated mutants was found to be (ρ = −0.091,p−value = 0.783) and (ρ = 0.280, p − value = 0.379) respectively.These results suggest that we have no evidence supporting the argument that mutants' number can have a strong influence on the proportions of the equivalences detected by TCE.
We also study the relation between the program size with the number of detected equivalences.We found a medium to small correlation in case of equivalent mutants (ρ = 0.692, p − value = 0.016).A slightly lower correlation was found between the size of program and the number of duplicated mutants (ρ = 0.650, p − value = 0.026).With respect to proportions, i.e., correlation between the program size and the proportion of the detected equivalences, we found (ρ = −0.035,p − value = 0.921) and (ρ = 0.084, p−value = 0.800) for the cases of equivalent and duplicated mutants, which indicate that we have no data supporting the argument that program size impacts the ratios of the detected equivalences.
Finally, we found a medium correlation between the size of program and the whole number of mutants (ρ = 0.671, p − value = 0.020), which indicates that larger programs have more mutants than smaller ones.In conclusion, we find no evidence of any correlation between the ratios of equivalent and duplicated mutants in any of the size indicators.This means there is no evidence that the proportion goes up or down as the size of the program or the number of mutants changes.However there is evidence that the number goes up with the size, as one would expect.Taken together based on the studied mutant set, these can be regarded as evidence suggesting that the number of TCE equivalent and duplicated mutants is a fairly consistent proportion, unaffected by the size of the program.These results may be explained by the fact that the compiler optimisations we use only apply "locally", i.e., on the occurrences of code patterns, and not on the semantic of the entire system.

TCE VIA J A V A C AND SOOT
This section details our results for Java.  the ground truth and the mutant operators.Finally, section 5.5 investigates the impact of program size on TCE.

javac and SOOT: TCE Effectiveness
In an analogous manner to the results of Section 4.1, we present our findings that are pertinent to RQ1, i.e. the effectiveness of TCE.These results are illustrated in Table 11 and Figure 3. Table 11 presents the equivalent and duplicated mutants detected by javac and SOOT per test subject.The '#Mutants' column, which is divided into the 'Eq.' and 'Dup' sub-columns, presents the number of the detected equivalent and duplicated mutants (per tool).The '% of all Mutants' column records their corresponding proportion to the generated mutants.From the depicted results, it is clear that SOOT outperforms javac in both equivalent and duplicated mutant detection, managing to detect 3,904 equivalent mutants and 3,687 duplicated ones.Thus, code optimisations implemented in SOOT appear to be superior to the ones of javac.Furthermore, it should be mentioned that the mutants detected by SOOT form a superset of the ones detected by javac.Therefore, we conclude that SOOT constitutes an appropriate tool for TCE.
It is noted that most Java-to-bytecode compilers mainly perform runtime optimizations than static ones.Thus, class files are optimised by the Java virtual machine as they are interpreted and not at the compilation time.This explains why the Java stock compiler is infective.
Figure 3 illustrates the proportion of equivalent and duplicated mutants per test subject.The horizontal axis presents the corresponding proportions and the vertical presents the test subjects in ascending order, according to their size.By examining the figure, it becomes evident that TCE manages to detect a considerable number of equivalent and duplicated mutants, ranging between 1% and 18% for equivalent ones and 2% and 17% for duplicated ones.
To summarise, in the case of Java, TCE managed to detect 6% of all mutants as equivalent and 5% of them as duplicated ones.

javac and SOOT: TCE Efficiency
In this section, we detail the empirical findings pertaining to TCE's efficiency for the case of Java.To this end, Table 12 presents the CPU execution time that the equivalent and duplicated detection required per test subject and optimisation tool.
Table 12 is divided into three columns: 'Program' refers to the names of test subjects; the 'javac' column reports the compilation time ('Comp.'sub-column), the equivalent mutant detection time ('Eq.D.' sub-column) and the duplicated mutant detection time ('D.D.' sub-column) of TCE via javac; and, 'SOOT' presents the corresponding results in the case of TCE via SOOT.It should be noted that in The question that is raised here is whether the time required by TCE is acceptable.While this depends on many uncontrolled parameters, we would like to underline that detecting equivalent mutants is a tedious and manual task.Previous research estimated the time of the manual identification of a single equivalent mutant to be approximately 15 minutes [40].Assuming this is a fair approximation, identifying the TCE equivalent mutants pure manually would require 124 × 15 minutes or 111,600 seconds for the case of javac and 3, 904 × 15 minutes or 3,513,600 seconds for the case of SOOT.Thus, it can be easily concluded that the execution cost of TCE is small when compared to the estimated manual effort.In fact, the total cost of TCE (optimisation phase + detection phase) constitutes only 3% of the estimated manual effort.

javac and SOOT: Equivalent Mutants
This section, which answers RQ3, provides insights into the actual proportion of equivalent mutants that can be automatically detected by TCE.We perform this evaluation based on manually-identified sets of such mutants to gives us a ground truth.Table 13 describes the corresponding findings per utilised tool.As can be seen, javac failed to detect equivalent mutants on our ground truth benchmark.By contrast, SOOT detected 105 out of 196 equivalent mutants, indicating that it can automatically weed out more than 50% of the studied equivalent mutants.These results provide strong evidence regarding the TCE's effectiveness.Finally, it should be stated that these automatically-detected equivalent mutants correspond to 7% of all the studied ones, which is in line with the results of the large-scale experiment we report in Section 5.1.
A manual analysis of the types of equivalent mutants that are TCE equivalent reveals that all but one of the detected mutants belong to the third category, i.e., the corresponding mutant can be reached and can infect the program state locally but subsequently fail to propagate the corrupted state to the observable output.The one mutant not falling into this category is a mutant that can be reached but not infected.

javac and SOOT: Mutant Operators
In order to answer RQ4 for Java, this section reports the contribution of each mutant operator to the detected mutants.More precisely, Table 14 presents the number of the detected equivalent mutants per operator, along with their proportion to all the generated mutants by that specific operator, and Table 15 presents the respective results for the case of the duplicated mutants.For brevity, we only record the cases that the number of detected mutants was higher than 0 in the studied programs.By examining Table 14, it becomes clear that TCE (via SOOT) managed to detect equivalent mutants that belong to 7 out of the 15 utilised mutant operators, indicating that it can be effective across a wide range of operators.With respect to the duplicated mutants discovered, the findings of Table 15 show that these mutants belong to 8 operators, corroborating the previous statement.
Figure 4 visualises the proportions of detected mutants across the corresponding mutant operators.It can be seen that TCE manages to identify at least 4% of the equivalent mutants produced by LOR, AODS and AOIS and at least 24% of the duplicated ones generated by ROR and COI.Again, for brevity reasons we depict only those operators with higher than 0% detection rates.
It is noted that the TCE equivalences are a special form of redundancy as they require mutual subsumption between mutants (mutant a subsumes mutant b and mutant b subsumes mutant a).This is different from the redundant mutants studied by Kaminski et al. [59] and Just et al. [60] which consider non-mutual subsumptions (mutant a subsumes b and mutant b does not subsumes a).In view of this, it is normal that the COR operator produces redundant mutants that are not captured by TCE (results reported in Tables 14 and 15).Still, a stronger version of the COR operator may provide more chances for TCE equivalences.

javac and SOOT: Program Size and Mutant Equivalences
In order to answer RQ5, i.e., whether or not the number of generated mutants or the program size affects TCE, we examined the correlation between the program size and the number and proportions of the detected equivalent and duplicated mutants.All ρ values, and p-values, were computed using the Spearman rank correlation test.
Regarding the correlation between the number of mutants and the number of identified equivalent ones, a ρ of 0.786, p − value = 8.536e − 06, was obtained, indicating a strong correlation.The correlation between the number of mutants and the proportion of the equivalent ones was found to be ρ of −0.008, p − value = 0.972.
In the case of duplicated mutants, i.e., correlation between the number of mutants and the number of duplicated ones, ρ = 0.657, p − value = 0.001 and ρ = −0.271,p − value = 0.199 with respect to number and proportions of duplicated mutants detected.Based on these data, it can be concluded that the number of equivalent and duplicated mutants detected by TCE tends to increase as the number of the generated mutants increases.However, this does not appear to be the case when considering the detected proportions.
With respect to the correlation of program size with the detected equivalent and duplicated mutants, the obtained results suggest that there is a very weak correlation in the case of the equivalent mutants (ρ = 0.230, p − value = 0.277) whereas, in the case of the duplicated ones, there is a strong one, i.e., ρ = 0.797, p − value = 3.081e − 06.The correlations between program size and proportion of detected equivalent and duplicated mutants was found to be ρ = −0.337,p−value = 0.107 and ρ = 0.423, p−value = 0.041.It is noted that this last case, i.e., program size and duplicated mutants, is the only one where our data show a correlation.
In conclusion, our data show that an increase in the program size is expected to increase the number of equivalent and duplicated mutants identified by TCE.However, the proportion of equivalent mutants detected is expected to be unaffected by the program size, while the proportion of duplicated ones is affected.Finally, we found a low but nontrivial correlation between the program size and the number of generated mutants, i.e., ρ = 0.417, with p-value = 0.044).

DISCUSSION
This section summarise our results and concludes the stated RQs.It also discusses the practical implications and constraints of applying mutation with the use of TCE.

TCE Effectiveness
Our results suggest that TCE can reduce the total number of mutants by 11% for Java and 28% for C. In the case of C, TCE equivalent mutants range from 2% to 17% depending on the studied program and account for 7.4% of all mutants on average.In the case of Java, TCE, using Soot, revealed 5.7% equivalent mutants, on average, that range from 1% to 18%.TCE duplicated mutants range from 3% to 27% and account for 21.0% on average when considering C, while for Java, they range from 2% to 17% and they are 5.4% on average.

TCE Efficiency
The time to detect equivalent and duplicated mutants, using the diff utility, ranges between programs and it is on average 22 and 225 seconds for C and 9 and 808 seconds for Java.This indicates that once the mutants have been compiled/optimised, the equivalence detection comes 'almost for free'.This is an important finding because it suggests that TCE can be applied to remove equivalent and duplicated mutants before the application of other time consuming cost-reduction methods.
Our results show that the total time spent for compiling, detecting equivalent and duplicated mutants is 374,162 and 95,222 seconds for C and Java respectively.Thus, a candidate mutant can be analyzed by TCE in less than 3.0 and 1.5 seconds for C and Java respectively.

Equivalent Mutants
In an attempt to identify the prevalence of TCE equivalent mutants we estimated their ratio, with respect to all equivalent mutants, based on the studied benchmarks.We found that approximately 30% and 54% of the benchmark mutants are trivially equivalent with respect to C and Java.Here it should be noted that there is a large variation on the detected ratios among the studied programs.This is common for both C and Java subjects, indicating that program characteristics have a strong influence on the TCE equivalences.Another important finding regards the causes of mutant equivalences that are detected by TCE.Our results are surprising since they show that the majority of detected mutants are due to failed propagation, i.e, there is no possible way to propagate the mutant infection to the program output.This is true for both C and Java.In Java almost all, 99%, of the detected mutants are of this category, while in C these are 57%.In the case of C, 41% of the detected mutants fall in the second category, i.e, it is impossible to affect the program state that pertains immediately after mutant execution, and 2% to the first one, i.e., mutants reside in unreachable code.

Mutant Operators
To better understand the nature of TCE mutants we identified their prevalence according to the considered mutant operators.Our results suggest that in C the ABS and UOI operators introduce more than 15% of trivial equivalent mutants, ROR and CRCR more than 5% and OAAA just 3% while, LCR, OBBN, AOR, OCNG and SSDL introduce a small fraction, less than 1%.Regarding Java most of the detected equivalent mutants are due to AOIS, 15%, and AODU, 11%.Also, LOR and AOIU introduce notable numbers that respectively account for 4% and 2%.The rest of the operators introduce none or non-significant numbers.
With respect to duplicated mutants, all operators introduce a large number of such mutants in C. Most of them, account for more than 7%.Only LCR introduces a smaller fraction that is 3%.In the case of Java, the situation is a bit different.Only COI and ROR operators have large proportions of TCE duplicated mutants.These are 56% and 24% for COI and ROR.AOIS also produces a large number of duplicated mutants which accounts for 4%.The rest of the operators introduce none or small numbers.

Program Size and Mutant Equivalences
We measured the correlation between the number of mutants and the size of programs.Our results reveal that in both cases there is medium level correlation which is stronger for C, i.e., ρ = 0.671 for C and ρ = 0.417 for Java.Thus, programs of similar size can vary much in terms of number of mutants.By measuring the average number mutants per statement we get 1.90 and 1.69 for C and Java respectively.Hence, for the programs we studied, we conclude that C programs have approximately 10% more mutants and a stronger correlation, between mutants number and program lines of code, than the Java ones.
With respect to equivalent mutants, our results indicate a strong correlation with the number of mutants, for both C and Java, i.e., ρ = 0.818 and ρ = 0.786.This is getting weaker when considering program size, i.e., ρ = 0.692 for C and ρ = 0.230 for Java.However, in all cases we found no evidence indicating that the ratio of the detected equivalent mutants correlates with the number of mutants.Together these two results can be regarded as evidence suggesting that the number of the detected equivalent mutants is a fairly consistent proportion, unaffected by the size indicators of the program under analysis.
With respect to duplicated mutants, our results suggest a strong correlation with the number of mutants, for both C and Java, i.e., ρ = 0.930 and ρ = 0.657.However, both in C and Java we found no evidence indicating that the ratio of the duplicated mutants correlates with the number of mutants.Program size has medium to strong correlation with the number of TCE duplicated mutants, i.e., ρ = 0.650 and ρ = 0.797 for C and Java.In case of C we found no evidence indicating that the ratio of the duplicated mutants correlates with program size.In contrast a medium level correlation was found in the case of Java, ρ = 0.423.

Differences between C and Java
Our presentation this far has focused on our results as found by the two versions of TCE, i.e., for C and Java.
Here we attempt to compare the results of the C with Java versions, answering RQ6, and highlight commonalities and differences between them.
One first observation is that TCE detects more equivalences in C than in Java.This can be attributed to the compiler optimisations implemented in gcc that are way more advanced than that of Java and SOOT.We took a close look at the analysis on the detected causes of equivalence and found that almost all TCE equivalent mutants detected in Java programs are those that cannot propagate, while, only the 57% of the C ones are due to the same reason.This suggests, that there is a 42% difference between the results of C and Java, mainly due to the lack of Java optimisations.The average detected ratios are 7.4% and 5.7%, for C and Java, that reflects the mentioned differences.
Our results demonstrate that equivalent mutants are more prevalent in C than in Java.This is evident from our ground truth analysis which revealed that in C the equivalent mutants account for 23%, while, in Java for 12.7% of all mutants.Additionally, Java has a larger number of trivially equivalent mutants.This is also shown by our ground truth analysis, which revealed that 54% of all Java equivalent mutants are TCE equivalent.The same ratio for C is 30%.In this result, we should consider our first observation, i.e., that 42% of the TCE equivalent mutants cannot be detected by SOOT due to lack of compiler optimizations, that a potentially high number of Java trivially equivalent mutants exists but not found by SOOT.Thus, we can easily conclude that Java programs have considerably less equivalent mutants than the C ones and at the same time Java programs contain a much larger proportion of trivially equivalent mutants.
Regarding duplicated mutants, we found TCE duplicated mutants in C are more prevalent than in Java programs.As our results shown that while in C a large proportion, of 21.0% on average, exists, in Java these mutants are considerably less and account for 5.4% on average.This difference is partly attributed to the lack of optimisations in Java and to language characteristics.Thus, characteristics, like the distinction of logical and arithmetic operators in Java, the typed conventions that are stronger in Java than in C and the use of pointers and arrays make C mutants more vulnerable to duplication.
Another interesting point is that after removing the TCE equivalent mutants, a ratio of 5.8% of equivalent mutants remain in java, while in C the ratio of equivalent mutants that remain is 16.1%.Considering this observation together with the one regarding the number of mutants, that are approximately 10% less in Java than in C, we conclude that, based on the programs we studied, mutation analysis in C is harder than in Java.The efficiency differences between C and Java in detecting duplicated mutants is believed to be due to the language differences.Our results suggest that 0.028 sec are required per C mutant under analysis while 0.28 per Java one.C binary code tends to be smaller than Java bytecode.While the differences are not practically significant, these could be ameliorated by using some form of checksum, as done by md5 to improve substantially the performance of the diff comparisons.
Considering other parameters, like the tools and operator sets used, could also lead to the differences in C and Java results.While, in C we have 10 operators and in Java 15, this difference is more conventional than actual.It is noted that the CRCR operator corresponds to many Java operators mainly due to the language differences, i.e. in C there are only arithmetic values while in Java logical operations are strictly of boolean types.Only two C operators, the ABS and SDL, are only partially implemented in Java; ABS is partially implemented by AODU and SDL by the various deletion operators like the COD.Three Java operators, SOR, AODS, and AORS, are not implemented in C.
Comparing individual operators, C-ABS produces 24% of TCE equivalent mutants while Java-AODU 11%.Similarly, C-UOI 16% while Java-AOIS 15%.Interestingly, C-ROR, C-CRCR and C-OAAA account for 6%, 5% and 3% respectively while their Java version for 0%.With respect to duplicated mutants, C-ROR produces 25% while the Java-ROR 24%.C-OCNG produces 49% and the Java-COI 56%.C-UOI produces 21% and the Java-AOIS 4%.All other C operators introduce many duplicated mutants not detected by the related Java ones.A manual inspection of the detected C mutants suggests that most of these mutants are due to a failed infection, i.e., mutant execution cannot result in a corrupted program state.As shown by our results, Java optimizations are ineffective for these cases and hence we get a reduced effectiveness.

Implications for Research Studies
Our results have direct implications for research studies: the application of TCE can improve the accuracy of a study's results when no manual analysis of equivalent mutants have been performed.To better understand these implications,  , when assuming that our results are generalisable.Both parts present the mutation scores with no manual analysis (line "traditional") and the improved mutation scores that could be obtained by applying TCE.We report the minimum and maximum number of detected equivalent mutants (lines "TCE min" and "TCE max") to better reflect the impact of TCE.Note that the minimum and maximum values are based on the results of our large-scale experiment (see also Sections 4.1 and 5.1).By examining the figures, it can be seen that TCE can improve the accuracy of the obtained mutation scores.More precisely, in the case of Java, this improvement ranges between 0%-18% and, in the case of C, it ranges between 0%-16%.While these results are only illustrative and have to be treated with a great deal of caution, they provide evidence that research studies will benefit from the application of TCE, by automatically improving the accuracy of the results reported.Consider for instance a study that compares two test generation methods, say methods X and Z which achieve a mutation score (without the analysis of equivalent mutants) of 60% and 67% respectively, and the study concludes that Z is better because it manages to achieve a better mutation score of 67% with an improvement of 7% over the previous method.TCE can be used to improve the accuracy of the study's results: by applying TCE, the mutation score of X will range between 61% and 73% and the one of Z, between 68% and 82%.Thus, the application of TCE will result in more accurate mutation scores and will potentially reveal a greater difference, of 9%, between X and Z, improving the empirical evidence of Z's superiority.

Practical Implications
Practitioners use test criteria to develop test suites and to assess the level of test thoroughness.Thus, in practice, TCE affects the effort needed (required work) to develop test suites and the ability of the criterion to accurately measure the effectiveness of the test suites.This section investigates these two practical implications of TCE by examining its impact on the work required, when generating mutation adequate test suites, and by examining the improvements it makes when measuring the mutation score.To reliably investigate both the required work and the improvements of TCE we need to know which mutants are equivalent.We also need to have multiple test suites of various levels of test thoroughness, i.e., with low and high mutation scores.The benchmark set of Yao et al., which we use to answer RQ3 (for C programs), is unfortunately short on both of the above two requirements.Thus, we used the benchmark of Papadakis et al. [47], which is an extension of the famous Siemens suite [68] and contains manually augmented test suites (mutation adequate) and analysed mutants.This benchmark was constructed using: a) the PROTEUM 11 mutation testing tool to generate mutants, b) manual analysis to characterise these mutants as killable or equivalent and c) manual analysis to augment the test suites (generate tests that kill the identified killable mutants) [47].In the case of Java, we used the mutation adequate test suites, which we generated when analysing the mutants of the ground truth set (used to answer RQ3).
A summary of the mutants produced by PROTEUM, when applied to the Siemens suite and the results of TCE (using the -O option) are given in Table 16.From these data, it becomes evident that a non-trivial number of mutants has been detected by TCE.The numbers of the TCE equivalent mutants account for 30% -48% (41% on average) of all the existing equivalent mutants.Interestingly, the results are very similar to those reported in our previous analysis (approximately 6.9% and 24% of all the PROTEUM mutants are TCE equivalent and duplicated) and thus, we are confident that they are representative.

Practical Implications: Required Work
To measure the manual effort involved when performing mutation testing, we adopt the model used by the recent study of Kurtz et al. [57].Thus, we define work as "the number of mutants that are examined by the engineer", or equally, "the sum of the number of tests written to kill all non-equivalent mutants and the number of equivalent mutants identified" [57].This metric in essence approximates the manual effort that a tester needs to perform when doing mutation testing.
Equation 1 presents the work model.In order to compare the results across different programs, we normalise the recorded work by dividing with the overall required work, per subject.The corresponding formula is presented in Equation 2.
normalised work = |testCases| + |EquivalentM uts| OverallW orkRequired (2) Algorithm 1 presents the procedure followed to calculate work, as suggested by Kurtz et al. [57].First, a mutant is randomly selected from the generated set of mutants of the program under test.Next, if the mutant is equivalent, the work is increased by one and the process is repeated.If the mutant is killable, a test case that kills this mutant is randomly selected, the value of work is increased by one and the other mutants that can be killed by this test case are marked as killed.This process continues until every killable mutant of the considered mutants is selected/killed.As can be seen from the algorithm, it requires two inputs: a mutant set and a set of mutation adequate test cases.Thus, we calculate the work based on manually analysed test subjects.To avoid any bias from the selection process, we repeated the experiment 100 times.
Figures 6 and 7 illustrate the results obtained for both programming languages.These figures plot the normalised work (x-axis) against the subsuming mutation score [56] (M S * , y-axis) realised at each step of Algorithm 1 with and without the application of TCE (denoted by the "TCE" and "Traditional" lines respectively) per test subject and programming language.Following the process of Kurtz et al.
[57], we used the subsuming mutation score as effectiveness measurement.This measurement avoids the inflation effects caused by redundant mutants [56], [57].
By examining Figure 6, it can be seen that TCE manages to substantially reduce the work required to achieve a given test effectiveness level: for instance, in the case of Joda-Time, by applying TCE, the work required to achieve a 70% subsuming mutation score is reduced by 11% compared to the application of mutation without TCE, this reduction increases to approximately 20% when the subsuming mutation score reaches 80% and to 30% when the score reaches 90%; finally, at the 100% effectiveness level, TCE  realises a 49% work reduction.This trend, i.e., the increase of the work reduction as the subsuming mutation score increases is present in most Java subjects 12 .This fact can be justified by TCE's equivalent mutant detection which, in turn, gives practitioners a higher chance of selecting killable mutants than equivalent ones, as the application of mutation progresses.Regarding the results for C depicted in Figure 7, it can be seen that analogous conclusions can be drawn.
To better portray TCE's implications for work, Figure 8 presents the overall work reduction when developing muta-  subject based on the available test cases and obtains a value x of mutation score; the question that is raised here is how much does the x score differ from the true mutation score, i.e., the score computed by removing the equivalent mutants?
We calculate the error of the measurement by comparing the true mutation score with the obtained one (with and without applying TCE).Equation 3 details the error of the computation.This metric quantifies the distance of our metric from the true one.Our results are depicted in Figures 9 and 10 for Java and C, respectively.The y-axis of the figures refers to the aforementioned error and the x-axis to the effectiveness levels denoted by the subsuming mutation score.
By examining Figures 9 and 10, it can be seen that the application of TCE results in a much lower error than calculating the mutation score without its application in most test subjects.For instance, in the case of the wrap method of the Commons test subject, at the 75% subsuming mutation score, the error in the mutation score's calculation is 9% without the application of TCE; this error is reduced to 4% when TCE is applied; this difference remain approximately the same until the 100% subsuming mutation score is reached.Overall, in the case of Java, TCE reduces the calculation error of the mutation score by 1%-10%.In the case of C, we find analogous results, with the calculation error reduction ranging between 0% and 4%.Fig. 9: Mutation score improvement after the application of TCE in the case of Java.

Application constraints
The proposed technique is solely based on the use of compilers and their optimisation options, thereby avoiding the several limitations of other methods and tools, e.g.applicability and scalability.It does not require any sophisticated source code analysis techniques or any expensive test executions.Thus, it can be directly applied to real-world systems and can be easily incorporated within mutation testing tools.Interestingly, the detected mutant equivalences are partly dependent on the compiler options used.Although it is rather unlikely that equivalent mutants detected by one compiler option are not equivalent according to another, to We investigated this issue by exploring the main gcc and SOOT settings covering a wide range of optimisation options and found that all of them can be used to detect mutant equivalences (some are more effective than others of course).We also explored the trade off between effectiveness and efficiency using different settings.Our results suggest that the -O and -O2 options are reasonably good, because they consume less compilation time than the -O3 option.However, none of them is superior to the others in detecting equivalent mutants.Here it should be noted that there are many more optimisation options in the modern compilers, there might exist some combinations of them that can detect faster or more mutant equivalences.Thus, our future research is directed towards identifying the options that fit best to TCE.Detailed information about the performed optimizations can be found in the gcc 13 and SOOT 14 websites.

THREATS TO VALIDITY
As it is usual in software engineering experiments, our subjects might not be representative.It is also possible that they might not hold for complete system analysis (as we only analysed sampled components of the large programs).To ameliorate this issue, we selected 12 real-world programs of varying size and application domain, 6 written in C and 6 written in Java, several orders of magnitude larger than those used in previous equivalent mutant detection studies.We also performed an additional evaluation using different sets of programs, composed of 31 manually-analysed benchmark subjects, taken from the literature.To further cater for this issue, we draw attention to strongly observed effects and present our results as ranges of expected values.
The evaluation of our approach resulted in analogous findings in all studied sets.With reference to the C test subjects, it detected approximately 7.4% of all the mutants as equivalent ones for the large-scale experiment, and 7.2% and 6.9% of the mutants of the manually-analysed test subjects (for the Yao et al. [16] and Papadakis et al. benchmarks [47]), on average.In the case of Java, it identified 5.7% and 6.8%, accordingly.Regarding the range of the results (range between worst and best cases), a similar picture appears.Thus, we are confident that TCE can eliminate a considerable number of equivalences.
Additionally, our results are in line with those reported in the literature 15 providing confidence that they are realistic.We studied the mutants of the C and Java languages and TCE implemented using gcc and SOOT.Therefore, some of our results might be a realisation of independent uncontrolled variables, such as the sample size, sample selection procedure (excluding classes not handled by MUJAVA), programs' internal characteristics, used software platforms and tools' operation.Therefore, it is important to note that all our results form empirical observations that might not hold in the general case.However, our findings fit intuition and rely on the foundations set by previous studies [31].Furthermore, we control the major factors that we believe 13. https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html14. https://ssebuild.cased.de/nightly/soot/doc/sootoptions.htm15.Offutt and Pan [33] reported that 9% of all the mutants are equivalent.Delamaro et al. [69]  Other threats are due to the use of software systems.For instance, the gcc compiler or the SOOT optimisation framework may have defects.However, these systems are heavily tested and deployed.Thus, it is unlikely that the remaining defects would influence our results to a great extent.Implementation defects of MILU, MUJAVA and PRO-TEUM may also have an influence.To reduce this threat we carefully checked their results.However, we consider this as a minor threat since all the used tools have been used by several authors in recent studies, e.g., [6], [57], [70], [71], [72], independently of us.Furthermore, we utilised three equivalent mutant benchmark sets which were entirely built by hand.These results served as a 'sanity check' to reduce the threat to validity.
Our results might be affected by our choice of mutant operators.As shown by other studies [73], [74] it is also possible that the realisation of the mutant operators by the employed tools may particularly affect the comparison between C and Java.To mitigate this threat, we detailed exactly how the operators supported by the used tools are realised (Tables 5 and 6) and analysed the common C and Java operators.Based on these lines we draw some conclusions.Overall, we used a wide range of 75 mutant operators (realised by PROTEUM) and all the popular operators (included in most existing mutation testing tools and those empirically found to correlate with fault detection).In all cases we found large numbers of equivalences, which have a major impact on the application of mutation testing.
The use of the equivalent mutants' benchmarks may also pose another threat.This is due to the performed manual analysis: some killable mutants may have been mistakenly identified as equivalent.However, these studies were performed independently of the present one and hence, it is not likely that this kind of mistakes coincidentally match the results of TCE.Additionally, it is equally possible that such mistakes have also led to the underestimation of TCE's effectiveness.
Finally, all our subjects, tools and data are available in the accompanied website of the present paper 16 .This helps reducing all the above-mentioned threats [75] since independent researchers can check, replicate and analyse our findings.

CONCLUSION AND FUTURE WORK
We have presented the results of an extensive empirical analysis of the ability of Trivial Compiler Equivalence (TCE) to detect both equivalent and duplicated mutants in the C and Java programming languages.
We have conducted an empirical study of TCE on 25 C and 6 Java benchmark systems, for which the programs under study are sufficiently small for their equivalent mutants to be determined manually.These systems provided us with the ground truth against which we can empirically assess the equivalent mutant detection power of TCE.We 16. http://pages.cs.aueb.gr/∼ kintism/papers/tce/ augmented this study with a much larger study for which no ground truth is possible.In total, we have experimented with over 1 million lines of code, consisting of the 31 smaller benchmark systems, together with 6 larger Java systems (with a total of 263,740 LoC) and 6 larger C systems (with a total of 750,157 LoC).
Overall, we find that for both C and Java, TCE is a useful, fast and widely-applicable technique that can detect between 17%-100% (30% on average) of C language equivalent mutants, and 0%-94% (54% on average) of Java equivalent mutants (for the ground truth set).Furthermore, over all mutants studied in all large real-world programs, the detection of trivially equivalent and trivially duplicated mutants was found to reduce the total number of mutants by 5%-23% for Java and 20%-37% for C, which accounts for 11% and 28% on average.These achievements imply that a practitioner who applies mutation testing and is using TCE will spend 0%-51% and 28%-47% less manual effort in the case of Java and for C than without using it.TCE also improves the accuracy of the mutation score measurement by 1%-10% and 0%-4% for Java and C. Thus, future research should integrate compiler optimisations within mutation testing tools in order to avoid any generation of such trivial mutants and future research studies should consider applying TCE to reap the benefit of the technique.
Our results revealed interesting findings that suggest topics for future work on mutation-based analysis of the semantic differences between programming languages.For example, it is intriguing that a larger proportion of Java's equivalent mutants were found to be detectable using TCE than for C. Furthermore, if the proportion of equivalent mutants from the ground truth study is similar to that for mutants overall, then it would appear that the Java language suffers significantly less from the equivalent mutant problem than the C language does.
One might conjecture that this is related to the relatively small size of Java methods when compared to the size of C functions.Alternative conjectures might revolve around the differing semantic features of these two languages (and the consequent mutation operators that are applicable).Of course, since we have insufficient data to make scientifically reliable statements on these conjectures, we have refrained from making any claims in the present paper and leave them as just that; conjectures.Nevertheless, our results suggest that future work might use TCE as one approach to tackle such conjectures, potentially leading to a better understanding of the difference between programming language semantics, based on mutation analysis.

Fig. 1 :
Fig. 1: The proportion of equivalent and duplicated mutants detected by TCE per studied C program.

Fig. 2 :
Fig. 2: The proportion of equivalent and duplicated mutants detected by TCE per mutant operator in case of C.

Fig. 4 :
Fig. 4: The proportion of equivalent and duplicated mutants detected by TCE per program studied in case of Java.

Fig. 5 :
Fig. 5: Mutation score improvements by TCE, when no manual analysis of equivalent mutants has been performed, e.g. in large-scale experiments.

Figure 5
Figure5illustrates the range in which TCE can change the resulting mutation scores in the case of Java (left part of the figure) and C (right part of the figure), when assuming that our results are generalisable.Both parts present the mutation scores with no manual analysis (line "traditional") and the improved mutation scores that could be obtained by applying TCE.We report the minimum and maximum number of detected equivalent mutants (lines "TCE min" and "TCE max") to better reflect the impact of TCE.Note that the minimum and maximum values are based on the results of our large-scale experiment (see also Sections 4.1 and 5.1).By examining the figures, it can be seen that TCE can improve the accuracy of the obtained mutation scores.More precisely, in the case of Java, this improvement ranges between 0%-18% and, in the case of C, it ranges between 0%-16%.While these results are only illustrative and have to be treated with a great deal of caution, they provide evidence that research studies will benefit from the application of TCE, by automatically improving the accuracy of the results reported.Consider for instance a study that compares two test generation methods, say methods X and Z which achieve a mutation score (without the analysis of equivalent mutants) of 60% and 67% respectively, and the study concludes that Z is better because it manages to achieve a better mutation score of 67% with an improvement of 7% over the previous method.TCE can be used to improve the accuracy of the study's results: by applying TCE, the mutation score of X will range between 61% and 73% and the one of Z, between 68% and 82%.Thus, the application of TCE will result in more accurate mutation scores and will potentially reveal a greater difference, of 9%, between X and Z, improving the empirical evidence of Z's superiority.

Fig. 6 :
Fig.6: Work required for different effectiveness levels with and without the application of TCE in the case of Java.

6 P ri n tt o k e n s 2 S c h e d u le 2 PFig. 8 :
Fig. 8: Overall Work Reduction after the application of TCE per test subject and programming language.

TABLE 2 :
Details of C subjects: 'LoC' shows the lines of code of the project; 'Comp' and 'Comp-Size' show the components considered and their size; 'Func' and 'Muts' show the number of functions and mutants of the components.

TABLE 3 :
Java Test Subjects' details: 'LoC' shows the source code lines of the projects; 'Package' and 'Class-Size' present the packages of the considered classes and their size; the 'Methods' and 'Mutants' columns show the number of methods and the corresponding number of generated mutants.

TABLE 4 :
Manually-analysed Java test subjects' details: 'Program' and 'Method' columns present the examined programs and the considered methods; 'Mutants' shows the number of the generated mutants and 'Equivalent' the number of the manually-identified equivalent ones.

TABLE 7 :
Equivalent and duplicated mutants detected by TCE via gcc.'None', '-O', '-O2' and '-O3' report the fraction of all identified equivalent mutants that were detected per optimisation flag.'#Mutants' reports the distinct number of detected mutants by all the options together and '% of all Mutants' reports the percentage of detected to the number of mutants.

TABLE 10 :
Number 'No.' and proportion '%' of equivalent and duplicated mutants detected by TCE per operator.

TABLE 11 :
Equivalent and duplicated mutants detected by TCE via javac and SOOT.

TABLE 12 :
Execution time, measured in sec., of equivalent and duplicated mutant detection per considered tool and test subject.

TABLE 13 :
TCE applied to Java benchmark set: Number 'No.' and proportion '%' of detected equivalent mutants.

TABLE 14 :
Number 'No.' and proportion '%' of equivalent mutants detected by TCE per operator.

TABLE 15 :
Number 'No.' and proportion '%' of duplicated mutants detected by TCE per operator.

Algorithm 1
Calculating the work metric.Let muts represent the program's generated mutants Let tcs represent the program's mutation test suite 1: function WORKCALCULATION(muts, tcs) 12. XStream is a clear outlier but it should be mentioned that it is the only program for which TCE did not detect any equivalent mutant.
6.4.2PracticalImplications: Mutation Score ImprovementThis section investigates how the use of TCE improves the accuracy of the mutation score measurement.Consider the following example: an engineer applies mutation to a test found 12% , Kintis et al. [46], Schuler and Zeller [40] 7%-8%, Papadakis et al. [47] 17%, Yao et al. [16] 23% and Madeyski et al. [44] 4%-39%.can influence our results.Additional studies are needed to determine what influences the performance of TCE and its practical use on different languages and compiler optimisation techniques.
Marinos Kintis is partly supported by the Research Centre of Athens University of Economics and Business (RC/AUEB).Mike Papadakis is supported by the National Research Fund, Luxembourg, INTER/MOBILITY/14/7562175 and by Microsoft Azure Grant 2015.Mark Harman is partly supported by the UK EPSRC projects EP/I033688 (GISMO) and EP/G060525, a 'Platform Grant' for the Centre for Research on Evolution Search and Testing (CREST) at UCL. Yue Jia is supported by the EPSRC project EP/J017515 (DAASE) and by Microsoft Azure Grant 2014.