A Memetic Algorithm for whole test suite generation

Thegenerationofunit-leveltestcasesforstructuralcodecoverageisataskwell-suitedtoGeneticAlgorithms. Method call sequences must be created that construct objects, put them into the right state and then execute uncovered code. However, the generation of primitive values, such as integers and doubles, characters that appear in strings, and arrays of primitive values, are not so straightforward. Often, small local changes are requiredtodrivethevaluetowardtheoneneededtoexecutesometargetstructure.However,globalsearches likeGeneticAlgorithmstendtomakelargerchangesthatarenotconcentratedonanyparticularaspectofatest case. In this paper, we extend the Genetic Algorithm behind the EvoSuite test generation tool into a Memetic Algorithm, by equipping it with several local search operators. These operators are designed to eﬃciently optimize primitive values and other aspects of a test suite that allow the search for test cases to function more effectively. We evaluate our operators using a rigorous experimental methodology on over 12,000 Java classes, comprising open source classes of various different kinds, including numerical applications and text processors. Our study shows that increases in branch coverage of up to 53% are possible for an individual class in practice.


Introduction
Search-based testing uses optimization techniques such as Genetic Algorithms to generate test cases. Traditionally, the technique has been applied to test inputs for procedural programs, such as those written in C (McMinn, 2004). More recently, the technique has been applied to the generation of unit test cases for object-oriented software (Fraser and Arcuri, 2013b). The problem of generating such test cases is much more complicated than for procedural code. To generate tests that cover all of the branches in a class, for example, the class must be instantiated, and a method call sequence may need to be generated to put the object into a certain state. These method calls may themselves require further objects as parameters, or primitive values such as integers and doubles, or strings, or arrays of values. The EvoSuite tool (Fraser and Arcuri, 2011) uses Genetic Algorithms to generate a whole test suite, composed of a number of test cases. Although empirical experiments have shown that it is practically usable on a wide range of programs , Genetic Algorithms are a global search technique, which tend to induce macro-changes on the test suite. In order to cover certain branches, more focused changes are required. If, for example, Fig. 1. Example class and test case: in theory, four edits of s can lead to the target branch being covered. However, with a Genetic Algorithm where each statement of the test is mutated with a certain probability (e.g., 1/3 when there are three statements) one would have to be really lucky: if the test is part of a test suite (size 10) of a Genetic Algorithm (population 50) and we only assume a character range of 128, then even if we ignore all the complexities of Genetic Algorithms, we would still need on average at least 50 × 4 ×1/((1/10) × (1/3) × (1/128)) = 768, 000 fitness evaluations before covering the target branch.
2 Local search for complex values: We extend the notion of local search as commonly performed on numerical inputs to string inputs, arrays, and objects. 3 Test suite improvement: We define operators on test suites that allow test suites to improve themselves during phases of Lamarckian learning. 4 Sensitivity analysis: We have implemented the approach as an extension to the EvoSuite tool (Fraser and Arcuri, 2013b), and analyze the effects of the different parameters involved in the local search, and determine the best configuration. 5 Empirical evaluation: We evaluate our approach in detail on a set of 16 open source classes as well as two large benchmarks (comprising more than 12,000 classes), and compare the results to the standard search-based approach that does not include local search.
This paper is an extension of , and it is organized as follows: Section 2 presents relevant background to search-based testing, and the different types of search that may be applied, including local search and search using Genetic and Memetic Algorithms. Section 3 discusses the global search and fitness function applied to optimize test suites for classes toward high code coverage with EvoSuite. Section 4 discusses how to extend this approach with local operators designed to optimize primitive values such as integers and floating point values, strings and arrays. Section 5 then presents our experiments and discusses our findings, showing how our local search operators, incorporated into a Memetic Algorithm, result in higher code coverage. A discussion on the threats to validity of this study follows in Section 6. Finally, Section 7 concludes the paper.

Search-based test case generation
Search-based testing applies meta-heuristic search techniques to the task of test case generation (McMinn, 2004). In this section, we briefly review local and global search approaches to testing, and the combination of the two in the form of Memetic Algorithms.

Local search algorithms
With local search algorithms (Arcuri, 2009) one only considers the neighborhood of a candidate solution. For example, a hill climbing search is usually started with a random solution, of which all neighbors are evaluated with respect to their fitness for the search objective. The search then continues on either the first neighbor that has improved the fitness, or the best neighbor, and again considers its neighborhood. The search can easily get stuck in local optima, which are typically overcome by restarting the search with new random values, or with some other form of escape mechanism (e.g., by accepting a worse solution temporarily, as with simulated annealing). Different types of local search algorithms exist, including simulated annealing, tabu search, iterated local search and variable neighborhood search (see Gendreau and Potvin, 2010, for example, for further details). A popular version of local search often used in test data generation is Korel's Alternating Variable Method (Korel, 1990;Ferguson and Korel, 1996). The Alternating Variable Method (AVM) is a local search technique similar to hill climbing, and was introduced by Korel (1990). The AVM considers each input variable of an optimization function in isolation, and tries to optimize it locally. Initially, variables are set to random values. Then, the AVM starts with "exploratory moves" on the first variable. For example, in the case of an integer an exploratory move consists of adding a delta of +1 or −1. If the exploratory move was successful (i.e., the fitness improved), then the search accelerates movement in the direction of improvement with so-called "pattern moves". For example, in the case of an integer, the search would next try +2, then +4, etc. Once the next step of the pattern search does not improve the fitness any further, the search goes back to exploratory moves on this variable. If successful, pattern search is again applied in the direction of the exploratory move. Once no further improvement of the variable value is possible, the search moves on to the next variable. If no variable can be improved the search restarts at another randomly chosen location to overcome local optima.

Global search algorithms
In contrast to local search algorithms, global search algorithms try to overcome local optima in order to find more globally optimal solutions. Harman and McMinn (2010) recently determined that global search is more effective than local search, but less efficient, as it is more costly. With evolutionary testing, one of the most commonly applied global search algorithms is a Genetic Algorithm (GA). A GA tries to imitate the natural processes of evolution: an initial population of usually randomly produced candidate solutions is evolved using search operators that resemble natural processes. Selection of parents for reproduction is based on their fitness (survival of the fittest). Reproduction is performed using crossover and mutation with certain probabilities. With each iteration of the GA, the fitness of the population improves until either an optimal solution has been found, or some other stopping condition has been met (e.g., a maximum time limit or a certain number of fitness evaluations). In evolutionary testing, the population would for example consist of test cases, and the fitness estimates how close a candidate solution is to satisfying a coverage goal. The initial population is usually generated randomly, i.e., a fixed number of random input values is generated. The operators used in the evolution of this initial population depend on the chosen representation. A fitness function guides the search in choosing individuals for reproduction, gradually improving the fitness values with each generation until a solution is found. For example, to generate tests for specific branches-to achieve branch coverage of a programa common fitness function (McMinn, 2004) integrates the approach level (number of unsatisfied control dependencies) and the branch distance (estimation of how close a branching condition is to being evaluated as desired). Such search techniques have not only been applied in the context of primitive datatypes, but also to test objectoriented software using method sequences (Tonella, 2004;Fraser and Zeller, 2012).

Memetic Algorithms
A Memetic Algorithm (MA) hybridizes global and local search, such that the individuals of a population in a global search algorithm have the opportunity for local improvement in terms of local search. With Lamarckian-style learning local improvement achieved by individuals is encoded in the genotype and thus passed on to the next generation. In contrast, with Baldwinian learning, the improvement is only encoded in terms of the fitness value, whereas the genotype remains unchanged. The Baldwin effect describes that as a result, individuals with more potential for improvement are favored during evolution, which essentially smoothes the fitness landscape. Yao et al. (2005) report no difference between the two types of learning, whereas other experiments showed that Baldwinian learning may lead to better results but takes significantly longer (Whitley et al., 1994). The use of MAs for test generation was originally proposed by Wang and Jeng (2006) in the context of test generation for procedural code, and has since then been applied in different domains, such as combinatorial testing (Rodriguez-Tello and Torres-Jimenez, 2010). Harman and McMinn (2010) analyzed the effects of global and local search, and concluded that MAs achieve better performance than global search and local search. In the context of generating unit tests for object-oriented code, Arcuri and Yao (2007) combined a GA with hill climbing to form an MA when generating unit tests for container classes. Liaskos and Roper (2008) also confirmed that the combination of global and local search algorithms leads to improved coverage when generating test cases for classes. Baresi et al. (2010) also use a hybrid evolutionary search in their TestFul test generation tool, where at the global search level a single test case aims to maximize coverage, while at the local search level the optimization targets individual branch conditions. Although MAs have been already used in the past for unit test generation, their applications have been mainly focused on numerical data types and covering specific testing targets (e.g., a branch) with a single test case. In this paper, we rather provide a comprehensive approach for object-oriented software, targeting whole test suites, handling different kinds of test data like strings and arrays, and also considering adaptive parameter control. Furthermore, we provide an extensive empirical evaluation to determine how to best combine the local and global search parts of the presented MA.

Whole test suite generation
With whole test suite generation, the optimization target is not to produce a test that reaches one particular coverage goal, but it is to produce a complete test suite that maximizes coverage, while minimizing the size at the same time.

Representation
An individual of the search is a test suite, which is represented as a set T of test cases t i . Given |T| = n, we have T = {t 1 , t 2 , . . . , t n }. A test case is a sequence of statements t = s 1 , s 2 , . . . , s l , where the length of a test case is defined as length( s 1 , s 2 , . . . , s l ) = l. The length of a test suite is defined as the sum of the lengths of its test cases, i.e., length(T) = ࢣ tࢠT length(t). There are several different types of statements in a test case: primitive statements define primitive values, such as Booleans, integers, or strings; constructor statements invoke constructors to produce new values; method statements invoke How the crossover operator works with test suites. Given two parent test suites (shown on the "before" side of the figure), two offspring test suites are produced (depicted on the "after" side of the figure) following splicing of the parent test suites at a given crossover point. methods on existing objects, using existing objects as parameters; field statements retrieve values from public members of existing objects; array statements define arrays; assignment statements assign values to array indexes or public member fields of existing objects. Each of these statements defines a new variable, with the exception of void method calls and assignment statements. Variables used as parameters of constructor and method calls and as source objects for field assignments or retrievals need to be already defined by the point at which they are used in the sequence. Crossover of test suites means that offspring recombine test cases from parent test suites, as Fig. 2 shows. For two selected parents P 1 and P 2 , a random value α is chosen from [0, 1], and on one hand, the first offspring O 1 will contain the first α|P 1 | test cases from the first parent, followed by the last (1 − α)|P 2 | test cases from the second parent. On the other hand, the second offspring O 2 will contain the first α|P 2 | test cases from the second parent, followed by the last (1 − α)|P 1 | test cases from the first parent. Mutation of test suites means that test cases are inserted, deleted, or changed. With probability σ , a test case is added. If it is added, then a second test case is added with probability σ 2 , and so on until the ith test case is not added (which happens with probability 1 − σ i ). Each test case is changed with probability 1/|T|. There are many different options to change a test case: one can delete or alter existing statements, or insert new statements. We perform each of these three operations with probability 1/3; on average, only one of them is applied, although with probability (1/3) 3 all of them are applied. When removing statements from a test it is important that this operation ensures that all dependencies are satisfied. Inserting statements into a test case means inserting method calls on existing calls, or adding new calls on the class under test. For details on the mutation operators we refer the reader to Fraser and Arcuri (2013b).

Fitness function
In this paper, we consider branch coverage as the optimization target, although the approach can be applied to any coverage criterion that can be expressed with a fitness function. Typically, fitness functions for other coverage criteria are based on the branch coverage fitness function. Branch coverage requires that for every conditional statement in the code there is at least one test that makes it evaluate to true, and one that makes it evaluate to false. For this, we can use a standard metric used in search-based testing, the branch distance. For every branch, the branch distance estimates how close that branch was to evaluating to true or to false. For example, if we have the branch x = 17, and a concrete test case where x has the value 10, then the branch distance to make this  [m5G;December 8, 2014;11:45] branch true would be 17 − 10 = 7, while the branch distance to making this branch false is 0. To achieve branch coverage in whole test suite generation, the fitness function tries to optimize the sum of all normalized, minimal branch distances to 0-if for each branch there exists a test such that the execution leads to a branch distance of 0, then all branches have been covered. Let d(b, T) be the branch distance of branch b on test suite T: if the branch has been covered, ν(d min (b, T)) if the predicate has been executed at least twice, 1 otherwise.
We require each branching statement to be executed twice to avoid the situation where the search oscillates between the two potential evaluations of the branch predicate. ν is a normalization function (Arcuri, 2013) with the range [0, 1]. Assuming the set of branches B for a given class under test, this leads to the following fitness function, which is to be minimized by the search: Here, M is the set of methods in the class under test, while M T is the set of methods executed by T.

Search guidance on strings
The fitness function in whole test suite generation is based on branch distances. EvoSuite works directly on Java bytecode, where except for reference comparisons, the branching instructions are all based on numerical values. Comparisons on strings first map to Boolean values, which are then used in further computations; e.g., a source code branch like if(string1.equals(string2)) consists of a method call on String.equals followed by a comparison of the Boolean return value with true. To offer guidance on string based branches we replace calls to the String.equals method with a custom method that returns a distance measurement (Li and Fraser, 2011). The branching conditions comparing the Boolean with true thus have to be changed to check whether this distance measurement is greater than 0 or not (i.e., == true is changed to == 0, and == false is changed to 0). The distance measurement itself depends on the search operators used; for example, if the search operators support inserts, changes, and deletions, then the Levenshtein distance measurement can be used. This transformation is an instance of testability transformation (Harman et al., 2004), which is commonly applied to improve the guidance offered by the search landscape of programs. Search operators for string values have initially been proposed by Alshraideh and Bottaci (2006). Based on our distance measurement, when a primitive statement defining a string value is mutated, each of the following is applied with probability 1/3 (i.e., with probability (1/3) 3 all are applied): Deletion: Every character in the string is deleted with probability 1/n, where n is the length of the string. Thus, on average, one character is deleted. Change: Every character in the string is changed with probability 1/n; if it is changed, then it is replaced with a random character.
Insertion: With probability α = 0.5, a random character is inserted at a random position p within the string. If a character was inserted, then another character is inserted with probability α 2 , and so on, until no more characters are inserted.

Applying Memetic Algorithms
The whole test suite generation presented in the previous section is a global optimization technique, which means that we are trying to optimize an entire candidate solution toward the global optimum (maximum coverage). Search operations in global search can lead to large jumps in the search space. In contrast, local search explores the immediate neighborhood. For example, if we have a test suite consisting of X test cases of average length L, then the probability of mutating one particular primitive value featuring in one of those test cases is (1/X) × (1/L). However, evolving a primitive value to a target value may require many steps, and so global search can easily exceed the search budget before finding a solution. This is a problem that local search can overcome.

Local search on method call sequences
The aim of the local search is to optimize the values in one particular test case of a test suite. When local search is applied to a test case, EvoSuite iterates over its sequence of statements from the last to the first, and for each statement applies a local search dependent on the type of the statement. Local search is performed for the following types of statements: primitive statements, method statements, constructor statements, field statements and array statements. Calculating the fitness value after a local search operator has been applied only requires partial fitness evaluation: EvoSuite stores the last execution trace with each test case, and from this the fitness value can be calculated. Whenever a test case is modified during the search, either by regular mutation or by local search, the cached execution trace is deleted. Thus, a fitness evaluation for local search only requires that one test out of a test suite is actually executed.

Primitive statements
Booleans and enumerations: for Boolean variables the only option is to flip the value. For enumerations, an exploratory move consists of replacing the enum value with any other value, and if the exploratory move was successful, we iterate over all enumeration values. Integer datatypes: for integer variables (which includes all flavors such as byte, short, char, int, long) the possible exploratory moves are +1 and −1. The exploratory move decides the direction of the pattern move. If an exploratory move to +1 was successful, then with every iteration I of the pattern search we add δ = 2 I to the variable. If +1 was not successful, −1 is used as exploratory move, and if successful, subsequently δ is subtracted. Floating point datatypes: for floating point variables (float, double) we use the same approach as originally defined by Harman and McMinn (2007) for handling floating point numbers with the AVM. Exploratory moves are performed for a range of precision values p, where the precision ranges from 0 to 7 for float variables, and from 0 to 15 for double values. Exploratory moves are applied using δ = 2 I × dir × 10 −p . Here dir denotes either +1 or −1, and I is the number of the iteration, which is 0 during exploratory moves. If an exploratory move was successful, then pattern moves are made by increasing I when calculating δ.

String statements
For string variables, exploratory moves are slightly more complicated than in the case of primitive statements: to determine if local search on a string variable is necessary, we first apply n random mutations on the string. 1 These mutations are the same as JID: JSS [m5G;December 8, 2014;11:45] described in Section 3.3. If any of the n probing mutations changed the fitness, then we know that modifying the string has some effect, regardless of whether the change resulted in an improvement in fitness or not. As discussed in Section 3.3, string values affect the fitness through a range of Boolean conditions that are used in branches; these conditions are transformed such that the branch distance also gives guidance on strings. If the probing on a string showed that it affects the fitness, then we apply a systematic local search on the string. The operations on the string must reflect the distance estimation applied on string comparisons:

ARTICLE IN PRESS
Deletion: First, every single character is removed and the fitness value is checked. If the fitness did not improve, the character is kept in the string.
Change: Second, every single character is replaced with every possible other character; for practical reasons, we restrict the search to ASCII characters. If a replacement is successful, we move to the next character. If a character was not successfully replaced, the original character stays in place.
Insertion: Third, we insert new characters. Because the fitness evaluation requires test execution, trying to insert every possible character at every possible position would be too expensive-yet this is what would be required when using the standard Levenshtein distance (edit distance) as distance metric. Consequently, we only attempt to insert characters at the front and the back, and adapt the distance function for strings accordingly.
The distance function for two strings s1 and s2 used during the search is (c.f. Kapfhammer et al., 2013):

Array statements
Local search on arrays concerns the length of an array and the values assigned to the slots of the array. To allow efficient search on the array length, the first step of the local search is to try to remove assignments to array slots. For an array of length n, we first try to remove the assignment at slot n − 1. If the fitness value remains unchanged, we try to remove the assignment at slot n − 2, and so on, until we find the highest index n for which an assignment positively contributes to the fitness value. Then, we apply a regular integer-based local search on the length value of the array, making sure the length does not get smaller than n + 1. Once the search has found the best length, we expand the test case with assignments to all slots of the array that are not already assigned in the test case (such assignments may be deleted as part of the regular search). Then, on each assignment to the array we perform a local search, depending on the component type of the array.

Reference type statements
Statements related to reference values (method statement, constructor statement, field statement) do not allow traditional local search in terms of primitive values. The neighborhood of a complex type in a sequence of calls is huge (e.g., all possible calls on an object with all possible parameter combinations, etc.), such that exhaustive search is not a viable option. Therefore, we apply randomized hill climbing on such statements. This local search consists of repeatedly applying random mutations to the statement, and it is stopped if there are R consecutive mutations that did not improve the fitness (in our experiments, R = 10). We use the following mutations for this randomized hill climbing: • Replace the statement with a random call returning the same type.
• Replace a parameter (for method and constructor statements) or the receiving object (for field and method statements) with any other value of the same type available in the test case. • If the call creates a non-primitive object, add a random method call on the object after the statement.

Local search on test suites
While the smallest possible search steps in the neighborhood of a test suite are defined by the tests' statements as discussed in the previous section, Lamarckian evolution in principle permits individuals to improve with any local refinements, and not just local search algorithms. This section describes some local improvements that can be performed on test suites with respect to the goal of achieving high code coverage.

Primitive value expansion
The search operators creating sequences of method calls allow variables to be reused in several statements. This is beneficial for certain types of coverage problems: for example, the case of an equilateral triangle (thus requiring three equal integer inputs) in the famous triangle example becomes a trivial problem when allowing variable reuse. However, variable reuse can also inhibit local exploration. In the case of the triangle example, given a test that creates an equilateral triangle using only a single variable, it is impossible for local search on the primitive values in the test to derive any other type of triangle. Therefore, a possible local improvement of test suite lies in making all variables uniquely used. That is, the triangle case would be converted to a test with three variables that have the same value, thus permitting local search to optimize each side independently again (Fig. 3).

Ensuring double execution
The branch coverage fitness function (Section 3.2) requires that each branching statement is executed twice, in order to avoid that the search oscillates between the true/false outcomes at the branching statement. If for a given test suite a branching predicate is covered only once, then it is possible to improve the test suite simply by duplicating the test that covers the predicate.

Restoring coverage
The fitness function captures the overall coverage of a test suite, and how close it is to covering more branches. This means that the fitness value does not capture which branches are covered, and so a test suite with worse fitness than another might still cover some branches the "better" test suite does not cover. Again it is possible to apply a local improvement measure to counter this issue: we keep a global archive of tests, and whenever a new branch is covered for the first time, this test is added to the archive. If a test suite determines that it is not covering branches that have been covered in the past, it can take the according test cases from that archive.

A Memetic Algorithm for test suites
Given the ability to perform local search on the individuals of a global optimization there is the question of how to integrate these techniques. Considering the high costs of fitness evaluations in the test generation scenario, a generally preferred choice (El-Mihoub et al., 2006) is Lamarckian learning, i.e., the local search changes the genotype and its fitness value, rather than just the fitness value. A common implementation of MAs applies this learning immediately after reproduction (Moscato, 1989 current _ population← random population 3: iteration← 1 4: repeat 5: if iteration mod local search rate = 0 6: ࢳ local search probability then 7: while budget for local search available do 8: x← select next individual from Z 9: x ← local search on x 10: if local search successful then 11: Algorithm 1 shows how these parameters are used in the MA: except for Lines 5-17, this algorithm represents a regular GA. If the current iteration matches the rate with which local search should be applied, then with a given probability the local search is applied to one individual of the current population after the other, until the local search budget is used up. One possible strategy to select individuals for local search is to apply it to the worst individuals (El-Mihoub et al., 2006), which supports exploration. However, we expect fitness evaluations and local search in the test generation scenario to be very expensive, such that it can be applied only to few individuals of the population. Furthermore, test suite generation is a scenario where the global optimization alone may not succeed in finding a solution (e.g., consider the string example in Fig. 1). Therefore, we direct the learning toward the better individuals of the population, such that newly generated genetic material is more likely to directly contribute toward the solution. The strategy implemented in EvoSuite is thus to start applying local search to the best individual of the population, then the second best, etc., until the available budget for local search is used up. The local search budget in EvoSuite can be defined in terms of fitness evaluations, test executions, number of executed statements, number of individuals on which local search is applied, or time. Finally, a further parameter determines the adaptation rate: if local search was successful, then the probability of applying it at every rth generation is increased, whereas an unsuccessful local search leads to a reduction of the probability. For this we use the approach that EvoSuite successfully applied to combine search-based testing and dynamic symbolic execution (Galeotti et al., 2013). The adaptation rate a updates the probability p after a successful (i.e., fitness was improved) local search as follows: whereas on unsuccessful local search it is updated to: Optionally, EvoSuite implements a strategy where local search is restricted to those statements where a mutation in the reproduction phase has led to a fitness change (Galeotti et al., 2013).

Evaluation
The techniques presented in this paper depend on a number of different parameters, and so evaluation needs to take these into account. As the problem is too complex to perform a theoretical runtime analysis (e.g., such as that presented by Arcuri (2009)  "Branches" is the number of branches reported by EvoSuite; "LOC" refers to the number of non-commenting source code lines reported by JavaNCSS (http://www.kclee.de/clemens/java/javancss).

Case study selection
For RQ1-RQ6 we need a small set of classes on which to perform extensive experiments with many different combinations of parameters. Therefore, we chose classes already used in previous experiments , but excluded those on which Evo-Suite trivially already achieves 100% coverage, as there was no scope for improvement with local search. In order to ensure that the set of classes for experimentation was variegated, we tried to strike a balance among different kinds of classes. To this end, beside classes taken from the case study in Arcuri and Fraser (2013), we also included four benchmark classes on integer and floating point calculations from the Roops 2 benchmark suite for object-oriented testing, This results in a total of 16 classes, of which some characteristics are given in Table 1. For RQ7 we need large sets of classes with different properties. First, we used the SF100 corpus of classes . The SF100 corpus is a random selection of 100 Java projects from Source-Forge, one of the largest repositories of open source projects. In total, the SF100 corpus consists of 11,088 Java classes. On one hand, the SF100 corpus is an ideal case study to show how a novel technique would affect software engineering practitioners. On the other hand, there are several research questions in unit test generation that are still open and may influence the degree of achievable improvement, like handling of files, network connections, databases, GUI events, etc. Therefore, we used the case study of the Carfast (Park et al., 2012) test generation tool 3 as a second case study, as it represents a specific type of difficult classes that could be efficiently addressed with a hybrid local search algorithm. Table 2 summarizes the properties of this case study. Note that the Carfast paper mentions a second case study with about 1k LOC, which is not included in the archive on the website and therefore not part of our experiments.

Experiment parameters
In addition to the parameters of the MA, local search is influenced by several other parameters of the search algorithm. Because how often we apply local search depends on the number X of generations, how much local search is actually performed is dependent on the population size. Consequently, we also had to consider the population size when designing the experiments. We Table 2 Details of the generated case study. For each project, we report how many classes it is composed of, and the total number of bytecode branches. , while the interpretation chosen for the local search budget was "seconds". We controlled the use of constant seeding by setting the probability of EvoSuite using seeded constants to either 0.0 or 0.2. We also included further configurations without local search (i.e., the default GA in EvoSuite), but still considering the different combinations of population size and seeding. In total, EvoSuite was run on (2 × 5 3 ) + (2 × 5) = 260 configurations. For each class we used an overall search budget of 10 min, but for RQ3 we also look at intermediate values of the search. Experiment 2 (RQ4-RQ5): we considered all possible combinations of the local search operators defined in Section 4, i.e., search on strings, numbers, arrays, reference types, as well as ensuring double execution, expanding test cases, and restoring coverage. Together with the seeding option, this results in 2 8 = 256 different combinations. The values chosen for population size, rate, and budget are those that gave the best result in RQ2. As we do not consider the behavior of the search over time, we use a search budget of 2 min per class, a value which our past experience has shown to be a reasonable compromise between a runtime practitioners would accept and allowing for good coverage results. Experiment 3 (RQ6): we considered the probabilities {0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 0.7, 1.0} and adaptation rates of {1.0, 1.2, 1.5, 1.7, 2.0, 5.0, 10} (i.e., a suitable coverage of values between the minimum and maximum plausible values), while the local search rate is set to 1. We also experimented whether selective mode was active or not, as well as seeding, which led to 8 × 7 ×2 2 = 224 configurations. The overall search budget is again 2 min per class. Experiment 4 (RQ7): for the experiments on the SF100 corpus and Carfast case study we only considered two configurations: default GA in EvoSuite and the best MA configuration from the analyses of the previous research questions. Search budget is 2 min per class also in this set of experiments.

Experiment procedure
On each class, for each parameter combination and algorithm, we ran EvoSuite 10 times with different random seeds to take into account their random nature. This means that the first set  (2 × 2 ×10)/(60 × 24) = 347 days. Thus, in total all the experiments together took 742 days of computational time, which required the use of a cluster of computers. We used the University of Sheffield's Linux based high performance computing cluster which has a total of 1544 CPUs. The nodes of the cluster have either AMD or Intel CPUs at around 2.6 GHz, and 4 GB of memory per CPU core. During all these runs, Evo-Suite was configured using the optimal configuration determined in our previous experiments on tuning . To evaluate the statistical and practical differences among the different settings, we followed the guidelines by Arcuri and Briand (2014). Statistical difference is evaluated with a two-tailed Mann-Whitney-Wilcoxon U-test (at 0.05 significant level), whereas the magnitude of improvement is quantified with the Vargha-Delaney standardized effect sizeÂ 12 . In some cases, it is sufficient to determine which configuration gives best result. In other cases, it is useful to analyze trends among the different configuration parameters and their combinations. However, when there are hundreds of configuration settings based on several parameters, the issue of how to visualize them is not so straightforward. In this paper, when we do this kind of analysis, we create rank tables, in a similar way as we did in previous work . In a rank table, we compare the effectiveness of each configuration against all other configurations, one at a time. For example, if we have X = 250 configurations, we will have X × (X − 1) comparisons, which can be reduced by half due to the symmetric property of the comparisons. Initially, we assign a score of 0 to each configuration. For each comparison in which a configuration is statistically better (using a U-test at 0.05 level), we increase its score by one, and we reduce it by one in case it is statistically worse. Therefore, in the end each configuration has a score between −X and +X. The higher the score, the better the configuration is. After this first phase, we rank these scores, such that the highest score has the best rank, where better ranks have lower values. In case of ties, we average the ranks. For example, if we have five configurations with scores {10, 0, 0, 20, − 30}, then their ranks will be {2, 3.5, 3.5, 1, 5}. We repeat this procedure for all the Z classes in the case study, and we calculate the average of these ranks for each configuration, for a total of Z × X × (X − 1)/2 statistical comparisons. For example, if we consider X = 250 configurations and Z = 16 classes, this results in 498, 000 statistical comparisons. This a very large number of comparisons, which can lead to a high probability of Type I error if we consider the hypothesis that all tests are significant at the same time. The issue of applying adjustments such as Bonferroni corrections, however, is a complex one, and there is no full consensus amongst statisticians as to their application. In this paper we have not to applied such corrections, for reasons discussed by Arcuri and Briand (2014), Perneger (1998) and Nakagawa (2004), with which we are in agreement with.

RQ1: Does local search improve the performance?
For both cases where seeding was applied and where it was not, we analyzed the 5 3 = 125 configurations using the MA, and chose the one that resulted with highest average coverage over the 16 classes in the case study. We did the same for the basic GA, i.e., we evaluated which configuration of the population size gave best results. We call these four configurations (two for MA, and two for GA) "tuned". Table 3 shows the comparison between the tuned MA and tuned GA configuration based on whether seeding was used. Results in Table 3 answer RQ1 by clearly showing, with high statistical confidence, that the MA outperforms the standard GA in many, but not all, cases. For classes such as Cookie, improvements are as high as a 98 − 45 = 53% average coverage difference (when seeding is not used). When considering the case without seeding RQ1: The MA achieved up to 53% higher branch coverage than the standard GA.
enabled, there are no classes where the MA resulted in significantly lower coverage; however, the effect size is worse for the MA for the classes IntArrayWithoutExceptionsWithArrayParameters, XMLElement, ArrayList and Bessj, although difference in coverage are no more than 1%. Some local search operators may thus lead to lower coverage, and this will be analyzed in detail as part of RQ4. With seeding enabled MA is still clearly better overall. Only for XMLElement the results are slightly worse, but these are not statistically significant.

RQ2: How does the configuration of the MA influence the results?
One thing that is clearly visible in Table 3 is that seeding, as expected , leads to higher results. On one hand, when seeding is not used, the difference in average coverage [m5G;December 8, 2014;11:45] between the MA and the GA is 89 − 80 = 9%. On the other hand, when seeding is used, the difference is 92 − 90 = 2%. At a first look, such an improvement might be considered low. But the statistics in Table 3 point out a relatively high average effect size of 0.67, with six classes having a strong statistical difference. This is not in contrast with the 2% difference in the raw values of the achieved coverage. What the data in Table 3 suggest is that, when seeding is employed, there are still some branches that are not covered with the GA, and so require the local search of the MA to be reached.

ARTICLE IN PRESS
To answer RQ2 we can look at the configuration that gave the best result on average. This configuration uses an MA algorithm with seeding, large population size (100 individuals), low rate of local search (every 100 generations) and a small budget of 25 s for local search. In other words, on average the best result is achieved using local search infrequently and with not a large budget. This is different from the result of our initial experiment , where the best configuration used seeding, small population size (five individuals), low rate of local search (every 75 generations), and a small budget of five fitness evaluations for local search. Table 4 shows the results of the rank analysis on those 250 MA configurations. For space reasons, we only show the results of 50 configurations: the 25 top configurations using seeding, and the top 25 that do not use seeding. The results in Table 4 clearly show that, on average, seeding has a strong impact on performance (all the 25 top configurations using seeding achieve better results than the top 25 not using seeding). Among the top configurations, there is a clear trend pointing to large population values, local search applied infrequently, and for a short period of time (i.e., low budget). This may seem in slight contrast to the results of our initial experiments in , where the best result was achieved with seeding, small population size (5), low rate of local search (every 75 generations), and a small budget (5) for local search. However, this difference can be attributed to (a) the variance in the results (the top configurations are all very similar in terms of achieved coverage), (b) differences in the local search operators, (c) optimizations introduced in this paper that make it feasible to apply local search on larger populations, e.g. the original experiments did not include primitive value expansion and restoring coverage. In general, these results suggest that, although local search does improve performance, one has to strike the right balance between the effort spent on local search and the one spent on global search (i.e., the search operators in the GA). Considering Table 3, we see that the results change significantly between individual classes. This suggests that the benefit of local search is highly dependent on the problem at hand. For example, in a class with many string inputs, much of the budget may be devoted to local search, even if the input strings have no effect on code coverage levels. Although we do see an improvement, even on average, this clearly points out the need for parameter control-in order to adaptively change the local search configuration to the class under test and current state of the search. At any rate, one problem with parameter tuning is that, given a large set of experiments from which we choose the best configuration, such a configuration could be too specific for the employed case study . This is a common problem that in Machine Learning is called overfitting (Mitchell, 1997). To reduce the threats of this possible issue, we applied a k-fold cross validation on our case study (for more details, see for example (Mitchell, 1997)). Briefly, we divided the case study in k = 16 groups, chose the best configuration out of the 250 on k − 1 groups (training), and calculated its performance on the remaining group (validation). This process is then repeated k times, each time using a different group for the validation. Then, the average of these k performance values on the validation groups is used as an estimate of actual performance of tuning on the entire case study (the "tuned" configuration) when applied on other new classes (i.e., does the tuning process overfit the data?). Note, we used a 16-fold cross validation instead of a typical 10-fold cross valida-tion as we have only 16 classes, and dividing them into 10 groups would had partitioned them in very unbalanced groups (i.e., some groups with only one element whereas others with twice as much). The obtained estimate for the best MA configuration was 0.91, which is close to the average value 0.92 in Table 3. Therefore, the best parameter configuration we found is not overfitted to the case study examples.
RQ2: The MA gives the best results when local search is applied infrequently with a small search budget.

RQ3: How does the search budget influence the results?
The time spent for test data generation (i.e., the testing budget) is perhaps the only parameter that practitioners would need to set. For a successful technology transfer from academic research to industrial practice, the internal details (i.e., how often and how long to run local search inside EvoSuite) of a tool should be hidden from the users, and thus this choice should be made before the tools are released to the public. However, usually the best parameter configuration is strongly related to the testing budget ). To answer RQ3, we studied the performance of the tuned MA and the tuned GA at different time intervals. In particular, during the execution of EvoSuite, for all the configurations we kept track of the best solution found so far at every minute (for both the GA and the MA). With all these data, at every minute we also calculated the "best" MA configuration (out of 250) and the "best" GA (out of 10) at that particular point in time. By definition, the performance of the "tuned" MA is equal or lower than the one of the "best" MA. Recall that "tuned" is the configuration that gives the "best" results at 10 min.
From a practical stand point, it is important to study whether the "tuned" MA is stable compared to the "best" MA. In other words, if we tune a configuration considering a 10 min timeout, are we still going to get good results (compared to the "best" MA and GA) if the practitioner decides to stop the search beforehand? Or was 10 min just a lucky choice? Fig. 4 answers these questions by showing that, already from 2 min on, "tuned" performs very similar to the "best" configuration. Furthermore, regardless of the time, there is always a large gap between the "tuned" MA and GA. Fig. 4 Fig. 4). This is particularly evident for the class FastFourierTransformer. The reason is that, at each point (minute) in time, we are considering the configuration with highest coverage averaged over all the 16 classes. Although on average the performance improves monotonically (Fig. 4), on single classes in isolation everything could in theory happen (Figs. 5 and 6).

RQ3:
The best configuration only differs for small search budgets, and is consistent across higher budgets. Table 5 shows the average coverage achieved for each individual type of local search. To study the effects individually and not conflate them with the effects of seeding, all results shown in the table are based on runs without seeding activated. If applied independently, then the techniques of ensuring double execution and expanding test cases have only a minor effect. However, they can be beneficial for all other types of local search. In the table they are activated for all types of local search.

RQ4: What is the influence of each individual type of local search operator?
In other words, results presented Table 5 are based only on six configurations out of the 2 8 = 256 we ran. In all these six configurations, seeding was off, whereas double execution and test expansion were on. In the "Base" configuration, all the five local search operators were off. For each of the remaining five configurations, one local search operator was on, whereas the other four were off.
• IntArrayWithoutExceptions benefits mainly from numeric search, and the local search on arrays has no benefit. Indeed, as long as there are explicit assignments to array elements in the tests then numeric local search can improve array contents as well, whereas search on all array elements may waste resources. JID: JSS [m5G;December 8, 2014;11:45]  • LinearWithoutOverflow is a class that consists almost exclusively of numerical constraints, thus numeric search brings the most benefits. • FloatArithmetic represents numeric problems with floating point inputs; numeric search brings the expected improvement. • IntArrayWithoutExceptionsWithArrayParameters repeats the pattern seen in IntArrayWithtoutExceptions: search on numbers improved coverage, search on arrays made things worse. • Cookie is a pure string problem, and string local search behaves as expected. • DateParse is also a string problem (which becomes trivial with seeding-see the flat-lined graph in Fig. 5); string local search works as expected. • Stemmer is a class that works with text input, yet it takes its input in terms of character arrays and integers. Consequently, string local search does not help, whereas numeric search improves it a lot. • Ordered4 is a surprising case: it is a string problem, yet the only type of local search that achieves a worse result than pure global search is search on strings. The reason for this is that the string constraints in this class are based on the compareTo method, which returns −1, 0, or 1. While EvoSuite transforms all boolean string comparison operators and replaces them with functions that provide guidance, it currently does not do this for compareTo. Consequently, local search on the strings will in many cases not get beyond exploration, which nevertheless consumes search budget. • XMLElement has strings dependencies, yet the are few constraints on these strings; they mainly represent the names of tags. However, some string-related inputs are represented as character arrays (char[]), which explains why the array search is more beneficial than the string search for this example. The class has many methods, which is likely why reference search is beneficial, as is restoring coverage. • Most methods of CommandLine have either string or character parameters, which offers potential to apply local search on strings and numbers. However, again this is a class where the actual values of these strings and characters do not matter, and so these types of local search have a negative effect. • Attribute has several string dependencies, for example one can set a string value for an XML attribute and then call methods to convert it to numbers or booleans. Consequently, local search on strings is beneficial. • DoubleMetaphone has many string related parameters, given that it implements an algorithm to encode strings. String local search has a small beneficial effect, as does search on references.

ARTICLE IN PRESS
• ArrayList has methods with string and numerical inputs, yet only few branches depend on these parameters (e.g., the capacity of an ArrayList needs to be larger than 0). Consequently, the only type of local search that has an effect on this class is search on references. • Bessj is a class with many branches on numerical dependencies; however, even with significantly higher search budget EvoSuite is not able to achieve higher coverage than 91%, therefore it is likely that this is already the maximum possible, and none of the types of local search have a negative impact on reaching this. • FastFourierTransformer has many array parameters, yet it seems to perform more transformative calculations on these arrays rather than depending on their content. Consequently, the array local search has a negative effect. • DateTimeFormat has functionality to parse date formatting patterns, and consequently benefits significantly from string local search. It also has many methods, which is reflected in the improvement with reference local search.
Restoring coverage had a negative effect only in five out of the 16 cases, whereas it had a very strong effect in many of them.
RQ4: Numeric and string local search work well on their relevant problem instances, whereas array search can have negative impact. Reference local search is beneficial for large classes.

RQ5: Which combination of local search operators achieves the best results?
There can be subtle effects and interactions between different types of local search. Consequently, for RQ5 we looked at all possible combinations of the local search operators. Table 6 presents a rank analysis where we list the top 25 configurations with seeding enabled and 25 configurations without seeding. All top ranked configurations restore coverage, most of them apply numeric local search, and most of them apply primitive value expansion (Section 4.2.1). This confirms the intuition that expansion is important to make local search on primitive values effective. The table clearly shows how seeding influences the search, as all seeding configurations are ranked higher than those without seeding. All top ranked configurations without seeding apply numerical local search, whereas there exist some in the seeding ranks that do not use numerical search. The top ranked configurations without ARTICLE IN PRESS JID: JSS [m5G;December 8, 2014;11:45]  seeding use string local search, whereas fewer of the top ranked configurations with seeding use string local search. Indeed, in several of the 16 example classes the string constraints are partially trivially solved with seeding, such that string local search in conjunction with seeding seems to waste resources and has a negative effect. The top ranked configuration without seeding excludes array local search, as one would expect from the analysis of RQ4. However, surprisingly reference search is also excluded, whereas in RQ4 we saw that there were only two cases where reference local search applied individually led to a worse result, suggesting interactions with the other operators. However, the configurations with array search and reference search enabled are ranked directly below that configuration with only marginally lower coverage, suggesting that the impact is only minor. The top ranked configuration with seeding also excludes array local search as expected, but it does include reference local search. However, the configuration with all types of local search enabled ranks third, with even a minimally higher average coverage.
RQ5: Applying all local search operators leads to good results, although string, array, and reference search can have minor negative effects.
5.9. RQ6: Does adaptive local search improve the performance?
RQ4 showed how different classes influence the effectiveness of local search. Consequently, instead of applying local search with a fixed configuration, we next consider how doing so in an adaptive way influences results. As described in Section 4.3, we use the adaptive methods introduced by Galeotti et al. (2013). Table 7 Table 8 For each class, comparisons without seeding of Base GA configuration with best non-adaptive MA from Table 6 and with best adaptive MA from Table 7. Effect sizesÂ12 are calculated for when non-adaptive is compared with base (Â nb ), and adaptive compared to base (Â ab ) and to non-adaptive (Âan  [m5G;December 8, 2014;11:45] the top ranked configurations for the combinations of adaptiveness parameters we considered. Again the configurations using seeding are ranked higher than those without. Applying local search selectively, i.e., only on statements that led to a fitness change after the last mutation, is not included in the top configurations. The likely reason for this is that this optimization is designed for a scenario where DSE is applied to the primitive values in a test suite (cf. (Galeotti et al., 2013)) and will thus only select some of the cases where local search can lead to an improvement. As seen in the discussion of RQ4, local search operators that are not directly related to primitive values can still have a strong positive influence on the performance, and these would not benefit from this selective strategy. The probabilities in the top ranks confirm the results of RQ1, where in the best configuration local search was applied every 100 generations. With a probability of 0.01, on average local search will also be applied every 100 generations. The best configuration shows a high adaptation rate of 10, followed by the second best configuration with the second highest configuration rate used in the experiment. Consequently, we can conclude that adaptation is an important factor in achieving high coverage. To compare the results between the adaptive configurations, the GA, and the tuned MA, Table 8 summarizes the coverage andÂ 12 for each pair of configurations. To show the effects of adaptiveness more clearly without interference of other optimizations, this table shows the results without seeding. Note that in contrast to Table 3, the non-adaptive configuration is for 2 min of search time using the best configuration of local search operators as obtained in RQ5. For this particular configuration, the non-adaptive MA is significantly better than the GA in 11 out of the 16 cases. Interestingly, it is even better on XMLElement, whereas in RQ1 the MA showed a slightly worse result after 10 min using all local search operators. Comparing the adaptive MA to GA shows significantly better results in 10 out of 16 cases, but interestingly slightly worse results in ArrayList and FastFourierTransformer, although not significant in any of the cases. The adaptive MA achieves higher average coverage than the non-adaptive tuned MA in eight cases, although none of them are statistically significant, and the coverage loss in XMLElement is statistically significant (however, on average it is still the same as the base GA). Thus, on average the adaptive configuration is only slightly better than the best fixed configuration (average coverage of 85.44% for adaptive vs. 85.06% for tuned fixed configuration, and the average effect size is 0.53). However, the implementation of adaptiveness used in these experiments is of course rather simplistic, and ideally one would apply adaptiveness also to the choice of operators. With this in mind, and considering that adaptive configurations have a higher chance of generalizing to new classes, it is fair to assume that adaptively in local search is beneficial in the general case.
RQ5: Applying all local search operators leads to good results, although string, array, and reference search can have minor negative effects.

RQ7: Do results generalize to other classes?
All experiments so far were conducted on 16 classes selected under the assumption that they are representative of difficult search problems. However, there remains the question on how these findings generalize (RQ7). To answer this question, we take the overall best configuration of local search, and apply Evo-Suite with that configuration to two different benchmarks. The SF100 corpus of classes is a random sample of 100 SourceForge Table 9 Comparison of results of default GA with best MA for both SF100 and Carfast case studies. The average number of covered branches is reported, and the difference between the two configurations. open source projects. A particular aspect of this real-life, unbiased sample of classes is that the problems it represents are quite different to those considered as difficult search problems : for example, a large share of the classes have environmental dependencies that make high coverage with EvoSuite impossible. In contrast, the Carfast (Park et al., 2012) case study is devoid of such environmental dependencies, but still consists of a set of automatically generated software projects that are intended to be realistic. We applied EvoSuite on both benchmarks for 2 min per class with 10 iterations to accommodate for randomness. Table 9 summarizes the results: on the CarFast benchmark, the use of local search covers on average 44,853 more branches than pure global search. On SF100 the increase is smaller; 639 additional branches were covered by the Memetic Algorithm. This is not unexpected; SF100 consists of many trivial classes and many branches cannot be covered until the test generator can handle the environmental dependencies, so the potential for improvement is smaller in the first place.
RQ7: The improvements with local search generalize to other classes, but in practice other technical problems may be prevalent to pure search problems.

Threats to validity
This paper compares the whole test suite generation approach based on a Genetic Algorithm to a hybrid version that uses a Memetic Algorithm with local search. Threats to construct validity are on how the performance of a testing technique is defined. We measured the performance in terms of branch coverage. However, in practice we might not want a much larger test suite if the achieved coverage is only slightly higher. Furthermore, this performance measure does not take into account how difficult it will be to manually evaluate the test cases and the generated assert statements (i.e., to check the correctness of the outputs). Threats to internal validity might come from how the empirical study was carried out. To reduce the probability of having faults in our testing framework, it has been carefully tested. But it is well known that testing alone cannot prove the absence of defects. Furthermore, randomized algorithms are affected by chance. To cope with this problem, we ran each experiment 10 times, and we followed rigorous statistical procedures to evaluate their results. There is also the threat to external validity regarding the generalization to other types of software, which is common for any empirical analysis. Because of the large number of experiments required (in the order of hundreds of days of computational resources), we only used 16 classes for our in depth evaluations. Those classes were manually chosen. To reduce this threat to validity, we also carried out a set of experiments with best found settings on the SF100 corpus, which is a random selection of 100 projects from SourceForge. We also carried out further experiments on a large case study (Carfast) previously used in the literature.

Conclusions
The EvoSuite tool applies Genetic Algorithms to the problem of generating unit-level test suites for Java classes with high branch coverage. However, genetic mutations on particular parts of the test cases tend to be undirected. This means that for variables of primitive types, strings, and arrays, small adjustments needed for certain branches to be covered are unlikely to occur. This paper therefore defined a series of local search operators, extending the Genetic Algorithm used in EvoSuite to a Memetic Algorithm. Although Memetic Algorithms have already been used in the past for unit test generation, this paper is the first to provide a comprehensive approach for object-oriented software, targeting whole test suites, handling different kinds of test data like strings and arrays. Our empirical study shows that, using these local search operators, branch coverage of classes may be significantly improved, in some cases even up by 53%. A sound evaluation on more than 12,000 Java classes confirms the results are of practical value for practitioners. Adding an adaptive parameter control technique showed improvements in our experiments. However, the technique we applied in our experiments was simple, and there is potential for further improvements using more advanced parameter control techniques (Eiben et al., 1999). For more information about EvoSuite please visit: http://www.evosuite.org/. Gordon Fraser is a lecturer in Computer Science at the University of Sheffield, UK. He received a Ph.D. in computer science from Graz University of Technology, Austria, in 2007. The central theme of his research is improving software quality, and his recent research concerns the prevention, detection, and removal of defects in software. More specifically, he develops techniques to generate test cases automatically, and to guide the tester in validating the output of tests by producing test oracles and specifications. Andrea Arcuri is a Senior Software Engineer working in the oil and gas industry. He also collaborates part time at Simula Research Laboratory, Norway, where he does research on software test automation.

Phil
McMinn is a senior lecturer in computer science at the University of Sheffield, UK. He received the Ph.D. degree from the University of Sheffield, United Kingdom, in January 2005, which was funded by Daimler-Chrysler Research and Technology. He has published several papers in the field of search based testing. His research interests cover software testing in general, program transformation, and agent-based systems and modeling. He is currently funded by the United Kingdom Engineering and Physical Science Research Council (EPSRC) to work on reducing oracles costs of testing, testing techniques for agent-based systems, and the automatic reverse engineering of state machine descriptions from software.