Adaptive Random Testing with Combinatorial Input Domain

Random testing (RT) is a fundamental testing technique to assess software reliability, by simply selecting test cases in a random manner from the whole input domain. As an enhancement of RT, adaptive random testing (ART) has better failure-detection capability and has been widely applied in different scenarios, such as numerical programs, some object-oriented programs, and mobile applications. However, not much work has been done on the effectiveness of ART for the programs with combinatorial input domain (i.e., the set of categorical data). To extend the ideas to the testing for combinatorial input domain, we have adopted different similarity measures that are widely used for categorical data in data mining and have proposed two similarity measures based on interaction coverage. Then, we propose a new version named ART-CID as an extension of ART in combinatorial input domain, which selects an element from categorical data as the next test case such that it has the lowest similarity against already generated test cases. Experimental results show that ART-CID generally performs better than RT, with respect to different evaluation metrics.


Introduction
Software testing, a major software engineering activity, is widely considered to assure the quality of software under test [1]. Many testing methods have been developed to effectively identify software failures by actively selecting inputs (namely, test cases). Random testing (RT), a basic software testing method, simply chooses test cases at random from the set of all possible program inputs (namely, the input domain) [2,3]. There are many advantages of using RT in software testing. For example, in addition to simplicity and the efficiency of generating random test cases [2], RT allows statistical quantitative estimation of software's reliability [4]. Due to these advantages, RT has been widely used to detect software failures in different scenarios, such as the testing of UNIX utilities [5,6], SQL database systems [7,8], Java JIT compilers [9], and embedded software systems [10]. In spite of the popularity, RT is still criticized by many researchers due to little or no information to guide its test case generation.
Given a faulty program, two basic features are determined by program inputs causing software to exhibit failure behaviors (namely, failure-causing inputs), that is, failure rate and failure pattern. Failure rate refers to the ratio between the number of failure-causing inputs and the number of all possible program inputs, while failure pattern refers to the geometry and distribution of failure regions (i.e., the region where failure-causing inputs reside). It has been observed, however, that failure-causing inputs tend to cluster together [11][12][13]. Given that failure regions are continuous, nonfailure regions should also be contiguous. More specifically, suppose a test case (tc) is not a failure-causing input, test cases that are close to tc (or tc's neighbors) may fail to reveal a failure as well. Therefore, it is intuitively appealing that test cases that spread away from tc may have a higher chance to be failure-causing than tc's neighbors.
Briefly speaking, it is very likely that a more even-spread of random test cases can improve the failure-detection effectiveness of RT. Based on this intuition, Chen et al. [14] have proposed a novel approach, namely, adaptive random testing (ART). Similar to RT, ART also randomly generates test case from the whole input domain. But ART uses additional criteria to guide the test case selection for the purpose of evenly spreading test cases over the input domain. Various ART algorithms have been developed based on different test 2 The Scientific World Journal case selection criteria, such as ART by distance [15], ART by exclusion [16], ART based on evolutionary search algorithms [17], and ART by perturbation [18]. Essentially, ART achieves test case diversity with the subset of test cases executed at any one time [19].
As an alternative of RT, ART has been successfully applied to different programs, such as numerical programs [15][16][17][18], object-oriented programs [20,21], and mobile application [22]. However, not much work has been done on the effectiveness of ART for programs with combinatorial input domain (or categorical data, i.e., a Cartesian product of finite value domains for each of a finite set of parameter variables). With the popularity of category-partition method [23] and many guidelines to help construct categories and partitions [24][25][26][27], combinatorial input domain has been widely applied to different testing scenarios, such as configurable-aware system [28,29], event-driven software [30], and GUI-based application [31]. In this paper, we propose a new testing strategy called ART-CID as an extension of ART in combinatorial input domain. In order to successfully extend the ART principle into combinatorial input domain, we propose two similarity measures based on interaction coverage and also adopt different well-studied similarity measures that are popularly used for categorical data in data mining [32]. To analyze the effectiveness of ART-CID (mainly FSCS-CID, one version of ART-CID), we compare the effectiveness of FSCS-CID with RT by designing some simulations and the empirical study. Experimental results show that, compared with RT, FSCS-CID can not only use smaller test cases in order to cover all possible combinations of parameter values at a given strength, but also require to generate fewer test cases to identify the first failure in the real-life program.
This paper is organized as follows. Section 2 introduces some preliminaries, including combinatorial input domain, ART, similarity measures used for combinatorial input domain, and the effectiveness measures adopted in our study. Section 3 proposes two similarity measures for combinatorial test cases based on interaction coverage. Section 4 proposes a new algorithm called ART-CID to select test cases from combinatorial input domain. Section 5 reports some experimental studies, which examine the rate of covering value combinations at a given strength and failure-detection effectiveness of our new method. Finally, Section 6 summarizes some discussions and conclusions.

Preliminaries
In the following section, some preliminaries of combinatorial input domain, failure patterns, adaptive random testing, similarity and dissimilarity measures for combinatorial input domain, and effectiveness measure are described.
2.1. Combinatorial Input Domain. Suppose that a system under test (SUT) has a set of parameters (or categories) = { 1 , 2 , . . . , }, which may represent user inputs, configuration parameters, internal events, and so forth. Let be the finite set of discrete valid values (or choices) for ( = 1, 2, . . . , ), and let be the set of constraints on parameter value combinations. Without loss of generality, we assume that the order of parameters is fixed; that is, = ⟨ 1 , 2 , . . . , ⟩. In the remainder of this paper, we will refer to a combination of parameters as a parameter interaction, and a combination of parameter values or a parameter value combination as a value combination.
, is about the information on a combinatorial input domain of the SUT, including parameters, | | ( = 1, 2, . . . , ) values for parameter , and constraints on value combinations.
In this paper, we assume that all the parameters are independent; that is, no constraint among value combinations is considered ( = 0), unless otherwise specified. Therefore, the test profile can be abbreviated as TP( , | 1 || 2 | ⋅ ⋅ ⋅ | |).
To clearly describe some notions and definitions, we present an example of the part of suboptions in an option "View" of the tool PDF shown in Table 1. In this system, there are four configuration parameters, each of which has three values. Therefore, its test profile can be written as TP (4, 3 4 ).
Intuitively speaking, a combinatorial input domain TP( , ) is a test case for the SUT shown in Table 1.
Definition 3. Given a TP( , | 1 || 2 | ⋅ ⋅ ⋅ | |), a -wise value combination is a -tuple (V 1 ,V 2 , . . . ,V ) involving parameters with fixed values (named fixed parameters) and ( − ) parameters with arbitrary allowable values (named free parameters), where 0 ≤ ≤ and Generally, -wise value combination is also called -value schema [33], and is called strength. When = , awise value combination becomes a test case for the SUT as it takes on a specific value for each of its parameters. For ease of description, we define a term CombSet (tc) as the set of -wise value combinations covered by the test case (tc). Intuitively speaking, a test case (tc) with parameters contains -wise value combinations, that is, |CombSet (tc)| = .
For example, considering a test case (tc) = ( , , , ), we can obtain that  As we know, the faulty model in the combinatorial input domain assumes that failures are caused by parameter interactions. For instance, if the SUT shown in Table 1 fails when 2 is set to "Single", 3 is set to "None," and 4 is not equal to "None," this failure is caused by the parameter interaction ( 2 , 3 , 4 ). Therefore, the FTFI number of this fault is 3.
In [28,34], Kuhn et al. investigated interaction failures by analyzing the faults reports of several software projects and concluded that failures are always caused by low FTFI numbers.

Failure Patterns.
Given a faulty program, two basic features can be obtained from it. One feature is failure rate, denoted by , which refers to the ratio of the number of failure-causing inputs to the number of all possible inputs. The other feature is failure pattern, which refers to the geometric shapes and the distributions of the failure-causing regions. Both features are fixed but unknown to testers before testing.
In [14], the patterns of failure-causing inputs have been classified into three categories: point pattern, stripe pattern, and block pattern. An illustrative example about three types of failure patterns in a two-dimensional input domain is shown in Figure 1. In this example, suppose the input domain is consisting of parameters and where 0 ≤ , ≤ 10.0. Point pattern means the tested program will fail when and are assigned to particular integers, that is, some specific points in the input domain, while strip pattern may be of the form 0 ≤ ≤ 10.0, 3.0 ≤ ≤ 3.5, and block pattern may be of the form 1.5 ≤ , ≤ 3.0.
In the combinatorial input domain, failure patterns of any failures belong to the point pattern as all test inputs are discrete. However, from the perspective of functionality and computation of each test input, three failure patterns shown in Figure 1 also exist in the combinatorial input domain. For example, (1) if a failure 1 in the SUT shown in the Table 1 is caused by " 2 = or 2 = " and " 3 = or 3 = ", we believe that the failure pattern of 1 is a strip pattern and its failure rate is (2 × 2)/(3 × 3) = 0.4444; (2) if a failure 2 in the SUT is caused by " 1 ̸ = ", " 2 ̸ = ", " 3 ̸ = ", and " 4 ̸ = ", we believe that the failure pattern of 2 is a block pattern and its failure rate is (2/3) 4 = 0.1975; and (3) if a failure 3 is caused by a single test case ( , , , ), we believe that the failure region of 3 is a point pattern and its failure rate is 1/81 = 0.0123. According to Kuhn's investigations [28,34], however, the FTFI numbers are always very low (i.e., the FTFI numbers are smaller than the number of parameters), which means that the strip pattern is the most frequent failure pattern in the combinatorial input domain.

Adaptive Random Testing (ART).
The methodology of adaptive random testing (ART) [14,15] has been proposed to enhance the failure-detection effectiveness of random testing (RT) by even-spreading test cases across the whole input domain. In ART, test cases are not only randomly generated, but also evenly spread. According to previous ART studies [15][16][17][18][19][20][21][22], ART was shown to reduce the number of test cases required to identify the first fault by as much as 50% over RT.
There are many implementations of ART by different notions. A simple algorithm is the fixed-size-candidate-set ART (FSCS-ART) [15]. FSCS-ART implements the notion of distance as follows. FSCS-ART uses two sets of test cases, namely, the executed set and the candidate set . is a set of test cases that have been executed but without revealing any failure, while is a set of tests that are randomly selected from the input domain according to the uniform distribution.
is initially empty and the first element is randomly chosen from the input domain and then incrementally updates with the selected elements from until a failure is exhibited. From , the element that is farthest away from all test cases in is chosen as the next test case; that is, the criterion is to choose the element from as the next test case such that where dist is defined as the Euclidean distance, that is, in a -dimensional input domain, for two test inputs, tc 1 = ( 1 , 2 , . . . , ) and tc 2 = ( 1 , 2 , . . . , ), The process is repeated until the desired stopping criterion is satisfied. Figure 2 gives the illustration of FSCS-ART in a twodimensional input domain. In Figure 2(a), there are 3 previously executed test cases 1 , 2 , and 3 , and 2 randomly generated candidates 1 and 2 . To choose among the candidates, the distance of each candidate against each previously executed test case is calculated. Figure 2(b) describes that the closest previously executed test case is determined for each candidate. In Figure 2(c), the candidate 2 is selected as the next test case (i.e., 4 = 2 ), as the distance of 2 against its nearest previously executed test case is larger than that of the candidate 1 .  Previously executed test cases are denoted as 1 , 2 , and 3 , and randomly generated candidates are denoted as 1 and 2 , respectively. To select the next test case, (a) multiple candidates are randomly selected, and the nearest previously executed test case to each candidate is determined; (b) these nearest distances are compared among all candidates; and (c) the candidate with the longest such distance is chosen.
In this paper, we emphasize the extension of FSCS-ART as that of ART in combinatorial input domain, unless otherwise specified.

Similarity and Dissimilarity Measures for Combinatorial
Input Domain. Measuring similarity or dissimilarity (distance) between two test inputs is a core requirement for test case selection, evaluation, and generation. Generally speaking, in numerical input domains, Euclidean distance (see (3)) is a mostly used distance measure for continuous data. However, for a combinatorial input domain, since its parameters and corresponding values are finite and discrete, Euclidean distance may not be available and reasonable. Nevertheless, various distance measures (or dissimilarity measures) are popularly used in data mining for evaluating categorical data [32], such as clustering ( -means), classification (KNN, SVM), and distance-based outlier detection. In this subsection, we simply describe the following measures that will be adopted in our paper later.
(i) ( ) is the number of times parameter takes the value in . Note that if ∉ , ( ) = 0.
(ii)̂( ) is the sample probability of parameter to take the value in . The sample probability is given bŷ (iii) 2 ( ) is another probability estimate of parameter to take the value in and is given by (iv) ( , ) is a generalized similarity measure between two data instances denoted as = ( 1 , 2 , . . . , ) The Scientific World Journal 5 and = ( 1 , 2 , . . . , ) where , ∈ , and , ∈ ( = 1, 2, . . . , ). Its definition is given as follows: where ( , ) ( = 1, 2, . . . , ) is the per-parameter similarity between two values for parameter and denotes the weight assigned to the parameter . Therefore, we only require to present the definitions of ( , ) and for each similarity measure, unless otherwise specified.
To directly refer to [32], the measures discussed henceforth will all be in the context of similarity, with dissimilarity or distance measures being converted using the following formula: where ( , ) is the dissimilarity measure between and . Table 2 presents nine similarity measures for categorical parameter values, which are widely used in data mining for categorical data. In Table 2, the last column "Range" represents the range of ( , ) for mismatches or matches of parameter values in each measure.

Effectiveness Measurement.
In this paper, we adopt the -measure (i.e., the number of test cases required to detect the first failure) as the measurement of failure-detection effectiveness of testing methods, since previous studies [35] have demonstrated that the -measure is particularly suitable for adaptive testing strategies such as ART. Intuitively speaking, a smaller -measure of ART over RT means fewer test cases required by ART to detect the first failure and hence implies a better failure-detection effectiveness of ART than that of RT. For the purpose of clear description, we will use ART -ratio (i.e., the ratio of ART's -measure ( ART ) relative to RT'smeasure ( RT )) to indicate the failure-detection effectiveness improvement of ART over RT.
However, it is extremely difficult to theoretically obtain ART's -measure ( ART ). Similar to all other ART studies, ART is collected via simulations and empirical studies, whose procedure is described as follows. On the one hand, in simulation studies, failure pattern (including its size and sharp) and failure rate are predefined for simulating a faulty program. The failure regions are then randomly placed inside the whole input domain. If a point inside one of the failure regions is picked by a testing strategy, a failure is said to be detected. On the other hand, for empirical studies, some faults are seeded into a subject program. Once the subject program behaves differently from its fault-seeded version, it is said that a failure is identified. The number of test cases to find the first failure is regarded as the ART of that run. Such a process runs S times repeatedly until a statistically reliable estimate of the ART (±5% accuracy rate and (1 − 5%) confidence level adopted in our paper) has been obtained. Refer to the value of S; it can be determined dynamically using the same method as shown in [15]. With respect to RT's -measure ( RT ), since test cases are chosen with replacement according to the uniform distribution, RT is equal to 1/ theoretically. Apart from the -measure used as the measurement, another measurement is also used in our paper, that is, the number of test cases required to first cover all possible value combinations of a given strength (denoted -measure). This measurement is widely used in the combinatorial input domain. Unlike the -measure, the testing stop condition of -measure is not that the first failure is detected, but that all possible -wise value combinations are first covered. For the purpose of clear description, we use RT ( ) to represent this measurement for RT while ART ( ) for ART.

Two Similarity Measures Based on Interaction Coverage
Apart from various similarity measures described in Section 2.4, in this section, we propose another two similarity measures by using interaction coverage: incremental interaction coverage similarity (IICS) and multiple interaction coverage similarity (MICS), in order to apply the characteristics of combinatorial input domain to the selection of test cases. All similarity measures illustrated in Section 2.4 are used to evaluate how similar two test cases are; however, two similarity measures presented in this section are used to evaluate the resemblance of the combinatorial test case against the combinatorial test suite. We will discuss them next. Before introducing them, we firstly describe a simple similarity measure of the test case against a test suite based on interaction coverage, named normalized covered -wise value combinations similarity (or NCVCS ) [36], which is widely used in combinatorial input domain.
Definition 5. Given a combinatorial test suite on TP( , | 1 || 2 | ⋅ ⋅ ⋅ | |), a combinatorial test case (tc), and the strength , normalized covered -wise value combinations similarity (NCVCS ) of tc against is defined as the ratio of the number of -wise value combinations covered by tc that have already been covered by to ; that is, where CombSet ( ) can be written as follows: Obviously, the NCVCS is a function that requires to set the strength value in advance, and its range is [0, 1.0]. Two properties of the NCVCS are discussed as follows. Proof. When NCVCS (tc, ) = 1.0, it can be noted that covers all possible -wise value combinations covered by , that is, 6 The Scientific World Journal also covers all possible value combinations at strengths lower than that are covered by tc. As a consequence, NCVCS (tc, ) = 1.0 where 1 ≤ < ≤ . Proof. When NCVCS (tc, ) = 0, it can be noted that each -wise value combination covered by tc is not covered by , indicating that, for ∀ ∈ CombSet (tc), ∉ CombSet ( ): that is, Therefore, the problem converts to demonstrating that ∀ ∈ CombSet (tc), ∉ CombSet ( ).
As we know, given a TP( , | 1 || 2 | ⋅ ⋅ ⋅ | |) and the strength , the number of all possible -wise value combinations is fixed; that is, |CombSet . In other words, there exists a test case generation method using NCVCS as the criterion, which can generate a certain number of combinatorial test cases denoted The Scientific World Journal 7 as ( ⊆ all ) to cover all possible -wise value combinations. However, if testing with fails to reveal any failures due to no failure-causing inputs in , the next test case generated by this method is, in fact, obtained in a random manner. The main reason is that the NCVCS of each element in all is equal to 1.0. Therefore, the NCVCS is not particularly suitable for adaptive testing strategies such as ART. To solve this problem, we propose two similarity measures based on interaction coverage in the following subsections.
3.1. Incremental Interaction Coverage Similarity. As discussed in Theorem 6, if all possible -wise value combinations are covered by a combinatorial test suite , all possible value combinations at strengths lower than are also covered by . According to this fact, we present a new similarity measure based on interaction coverage, named incremental interaction coverage similarity (IICS).
Given a combinatorial test suite on TP( , | 1 || 2 | ⋅ ⋅ ⋅ | |) and a combinatorial test case (tc), the incremental interaction coverage similarity of tc against is defined as follows: where satisfies the following properties: It can be noted that if tc ∈ , the IICS is equal to 1.0 as tc is the same as one of elements in ; if tc ∉ , the IICS of tc against is actually equal to the NCVCS of tc against where is gradually incremented. More specifically, if covers all possible -wise value combinations and partial ( + 1)-wise value combinations occurred in tc, = + 1. Similar to NCVCS , the range of IICS is also [0, 1.0].
Here, we present an example to illustrate IICS. Suppose

Multiple Interaction Coverage
Similarity. As shown in Section 3.1, the IICS measure begins at strength = 1, and then update the value of by ( + 1). In other words, it considers different strength values when evaluating the combinatorial test case against the combinatorial test suite. However, the IICS accounts for each strength value at each time rather than simultaneously considering all strength values. As a consequence, we present another similarity measure based on interaction coverage, named multiple interaction coverage similarity (MICS).

8
The Scientific World Journal According to Theorem 8, a test case generation method using IICS or MICS as the similarity measure becomes a random generation method, when its generated test suite covers all possible ( − 1)-wise value combinations. The main reason is that, for any candidates, no matter whether they are included in or not, the IICS (or MICS) values of all candidates are identical. Proof. If MICS(tc, ) = 0, NCVCS (tc, ) = 0 where 1 ≤ ≤ because of 0 ≤ ≤ 1.0 and 0 ≤ NCVCS (tc, ) ≤ 1.0; that is, all possible -wise value combinations covered by tc are not covered by . According to (15), therefore, IICS(tc, ) = 0.
As discussed before, both IICS and MICS consider different interaction coverage when evaluating combinatorial test cases. However, they have some differences. Given a combinatorial test case (tc), its IICS measure is actually calculated by the NCVCS at an appropriate value, which means that the IICS measure of tc only considers single interaction coverage, while its MICS measure considers different coverage at the same meanwhile. In other words, tc's calculation time of the IICS measure is less than that of the MICS.
In summary, two new similarity measures based on interaction coverage (IICS and MICS) fundamentally differ from NCVCS due to the following reasons: (1) they do not require setting the strength value in advance, and (2) they are more suitable for adaptive strategies than NCVCS.

Adaptive Random Testing for Combinatorial Test Inputs
In this section, we propose a new family of methods adopting ART in combinatorial input domain, namely, ART-CID. Similar to previous ART studies, ART-CID can also be implemented according to different notions. In this paper, we present one version of ART-CID by similarity (denoted as FSCS-CID), which uses the strategy of FSCS-ART [15]. Since the similarity measure is used in this paper, the procedure of FSCS-CID may differ from that of FSCS-ART. Detailed information will be given as follows.

Similarity-Based Test Case Selection in FSCS-CID.
FSCS-CID uses two test sets, that is, the candidate set of fixed size and the executed set , each of which has the same definition as FSCS-ART. However, test cases in either or are obtained from the combinatorial input domain. For ease of description, let = { 1 , 2 , . . . , } while = { 1 , 2 , . . . , }.
In order to select the next test case ℎ from , the criterion is described as follows: (∀ ) ( ∈ ) ( = 1, 2, . . . , , ̸ = ℎ) where is the similarity measure between two combinatorial test inputs. The detailed algorithm of implementing (19) is illustrated as follows (see Algorithm 1).

Algorithm of FSCS-CID.
As discussed before, Algorithm 1 is used to guide the selection of the best test case. In FSCS-CID, the process of Algorithm 1 runs until the stop condition is satisfied. In this paper, we consider two stop conditions: (1) the first software failure is detected (denoted StopCon1); and (2) all possible value combinations at strength are covered (denoted StopCon2). Detailed algorithm of FSCS-CID is shown in Algorithm 2.
Since the frequencies of parameter values are used in some similarity measures such as Lin, OF, and Goodall2, there requires a fixed-size set of test cases in order to count the frequencies. However, the executed set is incrementally updated with the selected element from the candidate set until the StopCon1 (or StopCon2) is satisfied. In this paper, we take the following strategy to construct the fixed-size set of test cases when calculating the similarity between test inputs. During the process of choosing the th ( > 1) test input from as the next test case (i.e., | | = − 1), each candidate ℎ ( ℎ ∈ ) requires to be measured against all elements in according to the similarity measure, and the fixed-size set of test case for ℎ is constructed by ⋃{ ℎ }.

Experiment
In this section, some experimental results, including simulations and experiments against real programs, were presented to analyze the effectiveness of FSCS-CID. We mainly compared our method to RT in terms of failuredetection effectiveness ( -measure) and the rate of value combinations coverage at a given strength ( -measure). For ease of describing our work clearly, we used the terms Goodall1, Goodall2, Goodall3, Goodall4, Lin, Lin1, Overlap, Eskin, OF, IICS, and MICS to, respectively, represent the similarity measure Goodall1, Goodall2, Goodall3, Goodall4, Lin, Lin1, Overlap, Eskin, OF, IICS, and MICS adopted in the FSCS-CID. Additionally, we used the term RT to represent RT.

Simulation.
In the following subsection, two simulations were presented to analyze the effectiveness of FSCS-CID  (5) Calculate the similarity between and , that is, ( , ); (14) end for (15) return .

Algorithm 1: Select the best test element based on similarity (BTES).
Input: A test profile, TP( , | 1 || 2 | ⋅ ⋅ ⋅ | |). Output: The size of the executed set and the first failure-causing input when meeting the StopCon1 (or the size of when meeting the StopCon2).  according to the rate of covering -wise value combinations (i.e., -measure). We used two usual test profiles TP(10, 2 10 ) and TP(10, 2 5 3 5 ) that are commonly used in previous studies [37].

Setup.
Since the was known before testing, in this simulation, we considered the FSCS-CID using NCVCS [36] as the similarity measure (denoted NCVCS). Except the MICS, all other methods do not require to be set. As for the MICS, different strength values from 1 to are considered to calculate the MICS measure according to (16). However, due to the known , we mainly focused on the strength values from 1 to for calculating the MICS measure. As a consequence, (16) becomes as follows: where only 1 , 2 , . . . , are considered. Each method runs until the StopCon2 is satisfied. Additionally, we consider as the metric to evaluate each method in terms of the rate of covering value combinations at strength for each method, where = 2, 3, 4. Figure 3 summarizes the number of test cases required to cover all possible -wise value combinations (i.e., ) generated by each method for the above two designed test profiles. Based on the experimental data, we have the following observations.

Results.
(1) For each test profile, the ( = 2, 3, 4) metric values of all FSCS-CID methods using different similarity measures are smaller than those of RT. In other words, the FSCS-CID methods require the smaller number of test cases for covering all -wise value combinations than RT, which means that the FSCS-CID methods have the higher rates of covering value combinations than RT.   the values of the NCVCS are about 30%∼50% of those of the RT. The IICS has the second best metric values, followed by the OF. For TP(10, 2 10 ), the Goodall3 is least effective, while for TP(10, 2 5 3 5 ), the Lin performs least.
(3) From the perspective of the similarity category, the FSCS-CID methods using the interaction-coveragebased similarity measures (including IICS, MICS, and NCVCS) perform best, while the FSCS-CID methods using the information-theoretic similarity measures (including Lin and Lin1) perform worst.

Analysis.
Here, we briefly analyze the above observations. The observation (1) is explained as follows. The FSCS-CID methods using different similarity measures select the next test case that has the smallest similarity value against already generated test cases, while RT simply generates teat cases at random from combinatorial input domain. As a consequence, the FSCS-CID methods achieve test cases more diversely than RT over the combinatorial input domain. As for the observations (1) and (2), they are easy to be explained. On the one hand, since the metric is related to -wise value combinations, the NCVCS performs best because it selects the next test case that covers of uncovered -wise value combinations as much as possible. In other words, it may have the fastest rate of covering all -wise value combinations. On the other hand, another two interactioncoverage-based methods, such asIICS andMICS, consider different strength values for generating test cases; however, both of them take the strength as an indispensable part. In detail, the IICS calculates the test candidate from the strength 1 to , while the MICS considers different strengths from 1 to at the same time. Hence, it is reasonable that, compared to other categories, the FSCS-CID methods using interaction-coverage-based similarity measures perform best according to the metric.

An Empirical Study.
In this section, an empirical study was conducted to compare the performance between FSCS-CID and RT in practical situations, using the -measure as the effectiveness metric. To describe data clearly, we used ART -ratio, which is defined as the -measure ratio between FSCS-CID and RT, that is, FSCS-CID / RT . Intuitively speaking, the smaller ART -ratio value implies the higher improvement of FSCS-CID over RT, and (1 − FSCS-CID / RT ) is the -measure improvement of FSCS-CID over RT.
In this empirical study, we use a set of six fault-seeded C programs with 9 versions. The five subject programs, including count, series, tokens, ntree, and nametbl, are downloaded from Chris Lott's website (http://www.maultech.com/ chrislott/work/exp/), which have been widely used in the research of combinatorial space such as comparison of defect revealing mechanisms [38], evaluation of different combination strategies for test case selection [39], and fault diagnosis [40,41]. The remainder subject programs are a series of flex programs (the model used in this paper is unconstrained, which has some limitations: "We note that in a real test environment an unconstrained TSL would most likely be prohibitive in size and would not be used" [42].), downloaded from Software Infrastructure Repository (SIR) [43], which are popularly used in combinatorial test suite construction [44] and combinatorial interaction regression testing [42]. Table 3 presents detailed information about these subject programs, from which the third column "LOC" represents the number of lines of executable code in these programs, and "#S. " is the number of seeded faults in each subject program, while "#D. " is the number of faults that can be detected by some test cases derived from the accompanying test profiles, which are not guaranteed to be able to detect all faults. However, in our study, we only use a portion of detectable faults, of which the size is shown as "#U. ". The main reason is due to the fact that faults in the set of detectable faults but not in the set of used faults have high failure rates that exceed 0.5. As we know, if the failure rate of a fault is larger than 0.5, the -measure of random testing is theoretically less than 1/ = 2. As a consequence, the -measure of FSCS-CID depends on the first randomly selected test case. In other words, if the first test case cannot detect a failure, the FSCS-CID is larger than or equal to 2. Therefore, the -measure of FSCS-CID is dependent on random testing.
For the purpose of clear description, we order used faults in each subject program in a descend order according to failure rate and abbreviate them as 1 , 2 , . . . , . The range of failure rates in each program, as shown in Table 3, is from to . We used all twelve FSCS-CID versions using different similarity measures to test these fault-seeded programs. The results of the empirical study are given in Figure 4, whereaxis represents each seeded fault in the subject program, while (1) According to ART -ratio, all twelve FSCS-CID versions, including Goodall1, Goodall2, Goodall3, Goodall4, Lin, Lin1, Overlap, Eskin, OF, IICS, MICS1, and MICS2, perform better than RT. In the best case, the improvement of FSCS-CID overRT is about 40% (i.e., ART -ratio is 60%).  Table 4 shows the average ART -ratio of each FSCS-CID version for each subject program. According to data shown in Table 4, it is obvious that in general one of FSCS-CID version OF performs best, followed by IICS, while Lin and Lin1 generally perform worst. In addition, Eskin performs best for the program tokens and Goodall1 has the best performance for the program flexv4.
In summary, our simulation results (Section 5.1) have shown that our FSCS-CID algorithm (irrespective of used similarity measure) has higher rates of covering value combinations at different strength values than those of random testing. Besides, the empirical study has shown that the FSCS-CID algorithm performs better than RT in terms of the number of test cases required to detect the first failure (i.e., -measure).

Threats to Validity.
The experimental results suffer from some threats to validity; in this section, we outline the major threats. In the simulation study, two widely used, but limited, test profiles were employed. In the empirical study, many real-life programs were used, which have been popularly investigated by different researches. However, the faults seeded in each subject program have high failure rates.
To address these potential threats, additional studies using a great number of test profiles and a great number of subject programs with low failure rates will be investigated in the future.
In addition, although two metrics ( -measure andmeasure) were employed in our experiment, we recognize that there may be other metrics which are more pertinent to the study.

Discussion and Conclusion
Adaptive random testing (ART) [15] has been proposed to enhance the failure-detection capability of random testing (RT) by evenly spreading test cases all over the input domain and has been widely applied in various applications such as numerical programs, Java programs, and object-oriented programs. In this paper, we broaden the principle of ART in a new type of input domain that has not yet been investigated, that is, combinatorial input domain. Due to special characteristics of combinatorial input domain, the test case similarity (or dissimilarity) measures previously used in ART may not be suitable for combinatorial input domain. By adopting some well-known similarity measures used in data mining and proposing two new similarity measures based on interaction coverage, we proposed a new approach to apply original ART into combinatorial input domain, named ART-CID. We conducted some experiments including simulations and the empirical study to analyze the effectiveness of one version of ART-CID (FSCS-CID, which is based on fixedsize-candidate-set ART). Compared with RT, FSCS-CID not only brings higher rates in covering all possible combinations at any given strengths, but also requires fewer combinatorial test cases to detect the first failure in the seeded program. Combinatorial interaction testing (CIT) [33] is a blackbox testing method and has been widely used in combinatorial input domain. It aims at constructing an effective test suite to identify interaction faults caused by parameter interactions. Some greedy CIT algorithms, such as AETG [45], TCG [46], and DDA [47], may have similar mechanism as FSCS-CID. Taking AETG for example, similar to AETG, FSCS-CID also first constructs some candidates, and then from which the "best" element would be chosen as the next test case according to some criteria. However, there are some fundamental differences between AETG and FSCS-CID, which are mainly summarized as follows.
(1) Different construction strategies of candidates: FSCS-CID constructs candidates in a random manner, while AETG first orders all parameters and then assigns a value to each parameter, such that all assigned parameter values can cover the largest number of value combinations at a given strength.
(2) Different test case selection criteria: AETG selects an element from candidates as the next test case such that it covers the largest number of value combinations at a given strength, while FSCS-CID chooses the next test case according to its used similarity measure. (3) Different goals achieved: AETG aims at covering all possible value combinations of a given strength with fewer test cases, which means that the unique stopping condition of AETG is that all value combinations of a given strength are covered by generated test cases, while FSCS-CID is an adaptive strategy, which means that the stopping condition of FSCS-CID is not limited to covering all value combinations of a give strength, for example, detecting a first failure in the SUT.
In this paper, constraints among value combinations have not been considered; however, they often exist in real-life programs. For example, as shown in Table 1, there may exist a constraint among "Full Size" of 1 and "Single" of 2 , that is, when 1 = , 2 ̸ = (i.e., "Full Size" and "Single" cannot occur in a combinatorial test cases). In this case, the method FSCS-CID proposed in this paper can still be successfully executed only by judging that each selected test case violates constraints among value combinations or not. Generally speaking, this judgment process can be implemented in the following phases: (1) when constructing the candidate set and (2) when adding the latest test case into the executed set. However, how to deal with constraints among value combinations should be further studied.
In the future, we plan to further investigate how to improve the effectiveness of the approach by adopting other similarity measures that may be available in combinatorial input domain or by considering additional factors to guide test case generation. In addition, how to extend other original