Two Improving Approaches for Faulty Interaction Localization using Logistic Regression Analysis

Faulty Interaction Localization (FIL) is a process to identify which combination of input parameter values induced test failures in combinatorial testing. An accurate and fast FIL provides helpful information to ﬁx defects causing the test failure. One type of conventional FIL approach, which analyzes test results of whole test cases and estimates the suspiciousness of each combination, has two main concerns; (1) the accuracy is not enough, (2) the huge time cost is sometimes needed. In this paper, we propose two novel approaches to improve those concerns. FROGa attempts to estimate suspicious-ness more accurately using logistic regression analysis. FROGb attempts to estimate failure-inducing combinations at high speed by estimating the subsets of them using logistic regression analysis and exploring just their supersets. Through evaluation experiments using a large number of artiﬁcial test results based on several real software systems, we observed that FROGa has very high accuracy, and FROGb can drastically reduce time cost for targets that have been diﬃcult to complete by the conventional method.


Introduction
Software testing plays an essential role in guaranteeing the behavior of software systems.However, modern software systems are getting more and more large-scale and complicated.Testing them is tough work because the system behaviors are affected by many factors, e.g., various input parameters and settings.Even worse, defects are sometimes caused by interactions between multiple factors rather than a single factor.However, testing all interactions of factors is impractical, especially for modern industrial software.
Combinatorial testing [1,2] is known as an efficient black-box testing techniques to detect test failures that occurred by interactions of multiple factors.Combinatorial testing uses only the minimum number of test cases covering all possible combinations of input values to satisfy a particular criterion.The cost of software testing is reduced by focusing on finding defects due to a combination of only a certain number of factors rather than all input patterns.
However, the combinatorial testing brings a problem about Faulty Interaction Localization(FIL).The FIL is a process to identify which combination of input parameter values induced the detected test failures.Identifying the minimum conditions that reproduce test failures will make it easier to identify and repair the defect in a source code.However, it is not easy to identify such failure-inducing combinations from the combinatorial testing result.Test cases in combinatorial testing are prioritized to include many combinations but not to test specific behaviors.Therefore, it is essential to develop practical FIL approaches for effective combinatorial testing, e. g., many works are studied [3][4][5][6][7].
BEN [5] is a powerful existing FIL method, which analyzes the combinatorial testing results to estimate which combination is most likely to induce failure.For this purpose, BEN calculates suspiciousness for each possible combination by their algorithm.However, BEN has two concerns: (1) The accuracy of suspiciousness estimation is not enough.BEN compensates its accuracy with creating and running additional tests, but increasing accuracy can reduce the cost of additional tests if possible.(2) Analyzing large-size combinations may need an unrealistic processing time due to a combinatorial explosion.
This paper proposes two approaches to improve for each concern in the existing FIL approach, using logistic regression analysis.Our first approach FROGa focuses on the concern (1), and our second approach FROGb focuses on the concern (2).The naming of FROG means "FIL based on Regression coefficient of lOGistic regression." FROGa aims to calculate the suspiciousness of combinations inducing test failures more accurately by using logistic regression analysis.FROGa first extracts all possible failure-inducing combinations of parameter values from test results, then encodes the inclusion relationship between the combinations and each test case.Finnaly, FROGa inputs the encoded data to logistic regression to obtain regression coefficients as the suspiciousness of each combination.
To evaluate the performance of FROGa, we implemented FROGa and BEN, then applied them to the same experimental subjects.As the experimental subjects, we artificially generated many combinatorial testing results to ensure the validity of the evaluation result.One hundred twenty-six thousand artificial testing results were generated using the input parameter model of six real software systems and Pict, a combinatorial testing generation tool.
According to our experiment, FROGa can identify the input values more accurately than BEN.This experiment used the artificially generated combinatorial testing results on six real systems.In the experiments, we obtained the results of the following two research questions (RQ).
RQ1 Does FROGa improve the accuracy of suspiciousness calculation compared to BEN? Yes.In particular, FROGa can significantly improve the accuracy of ranking the suspiciousness of combinations in the case of targeted testing results with high coverage.RQ2 How much is the difference in the time cost between FROGa and BEN?
There is little difference in time cost between FROGa and BEN.
Next, FROGb aims to directly estimate failure-inducing combinations by estimating the subsets of those combinations.FROGb uses logistic regression analysis to estimate suspicious input parameter values as subsets of the failureinducing combination and then explore the superset of values step by step.Using FROGb, only the most suspicious combinations can be extracted without dealing with all possible combinations, thus avoiding a combinatorial explosion.
According to our experiment with the same subjects as for FROGa, FROGb can significantly reduce the time cost to find the large size failure-inducing combinations with high accuracy.More specifically, we obtained the results of the following two research questions.
RQ3 How much does FROGb reduce the time of extracting highly suspicious combinations compared toBEN and FROGa?FROGb can significantly reduce the time compared to BEN and FROGa in the case of targetting testing results with high coverage.RQ4 How accurately can FROGb extract failure-inducing combinations?
FROGb can extract all failure-inducing combinations in 63.3% of all and can extract at least one of them in 95.6% of all.
The rest of this paper is organized as follows: Section 2 gives backgrounds and definitions.Section 3 introduces the first proposal FROGa, and Section 4 evaluates FROGa.Section 5 introduces the second proposal FROGb, and Section 6 evaluates FROGb.Section 7 discusses threats for validities.Section 8 introduces related work.Finally, Section 9 concludes this paper.

Combinatorial Testing
Combinatorial testing is one of black-box testing that focuses on the combination of multiple input parameter values [1,2].The primary purpose of combinatorial testing is to efficiently detect failures caused by the interaction of such multiple input parameter values.For this purpose, the combinatorial testing methodology designs a test suite with a few test cases as possible while covering all combinations of input parameter values below a certain number.A test suite designed in this way is also referred to as a covering array.A survey [8] reported that the upper limit of the number of combinations of input values that can induce failures is between four and six.Therefore, combinatorial testing that focuses on only a few combinations is practical.
The test that attempts to detect failures caused by the interaction of t or fewer input values using a covering array that covers t or less input parameter values is referred to as a t-way testing.In a t-way testing, all combinations of t or fewer parameter values are tested at least once with a minimum number of test cases.The value of t in the t-way testing is referred to as the combinatorial coverage in this paper.
We will formally define some issues related to combinatorial testing.First, the input model of the System Under Test (SUT) for combinatorial testing is modeled in terms of parameters, their possible values, and the constraints between the parameter values.
Definition 1 (SUT) An SUT model for combinatorial testing is ⟨P, V, φ⟩, where is a set of values assigned to a parameter p i , and φ is a set of constraints on combinations of parameter values.
A test case is a tuple assigned to each parameter a value that does not violate SUT constraints.
A schema is a formal representation of a combination of input parameter values.This definition is initially defined in the previous study [3] and is used in the recent study [7] as well.
Definition 3 (Schema) For the SUT, the n-tuple (−, v n1 , ..., v n k , ...) is refered to as a k-degree schema (0 < k ≤ n) when some k parameters have fixed values and other irrelevant parameters are represented as "−".

Parameter
Values Constraint: Definition 4 (Sub-schema and Super-schema) Let c l be an l-degree schema, c m be an m-degree schema in SUT and l < m.If all the fixed parameter values in c i are also c m , then c m subsumes c l .In this case, we can also say that c l is a sub-schema of c m , and c m is super-shchema of c l , which can be denoted as c l ≺ c m .For example, a 2-degree schema (-, 4, 4, -) is a sub-schema of a 3-degree schema (-, 4, 4, 5), that is, (-, 4, 4, -) ≺ (-, 4, 4, 5).
Example 1 Table 1 shows an example SUT model.This system has five configurable parameters: CPU, Network, DBMS, OS, and Browser.The first three parameters have two possible values, and the remaining two parameters have three possible values.There is a constraint that we must use CPU ̸ = AMD when OS = Mac, and OS = Win must be used when Browser = IE.In other words, the combinations of the parameter values (Mac, AMD), (Linux, IE), and (Mac, IE) are not allowed.
Table 2 shows the covering array that realizes the 3-way test of the SUT shown in Table 1 (not including the "result" column).This test suite consists of 20 test cases and includes all possible 115 3-degree schemas (Intel, Wifi, MySQL, -, -), . .., (-, -, Sybase, Mac, Chrome) at least once in the test cases while satisfying the constraints.On the other hand, there are 54 possible combinations between these parameter values based on the constraints, and a total of 54 test cases are needed to cover all of them.Therefore, combinatorial testing can significantly reduce the number of test cases as a tradeoff against the coverage criteria.

Faulty Interaction Localization
Faulty Interaction Localization (FIL) is the process of identifying combinations of input parameter values that induce faulty behavior based on the results of combinatorial tests.Developers can quickly identify and fix faulty components by identifying the testing input values that are the minimum requirements to reproduce the faulty behavior.
The purpose of FIL is to identify minimal schemas which induce failures from given combinatorial testing results.We refer to this schema as a failureinducing schema.
Definition 5 (Failure-inducing schema) A schema s is referred to as a Minimal Failure-inducing Schema if all test cases including the schema s always cause a particular failure and none of the sub-schemas of s cause the failure.In this paper, we refer to it as just failure-inducing schema.
Furthermore, some FIL methods extract all possible failure-inducing schemas from the total combinatorial testing result at first and then narrow down the failure-inducing schemas from these schemas.Such a schema that may be failure-inducing is defined as a candidate schema.Definition 6 (Candidate schema) Given test suite T and test oracle R : tc ⊆ T → {pass, fail}, a candidate schema is a schema s such that -there is a test case tc in T such that tc ⊇ s, and -for every test case tc in T such that tc ⊇ s, R(tc) = fail.Example 2 In the example system shown in Table 1, a failure occurs when OS = Mac and Browser = Firefox.Also, another failure occurs when CPU = AMD, Network = Wifi and DBMS = Sybase.Therefore, two schemas, the 2-degree schema (1, -, -, 2, -) and the 3-degree schema (-, 2, 1, -, 1), are failure-inducing schemas.

Logistic Regression Analysis
Logistic regression is a well-known statistical model for the regression of a binary dependent variable and is mainly used as a supervised machine learning method [9].In the field of software engineering, this model is often used for classifying fault-prone modules with software metrics [10][11][12].
The logistic model is represented by the following equation: where x 1 , . . ., x n are the independent variables, b 1 , . . ., b n are their coefficients, b 0 is the intercept, Y is the binomial dependent variable that takes 0 or 1, and P r is the conditional probability that Y = 1 given the values of x 1 , . . ., x n .In the logistic regression, the regression equation is obtained by calculating the values of b 1 , . . ., b n , i.e. regression coefficients, from the training data so that the error between the conditional probability and the correct answer value is the smallest.In using this model as a classifier, the test data is substituted into the regression equation, the conditional probability that the binomial dependent variable is positive is calculated, and the classification result is presented according to the threshold.
Logistic regression is also used as an analytical data method.In this study, we use the analytical method.Logistic regression analysis is based on that the regression coefficient can be viewed as the degree to which the probability P r is affected by a unit increase in the corresponding independent variable [13].A significant regression coefficient indicates that a change in the value of that independent variable has a significant impact on the variation of the probability P r.Therefore, a regression coefficient can be regarded as the importance of the independent variable for that the binary dependent variable is positive.
3 Improved FIL method using logistic regression

Concept of FROGa
In this section, we propose FROGa as an improved FIL method.FROG is based on the existing FIL method, BEN.BEN calculates and then ranks the suspiciousness of candidate schemas using the original algorithm with several software metrics.On the other hand, FROGa attempts to rank candidate schemas more accurately by using logistic regression analysis.
We explain the logic that logistic regression analysis can calculate the suspiciousness of a candidate schema.Remember that a regression coefficient can be regarded as the importance of the independent variable for that the binary dependent variable is positive.Let the independent variable be binary (1 = included, 0 = not included), representing the inclusion relationship with a test case in a candidate schema.The dependent variable is binary (1 = failed, 0 = passed), representing the test result.In this case, the regression coefficient means the impact level that changing the corresponding independent variable change from 0 to 1 affected the probability of the test case fails.This independent variable change means not including the schema in the test case to including.Therefore, a large regression coefficient schema is likely to be a failure-inducing schema.Thus, a regression coefficient is expected to determine the magnitude of suspiciousness.

Model
FROGa needs three inputs: a test suite, the test results for each test case, and k, which is the assumed maximum degree of failure-inducing schema.It is necessary that the test results are classified into pass, or fail by some test oracle.Moreover, k is the degree of schema that a given combinatorial test suite covers.
First, FROGa extracts all candidate schemas whose degree is less than or equal to k from the test suite and the results.Let S k be the set of these candidate schemas.
Next, FROGa creates a data table Φ, representing the inclusion relationship between every candidate schema in S k and every test case.Define a function inc(sc, tc) which represents the inclusion of a candidate schema sc in a test case schema st as follows.
With the function inc(sc, st), the data table Φ is expressed as follows, Moreover, FROGa creates a pseudo-Boolean vector R that represents the test results for each test case.When the test suite consists of n test cases st 1 , . . ., st n , with the function result(st) → {0, 1} which encodes the test result of st as pass = 0 and fail = 1, R is expressed as follows.
R = [result(st 1 ), . . ., result(st n )] T Then, FROGa runs a logistic regression with every column vector of Φ as the independent variable and R as the dependent variable.As a result, the regression coefficients for each corresponding candidate schema are obtained.
Finally, FROGa ranks the candidate schemas by the obtained regression coefficient as the magnitude of suspiciousness.The larger the regression coefficient, the higher the possibility that a candidate schema is failure-inducing.

Simplification
Next, we will write the simplification of the encoding in FROGa.In summary, the row vectors inf Φ and R corresponding to the passed test cases can be omitted.
According to the definition of the candidate schema, candidate schemas are not included in any passed test cases.Therefore, the row vectors in Φ and R that corresponds to the passed test cases are always all-zero vectors.Thus, all those row vectors are the same.In logistic regression, the same duplicate data does not affect the update of the regression equation.This characteristic allows the row vector corresponding to the passed test case to be deleted except only one row.Therefore, FROGa can use simplified Φ ′ and R ′ instead of Φ and R as follows.
T where F is the set of failed test cases, and f = |F |.Φ ′ is a data table with m rows and f + 1 columns, and R ′ is a column vector with f + 1 dimensions.
Table 5 Example Φ ′ and R ′ : The simplified Example Φ and R.

Application example of FROGa
For a better understanding of FROGa, we show an example application.Table 4 represents the Φ and R created from the Example Test Result (Table 2) and the Example Canididate Schemas (Table 3).In addition, Table 5 represents Φ ′ and R ′ as simplificated them.Table 6 shows the values of regression coefficient and the ranks of these candidate schemas that are obtained by running logistic regression inputted Φ ′ and R ′ .In this example, we used the logistic regression implemented in scikit-learn [14] with all defaulted optional parameters.In this example, both underlined failure-inducing schemas have the highest regression coefficients, and they are at the top of the ranking.This result shows that FROGa is working well.We set the following research questions to evaluate the efficacy of FROGa.
RQ1.Does FROGa improve the accuracy of suspiciousness calculation in FIL compared to BEN?
RQ2.How much is the difference in the time cost between FROGa and BEN?
In order to answer these research questions, we applied FROGa and BEN to several combinatorial testing results and compared their accuracy and processing time.

Experimental Subjects
We used artificially created combinatorial testing results for the experiments rather than actual reported combinatorial testing results.We could not find any bug reports with both defects detected by combinatorial testing and the used test suites.Furthermore, we believed that using a few experimental subjects does not bring valid results even though the subjects are real ones.Therefore, we used a lot of artificially generated test results by assuming several defects induced by several failure-inducing schemas in real software systems.We believe that the use of artificially generated test results in experiments has a shallow impact on validity because real-world failure-inducing schemas are included in the theoretical failure-inducing schemas we used.Six real SUTs were used in the experiment as combinatorial testing targets.The parameter and constraint sizes of each SUT are given in Table 7.The parameter size of a SUT is expressed as |P |; g k1 g k2 . . .g kn , where k i parameters with g i values and |P | is the number of parameters.The size of the constraint is expressed as a series of standard forms l; h l1 h l2 . . .h lm , where m variables and h j clauses with l j literals for each j.The four SUTs, SystemMgmt, Storage3, ProcesserComm2, Healthcare4 are specific versions of IBM product programs.Moreover, SPINS and SPINV is the simulator and the verifier in SPIN, which is an open-source model checking tool.These SUT models were randomly selected from several models published in Cohen et al. [15] with a broad range of input sizes.
In addition, we used Pict [16] to design test suites.Pict is a well-known covering array generation tool provided by Microsoft.The inputs of Pict are the SUT model and the value of combinatorial coverage.
The artificial combinatorial testing results were generated through the following three steps. (

1) Test suites generation
We created t-way test suite (t = 2, 3, 4) of each SUT with Pict.As a result, a total of 18 test suite was generated from the combination of six SUTs and three combinatorial coverages.Table 8 shows the number of test cases included in each test suite.(2) Failure-inducing schemas generation We randomly determined failure-inducing schema patterns that are assumed in each test suite.First, the number of each failure-inducing schema is randomly determined between 1 and 3.Then, the degree of each failureinducing schema is randomly determined between 2 and t (for t-way tests).The reason for this upper limit is to be tested every failure-inducing schemas.
Next, the parameter values assigned to the failure-inducing schemas are determined randomly.We prepared 10,000 patterns of failure-inducing schemas for each test suite by iterating the above operations without duplication.

(3) Testing results generation
The combinatorial testing results are obtained according to these failure-inducing schemas; the result of a test case including even one of the determined failure-inducing schemas is fail, otherwise pass.Finally, 10,000 cases of combinatorial testing results were obtained for each test suite for 2-way and 3-way testing.However, only 1,000 cases were obtained for the 4-way testing.This limitation was due to the time constraint of the experiment.

Experimental Environment
The algorithms for BEN and FROGa were implemented in Python 3.5.1 and run on a MacBook Pro 2017.To implement logistic regression in FROGa, we used scikit-learn [14] which is an open-source Python library for machine learning, and we used all optional parameters of logistic regression with default parameters.The timeout criterion of the FIL process for a single combinatorial testing result is 3,600 seconds.Encoding in FROGa with Φ ′ and R ′ instead of Φ and R.

Evaluation Metrics
In order to answer RQ1, we used two rank-based accuracy evaluation metrics, MAP and top-k% accuracy.In order to answer RQ2, we measured processing time.
The Mean Average Precision (MAP) is a metric that evaluates the accuracy of a system's ranking ability [17].The value of MAP takes the value from 0 to 1.The higher the value, the more accurate the ranking is.This metric is often used to evaluate a query information retrieval system, which is expected to return search results arranged in order of relevance to a given query, such as search keywords.In this experiment, an input combinatorial testing result is treated as a query, and the ranked candidate schemas are treated as search results.To compute the MAP, AveP q (Average Prediction) is first calculated for single query q out of the query set Q by the following formula: where |R q | is the number of the search results correctly related to the query q, i. e. the number of failure-inducing schema.In addition, prec@n q is a percentage of the search results correctly related to q in top n results.After that, MAP is calculated from the mean of AveP of all queries q ∈ Q as follows: Note that MAP is calculated for the ranking ability of a FIL method, while AveP is calculated for the accuracy of a single ranking result.Next, top-k% accuracy is our defined rank-based metric based on top-k accuracy.The top-k accuracy means the success rate at the cutoff rank k in multiple trials when it is successful if all failure-inducing schemas are included in the top k of all ranked candidate schemas.However, we cannot simply use this metric because the number of candidate schemas is different in each trial in this experiment.Therefore, we defined and used top-k% accuracy using top-k% instead of top-k.This can be formulated as follows: where InTopKp(k, q) is a function that returns 1 if all relevant results are included in the top k% of search results for a query q, otherwise 0.

Results
In this section, we show the experimental results for evaluating FROGa.The processes did not complete within the timeout criterion in both BEN and FROGa.Therefore, the results of the 4-way test of Healthcare4 and SPINV are missing values(NA).Although we could not confirm all inputs, we have confirmed that all five randomly selected inputs in Healthcare4 and SPINV did not complete within the timeout criteria.Furthermore, we also confirmed that none of them had completed extracting candidate schemas from all schemas at a timeout by checking the execution logs.SPINV.There is little difference between the SUTs, while there is a significant difference between the combinatorial coverage.BEN loss accuracy as the combinatorial coverage increases, but FROGa can keep high accuracy even as the coverage increases.

Results for RQ1
In addition, Table 10 shows a comparison of the percentage of samples that resulted in AveP = 1 out of all samples.The "AveP = 1" means that all failure-inducing schemas are at the top of the ranking for an input.The result shows that FROGa is much more capable of ranking accurately than BEN.Therefore, FROGa can make a perfect ranking of suspiciousness with high accuracy.This result leads that FROGa can suggest that the most suspicious candidate schemas.For a more detailed analysis, Table 11 shows the MAPs separately calculated for each the number of assumed failure-inducing schema (#FS).We can see that FROGa is always superior to BEN.Moreover, both BEN and FROGa decrease accuracy when several failure-inducing schemas exist.In particular, when there is only one failure-inducing schema, unlike BEN, FROGa succeeds in placing the failure-inducing schema at the top of the ranking in almost all cases.In addition, it is a significant difference in accuracy that the MAP for BEN is 0.07 and that the MAP for FROGa is 0.57 in the most challenging situation in this experiment; for the 4-way testing results with three failureinducing schemas.As reported in the previous study [7], the quality of the FIL Fig. 2 Top-k% accuracy of BEN and FROGa decreases with the increase of failure-inducing schemas for the three adaptive methods, and our results confirm this trend.
The graph in Figure 2 shows the top-k% accuracy for each SUT.The graph's horizontal axis represents the value of k, and the vertical axis represents the value of top-k% accuracies.Reaching the graph to 1 in fast means that developers can find all failure-inducing schemas by checking the candidate schemas in order from the top of the ranking in fast.Note that the graphs cannot be compared between different combinatorial coverage because the populations of candidate schemas are different.We can observe that the top-k% accuracy of FROGa reaches 1 faster than BEN in every SUTs.In comparing FROGa and BEN, the higher the combinatorial coverage, the faster the top-k% accuracy reaches 1.This trend is similar to MAP.From these results, the answer to RQ1 "Does FROGa improve the accuracy of ranking suspiciousness as a failure-inducing schema compared to BEN?" can be obtained as follows.

Answer to RQ1 ✓ ✏
Compared to BEN, FROGa can significantly improve the accuracy of ranking the suspiciousness of candidate schemas.In particular, when there is only one failure-inducing schema, FROGa can almost certainly place that schema at the top of the ranking.In addition, although BEN haves low accuracy, FROGa keeps high accuracy when targeting results with large SUT and high combinatorial coverage.✒ ✑

Results for RQ2
Table 12 shows the comparison of the average processing times for BEN and FROGa.Moreover, Fig. 3 shows the comparison of processing time distribution by boxplots.The upper graph compares each SUT, and the lower graph compares each combinatorial coverage.Note that these graphs do not include missing values; for the 4-way testing of Healthcare4 and SPINV.As a result, we can confirm that there is little difference in time cost between BEN and FROGa.Therefore, we can conclude that FROGa is a better method than BEN overall because FROGa can more accurately calculate suspiciousness even though there is little difference in time cost between BEN and FROGa.
From the results, the answer to RQ2 "How much is the difference in the time cost between FROGa and BEN?" can be obtained as follows.

Answer to RQ2 ✓ ✏
There is little difference in time cost between BEN and FROGa.5 Further improvement for cost effectiveness

Concept of FROGb
The approaches treating all candidate schemas, i. e. BEN and FROGa, are thorough and straightforward way.However, the step to extract all candidate schemas from all possible schemas may occur a combinatorial explosion and require a substantial computational cost when targeting extensive testing results.For example, in the experiment in Section 4, both BEN and FROGa did not complete to extract all candidate schemas within 3,600 seconds when targeting the test result of 4-way tests of Healthcare4 and SPINV.This unreal-istic time cost should make lost efficiency in bug repair, which is the primary purpose of FIL.We believe that combinatorial explosion can be avoided by directly extracting only the most suspicious candidate schemas without dealing with all schemas.We have an idea that logistic regression analysis can solve this problem.
It is a natural idea that test cases that include sub-schemas of a failureinducing schema are more likely to include the failure-inducing schema than test cases that do not.Furthermore, test cases including sub-schemas of failureinducing schemas are more likely to fail than test cases that do not.Therefore, the regression coefficient of the sub-schemas of failure-inducing schemas can be high.On the contrary, if the regression coefficient of a schema s x is high, s x is expected to be a failure-inducing schema or its sub-schemas.Since s x must be a candidate schema to be a failure-inducing schema, if s x is not a candidate schema but has a high regression coefficient, we can expect s x to be a sub-schema of a failure-inducing schema.
There, we built the following hypothesis: Hypothesis ✓ ✏ All sub-schemas of failure-inducing schemas have logistic regression coefficients higher than 0.

✒ ✑
The logistic regression coefficients in this hypothesis are obtained the same way as in FROGa.We set the threshold as 0 to get the widest sensitivity.A positive regression coefficient means that a unit increase in that independent variable will positively impact the probability of a positive value for the dependent variable.If this hypothesis is correct, there will be a minimal number of super-schemas that can include all the limited sub-schemas.Therefore, we believe in efficiently reaching failure-inducing schemas while significantly reducing the search space by successively obtaining super-schemas that satisfy the inclusion relation from the 1-degree schema obtained as a sub-schema.Based on this idea, we propose FROGb.FROGb obtains a small number of candidate schemas that are highly suspicious as failure-inducing schemas (after this referred to as high-suspicious candidate schemas) without extracting all candidate schemas from all possible schemas.

Model
FROGb needs three inputs as same as FROGa: a test suite, the test results for each test case, and k, which is assumed maximum degree of failure-inducing schema.FROGb runs the following steps for each t-degree schema of 1 ≤ t ≤ k step by step.

Initial Status
Let be t = 1.There are three empty sets: S, C, SubS; S means the set of schemas that are currently in focus.C means the set of high-suspicious candidate schemas.Moreover, SubS means the set of schemas that are considered to be sub-schemas of failure-inducing schemas.

Step 1 (t = 1)
Extract all 1-degree schemas in the failed test cases and add them to S.

Step 2
For each schema in S, check if it is a candidate schema according to the definition.If it is a candidate schema, delete it from S and add it to C.

Step 3
For each schema left in S, calculate the logistic regression coefficients and the FROGa procedure, and add the schemas with positive regression coefficients to SubS.

Step 4
Increment t by one and initialize S to be empty.Then, go to Step 1.
Step 1 (t ≥ 2) Find all t-order schema such that their all (t − 1)-degree sub-schemas are included in the set SubS, and add them to S.Then, initialize SubS to empty and go to Step 2.
There are two termination conditions for this algorithm: 1. Terminate when the variable t exceeds the given maximum schema order k, i.e., t > k.To be more precise, there is no need to check whether a kdegree schema is a sub-schema of (k + 1)-degree high-suspicious candidate schemas, so immediately terminate after Step 2 (t = k).2. Terminate when S is empty at the end of Step 1, or SubS is empty at Step 3, despite t ≤ k.It is because there is no schema to pass to the next step.
FROGb outputs C at the end of the algorithm as a set of high-suspicious candidate schemas with k-degree or less.

Application Example of FROGb
For a better understanding of FROGb, we show an example application using the example test results shown in Table 2. Table 13 shows the schemas handled in each iteration and their logistic regression coefficients and several judgment results in this application example.In this table, "s ∈ S" indicates each schema included in S, "to C" indicates whether be judged a candidate schema or not.Moreover, "to SubS " indicates whether be judged as a sub-schema or not.
As input, we give the example test suite and the result shown in Table 2, and the maximum degree of schema, k = 3.

t = 2
FROGb extracts all the 2-degree schemas included in some failed test cases and whose 1-degree sub-schemas are all included in SubS, then adds them to the empty S. As a result, the nine 2-degree schemas shown in the t = 2 of Table 13 were added to S. As an example, the 2-degree schema (1, 2, -, -, -) is included in (1, 2, 1, 1, 1), which is the failed test case #8 in table 2, and all of its 1-degree sub-schemas (1, -, -, -, -) and (-, 2, -, -, -) are included in SubS.Therefore, it was added to S. Now, let SubS be empty.Next, FROGb checks whether every s ∈ S is a candidate schema.As a result, only (1, -, -, 2, -) is a candidate schema, so it is removed from S and added to C. We compute the regression coefficients for the left schema in S, and six 2-degree schemas are added to SubS.Now, let S be empty.

t = 3
FROGb extracts all 3-degree schemas that are included in some failed test cases and whose all 2-degree sub-schemas are included in SubS, and add them to the empty S.There are only two such schemas, (-, 2, 1, 2, -) and (-, 2, 1, -, 1).Only (-, 2, 1, -, 1) is judged as a candidate schema, so remove it from S and add it to C. Here, FROGb finishes because the first termination condition is satisfied.
6 Evaluation of FROGb

Research Questions
In order to evaluate FROGb, we set the following research questions.
RQ3.How much does FROGb reduce the cost of extracting high-suspicious candidate schemas compared to BEN and FROGa?

Setting
We reused the results in Section 4.5 to compare FROGb with BEN and FROGa.Therefore, the experimental subjects are the same as written in Section 4.2 and the experimental environment is the same as written in Section 4.3 We newly implemented and executed FROGb, and measured several metrics, and compared them with the results of BEN and FROGa had been already obtained.
To answer RQ3, we measure the processing time till FROGb finishes.The FROGb's purpose that extracts high-suspicious candidate schemas can achieve by picking up some candidate schemas from the top of the suspiciousness ranking created by BEN and FROGa.Therefore, we correctly compare the measured processing time of FROGb with the processing time of BEN and FROGa.In addition, we also counted the number of high-suspicious candidate schemas extracted by FROGb in order to check whether FROGb can extract only a few high-suspicious candidate schemas or not.
To answer RQ4, we investigated the number of the artificially generated failure-inducing schemas included in the high-suspicious candidate schemas extracted by FROGb.

Results for RQ3
Table 14 shows the average processing time and two indexes for comparison of BEN, FROGa, and FROGb.The index %red represents the average reduction rate of processing time of FROGb, and the index %short represents the percentage of cases where FROGb was able to save time.These indexes compare FROGb with the shorter processing times of BEN and FROGa.In addition, Figure 4 shows a boxplot comparing the distributions of processing time.As a result, the difference in the reduction of processing time depends on the combinatorial coverage.First, there was a slight time reduction by FROGb when targeting 2-way tests.In particular, for 2-way tests of Health-care4 and SPINV, the average time reduction rate was less than zero, and the processing time increased in more than half of the cases.The reason may be that the processing time overhead by FROGb exceeded the reduction time by FROGb.Next, for the 3-way tests, the time reduction rate was about 50%, and about 96% of all cases reduced the processing time.For the 4-way tests, the time reduction rate was even more pronounced.In particular, for 4-way tests of Healthcare4 and SPINV, FROGb could complete extracting high-suspicious candidate schemas in a few seconds.This result is a remarkable improvement considering that BEN and FROGa could not complete the process within 3,600 seconds.
For the sake of simplicity, we refer to BEN and FROGa collectively as the all-extraction method (All-ex) because these methods extract all candidate schemas at first, unlike FROGb.The left side of Table 15  shows the average number of candidate schemas extracted by the all-extraction method and the average number of high-suspicious candidate schemas extracted by FROGb, and the reduction rates.The right side of Table 15 (#Check row) shows the average number of checking operations whether a schema is a candidate schema or not by the all-extraction method and FROGb, and the reduction rates.Furthermore, the upper part of Figure 6.3.1 illustrates the comparison of the distribution of the number of extracted candidate schemas, and the lower part illustrates the comparison of the distribution of the times of the checking operation.As a result, high-suspicious candidate schemas extracted by FROGb are fewer than those extracted in the first step of the all-extraction method.For each t-way test, the reduction rates are 50.6%(t= 2), 91.5%(t = 3), and 99.0%(t = 4).The higher the combinatorial coverage, the fewer high-suspicious candidate schemas were output by FROGb.On the other hand, there was no difference in SUT.
The number of checking operations in FROGb decreased as the combinatorial coverage increased.This result is the opposite trend of the one of the all-extraction method.The reason is that the higher the maximum number of iterations k of the FROGb algorithm, the tighter the constraints on the subschemas that the output candidate schema must satisfy.For most SUTs with k = 4, the average number of high-suspicious candidate schemas was around five.This result shows that FROGb can output only high-suspicious candidate schemas with minimum checking operations.
From the results, the answer to RQ3 "How much does FROGb reduce the cost of extracting high-suspicious candidate schemas compared to BEN and FROGa?" can be obtained as follows.
Answer to RQ3 ✓ ✏ FROGb can extract only a tiny number of high-suspicious candidate schemas and significantly reduce the processing time compared to BEN and FROGa, especially when targeting testing results with the high combinatorial coverage.✒ ✑

Results for RQ4
Table 16 shows the result of investigating whether the failure-inducing schemas were included in high-suspicious candidate schemas extracted by FROGb.The investigating results are aggregated into three categories: All, Partly, and No.This result shows that FROGb did not always extract all failure-inducing schemas.On average, 63.3% cases extracted all failure-inducing schemas, 32.2% cases extracted only some of the failure-inducing schemas, and only 4.4% cases did not extract any failure-inducing schemas.Moreover, there was little different depending on the combinatorial coverage.On the other hand, the larger SUT, the higher the percentage of all failure-inducing schemas extracted as high-suspicious candidate schemas.Next, Table 17 shows the percentage of case that all failure-inducing schemas were included in high-suspicious candidate schemas extracted by FROGb, aggregated by the number of failure-inducing schemas.From the result, we can see that the fewer failure-inducing schemas, the higher the percentage.For example, for the 2-way testing results of SPINV with only one failure-inducing schema, FROGb could extract all failure-inducing schemas with very high accuracy, 99.9%.However, for the results with three failure-inducing schemas, the accuracy sharply drops to 68.7%.This fact leads to multiple failure-inducing schemas impacting the failure of the extraction of each other.In addition, this difference in accuracy is also emphasized by the size of SUT.For example, the accuracies of extracting all failure-inducing schemas are 50-70% for HC4 and SPINV, which have many input parameters.On the other hand, the accuracies are only 9-27% for SM, which has the fewest input parameters.
From the results, the answer to RQ4 "How accurate does FROGb extract failure-inducing schemas as high-suspicious candidate schemas" can be obtained as follows.

Answer to RQ4 ✓ ✏
FROGb can extract all failure-inducing combinations in 63.3% of all and can extract at least one of them in 95.6% of all.This accuracy tends to increase as the size of SUT increases and the number of failure-inducing schemas decreases.✒ ✑

Overall Evaluation
Contrary to our hypothesis, the answer to RQ4 reveals that not all sub-schemas of failure-inducing schemas always have positive logistic regression coefficients.Therefore, the FROGb does not necessarily lead to accurate results.However, in our experiments, all failure-inducing schemas were extracted as high-suspicious candidate schemas in about 63.3% of the cases.This accuracy is not low.On the contrary, it is a good tradeoff for reducing the processing time for some targets where the BEN and FROGa take unrealistic processing time.
In addition, the identification of only single defect-inducing schemas can be valuable.For example, if multiple failure-inducing schemas induce the same defect, identifying one of the schemas will help to identify all the defects.
Moreover, consider the case where partial defects could be repaired by identifying some of the failure-inducing schemas.It may be possible to identify all defects by rerunning combinatorial testing and obtaining different test results, step by step.Therefore, we can see that FROGb can obtain effective localization results with 95.6% accuracy in actual.
Based on these considerations, we conclude that FROGb is a very efficient, fast, but approximate FIL approach.

Causes of Failure of FROGb
We identified two causes of failure by checking several processing logs where FROGb failed to extract all the failure-inducing schemas.
The first cause is due to the collision of multiple failure-inducing schemas.This collision refers to the situation where there are multiple failure-inducing schemas, and all the primary schemas corresponding to all possible values of a parameter are included in different failure-inducing schemas.FROGb expects the regression coefficients of all the 1-degree sub-schemas of the failureinducing schema to be positive.However, due to the relativity of regression coefficients, only some of these 1-degree schemas can have positive values.Therefore, FROGb could extract all correct 1-degree schemas as sub-schemas of the high-suspicious candidate schemas and could successfully obtain either of the failure-inducing schemas.For example, consider the p n , n-th parameter of the SUT, and the p n can take two different values, p n1 and p n2 .In this case, the regression coefficients of either p n1 or p n2 will always be greater than zero, and the rest will be less than zero.
In our posterior study, there are the collisions in 12,319 cases, out of 44,368 cases where FROGb could not extract all failure-inducing schemas.In addition, all cases involving multiple failure-inducing schemas with the collision failed.This cause of failure explains that the smaller the size of SUT and the larger the number of failure-inducing schemas, the lower the accuracy of extracting all failure-inducing schemas in our experiments.Since the assignment of failure-inducing schemas is determined randomly in our experiment, it is natural that the possibility of the collision increases as the number of failureinducing schemas increases.In addition, when the number of input parameters of the SUT is large and can take various values, the possibility of the collision decreases.
The second cause is due to just an accident.This accident refers to which input values were assigned by chance to each test case of the designed test suite.We have confirmed the following two cases of accidental failures of identification.
1.There are multiple failure-inducing schemas.Consider the p n , n-th parameter of the SUT.The p n can take two different values, p n1 and p n2 .One failure-inducing schema, f 1 , contains p n1 as a sub-schema.However, all the test cases failed by failure-inducing schemas other than f 1 happen to contain p n2 .In this case, the value of the parameter p n appears to be irrelevant to the test failure.Therefore, p n1 is not estimated as a sub-schema of f 1 , and the wrong result is output.2. There are multiple failure-inducing schemas.Consider the p n , n-th parameter of the SUT.The p n can take two different values, p n1 and p n2 .One failure-inducing schema, f 1 , contains p n1 as a sub-schema.However, the regression coefficient for pn1 will be lower than zero if the test fails by chance more often due to other failure-inducing schemas when p n2 is included than when p n1 is included in the test case.Therefore, p n1 is not estimated as a sub-schema of f 1 , and the wrong result is output.
Since we could not find any other causes of failure, all 32,049 failures that the first cause cannot explain must be due to the second cause.

Limitation of FROGb
FROGb should not be used to combinatorial testing results where only one test case failed.This limitation is because FROGb's algorithm always predicts all the 1-degree schemas in the failed test case to be sub-schemas of the failure-inducing schema in this case.Therefore, all the candidate schemas are extracted as high-suspicious ones by FROGb, so the result no longer provides helpful information.In such cases, we recommend using an alternative FIL method based on the modification and re-running of a single failure test case, such as the OFOT [3].
As we saw, FROGb can not obtain all candidate schemas.Therefore, we cannot establish that a gotten high-suspicious candidate schema is indeed a failure-inducing schema in such a way to run a test case without any other candidate schemas and check the test case to fail.However, if necessary, it is possible to increase confidence by testing several random test cases that include high-suspicious candidate schemas, such as the checking mechanism used in ICT [7].

Threats to validity
We used Python to implement BEN, FROGa, and FROGb.Python is slower than a compiled language, and our implementation and environment may not be optimized.Therefore, the processing time for each method measured in our experiments may be more redundant than the ideal time.However, considering all methods have the same affection on processing speed, the comparison results must be correct.
We used only one type of machine learning package, scikit-learn, and its default parameter values to implement the logistic regression because we had no significant problems with the behavior and results of FROGa and FROGb in this case.On the other hand, we have not yet tested the behavior with other packages and parameters.A more detailed investigation is required to see how changing and optimizing these packages and parameters will change the experimental results and improve the accuracy.
BEN updates the suspiciousness of candidate schemas by designing and re-running additional test cases in original usage.On the other hand, we did not include such an update process but only compared our experiments' first obtained suspiciousness rank.The reason is to simplify the experiment and emphasize that FROGa exhibits high accuracy without additional testing.Therefore, BEN may achieve higher ranking accuracy with several additional testing.However, there is no doubt that FROGa is more beneficial because it achieved high accuracy without additional testing.

Related work
Several FIL approaches have been proposed.These approaches are categorized into adaptive and non-adaptive by involving adaptive generation and execution of additional test cases.Recent survey work about FIL is in Jayaram et al. [18].

Adaptive FIL approaches
The adaptive approach mainly identify defect-inducing schemas by designing and executing new test cases based on the execution results of the test cases [3,6,[19][20][21].Some methods use a single failed test case as input, generate and execute additional test cases by changing the value of one of the parameters in the test case, and identify the fault-inducing schema in the original test case based on the changes in the execution results.Many of these methods are known to work only when the number of fault-inducing schemas is one.For example, the OFOT [3], and the method using Delta Debugging [6,19,20] do not give correct results in many cases unless the number of fault-inducing schemas is one.In order to address these problems, Zhang et al. [21] proposed an FIC that partially supports the localization of multiple fault-inducing schemas.In addition, Niu et al. [7] proposed an ICT that performs localization more effectively by dynamically running the processes of single test case design and execution and input value localization in succession.ICT further checks whether the identified failure-inducing schemas lead to test failures by using a checking mechanism that extends the OFOT method, preventing false identifications that occur when there are multiple fault-inducing input value pairs.These methods assume a cycle that; (1) interrupts the sequential execution of test when a failure occurs, (2) identify the failure-inducing schemas, (3) correct the defect components using the identification result, (4) and then test again.
On the other hand, there is also a method that executes all the test cases of a normal covering array and then performs input value localization based on the results of all those executions.Due to the recent development of automated testing techniques through continuous integration, etc., obtaining the test results of the entire covering array has become sufficiently realistic to be considered a promising method.As the most primitive method, Yilmaz et al. [22] proposed a method to estimate the failure-inducing schema using classification trees.While this estimation does not require additional test cases, it is not guaranteed to be accurate.It is also known that the effectiveness of this approach is highly dependent on the characteristics of the covering array and does not work well if, for example, the majority of the covering array consisting of a small number of test cases fails.Fouche and Shakya et al. [23,24] have proposed an extension of Yilmaz et al.'s work with some improvements.Yilmaz et al. [25] also devised a framework to feed back estimation results to test case generation.
Another approaches are first to extract all candidate schemas that may be failure-inducing and then generate additional test cases to verify that they are indeed failure-inducing schemas [4,5,19,[26][27][28][29][30].The AIFL by Shi et al. [26] attempts to find a single failure-inducing schema, and the InterAIFL extended by Wang et al. [19] can find multiple failure-inducing schemas.ComFIL, proposed by Zheng et al. [27], can find multiple failure-inducing schemas by elimination and generates test cases from all candidate schemas to reduce the most candidates in a single test.In addition, Niu et al. [28] attempt to optimize the design of additional test cases by constructing a tuple relationship tree to describe the relationship between each candidate schema.The BEN proposed by Gandihari et al. [30][31] also performs efficient input value localization by ranking the candidate schemas based on the calculated suspicious values [5,29,30].Several studies seem BEN as a strong candidate for FIL methods [31,32].
The proposed FROGa is an extension of the existing adaptive method BEN and thus belongs to the adaptive methods.On the other hand, FROGb is a predictive method, so these classifications are not necessarily applicable.The idea of FROGb is based on our earlier work [33].However, in this paper, we give a more strict and detailed model definition, and we also treat a new variety of experimental subjects in the evaluation experiments and make new discussion.

Non-Adaptive FIL approaches
In contrast to the adaptive approach, the non-adoptive approach does not require the creation and execution of additional test cases.The non-adaptive methods mainly generate a Locating Array with a particular input value localization capability when designing a covering array.For example, Colbourn and McClary [34] proposed a (d, t)-Locating Array.(d, t)-Locating Array can uniquely identify d defect-inducing t-degree schemas or less by using a covering array covering all (d + t)-degree schemas.As a similar mathematical object, Martínez et al. [35] proposed the Error Locating Array.This is based on assuming the structure of the input parameters related to the failure-inducing schemas.In addition, several developments of the Locating Array have been proposed.Hagar et al. [36] proposed a partial covering array method that uses the known safe input values of the target software in the design.Nagamoto et al. [37] focused on pairwise testing and proposed a method to effectively generate a (1, 2)-Locating Array from a given test suite.Jin et al. [38] also proposed the Constrained Locating Array, which is an extension of the Locating Array to deal with constrained input models.A recent survey of such Locating Array research and their applications is also summarized in Colbourn et al. [39].
The advantage of using non-adaptive methods is that the design and execution of the test suite can be wholly separated.On the other hand, the disadvantages are as follows.The developers need to know the number and the degree of failure-inducing schemas to design a locating array.Alternatively, the developers must assume these numbers, and successful localization is successful only if the assumptions are correct.In addition, the number of test cases in a Locating Array is much larger than in a naive covering array, and thus the execution cost is much higher.These limitations are not very attractive in practical use, and these limitations have not been fundamentaly resolved now.

Conclution
This paper proposed two novel faulty interaction localization approaches using logistic regression analysis: FROGa and FROGb.FROGa improves the accuracy of computing suspiciousness of combinations of parameter values by using that logistic regression.In addition, FROGb avoids a combinatorial explosion by estimating the subsets of failure-inducing combinations and exploring their supersets.The estimation of subsets also uses the logistic regression coefficients.Through evaluation experiments using a large number of artificial test results based on several real systems, we observed that: FROGa has very high accuracy, and FROGb can drastically reduce computing cost for targets that have been difficult to complete to identify by the conventional method.
Our research leaves several challenges.One of them is to improve the accuracy by various logistic regression implementation methods.Another is to quantitatively evaluate the accuracy improvement and cost increase by using

Fig. 1
Fig. 1 Comparison of AveP distribution of BEN and FROGa for each SUT (top) and combinatorial coverage (bottom).

Fig. 3
Fig. 3 Comparison of processing time distribution of BEN and FROGa for each SUT (top) and combinatorial coverage (bottom).

Fig. 4
Fig. 4 Comparison of processing time distribution of BEN, FROGa and FROGb

Table 1
Example SUT: An example SUT model.

Table 2
Example Test Result: An example of 3-way test suite and its testing result of Example SUT.

Table 3
Example Candidate Schemas: The candidate schemas extracted from the Example Test Result.
1 , st 1 ) . ..inc(sc m , st 1 )where m is the number of extracted candidate schemas i. e. |S k |, and n is the number of test cases in the test suite.The columns of Φ correspond to each candidate schema, and the rows correspond to each test case, and it is can be said that Φ is the encoded inclusion relations of them.

Table 4
Example Φ and R: The Φ which represents the inclusion relationship of each Example Candidate Schema (s 1 ∼ s 11 ) in each test case (1 = included , 0 = not included), and The R which represents the test results of each test case (1 = fail, 0 = pass), created from Example Test Result.

Table 8
The number of test cases consisting of t-way test suite generated by PICT

Table 9
MAP of BEN and FROGa

Table 9 shows
MAP of BEN and FROGa for each test suite.The mean value of MAP for FROGa in the 16 test suites excluding missing values is 0.76, and the mean value of MAP for BEN is 0.31.As a result, we can observe that the MAP of FROGa is well above BEN.Moreover, Fig.1shows a comparison of the distribution of the AveP as a box plot.The upper graph compares each SUT, and the lower graph compares each combinatorial coverage.Note that these graphs do not include missing values; the 4-way tests of Healthcare4 and

Table 12
The average of processing time (s)

Table 13
The handled schemas and their various judgment results in each iteration in FROGb with Example Test Result

Table 14
The average of processing time, the reduction rate (%red), and the shortened cases rate (%short)

Table 15
The average number of extracted candidate schemas (#Candidate) and checking opperations to extract candidate schemas (#Check) for each All-ex (i.e., BEN and FROGa) and FROGb, and their reduction rates of FROGb against All-ex (%red)

Table 16
The classification rates of how many failure-inducing schemas were included in the high-suspicious candidate schemas extracted by FROGb.

Table 17
The percentages of all n failure-inducing schemas were included in the highsuspicious candidate schemas extracted by FROGb for each n.