1 Introduction

Model-based testing (MBT) (Utting and Legeard 2007) is an emergent field that has raised interest from both academy and industry in the last years. It provides the benefit of automatic test case generation from abstract models that capture, for instance, software requirements. Despite the fact that automation is critical to the practice of MBT, test case generation for industrial size applications can often produce large test suites that may not be cost-effective. The reason is that most of the times, automatic generation algorithms are based on a structural and systematic search for test cases constrained by test criteria. With the goal of improving the effectiveness of the suite by achieving coverage, algorithms may generate several similar test cases, depending on the model structure. In order to handle this problem, the testing team can perform additional test selection before test execution. However, test selection may profoundly impact on the success of the testing process as whole: important test cases such as the ones that uncover faults may not be selected (Pezzè and Young 2007). Therefore, there has been extensive research and practical interest in the automated test case selection problem for MBT (Anand et al. 2013).

Test suite reduction aims to produce a representative subset of the original test suite that satisfies a set of test requirements with the same coverage as the original test suite (Harrold et al. 1993; Chen and Lau 1998a). The idea is to have in the subset the most representative test cases chosen according to the capability of either covering more test requirements or covering uniquely one or more requirements. For instance, four well-known heuristics for code-based test suite reduction follow these ideas: Greedy (Chvätal 1979; Cormen et al. 2001), GE (Chen and Lau 1998b), GRE (Chen and Lau 1998a), and HGS (Harrold et al. 1993). Empirical studies have shown that requirements-based reduction may be effective to reduce the size of the suite, but they may also reduce the capability of fault detection (Fraser and Wotawa 2007; Yoo and Harman 2012). To address this problem, there are other approaches, present in the literature, based on test case classification according to a degree of similarity measured by a distance function (da Silva Simao et al. 2006; Kovcs et al. 2009; Bertolino et al. 2010; Coutinho et al. 2013). Empirical studies on test case selection based on similarity have shown that test case diversity may improve the rate of fault detection (Chen et al. 2010; Hemmati et al. 2013; Cartaxo et al. 2011).

Intuitively, the choice of a distance function may directly influence on the performance of test reduction strategies. For instance, the function can tune a technique to an extent in which it becomes capable of revealing differences that may speed up the achievement of coverage and at the same time diversifying the choice of test cases for improving the fault coverage (FC). Another important issue is that since reduction strategies often face draws and handle them by random selection, distance functions may also influence on the stability of the technique, that is, how variable are the results obtained in relation to selected test cases and fault coverage by subsequent runs of the technique.

Applications of distance functions spread across different contexts such as medicine (Felipe et al. 2003), speech (Thakur and Sahayam 2013), and image (Felipe et al. 2006) recognition. Moreover, there are many distance functions proposed in the literature, usually applied to specific applications or contexts where they are recognized as more effective (Akleman and Chen 1999). For instance, the use of distance functions and equivalence relations is the basis of several fault localization strategies (Renieres and Reiss 2003; Xie et al. 2013).

More specifically, in the context of software testing, efforts have already been made to compare distance functions for both test case selection (Hemmati et al. 2013) and prioritization (Ledru et al. 2009). On one hand, empirical studies have already shown that the choice of the function may influence on fault detection capability for the general test selection and test case prioritization problems (Yoo and Harman 2012; Hemmati et al. 2013). Particularly, Hemmati et al. (2013) present a study on test selection strategies based on similarity where they consider the choice of different distance functions combined with other parameters to decide on the best strategy for test case selection. Among results on 320 variants applied to two industrial case studies, top candidates emerge, even though differences found are minor. Generally, studies point to the need for more investigation. On the other hand, to the best of our knowledge, there are no studies comparing the effectiveness of distance functions applied to test suite reduction strategies for MBT. Different from test selection strategies where the tester may decide on the number of test cases to select, test suite reduction strategies rely on requirements coverage. In this sense, the choice of a distance function may influence on the size of the reduced suite as it may or not optimize coverage.

The goal of this work was to investigate the effectiveness of distance functions for test suite reduction in the context of MBT. For this, we apply a similarity strategy for test suite reduction proposed by Coutinho et al. (2013) by considering six distance functions: Similarity function, Levenshtein distance, Sellers algorithm, Jaccard index, Jaro distance, and Jaro–Winkler distance. We evaluate effectiveness by comparing the rates of test suite size reduction (SSR) and fault coverage (FC). Moreover, we observe the stability of the technique when considering the different functions according to the different subsets of test cases and faults. We focus on system-level testing and specifications modelled as labelled transition systems (LTS). LTS are largely considered by research and practice of MBT, including fundamental background, techniques, and tools (Anand et al. 2013). This paper presents three empirical studies. The first two are controlled experiments focusing on two real-world applications with real faults and 10 synthetic specification models automatically generated from the configuration of each application, such as the number of forks, transitions of forks, transitions of joins, joins, paths with loop, and depth. We generate synthetic models based on the strategy presented by Oliveira et al. (2013) and define sets of faults randomly for each generated model according to the obtained percentage of faults from each correspondent real-world specification. Test cases are sequences of transitions generated from each abstract specification by using a depth-search-based algorithm with all-one-loop-paths coverage as stop criterion—a common criterion applied in MBT (Utting and Legeard 2007; Cartaxo et al. 2008; Sapna and Mohanty 2009). As test requirement for the reduction strategy, we choose all-transition-pairs criterion (Utting and Legeard 2007). This criterion is satisfied if all pairs of adjacent transitions in the specification are traversed at least once (Utting and Legeard 2007). By using test cases selected according to this criteria, all interactions between adjacent transitions can be tested, even if the reduction strategy discards a number of test cases from the original generated suite. Furthermore, in the third study, we apply the reduction strategy to two versions of a real-world industrial application with real faults collected from manual execution of test cases. Although we apply the same procedure of the first two studies for generating the application model and test cases as well as the same reduction strategy, this is a low control study with no synthetic models with the goal of further investigation following an industrial application of MBT. Results show that the choice of the distance function has little influence on the size of the reduced test suite. However, the choice can significantly affect FC and stability.

In summary, the main contributions of this paper are as follows: (1) investigate the impact on the choice of a distance function in the scope of a similarity-based reduction strategy for MBT; (2) by considering SSR, FC, and stability in controlled experiments with statistical analysis; (3) and by observing on these measurements in the scope of a real application under development.

This paper is structured as follows. Section 2 presents basic concepts and Sect. 3 presents the distance functions considered in this work. Section 4 presents the reduction strategy applied to investigate the distance functions. Section 5 presents a description of the first empirical studies' goals and planning. Section 6 presents the results and analysis through the metrics collected and their statistical validity, particularly explaining them in terms of observations of the test suites and the faults revealed. Section 7 presents a case study that investigates the effectiveness of the functions in the context of a two versioned real-world application. Section 8 discusses related work. Finally, Sect. 9 presents some conclusions and pointers for further research.

2 Background

2.1 Labelled transition system (LTS)

MBT is a black box testing approach based on the automatic generation of test cases from behavioral specifications (Utting and Legeard 2007). In this work, we focus on Labelled Transition Systems (LTSs)—a common formalism considered by both fundamental and practical research on MBT that is also usually adopted as the semantics formalism of specification notations (Tretmans 2008; Anand et al. 2013).

According to Tretmans (2008), an LTS can be formally defined as a 4-tuple \(\langle S, L, T, s_0 \rangle\), where

  • \(S\) is a finite, nonempty set of states;

  • \(L\) is a finite, nonempty set of labels of transitions;

  • \(T\) is a subset of \(S \times L \times S\) (set of triples), called the transition relation;

  • \(s_0\) is the initial state, where \(s_0 \in S\).

Figure 1a presents an example of an LTS that combines basic, alternate, and exception flows of a use case. The use case defines the behavior of a user account editing operation where (1) we can change user name and password and (2) we can delete a user account. As a usual convention, labels ending with “?” denote actor input actions, whereas labels ending with “!” denote system output actions. We consider this example to illustrate concepts throughout the paper. However, for the sake of simplicity, we replace transition labels by letters (Fig. 1b). Figure 1c shows a test suite generated from the LTS based on the depth search test case generation algorithm, proposed by Araújo et al. (2012), with all-one-loop-paths as stop criterion.

Fig. 1
figure 1

An example of an LTS specification and a test suite generated from it

In an LTS, a path is a finite or infinite sequence of transitions from the initial state. In this work, a test case is defined as a path. Paths can be classified as: (1) simple path, path without repeated states or transitions (\(\langle d, e, c \rangle\) from Fig. 1); (2) path with loop, path in which one or more states or transitions may be repeated, producing cycles (for example, \(\langle a, b, f, g, h, i \rangle\), \(\langle a, b, f, g, e \rangle\), \(\langle d, e, f, g \rangle\) and \(\langle d, h, i, g \rangle\) from Fig. 1). The depth of an LTS is calculated by considering the longest simple path. In the example presented in Fig. 1, the depth of the LTS is 5 defined by the path \(\langle a, b, f, g, h \rangle\).

Two kinds of special states can be identified in an LTS: (1) join is a state with more than one incoming transition (the example in Fig. 1 contains three joins: states 2, 3, and 5); (2) fork is a state with more than one outgoing transition (the example in Fig. 1 contains threeforks: states 0, 2, and 3). Finally, we can define the transitions of joins and transitions of forks measures as the total number of incoming transitions of joins and outgoing transitions of forks of an LTS, respectively. The LTS from Fig. 1 has six transitions of joins and six transitions of forks.

It is important to remark that, in this paper, we consider models as abstraction of software requirements devoted for test case generation. We do not require models to be executable. The tester can choose between automated or manual execution of test cases.

2.2 Test suite reduction

According to Harrold et al. (1993), the test suite reduction problem can be defined as follows:

Given A test suite \(TS\), a set \(Req=\{Req_1, Req_2,\ldots , Req_n\}\) of test requirements to be covered, and subsets of \(TS\): \(TS_1, TS_2, \ldots , TS_n\), where each test case of \(TS_i\) can be used to test \(Req_i\);

Problem Find a minimal subset—the reduced set—\(RS \subseteq TS\) that satisfies all of the Req’s, that is, \(RS\) must have at least one test case for each \(Req_i\).

In general, finding \(RS\) is an NP-complete problem (minimization problems are NP-complete since they can be reduced to the minimum set-covering problem) (Cormen et al. 2001). Therefore, heuristics and approximations are often applied to compute RS, such as the ones presented by Chen and Lau (1998b).

In order to apply a reduction strategy, it is necessary to define a satisfiability relationship between \(TS\) and \(Req\), relating each \(Req_i\) to the set of test cases \(TS_i\) that cover it.

For the LTS specification and the test cases in Fig. 1, by considering that the test criterion is all-transition-pairs coverage, the satisfiability relation is presented in the Table 1. The most covered requirement is the pair (f, g)—(New changes saved!, Select another user?)—whereas the least covered requirement is the pair (b, c)—(Change password?, Limit of daily changes exceeded!) with only one test case. Particularly, (b, c) is part of an exception flow that is often associated with critical failures in practice. In this case, \(t_1\) is an essential test case—the one that uniquely covers a given requirement—any test suite reduction strategy will keep it. However, it is important to remark that if we consider a weaker test criterion such as all transitions, we would require at least one test case covering b and c, but not necessarily both in the same test case. In this case, we cannot guarantee that the reduction strategy will select \(t_1\). Therefore, the choice of test criteria that define requirements to be covered is critical to maximize fault detection capability of the reduced suite.

Furthermore, the choice of a distance function plays an important role in test suite reduction. It defines which test case(s) are part of the reduced suite, by selecting the most different ones that cover a given requirement. As we discuss in Sect. 3, different functions present different measures; therefore, we may get a different reduced set for each function.

Table 1 Satisfiability relation

3 Distance functions

In this section, we present the six distance functions applied in this work to calculate the similarity degree between pairs of test cases. These functions are good candidates for detecting sequencing, matching, and/or repetition of transitions.

While other works have already applied these functions to similarity-based selection strategies (Heß 2006; Vinson et al. 2007; Cartaxo et al. 2011; Hemmati et al. 2013; Fang et al. 2013), in this paper, we apply these functions in the context of test suite reduction for MBT. It is important to remark that some of them needed to be slightly adapted to consider transition labels as unit of comparison instead of characters. Moreover, despite the fact that there are many other distance functions presented in the literature, our goal was to investigate the effect of distance functions, in general, on test suite reduction. For the sake of simplicity, we opt to choose a small set with the ones that are included in other studies in the general area of test case selection.

As running example, we consider test cases \(t_{11}\) and \(t_{13}\) from Fig. 1c that covers three test requirements in common: \((d, e), (f, g), (e, f)\) (Table 1). These test cases start with editing user name, but differ by the subsequent operation, which is either another user name editing or removing a user.

3.1 Similarity function

Cartaxo et al. (2011) define a redundancy measure that calculates the similarity degree between two test cases defined as paths. The degree is measured as the number of identical transitions divided by the average of paths length. Note that this function does not consider the number of repetitions of a transition, for instance, if a loop is traversed more than once.

To address this limitation, we present here an extension of this redundancy measure that ensures that the degree of similarity between test cases without repeated transitions is identical to the value calculated by the original function. The key idea is to consider the relationship between the number of identical transitions of a path and their correspondent occurrences in both test cases (pairs) with average path lengths and set of distinct transitions. Thus, to calculate the similarity degree between two test cases \(i\) and \(j\), considering repetition of transitions, we propose the following function

$$\begin{aligned} SF(i,j) = \frac{\frac{nip(i, j) + |sit(i,j)|}{2}}{\frac{\frac{|i| + |j|}{2} + \frac{|sdt(i)| + |sdt(j)|}{2}}{2}} = \frac{nip(i, j) + |sit(i,j)|}{\frac{|i| + |j| + |sdt(i)| + |sdt(j)|}{2}} \end{aligned}$$

where

  • \(nip(i,j)\) is the number of identical transition pairs between the two test cases;

  • \(sdt(i)\) is the set of distinct transitions in the \(i\) test case.

  • \(sit(i,j)\) is the set of identical transitions between two test cases, i.e., the intersection between \(sdt(i)\) and \(sdt(j)\);

  • \(O(|i| + |j|)\) is the time complexity.

For example, the similarity degree between \(t_{11} = \langle d, e, f, g, h, i\rangle\) and \(t_{13} = \langle d, e, f, g, e, f\rangle\) is calculated as follows:

  • Set of distinct transitions:

    • \(|sdt(t_{11})| = |\{d, e, f, g, h, i\}| = 6\);

    • \(|sdt(t_{13})| = |\{d, e, f, g\}| = 4\);

  • Set of identical transitions:

    • \(|sit(t_{11},t_{13})| = |sdt(t_{11}) \cap sdt(t_{13})| = |\{d, e, f, g\}| = 4\);

  • Number of identical transition pairs:

    • \(nip(t_{11}, t_{13}) = 4\), as presented in Table  2;

  • Paths length: \(|t_{11}| = 6\) and \(|t_{13}| = 6\).

Table 2 Identical transition pairs

Then,

$$\begin{aligned} SF(t_{11},t_{13}) = \frac{npi(t_{11}, t_{13}) + |sit(t_{11},t_{13})|}{\frac{|t_{11}| + |t_{13}| + |sdt(t_{11})| + |sdt(t_{13})|}{2}} = \frac{4 + 4}{\frac{6 + 6 + 6 + 4}{2}} = \frac{8}{11}= 0.727 \end{aligned}$$

Hence, the similarity degree of the test cases \(t_{11}\) and \(t_{13}\) is 72.7 %.

3.2 Levenshtein distance

Levenshtein (1966) proposes the distance function of editing, called editDistance. This function compares two strings and determines the minimum number of edit operations (deletion, insertion. and substitution) necessary to transform one string into another.

Consider two strings, \(A\) and \(B\), where \(i\) and \(j\) are, respectively, their lengths. Firstly, a matrix \(M\) with \((i + 1) \times (j +1)\) values is built, where the first row and the first column are initialized with values from 0 (incremented by 1) to the size of the test cases. The idea is to calculate the distances among all the prefixes of the first string \(A\) and all the prefixes of the second string \(B\) in a dynamic programming fashion. As the matrix is built, only the previous row (\(p\)) and the current row (\(q\)) are needed to calculate the current value of the matrix, where this value is the minimum of the three possible ways to do the transformation:

  • deletion: \(M[(p-1,q)] + 1\);

  • insertion: \(M[(p,q-1)] + 1\);

  • substitution: \( M[(p-1,q-1)] + cost\), where \(cost = 0\) if \(A[p] = B[q]\), otherwise \(cost = 1\).

The value of \(M [i + 1, j + 1]\) reflects the minimum number of operations necessary to convert one test case into another, i.e., the cost of the best sequence of edit operations. The degree similarity can be calculated in the interval of \([0, 1]\) by the following function:

$$\begin{aligned} Lev(A, B) = \frac{ \hbox {max} (i,j) - M[i + 1, j + 1]}{ \hbox {max} (i,j)} = 1 - \frac{M[i + 1, j + 1]}{ \hbox {max} (i,j)} \end{aligned}$$

where the time complexity is \(O(|A| * |B|)\).

For example, consider test cases \(t_{11}\) and \(t_{13}\). From Matrix 1, the similarity value between \(t_{11}\) and \(t_{13}\), calculated by Levenshtein distance is 66.7 %, where \(i = |t_{11}| = 6\), \(j = |t_{13}| = 6\) and \(M[6 + 1, 6 + 1] = 2\) (box contents), obtained by calculation of

$$\begin{aligned} Lev(t_{11}, t_{13}) = 1 - \frac{M[6 + 1, 6 + 1]}{ \hbox {max} (6,6)} = 1 - \frac{2}{6} = \frac{5}{6} = 0.667 \end{aligned}$$
(1)

3.3 Sellers algorithm

The algorithm proposed by Sellers (1980) is a variation in the editDistance algorithm (Levenshtein 1966) (presented in Sect. 3.2) that modifies the way the matrix is created. The idea is to search for a string (sub-chain) in another string with a difference in at most \(k\) operations. Unlike the editDistance algorithm, the first row of the matrix is initialized with \(0\). This changes the calculation of the minimum number of operations to perform the transformation, from string \(A\) to string \(B\), by ignoring any prefix of the string \(B\). The degree of similarity is calculated by the same formula presented in Sect. 3.2.

$$\begin{aligned} Sel(A, B) = \frac{ \hbox{max} (i,j) - M[i + 1, j + 1]}{ \hbox{max} (i,j)} = 1 - \frac{M[i + 1, j + 1]}{ \hbox{max} (i,j)} \end{aligned}$$

For example, considering test cases \(t_{11}\) and \(t_{13}\), the Sellers algorithm creates Matrix 2, where \(i = |t_{11}| = 6\), \(j = |t_{13}| = 6\) and \(M[6 + 1, 6 + 1] = 2\) (box contents). So, \(t_{11}\) and \(t_{13}\) are 66.7 % redundant—the same value obtained by the Levenshtein distance. But note that the base matrixes are different.

(2)

3.4 Jaccard index

Jaccard’s index, proposed by Jaccard (1901), is a similarity measure between sample sets. Let \(A\) and \(B\) be two sets of labels. The measure can be defined by the following function

$$\begin{aligned} Jac(A, B) = \frac{|A \cap B|}{|A \cup B|} \end{aligned}$$

where time complexity is \(O(|A| + |B|)\).

In order to illustrate the Jaccard index, consider again test cases \(t_{11}=\langle d, e, f, g, h, i\rangle\) and \(t_{13}=\langle d, e, f, g, e, f\rangle\). Then, the calculation of Jaccard’s index for test cases \(t_{11}\) and \(t_{13}\) is the following:

$$\begin{aligned} Jac(t_{11}, t_{13}) = \frac{|t_{11} \cap t_{13}|}{|t_{11} \cup t_{13}|} = \frac{|\{d, e, f, g\}|}{|\{d, e, f, g, h, i\}|} = \frac{4}{6} = 0.6666 \end{aligned}$$

Thus, the similarity degree between \(t_{11}\) and \(t_{13}\) calculated by using the Jaccard index is 66.66 %.

3.5 Jaro distance

The Jaro distance presented in Jaro (1989) is a measure of similarity between two strings. The idea of this measure was to calculate the similarity degree between two strings from the number of transfer of the position between characters (transpositions) and the number of different characters. Thus, given two strings \(s_1 = a_1 \ldots a_k\) and \(s_2 = b_1 \ldots b_l,\) the Jaro distance is defined as:

$$\begin{aligned} Jaro(s_1,s_2) = \left\{ \begin{array}{lll} 0 &{} \quad \hbox {if}\quad m = 0&{}\\ \frac{1}{3} \left( \frac{m}{|s_1|} + \frac{m}{|s_2|} + \frac{m - t}{m}\right) &{} \quad \hbox {otherwise} &{} \end{array}\right. \end{aligned}$$

where

  • \(m\) is the number of matching characters;

  • \(t\) is half the number of transpositions;

  • \(O(|s_1| + |s_2|)\) is the time complexity.

For instance, the number of matchings between test cases \(t_{11} = \langle d, e, f, g, h, i\rangle\) and \(t_{13}=\langle d, e, f, g, e, f \rangle\) is \(m = 4\) and half the number of transpositions is \(t = 0\), then

$$\begin{aligned} Jaro(t_{11},t_{13}) = \frac{1}{3} \left( \frac{4}{6} + \frac{4}{6} + \frac{4 - 0}{4}\right) = \frac{2.333}{3} = 0.778 \end{aligned}$$

Thus, the similarity degree between \(t_{11}\) and \(t_{13}\) is 77.8 %.

3.6 Jaro–Winkler distance

Jaro–Winkler distance (Winkler 1999), denoted \(JW\), is a variant of the Jaro distance presented in Sect. 3.5, with the addition of the weighted prefix. Given two strings \(s_1\) and \(s_2\), the function is defined as:

$$\begin{aligned} JW(s_1,s_2) = Jaro(s_1,s_2) + \ell p(1-Jaro(s_1,s_2)) \end{aligned}$$

where

  • \(\ell\) is the length of common prefix shared by the two strings with a maximum of four characters;

  • \(p\) is a constant scaling factor for how much the score is adjusted upwards for having common prefixes. \(p\) should not exceed \(0.25\), otherwise the distance can become larger than \(1\). The standard value for this constant in Winkler’s work is \(p = 0.1\);

  • \(O(|s_1| + |s_2|)\) is the time complexity.

The difference between Jaro and Jaro–Winkler is that Jaro–Winkler adds more weight in strings starting with the exact match characters. However, the maximum size for common prefix must be four, i.e, all matching characters past the first four have the same weight. The length of common prefix is multiplied by a constant, the standard being \(0.1\) for Jaro–Winkler distance.

For example, considering \(p = 0.1\) and the test cases \(t_{11} = \langle d, e, f, g, h, i\rangle\) and \(t_{13} = \langle d, e, f, g, e, f\rangle\), then \(\ell = 4\) and \(Jaro(t_{11},t_{13}) = 0.778\). Then, the Jaro–Winkler distance is

$$\begin{aligned} d_w = 0.778 + 0.4 (1 - 0.778) = 0.867 \end{aligned}$$

Thus, the similarity degree between \(t_{11}\) and \(t_{13}\) for Jaro–Winkler distance is 86.7 %.

4 Similarity-based test suite reduction strategy

Coutinho et al. (2013) introduce a similarity-based test suite reduction strategy inspired by the test selection strategy proposed by Cartaxo et al. (2011). The goal of the original selection strategy was to select a percentage of the test cases that are the most different ones based on the degree of similarity among them without the need to preserve test requirements coverage of the original suite. On the other hand, the reduction strategy aims to produce a subset from the original test suite that satisfies the same set of test requirements, by removing from the suite the most similar test cases while the reduced suite covers the requirements.

In order to apply the reduction strategy, the following inputs are necessary

  • Test suite the set of test cases to be reduced;

  • Test requirements the set of requirements that should be covered, defined by a satisfiability relation;

  • Similarity matrix the matrix that presents the similarity degree for all the pairs of test cases. The degree is measured by a distance function.

The similarity degrees of the test suite, computed by a distance function (Sect. 3), are disposed in a similarity matrix proposed by Cartaxo (2011). In this section, we describe the similarity matrix in Sect. 4.1 and the reduction algorithm in Sect. 4.2.

4.1 Similarity matrix

The similarity matrix is assembled by applying the distance function to each pair of test cases in the test suite. In summary, the matrix is defined as:

  • Square matrix (\(n \times n\)) where \(n\) is the number of test cases and each column and line represents a test case;

  • Each element of the matrix \(a_{ij}\) is the similarity degree between two test cases \(i\) and \(j\) defined by the calculation of the distance function;

  • Symmetric matrix since \(a_{ij} = a_{ji}\);

For the example presented in Fig. 1, we obtained the Similarity Matrix 3 by using the Similarity Function as the distance function.

(3)

4.2 Similarity-based strategy

Algorithm 1 performs the selection of test cases to remove from the original suite. Basically, the algorithm analyzes values of the similarity matrix from the highest value and verify whether after the removal of test cases, the suite still keeps 100 % of the test requirements coverage. The allValuesMatrixAnalyzed method (line 1) returns true after all values of the matrix are analyzed. Inside the while loop, the first step (lines 2–5) is to find the maximum value in the matrix. This value is associated with the two most similar test cases. When a tie among maximum values is found in the similarity matrix, one value is randomly chosen. In the second step (lines 6–16), the order of analysis of these two test cases is defined according to their path lengths. Then, the test case with the lower number of transitions is the first to be analyzed. If the test cases have the same length, one of them is chosen randomly.

In the next step (lines 17–26), it is necessary to verify whether the reduced test suite satisfies all the requirements with the removal of the first test case chosen. If all requirements are still satisfied, then the first test case chosen is removed from the similarity matrix. Otherwise, the first test case is added again to the test suite, and then the other one (the second test case chosen) is removed from the test suite in a similar way. While all pairs of the similarity matrix are not analyzed, new pairs of test cases continue to be selected, removed, and tested in the similarity matrix. Finally, the reduced test suite is returned (line 28).

figure d

In Algorithm 1, we can observe a loop in line 1 that is executed for the worst case \(\frac{n^2 - n}{2}\), where \(n\) is the number of test cases in the test suite. Furthermore, within each iteration, the method getAllMaxValue in line 2 (\(O(n^2)\)) is used to search the matrix for the highest similarity values. Thus, the worst case time complexity of Algorithm 1 is \(O(\frac{n^2 - n}{2} \times n^2)\). Based on the time complexity of each distance function and similarity-based reduction strategy (polynomial time), we observe that the distance functions considered in this paper do not influence on the efficiency of the reduction strategy.

For the test suite described in Fig. 1c, and considering Matrix 3 and test requirements as all-transition pairs, Algorithm 1 can return the following reduced test suite: \(RS = \{ t_1, t_5, t_7, t_{10}\}\), by removing \(t_9\), \(t_3\), \(t_2\), \(t_6\), \(t_{12}\), \(t_8\), \(t_{11}\), \(t_4\), \(t_{13}\), in this order. As expected, the reduced suite covers the essential test case \(t_1\) and three other test cases, minimally covering the test requirements. With the reduced suite, it is possible to test some of the situations where editing a user name or password is followed or not by another editing or a remove operation. We also exercise the exceptional flow in two cases. Other similarity functions would lead to other choices of nonessential test cases, even if covering the same requirements. The choice may influence on size of the reduced suite and fault detection capability as we discuss in the following sections.

5 Experiment definition

In this section, we present the definition of two empirical studies to assess the effectiveness of different distance functions applied in the scope of the similarity-based strategy for test suite reduction presented in Sect. 4. Both studies focus on considering a real-world application model and real faults experienced during test execution. Based on the structure of the application, 10 synthetic specification models are automatically generated along with a similar percentage of random faults. The idea is to consider two different real settings of application model and fault detection percentage in order to investigate the functions in a controlled way.

The first empirical study focus on a version of the PDFSam toolFootnote 1. This application has few essential test cases and, consequently, a great potential for reduction. On the other hand, the second empirical study focus on a version of the TaRGeT tool (Nogueira et al. 2007; Ferreira et al. 2010) composed mostly of essential test cases, making the reduction task harder.

For these investigations, we follow the process for experimental studies in software engineering proposed by Wohlin et al. (2000). The next sections describe the activities performed to define and execute the studies.

5.1 Definition

As mentioned before, the goal of these empirical studies was to investigate distance functions to measure similarity between two test cases to assess the effectiveness when applied in a test suite reduction strategy based on similarity. For this, we observe, for the reduced suite, the size and fault coverage. Based on this goal, our general hypothesis is that “test suite reduction strategies based on similarity show a different performance regarding size and FC of the reduced suite depending on the distance function used.” Furthermore, we analyze the results considering the point of view of the tester (responsible for the testing process) in the context of MBT.

5.2 Planning

In the phase of planning, we define context selection, variables, hypothesis, instrumentation, design, and threats to validity as follows.

5.2.1 Context selection

Following the dimensions proposed by Wohlin, the studies are off-line, i.e., we perform them in laboratory, which is not a real industrial environment. For more general results, an experiment should be performed in real settings (online).

Each empirical study has as inputs to the reduction strategy (with the different distance functions) one real-world application (real problems) and 10 synthetic automatically generated specifications. These specifications are randomly generated by considering the same configuration of the respective real-world application such as depth, number of forks, number of transitions of forks, number of joins, number of transitions of joins, and number of paths with loop. Since these empirical studies focus only on two sets of different configurations, those studies can be characterized as a specific.

5.2.2 Variables selection

One of the main elements of experiments is the variables. They comprise the elements that are observed (dependent variables), and modified and controlled (independent variables) during the experimental study. The variables that compose our studies are defined as follows:

  • Independent variables

    • Test requirements all-transition-pair coverage;

    • Test suite reduction strategy similarity-based test suite reduction strategy (Sim);

    • Distance functions functions to measure the similarity degree between two test cases applied in the reduction strategy. In this work, we analyze the functions presented in Sect. 3:

      • Jac: Jaccard index;

      • Jaro: Jaro distance;

      • JW: Jaro–Winkler distance;

      • Lev: Levenshtein distance;

      • Sel: Sellers algorithm;

      • SF: Similarity function.

    • Faults the faults revealed by the test suite. For the synthetic models, faults are automatically defined considering the same pattern of the real models: a test case fails due to one fault (one-to-one relationship) and the test cases that fail are distinct;

  • Dependent variables

    • Suite size reduction (SSR) percentage of the number of test cases removed from the original suite.

      $$\begin{aligned} SSR = \frac{|TS| - |RS|}{|TS|} \times 100\,\% \end{aligned}$$

      where \(|TS|\) is the number of test cases in the original test suite and \(|RS|\) is the number of test cases in the reduced test suite;

    • Fault coverage (FC) percentage of the total number of faults uncovered by the reduced test suite:

      $$\begin{aligned} FC = \frac{|F_{RS}|}{|F_{TS}|} \times 100\,\% \end{aligned}$$

      where \(|F_{TS}|\) is the number of faults revealed by the original test suite and \(|F_{RS}|\) is the number of faults revealed by the reduced test suite.

5.2.3 Hypothesis formulation

The experiment definition is formalized into hypotheses that are tested during the analysis of the experiment. Based on the goal of the empirical studies, for each dependent variable (SSR and FC), we define two hypotheses as followsFootnote 2.

  1. 1.

    SSR A null hypothesis (\(H^0_{1}\)) all distance functions have the same behavior regarding SSR; An alternative hypothesis (\(H^1_{1}\)) all distance functions have a different behavior regarding SSR.

    $$\begin{aligned}&H^0_{1}: SSR_{Jac} = SSR_{Jaro} = SSR_{JW} = SSR_{Lev} = SSR_{Sel} = SSR_{SF} \\&H^1_{1}: SSR_{Jac} \ne SSR_{Jaro} \ne SSR_{JW} \ne SSR_{Lev} \ne SSR_{Sel} \ne SSR_{SF} \end{aligned}$$
  2. 2.

    FC A null hypothesis (\(H^0_{2}\)) all reduction strategies have the same behavior regarding the rate of FC; An alternative hypothesis (\(H^1_{2}\)): all reduction strategies have a different behavior regarding the rate of FC.

    $$\begin{aligned}&H^0_{2}: FC_{Jac} = FC_{Jaro} = FC_{JW} = FC_{Lev} = FC_{Sel} = FC_{SF} \\&H^1_{2}: FC_{Jac} \ne FC_{Jaro} \ne FC_{JW} \ne FC_{Lev} \ne FC_{Sel} \ne FC_{SF} \end{aligned}$$

5.2.4 Instrumentation

The instruments of the experiments are defined as follows:

  1. 1.

    Objects 1 real-world and 10 synthetic automatically generated LTS specifications for each empirical study (22 specification models in total);

  2. 2.

    Guidelines since the strategy does not require people (subjects) to configure them, no guideline is used;

  3. 3.

    Measurements the LTS-BT tool (Cartaxo et al. 2008) is used to support the experiments execution and data collection.

The two real-world specifications selected for each empirical study are briefly described as follows:

  • PDFSam an open-source tool used to split and merge pdf documents;

  • TaRGeT an application that generates test cases from use case documents in a MBT process.

In these studies, we consider a specific version of each of the real-world applications in which faults can be observed. For this versions, in order to generate the specification models, we consider a specification of software requirements written as use cases, by experienced testers, using the use case template of the TaRGeT tool. As output, the TaRGeT tool returns an LTS model that represents the execution flows of the use cases (Fig. 2). It is important to remark that the version of TaRGeT we consider as object of the study is different from the one we use for generating the models. The latter is a stable and deployed one. Furthermore, we collect the faults considered in the studies by manually executing the version under testing and manually identifying faults from failures.

Fig. 2
figure 2

Generation process of models of the real-world applications

Table 3 presents the configuration of the real specification models, defined as: (1) (structural) measures (based on the concepts presented in Sect. 2.1); (2) the number of test cases generated by the LTS-BT tool considering all-one-loop-paths coverage criteria; (3) the number of essential test cases; (4) the number of faults detected. Notice that the two real-world specifications have a different number of faults. This is due to the fact that we consider only and exactly the real faults detected in order to make the results resemble the practice. Moreover, it is important to remark that for each real-world specification, each fault is revealed by a distinct failure (test case). In MBT, the test cases are usually abstract (Utting and Legeard 2007), particularly for system testing. Therefore, although faults are more precisely described at code level, for the sake of simplicity, we present the faults considered in our studies in Table 4) thorough the description of failures that we can observe.

Table 3 Basic configuration of the two real-world specifications
Table 4 Description of faults, abstracted by the corresponding failure, of the real-world applications

From the configurations of each real-world model, we generate 10 synthetic LTS models based on the strategy presented by Oliveira et al. (2013), as illustrated in Fig. 3. The LTS generator receives as input the depth, the number of transitions of joins, joins, transitions of forks, forks, and paths of loops for each real-world specification. Then, it generates a number of different models (10 in this study) for each configuration.

Fig. 3
figure 3

Generation process of the synthetic models

Table 5 presents the number of test cases generated, essential test cases, and faults generated for each synthetic model. Notice that they resemble the correspondent real one.

Table 5 Comparing test case and fault metrics of the synthetic LTS specifications to the corresponding real specification ones

For the synthetic models, we randomly selected a number of test cases that fail and associated each failure with a fault to follow the same pattern of the real models. Moreover, the number of failures/faults approximates the percentage of faults of the real applications w.r.t. the number of test cases (PDFSam configuration: 3.65 % and TaRGeT configuration: 15.85 %). Likewise, the percentage of essential test cases is also an approximation, but it lacks a little bit of precision due to the fact that distribution of essential test cases depends on the model and we did not control it directly. However, variation is low: the percentage of essential test cases ranges from 0 and 7 % for PDFSam configuration and from 44.66 and 89.87 % for TaRGeT configuration.

5.2.5 Experimental design

The experimental design is defined from the characteristics of the experiment, such as amount of object, subjects, factors, and levels (Jain 1991; Wohlin et al. 2000). In our case, there is one experimental study of one factor (distance function applied in the reduction strategy) with more than two treatments (the six distance functions investigated) for each specification. Thus, there are 11 experimental studies for each empirical study (10 synthetic specifications and one real specification). These experimental studies are structured in two experimental designs, i.e., one experimental design for each metric observed (SSR and FC) (Fig. 4).

Fig. 4
figure 4

Schema of the experimental study for each input specification

As suggested in the literature for experimental studies, we choose a confidence level of 95 %. Then, we use \(\alpha = 0.05\) whenever referring to statistical significance. Moreover, in order to obtain conclusions with statistical significance, the minimum sample size must be calculated. Thus, we execute the six distance functions 40 times to calculate the number of replications required (\(n\)), according to Jain (1991), for each metric in each one of the experimental study as follows:

$$\begin{aligned} n = \left( \frac{100 \cdot z \cdot s}{r \cdot \overline{x}}\right) ^2 \end{aligned}$$

where

  • \(z\) is a standard value from the normal distribution table, for a 95 % confidence level \(z = 1.96\);

  • \(s\) is the standard deviation from the sample;

  • \(r\) is the desired accuracy (\(\alpha\) \(=\) 0.05, then \(r = 5\));

  • \(\overline{x}\) is the mean of the sample.

For each empirical study, we consider that the number of necessary replications is the highest value defined between the metrics SSR and FC for all specifications, as summarized in Table 6. Note that only the highest value for each metric in each empirical study is presented. For the configuration of the PDFSam application, the number of replications required is defined by the JW (Jaro–Winkler distance) function for Specification 2, observing the FC metric. In this case, we consider 62,000 replications of each distance function for each specification. In the configuration of TaRGeT, the highest value was defined among the metrics (SSR and FC) of all specifications defined by SF (Similarity Function) for Specification 2, observing the FC metric. Therefore, for the configuration of TaRGeT, we consider approximately \(40\) replications of each distance function for each specification.

Table 6 Mean, standard deviation, and the highest number of necessary replications for each metric and each application

5.3 Operation

To execute these empirical studies, we implemented an LTS generator as proposed by Oliveira et al. (2013) to automatically generate the different specifications according to specific configurations. Furthermore, it was necessary to implement the distance functions and the code to collect the data during the execution of the experiment. We implemented them in the Java programming languageFootnote 3. Following this, we use the LTS-BT tool to generate test cases. Furthermore, we perform each step of the experiment for the maximum number of times defined among the metrics, using a machine with Intel Core(TM) i5 3.10 GHz, 8GB RAM running GNU Linux.

5.4 Threats to validity

An important question concerning the results of the empirical studies is the potential threats to the validity that may negatively influence on the results. Cook and Campbell (1979) suggest that the threats can be identified according to the type of validation of results and defined a list of four types such as: conclusion, internal, construct, and external validity.

The statistical tests used represent the main threat to conclusion validity. To deal with this threat, the number of executions of the experiments for each specification is eventually higher than the amount defined in the sample size. In order to maintain the statistical significance of the data, all analysis consider the confidence level of 95 %, according to the suggestions for conducting experiments on the statistical literature (Jain 1991). Thus, this ensures that we have a good conclusion validity.

A threat to internal validity is related to the control of the experiment. To make execution of the reduction strategy for each distance function automatic, during the implementation and execution of the algorithms, we add control so that the execution environment would not be influenced by other processes, programs, or the machine on which the experiment is running. In these empirical studies, there are no people involved, and the same inputs (LTS specifications) are applied for all the distance functions. Thus, this internal validity is not considered critical.

For construct validity, the experiments setting is the main threat. To maintain the construct validity, the experimenter cannot influence on the measures. To handle this, the synthetic specifications are automatically generated from the configuration of real-world applications. Furthermore, our results rely on input specifications that have given a set of faults that were randomly generated. The number of faults for each configuration is defined according to the percentage of the real specification previously executed. In this real specification, the set of real faults is identified after each test case is manually executed by experienced software engineers.

Another threat to construct validity is when the measurements of the metrics (SSR and FC) are not adequate. For this, these metrics are implemented according to the concepts proposed in the literature. Moreover, the implementation of the distance functions is another threat to validity. To deal with this, the distance functions are implemented according to the algorithms described in Sect. 3. In order to maintain the validity of the data, it is necessary to adapt the distance functions to calculate the degree of similarity between two test cases for these empirical studies.

The objects used in these experiments are the main threat to external validity, particularly, synthetic LTS specifications that are automatically generated, not representing a real behavior, even though they are randomly generated considering the same configuration of real applications. However, automation makes it possible to consider a number of specifications in a controlled way.

6 Experiment analysis

The first step is to check whether data collected have a normal distribution for all specifications, considering the SSR and FC metrics. For this, we apply the Anderson–Darling normality test, using the R toolFootnote 4, considering the confidence level at 95 % (significance level is \(\alpha\) \(= 0.05\)) (Jain 1991). For the two empirical studies and all specifications, \(\rho\) values are smaller than the significance value (\(\alpha = 0.05\)). Thus, we need to apply nonparametric tests. Since each experimental design has a unique factor with more than two treatments, we apply the nonparametric Kruskal–Wallis test to check the null hypotheses. This test is used to determine whether there are “significant” differences among the population medians. In the next subsections, we present and discuss these results, considering each empirical study. Detailed data collected in the experiment can be found in the studies Web siteFootnote 5.

6.1 First empirical study—PDFSam configuration

For Specifications 4 and 8, we obtain \(\rho values=1.000\) by executing the Kruskal–Wallis test for both SSR and FC metrics. In other words, for these specifications and metrics, the distance functions have the same behavior with 95 % confidence level. For the other specifications, we obtain \(\rho\,value= 0.0001\) by executing the Kruskal–Wallis test. These values are smaller than the significance level (\(\alpha = 0.05\)) for all data. Thus, all null hypotheses can be rejected (\(H^0_{1}\) and \(H^0_{2}\)), that is, for SSR and FC, the distance functions do not present the same behavior.

We use boxplot to display the distribution of data based on the five number summary: minimum, first quartile, median, third quartile, and maximum. Graphically, the minimum and maximum are represented by whiskers above and below the box. The central rectangle spans the first quartile to the third quartile, and the thicker line shows the median. The outliers (unfilled dots) represent the individual values beyond the whiskers. Boxplots are also useful for comparing two or more variables and a visual interpretation of the data. When there is an interval overlap, this apparently indicates no statistical difference between them. When there is no overlap, we can state the statistical differences.

Figure 5 presents the boxplots for SSR and FC considering the general average in PDFSam configuration.

Fig. 5
figure 5

Boxplots for SSR and FC considering the general average for PDFSam configuration

As there are overlaps in the boxplots, we apply the Mann–Whitney test (Wilcoxon–Mann–Whitney test in R) between each pair of distance function. If the \(\rho\) value \( < \alpha\) for Mann–Whitney tests, then null hypothesis can be rejected in favor of the alternative hypothesis. In this case, the response variable tends to be either greater or smaller for one group in spite of the other group. For the other cases, such as \(\rho\) value \( \ge \alpha\), the null hypothesis cannot be rejected, and we conclude that the distance functions have similar behavior.

However, the Mann–Whitney test shows only whether there is a statistically significant difference between two treatments. In order to clarify the magnitude of the treatment effect, we use the \(\hat{A}_{12}\) effect size measure proposed by Vargha and Delaney (2000). Considering two treatments \(X\) and \(Y\), \(\hat{A}_{12} = 0.5\) indicates that there is no difference between the treatments \(X\) and \(Y\), whereas \(\hat{A}_{12} > 0.5\) indicates that \(X\) is superior to \(Y\), and \(\hat{A}_{12} < 0.5\) indicates that \(Y\) is superior to \(X\). Note that \(\hat{A}_{12}\) is between 0 and 1 and the larger the effect size, the further away the value from \(0.5\) is. We follow the categories used by Rogstad et al. (2013), where they categorize the effect into \(Small < 0.10\), \(0.10 < Medium < 0.17\) and \(Large > 0.17\), the value being the distance from 0.5. Table 7 shows the Mann–Whitney U tests and \(\hat{A}_{12}\) effect size for each comparison considering the general average in PDFSam configuration.

Table 7 Mann–Whitney and \(\hat{A}_{12}\) effect size measurements for general average in PDFSam configuration

In most of the cases for SSR in Table 7, the effect size is classified as small between the distance functions. The results indicate that there is a small difference when applying different distance function combined with similarity-based reduction strategy, considering SSR. In terms of FC, the results show that when Jac is compared to others, its behavior is clearly better, with an effect size mostly from medium to large.

From the boxplots, Mann–Whitney tests and \(\hat{A}_{12}\) effect size measurement, we calculate the average position of each distance function regarding effectiveness, as presented in Table 8. This table presents the performance order of the distance functions for the SSR and FC metrics.

Table 8 Ordering of effectiveness for SSR and FC in PDFSam configuration

Additionally, Table 9 shows the minimum, maximum, median, and average considering the average for all executions of each specification. Note that the percentage of reduction is similar (but it is not equal, as shown in Table 8) by observing the average. Moreover, the rate of reduction varies from 75.546 to 88.940. This can be considered a high reduction rate justified by the characteristics of the specifications, particularly the presence of loops. With loops, test cases with a certain redundancy degree are generated, and therefore, the reduction strategy (whatever the similarity distance applied) tends to further reduce. However, by observing Table 9 (FC), on average, Jac presents the best behavior as seen in Table 8, uncovering in average about 20 % of the faults with lower variance.

Table 9 Minimum, maximum, median, and average for PDFSam configuration

Finally, by analyzing the data obtained in the 62,000 executions of the technique when considering each function, we can also observe the stability of the reduction technique with respect to two measures: (1) the number of different sets of faults produced by the selected suites; (2) the number of different sets of test cases selected (different suites). Ideally, the technique should be as stable as possible by presenting a low number of different sets in each case, making its performance more predictable.

Fig. 6
figure 6

Number of subsets of test cases and faults for the PDFSam configuration

Figure 6 presents the boxplots obtained for each function. For the sets of test cases, Jaro and JW present the best stability in relation to the set of test cases, because the distance between each pair of test cases obtained by applying those distances is not equal generally. So, it is not necessary to frequently apply random selection. On the other hand, note that for the different sets of faults, SF is the most stable one, whereas Lev and Sel are the less stable. The reason is that SF is more precise in this context due to the presence of loops. It can more effectively detect the difference between a test case that is (contain) a subset of the other.

6.2 Second empirical study—TaRGeT configuration

Considering the SSR metric and the specification of the real application, Specification 3 and Specification 10, we obtain \(\rho \) values that are bigger than \(0.05\) by executing the Kruskal–Wallis test. For FC and Specification 3, we obtain \(\rho \) values bigger than \(0.05\). Thus, not all null hypotheses can be rejected. In other words, for these specifications and metric, the distance functions have the same behavior with 95 % confidence level. For the other cases, the \(\rho \) values obtained are smaller than the significance level (\(\alpha = 0.05\)). Thus, all null hypotheses can be rejected (\(H^0_{1}\) and \(H^0_{2}\)). So, with 95 % confidence level, the distance functions can be considered different for SSR and FC.

Figure 7 shows the confidence intervals of the SSR and FC metrics considering the general average in the TaRGeT configuration. By observing the boxplots, we can see that behavior is only slightly different, generally making it impossible to rank the performance of the functions.

Fig. 7
figure 7

Boxplots for SSR and FC considering the general average in TaRGeT configuration

To uncover differences that might exist, we evaluate the pairs of distance functions by applying the Mann–Whitney tests and \(\hat{A}_{12}\) effect size measurements (as defined in Sect. 6.1). Table 10 shows the Mann–Whitney U tests and \(\hat{A}_{12}\) effect size for each comparison considering the general average in the TaRGeT configuration.

Table 10 Mann–Whitney and \(\hat{A}_{12}\) effect size measurements for general average in TaRGeT configuration

As can be seen, for both metrics—SSR and FC—the effect size between the pairs of distance functions is considered small. This means that the behavior of one is better than the other one, even though the difference is small. Moreover, again Jac is prevalent for FC.

From the boxplots, Mann–Whitney tests, and \(\hat{A}_{12}\) effect size measurements, we can observe their performance, as presented in Table 11. In most cases, the performance of the functions can be considered similar. However, we can also note that, for both SSR and FC, Lev and Sel are the most closely related since either they present the same behavior or they are at subsequent levels of equality, except when the average is considered.

Table 11 Ordering of effectiveness for SSR and FC in TaRGeT configuration

The minimum, maximum, median, and average considering the average for all executions of each specification are presented in Table 12. These values show also that the performance of the functions is comparable for both SSR and FC, even though a few significant differences can be observed.

Table 12 Minimum, maximum, median, and average for TaRGeT configuration

Finally, as in the first experiment, by analyzing the data obtained in the 40 executions of the technique when considering each function, we can also observe the stability of the reduction strategy by considering the same measures defined in Sect. 6.1. Figure 8 presents the boxplots obtained for each function. Note that, SF is the most stable one for both the different sets of test cases and faults. For the sets of faults, Jac, Lev, and Sel are the less stable, even though differences here are less significant. The reason is that the TaRGeT configuration presents less redundancy.

Fig. 8
figure 8

Number of subsets of test cases and faults for the Target configuration

6.3 General remarks

In the presented experiments, we exercise and analyze distance functions in the context of a test suite reduction strategy. We consider two different scenarios by grouping specifications with a comparable configuration: (1) in the PDFSam configuration group, reduction is more likely due to the presence of structures that may lead to higher degree of similarity between test cases; (2) in the TaRGeT configuration group, reduction is harder due to the prevalence of structures that do not directly lead to a higher degree of similarity, making the occurrence of essential test cases more likely.

It can be noticed that the configurations of the applications are different and the differences may impact directly on the results. For example, the number of paths with loop is a significant difference. This may have direct impact on the number of generated test cases and the degree of redundancy among test cases. As the PDFSam configuration has five paths with loop, then the generated test cases may contain a high degree of redundancy among them. Thus, we observe that the strategy present a high rate of reduction. On the other hand, for the TaRGeT configuration, with no paths with loop and a big number of essential test cases, the reduction rate is low.

Results show that the PDFSam configuration presents differences that are more significant on performance between the functions since their influence on the overall result of the reduction technique is higher: the choice of the test case to be included depends on the function. However, we can conclude, for the investigated context, that the influence is mostly related to the FC metric rather than the SSR metric. Reduction percentage is quite similar in all cases, whereas FC is more or less successful for different functions. Jac is in average the best function, particularly for the PDFSam configuration. This fact confirms a similar result presented by Hemmati et al. (2013) in the context of test selection, where Jac and two of its variants are the distance functions with best performance for FC.

Regarding stability, results obtained indicate that average stability of the number of different sets of faults is usually related to better FC. For instance, consider the results obtained by the Jac function. This may indicate that less precision can make the function more effective to cover different faults. Moreover, note that, in the PDFSam configuration, there are cases where the SF function, the more stable one, detected \(0\) faults. Furthermore, there is a limit to stability: the less stable functions, Sel and Lev, cannot supersede Jac in general.

7 Case study

The goal of this case study was to provide further investigation into the performance of the distance functions in a different context from the two experiments discussed so far. The study is based on an industrial application developed in the context of a cooperation between our research laboratory and IngenicoFootnote 6. The application is a software for collecting and processing biometrics. From use cases, LTS specification models are automatically generated for two subsequent versions of the application, where one is a baseline version—CB\(_{v_{1}}\)—and the other is a delta version—CB\(_{v_{2}}\)—obtained from CB\(_{v_{1}}\) by two progressive modifications. From the models, we generate two test suites and manually executed them. From execution, we collect faults and failures. Table 13 describes the configurations of the two specification models.

Table 13 Configurations of the real-world specifications

Note that the rates of fault of the specifications based on size of the generated test suites are 14.49 % for CB\(_{v_{1}}\) and 9.09 % for CB\(_{v_{2}}\). The number of essential test cases that fail for CB\(_{v_{1}}\) and CB\(_{v_{2}}\) are 2 and 3, respectively. For all essential test cases that fail, each failure is caused by a distinct fault. We expect that they are always included in the reduced suite, as they uniquely cover a requirement by definition.

For each specification, we execute 1,000 replications for each distance function. In order to draw observations based on these data, we apply a statistical analysis similar to that used in the other empirical studies.

Figure 9 presents the boxplots considering SSR and FC. Note that there are many overlaps; then, it is necessary to perform the Mann–Whitney test.

Fig. 9
figure 9

Boxplots for SSR and FC considering the general average for CB\(_{v_{1}}\) and CB\(_{v_{2}}\)

In order to clarify the magnitude of the difference between the distance functions, we perform the \(\hat{A}_{12}\) effect size. The results of the Mann–Whitney U tests and \(\hat{A}_{12}\) effect size measurement for each distance function comparison are reported in Table 14 for CB\(_{v_{1}}\) and CB\(_{v_{2}}\). Considering SSR for CB\(_{v_{1}}\) and CB\(_{v_{2}}\), we can see that \(Jac\) and \(SF\) present that best behavior, and there is no difference between them, whereas considering FC, the difference between them is large and \(SF\) is better.

Table 14 Mann–Whitney and \(\hat{A}_{12}\) effect size measurements when SSR and FC across the distance functions for CB\(_{v_{1}}\) and CB\(_{v_{2}}\)

From the boxplots, Mann–Whitney tests, and \(\hat{A}_{12}\) effect size measurements, we obtain the ordering of effectiveness for SSR and FC behavior in Table 15.

Table 15 Ordering of effectiveness for SSR and FC in CB\(_{v_{1}}\) and CB\(_{v_{2}}\)

For these specifications, FC varied between 50 and 80 % for CB\(_{v_{1}}\), and between 66.667 and 83.333 % for CB\(_{v_{2}}\) (Table 16).

Table 16 Minimum, maximum, median, and average for CB\(_{v_{1}}\) and CB\(_{v_{2}}\)

The fact that the choice of the distance function may influence on FC follows the results obtained in the previous experiments to a certain extent. However, Jac did not performed as good as in the experiments regarding FC. By closely analysing the reduced suite, we can see that the measurement made by Jac made some failing test cases to be discarded, because each of them was considered similar to another that was selected.

As mentioned before, the distance function may influence on the order pairs of test cases are considered. Particularly, for the CB application, SF is more successful when comparing the total number of distinct faults and the frequency in which that faults are detected. However, Jac presented the more stable behavior, that is, less variance in reduced suite for 1,000 executions when considering the subset of test cases that fail, followed by SF (Table 17). This confirms the results obtained in the experiments: the function with best stability may not be the one with best performance for FC. With the presence of essential test cases, SF becomes less stable than when applied in the PDFSam configuration.

Table 17 Number of different sets of test cases selected, number of distinct test cases, average frequency of inclusion of a test case in the reduced suite, number of different sets of faults detected, number of distinct faults, and average frequency of inclusion of a fault detected by a reduced suite

For both specifications, we observed that Lev and Sel have a bigger variation in the sets of test cases that make up the reduced suite, when compared to Jac and SF. Moreover, in general, the number of faults detected at least once is greater than by other functions, that is, they may eventually achieve a much better FC. However, on average, the number of covered faults by each reduced test suite is small, making them less reliable (Table 16). The variation is due to the large number of draws among similarity degrees in the matrix, making it possible for test cases that fail not selected by the SF and Jac reduction, to be selected as a result of a random choice. As in the experiments, Lev and Sel present a comparable behavior.

8 Related work

8.1 Test suite reduction and prioritization by dependency analysis and clustering in the context of MBT

While it is out of the scope of this paper to compare existing and potential approaches to test suite reduction in the context of MBT, it is worth mentioning their differences. Korel et al. (2002) present an approach of regression test reduction based on dependence analysis of EFSM models. The approach identifies the difference between the original and modified models as a set of elementary modifications. For each elementary modification and for each test in the regression test suite, it also identifies interaction patterns based on the dependence analysis. Then, it uses the interaction patterns to reduce the regression test suite: if there are more than one test case covering the same interaction patterns, it chooses only one. Additionally, Chen et al. (2007) extend this idea by revising and identifying more interaction patterns.

Besides dependence analysis, in the scope of test case prioritization, Yoo et al. (2009) propose clustering test cases to achieve effective prioritization using expert knowledge. They use clustering based on their dynamic runtime behavior with the goal of reducing the number of pairwise comparisons required by the Analytic Hierarchy Process algorithm. Moreover, Arafeen and Hyunsook (2013) propose test case prioritization using requirements-based clustering. They proposed to create clusters of requirements and use the requirements test cases traceability matrix to associate the test cases with each requirement cluster. Furthermore, Leon and Podgurski (2003) suggest a simple combination between distribution-based and coverage-based techniques for filtering and prioritizing test cases to exhibit higher defect detection efficiency from the selection of the most different test cases.

On the other hand, in this paper, we focus on the use of distance functions to measure the similarity of test cases as a basis to choose the most different test cases to compose the reduced test suite. Even though distance functions can measure similarity of test cases for different purposes, we apply them in the scope of a similarity strategy for test suite reduction proposed by Coutinho et al. (2013) to study their effect on the reduction problem.

8.2 Selection, reduction, and prioritization based on similarity in the context of MBT

A number of studies have been conducted to use the similarity between test cases for addressing the test suite size problem. Generally, the studies presented in the literature focus on test case selection and prioritization. Moreover, most of them focus on code-level coverage of the test cases.

Hemmati and Briand (2010) present a preliminary investigation to compare the effect of similarity measures for test case selection. For this, six distance functions in a similarity-based selection technique in the MBT context are used. The conclusion is that the distance function has a very significant effect on fault detection capability. In another study, Hemmati et al. (2013) present an extension of previous research. In this work, a total of 320 different techniques were evaluated. These techniques were obtained from the variations in eight distance functions, four encodings of abstract test cases, and 10 minimization algorithms. In order to compare the best similarity-based selection technique with other common selection techniques in the literature, two case studies were performed. The results confirm that the fault detection capability can be influenced by the choice of the distance function. However, the goal of their work was to evaluate the configurations of parameters when applied together rather than investigate distance functions in particular. Also, they do not focus on the test suite reduction problem.

A comparison study between distance functions for test case prioritization is presented by Ledru et al. (2009, 2012). The work proposes comparing the text of test cases by using string distances, that is, to compare each pair of test cases as two strings from a distance function. For this, four classical string distances are compared using a simple greedy algorithm on a experiment aiming to compare their order with a random ordering of test cases. The results suggested that prioritized test suites using string distances are more efficient than randomly ordered test suites.

Researchers have also investigated the use of different distance functions in several well-known techniques based on Adaptive Random Testing (ART). The idea of ART was initially introduced by Chen et al. (2005) to replace random testing for test case generation based on the distance between two test cases. These investigations are focused in techniques for selection and prioritization aiming to improve the FC effectiveness, particularly on code level, such as presented in (Ciupa et al. 2008; Jiang et al. 2009; Zhou 2010).

On the other hand, few similarity-based test suite reduction strategies have been proposed in the MBT context. Furthermore, to the best of our knowledge, there are no studies comparing the effectiveness of distance functions applied to test suite reduction strategies for MBT.

9 Conclusions

This paper presents the results of empirical studies with the goal of comparing distance functions when applied to a strategy of test suite reduction based on similarity in the context of MBT. The idea is to provide evidence of the impact that the choice of a function can have on the performance of the strategy regarding suite size reduction, FC, and stability. Results show that the choice has little influence on SSR, but it can more significantly influence FC and stability. The reason is that each function leads to the selection of a different suite, and it is possible to have significant variations on this selection. To provide further evidence and deeper observation, we conduct a case study in the scope of a real-world application under development that has a different configuration from the ones previously considered in the experiment. The results from this study are comparable to the ones obtained in the experiment regarding the effect produced by the functions on SSR, fault coverage, and stability as well as on the pattern of related behavior of some functions (Lev and Sel). Additionally, in the case study, we can also observe the stability of the number of different sets of faults and fault frequency of the reduction strategy when considering different functions.

Even though no definite conclusions can be reached, as the context of the experiments and case study is specific, for the model configurations investigated, the SF function promotes the best stability, followed by Jac, Jaro, and JW. On the other hand, Lev and Sel present a relatively lower stability. Moreover, Jac often presents the best performance by optimizing the relationship between stability and fault coverage.

It is important to highlight that the number of paths with loops and the number of essential test cases in the specification configuration have also impact on the results of the reduction technique. When the number of paths with loops is high, probably the degree of redundancy in the test suite is high. Therefore, the reduction strategy can be more effective w.r.t. to size and consequently less effective w.r.t. FC. When the number of essential test cases is high, observations are the opposite. Nevertheless, this is a behavior expected from the strategy of reduction based on similarity, as the average changes in rate are relatively similar when considering all functions.

Another interesting issue is the difference in the similarity degree for a given pair of test cases provided by the different distance functions. The differences have direct influence on the order in which the strategy evaluates pairs of test cases by considering the set of test cases that make up the reduced test suite. This might explain why a given test case is never part of the reduced test suites for a distance function, but it is always for another function.

From the results obtained in this work, we can have an overview of the distance functions behavior and effect on similarity-based test suite reduction, even though no definite conclusions can be reached yet. Besides, the results can motivate further investigation in the area, for instance regarding improvements on distance functions to suit the test suite reduction problem. The choice of a distance function can clearly influence FC and the stability of a similarity-based strategy. On the other hand, fault coverage seems to be related to SSR independently of the choice of the function. This motivates further investigation of how to improve the reduction strategy as well. Furthermore, executing more case studies and experiments using other configurations as input of the LTS generator is part of our future work as well as evaluating the distance functions by using others metrics.