Feature causality

.


Introduction
Most of nowadays computing systems are configurable, offering a wide variety of configuration options and parameters from which users can control the functionalities of the system.A common view on configuration options is by features that encapsulate optional or incremental units of functionality (Zave, 2001) and pave the way for the well-established concept of feature-oriented software engineering (Apel et al., 2013).The features of a configurable system can influence critical functional properties such as safety, and non-functional properties such as low probability of failure or performance.Often, the configuration spaces are exponential in the number of features, rendering the detection, prediction, and explanation of defects and inadvertent behavior challenging tasks.While there are specifically the causes for effect events.As with HP causality, feature causality assumes full information about the situations where the effect emerges.
To enable causal reasoning also on incompletely specified sets of effect configurations or sets of valid system configurations, we extend this view and present a generalization of feature causes by means of feature precauses.Incomplete specifications of sets of configurations arise, e.g., when variability spaces are not known (Thüm, 2020), the analysis of the effect involves noise, e.g., when using approximative methods for determining non-functional effect properties, or when an exhaustive analysis is impossible or infeasible, e.g., when relying on real-world bug reports or testing.
Relevant analysis and reasoning tasks that we then address include to determine the set of feature causes for a bug to emerge, the degree to which some configurations are responsible for bad system performance, how robust are feature causes w.r.t.uncertainty in emerging effects, or which features necessarily have to interact for inadvertent behavior.
We develop techniques to perform such tasks in a very generic way, i.e., independent from language-, architecture-, nor environmentspecific properties.In fact, our methods are applicable within any effective method to analyze or test variability-aware properties that describe the effect for which reasons are of interest.To this end, causal reasoning on both variability-aware white-box and black-box analyses is supported, complementing existing causal reasoning techniques for the detection of root causes: Approaches such as delta-debugging (Zeller, 2002;Cleve and Zeller, 2005), causal testing (Johnson et al., 2020), or causal trace analysis (Beer et al., 2012) require a white-box analysis that operates at the level of code and are not variability-aware.Hence, they usually would have to be applied on a multitude of system configurations for a variability-aware causal analysis, suffering from a combinatorial blowup well known to arise in configurable systems.
Explications from feature causality.Since features correspond to system functionalities specified by software engineers, they often have a dedicated meaning in the target application domain (Apel et al., 2013).To this end, defects (and other behaviors of interest) detected at the level of features can provide important insights for the resolution of variability bugs (Garvin and Cohen, 2011;Rhein et al., 2018;Abal et al., 2018) and configuration-dependent behavior (Siegmund et al., 2012;Siegmund et al., 2015;Guo et al., 2018;Nair et al., 2020).They hence are certainly more informative and actionable than lowlevel program traces that do not include variability information.We introduce and discuss several means to explicate reasons for properties that can be obtained from feature causes and precauses: concise logic formulas by a new method called distributive law simplification (DLS), cause-effect covers, feature interactions, and causal measures by means of responsibility and blame (Chockler and Halpern, 2004).With explicated feature causality at hand, developers may choose to focus on those feature implementations identified as root causes of bugs or simply disallow or coordinate the activation of certain features when defects are related to them.We envision applications of feature causality at those development phases where analysis methods are used, for instance, in software product line engineering.Also in productionlevel deployments our techniques shall be useful to optimize software through causally relevant configurations.
Evaluation.We present algorithms to compute feature causes, feature precauses, and causal explications for them.Our prototypical implementation relies on binary decision diagrams (BDDs) (Bryant, 1986) and the computation of prime implicants using the de-facto standard two-level logic minimizer Espresso (McGeer et al., 1993).By means of an analysis of several configurable systems, including community benchmarks and real-world systems, we investigate feature causes and their properties.We demonstrate that our notion of feature causes and methods to represent them help to pinpoint features relevant for the configurable system's properties and illustrate how feature interactions can be detected and quantified.In particular, our evaluation is driven by the following research questions: (RQ1) Can feature causes be effectively computed in real-world settings and support the detection of reasons for different effects of interest?(RQ2) Do DLS representation, cause-effect covers, and degrees of responsibility and blame provide concise causal explications?(RQ3) Is feature causality beneficial for guiding the configuration of systems under variability-aware constraints?(RQ4) Can feature interactions and configuration-dependent anomalies be detected and isolated based on feature causality?(RQ5) To which extend can feature precauses for underspecified effect sets already provide insights on causal relationships?
Contributions.In summary, our contributions are: (1) We introduce the notion of feature causality inspired by the well-established counterfactual definition of actual causality by Halpern and Pearl (2001a), Halpern (2015).
(2) We show that feature causes for effects given as sets of configurations coincide with certain prime implicants that cover the effect configurations, leading to an algorithm to effectively compute all feature causes (Strzemecki, 1992).
(3) We extend feature causality and algorithms to support incomplete specifications and uncertainties of effects and configuration space, leading to feature precauses.(4) We provide methods to interpret and represent feature causes and precauses by propositional formulas, cause-effect covers, responsibility and blame, and potential feature interactions.(5) We offer a BDD-based prototype to compute and represent feature causes, feature precauses, and feature interactions.(6) We conduct several experiments illustrating how to determine and reason about feature causes in different realistic settings.
About this article.This article is an extended version of the conference publication titled ''Causality in Configurable Software Systems'' (Dubslaff et al., 2022).The main additional material not presented in the conference version comprises the definition, computation, and evaluation of feature precauses (see above (3) and parts of (4), ( 5), and ( 6)).
Besides this, the article provides full proofs, additional technical details and examples, a formal comparison to the binary case of HP causality, as well as discussions on cause-effect prime covers and their relation to minimal sum of products from circuit optimization.The source code of our implementation and raw data to reproduce our experiments are publicly available Dubslaff et al. (2023a,b).

Background
In this section, we revisit basic concepts and notions from logics and configurable systems used throughout the paper.
Interpretations.A partial interpretation over a set  is a partial mapping  ∶  ⇀ {, }.We denote by () the support of , i.e., the set of all elements  ∈  where () is defined.We say that  is a total interpretation if () =  and denote by () and () the set of partial and total interpretations, respectively.Given a partial interpretation  ∈ (), we define its semantics [[𝜕]] ⊆ () as the set of all total interpretations  ∈  (𝑋) where for all  ∈ () we have () = ().For a set of partial interpretations  ⊆ (), we define The -expansion of a partial interpretation  ∈ () is the partial interpretation ↑  ∈ () where (↑  ) = ()⧵{} and where ↑  () = () for all  ∈ ()⧵{}.
We formalize switching of polarities in interpretations, i.e., mapping of  to  assignments and vice versa, by a function  ∶ ℘() × () →  (𝑋) where for any  ⊆  and  ∈ (), we have ( , )() = () if  ∉  and ( , )() = ¬() otherwise.Covers and prime implicants.Let ,  0 ,  1 ⊆ () be three sets of partial interpretations.We say that  1 is covered by  (or alternatively: ] intuitively stand for the sets of interpretations for which a Boolean function is known to be  and , respectively.The set of all other interpretations ()⧵[[ 0 ∪  1 ]] is the * -set, constituting the set of unknown or ''don't care'' interpretations.Note that  is a cover of  1 iff  is a * -cover of  1 relative to ∅ and hence, * -covers can be seen as a more general form of covers.A * -cover Note that minimal * -covers are not uniquely defined and there can be multiple minimal * -covers  for a given  0 and  1 .Our notions also extend to ,  0 , or  1 being singletons (or sets of total interpretations), e.g., a partial configuration We denote by P( 1 ,  0 ) the set of prime * -covers of  1 relative to  0 and by P( 1 ,  0 ) the minimal prime * -covers of  1 relative to  0 .Configurable systems.A widely adopted concept to model configurable systems is by means of features (Apel et al., 2013).Features encapsulate optional or incremental units of functionality (Zave, 2001) and describe commonalities and variabilities of whole families of systems.At an abstract level, we identify Boolean configuration options with features of the system and fix a set of features  .We call a total interpretation  ∈ ( ) over  a configuration, which we usually describe by listing the selected features, i.e., the features  ∈  where () = .The set of valid configurations  ⊆ ( ) comprises those configurations for which there exists a corresponding system implementation.A partial interpretation over  is called partial configuration.

Propositional
Example 1.As the running example, consider a simple email system over features  = {, , , , , }, formalizing the base email system functionality, optional features for signing and encryption, and encryption methods Caesar, AES, and RSA.Valid configurations are usually specified through feature diagrams (Kang et al., 1990).These are hierarchical structures over features describing the constraints imposed on configurations to render them valid.An example of such a diagram for our running example is provided in Fig. 1.In essence, the feature of the root of the diagram has always to be included in any valid configuration, here ''m''.If some son is active in a configuration, then also its parent has to be included.The other way around, if no • is drawn at the top of a node (indicating ''optional features''), then also the activation of a parent feature imposes activation of the son.Non-connected branches stand for logical conjunctions and connected branches for exclusive disjunctions over the connected sons.Thus, for the encryption features, we assume that exactly one can be selected.The described variability constraints for the email system specified in the feature diagram of Fig. 1 lead to valid configurations  = { , , , , , , ,  }. ⋄

Feature causality
The notion of causality has been extensively studied in philosophy, social sciences, and artificial intelligence (Good, 1959;Eells, 1991;Pearl, 2009;Williamson, 2009).We focus here on actual causality, describing binary causal relationships between cause events  and effect events .Halpern and Pearl formalized actual causality based on the concept of counterfactual dependencies (Lewis, 1973) using a structuralequation approach (Halpern, 2015;Halpern and Pearl, 2001a,b).The idea of counterfactual reasoning (Wachter et al., 2017) relies on the assumption that  would not have happened if  had not happened before, which corresponds to the ''but-for'' test used in law (Spellman and Kincannon, 2001;Wachter et al., 2017).
In this section, we take inspiration of the definition by (Halpern and Pearl, 2001a;Halpern, 2015) to establish a notion of causality at the level of features.Here, we interpret the selection of features as events considered for actual causality.The basic reasoning task we address then amounts to determine those feature selections that cause a given effect property.Examples for effect properties are ''the execution time is longer than five minutes'' or ''the system crashes''.
We assume to have described the effect properties as effect set    ⊆  of valid configurations for which an effect property can be observed.Elements of    are called effect instances.All other valid configurations in ⧵   are assumed not to exhibit the effect (called non-effect instances).Feature selections are naturally specified by partial configurations.Clearly, a partial configuration  can only be a cause of the effect if  ensures the effect to emerge, i.e., all valid configurations that are covered by  are effect instances.Furthermore, following counterfactual reasoning, we require for  being a cause that, if we would select features of  differently, there might be a configuration for which the effect does not emerge.These two intuitive conditions on causality are reflected in our formal definition of causes of    w.r.t.: Definition 1.A feature cause of an effect    w.r.t.valid configurations  is a partial configuration  ∈ ( ) where (  , ) denotes the set of all causes for    w.r.t..
In case (FC1) holds for a partial configuration , we say that  is sufficient for    w.r.t. (Garvin and Cohen, 2011).This sufficiency is considered in the scope of , since for configurations not contained in  it usually cannot be decided in practice whether an effect emerges or not.The counterfactual nature of (FC2) ensures that for every feature cause  and  ∈ () there is a counterfactual witness  ∈ [[↑  ]] ∩ (⧵  ).That is, a valid feature configuration where the effect does not emerge but where changing one feature selection may yield an effect instance.Note that (FC2) ensures minimality of the feature cause w.r.t.its support, i.e., dropping conditions on interpretations of features necessarily leads to a partial configuration that is not sufficient for the effect anymore.In the formal definition of Halpern and Pearl causality (Halpern, 2015), counterfactuality and minimality are stated in two distinct conditions (see also Section 3.2).We usually denote configurations in    by , counterfactual witnesses in ⧵   by , and feature causes by .Fig. 2 depicts the relation between valid configurations, effects, causes, and counterfactual witnesses.
Example 2. Let us continue our running example of the configurable email system introduced in Example 1.We consider an effect property reflecting ''long decipher time'', e.g., that it takes in average more than three months for an attacker to decrypt an email.Assume that this effect property can be observed by configurations in which AES or RSA are selected, i.e.,    = {, , , }.Conversely, in all valid configurations in which AES and RSA are not selected, the effect does not emerge.In this setting, the encryption features AES and RSA are both causes since all valid configurations with either feature show the effect.Considered in isolation, AES and RSA are not necessary for the effect, as one can choose the other encryption feature (RSA or AES, respectively) to ensure the effect.The sign feature does not trigger the effect and is not a cause.
Interestingly, a further cause is given by selecting the encryption feature and explicitly deselecting the Caesar feature, illustrating that also explicitly not selecting features might be a cause of some effect.This hints at the fact that causes can be represented in different ways, addressed later in the paper.
Formalizing this intuition, we check whether the three partial configurations   ,   , and   c given by Hence,  is a counterfactual witness for   ,   , and   c w.r.t., , and , respectively, while  can serve as such for   c and .It is easy to check that there are no further feature causes since for all other partial configurations sufficient for    w.r.t. (see (FC1)) there are expansions towards   ,   , or   c (thus, violating (FC2)).⋄

Effect properties and effect sets
Our definition of feature causality relies on a given effect set, which is assumed to comprise all those valid configurations where the effect property holds.We now elaborate more on how to obtain effect sets from analyzing configurable systems.In fact, our generic definition supports a multitude of effect properties for which the only assumption is that there is an effective method to determine all configurations in which the effect property holds.Such methods include variabilityaware white-box analyses (Weber et al., 2021;Velez et al., 2021) or formal analysis through model checking (Cordy et al., 2013a;Chrszon et al., 2018) where the source code or operational behavior of system variants is accessible.Further, also black-box analyses are eligible to specify effect sets, where only the behavior of the system can be observed without knowledge about the innerworkings, relying on testing or sampling (Guo et al., 2018;Kaltenecker et al., 2019).In the following paragraphs, we exemplify how to obtain effect sets from analysis results.The discussed effect properties reflect the instances of the experimental evaluation section (see Section 5) and do not claim to be exhaustive.
Functional properties.To reason about causality w.r.t.functional properties, the effect set can be determined by variability-aware static analysis (Beek et al., 2019;Rhein et al., 2018) or model checking (Plath and Ryan, 2001;Classen et al., 2013;Apel et al., 2013).In the latter case, effect properties can be formalized, e.g., in a temporal logic such as LTL (Pnueli, 1977) or CTL (Clarke et al., 1986).Model checking configurable systems against LTL and CTL properties has broad tool support (e.g.Classen et al., 2012;Cordy et al., 2013a).Given a formula  that specifies the effect property, these tools return all the valid configurations  ∈  whose corresponding system variants satisfy , i.e., the effect set Since model checking is based on an exhaustive analysis, an analysis also exposes those valid configurations for which the effect property does not hold.The same is possible for variability-aware static analysis (Rhein et al., 2018;Bodden et al., 2013).
Non-functional properties.Besides functional properties, also nonfunctional properties of configurable systems can serve as effect property and give rise to an effect set.Let  ∶  → R be a function that results from a quantitative analysis of the configurable system in question, providing a quantitative measure for all valid configurations.Values () for a valid configuration  ∈  may stand for the performance achieved, the probability of failure, or the energy consumed in the system variant that corresponds to .To obtain  for real-world systems, Siegmund et al. (2015), Siegmund et al. (2012) presented a black-box method to generate linear-equation models for performance measures by multivariable linear regression on sampled configurations.Other black-box approaches rely on regression trees (Guo et al., 2018), Fourier learning (Zhang et al., 2015), or probabilistic programming (Dorn et al., 2020).Related white-box approaches use insights of local measurements and taint analysis information (Velez et al., 2021) or profiling information (Weber et al., 2021).
An orthogonal formal white-box analysis on operational models with quantitative information (such as probabilities, costs, etc.) is provided through variability-aware probabilistic model checking (Dubslaff et al., 2015;ter Beek et al., 2016).Effect properties for such approaches are specified in quantitative variants of temporal logic such as probabilistic CTL (Hansson and Jonsson, 1994).These approaches have been implemented in the tools like ProFeat (Chrszon et al., 2018) and QFLan (Vandin et al., 2018) and have shown practical applicability in various experimental studies.
Given  that results from one of the analysis approaches mentioned above, an effect set can be specified by imposing a threshold  ∈ R combined with a comparison relation ∼ towards threshold effect sets Example 3. In Example 2, we informally specified the effect of a ''long decipher time'' as taking more than three months to decrypt an email without having the encryption key available.By a variability-aware quantitative analysis on the email system, we may obtain a function  that, for a configuration , returns the minimal time in years to decipher an email sent with the system variant corresponding to .Analysis results could be, e.g., () = 0 with no encryption, () = 10 −7 with Caesar, () = 1 with AES, and () = 2 with RSA selected in , respectively.Then,    >0.25 provides the effect set    of Example 2. ⋄ On computing effect sets.The effect set and the set of valid configurations can be of exponential size in the number of features.An efficient computation of these sets depends on the analysis techniques used and are independent from our causal framework.However, specifically tailored variability-aware analysis techniques can tackle the exponential blowup, e.g., through symbolic representation of family models (Thüm et al., 2014;Dubslaff, 2019).

Relation to HP causality
The original definition of actual causality by (Halpern, 2015) relies on a structural-equation approach and comprises three conditions: effectiveness, counterfactuality, and minimality.Compared to our definition of feature causes presented in Definition 1, their definition supports non-Boolean evaluation of variables and the distinction between endogenous and exogenous variables, specifying variables inherently contained in the system and those that can be subject of external influences, respectively.Directly transferring their notion of actual causality to the setting of configurable systems leads to the following definition of HP feature causality: In the following, we draw the connection between HP feature causes and our definition of feature causes provided in Definition 1. First, observe that for the counterfactual (FCb) of HP feature causes, a single counterfactual witness suffices: Lemma 1. ⧵   ≠ ∅ iff (FCb) for some  ∈ ( ).

Proof. (⇒):
Proof.First, let ⧵   ≠ ∅. (FCa) coincides with (FC1).Due to Lemma 1 we also have (FCb).The claim then directly follows in combination with Lemma 2. If ⧵   = ∅, then (FCb) is violated (see Lemma 1) and hence, there is no HP feature cause for    w.r.t..Further, (FC2) can only be satisfied for a  ∈ ( ) if () = ∅ and (FC1) can only be satisfied if  ≠ ∅.Thus, we have (  , ) = {} with () = ∅.□ It is merely a philosophical discussion whether one allows for empty causes, i.e., feature causes with empty support and covering all valid feature configurations.We defined feature causes allowing for an empty support to distinguish between those cases where no causal dependencies arise, which is the case of an empty effect set.In the latter case, one could not distinguish between no causal dependencies due to all valid configurations showing the effect or none.
From effect configurations to effect sets.Our definition of HP feature causes can be embedded into HP causality on binary domains, allowing for structural equations on Boolean variables for feature selection (Halpern, 2015).Given an effect property as a propositional formula, actual causes are then considered w.r.t. a contingency that has a similar role as a single effect configuration in our setting.Causes w.r.t. a contingency then can be easily extended to causes w.r.t. an effect set.

Computation of feature causes
For a given effect set    and a set of valid configurations  along with a partial configuration , Definition 1 directly provides a polynomial-time algorithm to decide whether  is a cause of    w.r.t. by checking (FC1) and (FC2).From this, we obtain a simple approach to compute the set (  , ) by successively checking expansions for sets of features applied on elements in    as candidates for causes.Since there might be exponentially many such expansions, this approach easily renders infeasible already within a small number of features.
We now present a practical algorithm to compute the set of causes, which relies on a connection to the notion of prime implicants (see Section 2).In the setting of features, a prime implicant of a set of partial configurations  ⊆ ( ) is a partial configuration  ∈ ( ) where  covers  and [[↑  ]] ⊈  for all  ∈ ().Here, observe the similarity to (FC2) of Definition 1. Further, this connection is not immediately visible when considering HP causality (see Definition 2).Towards establishing the feature cause computation algorithm, we first require a technical lemma: ∪   due to (FC1) and even a prime implicant due to (FC2).Conversely, every prime implicant  of ( ( )⧵ ) ∪    for which [[]] ∩    ≠ ∅ is a cause due to (FC1).This directly suggests an algorithm to compute causes via prime implicants: Algorithm 1 first generates prime implicants as cause candidates and then removes those candidates that are not sufficient for    w.r.t..Fig. 2 reflects this situation where  and  are prime implicants with  being a cause and  not: at least one effect instance is covered by , while this is not the case for  and hence would be removed by Algorithm 1.
Algorithm 1: Computation of feature causes input :   ,  ⊆ ( ) with    ⊆  output: (  , ) Prime implicants of a set of configurations can be computed in polynomial time in the size of the input set (Strzemecki, 1992), which directly leads to: Let us now turn to the complexity of Algorithm 1.It is well known that the computation of prime implicants can be done in polynomial time (Strzemecki, 1992) in the size of the number of interpretations.Hence, || is polynomial in |(( )⧵) ∪   |, which is smaller than |( )|.Furthermore, since the emptiness check in Line 2 can be done by simply iterating over all elements of   , the whole algorithm runs in polynomial time in |( )|.□ Note that the set of valid and effect configurations can be both exponential in the number of features and there might be exponentially many prime implicants (Chandra and Markowsky, 1978) in the worst case.Hence, Algorithm 1 is exponential in the number of features.
Example 4. Let us illustrate the computation of feature causes of Example 2 by Algorithm 1.First notice that comprising 60 feature configurations.The prime implicants for this set are computed in Line 1, which yields Here, we used notations as in Example 2, e.g., ( ē ) = {, },  ē () = , and  ē () = .Clearly, all configurations covered by  m or  ē are not valid and hence also no effects.Thus, they are removed in Line 2, leading to (  , ) = {  ,   ,   c }. ⋄

Effect uncertainty and feature precauses
For our notion of feature causality introduced at the beginning of this section (see Definition 1), we assumed full information about the set of valid configurations  and its partition into configurations    that show an effect and those  ⧵    that do not.While this assumption is reasonable when applied in the context of variabilityaware exhaustive analysis (see Section 3.1), there are situations where effect and non-effect configurations cannot be ultimately separated: (1) The set of valid configurations is unknown, irrelevant, or cannot be explicitly constructed.
(2) There is incomplete information about the set of effect configurations due to non-exhaustive analysis methods, e.g., variabilityaware testing or limits on analysis resources such as runtime.
The first case arises, e.g., when the configuration spaces are of such sizes that formal reasoning about valid configurations is infeasible (Thüm, 2020).The second case boils down to not having information about all effect configurations at hand.This also covers practical relevant situations where the effect is partly described, e.g., through bug reports where users report the same bug within different configurations (see Section 3.1).The third case is in particular relevant for nonfunctional effect properties, e.g., when variability-aware probabilistic model checking or approximative regression methods are chosen as analysis methods to investigate threshold effect sets (see Section 3.1).
Then, an exact decision whether the given threshold is met cannot be made due to noise: those quantities close to the threshold cannot be decided up to the precision of the analysis method.
While feature causality cannot directly be used to pinpoint those configuration options that are the reasons for such effects with uncertainty, we can still follow the very same counterfactual reasoning to establish candidates for causes given the limited access to information, which we call feature precauses.Definition 3. A feature precause for effect configurations    * ⊆ ( ) w.r.t.non-effect configurations    * ⊆ ( ) where    * ∩    * = ∅ is a partial configuration  ∈ ( ) where We denote by (   * ,    * ) the set of all precauses for effect configurations    * w.r.t.non-effect configurations    * .
Note the close relation of the effectiveness condition in HP feature causes (see (FCa) in Definition 2) and (FPC1).Stated in words, precauses provide minimal conditions on feature selection or deselection (see (FPC2)) such that it is possible to show the effect and they do not cover any configuration where it is sure the effect is not emerging (see (FPC1)).Our definition of precauses also implements counterfactual reasoning by (FPC2), since relaxing one of the feature selection conditions directly leads to also allow for a possible non-effect configuration, serving as counterfactual witness.
Feature precauses indeed constitute an extension of feature causes, i.e., in case the set of valid configurations can be partitioned into effect and non-effect configurations, precauses coincide with causes.Even more, if the set of valid configurations is known to be , every precause for an effect    * w.r.

Causal explications
Since the number of feature causes (and precauses) can be exponential in the number of features, a mere listing of all causes is neither feasible nor expedient for real-world software systems.This holds for both, humans that have to evaluate causal relationships in configurable systems, e.g., during software development, and machines that might use feature causes for further processing and reasoning.
In this section, we present and discuss several methods to compute causal explications, i.e., mathematical or computational constructs that arise from processing feature causes to provide useful causal representations and measures (Baier et al., 2021).Explications are closely related to explanations, by which we mean human-understandable objects employed within an integrated system, e.g., in feature-oriented software development or in production-level deployments.
Our methods for computing explications rely on techniques from propositional logic and circuit optimization (Paul, 1975;McGeer et al., 1993), responsibility and blame (Chockler and Halpern, 2004), and feature interactions (Garvin and Cohen, 2011).They all take a global perspective on sets of feature causes rather than only considering single feature causes in isolation.In the following, we fix sets of valid configurations  ⊆ ( ) and effects    ⊆ .

Propositional logic formulas
A rather natural explication for a set of causes  ⊆ (  , ) is to represent  as propositional logic formula, e.g., as the characteristic formula () defined in disjunctive normal form (DNF)
Clearly, () has the same size as  and its representation does not exhibit any advantage compared to . Methods to minimize propositional logic formulas (McCluskey, 1956;McGeer et al., 1993;Hemaspaandra and Schnoor, 2011) could be used to yield small formulas  covering the same configurations as , i.e., where While beneficial for related problems in configurable systems analysis, e.g., for presence condition simplification (von Rhein et al., 2015), such methods vanish causal information, i.e., the set of causes  cannot be reconstructed from the reduced formula .To provide a small formula that maintains the causal information of , we use a simple yet effective reduction method, which we call distributive law simplification (DLS).The basic idea is to factorize common feature selections in a DNF formula step by step, exploiting the -ary distributive law Each factorization leads to a length reduction of (−1)⋅||, where || is the length of the propositional formula factored out.Obviously, these transformations are reversible, such that the original DNF () and hence the set of causes  can be reconstructed.The final formula length depends on the formulas factored out, the subformulas, and the factorization order.Determining a formula through DLS that has minimal size is close to global optimization problems for propositional logic formulas and thus computationally hard.For practical applications, we hence employ a heuristics that reduces a given formula  in DNF by stepwise factoring out literals that have maximal number of occurrences in DNF subformulas.We denote the reduced formula obtained by this heuristics by (), where (⋅) is provided by Algorithm 3.
Note that Line 4 involves a non-deterministic choice of a literal with maximal occurrence.This leaves some degree of freedom when implementing this algorithm, enabling heuristics depending on the practical need, e.g., prioritizing important or user-specific features for explication.
Extended boolean connectives.Besides the standard Boolean connectives ∧ and ∨ apparent in our DLS, common patterns of propositional logic formulas such as ⊕ (XOR), ↔ (equivalence), ≠ (non-equivalence), and → (implication) could be used for explication and providing a short (yet reversible) representation.
Example 5.In the feature diagram of Fig. 1, the encryption features are connected through an exclusive disjunction (XOR) and the set of valid feature configurations could then be described by

⋄
Such extended Boolean connectives could well be included into our DLS scheme, possibly also exploiting anti-distributivity of →.
Numeric features.Another possibility to further provide reversible and concise representations could be achieved by exploiting properties of multi-features and attributes, i.e., features that might not only be active or inactive but are configurable by a numerical value (Classen et al., 2011;Cordy et al., 2013b).Formally, a numeric feature configuration is a function  ∶  → D over a finite domain D ⊆ N. It is well known that feature attributes can be modeled by Boolean features by extending the feature space, e.g., introducing features  0 , … ,  9 in the case a feature attribute  can take numeric values in D = {0, … , 9}.To this end, our Boolean framework for feature causality also covers reasoning about configurable systems with feature attributes.Towards a meaningful explication, we might however revert this encoding after our feature causality analysis and replace parts of a propositional logic formula for the feature causes by corresponding arithmetic expressions.For instance,  0 ∨  1 ∨  2 could be replaced by an expression  ≤ 2.

Cause-effect covers
The complete set of causes may contain several candidates to describe reasons for the effect emerging in a single system variant.If not interested in all causes but in a set of causes that covers all effects (i.e., that contains, at least, one cause for every system variant) we might ask for a preferably small set of causes covering all effect configurations.Formally, a cause-effect cover of    is a set of causes  ⊆ (  , ) that covers   , i.e., where Proof.Let us assume the opposite, i.e., there is a configuration  ∈    such there is no  ∈ (  , ) with  ∈ [[]].Since (FC1) is fulfilled for , (FC2) cannot be true for  since otherwise  ∈ (  , ) and thus,  =  could serve as counterexample of our statement.Hence, there is an  ∈  such that [[↑  ]]∩ ⊆   , which yields (FC1) to hold for ↑  .By an inductive argument, there is We say that a cause-effect cover  is minimal if there is no causeeffect cover  ′ of    where | ′ | < ||.Note that this notion of minimality is similar to the standard definition for sets of partial interpretations, but ranges not over all sets of partial configurations but over causes only.
Example 6.For the email system in Example 2 we directly see that {  c } and {  ,   } are the only cause-effect covers of   .Thus, {  c } is a minimal cause-effect cover of    as |{  ,   }| > |{  c }|.⋄It is well known that computing minimal covers is expensive as the decision problem whether there is a cover of a given set of configurations with at most  ∈ N elements is NP-complete (e.g.Umans et al., 2006).The same holds for computing minimal prime * -covers and hence minimal cause-effect covers (Paul, 1975).Thus, for practical applicability, heuristics that lead to nearly minimal cause-effect covers are of interest to concisely explicate causal candidates covering all effect configurations.In the following, we establish such a heuristic by a greedy scheme involving most general causes.
Definition 4. The binary relation ⊴ ⊆ ( ) × ( ), where  ⊴  ′ stands for  ′ to be at least as general as  w.r.t.  , is defined by The set of most general causes (  , ) comprises those causes for    w.r.t. that are ⊴-maximal in (  , ).
In the next paragraphs, we formally prove relationships of (  , ) and the set of most general causes (  , ) to prime * -covers and minimal prime * -covers.From the above definition, we then directly establish an algorithm to compute (  , ) by computing ⊴ in quadratic time and by selecting elements maximal w.r.t.⊴.In particular, we will see that (  , ) and (  , ) both provide causeeffect covers of   .Note that ⊴ is not antisymmetric and hence there might be different most general causes that cover the same set of effect instances.Towards nearly minimal cause-effect covers, we thus might pick only one of those candidates from (  , ) (e.g., with minimal support) to obtain even more concise representatives for feature causality.
Feature causality and prime covers.Lemma 5 and the section above showed the close connection of feature causes of    w.r.t. to prime covers.However, it is still open whether (  , ) is actually a prime * -cover of    relative to ⧵  .In the corner cases, e.g., if    = ∅, then (  , ) = ∅ due to (FC1) and the statement holds since ∅ is the trivial * -cover of ∅ relative to any set.Further observe that if    = , the statement is also clear since the uniquely defined prime * -cover of    relative to ⧵   is {} with  ∈ ( ) where () = ∅.Then, (FC2) is trivially fulfilled and since  covers all configurations, also (FC1) holds.
For the following proposition, recall that for sets of partial configurations  0 and  1 , P( 1 ,  0 ) denotes the set of prime * -covers of  1 relative to  0 and P( 1 ,  0 ) denotes the set of minimal prime * -covers of  1 relative to  0 .Ad (P2): First consider the case where  = ∅, then this is the only minimal prime * -cover and    = ∅, leading to (  , ) = ∅ and  ⊆ (  , ).
, which also leads to (FC2) being fulfilled for .Further, since  is a minimal cover, ⧵{} is not a cover of   and thus,
Effect uncertainty and precause-effect covers.Following Definition 3 and Proposition 2, counterfactual reasoning by using prime implicant computations can be also used to provide causal candidates for underspecified effect sets.However, the number of precauses is usually much higher than actual causes, due to the uncertainty of configurations being possible to serve as effect or non-effect instances.To this end, precause-effect covers are important to reduce the number of precauses eligible to describe the (sure) effect instances and respect counterfactual reasoning.It seems reasonable to aim for precauseeffect covers that minimize the number of precauses and the covered underspecified valid configurations (or configurations in case the set of valid configurations is unknown).We propose to first compute most general precauses following Definition 4 and stepwise select those towards a precause-effect cover that have the highest ratio between covered effect instances from    * and additional covered underspecified valid configurations.For the latter, we assume configurations covered by previously selected most-general precauses as not underspecified anymore (with complete information, they would turn into effect instances for feature causes following (FC1) in Definition 1).

Responsibility and blame
To measure the influence of causes on effects, Chockler and Halpern (2004) introduced degrees of responsibility and blame, ranging from zero to one for ''no'' to ''full'' responsibility and blame, respectively.Responsibility measures how relevant a single cause is for an effect in a specific context.Blame denotes the expected overall responsibility according to a given probability distribution on all contexts where the effect emerges.We take inspiration from these measures and present corresponding notions for feature causality.In short, the degree of responsibility is the maximal share of features contributing to the effect, i.e., features that would have to be reconfigured to provide a counterfactual witness.

Example 8.
We rephrase the majority example by Chockler and Halpern (2004) in our setting.Consider 11 features whose configurations are all valid.We are interested in responsibilities for the effect that the majority of features is active.If all eleven features are active, each feature has a responsibility of 1∕6, since six features share the responsibility for the effect: Besides the feature of interest, further five features have to be reconfigured towards a majority of inactive features.In a configuration where six features are active and five not, each of the six active ones is fully responsible for the effect: If this feature would be reconfigured, more features would be inactive than active.We then assign responsibility of one to each of the six active features.⋄ In what follows, we formalize degrees of responsibility and blame for single features as in the example above.An extension of this notion to partial configurations to explicate feature interactions is provided in Section 4.4.
Feature responsibility.Intuitively, the degree of responsibility of a single feature  ∈  is defined as the maximal share to contribute to causing the effect in an effect instance  ∈   .In the case that  does not appear in the support of any cause  covering , feature  does not contribute to causing the effect in  and thus has no responsibility.Otherwise,  shares its responsibility with, at least, a minimal number of other features, whose switch of its interpretation in  would lead to a counterfactual witness , i.e.,  ∈ ⧵  .In the following,  denotes the set of features that have to be switched in a configuration (including the feature  the responsibility is determined for) such that a counterfactual witness is reached.Note that due to (FC2) there exists at least one counterfactual witness in the case  appears in a cause covering  and hence, the denominator of the above fraction is finite and greater than zero.
Example 9. Continuing the email example (see Example 2), the mail feature  and sign feature  do not have any responsibility for a long decipher time in any configuration as they do not appear in any of the causes   ,   , and   c .Also the AES feature  has no responsibility in configurations  or , since the only covering causes are   c and   , not containing  in their support.Besides the analogous case for the RSA feature , other degrees of responsibility are 1∕2: switching a feature  in  usually requires one further feature to switch towards a configuration of ⧵  .For example, selecting the Caesar feature  in  =  requires also to deselect the AES feature , leading to  =  ∈ ⧵  .The table below shows (, ) for  ∈ {, , , }.On the choice of blame distributions.The distribution  models the frequency of valid configurations to occur, for which there are several scenarios that lead to a reasonable definition of .One natural distribution may model the frequency of users choosing a configuration in the configurable system.But also the frequency of effect configurations is useful, e.g., to model the frequency a certain bug is reported by users when the effect corresponds to the property of a malfunction.In the case such statistics are not at hand or one is interested in the degree of blame from a developer's perspective, uniform distributions over valid configurations or effects are canonical candidates for .

Feature interactions
A notorious problem in configurable systems is the presence of (inadvertent) feature interactions (Calder et al., 2003;Apel et al., 2014), which describe system behaviors that emerge due to a combination of multiple features not easily deducible from the features' individual behaviors.The detection, isolation, and resolution of feature interactions play a central role in the development of configurable systems and beyond (Zave, 2001;Apel et al., 2013).On the one hand, feature interactions may be desired to integrate and coordination multiple independently developed features.On the other hand, due to the exponential number of possibilities of how features may interact, undesired feature interactions can easily be overseen by developers.The problem of detecting unintended feature interactions rises in severity with the number of features.This led to a crisis in the area of telecommunication systems already in the early 1980s.Back then, an increasing number of undesired behaviors between features of complex telecommunication systems have been observed (Calder et al., 2003).A multitude of approaches have been proposed and are still developed to address the problem of detecting (undesired) feature interactions, for example, to discover feature interaction faults (Garvin and Cohen, 2011).We now show how our black-box causal analysis at the level of features (see Section 3.1) can be used for the detection and isolation of feature interactions.These can provide the basis for fine-grained, white-box feature-interaction resolution as also proposed by Garvin and Cohen (2011).
Detection.The first problem we address is to detect the necessity of feature interactions for an effect to emerge.Garvin and Cohen presented a formal definition of feature interaction faults to capture faults in configurable systems that necessarily arise from the interplay between multiple features (Garvin and Cohen, 2011).Notably, their characterization is also at the abstraction level of features and relies on black-box testing of faults, similar to our perspective on effects.We transfer their definition to our setting, but covering arbitrary effects instead of faults only.Recall that a partial configuration  ∈ ( ) is sufficient for    w.r.To this end, Algorithm 1, in combination with a projection to feature causes with minimal support, can be used to decide whether the effect emerges necessarily from feature interactions: A necessary feature interaction takes place in the case these minimal feature causes all have a supports that involve, at least, two features.
Example 12. Returning to Example 2, there are exactly two causes that are both 1-way interaction witnesses:   and   .The cause   c does not witness a necessary 2-way feature interaction since although having support size of two,   and   have support size one (cf.Theorem 2(2)).Hence, the effect describing long decipher time is not necessarily related to a feature interaction.⋄Isolation.The second problem we address is to pinpoint features responsible for feature interactions.For this, observe that Definition 7 is similar to Definition 1, but with a different notion of minimality: While (FI2) ensures global minimality over all partial configurations, (FC2) ensures local minimality through expansions, taking the individual selection of features into account.To pinpoint those features that actually interact towards the effect, feature causes can be interpreted as a form of interaction witnesses at the local level instead of the global perspective taken for -way interaction witnesses: Switching some feature a cause does not ensure the effect to emerge anymore -hence, the switched feature is necessary for the effect.For instance, in Example 12, the interaction between encryption and Caesar being disabled is witnessed by the feature cause   c .Feature causes thus also provide a criterion for feature interactions at the operational level and can be used to guide a more in-depth white-box feature interaction analysis, possibly reducing the naive exponentially-sized feature-interaction search space.
Feature interaction responsibility and blame.Any subset of the support of a feature cause that contains more than two features provides a candidate for an actual interaction between those features.Since both, the number of feature causes and their expansions, can be exponential in the number of features, feature interactions isolated via a causal analysis might still be difficult to interpret by developers.Based on the degree of responsibility for single features, we now provide a variant to measure responsibility and blame of feature interactions, where high values indicate strong relevance of the interaction and low values weak relevance.Both measures are defined as the share of features  ⊆  to be switched including at least one feature  ∈ () ∩  from the interaction support to obtain a counterfactual witness.This definition covers the single feature case (Definition 5), where the latter feature  coincides with the feature of interest.
Definition 8.The degree of responsibility (, ) of a partial configuration  ∈ ( ) in the context  ∈    for which there is  ∈ (  , ) with  ∈ [[]] and () ⊆ () and () = () for all  ∈ () is defined as and as (, ) = 0 otherwise.Note that, for () = () and () = {}, Definition 8 agrees with Definition 5.The degree of responsibility is non-zero in the case of a single feature if the feature appears in some cause, whereas the degree of feature interaction responsibility is non-zero if some cause is an expansion of the potential feature interaction.Feature interaction blame is extended similarly from the single feature case (Definition 6) by replacing single features  ∈  by partial configurations  ∈ ( ) that stand for the potential feature interaction of interest: Definition 9.The degree of blame of a partial configuration  ∈ ( ) w.r.t. a distribution  over  is defined as Example 13.We formalize the majority voting example of Example 8 by features  = { 1 , … ,  11 } and a function  ∶  → N counting active features in valid configurations such that    = { ∈  ∶ () > 5}.Then, the degree of responsibility of a part  = { 1 ,  2 ,  3 } in the context of  ∈    where () = 11, i.e., all features are active.Then, the joint responsibility of  in  is 1∕4, since besides the features of , further three features have to swap to yield a non-winning configuration.⋄

Experiment setup
To evaluate our causal analysis and explication methods, we conducted a number of experiments comprising many analyses on community benchmarks and real-world examples from the area of configurable software systems.
Our evaluation is driven by the five research questions stated in the introduction that address the key issue of whether and how the notion of feature causality facilitates identifying root causes, estimating the effects of features, and detecting feature interactions in controlled and practical settings.

Implementation
We implemented our algorithms to compute feature causes and explications in the prototypical tool FeatCause.Written in Python, our tool relies on the engines for logical expressions and binary decision diagrams (BDDs) of PyEDA, a library for electronic design automation (Drake, 2015).The tool takes the sets of valid feature configurations  and effects    as input.FeatCause supports different input formats for these sets, e.g., by Boolean expressions in DNF or CNF.We implemented Algorithm 1, which uses prime implicants to efficiently determine feature causes (cf.Section 3).Internally, sets of (partial) feature configurations are efficiently represented as reduced ordered BDDs (Bryant, 1986).In addition to their compact and hence space-efficient representation, we chose BDDs because they provide an efficient method to check satisfiability (required, e.g., for Line 2 in Algorithm 1).Note that even when using BDDs, a naive algorithm that directly checks the conditions (FC1) and (FC2) for all partial configurations is not feasible, since it would need to construct and operate on exponentially many BDDs for each of the possible partial configurations.Instead, in our evaluation we determine feature causes using prime implicant computations (cf.Section 3).To compute prime implicants, we used the tool Espresso (McGeer et al., 1993), well known from circuit optimization, through an interface that mediates between our BDD representations and the DNF representations in Espresso's PLA format.This interface is also used to provide minimal and nearly minimal cause-effect covers through Espresso-Signature and Espresso, respectively, which can then be compared with our heuristic causeeffect cover by most general causes (see Section 4.2).While it is well known that the length of DNFs can be exponential in the size of the BDD representing the same Boolean function, generating DNFs from our BDD representations did not face any significant blowup and required negligible time in all our experiments.For resolving the non-deterministic choice in Line 4 of Algorithm 3 (DLS) in our implementation, we picked the first element of the list of literals ordered according to their occurrences, i.e., left the resolution of the non-determinism open to be resolved by the sorting algorithm implemented in Python.We also modified the algorithm towards a global minimization by exhaustively iterating over all literals and returning the factorization with minimal length.Our conducted experiments with a global minimization resolution of this non-determinism unsurprisingly led to massive performance drawbacks.The impact on the resulting formula sizes turned out to be negligible.The reduction by the algorithm is reversible and hence the set of causes can be easily reconstructed from the reduced formula.Besides the core tool, we have implemented several conversion scripts to generate valid feature configuration sets from TVL (Classen et al., 2011) and effect sets from analysis results returned by variabilityaware analysis tools such as ProVeLines (Cordy et al., 2013a) and ProFeat (Chrszon et al., 2018) (see also Section 3.1) and the data sets from Siegmund et al. (2015) and Kaltenecker et al. (2019).

Subject systems
We selected a diverse set of subject systems to approach our research questions, ranging from popular community benchmarks to more involved systems with non-functional properties and from realworld settings.
From Cordy et al. (2013a), we use CFDP, Elevator, and Minepump systems and analyzed them against the accompanied LTL properties  using the variability-aware model checker ProVeLines.Furthermore, we took the Email and Elevator from von Rhein et al. (2015) and analyzed multiple defects provided as propositional logic formula generated by SPLVerifier (Apel et al., 2013).
For quantitative properties, we generated effect sets from configurable system analysis results as illustrated in Section 3.1 for three classes of systems.We list for each type of effect property the considered subject systems, the numbers of experiments (#), valid configurations (||), features (| |), and the overall time in seconds to compute feature causes and most general causes.The second part of the table lists the average sizes of the effect sets (), feature causes ( = (, )), cause-effect covers by most general causes (⊴− = (, )), and DLS formulas relative to the characteristic formula of causes (DLS=|(())|∕|()| in percent).
First, we constructed effect sets for several systems from studies on performance prediction of non-functional properties (Siegmund et al., 2013a;Siegmund et al., 2012;Siegmund et al., 2013b), such as Apache, Linux, SQLite, and WGet.For these, we used the black-box approach by Siegmund et al. (2012), which uses multivariable linear regression methods to generate variability-aware performance models.Our thresholds   ,   , and   for constructing effect sets are imposed on the prediction accuracy of the three different non-functional properties of binary size, memory footprint, and runtime, respectively.
Second, we considered configurable systems modeled for the variabilityaware probabilistic model checker ProFeat (Chrszon et al., 2018), comprising a body sensor network (BSN) model (Rodrigues et al., 2015) and a velocity control loop (VCL) model of an aircraft (Dubslaff et al., 2019a(Dubslaff et al., , 2020b)).In both systems, the reliability of the system is analyzed in terms of the probability of failure of sensors and control components, respectively.Effect sets are generated by imposing a threshold   on the analysis results.
Third, we generated effect sets from performance measurements of real-world configurable software systems that have been used to evaluate performance modeling techniques (Siegmund et al., 2015;Kaltenecker et al., 2019).In particular, we selected five systems from different domains: a compiler framework (LLVM), a database-system (BerkeleyDB), a compression tool (Lrzip), a video encoder (x264), and a toolbox for solving partial differential equations (DUNE).For these systems, we chose thresholds   on the runtime of the system executions.

Operationalization
To address (RQ1), we compute both sets (  , ) and their dual (⧵  , ), i.e., the feature causes of the effect property and its negation.Based on these feature causes, we further compute the following causal explications, e.g., to answer (RQ2): • most-general cause-effect covers (  , ) and their dual set (⧵  , ), • nearly minimal cause-effect covers by * -covers of    w.r.t.
⧵   provided by Espresso (and their dual case), • distributive law simplification (DLS) on causes (  , ) and cause-effect covers (  , ) (and their dual case), • feature precauses (   * ,    * ) and precause-effect covers for samples    * and    * from    and ⧵  , respectively, of increasing size, • responsibility and blame values for single features and causes, and • feature interaction blames for pairs of features.
For blame computations, we assume a uniform distribution over all effects, due to the absence of further statistical information and taking a developers' perspective (cf.Section 4.3).For (RQ3), we compute single feature blames based on a uniform effect distribution to measure the influence of individual features onto the effect.We compute feature interaction blames on pairs of features to address (RQ4), again assuming a uniform effect distribution.Table 1 provides key statistics about our experiments, focusing on model characteristics and the time to compute feature causes and most general feature causes.All experiments were conducted on an AMD Ryzen 7 3800X 8-Core system with 32 GB of RAM running Debian 10 and Python 3.7.3.

Results
We discuss our results w.r.t. the different kinds of causal explications of Section 4. First, we discuss statistics on our experiments, also quantitatively analyzing the potential of causal explications by most general causes and DLS-reduced formulas.Then, we address our research questions in more depth by means of three representative subject systems.Here, we focus on properties not detectable by classical causal white-box analysis methods (Zeller, 2002;Johnson et al., 2020).The computability of feature causes is of major interest for our evaluation.Table 1 provides an overview of the subject systems for which we generated effect sets and applied our feature causality analysis.We see that our algorithms compute feature causes in reasonable time, within a few seconds for most subject systems.The effect sets have been precomputed by various variability-aware analysis methods for effect properties as described in Section 3.1:  stands for LTL properties;   ,   , and   for thresholds on the accuracy of a prediction model for binary size, memory footprint, and runtime; and   and   for reliability and runtime thresholds, respectively.This variety of properties already illustrates the wide range of applications and potential use of feature causality.Note that the time to compute the effect sets is not included in our experiments, since they were partly taken from existing benchmark sets.The sizes of valid configuration and effect sets crucially influence the time for computing feature causes, which is as expected since the complexity of Algorithm 1 is dominated by the computation of prime implicants of ( ( )⧵ ) ∪   .Since our implementation relies on BDDs for the representation of valid configuration and effect sets, it is however well possible that within similar sizes, computation times can significantly differ.This is mainly due to the fact that BDD sizes highly depend on the specific nature of the represented Boolean functions and the variable order chosen (Bryant, 1986).For instance, while the experiments on DUNE and BerkeleyDB (see Table 1) have similar sized sets  and   , their runtimes differ in two orders of magnitude.We see that the number of most general feature causes is often way smaller than the overall number of feature causes, which renders the creation of cause-effect covers by most general causes sensible to support concise explications.In the same vein, the application of DLS leads to great reductions of logical representations of feature causes, e.g., on average by almost 3/4 in the Elevator 1 subject system (see Table 1).
For (RQ1) and (RQ2), we conclude that feature causes are computable in reasonable time.A substantial reduction of the set of feature causes and cause-effect covers can be performed with DLS formulas and most general feature causes, respectively.

Feature cause explications (RQ2)
We discuss the explications of feature causes that we have generated by the example of the Minepump system (Classen et al., 2013), which is frequently used in the configurable systems' analysis community.This system models a water pump of a mine with | | = 11 features on which requirements expressed in LTL are imposed (see Section 3.1).One of these requirements addresses system stabilization, formalized by the LTL formula  = ◊□¬pumpOn ∨ ◊□pumpOn, i.e., that from some point on the pump stays on or off forever.An analysis of the stabilization property using ProVeLines returned |  | = 28 configurations where the property holds and |⧵  | = 100 configurations where it does not hold.
The assessment which features are important for this effect property is not obvious, possibly requiring a careful investigation of all || = 128 configurations, evaluating both, effect and non-effect configurations.Our causal analysis returned seven feature causes in an automated way, listed with their degree of blame in Table 2.They already provide hints which features are responsible for the property.Among the feature causes, three are most general, highlighted in Table 2.They have the highest degree of partial configuration blame while the most lengthy cause has the smallest degree.Our DLS heuristics on most general causes yields providing a concise representation of feature cause candidates explicating the effect property.That is, selecting features High, Start, and one of the three features Stop, Low, or MethaneAlarm covers all causally relevant configurations for the Minepump system to stabilize.On other subject systems, explications are also effective, but with less drastic reductions than for the Minepump example.
Answering (RQ2), feature causes are of reasonable size compared to the complete analysis results, i.e., most general feature causes and DLS provide concise explications for feature causes.Responsibility and blame reflect the impact of feature causes.

Causality-guided configuration (RQ3)
Feature blame provides a quantitative measure on the causal impact of feature selections w.r.t. a set of configurations.This measure can be used to support configuration decisions, e.g., by prioritizing features with high blame values in case the effect property is desirable.We investigate such a causality-guided configuration on the velocity control loop (VCL) subject system (Dubslaff et al., 2020a,b).The VCL models an aircraft velocity controller in SimuLink for which its reliability in terms of probability of failure is of interest.A common principle to increase the reliability of a system is by triple modular redundancy (TMR) where system components are triplicated and their output are combined via a majority vote.Dubslaff et al. (2019a) suggested to model and analyze systems with such protection mechanisms using family-based methods from configurable systems' analysis.To each component they assign a protection feature that specifies whether a component is triplicated or not.Comprising 21 components eligible for protection, the VCL model has || = 2 21 = 2 097 152 valid feature configurations.Clearly, the highest reliability is achieved by protecting all components.However, each protection comes at its costs in terms of execution time, energy costs, and packaging size.While it is known how to determine protection configurations with optimal reliability-cost tradeoff (Dubslaff et al., 2019a), reasons for why a protection configuration is optimal or why a component was selected for protection are typically unclear.We address this issue exploiting our causal analysis methods.Using the variability-aware probabilistic model checking tool ProFeat (Chrszon et al., 2018), we generated effect sets    <  w.r.t. mapping to the probability of failure of the VCL within two control-loop executions and reliability thresholds   between 0.019 and 0.064.1 Table 3 shows the degree of feature blame for 18 protection features of the 21 components of VCL (cf.Section 4.3).The three components not shown in the table are input components, having zero degree of blame and hence do not contribute to the systems' reliability.With the lowest threshold   = With tight reliability constraints, one should protect the ''Acceleration'' component, followed by the ''Integrator'' component, as their blames are significantly higher than for other components (cf.upper rows of Table 3).When higher failure rates are acceptable, one should prefer to protect the components ''Sum2'' and ''Sum3'' instead of the ''Integrator'' component due to their higher impact on reliability (cf.lower rows of Table 3).
For (RQ3), we conclude that feature causes and degrees of blame reveal and quantify the impact of features on the desired effect and, this way, are able to guide the feature configuration process.

Feature interactions (RQ4)
Our theoretical results from Section 4.4 provide a new connection of causal reasoning to feature interactions in configurable systems.In particular, all methods to explicate feature causes and precauses, as well as cause-effect covers with causes of minimal support can also be used to explicate feature interactions.For illustration of how to detect and isolate feature interactions through causal reasoning, we perform causal analysis on the Lrzip subject system, modeling a compression system for which runtime characteristics of the compression algorithms are of interest.Since the number of feature causes and their expansions can be both exponential in the number of features, a direct evaluation of the runtimes and causal analysis results is difficult.We hence investigate feature interactions through their degrees of blame as described at the end of Section 4.3.The subject effect sets    >  depend on the runtime  ∶  → R in seconds for a configuration compressing a file that is obtained as by Siegmund et al. (2015), Kaltenecker et al. (2019) and a runtime threshold   (see end of Section 3.1).We then focus on 2-way interactions by investigating potential feature interactions between the compression algorithm and compression level responsible to have high runtimes.For this we compute degrees of feature interaction blame for partial configurations  where () ∈ {, , } × {1, … , 9} and () =  for each  ∈ ().In the columns of Table 4, we show the degrees of feature interaction blames for thresholds   ranging from   = 200s to   = 2 300s.Empty cells correspond to combinations of compression algorithm and level that do not appear in any cause and thus have zero blame.In these cases, we can conclude that no feature interaction takes place.Higher blame values indicate that the combined responsibility of the compression algorithm and level has a greater causal impact on runtime.Notably, we observe that, with an increasing threshold, the level of compression is increasingly responsible for longer runtime.Certain compression algorithms always have runtimes above the threshold independently of the compression level.This leads to a configuration blame of zero at any compression level, e.g., thresholds   ≤ 600s for the Zpaq algorithm shown in the upper right of Table 4.Note that, in these cases, Zpaq serves as 1-way interaction witness according to Definition 7. All greater thresholds for Lrzip and Zpaq do not have 1-way but 2-way interaction witnesses.That is, being above the runtime threshold is a result of a feature interaction between the compression algorithm and those compression levels not showing zero blame.Notice that the sums of the given feature interaction blames for   ≥ 700s that contain algorithms Lrzip or Zpaq add up to 1∕2.That is, no other features are to be blamed for exceeding the runtime threshold.
C. Dubslaff et al.Features have a dedicated meaning and one would hence expect higher runtimes for higher compression levels.To this end, it seems odd that the feature interaction of Lrzip and compression level 9 is less to blame for higher runtimes than for levels 7 and 8.This indicates an anomaly of the feature interaction between Lrzip and the compression levels 7, 8, and 9. Further investigations on analysis results and feature causes support these findings: averaged over all measurements of Lrzip configurations, we observe runtimes of 1 064.9s at compression level 7 (standard deviation 4.1s), 1 181.7s at level 8 (standard deviation 3.2s), and only 830.5s at level 9 (standard deviation 2.6s).Hence, Lrzip at level 9 is not causally relevant for exceeding the execution time threshold of 900 s, as the compression level 9 feature is not contained in any cause with Lrzip.However, this insight is difficult to obtain relying purely on the performance influence model given by , as this would require handcrafted analysis of all 432 analysis results (see Table 1).
For (RQ4) we conclude that feature causes can provide hints for feature interactions and anomalies arising from them.Blame measures render themselves promising to quantify the influence of feature interactions that contribute to certain effects.

Minimal * -Covers (RQ2)
Minimal prime * -covers have, due to its different purpose minimizing the number of prime implicants in the cover, no direct correspondence to most general causes.But as detailed in Section 4.2, they share certain commonalities.While determining minimal * -covers involves to solve an NP-complete problem, many heuristics exist to compute nearly-minimal * -covers (McGeer et al., 1993).We discovered that in all our experiments the nearly minimal * -covers returned by Espresso are contained in the set of * -covers of most general feature causes that have minimal support size.This has three implications: First, the highly optimized heuristics of two-level logic minimizers can provide a first impression about feature causality.Second, as feature interactions correspond to feature causes with minimal supports, these methods provide also first insights about feature interactions.Third, once feature causes are computed, our heuristics towards cause-effect cover with most general causes provides an efficient method for nearly minimal * -covers.
Concerning (RQ2), we can conclude that most general feature causes in combination with distributive law reduction provides smaller representations for * -covers of    relative to ⧵  , which could also lead to meaningful and concise presence conditions von Rhein et al. (2015).

Effect uncertainty and precause-effect covers (RQ5)
To investigate the impact of uncertainty effect sets, we conducted two experiments, (1) addressing effect set underspecification and (2) accounting approximative analysis methods and threshold effect sets.
For (1), we chose the Minepump example and stepwise uniformly sampled effects     The results of these scores are shown for the 20 functional effect properties of Minepump with non-empty sets of causes in Fig. 3. Here, we averaged the score for each step of 500 sample runs.One can see that already with around 30% of knowledge about a configuration being either effect or non-effect, in all cases an fscore of more than 0.8 shows great quality of the causal description of effects from precauses.Hence, our notion of feature causality is also meaningful for effect sets that are underspecified.
To account for the approximative nature of performance models obtained by regression methods, we investigated Lrzip with the same thresholds as in Section 6.4 but with introducing a 5% uncertainty around thresholds   , i.This introduced noise led to only slight changes of the results in feature interaction blames of Table 4. Specifically, the uncertainty leads to not being sure about the interaction between Gzip and compression level 8 at the threshold   = 200s, i.e., showing a blame of 0 in the leftmost upper cell of Table 4. Starting with 10% uncertainty only changes the determined causal influences also from the other compression methods, but not affecting the overall picture.Hence, we can conclude that our causal reasoning is also robust on small deviations and introducing noise in effect sets.
Concerning (RQ5), we can conclude that precause-effect covers already show great performance for causal explications also when considering reasonably underspecified effect sets.

Discussion
In this section, we discuss potential threats to validity of our experiments and relate our findings to existing work from the literature.

Threats to validity
A threat to internal validity arises from the correctness of the analysis results from which we generated the effect sets.While for functional properties this threat is not crucial due to exact model-checking techniques used in our experiments, for non-functional properties the results have been partly established using machine learning.To mitigate this threat, we carefully chose effect-set thresholds such that the effect sets remain stable also within small threshold variations.Note that the choice of the effect set has no influence on the applicability of our causality definitions but only on to what extent causality can serve as an explication.For blame computations in our experiments, we assumed a uniform distribution over all effects, taking a developers' perspective where frequencies on how often an effect occurs in a realworld setting are not yet accessible.Other distributions could change our quantitative results, it is unlikely that they would alter our conclusions about causal influences of features and feature interactions.To increase the internal validity of our prototype, we implemented and evaluated several methods to compute causes.These include a naive brute-force approach and two additional methods to generate prime implicants, independent from the tool Espresso.
Naturally, the choice of subject systems threatens external validity, which includes the kinds of effect sets on which we evaluate causality.To alleviate this threat, we included a wide variety of systems with multiple properties from different areas to our evaluation.They comprise several real-world software systems often used to evaluate sampling strategies and performance-modeling approaches.We further added several community benchmarks from the feature-oriented modelchecking community as well as a large-scale redundancy system from reliability engineering.

Related work
Various techniques for software defect detection have been proposed in the literature, ranging from testing (Myers, 2004) and static code analysis (Nielson et al., 2010) to model checking (Baier and Katoen, 2008).These techniques have been also extended for analyzing configurable systems to tackle huge configuration spaces (Thüm et al., 2014).While such methods are able to identify defects and their location, the challenge of finding root causes for defects remains.A methodology to identify causes of defects during software development is provided through root cause analysis (Risk and Division, 1999;Rooney and Heuvel, 2004), which can be supported by a multitude of techniques for causal reasoning (Pearl, 2009;Peters et al., 2017).To the best of our knowledge, the foundations for a combination of configurable systems analysis and causal reasoning as we presented in this paper have not yet been addressed in the literature.In the following, we discuss related work in the fields of configurable systems analysis and causal reasoning.
There is a substantial corpus of work on determining those features in a configurable system that are responsible for emerging effects (e.g.Kuhn et al., 2004;Yilmaz et al., 2006;Qu et al., 2008).The focus has been mainly on detecting feature interactions (Calder et al., 2003;Calder and Miller, 2006;Apel et al., 2014).Siegmund et al. (2012) and Kolesnikov et al. (2019) describe non-functional feature interactions as interactions where the composed non-functional property diverges from the aggregation of the individual contributions of the single features.Garvin and Cohen (2011) provided a formal definition of feature interaction faults based on black-box analysis to guide white-box isolation of interaction faults.
An incremental software configuration approach to optimize nonfunctional properties has been presented by Nair et al. (2020), complementing our causality-guided configuration exemplified in Section 6.3.
To reduce the size of propositional logic formulas in configurable systems, von Rhein et al. (2015) proposed to exclude information about valid configurations and use two-level logic minimization, e.g., by the Espresso heuristics (McCluskey, 1956;McGeer et al., 1993).Our DLS method differs from this approach by prioritizing causal information over reduction.
Causal reasoning.Algorithmic reasoning about actual causes following the approach by Halpern and Pearl (2001a), Halpern (2015) on structural equations is computationally hard in the general case (Eiter and Lukasiewicz, 2002;Aleksandrowicz et al., 2017).However, tractable instances such as the Boolean case have been identified by Eiter and Lukasiewicz (2006).For deciding whether a partial interpretation is an actual cause in the Boolean case, Ibrahim and Pretschner presented an approach based on SAT solving (Ibrahim and Pretschner, 2020).To compute all causes, their implementation relies on checking causality for all possible partial interpretations, suffering from an additional exponential blowup in the number of variables, which we avoid within our approach using prime implicant computations.
Using test generation methods relying on program trace information, program locations that are the origin of the defect can be identified (Johnson et al., 2020;Rößler et al., 2012).Analyzing differences between program states of sampled failing and passing executions, delta debugging identifies code positions relevant for an emerging failure (Cleve and Zeller, 2005).Similarly, causes for detects can be determined by analyzing counterexample traces (Groce and Visser, 2003;Beer et al., 2012).Faults can be also located by causal inference on graphs constructed from statement and test coverage data (Baah et al., 2010).
Iqbal et al. present a static technique to generate causal models of a given configurable system using causal interference and statistical counterfactual reasoning (Iqbal et al., 2021).This model is used to detect performance bugs and provide hints for their resolution.While we focused on actual causality and rigorous analysis, they are interested in type causality to answer more generic questions.

Concluding remarks
We introduced a formal definition and algorithms to identify causes in configurable systems that relies on counterfactual reasoning and connections to classical problems of propositional logics and circuit optimization.We demonstrated their potential by analyzing several subject systems, including real-world software systems and popular community benchmarks.To prepare for explanations of causes and their impact onto effects, we proposed explication techniques to concisely represent causes and quantify the causal impact of features.We showed that our explications are meaningful and can support the development of configurable software systems by causality-guided

Fig. 1 .
Fig. 1.Feature diagram for the email system example.

C
.Dubslaff et al.

Table 1
Statistics of feature causality experiments.

Table 3
Degree of feature blame for the VCL redundancy system.

Table 4
Degree of feature interaction blame for Lrzip.
and non-effects    *  for  = 1, … , 128 from the 128 valid configurations in    and  ⧵   , respectively, such that    * 128 =   and    * 128 =  ⧵   .During sampling, we closely maintain the ratio between effects and non-effects, i.e., To evaluate the quality of the precause-effect covers   of    *  with relation to the (true) effects   , we computed the fscore in each step , i.