Approximating Counterfactual Bounds while Fusing Observational, Biased and Randomised Data Sources

We address the problem of integrating data from multiple, possibly biased, observational and interventional studies, to eventually compute counterfactuals in structural causal models. We start from the case of a single observational dataset affected by a selection bias. We show that the likelihood of the available data has no local maxima. This enables us to use the causal expectation-maximisation scheme to approximate the bounds for partially identiﬁable coun-terfactual queries, which are the focus of this paper. We then show how the same approach can address the general case of multiple datasets, no matter whether interventional or observational, biased or unbiased, by remapping it into the former one via graphical transformations. Systematic numerical experiments and a case study on palliative care show the effectiveness of our approach, while hinting at the beneﬁts of fusing heterogeneous data sources to get informative outcomes in case of partial identiﬁability.


Introduction
Consider the data in Table 1.They represent the results of an artificial case study from Mueller and Pearl (2023), where patients affected by a deadly disease test a drug that might help them.The table reports two independent studies carried out on different groups of patients: a randomised control trial (first eight rows) and an observational study (last eight rows).From those data, Mueller and Pearl (2023) compute a typical counterfactual such as the probability of necessity and sufficiency (PNS), which is the proportion of people that would recover if treated and die otherwise.They do this analytically by conditioning on gender.For males they obtain that PNS ∈ [0.28, 0.49] if only the randomised trial is considered, whereas the sharp value PNS = 0.49 is obtained when observational data are also taken into account.Analogous results are obtained for females.This example illustrates the main traits of the problem we are going to address in this paper: first and foremost that the computation of counterfactuals is often only partially identifiable, in the sense that in general we can at best bound them, no matter the amount of data; but also that joining observational and randomised studies (under the guidance of a so-called structural causal model) can strengthen the results, i.e., narrow the bounds.

Study
What the original example instead does not take into account is the possi-ble presence of a selection bias, which is the systematic exclusion of a certain subpopulation from the sample.To account for this, we assume here that half of the data (the ones about treated males and untreated females, denoted in grey in the table) from the observational study is missing because of a problem with the communication protocol.The challenge is to produce correct counterfactual bounds for the overall population given the partial view induced by the biased data (in what follows, we call these data under selection bias just 'biased' for brevity).
In this paper we shall give general methods to solve problems as above: that is, the computation of counterfactuals with structural causal models where available data are a mix of observational and interventional (e.g., based on randomised studies) datasets, possibly subject to selection bias.We refer to these data as heterogeneous.
Such a setting essentially coincides with the general problem of information fusion as defined by Bareinboim and Pearl (2016).This, and nearly all other works in the literature, have however focused on the special case of identifiable queries, namely, those that can be reduced to probabilistic expressions.The only apparent exception is the recent work by Zhang et al. (2022) that showcases the application of MCMC techniques to solve a problem of information fusion without selection bias.Selection bias in itself has been studied in the causal literature for long (e.g., Cooper, 1995;Pearl, 2012;Bareinboim and Pearl, 2012;and Bareinboim and Tian, 2015) but the unidentifiable case is basically unexplored.
In contrast with the trend in the literature, the present work will entirely focus on the general case of partial identifiability.In particular, after reviewing the necessary background material about structural causal models and Bayesian networks in Section 2, we shall quickly review our EM-based scheme (we call it EMCC) for observational datasets in Section 3. The EMCC is an algorithm that takes a structural causal model in input, along with observational data, and allows us to approximately compute counterfactual inference in the form of a possible range of values included in the actual bounds.The EMCC will be extended to the case of biased datasets in Section 4 and to multiple studies is Section 5. We shall present in Section 6 a numerical validation on a synthetic benchmark and a case study on palliative care involving a biased observational dataset.Conclusions and outlooks will be reported in Section 7.1

Background on Bayesian Networks and Structural Causal Models
A generic variable X is assumed to take values from a set Ω X .Note that we only consider variables taking values from finite sets.A probability mass function (PMF) over X is denoted as P (X ).Given variables Y and X , a conditional probability table (CPT) P (Y |X ) is a collection of PMFs over Y indexed by the values of X , i.e., {P (Y |X )} x∈Ω X .If all PMFs in a CPT are degenerate, i.e., {0, 1}valued, we say that also the CPT is degenerate.
A structural equation (SE) f associated with variable Y and based on the input variable(s) X , is a surjective function f : Ω X → Ω Y that determines the value of Y from that of X .Such an SE induces the degenerate CPT P (Y |X ) via P (y|x) := f (x) = y for each x ∈ Ω X and y ∈ Ω Y , where • denotes the Iverson brackets that take value one if the statement inside the brackets is true and zero otherwise (it is just another way to denote events similarly to what one does via indicator functions).
Consider a joint variable X := (X 1 , . . ., X n ) and a directed acyclic graph G whose nodes are in a one-to-one correspondence with the variables in X (we use a node in G and its corresponding variable interchangeably , where Pa X i denotes the parents of X i , i.e., the direct predecessors of X i according to G .A BN induces a joint PMF P (X ) that factorises as P (x) = n i =1 P (x i |pa X i ), for each x ∈ Ω X , where (x i , pa X i ) ∼ x, i.e., x i and pa X i are the values of X i and Pa X i consistent with x for each i = 1, . . ., n.Now consider two disjoint sets of variables U and V , which we respectively refer to as exogenous and endogenous.A collection of SEs { f V } V ∈V such that, for each V ∈ V the input variables of f V are in (U ,V ), with at least one of them in U , is called a partially specified structural causal model (PSCM).It corresponds to the notion of a 'functional causal model' in Pearl (2009, Chapter 1.4).A PSCM M induces the specification of a directed graph G with nodes in a one-to-one correspondence with the variables in (U ,V ) and such that there is an arc between two variables if and only if the first variable is an input variable for the SE of the second.The exogenous variables are therefore root nodes of G .We focus on semi-Markovian PSCMs, i.e., those PSCMs that lead to acyclic graphs (Pearl, International Conference on Probabilistic Graphical Models (Zaffalon et al., 2022).Section 4 includes a revision of the technical material originally presented in the conference version.The procedure to cope with hybrid data, discussed in Section 5, as well as the experiments and the case study (Section 6), are instead original material presented here for the first time.
2009).Moreover, if each exogenous variable has exactly one endogenous child, we call the PSCM Markovian. 2SCMs assume SEs to be given.However, in the case where endogenous variables take on finitely many values, recent results allow one to specify a PSCM by only giving a causal graph; SEs are automatically defined, without loss of generality (yet at the cost of adding a likely excess of caution to the model), via a canonical specification.Let us introduce it in the simple case of Markovian models: we say that SE f V is canonical if the states of the single (because of Markovianity) exogenous parent U of V index all the deterministic relations between the endogenous parents, denoted here as W V , and V .This requires: (1) As an example, in a Markovian model, if V is Boolean and it has a single endogenous parent W , which is also Boolean, because of Equation ( 1), we need its exogenous parent U to have cardinality four in order to enumerate all the possible deterministic relations between the two Boolean variables W and V .E.g., we might set A Markovian PSCM whose SEs are all canonical is also called canonical.We refer the reader to the works of Duarte et al. (2023) and Zhang et al. (2022) for a generalisation of such a concept to non-Markovian models.
In a PSCM M we obtain a joint state of V from a (joint) state of U by applying the SEs of M consistently with a topological order for G .A fully specified structural causal model (FSCM, see Pearl, 2009, top of p. 69) is just a PSCM M paired with a collection of marginal PMFs, one for each exogenous variable.As SEs induce (degenerate) CPTs, an FSCM defines a BN based on G whose joint model (Pearl, 2009, Section 1.4.1)requires that there is a one-to-one correspondence between endogenous and exogenous variables; that is, each exogenous variable has exactly one endogenous child and vice versa.In such a case, assuming independence of the exogenous variables makes the model Markovian.In this paper, as it is common in much of the literature, we use an equivalent definition that keeps on requiring that endogenous variables have at least one exogenous parent, but it allows exogenous variables to have more than one endogenous child; at the same time it assumes that the exogenous variables are mutually independent.To have Markovianity in this setting, we need to require in addition that exogenous variables have exactly one endogenous child, thus going back to the one-to-one correspondence in the original definition of a functional causal model.
PMF factorises as: for each u ∈ Ω U and v ∈ Ω V , (u, v, pa V ) ∼ (u, v ) and where Pa V are the parents of V according to G (i.e., the inputs of SE f V ), while θ u denotes the true but unknown chances for U = u, to be considered for each u ∈ Ω U and U ∈ U .Notation θ U := (θ u ) u∈Ω U ∈ ∆ U is similarly used for the whole PMF, with ∆ U denoting the corresponding probability simplex.We extend this notation to the joint variable We shall instead use P (u) for an estimate of θ u and likewise for PMFs.
Given the graph G of a PSCM (or FSCM) M, obtain G ′ by removing from G any arc connecting pairs of endogenous variables.Let {G c } c∈C denote the connected components of G ′ .The c-components of M are the elements of the partition {V (c) } c∈C of V , where V (c) denotes the endogenous nodes in G c , for each c ∈ C (Tian, 2002).This procedure also induces a partition of U , similarly denoted as {U (c) } c∈C .Moreover, for each c ∈ C , let W (c) denote the union of the endogenous parents of the nodes in V (c) and V (c) itself.Finally, for each V ∈ V (c) , obtain W V by removing from W (c) the nodes topologically following V and V itself (note that in the notation we dropped the index c as this can be implicitly retrieved from V ).As an example, Figure 2.a describes the identification of the two c-components of the model in Figure 1.In this case we have Tian (2002) also shows that the joint PMF P (V ) obtained by marginalising the exogenous variables from the joint PMF in Equation ( 2) is a BN, to be called here endogenous, that factorises as follows: In FSCMs, the CPTs in the right-hand side of Equation (3) can be computed through standard BN inference algorithms by simply regarding the FSCM as a BN.With PSCMs, assuming the availability of a dataset D of endogenous observations, we might also define an endogenous BN with the same factorisation and whose CPTs are directly assessed from D.
Observational queries in FSCMs can be addressed in the endogenous BN, and the same can be done for PSCMs assuming the availability of the dataset D of endogenous observations.To perform causal inference, interventions denoted as do(•) should be considered instead.In an FSCM or PSCM M, given V ∈ V and v ∈ Ω V , do(V = v ) simulates a physical action on M forcing V to take the value v .The original SE f V should be consequently replaced by a constant map V = v .Notation M v is used for such a modified model, whose graph is obtained by removing from G the arcs entering V , and for which evidence V = v is considered.This operation is often called surgery.In an FSCM M, given V,W ∈ V and v ∈ Ω V , P (w|do(v )) denotes the conditional probability of W = w in the post-intervention model, i.e., P ′ (w|v ), where P ′ is the joint PMF induced by M v .
A more general setup is provided by counterfactual queries, where the same variable may be observed as well as subject to intervention, albeit in distinct worlds.In mathematical parlance, if W are the queried variables, V ′ the observed ones and V ′′ the intervened ones, we write the query by P (W v ′′ |v ′ ) with possibly V ′ ∩V ′′ = .Popular examples of counterfactual queries involving two endogenous Boolean variables X and Y of an FSCM are the probability of necessity (PN), i.e., the probability that event Y would not have occurred by disabling X , given that X and Y did in fact occur.This corresponds to P (Y X =0 = 0|X = 1, Y = 1).Similarly the probability of sufficiency (PS) P (Y X =1 = 1|X = 0, Y = 0) and necessity and sufficiency (PNS) P (Y X =1 = 1, Y X =0 = 0) are often considered.
Computing counterfactual queries in an FSCM may be achieved via an auxiliary structure called a twin network (Balke and Pearl, 1994).This is simply an FSCM where the original endogenous nodes (and their SEs) have been duplicated, while remaining affected by the same exogenous variables (see, e.g., Figure 2.b).More general and compact structures (e.g., involving more than two copies of the same endogenous node) can be also considered (Shpitser and Pearl, 2007).Computing a counterfactual in the twin network of an FSCM is analogous to what is done with interventional queries provided that interventions and observations are associated with distinct copies of the same variable.If the query involves multiple variables like in the case of the PNS, an auxiliary Boolean child of these variables, that is true if and only if its parent variables are in the queried state, can be added to the model.BN inference eventually allows one to compute the counterfactual query in such an augmented model.

Partial Identifiability (with a Single Unbiased Observational Dataset)
In this section we formalise the fundamental notion of partially identifiable query and review the basics of the EM-based scheme proposed by Zaffalon et al.
The graph of a structural causal model over three endogenous variables (black nodes) with two exogenous variables (grey nodes).The model is used to describe the data in Table 1 with V 1 corresponding to Treatment, V 2 to Gender and V 3 to Survival.The exogenous variable U 1 acts as a confounder for V 1 and V 3 , this being sufficient to have results consistent with those of Mueller and Pearl (2023), which are obtained by conditioning on V 2 .
(a) (2021) to address queries of this kind.3Such an EM procedure considers the case of a single, purely observational and unbiased, dataset.An extension to the case of (single and observational) biased datasets is in Section 4, while the fully general case of multiple, observational and interventional, unbiased or biased, datasets is addressed in Section 5.
As already noticed in the previous section, the general problem of computing counterfactuals can be reduced to standard BN inference in FSCMs.Yet exogenous variables are typically latent; their marginal PMFs being unavailable leaves us with PSCMs.
Say that we have a PSCM M over variables (U ,V ) together with a dataset D of endogenous observations.This permits quantifying the endogenous BN from which any observational query can be computed.The do-calculus (Pearl, 2009) instead permits reducing identifiable interventional queries on PSCMs to observational ones, which are possibly then computed in the endogenous BN.The remaining queries are called here partially identifiable.
To start approaching partially identifiable queries in the PSCM M from D, we first compute the endogenous BN inducing the endogenous joint PMF P (V ).To be consistent with the endogenous PMF, the exogenous PMF θ U of an FSCM should satisfy: for each v ∈ Ω V , with (u, v, pa V ) ∼ (u, v ) and θ u ∼ θ U .If such a θ U exists, we say that the endogenous BN is M-compatible with the original PSCM (Zaffalon et al., 2021).In general there are multiple FSCMs satisfying Equation (4), that is, there is a set Θ U ⊆ ∆ U of elements θ U that satisfies Equation (4).To compute a partially identifiable causal query in PSCMs, we should therefore consider all θ U ∈ Θ U and compute the query for each of them.The result of the partially identifiable query is summarised by the lower and upper bounds obtained considering the outcomes of all the (M-)compatible FSCMs.
To practically pursue this goal, it is useful to consider the log-likelihood of D from a generic FSCM based on M, i.e.,

LL(θ
with (u, v, pa V ) ∼ (u, v ) and θ u ∼ θ u .Interestingly, this function has a single global maximum and achieving such a maximum is an equivalent condition to identify the (M-)compatible FSCMs, as stated by the following result.
Theorem 1 (Zaffalon et al., 2021, Theorem 1).If Θ U = , the log-likelihood in Equation (5) has no local maxima and achieving its global maximum at θ U is an equivalent condition for θ U ∈ Θ U .If Θ U = , instead, the log-likelihood takes only values strictly smaller than the global maximum.
The proof of the above result is in Zaffalon et al. (2021), and also in Zaffalon et al. (2023).For the sake of a self-contained presentation, we also report a proof in Appendix A.
The lack of local maxima allows us to employ an expectation-maximisation (EM) scheme, iterated with different random initialisations, to sample points from Θ U .Given a PSCM M and a dataset of (unbiased) endogenous observations D, Algorithm 1 yields an FSCM compatible with the PSCM (provided that there is one): such an FSCM is obtained by adding PMFs to the exogenous variables of the PSCM in such a way that the resulting model can generate the distribution of the available data.
In practice, given an initialisation {P 0 (U )} U ∈U , the EM algorithm consists in regarding the posterior probability P 0 (u|v ) as a pseudo-count for (u, v ), for each v ∈ D, u ∈ Ω U and U ∈ U (E-step).A new estimate P 1 (u) := v ∈D P 0 (u|v ) |D| is consequently obtained (M-step).The scheme is iterated until convergence.
Subroutine initialisation provides a random initialisation of the exogenous PMFs (line 1).The iterative procedure stops when the log-likelihood achieves a maximum (line 3).Note that the numerical value of the global maximum of the log-likelihood can be proved to coincide with the log-likelihood of the dataset D from the endogenous BN whose parameters are quantified with the frequencies in D itself (Zaffalon et al., 2021).Such a value can be used to detect the two special cases in which the EMCC does not converge to the global maximum.The first is the convergence to a saddle point: this is an unstable point and a global maximum replacing that value can be easily obtained by a small perturbation in the initialisation.On the other side, if there is no convergence to the global maximum irrespectively of the initialisation, then the endogenous BN is not M-compatible.As discussed by Zaffalon et al. (2021), this is due to a wrong modelling or to insufficient data and makes inferences unreliable irrespectively of the procedure adopted to compute them.
Computing a causal query in the FSCM returned by a single EMCC run gives the exact value of the query in the identifiable case, which then corresponds to a point inside the bounds that characterise the partially identifiable case represented by the PSCM.An approximation of those bounds is eventually achieved by iterating the EM scheme above with variable initialisation values.This corresponds to sampling the space of FSCMs compatible with the given PSCM (the overall number of EM runs is denoted by r ), and collecting the wanted counterfactual values across all these models.We call 'range' the approximation to the actual bounds that is delimited by the lower and upper values of the r outcomes above.We say that the range is an 'inner' approximation to stress that it lies between the actual lower and upper bounds.Approximations that instead encompass the actual bounds are called 'outer'.
end for 8: end while A demonstrative example of our EM procedure is reported here below.Here and in the other examples we consider canonical SEs (see Section 2).Nevertheless, all the methods we present in this paper can be applied to models with SEs of any kind, no matter whether canonical or not. 1 with a canonical specification of the SEs.This requires |Ω U 1 | = 64 and |Ω U 2 | = 2.We are interested in a counterfactual query corresponding to the PNS for V 1 (i.e., Treatment) as subject of the intervention and V 3 (i.e., Survival) as queried variable.Each run of Algorithm 1 corresponds to an FSCM.The range of the corresponding PNS after r = 300 runs and a rounding to the second digit is given by [0.00, 0.43].This is the same range we obtain by applying the mapping to credal networks of Zaffalon et al. (2020), and then running an exact credal network inference algo-rithm. 4In the other examples discussed in the rest of the paper we consider, with different setups, the same counterfactual query computed with the same number of runs and rounding strategy.The script to reproduce all these simulations is available together with the material released to reproduce the experiments (see Section 6).

Example 1. Consider the observational study associated with the last eight rows in Table 1 and a PSCM M based on the causal graph in Figure
Other approaches have been proposed to approximate partially identifiable bounds (Zaffalon et al., 2020;Duarte et al., 2023;Zhang et al., 2022).Yet, these are focused on the case of unbiased data, and no viable technique to compute general counterfactual inference under selection bias seems to be available. 5This is the focus of the next section, which shows how the EMCC of Zaffalon et al. (2021) has a natural extension to problems of selection bias.

Partial Identifiability under Selection Bias
To model the effect of selection bias, we define a selector S, i.e., a Boolean variable that is true for selected states of V and false otherwise.We consider deterministic selectors, i.e., S := g (V ) with g : Ω V → {0, 1}.E.g., for the observational data in Table 1, if we regard drug and female as the first states of the variables Gender and Treatment, the selector is true if and only if the two variables are in the same state.S can be embedded in a PSCM M as a common child of the endogenous variables, with g being the SE of S, thus acting as an additional endogenous variable with endogenous parents only.Notation M S is used to denote such an augmented PSCM obtained from M.
Selector S induces a partition among the records of D. The endogenous values are assumed to become missing on the records such that S = 0, while 4 As discussed in Section 2, here and in the following, we compute PNS values in the twin network associated with the FSCM returned by each EMCC run.
5 This is also the case of the formulae proposed by Tian and Pearl (2000) to bound PNS (but also PN and PS) queries by observational and interventional ones.Yet, at least in general, only outer approximations are returned by this approach.As a simple example, consider a PSCM over Treatment (V 1 ) and Survival (V 3 ), with the first endogenous variable being a parent of the second one and the common exogenous parent U 1 .Setting |Ω U 1 | = 8 allows to implement a canonical specification of the SEs of both V 1 and V 2 .With the observational data in Table 1, such a canonical PSCM induce PNS bounds equal to [0.00, 0.43].These values coincide with the bounds of Tian and Pearl (2000), which are therefore tight.Removing from Tian and Pearl (2000).Yet, the actual bounds of such a non-canonical PSCM become [0.00.0.09].The code to reproduce this simple result can be found in the companion repository of the paper (see Section 6).
remaining available if S = 1.Let us denote the corresponding datasets as D S=0 and D S=1 .In terms of cardinalities, if N S=0 := |D S=0 | and N S=1 := |D S=1 |, then N S=0 +N S=1 = N =: |D|.Note that D S=0 contains N S=0 identical records.Figure 3 depicts an example of the partition for the biased observational data of Table 1.
Figure 3: Selecting the data of the observational study in Table 1.
To address partially identifiable queries under selection bias, let us first reformulate the notion of M-compatibility for M S .On the selected states the compatibility condition is like in the unbiased case, and we consequently require Equation (4) to be satisfied for each v such that g (v ) = 1, where the joint endogenous probability on the right-hand side is computed by an endogenous BN whose CPTs are obtained from the frequencies in D S=1 .For the unselected states, as the only endogenous observation is S = 0, we require instead: where, like for the endogenous joint probabilities, also the marginal probabilities of the selector are obtained from the (cardinalities of the) data, e.g., P (S = 0) ≃ N S=0 /N .A discussion on how to proceed if these cardinalities or probabilities are not available is provided at the end of this section.The task of detecting the set Θ U of elements θ U satisfying the above Mcompatibility conditions can be reduced to a log-likelihood maximisation as in the unbiased case.The log-likelihood of D S := D S=0 ∪ D S=1 from an FSCM based on M S can be written as: with: The following result, whose proof is in Appendix A, provides a generalisation of Theorem 1 to the case of biased data.
Theorem 2. For Θ U = , the log-likelihood in Equation (7) has no local maxima and achieving its global maximum at θ U is an equivalent condition for θ U ∈ Θ U .For Θ U = , instead, the log-likelihood takes only values strictly smaller than the global maximum.
As a consequence of Theorem 2, we can apply Algorithm 1 to PSCM M S and dataset D S to obtain FSCMs whose log-likelihood takes its global maximum and hence satisfies the compatibility constraints.Causal queries computed with those FSCMs are consequently inside the bounds for the partially identifiable query.The more points the better the approximation.
Note that the data in D S=0 are N S=0 instances of the same observation S = 0. Thus, when coping with biased data, line 5 of Algorithm 1 rewrites as: Equation ( 10) shows the divergent effect of two datasets D S=0 and D S=1 on the exogenous chances.While the selected states push the chances towards their unbiased values, the unselected ones act as a noise term only forcing the chances to be zero on the exogenous states generating the selected endogenous states.Overall, compared to the unbiased case, we expect this weighted average to induce wider bounds on the counterfactual queries.An example is reported here below. 1 with a selector preventing the data associated with the grey rows from being available.These are half of the data, which means P (S = 1) = 0.5.With the two datasets induced by such a selector (see Figure 3), the modified version of Algorithm 1 for biased data returns the range [0.00, 0.73].Unlike the unbiased case considered by Example 1, a comparison with the exact method cannot be performed in this case.Figure 4  Note that in the limit P (S = 1) → 0, we are basically ignoring the data in D S=1 .If this is the case, the log-likelihood is proportional to N S=0 and this implies that EMCC gives the same results irrespectively of the size of the (completely missing) dataset.In those cases our EMCC trivially returns the exogenous PMFs sampled for initialisation.This explains, for instance, the vacuous result obtained in this limit in the left side of Figure 4.

Example 2. Consider the same setup as in Example 1. Here we compute the PNS for the observational data in Table
Let us also remark, with reference to Equation ( 10), that the execution of EMCC under selection bias requires the cardinality N S=0 of D S=0 .This corresponds to the common case of an observational process making unavailable the output of the measure, but not the fact that the measurement (attempt) was performed.We expect therefore that an investigator will have in many cases an idea of the size of the neglected part of the population; or that they will be able to estimate N S=0 from P (S = 0) as N S=0 /N S=1 ≃ P (S = 0)/P (S = 1); or they could estimate an upper bound for P (S = 0).These estimates can follow even simply by first estimating the sample size before the selection; time and resources constraints or domain knowledge are usually enough to bound such a size.In the worst possible case, one could use the (very) conservative approach of taking the limit P (S = 0) → 1.This is not recommended because it corresponds to a very extreme measure (e.g., like in Figure 4).
Finally notice that as in principle S might be a common child of all the variables in V , implementing SE g as a CPT might lead to an exponential blow-up, this making the inference of P (U |S = 0) required by EMCC intractable.In practice, as in Bareinboim and Pearl (2012) or in our example, the selector might only depend on a subset of V of bounded cardinality.Circuital approach like the one recently proposed by Darwiche (2022) might be considered to bypass such a limit.

Coping with Multiple Datasets
In the previous sections we addressed the problem of partial identifiability in causal queries when coping with a purely observational dataset, no matter whether unbiased (Section 3) or biased (Section 4).The challenge of this section is to extend our approach to the case of multiple datasets in a fully general setup mixing observational and interventional data.We initially assume the data unbiased just for the sake of presentation.The extension to the biased case is discussed in the last part of the section.
Say that our domain of interest is described by a PSCM M over (U ,V ).Endogenous observations from d independent, observational or interventional, studies are available.Denote as D k the dataset of size N k := |D k | with the results of the k-th study, for each k ∈ {1, . . ., d }.The endogenous variables V k I ⊆ V are those subject to interventions in D k , while for observational studies we simply set V k I := .Without any lack of generality, we assume that the intervened variables are always the same across the records of a same dataset.If this is not the case, we just split the dataset into smaller groups homogeneous w.r.t. the set of intervened variables.The states induced by the interventions inside each dataset might instead be different (e.g., a clinical study where we intervene by giving or not giving the drug to a patient).For D k , we denote these states as , while ΩV k I := for an observational D k .We further clarify such a notation by means of the following example.
Example 3. Let D 1 denote the interventional dataset in Table 1 and D  ∪ {w }.We consequently augment D with a column of observations of W .The values of W for records originally belonging to the interventional datasets are those corresponding to the intervened states, while the state w is assigned to the records from the observational datasets.We eventually denote as D ′ the dataset of N (complete) observations of (V ,W ) obtained in this way.Let us demonstrate the above merging procedure by means of an example.

Example 4. Consider the setup in Example 3 with
Given M and D ′ , let us build a, so-called auxiliary, PSCM M ′ as follows.Add variable W to M as an endogenous auxiliary variable corresponding to an additional parent of all the endogenous variables in V .The SE f ′ V associated with V in M ′ is defined as: for each pa V ∈ Ω Pa V , w ∈ Ω W , and V ∈ V , where f V is the SE of V in M, and, k(w) is the index of the interventional dataset associated with w, for each w = w , and v w ∈ Ω V is the state of V appearing in w.In practice, Equation ( 11) implements the surgery on V required by an intervention if W takes a value corresponding to an intervention involving also V , while leaving the original SE of M otherwise.For a proper PSCM specification, as our definition requires each endogenous variable to have at least an exogenous parent, we finally add to M ′ an exogenous variable U W defined as a unique parent of W , with Ω U W := Ω W and the identical map as SE for W .As an example, Figure 5 2.
The auxiliary PSCM M ′ embeds all the possible interventions included in the d datasets.Thanks to the index variable W , which keeps track of the different interventions, we can regard D ′ as a purely observational dataset for the endogenous variables of M ′ .By Theorem 1, we can therefore apply the EMCC scheme to address non-identifiable queries as in Section 3 even when coping with the generalised setup discussed in this section.
In particular, the global maximum of the log-likelihood achieved by the EM scheme should correspond to the log-likelihood assigned to the dataset by an endogenous BN whose parameters have been trained from the dataset itself.As W is a common parent of all the endogenous variables, the endogenous graph of M ′ is just the endogenous graph of M augmented by W , which acts as a common parent of all the other endogenous variables.As a first example of a partially identifiable query based on heterogeneous data let us consider the following example, where the data integration induce more informative ranges. 1 yields range [0.00, 0.43].By also considering the interventional data, the EMCC applied to the model in Figure 5 and the dataset in Table 2 returns instead the range [0.32, 0.42].

Example 5. The PNS query for the (unbiased) observational data in Table
So far we only considered the case of unbiased datasets.In the presence of a selection bias, even if only on some datasets, we should first add the selector variable S to all the d datasets, with its value being constantly true on the unbiased ones.The construction of the auxiliary PSCM M ′ from M is analogous, provided that the auxiliary variable W becomes also a parent of S, with the SE of S in M ′ implementing the different selection functions depending on the particular value of W .This requires S to be a common child of all the endogenous variables.A more compact approach consists in considering only the union of the endogenous variables involved in the different biases for the different datasets.Other approaches involving auxiliary variables might be also considered.Note also that we might need different states of W to model the same intervention (or lack of intervention) in different datasets.We do not explicitly formalise this point just for the sake of light notation.Once this is done, Algorithm 1 can be executed on M ′ with D ′ as in the unbiased case.An example is reported here below.Example 6.In the same setup of Example 5 consider also the selection bias preventing the grey rows of the observational study in Table 1 from being available.To compute the PNS in this case we add the Boolean selector S to all the records and obtain a merged dataset D ′ as in Table 2 with an additional column associated with S, which is always one apart from the four grey rows where it is zero.Figure 6 depicts the corresponding auxiliary model M ′ .Given D ′ and M ′ , the EMCC returns the range [0.27, 0.53] for PNS (with the observational data only we had [0.00, 0.73]).  2 in the presence of a selection bias.
As a final remark, notice that so far we implicitly assumed the different datasets to be generated by the same exogenous chances.Yet, this might not always be the case: consider for instance two clinical studies in different countries where the body mass index has very different distributions over the population.Different chances for the different datasets can be simply introduced to cope with such an extended setup.To model this in the auxiliary PSCM M", we should set the index variable W to be also a parent (or the parent of an auxiliary parent) of the exogenous variables having different chances.Although this is not respecting the standard definition of PSCMs (cf.Section 2), W is always observed in D ′ and therefore acts as a selector for the exogenous PMF to be updated by the EMCC.Such an index should be also specified in the query of interest.The above setup is demonstrated by means of the following example.
Example 7. In the same setup of Example 6 assume that the two studies in Table 1 refer to two different exogenous chances associated with the endogenous variable V 2 (i.e., Gender).Let us modify the auxiliary PSCM M ′ in Figure 6 in order to take into account this difference.An arc directly connecting W and U 2 would not work as the (three) states of W distinguish between the two interventional states in D 1 , which in our assumptions refer to the same exogenous chance.This can be solved by a coarsening variable W ′ := i (W ), specified as a child of W and returning the index of the dataset we refer to, i.e., i (W = w ) = 1 and i (W = drug) = i (W = no drug) = 2. Variable W ′ is eventually specified as a parent of U 2 with P (U 2 |W ′ = 1) and P (U 2 |W ′ = 2) modelling the expectations of the exogenous chances in the two datasets.The PNS range we obtain in this case is [0.20, 0.54] for the chances associated with the observational dataset, this corresponding to a looser range than [0.27, 0.53], which we obtained in Example 6 by forcing the chances to be equal.

Empirical Validation
To evaluate the potential of our approach to the computation of partially identifiable queries with heterogeneous data sources, we report and discuss the results of extensive tests based on a benchmark of synthetic data and causal models (Section 6.1), and a real-world application of counterfactual analysis with biased data (Section 6.2).
Algorithm 1 was already implemented within the CREDICI library, a Java tool for causal inference (Cabañas et al., 2020). 6We enhance the library with an implementation of the procedures presented in Sections 4 and 5, thus allowing to compute counterfactuals with heterogeneous data.To the best of our knowledge this is the first tool for causal inference in such a general setting.The code to reproduce the experiments is available in a dedicated repository.7

Synthetic Data
We use the Erdös-Rényi scheme to sample directed acyclic graphs.If the sampled graph is such that there are non-root nodes without root parents, we add to the graph a new root parent node for each one of these nodes.Parentless nodes are then regarded as exogenous variables, and the other ones as Boolean endogenous ones.SEs and exogenous cardinalities are obtained by sampling (without replacement) the deterministic relations between each endogenous variable and its endogenous parents, letting the states of the exogenous parents index the relations with |Ω U | ≤ 64 for each U ∈ U .Note that this typically induces a non-canonical specification of the corresponding SE.
From such a PSCM M we sample the ground-truth chances, thus obtaining FSCM M * .Overall we generate 220 models with |V | ranging from 5 to 17.
For each model we select three endogenous variables to be called, respectively, input, target, and covariate.Following a topological order, we set as input the first variable with an exogenous confounder.The target is a leaf such that there is a directed (endogenous) path connecting the input to the target, and the covariate is a variable belonging to that path.We sample from M * three datasets of endogenous observations, namely: (i) an observational dataset D O ; (ii) an interventional dataset D I , with interventions on the input and an equal number of positive and negative values for the intervened variable; (iii) another interventional dataset D I B with interventions on the covariate and a selector based on the values of input, covariate and target.To avoid very strong or very weak biases, the selectors are designed in order to have 0.25 ≤ P (S = 1) ≤ 0.75.For each model, the three datasets have the same size, with values ranging from 1, 000 to 5, 500 over the different models in order to ensure M-compatibility.Note that for the biased dataset, the size is intended to be the one before the selection process.
We evaluate the PNS for the input and the target from the PSCM M. This is done by adding the different datasets following two different orderings.In the first we start from D I B , then we add D O , and finally also D I .In the second we start from D I , then we add D O , and finally also D I B .In both cases the EMCC is executed with r = 300 runs.Such a sequential approach is considered here only to evaluate the impact of the different datasets.In a real scenario all the available data would be used from the start.
Before commenting on the results, we use the credibility intervals derived by Zaffalon et al. (2022) to evaluate the quality of the inner approximation given by the EMCC-delivered range with the selected number of runs.We compute the probability of a relative error (at each extreme of the range) smaller than ǫ = 0.1 in our simulations and obtain a median confidence equal to 0.999 and a lower quartile equal to 0.921.Those high values support the accuracy of our inference scheme, and allow to reasonably approximate the ground-truth intervals with the EMCC values.
Concerning the results with the heterogeneous data, we notice that adding a new dataset typically induces a shrink in the interval.To summarise the results of our experiments we therefore compute the relative shrink defined as one minus the ratio between the size of the PNS range with the additional dataset(s) and the PNS range with the initial dataset.Figure 7 depicts the corresponding boxplots.On the left we consider the model with the same chances shared over the different datasets (called global), while for the local models on the right we let the chances associated with the covariate to be different for each dataset, and we run the PNS query with the chances based on D I B for the first ordering, and D I for second.When starting with biased data (plots on the top of Figure 7) we observe noticeable shrinks w.r.t. the initial PNS ranges, these being above 60% for both global and local models.This advocates the benefits of developing tools for counterfactual inference with heterogeneous data.Such a shrink is achieved by adding a single dataset to the initial one.The effect of adding the second dataset is considerably weaker (about 5% for both class of models), but still noticeable.Low shrinks are also obtained when we start with the unbiased interventional data (plots on the bottom).This supports the intuition that the unbiased interventional data are the most informative for causal analysis.Yet, we still observe a non-negligible advantage in adding also observational or biased data.
Finally let us also report a dedicated analysis of the effect of the missingness induced by a selection bias with a single observational dataset as in Section 4. We consider different selectors with no restrictions on the values of P (S = 1).For each dataset we compare the PNS ranges with the narrower ones obtained by removing the bias.The quality of the EMCC ranges as based on the credible intervals is analogous to the one observed with the heterogeneous data.As a descriptor we use the difference between the two lower values and we normalise it with the difference obtained when the biased range is [0, 1].We call this descriptor normalised bias effect.We similarly proceed for the upper values.The aggregated results are grouped w.r.t. a discretisation of P (S = 1) and the corresponding boxplots displayed in Figure 8.As expected with less missing records the ranges become more informative and closer to their unbiased values.0 .00 0 .00 -0 .2 5 0 .2 5 -0 .5 0 0 .5 0 -0 .7 5 0 .

A Counterfactual Analysis in Palliative Care with Biased Data
Figure 9 represents the causal graph used for a study on the preferences of terminally ill cancer patients regarding the place they want to spend their final moments: home or hospital.Although a majority of patients prefer to pass away at home, most of them end up dying in institutional settings.The study focuses on exploring the interventions that healthcare professionals can take to increase the chance of patients dying at home.All the variables in the graph are intended to be endogenous and the graph corresponds to the BN proposed by Kern et al. (2020) reduced to the subset of variables for which data were available-variables have been binarised too.Due to ethical reasons, the original data of the study cannot be used and we sampled instead a dataset D of 1, 000 observations.One can intervene on three variables in the network (light grey nodes in Figure 9): the patient's and the family's awareness of death (which involves communication with the doctors); and home assistance (provided by the Triangolo association).The model is taken without exogenous confounders as a consequence of the fact that all the potential confounders have been explicitly represented in the causal graph.In practice we consider a Markovian model where each endogenous variable has a separate and unique exogenous parent, that is not reported in Figure 9 just for the sake of readability.As for SEs, we stick to the canonical (see Section 2) representation since we want to be least committal w.r.t. the true underlying mechanisms.Such a conservative modelling assumption can induce high cardinality for the exogenous variables.In particular because of Equation ( 1 The lower value associated to Triangolo being greater than the upper values for the two other options, shows that home assistance dominates them and hence it is the key factor determining the death place. The above result has been presented by Zaffalon et al. (2023).Here we investigate whether or not such an evidence in favour of home assistance holds even in the presence of a selection bias.In particular we want to model the fact that studies of this kind are typically biased towards patients progressed to severe conditions.We consequently consider selectors based on Karnofsky (that is an index to track disease progressions) and Symptoms.For both these binary variables we consider the bad-condition states, i.e., low Karnofksy index and serious symptoms.We consequently consider a weak bias removing only the records of patients such that both variables are in the bad-condition state, and a strong bias such that even a single variable in a bad-condition state makes the datum unavailable.The results are: In practice even in the presence of a selection bias, no matter whether weak or strong, it is possible to draw informative conclusions from data and keep advocating the importance of home assistance.
Let us finally note that the fact that the lower value of the PNS range for Triangolo with a weak bias is slightly higher than the corresponding lower value in the unbiased case reflects the fact that, with fewer complete data, the range provided by the EMCC with the current number of runs might be stronger than the one obtained with the complete data.

Conclusions and Outlook
We have presented an EMCC algorithm that learns the parameters of a given partially specified structural causal model from a mix of observational and interventional, as well as possibly biased, data.This setting is essentially that of information fusion put forward by Bareinboim and Pearl (2016).In this setting, we appear to be the first to deliver an algorithm with this type of generality and to make the code public.
A few remarks may be useful to clarify some main aspects of our approach: • Our algorithm aims at reconstructing the uncertainty about the latent variables.It does this in an approximate way, by delivering a set of fully specified SCMs contained in all those that are compatible with the partially specified one.Once this is done, one can compute any counterfactual by iterating its computation over the FSCMs above and aggregating their results so as to yield a range of values for unidentifiable queries that approximates the actual bounds from inside.
• Empirical results support the quality of the approximation and confirm the increased informativeness of ranges obtained by merging multiple studies.Note that more efficient, or accurate, computation would be possible in principle if the aim was to compute a specific counterfactual from the beginning, because searching the space of compatible FSCMs could be tailored to such a goal; but this is not the purpose of our work, which aims at remaining agnostic w.r.t. the follow-up computations, and hence general.
• The PSCM is an input for our algorithm.We stress that the requirement that SEs are given (they are part of the PSCM) is not as stringent as it might seem, given that we can produce the needed SEs by a preprocessing step if the actual ones are not available: this is possible thanks to recent work (Duarte et al., 2023;Zhang et al., 2022) that has introduced a canonical specification of the SEs.It can be understood as a least-committal specification that, loosely speaking, can be used without loss of generality (the implication however is that the output will tend to be less informative than the case where the actual SEs are given).In this sense, our work is therefore as general as the works that do not assume the SEs to be given.
As for future work, it would be interesting to make a quantitative comparison of our EMCC with the MCMC approach recently put forward by Zhang et al. (2022) (a qualitative comparison is available in Zaffalon et al. 2023, Appendix B).Also, it would be important to take advantage of the recent circuital compilation of the EMCC algorithms presented in Huber et al. (2023) to handle bigger models; and it would definitely be useful to extend the basic EMCC to continuous data.
to D ′ (remember that * is a proper state of T ); while the observations of V in D S=1 are directly added to D ′ and regarded as observations of T instead of V .
The log-likelihood of D ′ from M ′ can be computed as in Equation ( 5 Let us show that Equation (A.7) coincides with Equation ( 7).Because of Equation (A.5), the elements of D ′ such that T = * are those corresponding to S = 0, their number being therefore N S=0 .Their contribution to the log-likelihood is therefore the term in Equation ( 8).For the other elements of D ′ , function h in Equation (A.5) acts as the identical map and hence l (u) = f (u).The corresponding contribution to the log-likelihood is the one in Equation ( 9).This proves that the log-likelihood of a biased dataset in Equation ( 5) can be expressed as the log-likelihood of an unbiased dataset (namely D ′ ) from a proper PSCM (that is M ′ ).This allows to apply Theorem 1 and hence to prove the thesis.

Figure 2 :
Figure 2: C-components (a) and twin network (b) of the model in Figure 1.

Figure 4 :
Figure 4: PNS ranges (grey) for different selection levels (x-axis) induced by r = 300 EMCC runs (black points denote the result of each run).

I
2 the (unbiased) observational one.We have V 1 I = {Treatment}, ΩV 1 We merge the d datasets into a single one, say D := ∪ d k=1 D k , with cardinality N := d k=1 N k .We introduce a new variable W to index all the interventional states in D. An additional state of W , denoted as w , describes the case of no intervention.Overall we have Ω W := ∪ d k=1 ΩV k I

Figure 5 :
Figure 5: The auxiliary model M ′ obtained from the PSCM M in Example 1 and the merged dataset in Table2.

Figure 6 :
Figure 6: The auxiliary model M ′ obtained from the PSCM M in Example 1 and the merged dataset in Table2in the presence of a selection bias.

Figure 7 :
Figure 7: Relative shrinks induced by integration of additional datasets.

Figure 8 :
Figure 8: Difference between the ranges of biased and unbiased data w.r.t.P (S = 1).

Figure 9 :
Figure 9: The graph of the model used to study preferences about the place of death in cancer patients.
), the exogenous parent U of variable Death has cardinality |Ω U | = 2 16 .Yet, this does not prevent EMCC from running and, in order to evaluate the relative importance of those three variables, we compute the corresponding PNS ranges having those variables as input and Death as target node.With the unbiased dataset we obtain: Triangolo → [0.27, 0.35] , Patient Awareness → [0.03, 0.11] , Family Awareness → [0.04, 0.11] .

Table 1 :
Data from interventional and observational studies on the potential effects of a drug on patients affected by a deadly disease.

Table 2 :
depicts the graph of the auxiliary PSCM M ′ obtained from the PSCM M discussed in Example 1 and the dataset D ′ in Table 2.A merged version of the two datasets in Table 1 with the index variable W .