High-Dimensional Joint Estimation of Multiple Directed Gaussian Graphical Models

We consider the problem of jointly estimating multiple related directed acyclic graph (DAG) models based on high-dimensional data from each graph. This problem is motivated by the task of learning gene regulatory networks based on gene expression data from different tissues, developmental stages or disease states. We prove that under certain regularity conditions, the proposed $\ell_0$-penalized maximum likelihood estimator converges in Frobenius norm to the adjacency matrices consistent with the data-generating distributions and has the correct sparsity. In particular, we show that this joint estimation procedure leads to a faster convergence rate than estimating each DAG model separately. As a corollary we also obtain high-dimensional consistency results for causal inference from a mix of observational and interventional data. For practical purposes, we propose jointGES consisting of Greedy Equivalence Search (GES) to estimate the union of all DAG models followed by variable selection using lasso to obtain the different DAGs, and we analyze its consistency guarantees. The proposed method is illustrated through an analysis of simulated data as well as epithelial ovarian cancer gene expression data.


Introduction
Directed acyclic graph (DAG) models, also known as Bayesian networks, are widely used to model causal relationships in complex systems across various fields such as computational biology, epidemiology, sociology, and environmental management [1,11,30,35,39]. In these applications we often encounter highdimensional datasets where the number of variables or nodes greatly exceeds the number of observations. While the problem of structure identification for undirected graphical models in the high-dimensional setting is quite well understood [34,25,10,4,44], such results are just starting to become available for directed graphical models. The difficulty in identifying DAG models can be attributed to the fact that searching over the space of DAGs is NP-complete in general [5].
Methods for structure identification in directed graphical models can be divided into two categories and hybrids of these categories. Constraint-based methods, such as the prominent PC algorithm, first learn an undirected graph from conditional independence relations and in a second step orient some of the edges [14,39]. Score-based methods, on the other hand, posit a scoring criterion for each DAG model, usually a penalized likelihood score, and then search for the network with the highest score given the observations. An example is the celebrated Greedy Equivalence Search (GES) algorithm, which can be used to greedily optimize the 0 -penalized likelihood such as the Bayesian Information Criterion (BIC) [6]. Highdimensional consistency guarantees were recently obtained for the PC algorithm [19] and for score-based methods [23,28,43].
Existing methods have focused on estimating a single directed graphical model. However, in many applications we have access to data from related classes, such as gene expression data from different tissues, cell types or states [24,37], different developmental stages [2], different disease states [41], or from different perturbations such as knock-out experiments [8]. In all these applications one would expect that the underlying regulatory networks are similar to each other, since they stem from the same species, individual or cell type, but also have important differences that drive differentiation, development or a certain disease. This raises an important statistical question, namely how to jointly estimate related directed graphical models in order to effectively make use of the available data.
Various methods have been proposed for jointly estimating undirected Gaussian graphical models. To preserve the common structure, Guo et al. [15] suggested to use a hierarchical penalty and Danaher et al. [7] suggested the use of a generalized fused lasso or group lasso penalty. While both approaches achieve the same convergence rate as the individual estimators, Cai et al. [3] were able to improve the asymptotic convergence rate of joint estimation using a weighted constrained ∞ / 1 minimization approach. Bayesian methods have been proposed for this problem [32] as well. Related works also include [27], where it is assumed that the networks differ only locally in a few nodes and [21,38], where the assumption is that the networks are ordered and related by continuously changing edge weights.
In this paper, we propose a framework based on 0 -penalized maximum likelihood estimation for jointly estimating related directed Gaussian graphical models. We show that the joint 0 -penalized maximum likelihood estimator (MLE) achieves a faster asymptotic convergence rate as compared to the individual estimators. In addition, by viewing interventional data as data coming from a related network, we show that the interventional BIC scoring function proposed in [16] can be obtained as a special case of the joint 0 -penalized maximum likelihood approach presented here. Our theoretical consistency guarantees also explain the empirical findings of [16], namely that estimating a DAG model from interventional data usually leads to better recovery rates as compared to estimating a DAG model from the same amount of purely observational data. These theoretical results are based on the global optimum of 0 -penalized maximum likelihood estimation. To overcome the computational bottleneck of this optimization problem we propose a greedy approach (jointGES) for solving this problem by extending GES [6] to the joint estimation setting. We analyze its properties from a theoretical point of view and test its performance on synthetic data and gene expression data from epithelial ovarian cancer.
The remainder of this paper is structured as follows. In Section 2, we review some relevant background related to DAG models and introduce notation for the joint DAG estimation problem studied in this paper. In Section 3, we present the joint 0 -penalized maximum likelihood estimator and jointGES, an adaptation of GES for solving this optimization problem. Section 4 establishes results regarding the statistical consistency of the 0 -penalized MLE and jointGES whereas Section 5 presents the implications for learning DAG models from a mix of observational and interventional data. In Section 6, we illustrate the performance of our proposal in a simulation study and an application to the analysis of gene expression data. We conclude with a short discussion in Section 7. The proofs of supporting results are contained in the Appendix.

Preliminaries
In Section 2.1 we introduce DAGs, their relation to linear structural equation models, and discuss statistical features enjoyed by random vectors following these models. In Section 2.2 we briefly review existing approaches for estimating a single directed graphical model from observational data. Finally, Section 2.3 describes a setting where multiple related directed graphical models exist.

Directed acyclic graphs and linear structural equation models
Let G = (V, E) denote a DAG with vertices V = [p] = {1, · · · , p} and directed edges E ⊆ V × V , where |G| denotes the cardinality of E. Let A ∈ R p×p be the adjacency matrix specifying the edge weights of the underlying DAG G, i.e., A ij = 0 if and only if (i, j) ∈ E. Also, let ∼ N (0, Ω) denote a p-dimensional multivariate Gaussian random variable with zero mean and diagonal covariance matrix Ω. In this work, we assume that the observed random vector X = (X 1 , · · · , X p ) ∈ R p is generated according to the following linear structural equation model (SEM).
Hence X follows a multivariate Gaussian distribution with zero mean and covariance matrix Σ, where the inverse covariance (or precision) matrix Θ = Σ −1 is given by Let Pa j (G) denote the parents of node j in G, then it follows from (1) that the distribution of X can be factorized as P(X j |X Pa j (G) ).
In addition, such a factorization of P according to G is equivalent to the Markov assumption with respect to G [22,Theorem 3.27]. Formally, given i, j ∈ V and an arbitrary subset of nodes S ⊂ V \ {i, j}, then If the implication (3) holds bidirectionally, then P is said to be faithful [39] with respect to G. Note that there exist DAGs G 1 and G 2 that encode the same d-separations and hence the same conditional independence relations. Such DAGs are said to belong to the same Markov equivalence class. A consequence of the acyclicity of G is that there exists at least one permutation π of [p] such that A ij = 0 for all π(i) ≥ π(j). Putting it differently, if the rows and columns of A are reordered according to π, then the resulting matrix is strictly upper triangular. Hence, if such a permutation π is known a priori, one can obtain the SEM parameters (A, Ω) from Θ according to the following steps [cf. (2)]. First, we reorder Θ according to π. Then, we perform on the reordered Θ an upper-triangular-plus-diagonal Cholesky decomposition to obtain (A , Ω ). Finally, we revert the ordering by permuting the rows and columns of A and Ω according to π −1 and obtain the sought (A, Ω). For an arbitrary permutation π and a given Θ, we denote by (A π , Ω π ) the Cholesky decomposition parameters obtained from the procedure just described. Alternatively, one can obtain (A π , Ω π ) by solving p linear regressions [cf. (1)]. More precisely, we can obtain each column of A π by regressing X j only on those X i such that π(i) < π(j) for all j. Once A π is obtained, one can estimate the variance of in (1) to get Ω π . In the rest of the paper, we denote by (A 0 , Ω 0 ) and Θ 0 the true parameters of the data-generating SEM and the associated precision matrix, respectively. Moreover, we denote by (A 0π , Ω 0π ) the SEM parameters obtained from the described procedure when the true precision matrix Θ 0 is used. Notice that (A 0 , Ω 0 ) = (A 0π , Ω 0π ) if π is any permutation consistent with the true underlying DAG G 0 . The DAG G π corresponding to the non-zero entries of A π is known as the minimal I-MAP (independence map) with respect to π. The minimal I-MAP with the fewest number of edges is called minimal-edge I-MAP [43]. If P is faithful with respect to a DAG G, then G is a minimal-edge I-MAP of P [33,43]. Furthermore, it has been shown in [31] that all DAGs in a Markov equivalence class share the same skeleton -i.e., the set of edges when directions are ignored -and v-structures. A v-structure is a triplet (, k, ) ⊆ V such that (j, k), ( , k) ∈ E but j and are not connected in either direction. This motivates the representation of a Markov equivalence class as a completely partially directed acyclic graph (CPDAG), which is a graph containing both directed and undirected edges. A directed edge means that all DAGs in the Markov equivalence class share the same direction for this edge whereas an undirected edge means that both directions for that specific edge are present within the class. In the same way, one can represent a subset of a Markov equivalence class via a partially directed acyclic graph (PDAG), where the directions of the edges are only determined by the graphs within the subset. In particular, some undirected edges in a CPDAG would become directed edges in a PDAG representing a subset of the class. Notice that both DAGs and CPDAGs are special cases of PDAGs, where the former represents a single graph and the latter represents the whole equivalence class.

0 -penalized maximum likelihood estimation for a single DAG model
We denote byX ∈ R n×p our observed data, where each row ofX represents a realization of the random vector X. We say that we are in the low-dimensional setting if asymptotically p remains a constant as n → ∞. By contrast, whenever p → ∞ as n → ∞, we say that we are in the high-dimensional setting. Assuming faithfulness, Chickering [6] shows that GES achieves a consistent estimator in the low-dimensional setting by optimizing the following objective -also known as the Bayesian information criterion (BIC) - where λ 2 = 1 2 log n n , A denotes the set of all valid adjacency matrices associated with DAGs, D + is the set of non-negative diagonal matrices, and n is the likelihood function In the high-dimensional setting, van de Geer and Bühlmann [43] give consistency guarantees for the global optimum of (4) when the collection A is further constrained to contain only adjacency matrices with at most d incoming edges for each node, where d = O(n/ log p). More precisely, they show that there exists some parameter λ 2 log p n such that the optimum (Â,Ω) in (4) converges in Frobenius norm to (A 0π , Ω 0π ) for increasing n and p, whereπ is a permutation consistent withÂ, i.e., Notice, however, that (6) does not guarantee statistical consistency sinceπ need not be a permutation consistent with the true underlying DAG. Moreover, (6) does not hold for any permutationπ consistent withÂ, but [43] shows the existence of at least one such permutation. In addition, it is shown in [43] that the number of non-zero elements inÂ, A 0π , and A 0 are all of the same order of magnitude, i.e., |Ĝ| |G 0π | |G 0 |.

Collection of DAGs
Consider the setting where not all the observed data comes from the same DAG, but rather from a collection of DAGs {G (k) = (V, E (k) )} K k=1 that share the same node set V = [p]. In addition, we assume that all DAGs in a collection are consistent with some permutation π. This precludes a scenario where (i, j) ∈ E (k) and (j, i) ∈ E (k ) for some k = k . This is a reasonable assumption in, e.g., the analysis of gene expression data, where regulatory links may appear or disappear, but they in general do not change direction.

Denote by {(A (k)
, Ω (k) )} K k=1 a set of SEMs on the K DAGs {G (k) } K k=1 and by {X (k) } K k=1 the data generated from each SEM, where we observe n k realizations for each DAG G (k) . In this way, each row of the data matrixX (k) ∈ R n k ×p corresponds to a realization of the random vector X (k) defined as Collections of DAGs arise for example naturally when considering data from perfect (also known as hard) interventions [9]. Consider a non-intervened DAG G with SEM parameters (A, Ω) [cf. (1)]. Then a perfect intervention on a subset of nodes I k ⊂ V gives rise to the interventional distribution [16,17]. We denote the DAG given by the non-zero entries of A I k by G I k .
In accordance with the notation introduced in Section 2.1, we denote by G 0 ) the true data-generating DAG and SEM parameters for class k, respectively, and by π 0 a permutation that consistent with A , Ω I k 0 ) to denote the true SEM parameters after intervening on targets I k .

Joint estimation of multiple DAGs
We first present a penalized maximum likelihood estimator that is the natural extension of (4) for the case where a collection of DAGs is being estimated. Since this involves minimizing · 0 , we then discuss a greedy approach that alleviates the computational complexity of this estimator.

Joint 0 -penalized maximum likelihood estimator
With d denoting a pre-specified sparsity level and w k = n k /n indicating the proportion of observed data from DAG k, we propose the following estimator: where A π is the set of all adjacency matrices consistent with permutation π and the matrix norm · ∞,0 computes the maximum 0 -norm across the rows of the argument matrix. The optimization problem in (7) seeks to maximize a weighted log-likelihood of the observations (where more weight is given to SEMs with more realizations) penalized by the support of the union of all estimated DAGs. To see why this is true, notice that K k=1 |A (k) | 0 counts the number of (i, j) entries for which A  Input: Collection of observationsX (1) ∈ R n 1 ×p , · · · ,X (K) ∈ R n K ×p , sparsity constraint d, penalization parameters λ 1 and λ 2 Output: Collection of weighted adjacency matricesÂ (1) , · · · ,Â (K) 1: Apply GES to findĜ union , an approximate solution to the following optimization problem 2: Estimate the weighted adjacency matrices {Â (k) } K k=1 consistent withĜ union by solving Kp sparse regressions of the formâ Regarding the constraints in (7), the first constraint imposes that all estimated DAGs are consistent with the same permutation π, which is itself an optimization variable. This constraint is in accordance with our assumption in Section 2.3 and drastically reduces the search space of DAGs. The second constraint ensures that the maximum in-degree in all graphs is at most d, and the last constraint imposes the natural requirement that all noise covariances are diagonal and non-negative. Notice that (7) is a natural extension of (4). Indeed, for the case K = 1 the objective in (7) immediately boils down to that in (4). Moreover, when there is only one graph and π can be selected freely, the constraint A (1) ∈ A π is effectively identical to A (1) ∈ A, i.e., the constraint in (4). Finally, observe that in (7) we have included the additional maximum in-degree constraint required in the high-dimensional setting [cf. discussion after (5)].

JointGES: Joint greedy equivalence search
The 0 norms as well as the optimization over all permutations π render the problem of (7) non-convex, thus, hard to solve efficiently. In this section, we present a greedy approach to find a computationally tractable approximation to a solution to (7). The algorithm, which we term JointGES, is succinctly presented in Algorithm 1 and consists of two steps.
In the first step of Algorithm 1 we recoverĜ union , our estimate of the union of all the DAGs to be inferred. We do this by finding an approximate solution to (8) via the implementation of GES [6]. The objective (scoring function) in (8) consists of two terms. The first term is given by the sum of the loglikelihoods of the achievable residues when regressing the jth column of X (k) , denominated as X Pa j (G) for each node j and DAG k. In [43], van de Geer and Bühlmann further show that if we keep the underlying DAG G fixed, the maximum likelihood estimator proposed in (4) is equivalent to optimizing . Thus, the first term in (8) corresponds to the first term in the objective of (7). The second term penalizes the size of the parent set of each node in the graph to be recovered, effectively penalizing the number of edges in the graph. In this way, the scoring function in (8) promotes a sparse G in the same way that the objective of (7) promotes the union of all K recovered graphs to have a sparse support. Additionally, it is immediate to see that the scoring function in (8) is decomposable [6], a key feature that enables the implementation of GES to find an approximate solution. Once we have obtained the union of all sought DAGsĜ union from step 1, in the second step of our algorithm we estimate the DAGsĜ (1) , · · · ,Ĝ (K) by searching over the subDAGs ofĜ union . More precisely, for each node j we estimate its parents inĜ (k) by regressing X using lasso, where the support of a (k) j corresponds to the set of parents of j inĜ (k) .
To summarize, Algorithm 1 recovers K DAGs by first estimating the union of all these DAGsĜ union using standard GES and then inferring the specific weight adjacency matricesÂ (k) via a lasso regression, while ensuring consistency with the previously estimatedĜ union .

Consistency guarantees
The main goal of this section is to provide theoretical guarantees on the consistency of the solution to Problem (7) in the high-dimensional setting. Our main result is presented in Theorem 4.9. In Section 4.3 we present a laxer statement of consistency based on milder conditions.

Statistical consistency of the joint 0 -penalized MLE
A series of conditions must hold for our main result to be valid. We begin by stating these conditions followed by the formal consistency result in Theorem 4.9. The rationale behind these conditions and their implications are discussed after the theorem in Section 4.2.
Condition 4.6. The number of DAGs K satisfies K = o(log p) and the amount of data associated with each DAG is comparable in the sense that n 1 n 2 · · · n K .
With the above conditions in place, the following result can be shown.
Furthermore, denoting byĜ the union of the graphsĜ (k) associated with the K recovered adjacency matri-cesÂ (k) , we have that The proof of Theorem 4.9 is given in Appendix A.2. To intuitively grasp the result in the above theorem, assume that the number of edges in G union 0 is proportional to the number of nodes p so that λ 2 |G union 0 | → 0 for increasing n as long as n > p log p. Hence, under these conditions, (10) guarantees that for the recovered permutationπ, the estimated adjacency matrixÂ (k) converges to A (k) 0π in Frobenius norm for all k. This not only implies that both adjacency matrices have similar structure, but also that the edge weights are similar. Moreover, from (11) it follows that the number of edges in the estimated graphĜ, i.e., |Ĝ| is similar to the number of edges in the union of all minimal I-MAPs with permutationπ, i.e., |G union 0π |. More importantly, |Ĝ| is also similar to the number of edges in the true union graph |G union 0 |. Despite these guarantees, it should be noted that the permutationπ need not coincide with the permutation π of the true graphs to be recovered.
We now assess the benefits of performing joint estimation of the K DAGs as opposed to estimating them separately. To do so, we compare the guarantees in Theorem 4.9 to those developed in [43] for separate estimation. More precisely, the application of the consistency bound reviewed in (6) yields that for the separate estimation of K DAGs, when we are in the setting where all K DAGs are highly overlapping (cf. Condition 4.7), by choosing λ such that λ 2 log p n p |G union where it should be noted that in the separate estimation the recovered permutationπ (k) can vary with k. A direct comparison of (10) and (12) reveals that performing joint estimation improves the accuracy by a factor of K from Ω(K log p n ) to Ω( log p n ). Hence, for joint estimation the accuracy scales with the total number of samples n, showing that our procedure yields maximal gain from each observation, even if the data is generated from K different DAGs. Moreover, the result in (10) holds under slightly milder conditions than those needed for (12) to hold since Condition 4.8 is a relaxed version of the beta-min condition in [43]. A more detailed discussion about the conditions of Theorem 4.9 is given next.

Conditions for Theorem 4.9
It has been shown in [33] that if a data-generating distribution is faithful with respect to G, then G must be a minimal-edge I-MAP. By enforcing the latter for every true graph, Condition 4.1 imposes a milder requirement compared to the well-established faithfulness assumption [39]. Conditions 4.2-4.4 ensure that we avoid overfitting and provide bounds for the noise variances. These are direct adaptations from Conditions 3.1-3.3 in [43]. Condition 4.5 is required to bound the difference between the sample variances of our observations and the true variances, and is related to Condition 3.4 in [43] but adapted to our joint inference setting. Notice that Condition 4.5 is trivially satisfied when p = O n 1/3 K 7/3 log n . Condition 4.6 follows from the bounds for sample variances shown in [3]. Intuitively, we are imposing the natural restriction that the number of DAGs is small compared to the number of nodes p in each DAG and the total number of observations n. Moreover, given that our objective is to draw estimation power from the joint inference of multiple graphs, we require that each DAG is associated with a non-vanishing fraction of the total observations. Condition 4.7 enforces that, for every permutation π, the number of edges in the union of all recovered graphs is proportional to the weighted sum of the edges in every graph as K → ∞. In particular, this requires the individual graphs G (k) 0π to be highly overlapping. To see why this is the case, notice that K k=1 w k |G 0π | is upper bounded by the maximum number of edges across graphs G (k) 0π . Consequently, Condition 4.7 enforces the number of edges in the union of graphs to be proportional to the number of edges in the single graph with most edges, thus requiring a high level of overlap. Imposing high overlap for all permutations π might seem too restrictive in some settings. Nonetheless, Condition 4.7 can sometimes be derived from apparently less restrictive conditions as the following example illustrates.
Consider the more relaxed bound |G union 0 |, which is equivalent to requiring Condition 4.7 to hold but only for permutations consistent with the true graph G union 0 . In the following example, we show that this might be sufficient for Condition 4.7 to hold. Suppose that G union 0 consists of two connected components G union 0 and G union 0 respectively defined on the subsets of nodes V 1 and V 2 . Moreover, assume that the subDAGs of G Putting it differently, the differences between the DAGs G (k) 0 are limited to the second connected component. In addition, assume that for all possible permutations π 2 of nodes V 2 we have that |G union 0π 2 | ≤ |G union 0 |. Then, for any permutation π, where we denote by π 1 (respectively π 2 ) the restriction of π to the node set V 1 (respectively V 2 ), we have that from where Condition 4.7 is satisfied for c t = 2. This example shows that learning the structure of large components that are common across the different DAGs is not affected by the changes in the smaller components of these DAGs. Despite the previous example, Condition 4.7 might still be too restrictive in some settings. In this respect, a more relaxed requirement is stated in Section 4.3 under Condition 4.7'. Condition 4.8 requires that, for every permutation π and every graph k, the value of at least a fixed proportion (1 − η 1 ) of the edges in G (k) 0π is above the 'noise level', i.e., the lower bound within the indicator function in (9). Intuitively, if the true weight of many edges is close to zero then correct inference of the graphs would be impossible since the true edges would be mistaken with spurious ones. Thus, it is expected that the weights of a sufficiently large fraction of the edges have to be sufficiently large. Condition 4.8 is the right formalization of this intuition. Moreover, notice that a straightforward replication of the beta-min condition introduced in [43] would have required the 'noise level' to scale with log p/n k , instead of the smaller scaling of log p/n required in (9). In this sense, Condition 4.8 is a relaxed version of the extension of the beta-min condition to the setting of joint graph estimation.

Consistency under milder conditions
As previously discussed, in some settings Condition 4.7 might be too restrictive. Hence, in this section we present a consistency statement akin to Theorem 4.9 that holds for a milder version of Condition 4.7. More precisely, consider the alternative condition stated next.
Condition 4.7'. Let c t (π) be some function of π that scales as a constant for permutations consistent with G union 0 and scales as o(K) for all other permutations such that |G union 0π | for all π. Observe that for permutations π consistent with the true union graph G union 0 , Condition 4.7' boils down to the previously discussed Condition 4.7. However, for all other permutations, c t (π) need not be a constant and is allowed to grow with K. Intuitively, for all permutations not consistent with G union 0 we are not requiring a high level of overlap among all the graphs G to be 'sparser' than the extreme case in which all graphs G (k) 0π are disjoint. In order to account for the fact that c t depends on the permutation π in Condition 4.7', we have to modify Condition 4.8 accordingly, resulting in the following alternative statement.
The following consistency result holds for the alternative set of conditions.
Furthermore, denoting byĜ (k) the graph associated withÂ (k) for the k ∈ [K] satisfying (13), we have that The proof is given in Appendix A.3. Condition 4.7' is milder than Condition 4.7 and this relaxation entails a corresponding loss in the guarantees of recovery: Comparing (13) and (14) with (10) and (11) immediately reveals that what could be guaranteed for the ensemble of graphs in Theorem 4.9 can only be guaranteed for a single graph in Theorem 4.10, thereby explaining the trade-off in relaxing the conditions.
On the other hand, the result in Theorem 4.10 still draws inference power from the joint estimation of multiple graphs since neither (13) nor (14) can be shown using existing results for separate estimation. To be more precise, as discussed in Section 4.2, when performing separate estimation, theoretical guarantees are based on the assumption that at least a fixed proportion of the edge weights are above the 'noise level', which scales as log p/n k . However, Condition 4.8' requires the noise level to scale with C max log p/n which, given the fact that C max = o(K), is not large enough to achieve the guarantee needed for separate estimation. In addition, the convergence rate of Ω(C max log p n ) in (13) is still faster than the corresponding convergence rate of Ω(K log p n ) associated with separate estimation [cf. discussion after (12)]. We close Section 4 with the following remark on the consistency guarantees of jointGES. ) as inputs to the second step of Algorithm 1, and then selecting the DAG G union ∈ M(G union 0 ) whose output from step 2 is the sparsest. For the high-dimensional setting, as even the global optimum of (7) is not guaranteed to recover the true G union 0 (cf. Theorems 4.9 and 4.10), jointGES is not consistent in general. Recently, Maathuis et al. [28] showed consistency of GES for single DAG estimation under more restrictive assumptions than the ones here considered. Although of potential interest, further strengthening the presented conditions to guarantee consistency of jointGES in the high-dimensional setting is not pursued in the current paper.

Extension to interventions
We show how our proposed method for joint estimation can be extended to learn DAGs from interventional data. It is natural to consider learning from interventional data as a special case of joint estimation since the DAGs associated with interventions are different but closely related. In this section we mimic some of the developments of Sections 3 and 4 but specialized for the case of interventional data. More precisely, we first propose an optimization problem akin to (7) and then state the consistency guarantees in the highdimensional setting of the associated optimal solution.
Recall from Section 2.3 that the true adjacency matrix A I k 0 of the SEM associated with an intervention in the nodes I k is identical to the true adjacency A 0 of the non-intervened model except that [A I k 0 ] ij = 0 for all j ∈ I k . In this way, our assumption that there exists a common permutation π consistent with all DAGs under consideration (cf. Section 2.3) is automatically satisfied for interventional data. Additionally, assuming that we observe samplesX I k from K different models corresponding to the respective intervention of the nodes in {I k } K k=1 , the knowledge of the intervened nodes can be incorporated into our optimization problem as follows [cf. (7)].
From the solution of (15) we obtain an estimate for the non-intervened SEM (Â,Ω) as well as K estimates for the corresponding intervened models (Â I k ,Ω I k ). The objective in (15a) is equivalent to that in (7) where we leverage the fact that the union of all intervened graphs results in the non-intervened one under the implicit assumption that no single node has been intervened in every experiment. Alternatively, if some nodes were intervened in all experiments, objective (15a) would still be valid since enforcing zeros in the unobservable portions of A does not affect the recovery of the intervened adjacency matrices A I k . The constraints in (15b) impose that A has to be consistent with permutation π and with bounded in-degree, and Ω has to be a valid covariance matrix for uncorrelated noise. Putting it differently, (15b) enforces for the non-intervened SEM what we impose separately for all SEMs in (7). The constraints in (15c) impose the known relations between the intervened and the non-intervened adjacency matrices. Finally, (15d) constrains the matrices Ω I k to be consistent with the base model on the non-intervened nodes while still being a valid covariance on the intervened ones. Even though it might seem that in (15) we are estimating K + 1 SEMs (the base case plus the K intervened ones), from the previous reasoning it follows that the effective number of optimization variables is significantly smaller. To be more specific, for a given π, once A is fixed then all the adjacency matrices A I k are completely determined. Moreover, for a fixed Ω, the only freedom in Ω I k corresponds to the diagonal entries associated with intervened nodes in I k . In this way, it is expected that for a given number of samples, the joint estimation of K SEMs obtained from interventional data [cf. (15)] outperforms the corresponding estimation from purely observational data [cf. (7)].
Recalling that we denote by (A I k 0π , Ω I k 0π ) the parameters recovered from the Cholesky decomposition of the true precision matrix Θ I k 0 under the assumption of consistency with permutationπ, the following result holds.
Furthermore, denoting byĜ the graph associated with the recovered adjacency matrixÂ for the nonintervened model, we have that The proof is given in Appendix A.4. A quick comparison of Theorem 4.9 and Corollary 5.1 seems to indicate that the consistency guarantees of observational and interventional data are very similar. Nonetheless, recovery from interventional data is strictly better as we argue next. As discussed after Theorem 4.9, the results presented do not guarantee that the permutationπ recovered coincides with the true permutation of the nodes. In principle, one could recover a spurious permutationπ (different from the true permutation π) that correctly explains the observed data [cf. (10) and (16)] and leads to sparse graphs [cf. (11) and (17)]. However, the more interventions we have, the smaller the set of spurious permutationsπ that can be recovered, as we illustrate in the following example. Figure 1 portrays the existence of a spurious permutation that could be recovered from observational data but cannot be recovered from interventional data. More precisely, Figure 1(a) presents the two true DAGs we aim to recover, where the second one is obtained by intervening on node 2. By contrast, Figure 1(b) shows the DAGs that are obtained when performing Cholesky decompositions on the true precision matrices under the spurious permutation π 1 . Notice that the sparsity levels of the DAGs in both figures are the same. In general, one could recover π 1 instead of π 0 from observational data, but one would never recover π 1 from interventional data. To see this, simply notice from the figure that [A I 2 0π 1 ] 32 = 0 whereas for the interventional estimate [Â I 2 ] 32 = 0 [cf. (15c)], thus, the error terms in (16) cannot vanish forπ = π 1 . This example also indicates that it is preferable to intervene on multiple targets in the same experiment instead of doing interventions one at a time. This observation is in accordance with new genetic perturbation techniques, such as Perturb-seq [8].
From a practical perspective, the objective in (15) corresponds to the same scoring function as GIES [16]. Therefore, GIES can be used to obtain an approximate solution to (15). A simulation study using GIES was  Figure 1: Interventional data can avoid the recovery of spurious permutations. (a) True DAGs to be recovered. (b) DAGs obtained from the Cholesky decomposition consistent with π 1 . The spurious permutation π 1 does not satisfy (16) for cases where node 2 is intervened. performed in [16,Section 5.2] showing that in line with the theoretical results obtained in this section, not only identifiability increases, but also the estimates obtained using interventional data are better than with the same amount of purely observational data.

Experiments
In this section, we present numerical experiments on both synthetic (Section 6.1) and real (Section 6.2) data that support our theoretical findings.

Performance evaluation of joint causal inference
We analyze the performance of the joint recovery of K different DAGs where we vary K ∈ {3, 5}. For all experiments, we set the number of nodes p = 100 and the total number of observations n = 600. In addition, we selected the number of samples from each DAG to be the same, i.e., n 1 = . . . = n K = n/K. For each experiment, the true DAGs were constructed in two steps. First, we generated a core graph that is shared among the K DAGs under consideration. We did this by generating a random graph from an Erdős-Rényi model with 100 edges in expectation, and then oriented the edges according to a random permutation of the nodes. In the second step, we sampled 30 additional edges specific to each DAG uniformly at random. This procedure results in the generation of a collection of true underlying DAGs G  1] to ensure that they are bounded away from zero. Notice that we did not put any constraints on the edge weights that are in the shared core structure for different DAGs: the same edge can change its weight in different DAGs, or even flip sign.
For each setting, we randomly generated 100 collections of DAGs and data associated with them. We then estimated the DAGs from the data via two different methods: a joint estimation procedure using the jointGES algorithm presented in Section 3.2 and a separate estimation procedure using the well-established GES method [6].
To assess performance of the two algorithms, we considered two standard measures, namely the structural Hamming distance (SHD) [42] and the receiver operating characteristic (ROC) curve. SHD is a com- monly used metric based on the number of operations needed to transform the estimated DAG into the true one [19,42]. Hence, a smaller SHD value indicates better performance. The ROC curve plots the true positive rate against the false positive rate for different choices of tuning parameters.
In Figure 2 (a), we selected the 0 -penalization parameter λ 2 1 = c log p n with scaling constant c ∈ {1, 2, 3, 4, 5} in both joint and separate estimation and then plotted average SHD as a function of the scaling constant c averaged over the K DAGs to be recovered and the 100 realizations. The penalization parameter λ 2 in the second step of the joint estimation procedure was chosen based on 10-fold cross validation. In Figure 2 (b) we plotted the average ROC curve where for each choice of tuning parameter, we computed the true positive and false positive rates by averaging over the K DAGs to be recovered and the 100 realizations. It is clear from the two figures that in general joint inference achieves better performance, which matches our theoretical results in Section 4. However, Figure 2 (a) shows also that jointGES performs worse than separate estimation for small scaling constants (c = 1). Note that this is in line with our theoretical findings in Theorem 4.10, which imply that whenever Condition 4.7 -which sometimes is a restrictive assumptionis violated, we need to choose a larger penalization parameter.

Gene regulatory networks of ovarian cancer subtypes
To assess the performance of the proposed joint 0 -penalized maximum likelihood method on real data, we analyzed gene expression microarray data of patients with ovarian cancer [41]. Patients were divided into six subtypes of ovarian cancer, labeled as C1-C6, where C1 is characterized by significant differential expression of genes associated with stromal and immune cell types and with a lower survival rate as compared to the other 5 subtypes. We hence grouped the subtypes C2-C6 together and our goal was to infer the differences in terms of gene regulatory networks in ovarian cancer that could explain the different survival rates. The gene expression data in [41] includes the expression profile of n = 83 patients with C1 subtype and n = 168 patients with other subtypes. We implemented our jointGES algorithm (Algorithm 1) to jointly learn two gene regulatory networks: one corresponding to the C1 subtype G C1 and another corresponding to the other five subtypes together G C2−6 . In addition, as in [3], we focused on a particular pathway, namely the apoptosis pathway. Using the KEGG database [20,29] we selected the genes in this pathway that were associated with at most two microarray probes, resulting in a total of p = 76 genes. JointGES  50  73  48  GES  68  101  32  PC  14  30  9   Table 1: Number of edges in the DAGs estimated by different methods for the gene regulatory network of subtype C1 (|G C1 |) and all other subtypes (|G C2−6 |). The last column shows the number of edges shared between both inferred graphs. Table 1 lists the number of edges discovered by jointGES as well as for two separate estimation methods, namely using the GES [6] and PC [39] algorithms. All three methods were combined with stability selection [26] in order to increase robustness of the output and provide a fair comparison. As expected, the two graphs inferred using jointGES share a significant proportion of edges, whereas the overlap is markedly smaller for the two separate estimation methods. Interestingly, under all estimation methods the network for the C1 subtype contains fewer edges than the network of the other subtypes, thereby suggesting that G C1 could lack some important links that are associated with patient survival.
To obtain more insights into the relevance of the obtained networks, we analyzed the inferred hub nodes in the three networks. For our analysis we defined as hub nodes those nodes, where the sum of the in-and out-degree was larger than 5 in the union of the two DAGs. For jointGES, this union is given byĜ union , the graph identified in the first step of Algorithm 1. The hub nodes identified by jointGES are CAPN1, CTSD, LMNB1, CSF2RB, BIRC3. Among these, CAPN1 [12], CTSD [40], LMNB1 [36], and BIRC3 [18] have been reported as being central to ovarian cancer in the existing literature. In addition, CSF2RB was also discovered by joint estimation of undirected graphical models on this data set [3]. The hub nodes discovered by GES are ATF4, BIRC2, CSF2RB, TUBA1C, MAPK3, while PC only discovered the hub node CSF2RB. While we were not able to validate the relevance of any of these genes for ovarian cancer in the literature, interestingly, CSF2RB has been identified as a hub node by all methods, thereby suggesting this gene as an interesting candidate for future genetic intervention experiments.

Discussion
In this paper we presented jointGES, an algorithm for the joint estimation of multiple related DAG models from independent realizations. Joint estimation is of particular interest in applications where data is collected not from a single DAG, but rather multiple related DAGs, such as gene expression data from different tissues, cell types or from different interventional experiments. JointGES first estimates the union of DAGsĜ union by applying a greedy search to approximate the joint 0 -penalized maximum likelihood estimator and then it uses variable selection to discover each DAG as a subDAG ofĜ union . From an algorithmic perspective, jointGES is to the best of our knowledge the first method to jointly estimate related DAG models in the high-dimensional setting. From a theoretical perspective, we presented consistency guarantees on the joint 0 -penalized maximum likelihood estimator. and showed that the accuracy bound scales with the total number of samples, rather than the number of samples from each DAG. As a corollary to this result, we obtained consistency guarantees for 0 -penalized maximum likelihood estimation of a causal graph from a mix of observational and interventional data. Finally, we validated our results via numerical experiments on simulated and real-world data, showing that the proposed jointGES algorithm for joint inference outperforms separate-inference approaches using well-established algorithms such as PC and GES.
The present work serves as a platform for the potential development of multiple future directions. These directions include: i) relaxing the assumption that all DAGs must be consistent with the same underlying permutation; ii) extending jointGES to the setting where the samples come from K related DAGs but it is unknown a priori which particular DAG each sample comes from; this is for example of interest in the analysis of gene expression data from tumors or tissues that consist of a mix of cell types; iii) extending jointGES to allow for latent confounders.

A Appendix: Theoretical analysis of statistical consistency results
Throughout the appendix we develop the proofs of Theorems 4.9 and 4.10. To facilitate understanding, we first provide a high-level explanation of the rationale behind the proof. If we have data generated from a single DAG and we are given a permutation π consistent with the true DAG a priori, then we can estimate (Â,Ω) by performing p regressions as explained in Section 2.1. By contrast, when the permutation is unknown and we need to consider all the possible permutations, the total number of regressions to run increases to p · p!. However, these regressions are not independent and, intuitively, by bounding the noise level of a subset of these regressions, we can derive bounds for the noise on the other ones. We characterize the 'noise level' of these regressions by analyzing the asymptotic properties of three random events. More precisely, whenever these events hold -and we show that they hold with high probability -, the noise is small enough so that the error of the 0 -penalized maximum likelihood estimator can also be bounded. Finally, we use this upper bound in the error to show that the recovered graph converges to a minimal I-MAP that is as sparse as the true DAG. The remainder of this appendix is organized as follows. In Section A.1 we define the three random events previously mentioned and show that each of them holds with high probability. Section A.2 then leverages the definition of these events to show Theorem 4.9, our main result. In Section A.3 we prove Theorem 4.10, which is based on relaxed conditions but similar proof techniques as in the previous section. Finally, Section A.4 fleshes out the proof of Corollary 5.1, our result applicable to the setting for interventional data.
Throughout the appendix, we use the following notation. Letâ j denote the j-th column ofÂ andω j denote the j-th diagonal entry ofΩ. Also, denote by a 0jπ and ω 0jπ the j-th column of A 0π and the j-th diagonal entry of Ω 0π , respectively.

A.1 Random events
Our proofs of Theorems 4.9 and 4.10 are inspired by the main proof in [43] and, thus, also based on a set of random events. However, the events considered here differ from those in [43] since, as explained in Section 4.1, a naive application of the guarantees in [43] to the joint estimation scenario does not achieve optimal learning rates [cf. discussion after (12)]. Intuitively, the rate gain achieved here comes from the assumption that all the DAGs share the same permutation, allowing us to effectively reduce the size of the search space.
In our proofs we consider three random events E 1 , E 2 , and E 3 that are respectively stated -along with proofs showing that they hold with high probability -in Sections A.1.1, A.1.2, and A.1.3.
jπ ∈ R n denote the residual when regressing X (k) j on X (k) S with S = {i | π(i) < π(j)}, i.e., (k) jπ ∈ R n denote the regression residual from the sampled data, i.e., 0jπ . Consider a generic set {A (k) } K k=1 of adjacency matrices consistent with a given permutation π, where the columns of A (k) are denoted by A (k) := (a (k) 1 , . . . , a (k) p ), and let G union denote the union of the support of A (1) , . . . , A (K) . Then, event E 1 is defined as for some constant δ 1 > 0 and some λ 1 (log p)/n. On random event E 1 a uniform inequality holds across all K DAGs for the sample correlation between the regression residual (k) jπ and any random variable spanned by the random vector X (k) S , i.e., X (k) T v for any v ∈ R p with v i = 0 for all i ∈ S. Notice that for convenience for further steps of the analysis, this generic vector v is written as a (18). Furthermore, for simplicity in the rest of this appendix, we denominate the space spanned by X (k) S as the projection space of (k) jπ . Intuitively, one could foresee E 1 holding since the expected correlation between the regression residual (k) jπ and X S is equal to zero, and therefore the sample correlation can be upper bounded by a sum of terms that converge to zero as n → ∞ as in (18).
We now show that random event E 1 holds with high probability, a result stated in Theorem A.2. Essential towards proving this result is the observation that, since the random variable 0jπ ) lies within the projection space of (k) jπ , these two random variables are independent. We can therefore deal with the randomness inˆ |e T Za|/n ≥ σ e ( 2m/n + 2t/n) ≤ exp(−t).
Based on the above lemma and recalling that A π denotes the set of adjacency matrices consistent with a given permutation π, we can show the following result.
Theorem A.2. Assume that Conditions 4.2 and 4.6 hold, then for all t > 0 and all δ 1 > 0, Proof. Letˆ jπ and a 0jπ be the concatenated vectorsˆ jπ := (ˆ  (1) , · · · ,X (K) ). We denote by A jπ ⊂ R pK the set that contains all vectors that can be formed by vertically concatenating the jth columns a (k) j for all k and that satisfy that jπ for all k .
Based on this, and recalling that Pa j (·) denotes the set of parent nodes of j in the argument graph, we define the random event B jπ as Combining the facts that: i) a j − a 0jπ may have at most K(|Pa j (G union )| + |Pa j (G union 0π )|) non-zero entries, and ii) the variance of each element ofˆ jπ is upper bounded by σ 2 0 (cf. Condition 4.2), we may apply Lemma A.1 to show that As can be seen from (19), event B jπ depends exclusively on the set of parents of node j in G union 0π . Putting it differently, if for two permutations π 1 and π 2 node j has the same set of parents in G union 0π 1 and G union 0π 2 , then the random events B jπ 1 and B jπ 2 coincide, since A jπ , a 0jπ andˆ jπ would all be the same for π ∈ {π 1 , π 2 }. If we denote by Π j (m) the set of permutations where node j has exactly m parents in G union 0π , then there are at most p m unique events B jπ for all π ∈ Π j (m). We therefore obtain that Applying a union bound on the events B jπ across all nodes j and permutations π yields that Combining (21) and (19) it follows that with probability at least 1 − exp(−t), for all j, π and all a j ∈ A jπ we have that |ˆ T jπX (a j − a 0jπ )| X (a j − a 0jπ ) 2 ≤ σ 0 2K(|Pa j (G union )| + |Pa j (G union 0π )|) + 2(t + |Pa j (G union 0π )| log p + 2 log p) .
Based on the collection of adjacency matrices {A (k) } K k=1 ∈ A π we define another collection where each column a (k) j is given by Notice that the positions of the non-zero entries in a 0jπ . By also using the fact that we have that for all j and π with probability at least 1 − exp(−t), it holds that ≤ σ 0 2K(|Pa j (G union )| + |Pa j (G union 0π )|) + 2(t + |Pa j (G union 0π )| log p + 2 log p) .
In the above expression we may use that ∀ δ 1 , a, b > 0, it follows ab ≤ δ 1 a 2 + b 2 /δ 1 in order to obtain By combining this with the fact that ∀a, b > 0, (a + b) 2 ≤ 2(a 2 + b 2 ) and the fact that K = o(log p) from Condition 4.6, we get that By replacing (24) back in (23), we obtain that Rewriting the absolute value in the left hand side as the sum of the corresponding K absolute values and adding the above inequality for j = 1, . . . , p we get that, with probability at least 1 − exp(−t), Finally, by noticing that X (a j − a 0jπ ) 2 0jπ ) 2 2 /n k we recover the statement of the theorem, concluding the proof.

A.1.2 Random event E 2
Let ω (k) 0jπ denote the j-th diagonal entry of Ω for some λ 2 (log p)/n. Mimicking the development for event E 1 , we now show that E 2 also holds with high probability. This result is stated in Theorem A.4. Similar to the proof of Theorem A.2, we first consider the asymptotic property for a particular node j and permutation π, and then leverage this to get a uniform bound across all permutations and nodes. For this proof, we use the following lemma stated in [3], which also follows from [45]. After the lemma, we formally state our result.
Lemma A.3. [3, Lemma 2] Suppose X 1 , · · · , X n are K-dimensional random vectors satisfying EX i = 0 and X i 2 ≤ M for 1 ≤ i ≤ n. We have for any β > 0 and x > β where λ max is the largest eigenvalue of Cov( n i=1 X i ), N is a K-dimensional standard normal random vector and c 1 , c 2 are positive constants.
Theorem A.4. Assume Conditions 4.5 and 4.6 hold, then there exist constants c 1 , c 2 , c 3 > 0 such that where c s is the constant defined in Condition 4.5.
Proof. We begin by analyzing the asymptotic properties of (k) jπ for all k given a fixed permutation π and node j. More specifically, consider the following random event for some positive constant c 1 . Following the proof of Theorem 1 in [3], we define u (k) t as follows A straightforward substitution in (26) allows us to rewrite the event C jπ as Notice that in order to apply Lemma A.3 to bound the probability of occurrence of C jπ , we would need u t 2 to be smaller than a constant M , which is not true in general. Hence, we split C jπ into two subevents, enabling the utilization of Lemma A.3. More precisely, whenever the following random event holds then the 2 norm of u t is smaller than for some constant c 2 > 0. This also implies that λ max {Cov( n t=1ũ t )} ≤ c 2 /n. Applying Lemma A.3, where we select x = (1 − δ)c 1 λ n , β = δ x for some arbitrary positive constant 0 < δ < 1 and, M = λ −1 n K 1−a /n, it follows that for some constants c 3 , c 4 > 0 that increase if constant c 1 is increased. In addition, by choosing a = 7/2, there must exist a large enough c 1 such that c 3 , c 4 > 1 and therefore Replacing (31) into (29) gives us the sought exponential bound for the first summand in (28).
We are now left with the task of finding a bound for P(¬F a ). By relying on the fact that n (1) · · · n (K) (cf. Condition 4.6), we get that (max k n k ) K ≤ c 5 n for some constant c 5 and therefore Following this, in order to bound the probability that ¬F a holds we further rely on the tail bound of the chi-squared random variable u Recalling the definition of λ n from (27) By recalling that we have fixed a = 7 2 , it follows that there exists a constant η such that where we have used the second inequality in (33). Furthermore, by leveraging the first inequality in (33) we obtain that thus obtaining an exponential bound for the second summand in (28).
Having found exponential bounds for both summands in (28), it follows that for c 1 > 0 sufficiently large andα sufficiently small we have that P(C jπ ) ≤ (1 + c 5 ) exp(−|Pa j (G union )| log p − c s log p). Following an argument based on union bounds similar to the one presented in the proof of Theorem A.2, we have that for some constant c 1 > 0. It is immediately implied from the previous expression that P   ∃π : thus recovering the statement of the theorem (since c s > 2 by Assumption 4.5) after accordingly renaming the constants on the right hand side.

A.1.3 Random event E 3
Event E 3 is defined as the intersection of 2K events that we denote by {E 3b are specific to the kth SEM. More specifically, event E (k) 3a ensures that all the estimated noise variancesω (k) j associated with the kth SEM are finite and bounded away from zero. Formally, we define the following events for k = 1, . . . , K, 3b imposes a universal lower bound on the norm achievable by any linear combinations of the data associated with the kth DAG. Mathematically, we consider the ensuing events for k = 1, . . . , K, for some δ 3 > 0 and λ (k) 3 (log p)/n k . Based on (35) and (36) we define events E (k) , and Leveraging the fact that Condition 4.4 enforces the maximum in-degree of each G (k) 0π to be at most αn k / log p for some positive constant α, we can generalize Lemmas 7.5 and 7.7 from [43] into the following lemma.
for some t > 0. Based on this, define λ (k) 3 := 3σ 0 log p n k , and δ 3 : Lemma A.5 shows that under certain conditions the events E (k) 3 hold with high probability, thus playing a role analogous to that of Theorem A.2 for event E 1 and Theorem A.4 for event E 2 .
With the events E 1 , E 2 , and E 3 defined and having shown under which conditions these hold with high probability, in the next section we leverage these events to prove Theorem 4.9.
A.2 Proof of Theorem 4.9
Lemma A.6. Assume we are on E 1 ∩E 2 ∩E 3 and Condition 4.2 holds. Consider a regularizer in (7) satisfying λ 2 > λ 2 1 /δ 1 + λ 2 2 /δ 2 with 0 < δ 1 < 1/β 2 and 0 < δ 2 < 1/(2β 2 σ 2 0 ). Then, π, {(Â (k) ,Ω (k) )} K k=1 the global optimum of (7), satisfies Proof. By definition, the global optimum must satisfy Letπ denote any permutation consistent with allÂ (k) . Since the value of the likelihood n k (X (k) ; A , Ω (k) 0π ) achieve the same value. We therefore replace the former by the latter in (40) and expand the definition of the likelihood function in (5) to obtain Basic manipulations transform the above expression into the following inequality Finally, using the fact thatX 0jπ )/n k + ˆ (k) j 2 2 /n k . By replacing the above into (42), we get that In order to further bound the expression in (43), notice that the first summand in the right hand side of the inequality corresponds to the sum of all empirical correlation coefficients. Leveraging that we are under the assumption that E 1 holds [cf. (18)], we have that In order to bound the second and third terms, we first restate their difference as follows Next, by using Cauchy-Schwarz inequality, i.e. | n i=1 u i v i | 2 ≤ n j=1 |u j | 2 n k=1 |v k | 2 , we further bound (45) as From the fact that event E 2 holds [cf. (25)], we can upper bound the first of the two factors in the right hand side of (46) by 2 λ 2 2 (p + |G union 0 (π)|). Further, relying on the inequality 2ab ≤ a 2 /δ 2 + δ 2 b 2 for any δ 2 > 0, it follows that By replacing (44) and (47) into (43), we recover (39), as we wanted to show.
From Lemma A.6 it follows that the global optimum of (7) corresponds to a minimal I-MAP, but no claim is made about the sparsity level of this I-MAP. In order to show that the solution is indeed sparse, we must rely in Conditions 4.7 and 4.8. Consequently, in Lemma A.7 we show that |G union 0π | cannot be much larger than |Ĝ|. Then, in Thm. A.8 we further show how to cancel out |G union 0π | with |Ĝ| in (39) to obtain our main result.
Lemma A.7. Assume Condition 4.7 holds and letλ > 0 be such that Consider also constants η 1 , η 2 with 0 ≤ η 1 < 1 and 0 < η 2 Proof. Let N (k) and M (k) be the sets of entries satisfying Leveraging inequality (48) and Condition 4.7 we further have that Notice that for all (i, j)-th entries in the set N (k) ∩ M (k) C it must be that |[Â (k) ] i,j | > 0. Hence, N (k) ∩ M (k) C corresponds to a subset of non-zero entries ofÂ (k) , which in turn corresponds to a subset of edges inĜ. From this we can infer that where the last inequality follows by combining (50) with the definition of η 1 in the statement of the lemma. The proof concludes by replacing Condition 4.7 in the above inequality.

A.2.2 Proof of Theorem 4.9
It follows from Theorem A.2, Theorem A.4, and Lemma A.5 that there exist some constants λ 1 , λ 2 with λ 2 1 λ 2 2 log p n as well as some λ (k) 3 2 log p n k for all k such that with probability 1 − exp(−c log p) for some constant c > 0, the random event E 1 ∩ E 2 ∩ E 3 occurs.

A.3 Proof of Theorem 4.10
We first introduce a lemma that will be instrumental in proving Theorem 4.10 and that can be obtained by generalizing Lemmas 7.2 and 7.3 in [43].
In order to show Theorem 4.10, we begin the proof just like for Theorem 4.9 in Section A.2.2 until we get to expression (56). It then follows from Condition 4.7' and |Ĝ| ≥ K k=1 w k |Ĝ (k) | that there exists some λ 2 0 C max log p n (p/|G union 0 | ∨ 1), where C max is defined in Condition 4.8, such that for any λ > 0, Let λ 2 := λ 2 · c t (π 0 ) and δ s := δ s /c t (π 0 ), it then follows that there must exist at least one k such that for any λ > 0, Since according to Condition 4.7', c t (π) scales as a constant for permutations consistent with G union 0 , we have that δ s is still a constant and λ λ. In this case, it follows from Lemma A.9 and Condition 4.8' that there exists some constant 0 < δ s < 1 and δ s > 0 such that by choosing λ as that λ C max log p n (p/|G union 0 | ∨ 1) and λ 2 δ s > λ 2 0 δ s , It also follows from Lemma A.9 that |Ĝ (k) | ≥ c g |G (k) 0π | ≥ c g |G (k) 0 | for some positive constant c g . Mimicking the arguments employed in the proof of Theorem 4.9 from (57) until the end of the proof, one can show that expressions (13) and (14) in the statement of Theorem 4.10 hold true.

A.4 Proof of Corollary 5.1
The following lemma is instrumental in showing the corollary. where n −j is the total number of data from the interventions where node j is not intervened, i.e., n −j = k:j ∈I k n k .
Recall that in the interventional setting G union 0 is given by the true graph G 0 of the non-intervened model, and that the K models to be inferred (A Then, applying the inequality log( K k=1 w k a k ) ≥ K k=1 w k log a k for any choices of a 1 , . . . , a K > 0 and w 1 , . . . , w K > 0 with K k=1 w k = 1, we have that In this case, Corollary 5.1 immediately follows from the proof of Theorem 4.9.