Reconstruction of a directed acyclic graph with intervention.

Identification of causal relations among variables is central to many scientific investigations, as in regulatory network analysis of gene interactions and brain network analysis of effective connectivity of causal relations between regions of interest. Statistically, causal relations are often modeled by a directed acyclic graph (DAG), and hence that reconstruction of a DAG's structure leads to the discovery of causal relations. Yet, reconstruction of a DAG's structure from observational data is impossible because a DAG Gaussian model is usually not identifiable with unequal error variances. In this article, we reconstruct a DAG's structure with the help of interventional data. Particularly, we construct a constrained likelihood to regularize intervention in addition to adjacency matrices to identify a DAG's structure, subject to an error variance constraint to further reinforce the model identifiability. Theoretically, we show that the proposed constrained likelihood leads to identifiable models, thus correct reconstruction of a DAG's structure through parameter estimation even with unequal error variances. Computationally, we design efficient algorithms for the proposed method. In simulations, we show that the proposed method enables to produce a higher accuracy of reconstruction with the help of interventional observations.


Introduction
Directed acyclic graph (DAG) models are useful to describe pairwise causal relations between random variables, defined by a certain Markov property [5], with each node representing one variable and each directed edge representing the corresponding pairwise causal relation. DAG models have been widely used in gene and social networks [7,19]. To identify causal relations, intervention observations are usually collected in addition to observational attributes [16]. The central topic this article addresses is the reconstruction of a DAG model based on interventional data and pertinent issues with respect to the effect of the intervention on the reconstruction of a DAG's structure.
In the literature, it is generally believed that interventions may help the reconstruction of a DAG's structure, particularly when a DAG model is not identifiable from data, that is, DAGs in a Markov equivalence class are not distinguishable based on observational data alone [16]. In biological experiments, for instance, intervention occurs in a form of randomized treatments in a clinical trial or a form of gene knockdown or knockout experiments in systems biology. In such a situation, some or all system variables are controlled, permitting direction estimation of ambiguous edges connecting to these controlled variables. Yet exactly how intervention impacts reconstruction of a DAG's structure remains unknown. Consequently, it is practically important to design a reconstruction method for interventional data, permitting the identification of a DAG's structure. Most existing methods for intervention [6,10,8] assume known intervention, that is, affected variables of the intervention are known a priori before data collection; see [9] for references therein.
However, assuming known intervention is impractical, as in system biology, where the effect of various chemicals intervening a system cannot be precisely known. To our knowledge, one exception is a Bayesian method of [4], which is designed for a low-dimensional problem without theoretical guarantee, due to the super-exponential complexity in the number of nodes.
In this article, we propose a novel variance constraint on interventions to fully identify the DAG structure in the framework of unknown intervention. Theoretically, we show in Theorem 1 that a DAG structure is fully identifiable under the constraint, which is otherwise only possible when all the error variances are not the same [18]. Moreover, we propose a constrained maximum likelihood to seek the most efficient interventions by sparsity pursuit, leading to an identifiable reconstruction of a DAG's structure. Computationally, we develop an efficient algorithm to solve nonconvex minimization subject to the quadratic variance constraint based on the alternating direction method of multipliers (ADMM) [2]. In simulations, we investigate the impact of the intervention on reconstruction and compare the proposed method with its counterpart without intervention. Overall, the proposed method performs well.
This article is organized into seven sections. Section 2 introduces the proposed method and discusses the issue of identifiability due to intervention, followed by the computational development in Section 3. Section 4 establishes the consistency theorems of the proposed method. Section 5 performs some simulations to study the intervention effect, and two real datasets are analyzed. Section 6 summarizes the results. Finally, the Appendix A contains technical details and proofs.

Method
Consider a causal model consisting of p random variables Y = (Y 1 , … , Y p ) ⊤ described by a DAG, with each node representing one variable and directed edges encoding causal relations between any two variables, where ⊤ denotes the transpose. This model factorizes the joint distribution of Y , P(Y) into a product of conditional distributions of each variable given its parents: P (Y ) = ∏ j = 1 p p Y j | pa j , where pa j denotes the parent set of Y j and is defined to be empty if Y j has no parents. This factorization is known as the local Markov property [5].
A DAG is modeled by structural equations as Y = AY + ϵ, ϵ N(0, D), (1) where ϵ = (ϵ 1 , ϵ 2 , … , ϵ p ) ⊤ represents the latent or unexplained error, D = Diag σ 1 2 , ⋯, σ p 2 is the error covariance matrix and A = (A ij ) p×p is an adjacency matrix that uniquely determines a DAG. Here A ij ≠ 0 encodes an edge from node j to node i. In (1), the inverse covariance matrix of Y is Ω = (I -A) ⊤ D −1 (I − A), where I is the identity matrix.

Intervention or covariate models and variance constraint
To treat non-identifiability in an observational study, consider a model consisting of W intervention variables X = (X 1 , X 2 , … , X W ) ⊤ , where the outcome of Y is observed with intervention variables X that may be non-informative.
After incorporating the intervention variables into (1), we obtain that Y = AY + BX + ϵ, ϵ N(0, D), (2) where A, ϵ and D are defined as in (1), B = (B jw ) p×W is an unknown intervention coefficient matrix, whose jwth entry B jw indicates the directional strength of the intervention of X w on Y j . When B jw = 0; j = 1, ⋯ , p, there is no intervention of X w on Y i , and thus X w is noninformative. Note that (2) becomes a causal model with covariates X.
In the situation of unequal error variances, with the help of the intervention, we may impose constraints to achieve model identifiability, which otherwise is impossible [18]. Assume that X ~ N(0, Σ X ), which is independent of ϵ. Without loss of generality, assume that Σ X = I subsequently because we can reparametrize X as Σ X where Ω = (I − A) ⊤ (BB ⊤ + D) −1 (I − A). In (2), we impose the variance constraint: Var(BX + ϵ) = θI, or BB ⊤ + D = θI, where θ > 0 is a parameter to be estimated.
More details are deferred to Sections 3 and 4.
Theorem 1 (Identifiability). Assume that X ~ N(0, Σ X ) is independent of ϵ in (2), and A jk ≠ 0 for all k which is a parent of j; j = 1, 2, ⋯ , p. Under (3), A is identifiable from the distribution of (Y , X).
As suggested by Theorem 1, A in (2) becomes identifiable when (3) is imposed on B, which is otherwise impossible. Note, however, that interventions B leading to identifiable A may not be unique. In what is to follow, we impose a sparsity constraint to identify a most sparse B in terms of the number of nonzero elements of B.

Constrained maximum likelihood
This section estimates A subject to the DAG requirement while seeking a most sparse B subject to (3). Consequently, the smallest set of informative intervention variables can be identified through B.
Under (2), two data matrices (y ij ) n×p and (x iw ) n×W are observed, with n representing the sample size. Then the negative log likelihood is To identify nonzero entries of A and B, we impose sparsity constraints to regularize: where K 1 and K 2 are nonnegative integer-valued tuning parameters. Note that the constraint on B removes zero entries thus zero-columns of B, which can be regarded as selection of intervention variables. To reinforce the DAG requirement, we impose acyclic constraints [27] to reinforce the DAG requirement to ensure no loops to occur: where λ = {λ jl } p×p is a dual variable matrix.

Computation
This section develops a computational strategy to solve (9). First, θ is estimated by θ through an estimate A of A from the method in [27], that is, where A jk j, k = 1, ⋯, p are obtained by a constrained maximum likelihood estimate with the sparsity constraint and acyclic constraint, based on the structural equation model (1) with observational data Y alone. Then we solve (9) with θ replaced by θ using a blockwise coordinate descent alternating between two blocks (A, B) and D until convergence. In particular, the (A, B)-block is solved via a difference convex (DC) programming followed by an alternating direction method of multipliers (ADMM), while the D-block is updated by gradient descent. More details are further discussed subsequently.

Optimization subject to the variance constraint
To deal with the variance constraint in (9), we first consider a general constrained minimization subject to the variance constraint as follows: where f(B) is a cost function and Λ is a diagonal matrix.
For (11), we work with its equivalent form by introducing a dual matrix C to decouple B and the constraint: min (B, C) f(B), subject to CC ⊤ = Λ and C − B = 0. Then we apply the alternating direction method of multipliers (ADMM) [2] to obtain its augmented Lagrangian: L ρ (B, C, y) = f(B) + y ⊤ vec C − B + ρ 2 ‖C − B‖ F 2 , where y ∈ ℝ pW is the Lagrangian multiplier for constraint C−B = 0, and ρ > 0 is the augmented Lagrangian parameter. Then it is further simplified by introducing a dual variable matrix V = {V jl } p×W to incorporate y ⊤ vec(C − B) into the quadratic form Now we apply ADMM to iterate through three steps to solve (12) until convergence. In the kth iteration, V (k + 1) = V (k) + C (k + 1) − B (k + 1) . (15) In (13)- (15), the variance constraint CC ⊤ = Λ enters only in (13). Next we provide a closedform solution of (13) in Lemma 1.

Algorithm for solving (9)
After plugging θ into (9), we begin with the update of the (A, B)-block by fixing D at an initial value D 0 and optimize (9) with regard to (A, B). A good estimate of D 0 can be obtained by solving (9) without the variance constraint, details are given in the Appendix.
To solve (9) with a fixed D, we follow [27] to convert (9) to its equivalent dual form. The procedure contains two steps. First, we apply a DC programming method and decompose the nonconvex constraint function of the nonconvex constraints (7) and (8) into a difference of two convex functions, based on which we construct a sequence of convex approximation of nonconvex constrained sets iteratively, the details are given in the Appendix. Then at the mth iteration, we solve a relaxed subproblem (16). The iteration process continues until a termination criterion is met. j, l, s = 1, …, p, j ≠ l, ξ jls ≥ 0, BB ⊤ + D = θ I, (16) where ξ = {ξ jls } p×p×p , ξ jls ≥ 0 is a slack variable tensor, and w jl  subj to CC ⊤ + D = θ I . . Then σ j 2 is updated by where α > 0 is the step size.
The computational strategy is summarized in Algorithm 1.
Step 2. Fix D at an initial value D 0 . Set pre-specified error tolerance ϵ 0 and the maximum number of iterations M 0 for blockwise coordinate descent.
Step 3. ((A, B)-block) Initialize A and B. Set pre-specified error tolerance ϵ 1 and the maximum number of DC iterations M 1 .
Step 3.1. At the mth DC iteration, compute A (m) and B (m) cycling through the ADMM updating steps until convergence.

Theory
This section develops a theory quantifying the reconstruction error of the CMLE defined in (9). In particular, we first show the estimated θ recovers the optimal parameter estimation of the oracle estimator θ O , which is defined as the maximum likelihood estimate in [27] provided that the true non-zero pattern of the DAG is given. Then we establish reconstruction consistency of the true DAG's structure defined by adjacency matrix A in the presence of intervention variables X in (2).
Let A, B, and D represent model parameters under (2). Let E = {(i, j) : A ij ≠ 0} be the set of non-zero edges in the graph G, and |E| denote the size of the set.
In what follows, we will assume the variance constraint (3), and by Theorem 1, A is identifiable from the distribution of (Y, X), and thus the graph structure of G is identifiable. Let G 0 , A 0 , B 0 , R 0 , E 0 , Ω 0 , θ 0 denote the truth. Let A O and θ O denote the oracle estimator, or the maximum likelihood estimate provided that the true set E 0 of non-zero edges is given. Let For the observational model, let A obs be the model parameter, then the precision matrix is The degree of reconstructability is defined as where h 2 (Ω, Ω 0 ) is the Hellinger distance between Ω and Ω 0 under (2) and the variance constraint (3), and E 1 \ E 2 denote the set difference between E 1 and E 2 . The degree of reconstructability measures the difficulty of reconstructing the graph, and we require it to be larger than a certain level in order for our proposed method to be consistent in reconstructing the graph structure. For a detailed discussion about the degree of reconstructability of a graph, c.f., [27]. Assumption C. For some positive constants d 1 , d 2 and d 3 , where Assumptions A.1 and A.2 concern the smallest eigenvalues and the maximum diagonal element of Ω. Under Assumption A.1, the likelihood function becomes bounded. Assumption B is a key condition for the consistency of reconstructed graph structure, we require the degree of reconstructability to be no less than a lower bound, which is related to the size of p or |E 0 |. Assumption C requires the Hellinger distance to be smooth so that we approximate L 0 function with its computational surrogate, the TLP function [21] to the desired level by tuning τ.
First, we show that θ is uniquely defined regardless of the value of B. (2), θ uniquely satisfies (3).

Lemma 2. Under
Then we show the optimal parameter estimation of θ achieved by utilizing the observational data in the Theorem 2.
Theorem 2 (Optimal parameter estimation). Under Assumptions A.1 and C, if K 1 = |E 0 | and τ ≤ C min (Ω 0 )M 1 /4p, then there exists a constant c 2 > 0, say c 2 = 4 27 1 1926 , such that for any (n, |E 0 |, p), The next theorem gives a reconstruction error bound, under which we obtain reconstruction consistency of the CMLE as well as its optimal parameter estimation. 1926 , such that for any (n, |E 0 |, p),

Simulations
This section examines the performance of the proposed method and demonstrates how intervention, as well as the variance constraint, improves the reconstructability of a DAG's structure. Seven methods are compared, including the proposed method with intervention, that without the variance constraint (3), the observational method without intervention [27], the constraint-based PC algorithm [22], the score-and-search method GES [3], and two hybrid methods, Max-Min Hill-Climbing (MMHC) [24] and ARGES [14]. For PC algorithm and GES, we use R package pcalg, while for MMHC we use R package bnlearn. For ARGES, we use the ARGES-CIG version [14] by first conducting a neighborhood selection using R package huge and then apply the greedy search using pcalg. For the other three methods, we implement in R with the main algorithm written in C, which is also available in the R package intdag [17] https://cran.r-project.org/web/packages/intdag/index.html.
Several performance metrics are used to measure the accuracy of reconstruction of a graph's skeleton as well as directionality. With respect to the skeleton of a graph, we use the false discovery rate (FDR) and false negative rate (FNR), defined as FDR = FP/(TP + FP) and FNR = FN/(TP + FN), where TP, FP, TN, and FN denote the true positives, false positives, true negatives and false negatives, respectively. These two metrics together measure the abilities to control false discoveries, as well as false negatives. With regard to directionality, we employ the Structural Hamming Distance (SHD), defined as the minimal number of operations required to transform one DAG to the other, including edge insertions, deletions or flips, c.f., [24]. Note that a smaller SHD value indicates two DAGs are closer to each other. To compute the SHD, we use the R-package pcalg. All the metrics are used on the estimation of adjacency matrix A, since we focus on the reconstructability of DAG.
For tuning, PC algorithm and MMHC require one tuning parameter α controlling the significance level for independence tests, yet there is no practical tuning way via a separate tuning set. In this simulation, the significance level is fixed at 0.05. This choice of α seems sensible as the number of estimated edges is roughly the same as the number of edges in the true graph, as shown in the simulation. For ARGES, the first stage of neighborhood selection needs one tuning parameter corresponding to the LASSO penalization, we use the functions huge.path() in the R package huge to select the tuning range based on the data, and the function huge.fit() to select the tuning parameter. For the observational DAG method [27], τ is chosen from a set {0.1, 0.05, 0.01}, and the sparsity regularization parameter μ 1 is chosen so that the number of estimated edges roughly ranges from 0 to 100. For our methods, τ and μ 1 are selected similarly, and μ 2 = r × μ 1 with the ratio r selected from {1, 2, 4, 8}. For each method, the optimal tuning parameters are obtained by maximizing the predicted loglikelihood (4) based on an independent tuning set of size 1000 over a set pre-specified grid points.
In simulations, we examine a sparse neighborhood graph and a sparse graph with non-sparse neighborhoods in Examples 1 and 2, respectively. A sparse neighborhood requires each node to have sparse links, but a sparse graph does not necessarily have sparse neighborhoods. To further investigate the operating characteristics of the methods, we consider two additional situations in Examples 3 and 4, in which the true graphs satisfy the variance constraint while other settings resemble Examples 1 and 2, respectively. Finally, to compare the performances of the methods in non-sparse situations, Examples 5 and 6 are added by increasing the sampling probabilities when generating the edges. More details are given in the example settings.
Example 1 (Sparse neighborhood). A DAG with 50 nodes is generated with a random generation mechanism as described in [11]. First, the partial ordering of these nodes is randomly generated. Second, we sample edges according to a binomial distribution with probability 0.02, where the edges are assigned to a weight 0.5. The intervention matrix B is a diagonal matrix with diagonal values being 0.5s, describing a situation that each node in DAG is intervened by exactly one intervention covariate. Third, we set the error variances σ 1 2 , ⋯σ p 2 ⊤ to be a sequence from 1.5 to 1 with equally spaced points. Finally, we sample X from a p-dimensional normal distribution N(0, I) and generate Y is generated according to (2).
Example 2 (Non-sparse neighborhood). This example is modified from the previous example to generate a DAG of 50 nodes with a special structure of so-called "one-control-all", where all directed edges are connected from the first node to the other nodes with connection strength of 0.5. Evidently, the neighborhood of the first node is not sparse, but the overall graph remains sparse. The intervention matrix B and the error variances are the same as in Example 1.
Example 3 (Sparse neighborhood with a causal model satisfying (3)). This example is modified from Example 1, in which the intervention matrix is diagonal, and the squared diagonals are generated from a sequence from 0.5 to 1 with equally spaced points so that the variance constraint is satisfied. The true θ is 2 in (3). Other settings remain the same as Example 1.
Example 4 (Non-sparse neighborhood with a causal model satisfying (3)). This example is modified from Example 2, in which the intervention matrix and θ = 2 are set in the same fashion as in Example 3.
Example 5 (Non-sparse case). This example is modified from Example 1, in which the sampling probability of the binomial distribution is increased to 0.1 when generating the edges, the rest are in the same fashion as in Example 1.
Example 6 (Non-sparse case with a causal model satisfying (3)). This example is modified from Example 3, in which the sampling probability of the binomial distribution is increased to 0.1 when generating the edges, the rest are in the same fashion as in Example 3. Tables 1-4 In Examples 2 and 4, with a non-sparse neighborhood structure, the estimation becomes more challenging. In such situations, we added the simulations with n = 1000, as shown in Tables 5-6, the proposed method with the variance constraint (2) outperforms all the competitors, with SHD very close to 0 in Example 2 and exactly 0 in Example 4. As a comparison, PC and MMHC fail to return results within 24 hours, GES and ARGES perform even worse than the cases with n = 500. For large sample sizes, GES and ARGES tend to have more false positives, with estimated graph structure failing to determine the directions of the edges between the first node and the rest.

As indicated in
As demonstrated in Tables  In conclusion, intervention clearly has a positive impact on the reconstructability of a DAG through interventional covariates. With the variance constraint, the accuracy of reconstruction can be further enhanced.

Analysis of Alzheimer's disease dataset
This section applies our proposed method to analyze the Alzheimer's disease dataset [25], where 8560 gene expressions were collected for 176 Alzheimer's disease patients and 187 healthy participants. Our primary goal is to reconstruct a causal network of Alzheimer's disease-related genes with the help of intervention, and compare the DAG structures for the patient and control groups, for identifying regulatory gene-gene interactions that differentiate these two patient groups.
Biologically, transcription factors (TFs) are proteins controlling the transcription process. By regulating genes, TFs ensure that target genes have right expressions in a cell and during a biological organism. In fact, a TF binds to its target DNA sequence and thus can be mapped to specific genes. For our purpose, we use the database [23] to extract a list of human DNA binding TFs and map them to genes in the dataset, resulting in 1031 mapped TF genes. Then we treat TF genes as intervention covariates to facilitate reconstruction of causal relations encoded by the gene networks.
To identify TF gene expressions associated with Alzheimer's disease in our dataset, we examine the KEGG database [12]. There, 168 genes in the Alzheimer's disease pathway, among which the expressions of 99 genes are mapped to the pathway. Among the mapped genes, we perform a two-sample t-test to obtain the significant ones between the patient and control groups, resulting in 43 selected genes at the significance level 0.05.
After pre-processing, we apply the proposed method to both the groups to reconstruct two DAG networks for the disease and control groups with 176 and 187 subjects, involving p = 43 genes as causal variables and W = 1031 TFs as intervention variables, where the tuning parameters of the method are estimated by a five-fold cross validation. As shown in Figure  1, there are 29 and 33 estimated directed connections in the patient and control groups, respectively, with 11 shared common directed connections. Moreover, the two networks share some common sub-structures, particularly, we can find common directed connections from NDUFAB1 to ATP5A1, ATP5G3, and COX7A2. However, different sub-structures are also revealed. In the patient group, several genes, including SDHD, NDUFA1, NDUFAB1, ATP5G3, ATP5A1 and ATP5C1 have directed connections to SDHB, whereas in the control group, only two of them are connected to SDHB. This suggests that the differences in the two DAG networks reflect the disparity in the gene regulatory relations between an Alzheimer's disease subject and a healthy subject.
Biologically, our estimated directed connections match the known biological pathway of Alzheimer's disease in the KEGG database. For example, the estimated directed connections between NDU-genes, SDH-genes, ATP-genes and COX-genes match the pathway of the electron transport chain in mitochondria. Also, the estimated connection from CALM3 to PPP3CB matches the biological pathway from calcium-modulated protein(CaM) to protein phosphatase 3 catalytic subunit alpha(PP3CA).

Analysis of cytometry data
This section applies the proposed method to analyze the flow cytometry data in [19] to reconstruct causal relations between phosphorylated proteins and phospholipids in human primary naive CD4+ T cells of the immune system. These cells were perturbed with molecular interventions to drive the ordering of connections in an intracellular signaling network. The original flow cytometry data contains n = 7466 cell measurements, each consisting of the amount of p = 11 proteins and phospholipids. The data is collected from nine different experiments in which different components in the network are intervened, either by stimulatory cues or inhibitory interventions. For our analysis, we use the continuous version of the original data [19]. Our objective is to reconstruct this network while the consensus network [19] is used as a benchmark for discovery.
For interventional data, we pre-process by selecting nine out of the eleven components, four of which form a DAG with three directed links while remaining five serve as intervention nodes; see Figure 2 for a display of an enlarge network with the nine nodes, known as consensus network, where links between the five intervention nodes are purposely removed for the intervention purpose. Note that the removal of these links does not affect our analysis, because each node is independent of its non-descendants given its direct parents by the local Markov property. Among the five intervention nodes, only three of them are informative with the other two Plcg and PIP2 having no effects on the four nodes to be intervened.
We fit our proposed method with and without interventions. For tuning, we use one-tenth of the samples for training, and the rest for tuning. As displayed in Figures 3 and 4, the proposed method with intervention correctly identifies the directed link from Erk to Akt, while the observational method alters the direction oppositely. Both the methods enable to reconstruct the directed link from Praf to Pmek, whereas they both miss the link from Pmek to Erk. Consequently, the proposed method with intervention identifies more correct directed links than that without intervention. With respect to the intervention effects, our method identifies three intervention links in the network, which rule out the true non-informative intervention nodes Plcg and PIP2. PIP3 is also non-informative in our estimation, and a possible reason is that Akt is already intervened by PKA, and thus the intervention effect of PIP3 on Akt is unnecessary. Similarly, the intervention from PKC to Praf is identified by the proposed method, while bothPKA and PKC have interventions on Praf in the consensus network. Finally, the proposed method yields a sparse intervention pattern.
One plausible explanation of missing the link from Pmek to Erk is that the linear causal model fails to capture nonlinear functional relations among cytometry measurements of proteins Praf, Pmek, Erk, and Akt, as evident from nonlinear patterns revealed by their residual plots of the structural equation model in Figure 5. Another possibility is that the five nodes on the bottom-left in Figure 3 are not originally designed as intervention nodes in the experiments.

Discussion
In this paper, a constrained maximum likelihood method is proposed to reconstruct the structure of a DAG using interventional data and efficient ADMM algorithms are developed to solve the optimization problems. In particular, a novel variance constraint is introduced and leverages the information in the interventional data to improve identifiability. Theories are established and it is shown that with the introduction of our variance constraint, the graphical structure is shown to be fully identifiable under some mild assumptions. The theoretical results are also demonstrated in our simulation results, as our proposed method performs well against competitors and can reconstruct the DAG more accurately than the observational method.
For convex relaxation of nonconvex constraints (7) and (8) in (9), we employ difference convex (DC) programming similar to [27]. In particular, we decompose J τ into a difference of two convex functions: J τ (z) = S 1 (z) − S 2 (z) ≡ min where w jl where H j, j− is the jth row of H with H jj excluded, Z j− is Z with its jth column removed and x j is the jth column of X. The minimizer is the solution to Z j − ⊤ Z j − + ρI H j, j − = Z j − ⊤ z j + ρ F j, j − (s) − U j, j − (s) , where the factorization of Z j − ⊤ Z j − + ρI can be cached to speed up subsequent updates.

A.2.3. F-step
F-step updates two parts of the matrix, one is the adjacency matrix, one is the intervention matrix.
Part I: adjacency matrix For i = 1, ⋯ , p and j = 1, ⋯ , p, we solve the following problem: if w ij m − 1 = 1, is the soft-thresholding operator.
Part II: intervention matrix For i = 1, ⋯ , p and j = p + 1, ⋯ , p + W, we solve the following problem: We can solve F ij elementwise:

A.3.1. R-step
Denote T = (Φ, Ψ) be the concatenation of Φ and Ψ, let Z = (Y , X) be the concatenated data matrices we solve the minimization problem: min r j − n log r j +
Define Ω E τ = I − A E τ ⊤ I − A E τ /θ E τ for any E τ ⊂ {(i, j) : 1 ≤ i ≠ j ≤ p}. We can partition where P* is the outer measure.
For I, we apply Theorem 1 of [26] to bound each term in the sum. We verify the entropy condition (3.1) there for the bracketing entropy over B kj . Let Π denote the covariance matrix of the joint distribution of (Y , X) and Γ = Π −1 , then Ω is the upper p×p diagonal block of Γ.
Let ℱ kj = f 1/2 (Γ, ⋅ , ⋅ ): Ω ∈ B kj be the class of square-root densities. Define Δ = Γ − Γ and let λ 1 , …, λ p+W be the eigenvalues of ΠΔ Π, z = y ⊤ , x ⊤ ⊤ . Then max 1≤i≤p+W λ i ≤ λ max (Δ) × λ max (Π) ≤ c(p + W)∥Π∥ max following Prob.III.6.14 [1]. By Lemma 6.5 of [28], it can be shown where λ max denotes the largest eigenvalue, D(z) = λ max (Γ) tr(zz ⊤ ) ≤ M 4 × tr(zz ⊤    Consensus intracellular signaling network as benchmark. Four nodes in brown are intervened by five intervention nodes marked in blue, with directed links indicating their intervention directions by solid red lines. Note that no links are present between intervention nodes.   Intracellular signaling network estimated by the model without intervention. One directed connection is correctly identified, denoted by the solid black line, while another connection is reversed, denoted by dotted black line, and the directed connection from Pmek to Erk is missed, denoted by dashed black line. Residual plots from our intervention model. Each subplot corresponds to one node in the DAG network. Table 1 Averaged false discovery rate (FDR), false negative rate (FNR), and Structural Hamming Distance (SHD), as well as their standard errors for four competing methods based on 10 simulation replications in Example. Here "Our", "Int", "Obs", "PC", "GES", "MMHC", "ARGES" denote the proposed method subject to the variance constraint, the proposed method without the variance constraint, the proposed method without intervention, PC method, GES method, MMHC method and ARGES method, respectively. The best performers are in bold.  Table 2 Averaged false discovery rate (FDR), false negative rate (FNR), and Structural Hamming Distance (SHD), as well as their standard errors for four competing methods based on 10 simulation replications in Example 2.
Here "Our", "Int", "Obs", "PC", "GES", "MMHC", "ARGES" denote the proposed method subject to the variance constraint, the proposed method without the variance constraint, the proposed method without intervention, PC method, GES method, MMHC method and ARGES method, respectively. The best performers are in bold. 'N/A' means the method did not return a result after 24 hours. The best performers are in bold.  Table 3 Averaged false discovery rate (FDR), false negative rate (FNR), and Structural Hamming Distance (SHD), as well as their standard errors for four competing methods based on 10 simulation replications in Example 3.

(n,p) Method FDR(A) FNR(A) SHD
Here "Our", "Int", "Obs", "PC", "GES", "MMHC", "ARGES" denote the proposed method subject to the variance constraint, the proposed method without the variance constraint, the proposed method without intervention, PC method, GES method, MMHC method and ARGES method, respectively. The best performers are in bold. 'NaN' in the FDR column means the ARGES method does not have any discoveries in all the replications.  Table 4 Averaged false discovery rate (FDR), false negative rate (FNR), and Structural Hamming Distance (SHD), as well as their standard errors for four competing methods based on 10 simulation replications in Example 4. Here "Our", "Int", "Obs", "PC", "GES", "MMHC", "ARGES" denote the proposed method subject to the variance constraint, the proposed method without the variance constraint, the proposed method without intervention, PC method, GES method, MMHC method and ARGES method, respectively. The best performers are in bold. 'N/A' means the method did not return a result after 24 hours. The best performers are in bold. 'NaN' in the FDR column means the ARGES method does not have any discoveries in all the replications.  Table 5 Averaged false discovery rate (FDR), false negative rate (FNR), and Structural Hamming Distance (SHD), as well as their standard errors for four competing methods based on 10 simulation replications in Example 2.
Here "Our", "Int", "Obs", "PC", "GES", "MMHC", "ARGES" denote the proposed method subject to the variance constraint, the proposed method without the variance constraint, the proposed method without intervention, PC method, GES method, MMHC method and ARGES method, respectively. The best performers are in bold. 'N/A' means the method did not return a result after 24 hours. The best performers are in bold.