Learning Bayesian Networks from Incomplete Data with the Node-Average Likelihood

Bayesian network (BN) structure learning from complete data has been extensively studied in the literature. However, fewer theoretical results are available for incomplete data, and most are related to the Expectation-Maximisation (EM) algorithm. Balov (2013) proposed an alternative approach called Node-Average Likelihood (NAL) that is competitive with EM but computationally more efficient; and he proved its consistency and model identifiability for discrete BNs. In this paper, we give general sufficient conditions for the consistency of NAL; and we prove consistency and identifiability for conditional Gaussian BNs, which include discrete and Gaussian BNs as special cases. Furthermore, we confirm our results and the results in Balov (2013) with an independent simulation study. Hence we show that NAL has a much wider applicability than originally implied in Balov (2013), and that it is competitive with EM for conditional Gaussian BNs as well.


Background and Notation
Bayesian Networks [BNs; 2, 3] are a class of graphical models in which the nodes of a directed acyclic graph (DAG) G represent a set X = {X 1 , . . . , X N } of random variables describing some quantities of interest. The arcs connecting these nodes express direct dependence relationships, with graphical separation in G (called d-separation) implying conditional independence in probability. Formally, a BN is then defined as an independence map that maps graphical separation to conditional independence. As a result, G induces the factorisation in which the joint distribution of X decomposes into one local distribution for each X i (with parameters Θ Xi , Xi∈X Θ Xi = Θ) conditional on its parents Pa (X i ). Thus BNs provide a compact and modular representation of both lowand high-dimensional problems. Multiple DAGs can represent the same set of independence relationships, and can be grouped into equivalence classes [4] whose elements are probabilistically indistinguishable without additional information. All DAGs in the same equivalence class share the same underlying undirected graph and v-structures (patterns of arcs like X i → X j ← X k , with no arcs between X i and X k ). It is easy to see that the two other possible patterns of three nodes and two arcs result in equivalent factorisations: Hence each equivalence class is represented by the completed partially-directed acyclic graph (CPDAG) where only arcs belonging to v-structures and those that would introduce additional v-structures or cycles are directed.
As for the probability distribution of X, the literature has mostly focused on three sets of assumptions for analytical and computational reasons. Discrete BNs [5] assume that X and the X i are multinomial random variables. Local distributions take the form X i | Pa (X i ) ∼ Mul π ik | j , π ik | j = P (X i = k | Pa (X i ) = j) ; (2) their parameters π ik | j are the conditional probabilities of X i given each configuration of the values of its parents, usually represented as a conditional probability table for each X i . Gaussian BNs [GBNs; 6] model X with a multivariate normal random variable and assume that the X i are univariate normals linked by linear dependencies. The parameters of the local distributions can be equivalently written [7] as the partial correlations ρ Xi,Xj | Pa (Xi)\Xj between X i and each parent X j given the other parents; or as the coefficients β Xi of the linear regression model so that X i | Pa (X i ) ∼ N µ Xi + Pa (X i )β Xi , σ 2 Xi . Finally, conditional linear Gaussian BNs [CLGBNs; 8] combine discrete and continuous random variables in a mixture-of-Gaussians model as follows: • Discrete X i are only allowed to have discrete parents (denoted ∆ Xi ), and are assumed to follow a multinomial distribution as in (2).
• Continuous X i are allowed to have both discrete and continuous parents (denoted Γ Xi , ∆ Xi ∪ Γ Xi = Pa (X i )), and their local distributions are or, equivalently, which define a mixture of linear regressions against the continuous parents Γ Xi with one component for each configuration j of the discrete parents ∆ Xi . Hence Θ Xi = {µ ij , β ij , σ 2 ij }. If X i has no discrete parents (∆ Xi = ∅), the mixture in (4) simplifies to a single linear regression like that in (3).
We denote discrete nodes with ∆ and Gaussian nodes as Γ, with ∆ ∪ Γ = X. If ∆ = ∅ or Γ = ∅, a CGBN reduces respectively to a GBN or to a discrete BN. For this reason, we will derive our results only for CGBNs but they will hold for both discrete BNs and GBNs as particular cases.
Other distributional assumptions have seen less widespread adoption for various reasons: for instance, copulas [9] and truncated exponentials [10] lack exact conditional inference and simple closed-form estimators.
The task of learning a BN from a data set D is performed in two steps: Structure learning consists in finding the DAG G that encodes the dependence structure of the data in the space G of all possible DAGs, thus maximising P(G | D) or some alternative goodness-of-fit measure. Parameter learning consists in estimating the parameters Θ given the G obtained from structure learning, that is argmax Θ P(Θ | G, D). If we assume that parameters in different local distributions are independent and that data contain no missing values, the Θ Xi can be learned separately for each node because (1) implies Assuming that G is sparse, each X i will have a small number of parents and therefore P(Θ Xi | Pa (X i ), D) will be a low-dimensional distribution that can be estimated efficiently from the data. On the other hand, structure learning is NP-hard [11]. Many algorithms have been proposed for this problem following one of three possible approaches: constraint-based, score-based and hybrid. Constraint-based algorithms are based on the seminal work of Pearl on causal graphical models [12]: they use conditional independence tests to assess which conditional independence relationships are supported by data, and they assume that G is a perfect map of to identify which arcs to include in G from the results of those tests. The most popular algorithm in this class is arguably the PC algorithm [13], which was originally introduced in [14] in the context of causal discovery. Score-based algorithms represent the application of general-purpose optimisation techniques to BN structure learning: each candidate DAG is assigned a network score reflecting its goodness of fit, which the algorithm then attempts to maximise. Some examples are heuristics such as greedy search, simulated annealing [15] and genetic algorithms [16]; they work well in practical applications but do not come with any formal guarantees about the optimality of the networks they learn. Other algorithms [17,18], such as K2, seek to speed-up structure learning by first obtaining a topological ordering of the nodes and then learning the optimal DAG conditional on that. Exact maximisation algorithms have also become feasible in recent years thanks to increasingly efficient pruning of the space of DAGs and tight bounds on the scores [19,20]. Finally, hybrid algorithms combine the previous two approaches, using conditional independence tests to reduce the number of candidate DAGs to score while searching for the DAG maximising some network score. The best-known member of this family is the Max-Min Hill Climbing algorithm by [21]. A comprehensive review and comparison of these three approaches for complete data can be found in [22]; in the following, we will only consider score-based structure learning.
Both parameter and structure learning are mathematically and computationally harder problems when learning BNs from incomplete data because decomposition in (1) no longer holds. A common approach is to use the Expectation-Maximisation algorithm [EM; 23] for both parameter learning and structure learning to reuse methods originally proposed for complete data. However, this choice comes at a significant computational cost. This is particularly the case for structure learning, where complete-data algorithms are embedded within EM in the Structural EM [24,25] algorithm. To address this issue, Balov [1] proposed an alternative approach based on the Node-Average Likelihood (NAL) that is competitive with EM-based approaches in terms of structural accuracy but has a much lower computational cost. He then proved both identifiability and the consistency of NAL for discrete BNs. Both approaches, as well as others, are reviewed in [26] which provides a broad overview of parameter and structure learning from incomplete data.
In this paper we establish general conditions for both properties of NAL and we show that they hold for CGBNs. In Section 2 we briefly review score-based learning from complete data, moving to incomplete data in Section 3. Our novel results and the required regularity conditions are introduced in Section 4 for both identifiability (Section 4.1) and consistency (Section 4.2). Proofs for all but the main consistency result are included in Appendix A. Finally, in Section 5 we confirm the results on discrete BNs obtained by Balov [1] with a simulation study, and that similar results hold for CGBNs.

Structure Learning from Complete Data
Score-based BN structure learning consists of two components: a score function S(G | D) and an algorithm that determines how we explore the space of the DAGs. Each candidate DAG is assigned a score S(G | D) reflecting its goodness of fit, which the algorithm then attempts to maximise to obtain a (possibly local) optimum DAG as argmax G∈G S(G | D). Heuristic algorithms such as tabu search [27] are more common in practical applications, but exact algorithms have been proposed as well [19,20] although their use is limited to small and medium problems by their computational complexity. As for the score function, using Bayes theorem we have that where P(G) is the prior distribution over the space of the DAGs spanning the variables in X and P(D | G) is the marginal likelihood of the data given G averaged over all possible parameter sets Θ. The most common choice in the literature for P(G) is the uniform distribution. The space of the DAGs grows super-exponentially in N [28] and that makes it cumbersome to specify granular priors that assign different prior probabilities to individual DAGs. However, informative priors that assign probabilities to groups of DAGs based on their structural properties have been proposed in the literature, including the sparsity-inducing priors presented in [29] and [30] and the informative priors from [31] and [32].
From (1), the marginal likelihood decomposes into one term for each node, which makes it possible to use it as a network score (denoted S ML (G | D)) in a computationally efficient way: the fact that S ML (G | D) is decomposable speeds up structure learning considerably because (6) is available in closed form for discrete BNs, GBNs and CGBNs [8] and because we only recompute differing portions of P(D | G) as we score and compare DAGs. It is also consistent for complete data. Due to the difficulty of choosing a prior over Θ and the resulting performance implications [33], a common alternative is the Bayesian Information Criterion where n = |D| is the sample size and the Θ Xi are the maximum likelihood estimates (MLEs) of the Θ Xi for D. In contrast, we will denote by (G, Θ) and (X i | Pa (X i ), Θ Xi ) the population log-likelihoods. S BIC (G | D) is a particular case of the penalised log-likelihood where λ n is a penalisation coefficient and h : is related to the 0 penalised maximum likelihood studied in [35] for GBN structure learning, which has the same general form; but it does not make any distributional assumption. S BIC (G | D) is decomposable, since h(G) = |Θ| = X |Θ Xi | which is the number of parameters of each local distribution; and it does not depend on any hyperparameter. Furthermore, it is equivalent to the minimum description length [36] of G and it is asymptotically equivalent to S ML (G | D). Hence, it is consistent for complete data. Setting λ n = 1/n instead of λ n = log(n)/2n gives the Akaike Information Criterion [AIC; 37] which, on the other hand, is not consistent for complete data [38]. An important property shared by AIC and BIC is score equivalence [4]: they take the same value for all DAGs in the same equivalence class. The same is true for S ML (G | D) for particular choices of the prior over Θ, as shown in [8]. This acknowledges that those DAGs are probabilistically indistinguishable from the data, and thus cannot be ranked by the score function. We can, however, still express a preference for some DAGs over others in the same equivalence class by means of informative priors P(G) over the space of DAGs.

Structure Learning from Incomplete Data
In the context of BNs, incomplete data are modelled using auxiliary nodes Z i ∈ Z that encode whether the corresponding X i is observed for each observation. (Say, Z i = 1 if X i is observed and Z i = 0 if not.) Rubin and Little [39,40] formalised three possible patterns (or mechanisms) of missingness: • Missing completely at random (MCAR): when complete samples are indistinguishable from incomplete ones. In other words, the probability that a value will be missing is independent from both observed and missing values.
• Missing at random (MAR): cases with incomplete data differ from cases with complete data, but the pattern of missingness is predictable from other observed variables. In other words, the probability that a value will be missing is a function of the observed values.
• Missing not at random (MNAR): the pattern of missingness is not random or it is not predictable from other observed variables; the probability that an entry will be missing depends on both observed and missing values. Common examples are variables that are missing systematically or for which the patterns of missingness depend on the missing values themselves.
These patterns of missingness can be modelled graphically: MCAR implies Z ⊥ ⊥ X; MAR implies Z ⊥ ⊥ X for the incomplete observations conditional on the complete observations; and MNAR does not imply any independence constraint. Note that, however, in the following we will still consider G to be spanning just X because we will restrict ourselves to the MCAR case in which the dependencies between the X and the Z are completely determined. When the data are incomplete, scoring G requires the missing values to be integrated out. We can make this apparent by rewriting P(D | G) as a function of D O (the observed data) and D M (the missing data): Averaging over all the possible configurations of D M as well leads to which contains one extra dimension for each missing value (in addition to one dimension for each parameter in Θ), making it infeasible to compute in practical applications. An additional problem is that, while P(D O | G, Θ) decomposes as in (6) In the E-step, we compute the expected sufficient statistics conditional on the observed data using belief propagation [2,41,42]. In the M-step, completedata learning methods can be applied using the expected sufficient statistics instead of the (unobservable) empirical ones.
There are two ways to apply EM to structure learning. Firstly, we can apply EM separately to each candidate DAG to be scored, as in the variational Bayes EM [43]. However, structure learning often involves many evaluations of the score function, thus making this approach computationally infeasible beyond small-scale problems. Secondly, we can embed structure learning in the M-step as follows: • in the E-step, we complete the data by computing the expected sufficient statistics using the current network structure; • in the M-step, we find the structure that maximises the expected score function for the completed data.
This approach is called Structural EM, and has been implemented using both S BIC (G | D) [24] and S ML (G | D) [25]. While faster than variational Bayes EM, Structural EM is still computationally expensive. [24] notes: "Most of the running time during the execution our procedure is spent in the computations of expected statistics. This is where our procedure differs from parametric EM.
In parametric EM, we know in advance which expected statistics are required.
[. . . ] In our procedure, we cannot determine in advance which statistics will be required. Thus, we have to handle each query separately. Moreover, when we consider different structures, we are bound to evaluate a larger number of queries than parametric EM." Balov [1] proposed a more scalable approach for discrete BNs under MCAR called Node-Average Likelihood (NAL). While Balov [1] defined NAL relying on the specific form of the multinomial log-likelihood, we will present it here using a more general notation that allows its extension to CGBNs. Starting from (1), he proposed to compute each term using the D (i) ⊆ D locally-complete data for which is an empirical estimate of the average node log-likelihood E[ (X i | Pa (X i ))]. Replacing (7) with the above gives which Balov [1] used to redefine the penalised log-likelihood score function as and structure learning as G = argmax G∈G S PL (G | D). NAL makes a more efficient use of incomplete data than discarding all incomplete samples; and it does not incur the computational costs of EM approaches since it can be plugged in complete-data structure learning approaches. However, comparing two DAGs means that NAL is evaluated on potentially different subsets of D for each X i in different DAGs; hence the usual results on MLEs and nested models do not apply. Balov [1] proved both identifiability and consistency of score-based structure learning when using S PL (G | D) for discrete BNs. We will now prove both properties hold more generally, and in particular that they hold for CGBNs.

Properties of Node-Average Likelihood
We study two properties of NAL: whether the true DAG G 0 is identifiable, and under which conditions G is a consistent estimator of G 0 . For each of our results, we reference the corresponding theorems that Balov [1] derived for discrete BNs. Our contribution is the generalisation of these results to CGBNs under mild regularity conditions, which requires completely different proofs since Balov [1] relied materially on the form of the multinomial log-likelihood and on not having both continuous and discrete parents in each Pa (X i ).

Identifiability
In this section we show under what conditions NAL can be used to identify the true G 0 of the BN from data in the limit of the sample size. Firstly, we prove that NAL is non-decreasing in the size of the parent sets and thus overfits like the log-likelihood does for complete data. This motivates the use of S PL (G | D) to avoid favouring overly complex models.
Lemma 1 allows us to state that if G 0 is identifiable, we can learn it by finding the simplest DAG that maximises NAL. We characterise this fact in a way that is more appropriate for BN structure learning in the same way as Balov [1] below.
We next establish under which conditions G 0 is guaranteed to be identifiable: under MCAR¯ (G, Θ) attains its maximum at all G ⊇ G 0 , and all these DAGs Proposition 1 (P3.1 in [1]). Under MCAR, we have: The identifiability of G 0 up to its equivalence class [G 0 ] follows from the above and is formally stated below.

Consistency
In this section we show that the candidate G chosen by maximising S PL (G | D) is a consistent estimator of the true G 0 under MCAR, and the mild regularity conditions this requires.

Regularity Conditions
For all G ∈ G and X i ∈ X: (R1) Θ Xi must exist, converge in probability to the population Θ Xi with speed O(n −1/2 ), and make A more general approach would be to model the X i ∈ Γ with a mixture of generalised linear models [45]. When using canonical link functions, (R1) and (R2) simplify due to the results in [46]. However, verifying (R1), (R2) and (R3) is non-trivial hence we leave this as a potentially tractable case for future work.

Consistency Results
Starting from the consistency of¯ (G, Θ | D), we will now establish for which sequences of λ n the S PL (G | D) in (11) is consistent for complete data and under MCAR, that is, lim n→∞ P ( G = G 0 ) = 1. We note the sufficient conditions for the consistency of G from Balov [1, P3.2, C4.1]: (i) ) ∈ (0, 1). These conditions can be trivially extended to equivalence classes as in Balov [1, C3.3].
To prove consistency, we first establish some intermediate results. Unlike Balov [1], we need a rigorous treatment of scoring DAGs that may represent misspecified models [47] that are not representable in terms of G 0 ; an example would be a X i ∈ Γ being a mixture of regressions in G and a single linear regressions in G 0 . Lemma 2 shows that if the difference in NAL between two DAGs is O(n −α ), then the less complex DAG is chosen asymptotically if λ n → 0 slower than n −α .
Lemma 3 establishes that the difference between the NAL at Θ Xi and the NAL at the population Θ Xi is O(n −1 ), which is relevant due to Lemma 2. Working with the NAL at the true parameters allows us to exploit conditional independencies and turns the NAL into a sum of i.i.d. random variables, allowing the Lindeberg-Lévy and the Lindeberg-Feller [44, p. 369] CLTs to be applied.
Lemma 3. If (R1) and (R2) hold, then for all G and X i : Lemma 4 establishes that if G ⊇ G 0 the local distributions for G reduce to those of G 0 , and that population and sample distributions coincide. These results are crucial in linking¯ (G, Θ | D) and¯ (G, Θ) in Lemma 5.
Lemma 4. If G ⊇ G 0 , then for each X i ∈ X:.
Lemma 5 links the sample NAL¯ (G, Θ | D) and the population NAL¯ (G, Θ) and establishes the convergence rate of the former. It will be used to show (C1) in Theorem 1 by way of Proposition 1. Theorem 1 is the key result of this section, showing that BIC is consistent for complete data but it is not under MCAR. AIC is not consistent in either case.
We show that under the conditions in parts (i) and (ii), This means that (C2) is satisfied and G is consistent. We then show that under the assumptions in part (iii), there exists a G ⊃ G 0 such that lim n→∞ P(S PL (G | D) > S PL (G 0 | D)) > 0, which implies inconsistency of G. Note that if P(Z = 1) = 1 as in part (i), then X ⊥ ⊥ Z. Therefore, results derived under MCAR hold for all three parts. Let G ⊇ G 0 . In the sample NAL, we first apply Lemma 3 to replace the MLEs with their population values and use Lemma 4(ii) to eliminate the redundant parents: The difference between the last expression and¯ (X i | Pa G0 (X i ), Θ Xi ), in which D (i) . We denote it as d(D (i) ) + O(n −1 ) in the following; the contrast between part (i) and parts (ii), (iii) follows from its behaviour for complete and MCAR data.

Hence (C2) is satisfied and part (i) follows.
Part (ii): Consider G ⊃ G 0 : Pa G (X i ) ⊃ Pa G0 (X i ) for at least one X i and Z As Applying Lemma 2 as in part (i), we obtain that if lim n→∞ √ nλ n = ∞, then lim sup n→∞ P(S PL (G 1 | D) > S PL (G 2 | D)) = 1.
Hence (C2) is satisfied and part (ii) follows. Part (iii): By Condition (C3), there exist a G ∈ G and an X i such that Pa G (X j ) = Pa G0 (X j ) for all j = i and Pa G ( As in (13), the left-hand side converges to N (0, γν 2 ). Furthermore, the sequence √ nλ n is bounded and we can apply the Bolzano-Weierstrass theorem [48]. There must be a subsequence { √ nλ n } n such that lim n →∞ √ n λ n = λ 0 < ∞. Combining this with (14) and the asymptotic normality above, we have where Φ is standard normal CDF. This proves part (iii).
Finally, the following corollary justifies the use of NAL in practical BN structure learning (including CGBNs). Proof (Corollary 2). From Corollary 1, we have that if X is MCAR, [1] argues that (C3) holds for almost all distributions of Z if the set of independence relationships implied by G 0 is non-empty. The result then follows from the proof of Theorem 1, with G 0 replaced by any G * ∈ [G 0 ].

Computational Experiments
Finally, we assess the structural accuracy and the computational efficiency of NAL with a simulation study: our aims are to confirm the results presented in Section 4 and to independently validate those originally presented in Balov [1]. Firstly, we confirm consistency using exact learning given a known topological ordering of the nodes and a fixed maximum number of parents in Section 5.1. Secondly, we learn DAGs using tabu search and without conditioning on a topological ordering in Section 5.2. Besides the accuracy of the learned DAGs, we explore how different choices of λ n impact the speed of structure learning. Thirdly, we investigate the claim by Balov [1] that NAL is an efficient alternative to Structural EM in Section 5.3.
For these purposes, we use three reference BNs from the Bayesian network repository [49]: one discrete BN (ALARM), one GBN (ECOLI70) and one CGBN (SANGIOVESE). Their characteristics are summarised in Table 1. Of the 14 continuous nodes of SANGIOVESE, 4 follow a conditional Gaussian distribution, while the remaining 10 nodes do not have any discrete parent and follow a Gaussian distribution.
Our experimental setup largely follows [22]. For computational reasons, we apply exact learning to pruned versions of the BNs in which each node has at most two (ALARM, ECOLI) or three (SANGIOVESE) parents. The parameters of pruned networks are set to the MLEs computed from a data set of size 100|Θ 0 | generated from the original BNs. We generate 20 datasets from each network for each combination of n (the sample size) and β (the probability of a value to be missing). To facilitate comparisons across networks of different size and complexity, we choose the values of n as n = k|Θ 0 | to make them proportional their number of parameters. From each data set, we learn a DAG for each λ n and we penalise complexity using h(G) = |Θ G |. In the experiments for known and unknown node ordering we consider all combinations of the values below, resulting in 144 configurations: • Relative sample size: k ∈ {10, 50, 100, 250, 500, 1000}.    [49], with the number of discrete (|∆|) and continuous (|Γ|) nodes, of arcs (|A|) and of parameters (|Θ 0 |) for the original and the pruned networks.
When comparing NAL with Structural EM we do not consider k = 1000 and β = 0 and only use α ∈ {0.10, 0.25, 0.40} for computational reasons. We compare DAGs using a scaled version of the Structural Hamming Distance (SHD) by [21]. The SHD between G 1 and G 2 is the minimum number of arc additions, deletions and reversals required to get from a DAG in [G 1 ] to a DAG in [G 2 ]. In particular, we use the scaled SHD defined as SHD/|A| in [22] to allow for meaningful comparisons between networks of different size. We prefer this measure to other structural distances such as the F1 score used by Balov [1] because it takes into account arc directions up to Markov equivalence. Furthermore, we prefer it to distances in the probability space such as the Kullback-Leibler divergence because we are mainly interested in structural accuracy.
We compare NAL and Structural EM on the basis of their running time and of the number of calls to the scoring function. To make their implementations comparable, we use the bnlearn R package [50] to impute missing data by likelihood-weighted sampling (500 particles). For the M-step, we use BIC as the network score. All the code used in the simulation study is available from https://github.com/Tbodewes/Dissertation.
In the following, we discuss results directly in terms of α for brevity instead of referring to the corresponding λ n . Figure 1 supports our theoretical results and confirms Balov's [1]: penalties going to zero slower than O(n −1/2 ) often lead to increased structural accuracy. In the complete data case, the results are as expected: BIC is consistent, and penalties other than AIC perform well. However, α = 0.10, 0.25 do not always improve accuracy as the sample size increases, leading instead to mild underfitting (that is, too few arcs are included in the DAGs). AIC performs poorly for the opposite reason: it does not penalise network complexity enough, leading to overfitting (that is, the learned DAGs contain too many arcs including a number of false positives).

Performance for Known Node Ordering
As for incomplete data, we can see that all α < 0.60 outperform α = 0.60, AIC and BIC for the ALARM network. As the proportion of missing data increases, the accuracy produced by α = 0.40 gradually deteriorates for all the sample sizes under consideration. We observe similar results for ECOLI: only α = 0.10 does not degrade performance as β increases. Furthermore, α = 0.25 does well when β = 0.05 but it is not markedly better than the inconsistent scores when β = 0.2. Closer inspection suggests that this may be due to overfitting: α = 0.25, 0.40 do not penalise complexity enough. Note that Gaussian parents are penalized less harshly than discrete parents in a CGBN because they introduce fewer parameters in the BN, which may explain why overfitting occurs for the ECOLI network. Based on the theoretical results, we expect that penalties below α = 0.50 to give consistent estimates. However, convergence appears to be very slow for penalties that converge to zero too quickly. For SANGIOVESE, the only penalty that allows one to learn the true DAG correctly for k 1000 is α = 0.25, with all the αs slowly increasing the scaled SHD as β increases. α = 0.10 seems to suffer from mild underfitting, while α = 0.40, 0.60, AIC and BIC lead to mild overfitting. Figure 2 suggests that for unknown node ordering, α = 0.10, 0.25 tend to give the best results. In the complete data case, BIC no longer outperforms α = 0.10, 0.25 but the overall behaviour of other penalties is similar to that shown in Figure 1. All penalties except AIC give a similar scaled SHD for each BN. Penalties that allowed one to learn the true DAG when the topological ordering of the nodes was known now have a scaled SHD strictly greater than zero due to the heuristic nature of tabu search: SHD/|A| ≈ 1 for all of them. This is in line with the findings in [22].

Performance for Unknown Node Ordering
As for incomplete data, we see a clean separation between consistent and inconsistent scores for the ALARM network. As β increases, the performance of AIC, BIC and α = 0.60 gradually degrades, while the performance of α = 0. learned includes 10 times as many arcs as the corresponding true DAG.) The scaled SHD of α = 0.25 is close to that of α = 0.10 for β = 0.05, but increases rapidly as β grows. This is due to the overfitting also observed for known node ordering, but in this case it is more pronounced because we did not prune parent sets. For SANGIOVESE, SHD/|A| ≈ 1 for all penalties. Nevertheless, performance is better when λ n goes to zero at a rate slower than n −1/2 . Figure 3 shows that stronger penalties generally lead to tabu search converging faster. For ALARM, α = 0.10 leads to a 50-70% reduction in the number of calls to the scoring function relative to α = 0.40, with even greater reductions for the other BNs. The computational cost follows a pattern that is similar to the SHD pattern in Figure 2: inconsistent scores have over twice the computation cost of the consistent scores because weaker penalties lead to more arcs being included in the learned DAGs. than Structural EM in terms of SHD. This is in line with the performance gap observed between α = 0.10 and all other penalties in Figure 2, and suggests that Structural EM does not suffer as strongly as NAL from the overfitting observed in Figure 2. For SANGIOVESE, the two methods perform similarly again, with NAL outperforming Structural EM for larger sample sizes. Figure 5 (top panel) shows that for appropriate penalisation, NAL requires 50-75% fewer calls to the scoring function than Structural EM. For ALARM and SANGIOVESE this holds for all considered penalties and sample sizes. For ECOLI, this is only true for larger sample sizes or for α = 0.10. This agrees with the evidence from Figure 3 that NAL with α = 0.25, 0.40 converges slowly for ECOLI. Figure 5 (bottom panel) shows that NAL is four to ten times faster than Structural EM. The difference is larger than the difference in the number of queries due to the computational cost of the E-step in Structural EM.

Conclusions
Common approaches to BN structure learning from incomplete data, such as the Structural EM, embed EM within score-based algorithms thus incurring a significant computational cost. NAL is a competitive but faster alternative; in this paper we proved its consistency and model identifiability for CGBNs, showing NAL's wide applicability beyond its original formulation. We also established general sufficient conditions for consistency that can be readily checked for other classes of BNs. To confirm these theoretical results and Balov's [1], as well as Balov's assertion that NAL is faster than Structural EM, we performed a simulation study including a discrete BN, a GBN and a CGBN. We observed that indeed: • BIC is not consistent with incomplete data, and AIC is not consistent even with complete data.
• NAL with appropriate penalties has similar structural accuracy to Structural EM.
• NAL with appropriate penalties can require 50-75% fewer calls to the scoring function than Structural EM.
• NAL with appropriate penalties can converge four to ten times faster than Structural EM.  These observations suggest that NAL is competitive with Structural EM in terms of structural accuracy, while being computationally more efficient, in the broader class of CGBN rather than just for discrete BNs.

Comparison of NAL and Structural EM on computation time
Proof (Proposition 2). Condition (R1): Let I j be the subset of observations for which ∆ Xi = j. By assumption, P (∆ Xi = j) > 0, thus this set is nonempty for n → ∞. If X i ∈ ∆, Θ Xi satisfies ∇ Θ X i¯ (X i | Pa (X i ), Θ Xi ) = 0 and is given by The Lindeberg-Lévy CLT gives π ik | j = π ik | j + O(n −1/2 ) as desired.
Condition (R2): For X i ∈ ∆, the diagonal elements of the Hessian are −1/π 2 jk , while the off-diagonal elements are zero. Hence the Hessian is finite for π jk > 0.
For X i ∈ Γ, the Hessian of β ij , σ 2 ij for each ∆ Xi = j is , where e = X i − W T β ij . All elements of this matrix are constants, Gaussian random variables or squares and cross-products thereof. As σ 2 ij > 0 by assumption, the expectation of all elements of the Hessian is finite and (R2) is satisfied.
The first term is O(1), while the second diverges as n → ∞. The result follows.
Proof (Lemma 4). Under MCAR, we can drop Z (i) . If G = G 0 , (i) and (ii) are trivial, while (iii) holds by definition. If G ⊃ G 0 , there exists an X i such that Pa G0 (X i ) ⊂ Pa G (X i ) and for which X i ⊥ ⊥ Pa G (X i ) \ Pa G0 (X i ) | Pa G0 (X i ), then (i) follows directly. For (ii), note that P(X i | Pa G (X i )) = P(X i | Pa G 0 (X i )). By definition, estimating Θ Xi minimises the KL divergence between G, Θ and G 0 , Θ 0 for the given D, which happens when the corresponding distributions are equal almost surely leading to (ii). We obtain (iii) from (ii) as P(X i | Pa G (X i ), Θ Xi , D) = P(X i | Pa G 0 (X i ), Θ Xi , D) P(X i | Pa G 0 (X i ), Θ Xi , D) = P(X i | Pa G 0 (X i ), Θ Xi ) = P(X i | Pa G 0 (X i ), Θ Xi , Z (i) ).
To prove the second assertion, assume that G 0 ⊆ G. By Lemma 4(iii), we have P(X i | Pa (X i ), Θ Xi , Z (i) ) = P(X i | Pa (X i ), Θ Xi , D) almost surely. Hence the KL divergence above is zero and¯ (X i | Pa (X i ), Θ Xi ) → p¯ (X i | Pa (X i ), Θ Xi ) with a rate of convergence of O(n −1/2 ) by the Lindeberg-Lévy CLT, dominating the O(n −1 ) in (A.1).