Large-scale empirical validation of Bayesian Network structure learning algorithms with noisy data

Numerous Bayesian Network (BN) structure learning algorithms have been proposed in the literature over the past few decades. Each publication makes an empirical or theoretical case for the algorithm proposed in that publication and results across studies are often inconsistent in their claims about which algorithm is 'best'. This is partly because there is no agreed evaluation approach to determine their effectiveness. Moreover, each algorithm is based on a set of assumptions, such as complete data and causal sufficiency, and tend to be evaluated with data that conforms to these assumptions, however unrealistic these assumptions may be in the real world. As a result, it is widely accepted that synthetic performance overestimates real performance, although to what degree this may happen remains unknown. This paper investigates the performance of 15 structure learning algorithms. We propose a methodology that applies the algorithms to data that incorporates synthetic noise, in an effort to better understand the performance of structure learning algorithms when applied to real data. Each algorithm is tested over multiple case studies, sample sizes, types of noise, and assessed with multiple evaluation criteria. This work involved learning more than 10,000 graphs with a total structure learning runtime of seven months. It provides the first large-scale empirical validation of BN structure learning algorithms under different assumptions of data noise. The results suggest that traditional synthetic performance may overestimate real-world performance by anywhere between 10% and more than 50%. They also show that while score-based learning is generally superior to constraint-based learning, a higher fitting score does not necessarily imply a more accurate causal graph. To facilitate comparisons with future studies, we have made all data, graphs and BN models freely available online.


INTRODUCTION
A Bayesian Network (BN) graph has two different interpretations. If we assume that the edges between variables represent causation, the BN is viewed as a unique Directed Acyclic Graph (DAG), also referred to as a Causal Bayesian Network (CBN). If, however, we assume that the edges between variables can represent some dependency that is not necessarily causal, then the BN is viewed as a dependence graph that can be represented by a Complete Partial Directed Acyclic Graph (CPDAG), where undirected edges indicate relationships that produce identical posterior distributions irrespective of the direction of the edge. Specifically, a CPDAG represents a set of Markov equivalent DAGs.
One of the reasons CBNs have become popular in real-world applications is because they enable decision makers to reason with causal assumptions under uncertainty, which in turn enable them to simulate the effect of interventions and extend them to counterfactual reasoning [1] [2]. As a result, this paper focuses on assessing the various structure learning algorithms in terms of reconstructing the ground truth DAG, rather than in terms of reconstructing a Markov equivalence class that contains the true graph.
In the literature, there are two main classes of algorithms for BN structure learning, known as constraint-based and score-based learning. The constraint-based algorithms use conditional independence tests to construct a graph, whereas the score-based algorithms view the BN structure learning process as a classic machine learning problem where algorithms search the space of possible graphs and return the graph that maximises some score. The constraint-based algorithms are often referred to as causal discovery algorithms [3], although with some attendant controversy [4] [5] [6] [7] [8] [9], and return a CPDAG where directed edges represent causation and undirected edges represent direct dependency in which the direction of causation cannot be determined by observational data. In addition to these two main classes of algorithms, there exists a class of hybrid learning which adopts strategies from both constraint-based and score-based learning, and such algorithms may occasionally return a CPDAG, although they will generally return a unique DAG. This paper investigates algorithms in all three classes.
While structure learning algorithms demonstrate good performance in synthetic experiments, it is widely acknowledged that this level of performance tends not to extend to real applications. However, the level of difference in performance between synthetic and real experiments remains uncertain. One of the reasons it is difficult to measure the 'real' performance of algorithms is because in the real world we normally have no knowledge of the ground truth causal graph on which the real-world dataset is based, and which is required to validate these algorithms in terms of their ability to reconstruct the true graph.
A recent relevant study by Scutari et al [10] examined the difference in accuracy and speed of the various structure learning algorithms implemented in the bnlearn R package [11]. In this paper, we test a larger set of algorithms available in several structure learning packages, where the focus is to better approximate the real-world performance of these algorithms by incorporating different types and levels of noise in the data. We investigate the performance of each algorithm with and without data noise. The paper is structured as follows: Section 2 discusses the case studies, Section 3 describes the process of generating synthetic and noisy data, Section 4 discusses the selected algorithms, Section 5 describes the evaluation process, Section 6 presents the results, and we provide our concluding remarks in Section 7.

CASE STUDIES
Six discrete BN case studies are used to generate data. The first three of them represent wellestablished examples from the BN structure learning literature, whereas the other three represent new cases and are based on recent BN real-world applications. Specifically, i. Asia: A small toy network for diagnosing patients at a clinic [12]. ii. Alarm: A medium-sized network based on an alarm message system for patient monitoring [13]. iii. Pathfinder: A very large network that was designed to assist surgical pathologists with the diagnosis of lymph-node diseases [14]. iv.
Sports: A small BN that combines football team ratings with various team performance statistics to predict a series of match outcomes [15]. v. ForMed: A large BN that captures the risk of violent reoffending of mentally ill prisoners, along with multiple interventions for managing this risk [16]. vi.
Property: A medium BN that assesses investment decisions in the UK property market [17].
The case studies are restricted to problems of up to 100s of variables due to time constraints 1 . While 100s of variables are more than sufficient to model causal relationships in most realworld areas, the results presented in this paper may not apply to bioinformatics where biology datasets often include 1,000s of variables. The properties of the six case studies are detailed in Table 1. Still, the selected case studies offer a good range of old and new, as well as simple and complex, BNs that come from different application domains. Moreover, the three traditional networks of Asia, Alarm and Pathfinder are based on knowledge-based priors, whereas the parameters of the Sports and ForMed networks are determined by real data, and the Property network is based on parameters determined by clearly defined rules and regulating protocols 2 . It is also worth mentioning that the Sports network is the only case study in which all data variables are ordinal.
The case studies provide a good range of model dimensionality that results from varied max in-degree, number of nodes and arcs. In BNs, a good measure of dimensionality is the number of free parameters (also known as independent parameters) that each network incorporates. In this paper, represents the set of the variables in graph , and | | is the size of set . The number of free parameters is: where is the number of states of , is the parent set of , | | is the size of set , and is the number of states of in parent set . For example, while both Asia and Sports are small BNs that incorporate eight and nine nodes respectively, they differ considerably in model dimensionality since there are 18 free parameters in Asia while there are 1,049 free parameters in Sports. A similar comparison applies to ForMed and Pathfinder which incorporate 88 and 109 nodes respectively, yet there are just 912 free parameters in ForMed and 71,890 free parameters in Pathfinder.
All case studies represent discrete BNs and incorporate different numbers of states. For example, while in Asia all variables are Boolean, the number of states per variable in Sports ranges from five to eight, which also explains the large difference in model dimensionality between the two. Moreover, Pathfinder represents an interesting challenge since, in addition to incorporating the highest number of nodes and arcs, it includes a variable that has 63 states, which also partly explains the high dimensionality of this network.

METHODOLOGY
Each of the BN case studies presented in Section 2 is used to generate synthetic data under various assumptions; so that multiple datasets are produced for each case study. The methodology involves producing multiple datasets that fall under different categories of synthetic data, with and without noise. We briefly describe each of the synthetic data categories below, and further details on how each of these datasets are generated or modified with noise, are provided in Appendix A. The categories are: i. No noise ( ): this represents the traditional approach used throughout the literature where data are generated as determined by the CPTs of each BN.
Each dataset is then manipulated to incorporate noise. The following types of data noise are used: ii  The above categories have led to the 16 different experiments, per case study per sample size,  shown in Table 2, where the experiment code consists of a) letters that correspond to the type of noise and b) integers that correspond to the rate of noise. Moreover, experiment codes that start with letter 'c' represent category 'Combo', followed by the letters that indicate the types of noise used in that particular experiment.  Table A1 shows which types of noise were not performed for reasons we clarify in Appendix A. For example, experiment S could not have been applied to Boolean BNs, such as the Asia BN, because the states of binary variables cannot be reduced further. Moreover, the Asia and Sports BNs incorporate eight and nine nodes respectively, which means experiments S and L, which involve manipulation of variables rather than data values, could be performed only at 10% level of randomisation; i.e., manipulating just one variable corresponds to a rate of manipulation just above 10% in both cases. For further details, see Appendix A.
Each experiment enumerated in Table 2 is tested with five different sample sizes. These are 0.1k, 1k, 10k, 100k, and 1000k samples. The different sample sizes are subsets of the first rows of the largest dataset. This makes the potential 4 number of possible experiments per case study per algorithm 80 (i.e., 16 experiments over five sample sizes), and the potential number of possible experiments per algorithm 480 (i.e., 80 experiments over six case studies). This information is displayed in Table 3. Appendix A provides additional supplementary information.

ALGORITHMS
The selected algorithms represent state-of-the-art and/or well-established implementations, including some recent algorithms. We aimed for a varied selection of algorithms across all three types of learning; constraint-based, score-based, and hybrid learning algorithms. A total of 15 algorithms have been evaluated. These are: i. PC-Stable: a modified version of the classic constraint-based algorithm called PC [18] [19]. PC-Stable solves the issue on the order dependency of the variables in the data by changing the order of edge deletion and combining the process with the CPC algorithm [20]. The PC-Stable algorithm is commonly used when benchmarking constraint-based learning algorithms.
ii. FGES: an efficient version of the score-based algorithm proposed by Chickering [21] that was based on GES algorithm proposed by Meek [22]. Instead of searching in the DAG space which grows super-exponentially with the number of variables, FGES searches in the space of equivalence classes of DAGs, allowing the algorithm to never exceed polynomial time with respect to the number of variables. FGES is commonly used as a state-of-the-art algorithm when benchmarking score-based learning algorithms.
iii. FCI: Fast Causal Inference (FCI) is a constraint-based algorithm that can be viewed as an extended version of the PC algorithm that accounts for the possibility of latent variables in the data [23]. It performs a two-phase learning process that produces the skeleton of the graph (i.e., an undirected graph) followed by conditional independence tests that orientate some of those edges. FCI is commonly used when benchmarking algorithms under assumptions of causal insufficiency.
iv. GFCI: The Greedy FCI (GFCI) is a hybrid algorithm that combines the score-based FGES and the constraint-based FCI algorithms and hence, also accounts for the possibility of latent variables in the data [24]. Specifically, it obtains the skeleton of the graph using the first phase of FGES, although it does not preserve all of the edges of FGES, and then uses FCI-based orientation rules to identify the direction of some of the edges in the skeleton graph.
v. RFCI-BSC: is another FCI-based hybrid learning version of the constraint-based algorithm called RFCI [25], which in turn is a faster version of FCI. RFCI-BSC is a model averaging approach that produces multiple probable models and selects the graph with the highest probability as the preferred graph [26]. Its process involves bootstrap sampling (re-sample with replacement) that generates multiple datasets and performs RFCI-based conditional independence tests on each of those datasets. This approach makes the algorithm non-deterministic. Because of this, we had to run RFCI-BSC 10 times on each experiment and record the average of the results. Therefore, this algorithm was run 10 times more than the other algorithms used in this paper. vi.
Inter-IAMB: uses the concept of Markov Blanket to reduce the number of conditional independence tests. The Markov Blanket of a particular variable is defined as the conditioning set of variables which ensure that the particular node is conditionally independent of all other variables in the graph. Specifically, in BNs the Markov Blanket of a node consists of its parents, its children, and the parents of its children. Inter-IAMB is an improved version of IAMB that both minimises and improves the accuracy of the Markov Blanket candidate set for each node [27]. A smaller Markov Blanket set would yield a more precise CI test result.
vii. MMHC: The Max-Min Hill-Climbing (MMHC) algorithm first constructs the skeleton of the BN using a constraint-based algorithm know as Max-Min Parents and Children (MMPC) and then performs a greedy hill-climbing score-based learning search to orientate the edges [28]. It is one of the most popular BN structure learning algorithms applicable to high-dimensional problems, and is often used as a state-of-the-art algorithm when benchmarking structure learning algorithms.
viii. GS: Grow-Shrink (GS) was the first constraint-based algorithm to use the concept of Markov Blanket to reduce the number of conditional independence tests [29]. Specifically, GS first finds the Markov-Blanket of each variable and then identifies direct neighbours in the Markov Blanket, thus determining the graph's undirected skeleton, before orientating as many edges as possible to produce a PDAG.
ix. HC: Hill Climbing (HC) is a score-based greedy algorithm which searches the space of DAGs [30]. The algorithm starts with an empty graph and iteratively performs local movesarc additions, deletions or reversals -which most improves the graph's score. It terminates when the score cannot be improved, and this generally means at a local maximum.
x. TABU: is an adaptation of the HC algorithm described above which attempts to escape from local maxima by allowing some lower-scoring local moves. TABU also avoids revisiting DAGs encountered before, encouraging the algorithm to explore new regions of the DAG space [30].
xi. H2PC: a hybrid algorithm that combines HPC and HC [31]. HPC is an ensemble constraint-based algorithm which comprises three weak parents and children set learners (PC learner) in an attempt to produce a stronger PC learner.
xii. SaiyanH: a hybrid learning algorithm that incorporates restrictions in the search space of DAGs that force the algorithm to return a graph that enables full propagation of evidence, under the controversial assumption that the data variables are dependent [32]; i.e., every variable has a direct dependency with at least one other variable in the data. It involves three learning phases: a) determining the skeleton of the initial graph using local learning, b) orientating the edges of the skeleton graph using constraint-based learning, and c) further modifying the graph using score-based learning.
xiii. ILP: an integer linear programming score-based algorithm that separates structure learning into two phases. The first phase computes the scores of candidate parent sets, whereas the second phase involves the optimal assignment of parents to each node [33]. ILP guarantees to return the graph that maximises a scoring function (i.e., it is an exact learning approach). However, this guarantee assumes unlimited runtime and no restriction on the maximum in-degree. In practise, exact learning can only be achieved for graphs that consist of up to approximately 30 nodes. xiv. WINASOBS: is similar to ILP in the sense that it relies on the same two learning phases; i.e., identification of a list of candidate parent sets for each node followed by an optimal assignment of the parent set of each node [34]. The difference from ILP is that it uses a simplified Bayesian Information Criterion (BIC) score that is easier to compute, referred to as BIC*, and uses more aggressive pruning on the search space of possible DAGs that does not guarantee to return the graph that maximises BIC*. As a result, WINASOBS is an approximate, rather than exact, learning algorithm, but one which is scalable to thousands of nodes.
xv. NOTEARS: a score-based algorithm that converts the traditional score-based combinatorial optimisation problem (i.e., searching over possible graphs) into an equality-constrained problem that uses an equation composed by the weight matrix of the graph for optimisation [35]. This algorithm was originally designed for continuous data but has now been applied to ordinal discrete data.
The R package r-causal v1.1.1, which makes use of the Tetrad freeware implementation [36], was used to test algorithms to . The bnlearn R package version 4.5 [11] was used to test algorithms to . Lastly, SaiyanH was tested using the Bayesys v1.4 software [37], ILP was tested using the GOBNILP software [33], WINASOBS was tested using BLIP [38], and NOTEARS was tested using its original Python source code [35]. Structure learning algorithms enable users to modify some of their hyperparameters, such as the maximum node in-degree, the -value threshold used in conditional independence tests, and the scoring metric used in searchtypically either the BIC [39] or Bayesian Dirichlet equivalent uniform (BDeu) [40]. Modifying these hyperparameters may decrease or increase the learning performance of an algorithm in a particular experiment. However, it is impractical to test each of these algorithms over different learning hyperparameters in this paper. As a result, we have tested all algorithms with their default hyperparameter settings as implemented in each software, on the basis that is how most users would use them, and to keep the comparison between algorithms as fair as possible. These hyperparameter defaults are listed in Table 4.

EVALUATION
The algorithms are evaluated in terms of how well their learned graphs predict the true graphs, rather than how well the fitted distributions agree with the empirical distributions. This means that the evaluation is fully orientated towards graphical discovery rather than inference, which is also the preferred method in the structure learning literature when the aim is to assess the capability of an algorithm in terms of discovering causal structure. The scoring metrics used for this purpose are covered in subsection 5.1, whereas subsection 5.2 describes the approach we followed to assess the accuracy of learned graphs when based on data that incorporate latent variables; also known as structure learning under causal insufficiency. Lastly, subsection 5.3 describes the approach we followed to assess the time complexity of the algorithms.

The scoring metrics
Three different scoring metrics are considered to assess the accuracy of a learned graph with respect to the true graph. These are: i. the F1 score which represents the harmonic mean of Recall and Precision, which are themselves the most used metrics in this field of research. F1 is defined as: where is Recall and is Precision. The F1 score ranges from 0 to 1, where 1 represents the highest score (with perfect Precision and Recall) and 0 the lowest.
ii. the Structural Hamming Distance (SHD) metric that penalises each change required to transform the learned graph into the ground truth graph [28]. The SHD score has become well-established in this field of research, partly due to its simplicity. However, it is important to highlight that the SHD score represents classic classification accuracy, which is often considered misleading. For example, consider a true graph with 1% direct dependencies (e.g., 100 edges) and 99% direct independencies (e.g., 9,900 direct independencies). An algorithm that returns an empty graph would be judged as being 99% accurate by SHD (i.e., an error of 100 out of possible 10,000 errors), despite the empty graph being useless in practice. Tsamardinos et al. [28], who proposed the SHD, acknowledged that it is biased towards the sensitivity of identifying edges versus specificity.
iii. the Balanced Scoring Function (BSF), which is a recent metric that addresses the bias in the SHD score. It eliminates this bias by taking into consideration all of the confusion matrix parameters (i.e., TP, TN, FP, and FN) to balance the score between direct independencies and direct dependencies [41]. Specifically, where is the numbers of edges and i is the number of direct independences in the true graph, where i can be calculated as = The BSF score ranges from -1 to 1, where -1 corresponds to the worst possible graph, 1 to the graph that matches the true graph, whereas a score of 0 corresponds to an empty or a fully connected graph. This means that, in the example used above to highlight the bias in SHD, the empty graph would have returned a BSF score of 0, which BSF assumes as the baseline.
The above three metrics are interesting in their own way, and their differences are reflected in the results presented in Section 6; although F1 and BSF scores are largely in agreement. Table 5 presents the penalty weights used to compute all the three scoring metrics. Note that rule #2 implies a minor change in the original definition of SHD. For example, in [28] a score of 10 indicates that the learned graph requires 10 changesdeletions, additions, or reversals -before it matches the true graph. In this paper, however, reversing an arc carries half the penalty of removing or adding an arc, under the assumption that the dependency has been correctly discovered, although with the wrong orientation, and this assumption is consistent with other SHD variants [42]. Table 5. The penalty weights used by the three scoring metrics, where o-o and o→ represent edges produced by structure learning algorithms designed for causally insufficient systems (see Section 5.2) and which indicate that the orientation is uncertain.

Rule True graph
Learned graph Penalty A ⊥ B Any edge/arc 1 Incorrect dependency discovered

Predicting the true causal graph, under causal sufficiency and causal insufficiency
What the structure learning algorithms can and cannot discover is heavily debated in the literature [4] [5] [6] [7]. Regardless, this paper investigates the usefulness of these algorithms in real-world settings where we tend to require CBNs. Because of this, the algorithms are evaluated in terms of their ability to discover the true causal DAG, which entails more information than a CPDAG. When it comes to causally insufficient experiments (i.e., those which incorporate latent variables), the learned graphs are assessed with respect to the ground truth Maximal Ancestral Graph (MAG). A MAG is an extension of a DAG that represents a set of observed variables in a DAG, where variables which are not part of that set are assumed to be latent. A MAG includes both directed and bi-directed (i.e., ↔) edges, where a bi-directed edge indicates a latent confounder between connected nodes, whereas a directed edge represents direct or ancestral relationships. While a bi-directed arc entails more information than a directed arc, we assume that if ↔ is in the true MAG then both → and ← are acceptable as arcs in the CBN (refer to rule #4 in Table 5).
Moreover, a Partial Ancestral Graph (PAG) represents a set of Markov equivalent MAGs under causal insufficiency, in the same way a CPDAG represents a set of Markov equivalent DAGs under causal sufficiency. Some latent variable algorithms produce a PAG in the same way that some algorithms that do not account for latent variables produce a CPDAG. The scoring system specified in Table 5 allows us to compare the learned graphs (DAG, CPDAG, MAGs, or PAG) with either the true DAG or the true MAG. Moreover, Table 6 indicates which experiments are based on the true DAGs and which experiments are based on specific MAGs. Appendix B provides an example of a DAG being converted into a MAG, where Fig

Measuring time complexity
Time complexity is measured by means of elapsed time (runtime). Due to the scale of the experiments, this study involved different members of the research team running the structure learning algorithms on different machines over multiple months. Since some machines are faster than others, we had to adjust structure learning runtimes for CPU speed.
We have selected the CPU with the highest market share as the baseline CPU to which all runtimes are adjusted. As of 31 st of January 2020, the CPU 5 with the highest market share of 3.4% is the AMD Ryzen 5 3600 [43]. Table 7 lists all the machines used by the research team along with the adjusted runtime against the baseline CPU, where the adjustment is determined by the difference in single-core performance scores published by UserBenchmark [44]. Our tests showed that the algorithms were using only one of the available CPU cores, which is why runtime is adjusted relative to the single-core performance difference between CPUs. Therefore, our runtime results approximate elapsed time for structure learning on the AMD Ryzen 5 3600 Desktop CPU.

RESULTS AND DISCUSSION
Because the results are based on more than 10,000 individual runs, we had to restrict the learning time per run to six hours. Algorithms that fail to return a result within the 6-hour limit are assigned the lowest rank for that graph. For example, if five out of the 15 algorithms fail to return a graph in a test, then all those five algorithms will receive rank 11 for that particular test. The ILP algorithm represents a 'relaxed' exception to this rule, and this is because it provides the option for the user to stop the learning process at any point in time and retrieve the 'best' graph discovered up to that point. We have taken advantage of this feature in ILP, which is not offered by other algorithms, and this can be argued as an unfair advantage for ILP. Furthermore, algorithms that fail to generate a graph due to a known or unknown error are also assigned the lowest rank for that graph. Detailed information on structure learning failures is provided in Appendix C for all algorithms and over all the experiments. Lastly, the results for NOTEARS are restricted to the Sports case study. This is because NOTEARS works only with binomial data (ordinal scale distributions), something which only the Sports case study conforms to in this paper. 5 The baseline CPU was not used in our experiments.
The results are separated into those with no synthetic noise (i.e., experiment N) which represent the typical approach to synthetic data used in this field of research, and those with different types of synthetic noise (i.e., all other experiments) which represent the new approach used in this study in an effort to better approximate real-world performance. The subsections that follow cover these two approaches in turn.

Results without data noise
We start by presenting the overall performance of the algorithms across all case studies and sample sizes associated with experiment N (i.e., no noise). Table 8 reports the average rank achieved by each algorithm, as determined by each of the three scoring metrics, averaged over all case studies and sample sizes in N. The reason we report the overall ranks rather than the overall scores is because not all algorithms produce a result in every single experiment. Averaging across scores that include missing scores would have biased the difference between scores, whereas assigning ranks (as indicated at the start of Section 6) avoids this bias.
Still, actual scores are needed to understand the difference in performance between two ranks and hence, we provide all scores generated by each algorithm for each experiment in Tables D1, D2, and D3, according to metrics F1, SHD and BSF respectively. These tables illustrate that the F1 and BSF metrics show similar patterns of results across all experiments, whereas SHD shows a much more homogeneous pattern within each case study. Because SHD captures the error between two graphs, it unsurprisingly shows relatively good performance for all algorithms across all sample sizes for small graphs (e.g., Asia and Sports) and relatively poor performance for the largest graphs (e.g., Formed and Pathfinder). F1 and BSF show more dependence on sample size within each case studyit being generally the case that performance improves with sample size. Across all three metrics, all algorithms and case studies it is usually the case that best performance is found with either 100K or 1000K samples. It is also clear that all metrics and all algorithms show considerably worse performance on the Pathfinder case study than the other case studies. This can be partly explained by a variable in Pathfinder which consists of 63 states and serves as the parent for multiple other nodes, and this disproportionately increases the dimensionality of the model. The results in Table 8 show that metrics F1 and BSF are generally in agreement in ranking the algorithms in terms of overall performance, since the discrepancy in ranking between these two metrics is never greater than one. In contrast, the ranking produced by SHD is often different. For example, SHD ranks MMHC 6 th while F1 and BSF rank it 10 th and 11 th respectively. Another major discrepancy between metrics involves the SaiyanH algorithm, which SHD ranks 11 th whereas both F1 and BSF rank 4 th . Fig 1 demonstrates how rankings fluctuate when the experiments are ordered on the / scale, where is the number of samples over free parameters, and provides an indication of the risk of model overfitting. For example, the experiment with the highest risk of overfitting is Pathfinder with sample size 0.1k (the leftmost point in each chart in Fig 1), whereas the experiment with the lowest risk of overfitting is Asia with sample size 1,000k (the rightmost point in each chart in Fig 1). In the former, the / score is 0.0 since there are 100 samples divided by 71,890 free parameters, whereas in the latter the / score is 55,556 since there are 1,000,000 samples divided by 18 free parameters.
The oscillatory behaviour observed in the graphs of Fig 1 suggests that the risk of overfitting of an experiment is not a strong predictor of the relative performance between algorithms. Figure D1 reorders the results of Fig 1 by the number of nodes, followed by sample size. Some weak patterns emerge under this ordering, such as a) the relative performance of MMHC increases with the number of nodes, b) the relative performance of H2PC improves with sample size within each case study, and c) the relative performance of ILP decreases with the number of nodes since higher nodes make the task of exact learning progressively harder, which needs to end at the 6-hour limit. However, Fig D1 continues to suggest that it is difficult to deduce a pattern that explains the changes in the relative performance between algorithms across the different experiments with confidence. Note that even TABU, which is ranked 1 st overall by all three metrics, often shows significant drops in relative performance for some experiments. Furthermore, the graphs in Fig 1 continue to support the notion that F1 and BSF are largely in agreement, whereas SHD often deviates from F1 and BSF rankings. Noticeable examples include GS (and to some extent MMHC) whose results are ranked highly by SHD (left side of the graph), something which strongly contradicts the F1 and BSF rankings, and vice versa for SaiyanH and ILP. Fig 2, on the other hand, plots the number of edges learned 6 with respect to the true number of edges, per algorithm per case study per sample size. It is noticeable how each of the charts in Fig 2 shows a broadly upwards trend in the number of edges as / increases, which is a reasonable outcome given that a higher / score decreases the risk of overfitting and enables the algorithms to draw edges with higher confidence. In contrast, a very low / score increases the risk of underfitting for some algorithms. The contradictions between metrics previously observed in Fig 1 can be partly explained by the results in Fig 2. For example, in the case of GS, the number of edges generated is extremely low with respect to the true number of edges, especially on the low / scores, something which SHD is known to favour since it tends to be biased in favour of empty/underfitted graphs, but which F1 and BSF rank poorly. In contrast, SaiyanH and ILP are the only algorithms that produce a relatively high number of edges for the low / scores and, although they do much better than other algorithms in approximating the true number of edges, SHD penalises them heavily which, once more, highly contradicts the F1 and BSF rankings.  Table 8) ordered on the / scale of samples over parameters. These results are based on experiment N (i.e., no data noise). In NOTEARS, the F1 rankings are identical to the BSF rankings. Moreover, Fig 2 shows that RFCI-BSC failed to generate a graph on all high / score experiments, and this is likely due to the higher sample size datasets 7 . The results also suggest that GS and MMHC, and to a lesser extent, Inter-IAMB and H2PC, greatly underestimate the number of true edges and, while the underestimation improves with higher / scores, these algorithms always underfit the learned graph. In contrast, TABU, HC, ILP, SaiyanH, and WINASOBS, which are the top five algorithms according to the F1 and BSF metrics, are the only algorithms that produce slightly more edges than the number of true edges. Fig 3 presents the cumulative runtime for each algorithm over all runs in the N experiments. Note that the information presented in this figure has already been adjusted for the different processing powers of the computers used as indicated in Table 7. Moreover, each experiment that did not complete within the 6-hour limit, including failures to produce a graph for other reasons (refer to Appendix C), is assigned the 6-hour limit as runtime for that particular experiment. This means that Fig 3 (as well as Fig 4 shown later) underestimates runtime when an algorithm does not produce a graph within the 6-hour limit, and underestimates or overestimates runtime when an algorithm returns an out-of-memory or an unknown error during the structure learning process. Still, we consider these results to be reasonably accurate. However, one estimate we would like to highlight as being highly uncertain is that of RFCI-BSC (and to a lower extent FCI after adding noise to the data), and this is because this algorithm produced a very high number of failures (refer to Appendix C), each of which is penalised with the 6-hour runtime limit. Lastly, it is worth highlighting the efficiency of the algorithms implemented in the bnlearn R package [11]. Specifically, GS, HC, Inter-IAMB, MMHC and TABU, all of which have been tested using bnlearn, proved to be considerably faster than other algorithms and never failed to produce a graph; although this observation can also be partly explained by the algorithms themselves.  Table 7. Failed attempts by the algorithms to generate a graph (refer to Appendix C) are assigned the 6-hour runtime limit. Note that the results exclude NOTEARS since it was applied only to the Sports case study, on which its runtime ranked 13 th out of the 15 algorithms. Table 9 repeats the analysis of Table 8 on the 15 noisy experiments, which are directly  compared to the results shown in Table 8 (i.e., the N experiment). Specifically, the analysis highlights how the relative performance of the algorithms has changed after incorporating noise in the data, as determined by each of the metrics.

Results with data noise
The results reveal numerous interesting observations. TABU, whose previous overall performance topped all three rankings, has lost significant ground against other algorithms and now ranks 2 nd overall by the F1 and BSF metrics, and 4 th overall by SHD. In contrast, the HC algorithm, which all metrics previously ranked 2 nd , is now ranked 1 st by F1 and BSF, and 2 nd by SHD. This is an interesting observation because TABU is an improved search version of HC that escapes some of the suboptimal search regions in which HC has the tendency to get stuck in. This result can only suggest that data noise has misled TABU into performing escapes from a local optima into regions that further deviate from the true graph. However, this observation does not extend to SaiyanH whose score-based learning phase also makes use of tabu search; although the tabu search in SaiyanH plays a less significant role compared to the tabu search in TABU.
The ILP algorithm, which is the only exact learning algorithm tested in this study, has lost some ground in relative performance but not enough to alter its ranking. This result is consistent with that of TABU on the basis that data noise appears to distort model fitting which in turn has a negative effect on algorithms that maximise exploration on curve fitting. This observation provides empirical evidence that while exact learning is expected to work best in theory, which assumes no data noise, it does not work best in practise.
Interestingly, MMHC is the only algorithm with significant gains in performance across all the three metrics. Specifically, it now ranks 6 th , 1 st , and 9 th by F1, SHD, and BSF metrics respectively, up from 10 th , 6 th , and 11 th under no data noise. While MMHC claimed the top spot in terms of SHD under data noise, this result can be largely explained by the low number of edges (i.e., underfitting) the algorithm tends to produce with respect to the number of true edges (refer to Fig 2), which also explains the contradiction with the F1 and BSF rankings. On the other hand, FCI is the algorithm with the highest loss in relative performance. Another interesting contradiction between metrics can be observed in the results of PC-Stable, where F1 and BSF suggest that PC-Stable has improved its performance under data noise, and relative to the other algorithms, whereas the SHD metric suggests otherwise; and this is a good example that demonstrates how classic classification accuracy (i.e., SHD) and balanced accuracy (i.e., BSF and partly F1) can lead to very different conclusions. Fig 5 presents the increase or decrease in relative ranking between algorithms, for each of the 15 noisy experiments and with respect to experiment N. These results reveal further inconsistencies between metrics, some of which are highly contradictory. For example, the F1 and BSF results on GFCI, HC, and PC-Stable, are in direct disagreement with SHD. Interestingly, the algorithms designed to account for latent variables during structure learning, such as the FCI, GFCI and RFCI-BSC, did not improve their performance relative to other algorithms under experiments which involve the reconstruction of the true MAG (experiments which incorporate code 'L').  Fig 3, the results suggest that some algorithms can be considerably slower under data noise, such as FCI and H2PC, while others can be faster, such as ILP. Table 9. Average and overall ranked performance for each algorithm over all case studies and sample sizes, and over all the 15 noisy-based experiments, as determined by each of the three metrics, where Δ is the relative difference in performance with respect to the N experiments (i.e., without noise). Green and red text indicate increased and decreased relative ranked performance respectively. Detailed performance for each of the 15 noisy experiments is provided in Appendix E.

F1
SHD   Table 7. Failed attempts by the algorithms to generate a graph (refer to Appendix C) are assigned the 6-hour runtime limit. Note that the results exclude NOTEARS since it was applied only to the Sports case study, on which its runtime ranked 9 th out of the 15 algorithms. The difference in the overall rank achieved by each algorithm for each noisy experiment and over each of the scoring metrics, relative to experiment N (i.e., before adding noise to the data). The superimposed red line represents zero difference in the rankings between experiment N (without noise) and all other experiments (with noise), whereas markers above and below the line indicate increased and decreased performance after adding noise to the data. Lastly, Fig 6 presents the overall decrease in accuracy, across all algorithms, for each noisy experiment and with respect to the experiment N (i.e., no data noise). The inconsistency in the conclusions between the F1 and BSF metrics, and with respect to the SHD, is also present in Fig 6, where the imbalance in the SHD score leads to the counterintuitive conclusion that experiments I5, I10, MI, and IL, have decreased structure learning performance more than experiment MISL which incorporates all types of noise and a higher total rate of data noise. On the other hand, the F1 and BSF metrics identify that MISL has had the largest negative impact on structure learning performance, as might be expected. Note that another counterintuitive conclusion, and one which applies to all metrics, is that I5 decreases overall performance larger than IS. However, this conclusion is uncertain and subject to bias because the minor difference between these two results could be explained by the IS experiments not including the Asia case study, unlike the I5 experiments (refer to Table A1).
According to the F1 and BSF metrics, the overall results suggest that data noise of types S and L have had a relatively minor impact on structure learning performance. However, it should be noted that the results from experiments that incorporate L are based on the reconstruction of the true MAGs which incorporate a lower number of variables compared to the true DAGs used in experiments that do not incorporate L, and the difference in the number of variables in the data can influence the result generated by some metrics. Specifically, the SHD error increases with the number of variables and this biases the comparisons between SHD scores that are based on true graphs with different number of variables. On the other hand, data noise of types M and I have had a much stronger impact on decreasing the performance of the algorithms, although this could be because noise of type and influence variable values across all variables, whereas noise of type and influences only some (up to 10%) of the variables and at the same time reduce model dimensionality. Combining all types of noise in a single dataset (i.e., experiment MISL), which is reasonable to assume better approximates real data, has led to a decrease in structure learning accuracy in the range of 30% to 37%; or an overestimation of real performance in the range of 8 43% to 59%. 8 If data noise reduces accuracy by roughly 1 3 , such as from 75% to 50%, then data with no noise increases accuracy by 50% (i.e., from 50% to 75%).

DISCUSSION AND CONCLUDING REMARKS
This paper presents a new methodology that models the level of difference in structure learning performance between traditional synthetic data and noisy data. The methodology involves applying the algorithms to synthetic data that incorporate different types of noise, and can be used to better estimate the real-world performance of structure learning algorithms under the assumption that similar types and rates of noise may exist in real data. The methodology was applied to 15 BN structure learning algorithms, which are either state-of-the-art, wellestablished, or recent promising implementations. The performance of the algorithms was assessed in terms of reconstructing the true causal DAG (CBN), or the causal MAG, under 16 different hypotheses of data noise that are investigated over six diverse case studies, with five different sample sizes per case study. The concluding remarks are derived from more than 10,000 graphs that required approximately seven months of total structure learning runtime to produce.
The results are based on three scoring criteria; the F1, SHD, and BSF metrics. Our results show that the F1 and BSF metrics are largely in agreement, whereas the SHD often deviates considerably from them and. When this happens, the SHD tends to lead to conflicting and counterintuitive conclusions; an observation consistent with what has been reported in [41]. This is because the SHD score represents pure classification accuracy (i.e., summation of false positives and false negatives), whereas the F1 and BSF metrics represent partly and fully balanced scores respectively. While classification accuracy is widely considered to be misleading in other machine learning fields, the SHD metric remains popular in the field of BN structure learning. The results below are discussed across all the three metrics, but emphasis is given to the F1 and BSF metrics in deriving conclusions.
Overall, the results suggest that data noise can have a considerable impact on the accuracy of the learned graph. Specifically, latent variables (experiments L) and the merging of states (experiments S) have had relatively minor impact on structure learning accuracy. This is not necessarily a surprising result since both these manipulations reduce the dimensionality of the input data with the residual data values not being subject to noise. On the other hand, missing and incorrect data values (experiments M and I respectively) have had a major impact on the accuracy of the learned graphs. Specifically, the reduction in structure learning accuracy occurring due to 5% or 10% missing data values ranged between 13% and 18% (i.e., accuracy increases by 15% to 21% without data noise), whereas the reduction in accuracy due to 5% or 10% incorrect data values ranged between 18% to 28% (i.e., accuracy increases by 23% to 39% without data noise). When both these types on noise are combined in a single dataset, the decrease in accuracy ranges between 26% and 30% (i.e., accuracy increases by 34% to 44% without data noise). Incorporating all four types of noise in a single dataset decreases accuracy in the range of 30% to 37% (i.e., accuracy increases by 43% to 59% without data noise). These results have major implications since they suggest that BN structure learning accuracy presented in the literature, on the basis of traditional synthetic data, overestimates real-world performance to higher degree than maybe was previously assumed. All the datasets used in this study, graphs and BN models, are freely available online [45].
Focusing on the performance of each algorithm, the results reveal that score-based algorithms are superior to constraint-based algorithms, and this observation is in agreement with [10]. Specifically, the score-based HC and TABU algorithms claimed the top two spots for overall performance, both with and without data noise (although TABU ranked 4 th in SHD under data noise). Interestingly, while TABU topped the traditional synthetic tests that did not incorporate data noise, HC was the algorithm that topped the noisy experiments. Both HC and TABU are based on the same implementation and differ only in search strategy, where TABU represents a theoretical improvement in search strategy over HC, by escaping some of the suboptimal graphs which HC fails to do so. However, empirical evidence suggest that data noise has misled TABU into escaping towards less accurate regions despite producing better fit compared to HC (at least in terms of BIC score). Equally interesting is the observation that the HC algorithm, which is the simplest and fastest structure learning algorithm, has outperformed all other, and indeed more sophisticated, algorithms under data noise. These results provide empirical evidence that a higher fitting score does not imply a more accurate causal graph, and support Pearl's view in that curve fitting alone is insufficient for causal discovery [2] [9]. ILP follows at 3 rd position (although ranked 5 th to 6 th by SHD), both with and without data noise. It is important to reiterate that ILP is the only exact learning algorithm investigated in this study and, relative to other approximate algorithms, it requires extensive time to complete its search. As stated in Section 6, while structure learning time was restricted to six hours per run, we took advantage of ILP's option to retrieve the best graph discovered at the end of the 6-hour time limit. This is not an option offered by other algorithms and hence, an argument can be made that this decision provided an unfair advantage to ILP. However, since this paper focuses on assessing the usefulness of these algorithms in real-world settings, we considered ILP's feature to be useful in the real world and have taken advantage of it in the experiments. Regardless, it is worth highlighting that even in cases where ILP completes exact learning before the time limit is reached, and specifically on networks with max in-degree 3 (its default hyperparameter) such as Asia, Sports, and Property, the highest scoring graph is only occasionally the most accurate graph across the algorithms tested. This result further supports the notion that a higher fitting score does not necessarily imply a more accurate causal graph.
SaiyanH follows at 4 th position, both with and without data noise (although ranked 11 th by SHD) and claims the top hybrid learning spot according to the F1 and BSF metrics. H2PC and WINASOBS follow with overall ranking ranging between 5 th to 6 th and 5 th to 7 th respectively, with and without data noise (although ranked 3 rd to 4 th and 4 th to 5 th respectively by SHD). FGES and MMHC, both which are often assumed to be the top structure learning algorithms in this field of research, ranked 8 th and 6 th to 11 th respectively, with and without data noise (although 8 th to 10 th and 1 st to 6 th by SHD). While MMHC gained impressive ground against all other algorithms under the noisy experiments, its performance proved to be the most unstable over all algorithms. Overall, our results on MMHC are not in agreement with the results in [28] and support the results in [10].
GFCI and FCI, which are often considered as the best algorithms for causally insufficient systems, ranked 6 th to 7 th and 11 th respectively without data noise, and 6 th to 7 th and 9 th respectively with data noise. While these algorithms have been specifically designed for causal structure learning with latent variables, our results provide no evidence that GFCI and FCI gain advantage over other algorithms under these conditions. On the other hand, FCI is the top constraint-based algorithm in this study, with PC-Stable closely behind. Lastly, Inter-IAMB, GS, RFCI-BSC and NOTEARS have produced disappointing results.
A limitation in this study is that we have not explored the option to incorporate knowledge-based constraints into the structure learning process of the algorithms. While we initially considered the possibility of exploring such constraints, not all algorithms support knowledge, nor do all those which support knowledge support the same types of constraint, and this caused difficulties in setting up the experiments in a fair manner. On the other hand, disregarding knowledge-based constraints enabled us to increase the number of experiments under data noise and to provide a broader picture on its impact on the accuracy of the learned graphs. Still, knowledge-based represent a desired feature when applying these algorithms to real-world problems, and it is an area for future investigation. Table A1 indicates which experiments were not possible to be performed. Specifically, in the case of Asia, experiments S (i.e., merging states) could not be performed because all variables in Asia are Boolean. In the cases of Asia and Sports, experiment L5 (i.e., 5% latent variables) could not be performed because both Asia and Sports consist of less than 10 nodes; i.e., a single latent variable corresponds to a rate of approximately 10% (i.e., L10) and hence, L5 becomes redundant. ii. Lower sample size datasets are sub-datasets of the 1000k dataset. For example, the dataset N with 100k samples corresponds to the first 100k samples of the dataset N with 1000k samples.

APPENDIX A: SYNTHETIC DATA DETAILS
iii. In experiments M (i.e., missing values) and I (i.e., incorrect states), each data value has a chance of being randomised (i.e., 5% or 10%). This means that some variability is expected between datasets over the total rate of noise, and the variability increases with lower sample size and fewer variables. The most extreme example involves experiment M10 in Asia, where the rates of missing values are 6.63%, 9.54%, 9.91%, 9.98%, and 10.02% for sample sizes 0.1k, 1k, 10k, 100k, and 1000k respectively.
iv. Experiments S (i.e., merged states) and L (i.e., latent variable) involve manipulation of variables rather than data values. In these experiments, approximately 5% and 10% of the variables are manipulated. Since experiment S cannot be performed on Boolean variables, the manipulation is restricted to multinomial variables. Table A2 presents the number of variables manipulated under each S and L experiments. v. Datasets that involve more than one type of noise have had each type of noise simulated in the following order: L → S → I → M. For example, the experiment cMS is dataset N is first manipulated with S and then with M.
vi. Experiment cMISL incorporates all types on noise at their default rate of 5%. However, for case Asia, cMISL does not include experiment S (refer to Table A1). Moreover, for cases Asia and Sports, cMISL includes experiments L10 instead of L5 (refer to Table  A1)  Table C1. Failure information for the first eight experiments (from experiment N to L5), where F1 is represents fail event "Algorithm does not complete within 6 hours", F2 represents fail event "Algorithm runtime error for unknown reason", and F3 represents fail event "Algorithm runtime error: Out of memory". Higher fail occurrences are represented with a darker red backcolour. NOTEARS n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a 0.33 0.32 0.35 0.35 0.35 n/a n/a n/a n/a n/a PC-Stable 0.  Table D2. SHD scores of the algorithms over all experiments N, where F represents a failed attempt by the algorithm to produce a graph (refer to Appendix C). The results are presented per case study per sample size, where the sample size of 0.1 corresponds to 0.1k data samples and so forth. Darker green backcolour corresponds to higher accuracy whereas darker red backcolour corresponds to lower accuracy. NOTEARS n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a 14 15 14 14 14 n/a n/a n/a n/a n/a PC-Stable 32 14     NOTEARS n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a 0.16 0.11 0.14 0.14 0.14 n/a n/a n/a n/a n/a PC-Stable 0.  Table 8) ordered by case study (separated by a red dotted line). Each case study is ordered by the number of nodes in the true graph, and each experiment within a case study is ordered by sample size. A lower Rank indicates better performance. These results are based on experiment N (i.e., no noise). In NOTEARS, the F1 rankings are identical to the BSF rankings. Table E1. Average (A) and overall (O) ranked performance for each algorithm (highlighted in yellow backcolour), over all case studies and sample sizes in noisybased experiments M5, M10, I5, I10, and S5, as determined by each of the three metrics.