BANSHEE–A MATLAB toolbox for Non-Parametric Bayesian Networks

Bayesian Networks (BNs) are probabilistic, graphical models for representing complex dependency structures. They have many applications in science and engineering. Their particularly powerful variant – Non-Parametric BNs – are for the first time implemented as an open-access scriptable code, in the form of a MATLAB toolbox ‘‘BANSHEE’’. 1 The software allows for quantifying the BN, validating the underlying assumptions of the model, visualizing the network and its corresponding rank correlation matrix, and finally making inference with a BN based on existing or new evidence. We also include in the toolbox, and discuss in the paper, some applied BN models published in most recent scientific literature.


Motivation and significance
Bayesian Networks (BNs) are graphical, probabilistic models for representing high-dimensional and complex dependency structures [1][2][3][4]. A BN consists of a directed acyclic graph (DAG), in which nodes (representing random variables) are connected * Corresponding author. 1 BANSHEE stands for 'Bayesian Networks in Scholarly Endeavours'. However, a banshee is also, in Irish folklore, a female spirit whose appearance is a warning about impending death. Bayesian Networks have been extensively used in risk analysis in fields ranging from aviation safety through natural hazards to building fire safety, hence they also warn against possible dangers, somewhat similarly to a banshee.
with arcs representing direct dependency between nodes. The direct predecessors of a node are called parents, and the direct successors are known as children. Each node with no parents has a marginal distribution specified, while each child node is associated with a conditional distribution. The strength of the dependency between nodes is informed by the conditional distributions in the BN [5,6]. BNs have been gaining popularity for several reasons. They are flexible and are able to present the dependence structure even for very large models. Different variants of BNs handle various data types, while the quantitative information needed to build a BN can be obtained both from data or through expert judgement [5,[7][8][9][10][11]. Here, we present a MATLAB toolbox ''BAN-SHEE'' for a particular variant of BNs -non-parametric Bayesian Networks (NPBNs). This type of BNs was introduced by Kurowica and Cooke [12] and has a major advantage of using empirical (non-parametric) marginal distributions of continuous variables. No discretization assumptions or a particular parametric distribution is therefore required for defining the continuous variables (though parametric distributions may still be used), eliminating a significant source of inaccuracies in BN models. For modelling the dependency structure, copulas [13] are used in NPBNs.
The applications of NPBNs are numerous and diverse. One of the first applications was in engineering, with often very large models applied to improve earth dam safety [9,10], aviation safety [14] and infrastructure reliability [15]. Later, NPBNs were introduced in geosciences, in diverse subfields such as hydrology [16], geomorphology [17], seismology [18] and volcanology [19]. In social sciences, they were used in such remote applications as public health [20] and climate-change mitigation policies [21]. A comprehensive review of NPBN applications was provided by Hanea et al. [5] and an up-to-date list of many papers using the method is available online [22].
However, an important limiting factor in making new analyses with NPBNs is software availability. At present, there is only one dedicated software solution for NPBNs, in the form of a closed-source, proprietary package Uninet by LightTwist Software [22]. Our toolbox BANSHEE is the first implementation as a standalone, open-access code, which could allow researchers to provide transparent, reproducible results with NPBNs. BANSHEE itself originates from MATLAB scripts [23,24] that were also used by the authors to implement NPBNs in some recent publications [16,17,25]. While one of the main advantages of Uninet is its Graphical User interface (GUI) that makes it easier to construct and experiment with the model, BANSHEE allows quickly embedding an NPBN model into MATLAB scripts through a set of easy-to-use functions, including diagnostic tools for analysing the NPBN's underlying assumptions.

Software description
BANSHEE consists of a set of MATLAB functions. The software allows for quantifying the NPBN, analysing the underlying assumptions of the model, visualizing the network and its corresponding rank correlation matrix, and finally making inference with a NPBN based on existing or new evidence. Examples are provided as standalone scripts. Real-world example applications from literature are included to better explain the method and show the power of the toolbox.

Software architecture
The most important component of the toolbox is the code (bn_rankcorr) implementing the NPBN method described by Hanea et al. [5,26] and Kurowicka and Cooke [4]. As noted in the introduction, NPBNs make no assumptions on the marginal distributions (i.e. distributions of the nodes). The arcs are associated with one-parameter, conditional copulas [13]. Loosely, a bivariate copula, or simply a copula for the purposes of this paper, is a joint distribution with uniform margins in [0, 1]. Multivariate joint distributions can be written in terms of the univariate marginal distribution functions and a copula. For the bi-variate case one can write H (x, y) = C (F X (x), G Y (y)), where H (x, y) is a joint distribution with marginal distributions F X and G Y . Therefore, C is a copula taking values from I 2 = ([0, 1] × [0, 1]). For copulas with only one parameter there is a one-to-one relation between the parameter and Spearman's rank correlation coefficient (Eq. (3)). Copulas allow the investigation of probabilistic dependence separately from the effect of the one-dimensional margins and hence their importance in the NPBN framework.
The (conditional) copulas are assigned to arcs according to the (non-unique) ordering of the parent nodes. For a particular choice of copulas, dependency structure and a set of onedimensional marginal distributions, the joint distribution of the NPBN is uniquely determined [26]. Any copula realizing all correlations in [−1, 1] can be used, while the (conditional) independent copula realizing all (conditional) independence relationships encoded by the graph of the NPBN may be used. Here we implement the NPBNs with the Gaussian (normal) copula, which does not present tail dependence (or other asymmetries) between variables. The joint density of a BN with n variables is factorized as follows [5]: where Pa(i) is the set of parent nodes of X i , with i = 1, . . . , n. In case there are no parents, The BN's structure (directed acyclic graph -DAG) is typically expert knowledge-driven, therefore we do not include any automated way of deriving the DAG. The user defines the nodes and arcs, together with the ordering of the parent nodes as a cell array. An example DAG, as defined in BANSHEE, with three nodes and two arcs is as follows: P{1} = []; % first node, no parents P{2} = 3; % second node (third node is the parent node) P{3} = 1; % third node (first node is the parent node) where P{i} is the ith node of the DAG. Empirical distributions from the user's data are assigned to the nodes. The usual estimator of the cumulative probability distribution is applied here: where (x i , . . . , x n ) are the samples of a random variable, 1 {x i ≤x} = 1 if {x i ≤ x} and zero otherwise. Once the DAG is defined and the user's data are provided, the NPBN rank correlation matrix quantifying the dependency structure is estimated. As noted above, a Gaussian copula parametrized by Spearman's rank correlation is applied here to define the correlation between two connected nodes. Spearman's correlation is Pearson's product moment correlation coefficient computed with the ranks of the random variables. In such a specific case, the rank correlation of two random variables (nodes) X i and X j is as follows: where u, v are the margins of one-parameter bivariate copula C θ .
The conditional Spearman's rank correlation of X i and X j given the random vector Z = z is the Spearman's rank correlation calculated in the conditional distribution of X i and X j given the with the rank correlation: where the index j is in the non-unique sampling order. It is worth noticing that the m parent nodes of X i in Eq. (4) can be permuted such that anyone can be the first parent, the second, and so on until the mth index. The resulting correlation matrices parametrizing the BN will in general differ according to the selected sampling order unless the BN is given by a complete (saturated) graph. Given a directed acyclic graph with n nodes described by invertible distribution functions and the conditional independence relationship between them modelled via the NPBN, the specification in Eq. (4) guarantees that the joint distribution of the n variables (nodes) is uniquely determined. For details, see [5,26]. The BN rank correlation matrix is computed in BANSHEE using bn_rankcorr function, requiring a defined DAG (as a cell array) and an adequate number of variables (as a matrix). The function utilizes as default a matrix of data that forms a set of actual records, but includes an option to compute BN rank correlation without such data, which will be discussed in Section 3. The assumptions underlying the BN quantification can be tested by measuring (1) the degree of agreement of the Gaussian copula with the data (Cramer-von Mises statistic M) and (2) the degree to which the chosen conditional independence statements implied by the BN agree with data (d-calibration score). In the first case, we use the sum of squared difference between the empirical and parametric copulas [27]. The Cramer-von Mises statistic M for a sample of length n is computed as follows: is the empirical copula and Cθ n (u) is a parametric copula with parameterθ n estimated from the sample. Notice that M is the sum of squared difference between the empirical copula and a particular parametric estimator. Hence a low value of M is desired over a high value. This observation may be used as a rule of thumb for assessing goodness of fit for copulas. Further details are given in [27].
The function cvm_statistic computes M for four copulas (Gaussian, Gumbel, Clayton, Frank) 2 and allows to visualize the results. The second test, called d-calibration score [11], consists in comparing the empirical correlation matrix (the data) with both the BN rank correlation matrix and the empirical normal rank correlation matrix (the model).Testing the distance between empirical and normal matrices informs whether the joint distribution of the variables defined by the user can be assumed to be Gaussian. This means that the Gaussian copula is a fair choice for modelling the bivariate dependence structure between variables. The distance between BN and normal matrices informs the user if the assumption of a joint normal distribution is valid for a particular non-saturated configuration of the NPBN. The distance between empirical and BN matrices provides information on whether the BN chosen is a fair model for the data. This is similar to the tests described by Hanea et al. [5]. The distance between matrices can be computed in different ways, and the user can choose between four methods [11,28] in the gaussian_distance function. The d-calibration score ranges between 0 (the two matrices are different) and 1 (the two matrices are similar).
Finally, inference can be made with the quantified BN using the function inference. The function computes the uncertainty distribution of nodes other than those that the user conditionalized in, i.e. provided specific values (posterior evidence). Algorithmically, the inference is done through sampling the NPBN; the process is described in more detail by Hanea et al. [26].

Software functionalities
The toolbox contains five MATLAB functions, two standalone MATLAB scripts with examples and three MATLAB functions implementing published NPBN models ( Table 1). The functions are 2 Those one-parameter copulas are implemented in standard MATLAB functions. The user might expand the code with the two-parameter t copula that is also available in MATLAB, while other copulas require custom code (e.g. Plackett and Joe-Clayton copula functions from Andrew Patton's copula toolbox for MATLAB).
interconnected through a common input data structure. The user can build a NPBN following a few steps: (1) upload a dataset of interest (DATA); (2) define a DAG in PARENTCELL based on a prior knowledge of the dependence between the variables; (3) run the function bn_rankcorr to calculate the BN rank correlation matrix R; (4) run the function inference, given R and DATA as input, to obtain the joint distribution function of the variables and the conditional distribution of a variable given the remaining. For inference, the nodes to be conditionalized are specified by the user together with the values of those nodes. The user can also modify certain aspects of the calculation such as sampling size, interpolation methods and type of output. The BN dependency structure can be visualized with a separate bn_visualize function, which allows naming the variables e.g. for use in research publications. The function bn_rankcorr and the two diagnostic tools, gaussian_distance and cvm_statistic, all have an option to generate a plot.
All functions are collected in the example script, which apart from running the functions and generating all possible plots (Fig. 1) includes detailed descriptions of each step of the procedure. A second example, example_udrm implements a particular example of a model where the (conditional) correlations between variables are taken from an external source such as expert judgement (Section 3). The real-life example models are discussed in Section 4. All scripts are described in detail in the quick start guide included in the toolbox.

Example 1
The first script, example, is constructed to predict the level of personal safety (as indicated by variable ''Safety'') employing a default MATLAB dataset cities. It contains data on nine qualityof-life indicators in 329 cities in the United States. It should be noted that the data are transformed to ranks in the procedure, so there is no need for adjusting the input data e.g. through normalization or logarithmic transformation. The variables in the DAG (Fig. 1b) were identified by firstly creating an expert-knowledge derived DAG, as is common in Bayesian Network models. Then, bn_rankcorr function was applied to compute a BN rank correlation matrix for the defined DAG (Fig. 1b). The model was then iteratively modified to remove the least-correlated arcs between nodes, until only significant and theoretically explainable variable pairs remained. The final example model includes five nodes: four explanatory variables -''Climate'', ''Economics'', ''Recreation'' and ''Arts'', and one variable of interest -''Safety''.
The validity of the Gaussian copula assumption is then tested. The Cramer-von Mises statistic shows that the Gaussian copula achieves best fit for the majority of variable pairs according to cvm_statistic function (Fig. 1c). The d-calibration score (gaussian_distance function) gives a mixed picture: the dcalibration score of the empirical rank correlation matrix and the empirical normal matrix (vertical red line in Fig. 1d -left panel) falls outside the 90% confidence interval of the d-calibration score of the normal rank correlation matrix (red circles, Fig. 1d -left panel) estimated via bootstrapping. This means that the determinant of the empirical rank correlation matrix is different than the determinant of the normal empirical rank correlation matrix, and so the Gaussian copula might not be the best choice for modelling the bivariate dependences. However, the d-calibration score of the BN's rank correlation matrix and the empirical normal matrix is well within the 90% confidence interval of the d-calibration score of the empirical normal matrix (Fig. 1d -right panel). This second d-calibration score is more important, as it shows that the joint normal copula is valid for the particular (non-saturated) BN structure. It should be noted that the d-calibration test is rather severe for large datasets [5]: the higher the number of variables the lower is the determinant, which makes the comparison between determinants numerically more difficult. Once the NPBN has been built, predictions of the level of safety in cities (or any other variable) given a combination of the remaining variables can be tested against observations, using the inference function.

Example 2
In a second example, example_udrm, a BN model without an actual dataset -User-Defined Random Model (UDRM) -is built. This example was created to allow implementation of BNs where the (conditional) correlations are obtained through structured expert judgement elicitation [8,11,29,30]. Indeed, BN models in engineering often require such a method of quantification due to lack of data [5,11]. The user should define the DAG, the marginal distribution at each node, and the (conditional) rank correlations on each arc. Based on these information, the BN rank correlation matrix is calculated via the bn_rankcorr function. The ex-ample_udrm script generates possible axle loads configurations from the defined BN model to estimate the probability of the vehicle weight accounting for (1) the type of vehicle (2-, 3-, 4-, 5-axle) and (2) the dependences existing between the axel loads of a single vehicle. Running the example will allow to visualize the correlation matrix and BN structure (Fig. S1). Moreover, the code contains a section in which the dependence between the axle loads is ignored (independent case). Fig S2 compares the probability of the vehicle weight estimated based on observations (blue dots), simulations from a dependent model (BN modelred dots), and simulations from an independent model (green dots). This plot highlights that considering the strong dependency between the load on different axels of vehicles is important in investigating maximum traffic load on bridges or other stretches of roads, as opposed to considering them independent. It should be noted that this example uses real-life data on vehicle axle loads from Weigh-In-Motion traffic monitoring system in the Netherlands [15,31]. It is an simplified version of the full, 705node model, which has been used to advise the Dutch Ministry of Infrastructure and Environment.

Impact
The toolbox includes three real-life example models from the authors' recent geoscientific applications (Fig. S3). Each model consists of a function script and a dataset containing the definition of the DAG, marginal distributions on the nodes, and a BN rank correlation matrix. The user only needs to specify which nodes are to be conditionalized and provide their data on the corresponding variables. The first model, predict_river_ discharge, estimates the value of maximum annual discharge in European rivers [16]. This enables generating an annual time series of extreme discharge in locations where river gauge measurements are not available. Hydrological modelling of discharge on a European scale is complex and very time-consuming, but it has been shown that the NPBN model is an accurate and efficient substitute [16]. The model was applied for the whole of European river network using our MATLAB toolbox and served as input for pan-European flood hazard modelling [32]. The user can apply the model to any location and year providing e.g. the value of catchment size, maximum daily precipitation during the year or percentage of the catchment covered by lakes. Usage of the function is highlighted in the wrapper script exam-ple_hydro_simulation.
Another model, predict_floor_space, was conceived as a tool to substitute for missing information on the size of residential buildings [25]. Knowing how big is a house, especially its height and useful floor space area, is fundamental for estimating its exposure and vulnerability to natural hazards. As 3D models of cities are still scarce [33], this NPBN model was constructed based on OpenStreetMap building footprints combined with several pan-European raster datasets. The model was applied using this MATLAB toolbox to estimate exposure of residential buildings in several case studies of past floods [34]. As with the model for extreme discharges, only openly-available datasets are used as explanatory variables, hence the user can easily collect data to apply the model within Europe (they were not validated for other countries so far).
The final example is a model of storm-induced coastal erosion in Poland (predict_coast_erosion), which was created based on field observations of cliff retreat in Poland and Germany [17]. The model reproduces the complex dependency structure of the processes involved, as not only meteorological and hydrological factors impact the cliff and the beach below it, but erosion in one part of the profile triggers erosion in another part and so on. As with the other models, MATLAB was used extensively in data preand postprocessing, hence the implementation of NPBN code in the same language enabled a quick embedding of the BN model into the workflow.

Conclusions
BANSHEE is the first openly available tool for Non-Parametric Bayesian Networks. As the examples contained in the toolbox have highlighted, the actual and potential applications are numerous. We hope that our toolbox (1) will increase the popularity of NPBNs as a powerful statistical method, (2) will support researchers committed to sharing their code and data openly and (3) will enhance implementation of NPBN models beyond science and into engineering and administrative practice.
BANSHEE is intended to be developed further, especially with improvements to analytical and visualization tools. New applicable BN models will be added as part of ongoing and future paper submissions, with novel BN flood damage models for the residential and commercial sectors expected to be added first [34,35]. Linking our toolbox with a MATLAB toolbox for structured expert judgement ANDURYL [36,37] is also envisioned.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.