Exploiting the full potential of Bayesian networks in predictive ecology

Although ecological models used to make predictions from underlying covariates have a record of success, they also suffer from limitations. They are typically unable to make predictions when the value of one or more covariates is missing during the testing. Missing values can be estimated but methods are often unreliable and can result in poor accuracy. Similarly, missing values during the training can hinder parameter estimation of many ecological models. Bayesian networks can handle these and other limiting issues, such as having highly correlated covariates. However, they are rarely used to their full potential. Indeed, Bayesian networks are commonly used to evaluate the knowledge of experts by constructing the network manually and often (incorrectly) interpreting the resulting network causally. We provide an approach to learn a Bayesian network fully from observed data, without relying on experts and show how to appropriately interpret the resulting network, both to identify how the variables (covariates and target) are interrelated and to answer probabilistic queries. We apply this method to the case study of a mountain pine beetle infestation and find that the trained Bayesian network has a predictive accuracy of 0.88 AUC. We classify the covariates as primary and secondary in terms of contributing to the prediction and show that the predictive accuracy does not deteriorate when the secondary covariates are missing and degrades to only 0.76 when one of the primary covariates is missing. As a complement to the previous work on constructing Bayesian networks by hand, we show that if instead, both the structure and parameters are learned only from data, we can achieve more accurate predictions as well as generate new insights about the underlying processes.

context of climate change, and has grown exponentially since the 1990s, given the quality and quantity of available ecological data (Mouquet et al., 2015;Purves et al., 2013). Simple and advanced statistical and machine-learning approaches have been used to this end, and some have reported great success. Commonly applied models include mechanistic equations, individual-based models, GLMs (Aukema et al., 2008;Preisler et al., 2012), generalized additive models, MaxEnt (Merow et al., 2013), decision trees, support vector machines and artificial neural networks (Marmion et al., 2009;Youssef et al., 2016).
These standard models, however, lack some practical features, which questions their use as predictors. They are unable to make predictions when the value of a covariate is missing, a typical issue because some covariates are expensive or logistically impossible to collect. To impute, the missing values can be unreliable as modelling assumptions are needed so as to 'guess' them. The assumptions may even conflict with those posed by the original model using the imputed values. Another approach is to produce a model that does not involve any covariate that is ever missing. This can be problematic as well, because (a) those covariates are not fixed in the area of interest: the value of a covariate may be missing at location A, but present at location B, and the opposite may hold for another covariate; and (b) even if a covariate is only measured in the laboratory and never on the field, incorporating it in the model can still reveal its effect on the response variable. Most models also cannot reveal the co-effect of more than one covariate on the response variable, and some do not allow for statistical inference. Moreover, those that are used for statistical inference cannot handle correlated covariates.
Bayesian networks (BNs) can deal with these issues. They are directed acyclic graphs, whose nodes are the response variable and covariates, and the links between the nodes show how these nodes are related to each other. Both links from covariate to response and from covariate to covariate are allowed in the network. BNs are graphical, and hence often simpler to understand than complex systems of equations (e.g. Bode et al., 2017;Eklöf et al., 2013;Rish et al., 2009;Troyanskaya et al., 2003), deepening our understanding of natural phenomena as well as allowing for accurate predictions.
However, there are two main issues with how BNs are typically applied in practice: (a) they are rarely used to their full potential and (b) they are misinterpreted as causal networks. The common practice of applying BNs is to manually construct the structure (network), based on the knowledge of experts, then either set the parameters manually or learn them from data, and finally, read the links as causal relationships in the resulting BN. Although useful in assessing the qualitative descriptions of an ecological process, this approach relies heavily on our prior understanding of the process, and hence, is only as good as our understanding. If, instead, both the structure and parameters of the BN are learned only from the data, there will be room for more accurate predictions as well as new insights about the process. Moreover, BNs are not causal networks, but essentially a set of conditional (in)dependencies that factorize the joint probability distribution of all of the variables. Causal deductions, hence, may not be made, although some hypotheses may be tested.
We complement previous studies on BNs that used the knowledge of experts (Chen & Pollino, 2012;Marcot et al., 2006) by focusing on learning the structure, and proper model interpretation in the form of conditional probabilistic inferences rather than causal deductions. The goal of this paper was (a) to discuss the advantages of different ecological modelling approaches, and highlight what BNs can offer in this context; (b) to provide a systematic approach for training a BN completely from data, without incorporating the prior knowledge of experts, and then evaluating and interpreting the resulting BN and (c) to apply this method to the case study of a mountain pine beetle (MPB) outbreak.

| ADVANTAG E S OF BAYE S IAN NE T WO RK S
In what follows, we first briefly introduce BNs and then compare them with other modelling approaches in predictive ecology (Table 1). Here, we focus on the 'typical' situation with each model; for example, the prediction accuracy of a properly trained BN being typically high does not imply that it is always higher or even as high as other highly accurate models.

| Introduction to Bayesian networks
Given a set of n random variables = {Z i } n i=1 (consisting of the response variable and n -1 covariates), a BN factorizes the joint probability P ( ) according to a specified directed acyclic graph whose nodes are the variables , following the equation where Pa Z i denotes the parents of Z i in the graph, that is, those nodes that have an outgoing edge that leads to Z i ( Figure S1). The individual factors P(Z i |Pa Z i ) are known as conditional probability distributions (CPDs; Koller & Friedman, 2009). A BN encodes the claim that given the Markov blanketMB(Z i ) of a node Z i -which is the set of its parents, children and the other parents of its children-the node becomes independent from the remaining of the nodes, written species, with densities denoted by X 1 and X 2 , each corresponding to a node in a BN. We could link these two distributions using either a directed edge from X 1 to X 2 , decomposing the joint density distribution of the two species as P(X 1 , X 2 ) = P(X 1 )P(X 2 |X 1 ), or a directed edge from X 2 to X 1 , resulting in P(X 1 , X 2 ) = P(X 2 )P(X 1 |X 2 ). The first relies on the distribution of X 1 and the conditional distribution of X 2 given X 1 , and the reverse holds for the second. Both of these models can be used to make acceptable predictions if one can estimate parameters P(X 2 |X 1 ) and P(X 1 |X 2 ) effectively. However, none of the models are causal: neither of X 1 or X 2 is causing the other. The edge simply means probabilistic dependence and dictates the factorization of the joint distribution. Now, assume that the species distributions are each partly 'caused' by a third variable vegetation, denoted by V. Should we construct a BN based on this 'causal understanding', we would add the node V and link it to both X 1 and X 2 without connecting the two.
This results in the joint probability distribution of the two species and vegetation.
However, this is not the only way to model the joint distribution.
Depending on the training data, one may obtain a more accurate model in terms of data fitting by also linking X 1 to X 2 (or vice versa), resulting in This might be because vegetation is not the only cause of the two, and another factor, say temperature, also plays a role, which is not included in our variable list but is highly correlated with X 1 , and hence, provides a better estimation of X 2 by linking the two. One may yet use a different model, where X 1 and X 2 are not linked to each other but both linked to V, resulting in This is particularly useful if we know the distributions of the species densities, that is, P(X 1 ) and P(X 2 ), but not that of vegetation P(V), and we know how vegetation can be estimated based on the distribution of the two species, that is, P(V|X 1 , X 2 ). None of the links in this model are causal.

| Generative versus discriminative learning
Consider the response variable Y and set of covariates (features) that are used to estimate Y. One may pursue either of the two learning tasks with respect to these variables: generative, that is to learn the joint probability distribution P(Y, ), or discriminative, that is to learn the conditional probability P(Y| ). On the one hand, the joint probability distribution P(Y, ) represents the probability of any given assignment to all of the variables Y and in the data, or loosely speaking, how all the variables are related to each other.
On the other hand, the conditional probability P(Y| ) represents the probability of Y happening given , or in other words, in which cases does Y happen. So discriminative learning focuses only on the probability of the response variable whereas generative learning also reveals the probability of the covariates. For example, on the one hand, an ecologist may be interested in two species' co-occurrence, which is a generative question given by the distribution P(X 1 , X 2 ), where X 1 and X 2 are the density of the species. On the other hand, the same ecologist may be interested in whether the density of species X 1 (as a response variable) can be estimated using that of species X 2 (as the covariate), which is a discriminative question, given by P(X 1 |X 2 ).
Note that knowing the 'true' joint distribution P(Y, ) allows knowing the conditional distribution P(Y| ). However, because small errors in estimating P(Y, ), which typically happen in practice, might lead to large errors in the associated values of P(Y| ) (Ng & Jordan, 2002), each learning task deserves its own treatment.
Although potentially capable of modelling the joint probability distribution, mechanistic models are not commonly used for this purpose as it would require a great deal of prior knowledge of the process.
Roughly speaking, none of the models in Table 1, except for BNs, are effective at generative learning. that is, to remove the entire instance (observation) from the dataset if the value of one or more variable is missing (Harrell, 2015). Casewise deletions can lead to bias in the estimated parameters if the degree to which the variable's value is likely to be missing is correlated with the actual range of values, for example, when a temperature sensor fails to record values below −10°C. Casewise deletions also result in losing the information provided by the remaining variables in the instance with missing values. Therefore, imputation is often used to estimate the missing values, which can be as simple as using the variable's mean or the variable's value from a similar instance, or can be more complex, such as using the chained equation method (Harrell, 2015). However, in essence, imputation is presuming a model for the variables with missing values, which may conflict the actual model that is going to be trained on the imputed dataset, resulting in a poor predictor. As with

| Missing data
BNs, methods such as expectation maximization (EM) and structural EM can be used to learn the parameters and structure, without imputation or casewise deletion (Koller & Friedman, 2009).
Should the testing dataset contain missing values, almost all models fail to make predictions as each covariate has to take some value, that is, they cannot be left with 'NA's (not available). Imputation comes with the above-mentioned shortcomings. Another alternative is to use expert knowledge to obtain probable limits for the covariates with missing values, and run the model on those limits to get a probable P(X 1 , X 2 , V) = P(V)P(X 1 |V)P(X 2 |V).
range for the prediction. For example, in climate change models, the exact concentration of the pathway of a covariate such as greenhouse gas emission that will be followed in the future is unknown. Therefore, models use a series of scenarios ranging from best to worst case scenario in order to predict changes in CO 2 emissions and temperatures (Pachauri et al., 2015). There is, however, no need of these rough approximations when applying BNs. By marginalizing over the unobserved covariates, BNs can predict the target variable based on any observed subset of the covariates.

| Non-linearity of the relationship between the covariates and response variable
In many real-world situations, the response variable may be related to the covariates in a highly nonlinear manner. Simple models such as linear regressions, however, assume a linear relationship.
To capture some levels of non-linearity, GLMs extend the regressions by applying functions such as log( ⋅ ) and logit( ⋅ ) to the covariates. Other extensions, such as generalized additive models, fit a smooth curve to the data for each covariate, thereby allowing complex nonlinear relationships (Guisan et al., 2002). Another extension is the machine-learning method MaxEnt (Phillips et al., 2006) that is able to link highly nonlinear response curves and estimate the probability distribution of the response variable using maximum entropy. Likewise, support vector machines classify the covariate space using hyper-planes, and hence, are linear, yet can allow for some non-linearity by first transforming the space using nonlinear kernels (Scholkopf & Smola, 2001). Process-based models can also build in highly complex nonlinear relationships. In all of these cases, the relationships between the response variable and covariates must be entirely described, based on a priori model, a constraint that is relaxed in some other machine-learning models. For example, classification trees can represent any function over the set of discrete covariates, but does not need to be defined beforehand. Note, this may require a very deep classification tree. Moreover, the fact that a classification tree can represent a complex function does not mean it can be learned effectively.
Likewise, BNs are flexible in dealing with nonlinear relationships.
Over a set of discrete variables, BNs can represent an arbitrary joint probability distribution P(Y, ), which can represent any arbitrary conditional distribution P(Y| ).

| Hypothesis testing, statistical inference and model selection
The objective of hypothesis testing is to make inference through deduction. It consists of devising one or more working hypotheses and challenging them with data for corroboration (Hilborn & Mangel, 1997;Stephens et al., 2005). The hypothesis to test is translated into a mathematical equation and is verified using methods such as least squares and maximum likelihood. So to test a hypothesis, one needs (a) a mathematical equation representing a biological hypothesis and (b) a test statistic with a distribution that can be determined, representing the model accuracy when confronted to data. The complexity of machine-learning models usually prevents us from obtaining a simple equation representing the hypothesis, but this is not the case for BNs.
For example, consider a process with the response variable Y and covariates X 1 and X 2 . One may hypothesize that the response variable Y depends on both X 1 and X 2 but becomes independent of X 2 , given X 1 . Namely, the response variable depends directly only on covariate X 1 , and that X 1 itself depends only on X 2 . This can be modelled by a BN with three nodes for the variables and two links: one from X 1 to Y and another from X 2 to X 1 . The BN assigns the following likelihood to each observation of the above process: The null hypothesis in this case is that there is no dependence among the variables: they are mutually independent. This results in a BN without any links between the nodes, yielding the following likelihood: Given an observation, each of the probabilistic terms on the righthand side of the above equations is simply a parameter provided that the BNs are discrete. Hence, the likelihood of a specified dataset for each of the BNs will be a polynomial in the parameters, the maximum of which is straightforward to derive. This allows for classical hypothesis testing, for example, by employing the likelihood ratio test, to reject the null hypothesis. Alternatively, among all BNs with the nodes Y, X 1 and X 2 , one may find 'the best' using multiple working hypotheses, based on the Akaike information criterion (AIC; Akaike, 1974) or Bayesian information criterion (BIC; Schwarz, 1978).

| Prior knowledge of the processes
Although they can be trained autonomously, BNs allow experts to incorporate their knowledge into the network by forcing or preventing links between the nodes and additionally adding latent variables that are unobservable and often abstract variables, such as habitat quality. Indeed, the spectrum of autonomous learning for BNs ranges from neither to both structure and parameters learned based on experts' knowledge.

| Correlations
Often two or more of the covariates in a process are highly correlated.
This hinders statistical inference as the effects of the correlated covariates on the response variable are difficult to separate (Dormann et al., 2013;Stewart, 1987). This would happen if we were building a model, say a logistic regression, with two covariates that are both relevant to the response variable, and also are highly correlated with each other. Thus, typically one of the variables is eliminated beforehand, either randomly, based on ecological relevance, measurement feasibility and proximity to the mechanisms (Dormann et al., 2013;Harrell, 2015), or by using some autonomous technique such as minimum-redundancy maximum-relevance (Peng et al., 2005). However, this prevents understanding the impact of both of the correlated covariates together on the response variable. Process-based models do not suffer from correlation (except for parameter estimability), yet they require the mechanisms to be a priori known (Dormann et al., 2013). Nevertheless, a BN whose structure is learned from data does not require any prior knowledge, and reveals the differences of the correlated covariates in terms of their probabilistic dependence to other covariates as well as the response variable.

| Predictive accuracy
Despite the complexity of ecological systems (Anand et al., 2010;Levin, 1992), some machine-learning models are reported to make accurate predictions. In contrast, process-based and traditional statistical models are rarely able to reach the same level of accuracy (e.g. Elith et al., 2006). Particularly, process-based models are known for their inability to make good predictions, although this has been challenged by, for example, Håkanson (2004), who presented an accurate mechanistic model for aquatic systems. Within machine-learning models, neural networks are acknowledged for accurate performance in highly complex tasks such as image recognition (Egmont-Petersen et al., 2002).
However, this does not mean that neural networks necessarily outperform simpler models in practice. Firstly, finding the optimal number of layers and nodes is not always practical due to limited computational resources. Secondly, proper estimation of the many parameters of a neural network often requires massive data. Hence, while asymptotically effective, neural networks may not be as successful as simple models when the available data are insufficient. Finally, if the system in question is actually simple, then neural networks, especially deep ones, can easily overfit the training data. Simpler models may be a better choice also in this case.

| Marginalizability
The notion of marginalizability addresses the possibility of separately studying how a particular covariate or subset of the covariates informs us about the response variable. We call a model marginalizable if it allows us to compute the probability of the response variable Y given any subset of the covariates ; that is, P(Y| ). Most predictive models allow us to compute P(Y| ), that is, the likelihood of the response given all of the covariates. However, only those that perform a generative task, that is, learning P(Y, ), allow us to marginalize the likelihood over the variables − , to obtain the likelihood conditioned on only those variables that we are interested in: P(Y| ). Therefore, only BNs and those mechanistic models developed to formulate the joint probability P(Y, ) are marginalizable.

| LE ARNING BAYE S IAN NE T WORK S FRO M DATA
We explain, step by step, how to learn and then use a BN to make predictions and acquire biological insights. Most steps are general enough to be applied by any statistical/machine-learning method in the context of model selection or prediction making.

| Setup
Ecological processes are typically modelled by a response variable Y and a set of covariates . If the process is spatial and temporal, then each instance (observation) of the process has a unique pair of identities: (a) the time t of the instance, the unit of which indicates the frequency of the observations, for example, a year, month or day, and (b) a general index g, roughly to distinguish the instances location wise. For example, if the process of interest is Cyanobacteria bloom in lakes, then g indicates the label of the lakes. If the interest is in the spread of an infestation over a given area, then we may divide the area into r × r squares for say r = 1 km, and label them by g = 1, 2…. We may exclude time when modelling a stationary quantity, for example, the joint distribution of several species in a specific area.
Similarly, we may exclude the index g, if all instances are taken from the same location, for example, from the same lake. Also, note that time and especially the index g are not necessarily two covariates of the process. Indeed, time must be excluded from the set of covariates if the goal is to obtain a model that can be applied to times different from those in the available data, for example, to predict the future (see Supporting Information). Similarly, the index g may be excluded; however, one must acknowledge the possible performance loss when applying the model to areas far-away from the training area, with dramatically different geographic features.
For illustration purposes, in what follows, we consider a spatial and temporal process. For each index g and time t, let g,t denote the set of covariates and Y g,t denote the response variable (Table 2).
Although the response variable can be continuous or integer, in order to use acknowledged performance measures such as AUC (Section 3.5), we restrict it to be binary. For example, given an index g and time t, the response variable Y g,t may represent the presence, Y g,t = 1, or absence, Y g,t = 0, of infestation or a species of interest. The covariates can be correlated with each other and may include variables that are not known a priori to contribute to the response variable. Our goal is to estimate (learn) the joint probability distribution P(Y g,t , g,t ) using available data.

| Step 1: Data discretization
The random variables in a BN can be either continuous or categorical. However, if they are continuous, we must predetermine their distributional forms, for example, a Gaussian distribution. To avoid making such assumptions, we use discrete BNs where every variable is categorical. We discretize all continuous variables by considering various number of intervals or discretization levels (say 2, 3, …, 10) and using data to determine which number leads to a higher performance score. If a continuous variable's range does not have evident thresholds in terms of the biological context, we use Hartemink's information-preserving algorithm (Hartemink, 2001) to quantify the values in a way that maximizes the mutual information shared by the variables (Cover & Thomas, 2012).

| Step 2: Partitioning the dataset into train and test
The typical machine-learning approach to learn, then evaluate a model, is to randomly partition the dataset in two subsets, train and test, where the greater portion (train) is used to estimate the model, and the remaining portion (test) to evaluate the trained model. However, evaluation concerns are raised if the instances of the original data are randomly partitioned into train and test. Indeed, using this method, the train and the test datasets are extremely similar (see Supporting Information). For each instance of the test dataset, it is highly likely to have a matching instance in the train dataset due to correlations in time and space. The purpose of a test dataset is to simulate how the model performs when applied in practice to a new dataset. If the goal is to make predictions in the future, say next month, we set the train dataset to be the data from the final observations (instances) and let the train dataset be the remaining instances. Namely, we make the train and test datasets time-wise disjoint.

| Step 3.1: Learning the BN structure
For each of the k-level quantified training datasets, we find the structure that results in the lowest BIC or the lowest AIC. Although this can be done by performing an exhaustive search on all possible BN structures-that is, directed acyclic graphs, with the response variable and covariates as the node-set-we instead use efficient algorithms, for example, (Silander & Myllymaki, 2012), which is implemented in the r package bnstruct (Franzin et al., 2017). Both BIC and AIC criteria penalize having more parameters, which reduces the chance of overfitting to the training dataset. The choice of BIC or AIC depends on the main goal of the study, the model complexity and the number of instances relative to the number of parameters (Aho et al., 2014).
Note this approach is computationally infeasible if there are too many variables, for example, more than 25, or too many discretization levels. Then one may, instead, either use a fixed (a priori known) BN structure, for example, naive Bayes, or learn a 'close-to-optimal' (a priori unknown) BN on the training dataset using acknowledged searching algorithms (Table 3, Figure 1). We learn the structure of the a priori unknown networks by the bnlearn package in r (Scutari, 2009). The input to each algorithm is the variables and the corresponding training dataset, and the output is a BN structure whose nodes are the variables. In case the learned structure contains undirected links, we randomly assign directions as long as directed cycles and v-structures do not appear. This is because BNs must not contain cycles by definition, and the introduction of v-structures can change the performance of the resulting BN (Koller & Friedman, 2009). So for each discretization level k, we obtain a BN structure according to one of the algorithms or fixed structures in Table 3.

| Step 3.2: Learning the BN parameters
After finding the highest-scoring BN structure for each of the klevel quantified training datasets, we learn the associated CPD parameters on the same training dataset and denote the resulting BN by ℬ * k . We use the Bayesian parameter estimation approach (Koller & Friedman, 2009), implemented in bnlearn. To this end, for each quantization level k, we obtain a BN ℬ * k that best fits the training data in terms of BIC, AIC or other constraints listed in Table 3.

| Step 4: Evaluation
How to choose among the different ℬ * k s from the previous step? Namely, what number of discretization levels results in 'the best' BN? We cannot compare them directly using a performance measure that involves the likelihood of the data, for example, log-likelihood, AIC and BIC, because the ℬ * k s do not use the same data but different discretized versions of it. Each network allows us to compute P(Y g,t | g,t ), that is, the chances of the observed response variable given the covariates, for every instance in the dataset. Correspondingly, we compare the area under receiver operating characteristic curve (AUROC or simply AUC; Bradley, 1997;Metz, 1978) score of the BNs on the test dataset (see Supporting Information). The choice of AUC is to make our results comparable with the huge body of literature using this performance score as the final performance of a classifier. For each discretization level k, we calculate the AUC score of ℬ * k and pick the highest-scoring one as our final BN. If there is a tie between the top BNs, we break it by looking at the area under precisionrecall curve (AUPR; Raghavan et al., 1989;Saito & Rehmsmeier, 2015) scores; that is, among the top BNs with a deficit of at most, say 0.01, from the top AUC, we pick the one with the highest AUPR.

TA B L E 2 Variable notation
The AUPR score better handles unbalanced data by looking at precision rather than the false positive rate (Davis & Goadrich, 2006;Saito & Rehmsmeier, 2015).
Given the temporal nature of our task, we evaluate the final model on a single test dataset, as explained in Section 3.3. If instead, one divides the original dataset into several yearly separated folds and uses cross-validation to obtain the AUC and AUPR values for each fold, then one could also provide confidence intervals for the reported AUC and AUPR values.

| Step 6 (optional): Sensitivity analysis
We examine the prediction accuracy (AUC and AUPR) of the best model when a primary covariate becomes unobservable. This roughly shows the contribution of each covariate to the prediction, although it is, indeed, the co-effect of all the covariates that leads to accurate predictions.

| Step 7 (optional): Comparison with simple Bayesian networks
To further assess the prediction performance of the final BN, we may compare its AUC (or AUPR) with that of simple BNs consisting of a single or two covariates linked to the response variable. These BNs might be considered as the 'null model'.
Recall that our final BN is designed to perform a generative task (that is to reveal the relationships between the variables), not a discriminative task (that is to predict the response variable). However, if the BN performs well on the first, it is likely to also do well on the second. Yet, the opposite does not hold (Ng & Jordan, 2002). So even if any of these simple BNs predicts the response variable better than our final BN, it does not question the capability of our BN in explaining the probabilistic relationships between the variables. The same may hold in the previous optional step: the AUC score of the BN may increase after removing some of the covariates. This can also be explained by the fact that our final BN is the best fit to the data under the performance score that we used, which is BIC (or AIC) not AUC.
Nevertheless, in such cases, we may train a BN with a different set of covariates for prediction purposes. For example, we may find that subset of the covariates that results in a BN scoring the highest AUC on the training dataset.

| THE MOUNTAIN PINE B EE TLE C A S E S TU DY
We illustrate the learning and interpretation of BNs via the data on the MPB infestation in the Cypress Hills park-an interprovincial park located in Alberta and Saskatchewan ( Figure S2). Endemic-level populations of MPB have existed in Cypress Hills since the 1980s.
However, a MPB outbreak started in 2006 and propagated in the park, where it continues until now.

| Biology and management
Mountain pine beetle presents two main population phases: an endemic phase with small population size where beetles attack weak and stressed pines with the help of other bark beetles, and an epidemic phase where the number of individuals is large enough to overcome the defences of large and healthy pines (Safranyik & Carroll, 2006).
In summer, beetles will emerge from a tree, mate and attack new pines to lay eggs in galleries under the bark. New MPB infestations are reported to frequently appear in south-and west-facing slopes (Safranyik, 2004). During the tree growing season, water stress negatively impacts the pine's ability to build its defence against bark beetles (Lusebrink et al., 2016;Safranyik, 1978). Indeed, pines use water to make a toxic resin that is exuded during a beetle attack to prevent beetles from attracting conspecifics and inhibit the formation of galleries and oviposition (Erbilgin et al., 2017;Raffa & Berryman, 1983). MPB emergence and flights are reduced with high temperatures during the dispersal season (Safranyik & Carroll, 2006). MPB can disperse at short distances within a stand or, more rarely, fly above the canopy to use the wind to travel long distances of the order of tens to thousands of kilometres (Robertson et al., 2007;Safranyik & Carroll, 2006). Once the eggs are laid, the adults die. Over the fall, winter and spring, eggs become larvae then pupae before finishing their transition to adult and emerging in the summer. Individuals need a minimum of 833 degree days to complete their transition to adult (Safranyik et al., 1975(Safranyik et al., , 2010. The Forest Service Branch of the Saskatchewan Ministry of Environment follows a strict direct control approach. At the start of every fall, the park is surveyed aerially to collect geo-referenced data on red-top trees-that is, trees that are dead or dying from a MPB (4) P(Y g,t | 1 g,t , 2 g,t ) = P(Y g,t | 1 g,t ).
infestation at the previous year. Then, on the ground, managers survey 50-m radius circular plots around each red-top tree to find recently infested trees during the summer. The newly found infestations are later controlled in late fall/winter using a fell and burn method.
Our goal is to provide a set of covariates that potentially impact the MBP infestation in Cypress Hills area, understand how they are related to each other and to the infestation, and find which covariates are sufficient for an accurate prediction. We are also interested to test some of the claims in the literature, for example, lower humidity increases the chances of infestation (Lusebrink et al., 2016), and to find what values of the highly correlated covariates degree days and maximum temperature, that are typically not included together in a model, makes infestation most likely. These objectives are well suited to BNs.

| Methods
We divide the studying area into 100 m × 100 m squares and label them by g = 1, 2, …. We choose 1 year as our time unite and define the response variable I g,t as the presence or absence of infestation in pixel g at the fall of year t. We use the covariates listed in Table 4 and quantify them into k = 1, 2, …, 7 levels. Our data include the values of g,t and I g,t (Figure 2)  Defined to be 1 if pixel g includes at least one tree that was infested and managed (controlled) at year t − 1, and 0 otherwise ( Figure 2) -

Missed last year infestation
Defined to be 1 if pixel g includes at least one tree that was infested and missed (not controlled) at year t − 1, and 0 otherwise -MPB's ability to disperse at short distances within a stand, defined as Missed neighbours' last year infestation I Missed and I Missed g,t−1 , being linked to the target I g,t , considered as the null model (Figure 1).

| Resulting Bayesian network
We find the BN with the best BIC score on the train dataset with six discrete levels, that is, ℬ * 6 (Figure 3), as our 'best model' to explain the MPB infestation, with AUC = 0.88 and AUPR = 0.28. The OI model scores 0.75 for AUC and 0.19 for AUPR-both lower than our selected model.
According to the structure of ℬ * 6 , the infestation I g,t in location g at year t is directly connected to I Missed g,t−1 , I Managed

| Sensitivity to missing covariates
The prediction accuracy of ℬ * 6 does not deteriorate when the values of any of the secondary covariates are missing. Upon missing values for the primaries, the model can still accurately predict infestation as it can use some of the secondary covariates (Table 5).

| Discussion
The Given this BN, we can provide a wide range of ceteris paribus claims revealing the co-effects of the covariates on the presence of infestation (see Supporting Information). For example, if we know maximum daily temperature is high (above 31.2°C), the interval of relative humidity that results in the highest infestation risk sharply changes from medium to low. This is in line with the claim in (Lusebrink et al., 2016;Safranyik, 1978) that lower humidity increases the infestation probability. However, for maximum daily temperatures lower than 31.2°C, the infestation likelihood is high for both low and high relative humidity. This inconsistency can be solved by looking at maximum temperature and relative humidity together.
We find that humid areas require low maximum daily temperature, while dry areas require high maximum daily temperature for a considerable risk of infestation (above 20%).
As another example, a MPB needs 833 degree days to complete its transition to adults and the minimum number of degree days in the data is 1,054 (Safranyik et al., 1975(Safranyik et al., , 2010. Therefore, degree day never prevents infestation in our data and just reflects the neg-  (Pearl, 2009;Pearl & Mackenzie, 2018). Moreover, the absence of an edge between, for example, degree day and infestation does not necessarily mean that the two are independent. They may be dependent but become conditionally independent if some other covariates are known here.
In summary, the learned BN contributes to the prediction and understanding of MPB infestations by (1)  where the ranges of the covariates are very different from those in the training dataset. We refer the reader to (Robinson et al., 2010;Zhou et al., 2008;Zhu & Wang, 2015) for relaxing the stationary assumption.

| D ISCUSS I ON
Although traditional models used to make ecological predictions from underlying covariates have a record of success, they also suffer from limitations. They cannot make predictions when one or more covariates are missing; unless the missing values are imputed using other methods which can be unreliable and result in low prediction accuracy. They also do not allow for statistical inference when some of the covariates are highly correlated. BNs can handle these issues.
Specifically, they provide a primary and secondary ordering of the However, BNs are not used to their full potential in the literature as their structure is typically constructed based on the knowledge of experts. Moreover, the obtained BN is often read causally, a questionable practice as BNs are different from causal networks.
We have complemented previous work by providing a systematic approach to obtain a BN fully from data. We have demonstrated the approach via a MPB case study, where no knowledge of experts was involved in finding either the structure or CPDs. The resulting BN predicts infestations fairly accurately, even in the absence of any of the selected covariates that are involved in the model.
The resulting networks have been often used as predictors and sometimes reported to be fairly successful on a test dataset. This is an acceptable approach to assess the a priori knowledge of the experts or when there are no data available to learn the BN structure. However, by means of the results for our MPB case study, we challenge claims that put forward this approach as 'the (only) right one' for constructing a BN. Examples include synthesizing existing knowledge into the model is necessary and structural learning is only for modelling poorly understood systems or those difficult to characterize (Chen & Pollino, 2012), modellers must demonstrate causal relations , models based on theories about causal relations are generally better (Uusitalo, 2007) and network structure is a matter of judgement and should reflect expert knowledge and stakeholder needs (Gutierrez et al., 2011). Some researchers have looked into fixed (naive Bayes) and partially learnable (tree-augmented naive Bayes) structures (Aguilera et al., 2010), yet this is different from learning fully based on data.
In general, for modelling the joint probability distribution of the variables involved in an ecological process, that is, a generative task, BNs seem to be the first and often best candidate, especially if the governing dynamics are yet unknown to be mechanistically modelled. However, if the sole purpose is to predict the response variable, that is, a discriminative task, other models may show a higher prediction accuracy, although unlike BNs, they typically cannot deal with missing values in the covariates. We are currently exploring ways to use BNs as well as other models, to predict infestation many years in the future (Ramazi et al., accepted).