Ensembles of ecosystem service models can improve accuracy and indicate uncertainty

34 Many ecosystem services (ES) models exist to support sustainable development decisions. However, 35 most ES studies use only a single modelling framework and, because of a lack of validation data, rarely 36 assess model accuracy for the study area. In line with other research themes which have high model 37 uncertainty, such as climate change, ensembles of ES models may better serve decision-makers by 38 providing more robust and accurate estimates, as well as provide indications of uncertainty when 39 validation data are not available. To illustrate the benefits of an ensemble approach, we highlight the 40 variation between alternative models, demonstrating that there are large geographic regions where 41 decisions based on individual models are not robust. We test if ensembles are more accurate by 42 comparing the ensemble accuracy of multiple models for six ES against validation data across sub- 43 Saharan Africa with the accuracy of individual models. We find that ensembles are better predictors 44 of ES, being 5.0-6.1% more accurate than individual models. We also find that the uncertainty (i.e. 45 variation among constituent models) of the model ensemble is negatively correlated with accuracy 46 and so can be used as a proxy for accuracy when validation is not possible (e.g. in data-deficient areas 47 or when developing scenarios). Since ensembles are more robust, accurate and convey uncertainty, 48 we recommend that ensemble modelling should be more widely implemented within ES science to 49 better support policy choices and implementation. 50

wrote the manuscript, with comments and revisions from all other authors. 33

Box 1 -Key definitions
Whilst relatively rare in the ES literature, frameworks for understanding model uncertainty can be found elsewhere in the literature (e.g. see Araújo and New (2007), Refsgaard et al. (2007), and Walker et al. (2003)). Key concepts are defined below: • Uncertainty -Any deviation from the unachievable ideal of completely deterministic knowledge of the relevant system (Walker et al., 2003). • Inaccuracy -The deviation from the 'true' value (i.e. how close a modelled value is to the measured value, the latter considered 'true' (Walker et al., 2003). • Robustness -The level of confidence in the overall patterns/conclusions derived from the model (which may be high even though quantified estimates in individual pixels are inaccurate) (Refsgaard et al., 2007). • Model Ensemble -A collection of modelled outputs produced by running simulations for more than one set of models, initial conditions, model classes, model parameters and/or boundary conditions (Araújo and New, 2007). • Committee averaging -A method combining models, giving each an equal weight (e.g. calculating the mean) (Araújo and New, 2007).
7 of the ensemble mean (Puschendorf et al., 2009). Thus, ensembles may also provide an indication of 118 uncertainty when faced with data scarcity, a potential benefit that is perhaps most pronounced in 119 many developing countries, where data collection and model assessment efforts are least advanced 120 (Suich et al., 2015) but reliance on ES for wellbeing is arguably the highest (Daw et al., 2011; Shackleton 121 and Shackleton, 2012;Suich et al., 2015). 122 In this paper, we demonstrate that decision-making based on single ES models is not robust for large 123 regions within sub-Saharan Africa as high variation between model estimates means that using a 124 different model or incorporating an additional model into the decision-making process is highly likely 125 to result in a different decision. In addition to increased robustness, we show that ensembles of ES 126 models can provide improved accuracy over individual models, as well as an indication of uncertainty. 127 Finally, we discuss how ensemble modelling might become standard practice within the ES community, 128 particularly when supporting high-level policy decisions, such as in IPBES regional, global and thematic 129 assessments used in policy and decision-making. 130

2.
Methods 131 Recently we validated multiple models for each of six ES in sub-Saharan Africa (stored carbon, available 132 water, water usage, firewood, charcoal, and grazing resources; Table 1) Table SI1-2, but see Willcock et al. (2019) for 134 further information). In that paper, we used six ES modelling frameworks (InVEST (Kareiva, 2011;135 McKenzie et al., 2012), Co$ting Nature (Mulligan, 2015;Mulligan et al., 2010), WaterWorld (Mulligan,136 2013), benefits transfer based on the Costanza and others (2014) values, LPJ-GUESS (Smith et al., 2014(Smith et al., , 137 2001, and the Scholes models (comprising two grazing models and a rainfall surplus model) (Scholes, 138 1998  performing model according to each validation dataset. We tested whether the accuracy of a first 183 category ("A", e.g., the ensemble mean) was higher -"improved" -or lower than a second category 184 datasets across six ES provides too low a level of replication per ES, but normalising each ES allows 187 comparisons across the different ES as a whole. Normalising involved dividing the accuracy of A by the 188 accuracy of B for each validation dataset. For simplicity, we refer to the 16 resulting proportions as 189 "improvement values", although they could indicate a loss of accuracy (values <1).  (Table SI-204 1-4) as the larger variation was offset by higher degrees of freedom (78 vs 15). 205 We also tested the correlation between ensemble uncertainty and absolute accuracy using 1661 of the 206 1675 individual data-points for validation (anovan-procedure in Matlab). The large sample size meant 207 we were able to differentiate between ES in this analysis. We calculated ensembles from a minimum 208 of three models and so discarded 14 data-points since they only matched ≤2 modelled estimates. , where X represents each 1 km 2 grid-cell, and n is the number of models. 220

Variation amongst models shows strong spatial patterning 225
For sub-Saharan Africa, we found large areas for which the variation among models was relatively low 226 ( Figure 2). In these areas all models provide similar normalised predictions and so a decision based on 227 a single model may prove robust. However, there are also notable areas of disagreement, where 228 variation among models was higher. These appear to occur in transition zones between vegetation 229 types ( Figure 2) and, for aboveground carbon storage models, in less densely forested areas (e.g. In general, individual models as a group were inferior to the ensembles created from them: ensembles 247 outperform individual modelling frameworks by 5% to 6% for both ρ and D ↓ (P = 0.03 and 0.008 248 respectively; Figure 3; Table SI1-3). Ensembles were outperformed by the best model for each 249 validation set by 13% (mean; P = 0.04) and 12% (median; P = 0.05) using ρ and 6% (P = 0.002) and 7% 250 (P < 0.001) using D ↓ . Unfortunately, which model performs best for each validation dataset was hard 251 to predict as no single model framework is consistently more accurate than others (Table SI1-  We have demonstrated that there is substantial variation between ES models and the difficulty in 280 predicting the best-fit model as no single model was consistently better than others (Table SI1- Despite disagreement between individual models, ensemble modelling has been mostly neglected by 308 the ES community; e.g. a Web of Science search (10 February 2020) for "model ensemble" and 309 "ecosystem service" resulted in no records. This is surprising as: 1) Ensembles are commonly used for 310 model types that simulate output variables closely related to ES, but without emphasising the ES source platforms (e.g. InVEST) makes running multiple models increasingly straightforward. Hence, it 356 is now possible for most studies using an ES model to shift to using multiple models. We hope this 357 study encourages ES researchers to do so. 358 However, whilst using ensembles of ES models is indeed possible, there are several challenges that 359 need to be overcome before it becomes standard practice within ES science. We argue that advances 360 are necessary in two key areas: accessibility and comparability. As more independent models are 361 developed, it might be hypothesised that the ease with which these models can be accessed might 362 increase. Indeed, anecdotal evidence seems to support this as, for example, InVEST historically  , 2019). Similarly, despite models becoming increasingly complex, the computational capacity 368 required to run some of these models has decreased as many modelling frameworks now make use of 369 cloud-computing resources, putting less stringent requirements on the end-user (Willcock et al., 2019). 370 Accessing multiple ES models remains a difficult undertaking. For example, whilst the software needed 371 to run InVEST is free, it still requires substantial GIS knowledge and many of the models within this 372 framework are 'data-hungry' and therefore require access to data and substantial processing power in 373 order to run (Willcock et al., 2019). By contrast, ARIES and Co$ting Nature store the necessary data 374 and processing power on their servers, but therefore require high-speed internet access (Willcock et 375 al., 2019). Furthermore, to benefit from the full Co$ting Nature model outputs (i.e. disaggregate 376 outputs of individual services) one either needs to enter a partnership with the model owners or pay 377 a subscription of at least 2,000 GBP yr -1 (http://www.policysupport.org/access-costs). Thus, in order 378 to contrast or combine, for example, carbon models across these frameworks you require access to 379 the internet, adequate data and computational power, as well as the funds to support a model 380 subscription fee and the extra staff time required (i.e. when compared to running a single model). Such 381 resources are likely out of reach of many ES researchers and practitioners and so, for them, ES 382 ensembles are an unfeasible ideal. However, this can be somewhat negated if those with access to 383 these resources make the ensembles they are able to create freely available (e.g. as we have done so 384 through the EIDC repository for our committee averaged ensembles and the SEM 385 [https://doi.org/10.5285/11689000-f791-4fdb-8e12-08a7d87ad75f]). 386 As well as the issues surrounding the feasibility of running ensembles of models, methodological 387 limitations remain. For example, when validating any model (individual or ensembles) a reference of 388 truth is required (Box 1). Validation data have their own intrinsic inaccuracies and so it may be good is robust (Willcock et al., 2019). Whilst we use multiple sets of validation data here (Table S-1-2), data 391 deficiency prevented further investigations into the sources of the uncertainty we identified; e.g. 392 running simulations to vary initial conditions (e.g. spatial scale (Hou et al., 2013)), model classes, model 393 parameters and/or boundary conditions (Araújo and New, 2007). This is an exciting avenue for future 394 research, which could also compare using ensembles of models to assess uncertainty with other 395 approaches (e.g. probabilistic models (Bagstad et

5.
Conclusions 430 This study highlights that, in most instances, ensemble modelling may provide more robust and better 431 estimates than using single models, as well as an indication of confidence in model predictions when 432 validation data are unavailable. Whilst ES science is not yet ready for ensembles to become standard 433 practice, ensemble modelling should be adopted more widely in ES modelling. In future, studies of high 434 policy relevance (e.g. future assessments of IPBES), as well as efforts to inform decisions and track 435  Grazing use * All 1x1 km in this study, unless otherwise noted. Willcock et al. (2019) investigated the impact of spatial scale on ecosystem service models and found no significant impact 715 (unpublished results). Thus, spatial scales are unlikely to affect results here. § These services were not modelled in these model frameworks when we conducted our model