How sensitive are species distribution models to different background point selection strategies? A test with species at various equilibrium levels

,


Introduction
Species Distribution Models (SDMs), also known as Ecological Niche Models (ENMs), constitute a widespread method for ecological, biogeographic, and conservation studies (Franklin 2010;Peterson et al., 2011;Guisan et al. 2013;Guisan et al. 2017;Araujo et al. 2019), often being particularly useful when dealing with invasive alien species (IAS) (Gallien et al. 2012;Slodowicz et al. 2018;Vicente et al. 2019).SDMs take geographic observations of species and georeferenced environmental datasets in a study area to statistically identify the conditions in which the species can occur (i.e., a part or all of the species' realized environmental niche), and map the occurrence probability of the species based on these conditions back in geographic space (Colwell and Rangel 2009).The usefulness of SDMs depends both on model accuracy, i.e., how well a model is able to make predictions, and model stability, i.e., the degree of model sensitivity to stochasticity (Duan et al. 2014;Grimett et al. 2020;Mateo et al. 2010).Therefore, consistency of geographical predictions is a pivotal factor to research.
The vast majority of SDMs use presence-only data, likely because occurrence data in national or global databases (e.g., the Global Biodiversity Information Facility ([GBIF]); Anderson et al. 2016, Fletcher et al. 2019, Chapman et al. 2020) have become numerous and easily accessible (Chauvier et al. 2021;Valavi et al. 2021, Nolan et al. 2022).In contrast, absence data are more rarely available and hard to generate, especially in the case of animal species (Gormley et al. 2011;Lobo et al. 2010).Presence-background SDMs function by contrasting the species occurrences to points sampled (often randomly) in the study area (Valavi et al. 2022), ideally along the main environmental gradients used to define the species niches (Barve et al. 2011; but see Iturbide et al. 2018 for other possibilities).This set of points representing the environment are called background points (BP; often also pseudo-absences, but see Senay et al. 2013) and are most typically sampled considering a geographically random scheme (Liu et al. 2019;Chauvier et al. 2021;Barber et al. 2022).Therefore, presence-background SDMs make the implicit assumption that BP sampled in geographic space represents the available environment (measured as all environmental combinations within the study area) appropriately (Barbet-Massin et al. 2012;Wisz and Guisan 2009;Liu et al. 2019, Cengic et al. 2020, Valavi et al. 2022).
However, since SDMs fit response curves along environmental predictors, this random selection of BP should theoretically be conducted in environmental rather than geographical space and in a stratified way (Fragnière et al. 2022).This is also supported by the fact that, in order to produce accurate and robust SDM results, species occurrence data should optimally be sampled in the field in a random-stratified in environmental space manner (Hirzel and Guisan 2002).
Whilst there have been studies that used background points selected in environmental space, this was usually done in a variety of stratified fashions, arranging the background points along environmental gradients (e.g.Fragnière et al. 2022).It has also recently been shown that geographically constraining background point selections to areas that are geographically or environmentally close to the occurrence points can strongly influence SDM predictive accuracy (Whitford et al. 2024;Schartel and Cao 2024).However, it was rarely done fully randomly in environmental space.Here, we present the first study that also uses a fully random selection of background points in environmental space, allowing us to test the four possible types of selection based on geographic versus environmental spaces, and stratified or not, and thus whether the implicit assumption that the geographic space reflects the environmental space is acceptable or not when fitting SDMs.As Hutchinson's niche-biotope duality tells us that the geographic space does not reflect the environmental space (Colwell and Rangel 2009;Guisan et al. 2014), the question is whether it can still give a good approximation.
Another major assumption made when using SDMs is that the modelled species is at equilibrium with its environment.This is already not true for all native species (Normand et al. 2011;Svenning and Sandel 2013), but it is almost never true in the case of invasive alien species (IAS), as their spread leads them to progressively occupy new habitats and territories (Gallien et al. 2012;Petitpierre et al. 2012).Therefore, statistical species-environment relationships usually only represent a snapshot "pseudo-equilibrium" in time (Guisan & Zimmermann 2000).The question is, then, how much a disequilibrium causes problems to SDM fitting and predictions (Svenning and Sandel 2013)?Therefore, the method of sampling background points may be even more important when a species is not at equilibrium with the environment.
Here, we address this second dimension of (dis)equilibrium with the environment together with our assessment of background point selection, by using species with different putative degrees of equilibrium.More specifically, we make the simplified assumption that wellestablished native species, i.e., those for which we know that they have a dynamic that has not changed recently, are likely at equilibrium inside their home range and fill their realized niche.In this study, we want to answer the two following research questions around the two assumptions previously exposed: (1) Which method of BP sampling produces the best-performing SDMs for invasive and equilibrium species in terms of model accuracy and stability?(2) Do the optimal BP sampling strategy differ between native and invasive alien species?

Species selection and study area
In our analyses, we considered both invasive alien species (IAS), i.e., species not at equilibrium with the environment, and species which we assume to be at the equilibrium (see Table 1).Both invasive and equilibrium species included 5 insect species, 5 plant species and 5 vertebrate species, making a total of 30 target species.
All equilibrium species are mountain species, which are naturally distributed along steep elevational and environmental gradients, thus making their distribution constrained in geographic space (i.e., where mountains are).Also, this limited our study area to mountain biomes, which was advantageous for computational considerations.A final reason to select non-generalist mountain species for equilibrium studies is because whilst many of these species might physiologically be able to occur at lower altitudes, as demonstrated by growing them in botanical gardens (Vetaas 2002), competition with other species drives them away from there.For instance, the Alpine aster (Aster alpinus) usually occurs at elevation of over 2000 m because it gets outcompeted at lower altitudes (Lyu and Alexander, 2022).High-altitude species therefore are likely to occupy their entire realized niche, and therefore they are at (pseudo-) equilibrium with their realized environment in mountains.
We selected 15 IAS from the EU list of Union Concern (https://ec.europa.eu/environment/nature/invasivealien/list/index_en.htm) and the factsheets on invasive alien species in Switzerland provided by the

Table 1
Study species and matching rationale to insert them into the studies and number of occurrences (in the study area, after removal of non-native range occurrences for equilibrium species and thinning).Note that the invasive alien species all at least occur in mountain areas in their home range.S5 for the list along with the rationale to include the species).
We defined our study area using the biomes from Olson et al. (2001) where our species occur (Fig. 1): temperate broadleaf and mixed forests, temperate coniferous forests, temperate grasslands, savannahs and shrublands, montane grasslands and shrublands, boreal forests/taiga and tundra.We excluded biomes where the presence of our species was marginal (see Appendix 3, Table S31 for occurrence points omitted per species).For an additional analysis on the potential effect of the omitted occurrence points on the model outcomes, showing that the environmental values of the occurrence points off the biomes are represented into the environmental values of the background points and of the occurrence points included in the study, see Appendix 3 (Figs.S32 and  S33).

Algorithm choice and settings
MaxEnt (which stands for 'maximum entropy'), a type of Poisson point process regression model (Renner and Warton 2013), is among the most widespread algorithms used to fit SDMs (Elith et al. 2011;Gomes et al. 2018).It also draws upon presence-only (and background) data, which is much more widely available than data on presence-absence of species (Fletcher et al. 2019;Guillera-Arroita et al. 2015).To avoid model overfitting, the ENMeval package version 2.0.3 (Kass et al. 2021) was used to find the most optimal model settings.MaxEnt uses several mathematical function types (linear, quadratic, hinge, product, threshold) and the Regularization Multiplier (RM), a controller of model complexity, to identify the relationship between data occurrence and the underlying environmental variables (Phillips et al. 2009).ENMeval calibrates a MaxEnt model by pointing out the optimal settings of the function types and RM.Since ecological interactions are never accurately described by only one type of function, e.g., only linear functions, we only chose three combinations of functions: 1) linear, quadratic and hinge (LQH), 2) linear, quadratic and product (LQP), and 3) linear, quadratic product and hinge (LQPH).The RM was varied from 1 to 5. Threshold functions were never used because hinge functions have been shown to be a better replacement (Elith et al. 2011).

Environmental variables
Bioclimatic variables were downloaded from CHELSA, version 1.2 (Karger et al., 2017), at a resolution of 0.0083 • .This version was chosen because the variables are composed of the average values from the time series of 1979 -2013 and species responses to climate change often lag behind the actual climate at a given time (Menendez et al. 2006).Though this may vary among species groups, we make this assumption in this study, for the sake of comparability.An average for a time period is therefore more informative than choosing the values at one specific time point.The Ecospat R package (Di Cola et al. 2017;Broennimann et al. 2023) was used to investigate correlation among the variables (i.e., collinearity) and to select the best set of (partly) uncorrelated variables to insert in the models.A correlation limit of 0.70 between variables was used (Dormann et al. 2013).In the end, seven variables were used: mean annual temperature, diurnal temperature range, total temperature range, annual precipitation, precipitation seasonality, precipitation of driest quarter, precipitation of coldest quarter.

Species occurrence data
We downloaded data from the Global Biodiversity Information Facility (GBIF; Anderson et al. 2016) on 30 species (GBIF.org, 2022a,b,c, GBIF.org, 2023).Only occurrence records from the year 2000 or later were selected, in order to match the CHELSA version 1.2 timeline.The delay from 1979-2013 to 2000-2022 is assumed to make up for the delayed species response (see Section 2.3).To properly focus on our question of equilibrium state, only the native range of species is modeled.Conversely, IAS are expected to be at disequilibrium with the environment (Gallien et al. 2012).For these species, all the known range (in the study area) was modeled in this study.This is firstly to research the effect of this disequilibrium on model stability, and secondly because more is often known about those species in their invaded range than in their native range (Poland et al. 2021).
Many species that are not listed as invasive are still naturalized alien in some countries, which might mean that they are not at equilibrium in part of their range.Therefore, the DASCO R package (Hseebens 2022) was used to identify the native distribution range of the species at equilibrium and all occurrence records from outside that native range were removed.Then, occurrence records that were outside the study area were removed.For an additional analysis on the potential effect of this on niche truncation and model's ability to accurate characterize the niche, see supplementary material, Appendix 3, Figs.S32 and S33.The data were subsequently 'cleaned'.This implied removal of unlikely coordinates, occurrences with coordinate uncertainty over 50 m, observations with zero abundance and/or an occurrence status of "absent" (in many languages) and visual inspection of the cleaned data set.This was done using the scrubr R package (scrubr source : R/scrubr-package.R; Downloading and cleaning GBIF data with R | R-bloggers).For species with more than 20′000 occurrence records, 20′000 occurrence points were selected at random, for reasons of computing efficiency.This number of occurrence points was considered big enough to represent the species' distribution range effectively, thus we lost no information in this step.Finally, the occurrence data for each species were thinned (Steen et al. 2021, Inman et al. 2021) as follows.First, a Minimum Convex Polygon (MCP) was drawn around the (remaining) points.This is a technique to draw the borders of the home range of the species.Within this MCP, as many points as there are occurrence data were randomly drawn, following Chiocchio et al. (2021).This process was repeated 50 times and each time, the average nearest neighbor distance between the randomly drawn points was recorded.The average of these averages was calculated, and the species occurrence data was thinned by this distance using the spThin R package (Aiello-Lammens et al., 2015).This method matches the clustering of the occurrence records to clustering expected for a random distribution, as it returns the number of points expected in the species' home range and therefore limits sampling bias of occurrence points.The downloaded occurrence data were then thinned to 1 point per grid cell of our environmental rasters (Steen et al. 2021) because our version of MaxEnt only accepts one occurrence point per grid cell.To limit the risk of sample size influencing our results, we kept only species that had at least 50 occurrence records remaining for further analyses, to be well above the limit of 30 occurrences under which no algorithm predicts consistently (Wisz et al. 2008).
Originally, many more species were considered to be included in the study, however, this method of thinning often removed too many records.

Background points sampling
We sampled 10,000 background points for each sampling strategy.It is generally acknowledged that this is a sufficient background point sample size for SDMs built using MaxEnt (Barbet-Massin et al. 2012).Each of the 4 methods of background point samplingfully random in environmental space (randenv), stratified random in environmental space (stratenv), fully random in geographic space (randgeo) and Note that the grid size of the checkerboard blocks have been enlarged to 100 km in order to show the checkered division of the BP data across the bins; Fully random in environmental space (randenv); Stratified-random in environmental space (stratenv).Note we did not display the stratified random in geographic space method here because the checkerboard models also use the random in geographic space background points (in B).Please print in color.
B. Steen et al. stratified random in geographic space (presented by checkerboard 1 (cb1) and checkerboard 2 (cb2)) -is described below.The background points were sampled from the study area described under Section 2.1.See Fig. 2 for a schematic explanation.All the background sample sizes were downweighted to the number of occurrences.

Fully random in environmental space (randenv)
The simplest approach would be to draw all points randomly in environmental space at once, but when the available environment of the study area is very restricted (as here with mountains), it becomes difficult to make such random points match actual conditions in the field.For this reason, we had to develop a specific procedure.First, a PCA analysis of all the environmental variables was performed using the "PCA" function of the R package "FactoMineR" (Husson et al. 2020).The first two PCA axes explain 74.4% of the variation.Then, a two-dimensional kernel density estimate was performed on the two axes using the "kde2D" function of the R package "MASS" (Venables et al. 2002).Based on this, we divided the environmental dataset into 10 quantiles, each of which contained roughly 10% of all the values in the PCA axes, see Appendix 1, supplementary Fig. S3.Then, the values of the axes were rounded to 1 decimal.Finally, 1000 background points were sampled randomly across all 10 quantiles, resulting in a final dataset of 10,000 BP, as follows: First, random numbers ranging between the minimum and the maximum value of the first PCA axis were generated.The same was done for the second PCA axis.Then, to perform the random sampling, both random numbers were matched to the values of both axes and when there was a match, the corresponding longitude and latitude were recorded into the randenv background point dataset.When performing the random sampling, two conditions had to be met: 1) the randomly sampled values on both the PC1 axis and the PC2 axis had to match the values in the matching cell and 2) no duplicate values of either PC1 or PC2 were allowed.The latter condition was used in order to prevent oversampling of more common values.Thus, we made sure that the final dataset of background points was one that at least approached a representative, fully random sample of all the environmental conditions available.For additional details, see Appendix 1, supplementary Fig. S3.

Stratified random in environmental space (stratenv)
The 7 CHELSA raster layers were each divided into 3 equal environmental strata as follows.First, the difference between the minimum and maximum value in a layer was calculated.Then, all the cells with values that fell within a one-third quantile of the group were selected, thus making three strata per layer.Each of these strata was assigned a unique numeric code which was not included in any of the strata of other layers.Therefore, each matching pixel across raster layers has seven different numeric codes.Then, a new raster layer was created in which all the numeric codes were summed, creating a unique new code for each combination of strata.Finally, the sampleStratified function of the raster package was used to sample across this new layer and create the stratenv background points.This function makes an effort to sample across all the cells in the summed layer that have the same code.Therefore, all strata were sampled (more or less) equally.Since, in environmental space, the number of the strata (and thus their resolution) should not affect the results, we used only one here (i.e. 3 strata per climatic layer).

Fully random in geographic space
This option is the most commonly used and therefore many options exist to conduct it.Here, the randomPoints function of the dismo (Hijmans et al. 2019) package was used to generate randgeo background points.This function samples fully randomly across the rows and columns of the geographic raster variables (omitting cells with no data), with weights on the rows, so not too many cells can be selected near the poles.Also, no cell can be selected more than once.

Stratified random in geographic space
The checkerboard method of the ENMeval package (Kass et al. 2021) was used to divide the randgeo background data (as well as the occurrence data) according to square "checkerboard" blocks (Steen et al. 2019).As, in geographic space (unlike in environmental space), the grain of the checkerboard could influence the result, we tested two different block sizes.The size of the checkerboard blocks was set at 5 km x 5 km to create the first set of bins (checkerboard 1 (cb1)) in one dataset and at 10 km x 10 km to create the second set (checkerboard 2 (cb2)).All the analyses described under Section 2.7 were done for all species, with both sets.

Background points comparison
The datasets of 10′000 background points used for each treatment are shown in Fig. 1.In order to investigate if the background point (BP) datasets are, as one would expect, spatially different from each other, we calculated the Moran's I for each dataset.See Fig. 2 for a finer illustration of the comparison between the classic fully random sampling in geographic space and the stratified random sampling in environmental space.

Optimal model selection and general modeling framework
The thinned occurrence datasets for the target invasive (non-equilibrium) and equilibrium species were randomly divided into two bins: one that contained 80% of their occurrence data and one that contained 20%, a common choice of partition sizes for SDMs (Cosentino et al. 2023).Only the 80% bin was used for the selection of the optimal Fig. 2. Geographic representation of the classic fully random in geographic space (randgeo) background sampling, the stratified random in geographic space (i.e., checkerboard; cb1/cb2) approach, the fully random in environmental space (randenv) approach, where the number of BP is proportional to the size of the strata and the stratified random in environmental space (stratenv) background sampling, where the same number of points are sampled in any combination of environmental variables.The dots represent sampling points, plain lines represent isolines of some gradually varying environmental variable (e.g., temperature changing along elevation) and the dashed line polygon represents a patch of some landscape feature (e.g. a forest).Modified from Hirzel & Guisan (2002).
B. Steen et al. models.For each of these 80% bins, all pertaining to a different target species, 50 maxent models were run using the ENMeval package, with the MaxEnt.jarmodel method selected and using the aforementioned settings (see Section 2.2).Before running each maxent model, the 80% bin of data was randomly split 50-50 across a training bin and a validation bin.We diverted here from the more typically used 80-20 or 70-30 division because the bins used in the checkerboard analyses would split the data roughly evenly across training and validation datasets.The best model settings were selected based on their average value of 3 predictive performance metrics: the continuous Boyce index (CBI; Hirzel et al. 2006), the Area Under the ROC Curve (AUC, Jimenez et al. 2020) and the sum of both (see Fig. 4).The last one was calculated in order to not neglect models that had both (relatively) high Boyce (i.e., good calibration in predictions) and (relatively) high AUC (i.e., good discrimination in predictions).This is necessary, because it is possible for a model to have a high AUC and a low Boyce index or vice versa (i.e.balancing discrimination versus calibration in the evaluation; Guisan et al. 2017), and in both cases that would be a bad model from at least one perspective.Before the sum was calculated, the AUC was recalculated to the gini AUC (2*AUC -1) ranging between -1 and +1 in order to give it the same "weight" as the Boyce index.This therefore yielded 3 optimal model settings (Boyce index, AUC and the sum of both) for each species.Subsequently, the full 100% of the occurrence data for each species was used to fit a final 'full' model using the identified optimal settings (James et al. 2013;Guisan et al. 2017) and again splitting the occurrence data 50-50 across the training and the validation bin in order to ensure that the checkerboard analyses are comparable to the other three treatments.This full model was finally projected in geographic space to obtain a habitat suitability map.This procedure was repeated for each species and all 4 methods of background selection.Note that only one dataset of background points for each method was used for the creation of SDMs in this entire study.

Model similarity analysis: a measure for model stability
To assess model stability, we used a form of virtual simulations based on the habitat suitability maps of the species generated from SDMs at the previous step.We refer to these SDMs and the matching distribution maps as being of generation 1.For this, we artificially considered the generation 1 prediction maps as "true", meaning a perfect presentation of habitat suitability for the species (as in Thibaud et al. 2014).New occurrence points were then generated from each of these true habitat suitability maps, using the values of the map as probability of occurrence, i.e., the higher the habitat suitability was, the higher the likelihood that an occurrence point got generated there.For each species, 7 new occurrence datasets were created: ones of 10, 30, 50, 100, 300, 500 and 1000 occurrence points.For each of these datasets, the model fitting and selection procedures described in Section 2.6 were performed again, minus the splitting up of data into bins of 80% and 20%, i.e., the generation 2 data were divided 50-50, used to train 50 model replicates to identify optimal settings, then re-run with 100% of the data using those optimal settings.We underline that the virtual occurrence datasets were never resampled, i.e., the same occurrence record dataset was used for the 50 MaxEnt runs.These are the models of generation 2. The resulting habitat suitability maps were then compared to the original habitat suitability maps for each species as follows: first, the raster datasets were converted to vectors, and then their similarity was determined using the cor() function in the R language.Our measure of model stability is consequently the measure of spatial similarity between the original map and the map created using the virtual occurrences, i.e., through describing spatial autocorrelation.An overview of the study design is given in Fig. 3. See Appendix 1, Fig. S4 for an idealized presentation of maximum model stability and Appendix 1, Fig. S1 and Table S2 for the effect of the number of generated virtual points on stability.
Many models, especially generation 2 models with low numbers of generated occurrences (10, 30, sometimes 50) yielded no performance Fig. 3. Workflow and design of the study.Please print in color.
B. Steen et al. metrics (see Appendix 1, supplementary Table S4), which is logical, given the very low sample size and 50-50 random split of data.In such a case, the model was not included in the similarity assessments, because no optimal settings could be selected.As a result, the amount of similarity values differed across background point creation strategies: there were 1032 for stratenv, 887 for randenv, 931 for the classic randgeo, 885 for checkerboard 1 (cb1) and 920 for checkerboard 2 (cb2).We compared all the values of the strategies with Welch t-tests.Since we tested no additional explanatory variables, the t-tests were not corrected for multiple comparisons.
In order to address the different sample sizes, we also did an additional analysis in which we only used Welch t-tests after random elimination of values until the two compared strategies had the same number of values.This method was repeated 100 times over, with a new random sample of eliminated values each time.The Welch t-test was chosen because no equal variance could be assumed among both samples.

Results
The moran's I was -3.967e-05 for the randgeo dataset, 3.573e-05 for the randenv background points and 0.001 for stratenv background points.Note that the same BP dataset was used for randgeo, cb1 and cb2; it was merely split into different bins for the latter two.

Model performance metrics comparison
Models calibrated for species at the equilibrium yielded significantly higher AUC and AUC+CBI in generation 2 for the stratenv and CB1 treatments than invasive species models.In generation 1, no significant differences were found when comparing invasive species models to equilibrium species models.
When comparing performance metrics between background point selection strategies (regardless of level of species equilibrium with the environment), stratenv yielded higher AUC values than any other strategy in generation 1 models.No additional significant effects were found in this generation.In generation 2, more significant differences were found.Stratenv had higher AUC than any other strategy, and randenv had a higher AUC than all the geographically sampled BP datasets.The same pattern was observed for AUC+CBI.Stratenv also had higher Boyce index than all other treatments, except for randenv, with which there was no significant difference.
In generation 1, when comparing AUC, CBI and AUC+CBI between IAS and EQ species, within each background point selection strategy, no significant differences in performance metrics were found.When comparing results for equilibrium species and invasive alien species within each BP sampling strategy, significant differences in average AUC and AUC+CBI measures were found in generation 2 for stratenv background points, with a slightly higher average for equilibrium species.The same pattern was found in cb1 treatments.See Fig. 4 for a graphical presentation of these results and Appendix 1, Table S1 for the same results in tabular format.

Effects of background points on model stability
When comparing all of the model stability values (i.e., the measure of similarity between a generation 2 model and the matching generation 1 model) between two treatments (i.e., not using the random elimination method), the results of the Welch t-tests indicate that the randenv strategy yields far higher model similarity (i.e., has far higher model stability) than any other strategy, for both invasive species and equilibrium species.Stratenv yields higher similarity values than the classic randgeo (Tables 2 and 3) and both checkerboard treatments.When randomly eliminating models, the pattern changes slightly.Randenv is still by far the highest in terms of model similarity and stratenv still yields significantly higher results than randgeo, but not higher than either checkerboard treatment anymore (see Table 4).

Effects of equilibrium species vs invasive alien species on model stability
The results were analyzed by both Welch t-tests of all the values and again by random elimination, just as described under Section 2.7.Only the stratenv resulted in higher average model stability for equilibrium species than for invasive alien species, but this result was not found to be statistically significant.The other four background treatments (i.e.including two options for the stratified in geographic space: cb1 and cb2) demonstrated the opposite effect, but this time with significant results, except for cb2 (Fig. 5).The random elimination method however yielded significantly higher similarity values for equilibrium species in the stratenv treatment in 100% of the cases.The strategies randenv, randgeo and cb1 also yielded 100% significant results, but with a higher average for invasive species.For cb2, no difference was found in similarity values for invasive species and equilibrium species.See Appendix 1, Table S3 for the same results in tabular format.

Model stability vs performance metrics
The stratified random in environmental space background points models yielded the highest performance metrics, particularly AUC, for both generation 1 and generation 2 models, and the fully random in environmental space treatment yielded the highest model similarity (see Fig. 6).See Appendix 1, Fig. S2 for the same analyses on the CBI.

Discussion
This research focuses on two questions: (1) Which method of background points sampling produces the best-performing SDMs for invasive and equilibrium species in terms of model accuracy and stability?(2) Do these background point sampling strategies differ between native and invasive alien species?

What drives model stability?
Our results indicate that the classic randgeo sampling of background points produces less stable models (i.e., models with lower similarity values) than both strategies of sampling in environmental space.This is logical and intuitive because the environmental space is used to model the environmental niche (Guisan et al. 2017).In addition, it is also optimal to sample species observations in the field based on environmental space (Hirzel and Guisan 2002).The finding likewise concurs with previous findings that randgeo sampling can cause prediction biases (Botella et al. 2020;Freeman et al. 2022).However, sampling background points in geographic space is still much more common than sampling in environmental space (examples of the latter are e.g.Fragnière et al. 2022, Bazzichetto et al. 2023).
However, the randenv method of background point selection produces by far the most stable (measured by similarity) models for both invasive alien species and species at equilibrium, though the model stability is significantly higher for invasive alien species.This is consistent with the results of Bazzichetto et al. 2023, who used virtual species to show that sampling background points uniformly in environmental space (which is intended to also be fully random in environmental space) produces models that most closely approximate reality for species with wide environmental tolerances, which, in the real world, would include invasive alien species.Therefore, the superior model similarity we found for invasive alien species using this background sampling strategy could be explained by the fact that the stratenv data points are very good at capturing the extremes of the environment, whilst the randenv background points are more continuously spread along the core parts of the environmental gradients.The latter probably informs the models better, particularly the models on invasive alien species, because these are often not present in the   extremes of the environment (e.g.there are very few IAS at high elevations (Fuentes-Lillo et al. 2021;Barros et al. 2022)).Therefore, the distribution of invasive alien species and the randenv BP are probably similar.This could be further reinforced by the fact that citizen science datasets like GBIF are subject to human sampling bias (Chauvier et al. 2021) and therefore contain less observations in high mountains, even for alpine species.

What drives model accuracy?
On the side of performance metrics, the stratenv background method yields higher AUC and AUC+CBI (our combined index) for equilibrium species than any other background treatment, particularly in generation 2 models.This is almost entirely due to higher AUC and the effect is noticeable in all equilibrium species models.Still, the ability of the stratenv models to yield higher AUC indicates that they are better at discerning suitable area from unsuitable area.Namely, a possible explanation that they yield higher AUC but not higher Boyce is that the stratenv models include fewer false positives than the other methods, likely due to their ability to classify the absences of the species.
Our results therefore show that although randenv background treatment yields by far the most stable models, it is less good at modeling realistic species distributions than the stratenv treatment.This is likely because the randenv treatment distributes its background points more heavily in environmental values that are common, rather than in the extremes.Both presence points and randenv BP find themselves often in the most common environmental values.Hence, it is often easier for randenv models to identify relationships between the presence points and the environment and therefore create stable models, but they are not necessarily the right relationships.The latter are more likely to be identified by the stratenv background model treatments.

Do optimal BP selection strategies differ between equilibrium and non-equilibrium species?
When answering question 2 (if different BP selection strategies produce the most accurate and stable models for EQ species and IAS), we would expect that EQ species would yield higher performance metrics and more stable models, since SDMs assume that the species is at equilibrium with the environment and that the species occupies its whole realized niche.In this study, it is assumed that the latter condition is satisfied for EQ species, but not for IAS.Stratified sampling in environmental space was the only method where SDMs trained on occurrence data of equilibrium species yielded both higher performance metrics and higher stability values than for non-equilibrium species.
This might be because the stratenv background points are agglomerated into areas where environmental gradients are steep, such as in mountains.Previous studies indicate that model performance is better when background point sampling biases match the bias of the occurrence records (Botella et al. 2020;Freeman et al. 2022;Schartel & Cao (2024)).This may explain why the stratenv background point sampling method, unlike all other methods, yielded higher model similarity for equilibrium species than for invasive alien species: the distribution pattern of these background points matches the distribution pattern of the species.This agrees with the observation that modeling species not at equilibrium with the environment can lead to inaccurate predictions (Freeman et al. 2022).It is therefore possible that the stratenv sampling works better for equilibrium species than for alien species because the stratenv background points are much more abundant around mountains.Hence, the models may be more stable not because the species are at equilibrium, but because they are primarily mountain species.In addition, many of the equilibrium species chosen in this study have relatively narrow distributions, meaning that stratenv background points, which are agglomerated where the environment is heterogeneous, can provide the model with a more complete view of the habitat preferences of narrow-ranged species.This may indicate that the methods have great conservation applications, as narrow-ranged species are both vulnerable

Table 2
Comparison of model similarity (i.e., as a measure of model stability) per background sampling method across all species for method 1 (in which all models were kept.).Stratenv = stratified random in environmental space, randenv = fully random in environmental space and randgeo = fully random in geographic space, cb1 = checkerboard 5 km, cb2 = checkerboard 10 km.    ).However, environmental strata are very strongly correlated to topography and elevation, so likely, the random stratified sampling in environmental space informs SDMs better on the margins of the environmental tolerance of each species.Still, we cannot refute the possibility that stratenv sampling is better suited to equilibrium species, regardless of whether these occur primarily in mountains or not.
An additional criticism might be that the spatial autocorrelation of the stratenv BP dataset would lead to errors in model training and predictions.Since this has been proven to be true for occurrence datasets (Veloz 2009;Miller et al. 2012), it might also be true for BP datasets.However, model training is about statistical correlations between occurrences and environmental values, it takes place in environmental space, not in geographic space.Therefore, we do not see how model errors resulting from spatial autocorrelation of the stratenv BP models could artificially inflate the stability values.In addition, our sampling of the stratenv BP data should be (most) uniform along environmental gradients, not biased.
The better stability and performance results for invasive alien species than for equilibrium species yielded by all sampling strategies except stratenv can be explained by their distributions being spread relatively evenly over the study area, just like the distributions of the selected species.The species used for this study are widespread, at an advanced stage of invasion.Sampling strategies that cover the more common environmental values therefore inform the model better.It is also possible that the stratenv background points produce models that are less sensitive to sample size, as the invasive alien species have on average more than twice as many occurrence points than the equilibrium species (see Table 1).A complementary argument is that since the chosen invasive alien species are at an advanced stage of invasion, they could be closer to equilibrium with the environment than initially assumed.This may also explain why the result of stratenv background points yielding higher model similarity results for equilibrium species is significant, but only barely: the invasive alien species are relatively close to equilibrium, as well.
We found that no performance metrics could be generated from many models of generation 2 (i.e., the models built using occurrences generated from the generation 1 habitat suitability maps) that were trained with only 10 or 30 occurrences.This is an additional finding that models trained with small sample size may still yield performance metrics (but see Collart and Guisan 2023) but fail our additional stability test.Our results therefore suggest that it is advisable to not run SDMs with fewer than 50 occurrence records (which has already been shown to be problematic (Adde et al. 2023)), and that testing with performance metrics may not be sufficient to evaluate SDMs.Null models are a possible measure of reliability that may alleviate this concern, as proposed in Collart and Guisan (2023).

Possible expansions
Further investigations expanding on our methods might yield Fig. 6.Average model similarity vs average AUC for real species models of generation 1 (the models that were run using real occurrence data) (A) and generation 2 (the models that were run using virtual occurrence data generated from the output suitability maps of generation 1 models) (B).Please print in color.
interesting results.For instance, different methods of assessing overlap of species' predictions, e.g., using Schoener's D (e.g. Warren et al. 2008) or Kulczynski's coefficient (e.g.Randin et al. 2006) instead of our correlation method used to calculate model stability could be used in future studies.This might help put our findings in the larger context of other studies, e.g. on calculating niche and distribution overlaps (Warren et al. 2008), or to compare model predictions in transferability assessments (Randin et al. 2006).Another potentially promising method to assess the effect of spatial autocorrelation on models would be to use Dutilleul's modified t-test, which is especially designed to account for the degree of spatial autocorrelation in statistical inference (Dutilleul et al. 1993).This test would thus be mostly ideal when one is interested in comparing two samples in the presence of spatial autocorrelation.
Secondly, limiting the SDM study area to the home range of the respective species might provide valuable insights, as large home ranges may artificially inflate AUC (Lobo, 2008).In this study, we have not done so, as the comparability of the models, particularly of the widespread IAS and the narrow-ranged EQ species, would be compromised.
Finally, alternate methods of thinning occurrence data might strongly influence performance metrics.We have already presented an alternate thinning method and the effect on model stability and performance metrics for two species in the supplementary materials (appendix 3), however environmental thinning of occurrence data has been shown to improve model performance metrics (Varela et al. 2014).
In conclusion, from our findings, we suggest sampling background points stratified randomly in environmental space when dealing with 1) low sample sizes, 2) species at equilibrium with the environment and 3) species with narrow ranges.Conversely, it is better to sample the background points randomly in environmental space when species are widespread and have a broad environmental niche.In addition, our results show that stratified random in environmental space background sampling yields both more accurate and more stable models than the classic fully random in geographic space sampling, whilst fully random in environmental space background sampling yields by far the most stable, but not necessarily the most correct models.These findings should however be confirmed by further studies.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Fig. 1 .
Fig. 1.Study area and comparison of methods of background creation.The background point datasets are shown for Europe only, in order to facilitate visual comparisons: Study area (plotted in dark gray); Fully random in geographic space (randgeo); Checkerboard treatment.The different bins of the data are shown in red and blue.Note that the grid size of the checkerboard blocks have been enlarged to 100 km in order to show the checkered division of the BP data across the bins; Fully random in environmental space (randenv); Stratified-random in environmental space (stratenv).Note we did not display the stratified random in geographic space method here because the checkerboard models also use the random in geographic space background points (in B).Please print in color.

B
.Steen et al.

Fig. 4 .
Fig. 4. Average performance metrics with error bars for invasive species (IAS) and equilibrium species (EQ) for models of generation 1 (A) and generation 2 (B).Please print in color.
B.Steen et al.

Fig. 5 .
Fig. 5. Comparison of model similarity for equilibrium species and invasive alien species per background sampling method.Please print in color.
The Swiss lists were chosen because of the high abundance of high-altitude species and extensive work on the effect of non-native flora and fauna on biodiversity.It was ensured that these species at least occur in mountains in their home range and that most of their occurrence data (see Section 2.4) were inside the study area (see Section 2.1).The list of species is given in Table1(see Appendix 1, table

Table 3 P
-values of t-tests comparing similarity score yielded for models per background point treatments for method 1.

Table 4
results of background point sampling strategy comparison when randomly eliminating values, to ensure equal sample size (i.e., method 2).