Importance of data selection and filtering in species distribution models: A case study on the Cantabrian brown

Species distribution models (SDMs) are powerful tools in ecology and conservation. Choosing the right environmental drivers and filtering species ’ occurrences taking their biases into account are key factors to


INTRODUCTION
Exploring how species respond to different environmental characteristics is key to creating successful management and recovery plans (Jetz et al., 2012). Species distribution models (SDMs) are currently popular tools for this purpose, especially considering the increase in the occurrence data availability (Guisan et al., 2013;Rodríguez et al., 2007). As an example of this trend in animal and plant modeling, we can name the increasingly used MaxEnt. This method based on the maximum-entropy approach (Merow et al., 2013;Phillips & Dudík, 2008) has become popular over the last decade because it can work with presence-only records of species' occurrences that are increasingly available from global databases (Jayathilake & Costello, 2020;Kuralt & Kos, 2018;Zhang et al., 2021). There are, however, key factors that need to be considered by researchers before modeling, such as choosing the right environmental drivers (Gardner et al., 2019) or filtering available species' occurrences and coping with their biases, for example, sampling bias or positional error (Fernandes et al., 2018;Kramer-Schadt et al., 2013;Moudrý &Šímov a, 2012). If not, SDMs may provide misleading results and conclusions, which can be particularly problematic when applied to population management and conservation.
SDMs relate species' occurrences to the environmental characteristics of their habitat and assess potential landscape-species relationships and geographical range suitability (Elith et al., 2011). Thus, the correct selection of environmental characteristics that may play a role in the distribution patterns of target species is one of the crucial modeling steps. Not much attention has been given to the selection of variables and there is little consistency among studies regarding the used variables (Austin & Van Niel, 2011;Fourcade et al., 2018;G abor et al., 2022). Some studies even rely only on climate-derived variables; this is especially true for studies predicting species shifts due to climate change, although the climate may not be enough to predict major shifts in species' distribution (Araújo & Guisan, 2006). Especially in vertebrates, such as large mammals, climatic variables might influence some species only indirectly by affecting their food and shelter as their habitat selection would be more likely driven by landscape features and structure and/or human impact (Berteaux & Stenseth, 2006;Greenberg et al., 2014). Thus, many studies fail at choosing important variables shown to be physiologically relevant to the species (Gardner et al., 2019).
Furthermore, species occurrences mostly come from non-systematic observations from museum collections, administrations, or citizen science, usually accessible from global public databases, such as Global Biodiversity Information Facility (GBIF, www.gbif.org) (Moudrý & Devillers, 2020;Newbold, 2010). Information from these sources can be affected by several (generally more than one) errors related to the observer, and the tools and/or collection methods employed (Marcer et al., 2022;Moudrý & Devillers, 2020). For example, data might suffer from positional uncertainty/inaccuracy, sampling effort bias (i.e., sampling toward places where it is easy to locate individuals or that are highly visited, but not from areas with reduced visibility or more remote and inaccessible), or different ecological meaning (i.e., observations recorded where a specific behavior takes place, such as locations of damages to human property or of water sources) (Anderson, 2012;Fourcade et al., 2014;Guillera-Arroita, 2017;Kramer-Schadt et al., 2013). Thus, the proper selection of species' input records is a necessary step toward construction of accurate habitat models (Beck et al., 2014;Boria et al., 2014). Common approaches to reduce the impact of bias in occurrence data include, for example, spatial or environmental filtering (G abor, Inman et al., 2021;Varela et al., 2014;Veloz, 2009), subsampling (Beck et al., 2014), or the use of systematic sampling or a bias file to account for sampling bias and to weight occurrences accordingly (El-Gabbas & Dormann, 2018;Fourcade et al., 2014;Tessarolo et al., 2014). Still, performing a careful revision of the available data to evaluate and understand various errors and biases in view of the data sources or collection methods before modeling is uncommon. Ignoring the intrinsic inaccuracies and ecological differences of the data can, however, produce misleading results and conclusions (Guillera-Arroita, 2017;Isaac & Pocock, 2015;Leitão et al., 2011). Still, the sources of possible bias are broader than just "how and by whom the data are collected." Long-term monitoring information available for some species has the potential to provide quality data for studying species distributions . However, different periods might reflect different population statuses associated with density-dependent use of space and, consequently, different habitat requirements (Seoane & Carrascal, 2008). Besides, data collected more recently are likely to be more accurate in their geolocation due to the wide availability and ongoing improvement of the GPS technology (Tomaštík & Varga, 2021). Population status should also be considered when filtering species data as, for example, dispersing individuals in expanding populations can appear in unfavorable areas while searching for new landscapes or moving between suitable habitat patches; their habitat selection may considerably differ with their breeding status as well (Moudrý et al., 2017;Rayment et al., 2015;Stamps, 2001). Additionally, different subpopulations or clusters might differ in the (1) growth/reproductive rate, (2) habitat or resources availability, (3) behavior, (4) impact of human infrastructure and density, (5) size/number of individuals (which might affect density-dependent patterns), or (6) number of occurrences or sampling effort. All of these can influence the models' behavior. Models may, therefore, be improved by splitting the species into subunits with different ecological characteristics or with different sampling efforts (Fourcade et al., 2014;Stockwell & Townsend Peterson, 2002).
Faced with the many problems of ecological modeling, a question arises: Is it better to use as much data as possible, disregarding their error or ecological meaning, or should we try to screen the data and minimize the bias? This is a key question that should be acknowledged before modeling species distributions and considered when building models, interpreting the results, and drawing conclusions (Bloom et al., 2018;Guillera-Arroita et al., 2015;Kramer-Schadt et al., 2013). In this study, we intended to test the importance of selecting and filtering proper input data (i.e., variables and occurrence data) to build SDMs in a case study for a generalist species, as well as to make these decisions based on expert knowledge of the analyzed species and populations. For this, we applied MaxEnt modeling to the distribution data on the endangered Cantabrian brown bear (Ursus arctos) population located in northwestern Spain from 1985 to 2019. The small number of bears in this population (Pérez et al., 2014), their complete isolation, and the limited connectivity between subpopulations make this population an interesting model in terms of conservation and management. We hypothesized that (1) climatic variables would not correctly define bear habitat selection as this species is dependent rather on landscape features and inhabits a wide range of climatic regions all around the world (Krofel et al., 2021; and (2) that thorough filtering of species' occurrences will produce better models relying on both model validation and expert opinion.

Study area and species
In our study, we focus on the brown bear population inhabiting the Cantabrian Mountains, located in the northwest of the Iberian Peninsula (Figure 1), a typical human-modified landscape of southern Europe . This population is divided into the western subpopulation, with 203 bears (95% CI = 168-260) in 2014 (Pérez et al., 2014) occupying an area of more than 7000 km 2 , and the eastern subpopulation, with 19 bears (95% CI = 12-40) in 2014 (Pérez et al., 2014), occupying around 4000 km 2 (Lamamy et al., 2019;. The long-term monitoring of the Cantabrian brown bear population, which started in the 1980s, is essentially based on yearly direct sightings, indirect signs (traces) of presence (footprints, hairs, and scats), images from camera traps randomly located in the bear distribution range Zarzo-Arias et al., 2019), and records of damage caused to livestock, beehives, crops, human activities, and infrastructures (Zarzo-Arias et al., 2020). Data have been gathered by the personnel of the Principado de Asturias and Junta de Castilla y Le on, primarily the Patrulla Oso, as well as all the other guards of both regional governments, by Fondo para la Protecci on de los Animales Salvajes (FAPAS), Fundaci on Oso de Asturias (FOA), Fundaci on Oso Pardo (FOP), and by the Cantabrian Brown Bear Research Group (http://www. cantabrianbrownbear.org/es/). Data were divided into groups considering first the type of data: (1) direct and indirect (camera traps) bear observations; (2) traces (footprints, hairs, and scats); and (3) damages, as each one has different intrinsic ecological and sampling considerations. Then, we divided the data into (1) the whole time frame (1985-2019), (2) the older , and (3) the recent period (2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016)(2017)(2018)(2019)(2020). This temporal division corresponds to the main trend change observed in this bear population, which shifted from slow recovery to population increase and colonization of new areas that occurred around 2005 (Mateo-S anchez et al., 2015;Zarzo-Arias et al., 2019). We also grouped the data according to the western and eastern subpopulations, reflecting the highway dividing the area and acting as the main dispersal barrier ( Figure 1).

Environmental data
We considered ecologically relevant environmental variables related to the terrain (5), climate (19), habitat (7), and human activity (5). Terrain, habitat, and human activity data are available for download from the Spanish National Center of Geographic Information (https://centrodedescargas.cnig.es/). The habitat variables were derived from the national Corine Land Cover database for the year 2018. For each analysis grain, we calculated the area of the following land uses to get the percentage per cell: urban areas, forests, crops, shrublands, pastures, rocks, and conifers. In addition, we calculated the distance to highways, roads, trails, and rivers. The terrain variables (Lecours et al., 2017) were derived from a high-accuracy digital terrain model at a 25-m resolution (Moudrý et al., 2018). The terrain attributes were derived using Horn's algorithm at that resolution and subsequently aggregated to the analysis grain using the mean value from each cell (Moudrý et al., 2019). The human population density obtained from Gallego (2010) was also included in the analysis. Nineteen standard bioclimatic variables were downloaded from the CHELSA climatic portal (https://chelsa-climate.org/). We extracted the information from all variables in grid cells of 5 km, 1 km, and 500 m (i.e., analysis grain) in our study area using QGIS version 3.8 (Quantum GIS Development Team, 2015). We examined pairwise correlations of variables as was done in many other studies (Brun et al., 2020;Lecours et al., 2020;Parra et al., 2004); from those showing high mutual correlations (Pearson's coefficient >0.7), we always kept the one most relevant to the species ecology. The correlated variables included mainly climatic variables (detailed description of variables can be found in Appendix S1); the four climatic variables most relevant to the species ecology were kept (isothermality, temperature seasonality, annual precipitation, and precipitation seasonality). Subsequent variance inflation factor (VIF) analysis did not find any potential multicollinearity at any analysis grain (VIF < 5; Zuur et al., 2010). Finally, we removed variables with ≤0.1% mean contribution to a preliminary set of models at a 1-km grain. A total of 16 variables were retained for modeling (see Appendix S1).

Evaluated input data characteristics
It has been hypothesized that organisms respond to their environment more strongly at some grains than at others. F I G U R E 1 Area of study (the Cantabrian Mountains) and brown bear observations, traces, and damages (brown points) used in this study. The dotted red line indicates the highway dividing the two bear subpopulations.
In general, it is assumed that climate defines the distribution of species at broad spatial grains. At successively finer grains and at regional extents, the importance of topography variables in controlling species occurrence increases; at even finer grains, the presence of individual land cover categories can play a role (Field et al., 2009;Pearson & Dawson, 2003). As a consequence, the choice of the analysis grain can strongly affect the ability to detect and measure species' response to the environment (Huston, 2002). As it has been shown that different variables may affect bears' distribution at different spatial grains (Mateo-S anchez et al., 2014), we tested three different spatial analysis grains (square grids of 5 km, 1 km, and 500 m) to account for this.
To choose the optimal way to build the Cantabrian bear habitat suitability models, we first built a set of 1-km models with and without climatic variables as climate variables, together with land cover ones, which were reported to be the most important predictors of this population at finer analysis grains (see Zarzo-Arias et al., 2019), while topography does at coarser resolutions. We used the model performance and expert knowledge of the species and its distribution within the study area to evaluate whether climatic variables were driving the modeled species distribution. Then, we compared models with input data (1) of different types (observations, traces, damages, and their combinations); (2) from different time frames (1985-2005 and 2006-2019); (3) considering grids with just a single occurrence (i.e., cells used only for dispersion that do not show habitat preferences but the casual movement around the landscape looking for new suitable places) as non-presences and as presences; and (4) from the whole population and from each subpopulation independently. Subsequently, we selected the best combination of input data. Finally, to evaluate the impact of data filtering on the results, we compared this model to the one built using non-filtered data (i.e., the model containing all types of data for the entire time frame  using all grids with at least one occurrence for the whole population).

Model fitting and selection
For building the models, we used the software MaxEnt (Phillips et al., 2004(Phillips et al., , 2006 version 3.3.3k called from the R environment version 3.5.1 (R Core Team, 2020) using the packages dismo version 1.3.3 (Hijmans et al., 2017) and ENMeval version 0.3.1 (Muscarella et al., 2014) following the methodology applied by Zarzo-Arias et al. (2019) at three different spatial analysis grains (square grids of 5 km, 1 km, and 500 m). Input data consisted of the centroid coordinates of the cells where at least one bear occurrence falls in. It is recommended to evaluate the best combination of MaxEnt parameter settings to select the most appropriate models (Morales et al., 2017). Therefore, to identify an optimal model structure, we evaluated candidate models with all types of feature combinations representing different types of parameterizations (linear, product, quadratic, hinge, threshold, and categorical), each run over a set of regularization multipliers ranging from 0 to 19 to control for overfitting by penalizing variables that add little to the model (see Elith et al., 2011, for more details). We used 500 iterations (maximum number of steps to reach convergence), and a convergence threshold (probability of predicting no occurrence where there is one) of 10 À5 , using the method checkerboard1 for data partitioning. This method aggregates the original environmental input grids to partition occurrence localities into two bins for training and testing (Muscarella et al., 2014). The MaxEnt default number of background points (10,000) has been criticized as insufficient to capture the total range of environmental conditions of the targeted landscape (Renner & Warton, 2013). For this reason, we used values from all grids in the study area (specifically 1019 background cells for the 5-km 2 resolution, 25,475 for 1 km 2 , and 101,900 for 500 m 2 ) to build all the models (as in, e.g., Ericksson & Dalerum, 2018;Swanepoel et al., 2013).
We identified the most robust and parsimonious combination of feature types and regularization multipliers using Akaike information criterion corrected for small sample sizes (AIC c ) following Warren and Seifert (2011). We considered models within two AIC c units to have equivalent empirical support (Burnham & Anderson, 2002) and chose the simplest model (lowest number of parameters, and if equivalent, lowest number of feature types) as the best model. The model structure and the total number of grids considered for each group of bear data in each analysis grain are available in Appendix S2.

Model evaluation
It is recommended to assess model performance using multiple metrics. Therefore, we used three validation metrics in this study (Grimmett et al., 2020). We assessed the model discrimination ability (i.e., how well the model differentiates between the presence and non-presence sites) using the area under the receiver-operator curve (AUC; Fielding & Bell, 1997). AUC ranges from 0.5 for models with no discrimination ability to 1 for models with perfect discrimination. In addition, we adopted also the mean absolute error (MAE) approach recently recommended by Konowalik and Nosol (2021) to be combined with AUC for selecting the best performing models, as the latter might be affected by large numbers of background cells (Lobo et al., 2008). The MAE of a model is the mean of the absolute values of the individual prediction errors (i.e., the difference between the true value and the predicted value) over all instances in the evaluation set. Finally, we calculated the Boyce index, which is recommended for presence-only models and shows the accuracy of the predicted probability of presence (Hirzel et al., 2006). To calculate the index, the predicted suitability is partitioned into suitability classes and, subsequently, the ratio between the predicted and expected frequency of evaluation points is plotted against the corresponding suitability class. If the model correctly distinguishes between species-suitable and unsuitable areas, the ratio between predicted and expected (random) frequencies of presences increases with the suitability (i.e., the low-suitability classes should contain fewer evaluation presences, while high-suitability classes should contain more evaluation presences than expected by chance). This increase is measured using the Spearman rank correlation coefficient, and hence the Boyce index ranges from À1 to 1. Negative values indicate an incorrect model that predicts low suitability in areas with more frequent presences, while positive values indicate a model whose suitability predictions match the distribution of presences in the evaluation dataset. We adopted the continuous version of Boyce index, which overcomes the sensitivity of the index to the number of suitability classes (Hirzel et al., 2006;Jiménez & Sober on, 2020).
After evaluating the models' predictive performance, we compared them according to the relative contribution of each environmental variable, which was calculated by a jackknife procedure and a heuristic method provided by MaxEnt. Additionally, with the raster output of the predictions, for which we used the complementary log-log format, we performed a niche comparison analysis based on the Schoener's I statistic (Warren et al., 2008), which ranges from 0 (no overlap) to 1 (perfect overlap). Finally, we compared the predicted suitability at presence locations and map outputs graphically.

Climatic variables
Models including climatic variables showed higher AUC but worse performance according to MAE (Figure 2a), and slightly higher Boyce index values than models without climatic variables (Appendix S2: Table S1). Climatic variables also presented a high mean percentage contribution (54.5%), especially isothermality (24%, Bio 3) and temperature seasonality (29%, Bio 4; Figure 2b). These variables, together with the distance to highways (29%), represent the main drivers of bear habitat suitability in these models. Once climatic variables were removed, the distance to highways increased in importance (+19/48.2% contribution), as well as slope (+20/22.8%) and altitude (+7/9.3%), which, together with forests (9.9%), became the most important explanatory variables. Numerical evaluation of the map predictions of both sets of models showed high similarity (mean Schoener's I = 0.933; Appendix S3: Table S1).
On the other hand, models including climatic variables show more restricted habitat suitability and high importance of climatic factors for bear distribution. Taking a closer look at the suitability prediction, the results of climate variables containing models are misleading; this is especially true for the eastern area, where models indicate some of the areas of Cantabria ( Figure 2c) to be unsuitable for the bear, while in reality, bears have inhabited these areas for decades (Clevenger et al., 1987). Furthermore, the global brown bear distribution implies good adaptability to a wide variety of climatic conditions (Kaczensky et al., 2013). Therefore, we decided to remove climatic variables from further analysis as they were not appropriate for modeling this bear population's distribution.

Analysis grain and type of occurrence
Models performed better at finer analysis grains, especially in terms of AUC and MAE (0.11 AUC and 0.014 MAE mean difference between the coarsest and finest analysis grain) (Figure 3a), although only the 5-km analysis grain slightly underperformed the other analysis grains according to the Boyce index (Appendix S2: Table S2). At this analysis grain, models constructed using different sets of input variables were mutually more similar than at the other grains (mean Schoener's I = 0.994; Appendix S3: Table S2). On the other hand, the importance of individual variables varied more among models at finer grains ( Figure 3b). Distance to highways was the most important variable at all analysis grains, while crops and altitude stand out at the 5-km grain, and slope and forest at 1-km and 500-m grains.
According to validation values, we can conclude that all models performed properly at all analysis grains for the different types of data and their combinations (AUC > 0.7, Figure 3a; Boyce index ≈ 1, Appendix S2: Table S2). Considering both the highest AUC and lowest MAE mean values of all combinations of data (Appendix S2: Table S1), models including only observations appear to perform best (Figure 3a), although the difference in the performance values is very low (0.06 AUC and 0.015 MAE maximum difference). All types of data also showed very similar results in terms of map predictions (mean Schoener's I = 0.985). Models present slight changes in contributions of the most important variables, particularly between analysis grains (Figure 3b). Models based on observations only and F I G U R E 2 Comparison of the preliminary 1-km models with versus without climatic (clim.) variables according to (a) area under the receiver-operator curve (AUC) and mean absolute error (MAE) values, (b) variable percentage contribution, and (c) suitability map predictions difference; higher predicted habitat suitability is shown in brown for the model not including climatic variables, and in blue for the model containing them, while gray areas show very similar suitability predicted by both for the mean of all types of brown bear data (observations, traces, damages, and their combinations). Alt., altitude; Bio 3, isothermality; Bio 4, temperature seasonality; Bio 12, annual precipitation; Bio 15, precipitation seasonality; D. hway, distance to highway; D. river, distance to river; D. road, distance to road; D. trail, distance to trail; H. Dens., human population density; Shrub., shrubland; Sun rad., sun radiation. damages only yielded results more different from the others (regarding variable contribution and Schoener's I) (Figure 3b; Appendix S3: Table S2). The combination of observations and traces represents the best presence data for SDMs construction, based on expert knowledge, and gives better predictions (Figure 3c; Appendix S4) than the models performing best in numerical analyses. The latter, including only observations, presented more restricted habitat suitability and underpredicted the eastern subpopulation habitat.

Time frame
Regarding the time frame, all periods produce well-performing models when including both observations and traces at all analysis grains (AUC > 0.74, Figure 4a; Boyce index ≈ 1, Appendix S2: Table S3). Models with data from the first period  stand out, but validation statistics between the entire time frame and the most recent period presented very similar values (0.005 AUC and 0.003 MAE maximum difference, Figure 4a; Appendix S2: Table S3). Both periods also show very similar results in terms of map predictions (Schoener's I > 0.99), especially at the larger analysis grains (Appendix S3: Table S3). In the first period, the contribution of individual variables was slightly different from the second period and from the entire time span, especially for the coarsest analysis grain (Figure 4b). By comparing map outputs between periods at the finest analysis grain (Figure 4c), predictions do not show evident differences in any specific area, but models from the first period appear more restrictive in terms of habitat suitability (Appendix S5).

Dispersal
When testing the inclusion or exclusion of grids with dispersing bear occurrences, both approaches produce well-performing models at all analysis grains (AUC > 0.74, Figure 5a; Boyce index ≈ 1, Appendix S2: Table S4). However, models considering all grids with more than one occurrence performed better in terms of AUC, although they had a higher MAE. Still, as above, the difference in validation statistics is very low (0.05 and 0.08 units of max AUC and MAE difference between models, respectively; Figure 5a; Appendix S2: Table S3). Map predictions from both models were very similar (Schoener's I > 0.98): the coarser the grain, the greater the similarity (Appendix S3: Table S4). In terms of the contribution of individual variables to the models, both models present very similar values, except for the largest analysis grain, where the importance of slope increases at the expense of crops when excluding dispersal grids (Figure 5b). By comparing map outputs from both models at the 500-m analysis grain (which differed most from the other grains), no under-or overprediction in any specific area can be seen. Nevertheless, the model excluding dispersal grids provides a more restricted habitat suitability, avoiding the overestimation of habitat quality in areas used to travel between best landscape patches ( Figure 5c; Appendix S6).

(c)
F I G U R E 3 (Continued) , and (c) suitability map predictions for the differences between the older and the recent periods at the finest analysis grain (500 m); higher predicted habitat suitability is shown in brown for the model built on the data from the older period, and in blue for the recent period, while gray areas show very similar suitability predicted by both. Alt., altitude; D. hway, distance to highway; D. river, distance to river; D. road, distance to road; D. trail, distance to trail; H. dens., human population density; Shrub., shrubland; Sun rad., sun radiation.

Subpopulations
According to model validation, models including data for the whole population and each subpopulation independently performed well at all analysis grains (AUC > 0.78, Figure 6a; Boyce index ≈ 1, Appendix S2: Table S5). Difference in MAE was very low between models (0.015 unit of maximum MAE difference), but in F I G U R E 5 Comparison of models built with all grids with at least one occurrence (i.e., including dispersal grids) versus grids with more than one occurrence using data on observations (obs) and traces from the period 2006-2019 as input data according to (a) area under the receiver-operator curve (AUC) and mean absolute error (MAE) values, (b) variable percentage contribution (decreasing from green to blue), and (c) suitability map predictions for both periods at the finest analysis grain (500 m); higher predicted habitat suitability is shown in brown for the model including dispersal grids, and in blue for the model excluding them, while gray areas show very similar suitability predicted by both. Alt., altitude; D. hway, distance to highway; D. river, distance to river; D. road, distance to road; D. trail, distance to trail; H. dens., human population density; Shrub., shrubland; Sun rad., sun radiation. F I G U R E 6 Comparison of models constructed using observations and traces from the period of 2006-2019 in all grids with more than one occurrence for the whole population versus the ensemble of models for the western and the eastern subpopulations (Subpop.) at each analysis grain as input data according to (a) area under the receiver-operator curve (AUC) and mean absolute error (MAE) values, (b) variable percentage contribution (decreasing from green to blue), and (c) suitability map predictions for both periods at the finest analysis grain (500 m); higher predicted habitat suitability is shown in brown colors for the model with the entire population, and in blues for the ensemble of the two subpopulations, while gray areas show very similar suitability predicted by both. Alt., altitude; D. hway, distance to highway; D. river, distance to river; D. road, distance to road; D. trail, distance to trail; H. dens., human population density; Shrub., shrubland; Sun rad., sun radiation.
terms of AUC modeling, each subpopulation independently produced better-performing models, especially at the largest analysis grain (Figure 6a). In terms of map predictions, models using the whole population and the combination of the individual subpopulations yielded similar results (mean Schoener's I = 0.986), especially at larger analysis grains. They were, however, similar to the models produced when using only the western subpopulation occurrences (Appendix S3: Table S5). Distance to the highway was a highly important variable in all models, especially at the finer analysis grains. The contribution of individual variables, however, differed between subpopulations, with altitude being more important in the eastern subpopulation and slope in the western one, bringing to light that the habitat availability differs between these areas. The contribution of individual variables to the model was similar for the model built on the whole population and the one built for the western population, especially at smaller analysis grains (Figure 6b). This can also be inferred from the map predictions, where the model for the whole population underpredicts the eastern core's habitat suitability (Figure 6c; Appendix S7).

Final comparison
Both the unfiltered model (built once the appropriate variables were selected, i.e., not including climatic variables) and the best model (using observations and traces in 2006-2019 in all grids with more than only one occurrence for the ensemble of independent models of the western and eastern subpopulations) performed well at all analysis grains (AUC > 0.7, Figure 6a; Boyce index ≥0.98, Appendix S2: Tables S1 and S5). Still, the non-filtered model showed lower AUC values (0.134 units of maximum AUC difference in the largest analysis grain) and slightly lower MAE (0.074 units of maximum MAE difference in the smallest analysis grain) than the final chosen model. The two models also showed similar results in terms of map predictions (mean Schoener's I = 0.977), especially at the smallest analysis grains (Appendix S3: Table S6). On the other hand, the importance of individual variables varied more. The non-filtered model overestimated the effect of crops and sun radiation at the largest analysis grain, while slope at the largest and altitude at the smallest grains had a low contribution (Figure 7b). Regarding map outputs (Figure 7c), the filtered model gives a more restricted habitat suitability and better reflects the suitable areas in the core of the eastern subpopulation.

DISCUSSION
SDMs are a useful tool, but some considerations regarding the selection of input data must be addressed before modeling. Several studies have highlighted that before building SDMs, a good ecological and expert background knowledge of the species, its ecology, and distribution are highly recommended (Beck et al., 2014;Harris et al., 2013;Merow et al., 2017). Thus, carefully revising and filtering available data so that they fit the ecological question we are trying to address is necessary to obtain the results in the way most appropriate for their use in developing conservation and management recommendations Moudrý &Šímov a, 2012). On an example of a generalist species, our results highlight the importance of this, especially when the results are intended for making decisions at the local level.

Climatic variables
Selecting appropriate environmental variables is the key to building proper models. If such models are not ecologically relevant to the species and cannot properly represent conditions that drive species' distribution, results and conclusions can be severely prejudiced (Austin & Van Niel, 2011;Booth, 2021;Gardner et al., 2019). Our results confirm in a real-world case that we cannot trust the models to choose the most important variables if they are not previously "groomed" by expert opinion and knowledge of the species. As demonstrated before (e.g., Harris et al., 2013;Porfirio et al., 2014), the results for the same species can vary greatly depending on the combinations of variables, from predicting almost extinction to very little range contraction. Even though methods for preselecting variables have not yet been explored deeply so far (Williams et al., 2012), some authors have tried to assess the best variable combinations through several methods without much success in finding any universal procedure (Dormann et al., 2008;Maggini et al., 2006;Meynard & Quinn, 2007). It should also be noted that the adequacy of selected variables may also depend on the characteristics of the species, representing a major problem for highly mobile animals (Porfirio et al., 2014). Thus, expert opinion or information about the species should be considered when selecting the environmental drivers. Furthermore, climatic variables are usually included in these types of studies, but, as evidenced by our models, they may not always yield a good fit. Bears are more likely influenced by climate indirectly, as shown by the fact that they can be found in areas with very different climatic conditions, from Alaska or Russia to Mediterranean F I G U R E 7 Comparison of the non-filtered model (all types of data, i.e., observations, traces, and damages, for the entire time frame  including all grids with at least one occurrence for the whole population) versus the final filtered model (observations and traces in 2006-2019 including all grids with more than only one occurrence for the ensemble of independent models of the western and eastern subpopulations) at each analysis grain according to (a) area under the receiver-operator curve (AUC) and mean absolute error (MAE) values, (b) variable percentage contribution (decreasing from green to blue), and (c) suitability map predictions for both periods at the finest analysis grain (500 m); higher predicted habitat suitability is shown in brown colors for the non-filtered model, and in blues for the final filtered model, while gray areas show very similar suitability predicted by both. *AUC, MAE, and variable percentage contribution values for the filtered model are the means calculated from the independent models of the western and eastern subpopulations. Alt., altitude; D. hway, distance to highway; D. river, distance to river; D. road, distance to road; D. trail, distance to trail; H. dens., human population density; Shrub., shrubland; Sun rad., sun radiation. countries (and historically even the north of Africa), same as wolves (Canis lupus) (Hovardas, 2018;Kaczensky et al., 2013;Ripple et al., 2014). Failure to consider the entire bear distribution (actual but also historical) and, consequently, all climatic conditions this species can endure can lead to misleading results (Chevalier et al., 2021;Faurby & Araújo, 2018). This was also shown in our models that indicated the climatic variables to be the most important ones, but such models led to underprediction of the habitat suitability in areas known to be inhabited by the species. Moreover, as shown by Albrecht et al. (2017), climate (increasing winter temperatures in particular) contributed substantially to the Holocene decline of brown bears mainly indirectly by facilitating human land use. Furthermore, as omnivorous species, bears can adapt to various conditions and food resources depending on natural availability, making these variables even less important Penteriani & Melletti, 2021;Zarzo-Arias et al., 2020). This emphasizes the need for a basic understanding of the species' ecology for the appropriate selection of variables used for modeling, despite the fact that many studies still overlook this matter. Our findings also imply the urgent need to revise studies on the impact of climate change on these types of species, as the climate itself may fail to predict real shifts in their home range but rather act indirectly by affecting resources (Araújo & Guisan, 2006;Guisan & Thuiller, 2005).

Analysis grain and type of occurrence data
Given that the occurrence data are generally unlikely to be error-free (although the error magnitude can be minimized by, e.g., balanced survey design, even sampling effort, and maintaining low positional uncertainty), we need to focus on how to obtain the right answers by asking the right questions and knowing which data fit our question better (Guillera-Arroita et al., 2015). In our dataset, direct bear observations are usually recorded from the distance in the mountains, either in the clear-cuts in forested parts or above the tree line, as well as from camera traps prevalently located in forested areas where direct observations are more difficult Zarzo-Arias et al., 2019). Traces include footprints or scats, usually located in already existing trails or mountain paths, and hairs are commonly found in rub trees or fences in the fields . On the other hand, damages are linked to human activities, and they are mainly driven by the accessibility of easy resources such as apiaries, fruit trees, or livestock (Zarzo-Arias et al., 2020). Also, they are commonly georeferenced not from the location at which the damage occurred, but from where the damaged property is registered (i.e., the home of the owner). None of the models with different data combinations stood out in terms of performance, but the importance of individual variables and especially spatial predictions differed, mainly at the largest analysis grain (5 Â 5 km). This is of particular concern as bears generally react to the environment at coarser spatial analysis grains (Mateo-S anchez et al., 2014). Although observations are the species input data type that performed best (Figure 3), map outputs appear misleading as they underestimate the core area of the eastern subpopulation, most likely due to the differences in sampling effort, population size, and habitat characteristics and availability (Lamamy et al., 2019). Thus, ecologically speaking, if looking for habitat suitability, the sensible choice would be to combine observations and traces, as those are the occurrences showing the natural behavior of the species. Damages, on the other hand, would not make good input data as animals approach human activities predominantly due to the presence of easily accessible resources, thus not reflecting the natural behavior (Zarzo-Arias et al., 2020).

Time frame
The dynamics of some carnivore populations in Europe have changed in recent years, going from relict and almost extinct to expanding or even recolonizing parts of their historical ranges, from which they were driven (mainly) due to human activities. This is especially supported by conservation and management efforts increasingly focusing on mitigating human impact on populations with poor conservation status (Chapron et al., 2014;Ripple et al., 2014). Indeed, the brown bears' population in the Cantabrian Mountains has also improved from slow recovery to population increase and colonization of new areas (Mateo-S anchez et al., 2015;Zarzo-Arias et al., 2019). Occurrences from the first years of monitoring show only very specific and limited areas to be suitable. At that time, bears occupied only the most suitable areas because they did not have to compete with other individuals for space due to the low population size and density (Matthiopoulos et al., 2015). However, using data from the most recent period would be the sensible choice for making decisions based on the current status of the population. As it is currently expanding, the bears might, providing that their conservation remains a priority and human-induced mortality is minimum, soon colonize even landscapes that are more human-disturbed or less abundant in terms of resources (Ciucci & Boitani, 2008; Morales-Gonz alez et al., 2020). Furthermore, the technology used in the early years of monitoring was less accurate than in the 21st century with novel technologies (Rodríguez et al., 2007), and sampling effort in the eastern subpopulation was lower in the first period (Fern andez-Gil et al., 2010), making these data even poorer in quality.

Dispersal
Considering all cells with at least one record as presences can shift habitat selection at the population level, while reviewing and screening species data usually produce better-performing models . As mentioned before, individuals in increasing and saturated populations might have to compete for the most suitable habitats or try to find new favorable territories further from their actual core/home range. Among bears, young males are the typical dispersal individuals: they are commonly relegated to the less suitable habitats when fighting for the landscape in a density-saturated population (Zedrosser et al., 2007). They can travel throughout unsuitable landscapes when moving toward new favorable habitat patches, which may place them closer to humans even if bears generally avoid them (Martin et al., 2010;Mateo-S anchez et al., 2014;Stamps, 2001).
In fact, at the coarsest grain of our models, the variable "crops" appears as decisive in explaining the bear's habitat suitability (24% contribution) when including dispersal grids (cells with at least one occurrence). In turn, excluding dispersal grids produces models with more restricted habitat suitability, showing avoidance of more human-modified areas ( Figure 5c) and demonstrates increased importance of the variable "slope" as bears' home range is usually linked to steeper territories (Martin et al., 2012).

Subpopulations
When using SDMs, we intend to obtain information about the species' habitat suitability, but specific conditions of the landscape inhabited by a single subpopulation might show only a part of the habitat characteristics that may drive our species' habitat selection, mainly due to the landscape available in their specific areas (Arthur et al., 1996;Mysterud & Ims, 1998). Carnivores, for example, have been restricted to less human-disturbed habitats as they tend to avoid human activities, and the habitat availability may vary throughout the landscape (L opez-Bao et al., 2017). Indeed, models performed better when focusing on individual subpopulations (Figure 6a). The importance of individual variables differed between the subpopulation models, showing favorable habitat features according to intrinsic habitat availability (Lamamy et al., 2019). This has been also reported in other bear populations (e.g., black bear, Benson & Chamberlain, 2007;or polar bears, Ferguson et al., 2000) as well as in other species (e.g., bison, Kuemmerle et al., 2018;or mice, Wright & Frey, 2015), where different subpopulations or clusters show different habitat preferences. Besides, our models show that when modeling the whole population, the size of the population or the sampling effort in each area can shift the results toward the subpopulation with more available information (western subpopulation, Figure 6b,c). Ensembling these models allows the combining of different subpopulations, species, or even model approaches in order to obtain merged predictions, which could be a more appropriate way to model species populations (Araújo & New, 2007).

Conservation applications
In terms of applied conservation, the fact that there are still favorable and unoccupied habitats in the eastern area of the Cantabrian Mountains (Figure 8) represents the most important result obtained in this study. This, again, brings our attention to the fact that the subpopulation inhabiting this area has not experienced the same expansion and recovery as the western population (Penteriani et al., 2018). "Eastern" bears have been even shown to move to the western areas (Greg orio et al., 2020) and have slightly (but not significantly) lower reproductive rate and success (Penteriani et al., 2018), which cannot be explained by differences in habitat composition and structure (Lamamy et al., 2019); this part of the region has even more protected areas. As stated before (Lamamy et al., 2019;Zarzo-Arias et al., 2019), the factors driving this situation in the eastern brown bear Cantabrian subpopulation require urgent study to maintain the quality and conservation status of this population and to enhance the connectivity in the future to increase the gene flow.

CONCLUSIONS
On the example of a generalist species, this study offers a detailed insight into methods for tackling specific types of bias that can be found in data available for building SDMs. In particular, we conclude that for a species such as the brown bear, selection of relevant variables is more important than filtering species' occurrences. Since there is practically no background information on how to select appropriate variables for modeling purposes (Dormann et al., 2008;Maggini et al., 2006;Meynard & Quinn, 2007), we show that we cannot expect models themselves to "decide" if the selected variables are indeed a good fit. Furthermore, we show that we cannot just rely on model performance and/or validation metrics to tell us if our model predicts the species' habitat requirements correctly. We first need to choose both the variables (Austin & Van Niel, 2011;Booth, 2021;Gardner et al., 2019) and the occurrence data that are relevant to our research question (Guillera-Arroita et al., 2015) based on the basic knowledge of our species' ecology, in order to obtain appropriate results on which to base applied management and conservation decisions. This is especially important if their application is required at a local level and for species that react to the environment at large spatial scales. Additionally, models can help answer different questions using species input data with different filtering in order to, for example, explore habitat selection by individuals of different age (Milanesi et al., 2016) or sex (Kwon et al., 2019), different life cycle stages (Mateo-S anchez et al., 2015), or focus only on damages to get conflict hotspots (Behdarvand et al., 2014). In short, we highlight several widely overlooked issues preceding the construction of SDMs that can affect important predictions on species' future and bring the issues to the attention of ecologists building SDMs. The application of these ideas can improve the selection of appropriate data for building SDMs that address the posed ecological questions and highlight the need for expert knowledge on the target species.

ACKNOWLEDGMENTS
During this research, Alejandra Zarzo-Arias was financially supported by a Margarita Salas contract financed by the European Union-NextGenerationEU, Ministerio de Universidades y Plan de Recuperaci on, Transformaci on y Resiliencia, through the call of the Universidad de Oviedo (Asturias). Vincenzo Penteriani was financially supported by the I + D + I Project PID2020-114181GB-I00 financed by the Spanish Ministry of Science and Innovation, the Agencia Estatal de Investigaci on (AEI), and the Fondo Europeo de Desarrollo Regional (FEDER, EU). Luk aš G abor was supported by the Faculty of Environmental Sciences, Czech University of Life Sciences Prague (Internal Grant Agency project no. 2022B0012). The local administrations of the Junta de Castilla y Le on and the Principado de Asturias provided us with the long-term database on brown bear observations. In particular, we would like to thank David Cubero Bausela, Mercedes García Dominguez, and María Angeles Osorio Polo of the Junta de Castilla y Léon, and Teresa S anchez Corominas, F I G U R E 8 Final map prediction without climatic variables, using bear observations and traces data for the most recent period (2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016)(2017)(2018)(2019), including all grids with more than only one occurrence and combining models for each subpopulation independently at the finest analysis grain (500 m). Bear occurrences are shown as black points, and the dotted red line indicates the highway dividing the two bear subpopulations.