Introduction

The relationship between species occurrence and their environment is fundamental to ecology (Cadotte et al. 2011; Wisz et al. 2013; Schmeller et al. 2018; Young et al. 2023) and its importance is growing in the face of the ongoing global change (Butchart et al. 2010; Barnosky et al. 2011; Carlson et al. 2022). Understanding species’ responses to the environment is intrinsically linked to the possibility of predicting species distribution patterns and, hence, is useful for their conservation and management. Species distribution models (SDMs) are widely used for such purposes (e.g. Václavík et al. 2012; Cohen et al. 2016; Ellis‐Soto et al. 2021; Lindegren et al. 2022; Mohammadi et al. 2022; Cogliati et al. 2023).

Even though SDMs are now commonly adopted, ecologists still face challenges. These are in particular related to the quality of the input data, which can significantly impact the fitted models (Araújo et al. 2019; Gábor et al. 2020; Bazzichetto et al. 2023; Smith et al. 2023; Wang and Jackson 2023). Such challenges include, among other issues, the selection of the appropriate scale/grain (Miguet et al. 2016; Wunderlich et al. 2022; Zarzo‐Arias et al. 2022) and environmental variables (Williams et al. 2012; Moudrý et al. 2019; Smith and Santos 2020).

The selection of environmental variables is a critical step in SDMs, nowadays compounded by the increasing availability of environmental data (e.g., Cord et al. 2013, 2014; Šímová et al. 2019; Howard et al. 2020; Moudrý et al. 2023a). This is especially true for land-cover data, the availability and quality of which increases with the number of remote sensing data (e.g. Prošek et al. 2020; Karra et al. 2021; Zhang et al. 2021; Hopkins et al. 2022). Land cover types are among the most common predictors that enter the SDMs, especially in the context of land-cover changes (Coppée et al. 2022; Peng et al. 2022). Land cover type, which typically describes the habitat availability within a spatial unit, is commonly included as a continuous variable—for example, as the area or proportion of a specific land-cover type within the study area (e.g., Moudrý and Šímová 2013; Rose et al. 2020; Koma et al. 2022). The underlying rationale is based on an assumption that the probability of occurrence of a species increases continuously with the increase in land cover (or habitat) area within a given spatial unit. This could be attributed to the fact that larger habitat areas sustain larger populations due to their higher carrying capacity and shelter and food availability, so that populations are less susceptible to stochastic extinctions, competition, predation and inbreeding depression (Hanski 1999; Lande et al. 2003). In addition, larger habitat areas are bigger targets for colonizing individuals from the surrounding habitat matrix (Buckley and Knedlhans 1986), thereby increasing the probability of rescue effects following extinction events (Brown and Kodric-Brown 1977). However, in some cases, the mere presence or absence of a habitat may be more important than the total habitat area. In a recent study, Gábor et al. (2022a) demonstrated that for species specializing in rare habitats, such as water bodies in Central Europe, the amount of habitat within a spatial unit is sometimes less important than its simple presence or absence. This is related to the concept of critical habitat area (Andren 1994; Fahrig 2001; Melo et al. 2018), which hypothesizes that there is a threshold in habitat amount below which a species cannot survive; for example, loons (order Gaviiformes) are physiologically constrained from foraging on land and require water habitat for survival (see Fig. 1). While there are limited examples of such an approach in SDMs (e.g., Pearson et al. 2004; Romero et al. 2013; Zhang and Vincent 2018), in a recent study, Gábor et al. (2022a) demonstrated that for species specializing in rare habitats, such as water bodies in Central Europe, the amount of habitat within a spatial unit is sometimes less important than its simple presence or absence. This could be particularly advantageous when collecting habitat data in the field as simply recording the presence of a habitat is significantly less time-consuming than recording its total area.

Fig. 1
figure 1

Graphical rationale for the hypothesis on binary variables. We modelled species occurrence probability as a step function (generative model, black dashed line) to generate 100 presences/absences of a species that needs at least 20% of land-cover within a spatial unit, drawn from a Bernoulli distribution with species occurrence probability as a parameter. Then, we fitted species distribution models (GLM) with either proportional (incorrect model, red line) and binary coverage (correct model, blue line), as a variable

It is well-recognized that SDMs are grain-dependent (Elith and Leathwick 2009) and empirical evidence has shown that species exhibit stronger responses to their environment at certain grains than others (see reviews by Miguet et al. (2016), Moudrý et al. (2023b)). Therefore, the choice of the grain constitutes an important part of the modeling process as it can affect the ability to detect the response (Mertes and Jetz 2018; Luebert et al. 2022; Wunderlich et al. 2022; Lu and Jetz 2023). In addition, the choice of the grain is often determined by data availability rather than study goals. This results in large variability of grains adopted in existing studies, from a few meters (e.g., Bazzichetto et al. 2018; Lecours et al. 2020; Casanelles-Abella et al. 2022; Stark and Fridley 2022) to many kilometers (e.g., Kleisner et al. 2017; Norberg et al. 2019; Zarzo-Arias et al. 2022). The grain size is also essential when selecting environmental variables (Pearson and Dawson 2003; Moudrý et al. 2023b), and it is likely to be critical when deciding whether to use land-cover as a binary or continuous variable when modelling species occurrence. However, this has never been thoroughly tested.

We hypothesize that at finer grains, binary variables are sufficient to fit a useful model, as a single habitat type is more likely to be dominant or completely absent. However, at coarse grains, the presence of habitat alone cannot be enough to sustain a viable population that requires a certain percentage of habitat in a given area, and the variables representing the habitat proportion may play a greater role. Further, habitat specialists may only require the presence of a given habitat type during winter or migration but, require that habitat to be the dominant type on the landscape during breeding season (Zuckerberg et al. 2016), emphasizing the importance of understanding seasonal moderators on the selection of environmental variables.

Therefore, in this study, we examined the interaction between the grain size and the representation of a land-cover variable (binary or continuous) when modeling species distribution. We used citizen science data downloaded from eBird to fit species distribution models for 57 water bird species in North America. The eBird dataset consists of extensive and often spatially detailed occurrence data (an average number of records per species in this study is 880,270 records), making it ideal to model species distributions down to fine spatial grains and across large spatial extents. For each species, we fit models with water cover represented as a proportional or binary variable and using multiple grains (1 × 1 km, 5 × 5 km, 10 × 10 km, and 50 × 50 km). In addition, we fit models for both breeding and non-breeding seasons because many of these species occupy highly distinct ranges between these seasons, with specialization to landcover type often greatest during breeding season (Zuckerberg et al. 2016). Specifically, we address the following questions: (1) Do models built with binary land-cover variables perform equally well or even better than those built with proportional variables? (2) Is the model performance affected by the adopted gain? (3) Does breeding and non-breeding season affect model performance? (4) Do grain and season affect the usability of the binary water cover variable?

Material and methods

Modeling region and species selection

We modeled species distributions across the North American continent, in a box between 179.99° W, 42.68° W, 10.53° N, and 87.11° N that included Canada, the United States, and the majority of Central America. Although we only modeled species native to the United States and Canada and only present the biodiversity estimates for this region, modeling each species’ entire continental range was required to ensure accurate predictions.

Based on the American Birding Association birding codes (2008) updated to the Clements bird taxonomy as of 2021, we compiled a list of 197 water-associated native bird species that breed or overwinter in the United States and Canada each year (Clements 2007). These codes are widely used to distinguish regularly occurring species (code 1 or 2 species, which we use) from vagrants occurring irregularly (code 3+). We excluded species with almost entirely marine ranges or with insufficient data points and those for which records are only available for part of the season (i.e., breeding, non-breeding). We also avoided modeling Hawaiian endemics because these species were restricted to islands that were smaller than some of our spatial grain sizes.

Species data acquisition and filtering

We gathered data from eBird, a global citizen science initiative in which users submit checklists containing bird observations, widely used for understanding species distributions (Sullivan et al. 2014). eBird provides vast amounts of avian biodiversity data and the number of records used in this study is literally unique, ranging from tens of thousands to units of millions, with an average of 880,270 records per species (see Table A1 in Supplementary material). Users can indicate whether or not all observed species were recorded in the checklist (“complete” checklists), allowing for absence inference and presence-absence modeling. Users also indicate the level of effort involved in each observation by providing the distance traveled, time spent birding, and the number of observers (hereafter, effort indicators).

Table 1 Environmental covariates included in species distribution models

We compiled all eBird data for all species separately during the breeding season (June–August; June–July for shorebirds or Charadriiformes, which migrate early) and non-breeding season (December–February). We removed subspecies information from all checklists and summarized all data at the species level. Following established eBird data modeling protocols, we initially applied several filters to the data to reduce bias and improve data quality (Johnston et al. 2019; Kelling et al. 2018). First, we eliminated checklists with extremely long durations (> 3 h), large numbers of observers (> 5), or protocols other than “stationary” or “traveling,” as these are incomparable with the majority of eBird’s data. Second, to reduce observational positional error, we eliminated checklists covering more than 1 km because they are likely to result in greater spatial uncertainty (Gábor et al. 2022b, 2023). Third, data prior to 2004 (< 0.1% of points) were removed because there is insufficient data from earlier years to adequately control for long-term temporal trends in regional bird abundances, as recommended by eBird (Fink et al. 2010, 2020). Finally, data from users with fewer than five contributions were removed to reduce bias (e.g., false absence) caused by inexperienced birders.

Before modeling, the data was further filtered at the species level. Checklists were limited to those falling within a 200 km buffer of the species’ seasonal expert range boundary to limit overprediction outside of the species’ range extent (via Cornell’s spatial boundaries—https://ebird.org/science/status-and-trends/download-data, accessed July 2021). When modeling shorebirds (order Charadriiformes), we excluded August data because many species are already migrating long distances by this time. We limited checklists to one per 5 km grid cell per week to reduce site selection and temporal bias in data collection. To reduce the imbalance between presence and absence points, we utilized “case–control sampling” repeating this filtering for populations of checklists where the species is present and absent as recommended by eBird’s best practices (Fink et al. 2018). We ended up modeling 57 species (see Table A1 in Supplementary material). R 4.1.0 was used to complete all data compilation, analyses, and visualizations (R Core Team 2021). Coordinates, polygons, and grids used in the study operated under a conical equal area projection. For spatial geoprocessing, the raster (Hijmans et al. 2015), rgdal (Bivand et al. 2015), and sf (Pebesma 2018) packages were used.

Environmental data

In total, we considered 12 land-cover variables, 4 topographic/habitat variables, and 4 climatic variables (Table 1). We obtained land-cover data from the European Space Agency (Climate Change Initiative; “ESA. Land Cover CCI Product User Guide Version 2. Technical Report” 2017) at a 300 m resolution. This product provides proportional land-cover variables (i.e. they provide the proportion of the area covered by a particular land-cover type) and includes the following land-cover types: mixed forest, mosaic, shrubland, grassland, lichens/mosses, sparse, flooded/freshwater, flooded/saltwater, flooded/shrub, urban, barren, and ice. We coarsened land-cover variables to a grain of 1 km by mean-aggregating the percentages of individual land-cover types.

To examine how water cover (flooded/freshwater category) influences model performance, we summarized this variable in five ways: as a proportional representation and as a binary variable classified using thresholds of 1%, 10%, 20%, or 50%. For example, when using the 1% threshold, any cell with water cover of 1% or more is considered to contain water in the binary representation and any cell with less than 1% is not.

While our primary focus in this study was on the influence of water cover on model performance and output, each SDM included several classes of covariates to account for numerous other factors that influence species distributions. Our topographic/habitat suite of variables included the mean elevation (from EarthEnv; Robinson et al. 2014), mean enhanced vegetation index (EVI; MODIS; https://lpdaac.usgs.gov/products/mod11a1v006/), topographic wetness index (TWI; hydroSHEDS; Marthews et al. 2015), and terrain roughness index (TRI; EarthEnv). Climatic covariates included the mean annual temperature (bio1), mean annual precipitation (bio12), precipitation seasonality (bio15; all bio variables from CHELSA v2.1; Karger et al. 2021), and intra-annual variation in cloud cover (EarthEnv). We selected environmental variables to reduce collinearity (Fig. A1), although collinearity is less of a concern within random forest (Farrell et al. 2019). All variables (topographic, climatic, and land-cover) that were available at finer resolution were spatially aggregated to 1 km using the mean value of all cells falling within the 1 × 1 km cell. We also used temporal covariates in our models, such as year, date, and time of day, to account for temporal variability in bird activity and long-term population trends. Finally, as covariates, we included all effort indicators (distance travelled, duration of checklist, and number of observers).

Coarsening the spatial grain

To test the role of representing the water cover variable as binary or proportional in different spatial grains, we coarsened the spatial grain of our analyses by aggregating all land-cover, topographic, and climatic variables. We aggregated from 1 × 1 km to 5 × 5 km, 10 × 10 km, and 50 × 50 km (hereafter ‘test grains’) using the means of all 1 km cells within the larger cell. The test grains were chosen based on their frequent use in published SDM studies (see review by Moudrý et al. 2023b). The species observation data were subsequently associated to the each coarser resolution grids. This approach leads to the fact that multiple species records may occur in any cell of the coarser grid. Following Guisan et al. (2007), we decided to keep the sample size constant. If any coarse cell contained more than one species record, all these records were retained rather than reducing these to single record per cell. This allowed us to keep the sample size (i.e. presence: absence ratio) consistent between all grains and avoid mistaking the effects of change in resolution for those caused by the change in sample size and prevalence (Leroy et al. 2018). Temporal and effort covariates were not scaled because they were assigned at the checklist (point) level.

Model fitting and evaluation

We used Random Forest to model the breeding and non-breeding distributions of each species separately. Random Forest is a machine learning method designed to analyse large datasets with many covariates and is frequently found to produce the most accurate SDMs (e.g., Mi et al. 2017; Valavi et al. 2023). Random forests are adaptable, automatically adjusting to complex, nonlinear relationships, and consider high-order interactions between environmental variables (Evans et al. 2011). Random Forest models were conducted in R version 4.1.0 using the ranger package (Wright et al. 2018).

We randomly divided the data into training and testing. The testing set was used for model validation. Models were parameterized to 100 trees and 7 threads. We compiled metrics of predictive model performance including the area under the ROC curve (AUC), true skill statistic (TSS), sensitivity, and specificity. We masked portions of the predictions that fell outside of the buffered range extent when estimating the predictive performance. To diagnose overfitting, we examined test-set calibration plots for each model. To compare the effect of binary and proportional water cover variables, we assessed their relative importance. Overall, we fit models for each of the 57 analyzed species during each season (breeding vs. non-breeding) at four test grains (1 × 1 km to 5 × 5 km, 10 × 10 km, and 50 × 50 km) and five types of water cover representation (proportional, and binary using four thresholds), resulting in 2280 separate models. The computational time for our models was approximately a month using high-performance computing tools, which vividly demonstrates the volume of scenarios we ran to test our hypothesis.

Results

All performance metrics followed similar patterns, hence we focused on TSS for simplicity (see Figs. A2, A3 in Supplementary material for trends in AUC, sensitivity, specificity). In general, SDMs fitted at the finest grain (1 × 1 km) performed very well (breeding TSS = 0.8; non-breeding TSS = 0.79; Fig. 2). The decrease in model performance was observed between original models at 1 × 1 km grain and models at 5 × 5 km grain. Further coarsening of grain resulted in minimal additional decrease in models performance (Fig. 2). The comparison of model performance (TSS) between models fitted with water cover as a proportional variable and as a binary variable revealed minimal differences in model performance at all grains (Fig. 3). At 1 × 1 km grain, the performance decreased with the increasing threshold used to derive the binary water cover variable. At coarser grains, the threshold had no effect on model performance (Fig. A3). The results showed that the choice of water cover representation mattered most in species with fewer observations, which are generally less common (Fig. 3B). The drop in model performance was more pronounced for non-breeding season models (Fig. 3B), while TSS dropped on average by 0.12 from 0.82 (1 × 1 km) to 0.7 (50 × 50 kms) for breeding season models; the drop for non-breeding season models was about 0.2. In the models fitted using binary water cover representation, the importance of the water cover variable significantly decreased compared to the models using its proportional representation (Fig. 4). The importance of the water cover variable decreased with coarsening the grain and increasing the threshold. For example, when models were fitted using a 1 × 1 km grain and a 1% threshold, the mean drop in water cover importance was approx. 20% compared to the proportional representation, whereas for models fitted using the 50 × 50 km grain and 50% threshold, the drop was over 80% (Fig. 4).

Fig. 2
figure 2

Predictive performance represented by TSS. Columns show results for different grain sizes, sorted by the thresholds (%) used to generate a binary land-cover variable. Rows show results for different seasons. Boxplot central lines represent median for each scenario, points represent individual species

Fig. 3
figure 3

Variation of predictive performance across summarization method, sample size, and spatial grain. A Difference in TSS between models fitted with binary land-cover variable compared to a reference model using the proportional land-cover. Boxplot central lines represent the median for each scenario, points represent individual species. B Variation in TSS across different sample sizes (green and purple colour represent 1% and 50% thresholds, respectively). Columns show results for different grain sizes. Rows show results for different seasons. Positive values indicate that models with binary habitat predictors performed better than those with proportional predictors and vice versa

Fig. 4
figure 4

Differences in the importance of the water cover variable in models fitted binary land-cover (with four thresholds) compared to the importance resulting from the corresponding model using the proportional representation of this variable. Boxplot central lines represent the median for each scenario, points represent individual species. Negative values indicate that models with binary water cover predictors had lower water variable importance than models with proportional predictors and vice versa

Discussion

To fully utilize the potential of the fact that the availability of data suitable for SDM keeps increasing, we must understand how to process and use these data effectively. This includes determining the optimal grain size and appropriate selection of environmental variables. In this study, we evaluated the usefulness of binary land-cover variables for SDMs at multiple grains. We computed species distribution models for 57 water bird species in North America across several grain sizes (i.e., from 1 km2 up to 2500 km2) for breeding and non-breeding seasons. We used the proportional and binary representations of land-cover variables derived with various thresholds (1%, 10%, 20%, 50%).

Our results show that the use of binary land-cover variables should be approached with caution. Models created with binary variables performed equally well as those with proportional variables (Fig. 2), but there was a significant decrease in the importance of the water cover variable (Fig. 4). The use of binary water cover variable often led to the situation when this variable was not identified as the most important land-cover variable for water bird species which is, to say the least, unexpected. In addition, we observed that with coarser grain and greater thresholds used to derive the binary water cover variable, the water cover became less important, with a drop of up to 90% at the coarsest grain used compared to the model using the proportional water cover variable (Fig. 4). This contrasts with the study by Gábor et al. (2022a) who showed that for water bird species, models at approximately 10 × 10 km grain using binary predictors performed better than models with proportional predictors. The contrasting results may be attributed to the fact that in our study, the modeling algorithms were allowed to choose the best combination of environmental variables, i.e., the lower predictive power of the binary form of the water cover variable is compensated for by another variable. For example, flooded tree or shrubland habitats often occur near water cover and may become predictive of a species occurrence. In contrast, Gábor et al. (2022a) used only water cover variables for model fitting and, hence, their models did not have the option to select variables that would provide a closer fit.

Our results show that as the grain size becomes coarser, the model performance decreases, which is consistent with prior studies (Guisan et al. 2007; Seo et al. 2009; Chauvier et al. 2022; Zarzo-Arias et al. 2022) and supports the common recommendation to first try the most detailed grain that the data allows (Moudrý et al. 2023b). However, it’s important to note that a finer grain may not always be better, as organisms respond to their environment more strongly at some grains than at others. These grains have been referred to as ‘response grains’ or ‘ecological scales’ and depend on the species ecology (e.g., its home-range; Mertes and Jetz 2018). Consequently, the choice of grain in models can strongly influence our ability to detect and measure species’ response to the environment. The grain at which a species is expected to respond to the environment should be always considered (Manzoor et al. 2018; Wunderlich et al. 2022; Moudrý et al. 2023b).

Importantly, our results suggest that the grain size can impact the applicability of binary land-cover variables. The large decline in the importance of the binary water cover variable relative to its proportional representation observed at 25 km2 resolution and coarser suggests that at grains coarser than 1 km2, the binary representation of water cover is not applicable. Therefore, the proportion or total perimeter of water bodies (e.g., Virkkala et al. 2005; Moudrý and Šímová 2013) should be preferred at coarser grains. However, at a 1 km2 grain, the decrease in the importance of the water cover variable was relatively small and models using binary predictors (presence/absence of the habitat) might be useful.

The threshold for deriving the binary water cover variable is another important aspect to consider. For example, a 1% threshold used in this study requires a water area equivalent to 2 American football fields at a 1 × 1 km resolution, 185 at a 10 × 10 km resolution, and 4638 at a 50 × 50 km resolution. For comparison, the area of the lake of the Great Salt Lake (Utah, USA) is 456 football fields. Thus, it is apparent that especially at coarser grains, the use of an inappropriate threshold can lead to a substantial drop in the number of cells identified as containing water cover, even though many of them provide enough carrying capacity for smaller water bird species populations (Hanski 1999; Melo et al. 2018). Indeed, our results show that using higher than 1% threshold leads to a decline in the importance of the binary water cover variable compared to its proportional representation, making it unusable.

Despite our expectations, we did not find any significant difference in model performance or in importance of the water cover variable between breeding and non-breeding seasons. We expected that models would perform better during the breeding season as water is more critical to the biology of many birds during this season. This expectation was based on the fact that birds have offspring that often cannot leave the water or fly, and many species remain stationary at a specific nesting site.

Conclusions

The appropriate selection of environmental variables and grain size in species distribution modeling is of considerable importance. In this study, we demonstrated that (1) the performance of the models was not significantly affected by the type of the adopted water cover variable (proportional or binary) but a significant decrease was observed in the importance of the water cover variable when used in a binary form, (2) the performance of the models was affected by the adopted grain, with models at a 1 km2 grain performing considerably better than models at coarser grains, (3) models for the breeding and non-breading season performed equally well, and (4) the importance of the binary water cover variable declined relative to its proportional representation with coarser grains and during the breeding season. We highlight that the use of binary land-cover variables should be approached with caution. Binary representation of water cover might be useful at finer grain sizes (i.e., 1 km2), which is a promising result as at more detailed grains, the simple presence or absence of a certain land-cover type can be a realistic descriptor and can save time in fieldwork. However, binary representation of water cover is unsuitable for models at grains coarser than 1 km2. For such grains, the use of proportional land-cover variables produces more reliable models.