Using centroids of spatial units in ecological niche modelling: Effects on model performance in the context of environmental data grain size

This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. © 2021 The Authors. Global Ecology and Biogeography published by John Wiley & Sons Ltd 1Department of Biogeography, University of Bayreuth, Bayreuth, Germany 2BayCEER, Bayreuth Center for Ecology and Environmental Research, Bayreuth, Germany


| INTRODUC TI ON
Ecological niche models (ENMs), based on niche theory, are widely used in many fields such as invasion and conservation ecology, biogeography, as well as epidemiology (Elith & Leathwick, 2009;Escobar & Craft, 2016;Liu et al., 2018;Peterson, 2014). They are often employed to estimate the spatial distribution of certain species (Elith et al., 2006(Elith et al., , 2011Elith & Leathwick, 2009) or diseases (Tjaden et al., 2018), and thus also known as 'species distribution models'. ENMs typically use geographical occurrence locations of the target species as input data. These locations are then related to a series of explanatory variables (spatial raster data describing the environmental and/or socio-economic conditions in the study area), forming a correlative model of the species' environmental niche. This model can be projected onto regions where the presence-absence state of the species is unknown, resulting in a map showing probability of presence or environmental suitability.
By default, ENMs assume that the whole study area was sampled consistently with precisely recorded geographical locations of occurrence, and that the selected explanatory variables can represent the species well (Yackulic et al., 2013). In practice, however, sampling bias is inevitable, especially when the input data have to be assembled from different sources that are based on different sampling methods Lobo & Tognelli, 2011;Qiao et al., 2017;Stolar & Nielsen, 2015). While the sampling bias caused by uneven sampling can be reduced by rarifying or filtering the occurrence records (Castellanos et al., 2019;Gábor et al., 2020;Kramer-Schadt et al., 2013), imprecisely recorded occurrence locations are very difficult (although not entirely impossible) to correct (Hefley et al., 2017). In certain cases, for example, when using citizen-science databases or local monitoring systems, the occurrence locations of the species may be of a coarse precision or only available at municipal or county level (i.e., related to geographical surfaces of differing sizes). For epidemiological data, such missing spatial precision ensures information privacy.

Internet databases like the Global Biodiversity Information
Facility (GBIF; http://www.gbif.org) gather and compile species occurrence data from different sources (scientific, governmental, citizen-science) across a tremendous geographical extent and across national boundaries. However, the precision of the occurrence locations in this kind of database is not always sufficient , and the record precision differs considerably depending on how the occurrence records were collected and processed (in certain cases, the precision might even be unknown) (Collins et al., 2017). In other databases, only very coarse administrational level information is available, that is, instead of geographical locations, the occurrence records are assigned to counties or postal regions. For instance, the European Centre for Disease Prevention and Control (ECDC)   with the shifted values, will further lead to a biased prediction for the species' distribution, and will probably lead to over-prediction.
While finding a substitute for a geographically unknown occurrence location, drawing the geographical centroid of the ASU minimizes the largest possible spatial distance between the substitute location and the unknown true location ( Figure 1b1). However, this does not necessarily minimize the difference in environmental conditions (i.e., values of explanatory variables) at the two locations. In fact, it is entirely possible that among all possible locations within an ASU, the environmental conditions at the ASU centroid are the worst possible substitute for the conditions at the true location-especially in areas where spatial heterogeneity is high. Approaching the substitute from another angle (Figure 1b2), using a central tendency value (median, mean) of each explanatory variable across the entire ASU has been presented as a better option (Park & Davis, 2017). Instead of minimizing the largest possible spatial distance, central tendency values minimize the largest value mismatch directly. The boxplot in Figure 1b2 illustrates this on the basis of the median: when using the median as the substitute, very likely (97% in this example) the largest potential value mismatch is half of the range between the two bars. Possibly (50%) the value of the occurrence location falls in the box. In this case, the largest potential value mismatch is even smaller. It is obvious that using central tendency values reduces

| MATERIAL AND ME THODS
To investigate how much ASU size and grain size affect ENM performance, a two-factorial design with three replicates was applied F I G U R E 1 (a) Value frequency mismatch of an explanatory variable resulting from the use of administrational spatial unit (ASU) centroids.
(a1) is a group of ASUs (here: counties) with true occurrence locations and respective centroid locations. Note that for each ASU only one centroid location will be kept, as there exists only one. (a2) is the value frequency curve mismatch between these two groups of locations, concerning an explanatory variable. (a1) and (a2) illustrate our hypotheses on how geographical distance between occurrence locations and ASU centroids leads to value frequency mismatch. (b) Zooming in to each ASU, for a single pair of occurrence location and centroid location.
(b1) shows that using ASU centroids minimizes the largest potential spatial distance (thick black line) between the centroid location (black dot) and any possible unknown occurrence location in the given ASU. However, this does not mean that the difference in values between those two points is minimized. (b2) shows the variation of values of an explanatory variable across all the grid cells within the ASU. More than 97% of values fall into the range between the whiskers, and 50% of the values fall into the rectangular box (a1) (a2) (b1) (b2) (Figure 2). For this, a virtual species was generated based on three explanatory variables across Europe (see below for details). According to the presence-absence map of this virtual species, three squared regions (sized 5° × 5°) were selected within which the virtual species occupancy was about 50%. For each squared region, pseudo-ASUs of three different sizes were constructed by dividing it evenly into 25 (large), 100 (medium) and 400 (small) squares ( Figure 2). The size range of these pseudo-ASUs corresponds to those of low-level administrational units across Europe. There, the Nomenclature of Territorial Units for Statistics (NUTS, https://ec.europa.eu/euros tat/ web/nuts/backg round) consists of three levels (NUTS 1-3). NUTS 3 is for small regions with a population size threshold of 150,000-800,000. The average area of the NUTS 3 units is about 7,000 km 2 .
The large pseudo-ASU size applied in this study is about 8,000 km 2 , which can be treated as an equivalent of the average NUTS 3 administrative units across Europe. The medium pseudo-ASU size is about 2,000 km 2 , and the small pseudo-ASU size is about 500 km 2 .
The hypotheses were first tested with these artificial pseudo-ASUs from the three rectangular regions (with three pseudo-ASU sizes; Figure 2). Afterwards, data from real countries (Germany and France) with irregular ASU size and shape were used to confirm the previous results in a real-life environment (Supporting Information Appendix S1, Figure S1.1). The varying pseudo-ASU sizes for the regions were applied to detect the general trend of centroid-arisen bias. France and Germany were chosen as test cases as they are of very different NUTS 3 ASU sizes. For France, the average area of NUTS 3 ASUs is about 6,000 km 2 . For Germany, it is 1,200 km 2 . As the NUTS 3 ASU size of Germany is much smaller than that of France, we expected the ENM models based on Germany's NUTS 3 centroids to outperform the ones based on French NUTS 3 centroids.
For the rectangular regions as well as France and Germany, a series of commonly used grain sizes was taken into consideration (four grain sizes 0.5, 2.5, 5, 10 arc-min, roughly equivalent to 0.5, 10, 40 and 200 km 2 , respectively), because the grain size (raster resolution) of explanatory variables in ENMs also affects model performance (Connor et al., 2018;Guisan et al., 2007;Lauzeral et al., 2013;Manzoor et al., 2018). It is necessary to view both ASU size and grain size as factors that affect models on similar special scales.

F I G U R E 2
The two-factorial study design with three replicates. Three rectangular regions [with varying grain size and pseudoadministrational spatial unit (pseudo-ASU) size] were used to assess the general trend of bias resulting from the use of ASU centroids together with varying grain size. For each region, 200 random locations were drawn to keep the sampling effort even, and only locations where the virtual species occurs were kept (black crosses). The whole setup was repeated with the three bioclimatic variables that the virtual species was generated with (see Material and methods)

| Explanatory variables
To keep the virtual species simple, only bioclimatic variables were taken into account. For this, the standard set of 19 bioclimatic variables was acquired from https://www.world clim.org (Fick & Hijmans, 2017), with grain sizes of 0.5, 2.5, 5 and 10 arc-min.

| Value frequency mismatch in explanatory variables due to the use of ASU centroids
To visualize the value frequency mismatch, the three explanatory variables' values were extracted at the centroid locations of the three different pseudo-ASU sizes separately per region. This was repeated for the four different grain sizes (Figure 2). The value frequencies of the explanatory variables were then described through a kernel density curve (which can be understood as the response curve of the virtual species to the variable) using the R package 'caTools' version 1.17.1.2 (Tuszynski, 2014). Relative overlap of the curve for the centroid locations with the corresponding curve for the original occurrence records was calculated using 'caTools'. Mismatch between these curves was then calculated as 1 − overlap, so that a mismatch of 0 means identical curves and 1 means no overlap at all.
The spatial heterogeneity of each explanatory variable was assessed by calculating its standard deviation within the respective pseudo-ASU and for the respective grain size, using the 'raster' package version 2.6.7 (Hijmans, 2019) in R. The spatial heterogeneity was expected to increase with the size of pseudo-ASUs. For each explanatory variable, a linear regression was applied to describe the correlation between frequency curve mismatch and spatial heterogeneity.

| Ecological niche model
Maxent, an ENM algorithm widely used and known for its good performance with small occurrence location datasets (Baldwin, 2009;Elith et al., 2006;Hernandez et al., 2006), was chosen in this study.
To make the models comparable, the settings were kept the same for all runs, for the three rectangular patches, Germany and France.
Default model settings were applied (with 10,000 background locations), with 10 replicates. Instead of commonly used methods such as the true skill statistic (TSS) or the area under the curve (AUC) of the receiver operating characteristic, model performance was assessed using Spearman's rank correlation coefficient (Spearman's rho). Spearman's rho is obviously a better choice than AUC, and compared to TSS, Spearman's rho has the advantage of being thresholdindependent. As the true spatial distribution probability map for the virtual species is available, Spearman's rho for the correlation between the environmental suitability predicted by the model and the true probability of presence can easily be achieved [via R package 'pspearman' (Savicky, 2014)]. In this case, Spearman's rho can range from 0 (no correlation) to 1 (perfect linear positive correlation, the compared models are identical).
To calculate Spearman's rho, the larger grain-sized (2.5, 5 and 10 arc-min) outputs from Maxent models were resampled to 0.5 arcmin resolution using the 'nearest neighbour' method [R package 'raster' (Hijmans, 2019)]. Essentially, this means cutting the large raster cells into smaller ones, while keeping the original values without interpolation or loss of information. The model results were then compared with the true (distribution) probability map of the virtual species (the virtual species was generated with 0.5 arc-min resolution; for more details see above and Supporting Information Appendix S2).

| Calculation of over-prediction ratio
The

| The larger the pseudo-ASU size, the larger the value frequency mismatch of the explanatory variables
Using ASU centroids resulted in a mismatch of value frequency curves of the explanatory variables. This mismatch increases with the spatial heterogeneity within the respective ASU. A statistically significant positive relationship between variable mismatch and spatial heterogeneity was revealed through linear regression analysis for all three bioclimatic variables (Figure 3, top; Bio1: p < .01, R 2 = .16; Bio12: p < .05, R 2 = .11; Bio15: p << .001, R 2 = .56). The overall spatial heterogeneity of an ASU increases with its size (Figure 3, bottom).

| For ENM performance, ASU size matters more than the grain size
Ecological niche models built with the original occurrence locations showed strong correlations of predicted environmental suitability F I G U R E 3 Top: the spatial heterogeneity of each explanatory variable (Bio 1: annual mean temperature, Bio 12: annual precipitation, Bio 15: precipitation seasonality) increases with increasing centroid region size. Bottom: value frequency curve mismatch for explanatory variables (Bio 1, Bio 12, Bio 15) at occurrence locations versus centroid locations increases with elevated spatial heterogeneity. Each black dot represents one pair of comparisons of the recorded variable values between centroid locations and real locations. For each variable, the spatial heterogeneity is measured by the standard deviation of that variable within the individual centroid regions. ASU = administrational spatial unit.
with the true distribution of the virtual species, suggesting good model performance (Figure 4a). For these ENMs, the model performance decreased with increasing grain size (see Supporting Information Appendix S4, Figure S4.

| Over-prediction of species spatial distribution due to the use of centroid data
Almost all ENM runs in this study, including those performed with true occurrence locations, over-predicted the virtual species' occurrence. Based on the MaxSSS threshold, over-prediction tends to be stronger with increasing ASU size ( Figure 5). However, this increase is statistically significant only for the large ASUs (p < .01 based on ANOVA followed by a Tukey honest significant difference post-hoc test). This is consistent with the results obtained from the eqSS and 10 percentile thresholds (Supporting Information Appendix S4, Figure S4.6).  (Table 1). This is in accordance with the pattern shown in Figure 4: ASU size matters more than grain size.

| D ISCUSS I ON
In this study, we looked into the mechanism of how the use of centroids affects ENM performance. Though there have been studies focusing solely on centroid size (Collins et al., 2017;Park & Davis, 2017) or grain size (Connor et al., 2018;Lauzeral et al., 2013;Manzoor et al., 2018), we investigated and compared how much ASU size and grain size affect ENM performance. Our results confirm that, in general, larger ASUs have higher spatial heterogeneity, and higher spatial heterogeneity is associated with higher value frequency mismatch of F I G U R E 4 Administrational spatial unit (ASU) size affects model performance more than grain size. (a) Grey: performance of the ecological niche model (ENM) at different ASU sizes. White, for reference: ENM performance when using true occurrence locations. Lower case letters above the boxes indicate differences between groups as indicated by a Kruskal-Wallis rank sum test with multiple comparison post-hoc test. (b) Performance of the ENM at different grain sizes (i.e., spatial resolution of environmental data). Lower case letters above the boxes indicate that an ANOVA followed by a Tukey honest significant difference post-hoc test revealed no statistically significant differences between any of the groups. Grey boxes in (a) and (b) refer to the same set of models. Model performance was assessed through the correlation coefficient (Spearman's rho) between the environmental suitability predicted by the ecological niche model and the true species distribution defined for the virtual species. Model performance ranges from 0 to 1. The larger the value, the better the model performance.

(a) (b)
explanatory variables between the true locations and centroids. When using ASU centroids, larger ASUs also lead to a larger decrease of ENM performance. Compared with ASU size, grain size does not affect ENM performance as much. The use of ASU centroids leads to over-prediction of the modelled species' distribution range, and the over-prediction ratio shows a tendency to increase with ASU size.
Using centroids of ASUs as a substitute for true occurrence records in ENMs introduces errors. Spatial distance between the centroids and the true, unknown occurrence locations leads to a mismatch of explanatory variables' values at these locations. These mismatched values at the centroid locations lead to a mismatch of explanatory variables' value frequency curves, which further results in a mismatch between the projected niche and the true niche. Our results show that the absolute size of ASUs affects the value frequency mismatch between true locations and centroids. How strong this effect is, ultimately depends on the explanatory variables' spatial heterogeneity (here: standard deviation) within the ASUs. Park and Davis (2017) found that spatial heterogeneity in climatic variables was mainly governed by the heterogeneity in topography in the US. Their findings that ASU (county) size only had minimal effects on spatial heterogeneity does not contradict our results, due to their different, non-nested study design. The absolute ASU size or grain size alone cannot determine how much the explanatory variables' values mismatch with the values at the true occurrence locations, but in general, larger ASU size leads to larger value mismatch (Figure 3).
Similarly, coarser grain size leads to a deterioration in model performance (Supporting Information Appendix S4, Figure S4.5). This is in line with previous findings (Connor et al., 2018;Guisan et al., 2007;Manzoor et al., 2018). However, although the grain size does affect model performance, its effect was found to be small compared that of the ASU size ( Figure 4 and Supporting Information Appendix S4, Table S4.2). When centroids are drawn from ASUs with large extent, the ENM's performance can hardly be improved by using a fine grain size (Figure 4). However, when using centroids drawn from a small extent or using the true occurrence locations, a fine grain size is preferable (Table 1). Note that the 'small' pseudo-ASU size used in this study is 0.25° × 0.25° (c. 400 km 2 ). This is not much larger than the coarsest grain size (c. 170 km 2 ) in this study, which corresponds to the resolution of some commonly used environmental datasets [e.g., E-OBS (Cornes et al., 2018)]. For ASUs of this size, the value mismatch between the centroids and true locations is very small, and the effect of ASU size on the value mismatch cannot be distinguished from that introduced by a large grain size.
The use of a virtual species in this study means that the environmental suitability and the presence-absence status of the species across the study region are known, and the species' occurrence records are precise. This makes the comparison between models based on true F I G U R E 5 Relative over-prediction ratio [ecological niche model (ENM) results versus 'true' virtual species occurrence] for models built on point locations (white) as well as centroids of differently sized regions (grey). Lower case letters above the boxes indicate differences between groups as indicated by an ANOVA followed by a Tukey honest significant difference post-hoc test. ASU = administrational spatial unit relative over-prediction rho for the correlation of the predicted probability of presence with the known true probability of presence of the virtual species. Bold: best-performing models for centroid-and true location-based models for France and Germany locations and models based on ASU centroids feasible and reliable. The difference between models based on observed point locations from field data versus centroid-based locations has previously been quantified (Collins et al., 2017). However, using observed locations, additional effects resulting from sampling bias or uncertainty cannot be excluded.
Generating a virtual species and drawing precise geographical locations ensures that the observed differences between model results are due to the use of centroid locations itself and that they can be quantified. As the true environmental suitability of the virtual species is known, it can be used as a benchmark for the performance of the models by directly calculating Spearman's rho. This obviates the use of threshold-based performance measures such as TSS or the commonly used but controversial AUC (Allouche et al., 2006;Tjaden et al., 2018). noted that the value frequency mismatch ranges from 0 to 100%, that is, though the value frequency mismatch increases with spatial heterogeneity within the ASU, this increase is not always linear. Here, a linear model was applied to show the general trend, but it should not be interpreted as an indication for how much the mismatch will be when larger spatial heterogeneity occurs.
As the virtual species was generated with the grain size of 0.5 arc-min, it is to be expected that the output from the combination of this grain size and original locations has the best performance. While generating the virtual species, we assumed that the grain size employed is the smallest unit in which the virtual species can survive, as an individual (similar to e.g., a deer, a bird) or a population (e.g., ants, bees). However, modelling with real species, it should be questioned which grain size should be utilized. It has been suggested to use a grain size smaller than 1 km when possible, especially for habitat specialists (Manzoor et al., 2018). Nevertheless, the question of choosing optimal grain size needs to be further investigated.
It is not surprising that centroids of ASUs are chosen as a substitute when no precise occurrence locations are available. After all, the changes in the modelling workflow required by this approach are minimal compared with the calculation of central tendency measures across ASUs. Using the geographical centroid of an ASU minimizes the largest possible spatial distance between the centroid and any (unknown) occurrence location within the ASU (Figure 1b1).
Considering spatial autocorrelation, the value mismatch between the unknown true location and the chosen substitute (centroid) has a chance of being limited as well. However, central tendency values (e.g., mean or median) of the variable value within the ASU are a better alternative for minimizing the value mismatch, as they limit the potential value distance directly rather than indirectly through spatial distance (Figure 1b2) When using a limited number of ASU centroids in addition to precise locations, there are two critical aspects that need to be considered. First, these coarse centroids are typically assigned the same weight as the precise location by the ENMs, which may result in a distorted model of the spatial distribution of the species. This could be ameliorated by down-weighting centroid locations, provided that the chosen modelling algorithm allows for that. Second, for each ASU, only one centroid record will be kept by the ENMs. As a consequence, regions with only one record are treated the same as regions with several records, and the abundance information (if available at all) is neglected.
It has been shown that a mixture of precise and centroid-based data would be more robust against the issues demonstrated in this work (Collins et al., 2017), but to what degree and how the ratio of the two data types affect the robustness needs to be clarified in future studies.
It should be noted that the insights gained here only apply to models built with continuous variables. Categorical predictors like land use classes typically show sharp edges on the map, so that even small differences in the spatial location of occurrence records can lead to dramatically different values being assigned to them. Thus, it seems reasonable to assume that models built with categorical variables would be affected more strongly by ASU size, but further investigations are needed to confirm this. Similarly, all our analyses were conducted at an intermediate (sub-continental) spatial scale, as this is the scale where ASU centroids are most likely to be used.
Further research is needed to verify whether the conclusions drawn from this can be transferred to coarser or finer scales.

| CON CLUS IONS
Whether ASU centroids can be a viable surrogate for precise occurrence locations depends on the ASUs' sizes and how heterogeneous they are in terms of environmental explanatory variables. For instance, in the northern German flatlands, where ASUs are small and the environment comparably homogenous, the use of centroid locations is much less of a problem than in the alpine regions of France, where ASUs are large and environmental gradients steep. If possible, central tendency values should be considered as a more robust alternative. As our results suggest that effects of using ASU centroids outweigh effects of grain size, it is important for modellers to recognize this source of error. In order to enable researchers to assess whether the use of centroid locations is appropriate for a specific project, new methods and guidelines need to be developed.

CO N FLI C T O F I NTE R E S T
The authors have declared no conflicts of interest for this article.

DATA AVA I L A B I L I T Y S TAT E M E N T
The data that support the findings of this study are available from the WorldClim dataset at https://www.world clim.org/. Details about how to recreate the simulated data are provided in the Supporting Information Appendices; R source code is available upon request from the authors.