Non‐stationary drivers on fish sampling efforts in Brazilian freshwaters

Closing knowledge gaps in species taxonomy and distribution (i.e. Linnean and Wallacean shortfalls, respectively) require spatially distributed high‐quality data. However, studies on terrestrial taxa have shown that occurrence data are biased owing to higher sampling efforts towards areas with greater accessibility, research infrastructure and attractiveness. Here, we tested whether these biasing factors are also important drivers of freshwater fish species research by assessing fishes’ sampling efforts in Brazil, a continental‐sized and megadiverse country. We hypothesized that the influence of biasing factors is scale‐dependent, with differential effects across regions, that is that they present non‐stationarity.


| INTRODUC TI ON
Information regarding species occurrence is essential for ecological and evolutionary research to understand processes that affect their geographical distribution (Meyer et al., 2015;Troia & McManamay, 2017). Additionally, high-quality species distribution data are required to define conservation strategies and for the prioritization of resource allocation (Hortal et al., 2007;Ladle & Whittaker, 2011;Tessarolo et al., 2017). The increased concern relating to data availability has stimulated initiatives for online aggregation of distributional data which are now publicly available Meyer et al., 2015). Initiatives such as GBIF (Global Biodiversity Information Facility) and SpeciesLink manage billions of species records from museums, herbaria and research centres which have been extensively used for various ecological and conservation studies (e.g. Canhos et al., 2015). Despite these online databases, which have improved the quality and accuracy of their data over time (Gaiji et al., 2013), high-quality data regarding species occurrences are missing for most known species and regions, a gap known as Wallacean shortfall (Bini et al., 2006;Hortal et al., 2015;Lomolino, 2004;Whittaker et al., 2005).
The knowledge shortfall related to species distribution is a result of the complex dynamic of species distribution itself (Tessarolo et al., 2017), but also of the uneven collection of data (Hortal et al., 2015;Meyer et al., 2015). Ideally, sampling efforts should be distributed in the geographical space, in a way that all real natural variation is detected (Brooks et al., 2004). However, even though increasing concern about intensification and spatialization of samplings has emerged (Girardello et al., 2018;Hortal et al., 2007;Oliveira et al., 2016;Reddy & Dávalos, 2003;Romo et al., 2006), spatial biases in biodiversity records are still considerable (Brooks et al., 2004).
Sampling bias results partially from the small spatial scale of the studies developed, which generates a spatial concentration of records restricting the knowledge and inferences of distribution patterns at larger scales Hecnar & M'Closkey, 1997), consequently affecting conservation actions (Brooks et al., 2004).
Unequal distribution of species sampling efforts may be driven by several factors such as ease of access Oliveira et al., 2016;Romo et al., 2006;Sánchez-Fernández et al., 2008), availability of financial resources (Anderson, 2012;Yang et al., 2013), attractiveness for specific areas or species (Reddy & Dávalos, 2003) and security (Amano & Sutherland, 2013). In fact, researchers tend to concentrate their samplings on sites close to their work, generally with the presence of roads near large cities seeking to facilitate travel (Sánchez-Fernández et al., 2008;Sastre & Lobo, 2009) and minimize logistics and financial expenses generated by fieldwork (Botts et al., 2011;Meyer et al., 2015). In addition, some sites also exert a certain attractiveness on some researchers, including protected areas, representing more conserved sites where there is a greater probability of finding high species richness or even rare species (Oliveira et al., 2016;Sánchez-Fernández et al., 2008).
Sampling bias studies often examine relationships between sample bias and factors causing it (e.g. presence of roads) in a unique global model applied to the entire studied region, not considering the spatial variability of sources bias (Oliveira et al., 2016). However, the relationship among spatial sample bias and the factors causing it can vary between regions, indicating the occurrence of nonstationary geographical patterns. For example, the known positive relationship between terrestrial access routes and sampling efforts (Oliveira et al., 2016) may be weaker or negative in regions where the navigable rivers are the main access routes, such as in the Amazon basin . Therefore, if spatial non-stationarity is present, other statistical approaches such as GWR (geographically weighted regression; Fotheringham et al., 2003) could be used to detect the spatial variability of the predictors and improve them when compared to global models (see examples using GWR in broadscale studies- Caetano et al., 2018;Eme et al., 2015;Foody, 2004;Terribile & Diniz-Filho, 2009).
Freshwater systems encompass a variety of habitats, harbouring around 10% of all known biodiversity, with high levels of endemism (Collen et al., 2013;Strayer & Dudgeon, 2010). Additionally, freshwater systems play an important role in providing resources for humans, such as water filtration, food provision and air quality regulation (Collen et al., 2013;Vigerstol & Aukema, 2011). Although the number of studies focusing on the identification of spatial biases regarding biodiversity data has been increasing (Boakes et al., 2010;Hortal et al., 2007;Oliveira et al., 2016;Reddy & Dávalos, 2003;Romo et al., 2006;Sánchez-Fernández et al., 2008;Stropp et al., 2016;Yang et al., 2013), the majority encompass only terrestrial vertebrate species. Freshwater biodiversity generally has been overlooked (Lévêque et al., 2005;Sánchez-Fernández et al., 2008), with few relevant studies conducted in this field (Pelayo-Villamil et al., 2018;Sánchez-Fernández et al., 2008;Troia & McManamay, 2017). Species in freshwater systems face severe anthropogenic threats, suffering a greater extinction risk than terrestrial species and significant biodiversity loss (Collen et al., 2013;Gallardo et al., 2018). In this context, filling the knowledge gaps in freshwater species distribution will be useful for informing freshwater conservation management (Howard et al., 2015). Thus, it has become imperative to understand the factors causing biases in freshwater fish data distribution and assess its limitations (Boyero et al., 2012;Troia & McManamay, 2017). As the influence of these biasing factors may vary spatially, predictions of biodiversity data would be required to go beyond stationary models (Oliveira et al., 2016) and to consider context-dependent processes more explicitly.
In this study, we evaluate the spatial distribution of the sampling efforts of Brazilian freshwater fish and determine the influence of K E Y W O R D S aquatic biodiversity, collection, Linnean shortfall, Neotropical, stationarity, Wallacean shortfall commonly assessed bias factors on such data. Brazil is a continentalsized country with the most diverse freshwater ichthyofauna in the world (Reis et al., 2003;Tisseuil et al., 2013), which has attracted the attention of researchers and resulted in significant cumulated sampling efforts over decades Rapp Py-Daniel et al., 2015). It encompasses different biomes, exhibiting a broad range of environmental, infrastructural and socio-economic contexts. This combination of high biological diversity and spatial variability in potential biasing factors makes it a suitable model to analyse drivers of sampling efforts (Oliveira et al., 2016). We tested the hypothesis that the sampling efforts directed towards freshwater fish become biased owing to the accessibility, research infrastructure and attractiveness of protected areas. Furthermore, considering that these relationships are expected to be scale-dependent, with the differential effect of biasing factors across regions, we predicted that non-stationary models would provide better predictions than global models.

| Data collection
We listed 3,358 species of freshwater fish for Brazil supported on the database "Fishbase" (www.fishb ase.org) by applying the following search filters: "Brazil" and "Freshwater." Marine species that were erroneously included within the database and non-native species (sensu Azevedo-Santos et al., 2015, Latini et al., 2016 were removed so that the list contained exclusively native freshwater species. We obtained all occurrence records per species available from four online databases (search was made one species at a time in each database): Global Biodiversity Information Facility (GBIF; www.gbif. org), SpeciesLink (www.splink.org.br), Portal da Biodiversidade (portaldabiodiversidade.icmbio.gov.br/portal/) and Sistema de Informação sobre a Biodiversidade Brasileira (SBBR; www.sibbr.gov.br/).
To guarantee that unreliable and non-relevant records were removed, we carried out a data filtering process. Records lacking information regarding geographical coordinates, year of sampling, complete species names and those falling outside Brazilian borders were removed. Additionally, duplicated records-same species name, geographical coordinates and sampling date-were removed. The final database was aggregated into a grid of 0.5°× 0.5° resolution.

| Sampling efforts
We used the number of sampling events as a surrogate for sampling efforts (Menegotto & Rangel, 2018;Righetti et al., 2020). We regarded as sampling events the unique combinations of longitude, latitude and year of sampling based on species records (Menegotto & Rangel, 2018). We used the year, since it is commonly the only information available about a sampling date. Therefore, records of different species with the same coordinates in a given year were considered a single sampling event (e.g. a group of species collected in a stream reach), while records with equal coordinates but carried out in different years were considered as different sampling events (e.g. temporal sampling in which the same stream is sampled repeatedly over time). By quantifying the sampling events and not the occurrence records of the species, our effort measure is not directly influenced by the differences in geographical patterns of species richness (Vieira et al., 2018).

| Drivers of spatial biases in sampling efforts
To assess the spatial variation in sampling efforts, we selected factors related to accessibility (population density, distance from terrestrial and fluvial access routes and density of terrestrial and fluvial access routes), availability of research infrastructure (distance from research centres) and the attractiveness of protected areas (distance from protected areas).

| Terrestrial access routes
Maps of the access routes (highways, streets, and paved and unpaved roads) were obtained from different platforms: Ministry of Environment (mapas.mma.gov.br/), Sistema Estadual de Geoinformação (www.sieg.go.gov.br) and Instituto Brasileiro de Geografia e Estatística (www.ibge.gov.br). The influence of access routes in sampling efforts was evaluated with two measures: the terrestrial access route density in each cell and the distance from each cell to the closest terrestrial access route.

| Fluvial access routes
We considered main rivers of sixth or greater order as fluvial access routes (sensu Strahler, 1957). Usually, rivers of these hierarchies are big enough to permit navigation. Maps of the rivers classified according to Strahler's hierarchy were obtained from HydroRIVERS database (data available at www.hydro sheds.org; Lehner & Grill, 2013).
We used two variables to describe local availability of fluvial routes: the fluvial access route density in each cell and the minimum distance from each cell to the closest fluvial access route.

| Research centres
Brazilian research centres with the potential to bias freshwater fish sampling efforts were defined as those with research linked to the study of freshwater fish biodiversity. To this end, we surveyed all Subsequently, we calculated the percentage of the protected area.
Only the cells containing more than 50% of its area covered by a protected area were regarded for this analysis. Therefore, we calculated the minimum distance of the centroid of each cell, for the centroid of the closest cell containing a protected area.

| Data analysis
The geographical pattern of sampling efforts in Brazilian freshwater studies was estimated using the spatial correlogram of Moran's I (Legendre & Legendre, 2012), estimated at 21 equal distance classes.
The Moran's I usually takes values in the interval −1 to +1 and in the correlogram, this coefficient is estimated to each class of distance. Therefore, the visual inspection of the correlogram (i.e. values of Moran's I to each distance class) allows the evaluation of occurrence of the spatial structure of sampling efforts, in which positive and significative values in first classes and a decrease to last classes (with zero or negative values) indicate spatial gradient in sampling efforts. The significance of Moran's I was tested using the Monte Carlo procedure with 999 permutations, and the significance of each class was established using Bonferroni's criterion (see Legendre & Legendre, 2012).
The effect of predictor variables on sampling efforts of fish data was investigated using two approaches: ordinary least square (OLS) and geographically weighted regression (GWR) (Fotheringham et al., 2003). The OLS indicates the global and static angular coefficient along with space, while the GWR takes into account the nonstationarity of predictor variables and generates angular coefficients (and other parameters) for each cell in the grid. The GWR and OLS results were compared using the Akaike information criterion (AIC).
Initially, the two statistical analyses with all variables were log-transformed (log X + 1) and the predictor variables were standardized using z-score. The z-score allows the comparing of the coefficients of predictor variables. The collinearity of predictor variables was evaluated using variance inflation factor (VIF), and only variables with a VIF < 2.5 were retained in the models. Thus, two variables (distance to terrestrial routes and distance to fluvial routes) were excluded from the models by the VIF criterion.
In GWR analysis, the non-stationarity of the predictor variables was checked using the Monte Carlo procedure and all variables presented non-stationarity (p < .05). The influence of neighbouring areas in GWR was defined by an adaptive Kernel Bi-square function, with bandwidths determined interactively using AIC (see details in Fotheringham et al., 2003). The number of nearest neighbours selected was equal to 101. Local angular coefficients obtained in the GWR were mapped in the grid cell.
The Moran's I correlogram was performed using the SAM program (Rangel et al., 2010), and the OLS and GWR were performed using software R. The GWR was calculated using the package "GWmodel" (Gollini et al., 2015) and the functions bw.gwr to estimate the bandwidth, montecarlo.gwr to investigate the non-stationarity of predictor variables and gwr.basic to perform the statistics of GWR.
The VIF was estimated using the function vif.cor in the package "usdm" (Naimi et al., 2014). Data and R scripts used in our analyses are available in an open-access repository (https://zenodo.org/recor d/39348 32#.YDUXp GhKiiM).

| Spatial patterns of fish sampling efforts
Our search of the online databases resulted in 339,246 records, numbering 243,901 records after the data filtering process. The GBIF and SpeciesLink databases contributed the largest number of records (Table S1)-from 1900 to 1964, respectively. From these data, we identified a total of 43,538 sampling events distributed across Brazil (Figure 1a), with a higher number of records in southern and south-eastern Brazil, even though all of the regions had sites with high sampling efforts. The sampling efforts presented a clear spatial structure, with geographically near sites having a similar number of sampling events. The spatial correlogram indicated that the sites with a distance of <215 km (first distance class in correlogram) presented the highest similarity among the number of sampling events ( Figure S1). Distribution of protected areas, research centres, population density, terrestrial and fluvial access routes is also spatially uneven (Figure 1b-f).

| Modelling fish sampling efforts
The GWR presented better performance than OLS according to statistics of R 2 and AIC (Table 1), showing the occurrence of nonstationarity in the effect of predictor variables in explaining the number of sampling events. Considering the median values of GWR results, the population density and terrestrial route access density were the variables with a higher effect on the number of sampling events (Table 1). Regions with higher density of terrestrial access routes and population density attracted a higher number of sampling events. Moreover, the higher the density of fluvial access routes and the closer a cell is to protected areas and to centres of research, the higher its number of sampling events (Table 1).
Local GWR results showed different predictive powers of the model across sites (Figure 2). The map of R 2 revealed that some regions of Brazil presented the highest values (R 2 > .75), indicating, for example, that in these regions the predictors explained 75% (or more) of the variability of sampling efforts (Figure 2a). Worth mentioning is the fact that the importance of each predictor needs to be evaluated considering the importance of the model in the same region (local R 2 ). Despite median values to reveal the dominant direction of the driver-sampling efforts relationship, local GWR coefficients showed strong explanatory influence of drivers in some F I G U R E 1 Distribution of the sampling efforts (measured as the sum of sampling events) of freshwater fish in Brazil (a) and explanatory variables: protected area (b), research centres (cc), population density (person/km 2 ) per municipality (d), terrestrial access routes (e) and fluvial access routes (

| D ISCUSS I ON
Sampling efforts have enhanced over time for almost all the biological groups and regions, resulting in an expressive cumulated number of species records (Girardello et al., 2018;Oliveira et al., 2016). This is also true for Brazilian freshwater fishes, as shown by the large number of species recorded on databases and sampling efforts employed . However, our results show that freshwater fish samplings are spatially biased with a tendency for major efforts to be concentrated in sites with a higher density of fluvial and terrestrial access routes, larger populations and near to research centres and protected areas. These results suggest that the well-known factors causing sampling bias for terrestrial organisms (Daru et al., 2017;Oliveira et al., 2016;Romo et al., 2006) are also influential for sampling of freshwater organisms. However, the importance of these drivers varied across sites, revealing the spatial non-stationarity of these effects. Drivers were strongly predictive of fish sampling events in regions with high sampling efforts (southern and south-eastern Brazil) and in areas with low sampling efforts (e.g. part of the north of Brazil).
In line with our expectations, the GWR model performed better than OLS in predicting fish sampling efforts. The former assumes the differential effect of explanatory variables on sampling events The importance of terrestrial access routes to predict sampling effort when taking into account the non-stationarity was remarkable. In fact, this variable had the greater range of local coefficients of the GWR model, indicating contrasting relationships of this variable with the sampling effort across the studied area. The positive association between terrestrial access route density and sampling effort occurred especially in regions with low or moderate density of access routes and where there is a broader gradient of this variable.
In those regions, the terrestrial access routes represent a key driver influencing sampling. In regions where most areas have a high density of access routes (e.g. south-eastern Brazil), accessibility seems to be a constraint, which is overcome. Carrying out samplings close to access routes generates spatial bias and causes a failure in the representativeness of environmental and biological diversity (Kadmon et al., 2004;Oliveira et al., 2016). Furthermore, access routes have been associated with degradation of the riparian zone, loss of physical habitat integrity, an increase in fishing and the introduction of exotic species (Trombulak & Frissell, 2000). These modifications favour the occurrence of less sensitive species that may not represent regional biodiversity. TA B L E 1 Statistical summary of OLS and GWR presented the standardized coefficient (Std coef) and p-value to each predictor variable, the R 2 adjusted and AIC Fluvial access routes may favour fish sampling especially in regions with low terrestrial accessibility, as seems to be the case in sampling efforts . Furthermore, urbanization has aroused researchers' interest in concentrating their studies on these sites since such urbanized ecosystems can present high biodiversity (Luck, 2007) and habitat heterogeneity (Araújo, 2003;Pautasso & McKinney, 2007). However, the population density effect is also relevant in less populated regions, as shown by our non-stationary model in some regions (e.g. Amazon), where minor differences in population density are also relevant to predict sampling effort.
Exceptions to the overall pattern (i.e. negative association between population density and sampling effort) may be observed when researchers are attracted to less populated regions, usually when other drivers exert a stronger effect. It occurs, for example, when there is the presence of research centres in less populated areas or when large projects fund biodiversity collection in remote regions .
Our results, and those found by Sánchez-Fernández et al., (2008) for aquatic beetles, reinforce the tendency to concentrate samplings close to research centres. The search for sampling sites close to the worksite is a convenience that makes the research cheaper and the logistics easier for the researchers (Moerman & Estabrook, 2006).  (Freitas et al., 2021;Zuanon et al., 2016). These results highlight the importance of research centres as vectors for biodiversity samplings and scientific development (e.g. Dias et al., 2016;Nabout et al., 2015). It is worth mentioning that the strength of the research centre sampling relationship may still have been underestimated since our criterion for the definition of a research centre was based on the presence of a graduate programme. Thus, sampling efforts carried out by researchers working in universities or research agencies that are not part of graduate programmes may not have been credited to the research centre effect, which may be the case for part of the sites with positive coefficients.
The association of protected areas with sampling effort found here reinforces the attractiveness effect of such protected areas (Sánchez-Fernández et al., 2008). Predictions of both models (OLS and GWR) were similar, reflecting the high consistency of this effect. The attractiveness of protected areas may be explained by the tendency of researchers to seek less disturbed sites, areas with high species richness and endemism, features usually associated with protected areas (Reddy & Dávalos, 2003;Sánchez-Fernández et al., 2008). In regions with higher degrees of human influence, such as the south and south-east, just a few protected areas with native vegetation remain. These areas may represent the last refuge for many species, including threatened species (Abell et al., 2007), attracting researchers' efforts to conduct sampling in these areas.
Additionally, as a legal requirement of Brazilian federal legislation, the planning, creation, expansion and management of conservation units demand studies and inventories of biodiversity, usually involving fish sampling.
Fish sampling efforts were weakly predicted by our models in some regions, principally in some sites with low sampling efforts. While this unpredictability may be the result of evenness in the sampling efforts, other factors may be considered: (i) in some cases, as in north-eastern Brazil, there is a spatial decoupling between the greater portion of the hydrographic network that drains the interior of the continent and the coastal region where major cities with most of the research centres, infrastructure and human resources are located ; (ii) occurrence of sampling efforts not recorded in online databases and hence not considered by us. While the online availability of data is a potential issue (Blair et al., 2020), the availability of high-quality biodiver-  (proc. 465610/2014-5).

PEER R E V I E W
The peer review history for this article is available at https://publo ns.com/publo n/10.1111/ddi.13269.

DATA AVA I L A B I L I T Y S TAT E M E N T
The data that support the findings of this study are available in Zenodo at https://zenodo.org/recor d/39348 32#.YDUXp GhKiiM.
These data were derived from the following resources avail-