Validating distribution models for twelve endemic bird species of tropical dry forest in western Mexico

Abstract Considering the high biodiversity and conservation concerns of the tropical dry forest, this study aim is to predict and evaluate the potential and current distributions of twelve species of endemic birds which distribute along the western slope of Mexico. The main goal is to evaluate altogether different methods for predicting actual species distribution models (ADMs) of the twelve species including the identification of key environmental potential limiting factors. ADMs for twelve endemic Mexican birds were generated and validated by means of applying: (1) three widely used species niche modeling approaches (ENFA, Garp, and Maxent); (2) two thresholding methods, based on ROC curves and Kappa Index, for transforming continuous models to presence/absence (binary) models; (3) documented habitat–species associations for reducing species potential distribution models (PDMs); and (4) field occurrence data for validating final ADMs. Binary PDMs' predicted areas seemed overestimated, while ADMs looked drastically reduced and fragmented because of the approach taken for eliminating those predicted areas which were documented as unsuitable habitat types for individual species. Results indicated that both thresholding methods generated similar threshold values for species modeled by each of the three species distribution modeling algorithms (SDMAs). A Wilcoxon signed‐rank test, however, showed that Kappa values were generally higher than ROC curve for species modeled by ENFA and Maxent, while for Garp models there were no significant differences. Prediction success (e.g., true presences percentage) obtained from field occurrence data revealed a range of 50%–82% among the 12 species. The three modeling approaches applied enabled to test the application of two thresholding methods for transforming continuous to binary (presence/absence) models. The use of documented habitat preferences resulted in drastic reductions and fragmentation of PDMs. However, ADMs predictive success rate, tested using field species occurrence data, varied between 50 and 82%.


| Statement of the problem
It has become a common practice to generate species distribution models (SDMs) based on the use of occurrence records obtained from electronic museum databases. However, because of the historical nature of the species' presence data, there is a lack of correspondence between the wide span of time in which species occurrence data were collected and the date and conditions of key spatial (mapped) environmental variables, such as the land cover type. This disparity is particularly problematic in places with high rates of land use/land cover changes. This study's main challenge is to evaluate how realistic are SDMs in replicating the likelihood for recording species on the ground (e.g., Austin, 2007), which entails the achievement of two interrelated objectives: (1) modeling the actual distribution of 12 Mexican endemic species of birds and (2) validating models' performance using field occurrence data. The applied approach consisted in evaluating the performance of a group of three of the most used algorithms to generate SDMs (Garp, Maxent, and ENFA), with less emphasis on comparing between algorithms; a main assumption is that no single modeling algorithms conveys all answers for modeling specie distributions (e.g., Qiao, Soberón, & Peterson, 2015).
This study undertakes main species distribution modeling challenges by proposing concrete and practical solutions; a) transforming continuous to binary (presence/absence) SDMs, b) identifying potential distributional limiting factors, c) obtaining SDMs which represent the most likely areas actually occupied by the species, and d) validating these later SDMs using species occurrence data sampled on the field.

| Study Species: Endemic species of the dry forest of western México
Considering the high levels of biodiversity and endemism found in Mexico's western slope (Escalante-Pliego, Navarro-Sigüenza, & Peterson, 1998;Peterson & Navarro, 1999 and the region's conservation concerns (e.g., Portillo-Quintero & Sánchez-Azofeifa, 2010), a main objective of this study is to predict and evaluate the potential and current distributions of twelve species of endemic birds which distribute along the tropical dry forest on the western slope of Mexico. These bird species are considered important community components, and they tend to show specialized habitat usage, also showing proportionally fastest population declines than other species with wider distribution patterns (Stotz, Fitzpatrick, Parker, & Moskovits, 1996).
Among the Mexican endemic bird species that we have recorded on the field across the country's western Pacific slope, twelve species were selected (Table 1)  Notwithstanding that these species' general geographic pattern follows the tropical dry forest along the Pacific coast, from southern Sonora state to Chiapas (see Figure 1

| Species distribution models
The modeling of species distribution has been also referred as modeling the "ecological niche," "habitat suitability," and "potential distribution" approaches aimed to identify species distribution areas and with this the limiting factors determining species distribution patterns (e.g., Elith & Graham, 2009;Soberon & Nakamura, 2009). These methods are conceptually related, and they have the same correlative nature: Field observations of species occurrences are related to environmental factors by means of a wide number of mathematical and statistical algorithms (Qiao et al., 2015). The application of these methods for T A B L E 1 Number of species presence records with different geographic location (museum species records) and species occurrence records sampled on the field generating models describing those places suitable for the distribution of species is generally referred as species distribution models (SDMs) (Elith & Graham, 2009;Elith et al., 2006;Franklin, 2010aFranklin, , 2010bGuisan & Thuiller, 2005;Loiselle et al., 2003). Some authors, however, make the distinction between niche and distribution or habitat models: While the former convey information about a species' environmental preferences, the latter just means its geographic projection without any other spatial or time projection (Owens et al., 2013). Habitat suitability models can be considered as operational applications of the ecological niche modeling which is based on a set of environmental variables predicting a species presence/absence (Hirzel & Le Lay, 2008).

| Potential vs actual SDMs
This study applies the term "species distribution model" because it describes both the modeling approach and the resulting outcome (Franklin, 2010a(Franklin, , 2010b
SDMAs differentiate among each other because of the way model's distribution response is obtained, how prediction variables are selected and relevant variables are identified and weighted, how the fitted functions are defined, the degree of interaction between variables and how prediction is performed onto geographic space (Elith et al., 2006). Correlative SDMs can also be classified whether they use species presence/absence or only-presence data (Tsoar, Allouche, Steinitz, Rotem, & Kadmon, 2007).
Considering the variety of SDMAs and the expectations for applying their analyses and results to a wide range of fields (Araújo & Peterson, 2012), model evaluation is considered one of the most challenging and important steps in applying the ecological niche modeling or species distribution modeling (SDM) (e.g., Rodríguez-Rey, Jiménez-Valverde, & Acevedo, 2013). Model evaluation is a necessary modeling step because of the high differences found among the outcomes obtained from applying different SDMAs (Marmion, Parviainen, Luoto, Heikkinen, & Thuiller, 2009).

| Study area
The study area corresponds to Mexico's western slope. Its delineation consisted in adding the 37 physiographic provinces (Cervantes-Zamora et al., 1990) that include the original distribution of the tropical dry forest as mapped by Rzedowski (1990) (Figure 1).
Mexico's richness of endemic bird species is remarkable; approximately 10% of the 1,050 bird species reported for the country are considered endemic, with the tropical dry forest of western Mexico being one of their most important habitats (e.g., Peterson & Navarro, 2000 (Trejo & Dirzo, 2000).
The tropical dry forest distributed on the Mexican western slope is considered among the most extensive and conserved areas across the Mesoamerican region and its regional and global biological importance is recognized (Ceballos et al., 2010).

| Species occurrence data
Twelve endemic bird species were selected for this study (see Table 1) among 47 endemic species reported (Gordon & Ornelas, 2000;Vega Rivera et al., 2010), based on the availability of data in the form of scientific collections and our knowledge and field experience sampling birds within western Mexico's tropical dry forest.
The approach taken consisted in using museum records for generating (training and evaluation) the species' potential distribution models (PDMs). On the other hand, species occurrence data sampled on the field (FOD) were used for validating species actual distribution models (ADMs). The ADMs consisted of the reduced version of PDMs after excluding predicted distribution areas located on unsuitable habitat types. The occurrence data for generating the PDMs were obtained from The Atlas of Mexican Birds (Navarro, Peterson, & Gordillo-Martinez, 2002). A total of 1,189 occurrence records for 12 species across the study area ( Figure 1, inset b) were used to generate the PDMs; the number of records for each species with different location is shown in Table 1.

| Prediction variables
The set of environmental prediction variables consisted of 19 bioclimatic variables obtained from the project WorldClim (Hijmans, Cameron, Parra, Jones, & Jarvis, 2005) and four topographic variables (Verdin & Greenlee, 1998) (Table 2). These bioclimatic variables, generated from monthly records of temperature and precipitation, are in raster format with a 0.01 × 0.01 degrees (aprox. 1 × 1 km) spatial resolution, and they represent annual trends, seasonality, and extremes values of climatic conditions (Hijmans et al., 2005).
Topography variables included elevation above sea level (meters), slope (degrees), and aspect (degrees), and the topographic index,

Bioclimatic variables
Annual mean temperature bc1 Mean diurnal range (Mean of monthly (max temp-min temp)) bc2 Isothermality (P2/P7) (* 100) bc3 Temperature seasonality (standard deviation *100) bc4 Max temperature of warmest month bc5 Min temperature of coldest month bc6 Temperature annual range (P5-P6) bc7 Mean temperature of wettest quarter bc8 Mean temperature of driest quarter bc9 Mean temperature of warmest quarter bc10 Mean temperature of coldest quarter bc11 Annual precipitation bc12 Precipitation of wettest month bc13 Precipitation of driest month bc14 Precipitation seasonality (coefficient of variation) bc15 Precipitation of wettest quarter bc16 Precipitation of driest quarter bc17 Precipitation of warmest quarter bc18 Precipitation of coldest quarter bc19 Topography Aspect Elevation Slope Topographic Index also known as a wetness index which is a function of the upstream contributing area and the terrain's slope (Moore, Grayson, & Ladson, 1991).

| Species distribution modeling algorithms (SDMAs)
Three widely used and evaluated SDMAs were here applied and represent incomplete information about the target distribution (Phillips et al., 2006).
The application of Garp and Maxent included splitting each species' records in 75% for training and 25% for calculating model accuracy. Main modeling criteria and parameters for each SDMA are included in Appendix S1. The complete set of bioclimatic and topography variables were used for generating potential distribution models of species distributions, as determined by physical environment. One reason for not selecting a reduced set of variables was the nature of the ENFA approach which uses data redundancy for generating factors (Hirzel et al., 2001). Therefore, the three SDMAs were applied consistently with the same number of prediction variables. Another reason was the fact that most of species had enough number of records which would compensate the possibility of model overfitting (Guisan & Thuiller, 2005).
Boundary vector data (shape files) about the species' geographic distribution (BirdLife International & NatureServe, 2015) were used as reference data to visualize the general distribution patterns shown by the generated SDMs.

| Prediction thresholding
The test data, randomly separated by Maxent to perform internal accuracy assessment, were identified and independently used, along pseudoabsence data, for calculating both ROC and Kappa analyses for the three SDMAs. The lack of species absence data constrained this study to apply three presence-only SDMAs. Therefore, a procedure was applied to generate the pseudoabsence data needed for calculating accuracy metrics: Areas predicted as species absence by the ENFA method were used to generate random points as pseudoabsences for testing models generated by Garp and Maxent. The pseudoabsences used to test ENFA models were generated in similar fashion but using absence areas predicted by Garp. Pseudoabsences for each species were generated by using the ArcView extension Random Point Generator (Jenness, 2005).
In order to obtain binary presence-absence models, threshold identification was achieved by two means: (1) (2004, 2005, and 2006) in the months from June to August. These months are suitable for recording occurrences of bird species due that such period corresponds to the end of the dry season and part of the rainy season, the time when many species reproduce and they can be observed.

| Species actual distribution models (ADMs)
Once prediction thresholds were determined so that presence/absence models (potential distribution) were obtained, and models were spatially reduced based on the documented information about each species' habitat preferences.  Banks et al., 2005;Blake, 1977;Howell & Webb, 1995;Stotz et al., 1996). Then, an updated land use/land cover map (Velázquez et al., 2001)

| Key environmental prediction variables
Both ENFA and Maxent include tools for measuring the degree in which environmental variables contribute to generate the species' potential distribution models (PDMs). For ENFA, the scores associated with the factors suggest the level of importance that each environmental variable had in defining the model's marginality or tolerance, which are parameters that measure of how distinct and narrow are conditions where species occur, in relation to average conditions across the regions (Hirzel et al., 2002). Maxent on the other hand, runs Jackknife tests for identifying those variables which appears to have the most of useful information by themselves (Phillips et al., 2006).  . Chlorostibon auriceps seemed to be the least specialist species because of its highest tolerance value (1.6). To illustrate and compare the meaning of the Marginality and Tolerence indexes, Figure 2 shows ENFA's prediction models for four species with contrasting index values.
Variables that individually were important for explaining the distribution models' marginality included elevation, annual mean temperature (bc1), minimum temperature of coldest month (bc6)

| PDM's accuracy assessment
Internal accuracy assessment, based on using a 25% partition of the museum occurrence data, consisted in running the ROC test   Species for which the two thresholds were more distant included:  Figure 4).

| Prediction thresholding
The common geographic distribution pattern of species resembles the distribution of the main vegetation types of which this group of species is associated, the tropical dry forest (Appendix S6 Chiapas, with a significant inclusion into the continent at the Balsas depression (see Figure 1). A further comparison of PDMs' geographic patterns will be provided in the discussion.

| Species' actual distribution models (ADMs)
Actual distribution models (ADMs) ( Chlorostilbon auriceps, Deltarhynchus flammulatus, and Granatellus venustus with <30 (see Table 3). The distribution of the number of records per species in the FOD showed similar proportionality that the species F I G U R E 5 Binary (presence/absence) species actual distribution models (ADMs) and corresponding species potential distribution models (PDMs) models, for 12 endemic birds in Mexico. ADMs (left) and PDMs (right) are shown in pairs for each species, with black areas representing species predicted presence, and white lines representing known geographic distribution ranges (BirdLife International and NatureServe, 2015) museum datasets (R = .84; p-value = .001), which is an indication that species occurrences were recorded on the field close to as expected by the historical species sampling across the region. However, the spe- ADMs were successful when predicting species' presence in an 66% average (std. deviation = 8.1) of the field occurrence sites (1 × 1 km cell) for all species, with the highest for Deltarhynchus flammulatus (84%) and the lowest for Polioptila nigriceps (50%) ( Table 3).
By generalizing the calculation of the spatial correspondence between ADMs' presence/absence values and the FOD's 1-km radius buffer areas, the prediction success among species were similar to the 1x1 km site evaluation (see Table 3 Tropical dry forest ecoregions (Balsas, Jalisco, Sinaloan, and Southern Pacific) included 60% of low (1-4) and 90% of the medium (5-8 spp.) and high (9-11 spp.) species co-occurring areas (see Figure   7). A number of temperate (e.g., pine and pine-oak) ecoregions accounted for just an average of 7.5% (Std. dev. = 2.8) of the species co-occurrence's total area. Among the four tropical dry forest ecoregions, the Balsas' tropical dry forest included the highest proportion of highest species co-occurrence (9-12 spp.), followed by the Southern Pacific's tropical dry forest ecoregion. However, high species co-occurrence uniformly distributes across the Pacific slope as a rather narrow strip. The Balsas' tropical dry forest ecoregion appears as the wider region where there is highest species co-occurrence (see Figure 7).

| DISCUSSION
Distribution models for the group of 12 endemic bird species made evident that species are not constrained to be distributed within the tropical dry forest, which is the vegetation type documented as these species' main habitat (Vega Rivera et al., 2010). The physiographic provinces, used to define the whole study area, adequately included the different habitat types where this group of species is distributed. Besides temperate forests, other habitat types where species can be distributed include the tropical moist montane/lowland and even human-transformed environments (e.g., The IUCN Red List of Threatened Species, 2016). For instance, altitudinal limits for some species can reach the 2,600 m (e.g., Chlorostilbon auriceps), which would suggest their presence in other habitat types (e.g., temperate) besides the tropical habitats.  Where: Total (1x1 km) presences= Total number of 1x1 km cells, recorded as species presence on the field; True presence= model's predicted presence and recorded presence on the field; False absence= model's predicted absence and recorded presence on the field; Total (buffer) presence cells= pixels within 1-km radius buffer areas around sites where species were found present on the field; True presence cells= pixels within 1-km radius buffer areas around sites where species were found present on the field and model's predicted presence; False absence cells= pixels within 1-km radius buffer areas around sites where species were found present on the field and model predicted absence. distribution across the region (T∼ 0), which can be visualized by looking at the more accentuated differentiation among the range of prediction values, as compared to C. auriceps with the highest Tolerance, and even when compared to C. melanicterus with a medium Tolerance value. Both the histogram and maps in Figure 2 show such differences; for instance, the C. auriceps model tend to have a more uniform prediction values distribution which is reflected on a less contrasting range of gray shades in the map.

| Key prediction variables
Even though there was high variability of variables identified as important among the 12 species models, ENFA and Maxent coincided identifying minimum temperature of coldest month (bc6) and mean temperature of coldest quarter (bc11) as two of the most important prediction variables for the largest number of species, followed by annual mean temperature (bc1) and elevation. Unfortunately, this information is not conclusive because of the difficulties in finding and applying a threshold from which differentiate the most relevant variables identified as important in building the SDMs. There is also the fact that variables may show high levels of correlation among themselves.

| Species potential distribution models (PDMs)
Species' potential distribution models (PDMs) were generated under the assumption that species geographic ranges can initially be identified by significant associations between a relatively unchanged physical environment (i.e., bioclimatic and topography variables) and the historical species' occurrences. The species occurrence dataset used (Navarro et al., 2002) is a part of one of the most comprehensive and scientifically recognized datasets documenting the distribution of Mexican birds . The PDMs generated by the three SDMAs reflect the different levels in which habitats are suitable for each of the 12 species (Hirzel & Le Lay, 2008) These observations make evident the need for investigating objective and robust methods to identify prediction thresholds that would provide models with the most possible certainty.

| PDMs accuracy and prediction thresholds
It is important to point out again that this study's main objective is not to evaluate which of the three applied SDMAs is the most accurate. The main concern is to test the application of SDMAs for generating species actual distribution models (ADMs), based on applying objective criteria for selecting prediction thresholds and using documented habitat-species associations for reducing the PDMs to approximate versions of ADMs.
As expected, the three SDMAs generated accurate models, as re-

| Species' actual distribution models (ADMs)
It is rather apparent the contrast in the extent and size of predicted areas when PDMs are compared with ADMs ( Figure 5). PDMs are drastically reduced as a result of eliminating areas with land use/land cover unsuitable for each species. In this modeling exercise, the selection of habitat types suitable for individual species was implemented based on considering no-human-transformed landscapes only. Within our group of 12 bird species, there are some which in fact are known for using at some extent transformed environments. However, because the significance of those transformed environments is unknown, this study's main concern was the actual availability of original habitats on which these endemic species depend.
After assessing the spatial correspondence between the ADMs' predicted presence and the species occurrence data sampled on the field (FOD), it was apparent that differences in the spatial resolution between these two datasets suggested an explanation for some of the mismatches obtained. For instance, the 1-km raster bioclimatic and topographic data fail to capture areas sampled along the coast line (see Figure 6). Another interesting situation occurred when a FOD (i.e., species occurrence) corresponded to a model's predicted species absence, F I G U R E 6 Influence of spatial resolution and spatial context on the correspondence between species actual distribution models (ADMs) and species occurrence sampled on the field. True presences (black cross) represent species presence recorded on the field and modeled as species presence. False absences (white squares) represent species presence recorded on the field and modeled absence. IA are examples of species presences recorded on the field but predicted as absence, in a spatial context dominated by predicted presence. EA are examples of species presences recorded on the field but predicted as absence, where there exist an edge effect related to issues of data spatial resolution but at the same time such a location also corresponded to areas highly dominated by the model's predicted presence (see Figure 6).
Because of these conditions concerning to the spatial resolution and spatial context, a conservative "area-based" approach (1-km radius buffers) was undertaken to test potential variations in the spatial correspondence between ADMs and FOD. The application of the "areabased" approach basically meant the addition of two or three more pixels next to each pixel corresponding to the 1 × 1 km FOD' sites.
The "area-based" approach incorporated 2-4 times more (contextual) pixels than the "site" approach (see Table 3, columns "Total (1x1 km) presences" and "Total (buffer) presence cells"); both correspondence analyses were carried out on a pixel to pixel basis. The slight increase in prediction success when "area-based" approach is applied seemed to capture, at some degree, the issues derived from managing data with different spatial resolutions combined with the pixel configuration associated with the SDMs, as described above.
Another important factor determining the SDMs' prediction success would be the potential changes in the land use/land cover configuration; the digital land use/land cover map, used to reduce PDMs to ASDMs, was elaborated using 1999-2000 data and the field work collecting species occurrence data was carried out 4-5 years later. The location of FOD's points for some species, on small areas predicted as absence but within a matrix of predicted presence areas, supports the hypothesis of changing landscapes ( Figure 6).
The provided species co-occurrence model (Figure 7) may be used for prioritizing areas with high co-occurrence of endemic species of Mexican birds. Even though highest species co-occurrence areas distribute along Mexico's western slope as relatively narrow strips, significant portions of these seem continuous (middle Sinaloan tropical dry forest, Jalisco tropical dry forest coast, and southeastern Southern Pacific tropical dry forest). There is only a small biosphere reserve (~130 km 2 ) for protecting these areas. On the interior land (western half of the Balsas tropical dry forest ecoregion), highest species co-occurrence areas are more extensive; here, there exists the Zicuirán Infiernillo Biosphere Reserve, which is a much larger area (~2,650 km 2 ). Further studies are necessary for assessing the potential of small groups of species like this study's, as indicators F I G U R E 7 Species co-occurrence model obtained by adding actual binary distribution models for eleven species of endemic bird species. Three co-occurrence intervals are shown in green, orange, and red, while ecoregions are delineated by four color lines. The bar graph shows the percentage of each co-occurrence interval located within four main ecoregions