Digital Accessible Knowledge and well-inventoried sites for birds in Mexico: baseline sites for measuring faunistic change

Background Faunal change is a basic and fundamental element in ecology, biogeography, and conservation biology, yet vanishingly few detailed studies have documented such changes rigorously over decadal time scales. This study responds to that gap in knowledge, providing a detailed analysis of Digital Accessible Knowledge of the birds of Mexico, designed to marshal DAK to identify sites that were sampled and inventoried rigorously prior to the beginning of major global climate change (1980). Methods We accumulated DAK records for Mexican birds from all relevant online biodiversity data portals. After extensive cleaning steps, we calculated completeness indices for each 0.05° pixel across the country; we also detected ‘hotspots’ of sampling, and calculated completeness indices for these broader areas as well. Sites were designated as well-sampled if they had completeness indices above 80% and >200 associated DAK records. Results We identified 100 individual pixels and 20 broader ‘hotspots’ of sampling that were demonstrably well-inventoried prior to 1980. These sites are catalogued and documented to promote and enable resurvey efforts that can document events of avifaunal change (and non-change) across the country on decadal time scales. Conclusions Development of repeated surveys for many sites across Mexico, and particularly for sites for which historical surveys document their avifaunas prior to major climate change processes, would pay rich rewards in information about distributional dynamics of Mexican birds.


INTRODUCTION
The temporal dynamics of geographic distributions of species and composition of local biotic communities are central to much of biogeography, macroecology, and conservation biology (Lavergne et al., 2010). That is, how species' distributions evolve through time and how community composition changes as a result are key in determining essentially all results in these areas of ecology and evolutionary biology. Although biogeographers invest fundamentally in retracing the geography of evolving lineages over long periods of time, strangely little information exists on short-term dynamics of species' distributions and community composition (e.g., Nunes et al., 2007;Tingley & Beissinger, 2009).
An important opportunity to understand these shorter-term dynamics of distributions and communities by means of longitudinal comparisons of inventories of local faunas and floras. That is, when a baseline of solid, complete, and well-documented knowledge exists about a site (Colwell & Coddington, 1994;Peterson & Slade, 1998;Soberón & Llorente, 1993), re-surveys over years, decades, and centuries can offer a fascinating view into the natural dynamics of species' ranges and the effects of human presence and activities. Drivers of these changes may act both on local scales (e.g., effects of land use change) and on global scales (e.g., effects of climate change), and potentially may interact as well; studies integrating and comparing effects of different such drivers are particularly rare (e.g., Peterson et al., 2015;Rubidge et al., 2011). Of course, detailed documentation of species identifications and of the completeness of inventory efforts are necessary for both the baseline and the resurvey, but the general paradigm has considerable potential.
Previous baseline/re-survey efforts have yielded fascinating information about faunas and floras. For example, Nilsson, Franzén & Jönsson (2008) documented 40% extirpation of butterfly species over a 90+ year span on a plot in southern Sweden, and found that species disappearing from the fauna tended to be those with a short flight length period, narrow habitat breadth, and small distributional area in Europe; however, only flight length period was significant in multivariate analyses. Grixti & Packer (2006) studied bee communities at a site in southern Ontario, and documented community changes and diversity increases over a 40+ year span, likely in response to successional changes in the surrounding landscapes. Scholes & Biggs (2005) proposed a biodiversity intactness index, and showed widespread population declines, particularly among mammals, and ecosystem declines concentrated in grasslands, across a large region of southern Africa. In California, important efforts have been carried out to resurvey sites studied by Joseph Grinnell a century ago, detecting fascinating distributional (Tingley & Beissinger, 2009), phenotypic (Leache, Helmer & Moritz, 2010, and genetic (Rubidge et al., 2011) changes in vertebrate species and communities (Moritz et al., 2008).
Within Mexico, such before-and-after studies have been particularly scarce. Peterson & Navarro-Sigüenza (2005) used nineteenth-century sources to reflect on changes in the avifauna of the Valley of Mexico over the twentieth century. In a particularly interesting example, Olvera-Vital (2012) re-surveyed the avifauna of Misantla, Veracruz-of great interest is that the avifauna has been quite stable over the past 50-100 years, which apparently reflects the very early mass-disturbance to the natural habitats of that region (Sánchez, 1998), such that the baseline inventory was itself already post-disturbance. On broader and coarser spatial scales, Peterson et al. (2015) assessed countrywide changes in Mexican endemic bird species' distributions between the middle of twentieth century and early twenty-first century, and found dominant effects of changes in temperature (and not of human impact on landscapes or changes in precipitation) in driving avifaunal change.
Curiously, however, to our knowledge at least, this short list includes all sites that have seen baseline and re-survey inventories of birds in the country, such that the nature and pattern of avifaunal change across Mexico remains very poorly characterized.
The purpose of this paper is to stimulate and enable a next generation of such resurvey efforts for birds across Mexico by means of cataloguing sites for which solid documentation exists for the original baseline inventory, and for which the baseline inventory is demonstrably complete. We reviewed all existing Digital Accessible Knowledge (DAK; Sousa-Baena, Garcia & Peterson, 2013) regarding the birds of Mexico: we probed four major online data portals for relevant data (i.e., bird records from Mexico prior to 1980), and evaluated inventory completeness at two spatial extents (0.05 • grid squares, and coarser hotspots of sampling). We present a simple list and catalog of sites that have seen relatively complete (i.e., ≥80% of avifauna known and documented) inventories, as a challenge and stimulus to the ornithological community in Mexico. Resurvey of these localities would provide rich and informative rewards in understanding the dynamics of bird populations and distributions across the country.

MATERIALS & METHODS
This analysis is based on a suite of assumptions about data quality and appropriate resolutions inherent in biodiversity data. For instance, we used single days as the unit of sampling effort throughout our analyses, as this resolution has proven an effective balance between too much and too little resolution; temporal information finer than the level of day is rarely available with older biodiversity data, whereas coarse temporal resolution can underappreciate the efficacy of short-term, intensive inventory efforts. Hence, a basic working data set was the combination of species identification, day (i.e., unique combinations of year, month, and day), and place. This latter we defined in various ways, described below, but we note that we used both geographic coordinates and textual locality descriptions to avoid loss of information owing to lack of georeferencing, at least to every extent possible.

Data sources and quantities
We downloaded data from four basic sources for this study: the Global Biodiversity Information Facility (GBIF; http://www.gbif.org), VertNet (http://www.vertnet.org/), the Red Mundial de Información de la Biodiversidad (REMIB; http://www.conabio.gob. mx/remib/doctos/remib_esp.html), and UNIBIO (http://unibio.unam.mx/). Each of these biodiversity information networks has its respective strengths and weaknesses, and considerable overlaps exist among them in coverage of biodiversity information sources. We downloaded data on bird occurrences in Mexico from all four, and trusted that duplication would be removed in our data cleaning steps; for this reason, we are unable to develop comparisons among the different data resources. For GBIF and VertNet, automated download was possible; however, for REMIB and UNIBIO, we requested and were provided with data 'dumps,' as some of the data were restricted from full public access (REMIB) or not yet fully available to the public (UNIBIO). The full set of sources contributing data to this analysis is provided in the Appendix.  Table 1 Summary of initial data downloaded from each of four biodiversity data portals for Mexican vertebrate classes, and the relative redundancy of records in each, at the level of species × time (year, month, day) × place (geographic coordinates, textual descriptions). Note that subsequent data cleaning steps changed these initial tallies of redundancy, as they sought synonymies for taxon and place that may not have been visible in this initial step.

Data reduction and cleaning
Initial data downloads totaled tens to hundreds of thousands of records from Mexico from each of the data portals, reaching millions from GBIF (although the GBIF numbers are largely from eBird and aVerAves, which come in greatest part from post-1980; Fig. 1, Table  1). We expected considerable redundancy between data sources, in the form of situations in which the same species was collected or recorded at the same place on the same day, so we embarked on a lengthy process of data reduction and cleaning. Without a doubt, some mistakes were made, and some information was lost, but this effort aimed to detect and highlight the major features of Mexican bird DAK, rather than all of the details. That is, we focused on sites that had the most information available, and explicitly excluded information for less-well-known sites. A first step was to concatenate all four datasets into a single, larger dataset, and to reduce the set of fields to the essential three suites of fields mentioned above: species (we retained order, family, genus, and species, to permit identification of the most difficult names), date collected (year, month, and day), and place (latitude and longitude, when available, and state, municipality, and specific locality). We filtered these records to remove all those records from years after 1980. The initial total of 2,598,478 records distilled to 845,658 (a 67.5% reduction) that both (1) had dates and (2) were not from after 1980.
From this set, we extracted 10,762 unique combinations of order, family, genus, and species, which included diverse name combinations, and required considerable work to distill to a consistent suite of names corresponding to the birds of Mexico. We avoided the temptation to attempt to take the taxonomic treatment to a newer authority list (e.g., Gill & Donsker, 2014;Navarro-Sigüenza & Peterson, 2004;Peterson & Navarro-Sigüenza, 2006) because taxonomic splitting, which has dominated recent years of taxonomic work, would cause considerable confusion of names applied to older records. Hence, we reduced the initial, highly redundant set of names to 1,027 names that coincided with the taxonomy of the American Ornithologists' Union (AOU, 1998), except that we merged Empidonax traillii and E. alnorum, in light of very frequent confusion in identification (Heller et al., 2016). This step was achieved based on long years of experience with Mexican bird taxonomy, plus occasional consultation of the literature (Monroe & Sibley, 1993;Peters, 1931Peters, -1987 and online (http://avibase.bsc-eoc.org/) resources.
A second step was to summarize date information as a concatenation of year, month, and day (e.g., 1964_7_16). Records with dates for which all three time elements were missing were removed from analysis, but records with partial dates (e.g., year and month, but not day) were treated as a unique time event. All data management was carried out in OpenRefine (http://openrefine.org), which permitted many important initial steps of combining similar names, and in Microsoft Access, which permitted development of customized queries for further refinement.

Data analysis
Perhaps the most difficult-to-manage of the fields was that of 'place.' Here, we used a three-step process that aimed to retain a maximum of locality information, yet avoid the massive and prohibitive task of full georeferencing of all records for which no geographic coordinates were available. Hence, (1) we used the 502,935 records that had latitudelongitude data in a first-pass analysis that aimed to identify single sites that were well inventoried, and to identify somewhat broader 'hotspots' of sampling (details provided below). Next (2), we used locality names associated with the records falling in the hotspots to probe the data lacking georeferences, and thereby rescued >54,000 records. Finally (3), we inspected the remaining data records-those lacking georeferences-to identify additional sites that merited analysis (five such additional, un-georeferenced sites indeed proved to be relatively well-inventoried, such that this step was important; see below). In this way, the number of localities that needed to be georeferenced was minimized, and yet we managed to include the great bulk of Mexican DAK in our analyses.
Step 1: Here, we aimed to develop first analyses of the 502,935 records from the original data set that carried latitude-longitude coordinates, both to identify individual pixels (0.05 • resolution) that were well-inventoried, and to identify concentrations (''hotspots'') of sampling that may or may not prove to be well-inventoried. We first filtered this data set to retain only the records that were within 0.05 • (5-6 km) of the administrative outline of Mexico-this step left 499,794 (99.4%) records for initial analysis. This initial data set was further reduced to 481,409 (95.7%) that came from 1980 or before.
We then created a shapefile with square elements of 0.05 • for Mexico, and eliminated pixels that were >0.05 • from the administrative boundaries of Mexico. We then used a spatial join to count numbers of records in each polygon; numbers of records per cell ranged from nil to as high as 15,464 (a pixel centered on Chilpancingo, Guerrero). We used optimized hot spot analysis (implemented in ArcGIS, version 10.2) to identify concentrations of sampling effort across the country-we used the Getis-Ord Gi* statistic (Hot Spot Analysis tool), and focused only on hotspots (0.05 • grid squares) significant at the highest (99%) confidence level. We isolated these well-sampled suites of pixels as a separate shapefile, and then merged pixels in contiguous sets as preliminary hypotheses of broader hotspots of sampling, which we enriched with more records in step 2 (see below).
We also used the initial data set to develop an identification of single 0.05 • pixels that were well-inventoried, as follows. The 481,409 pre-1980 records remaining after clipping to the country boundaries reduced to 277,249 (57.6%) unique combinations of species × date × pixel. We calculated S obs as the number of species that have been recorded from each pixel, N as the total number of unique combinations of species × date from each pixel (and thus we set a base spatial resolution as that of the 0.05 • grid, such that co-occurrences of species within these pixels are assumed to be sympatric), and a and b as the numbers of species recorded exactly once and exactly twice, respectively, from each pixel. We then used these data to calculate, for each pixel, the Chao2 estimator of expected species richness, with its associated adjustment for small sample sizes (Colwell, 1994-present), as Completeness was then calculated as C = S obs / S exp . To avoid including occasional localities with small sample sizes and artificially 'complete' inventories, we further removed all pixels for which N < 200. We summarized these calculations in a table, which we then imported into ArcGIS and joined to the grid shapefile for visualization.
Step 2: This step aimed to assess the fairly large portion of the data that lacked georeferences (195,371 records), and 'rescue' relevant data for further enrichment of the hotspot analysis. That is, in Step 1, we used georeferenced occurrence data to identify hotspots of sampling, but many more occurrences were documented in records lacking such data. Hence, here, first, we used the merged hotspot shapefile to associate the original occurrence data (i.e., records with georeferences) to hotspots, and identified key locality names among those occurrence data. (e.g., the Comitán hotspot included localities such as ''La Trinitaria'', ''El Triunfo,'' ''Lagunas de Montebello,'' ''Las Margaritas,'' and ''Santa Rosa''). Finally, we used those names to probe the non-georeferenced data records (checking, of course, that offset distances were not >5 km), and thereby rescued 54,369 records that were added to the hotspot-based analyses. Inventory statistics and completeness values were then recalculated for each hotspot just as they were calculated in Step 1 for each pixel.
Step 3: This final step involved review of the remaining non-georeferenced records, even after the rescue of Step 2. That is, Step 2 focused on non-georeferenced records that corresponded to already-identified pixels and hotspots, and that could be used to enrich the existing (georeferenced) data from those sites. This third step, however, involved a look at the remaining data to see if additional well-known sites could be identified.
We assembled the raw locality descriptors, and tallied numbers of records associated with each; we did some minor cleaning and synonymizing of minor variants on locality descriptors to maximize numbers of records for each locality. Finally, we developed completeness indices (as described above) for each of the non-georeferenced localities that had a raw sample size (i.e., unique combinations of locality descriptor, year/month/day, and species) of ≥200 records. Because the focus of these exercises is on detecting the few well-inventoried sites, rather than characterizing the overall sampling landscape, no biases are introduced by this rescue step. All localities meeting an arbitrary criterion of completeness, C ≥ 0.8, were then georeferenced and included among well-known sites.

RESULTS
Processing data available for birds of Mexico from raw DAK downloads into usable records of species at localities on particular days involved considerable reduction in numbers of records (Table 1). Simply progressing from raw to unique combinations of species × locality × geographic coordinates × year × month × day involved a reduction of 20.1-26.2% in numbers of records (note that this reduction step was done prior to removing records post-1980). Subsequent data cleaning and reduction steps (see Methods, above) reduced redundancy both among data sources (GBIF, VertNet, REMIB, UNIBIO) and among nearby localities, leaving a final number of 481,409 records for analysis.
A first analysis focused on 277,249 unique combinations of species × date collected × 0.05 • pixel. These records fell in pixels in numbers ranging from nil to 15,464 records (Chilpancingo, Guerrero). Processing records in each pixel into estimates of expected numbers of species and estimated inventory completeness (Table 2; Fig. 2), we found that 24 pixels were complete at the level of C ≥ 0.9, and a further 71 pixels were complete at the level of C ≥ 0.8. The well-inventoried sites were well-distributed across the country, from Baja California Sur and Sonora to Quintana Roo and Chiapas (Fig. 2). A further five    Figure 2 Distribution of single 0.05 • grid squares for which >200 records were available and completeness (C) was 0.8 ≤ C < 0.9 (pink) or C ≥ 0.9 (brown). Sites detected in Step 3 (i.e., single sites that are relatively complete, but that were not georeferenced prior to this study) are shown in blue (all had 0.8 ≤ C < 0.9).
localities were rescued from among the pool of data lacking geographic coordinates, but that were inventoried completely to the level of C ≥ 0.8 (Table 3). Seeking 'hotspots' of sampling (i.e., sets of contiguous 0.05 • pixels), we identified an initial large number of such hotspots, again well-distributed across the country. Our inspection of the data lacking geographic coordinates 'rescued' 54,369 records, augmenting the initial data set considerably. Of the initial sampling hotspots, only three made the C ≥ 0.9 completeness criterion (Xalapa, Veracruz; El Triunfo, Chiapas; Isla Cozumel, Quintana Roo); another 17 fell at the lower C ≥ 0.8 criterion (Table 4). These hotspots covered areas ranging from 60.5-4325.8 km 2 , considerably larger than the ∼30 km 2 of the 0.05 • pixels. The hotspots, once again, ranged across the country, from southern Sonora to Chiapas and Quintana Roo (Fig. 3).

DISCUSSION
This contribution is an exploration of the utility of the existing Digital Accessible Knowledge (Sousa-Baena, Garcia & Peterson, 2013) for Mexican birds in identifying well-inventoried sites across the country. We chose our 1980 cut-off to coincide roughly with the transition between initial (subtle) climate changes and the present phase of rapid, large-scale climate change worldwide (IPCC, 2013). In this way, the sites that we have identified provide baseline points of reference for species composition, for comparison with species composition later, decades into the processes of global climate change (Tingley et al., 2009). One could map the species diversity that has been documented or that we have estimated for 0.05 • grid squares and/or hotspots across the country, to obtain a picture of species diversity countrywide. We have avoided this temptation, however, in view of the highly non-random and scattered distribution of the well-inventoried sites across the country. The sites do not cover all regions of the country at all consistently, so any the results would be incomplete and potentially misleading. That step was, quite simply, not among the objectives of this study.
Rather, our aim was to compile a catalog of sites across Mexico that have been inventoried historically in detail, to the point that the inventory is more or less complete. This catalog, in and of itself, is not of great interest scientifically; however, to the extent that new inventories can be developed for comparison with the old ones, the interest in the comparisons grows considerably. Echoing our earlier contribution (Peterson et al., 2015), we are fascinated by the long-term processes of population biology and biogeography that are leading to turnover of species at sites.
We have improved on our earlier contribution (Peterson et al., 2015), however, which had to aggregate occurrence data to coarse resolutions (1 • , or ∼110 km) until inventories were sufficiently complete. Here, in contrast, we have sought single sites (0.05 • grid squares) that are completely inventoried-this difference has the major advantage of not including as much beta diversity in single-site inventories as did our previous study. Our Table 4 Hotspots of sampling that are relatively completely inventoried (i.e., C ≥ 0.8). Hotspot names correspond to the shapefile dataset summarizing the geographic distribution of these sites.

State
Hotspot  consideration of hotspots, to some degree, began to coarsen the spatial extent of the sites once again, but offered a somewhat more extensive list of sites that have seen thorough inventories, yet still across extents much smaller than in our previous work.
One criticism that can be leveled at this work is that some important data may have been left out of the analysis. That is, the entire concept of Digital Accessible Knowledge is that the data are (1) in digital format, (2) accessible readily via the Internet, and (3) integrated with the remainder of DAK via common portals (i.e., making the transition from individual data points to integrated ''knowledge''), as was emphasized in the original publication presenting the idea of DAK (Sousa-Baena, Garcia & Peterson, 2013). (We note that a subsequent publication (Meyer et al., 2015) used ''digital accessible information,'' we believe unfortunately, as they provided no justification for or even notice of the change of terminology.) In the case of Mexican birds, for example, the Natural History Museum (UK) has very few data in digital format, and none has been made accessible, such that important collections from Mexico, like those assembled by Godman & Salvin (1879-1915, have not been analyzed in this contribution. That is the blessing and the curse of DAK: data that are digital and shared on global portals are used broadly, whereas data that do not meet the DAK criteria are frequently not used at all. The purpose of this paper is to enable a broad suite of repeat avifaunal survey efforts across Mexico. In effect, with the maps and tables of this paper, we challenge the ornithological community interested in Mexican birds to focus attention on these sites (we provide a complete compendium of the well-inventoried sites and hotspots documented in this paper, as well as associated shapefiles, in a dataset made available permanently at http://hdl.handle.net/1808/20673). Not only does work at these sites provide information about the current community composition there, but also about the change in those communities through time.
We suggest and advocate that resurvey efforts take the form of two inventories at or near the site: one in as exactly the original site as is possible to determine, and the other in the closest and most comparable site that still retains the vegetation type that was represented at the site at the time of the original inventory efforts; the former reflects effects of local-scale processes (e.g., habitat destruction, aridification), whereas the latter reflects more global processes (e.g., climate change), and comparisons of the two resurvey inventories will yield a rich understanding of the relative magnitude of effects of the local and global processes in changing avifaunas across the country. Once several such sites have been re-surveyed, a rich picture of the dynamics of Mexican bird distributions will emerge, in much greater detail than the picture presently available (Peterson et al., 2015).
We are working to assemble what historical photographs exist for each of these sites, in association with the original inventory efforts, to further enrich comparisons of 'before' and 'after.' For instance, the Nelson and Goldman expeditions across Mexico in the late nineteenth and early twentieth centuries produced large quantities of images that are well-documented in two summary volumes (Goldman, 1951;Nelson, 1921)-two examples are shown in Fig. 4. A major complement to the repeat inventories that are facilitated by the analyses in this study would be repeat photography to allow a clear view of what sorts of landscape change have occurred at key sites (see, e.g., Spond, Grissino-Mayer & Harley, 2014).

CONCLUSIONS
Valuable insights can be gained from longitudal comparisons to detect and characterize patterns of change in biodiversity, yet such changes have been opaque to study for lack of paired temporal samples at sites or in areas. This study explores a novel approach to enabling such studies: we mine the existing DAK for Mexican birds to detect and document well-inventoried sites, and provide a catalog of those well-known sites for others to use. Developing repeated inventories at a series of sites across Mexico would yield a detailed, controlled set of comparisons that would allow a view of avifaunal dynamics across the country. Adding the dimension of repeated landscape photography to the repeated inventories would greatly facilitate pairing of sites for future inventories at sites for which historical information exists.