The database of the PREDICTS (Projecting Responses of Ecological Diversity In Changing Terrestrial Systems) project

Abstract The PREDICTS project—Projecting Responses of Ecological Diversity In Changing Terrestrial Systems (www.predicts.org.uk)—has collated from published studies a large, reasonably representative database of comparable samples of biodiversity from multiple sites that differ in the nature or intensity of human impacts relating to land use. We have used this evidence base to develop global and regional statistical models of how local biodiversity responds to these measures. We describe and make freely available this 2016 release of the database, containing more than 3.2 million records sampled at over 26,000 locations and representing over 47,000 species. We outline how the database can help in answering a range of questions in ecology and conservation biology. To our knowledge, this is the largest and most geographically and taxonomically representative database of spatial comparisons of biodiversity that has been collated to date; it will be useful to researchers and international efforts wishing to model and understand the global status of biodiversity.

Terrestrial Systems (www.predicts.org.uk)-has collated from published studies a large, reasonably representative database of comparable samples of biodiversity from multiple sites that differ in the nature or intensity of human impacts relating to land use. We have used this evidence base to develop global and regional statistical models of how local biodiversity responds to these measures. We describe and make freely available this 2016 release of the database, containing more than 3.2 million records sampled at over 26,000 locations and representing over 47,000 species. We outline how the database can help in answering a range of questions in ecology and conservation biology. To our knowledge, this is the largest and most geographically and taxonomically representative database of spatial comparisons of biodiversity that has been collated to date; it will be useful to researchers and international efforts wishing to model and understand the global status of biodiversity.

K E Y W O R D S
data sharing, global biodiversity modeling, global change, habitat destruction, land use

| INTRODUCTION
Many indicators are available for tracking the state of biodiversity through time, for example, in order to assess progress toward goals such as the Convention on Biological Diversity's 2010 target or the newer Aichi Biodiversity Targets (Pereira et al., 2013;Tittensor et al., 2014). Most of the available indicators are taxonomically or ecologically narrow in scope, and many are based on the global status of species (e.g., Butchart et al., 2010;Tittensor et al., 2014), because of the finality of extinction. However, using a more representative set of taxa and considering local biodiversity offers several advantages. First, average responses of species to human impacts typically vary among higher taxa and ecological guilds (Lawton et al., 1998;McKinney, 1997;Newbold et al., 2014;WWF International, 2014), meaning that indicators need to be broadly based and as representative as possible, if they are to be used as proxies for biodiversity as a whole. Second, the taxa for which most data on trends are available (typically, charismatic groups such as birds or butterflies) are not always the most important for the continued functioning of ecosystems and delivery of ecosystem services (Norris, 2012). Third, although many of the ultimate drivers behind biodiversity loss are global, the most important pressure mechanisms usually act much more locally (Brook, Ellis, Perring, Mackay, & Blomqvist, 2013). Fourth, most ecosystem services and their underpinning processes are mediated by local rather than global biodiversity (Cardinale et al., 2012;Grime, 1998): It is local rather than global functional diversity, for example, that determines how ecosystems function in a given set of conditions (Steffen et al., 2015). Finally, presence/absence and especially abundance of species at a site respond more rapidly to disturbance than extent of geographic distribution or global/national extinction risk (Balmford, Green, & Jenkins, 2003;Collen et al., 2009;Hull, Darroch, & Erwin, 2015), so local changes are likely to be detected before large global changes or extinction.
For these reasons, there is a need to model the response of local biodiversity to human pressures and, thus, to estimate biodiversity changes at local scales, but across a wide spatial domain (ideally globally) and for a wide range of taxa. We therefore need comparable highquality data on local biodiversity at different levels of human pressure, from many different taxa and regions. At present, spatial comparisons of how biodiversity responds to variation in pressures provide the only feasible way to collate a large, globally representative evidence base and to model responses to human impacts. Although large temporal datasets are available (e.g., Butchart et al., 2004;Collen et al., 2009;Dornelas et al., 2014;Vellend et al., 2013), they may not be sufficiently representative of anthropogenic pressures for the trends they show to be taken at face value (Gonzalez et al., 2016). Furthermore, in the absence of contemporaneous site-specific information about pressures, it is not straightforward to use these data to model how biodiversity responds to pressures or to project changes into the future (but see Visconti et al., 2015). Spatially extensive field data of suitable quality and resolution are time-consuming and expensive to collect.
The most convenient and readily available source of suitable biodiversity data is the published literature: Thousands of published papers are based on datasets that would be of value to global modeling efforts.
However, it has been rare for such papers to publish data in full, even as supporting information, meaning that many potentially valuable datasets are "dark data" (Hampton et al., 2013), effectively at risk of being lost to science if they have not been lost already.
Since 2012, the PREDICTS project has been collating data on local biodiversity at different levels of human pressure from published papers, where necessary contacting those papers' corresponding authors to request the underlying biodiversity data, species' identities, and precise sampling locations. We have enhanced the collated data by scoring site characteristics relating to human pressures such as the predominant land use and how intensively the land is used by humans. We also used the geographical coordinates of the sites to match them to a number of published spatially explicit datasets.
The database has already been used to conduct global (e.g., Newbold et al., 2015;Newbold, Hudson, Arnell, et al., 2016), regional  and national (Echeverría-Londoño et al., 2016) analyses of the responses of local biodiversity to land use and related human pressures. The database was first described by Hudson et al. (2014) who published an interim version (March 2014) of the site-level metadata along with a detailed description of how the database has been collated and validated. Since that time, the database has nearly doubled in size. Here, we describe the status of the database and make available the full species-level data themselves (not just the site metadata previously released) to facilitate other research, especially into human impacts on ecological assemblages. We also include suggestions for how the database can be used.

| METHODS
We sought datasets describing the abundance or occurrence of species, or the diversity of ecological assemblages of species at multiple sites in different land uses or at different levels of other human pressures (e.g., differing levels of land-use intensity). Data were primarily collated through subprojects on particular regions, land uses, or taxa.
We also made general requests for data at conferences and through published articles (Hudson, Newbold, et al., 2013;Hudson et al., 2014;Newbold et al., 2012). Through the course of the project, searches were increasingly targeted toward under-or unrepresented regions, biomes, or taxa, in order to mitigate biased coverage in the literature.
To be included in the database, data were required to meet the following criteria: (1) the dataset was part of a published work, or the sampling methods were published; (2) the same sampling procedure was carried out at each site within each study (sampling effort was permitted to vary so long as it was recorded for each site); and (3) we could acquire the geographical coordinates of each sampled site. Where the author of the original publication was unable to supply the geographical coordinates, sites were georeferenced from maps in the publication . Sites' land use-primary vegetation, secondary vegetation (divided according to stage of recovery into mature, intermediate and young; or indeterminate where information on stage was unavailable), plantation forest, cropland, pasture and urban-and, within each land-use class, intensity-minimal, light and intense-were classified from the description given in the source publication or information subsequently provided by data contributors (see Hudson et al., 2014 for full details). These land-use categories were chosen to be as compatible as possible with those used in the harmonized land-use scenarios for 1500-2100 (Hurtt et al., 2011) in order to facilitate spatial and temporal projections of modeled land-use effects on biodiversity (e.g., Newbold et al., 2015). For some sites, land use and/or use intensity could not be established, so were given missing values.
The data were arranged in a hierarchical structure. The data from an individual published work, typically a published paper, constituted a "DataSource." Where different sampling methods were used within a DataSource, for example, because different taxonomic groups were collected, and the data were made available separately, the data were divided into separate "Studies." Data from a given DataSource were also split into multiple Studies if they covered large geographic areas (e.g., several countries), to reduce the effect of biogeographic differences within Studies. Each Study contained a set of sampled "Sites" and "Taxa"; at each Site a set of "Measurements" (typically the abundance or occurrence of a set of taxa) were taken. The provided database extracts contain, for each Site, the raw measurement values, the sampling efforts and, where relevant, the effort-corrected abundance values (corrected across Sites within a Study by dividing the abundance measurement by sampling effort, assuming that sampled abundances increase linearly with sampling effort, after first rescaling effort values within each Study to a maximum value of one). The measurements were not corrected for different detectability (Hayward et al., 2015;MacKenzie et al., 2002).
It is important to note that the data in the database are often not exactly the same as those used in the source papers. Numbers of sites may differ because datasets provided may have been partial or included extra sites, or because we have aggregated or disaggregated data differently. Likewise, numbers of taxa may differ because of curation or because more data were provided than had been used in the source paper. Because our focus was to make these data as useful as possible for PREDICTS analyses, rather than to act as a repository for datasets from previous publications, it will often not be possible to use these data to replicate the analyses presented in the source papers.
We were limited by the rate at which we could process new data because so many datasets were contributed. This led to the development of a backlog, which we had to clear by the end of the first phase of funding for PREDICTS. During this stage of the project, in order to process all the datasets in hand within the time available, we focused our efforts on the fields shown to be most important in our models to that point (De Palma et al., 2015;Newbold et al., 2014Newbold et al., , 2015. As a result, DataSources processed since early 2015 often lack data for some fields, including coordinate precision and maximum linear extent; details of the potentially affected fields are listed in Supporting Information.
Team members were trained in how to score datasets received, using written definitions and descriptions of fields and terms, as well as practice datasets. All data underwent basic validation checks to ensure values entered in each field were appropriate .
Geographical coordinates were visually inspected on a map after entry into the database, and our software automatically detected coordinates falling outside of the expected country (e.g., because latitude and longitude values were accidentally swapped). For the calculation of biodiversity metrics such as species richness, we accepted the identifications of species provided by the authors of the source publications; these were determined at the time of the original research, and so will not reflect subsequent taxonomic changes or re-identifications.
We also matched taxonomic names to the Catalogue of Life 2013 checklist (COL; Roskov et al., 2013), allowing us to validate many of the names, assess taxonomic coverage and relate measurements to species-level datasets such as those describing ecological traits. We make available both the original species classifications and those from COL (field names are given in Supporting Information). We reviewed and corrected a number of potential error cases, such as names without a matching COL record, and names for which the higher taxonomic F I G U R E 1 Sampling locations. Map colors indicate biomes, taken from the Terrestrial Ecoregions of the World dataset (The Nature Conservancy, 2009), shown in a geographic (WGS84) projection. Circle radii are proportional to log 10 of the number of samples at that Site. All circles have the same degree of partial transparency. Sites added to the database since Hudson et al. (2014) are shown in pink rank of the matching COL record was unexpected (e.g., a COL record for a true fly within a Study that examined birds). Many more validation checks were applied; a complete description is in Hudson et al. (2014).

| Taxonomic coverage
Records in the PREDICTS database represent 47,044 species (see Hudson et al., 2014 for how species numbers are estimated in the face of imprecise taxon names), which is over 2% of the number thought to have been formally described (Chapman, 2009

| Temporal coverage
We focused primarily on data sampled since 2000 because most global layers describing human pressure are collected after this year and, in particular, to facilitate use of contemporaneous Moderate-resolution Imaging Spectroradiometer (MODIS) remotely sensed data (Justice et al., 1998;Tuck et al., 2014) in modeling. However, in filling certain taxonomic and geographic gaps, we also collated some data that were sampled before 2000 ( Figure 6). Data are sparse after 2012 because of the natural time lags between data collection in the field, publication and then assimilation into the PREDICTS database ( Figure 6).

| Data access and structure
This 2016 release of the database-the complete dataset and also site-level summaries-is available on the data portal of the Natural History Museum, London (doi: 10.5519/0066354) as commaseparated variable (CSV) files and as RDS files, the latter for use with the R statistical modeling language (R Core Team 2015; RDS files were generated using R 3.3.1). A complete description of the columns in the extracts, along with a visualization of the database schema, is given in Supporting Information. This paper makes all the data in this version of the database freely available to anyone wishing to use F I G U R E 3 Numbers of Sites against the areas of biodiversity hotspots and of high biodiversity wilderness areas (HBWAs). Hotspots are shown by circles and HBWAs by squares; symbols are colored by the predominant biogeographic realm in which they fall.

| DISCUSSION
The PREDICTS database is designed to be able to address a range of questions about how land use and related pressures have influenced the occurrence and abundance of species and the diversity of ecological assemblages. The highly structured nature of the data, with comparable surveys having been carried out at each Site within a Study, was chosen to facilitate such modeling. Table 1  The largest open compilation of biodiversity data is the Global Biodiversity Information Facility (GBIF; www.gbif.org), which aggregates mostly unstructured species occurrence data. The unstructured nature of most GBIF data limits the range of questions to which they can easily be put, although they are increasingly used in modeling species distributions (e.g., Pineda & Lobo, 2008) and habitat suitability (e.g., Ficetola, Rondinini, Bonardi, Baisero, & Padoa-Schioppa, 2015). As of April 2016, GBIF holds over 560 million georeferenced occurrence records of around 1.5 million species, although coverage is taxonomically uneven (e.g., most records are of birds) and patchy even among the best-recorded groups (Meyer, Kreft, Guralnick, & Jetz, 2015).
Although our targeting of data from underrepresented biomes and taxa  reduces the effects of geographic and taxonomic biases in available data, the PREDICTS database nonetheless has many limitations, of which four are particularly important to note. First, our individual datasets seldom take a whole-ecosystem perspective, being instead taxonomically or ecologically restricted; consequently, our data shed little light on how trophic webs or other interactions are affected by human pressures. Second, even within the groups sampled, our data do not provide complete inventories of the species that would be found with comprehensive sampling; thus, failure to record a species from a Site does not provide strong evidence of absence.
Third, Latin binomials were not available for a sizeable fraction of the species in our DataSources, limiting the prospects for linking the observations of occurrence and abundance to other information about the species (e.g., functional traits; Kattge et al., 2011). Last, because our database was designed to test hypotheses about local-scale variation in biodiversity, it is not particularly informative about large-scale biodiversity patterns such as the latitudinal gradient in species richness or how pressures with a coarse spatial grain (e.g., atmospheric nitrogen deposition; Simkin et al., 2016) influence Site-level diversity.
When using the PREDICTS database, or indeed any database, to model biodiversity responses, it is important to be aware of potential mismatches in scale between Site-level data and pressure data such as MODIS remotely sensed data (Justice et al., 1998) and the harmonized land-use scenarios (Hurtt et al., 2011) Austin, Nicholls, and Margules (1990) Filter to remove species not of interest. Merge PREDICTS data with data on any additional site-level characteristics of interest. One possible analytical approach is to model effects of site characteristics on presence-absence and log (abundance when present) separately, the first with binomial errors and the second with Gaussian errors, while accounting for among-Study differences (e.g., using mixedeffects models).
-Q 2. Do changes in land-use facilitate success of invasive species?
Dukes and Mooney (1999), Theoharides and Dukes (2007) Obtain lists of invasive species for the regions of interest and model presence-absence and/or abundance of invasives as above.
-Q 3. Which ecological attributes of species make them more or less sensitive to human pressures?
McKinney (1997), Davies, Margules, and Lawrence (2000), Cardillo et al. (2005) Merge PREDICTS data with species-level data on traits of interest. Model how site and species characteristics affect presence-absence and log (abundance when present) separately as above, accounting for Study-level and taxon-level differences (e.g., using mixedeffects models).  (2000), Gibson et al. (2011) Add taxonomic group into models above as a fixed effect interacting with other fixed effects.
-Q 5. Are phylogenetically distinct species particularly sensitive? Gaston and Blackburn (1997), Purvis, Agapow, Gittleman, and Mace (2000) Analyze phylogenetic distinctiveness or unique evolutionary history in the same way as ecological attributes.
-Q 6. What are the relationships between geographic range size or occupancy and abundance? Brown (1984) Merge PREDICTS data with species-level data on range sizes or occupancy. Filter to the land uses of interest (e.g., primary vegetation if the focus is on natural systems), and examine within-Study relationship between abundance and relative range size or occupancy.  Balmford (1996), Travis (2003) Merge Site-level diversity data with Site-level data on characteristics to be tested and assess the interaction of these variables with land use. Gray et al. (2016) Q 13. How accurate are global land-use data? Giri, Zhu, and Reed (2005) Use Site-level land-use data to calculate the receiver operating characteristic curve (i.e., sensitivity versus false-positive rate), using the area under the curve to quantify agreement. An extension of this could be to use the PREDICTS Site-level land use data as input into land use/ land cover classification procedures, for example, by the remote sensing community, or at least use PREDICTS data to cross-check and validate land use and land cover maps with independent PREDICTS data.
Hoskins et al.
Questions above the site level Q 14. Is beta diversity lower in human-dominated than more natural land uses? Tylianakis et al. (2005) Estimate desired measures of similarity among Sites within studies. Model how biotic similarity among Sites depends on similarity of other attributes (including characteristics from remote sensing or Dynamic Global Ecosystem Models if required), accounting for among-Study differences (e.g., using mixed-effects models). Newbold, Hudson, Hill, et al. (2016) Q 15. Are land-sparing or land-sharing strategies optimal for local biodiversity? Green, Cornell, Scharlemann, and Balmford (2005) Analyze species by Sites and by Study and relate back to Q. 1. The overarching question about sparing versus sharing can be addressed by looking at the individual responses of species to land-use intensity, as measured by yield suggested by Green et al. (2005); this requires data on agricultural yields at relevant Sites in the PREDICTS database.
- ) measurements to estimate changes in gamma diversity over broader areas (e.g., Azaele et al., 2015); both approaches offer potential solutions to mismatches in scale.
The PREDICTS database continues to increase in size and currently contains a further 22 Studies with embargo dates that prevent their inclusion in this release. We intend to publish occasional updates to make these data freely available. We have also received a number of further offers of datasets that we hope to incorporate into the database and include in future releases. There are three priority categories of data that we are still seeking actively: bees from outside Western Europe; soil invertebrates and fungi; and geographic islands. The current database focuses entirely on spatial "control-impact" comparisons. A follow-on project that has recently begun focuses instead on temporal comparisons, collating data from "before-after" and (especially) "before-after-control-impact" studies of the effects of land-use change on terrestrial assemblages. We are therefore seeking datasets, linked to peer-reviewed publications, of comparable species-level surveys conducted at each sampling location, with temporal changes in land use and/or land-use intensity. If corresponding authors of such papers wish to offer their data, please complete our online form, available at www.predicts.org.uk/ pages/contribute.html. As with PREDICTS, the new project will seek to make its data freely available.