The PREDICTS database: a global database of how local terrestrial biodiversity responds to human impacts

Biodiversity continues to decline in the face of increasing anthropogenic pressures such as habitat destruction, exploitation, pollution and introduction of alien species. Existing global databases of species’ threat status or population time series are dominated by charismatic species. The collation of datasets with broad taxonomic and biogeographic extents, and that support computation of a range of biodiversity indicators, is necessary to enable better understanding of historical declines and to project – and avert – future declines. We describe and assess a new database of more than 1.6 million samples from 78 countries representing over 28,000 species, collated from existing spatial comparisons of local-scale biodiversity exposed to different intensities and types of anthropogenic pressures, from terrestrial sites around the world. The database contains measurements taken in 208 (of 814) ecoregions, 13 (of 14) biomes, 25 (of 35) biodiversity hotspots and 16 (of 17) megadiverse countries. The database contains more than 1% of the total number of all species described, and more than 1% of the described species within many taxonomic groups – including flowering plants, gymnosperms, birds, mammals, reptiles, amphibians, beetles, lepidopterans and hymenopterans. The dataset, which is still being added to, is therefore already considerably larger and more representative than those used by previous quantitative models of biodiversity trends and responses. The database is being assembled as part of the PREDICTS project (Projecting Responses of Ecological Diversity In Changing Terrestrial Systems – http://www.predicts.org.uk). We make site-level summary data available alongside this article. The full database will be publicly available in 2015.


Funding Information
The PREDICTS project was supported by the U.K. Natural Environment Research Council (Grant Number NE/J011193/1) and is a contribution from the Imperial College Grand Challenges in Ecosystems and the Environment initiative. Adriana De Palma was supported by the U.K. Biotechnology and Abstract Biodiversity continues to decline in the face of increasing anthropogenic pressures such as habitat destruction, exploitation, pollution and introduction of alien species. Existing global databases of species' threat status or population time series are dominated by charismatic species. The collation of datasets with broad taxonomic and biogeographic extents, and that support computation of a range of biodiversity indicators, is necessary to enable better understanding of historical declines and to projectand avertfuture declines. We describe and assess a new database of more than 1.6 million samples from 78 countries representing over 28,000 species, collated from existing spatial comparisons of local-scale biodiversity exposed to different intensities and types of anthropogenic pressures, from terrestrial sites around the world. The database contains measurements taken in 208 (of 814) ecoregions, 13 (of 14) biomes, 25 (of 35) biodiversity hotspots and 16 (of 17) megadiverse countries. The database contains more than 1% of the total number of all species described, and more than 1% of the described species within many taxonomic groupsincluding flowering plants, gymnosperms, birds, mammals, reptiles, amphibians, beetles, lepidopterans and hymenopterans. The dataset, which is still being added to, is therefore already considerably larger and more representative than those used by previous quantitative models of biodiversity trends and responses. The database is being assembled as part of the PREDICTS project (Projecting Responses of Ecological Diversity In Changing Terrestrial Systemswww.predicts.org.uk).

Introduction
Despite the commitment made by the Parties to the Convention on Biological Diversity (CBD) to reduce the rate of biodiversity loss by 2010, global biodiversity indicators show continued decline at steady or accelerating rates, while the pressures behind the decline are steady or intensifying Mace et al. 2010). Evaluations of progress toward the CBD's 2010 target highlighted the need for datasets with broader taxonomic and geographic coverage than existing ones (Walpole et al. 2009;Jones et al. 2011). Taxonomic breadth is needed because species' ability to tolerate human impacts destruction, degradation and fragmentation of habitats, the reduction of individual survival and fecundity through exploitation, pollution and introduction of alien speciesvaries among major taxonomic groups (Vi e et al. 2009). For instance, the proportion of species listed as threatened in the IUCN Red List is much higher in amphibians than in birds (International Union for Conservation of Nature 2013). Geographic breadth is needed because human impacts show strong spatial variation: most of Western Europe has long been dominated by human land use, for example, whereas much of the Amazon basin is still close to a natural state (Ellis et al. 2010). Thus, in the absence of broad coverage, any pattern seen in a dataset is prone to reflect the choice of taxa and region as much as true global patterns and trends.
The most direct way to capture the effects of human activities on biodiversity is by analysis of time-series data from ecological communities, assemblages or populations, relating changes in biodiversity to changes in human activity (Va ck a r 2012). However, long-term data suitable for such modeling have limited geographic and taxonomic coverage, and often record only the presence or absence of species (e.g., Dornelas et al. 2013). Time-series data are also seldom linked to site-level information on drivers of change, making it hard to use such data to model biodiversity responses or to project responses into the future. Ecologists have therefore more often analyzed spatial comparisons among sites that differ in the human impacts they face. Although the underlying assumption that biotic differences among sites are caused by human impacts has been criticized (e.g., Johnson and Miyanishi 2008;Pfeifer et al. 2014), it is more likely to be reasonable when the sites being compared are surveyed in the same way, when they are well matched in terms of other potentially important variables (e.g., Blois et al. 2013;Pfeifer et al. 2014), when analyses focus on community-level summaries rather than individual species (e.g., Algar et al. 2009), and when the spatial and temporal variations being considered are similar in magnitude (Blois et al. 2013). Collations of wellmatched site surveys therefore offer the possibility of analyzing how biodiversity is responding to human impacts without losing taxonomic and geographic breadth.
Openness of data is a further important consideration. The reproducibility and transparency that open data can confer offer benefits to all areas of scientific research, and are particularly important to research that is potentially relevant to policy (Reichman et al. 2011). Transparency has already been highlighted as crucial to the credibility of biodiversity indicators and models (e.g., UNEP-WCMC 2009;Feld et al. 2010;Heink and Kowarik 2010) but the datasets underpinning previous policy-relevant analyses have not always been made publicly available.
We present a new database that collates published, in-press and other quality-assured spatial comparisons of community composition and site-level biodiversity from terrestrial sites around the world. The underlying data are made up of abundance, presence/absence and speciesrichness measures of a wide range of taxa that face many different anthropogenic pressures. As of March 2014, the dataset contains more than 1.6 million samples from 78 countries representing over 28,000 species. The dataset, which is still being added to, is being assembled as part of the PREDICTS project (Projecting Responses of Ecological Diversity In Changing Terrestrial Systemshttp://www.predicts.org.uk), the primary purpose of which is to model and project how biodiversity in terrestrial communities responds to human activity. The dataset is already considerably larger and more representative than those used in existing quantitative models of biodiversity trends such as the Living Planet Index (WWF International 2012) and GLOBIO3 (Alkemade et al. 2009).
In this paper we introduce the database, describe in detail how it was collated, validated and curated, and assess its taxonomic, geographic and temporal coverage. We make available a summary dataset that contains, for each sampling location, the predominant land use, landuse intensity, type of habitat fragmentation, geographic coordinates, sampling dates, country, biogeographic realm, ecoregion, biome, biodiversity hotspot, taxonomic group studied and the number of measurements taken. The full dataset constitutes a large evidence base for the analysis of: • The responses of biodiversity to human impacts for different countries, biomes and major taxonomic groups; • The differing responses within and outside protected areas; • How traits such as body size, range size and ecological specialism mediate responses and • How human impacts alter community composition.
The summary dataset permits analysis of geographic and taxonomic variation in study size and design. The complete database, which will be made freely available at the end of the current phase of the project in 2015, will be of use to all researchers interested in producing models of how biodiversity responds to human pressures.

Criteria for inclusion
We considered only data that met all of the following criteria: • Data are published, in press or were collected using a published methodology; • The paper or report presents data about the effect of one or more human activities on one or more named taxa, and where the degree of human activity differed among sampling locations and/or times; • Some measure of overall biodiversity, or of the abundance or occurrence of the named taxa, was made at two or more sampling locations and/or times; • Measurements within each data source were taken using the same sampling procedure, possibly with variation in sampling effort, at each site and time; • The paper reported, or authors subsequently provided, geographical coordinates for the sites sampled.
One of the modeling approaches used by PREDICTS is to relate diversity measurements to remotely sensed data, specifically those gathered by NASA's Moderate Resolution Imaging Spectroradiometer (MODIS) instruments (Justice et al. 1998). MODIS data are available from early 2000 onwards so, after a short initial data collation stage, we additionally required that diversity sampling had been completed after the beginning of 2000.
Where possible, we also obtained the following (see Site characteristics, below, for more details): • The identities of the taxa sampled, ideally resolved to species level; • The date(s) on which each measurement was taken; • The area of the habitat patch that encompassed each site; • The maximum linear extent sampled at the site; • An indication of the land use at each site, e.g. primary, secondary, cropland, pasture; • Indications of how intensively each site was used by people; • Descriptions of any transects used in sampling (start point, end point, direction, etc.); • Other information about each site that might be relevant to modeling responses of biodiversity to human activity, such as any pressures known to be acting on the site, descriptions of agriculture taking place and, for spatially blocked designs, which block each site was in.

Searches
We collated data by running sub-projects that investigated different regions, taxonomic groups or overlapping anthropogenic pressures: some focused on particular taxa (e.g., bees), threatening processes (e.g., habitat fragmentation, urbanization), land-cover classes (e.g., comparing primary, secondary and plantation tropical forests), or regions (e.g., Colombia). We introduced the project and requested data at conferences and in journals (Newbold et al. 2012;Hudson et al. 2013). After the first six months of broad searching, we increasingly targeted efforts toward under-represented taxa, habitat types, biomes and regions. In addition to articles written in English, we also considered those written in Mandarin, Spanish and Portuguese languages in which one or more of our data compilers were proficient.

Data collection
To maximize consistency in how incoming data were treated, we developed customized metadata and data capture toolsa PDF form and a structured Excel filetogether with detailed definitions and instructions on their usage. The PDF form was used to capture bibliographic information, corresponding author contact details and meta-data such as the country or countries in which data were collected, the number of taxa sampled, the number of sampling locations and the approximate geographical center(s) of the study area(s). The Excel file was used to capture details of each sampling site and the diversity measurements themselves. The PDF form and Excel file are available in Supplementary Information. We wrote software that comprehensively validates pairs of PDF and Excel files for consistency; details are in the "Database" section.
Most papers that we considered did not publish all the information that we required; in particular, site coordinates and species names were frequently not published. We contacted authors for these data and to request permission to include their contributed data in the PREDICTS database. We used the insightly customer relationship management application (https://www.insightly.com/) to manage contact with authors.

Structure of data
We structured data into Data Sources, Studies, and Sites. The highest level of organization is the Data Source. A Data Source typically represents data from a single published paper, although in some cases the data were taken from more than one paper, from a non-governmental organization report or from a PhD or MSc thesis. A Data Source contains one or more Studies. A Study contains two or more Sites, a list of taxa that were sampled and a site-by-species matrix of observations (e.g., presence/ absence or abundance). All diversity measurements within a Study must have been collected using the same sampling method. For example, a paper might present, for the same set of Sites, data from pitfall traps and from Malaise traps. We would structure these data into a single Data Source containing two Studiesone for each trapping technique. It is therefore reasonable to directly compare observations within a Study but not, because of methodological differences, among Studies. Sometimes, the data presented in a paper were aggregates of data from multiple sampling methods. In these cases, provided that the same set of sampling methods was applied at each Site, we placed the data in a single Study.
We classified the diversity observations as abundance, occurrence or species richness. Some of the site-by-species matrices that we received contained empty cells, which we interpreted as follows: (1) where the filled-in values in the matrix were all non-zero, we interpreted blanks as zeros or (2) where some of the values in the matrix were zero, we took empty cells as an indication that the taxa concerned were not looked for at those Sites, and interpreted empty cells as missing values.
Where possible, we recorded the sampling effort expended at each Site and allowed the units of sampling effort to vary among Studies. For example, if transects had been used, the (Study-level) sampling effort units might be meters or kilometers and the (Site-level) sampling efforts might be the length of the transects. If pitfall traps had been used, the (Study-level) sampling effort units might be "number of trap nights" and the (Sitelevel) sampling efforts might be the number of traps used multiplied by the number of nights that sampling took place. Where possible, we also recorded an estimate of the maximum linear extent encompassed by the sampling at each Sitethe distance covered by a transect, the distance between two pitfall traps or the greatest linear extent of a more complex sampling design (see Figure S1 in Supplementary Information for details).

Site characteristics
We recorded each Site's coordinates as latitude and longitude (WGS84 datum), converting where necessary from local grid-based coordinate systems. Where precise coordinates for Sites were not available, we georeferenced them from maps or schemes available from the published sources or provided by authors. We converted each map to a semi-transparent image that was georeferenced using either ArcGIS (Environmental Systems Research Institute (ESRI) 2011) or Google Earth (http://www.google.co.uk/ intl/en_uk/earth/ ), by positioning and resizing the image on the top of ArcGIS Online World Imagery or Google Maps until we achieved the best possible match of mapped geographical features with the base map. We then obtained geographic coordinates using geographic information systems (GIS) for each Site center or point location. We also recorded authors' descriptions of the habitat at each Site and of any transects walked.
For each Site we recorded the dates during which sampling took place. Not all authors presented precise sampling datessome gave them to the nearest month or year. We therefore recorded the earliest possible start date, the latest possible end date and the resolution of the dates that were given to us. Where dates were given to the nearest month or year, we recorded the start and end dates as the earliest and latest possible day, respectively. For example, if the authors reported that sampling took place between June and August of 2007, we recorded the date resolution as "month," the start of sampling as June 1, 2007 and end of sampling as August 31, 2007. This scheme meant that we could store sampling dates using regular database structures (which require that the year, month, and day are all present), while retaining information about the precision of sampling dates that were given to us.
We assigned classifications of predominant land use and land-use intensity to each Site. Because of PRE-DICTS' aim of making projections about the future of biodiversity under alternative scenarios, our land-use classification was based on five classes defined in the Representative Concentration Pathways harmonized landuse estimates (Hurtt et al. 2011)primary vegetation, secondary vegetation, cropland, pasture and urbanwith the addition of plantation forest to account for the likely differences in the biodiversity of natural forest and plantation forest (e.g., Gibson et al. 2011) and a "Cannot decide" category for when insufficient information was available. Previous work has suggested that both the biodiversity and community composition differ strongly between sites in secondary vegetation of different maturity (Barlow et al. 2007); therefore, we subdivided secondary vegetation by stageyoung, intermediate, mature and (when information was lacking) indeterminateby considering vegetation structure (not diversity). We used authors' descriptions of Sites, when provided, to classify land-use intensity as minimal, light or intense, depending on the land use in question, again with "Cannot decide" as an option for when information was lacking. A detailed description of how classifications are assigned is in the Supplementary section "Notes on assigning predominant land use and use intensity" and Tables S1 and S2.
Given the likely importance of these classifications as explanatory variables in modeling responses of biodiversity to human impacts, we conducted a blind repeatability study in which one person (the last author, who had not originally scored any Sites) rescored both predominant land use and use intensity for 100 Sites chosen at random. Exact matches of predominant land use were achieved for 71 Sites; 15 of the remaining 29 were "near misses" specified in advance (i.e., primary vegetation versus mature secondary; adjacent stages of secondary vegetation; indeterminate secondary versus any other secondary stage; and cannot decide versus any other class). Cohen's kappa provides a measure of inter-rate agreement, ranging from 0 (agreement no better than random) to 1 (perfect agreement). For predominant land use, Cohen's kappa = 0.662 (if only exact agreement gets credit) or 0.721 (if near misses are scored as 0.5); values in the range 0.6-0.8 indicate "substantial agreement" (Landis and Koch 1977), indicating that our categories, criteria and training are sufficiently clear for users to score Sites reliably. Moving to use intensity, we found exact agreement for 57 of 100 Sites, with 39 of the remaining 43 being "near misses" (adjacent intensity classes, or cannot decide versus any other class), giving Cohen's kappa values of 0.363 (exact agreement only) or 0.385 (near misses scored as 0.5), representing "fair agreement" (Landis and Koch 1977); agreement is slightly higher among the 71 Sites for which predominant land use was matched (exact agreement in 44 of 71 Sites, kappa = 0.428, indicating "moderate agreement": Landis and Koch 1977).
Where known, we recorded the number of years since conversion to the present predominant land use. If the Site's previous land use was primary habitat, we recorded the number of years since it was converted to the current land use. If the habitat was converted to secondary forest (clear-felled forest or abandoned agricultural land), we recorded the number of years since it was converted/ clear-felled/abandoned. Where ranges were reported, we used mid-range values; if papers reported times as "greater than N years" or "at least N years," we recorded a value of N 9 1.25. Based on previous work (Wilcove et al. 1986;Dickman 1987), we assigned one of five habitat fragmentation classes: (1) well within unfragmented habitat, (2) within unfragmented habitat but at or near its edge, (3) within a remnant patch (perhaps at its edge) that is surrounded by other habitats, (4) representative part of a fragmented landscape and (5) part of the matrix surrounding remnant patches. These are described and illustrated in Table S3 and Figure S2. We also recorded the area of the patch of predominant habitat within which the Site was located, where this information was available. We recorded a value of À1 if the patch area was unknown but large, extending far beyond the sampled Site.

Database
Completed PDF and Excel files were uploaded to a Post-greSQL 9.1 database (PostgreSQL Global Development Group, http://www.postgresql.org/) with the PostGIS 2.0.1 spatial extension (Refractions Research Inc, http://www.postgis.net/). The database schema is shown in Figure S3.
We wrote software in the Python programming language (http://www.python.org/) to perform comprehensive data validation; files were fully validated before their data were added to the database. Examples of lower level invalid data included missing values for mandatory fields, a negative time since conversion, a latitude given as 1°6 1', a date given as 32nd January, duplicated Site names and duplicated taxon names. Commonly encountered higher level problems included mistakes in coordinates, such as latitude and longitude swapped, decimal latitude and longitude incorrectly assembled from DD/MM/SS components, and direction (north/south, east/west) swapped round. These mistakes typically resulted in coordinates that plotted in countries not matching those given in the metadata and/or out to sea. The former was detected automatically by validation software, which required that the GIS-matched country for each Site (see "Biogeographical coverage" below) matched the country name entered in the PDF file for the Study; where a Study spanned several countries, we set the country name to "Multiple countries." We visually inspected all Site locations on a map and compared them to maps presented in the source article or given to us by the authors, catching coordinates that were mistakenly out to sea and providing a check of accuracy. Our database linked each Data Source to the relevant record in our Insightly contact management database. This allows us to trace each datum back to the email that granted permission for us to include it in our database.

Biogeographical coverage
In order to assess the data's geographical and biogeographical coverage, we matched each Site's coordinates to GIS datasets that were loaded into our database: Global GIS layers appear coarse at local scales and we anticipated that Sites on coasts or on islands could fall slightly outside the relevant polygons. Our software therefore matched Sites to the nearest ecoregion and nearest country polygons, and recorded the distance in meters to that polygon, with a value of zero for Sites that fell within a polygon; we reviewed Sites with non-zero distances. The software precisely matched Sites to hotspot polygons. The relative coarseness of GIS polygons might result in small errors in our assessments of coverage (i.e., at borders between biomes, ecoregions and countries, and at the edges of hotspots)we expect that these errors should be small in number and unbiased.
We also estimated the yearly value of total net primary production (TNPP) for biomes and five-degree latitudinal belts, using 2010 spatial (0.1-degree resolution) monthly datasets "NPP -Net Primary Productivity 1 month-Terra/MODIS" compiled and distributed by NASA Earth Observations (http://neo.sci.gsfc.nasa.gov/view.php? datasetId=MOD17A2_M_PSN&year=2010). We used the NPP values (average for each month assimilation measured in grams of carbon per square meter per day) to estimate monthly and annual NPP. We then derived TNPP values by multiplying NPP values by the total terrestrial area for that ecoregion/latitudinal belt. We assessed the representativeness of land use and land-use intensity combinations by comparing the proportion of Sites in each combination to a corresponding estimate of the proportion of total terrestrial area for 2005, computed using land-use data from the HYDE historical reconstruction (Hurtt et al. 2011) and intensity data from the Global Land Systems dataset (van Asselen and Verburg 2012).

Taxonomic names and classification
We wanted to identify taxa in our database as precisely as possible and to place them in higher level groups, which required relating the taxonomic names presented in our datasets to a stable and authoritative resource for nomenclature. We used the Catalogue of Life (http://www.cata logueoflife.org/) for three main reasons. First, it provides broad taxonomic coverage. Second, Catalogue of Life publishes Annual Checklists. Third, Catalogue of Life provides a single accepted taxonomic classification for each species that is represented. Not all databases provide this guarantee; for example, Encyclopedia of Life (http:// www.eol.org/) provides zero, one or more taxonomic classifications for each represented species. We therefore matched taxonomic names to the Catalogue of Life 2013 Annual Checklist (Roskov et al. 2013, henceforth COL).
There was large variation in the form of the taxonomic names presented in the source datasets, for example: • A Latin binomial, with and without authority, year and other information; • A generic name, possibly with a number to distinguish morphospecies from congenerics in the same Study (e.g., "Bracon sp. 1"); • The name of a higher taxonomic rank such as family, order, class; • A common name (usually for birds), sometimes not in English; • A textual description, code, letter or number with no further information except an indication of some aspect of higher taxonomy.
Most names were Latin binomials, generic names or morphospecies names. Few binomials were associated with an authorityeven when they were, time constraints mean that it would not have been practical to make use of this information. Many names contained typographical errors.
We represented each taxon by three different names: "Name entered," "Parsed name," and "COL query name." "Name entered" was the name assigned to the taxon in the dataset provided to us by the investigators who collected the data. We used the Global Names Architecture's biodiversity package (https://github.com/GlobalNames Architecture/biodiversity) to parse "Name entered" and extract a putative Latin binomial, which we assigned to both "Parsed name" and "COL query name." For example, the result of parsing the name "Ancistrocerus trifasciatus M€ ull." was "Ancistrocerus trifasciatus." The parser treated all names as if they were scientific taxonomic names, so the result of parsing common names was not sensible: e.g. "Black and White Casqued Hornbill" was parsed as "Black and." We expected that common names would be rarewhere they did arise, they were detected and corrected as part of our curation process, which is described below. Other examples of the parser's behavior are shown in Table S4.
We queried COL with each "COL query name" and stored the matching COL ID, taxonomic name, rank and classification (kingdom, phylum, class, order, family, genus, species and infraspecies). We assumed that the original authors gave the most authoritative identification of species. Therefore, when a COL search returned more than one result, and the results were made up of one accepted name together with one or more synonyms and/or ambiguous synonyms and/or common names and/or misapplied names, our software recorded the accepted name. For example, COL returns three results for the salticid spider Euophrys frontalisone accepted name and two synonyms.
When a COL search returned more than one result, and the results included zero or two or more accepted names, we used the lowest level of classification common to all results. For example, COL lists Notiophilus as an accepted genus in two beetle families -Carabidae and Erirhinidae. This is a violation of the rules of nomenclature, but taxonomic databases are imperfect and such violations are to be expected. In this case, the lowest rank common to both families is the order Coleoptera.

Curating names
We reviewed: • Taxa that had no matching COL record; • Taxa that had a result at a rank higher than species and a "Name entered" that was either a Latin binomial or a common name; • Cases where the same "Parsed name" in different Studies linked to different COL records; • Studies for which the lowest common taxonomic rank did not seem appropriate; for example, a Study of birds should have a lowest common taxonomic rank of class Aves or lower rank within Aves.
Where a change was required, we altered "COL query name", recording the reason why the change was made, and reran the COL query. Sometimes, this curation step had to be repeated multiple times. In all cases, we retained the names given to us by the authors, in the "Name entered" and "Parsed name" columns.
Typographical errors were the most common cause for failed COL searches; for example, the hymenopteran Diphaglossa gayi was given as Diphaglosa gayi. Such errors were detected by visual inspection and by performing manual searches on services that perform fuzzy matching and suggest alternatives, such as Google and Encyclopedia of Life. In cases where "Parsed name" was a binomial without typographical errors but that was not recognized by COL, we searched web sites such as Encyclopedia of Life and The Plant List (http://www.theplant list.org/) for synonyms and alternative spellings and queried COL with the results. Where there were no synonyms or where COL did not recognize the synonyms, we searched COL for just the genus. If the genus was not recognized by COL, we used the same web services to obtain higher level ranks, until we found a rank that COL recognized.
Some names matched COL records in two different kingdoms. For example, Bellardia, Dracaena and Ficus are all genera of plants and of animals. In such cases, we instructed our software to consider only COL records from the expected kingdom. We also constrained results when a name matched COL records in two different branches within the same kingdom; for example, considering the Notiophilus example given aboveif the Study was of carabid beetles, we would instruct of software to consider only results within family Carabidae.
COL allows searches for common names. Where "Name entered" was a common name that was not recognized by COL, we searched web sites as described above and set "COL query name" to the appropriate Latin binomial.
Some studies of birds presented additional complications. Some authors presented taxon names as four-letter codes that are contractions of common names (e.g., AMKE was used by Chapman and Reich (2007) to indicate Falco sparverius, American kestrel) or of Latin binomials (e.g., ACBA was used by Shahabuddin and Kumar (2007) to indicate Accipiter badius). Some of these codes are valid taxonomic names in their own right. For example, Shahabuddin and Kumar (2007) used the code TEPA to indicate the passerine Terpsiphone paradisi. However, Tepa is also a genus of Hemiptera. Left uncurated, COL recognized TEPA as the hemipteran genus and the Study consequently had a lowest common taxonomic rank of kingdom Animalia, not of class Aves or a lower rank within Aves, as we would expect. Some codes did not appear on published lists (e.g., http://www.birdpop.org/alphacodes.htm, http:// www.pwrc.usgs.gov/bbl/manual/speclist.cfm, http://www. carolinabirdclub.org/bandcodes.html and http://infohost. nmt.edu/~shipman/z/nom/bbs.html) or in the files provided by the authors, either because of typographical errors, omissions or incomplete coverage. Fortunately, codes are constructed by following a simple set of rulesthe first two letters of the genus and species of binomials, and a slightly more complex method for common names of North American birds (http://infohost.nmt.edu/~shipman/z/nom/bbl rules.html). We cautiously reverse-engineered unrecognized codes by following the appropriate rules and then searched lists of birds of the country concerned for possible matches. For example, we deduced from the Wikipedia list of birds of India (http://en.wikipedia.org/wiki/List_of_birds_of_India) that KEZEused in a study of birds in Rajasthan, northwestern India (Shahabuddin and Kumar 2007)most likely indicates Ketupa zeylonensis. Another problem is that collisions occurthe same code can apply to more than one taxon. For example, PEPT is the accepted code for Atalotriccus pilaris (pale-eyed pygmy tyranthttp://www.bird pop.org/alphacodes.htm), a species that occurs in the Neotropics. The same code was used by the Indian study of Shahabuddin and Kumar (2007) to indicate Pernis ptilorhynchus (crested honey buzzard). We therefore reverse-engineered bird codes on a case-by-case basis. Where a code could represent more than one species, we set "COL query name" as the lowest taxonomic rank common to all matching species.

Counting the number of species
It was not possible to precisely count the number of species represented in our database because of ambiguity inherent in the taxon names provided with the data. We estimated the number of species as follows. Names with a COL result at either species or infraspecies level were counted once per name. Names with a COL result resolved to higher taxonomic ranks were counted once per Study. To illustrate this scheme, consider the bat genus Eonycteris, which contains three species. Suppose that Study A sampled all three species and that the investigators could distinguish individuals as belonging to three separate species but could not assign them to named species, reporting them as Eonycteris sp. 1, Eonycteris sp. 2 and Eonycteris sp. 3. Study B also sampled all three species of Eonycteris and again reported Eonycteris sp. 1, Eonycteris sp. 2 and Eonycteris sp. 3. We would erroneously consider these taxa to be six different species. We did not attempt to determine how often, if at all, such inflation occurred. In order to assess the taxonomic coverage of our data, we computed a higher taxonomic grouping for each taxon as: (1)  bee is order Hymenoptera (following rule 1), the higher taxonomic group of a wolf is class Mammalia (rule 2), and the higher taxonomic group of a snail is phylum Gastropoda (rule 3). For each higher taxonomic group, we compared the numbers of species in our database to the estimated number of described species presented by Chapman (2009). Some of the higher taxonomic groups that we computed did not directly relate to the groups presented by Chapman (2009) so, in order to compare counts, we computed Magnoliophyta as the sum of Magnoliopsida and Liliopsida; Gymnosperms as the sum of Pinopsida and Gnetopsida; Ferns and allies as Biome 1 9 9 2 1 9 9 3 1 9 9 4 1 9 9 5 1 9 9 6 1 9 9 7 1 9 9 8 1 9 9 9 2 0 0 0 2 0 0 1 2 0 0 2 2 0 0 3 2 0 0 4 2 0 0 5 2 0 0 6 2 0 0 7 2 0 0 8 2 0 0 9 2 0 1 0 2 0 1 1 2 0 1 2  latitude 1 9 9 2 1 9 9 3 1 9 9 4 1 9 9 5 1 9 9 6 1 9 9 7 1 9 9 8 1 9 9 9 2 0 0 0 For some of our analyses, we related taxonomic names to databases of species' traits. To do this, we synthesized, for each taxon, a "Best guess binomial": • The COL taxon name if the COL rank was Species; • The first two words of the COL taxon if the rank was Infraspecies; • The first two words of "Parsed name" if the rank was neither Species nor Infraspecies and "Parsed name" contained two or more words; • Empty in other cases.
This scheme meant that even though COL did not recognize all of the Latin binomials that were given to us, we could maximize matches between names in our databases with names in the species' trait databases.

Results
Between March 2012 and March 2014, we collated data from 284 Data Sources, 407 Studies and 13,337 Sites in 78 countries and 208 (of 814) ecoregions (Fig. 1) (Table 1). Hotspots together account for just 16% of the world's terrestrial surface, yet 47.67% of our measurements were taken in hotspots. The vast majority of measurements in hotspots were taken in the Sundaland hotspot (Southeast Asia) and the latitudinal band with the most samples is 0°to 5°N (Fig. 2); many of these data come from two studies of higher plants from Indonesia that between them contribute just 284 sites but over 320,000 samples (Sheil et al. 2002).
The best-represented biomes are "Temperate Broadleaf and Mixed Forests" and "Tropical and Subtropical Moist Broadleaf Forests" (Figs 3, 4). "Flooded Grasslands and Savannas" is the only biome that is unrepresented in our database (Figs 3, 4); although this biome is responsible for only 0.7% of global terrestrial net primary productivity, it is nevertheless ecologically important and will be a priority for future collation efforts. Two biomes -"Tundra" and "Deserts and Xeric Shrublands"are underrepresented relative to their areas. Of the world's 17 megadiverse countries identified by Mittermeier et al. (1997), only Democratic Republic of Congo is not represented ( Figure S4). The vast majority of sampling took place after the year 2000 (Fig. 3), reflecting our desire to collate diversity data that can be related to MODIS data,  Figure 5. Representativeness of predominant land use and land-use intensity classes. Numbers are the percentage of Sites assigned to each combination of land use and intensity. Numbers in brackets and colors are the differences between these and the proportional estimated total terrestrial area of each combination of land use and land-use intensity for 2005, computed from the HYDE (Hurtt et al. 2011) and Global Land Systems datasets (van Asselen and Verburg 2012); no difference is shown for "Urban"/"Light use" because these datasets did not allow us to compute an estimate for this combination. The 12.15% of Sites that could not be assigned a classification for predominant land use and/or landuse intensity are not shown.   (Kattge et al. 2011). Best guess binomials: the number of unique "Best guess binomials" in the PREDICTS database within that taxonomic group. Attribute database names: the number of unique binomials and trinomials for that attribute in attribute database. Species matches: the number of "Best guess binomials" that exactly match a record in the attribute database. Genus matches: the number of generic names in the PREDICTS database with a matching record in the attribute database (only for binomials for which there was not a species match). Total matches: sum of species matches and genus matches. We did not match generic names for GBIF range size, IUCN category or CITES appendix because we did not expect these traits to be highly conserved within genera. which are available from early 2000 onwards. The database's coverage of realms, biomes, countries, regions and subregions is shown in Supplementary Tables S5-S11. The distribution of Site-level predominant land use and use intensity is different from the distribution of the estimated total terrestrial area in each land use/landuse intensity combination for 2005 (v2 = 28,243.21, df = 16, P < 2.2 9 10 À16 ; we excluded "Urban"/"Light use" from this test because the HYDE and Global Land Systems datasets did not allow us to compute an estimate for this combination). The main discrepancies are that the database has far fewer than expected Sites that are classified as "Primary habitat"/"Minimal use", "Secondary vegetation"/"Light use" and "Pasture"/"Light use" (Fig. 5). We were unable to assign a classification of predominant land use to 3.34% of Sites and of use intensity to 12.09% of Sites. The most common fragmentation layout was "Representative part of a fragmented landscape" (27.95% of Sites; Table S12)a classification that indicates either that a Site is large enough to encompass multiple habitat types or that the Site is of a particular habitat type that is inherently fragmented and dominates the landscape e.g., the site is in an agricultural field and the landscape is comprised of many fields. We were unable to assign a fragmentation layout to 15.47% of Sites. We were able to determine the maximum linear extent of sampling for 60.09% of Sitesvalues range from 0.2 m to 39.15 km; median 120 m ( Figure S5). The precise sampling days are known for 45.44% of Sites; 42.19% are known to the nearest month and 12.37% to the nearest year. The median sampling duration was 91 days; sampling lasted for 1 day or less at 9.90% of Sites ( Figure S6). The area of habitat containing the site is known for 25.49% of Sites values are approximately log-normally distributed (median 40,000 square meters; Figure S7). We reviewed all cases of Sites falling outside the GIS polygons for countries (0.82% of Sites; Figure S8) and ecoregions (0.52% of Sites; Figure S9). These Sites were either on coasts and/or on islands too small to be included in the GIS dataset in question.
The database contains measurements of approximately 28,735 species (see "Counting the number of species" in Methods) -17,733 animals, 10,201 plants, 800 fungi and 1 protozoan. We were unable to place 97 taxa in a higher taxonomic group because they were not sufficiently well resolved. The database contains more than 1% as many species as have been described within 20 higher taxonomic groups (Fig. 6). Birds are particularly well represented, reflecting the sampling bias in favor of this charismatic group. Our database contains measurements of 2,479 species of birds -24.81% of those described (Chapman 2009)and 2,368 of these are resolved to either species or infraspecies levels. A total of 228,644 samplesmore than 14% of the entire database are of birds. In contrast, just 397 species of mammals are represented, but even this constitutes 7.24% of described species. Chiroptera (bats) are the best-represented mammalian order with 188 species. Of the 115,000 estimated described species of Hymenoptera, 3,556 (3.09%) are represented in the database, the best representation of an invertebrate group. The hymenopteran family with the most species in the database is Formicidae with 2,060 species. The database contains data for 4,056 species of Coleoptera -1.07% of described beetles. Carabidae is the best-represented beetle family with 2,060 species. Some higher taxonomic groups have well below 1% representation and, as might be expected, the database has poor coverage of groups for which the majority of species are marinenematodes, crustaceans and molluscs.

Discussion
The coverage of the PREDICTS dataset illustrates the large number of published articles that are based on  Kingdom (10) Phylum (90) Class (87) Order (126) Family (67) Genus (11) Species ( local-scale empirical data of the responses of diversity either to a difference in land-use type or along a gradient of land-use intensity or other human pressure. Such data can be used to model spatial responses of local communities to anthropogenic pressures and thus changes over time. This is essential for understanding the impact of biodiversity loss on ecosystem function and ecosystem services, which operate at the local level (Fontaine et al. 2006;Isbell et al. 2011;Cardinale et al. 2012;Hooper et al. 2012). Regardless of scale, no single Study is or could ever be representative, but the sheer number and diversity of Studies means that a collation of these data can provide relatively representative coverage of biodiversity. The majority of Data Sources (271 of 284) come from peerreviewed publications and all data have used peer-reviewed sampling procedures. There are doubtless very many more published data than we have so far acquired and been given permission to use. For the majority of Data Sources (225), it was necessary to contact the author(s) in order to get more information such as the Site coordinates or the names of the taxa studied: even now that supplementary data are commonplace and often extensive, we usually had to request more detail than had been published.
The database currently lacks Sites in ten biodiversity hotspots and one megadiverse country (Democratic Republic of the Congo). It also has no data from many large tropical or partially tropical countries such as Angola, Tanzania and Zambia. Many countries are underrepresented given their area and/or the distinctiveness of their biota e.g., Australia, China, Madagascar, New Zealand, Russia and South Africa. We have few data from islands and just 57 Sites from the biogeographic realm of Oceania ( Fig. 3 and Table S8): we have not yet directly targeted Oceania or island biota more generally. The database contains no studies of microbial diversity and few of parasitesmajor shortcomings that also apply to other large biodiversity databases such as the Living Planet Index (WWF International 2012), the IUCN Red List (International Union for Conservation of Nature 2013) and BIOFRAG (Pfeifer et al. 2014). Fewer than 50% of the taxa in our database are matched to a Catalogue of Life record with a rank of species or infraspecies (Fig. 6). The quality and coverage of taxonomic databases continues to improve and we hope to improve our database's coverage by making use of new Catalogue of Life checklists as they become available. Improved software would permit the use of fuzzy searches to reduce the current manual work required to curate taxonomic names.
Intersecting our data with datasets of species attributes (Table 2) indicates much greater overlap among largescale data resources than might be expected simply based on overall numbers of species. This suggests that the same species are being studied for different purposes, because of either ubiquity, abundance, interest or location. In one sense this is useful, allowing a thorough treatment of certain groups of species, for example by incorporating trait data in analyses. On the other hand, it highlights the fact that many species are poorly studied in terms of distribution, traits and responses to environmental change. Indeed, many taxonomic groups that matter greatly for ecosystem functions (e.g., earthworms, fungi) are routinely underrepresented in data compilations (Cardoso et al. 2011;Norris 2012), includingdespite our efforts toward representativenessours. The PREDICTS database is a work in progress, but already represents the most comprehensive database of its kind of which we are aware. Associated with this article is a site-level extract of the data: columns are described in Table S13. The complete database will be made publicly available in 2015, before which we will attempt to improve all aspects of its coverage by targeting underrepresented hotspots, realms, biomes, countries and taxonomic groups. In addition to taking data from published articles, we will integrate measurements from existing large published datasets, where possible. We welcome and greatly value all contributions of suitable data; please contact us at enquiries@predicts.org.uk. Zaitsev, A. S., M. Chauvat, A. Pflug, and V. Wolters. 2002. Oribatid mite diversity and community dynamics in a spruce chronosequence. Soil Biol. Biochem. 34:1919-1927. Zaitsev, A. S., V. Wolters, R. Waldhardt, and J. Dauber. 2006 Long-term succession of oribatid mites after conversion of croplands to grasslands. Appl. Soil Ecol. 34:230-239. Zimmerman, G., F. W. Bell, J. Woodcock, A. Palmer, and J.

Supporting Information
Additional Supporting Information may be found in the online version of this article: Figure S1. Maximum linear extents of sampling. Figure S2. Graphical representations of fragmentation layouts. Figure S3. Database schema. Figure S4. Countries represented by area. Figure S5. Histogram of Site maximum linear-extents of sampling. Figure S6. Histogram of Site sampling durations. Figure S7. Histogram of the area of habitat surrounding each Site. Figure S8. Histogram of the distance from each Site to the nearest country GIS polygon. Figure S9. Histogram of the distance from each Site to the nearest ecoregion GIS polygon. Table S1. Classification of land-use intensity for primary and secondary vegetation based on combinations of impact level and spatial extent of impact. Table S2. Combinations of predominant land use and use intensity. Table S3. Habitat fragmentation classifications. Table S4. Examples of parsing different styles of taxonomic name with the Global Names Architecture's biodiversity package (https://github.com/GlobalNames Architecture/biodiversity). Table S5. Coverage of countries. Table S6. Coverage of regions. Table S7. Coverage of subregions. Table S8. Coverage of realms. Table S9. Coverage of biomes. Table S10. Distribution of samples by biome and kingdom. Table S11. Distribution of samples by subregion and kingdom. Table S12. Coverage of fragmentation layouts. Table S13. Data extract columns. Data S1. Data extract.