The World Spider Trait database: a centralized global open repository for curated data on spider traits

Abstract Spiders are a highly diversified group of arthropods and play an important role in terrestrial ecosystems as ubiquitous predators, which makes them a suitable group to test a variety of eco-evolutionary hypotheses. For this purpose, knowledge of a diverse range of species traits is required. Until now, data on spider traits have been scattered across thousands of publications produced for over two centuries and written in diverse languages. To facilitate access to such data, we developed an online database for archiving and accessing spider traits at a global scale. The database has been designed to accommodate a great variety of traits (e.g. ecological, behavioural and morphological) measured at individual, species or higher taxonomic levels. Records are accompanied by extensive metadata (e.g. location and method). The database is curated by an expert team, regularly updated and open to any user. A future goal of the growing database is to include all published and unpublished data on spider traits provided by experts worldwide and to facilitate broad cross-taxon assays in functional ecology and comparative biology. Database URL: https://spidertraits.sci.muni.cz/


Introduction
With almost 50 000 species described to date (1), spiders are among the most diverse orders of terrestrial arthropods (2). Spiders rank among the most dominant arthropod predators in a huge variety of ecosystems and therefore provide important ecosystem services, such as biological control (3,4) and bio-indication (5). They are also potentially an important source of molecules to be used in new biotechnologies and human medicine (6,7). In addition to these uses, spiders provide suitable models to test the breadth of ecological and evolutionary hypotheses (8)(9)(10).
Successful use of spiders for research and environmental assessments is based on knowledge of traits (morphological, ecological, physiological or behavioural characteristics), which characterize responses to environmental conditions and both change and define the effects of spiders on ecosystem functioning (10). Assembling trait values for species in a community is laborious because, for some traits and species, this information either does not exist or is not easily available as it is hidden in old publications (often not in English), unpublished records, technical reports or even field notes. Although difficult to access, the data available are extensive as research on spiders has covered a huge diversity of topics for over 200 years (11). Data on spider traits continues to be generated on a daily basis, most of it being used in individual publications or retained in unpublished datasets. Trait data are stored in different places and forms, and most data that originated before the use of personal computers are only available from printed publications. More recently, collected data have often been stored in digital form in different repositories (from personal computers to data archive servers), but it is often difficult to compile and standardize datasets with different formats and completeness of metadata, which are necessary for leveraging data for common purposes as pointed out in the concept of Essential Biodiversity Variables (12,13).
Trait databases already exist for a number of taxonomic groups, such as plants (14), corals (15), reptiles (16), copepods (17) and ground beetles (18), with a similar aim to accumulate and organize available data in a single repository. The success of such databases can be seen in their frequent use by many scholars (19). A general database of spider traits has not yet been developed. However, a range of spider traits can currently be found in several online resources, for example, the body size of European species (20), cytogenetic data (21), protein toxins of spiders (22), habitat and phenology of British (http://srs.britishspiders.org.uk/) and Czech spiders (http://arachnobaze.cz/) and various traits of ground-dwelling spiders (https://portail.betsi.cnrs.fr).
A trait-based approach has the advantage that some investigations (e.g. bio-indication) can be performed even when the taxonomic identity is missing or inaccurate (using morphospecies, for example) (30). Using traits, instead of taxonomic information, also allows for a comparison of community patterns and responses across regions with different species pools (31). For these purposes, it is important that trait data are available in appropriate quality and quantity and have broad taxon and regional coverage. Overcoming these barriers will foster collaboration among arachnologists and other researchers that aim for multi-taxa analyses (24,32,33).
Recently, Lowe et al. (10) initiated the establishment of a centralized database that aims to cover all spider traits and store data in a single place under FAIR (findable, accessible, interoperable and reusable) principles (34). Lowe et al. (10) built the foundation of such a database by detailed coverage of the trait definition, their standardization, input data types, database governance, geographical coverage, accessibility, quality control and sustainability. Furthermore, Lowe et al. (10) recognized that the unification of the trait records can only be accomplished by careful examination of the data during the validation procedure.
Following the initiative (10), here, we present a curated global database that follows the FAIR principles and hosts a variety of traits recorded for spiders ( Figure 1). With the potential to grow indefinitely, we have already collected data for more than 7000 spider taxa so far. The database has two main goals: (i) to collect and curate trait data on spiders from different sources, either (un)published or to be published in the future, and (ii) to provide public access to these data under a CC BY licence, facilitating their widespread use by researchers.

Definitions
We adopted a broad definition of traits for inclusion in our database: any measurable phenotypic (i.e. morphological, ecological, physiological and behavioural) characteristic of an individual or taxon. This may also include 'pure' (heritable) traits (35), as well as the response to environmental conditions or a treatment (36,37). Traits can be either quantitative (continuous, integers and proportions) or categorical (qualitative, binary and ordinal). Trait values can represent individual-level measurements (single observation) to higher taxonomic (species-, genus-and family-) level measurements (aggregates), often recorded as a statistic (mean, median, minimum and maximum). We do not consider descriptive molecular data (such as DNA or protein sequences) or faunistic records to be traits, unless these contain reference to some trait (e.g. habitat type), as these have already established repositories, such as GenBank® or the Global Biodiversity Information Facility.
The definition of specific traits (including units for numerical traits or eligible values for categorical traits) was adopted from widely used definitions in a variety of published papers on spiders. To achieve semantic interoperability, each trait is described by standardized terms (Table S1). Two types of ontologies, describing the process of data collection and the traits themselves, were implemented during the development of the database structure, as suggested by Kissling et al. (12). The process of measurement, that is, details of data collection, is provided as metadata, and the trait measured is given in the main table (see  below).
To increase the interoperability of this database with other databases, the next step in the update of the database will be setting up an expert team to develop ontologies, detailed vocabularies and a hierarchical structure for all traits. Some traits thus might be redefined. This will not affect the current content but will prepare space for a harmonized collection of future data.

Database structure
We developed an online application and architecture called the World Spider Trait database, currently in version 1.0 (https://spidertraits.sci.muni.cz/), to store and retrieve trait data on spider species ( Figure 2). The database is able to accommodate traits measured at any taxonomic level. As many trait values show variation (phenotypic plasticity) as a response to varying conditions, each trait record can be accompanied by extensive metadata, describing the conditions under which it was measured (such as treatment, sampling method, geographic location, habitat and date).
The database was built to meet the FAIR principles: it is available at a public domain under an open-access licence in a machine-readable format. This is enhanced by comprehensive online search options and export capabilities.
The database has multi-layered structure. It is composed of a main table (Figure 1), including five mandatory variables, namely (i) Original species name (taxon name as reported in the original source), (ii) Trait abbreviation (unique abbreviation of each trait), (iii) Trait value (measured value of a trait), (iv) Method abbreviation (unique abbreviation of each method used to measure a trait) and (v) Reference abbreviation (unique abbreviation of each source). Several other variables are optional, namely WSC LSID (unique taxon identifier taken from the World Spider Catalog), Trait category (see below), Trait name, Trait description, Trait data type, Trait unit, Measure (type of the measured value), Life stage (ontogenetic stage), Sex, Frequency (relative frequency of occurrence), Sample size (total number of observations per record), Treatment (treatment conditions), Method name (see below), Method description, Location abbreviation (unique identifier of a location), Latitude, Longitude, Altitude, Locality (the name or description of the place), Country, Habitat (habitat type according to a local classification), Microhabitat, Date, Note (any note related to a record), Row link (unique identifier of related measurements) and Reference (full reference). For a detailed description of each variable and examples, see Table 1.
In the backend of the application, there are five additional metadata tables (extensions) that provide auxiliary information: (i) Taxa, (ii) Locations, (iii) Traits, (iv) Methods and (v) References. The Taxa table includes valid species or   updated on a weekly basis from the spider nomenclature information available in the World Spider Catalog (1), which contains valid Latin names and synonyms. Morpho-species do not have valid species names, thus higher level categories (e.g. genus) are used, optionally accompanied by additional information provided by the uploader in the Note field. The Locations table includes country code, country name, locality name, coordinates and its abbreviation. The Traits table contains trait name, category, description, data type, unit and its abbreviation. The Methods table includes method name, description and its abbreviation. References table includes full reference and its abbreviation. For more details see Table 1.
We defined 175 traits that are currently grouped into 12 categories according to the discipline (Anatomy; Biomechanics; Communication; Cytology; Defence; Ecology; Life-History; Morphology; Morphometry; Physiology; Predation and Reproduction) (Table S1). Information on the way a trait was measured is described in the Methods table. The provision of this metadata is mandatory during upload to ensure comparability of data. The Methods list includes field collection techniques, as well as laboratory methodologies. Currently, there are 37 methods defined (Table S2). The included pre-defined traits, categories and methods are meant to cover the majority of traits and methodologies in spider research. However, the architecture of the database is flexible enough that further traits, categories and methods can be added in the future to accommodate new trait types and novel methodologies.
This database is hosted, developed and maintained at the Department of Botany and Zoology of Masaryk University in collaboration with the University IT centre. It is connected to the World Spider Catalog (1), and administered and curated by the core team members (Figure 2).

Data upload procedure
Upon collection, the data must be harmonized. Before a dataset can be submitted to the database, the data must be in a valid format (for a detailed description, see https://g ithub.com/oookoook/spider-trait-database/blob/master/docs/ template.md). For this purpose, we developed an MS Excel spreadsheet (Template) that should fit the great majority of trait types with predefined columns. The spreadsheet was designed to enable easy data manipulation by classical statistical software, such as R (38). The template can be downloaded from the World Spider Trait database webpage (https://spidertraits.sci.muni.cz/contribute). It contains 31 columns, some of which are mandatory, so they must be filled with appropriate numerical or character values. Eligible values for all columns can be found in the header of each variable in the List of Traits (Table S1) and List of Methods (Table S2). If the input trait or method is not already defined, the contributor should provide all of the following information to create a new trait or method: trait category, trait name, trait description, trait data type and trait unit in the case of missing traits or method name and method description in the case of missing methods. Similarly, for references, the contributor either provides an abbreviation of a reference if it is in the List of References or a full reference. Unpublished data are referenced as personal observations.
The data in the template then needs to be saved either as an .xls(x) or a comma-delimited .csv file, and the file should be encoded as UTF-8 to assure compatibility with special (regional) characters. Once the template is uploaded, the contributor must approve it using the tools within the web application.

Software used
The code of the web application is stored at GitHub (https://github.com/oookoook/spider-trait-database) and is available under the GNU GPL v 3.0. The phylogenetic tree was produced using functions within ape package (39) within R (38).

Data records
Integration of data from different sources was based on standardization and harmonization. This involved the conversion of trait values to comparable units/trait, use of controlled vocabulary in the definition of traits, standardization of eligible character values and use of single spreadsheet format. Each record was accompanied by licence information and the original source.
Currently, both published (from more than 1000 publications) and unpublished data from diverse study designs (both descriptive and experimental) are included in the database, with the citation of the original source. So far, 70 datasets have been contributed, with a total number of more than 221 000 records belonging to more than 7500 taxa. Of these, 40 datasets (34.1% of records) are unlocked (i.e. freely accessible without user registration). The remainder (i.e. embargoed datasets) are previously unpublished data compilations and can be viewed and downloaded by registered users only to ensure applicable authorship credits (see 'Usage Notes'). Registration and data usage are free under a CC BY licence. Embargoed data compilations may eventually become unlocked (e.g. once these have been used in published studies).
Geographical coverage of the database is global, but, currently, there are more records from Europe and South America than from other continents ( Figure 3)-a typical bias in biodiversity research (40). Data on taxa from North America, Africa and Asia are represented by very few records. The great majority of records available now come from Europe. Specifically, 20 datasets (66.1% of records) concern European species. These include data on body size (2024 species), light and moisture preferences (1949 species), guild classification (1017 species) and conservation status (1557 species). In terms of traits, anatomical, behavioural and physiological data are largely missing.
As for the taxonomic coverage, of 129 known spider families (1), only 2 (Euctenizidae and Penestomidae) have no records in the database so far (Figure 4). Several families (e.g. Gnaphosidae, Lycosidae, Salticidae, Sicariidae and Theridiidae) have data for more than 40% of the 138 traits, but 38 families still have fewer than 5% of all traits covered. As for the number of records per family, most records come from the most speciose families, namely Linyphiidae, followed by Lycosidae, Theridiidae and Salticidae ( Figure 5A). Because not every trait has been measured for every taxon, the taxon × trait matrix is highly incomplete (2.82% completeness; Figure 5B). This is to be expected for a highly diverse and severely understudied taxonomic order. With respect to sex/stage, there are 33.6% records for adult males, 55.8% adult females and 8.6% for juveniles.
The content of the database reflects real historical differences among geographic areas and disciplines. The database thus can be used to identify gaps and help to prioritize future areas for investigation to achieve more complete sets of records. To fill these gaps, we plan to encourage contributions from specific areas, traits and trait categories in the future. This can include the collection of data from other repositories, extraction of data from publications and archiving currently produced data. We will also ask curators of specialized spider trait databases to provide their data to be centrally stored here. Since many funders and journals now require data to be made publicly available, the database can be used as a permanent data archive option (an alternative to, e.g. Dryad or Figshare), provided that each contributed dataset meets the standards of the database format, which allows efficient reuse and synthesis. Each dataset obtains a unique URL and, in near future, it will be associated with a DOI provided by DataCite. In the future, we expect to mainly gather data on new traits and new taxa and would like to encourage colleagues to contribute their datasets of both published and unpublished data. A coordinated effort is needed to achieve this goal.
To promote the process of data collection, we invite arachnologists to download the template and use it for data storage on their personal computers. At the same time, we ask arachnologists to get used to the vocabulary of the database, adopt the definition of the traits that are used here (or suggest alternatives) and develop protocols that follow the same standards. This will markedly enhance the integration of their datasets into the database.

Data validation
Validation is performed at several steps during submission in order to retain only high-quality records.
First, a contributor is advised to search through the current database content in order to ensure that such (exact) data are not already included for the taxon/taxa under investigation. It is also useful at this point to check whether the proposed trait(s) and method(s) are already defined. Contributors become eligible to upload their dataset after requesting registration from the administrator.
To upload a new dataset, a contributor must specify the name of the dataset, their full name and email address. In addition, a contributor can specify the authors of the dataset and author emails and mark whether the data can be immediately accessed or are under an embargo and add any note. Then, the dataset sheet is created and the contributor is able to upload the data. The data is then imported to the temporary cache. During the upload process, the web application checks the presence of eligible values in the variables (Original name, Trait abbreviation, Value, Measure, Sex, Life stage, Frequency, Sample size, Method abbreviation, Latitude, Longitude, Altitude, Country, Date and Reference) and identifies duplicate records. Invalid records are highlighted to facilitate corrections. The taxonomy check includes existence of the name and match with a current valid name according to the World Spider Catalog (1).
At this stage, the contributor can view the dataset and must edit invalid cells in order to comply with the database requirements. Editing is done using the web application tools. When the contributor completes all changes and the dataset is valid, it can be sent to the administrator or editor for review. The contributor can include a message to the editor when submitting the dataset for review, in which the contributor can explain any problems they had encountered while editing the dataset.
The administrator or editor is informed of a new dataset submission by an email. The dataset enters a second validation phase, which can only be done by the administrator or editor. The administrator or editor must add new trait(s) and method(s) to the database, check for additional errors, such as extreme (unlikely) values of traits (e.g. resulting from typos and wrong digit separator), imprecise definition of new traits and methods or an incorrect format of references. Once the dataset is validated by the administrator or editor, it is published in the database. This means that all the data are transferred from the temporary import cache to the main database and become available to the general public, unless embargoed. If the administrator or editor observes any problems, the dataset is rejected and sent back to the contributor with an email containing a description of the problem(s) to be fixed. Any dataset can be post hoc corrected/altered by the administrator or editor without contributors' consent.

Data usage
A user can view the whole content of the database using the Data Explorer within the online application. In the Data Explorer, the user can apply filters (Family, Genus, Species, Original name, Trait category, Trait, Method, Location, Country, Dataset, References and Row links) to display selected content. The result can be displayed in a spreadsheet or in bar figure window. Selected data can then be downloaded in a .csv or .xlsx format. If the selected data contain data from datasets under embargo, the user is given a warning. In order to download embargoed data, the user has to send a request to the administrator or editor, who will then contact the dataset authors. Data with embargo can be download only after receiving login data.
In addition, the database provides an Application Programming Interface (API) to allow access to data via web platforms or software. An R package, named ARAKNO (41), with few easy-to-use functions that allow downloading and preprocessing data from the database, is now available. Resulting data frames can then be analysed with a variety of tools available in R (38). Access of the embargoed data via API requires login as well.
As the trait value data can be a mixture of various statistics, it is important that the user checks the 'Measure' variable of each record and adopts appropriate procedures prior to analysis. Furthermore, due to inherent variation in most trait values, the user must consider conditions (such as habitat, altitude and treatment) under which it was measured. Not all conditions (e.g. hunger state and mating status) are recorded in the auxiliary variables; thus, the user is strongly advised to study the original publication.
A number of traits included in this database are candidates of Essential Biodiversity Variables proposed by the Group on Earth Observations Biodiversity Observation Network (12,13). The traits are recorded with many metadata and thus allow quantification of intra-specific variation with respect to environmental conditions, space and time. These traits can be of societal relevance, as they can be used in the study spread of invasive species or biodiversity change.
Although the use of data is free, users are strongly encouraged to contribute their data, particularly if they have not contributed yet, following the simple 'first give, then take' principle. Only by these means will the database grow in quantity and frequency of use.
Contained data are publicly available under a Creative Commons Attribution license (CC BY 4.0) so that anyone can use received data under the condition of appropriate citation of this publication. In the case of datasets that have not been published and are under embargo, the user must agree with the dataset contributor on the conditions of use. Typically, this should include citation (URL or DOI) of the specific dataset in addition to the database citation.

Supplementary data
Supplementary data are available at Database Online.