DNA barcode data accurately assign higher spider taxa

The use of unique DNA sequences as a method for taxonomic identification is no longer fundamentally controversial, even though debate continues on the best markers, methods, and technology to use. Although both existing databanks such as GenBank and BOLD, as well as reference taxonomies, are imperfect, in best case scenarios “barcodes” (whether single or multiple, organelle or nuclear, loci) clearly are an increasingly fast and inexpensive method of identification, especially as compared to manual identification of unknowns by increasingly rare expert taxonomists. Because most species on Earth are undescribed, a complete reference database at the species level is impractical in the near term. The question therefore arises whether unidentified species can, using DNA barcodes, be accurately assigned to more inclusive groups such as genera and families—taxonomic ranks of putatively monophyletic groups for which the global inventory is more complete and stable. We used a carefully chosen test library of CO1 sequences from 49 families, 313 genera, and 816 species of spiders to assess the accuracy of genus and family-level assignment. We used BLAST queries of each sequence against the entire library and got the top ten hits. The percent sequence identity was reported from these hits (PIdent, range 75–100%). Accurate assignment of higher taxa (PIdent above which errors totaled less than 5%) occurred for genera at PIdent values >95 and families at PIdent values ≥ 91, suggesting these as heuristic thresholds for accurate generic and familial identifications in spiders. Accuracy of identification increases with numbers of species/genus and genera/family in the library; above five genera per family and fifteen species per genus all higher taxon assignments were correct. We propose that using percent sequence identity between conventional barcode sequences may be a feasible and reasonably accurate method to identify animals to family/genus. However, the quality of the underlying database impacts accuracy of results; many outliers in our dataset could be attributed to taxonomic and/or sequencing errors in BOLD and GenBank. It seems that an accurate and complete reference library of families and genera of life could provide accurate higher level taxonomic identifications cheaply and accessibly, within years rather than decades.


INTRODUCTION
Accurate identification of biological specimens has always limited the application of biological data to important societal problems. Obstacles are well-known and difficult: the vast majority of species are undescribed scientifically (Erwin, 1982;May, 1992;Mora et al., 2011); some unknown but large fraction of higher taxa are not monophyletic (Goloboff et al., 2009;Pyron & Wiens, 2011); many species can only be identified if certain life stages are available, e.g., adults (Coddington & Levi, 1991), classical data sources such as morphology imperfectly track species identity; the discipline of taxonomy continues to dwindle (Agnarsson & Kuntner, 2007); the classical process of taxonomic identification is mostly manual and cannot scale to provide the amounts of data required for real-time decisions such as environmental monitoring, invasive species, climate change, etc.
While most species remain undescribed, the situation is not so dire for larger monophyletic groups such as clades accorded the Linnaean ranks of genus or family. In assessing the state of knowledge about biodiversity, it is important to distinguish between the first scientific discovery of an exemplar of a lineage, and phylogenetic understanding of that lineage. Phylogenetic understanding-both tree topology and consequent taxonomic changes, are research programs with no clear end in sight. Linnaean rank is partially arbitrary, and one expects that the number of higher taxa will probably increase over time as understanding improves. Discovery, however, can have an objective definition: the year of the earliest formal taxonomic description of a member of the lineage or taxonomic group in which it is currently included. By this definition the earliest possible discovery of an animal lineage is 1758 (Linnaeus, 1758), or in the case of spiders, 1757 (Clerck, 1757).
More illuminating are the latest discoveries of lineages with the rank of family within larger clades, because the data tell us something about progress towards broad scale knowledge of biodiversity. The species representing the most recent discovery of a family of birds, for example, is the Broad-billed Sapayoa, Sapayoa aenigma Hunt, 1903 (Sapayoaidae). The species representing the most recently discovered mammal family is Kitti's hog-nosed bat, Craseonycteris thonglongyai Hill, 1974 (Craseonycteridae). For flowering plants, it is Gomortega keule (Molina) Baill, 1972 (Gomertegaceae). For bees, it is Stenotritus elegans Smith, 1853 (Stenotritidae). For spiders, a megadiverse and poorly known group, it is Trogloraptor marchingtoni Griswold, Audisio & Ledford, 2012 (Trogloraptoridae), but the second most recent discovery of an unambiguously new spider family was in 1955, Gradungulidae (Forster, 1955). Figure 1 illustrates the tempo of first discovery of families for these five well-known clades. At the family level, these curves are essentially asymptotic, implying that science is close to completing the inventory of clades ranked as families for these large lineages. On the other hand, for Bacteria and Archaea ( Fig. 1), as one would expect, the curve is not asymptotic at all but sharply increasing; prokaryote discovery and understanding is obviously just beginning.
In fact, although many new eukaryote families are named every year, the vast majority of these new names result from advances in phylogenetic understanding, not biological discovery of major new forms of life. The last ten years of Zoological Record suggests that roughly 5-10 truly new families are discovered per year.
In the context of the above question-approximate taxonomic assignment of organisms using DNA sequences-these data suggest that our knowledge of major clades of life is approaching completion. The Global Genome Initiative (GGI; http://ggi.si.edu/) of the Smithsonian Institution via the GGI Knowledge Portal (http://ggi.eol.org/) has tabulated a complete list of families of life, which total 9,650-on the whole a surprisingly small number. 10,000 barcodes, more or less, seems like a feasible goal. If we were able to assemble a complete database of DNA sequences at the family level, would it suffice to identify any eukaryote on Earth to the family level?
While the literature on species identification success of DNA barcodes comprises thousands of studies, only a few have tested their effectiveness at the level of higher taxonomic units. In the seminal paper on DNA barcodes, Hebert et al. (2003) established that animal CO1 sequences can roughly assign taxa to phyla (96% success) or orders (100% success). However, their test was based on a neighbor joining tree-building approach, and it remained unknown if sequence data itself, i.e., percent identity among taxa, can be used in this way. Similarly, Nagy et al. (2012) showed that DNA barcoding in reptiles usually correctly assigned barcodes to species, genus and family. Their approach was phylogenetic: they tested whether including a sequence in tree building rendered the higher group non-monophyletic, which would imply failure. Finally, Wilson et al. (2011) provided a similar tree based test in sphingid moths, and established reliabilities of correct generic and subfamily taxonomic assignments between 74 and 90% using a liberal, and only 66-84% using a strict, tree-based criterion. These authors argued that tree-based methods perform better than sequence comparison methods, but that reliability, of course, depends on the library completeness.
Our project not only contributes original DNA barcode data for Central European spiders, but also works in synergy with the GGI towards a permanent preservation of genomic biodiversity: the formation of a collection of deeply frozen spider tissues and their DNA. We provide: (1) cryo-preserved tissues of reliably identified species of Central European spiders, and their vouchers photographed and deposited in public museums; (2) permanently frozen genomic DNA of these species; (3) publicly accessible DNA barcodes for these species (genetic sequence of cytochrome oxidase I-CO1) as public identification tool (Hebert et al., 2003) to facilitate organism identification, taxonomy, ecology and conservation.
In addition, this project addresses to what extent higher level taxonomic units can be reliably identified using barcodes of unknown spiders, and specifically asks what percent sequence identity in BLAST results is necessary to correctly identify unknown taxa to the Linnaean genus and/or family. Other methods for classification of higher-level taxonomies such as RDP (Wang et al., 2007), UTAX (Edgar, 2010) and MEGAN (Huson et al., 2007) have primarily been developed for studies of microorganisms, using genetic markers for these groups, but less is known about using the CO1 barcoding gene in metazoans. We examine empirical data from Araneae barcode data to ask what is the percent sequence identity value above which 5% or less of higher level (genus/family) taxonomic identifications are incorrect and the extent to which frequency of correct identifications correlated with the number of taxa in this dataset, as would be expected given the dependence of BLAST on the reference database.

Specimen processing and imaging
We used automated and manual sampling methods for collecting spiders in the field in numerous localities in Slovenia and Switzerland. Faunistic and sampling details are published elsewhere (Čandek et al., 2013; see also 2015 corrigendum). Collected spiders were fixed in absolute ethanol immediately after being caught and the ethanol was replaced on the following day. Spiders were frozen at −80 • C, same day, or as soon as possible. In the laboratory they were identified, labeled, photographed and processed for DNA extraction and sequencing (Čandek et al., 2013; see also 2015 corrigendum). Voucher specimens (voucher codes starting with 0078) are deposited at National Museum of Natural History, Smithsonian Institution (Washington D.C., USA), with duplicates (voucher codes starting with ARA) at Naturhistorisches Museum der Burgergemeinde Bern (Switzerland) and EZ LAB, ZRC SAZU (Ljubljana, Slovenia).
Voucher images are published along with their barcodes (see Table 1) at http://ezlab.zrcsazu.si/dna. All original sequences generated by this project have been submitted to BOLD systems, and those that BOLD accepted were also submitted to GenBank (Table 1).

Tissues
After specimen identification and processing, up to four legs (or in the case of very small individuals the whole prosoma) of a spider were removed and stored in fresh absolute ethanol in cryovials. Part of the tissue was used for DNA isolation while the other part remains permanently frozen at −80 • C at GGI facilities. The maintenance and use of these materials abides by the international legal standards and conventions of the biological genetic heritage (The Access and Benefit Sharing agreement as part of the 2010 Nagoya protocol).

Molecular procedures
At Laboratories of Analytical Biology (National Museum of Natural History, Smithsonian Institution, hereafter LAB), specimens were extracted using the AutoGenPrep phenolchloroform automated extractor (AutoGen). Samples were digested overnight in buffer containing proteinase-k before extraction. At EZ Lab, specimens were extracted using the Mag MAX TM Express magnetic particle processor Type 700 with DNA Multisample kit (Applied Biosystems, Foster City, CA, USA) following the manufacturer's protocols with modifications (Vidergar, Toplak & Kuntner, 2014).

Barcode library
While we targeted 649 bp long DNA barcodes we also submitted (Table 1) 18 shorter fragments (>570 bp) that still satisfy the requirements of The Barcode of Life Data System BOLD systems (Ratnasingham & Hebert, 2007). We combined the 297 species barcodes from this study with publically available Araneae sequences from BOLD retrieved 4 December 2013, for a total of 816 species sequences, which formed the test library for this study. Sequences from BOLD were initially included if the sequence length was at least 600 bases and identification was to species. We further filtered and curated the data to exclude sequences whose identification was anonymous or by non-arachnologists, diverged dramatically from all other spider sequences, or for other reasons the sequences were not deemed to be reliable. After having discarded the above, we did not assess the accuracy of every remaining sequence, as it is well known that both BOLD and GenBank contain errors of various kinds, and we wanted our test library to reflect real world conditions. A single sequence was chosen per species from BOLD using these criteria and added to the original sequences from this project, resulting in 816 species representing 313 genera and 49 families (Table 1 and Table S2). Eighteen sequences were singletons at the family level; the maximum number of species per family was 224. 157 sequences were singletons at the genus level; the maximum number of species per genus was 34. The standalone BLAST+ suite 2.2.28 (Altschul et al., 1990;Zhang et al., 2000) was used to create a custom BLAST database from these sequences. Each sequence was then queried against the full set using blastn (MegaBLAST task, minimum e value of 1e-10, maximum of top ten hits other than the hit of the query to itself). For each hit the percent of identical nucleotides in the aligned region (PIdent) was calculated by BLAST. An advantage of using BLAST is the local nature of the alignment hits returned. This will account for differences in sequence lengths in the dataset, which may otherwise affect pairwise identity calculations of complete alignments. A possible outcome of BLAST results are short aligned regions that have high similarity but omit much of the queried sequence. To investigate this, we compared lengths of aligned regions with query sequence lengths to determine the prevalence of this in this dataset. Custom Python scripts (GitHub https://github.com/mkweskin/spider-blast) were used to parse the results, removing the match of the query to itself and to score whether hits matched the genus and family of the query sequence or not. Obviously, if the generic identification matched, the family identification also matched; families therefore always match more often than genera.
On the other hand, singleton generic sequences cannot match correctly at the genus level (for spiders or other poorly known diverse groups), and, likewise, singleton family sequences cannot match correctly at the family level (for spiders or other poorly known diverse groups). We included singletons as targets in order to model more realistically BLAST searches against the BOLD database (many sequences in BOLD are higher level singletons), and also to test more strongly the ability of sequences with two or more species per either genus or family to match correctly. Including 18 singleton family sequences and 157 singleton genus sequences, therefore, increases the probability of misidentification at either ranks and more strongly tests the usefulness of barcodes as supraspecific identification tools.
However, because the 18 unique family sequences must fail at both the family and genus levels, and the 157 unique genus level sequences must fail at the genus level, these necessary Figure 2 Results from the barcode matching test. Frequency distributions of correct and incorrect identifications by percent sequence identity (PIdent) for the top ten and/or best hits at the genus and family level. Shaded areas include hits where no more than 5% of identifications were incorrect. failures were not included in the overall assessments of the ability of barcode sequences to provide accurate identifications at supraspecific levels.

RESULTS
The 816 query sequences returned 8,159 total hits with one query only returning nine hits and all others ten (Table S1). PIdent scores ranged from 75% to 100%. We also examined the length of the sequence matched compared to the entire sequence length. 8,114 hits (>99%) matched to 90% or more of the query sequence length indicating that these results represent matches to large portions of the query validating the use of Percent Sequence Identity in the BLAST hits rather than computing the value for a global alignment between sequences. Figure 2 shows the frequency distributions of PIdent values of correct and incorrect identifications at the genus and family rank. 1. 95% of incorrect genus identifications were below PIdent = 95 when all hits for all queries are included, which suggests the latter value as a heuristic threshold to delimit incorrect from correct identifications (for these data). For only the highest rank hits whose PIdent ≥ 95, 98% of genus identifications were correct. 2. 95% of incorrect family identifications were below PIdent = 91 when all hits for all queries are included, which suggests the latter value as a heuristic threshold to delimit incorrect from correct identifications (for these data). For only the highest rank hits whose PIdent ≥ 91, 97% of family identifications were correct.
3. Library accuracy is crucial, but sequencing, labelling, and identification errors are difficult to detect a priori. The highest ranked incorrect family identification was Meta menardi (Tetragnathidae) to Steatoda grossa (Theridiidae), at PIdent = 96. Further study of the M. menardi sequence shows that the BOLD record is probably a mislabeled Steatoda. The first true incorrect family identification occurs at a PIdent value of 88; the best hit for Octonoba (Uloboridae) is Amaurobius (Amaurobiidae). 4. For the 136 genera with at least two species in the library, 76% (n = 103) best matched congeners. Thirty-three failed, perhaps because sequences were incorrectly identified taxonomically, or the sequence itself may be erroneous, or perhaps due to nonmonophyly of genera. 5. The distributions of PIdents for correct family and genus identifications differ significantly from the distributions of incorrect identifications (Fig. 2). 6. Plotted against increasing numbers of species/genus, and genera/family, the proportion of top ten PIdent values that exceed the above suggested threshold values increases. Roughly speaking, 15 species per genus, and 5 genera per family, are sufficient to ensure that best hits represent correct identifications (Fig. 3).

DISCUSSION
We show that standard DNA barcodes can accurately assign unknown specimens to genus and family given sufficient sequence identity and sufficient taxonomic representation in the database. Accurate identification (PIdent above which less than 5% of identifications were incorrect) occurred for genera at PIdent values > 95 and families at PIdent values ≥ 91, suggesting these as heuristic thresholds for generic and familial identifications in spiders (shaded in Fig. 2). Accuracy of identification increases with numbers of species/genus and genera/family; above five genera per family and 15 species per genus all identifications were correct (Fig. 3). The accurate identification of specimens remains a critical challenge for megadiverse groups such as arthropods, most other invertebrates, plants, fungi, protists etc. Morphological identification to species, or even more inclusive taxonomic ranks like genera and families, in many cases requires extensive training, and for most groups taxonomic expertise is limited and dwindling-the so called 'taxonomic impediment' (Rodman & Cody, 2003;Agnarsson & Kuntner, 2007). DNA barcodes have been proposed as convenient tools to overcome this impediment by making identification a purely technical procedure available to any interested researcher or even 'citizen scientists.' However, the accuracy of such a tool strongly depends on the scope and quality of the barcode library (Smit, Reijnen & Stokvis, 2013). Currently available data on databanks like BOLD and GenBank are extensive for some groups, yet the vast majority of species on earth have not yet been barcoded, much less discovered and described taxonomically-each of these tasks is enormous. Even for existing barcoding data, numerous sequences lack accurate taxonomic identification (Collins & Cruickshank, 2013), limiting their utility (e.g., only 58% of Araneae in BOLD are identified to species, and of those many are not correctly identified, as shown in our results; see also Shen, Chen & Murphy, 2013;Blagoev et al., 2016). Therefore, the identification of unknown specimens through blasting against BOLD or GenBank will be inaccurate if the databases lack close hits or contain errors. While the ideal database would allow species-level identification by containing barcodes from expertly identified and vouchered specimens of all species, we hypothesized that rapid surveys of well-known biotas can help quickly to build valuable tools allowing identification of larger clades such as genera and families.
Although we were careful to screen available barcode sequences from BOLD to produce a test library with as few errors as possible, it is certainly possible that errors remained, either due to mistakes in the lab or taxonomic identifications of vouchers. For example, Meta menardi (Tetragnathidae) blasted to Steatoda grossa (Theridiidae) at PIdent = 96, and