Assembling a DNA barcode reference library for the spiders (Arachnida: Araneae) of Pakistan

Morphological study of 1,795 spiders from sites across Pakistan placed these specimens in 27 families and 202 putative species. COI sequences >400 bp recovered from 1,782 specimens were analyzed using neighbor-joining trees, Bayesian inference, barcode gap, and Barcode Index Numbers (BINs). Specimens of 109 morphological species were assigned to 123 BINs with ten species showing BIN splits, while 93 interim species included representatives of 98 BINs. Maximum conspecific divergences ranged from 0–5.3% while congeneric distances varied from 2.8–23.2%. Excepting one species pair (Oxyopes azhari–Oxyopes oryzae), the maximum intraspecific distance was always less than the nearest-neighbor (NN) distance. Intraspecific divergence values were not significantly correlated with geographic distance. Most (75%) BINs detected in this study were new to science, while those shared with other nations mainly derived from India. The discovery of many new, potentially endemic species and the low level of BIN overlap with other nations highlight the importance of constructing regional DNA barcode reference libraries.


Introduction
With nearly 48,000 known species in 117 families [1], spiders are a major component of terrestrial ecosystems with important practical applications as biocontrol agents [2] and as bio-indicators [3,4]. Prior studies have documented 4,300 spider species in Europe [5] and a similar number (3,800) in the Nearctic [6]. By contrast, just 2,300 species have been reported from South Asia [7], suggesting that many species await detection in this region. Although studies on the spider fauna of Pakistan began nearly a century ago [8], work has recently intensified, but most of these studies have produced regional checklists (S1 Table) publications often employ invalid or incorrect species names or only identify specimens to a family [9], compromising their value [10][11][12]. It is likely that many species reported as new discoveries from Pakistan [13] await description. For example, in her dissertation research on spiders of Punjab, Parveen [13] reported the discovery of 33 new species but only one has been formally described [9]. Examination of prior taxonomic work (S1 Table) indicates that just 400 species of spiders have been documented from Pakistan. Considering the country's diverse ecosystems [14], this count must seriously underestimate the true diversity of its fauna given the much higher numbers reported for India (1686) [15] and Iran (528) [16]. The limited knowledge of the spider fauna of Pakistan is a particular example of the barrier to our general understanding of spider biodiversity in a global context, a factor compromising both scientific progress and conservation efforts [17]. The poor documentation of spider diversity of Pakistan reflects, in part, the paucity of taxonomic specialists working on the group [18]. Moreover, spiders pose a challenge for morphological approaches because cryptic species are common [19], and sexual dimorphism is often striking [20]. DNA barcoding [21] provides an alternate approach to identifications. It employs sequence diversity in a standard gene region (COI-5 0 ) to discriminate both morphologically cryptic species and all life stages, even for species with sexual dimorphism [22,23]. Although concerns about the use of single marker [24,25] or discordance between the barcode and other gene regions [26] have been voiced [27], the advantages of employing a single standard gene region for DNA barcoding is now very well established [28]. Fifteen years after its introduction, this approach has demonstrated its effectiveness in discriminating species in diverse groups, including spiders [29][30][31][32][33][34].
The use of DNA barcoding for specimen identification and species discovery is greatly facilitated by BOLD, the Barcode of Life Data System (http://www.boldsystems.org). This informatics platform assembles specimen metadata and sequences and provides tools to facilitate data analysis and publication [35]. It also enables species discrimination by assigning each COI sequence cluster to a Barcode Index Number (BIN) [ [33]. By assigning sequences from unidentified specimens to a species proxy [44], the BIN system has greatly augmented the application of barcode data in groups where taxonomic knowledge is poor. These barcode libraries are, in effect, forming the foundation for a global "DNA library of life" [55].
At present, BOLD holds 6.8 million records derived from specimens representing 587,000 BINs (accessed 13 April, 2019). This total includes 117,000 records from spiders that have been assigned to more than 10,000 BINs. Past work on spiders has had varied motivations [39, [56][57][58][59][60], but just two prior studies have aimed to construct a comprehensive DNA barcode library for a national fauna-Canada [61] and Germany [62]. The need for similar work in other regions is evident, particularly in south Asia. For example, barcode records are only available for 73 species of spiders from India [35,63] and for 41 species from Pakistan [64][65][66]. The current study aimed to develop a barcode library for the spider fauna of Pakistan and investigate the spider diversity overlap with other regions using BINs. The study addresses the gap for reference data in the country by expanding DNA barcode coverage for Pakistan to 202 species.

Ethics statement
No specific permissions were required for this study. The study did not involve endangered or protected species.

Spider collection
From 2010 to 2016, 1,795 spiders were collected at 225 sites in Pakistan (Fig 1). Each spider was provisionally identified by collectors in Pakistan before it was sequenced for the barcode region of the mitochondrial COI gene [21]. GB subsequently validated and refined identifications by examining (including genitalic dissections) representative specimens from each barcode cluster. Generic and species assignments generally followed taxonomic publications on Asian spiders (S1 Table), but nomenclature was updated as required to follow the World Spider Catalog [1]. Collection data, a photograph, and a taxonomic assignment for each specimen are available in the public dataset, "DS-MASPD DNA barcoding spiders of Pakistan" (dx.doi. org/10.5883/DS-MASPD) on BOLD. The 1,795 specimens are held in four repositories: Centre for Biodiversity Genomics, University of Guelph, Guelph, Canada (585); National Institute for Biotechnology and Genetic Engineering, Faisalabad, Pakistan (1126); University of Sargodha, Sargodha, Pakistan (84). The location of any particular specimen is reported in the dataset.

Data analysis
All sequences were submitted to BOLD (DS-MASPD) where those meeting required quality criteria (>507 bp, <1% Ns, no stop codon or contamination flag) were assigned to a BIN [36]. An accumulation curve, BIN discordance, genetic distance analysis, barcode gap analysis (BGA), and geo-distance correlation were determined using analytical tools on BOLD. The Accumulation Curve plots the rise in the number of BINs with increased sampling effort making it possible to ascertain if asymptotic diversity has been reached. The BGA determines if the maximum sequence divergence within members of a species or BIN is less than the distance to its Nearest-Neighbor (NN) species or BIN, a condition required for unambiguous identification [71,72]. The geo-distance correlation ascertains the correlation between geographic net. The author of SimpleMapper has waived all copyrights and no permission is needed to use. GPS coordinates (Latitude, Longitude) for distance and genetic distance in each species or BIN employing two methods. The Mantel Test [73] examines the relationship between the geographic distance (km) and genetic divergence (K2P) matrices. The second approach compares the spread of the minimum spanning tree of collection sites and maximum intra-specific divergence [61]. The relationship between geographic and intraspecific distances was analyzed for each species with at least one individual from three or more sites. The analysis included all the conspecific records public on BOLD.
A neighbor-joining (NJ) tree was generated in MEGA5 using the Kimura-2-Parameter (K2P) [74] distance model along with pairwise deletion of missing sites. Nodal support on the NJ tree was estimated by 1000 bootstrap replicates. Bayesian inference (BI) was calculated by MrBayes v3.2.0 [75] using representative sequences of the 221 BINs and employing Phalangium opilio (Arachnida: Opiliones) and Galeodes sp. (Arachnida: Solifugae) as outgroups. The data was partitioned in two ways; i) a single partition with parameters estimated across all codon positions, ii) a codon-partition in which each codon position was allowed different parameter estimates. Sequence evolution was modelled by the GTR+Γ model independently for the two partitions using the ''unlink" command in MrBayes. Analyses were run for 10 million generations using four chains with sampling every 1000 generations and the BI trees were obtained using the Markov Chain Monte Carlo (MCMC) technique. Posterior probabilities were calculated from the sample points once the MCMC algorithm converged. Convergence was determined when the standard deviation of split frequencies was less than 0.022 and the PSRF (potential scale reduction factor) approached 1, and both runs converged to a stationary distribution after the burn-in stage (the first 25% of samples were discarded by default). The resultant trees were visualized in FigTree v1.4.0. The NJ and Bayesian analyses were employed to assess support for the BINs detected in this study, not to reconstruct the phylogeny of Araneae.

Results
Coupling of the DNA sequence results with detailed morphological analysis made it possible to assign 1,574 of the 1,795 barcoded specimens to one of 109 species, but the other 221 specimens could only be placed into one of 93 interim species. Collectively, these specimens included representatives of 27 families, 113 genera, and 202 species (Table 1). Most species were only represented by a single sex, usually females. Two-thirds (1,256) of the specimens were immatures that lacked the diagnostic characters required for species assignment. However, their DNA barcodes allowed them to be linked to adults whose identification was established through morphology. Four families (Amaurobiidae, Atypidae, Ctenidae, Segestriidae), 43 genera, and 74 species identified here represent first records for Pakistan (Tables 1 and S1). As adults from 12 of the 93 interim species possessed clear morphological differences from any known species in their genus, they are likely new to science ( Table 1).
As the accumulation curve failed to approach an asymptote (Fig 2), it is certain that more species await detection. Although one species (Artema transcaspica) failed to qualify for a BIN assignment because its only sequence was too short, the other 108 morphological species were assigned to 123 BINs with 10 species showing a split to two or more BINs (Table 1 and Fig 3). The 93 interim species were allocated to 98 BINs with three showing BIN splits (Table 1), making the total BIN count 221 -with 94 of them singletons. NJ clustering (Fig 3) and Bayesian inference (Fig 4), supported the monophyly of all 221 BINs. Barcode distances (K2P) varied for differing taxonomic ranks with conspecific values ranging from 0.0-5.3% (mean = 0.8%), congenerics from 2.8-23.2% (mean = 8.8%), and confamilials from 4.3-26.7% (mean = 15.1%) ( Table 2). Excepting 14 species, maximum intraspecific divergences did not exceed 2% in the 90 species that were represented by two or more specimens ( Table 1). The barcode gap analysis showed that maximum intraspecific distance for all but one of the 90 species with two or more records was less than its NN distance (Oxyopes azhari was the exception, overlapping with Oxyopes oryzae) (Fig 5). The Mantel test was non-significant (P>0.01) for 60 of the 69 species and the regression line for all species showed a weak positive relationship (R 2 = 0.08; y = 0.0003x + 2.62) (Fig 6). The similarity between the spider fauna in Pakistan and that of other nations was calculated by examining BIN overlap. Less than a quarter (52/221) of the BINs from Pakistan were represented among the 10,229 spider BINs reported in prior studies. As expected, the highest overlap (23%) was with India, but the proportion of shared BINs was far lower for the other 43 countries (Fig 7).

Discussion
Most prior work on the spider fauna of Pakistan has had a regional focus and only employed morphological approaches. For example, 157 species were reported from the province of Punjab [9], 56 from the district of Sargodha [76], 23 from Peshawar [11], and 13 from Buner [77]. A recent checklist for the spiders of Pakistan [10] included records for 239 species, but the present study has substantially increased this total by adding first records for 84 described species and another 93 that could not be assigned to a known taxon. Most importantly, this study generated a DNA barcode reference library for 202 species, facilitating their future identification.
Because the spider fauna of Pakistan has seen such limited study, the discovery of new species was not unexpected, and follows a pattern seen for spiders in other regions. For example, the analysis of 80 species of Salticidae from Papua New Guinea revealed 34 species and five genera new to the country [78]. Likewise, 6% of the 136 spider species recovered from the Northern Cape Province, South Africa were new [79]. This study employed a mix of methods for spider collection, including beating, sweeping, and pitfalls. The choice of sampling method impacts species detection [80] and extensive sampling is critical to generate comprehensive species coverage [81]. Although the present study involved collections at 225 sites, the resultant species accumulation curve did not reach an asymptote, indicating that many more species await detection. The present study revealed a close correspondence (93%) between BINs and morphospecies as 188 of the 202 species were assigned to a unique BIN, reinforcing a pattern seen in other groups [37,38,40]. For example, the concordance between BINs and species was 78% in a study that examined 30,000 Canadian spiders representing 1,018 species [61] with most  discordances reflecting BIN splits suggestive of overlooked species. Stronger species-BIN correspondence has been reported in several insect groups; 96% for Erebidae (Lepidoptera) from the Iberian Peninsula [38], 94% for tiger moths from Brazil [82] and 92% for beetles from central Europe [40]. However, some arthropod groups have shown relatively low level of species-BIN concordance; for example, orthopterans in Central Europe (76%) [83], waterstriders in Germany (82%) [84] and katydids in China (75%) [85]. Thirteen (6%) species in this study were assigned to two or more BINs (BIN splits), and one species (Plexippus paykulli) was assigned to five. BIN splits often indicate the presence of a species complex [43]. For example, 13% of 1,018 species of Canadian spiders [61], 13% of 1,541 Canadian Noctuoidea [86], 5.7% of 1,872 Finnish beetles [87], and 20% of 62 global mealybugs [88] possessed BIN splits. Although in most cases the subsequent morphological investigation has revealed overlooked species [89], other factors can cause BIN splits/mergers, such as hybridization [90], incomplete lineage sorting [83], or rapid speciation [91].
K2P divergences >2% were found in 14 of the 202 spider species from Pakistan with a maximum value of 5.3%. There was, however, no significant relationship between intraspecific divergence and the number of specimens analyzed. For example, 12 specimens of Crossopriza maculipes (3 BINs) showed 5.3% divergence and were assigned to three BINs while 160 specimens of Neoscona theisi possessed a maximum divergence of 2.5%. High COI divergence is not uncommon in spiders. For example, the maximum intraspecific divergence in 561 spider species from Germany was 10.1%, but it was below 2.5% in 95% of the cases with an arithmetic mean of 0.7% [62]. The divergence could depend on several factors such as the number of specimens analyzed, the number of localities, the geographic distance between them and the dispersal capabilities of the particular species [92,93]. With the exception of a single species (Oxyopes azhari), high conspecific distances did not impede the capacity of DNA barcodes to discriminate the species encountered in our study. However, species with BIN splits and high divergences are likely to represent a cryptic species complex. Preliminary morphological analyses including genitalic dissections of specimens from taxa with BIN splits in this study reinforced this conclusion.
Correlation analysis revealed only a weak relationship between the geographic range of the species examined in this study and their intraspecific divergence value. The Mantel test was significant for a few (13%) species, but species identification was not impeded as maximum intraspecific distances were nearly always less than NN distances. Similar results have been reported for Lepidoptera from Europe [94], Pakistan [32] and Central Asia [95]. Although a study that examined a single tribe, Agabini, of aquatic beetles in Europe [96] argued that regional divergences were so great as to obscure species assignments, this result is clearly not the rule [72].
Because BINs are generally an effective species proxy [41], we used them to assess faunal overlap. This work revealed that most (76%) BINs detected in this study were first records. Just 52 BINs have records from other nations and 13 of these were shared only with India. The BIN overlap with other nations was considerably lower for the spiders (24%) of Pakistan than for its Lepidoptera (42%) [42], but this difference almost certainly reflects the intensive barcode studies on the latter group. Although DNA barcoding has been used to assess regional biodiversity [41,47] and to ascertain species connections [42], the limited data availability complicates interpretation. Although further sampling will add new BINs, it is also likely to raise BIN overlap with other regions, improving our understanding of faunal overlap. Such efforts to better document local biodiversity are also certain to reveal new species as evidenced by the discovery of 93 taxa in this study that could not be assigned to a known species.