Delimiting species of Protaphorura (Collembola: Onychiuridae): integrative evidence based on morphology, DNA sequences and geography

Species delimitation remains a significant challenge when the diagnostic morphological characters are limited. Integrative taxonomy was applied to the genus Protaphorura (Collembola: Onychiuridae), which is one of most difficult soil animals to distinguish taxonomically. Three delimitation approaches (morphology, molecular markers and geography) were applied providing rigorous species validation criteria with an acceptably low error rate. Multiple molecular approaches, including distance- and evolutionary model-based methods, were used to determine species boundaries based on 144 standard barcode sequences. Twenty-two molecular putative species were consistently recovered across molecular and geographical analyses. Geographic criteria were was proved to be an efficient delimitation method for onychiurids. Further morphological examination, based on the combination of the number of pseudocelli, parapseudocelli and ventral mesothoracic chaetae, confirmed 18 taxa of 22 molecular units, with six of them described as new species. These characters were found to be of high taxonomical value. This study highlights the potential benefits of integrative taxonomy, particularly simultaneous use of molecular/geographical tools, as a powerful way of ascertaining the true diversity of the Onychiuridae. Our study also highlights that discovering new morphological characters remains central to achieving a full understanding of collembolan taxonomy.

or supplementary pseudocelli developed in one side of the body (Supplementary Tables S3 and S6). Results of examination of other minor additional characters were listed in Supplementary Table S6. ABGD generated 22 initial/primary partitions and 22-41 recursive/secondary partitions while prior intraspecific divergences (P) varied from 0.001 to 0.1079 (Fig. 1b). Rough "barcoding gaps" were observed at K2P distance of around 0.03, 0.045, 0.08 (Fig. 1a), corresponding to plateaus of 27, 23, 22 MOTUs (molecular operational taxonomical units) (Fig. 1b). Primary and secondary partitions were identical at 22 MOTUs (P > 0.064). Because primary partitions are stable on a wider range of prior values and are usually close to the hypothetical species number described by taxonomists 49 , the value 22 was finally selected as the result of ABGD (Fig. 3).
For the single phylogenetic tree, bPTP.py estimated 26-40 (mean 30.98) putative species with a best supported partition of 30 species. With multiple bootstrap trees as input, 25-36 (mean 29.15) species were estimated by PTP.py, 27 in highest bootstrap supported partition. All clusters under both PTP.py and bPTP.py models formed monophyletic clades with high node bootstrap supports (>0.75, Fig. 3).  Under the primary hypothesis of 10 morphospecies, 27,26,26,25,20,16,16,13 putative species were delimited using the distance threshold values 10, 20, 50, 100, 200, 300, 400, 500 kilometers respectively (Supplementary  Tables S4 and S5). The estimates for number of species with different geographical threshold values showed an initial plateau around 26 and a second at 20 with a steep decline when distances between populations more than 100 kilometers. Because of the long-term persistence within restricted geographical limits for Collembola 60 , the plateau implied a sign of dispersal limits for species. Considering the possibility of no apparent gene flow across . Summary from all species delimitation analyses on a Bayesian consensus tree. Node values represent likelihood bootstrap and posterior probabilities, respectively, with a-indicating nodes not compatible between the analyses. Colored bars represent hypothesized species groupings based on corresponding delimitation analyses. The geographical delimitation indicates the result of threshold of 50 kilometers. Colored clades on phylogeny are taxa morphologically unresolved. Number formulae above and below branches represent dorsal pseudocelli and ventral parapseudocelli, respectively. distances of tens of kilometers 60 , 26 putative species under a distance threshold of 50 kilometers were accepted for geographical criterion (Fig. 3).
When additional morphological characters were examined in combination with the traditional features, more morphological lineages were recognized, and most of them matched the 22-MOTUs partition. Among them, four twins cannot be distinguished on morphology: Gp 1 and Gp 5, Gp 10 and Gp 22, Gp 13 and Gp 17, Gp 19 and Gp 21 (Fig. 3). Finally, fourteen candidates were congruent across all three species validation criteria, including 2 known (Gp 9 and Gp 12), 6 new to science (Gp 4, Gp 6, Gp 14, Gp 15,Gp 16 and Gp 20), and the remaining which could not be designated a valid species name due to a lack of sufficient diagnostic information (e.g. Gp 11 and Gp 18 could not be further studied due to the lack of additional specimens; Gp 2, Gp 3, Gp 7 and Gp 8 have unclear taxonomical status due to the lack of detailed information available from similar type specimens from other collections) (Supplementary Tables S1-3). The morphological description of the six new species are given in the Supplementary files.

Discussion
In the present study, we combined three types of delimitation data (morphology, molecular markers and geography) to resolve the taxonomical problems in the genus Protaphorura. The genus is most abundant and widely distributed group of Onychiuridae in northeast China 33 , and this gives us a good chance to collect a great many species and analyze both morphological and genetic information from fresh materials. Molecular markers, as well as geography, proved effective for delimitating. The results from the combination of morphological and molecular analyses, or morphological and geographical analyses were mostly congruent for species delimitation. In particular, morphology with ABGD provided the highest accuracy and could be used as a standard operational workflow in future taxonomy of the Onychiuridae.
Furthermore, the geographical distances may also help to discriminate taxa among closely related species, when the molecular information is unavailable. Geography criterion works well for species delimitation because of the low dispersal capabilities and long-term persistence 60,61 , which result in high endemism and non-overlapping distribution. Geographical delimitation was mostly consistent with molecular approaches except delimitation for the widespread P. uniseta sp. nov. in the present study. Many collembolan species, which could be widely distributed in regions of large geographical scale, have been demonstrated to have cryptic diversity 62,63 . Therefore, geographic criterion should be available for most onychiurids. We don't encourage the single use of geographical distances to delimit species. Endemism extent and range of distributing area have not been estimated for most collembolan groups as well as most insects. As strategies implemented in this study, geographic criteria on the basis of preliminary morphological delimitation using simple morphological characters would be robust and accurate.
The inferences made here based on congruent results across several analyses are considered to be credible for species delimitation 37 . Several studies have combined the evidence from morphology and molecular markers in Collembola 55-58, 63, 64 . In our study, two new species delimitation methods based on molecular data (ABGD and PTP) were performed, and the results were mostly congruent. Coupled with the geographical evidence and using the Generalized Lineage Concept of species delimitation, the species of Protaphorura have been well discriminated here. Therefore, integrative taxonomy is essential for the accurate species identification of Onychiuridae, and it should be encouraged in the future taxonomical work.
Burkhardt's work 59 based on the results of molecular and morphological analyses supported Pomorski's conclusion 27 , that the position of pseudocelli is more important than their number for taxonomy, especially for the 'deformed' pseudocelli which are reduced or supplementary pseudocelli present on only one side of the body. In a limited number of 'deformed' pseudocelli cases in our study (5 in 108 individuals), the diagnostic value of the pseudocelli formulae was validated in a number of species with the application of integrative taxonomy. Although frequent variations in the number and position of pseudocelli on the dorsal body have been reported in several species 26,27 , we discovered here that the characters were relatively stable intraspecifically in the present 22 MOTUs which were both supported by the molecular and geographical evidence. Overall the results confirmed the importance of pseudocelli formulae for discriminating species in Onychiuridae 17,18 in the majority of cases.
Parapseudocelli, a long neglected structure in Onychiuridae, was first found by Rusek 28 and never reported in other groups of Collembola. It is usually located on the ventral side of body, subcoxae 1 of legs and anal valves, while on the dorsal side of the body it occurs in some genera of the tribe Hymenaphorurini Pomorski, 1996. The parapseudocelli formula was first used as diagnostic character by Weiner 65 and Pomorski 27,66 . However, parapseudocelli are sometimes difficult to see and stated by several authors as "not seen" or "invisible" rather than "absent" in the descriptions. Therefore, the practicality of using this character in species diagnosis was questionable and it has not been widely used. In the present study, parapseudocelli formulae were shown to be stable and had no variations at the intraspecific level of Protaphorura. Variation in the number of the parapseudocelli is not common in other groups of Onychiuridae 12, 29, 67-69 and for this reason they were proposed to be of great taxonomic values in species discrimination 70 . The instances of intraspecific variability reported for this character system before may correspond to inter-specific differences, because the variation reported referred to inter-populational rather than intra-populational comparisons 15,32 . Compared to ten morphospecies recognized by traditional diagnostic characters, seven more have been recognized by the addition "parapseudocelli formulae" in our study, thus confirming views of Sun & Li 70 on the diagnostic value of this character system. While useful, several MOTUs, which were well separated by evidence from DNA sequences and geography, still could not been separated by morphological characters. Thus, other minor characters, which may have differences at the species level, were also explored. The number of chaetae on Th. sternum II, was often recorded in previous studies on Onychiuridae and can be relatively stable intraspecifically 13,15,29,71,72 , although it has not yet been treated as an important diagnostic character. When closely related taxa in this study are compared, the number of chaetae on Th. sternum II enabled G4 to be differentiated from Gp 20 (although this feature was variable in a small percentage of the examined specimens) and here we propose that it could be used as a main diagnostic character. The axial chaetae on abodominal terga IV-VI and the number of chaetae on subcoxa 1 of legs I-III, which are relatively stable at the intraspecific level in many genera of Onychiuridae 12-14, 29, 68, 72 , were also checked. However, they had a large ratio of intraspecific variations and could not be used as diagnostic characters in Protaphorura.
Despite the findings of this study, inadequacies in the morphological taxonomy of the Protaphorura still exist as evidenced by the presence of four pairs of morphologically cryptic species (e.g. Gp 1 and Gp 5, Gp 10 and Gp 22, Gp 13 and Gp 17, Gp 19 and Gp 21). This deficiency is caused by a lack of sufficient morphological characters or the existence of cryptic species, and requires further investigation.

Material and Methods
Taxon sampling. Samples were collected from areas of high biodiversity in northeast China, including the Changbai Mountain Range, the Greater and Lesser Khingan Range and the Sanjiang Plain (Fig. 2). Specimens were collected by Berlese extraction, and stored in 99% ethanol at −20 °C. One hundred and forty-four individuals from 17 populations were selected for molecular analysis. In this study, a population was defined as individuals collected from sites within a 5 kilometer radius of each other. The samples were stored pending analysis in the Key Laboratory of Wetland Ecology and Environment, Northeast Institute of Geography and Agroecology, Chinese Academy of Sciences, Changchun (NEIGA), China and the Department of Entomology, College of Plant Protection, Nanjing Agricultural University, Nanjing (NJAU), China.
DNA extraction and sequencing. DNA was extracted using an Ezup Column Animal Genomic DNA Purification Kit (Sangon Biotech, Shanghai, China) following the manufacturer's standard protocols. Specimens were kept in 75% ethanol for further morphological examination after the 1h-lysis buffer was transferred to the pipette containing silica column. Extractions were performed non-destructively (small tissue biopsy) allowing further morphological examination of the specimens. Primers were LCO1490/HCO2198 commonly used for metazoa 73 . Amplification volume and procedure were listed in Zhang et al. 63 . All PCR products were checked on a 1% agarose gel. Successful products were purified and sequenced in both directions by Majorbio (Shanghai, China) on the ABI 3730XL DNA Analyser (Applied Biosystems). Sequences were assembled in Sequencher 4.5 (Gene Codes Corporation, Ann Arbor, USA), and deposited in GenBank (Supplementary Table S3). Sequences were blasted in GenBank and checked for possible errors, then were preliminarily aligned using MAFFT v7.149 by the Q-INS-I strategy 74 . Alignments were checked and corrected manually, with a final 658 bp for COI.

Phylogenetic analyses.
Neighbour-joining (NJ) trees and genetic distances were calculated in MEGA 5.0 75 , with the Kimura-2 parameter model (K2P, Kimura 1980) and pairwise deletion for gaps. Node supports of NJ tree were evaluated through 1000 bootstrap replications.
Phylogenetic trees were reconstructed using maximum likelihood (ML) and Bayesian inference (BI) on online CIPRES services 76 . Protaphorura armata (Tullberg, 1869) from Europe was selected as the outgroup. To avoid the incorrect likelihood calculation, identical sequences were removed from the analyses. Best-fitting substitution models were assessed under the AIC and BIC criteria in jModelTest 2.1.7 77 , and the GTR + I + Γ model selected for subsequently analyses. ML trees were reconstructed in RAxML v8.2.4 78 with the GTRGAMMAI model and 1000 bootstrap replicates. BI-analyses were conducted in MrBayes 3.2.6 79 . To avoid the problem of branch-length overestimation, the compound Dirichlet priors "brlenspr = unconstrained: gammadir (1, 1, 1, 1)" for branches lengths were incorporated 80 . The number of generations for the total analysis was set to 50 million, with the chain sampled every 5000 generations. The burn-in value was 25% and other parameters were set as default options. Evaluating effective sample size (ESS) values and state convergence were checked in Tracer 1.5 81 .
Species delimitation. Species delimitation was performed using three data, morphology, mitochondrial marker COI, and geography. Using multiple approaches provided an acceptable error rate for species delimitation of 0.027 35 .
Morphology. The specimens for morphological examination were cleared in lactic acid, mounted in Marc André II solution and then studied using a Nikon Eclipse 80i microscope. Primary discrimination for morphospecies relied on the most traditionally important diagnostic characters among Protaphorura species including the formula and the positions of the pseudocelli on the dorsal body and subcoxa 1 of the legs, the presence or absence of additional microchaetae on abdominal terga I-III and V, and the relative position of the prespinal microchaetae on abdominal tergum VI. Other minor additional characters were carefully examined for assessing their potential values including the number and arrangement of parapseudocelli, the number of ventral mesothoracic chaetae, the axial chaetae on abdominal terga IV-VI, and the number of chaetae on subcoxa 1 of legs I-III. Specimens which displayed bilateral asymmetry in key diagnostic features were considered deformed and not included in our analysis. Molecular approaches. Both distance-based (ABGD) and evolutionary model-based (PTP) methods were employed for the single mitochondrial marker COI without a priori species hypothesis. Prior intraspecific divergence varied from 0.001 (Pmin, a single nucleotide difference) to 0.14 (Pmax, threshold value applied in Collembola of Churchill 64 ). Relative gap width was set to 1, with 20 recursive steps, 40 bids for graphic histogram of distances, K2P model for distance calculation and other parameters as default. An unrooted ML tree was generated in RAxML v8.2.4 with the GTRGAMMA model and 1000 bootstrap replicates. Identical sequences were removed and Bayesian PTP (bPTP) delimitation was performed on single unrooted tree in python script bPTP.py v0.51 50 . A total of 5 × 10 5 generations were run with first 20% as burn-in. Maximum likelihood delimitation was also calculated in PTP.py 2.2 given 1000 trees derived from RAxML bootstrap analysis.
Geographical criterion. Because of the very low dispersal capabilities of springtails (except for some species which are in maritime islands and nunataks), vicariance and long-term persistence can occur within restricted geographical limits, even across distances of tens of kilometers 60,61,82 . Putative geographical species were delimited on the basis of primary hypotheses of morphospecies. For one morphospecies occurring in several geographical sites, populations, separated by distances greater than the hypothetical values (10-500 kilometers), were recognized as distinct putative species.
Species validation criteria. We used Generalized Lineage Concept (GLC) to determine the species boundary from our data. In this study, the operational criteria for final species recognition are conservative with trusted delimitations congruent across all analyses 37 : (i) monophyletic lineage (GLC) confirmed by phylogenetic inferences (it is also the a priori species hypothesis hidden in PTP); (ii) congruent across geographical and molecular delimitations; (iii) reliable morphological diagnosis with characters stably differing from other lineages. Lineages which met all of the above criteria were given a formal biological scientific name. Different lineages consistent across molecular and geographical analyses but of morphological stasis were considered cryptic species.