Improving eDNAbased protist diversity assessments using networks of amplicon sequence variants

Effective and precise grouping of highly similar sequences remains a major bottleneck in the evaluation of high-throughput sequencing (HTS) datasets. Amplicon sequence variants (ASVs) offer a promising alternative that may supersede the widely used operational taxonomic units (OTUs) in environmental sequencing studies. We compared the performance of a recently developed pipeline based on the algorithm DADA2 for obtaining ASVs against a pipeline based on the algorithm SWARM for obtaining OTUs. Illumina-sequencing of 29 individual ciliate species resulted in up to 11 ASVs per species, while SWARM produced up to 19 OTUs per species. To improve the congruency between species diversity and molecular diversity, we applied sequence similarity networks (SSNs) for second-level sequence grouping into network sequence clusters (NSCs). At 100% sequence similarity in SWARM-SSNs, NSC numbers decreased from 7.9-fold overestimation without abundance filter, to 4.5-fold overestimation when an abundance filter was applied. For the DADA2-SSN approach, NSC numbers decreased from 3.5-fold to 3-fold overestimation. Rand index cluster analyses predicted best binning results between 97% and 94% sequence similarity for both DADA2-SSNs and SWARM-SSNs. Depending on the ecological questions addressed in an environmental sequencing study with protists we recommend ASVs as replacement for OTUs, best in combination with SSNs. This article is protected by copyright. All rights reserved.


Introduction 35
Ever since high-throughput-sequencing (HTS) has been introduced in molecular ecology, 36 researchers have been looking for effective tools to evaluate the resulting sequence data in the 37 context of diversity measures. The standard way of addressing this issue is to group sequencing reads 38 obtained for example from environmental samples, into operational taxonomic units (OTUs; e.g. de 39 Vargas et al. 2015; Stoeck et al. 2010). This grouping can be achieved either by relying on global 40 clustering scores (Schloss et al., 2009;Edgar, 2010;Fu et al., 2012;Rognes et al., 2016), or on local 41 clustering scores . While traditional clustering relies on fixed global clustering 42 thresholds expressed e.g. as sequence similarity between aligned sequences, local clustering allows 43 for a more fine-tuned evaluation by comparing all local pairs of nucleotides between the sequences 44 and iteratively grouping them into OTUs. It is well known, though, that every kind of OTU is at best a 45 bioinformatical approximation of a species Tikhonov et al., 2015) and 46 that we are far away from the one OTU -one species ideal. Currently available sequence grouping 47 methods tend to allocate sequencing reads of the same species into multiple OTUs. Even though 48 sequencing errors (Kunin et al., 2010) and intraspecific genetic heterogeneity may also contribute to 49 diversity inflation in environmental HTS datasets, imprecise sequence grouping is the main cause of 50 severe biodiversity overestimations (Flynn et al., 2015;Clare et al., 2016). Likewise, imprecise 51 sequence grouping may also lead to biodiversity underestimations, when sequences of different 52 species are grouped into the same OTU (Bachy et al., 2013;Grattepanche et al., 2014). This 53 emphasizes the need for a more accurate grouping of sequences for providing more realistic 54 estimates of species richness and diversity within a sample. 55 DADA2-derived ASV and Swarm-derived OTU results to further improve the degree to which the 139 expected species diversity can be obtained. 140 141

Results 142
First-level sequence grouping 143 Sequence grouping of the 29 species-specific ciliate datasets with the DADA2 pipeline 144 resulted in 101 different ASVs (Table 1). This represented a 3.5-fold overestimation of the species 145 diversity in the samples. For the species Dexiotricha sp., Metanophrys sp., Trithigmostoma cucullulus 146 and Vorticella sp., DADA2 produced one ASV per species. For all other species, sequence grouping 147 with DADA2 resulted in two or more ASVs, eight of which yielded 5 or more ASVs. The most ASVs 148 were obtained for Stentor coeruleus (11). There was a moderate linear relationship between the 149 abundance of reads and the resulting number of ASVs in each species-specific dataset (R 2 =0.51, p-150 value < 0.001). 151 Sequence grouping of the 29 species-specific ciliate datasets with the SWARM pipeline 152 resulted in 229 different OTUs (Table 1), overestimating the species diversity of the samples by a 153 factor of 7.9. Trithigmostoma cucullulus was the only species for which SWARM produced one OTU 154 per species. Two OTUs were obtained for the species Tetrahymena sp. and Tokophrya infusionum. 22 155 of the species produced 5 or more OTUs, with most obtained for Folliculina sp. (18) and Spathidium 156 ascendens (19). There was a weak linear relationship between the abundance of reads and the 157 resulting number of OTUs in each species-specific dataset (R 2 =0.11, p-value=0.045). 158 In direct comparison, DADA2 produced 2.3-times fewer ASVs than SWARM produced OTUs 159 (Table 1). We observed a weak linear relationship between the overestimation of DADA2 and the 160 overestimation of SWARM across all species (R 2 =0.31, p-value=0.001). However, Folliculina sp. and 161 Spathidium ascendens , the two species with the largest overestimation in SWARM, resulted in only 7 162 and 4 ASVs in DADA2, respectively. By contrast, Stentor coeruleus, the species with the largest overestimation in DADA2, also resulted in 12 SWARM-OTUs. For three species the number of ASVs 164 and OTUs was identical (Tetrahymena sp., Trithigmostoma cucullulus, Tokophrya infusionum). For the 165 remaining 26 species fewer ASVs were produced than OTUs. There was not a single species for which 166 more OTUs than ASVs were produced. 167 168 Quantitative evaluation of second-level sequence grouping 169 Sequence clusters originating from second-level sequence grouping are generally designated 170 as network sequence clusters (NSCs; Figure 1). When resulting from DADA2-ASVs, these NSCs are 171 further described as ASV-NSCs, while NSCs resulting from SWARM-OTUs are further described as 172 OTU-NSCs. The sequence similarity network approach successfully reduced the amount of sequence 173 clusters from first-level sequence grouping in both DADA2 and SWARM. The reduction was higher at 174 lower sequence similarity binning levels, since more pairs of sequences could be connected in the 175 network, thus producing larger NSCs (i.e. comprising more first-level ASVs or OTUs, respectively; see 176 Figure 2 and Table S1). At a binning level of 99% similarity, SSNs could reduce the amount from 101 177 ASVs to 65 ASV-NSCs (35.6% reduction). While no reduction was possible for SWARM-derived OTUs 178 at the 99% binning level, the 229 SWARM-derived OTUs were reduced by more than half (53.3% 179 reduction) to 107 OTU-NSCs at the next-lower binning level of 98% similarity. DADA2-derived ASVs 180 were also reduced by more than half (51.5% reduction) at 98% similarity. At the lowest tested 181 binning level of 90%, SSNs could reduce the amount from 101 DADA2-derived ASVs to 16 ASV-NSCs 182 (84.2% reduction) and from 229 SWARM-derived OTUs to 15 OTU-NSCs (93.4% reduction). 183 Overall, the results of the DADA2-SSN combination were notably closer to the ideal of one 184 NSC per species than the results of the SWARM-SSN combination ( Figure 2, Table S1). But for binning 185 levels lower than 95% similarity, the combination of SSNs with either DADA2 or SWARM produced 186 similar amounts of NSCs. Except for the SWARM-SSN results at 95% similarity, which matched the 187 ideally expected diversity of 29 species, all of these analyses underestimated the expected amount of 188 diversity. For most lower binning levels, SWARM-SSNs resulted in fewer NSCs than DADA2-SSNs, thus 189 diverging stronger from the expected amount of species than DADA2-SSNs (exceptions were 91% and 190 95% similarity). At all binning levels higher than 95% similarity, the SWARM-SSN combination 191 produced distinctively more NSCs than the DADA2-SSN combination. At 96% similarity, the DADA2-192 SSN results (29 ASV-NSCs) matched the ideally expected diversity of 29 species. Up to similarity 193 binning levels of 98% similarity, the amounts of ASV-NSCs moderately increased (up to 49 NSCs, 194 meaning a 1.7-fold overestimation of the expected diversity). 99% similarity is the first level at which 195 more than twice as much ASV-NSCs were found as species were analyzed (65 ASV-NSCs, 2.2-fold 196 overestimation). The results at the 100% binning level were equivalent to the first-level sequence 197 grouping results of DADA2 without SSNs. By contrast, the amount of OTU-NSCs in the SWARM-SSN 198 approach increased much stronger at higher binning levels. When the binning level was set to 97% 199 similarity, which is a commonly used threshold for sequence grouping of ciliate data, the SWARM-200 SSN approach produced 54 OTU-NSCs, which nearly doubles the expected diversity of the dataset 201 (1.9-fold overestimation). At 98% similarity, 107 OTU-NSCs were produced, which exceeded the 202 number of ASVs produced by DADA2 without a second-level grouping in SSNs (Figure 2). 203 The application of an abundance filter could further decrease the overestimation of diversity 204 by the second-level sequence grouping ( Figure S1, Table S2). For each species-specific dataset, we 205 discarded every OTU that accounted for less than 0.01% of the total read abundance in that dataset. 206 Thereby, there was only little effect on the SSN results at binning levels lower than 95% similarity. At 207 higher binning levels, the effect was more apparent on the outcome of the SWARM-SSN approach. 208 The diversity overestimation at the 100% similarity binning level decreased from a 7.9-fold 209 overestimation without abundance filter, to a 4.5-fold overestimation when the abundance filter was 210 applied. For the DADA2-SSN approach, the diversity overestimation decreased at the same binning 211 level from 3.5-fold without abundance filter to 3-fold with abundance filter. 212 213 Qualitative evaluation of second-level sequence grouping Species-specific datasets were used for first-level sequence grouping to preclude DADA2-215 derived ASVs or SWARM-derived OTUs that contained sequences from more than one dataset ( Figure  216 1). The preclusion did not apply for SSN analyses since representative sequences of either ASVs or 217 OTUs from all species were pooled for all-versus-all pairwise sequence analyses during second-level 218 sequence grouping. Therefore, we distinguished between three different types of NSCs ( Figure 1 should not be confounded with the terminology used for defining reference databases in some OTU 226 clustering methods (for more information on the latter see e.g. Bik et al., 2012). Figures 3A and B  227 illustrate the results at each binning level from 90 to 100% sequence similarity (see also Table S3). 228 Independent of the first-level sequence grouping algorithm, most closed NSCs were produced at 229 intermediate binning levels. Except for 91% similarity, the DADA2-SSN approach produced at each 230 binning level more closed NSCs than the SWARM-SSN approach. The maxima of 24 closed ASV-NSCs 231 and 20 closed OTU-NSCs were detected at a binning level of 94% similarity. The least number of 232 closed NSCs was in both cases detected at the 100% similarity level (4 closed ASV-NSCs, 1 closed 233 OTU-NSCs). Most hybrid NSCs were produced at low similarity binning levels. At 90% similarity we 234 detected maxima of 4 hybrid ASV-NSCs and OTU-NSCs. In general, both approaches produced similar 235 numbers of hybrid NSCs at each binning level, with decreasing numbers of hybrid NSCs at increasing 236 similarity binning levels. For the DADA2-SSN approach one hybrid NSC was consistently observed 237 even at binning levels up to 99% similarity; for the SWARM-SSN approach no more hybrid NSCs were 238 observed at binning levels higher than 97% similarity. By contrast, increasing similarity binning levels 239 starts at 93% similarity (RI = 0.9871) and reaches until 97% similarity (RI = 0.9907). For SWARM-SSNs, the plateau phase is shorter and reaches from 95% similarity (RI = 0.9873) to 97% similarity (RI = 268 0.985). At binning levels higher than 97%, RI values slowly start to decrease, while ARI values 269 drastically decrease. For both approaches, RI values were lowest at the 90% similarity and ARI values 270 were lowest at the 100% similarity binning level. 271 Applying an abundance-filter affected the quantitative output of the second-level sequence 272 grouping stronger than the qualitative output. The removal of low abundant DADA2-derived ASVs or 273 SWARM-derived OTUs resulted in first place in a general decrease of open NSCs, which was coupled 274 to a slight increase of closed NSCs at similarity binning levels of 97% and higher ( Figures S2A and B). 275 But this did not lead to different trends comparing RI values calculated from abundance-filtered and 276 non-abundance-filtered data ( Figure S3). There was, however, a trend towards higher ARI values for 277 the DADA-SSN approach and lower ARI values for the SWARM-SSN approach compared between 278 abundance-filtered and non-abundance-filtered data. 279 We also tested the complete sequence grouping workflow for DADA2 and SWARM 280 when merging all species-specific datasets into one ciliate dataset before first-level sequence 281 grouping. However, the results were not substantially different from those obtained when species-282 specific datasets were used for first-level sequence grouping (Table S5, Figures S4 and S5). A stronger 283 effect could be observed when all species-specific datasets were merged before first-level sequence 284 grouping and, in addition, an abundance-filter was applied before second-level sequence grouping in 285 SSNs (Table S5). Since the initial idea of this study was to analyze distinct ciliate species in the same 286 way and each species-specific dataset indeed represented a sample of a distinct ciliate species, we 287 decided to focus on the results obtained from species-specific datasets. 288 289

Discussion 290
First-level sequence grouping results overestimate species richness A major goal of most environmental HTS studies is to provide realistic estimates of 292 biodiversity within a sample. However, it is not a trivial task to place the tremendous amount of 293 resulting data into an ecologically meaningful context. One of the main difficulties is the delineation 294 of species based on HTS datasets. Ciliates have been used as model organisms for addressing this 295 problem in the past, but so far all comparisons of morphospecies richness and OTU richness within intraspecific polymorphism with a mean sequence divergence of 1.6% for the SSU rRNA gene within 316 the ciliate genus Gastrostyla. Given that their results were based on sequences of three individuals, 317 this may only represent a fraction of the complete intraspecific polymorphism for these organisms. In 318 our analyses, we detected only two ASVs for Gastrostyla steinii while we detected considerably more 319 ASVs of ciliates (e.g. 11 ASVs for Stentor coeruleus) for which no high variation rates of intraspecific 320 polymorphism have been reported (Kusch, 1998;Zhang et al., 2012). Thus, it is possible that some of 321 the more divergent true organismal variants of G.steinii may have been error-corrected by DADA2. 322 In contrast to the DADA2-pipeline, the SWARM algorithm does not perform any denoising 323 but is purely designed for grouping sequences. As such, SWARM will use any sequence provided in 324 the input dataset without deciding whether or not the sequence may be a true organismal variant or 325 a sequencing artifact. Denoising is a very sensible and important step that has to be performed by 326 other bioinformatic tools, preferentially on results obtained after sequence grouping in SWARM 327 . SWARM's focus is on finding smallest differences between sequences and 328 enabling a very fine-scaled resolution of genetic diversity in a sample. As reflected by our results, this 329 is counterproductive when looking at alpha diversity or species richness within a sample. The 330 Spathidium ascendens dataset of 19 SWARM-derived OTUs was the most severe example of over-331 splitting in our data. Since the dataset of S. ascendens was subjected to the same bioinformatic 332 treatment as the other datasets with standard SWARM parameters, it is unlikely that this high 333 SWARM-OTU number is an artifact of the SWARM algorithm. The inflation of OTUs may be caused by 334 intraspecific sequence polymorphism in Spathidium ascendens, but this feature has not been studied 335 for this species before. For other species which yielded high numbers of OTUs, intra-specific and 336 intra-individual sequence polymorphism is documented. Among the order Heterotrichida, 337 polymorphic sites were found within the V9 region of the 18S rDNA (Wang et al., 2017), which could 338 explain our finding of 18 SWARM-OTUs for Folliculina sp.. Sequence polymorphism is also a 339 widespread feature in the genus Paramecium (Coleman, 2005). At least some part of the SWARM-340 OTUs from Paramecium bursaria (10 OTUs) and Paramecium tetraurelia (9 OTUs) may reflect this 341 true intraspecific genetic diversity. Using dataset replicates, as proposed by Prosser (2010), might 342 help for deciding which sequence is artificial and which is a true organismal variant. But even this 343 strategy cannot be the ultima ratio for species-specific datasets from single cell sequencing, since 344 intra-individual genetic diversity may not contain all variants of intraspecific genetic diversity. 345 346 Second-level sequence grouping with sequence similarity networks improves diversity estimates 347 The shortcomings for species diversity estimations of both first-level sequence groupings 348 (DADA2-ASVs and SWARM-OTUs) were distinctively alleviated when sequence similarity networks 349 were used as second-level sequence grouping. Even though species richness was still overestimated, at their suggested level of 97% sequence similarity. At the same level of similarity, the DADA2-SSN 399 combination overestimated the diversity by one fifth (6 ASV-NSCs more than species expected). In 400 addition, our approach led to more accurate diversity estimates and an underestimation of diversity 401 by only three ASV-NSCs at the binning level with the highest RI and ARI scores (94% similarity). The 402 marginal differences between RI and ARI scores at intermediate similarity binning levels (e.g. RI 403 difference of 0.0002 and ARI difference of 0.0026 between 94% and 95% similarity), indicate that 404 there is not one universally applicable level for optimal sequence grouping. Instead, there exists a 405 range of sequence similarities, at which qualitatively highly similar and equally precise sequence 406 grouping can be achieved. To identify the best binning level, we advise to create SSNs on such a 407 range of similarities and evaluate the outcome for best sequence grouping results. Our data suggests 408 that the use of more conservative and lower similarity binning levels has a positive effect towards 409 more accurate diversity estimates of ciliates while also allowing for more precise species delineation. 410 Similar conclusions can be drawn from the SWARM-SSN approach, which yielded the best results at 411 96% similarity and thus, below the 97% similarity threshold widely used for clustering ciliate 412 sequence data. While our two-level sequence grouping strategy can easily be adapted to datasets of 413 other taxonomic groups or other barcode gene regions, it is important to note that the similarity 414 binning level, which produced the best output for our ciliate V9 18S rDNA dataset, is not 415 generalizable to other datasets without prior tests. The extent of genetic diversity varies among 416 different taxonomic groups (e.g. Brown et al., 2015) and even within ciliates, different sequence 417 similarity thresholds have been shown to be more effective for delineating species when working 418 with datasets of different hypervariable regions of the 18S rDNA (Dunthorn et al., 2012). 419 All-versus-all pairwise sequence alignments in hierarchical sequence grouping are a 420 prerequisite for attaining the advantages of sequence similarity networks (Forster et al., 2015(Forster et al., , 2016 grouping is that an over-splitting of diversity is avoided Flynn et al., 2015). This 423 benefit displays in the avoidance of diversity over-splitting in our study as well. By contrast, an over-424 splitting was observed in the study of  Table S2). The effect was further enhanced when species-specific datasets were merged before 450 first-level sequence grouping and an abundance-filter was applied before second-level sequence 451 grouping in SSNs. But even without merging the datasets, the ASV-NSCs and OTU-NSCs reflected the 452 real species diversity quantitatively closer after the application of an abundance filter. The consequences and relevance of notably different outputs using DADA2 and SWARM in 463 ecological studies remains to be tested in real case scenarios using field samples. While some specific 464 ecological questions (such as results of beta diversity analyses) may not suffer, others (such as results 465 of alpha diversity analyses or the identification of ecologically relevant key species) may be highly 466 compromised by different information obtained from DADA2 and SWARM. Beyond that, second-level 467 sequence grouping with sequence similarity networks clearly improved diversity estimates compared 468 to first-level sequence grouping. We expect that future studies will benefit from implementing this 469 strategy, especially by relying on the combination of DADA2 and sequence similarity networks. 470 471

Ciliate specimen collection
Single ciliate cells were hand-picked from either pure cell cultures or from environmental 474 culture samples, leading to a collection of 29 different ciliate species (Table 1, see Table S4 for 475 material sources). To allow for testing species delineation at higher taxonomic levels, six of the 476 known ciliate classes were covered, including Heterotrichea, Litostomatea, Oligohymenophorea, 477 Phyllopharyngea, Prostomatea and Spirotrichea. To test the species delineation on lower taxonomic 478 levels, the collection also included 2 species from each of the genera Paramecium and Spathidium. 479 Each of the 29 species was treated independently in our workflow to create one species-specific 480 high-throughput sequencing dataset per ciliate for downstream analyses (Figure 1 EukB as reverse primer. Each primer pair was tagged with one of ten different barcodes, so that the 501 resulting species-specific barcoded sequences could easily be retraced in downstream analyses. The 502 V9-EukB protocol consisted of an initial denaturation step of 30 s at 98°C, followed by 25-35 cycles 503 (adjusted for each species, see Table S4) of 10 s at 98°C, 20 s at 64°C and 25 s at 72°C; the final 504 extension lasted for 300 s at 72°C. Once the barcoded V9 PCR products were successfully amplified, 505 samples were pooled into libraries that contained up to ten different barcodes (one for each ciliate 506 species). Libraries were paired-end sequenced on either an Illumina MiSeq or NextSeq platform (see 507   Table S4 for details), each generating 250-base pair reads. High-throughput sequencing was 508 conducted by SeqIT (Kaiserslautern, Germany). 509 After sequencing, libraries were split into single species datasets using CUTADAPT v1.18 510 (Martin, 2011) according to the barcodes initially applied during the semi-nested PCR. Filtering of 511 reads with an expected barcode and removal of those was followed by matching the primer 512 sequences at both 5' and 3'-ends in CUTADAPT, as well. Reads were kept if they exactly matched the 513 forward or reverse V9 primers at their 5'-end and, at the same time, if they exactly matched the 514 reverse or forward V9 primers near to the 3'-end ("linked adapter" approach in CUTADAPT). Primer 515 sequences and overhang at 5'-end were removed during this filtering process to keep only the 516 targeted V9 region. Reads were oriented in the same direction using VSEARCH v2.8.0 (Rognes et al., 517 2016). Thus, the first paired libraries only contained reads oriented in the forward V9 primer 518 direction and the second paired libraries only contained reads oriented in the reverse V9 primer 519 direction. The barcode-and primer-filtering produced 29 single species paired libraries (for the 520 species Epistylis plicatilis two libraries existed, but the data were merged for subsequent steps) 521 which were used as input for the downstream first-level sequence grouping approaches. 522 The sequence data were deposited at the Sequence Read Archive of the National Center for 523 Biotechnology Information (NCBI) and are available under accession number PRJNA548847. All bioinformatic procedures and commands for statistical analyses are provided in HTML format as 525 supplemental material (File S1, Table S6). 526 527 Quality filtering and first-level sequence grouping with DADA2 528 Reads were truncated to their first 100 bp and filtered with a maximum expected error of 0. Demultiplexed libraries, from which barcodes and primers were previously removed, were 540 used for fist-level sequence grouping with SWARM. For each paired end, species-specific library, we 541 precisely followed the publicly available instructions at https://github.com/frederic-542 mahe/swarm/wiki/Fred's-metabarcoding-pipeline. In short, paired-end reads were merged and 543 subsequently dereplicated into amplicons using VSEARCH. First-level sequence grouping in SWARM 544 v2.0.5 (Mahé et al., 2015) was performed on the amplicons with -d 1 and the fastidious option -f. The 545 resulting SWARM-OTUs were subjected to a de novo chimera detection in UCHIME (Edgar et al., 546 2011) and singletons were removed from the output. From each non-chimeric SWARM-OTU we 547 extracted the seed amplicon as representative sequence for downstream taxonomic assignment. 548

Taxonomic assignment of first-level sequence grouping results 550
ASVs and representative seed sequences (i.e., the most abundant amplicon) of SWARM OTUs 551 were annotated using VSEARCH (Altschul et al., 1990) (Csardi and Nepusz, 2006). To determine the sequence similarity which gave the maximal 572 congruence between the number of NSCs and the number of ciliate species, similarity binning levels 573 for SSN construction from 90%-100% were tested in single percentage steps, except between 96% and 98% for which we tested 0.5 percentage steps. This was because previous studies suggested the 575 best cutoff level for species delineation at this range of sequence similarities (e.g. Worden, 2006;576 Caron et al., 2009). Every node in a SSN represented one DADA2-derived ASV or one SWARM-derived 577 OTU representative sequence. Every edge in a SSN represented a sequence similarity between two 578 ASVs or two representative sequences that exceeded the applied binning level (Figure 1, see also 579 Forster et al., 2015). Thus, NSCs either represented connected components, i.e. a cluster of ASVs or 580 OTU representative sequences that could be further grouped based on the applied binning level; or a 581 single node, i.e. a single ASV or OTU representative sequence which sequence similarity to any other 582 ASV or OTU representative sequence in the dataset was lower than the applied binning level. For 583 instance, a binning level of 100% similarity reproduced the results from the first-level sequence 584 grouping, meaning that the SSNs consisted exclusively of single nodes that represented DADA2-585 derived ASVs or SWARM-derived OTUs. 586 Previous studies indicated that the implementation of an abundance filter can be beneficial 587 for increasing the accuracy of sequence grouping (e.g. Quince et al., 2009Quince et al., , 2011Reeder and Knight, 588 2010; Bokulich et al., 2013;Auer et al., 2017). The rationale behind this is that sequences which 589 emerge from organisms that are actually occurring in a sample should be much more abundant in a 590 HTS dataset, than sequences which represent sequencing artifacts or elusive contaminations. 591 Following this idea, we tested in a separate approach if an abundance filter could improve the 592 outcome of our two-level sequence grouping strategy. For this test, we removed before SSN 593 construction from each species' dataset all DADA2-derived ASVs with an abundance of less than 594 0.01% with regard to all sequences of that dataset. For abundance filtering in SWARM we first set 595 SWARM's -b option to the species-specific 0.01% abundance threshold of each dataset during first-596 level sequence grouping, then removed before SSN construction all SWARM-OTUs with an 597 abundance of less than 0.01% with regard to all sequences of that dataset. All other second-level 598 sequence grouping steps in this test were performed as outlined above. 599 All statistical tests were run in R v3.5.1. The focus of the statistical evaluation for second-602 level sequence grouping was to identify the similarity binning level, which maximized the number of 603 closed NSCs and minimized the number of hybrid NSCs. To express this level mathematically, we 604 listed the NSC membership of each representative sequence and applied both the rand index (RI; 605 Rand, 1971) and Hubert's and Arabie's adjusted rand index (ARI; Hubert and Arabie, 1985). In short, 606 the RI compares the congruence between two cluster distributions, whereas the ARI is the corrected-607 by-chance version of the RI. The values of the indices range from 0 to 1, with 0 describing completely 608 different cluster distributions and 1 describing perfectly matching cluster distributions between two 609 sets of data. In this study, we compared the sequence grouping of representative sequences within 610 NSCs for each binning level to the optimal distribution of representative sequences, in which case 611 The authors declare that there is no conflict of interest. NSCs. Raw values of the bars can also be found in Table S1.  Table S1: NSC numbers at different sequence similarity binning levels. The values document the 849 gradual increase of ASV-NSCs and OTU-NSCs with increasing similarity binning level, which is also 850 displayed in Figure 2. 851 852 Table S2: Difference between non-abundance filtered and abundance-filtered data before second-853 level sequence grouping. For testing the effect of an abundance filter, all ASVs or SWARM-OTUs 854 which amounted to less than 0.01% of the sequences of a species were filtered from the dataset. 855 That is, the numbers shown here are also equivalent to the number of NSCs at the 100% binning level 856 for second-level sequence grouping with and without abundance filter. 857 858 taxonomy, the table also shows the source of the sample, to which GenBank accession number the 863 best hit refers and with which Illumina platform it was sequenced. 864 865 Table S5: NSC numbers at different sequence similarity binning levels when species-specific 866 datasets were merged before first-level sequence grouping. The values document the gradual 867 increase of ASV-NSCs and OTU-NSCs with increasing similarity binning level. NSC numbers at the 868 similarity binning level of 100% are equivalent to the outcome of first-level sequence grouping after 869 merging species-specific ciliate datasets. The third and fourth columns show NSC numbers when 870 species-specific datasets were merged before first-level sequence grouping and an abundance-filter 871 was applied before second-level sequence grouping. 872 873