Open-Source Sequence Clustering Methods Improve the State Of the Art

Massive collections of next-generation sequencing data call for fast, accurate, and easily accessible bioinformatics algorithms to perform sequence clustering. A comprehensive benchmark is presented, including open-source tools and the popular USEARCH suite. Simulated, mock, and environmental communities were used to analyze sensitivity, selectivity, species diversity (alpha and beta), and taxonomic composition. The results demonstrate that recent clustering algorithms can significantly improve accuracy and preserve estimated diversity without the application of aggressive filtering. Moreover, these tools are all open source, apply multiple levels of multithreading, and scale to the demands of modern next-generation sequencing data, which is essential for the analysis of massive multidisciplinary studies such as the Earth Microbiome Project (EMP) (J. A. Gilbert, J. K. Jansson, and R. Knight, BMC Biol 12:69, 2014, http://dx.doi.org/10.1186/s12915-014-0069-1).

database, and an E value threshold is applied to evaluate the quality of resulting alignments. In SortMeRNA 2.0, the reference sequence achieving the lowest E value when aligned with a query sequence is chosen as the OTU centroid for that query. In addition to passing the E value threshold, the query must also have sufficient percent identity and coverage (both set to 97% by default). Contrary to UCLUST, the run time of SortMeRNA is not affected by reducing these thresholds (e.g., clustering at 60% identity).
UCLUST and USEARCH (versions 5.2 and 6.1) are supported in QIIME (v1.8.0). Both tools can perform de novo, closed-reference, and open-reference (except for USEARCH 5.2) clustering. In QIIME's implementation, USEARCH 5.2 is executed via a pipeline closely shadowing otupipe (6,21) to cluster OTUs, and USEARCH 6.1 performs chimera checking in an external script. UPARSE (16) is the latest de novo amplicon analysis pipeline from USEARCH; it applies stringent quality filtering and length trimming to remove erroneous reads and implements a novel greedy algorithm that performs OTU clustering and chimera removal concurrently.
A variety of datasets were chosen to evaluate the performance of these open-source OTU clustering approaches relative to QIIME's UCLUST/USEARCH-based OTU clustering approaches as well as UPARSE (see Table 1 for details). Two 16S rRNA gene simulated datasets were generated as FASTQ files. The first one (sim_even) represents an even distribution of 1,076 species, randomly subsampled from the Greengenes 97% database and computationally amplified at the same depth (100 reads/amplicon) and length (150 bp) using PrimerProspector (23) for extracting the V4 region and the ART (24) simulator for amplification and sequencing simulation. The second data set (sim_ staggered) represents the same 1,076 species as the sim_even data set but amplified at different (random) species abundance levels. We used four different previously published mock community data sets: three 16S rRNA gene mock community data sets (Bokulich_2, Bokulich_3, and Bokulich_6) from Bokulich et al. (25) and an 18S gene (mock_nematodes) data set from Porazinska et al. (26). Finally, we also used three previously published natural data sets: a 16S rRNA gene soil data set (canadian_soil) from Neufeld et al. (27), a 16S rRNA gene human data set (body_sites) from Costello et al. (28), and an 18S rRNA gene soil data set (global_soil) from Ramirez et al. (29).
Performance. All tools were run with default parameters. Input FASTA files for Swarm, SUMACLUST, and SortMeRNA were generated using QIIME's demultiplexing and quality filtering workflow. Input FASTA files for OTUCLUST, mothur, and UPARSE were demultiplexed using QIIME and quality filtered using each tool's recommended Performance was evaluated using a variety of metrics, including the accuracy of OTU and taxonomic assignments, alpha diversity (within-sample diversity), beta diversity (between-sample diversity), and taxonomic correlation. All tools showed increased precision after the removal of singleton OTUs (OTUs consisting of only one sequence), so all results presented here have had singleton OTUs removed. Table 2 summarizes basic performance results for all software.
Expected community composition: sensitivity and specificity. (i) Simulated data. For de novo clustering, most tools report F measures (or F1 score, a metric that assesses the accuracy of taxonomic composition and observed OTUs, with a range from 0 to 1, where 1 is the best score) of 0.82 to 0.84 (sim_even) and 0.81 to 0.83 (sim_staggered) at the genus level ( Table 2). Variation in results was emphasized in sim_staggered, where UPARSE_q16 reported the lowest F measure (0.78) as a result of stringent read filtering that removed nearly 95% of the reads prior to clustering. UPARSE_q3 removed roughly 4% of the reads and reported improved results on a par with those of Swarm, SUMACLUST, UCLUST, and USEARCH61. The highest F measure (0.83 to 0.84) and number of OTUs closest to the expected one (1,076) were reported for software using input files from QIIME's method of sequence filtering (Swarm, SUMACLUST, UCLUST, and USEARCH61). All tools except UPARSE_q16 reported comparable alpha diversity phylogenetic diversity (PD) whole-tree (35) (a measure of diversity which considers the phylogenetic differences between species) values (mean of 109.21 with a standard deviation of 2.16 [ Table 2]).
Among the closed-reference methods, SortMeRNA yields the fewest OTUs while achieving comparable or higher F measures for assigned taxonomy (F1 tax in Table 2) and OTUs (F1 OTUs in Table 2) and reported a phylogenetic diversity (121.89) closest to  39), and USEARCH 6.1 (127.50). This is a result of SortMeRNA's more exhaustive search for better alignments, which can increase run time but becomes imperative when short reads are aligned against a highly conservative set of sequences, such as the rRNA gene. The complete Greengenes database contains over a million rRNAs, and almost 73% of all full-length V4 regions (~250 nucleotides) are not unique. This emphasizes the highly conservative nature of rRNA, even in this hypervariable region, and suggests the need for thorough searches to ensure higher-quality alignments (especially for read lengths that do not cover the entire region). At the genus level of taxonomy, all tools report F measures of 0.80 to 0.83 (sim_even) and 0.78 to 0.84 (sim_staggered), which can be attributed to the many-to-one relationship between OTUs and taxonomy strings for the Greengenes 97% database.
For open-reference clustering, QIIME's subsampling pipeline combining SortMeRNA and SUMACLUST reports the fewest OTUs in comparison to UCLUST and USEARCH 6.1. The F measure is 0.82 to 0.83 (sim_even) and 0.81 to 0.83 (sim_staggered) for all tools, which is in agreement with de novo and closed-reference results. These results are expected given the nature of open-reference clustering, which combines the closedreference approach with the de novo approach.
(ii) Mock communities. Results for Bokulich_2 (and Bokulich_3 for UPARSE) are unavailable for UPARSE, OTUCLUST, and mothur due to significant memory, run time, and disk space requirements, respectively. All other methods were compared against the expected taxonomic composition for each data set. In addition, Pearson's correlation coefficient was computed to measure the relatedness of taxonomic assignment between all pairs of tools (see column rho in Table 2 for all tools versus UCLUST). Values can range between Ϫ1 and 1, with Ϫ1 indicating a negative correlation, 0 indicating no correlation, and 1 indicating a positive correlation (strong relationship).
For de novo clustering, USEARCH 5.2, UPARSE_(q3, q16), OTUCLUST_(q3, q20), and mothur_nearest frequently reported the lowest number of OTUs, the lowest number of observed taxa, and the highest F measure ( Table 2). Since the F measure is computed using true-positive taxonomies based on the expected composition, possible contamination species (false positives) are unaccounted for. However, false-positive taxonomies can also arise from OTUs formed by chimeric sequences (sequences from two organisms that bind together during PCR and are subsequently sequenced as a single read) or incorrect assignment by taxonomy assignment tools. To investigate the origins of false-positive taxonomies reported by the tools, we checked all OTUs for chimeras using UCHIME (20) and mapped the nonchimeric OTUs against BLAST's NT database using MEGABLAST. Most of the false-positive taxa were not wholly comprised of chimeric OTUs (meaning that the collection of OTUs mapping to the same taxa was composed of chimeric and genuine sequences), and the majority of such nonchimeric taxa consisted of OTUs mapping with an E value of Ͻ1eϪ50 to BLAST's NT database (e.g., in Table 3 there are 57 false-positive taxa reported by SUMACLUST, but only 4 of those taxa are fully comprised of chimeric OTUs [FP-chimeric], and 99% of OTUs representing the remaining 53 nonchimeric taxa mapped with high similarity to BLAST's NT database). Not surprisingly, all false-positive taxa whose OTUs mapped with Ͻ97% similarity (FP-other) are less abundant than the taxa whose OTUs map with Ն97% similarity (FP-known) and significantly less abundant than true-positive taxa (see Fig. S1 to S3 in the supplemental material). In fact, false-positive taxonomies (especially FP-other and FP-chimeric) comprise few and low-abundance OTUs, which can be analyzed and filtered out if necessary after clustering. Since UPARSE filters out a large fraction of presumably erroneous reads (even prior to chimera checking), it can detect the most abundant species (as can other tools) but also potentially overlook lowabundance species. For the Bokulich_2 and Bokulich_3 data sets, the top 20 most abundant genera follow a similar relative abundance distribution for all de novo tools, which is a direct reflection of hundreds of thousands of reads representing each expected genus in these data sets. However, for the much smaller data set Bokulich_6, UPARSE_q16 reported only half of the expected genera relative to all other tools, and the relative abundance of some of the genera significantly decreased ( Fig. 1; PD values in Table 2). OTUCLUST, mothur_nearest, mothur_average, Swarm, and SUMACLUST reported significantly fewer OTUs than UCLUST and USEARCH 6.1, as well as a lower alpha diversity (Tables 2 and 3). Thus, tools with lower false-positive rates accomplish this by more stringent quality control, but they are less suitable for finding lowerabundance genera.
For closed-reference clustering, SortMeRNA reported up to 60% fewer OTUs and a PD of about half that of UCLUST and USEARCH 6.1 ( Table 2). On the genus taxonomy level, USEARCH 5.2 reported a high F measure (due to a lower number of false-positive genera), but unlike all other tools, the majority of false-positive genera are composed of reads mapping with Յ97% identity and coverage to BLAST's NT database (FP-other). In fact, all other tools filtered out a large portion of these false-positive reads due to insufficient identity matches to the reference database. The difference appears to be caused by USEARCH 5.2's identity definition (which does not consider insertions or deletions), which scores alignments higher than other tools. SortMeRNA generates taxonomic profiles similar to those obtained with other tools, with a Pearson's correlation coefficient of Ͼ0.93 (Fig. 1).
As expected for open-reference clustering, SortMeRNA combined with SUMACLUST reported fewer OTUs than UCLUST and USEARCH 6.1 while preserving high accuracy and lower alpha diversity for both the number of observed OTUs and the phylogenetic diversity. Specific details can be found in the supplemental material.
The Pearson coefficient for comparisons between all tools and methods remained relatively stable, from~0.99 (Bokulich_2) to 0.97 to 1 (Bokulich_3) to 0.92 to 0.99 (Bokulich_6), showing a strong relationship between all algorithms. The coefficient was lower in the cases where the taxonomy could not be assigned (e.g., 0.0273 for SortMeRNA versus UCLUST for data set Bokulich_3) or significant filtering of sequences (e.g., 0.3719 for UCLUST versus UPARSE_q16 for data set Bokulich_6).

Natural community composition.
Results for UPARSE_q4 and UPARSE_q16 are unavailable for the global_soil data set due to memory limitations in the 32-bit version of UPARSE and for OTUCLUST due to significant run time (limited to one thread). In contrast to mock communities, the Pearson correlation for natural communities was much more variable (0.70 to 0.94 for the canadian_soil data set, 0.28 to 0.99 for the body_sites data set, and 0.19 to 0.98 for the global_soil data set) ( Table 2), highlighting differences between all clustering algorithms in a complex environment that are not immediately visible in either simulated or mock communities. These ranges do not take into account outliers that were caused by an inconsistency with RDP assignments for the most abundant taxa (Bokulich_3) and stringent filtering of reads by UPARSE_q16 (Bokulich_6) (Fig. 1). QIIME, UPARSE, OTUCLUST, and mothur include different sequence filtering methods, which could be the major reason behind inconsistent taxonomic compositions (Pearson's correlation in Table 2; Fig. 2). As illustrated in Fig. 2, running mothur with sequences that were quality-filtered by mothur and QIIME produced significantly different taxonomic compositions. As expected, the highest correlation exists for studies with the longest reads and the largest number of reads per sample, showing that clustering results converge to the same conclusions with longer, higherquality reads and deep sequencing (Fig. 3).
As with the mock-community results, all tools frequently reported fewer OTUs and lower alpha diversities than UCLUST and USEARCH 6.1 ( Table 2 and Fig. 4; also, see Fig. S4 and S5 in the supplemental material). Procrustes analysis (36) was used to compare unweighted UniFrac (37, 38) principal coordinates analysis (PCoA) (22) generated by all methods versus UCLUST (the current default OTU picker in QIIME). The Procrustes M 2 metric for body_sites and canadian_soil was Ͻ0.3 for most software ( Table 2), indicating that beta diversity patterns are similar irrespective of the OTU clustering method used. Neither recommended nor relaxed quality filtering parameters for UPARSE worked well for the body_sites data set, where 98.5% and 99.2% of reads were filtered out for UPARSE_q3 and UPARSE_q16 (with a trim length of 250 bp), respectively, resulting in very few remaining samples and high M 2 values (Table 2; also, see Fig. S4 in the supplemental material). Although read quality filtering is an important preprocessing step, more work is required to regulate these parameters (perhaps by an automated estimation of optimal truncation length and quality), as they can be very sensitive to different types of data.

DISCUSSION
We evaluated the performance of four recently published open-source sequence clustering tools against the widely used mothur, UCLUST, and USEARCH tools using

The bars do not reach 1, since only a fraction (top 50) of taxonomies was illustrated. mothur was run using recommended filtering (trim.seqs function) for 454 SOP and with QIIME's split_libraries-_fastq.py to highlight the effect of different filtering methods.
simulated data, mock communities, and natural microbial communities. We found that Swarm, SUMACLUST, UCLUST, and UPARSE (with relaxed parameters) performed equally well on simulated datasets where the ground truth was well established, with mothur_average and OTUCLUST closely behind. Despite this controlled chimera-free environment, UPARSE with recommended parameters reported the lowest accuracy for the sim_staggered data set, implying that stringent quality filtering can cause a significant underestimation of species abundance and diversity and lead to incorrect biological results. For the mock communities, most tools were able to correctly detect the expected number and identity of genera, but only UPARSE reported significantly fewer false-positive taxa (followed by OTUCLUST and USEARCH). For UPARSE, this was expected, as a large proportion of reads was filtered out prior to clustering, leaving evidence of only the most abundant taxa (OTUs comprised of hundreds of thousands of reads). The majority of false-positive taxa reported by other tools were low-abundance OTUs that could be mapped to BLAST's NT database with very high similarity (E value, Ͻ1eϪ50). If the user's primary goal is to focus on the most abundant microbial profiles, low-abundance OTUs may be filtered out postclustering, but care should be taken, as such low-abundance OTUs can be important members of communities (39).
In terms of accurately predicted taxonomic composition for de novo tools, Swarm performed well across all simulated and mock datasets, followed closely by SUMACLUST and UCLUST. However, both Swarm and SUMACLUST reported significantly fewer OTUs and lower alpha diversities than UCLUST. The performance of other de novo methods, such as mothur and OTUCLUST, showed more variation across datasets; however, these results were largely influenced by the preliminary sequence filtering step, where both tools removed more data than QIIME's method. We found that QIIME's filtering approach worked well across all datasets, rendering the most data for clustering tools to work with. For closed-reference tools, SortMeRNA significantly outperformed UCLUST and USEARCH for predicting OTUs and performed as well or better in terms of predicted taxonomic composition. Several studies could not be processed with mothur, OTUCLUST, or the free academic distribution of UPARSE due to their large sizes, either because of an unreasonable disk space requirement in the case of mothur, unreasonable run time in the case of OTUCLUST (no multithreading support), or a relatively small memory limit in the case of UPARSE. Regarding UPARSE, the small memory limit makes it necessary to purchase the 64-bit license in order to process large projects (e.g., see Yatsunenko et al.'s work [40], which contains 500 GB of raw sequence data generated on 17 HiSeq lanes) or use open-source alternatives. QIIME's current open-source, open-reference pipeline (based on SortMeRNA and SUMACLUST) was able to process this quantity of data within 24 h using 64 threads on Intel Xeon CPU E5-4620 v2 at 2.60GHz or within 3 days using 64 threads on AMD Opteron Processor 6276.
Although most open-source tools report an increased run time in comparison to UCLUST and USEARCH (Fig. 5), they provide the benefit of finding significantly fewer OTUs. In the case of SortMeRNA, longer reads (~150 bp) are quicker to align than the same number of shorter reads (~100 bp) due to many fewer high-scoring candidate reference sequences to analyze. Moreover, all of these tools support multilevel multi- threading and can easily scale to modern big-data processing demands. An alternative to reducing run time is to filter out a substantial number of reads, as done by UPARSE; unfortunately, the filtering parameters are sensitive to different data, and choosing them manually by trial and error can be a time-consuming task with unpredictable outcomes in diversity.
The three open-source software products, Swarm, SUMACLUST, and SortMeRNA, are now accessible through the widely used QIIME software package (release 1.9.0). Swarm 2 was released in reference 14 and reported to be faster and more memory efficient than Swarm 1; however, as of this writing, only Swarm 1 has been integrated into QIIME. Ongoing work to improve the QIIME OTU clustering workflows that use these tools includes adding a targeted gene prefilter for de novo clustering to remove (prior to clustering) any sequences not matching a specific gene model (e.g., 16S rRNA) and a refined reference database for targeted hypervariable regions (e.g., V4 at 97% identity) to improve alignment quality (41). Furthermore, research is in progress to introduce an open-source implementation of chimera detection directly within QIIME. Both of these improvements will further reduce the number of unrelated or erroneous reads recruited into OTUs, a known problem with both the UCLUST-and USEARCH-based OTU clustering illustrated here, without underestimating diversity.

MATERIALS AND METHODS
All steps taken to generate the analyses presented in this article are documented and implemented as shell or python scripts, available at https://github.com/ekopylova/OTU-clustering.
Performance benchmarks. Open-source with multilevel parallelization tools tested in this paper-Swarm, SUMACLUST, and SortMeRNA-have been integrated into QIIME 1.9.0. For these tools, all benchmarks were launched through QIIME. For UPARSE, the recommended workflow (http:// www.drive5.com/usearch/manual/uparse_cmds.html) was run. For OTUCLUST, the script micca-preproc was used for sequence filtering and the command otuclust for clustering. For mothur, the MiSeq SOP (42) (website accessed 27 October 2015) and 454 SOP (43) (website accessed 27 October 2015) were run. The shell scripts commands_16S.sh and commands_18S.sh were used to launch all tools, and the opensource project (https://github.com/josenavas/QIIME-Scaling) was used for measuring their run time performance. All run time performance tests were performed using 1 to 32 threads on Intel Xeon CPU E5-2640 v3 at 2.60 GHz.
Precision and recall. For simulated and mock datasets, false-positive (FP; taxonomy/OTU string exists in observed but not expected), false-negative (FN; taxonomy/OTU string exists in expected but not observed), and true-positive (TP; taxonomy/OTU string exists in both observed and expected) measures were computed between the pickers' results (observed) and the ground truth or expected taxonomic composition (expected). The following definitions were used: precision ϭ TP/(TP ϩ FP); recall ϭ TP/(TP ϩ FN); F measure ϭ 2 ϫ precision ϫ recall/(precision ϩ recall).
The python script run_compute_precision_recall.py was used to compute TP, FP, FN, precision, recall, F measure, the number of false-positive taxa whose complete set of OTUs are identified as chimeric (FP-chimeric) by UCHIME, the number of false-positive taxa whose complete set of OTUs map with Ն97% identity and coverage to BLAST's NT database (FP-known), and the number of false-positive taxa whose complete set of OTUs map with Ͻ97% identity and coverage to BLAST's NT database (FP-other). The script plot_tp_fp_distribution.py was used to generate Fig. S1, S2, and S3 in the supplemental material.
Simulating reads (even and staggered). All of the following steps can be executed using the shell script simulate_reads.sh.
Reads were simulated using PrimerProspector (23) and the ART simulator (24). For the even data set, the following steps were taken. (i) Use PrimerProspector to extract V4 regions from the Greengenes 97% representative database (version 13.8); (ii) subsample 0.011% of the sequences from the resulting V4 region database; and (iii) simulate even abundance reads with ART simulator using the subsampled V4 sequences.
Amplicon sequencing simulation in ART (version VanillaIceCream-03-11-2014) could generate only evenly distributed communities. To simulate the staggered data set, a staggered distribution of template sequences was passed (for example, 3 duplicates of OTU1, 10 duplicates of OTU2, etc.). To simulate the staggered data set, the following steps were taken. (i) Generate a random staggered distribution FASTA file of template V4 sequences using the list of OTU identifications from the even data set and the V4 subsampled sequences and (ii) simulate staggered abundance reads with ART using the staggered subsampled V4 sequences.
For both even and staggered reads, QIIME's split_libraries_fastq.py script was run to filter simulated reads based on quality scores and format FASTA labels to be compatible with QIIME (reads for UPARSE, mothur, and OTUCLUST were not filtered; only FASTA labels were reformatted).
Building ground-truth BIOM tables. Ground-truth OTU maps and BIOM tables were constructed using the simulate_reads.sh script that was used for simulating reads. OTU maps were generated using the reads' origin information stored in the FASTA labels of ART-simulated reads. BIOM tables were generated using QIIME's make_otu_table.py script together with Greengenes 97% taxonomy strings.
Calculating alpha diversity, beta diversity, and taxonomic correlation. Customs scripts iterating over all benchmarking results were used to launch QIIME's alpha and beta diversity analyses. The script run_single_rarefaction_and_plot.py was used to compute and plot alpha diversity as shown in Fig. 4 and in Fig. S4 and S5 in the supplemental material. The script run_beta_diversity_and_procrustes.py was used to compute beta diversity and run Procrustes analysis.