Subsampled open-reference clustering creates consistent, comprehensive OTU definitions and scales to billions of sequences

We present a performance-optimized algorithm, subsampled open-reference OTU picking, for assigning marker gene (e.g., 16S rRNA) sequences generated on next-generation sequencing platforms to operational taxonomic units (OTUs) for microbial community analysis. This algorithm provides benefits over de novo OTU picking (clustering can be performed largely in parallel, reducing runtime) and closed-reference OTU picking (all reads are clustered, not only those that match a reference database sequence with high similarity). Because more of our algorithm can be run in parallel relative to “classic” open-reference OTU picking, it makes open-reference OTU picking tractable on massive amplicon sequence data sets (though on smaller data sets, “classic” open-reference OTU clustering is often faster). We illustrate that here by applying it to the first 15,000 samples sequenced for the Earth Microbiome Project (1.3 billion V4 16S rRNA amplicons). To the best of our knowledge, this is the largest OTU picking run ever performed, and we estimate that our new algorithm runs in less than 1/5 the time than would be required of “classic” open reference OTU picking. We show that subsampled open-reference OTU picking yields results that are highly correlated with those generated by “classic” open-reference OTU picking through comparisons on three well-studied datasets. An implementation of this algorithm is provided in the popular QIIME software package, which uses uclust for read clustering. All analyses were performed using QIIME’s uclust wrappers, though we provide details (aided by the open-source code in our GitHub repository) that will allow implementation of subsampled open-reference OTU picking independently of QIIME (e.g., in a compiled programming language, where runtimes should be further reduced). Our analyses should generalize to other implementations of these OTU picking algorithms. Finally, we present a comparison of parameter settings in QIIME’s OTU picking workflows and make recommendations on settings for these free parameters to optimize runtime without reducing the quality of the results. These optimized parameters can vastly decrease the runtime of uclust-based OTU picking in QIIME.

are highly correlated with those generated by "classic" open-reference OTU picking through comparisons on three well-studied datasets. An implementation of this algorithm is provided in the popular QIIME software package, which uses uclust for read clustering. All analyses were performed using QIIME's uclust wrappers, though we provide details (aided by the open-source code in our GitHub repository) that will allow implementation of subsampled open-reference OTU picking independently of QIIME (e.g., in a compiled programming language, where runtimes should be further reduced). Our analyses should generalize to other implementations of these OTU picking algorithms. Finally, we present a comparison of parameter settings in QIIME's OTU picking workflows and make recommendations on settings for these free parameters to optimize runtime without reducing the quality of the results. These optimized parameters can vastly decrease the runtime of uclust-based OTU picking in QIIME.

INTRODUCTION
Three high-level strategies for defining Operational Taxonomic Unit (OTU) cluster centroids have been widely applied for centroid-based greedy clustering (Li & Godzik, 2006;Edgar, 2010) of marker gene (e.g., 16S rRNA) sequences generated on nextgeneration sequencing platforms to facilitate microbial community analysis. These are canonically described as de novo, closed-reference, and open-reference OTU picking (Navas-Molina et al., 2013). In each of these approaches, respectively, centroids are defined internally based only on the sequences being clustered, based only on an external, predefined database of cluster centroids, or based on a combination of the two. Each of these methods has benefits and drawbacks.
In de novo OTU picking, input sequences are aligned against one another, and sequences that align with greater than a user-specified percent identity are defined as belonging to the same OTU. There are many variations and free parameters in this process, such as how many alignments are performed before a sequence is assigned to an OTU or used to define a new OTU, but the common feature of these methods is that no external reference database is required. This is also the primary advantage of this method: it is not necessary to have accumulated a collection of reference sequences before working with a new marker gene. However, de novo OTU picking is difficult to parallelize because all processes must be able to use new OTUs that are defined by other processes. Consequently, this approach cannot scale to modern-sized data sets.
In closed-reference OTU picking, input sequences are aligned to pre-defined cluster centroids in a reference database. If the input sequence does not match any reference sequence at a user-defined percent identity threshold, that sequence is excluded. The primary advantage of closed-reference OTU picking is that it is easily parallelizable. that we specifically focus on centroid-based greedy clustering approaches in this study (e.g., as in uclust and cd-hit Li & Godzik, 2006;Edgar, 2010), not approaches that require alignment of all pairs of unique sequences (i.e., the hierarchical methods described in Schloss & Westcott, 2011), as the former scale better to larger data sets. However, because our full evaluation framework (metrics and data sets) and the EMP raw sequence data are all freely accessible, it is straightforward for other groups to reproduce these evaluations on alternative methods.
All analyses presented here are performed using the QIIME and pandas python packages. As far as we know, QIIME contains the only existing implementation of the subsampled open-reference OTU picking algorithm, but the algorithm is not QIIME-specific. Thus while our comparison is based on specific QIIME/uclust-based implementations of de novo, closed reference, classic open reference, and subsampled open reference OTU picking, our findings should be general to other implementations of these algorithms.

Subsampled open-reference OTU picking algorithm
Open-reference OTU picking is preferable to the other methods presented here because it combines the advantages of closed-reference and de novo clustering. However, the de novo step of open-reference OTU picking can only be run serially, and therefore can be time-consuming for large datasets if many sequences fail to hit the reference database. To improve the runtime of open-reference OTU picking, we developed subsampled open-reference OTU picking, which incrementally increases the size of the reference database by de novo clustering a subset of the sequences that fail to match the reference database. The remainder of the sequences that fail to hit the reference database can then be clustered against these new cluster centroids in a parallel closed-reference OTU picking process. This allows for partial parallelization of the de novo clustering step and can significantly decrease runtime on large datasets, allowing open-reference OTU picking to scale to billions of input sequences (e.g., as generated in multiple Illumina HiSeq 2000 runs). It can additionally be run iteratively, so that representative sequences for the new (i.e., non-reference) OTUs can be combined with the reference database for future OTU picking runs. It is important to note that runtime is not always reduced with subsampled open-reference OTU picking. Data set and algorithm parameters have a large effect on runtime (discussed further in Runtime differences). This approach is similar to the Buckshot algorithm (Cutting et al., 1992;Jensen et al., 2002), initially described for semantic clustering of documents in a corpus, though we do not use the parallel hierarchical clustering approach described by Jensen et al. (2002) for initial clustering definition.
A detailed description of this workflow is illustrated in Fig. 1. It is implemented using uclust v1.2.22q (Edgar, 2010) for clustering in QIIME 1.6.0 (Caporaso et al., 2010) and later, though any sequence clustering software that provides support for de novo and closed-reference clustering could be substituted for uclust in an alternate implementation. The inputs provided to this method are demultiplexed, quality-filtered sequences, and a reference sequence collection (for example, the Greengenes 13 8 97% OTU representative sequences DeSantis et al., 2006;McDonald et al., 2012b). First, sequences are clustered in parallel using a closed-reference OTU picking workflow, where sequences are queried against the reference database at percent identity s (default 97%). If a read matches a reference sequence at greater than or equal to s% identity, it is assigned to the OTU defined by that reference sequence. These are referred to as the reference OTUs. Next, a random subsample of n% (n should be small, the default value in QIIME 1.8.0-dev and earlier is 0.1%) of the sequences that failed to match the reference sequence collection are clustered de novo, and the cluster centroids for all resulting OTUs are used to define a new reference sequence collection. Those OTUs are referred to as the new reference OTUs. The sequences that were not included in the random subsample that was clustered de novo then go through an additional round of parallel closed-reference OTU picking, this time where they are clustered against the new reference OTUs based on matching a sequence in the new reference sequence collection at greater than or equal to s% identity. This creation of a "new reference database" allows us to harness the parallelization of our closed-reference OTU picking pipeline, greatly decreasing the time it takes for sequences that fail to hit the initial reference database to be clustered into OTUs. In the final clustering step, sequences that fail to hit a reference sequence during this final closed-reference OTU picking step are clustered de novo. These are referred to as the clean-up OTUs. Finally, the reference OTUs, new reference OTUs, and clean-up OTUs are combined into a single OTU table (i.e., table of counts of OTUs on a per-sample basis, as described in McDonald et al. (2012a)), and this table, as well as a filtered table excluding OTUs with counts less than or equal to a user-defined threshold c, are provided to the user. By default, c = 2, so each OTU is observed at least twice (i.e., singleton OTUs are excluded). Because many more of the sequences can be clustered using closed-reference OTU picking in this workflow, it can run in far less time than classic open-reference OTU picking (see Runtime differences section below).

Evaluation of subsampled open-reference OTU picking
We validated the subsampled open-reference OTU picking workflow by comparing it to de novo, closed-reference, and classic (i.e., non subsampled) open-reference clustering methods on three different datasets: the Lauber "88 Soils" study (Lauber et al., 2009) (referred to as 88-soils here), the Caporaso "Moving Pictures" study (Caporaso et al., 2011) (referred to as moving-pictures here), and the Costello "Whole Body" study (Costello et al., 2009) (referred to as whole-body here) using three metrics. Table 1 provides a description of the OTU picking methods being compared. First, we tested the correlation between sample alpha diversities (OTU counts, i.e., QIIME's observed species metric, and Phylogenetic Diversity (PD) (Faith, 1992)) based on subsampled open-reference OTU picking and the other OTU picking protocols. Next, we tested whether beta diversity patterns (as determined by weighted and unweighted UniFrac (Lozupone & Knight, 2005) distances between samples) were consistent across OTU picking protocols, based on Mantel tests (Mantel, 1967) with 1,000 Monte Carlo iterations. Finally, we tested whether the same taxonomic profiles were obtained on a per-sample basis using each of the OTU picking Table 1 Method definitions. Definitions of the OTU picking methods being compared here, based on the abbreviations used throughout the paper. From here, we refer to each method by its abbreviation for simplicity. We note that the both de novo (uc) and classic openreference OTU picking (ucr) are accessed through QIIME's pick de novo otus.py command. ucr is applied when pick otus:otu picking method uclust ref is specified in the parameters file, and uc is applied when that option is absent. The exact command/parameter combinations used for each OTU picking run are provided in the study's GitHub repository (see Data Availability).   methods. It is important to note that we are not trying to assess whether one method is better than another using these metrics. Instead, we are testing whether the methods give highly correlated results.

Data availability
The raw sequence data analyzed in this study is available in the QIIME Database under study numbers 103 (88-soils), 449 (whole-body), and 550 (moving-pictures). All analyses were run with QIIME 1.8.0-dev. All commands, as well as all processed data and IPython Notebooks that illustrate how to work with that data, are available in this project's GitHub repository at https://github.com/gregcaporaso/cloaked-octo-ninja.

Subsampled versus "classic" open-reference OTU picking
Alpha diversity (Table 2; whole-body PD Pearson r = 0.989; 88-soils PD Pearson r = 0.930; moving-pictures PD Pearson r = 0.996), beta diversity (Table 3; whole-body unweighted UniFrac Mantel r = 0.948; 88-soils unweighted UniFrac Mantel r = 0.939; moving-pictures unweighted UniFrac Mantel r = 0.991) and taxonomic summaries (Table 4; whole-body: r = 0.999 at phylum level, 0.999 at species level; 88-soils r = 0.999 at phylum level, r = 0.999 at species level; moving-pictures r = 0.999 at phylum level, r = 0.999 at species level) were highly correlated between classic and subsampled open-reference OTU picking. Minor differences likely arise from the non-deterministic step of rarefying all samples to even sampling depth before comparing samples. These results suggest that subsampled open-reference picking yields the same results as classic open-reference OTU picking, including identical numbers of sequences failing to hit the reference database, and therefore is a suitable replacement.

Application to the Earth Microbiome Project dataset
In order to evaluate the effectiveness of the subsampled open-reference OTU picking method on an extremely large data set, the first 15,000 samples (1.3 billion V4 16S rRNA amplicons) from the Earth Microbiome Project (EMP, Gilbert et al., 2010) were processed on the Amazon Web Services (AWS) EC2 platform. These samples were split across more than 60 studies, which were clustered iteratively. To the best of our knowledge, this is the largest OTU picking run ever completed. We created a StarCluster-based (http://star.mit. edu/cluster/) virtual cluster on AWS using between 8 and 18 M2.4xlarge spot instances (the number of instances was varied at different stages of the run). Each instance (or virtual cluster node) had 69 GB RAM and 8 cores. A total of 11,242 CPU hours were consumed to complete subsampled open-reference OTU picking (at 97% nucleotide identity), and the combined input and output files consumed 1.2 TB of disk space. (This runtime includes the pre-filtering step. The process would have completed much faster if this were disabled.) The resulting OTU table contained 5.6 million non-singleton OTUs. This is the largest number of OTUs identified, and the most comprehensive survey of microbial diversity across environment types to date, so it likely suggests the magnitude of the lower-bound on the microbial diversity of the Earth (although the accuracy is limited because some of these OTUs may be artifacts of PCR or sequencing: such artifacts, e.g., chimeras, need to be identified after the OTU picking step).
We were next interested in how long the de novo clustering step of classic open-reference OTU picking would take on the EMP data set, but as we'll illustrate this is an intractable problem in practice with current computer hardware. We began by applying de novo clustering using the "fast" uclust parameter settings to the representative sequences from the 5.6 million non-singleton OTUs from the run described above. These representative sequences represent the full alpha diversity of the EMP data set (a property known to be important to runtime of de novo and open reference OTU clustering) but the data set contains only 5.6 m sequences, so is feasible to cluster de novo. We then subsampled this to contain between 10% and 80% of those sequences, in steps of 10% with 10 iterations at each step, and compiled the runtime for each clustering run. Figure 2 illustrates the relationship between runtime and input sequence count, along with the results of a regression analysis presenting median runtime as a function of sequence count (r 2 = 0.98,p = 8e-6).
In the subsampled open-reference OTU picking run on the EMP dataset, 660 million sequences failed to hit the reference database, and therefore need to be clustered de novo

Run-time differences
The speed improvements of subsampled open-reference OTU picking arise from the fact that a larger portion of the clustering process can be parallelized. When not run in parallel, or run in parallel over only a few (e.g., 3) CPUs, classic open-reference OTU picking is likely to be faster. Similarly, for smaller data sets (e.g., less than a few million sequences), especially if most sequences have a match in the reference database (e.g., with human gut microbiome data), classic open-reference OTU picking will achieve similar runtimes to   Moving-picture  Whole-body   uc  1220  27748  1095  ucr  1358  46576  1082  ucrC  226  28572  388  ucrss  1493  47207  1212  ucrss wfilter  1885  76061  2088  uc fast  914  23510  489  ucr fast  1052  19371  621  ucrC fast  44  2428  68  ucrss fast  1021  23710  707  ucrss fast wfilter  1525  52811  1661 subsampled open-reference clustering (Table 5). However, in these cases, the results are still highly correlated, so if in doubt of which method will be faster, subsampled open-reference OTU picking is a reasonable choice as the summary statistics of interest (often alpha diversity, beta diversity and taxonomic profiles) are very unlikely to be different between the two methods. When more sequences fail to hit the reference database, subsampled open-reference OTU picking becomes faster than classic open-reference OTU picking (Table 6). To illustrate this, we clustered the moving-pictures sequences against the 82% and 97% Greengenes reference OTUs at 97% identity using subsampled and classic open-reference OTU picking on 29 processors. When clustering against the 82% OTUs, 52.1 million failed to hit the reference, while when clustering against the 97% OTUs 3.4 million sequences failed to hit the reference. Subsampled open-reference OTU picking ran in 4000 s less wall time than classic open-reference clustering (in a single run of each on a system dedicated for this run time comparison) against the 82% OTUs, and in 72 s less time against the 97% OTUs, illustrating that as more sequences fail to hit the reference, subsampled open-reference OTU picking offers more of an advantage. This runtime difference would be even larger if the job were split over more processors. Another parameter that can affect runtime of subsampled open-reference OTU picking is the size of the random subsample that is selected. The optimal setting for this parameter is affected by the size of the dataset being clustered and the diversity of the sequences that fail to match the reference database. On small datasets, or datasets with a lot of novel diversity, a large fraction (e.g., 1%) is better than a small fraction (e.g., 0.001%), but as the data set increases in size a large fraction can result in far more time spent performing de novo clustering of the sequences that initially fail to hit the reference database. We recommend using the default (0.1% in QIIME 1.8.0-dev and earlier), which was chosen to reduce runtime on larger datasets where optimized runtime is more important. As this parameter setting approaches zero, subsampled open-reference OTU picking becomes more like classic open-reference OTU picking, in that more of the reads that fail to hit the reference database are clustered de novo serially, and at the limit of 0% of sequences subsampled, the subsampled open reference OTU picking becomes classic open-reference OTU picking. The summary statistics investigated here are highly correlated between classic and subsampled open-reference OTU picking, suggesting that this parameter setting will not affect those statistics, but can affect runtime.

Pre-filtering
QIIME's open-reference OTU picking workflow optionally includes a pre-filtering step, where sequences are searched against the reference database with low percent identity (the default in QIIME 1.8.0 and earlier is 60%), and sequences that fail to match are discarded from the analysis. The goal of this process is to discard sequences that are likely not representatives of the marker gene, such as host genomic sequences or products of non-specific amplification. This process is functionally similar to closed-reference OTU picking (sequence reads are searched against a pre-defined reference database), and therefore is easily run in parallel.
We show that alpha diversity (Table 2; whole-body PD Pearson r = 0.991; 88-soils PD Pearson r = 0.930; moving-pictures PD Pearson r = 0.996), beta diversity (Table 3; whole-body unweighted UniFrac Mantel r = 0.953; 88-soils unweighted UniFrac Mantel r = 0.940; moving-pictures unweighted UniFrac Mantel r = 0.990) and taxonomic summaries (Table 4; whole-body: r = 1.000 at phylum level, r = 1.000 at species level; 88-soils r = 1.000 at phylum level, r = 1.000 at species level; moving-pictures r = 1.000 at phylum level, r = 0.999 at species level) are highly correlated between the pre-filtered and non-pre-filtered results, when pre-filtering is performed at percent identity of 60%. Despite nearly identical results, the pre-filtering process results in vastly increased runtimes. Consequently, we no longer recommend pre-filtering of sequences prior to open-reference OTU picking. Rather, contaminant sequences should be discarded after OTU picking. This feature is now disabled by default starting with QIIME 1.8.0-dev.
One case where pre-filtering may prove useful is in the preparation of sequence data where there is a large amount of contamination of non-marker-gene sequence, for example host genomic contamination. In this case, pre-filtering can be useful to remove those sequences prior to clustering. Note that if you suspect that your sample may contain human genomic contaminant sequences, it is important to filter them out before analysis or data deposition due to Institutional Review Board or other ethical concerns related to release of human DNA sequences.

Clustering parameters
We also investigated the effect of clustering parameters on the same summary statistics, as these can have a considerable effect on runtime. We compared uclust's default settings (referred to in QIIME as "fast mode") with the default settings in QIIME 1.8.0 and earlier ("slow mode"). We again compared the methods based on the degree to which they resulted in correlated alpha diversity (Table 2), beta diversity (Table 3), and taxonomic results (Table 4), and found that all results were highly correlated between fast and slow modes. This suggests that while fast mode will occasionally make suboptimal OTU assignments, the effects are subtle enough to be unnoticeable in downstream ecological analyses. We therefore recommend using the "fast" settings for decreased runtime, and these are now the default in QIIME 1.8.0-dev.
We do recommend using the "slow" settings if clustering sequences to build reference OTUs (for example, as is performed when building the Greengenes reference OTU collection McDonald et al., 2012b) because suboptimal OTU assignments can have further reaching consequences. For example, "splitting" an OTU (i.e., defining two sequences that are within s% identity of each other as the centroids of two different s% OTUs), which is always a possibility in greedy clustering algorithms, is more common with the "fast" settings than with the "slow" settings. If this occurs in a single study, the downstream effects are limited to that study and are likely only to be problematic if the split OTU is of key significance to the system being investigated. However, a split OTU when defining reference OTUs is more problematic, because those definitions will be used in many studies, increasing the chance that the split OTU will be problematic for someone. For this application, the processing step is typically only run once per database release (which is relatively infrequent). Therefore, the longer runtime is preferable to less accurate OTU definitions in this particular application. If splitting and lumping of OTUs is of concern on your dataset, you may want to experiment with the "slow" parameter settings, which are still accessible in QIIME and we also recommend exploring the use of Oligotyping (Eren et al., 2013).

Consistent OTU definitions across runs: iterative open-reference OTU picking
Subsampled open-reference clustering, as implemented in QIIME, provides new identifiers for sequences that fail to match the reference database, allowing OTUs to be directly compared across clustering runs (although sequences clustered against this expanded reference sequence collection do need to be from the same gene fragment as the sequences used to expand the reference sequence collection). These OTUs can also be used in iterative OTU picking, which is useful in studies where sequence data is continuously accumulating, for example in routine monitoring of microbial communities in human subjects (e.g., patients monitored over time), the built-environment, or during environmental clean-up.

CONCLUSIONS
Taken together, the reduced runtime of subsampled open-reference OTU picking relative to classic open-reference OTU picking on large datasets, and the benefits that open-reference OTU picking offers over full de novo OTU picking (vastly decreased runtime) and closed-reference OTU picking (all sequences are clustered, not only those that match the reference collection), we recommend subsampled open-reference OTU picking when a reference collection is available.
Because the metrics provided here show that the same summary statistics are derived from the four OTU picking protocols, an interesting question is whether de novo or open-reference OTU picking offers any benefit over closed-reference OTU picking. The primary motivation for using methods that incorporate previously unknown OTUs (i.e., those that are not represented in the reference database) such as de novo and open-reference OTU picking is that OTUs not represented in the reference database might best illustrate a biological pattern of interest. For example, in the 88-soils data analyzed here, 1 of the top 10 OTUs identified as significantly different across sample pH is an OTU that is not represented in the reference database (Table 8) (this OTU was classified as in the Actinomycetales order by QIIME's uclust-based taxonomy classifier). Similarly, for the whole-body data set, 2 of the top 10 OTUs identified as significantly different across body sites were not represented in the reference database (these were classified as Prevotella melaninogenica and Veillonella parvula by QIIME's uclust-based taxonomy classifier). On the other hand, in the moving-pictures data analyzed here, all of the top 10 OTUs identified as significantly different across body site were OTUs represented in the reference database. Table 7 illustrates the fraction of OTUs not represented in the reference database by environment based on the Earth Microbiome Project dataset. We expect that using OTU picking methods that incorporate new OTUs is more important in samples where this fraction is higher.  In conclusion, this paper presents the performance-optimized subsampled openreference OTU picking algorithm, now available in QIIME. This method can be applied iteratively to define stable OTUs across sequencing runs, and achieves nearly identical results to "classic" open-reference OTU picking (i.e., not including the subsampling step). It enables massive sequencing projects such as the Earth Microbiome Project to use open-reference OTU picking in far less time than is possible with classic open-reference OTU picking, which will facilitate our exploration of microbial diversity. Further, the iterative nature of the process (which is also possible with classic open-reference OTU picking) enables progressively expanding datasets, as might be generated in clinical laboratories as microbiome-based medical treatment becomes a reality, to cluster OTUs using OTU definitions from previous clustering runs as reference sequences. This