Deblur Rapidly Resolves Single-Nucleotide Community Sequence Patterns

Deblur provides a rapid and sensitive means to assess ecological patterns driven by differentiation of closely related taxa. This algorithm provides a solution to the problem of identifying real ecological differences between taxa whose amplicons differ by a single base pair, is applicable in an automated fashion to large-scale sequencing data sets, and can integrate sequencing runs collected over time.

IMPORTANCE Deblur provides a rapid and sensitive means to assess ecological patterns driven by differentiation of closely related taxa. This algorithm provides a solution to the problem of identifying real ecological differences between taxa whose amplicons differ by a single base pair, is applicable in an automated fashion to large-scale sequencing data sets, and can integrate sequencing runs collected over time.
KEYWORDS DNA sequencing, microbiome A n important goal of microbiome research is identifying taxa present in a given sample. Next-generation sequencing of the 16S rRNA gene on Illumina instruments is commonly used for this task but suffers from an error rate of 0.1% per nucleotide (1). In a typical study spanning millions of sequences, many sequences contain at least one error, obscuring the underlying biology through inaccurate taxon identification and inflated diversity statistics. These errors seldom affect statistical tests for differences between two communities, but higher precision is becoming increasingly important as the field moves toward applications with clinical or regulatory significance.
The classic approach to overcoming these errors is to cluster amplicon sequences into operational taxonomic units (OTUs) (2, 3) based on an arbitrary sequence identity threshold. This approach reduces problems caused by erroneous sequences but also reduces phylogenetic resolution because sequences below the identity threshold cannot be differentiated. Furthermore, OTUs picked within individual data sets cannot be merged when different centroid sequences are chosen in each data set, limiting the ability to combine OTU results. This problem of merging OTUs is exemplified by Fig. 1, for which de novo OTUs were assessed independently on separate rounds (i.e., "run_ center") of sequencing with a subset of American Gut Project data. Although approaches such as closed-reference and open-reference OTU picking (4) reduce this problem, integrating large data sets into a single OTU space remains a challenge. Here we describe Deblur, a novel sub-OTU (sOTU) method for fast and accurate identification of exact sequences in amplicon studies, and show how it can be used to integrate large data sets.
Similar in concept to AmpliconNoise (5), a denoising method for pyrosequencing, Deblur, like DADA2 (6) and UNOISE2 (7), attempts to obtain single-nucleotide resolution from Illumina data with statistical methods to infer the putative true sequences within a sample that give rise to the distribution of observed error-prone sequences. Unlike DADA2 and UNOISE2, Deblur operates on each sample independently. It compares sequence-to-sequence Hamming distances within a sample to an upper-bound error profile (see Table S1 in the supplemental material; mathematical derivation in Text S1 in the supplemental material) combined with a greedy algorithm to obtain singlenucleotide resolution. The Deblur algorithm is implemented as follows (see Fig. S1 in the supplemental material). First, sequences are sorted by abundance. Second, from the most to least abundant sequence, the number of predicted error-derived reads is subtracted from neighboring reads based on their Hamming distance, using an upper bound on the error probability. A parameterized maximal probability for indels (defaulting to 0.01) and a parameterized mean read error rate for normalization (defaulting to 0.5%) are included. Finally, any sequence whose abundance drops to 0 after a subtraction is removed from the list of valid sequences. Sequences not considered to be valid (i.e., noise) are removed. After applying Deblur, only reads likely to have been presented to the sequencer are retained. However, it is possible that the reads would still contain chimeras originating from PCR. Reads are filtered for de novo chimeras using UCHIME (8) as implemented by VSEARCH (9) using modified parameters (Text S1).  (3) was run independently per round via QIIME. The resulting OTU tables were merged, normalizing sequencing identifiers (IDs) such that if the same sequence was observed in multiple rounds it would receive the same ID. Observations with fewer than 10 counts were dropped. The data were rarefied to 5,000 sequences per sample. The plot shown is based on unweighted UniFrac distances, and the samples are colored by the sequencing center. An interactive visualization can be viewed at https://nbviewer.jupyter.org/github/knightlab-analyses/deblur-manuscript/blob/master/embedded_figure_1.ipynb; the coloring used in the static image can be done by selecting "run_center" as the scatter field. CU, University of Colorado Boulder; ANL, Argonne National Laboratory; UCSD, University of California San Diego.
Using simulated, mock, and real data sets, we compared the performance of Deblur with the performance of DADA2 and UNOISE2. Most analyses using real data sets did not complete using the free version of UNOISE2. We omitted classic OTU methods and MED (10), given the benchmarks described in reference 6. Finally, we applied Deblur to multiple sequencing rounds of the American Gut Project samples to demonstrate integration across sequencing runs from multiple instruments.
We first compared methods using simulated communities based on bacterial taxa and frequencies obtained from Sanger sequencing of one stool sample (11). Reads were simulated from this real community using ART (12) to produce Illumina-like sequence data. All three methods identified sOTUs with single-nucleotide differences ( Fig. 2A). We then simulated sequences over increasing levels of similarity between the real sequences, measuring unweighted UniFrac (13) distances to ground truth, and observed OTUs ( Fig. 2B and C). Deblur, DADA2, and UNOISE2 were all close to the ground truth except at high similarity levels, but all three suffered from limitations in distinguishing the true reads. Using the MiSeq data generated from a 22-member community ("mock-3") (14) trimmed to the first 150 nucleotides (nt) due to poor 3= quality, we observed that all methods produced results that were close to the ground truth (Fig. 2D).
Stability (i.e., obtaining the same sOTU across different samples) is becoming critical as more study designs exploit existing samples from resources like the Earth Microbiome Project (15) or require integration of sequence data collected over time such as the American Gut Project (http://americangut.org). We compared the levels of stability of Deblur and DADA2 using technical replicates from a data set consisting of 40 individuals, each with one fecal sample sequenced twice on two separate MiSeq runs  (11) using the 52 most abundant bacterial species identified in this study. Reads were then simulated using an ART Illumina (12) read simulator. OTU picking was performed on these simulated reads using UNOISE2, DADA2, and Deblur. The relative abundances predicted by each of these tools and the ground truth (GT) are shown in the heat map. The dendrogram was built using hierarchical clustering based on the Hamming distance between the sequences, with numbers indicating sequence similarity (log scale). (B) Simulated communities with various levels of sequence-sequence similarity. Unweighted UniFrac distances of the predicted OTUs from UNOISE2, DADA2, and Deblur were compared to those of the original composition of the simulated communities. The x axis denotes the similarity radius for each community. The shaded area denotes the standard error of the mean distance estimation (based on 10 random repeats per community). (C) Similar to panel B but with the ratio of observed OTUs (predicted by UNOISE2, DADA2, and Deblur) to actual OTUs in each simulation indicated. (D) Performance of Deblur, UNOISE2, and DADA2 on the even1 community from mock-3 (14). GT data denote the expected ground truth relative frequency for each sOTU as informed by the design of the mock community. Dendrograms and colors are the same as described for panel A.
Rapid Single-Nucleotide Resolution through Deblur (16). sOTUs for each run were assessed separately, and we compared the fractions of sOTUs from one run to those present in the second run, as a function of the minimal sOTU frequency. Deblur showed greater stability than DADA2 at a higher frequency cutoff (Fig. 3A), indicating that a larger fraction of sOTUs from the first run were also identified in the second run. To further test the stability of Deblur, we sourced previously sequenced fecal samples from the American Gut Project and selected fecal samples which spanned five distinct sequencing runs over multiple instruments from three geographically distinct locations (Table S2). As can be seen in the Emperor (17) plot of a principal-coordinate analysis of unweighted UniFrac distances, samples from different sequencing rounds (denoted by "center_project_name") are integrated in the ordination, demonstrating that the sequencing rounds do not separate as is typical with OTU-based methods (Fig. 4). In contrast, performing the same experiment with UNOISE2 (running it per round and merging results with respect to commonly identified sequences) produced an observable effect mediated by the sequencing round and run center (Fig. 5), although the magnitude of the effect was notably less than that observed with de novo OTUs (Fig. 1). We note that this is in contradiction to the recommended way of running UNOISE2 (i.e., we did not run it on the full data set); however, we are unaware of a mode of operation (free or full version) capable of operating on modern large-scale data sets.
Next, we compared DADA2 and Deblur using a complex natural community and a previously published data set of fecal samples from two species of howler monkeys (18). Deblur and DADA2 detected 1,938 and 1,636 sOTUs, respectively, after removal of sOTUs with fewer than 10 total reads from each method (Fig. S2A). Following filtering, about 70% of the sOTUs were identical between the methods (Fig. S2B). As expected, both methods identified differential sOTUs (permutation-based rank mean test; 0.1 false-discovery rate-Benjamini-Hochberg method [FDR-BH] control value) with 61% of Deblur sOTUs differentiating between primate species (1,193/1,938), compared to 55% of DADA2 sOTUs (891/1,636). To assess whether the sOTUs unique to either method were from increased numbers of artifacts, we used BLAST (19) to compare each unique sequence against nt/nr and plotted the fraction of sOTUs with zero, one, or two mismatches. We observed that sOTUs unique to Deblur showed fewer mismatches than those unique to DADA2 (Fig. 3B). The distribution of sOTUs over the monkey samples suggests that the sOTUs unique to Deblur are more plausible because they show a pattern similar to those identified by both methods, whereas the sOTUs unique to FIG 4 A principal-coordinate analysis plot of UniFrac distances from Deblur as visualized by Emperor. A subset of American Gut Project samples spanning sequencing centers and rounds were selected. Each sample was processed separately by Deblur. Observations with fewer than 10 counts were dropped. The data were rarefied to 5,000 sequences per sample. The plot shown is based on unweighted UniFrac distances and is colored according to the round of sequencing in the American Gut Project (AG). An interactive visualization can be viewed at https://nbviewer.jupyter.org/github/knightlab-analyses/deblurmanuscript/blob/master/embedded_figure_4.ipynb; the coloring used in the static image can be made by selecting the "center_project_name" as the scatter field.

FIG 5
A principal-coordinate analysis plot of UniFrac distances from UNOISE2 as visualized by Emperor. A subset of American Gut Project samples spanning sequencing centers and rounds were selected. UNOISE2 was run independently per round. The resulting sOTU tables were merged, normalizing sequencing IDs such that if the same sequence were observed in multiple rounds it would receive the same ID. Observations with fewer than 10 counts were dropped. The data were rarefied to 5,000 sequences per sample. The plot shown is based on unweighted UniFrac distances and is colored according to the round of sequencing in the American Gut Project. An interactive visualization can be viewed at https://nbviewer.jupyter.org/github/knightlab-analyses/deblurmanuscript/blob/master/embedded_figure_5.ipynb; the coloring used in the static image can be made by selecting the "center_project_name" as the scatter field. The static shot is oriented to show PC1 versus PC2, and the separation is more pronounced if orienting the projection to look at PC2 versus PC3.
Rapid Single-Nucleotide Resolution through Deblur DADA2 have markedly different patterns of clusters of unique sOTUs within single samples (Fig. 3C).
Finally, to explore performance characteristics, we used a MiSeq run from the stability analysis in order to assess computational space and time demands of DADA2, Deblur, and UNOISE2 (where possible) over an increasing number of samples. UNOISE2 was an order of magnitude faster than Deblur, while Deblur was an order of magnitude faster than DADA2 (Fig. 3D). Deblur maintained a fairly flat memory profile (Fig. S3) in contrast to the growth observed with DADA2 and UNOISE2.
Like DADA2 and UNOISE2, which approach the same concept with different algorithms, Deblur produces stable sOTUs which can achieve single-nucleotide resolution and which can be used in place of OTUs. However, unlike DADA2 and UNOISE2, Deblur does not require operation on the full study and can thus be parallelized easily to very large projects. Finally, Deblur is released under the Berkeley Software Distribution (BSD) open source license, allowing easy commercial adoption and peer scrutiny. Consequently, because of its stability, ability to integrate, performance, and open-source license, Deblur is positioned to operate on present and future large-data sets as well as continued discovery through reuse of existing rich data sets.