Denoising the Denoisers: an independent evaluation of microbiome sequence error-correction approaches

High-depth sequencing of universal marker genes such as the 16S rRNA gene is a common strategy to profile microbial communities. Traditionally, sequence reads are clustered into operational taxonomic units (OTUs) at a defined identity threshold to avoid sequencing errors generating spurious taxonomic units. However, there have been numerous bioinformatic packages recently released that attempt to correct sequencing errors to determine real biological sequences at single nucleotide resolution by generating amplicon sequence variants (ASVs). As more researchers begin to use high resolution ASVs, there is a need for an in-depth and unbiased comparison of these novel “denoising” pipelines. In this study, we conduct a thorough comparison of three of the most widely-used denoising packages (DADA2, UNOISE3, and Deblur) as well as an open-reference 97% OTU clustering pipeline on mock, soil, and host-associated communities. We found from the mock community analyses that although they produced similar microbial compositions based on relative abundance, the approaches identified vastly different numbers of ASVs that significantly impact alpha diversity metrics. Our analysis on real datasets using recommended settings for each denoising pipeline also showed that the three packages were consistent in their per-sample compositions, resulting in only minor differences based on weighted UniFrac and Bray–Curtis dissimilarity. DADA2 tended to find more ASVs than the other two denoising pipelines when analyzing both the real soil data and two other host-associated datasets, suggesting that it could be better at finding rare organisms, but at the expense of possible false positives. The open-reference OTU clustering approach identified considerably more OTUs in comparison to the number of ASVs from the denoising pipelines in all datasets tested. The three denoising approaches were significantly different in their run times, with UNOISE3 running greater than 1,200 and 15 times faster than DADA2 and Deblur, respectively. Our findings indicate that, although all pipelines result in similar general community structure, the number of ASVs/OTUs and resulting alpha-diversity metrics varies considerably and should be considered when attempting to identify rare organisms from possible background noise.

230 database. Distances were then generated using the summarize_taxa.py and beta_diversity.py 231 commands in QIIME1. Ordination was generated using the metaMDS function in the vegan R 232 package (Oksanen et al., 2018). 233 Mantel correlations between each distance matrix for each pipeline and each beta-234 diversity metric (weighted/unweighted UniFrac and Bray-Curtis) were generated using the vegan 235 R package. The number of observed OTUs/ASVs per sample for each real dataset were 236 generated using the rarified combined OTU/ASV tables and the alpha_diversity.py command in 237 QIIME1 using observed OTUs as the metric option.  242 There are important aspects other than accuracy that need to be considered when determining 243 which denoising package a researcher should use for their project. Both DADA2 and UNOISE3 244 are suggested to be run in a pooled-sample workflow, where all sequences are pooled together 245 during the denoising process (Table 1). This allows them to better account for batch errors 246 across multi-run experiments. Deblur, on the other hand, runs its denoising process sample-by-247 sample. This approach helps lower Deblur's computational requirements, but at the cost of 248 reducing its ability to correct multi-run batch effects. Both DADA2 and Deblur are open source 249 projects, whereas UNOISE3 is a closed-source project which has a free 32-bit academic version 250 with a 4 Gb memory cap and a full 64-bit version that costs between $885-1485 USD ( Table 1). 251 Another major difference is that the built-in Deblur function in QIIME2 has a positive filtering 252 process. This default setting causes Deblur to discard reads that do not reach a length-scaled bit Manuscript to be reviewed 253 score and an e-value threshold to any sequence in the 88% representative sequences Greengenes 254 database. Note the default database can be changed using the "other" version of the Deblur 255 plugin in QIIME2, an important feature when working with fungal or eukaryotic data. It is also 256 important to note that the stand-alone version of Deblur outputs both positively filtered and non-257 filtered results by default, unlike the QIIME2 plugin which, only outputs the positively filtered 258 results. Currently, the functionality of both DADA2 and Deblur can be accessed through a 259 graphical user interface as plugins in QIIME2 Studio, whereas UNOISE3 does not support a 260 graphical user interface ( Manuscript to be reviewed 276 with 97% or greater identity matches to expected sequences in the Extreme dataset (read depth of 277 11.6 million reads) than the other two denoising pipelines (Supp Table 4). Six of the nine taxa 278 that DADA2 called and the other pipelines did not call, had expected relative abundances of only 279 0.000427% (Supp Table 4). Of the other three taxa not found by Deblur or UNOISE3, one was 280 at an expected abundance of 0.00427% and two had an expected abundance of 0.0427% (Supp 281 Table 4). Interestingly, open reference OTU clustering called OTUs corresponding to these nine 282 expected taxa along with two more expected taxa that DADA2 missed, both of which were in the 283 lowest possible expected abundance range (0.000427%). One organism was missed by all 284 pipelines in the Extreme community (C. methylpentusum), which was also in the lowest possible 285 expected abundance range. Sequence identity histograms were constructed for the HMP, 286 Zymomock and Extreme communities (Supp Fig 1-3). In the HMP community, no ASV was 287 found to have below 80% identity by the denoising pipelines, but open reference OTU clustering 288 found 21 OTUs below 80% identity. Furthermore, the OTU pipeline called 324 OTUs below 289 99% identity whereas all denoising approaches called less than 15 ASVs (Supp Fig 1). In the 290 Extreme community, most ASVs were found in the 99-100% identity range whereas the majority 291 of OTUs found by open-reference OTU clustering were in the 95-97% identity range (Supp Fig   292 2). 293 Due to the expected sequences for the fungal community being incomplete, we included 294 any UNITE database hits as expected sequences if the matching sequence was from a species 295 that was included in the fungal community. This resulted in almost all the fungi present in the 296 community to be found by all four pipelines except for Penicillium allii and P. commune. Open-297 reference OTU clustering did find P. allii, although at an Extremely low abundance of 0.004% 298 compared to its expected abundance of 6.25%. Given that some of the above potential spurious ASVs would be removed by sequence 300 bleed-through (Illumina, 2017) or low-abundance filters in typical workflows, we applied an 301 abundance cutoff filter of 0.1% to each mock community except for the Extreme community, 302 where an abundance filter of 0.0004% was applied (Supp Fig 4). This lower abundance filter 303 was chosen due to some expected sequences being in abundance of 0.000427%. Application of 304 the abundance filter to the HMP community resulted in all 10 unmatched ASVs (those that did 305 not match either the expected or SILVA by 97% or greater) called by DADA2 to be discarded, 306 but none of the four unmatched reads in UNOISE3 to be discarded (Supp Fig 4). A similar 307 phenomenon was seen in the Zymomock community with all 12 of Deblur's unmatched reads 308 being discarded (along with one database hit) and UNOISE3 only discarding one of 19 309 unmatched reads it called. The largest effect that the application of the filters had was the 310 removal of a significant amount of OTUs found in each mock community by the open-reference 311 OTU pipeline. The most drastic change was seen in the Extreme community that started with 312 8891 OTUs and was reduced to 1248 OTUs (Supp Fig 4). In the fungal community the number   The relative abundances determined for each study on the mock communities were 510 similar to each other irrespective of which pipeline processed the data (Fig 2). This finding 511 suggests that biological conclusions based on microbial relative abundance data should be 512 unaffected by the choice of denoising pipeline. One trend that was noticed in the relative 513 abundance data was that UNOISE3 tended to call higher abundances of non-reference 514 ASVs/OTUs in each mock community except for the Extreme community where open-reference 515 OTU clustering found the highest abundance of non-reference hits. Interestingly, the lowest 516 identity match for any of these ASVs called by UNOISE3 in both the Zymomock and HMP 517 mock communities was still found to be above 90% identity to the expected sequences (Supp 518 Fig 1,3) and was classified as Gammaproteobacteria by the RDP classifier using a 70% 519 confidence threshold, suggesting it is a 16S rRNA sequence that may have been introduced by 520 contamination, sequencing bleed-through or acquired an error early on in PCR amplification.

521
The relative abundances determined within the Zymomock and fungal communities were 522 highly similar between pipelines, but markedly differed from the expected result. This finding 523 suggests that either the expected abundances of sequences from these communities may be 524 incorrect or all four pipelines are similarly biased. This non-agreement could also be due to steps 525 during the sequencing processes such as PCR amplification, which may be causing primer bias 526 (Aird et al., 2011) or the inclusion of contaminant organisms. In the case of the fungal 527 community, it is possible that none of these approaches work well with ITS1 data which are 528 more variable than 16S data. Additional fungal mock communities should be analyzed in the 529 future to better explore this issue. Manuscript to be reviewed 530 Benchmarking relative abundance profiles from different pipelines with mock 531 communities can be useful, however, they tend to lack the diversity that is found in many real 532 sample datasets. To address this issue, we compared resulting microbial compositions from each 533 pipeline across three real datasets (exercise, human-associated, and soil). Weighted UniFrac, 534 unweighted UniFrac and Bray-Curtis dissimilarity distances between the same biological 535 samples for each approach were examined. In both cases the weighted UniFrac and Bray-Curtis 536 intra-sample distances between all pipelines for all three datasets were small (less than a median 537 of 0.21) (Fig 3, Supp Figure 5). This complemented our previous results, showing that each 538 pipeline had comparable microbial compositions for the mock communities. Furthermore, 539 plotting the samples on a PCoA or NMDS resulted in the same biological samples from each 540 pipeline grouping together (Fig 3, Supp Figure 5). This indicated that a similar plot would be 541 observed whether the researcher was using the Deblur, UNOISE3, DADA2 or open-reference 542 OTU clustering. Interestingly, Deblur did not agree with the DADA2 or UNOISE3 as much as 543 they agreed with each other on multiple occasions, based on Bray Curtis dissimilarity at the 544 genus level (Supp Fig 5A,5B). This result is interesting, as one of the main differences between 545 Deblur and the other two denoising pipelines is its positive filtering feature, and so we expected 546 this feature to be driving this difference. However, when we compared the other three 547 approaches to Deblur and Deblur without positive filtering, we found no difference (Sup Fig 7). 548 Due to the similar weighted UniFrac results (Fig 3) between the denoising pipelines, we believe 549 that this difference is most likely due to highly similar sequences being classified into slightly 550 different genera. Manuscript to be reviewed 553 Fig 5). In most cases, OTUs had similar intra-sample distances between pipelines in both 554 weighted UniFrac and Bray Curtis dissimilarity as denoising pipelines had amongst each other. 555 The relatively small differences seen in the abundance-based metrics indicates that choice of 556 pipeline would have minimal impact when looking at weighted beta-diversity metrics. Although, 557 due to denoising pipelines capability of single nucleotide resolution, they may provide more 558 strain information than OTU clustering at 97% would.

559
Unweighted UniFrac beta-diversity metrics were highly variable between pipelines 560 indicating different bias between pipelines when determining low abundance sequence (Supp 561 Fig6A,6B,6C). When pipelines were plotted together on a PCoA plot we found that the samples 562 separated by approach, rather than sample, indicating that interpretation from unweighted 563 UniFrac data would most likely be impacted based on the pipeline chosen to process any set of 564 given data. We wish to highlight, however, that recent evidence has emerged that the unweighted 565 version of UniFrac analysis can give misleading results (Wong et al. 2016), therefore these 566 patterns should be interpreted with caution.

567
To follow up on the vastly different unweighted UniFrac profiles of samples, we looked 568 at the alpha-diversity between the same samples run by each pipeline and the total number of 569 ASVs/OTUs called by each pipeline in the real datasets. We found that DADA2 called the most 570 ASVs among all denoising approaches, but overall open-reference OTU clustering found the 571 most (Fig 4). This was not surprising, as it agreed with the analysis of the mock communities. 572 Interestingly, despite DADA2 finding the most ASVs overall for all three denoising pipelines, it 573 found the least amount of ASVs per sample (Supp Fig 8-10). We suspect this is due to Manuscript to be reviewed 576 pipelines. This indicated that the number of different organisms found within a sample is directly 577 impacted by the choice of processing pipeline, emphasizing the difficulty in determining the true 578 number of different organisms a sample contains. It should be noted that all of the denoising 579 pipelines provide parameters that could be altered to increase or decrease sensitivity in 580 identifying rare/spurious ASVs depending on a user's targeted application. One alarming result 581 we found was the poor correlation between the number of observed OTUs/ASVs found by 582 DADA2 and both open-reference OTU clustering and UNOISE3 (Fig 4). This is concerning as 583 major differences in biological signal would be seen depending on the approach that was chosen 584 to process the data (i.e. a sample could have relatively low numbers of ASVs based on DADA2 585 analysis, but have relatively high numbers of ASVs/OTUs based on UNOISE3 or an OTU 586 analysis). This issue highlights that the approaches are all very good at identifying highly 587 abundant sequences but vary when identifying low-abundance sequences which will impact 588 metrics that do not take into account the abundance of OTUs/ASVs.

589
A major difference between the three denoising pipelines was their computational run 590 time. UNOISE3 was magnitudes faster than both DADA2 and Deblur. This is most likely due to 591 both the programming language that UNOISE3 is implemented in (C++), as well as its simple 592 one-pass denoising pipeline. DADA2 was the slowest pipeline and, although computation time 593 could be inconvenient for those with limited computational power, it did not reach times that 594 were impractical even when running almost 2 million total reads. Memory usage for each 595 program also did not reach impractical amounts when running close to 2 million reads, with 596 DADA2 using a maximum amount of 1024 Mb of memory which is a reasonable amount for 597 modern computers. Memory usage by UNOISE3 did not come close to reaching the 4 Gb 598 memory cap on the 32-bit version, even after running 103 samples at a total read depth of 2 PeerJ reviewing PDF | (2018:02:25564:2:0:CHECK 9 Jul 2018) Manuscript to be reviewed 1 5 6 * When all sequences from all samples are denoised at the same time (in contrast to running each 7 sample separately).
8 ** Compares resulting ASVs to a database (Greengenes for Deblur) and discards reads if they do 9 not match a certain identity threshold (88% for Deblur).