Accuracy of microbial community diversity estimated by closed- and open-reference OTUs

Next-generation sequencing of 16S ribosomal RNA is widely used to survey microbial communities. Sequences are typically assigned to Operational Taxonomic Units (OTUs). Closed- and open-reference OTU assignment matches reads to a reference database at 97% identity (closed), then clusters unmatched reads using a de novo method (open). Implementations of these methods in the QIIME package were tested on several mock community datasets with 20 strains using different sequencing technologies and primers. Richness (number of reported OTUs) was often greatly exaggerated, with hundreds or thousands of OTUs generated on Illumina datasets. Between-sample diversity was also found to be highly exaggerated in many cases, with weighted Jaccard distances between identical mock samples often close to one, indicating very low similarity. Non-overlapping hyper-variable regions in 70% of species were assigned to different OTUs. On mock communities with Illumina V4 reads, 56% to 88% of predicted genus names were false positives. Biological inferences obtained using these methods are therefore not reliable.

2 chosen pair of template sequences has high identity. The mock communities considered here contain from eleven (Mock1) to 19 genera (Mock2) and from seven (Mock1) to 19 families (Mock2), as shown in Table 1 in the main text. A random pair of mock community templates will therefore usually belong to different families, and an in vivo community with few dominant genera could have comparable or higher average sequence similarity. For example, the human vaginal microbiome is often dominated by members of the Lactobaccilus genus (Ravel et al., 2011) and it is therefore likely that a vaginal sample would have a higher chimera formation rate than a mock sample because a random pair of amplicons will often be derived from the same genus. Even if the chimera frequency is low, other sources of error can result in high diversity of erroneous sequences. This is illustrated by the Mock1 reads which have few or no inter-strain chimeras because the strains were amplified separately, but nevertheless gave large numbers of spurious OTUs with both Qclosed and QIIME* ( Table 2 in main text).

Substitution, insertion and deletion (SID) errors due to PCR and sequencing.
Call a read harmful if it has >3% SID errors. With a 97% threshold, harmful reads cause spurious OTUs. Reads with <3% errors can also cause spurious OTUs; I will neglect this scenario for simplicity because similar arguments apply. I will also neglect the possibility that a read with >3% errors might be the only read for a given strain, in which case its OTU might be considered valid for some purposes, e.g. calculating diversity. Such reads can be neglected to a reasonable approximation because they are surely rare even in samples with high diversity, as shown by the following reasoning. Let h be the frequency of harmful reads. Most reads are correct or have <3% errors so h is small. Let K be the number of singleton strains, i.e. strains having exactly one read. The number of singleton strains with 3 a harmful read will then be approximately hK, i.e. a small fraction of K. Thus, even if singleton strains are common, those with harmful reads will nevertheless be rare.
With the simplifications and caveats described above, a read falls into a spurious OTU if, and only if, it is harmful, i.e. has >3% SID errors. Therefore a new harmful read will necessarily cause a new spurious OTU unless it falls into an existing spurious OTU, i.e. is sufficiently similar to a previously-generated harmful read.
Frequencies for SID errors, and hence the rate of harmful reads due to SID errors, depend on several known factors including the sequencing platform (e.g., 454 or Illumina), the sequence of the template (e.g., the lengths of its homopolymers), and the PCR protocol (e.g., the chosen polymerase and number of cycles). I would therefore expect the frequency of harmful reads from a given template to be primarily determined by its sequence, with biases that depend on the experimental protocol (PCR and sequencing platform). The experimental protocol and number of reads is assumed to be the same for all samples, and while the fraction of reads with >3% no doubt varies somewhat between samples, the variation is probably not very large because biases will tend to average out, and there is no reason to believe mock samples have unusual biases. Therefore I would expect that: To a reasonable approximation (a) the fraction of harmful reads (i.e., with >3% SID errors) is independent of the sample composition, and (b) mock samples have rates of harmful reads comparable to rates in samples encountered in practice. (1) 4 Two harmful reads that fall into the same spurious OTU must be generated from the same template sequence (or two very similar template sequences), and have similar errors. In a sample with high diversity, each new harmful read is therefore likely to be a novel error, i.e.
one that is not close enough to a previous harmful read to fall into the same spurious OTU.
A novel harmful read creates a new spurious singleton OTU, and by (1) it would then follow that the number of spurious OTUs will be comparable for mock samples and samples encountered in practice. If the diversity is lower, especially if there are a few highly abundant templates, then there are more opportunities for errors to be reproduced, which will reduce the total number of spurious OTUs caused by a given number of harmful reads.
Thus, the number of spurious OTUs due to SID errors may in fact tend to be lower in samples with low diversity, such as a mock community. This conclusion assumes that singleton OTUs are retained, as with the Qclosed method.
If singleton OTUs are discarded, as with QIIME*, then forming a spurious OTU requires that an error is reproduced well enough that two harmful reads fall into the same OTU, and the rate of forming spurious OTUs will be more dependent on biases. Now imagine dividing a high-diversity sample into mock-like subsets of, say, 20 strains. If the number of spurious OTUs for a mock sample increases approximately linearly with the number of reads, as seen for QIIME* in Fig. 1  with 20 strains and nj reads produces the same number of spurious OTUs as a mock sample with nj reads, then each subset will produce bj = r nj spurious OTUs and the total number of spurious OTUs is BH = Σj bj = Σj r nj = r Σj nj. By assumption, the samples have the same number of reads so Σj nj = N and hence BH = r N = Bmock.