Taxonomy annotation errors in 16S rRNA and fungal ITS sequence databases

Sequencing of the 16S ribosomal RNA (rRNA) gene and the fungal Internal Transcribed Spacer (ITS) region is widely used to survey microbial communities. Specialized ribosomal sequence databases have been developed to support this approach including Greengenes, SILVA and RDP. Most taxonomy annotations in these databases are predictions from sequence rather than authoritative assignments based on studies of type strains or isolates. Here, I investigate the error rates of taxonomy annotations in these databases. I found 253,485 sequences with conflicting annotations in SILVA v128 and Greengenes v13.5 at ranks up to phylum (9,644 conflicts), indicating that the annotation error rate in these databases is ~15%. I found that 34% of non-singleton genera have overlapping subtrees in the Greengenes tree from 2001 according to the RDP taxonomy, most of which are probably due to branching order errors in the Greengenes tree, which is therefore an unreliable guide to phylogeny. Using a blinded test, I estimated that the annotation error rate of the RDP database is ~10%.


Abstract
Sequencing of the 16S ribosomal RNA (rRNA) gene and the fungal Internal Transcribed Spacer (ITS) region is widely used to survey microbial communities. Specialized ribosomal sequence databases have been developed to support this approach including Greengenes, SILVA and RDP. Most taxonomy annotations in these databases are predictions from sequence rather than authoritative assignments based on studies of type strains or isolates.
Here, I investigate the error rates of taxonomy annotations in these databases. I found 253,485 sequences with conflicting annotations in SILVA v128 and Greengenes v13.5 at ranks up to phylum (9,644 conflicts), indicating that the annotation error rate in these databases is ~15%. I found that 34% of non-singleton genera have overlapping subtrees in the Greengenes tree from 2001 according to the RDP taxonomy, most of which are probably due to branching order errors in the Greengenes tree, which is therefore an unreliable guide to phylogeny. Using a blinded test, I estimated that the annotation error rate of the RDP database is ~10%.

Background
Next-generation sequencing the 16S ribosomal RNA (rRNA) gene has revolutionized the study of microbial communities in environments ranging from the human body (Cho and Blaser, 2012;Pflughoeft and Versalovic, 2012) to oceans (Moran, 2015) and soils (Hartmann et al., 2014). Specialized 16S rRNA sequence databases providing taxonomy annotations include and Greengenes (DeSantis et al., 2006), SILVA (Pruesse et al., 2007) and RDP (Maidak et al., 2001). Most taxonomies in the RDP database were predicted by the Naive Bayesian Classifier (Wang et al., 2017), while most taxonomies in Greengenes and SILVA were annotated by a combination of database-specific computational prediction methods and manual curation based on predicted phylogenetic trees (McDonald et al., 2012;Yilmaz et al., 2014).

Microbial taxonomy and phylogeny
The goal of taxonomy has been described as classifying living organisms into a hierarchy of categories (taxa) that are useful (based on biologically informative traits) and monophyletic (consistent with the true phylogenetic tree) (Hennig, 1965), though this point of view is not universally accepted (e.g., (Benton, 2000)). With microbial markers such as 16S rRNA, most organisms are known only from environmental sequencing, and in these cases taxonomy must necessarily be predicted from sequence alone. If taxonomy is based on traits, then a prediction of taxonomy is a prediction that characteristic traits are present, or, more precisely, whether a taxonomist would decide that sufficiently many characteristic traits are present, noting that these may be subject to revision when new species are studied. For organisms known only from sequence, the best evidence for phylogeny is usually a tree inferred from the sequences. It is therefore useful to consider the problem of taxonomy prediction given a sequence-based tree. This is the approach used by SILVA and Greengenes, while RDP annotations for environmental sequences are predictions made by an algorithm, the Naive Bayesian Classifier, which does not explicitly incorporate phylogeny.

Taxonomy prediction from a sequence-based tree
Consider a binary phylogenetic tree with authoritative taxonomies for a small subset of its leaves. The goal is to predict taxonomies for the remaining leaves which represent organisms known only from sequence. The authoritative tree is the implied binary tree for the leaves having authoritative taxonomies, i.e. classifications based on direct observations of characteristic traits. Note that its annotations are considered to be authoritative, but not necessarily its branching order. The lowest common ancestor (LCA) of a taxon is the lowest node above all its leaves. If the branching order is correct and the taxonomy is consistent with the tree, then the subtree under the LCA for a taxon is pure, i.e. contains no other taxa at the same rank. The lowest pure taxon (LPT) for a node is the taxon at lowest rank for which all the leaves in its subtree belong to the same taxon. For example, the LPT for node n is family F if, and only if, the leaves in the subtree under n belong to two or more genera in F. If F contains exactly one genus G, then F cannot be the LPT of any node because if F is pure then G will also be pure and G is lower. A pure edge connects two nodes having the same LPT. For example, consider the tree shown in Fig. 1. Sequences s, g1 and g2 are type strains with authoritative taxonomy; w, x, y and z are sequences obtained from environmental samples. The authoritative tree (black edges) connects s, g1 and g2. Genus is the lowest rank; s belongs to a singleton genus S (i.e., it is the only authoritatively named sequence for that genus); g1 and g2 belong to the same genus G. S and G belong to the same family, F. The LCA for S is s (a singleton leaf is always its own LCA), and the LCA for G is g.
The LPTs of g1, g2 and g are G, the LPT of s is S, and the LPT of f is F. Pure edges are shown by continuous lines; impure edges are dashed lines. If the tree is consistent with the taxonomy, then all impure edges join an LCA node for some taxon t to a node which has an LPT at a higher rank than t, and conversely every LCA node has one impure edge which connects it to a higher node. Purity is closely related to monophylicity; however, monophylicity is generally understood to be with respect to the true but unknown phylogeny, while purity is defined with respect to a given predicted phylogeny which may have an incorrect branching order.

The clade extension problem
Adding a new sequence (Q) of unknown taxonomy introduces a new node into the tree, its join node, indicated by an arrowhead in Fig. 1. Assume the tree is correct and the taxonomy is consistent with the tree. Then, if the join node is below the LCA for taxon t, Q certainly belongs to t. In Fig. 1, y certainly belongs to G because its join node is below g, the LCA of G.
Also, w certainly belongs to a novel genus because it joins a family edge (which is above two or more genera by definition), and a new genus must therefore be introduced because it is impossible for the join node or any node above it to be the LCA for any known genus.
The environmental sequence x joins the tree at the impure edge joining s and f, so it is above the LCA for S, but the subtree under the join node contains only S. The tree therefore cannot resolve whether x belongs to S or to a novel genus because both possibilities are consistent with the data. If x belongs to S, its join node will be the new LCA for S, expanding the clade by moving its LCA higher in the tree. Otherwise, x belongs to a novel singleton genus, call it V, and x will be the LCA of V if it is added to the tree. Thus, s is the only sequence which unambiguously belongs to S-if a new sequence Q differs from s, then the join node is above the LCA for S and Q could belong to a different genus. This ambiguity could be resolved by a sequence identity threshold. For example, if x is 97% identical to s, then it might be reasonable to assign x to S, but identity thresholds are only approximate guides and this prediction is uncertain even if the tree is assumed to be correct. A similar problem arises in non-singleton taxa. The join node for z is above the LCA for G, but the subtree below the join node contains only G and z could therefore belong to G (which would expand the clade by moving the LCA for G up to the join node for z), or to a novel genus, in which case z would be a new singleton LCA. In general, if a sequence Q joins the tree at an impure edge, then the tree is consistent with assigning Q to a new taxon or extending an existing clade to include Q.

The subtree impurity problem
Trees inferred from large sequence sets are unreliable guides to phylogeny due to incorrect branching orders (Huelsenbeck et al., 2000;Philippe et al., 2011). Naive assumptions that the tree is correct and the taxonomy is consistent with the tree are violated by branching order errors and also by taxa which are found to overlap by sequence such as Escherichia and Shigella (Escobar-Páramo et al., 2003), which may be regarded as errors in the taxonomy. A conflict between the tree and the taxonomy occurs when taxa at the same rank overlap (Fig. 2). An overlapped leaf is below LCAs for two or more taxa at the same rank.
Conversely, a taxon t is consistent with the tree if, and only if, the subtree under the LCA for t is pure. In Fig. 2, i is the LCA for G and j is the LCA for H; neither is pure because i is above h1 and j is above all members of both G and H. If an environmental sequence Q joins the tree at an edge where taxa overlap, i.e. the edge is above leaves belonging to two or more taxa at a given rank but below the LCAs for those taxa, the prediction at that rank is ambiguous.
For example, consider cases where the join node for Q is in p-i, q-i, h1-q or g3-q. Here, Q could plausibly belong to G, H or a novel genus, depending on the underlying explanation of the tree conflict. If Q joins the tree below a node for which taxon t is pure, then it is plausible that Q belongs to t, though this purity heuristic is not reliable because the branching order may be incorrect. For example, suppose h1 in Fig. 2 is an environmental sequence rather than a known strain, then the subtree under i is pure at genus rank and h1 would be incorrectly assigned to G by the purity heuristic.

High-identity outliers
The lowest common rank (LCR) (Edgar, 2018) is the lowest rank shared by a pair of sequences. Note that an LCR is the name of a rank, not a taxon name at that rank. For example, if two sequences belong to different genera in the same family, then their LCR is genus. If a pair of sequences has an anomalously high identity considering its LCR, this indicates a possible annotation error, though it could also be explained by unusually high sequence conservation or by lateral transfer. For example, if two 16S rRNA sequences are 99% identical and are annotated as belonging to different orders in the same class, this is likely to be a problem in the annotations. This approach can be applied most effectively to a reference used to make annotations in a larger database because an anomaly in the reference may propagate to multiple predictions, obscuring the underlying issue. I tested this method using RDP 16S rRNA training set v16, the BLAST 16S rRNA reference database (BLAST16S) (Sayers et al., 2012) downloaded 17th January 2017, and the Warcup fungal ITS reference v2 (Deshpande et al., 2015).

Nomenclature conflicts
At least three prokaryotic taxonomy nomenclatures are widely used. The Greengenes nomenclature is based on the NCBI Taxonomy database (Sayers et al., 2012), RDP on Bergey's Manual (Bergey, 2001), and SILVA on LSPN (Parte, 2014). These nomenclatures differ mainly in revisions to resolve conflicts with sequence and add new candidate groups identified by sequence. For example, Escherichia and Shigella overlap (see Background), which Greengenes resolves by leaving their sequences unclassified at ranks below Enterobacteriaceae. SILVA deals with this issue differently by retaining well-established species names such as Escherichia coli and introducing a combined genus named Escherichia-Shigella. An example of a candidate group is JS1 (Webster et al., 2004), which is found in the SILVA nomenclature but not in the tested versions of RDP or Greengenes. Most well-established taxa are found in all nomenclatures, and in these cases a biologist would presumably expect a taxon name to be placed within the same group at a higher rank. To investigate this for a representative pair of large databases, I identified the common nomenclature as the set of names which appear in both SILVA and Greengenes. If a taxon name and its parent name were both found in the common nomenclature, I tested whether the databases agreed on the parent name.

Conflicting annotations in Greengenes and SILVA
I identified sequences found in both SILVA and Greengenes that are annotated by names in their common nomenclature in at least one of the databases, and listed cases where the annotations are different. A disagreement must be due a difference in the nomenclature hierarchy, a different definition of the taxon (e.g., different characteristic traits), or errors in one or both annotations.

Blinded test of RDP predictions
If a sequence that was previously known only from environmental samples is found in a newly-sequenced isolate strain, it can provide a blinded test of taxonomy prediction analogous to CASP (Moult, 2005), where protein structures predicted from sequence are compared to newly-solved structures determined by sequence-independent methods such as X-ray crystallography. In principle, predictions in an older version of Greengenes or SILVA could be assessed by comparison with isolates sequenced after annotations were generated, but this approach is stymied by a lack of data specifying which sequences were used as references and the deposition dates of new isolate sequences. With RDP, it is straightforward because some of the older training sets for the Naive Bayesian Classifier are deposited in public archives, and new sequences are readily identifiable as those not found in an older training set. If a training set is considered to be an authoritative reference, this enables predictions for new, authoritatively classified sequences to be assessed using an old training set as a reference, which is effectively equivalent to a blinded test. I implemented this test using versions 9 and 16 of the RDP reference data, which were the oldest and newest, respectively, available for download at the time this work was performed. Version 9 contains 10,049 sequences; version 16 has 13,212 and therefore contains more than three thousand new sequences. I assessed predictions by the RDP Classifier on these new sequences using v9 as a reference and the v16 annotations as the truth standard.

Prediction performance metrics for the blinded test
I used the following metrics (Edgar, 2014(Edgar, , 2018 to measure prediction accuracy using a reference database (A) with a query set (S) containing sequences with known taxonomy. Metrics are calculated separately for each rank.

Conflicts between the RDP taxonomy and Greengenes tree
Greengenes and SILVA use sequence-based trees to guide predictions of taxonomy for environmental sequences. Conflicts between a phylogenetic tree and a taxonomic hierarchy arise when taxa at the same rank overlap in the tree (see Background). Such conflicts could readily be detected given the tree and the authoritative annotations used as the basis for prediction, but this data is not provided by Greengenes or SILVA and I therefore chose to assess the Greengenes tree from 2011 (http://greengenes.lbl.gov/Download/Trees/16S_all_gg_2011_1.tree.gz) using taxonomy annotations from RDP training set v16 as a reference. This assesses an old tree by comparison with a new authoritative reference, following the rationale of the previous section. Annotations were propagated from the RDP reference by identifying exactlymatching sequences in Greengenes. I extracted the binary tree implied by the Greengenes tree for leaves annotated by the RDP reference. For each taxon in the RDP hierarchy, I identified its LCA (i.e., the lowest node above all members of the taxon), and then the overlapped taxa and leaves at each rank. If taxonomy is required to be consistent with the true phylogenetic tree, then an overlap must be due to an error in the branching order of the tree and/or an error in the taxonomy.

Annotation errors in the full RDP database
Annotations in the full RDP database are generated by the Naive Bayesian Classifier (NBC).
Performance metrics for NBC predictions on new sequences in RDP training set v16 using v9 as a reference are shown in Table 1. The top-hit identity distribution (Edgar, 2018) of the blinded test is shown in Fig. 3, together with the distribution of the full RDP database against training set v16 for comparison. The full distribution is strongly peaked at 99% identity, while the blinded distribution has a higher frequency of low-identity sequences.
The blinded test is therefore more challenging than annotating the full database, and the average genus accuracy of the RDP annotations is probably higher than the ~85% observed in the blinded test. Given these results, a rough estimate is that ~10% sequences in RDP have incorrect genus annotations.

Conflicts between the RDP taxonomy and Greengenes tree
I transferred taxonomies from RDP training set v16 to the Greengenes tree from 2011 by finding identical sequences (see Methods), giving a binary tree with 20,952 leaves where the branching order is given by the Greengenes tree and taxonomy annotations at the leaves are given by the RDP training set. Results are shown in Table 2. Conflicts between the taxonomy and tree are identified as overlaps. At all ranks, a large majority of the leaves are overlapped; i.e., are found in the subtrees of two or more LCA nodes at the same rank. A taxon is defined to be overlapped if its LCA contains leaves belonging to other taxa at its rank. There are 1,454 genera in the authoritative tree; 586 of these are singletons which are necessarily pure. Of the 868 non-singleton genera, 294 (34%) are overlapped, and 4/28 phyla (14%) are overlapped. While some conflicts could be due to errors in the taxonomy, these results strongly suggest pervasive branching order errors in the Greengenes tree.

Nomenclature hierarchy disagreements between Greengenes and SILVA
I found that 46 taxon names were placed in different parent taxa by Greengenes v13.5 and SILVA v128, considering only names present in both databases. For example, family Brevibacteriaceae is in order Actinomycetales according to Greengenes (e.g. accession 1109545), but this family is in order Micrococcales according to SILVA (JAJB01000072.80. (825086), so there is an unambiguous disagreement between the nomenclature hierarchies. A complete list is given in Supp. Table 1.

Annotation conflicts between Greengenes and SILVA
I found 253,485 sequences present in both Greengenes v13.5 and SILVA v128 with conflicting annotations. A rank was considered to disagree if the names at that rank were different and one or both were in the common nomenclature, i.e. was a name found in both databases. 61,744 (24%) of these conflicts were due to a rank which was blank (unspecified) in one database, which implies a false positive by one database or a false negative by the other. An example of a conflict where both names are in the common nomenclature is Greengenes 4366627 which has the same sequence as SILVA LOSM01000005.1106908.1108461. Greengenes assigns this sequence to family Pseudoalteromonadaceae while SILVA assigns it to family Vibrionaceae.
Pseudoalteromonadaceae is found in SILVA (FJ889568.1.1501) and Greengenes contains Vibrionaceae (3059893). A small fraction of the conflicts (440, or 0.2%) reflected nomenclature hierarchy disagreements (see previous section); the rest were conflicts in taxon names for which both databases agreed on the parent group. Conflicts were found at all ranks: 9,644 at phylum, 31,851 class, 93,534 order, 48,945 family and 69,511 genus, counting only the highest conflicted rank. An example of a phylum conflict in a sequence classified to genus level by both databases is Greengenes 851746 which is assigned to genus Streptococcus in phylum Firmicutes and has the same sequence as SILVA JZIA01000003.3571786.3573313, which is assigned to genus Xanthomonas in phylum Proteobacteria. If a given named taxon is required to have the same definition (e.g., the same characteristic traits) in all nomenclatures where it appears, then all conflicts are necessarily due to annotation errors in one or both databases. A conservative estimate of the total number of such errors is then the lower bound where all conflicts are due to an error in one database but not the other, i.e. a total of ~250k. Absent other evidence, it is reasonable to assume that the error rate is similar in both databases, giving an estimate of half the total, i.e. ~100k errors per database after rounding down. 596,648 (~600k) matching sequences had annotations from the common nomenclature in one or both databases. This is the number of cases where conflicts are identifiable, giving a conservative estimate of ~100k/~600k (~15%) for the annotation error rates of SILVA and Greengenes.

High-identity outliers
Representative outliers for BLAST16S are shown in Table 3, for Warcup in Table 4 and for the RDP 16S training set in Table 5. The ten pairs with highest identity at each rank for all tested databases are given in Supp. Table 2. As an example, sequences gi|645320505 and gi|265678369 in BLAST16S are 100% identical and have LCR=phylum, which probably indicates an annotation error. In Warcup, several cases are identified where the genus of the species binomial name is different from the genus annotation per se; for example, Sporisorium destruens is assigned to genus Sphacelotheca in accession AY344979. At least some of these cases reflect nomenclature revisions that were only partially implemented in Warcup. For example, Serpula incrassata is an older name for Meruliporia incrassata (Skrede et al., 2011).

Discussion
The results presented here suggest that the annotation error rates of Greengenes and SILVA are ~15% and RDP is ~10%, though the uncertainties in these estimates are too great to support a conclusion that annotations in one of these databases are better or worse than another. However, while keeping this caveat in mind, it is striking that a pessimistic estimate for the RDP database of 15% on the blinded test is comparable to a conservative estimate for SILVA and Greengenes of a 15% error rate from disagreements between annotations. This suggests that the accuracy of automated annotations generated by the Naive Bayesian Classifier, which does not explicitly consider phylogeny, is comparable to or perhaps better than the accuracy of annotations in SILVA and Greengenes, which were generated by a combination of automated and manual methods guided by alignments and trees. Predictions generated by the RDP Classifier have several advantages: they are based on a documented reference, can easily be reproduced, and report a bootstrap confidence value. This allows a biologist to critically review the evidence underlying any annotation of interest. By contrast, annotations in SILVA and Greengenes do not report confidence and are not reproducible because there is a manual curation step and references are not documented.

Figure 1. Example tree showing join nodes and impure edges.
Leaves s, g1 and g2 are sequences with authoritative taxonomies. Sequences w, x, y and z are environmental sequences. Sequence s belongs to genus S, which is a singleton (I.e., has only one authoritative sequence); sequences g1 and g2 belong to genus G. The LCA for G is g, the LCA for S is s. The inferred positions of w, x, y and z in the tree are indicated by arrows. An arrowhead is a join node, i.e. the new node created by adding a new sequence to the tree. Black edges are the authoritative tree, i.e. the binary tree for leaves with authoritative taxonomies. Dashed lines are impure edges; continuous lines are pure edges (see main text).

Figure 2. Example tree showing conflicts with taxonomy.
Leaves g1, g2 and g3 belong to genus G, h1, h2 and h3 belong to a different genus H. The tree is not consistent with the taxonomy because G and H overlap. The LCA for G is i, which is above h1, and the LCA for H is j, which is above all members of both G and H. Overlaps are common in practice, where most are probably explained by errors in the branching order of the predicted phylogenetic tree (see main text).

Figure 3. Top-hit identity distribution for the blinded RDP prediction test.
Panel (a) shows the top-hit identity distribution (THID (Edgar, 2018)) for new sequences in RDP trainset v16 compared to v9 as a reference. Panel (b) shows the THID for the full RDP database against trainset v16 as a reference. Histogram bars are colored according to the most probable lowest common rank (Edgar, 2018  A pure taxon has no other taxa at the same rank under its LCA, otherwise it is overlapped. A pure leaf is under exactly one LCA at the given rank, if there are two or more it is overlapped.