Generation of Comprehensive Ecosystem-Specific Reference Databases with Species-Level Resolution by High-Throughput Full-Length 16S rRNA Gene Sequencing and Automated Taxonomy Assignment (AutoTax)

High-throughput 16S rRNA gene amplicon sequencing is an essential method for studying the diversity and dynamics of microbial communities. However, this method is presently hampered by the lack of high-identity reference sequences for many environmental microbes in the public 16S rRNA gene reference databases and by the absence of a systematic and comprehensive taxonomy for the uncultured majority. Here, we demonstrate how high-throughput synthetic long-read sequencing can be applied to create ecosystem-specific full-length 16S rRNA gene amplicon sequence variant (FL-ASV) resolved reference databases that include high-identity references (>98.

ASVs are often preferred over OTUs because they provide single-nucleotide resolution and can be applied as consistent labels for microbial identification independently of a 16S rRNA gene reference database (8). This approach is used with short-read ASVs in several large-scale projects, including the Earth Microbiome Project (EMP) (9) and the American Gut project (10), to provide detailed insight into the factors that shape the overall microbial community diversity and dynamics. However, ASVs are not ideal as references for linking microbial identity with the physiology and ecology of key community members, which is crucial if we want to use the microbial community structure to predict ecosystem functions or process performance in engineered systems. First, without taxonomic assignment, it is not possible to compare results across studies that have used primers targeting different regions of the 16S rRNA gene. Second, short-read ASVs alone do not contain enough evolutionary information to resolve their phylogeny confidently (11,12). This limitation makes it impossible to report and infer how microbial traits are conserved at different phylogenetic scales. As such, functional properties for uncultured lineages predicted from the annotation of metagenome-assembled genomes (MAGs) with complete rRNA genes (high-quality minimum information about a metagenome-assembled genome [MIMAG] standard) (11), or determined through in situ studies, cannot be confidently linked to the shorter ASV sequences (12,13). A robust taxonomic assignment is therefore crucial for crossstudy comparisons and the accumulation and dissemination of knowledge relating to uncultured lineages.
Taxonomic assignment to ASVs relies on a classifier, e.g., the SINTAX (7) or the RDP classifier (6), which uses statistical algorithms to compare each ASV to a full-length 16S rRNA gene reference database to propose the best estimate of their taxonomy. Confident classification at the lowest taxonomic ranks (genus and species) requires high-identity reference sequences (Ն98.7% identity) and a complete seven-rank taxonomy for all references (13). Neither of these criteria is met with the most commonly applied universal reference databases, e.g., Greengenes (14), SILVA (15), and RDP (16), as they lack sequences or taxonomic assignment for many uncultivated environmental taxa. Given the vast diversity predicted for microbial life on Earth (17), it will be some time before reference sequences for all species are generated, and the manual curation of their taxonomy will not be feasible.
A potential solution to the problems mentioned above is to create ecosystemspecific reference databases. They can either be ecosystem-curated versions of universal databases or smaller independent databases that only include sequences from the specific ecosystem. The MiDAS 2.0 database for microbes in biological wastewater treatment systems (18) and the Dictyopteran gut microbiota reference Database (DictDb) (19) are examples of ecosystem-curated versions of the SILVA databases, where the taxonomies for the abundant and process-critical microorganisms are manually curated and maintained. However, the mostly manual taxonomic curation of central reference databases is time-consuming and subjective. The universal reference databases are also clustered at 99% identity or below to keep the unique sequences to a manageable number, which reduces the taxonomic resolution.
Examples of independent ecosystem-specific databases include the human intestinal tract 16S taxonomic database (HITdb) (20), the human oral microbiome database (HOMD) (21), the freshwater-specific FreshTrain database (22,23), the honey bee gut microbiota database (24), and the rumen and intestinal methanogen database (25). While such databases have been shown to improve the rate of classifications for amplicons, they generally contain a relatively limited number of sequences and are therefore associated with an inherent risk of over-or misclassification if the sequence being classified is not represented in the database. To address this issue, Rohwer et al. introduced the TaxAss algorithm that classifies amplicons using two reference databases, namely, a universal database and a small ecosystem-specific database (23).
Amplicons are first mapped to the ecosystem-specific database to determine the percent identity with the best hit, and those above a user-defined threshold are classified using the ecosystem-specific database. The remaining sequences are classified using the more extensive universal database. While higher rates of classification were achieved, an issue with the approach is the potential for closely related sequences that fall on either side of the user-defined threshold to receive very different taxonomies, especially if the ecosystem-specific database is not updated to reflect the evolving taxonomy of the universal reference database. As such, while current strategies to create ecosystem-specific databases have shown promise, there are critical issues that need to be resolved before their potential can be realized.
The recent development of methods for high-throughput full-length 16S rRNA gene sequencing, e.g., synthetic long-read sequencing on the Illumina platform (26,27), along with PacBio (28) or Nanopore (29) consensus sequencing, now allows for the generation of millions of high-quality reference sequences within days. Importantly, these technologies now allow for the high-throughput generation of high-identity reference databases with broad coverage of the true diversity. However, improving sequence coverage alone will not solve the problem of poor taxonomic assignments for many uncultured taxa. We have therefore developed the AutoTax pipeline, which provides a simple and efficient strategy for the creation of comprehensive ecosystemspecific taxonomies that cover all seven taxonomic ranks. AutoTax uses the SILVA taxonomy as a backbone and provides stable placeholder names for unclassified taxa, based on de novo clustering of sequences according to statistically supported identity thresholds for each taxonomic rank (12). Importantly, AutoTax databases are easily updated with subsequent releases of the SILVA taxonomy-avoiding the divergence of generated ecosystem-specific taxonomies with the universal reference database. The strict computational nature of the taxonomy assignment means that it is objective and reproducible. The simplicity of the applied de novo clustering also ensures that the placeholder names are maintained even though the database is expanded with additional reference sequences.
We demonstrate the potential of the AutoTax method by sequencing almost a million full-length 16S rRNA gene sequences from Danish biological wastewater treatment and bioenergy systems. The sequences were denoised to resolve full-length gene amplicon sequence variants (FL-ASVs) with single-nucleotide resolution. Taxonomy was assigned to the FL-ASVs using the AutoTax pipeline to create an ecosystem-specific reference database. As evidence supporting the value of our approach, mapping of short-read amplicon data revealed that a substantially higher proportion of sequences were matched to high-identity references and received species-and genus-level classification when the FL-ASV database was used than those of the much larger public universal reference databases.

RESULTS AND DISCUSSION
Sampling and high-throughput sequencing of full-length 16S rRNA sequences. To obtain 16S rRNA gene reference sequences for Danish wastewater treatment plants (WWTPs) and anaerobic digesters (ADs), we sampled biomass from 22 typical WWTPs and 16 ADs treating waste activated sludge located at Danish wastewater treatment facilities (see Table S2 in the supplemental material). These facilities represent an important engineered ecosystem containing complex microbial communities of both bacteria and archaea, with the vast majority of microbes being uncultured and poorly characterized (30).
DNA and RNA were extracted and used for synthetic long-read 16S rRNA gene sequencing using both a primer-based and primer-free approach (26) (Fig. 1a). A total of 926,507 full-length 16S rRNA gene sequences were obtained after quality filtering (see Table S3 in the supplemental material). They were denoised with UNOISE3 to generate a comprehensive reference database of 9,521 FL-ASVs with an error rate below the detection limit according to theoretical calculations and an analysis of 7,816 sequences previously obtained from the eight-strain ZymoBiomics microbial community DNA standard (26) (see Text S1 in the supplemental material).
To estimate the number of FL-ASVs belonging to novel taxa, FL-ASVs were mapped to the SILVA 138 SSURef NR99 database (15) using global mapping with USEARCH, and the identity of their closest relatives was compared with the thresholds for taxonomic ranks proposed by Yarza et al. (12) (Table 1). The majority of the FL-ASVs (94%) had references in the SILVA database above the genus-level threshold (Ͼ94.5% identity), WWTPs and ADs, and DNA and RNA were extracted. Purified DNA or RNA was used for the preparation of primer-based and "primer-free" full-length 16S rRNA libraries, respectively. They were sequenced and processed bioinformatically to produce the FL-ASVs and FL-OTUs. (b) Mapping of V1-V3 amplicon data to the FL-ASV reference database, the FL-OTU expanded database, and commonly applied universal reference databases. ASVs were obtained from activated sludge and anaerobic digester samples and filtered based on their relative abundance before the analyses to uncover how well the databases cover the rare biosphere. The fraction of the microbial community represented by the remaining ASVs after the filtering (coverage) is shown as the mean Ϯ standard deviation across plants. (c) Mapping of V1-V3 ASVs from Dutch WWTPs based on raw data from Gonzalez-Martinez et al. (36). For details, see  a FL-ASVs were mapped to SILVA 138 SSURef NR99 to find the identity with the closest relative in the database. The novelty was determined based on the identity for each FL-ASV using the threshold for each taxonomic rank proposed by Yarza et al. (12).
but 26% lacked references above the species-level threshold (98.7% identity), which are crucial for confident taxonomy assignment to ASVs.

FL-ASVs have better ecosystem coverage than universal reference databases.
To evaluate if the FL-ASV database contained high-identity references for all bacteria in the ecosystem, we mapped V1-V3 ASVs obtained from the following two sources: (i) the same samples used to create the FL-ASVs and (ii) samples from unrelated Danish WWTP and ADs. The ecosystem-specific FL-ASV database (9,521 seq.) included more highidentity references (Ͼ98.7% identity) for the abundant ASVs (relative abundance cutoff at 0.01%) in all samples analyzed than that of the much larger universal databases MiDAS 2.1 (548,447 seq.) (18), SILVA 138 SSURef NR99 (510,984 seq.) (15), SILVA 138 SSURef (2,225,272 seq.) (15), Greengenes 16S v.13.5 (1,262,986 seq.) (14), and the full RDP v.11.5 (3,356,808 seq.) (16) (Fig. 1b and Fig. S1 in the supplemental material). ASVs were also mapped to the 16S rRNA gene database derived from the Genome Taxonomy Database (GTDB) release 89 (17,460 seq.) (31). However, this database lacked highidentity references for almost all ASVs. The poor coverage likely relates to the fact that 16S rRNA genes often fail to assemble in MAGs produced by short-read sequencing data (32). This problem will likely disappear in the future with the introduction of more high-quality MAGs with complete rRNA genes into the GTDB as a result of long-read sequencing technologies, such as Nanopore and PacBio (33,34).
When the rare biosphere was included in the analysis (relative abundance cutoff at 0.001%), a decrease in the percentage of ASVs with high-identity reference sequences was observed (Fig. 1b). This may be a problem in ecosystems with high diversity, such as soil and sediments (9), where low-abundant microbes constitute a considerable fraction of the community, but also in engineered systems where transient or lowabundant bacteria, such as pathogens or bacteria degrading micropollutants, may be important. To get a better representation of the rare biosphere, we created an additional reference database, which besides the FL-ASVs, contained chimera-filtered, full-length 16S rRNA gene sequences clustered at 99% identity (FL-OTUs). This database greatly increased the coverage for the rare biosphere because it includes FL-OTUs for sequences that were only observed once. However, the improved coverage is achieved at the expense of taxonomic resolution (see later).
Since only Danish WWTPs and ADs were used to establish the comprehensive high-identity FL-ASV reference database, published amplicon data from non-Danish WWTPs (35,36) were also evaluated ( Fig. 1c and d and Fig. S2 in the supplemental material). Compared with the analyzed universal reference databases, the Danish reference FL-ASVs performed better or as well for most of the investigated non-Danish WWTPs, which indicates that even with less than 10,000 sequences, the database covers many of the microbes that are common to WWTPs across the world, particularly for systems with nutrient removal. We anticipate that our ongoing sampling of more than 1,000 WWTPs and AD systems across all 7 continents and different processconfigurations (MiDAS Global, https://www.midasfieldguide.org/global) for highthroughput full-length 16S rRNA gene sequencing will provide references for the region-specific taxa in the near future, providing a comprehensive database of reference sequences for this ecosystem.
A new comprehensive taxonomic framework. A major limitation in the classification of amplicon data from environmental samples is the lack of genus and species names for many uncultivated bacteria in the universal reference databases. To address this limitation, we developed a simple taxonomic framework (AutoTax), which provides a consistent taxonomic classification of all reference sequences to all seven taxonomic ranks using identity thresholds (Fig. 2).
In the AutoTax pipeline, the FL-ASVs (or FL-OTUs) are first mapped to the SILVA 138 SSURef NR99 database, which provides the taxonomy and percent identity of the closest relative in the database. This taxonomy is assigned to the FL-ASV down to the taxonomic rank supported by the sequence identity thresholds proposed by Yarza et al. Table 1). Because of the stringent mapping and the taxonomy trimming based on percent identity, we obtained an overall better taxonomy assignment than that of commonly used classifiers, as revealed by a leave-one-out classification test (see Fig. S3 in the supplemental material).
Since species-level classification is desired wherever possible and the official SILVA taxonomy for bacteria and archaea is not curated at the species level (37), FL-ASVs were also mapped to the 16S rRNA gene sequences from type strains extracted from the SILVA 138 SSURef NR99 database, as they carry official species names. Species-level classifications were assigned to the FL-ASVs if they shared more than 98.7% identity with only one species. If the FL-ASVs matched more than one, they were not classified at the species level due to the high risk of misclassification.
Because the SILVA taxonomy does not provide a complete seven-rank taxonomy for all sequences, the missing classifications are covered with a de novo placeholder taxonomy. This taxonomy was created based on the clustering of the FL-ASVs at identity thresholds corresponding to each taxonomic rank (12). The clusters were labeled according to the format denovo_x_y, where x is a one-letter abbreviation for the taxonomic rank (k, p, c, o, f, g, and s), and y represents the number of the FL-ASV, which is the cluster centroid of the particular taxon. Because the applied clustering algorithm processes the sequences sequentially in the order they appear in the input file, sequences are always clustered in the same way, even if additional FL-ASVs are later added to the database. This strategy may not always yield the most optimal clusters, but the reproducibility is critical if the clusters are to be used as a robust placeholder taxonomy.
Merging of the SILVA-and the de novo-based taxonomies resulted in a few conflicts, e.g., where different FL-ASVs from the same species associate with more than one genus. In such cases, the genus-level classification for the centroid FL-ASV is adapted (2) Taxonomy is adopted from this sequence after trimming based on percent identity and the taxonomic thresholds proposed by Yarza et al. (12). (3) To gain species-level information, FL-ASVs are also mapped to sequences from type strains extracted from the SILVA database, and (4) species names are adopted if the identity is Ͼ98.7% and only a single species is within the threshold. (5) FL-ASVs are also clustered at different percent identities, corresponding to the thresholds proposed by Yarza et al. (12). The clustering is used to generate a stable de novo taxonomy. (6) Finally, a comprehensive taxonomy is obtained by filling gaps in the SILVA-based taxonomy with the de novo taxonomy. Colored squares represent sources of taxonomic classifications of FL-ASVs, as follows: orange, SILVA SSURef NR99; yellow, SILVA type strains; blue, de novo names; and gray, names rejected during the AutoTax workflow.
for all FL-ASVs within that species. However, these types of conflicts only applied to lower rank taxa (species), which were located close to the taxonomic threshold of the higher rank taxa (genus), and it only affected the classification of a low number of FL-ASVs (approximately 1%).
As AutoTax is based on the SILVA taxonomy, the taxonomy generated will change if another version of the SILVA SSURef NR99 reference database is used. Accordingly, users must specify which version of the SILVA database has been used when they publish databases created with AutoTax. We recommend that AutoTax-generated databases are updated when there is a new version of the SILVA database. This ensures that the taxonomy is in agreement with the current central taxonomy.
Taxonomy assignment to FL-ASVs with AutoTax. AutoTax provided placeholder names for many previously undescribed taxa in our FL-ASV database ( Table 2, see Fig. S4 in the supplemental material). A total of 95% of all species, 73% of all genera, 47% of all families, and 24% of all orders obtained placeholder names from the de novo taxonomy and would otherwise have remained unclassified. The novel taxa were affiliated with several phyla, especially the Proteobacteria, Planctomycetota, Patescibacteria, Firmicutes, Chloroflexi, Bacteroidota, Actinobacteriota, and Acidobacteriota (Fig. S4). A prominent example is the Chloroflexi, where 5 out of 14 orders, 26 out of 34 families, and 142 out of 152 genera observed were assigned a de novo placeholder taxonomy. We believe that this method will have important implications for future studies and the accumulation and transfer of knowledge about these taxa in WWTPs, given their high diversity and abundance, and their association with serious operational problems related to the settling of activated sludge (bulking) and foaming (38,39). It should be noted that the placeholder taxonomy does not provide the same degree of support as traditional phylogenetic analyses and should only be used until an official taxonomy is established.
We have here shown that increased coverage of the rare biosphere can be achieved by adding FL-OTUs clustered at 99% to the FL-ASV database. The FL-OTU-expanded database increased the classification rate at the genus level for the abundant bacteria (Ͼ0.01% relative abundance) (94.2% Ϯ 1.4% versus 89.9% Ϯ 4.3%) but reduced classification at the species level (67.7% Ϯ 2.6% versus 78.5% Ϯ 4.0%) (Fig. 3a). When ASVs from the rare biosphere (Ͼ0.001% relative abundance) were included in the analysis, the advantage of including FL-OTUs for genus-level classification was even more pronounced (93.0% Ϯ 0.6% versus 80.7% Ϯ 1.7%), and an improved classification rate  Fig. S5 in the supplemental material). We hypothesize that the reduced classification rate at the species level for the abundant ASVs is caused by sequencing errors as well as low-divergence chimera, which cannot be detected by chimera filtering (40). It should be noted that FL-OTUs are problematic for databases that are to be maintained and updated with additional references in the future. This is because the placeholder taxonomy will change if FL-OTUs are removed, and sequencing errors and chimeras are propagated if they are kept. We therefore recommend that FL-OTUs are added only for exploratory purposes.
To investigate the influence of AutoTax on taxonomic assignment, independent of our ecosystem-specific database, we applied the pipeline directly to high-quality, full-length sequences from the SILVA 138 SSURef NR99 database. This increased the percentage of ASVs classified with SILVA at the genus level from 30.4% Ϯ 3.5% to 63.7% Ϯ 4.8% and at the species-level from 0% to 28.2% Ϯ 2.4%, suggesting that the large universal databases would also benefit from the use of AutoTax. An advantage of the AutoTax-processed SILVA database, which we have made publicly available, is that the placeholder taxonomy is universally applicable, providing a unique opportunity for studying the ecology of unclassified taxa across ecosystems.
Taxonomic resolution of ASVs in combination with a comprehensive reference database. Classification of amplicon sequences can be challenging due to the limited taxonomic information in short-read sequences (12,13). However, this challenge may change with the access to reference databases with perfect references for the majority of all ASVs and a complete seven-rank taxonomy for all reference sequences. To determine the confidence of the amplicon classification in this scenario, we extracted ASVs in silico from the bacterial FL-ASVs corresponding to commonly amplified 16S rRNA regions, including full-length amplicons. These ASVs were classified against the FL-ASV database. We then calculated the fraction of amplicons correctly classified to the same genus and species as their corresponding FL-ASVs (Fig. 3b and Fig. S6a in the supplemental material). Nearly all ASVs (95% to 99%) were assigned to the correct  genus and most (72% to 91%) to the right species, depending on the taxonomic conservation of 16S rRNA region covered by the in silico amplicons. The primers targeting the V1-V3 variable region performed exceptionally well for species-level identification (90.7% correct classifications), which is the same as for the full-length 16S rRNA gene amplicons. The commonly used primers targeting the V4 variable region were the worst (72.5% correct classifications). Very few of the sequences that did not receive the same taxonomic classification as their source reference sequence where misclassified (Ͻ0.2% at genus level and Ͻ0.8% at species level), with the majority not receiving any classification at the specific taxonomic rank (Fig. S6a).
Sequencing costs on the Illumina platforms can be reduced considerably if single reads are used instead of merged reads. To evaluate the effect of reduced amplicon length, we compared the classification of 200-bp forward reads and reverse reads to those of full-length amplicons (Fig. 3b). The decrease in the percentage of correct classifications was highly dependent on the 16S rRNA region targeted and from which direction the single reads were obtained. For the ecosystem studied here, the V1-V3 and V4-V5 forward read provided almost the same specificity as the full-length amplicons, whereas the reverse reads performed much worse for species-level classification. For the V3-V4, V5-V7, and V5-V8, the reverse reads performed better than the forward reads, revealing the importance of choosing the right direction for single-read amplicons.
To evaluate the effect of sequencing errors and low-divergence chimeras in the reference database, we also classified the in silico ASVs against the reference databases expanded with the FL-OTUs (Fig. S6b). The result confirmed our prior observations that the inclusion of error-prone references had a negative impact on our ability to classify short-read amplicons correctly; however, the effect was marginal at the genus level. Full-length amplicons were less affected (Fig. S6b), highlighting a clear advantage of using longer amplicons in combination with universal databases, which are likely to contain sequencing errors and low-divergence chimeras despite chimera filtering (40).
Overall, the analysis demonstrated that confident classification of short-read ASV sequences at the genus to species level is possible. However, it requires a reference database with a complete seven-rank taxonomy and perfect references for the majority of all ASVs. The scripts made available with this study can be used to confirm whether this is the case for samples from other studies.
Ecosystem-specific evaluation of primer bias. When choosing primers for amplicon sequence analyses, it is essential also to take primer bias into account, as some primer sets may result in ecosystem-relevant species being severely underestimated or absent from the analyses (41). The ecosystem-specific FL-ASV databases provide a near-perfect reference to determine the theoretical coverage of different primer sets for the given ecosystem so that an informed selection can be made (42,43). It should be noted that primers used to generate the full-length 16S rRNA sequences for FL-ASV databases may introduce a bias, and here, we have included primer-free (RNA-based) libraries to account for this. Evaluation of different primer sets using our FL-ASV database revealed a clear taxonomic bias associated with several primer sets (Fig. 3c). The primers targeting the V4-V5 and V5-V8 regions had the best coverage of the FL-ASVs. The V5-V7 primers demonstrated very poor coverage. Because the FL-ASVs were trimmed after the forward priming site of the V1-V3 primers, we were not able to evaluate the coverage of this primer pair here. However, it has previously been shown that the primers have a good overall agreement with metagenomic data for wastewater treatment systems and capture most of the process-critical organisms (44).
Perspectives. High-throughput full-length 16S rRNA gene sequencing and automated taxonomy assignment (AutoTax) now allow individual research groups to develop their own FL-ASV ecosystem-specific reference databases for community profiling analyses. In addition, such databases can be used to evaluate the ecosystem-specific coverage and specificity of amplicon primers and fluorescence in situ hybridization (FISH) probes. The high quality of the FL-ASVs furthermore allows for the design of new primers and probes with improved confidence of the coverage and taxonomic resolution when applied within the target ecosystem. Collectively, the approach importantly allows for the identification and subsequent characterization of novel numerically important taxa for the specific environment, which would have otherwise been overlooked.
We acknowledge that the ability to quickly generate new 16S rRNA gene reference databases poses a risk for the development of several competing divergent taxonomies. We, therefore, recommend that the custom databases are only used for exploratory purposes and are combined with traditional phylogenetic analyses of key taxa. Ecosystem-specific databases that are broadly applied, such as the MiDAS database, should be created as open-source community efforts, and universal reference databases processed with AutoTax should be published and maintained in agreement with current developers of such databases. An important benefit of this approach is that the placeholder taxonomy can be used as a common language within the field, or in the case of universal reference databases, across all fields. This has major implications, e.g., in wastewater treatment systems, as it allows for the identification of unclassified taxa that are processcritical and decisive for process performance. If the current universal reference databases are used, the majority of ASVs will not get a genus-level classification, making it impossible to compare their prevalence across studies. Given that hundreds of amplicon-based studies are carried out every year worldwide, a considerable amount of useful information is lost when the data generated across studies are not comparable.
We have chosen to use the SINTAX-classifier for our analyses because it applies a simple algorithm that does not require training, and we expect the results to be less biased. However, some classifiers which use Bayesian inference, e.g., q2-featureclassifier (5), may yield better classifications. Kaehler et al. (45) recently demonstrated that environment-specific taxonomic abundance information could be used as weights for such classifiers to improve the accuracy of the taxonomy assignment. This approach is interesting because the frequency of individual raw full-length 16S rRNA gene sequences used to generate the FL-ASV database may be used as phylogenetically informative weights for the specific ecosystem.
We used SILVA SSURef NR99 as the backbone taxonomy for AutoTax in this study, as it is currently the most comprehensive database. However, we anticipate the GTDB may replace SILVA in a future release of AutoTax when more high-quality genomes and MAGs with 16S rRNA genes (MIMAG standard [11]) are added to the database as the result of advances in long-read sequencing technology (Nanopore and PacBio). This will importantly link the 16S rRNA gene taxonomy with that derived from the more robust phylogenomicbased analyses, creating a unified language across the field of microbiology.

MATERIALS AND METHODS
General molecular methods. The concentration and quality of nucleic acids were determined using a Qubit 3.0 fluorometer (Thermo Fisher Scientific) and a 2200 Tapestation (Agilent Technologies), respectively. Agencourt RNAClean XP and AMPure XP beads were used as described by the manufacturer, except for the washing steps, where 80% ethanol was used. RiboLock RNase inhibitor (Thermo Fisher Scientific) was added to the purified RNA to minimize degradation. All commercial kits were used according to the protocols provided by the manufacturer unless otherwise stated. Oligonucleotides used in this study can be found in Table S1 in the supplemental material.
Samples and nucleic acid purification. Activated sludge and anaerobic digester biomass were obtained as frozen aliquots (-80°C) from the MiDAS collection (18). Sample metadata are provided in Table S2. Total nucleic acids were purified from 500 l of sample thawed on ice using the PowerMicrobiome RNA isolation kit (Mo Bio Laboratories) with the optional phenol-based lysis or with the RiboPure RNA purification kit for bacteria (Thermo Fisher Scientific). Purification was carried out according to the manufacturer recommendations, except that cell lysis was performed in a FastPrep-24 instrument for 4 ϫ 40 s at 6.0 m/s to increase the yield of nucleic acids from bacteria with sturdy cell walls (41). The samples were incubated on ice for 2 min between each bead beating to minimize heating due to friction. DNA-free total RNA was obtained by treating 41 l of the purified nucleic acid with the DNase Max kit (Mo Bio Laboratories), followed by clean up using 1.0ϫ RNAClean XP beads with elution into 25 l nuclease-free water.
Primer-free full-length 16S rRNA library preparation and sequencing. Purified RNA obtained from biomass samples was pooled separately for each sample source type (activated sludge or anaerobic digester) to give equimolar amounts of 16S rRNA determined based on peak area in the TapeStation analysis software A.02.02 (SR1). Full-length small subunit (SSU) sequencing libraries were then prepared, as previously described (26). The SSU_rRNA_RT2 (activated sludge) and SSU_rRNA_RT3 (anaerobic digester) reverse transcription primers and the SSU_rRNA_l adaptor were used for the molecular tagging (Table S1), and approximately 1,000,000 tagged molecules from each pooled sample were used to create the clonal library. The final library was sequenced on a HiSeq 2500 instrument using on-board clustering and rapid run mode with a HiSeq paired-end (PE) rapid cluster kit v2 (Illumina) and HiSeq rapid SBS kit v2, 265 cycles (Illumina), as previously described (26). Raw sequence reads were binned based on unique molecular tags, de novo assembled into synthetic long-read sequences, and trimmed equivalent to Escherichia coli positions 8 and 1507 using the fSSU-pipeline-RNA_v1.2.sh script (https://github.com/ KasperSkytte/AutoTax) (26).
Primer-based full-length 16S rRNA gene library preparation and sequencing. The purified nucleic acids obtained from the biomass samples were pooled separately for each sample source type (activated sludge or anaerobic digester) with an equal weight of DNA from each sample. Full-length 16S rRNA sequencing libraries were then prepared, as previously described (26). f16S_rDNA_pcr1_fw1 (activated sludge) or f16S_rDNA_pcr1_fw2 (anaerobic digester) and f16S_rDNA_pcr1_rv (Table S1) were used for the molecular tagging, and approximately 1,000,000 tagged molecules from each pooled sample were used to create the clonal library. The final library was sequenced on a HiSeq 2500 instrument using on-board clustering and rapid run mode with a HiSeq PE rapid cluster kit v2 (Illumina) and HiSeq rapid SBS kit v2, 265 cycles (Illumina), as previously described (26). Raw sequence reads were binned based on unique molecular tags, de novo assembled into synthetic long-read sequences, and trimmed equivalent to E. coli positions 28 and 1491 using the fSSU-pipeline-DNA_v1.2.sh script (https:// github.com/KasperSkytte/AutoTax) (26).
Extraction of high-quality full-length 16S rRNA gene sequences from SILVA. High-quality bacterial and archaeal 16S rRNA gene sequences in the SILVA 138 SSURef NR99 ARB-database were selected using the query pintail_slv ϭ 100 and tax_slv ϭ Bacteria* or tax_slv ϭ Archaea*. Bacterial and archaeal sequences were exported separately in the "fastawide" format after terminal trimming. Bacterial sequences were trimmed between the 27F and 1391R (44) primer binding sites equivalent to positions 1,044 and 41,788 in the global SILVA alignment. Archaeal sequences were trimmed between the 20F (46) and the SSU1000ArR (47) primer binding sites equivalent to positions 1041 and 32818 in the global SILVA alignment. A list of names for full-length sequences spanning the positions above was created using the Extract_full-length_16S_rRNA_names_from_SILVA.sh script, which takes advantage of the fact that ARB uses the period to specify terminal gaps and therefore indicates truncated sequences in the exported FASTA files. The names were used to select and export the full-length bacterial or archaeal sequences without trimming from the SILVA ARB database.
Generation of reference databases using AutoTax. AutoTax was created as a modular multistep Linux BASH script that (i) generates FL-ASV and full-length 16S rRNA gene operational taxonomic units clustered at 99% identity (FL-OTU) reference sequences from high-quality, full-length 16S rRNA sequences; (ii) assigns a comprehensive seven-rank taxonomy to all reference sequences based on the SILVA taxonomy with the addition of placeholder names for unclassified taxa defined by de novo clustering of sequences using specific identity thresholds for each taxonomic rank (12); and (iii) produces formatted reference databases, which can be directly used for classification using SINTAX or classifiers in the QIIME 2 framework.
AutoTax combines several software tools (GNU parallel v.20161222 [48], USEARCH v.11.0.667 [49], SINA v.1.6.0 [50], and R v.3.5.0 with the following packages: biostrings [51], doParallel [52], stringr [53], data.table [54], tidyr [55], and dplyr [56]) into a single BASH script that otherwise requires only a single FASTA file with the user-provided full-length 16S rRNA gene sequences and the FASTA-formatted SILVA_138_SSURef_NR99_tax_silva reference database as input. The script, as well as a docker container image with all required software (except USEARCH, as the required 64-bit version is not free and must be purchased online) is available on the GitHub repository online at https://github.com/KasperSkytte/ AutoTax. The script is composed of separate, individual BASH functions to both allow for customization of the script as well as unit testing using the BASH automated testing system (https://github.com/bats -core/bats-core) where possible. Core functions of AutoTax are briefly described below. Expanded descriptions can be found in the supplementary information (Text S1).
Resolving full-length 16S rRNA amplicon sequence variants. Input sequences are oriented based on the SILVA 138 SSURef NR99 database using the usearch -orient command, dereplicated using usearch -fastx_uniques with the -sizeout, -strand plus, and -threads 1 options, and finally denoised to produce the FL-ASVs using the usearch -unoise3 command with the -minsize 2 option.
Preparation of chimera-filtered full-length 16S rRNA OTUs. Dereplicated sequences (before denoising) from above are clustered at 99% sequence identity using the usearch -cluster_smallmem command with the -id 0.99, -maxrejects 0, -centroids, and -sortedby size options. Potential chimeras are identified and extracted using the usearch -uchime2_ref command with the -strand plus, -mode sensitive, and -chimeras options with the FL-ASVs from above as the reference database. The chimeras are finally removed to create the final FL-OTUs using the usearch -search_exact command with the -strand plus and -dbnotmatched options.
Taxonomy assignment. A complete taxonomy from kingdom to species is automatically assigned to each FL-ASV. In brief, the AutoTax script identifies the closest relative of each FL-ASV in the SILVA database using the usearch -usearch_global command, obtains the taxonomy for this sequence, and discards information at taxonomic ranks not supported by the sequence identity and the thresholds for taxonomic ranks proposed by Yarza et al. (12). The identity thresholds used for each of the taxonomic ranks are 75.0%, phylum; 78.5%, class; 82.0%, order; 86.5%, family; 94.5%, genus; and 98.7%, species. For the species-level classification, the script identifies all type strains within the species-level threshold in the SILVA database and assigns a species-level classification to the FL-ASV if only a single species fits within the threshold. In addition, FL-ASVs are de novo clustered using the usearch -cluster_smallmem command using the thresholds for each taxonomic rank. The de novo clusters are labeled according to the format denovo_x_y, where x is a one-letter abbreviation for the taxonomic rank (k, p, c, o, f, g, and s), and y represents the FL-ASV number of the cluster centroid for each taxon. These labels act as a placeholder taxonomy, where the SILVA taxonomy does not provide any taxonomy information.
Construction of phylogenetic trees and primer evaluation. FL-ASVs aligned to the SILVA 138 NR99 ARB database were obtained from the AutoTax output (temp/FL-ASVs_SILVA_aligned.fa) and loaded into the SILVA 138 NR99 ARB database. All bacterial FL-ASVs were selected and exported as a FASTA file using the ssuref:bacteria positional variability by parsimony filter. A rough tree was created from the alignment using FastTree v.2.1.10 (59) with the -nt -gtr -gamma options and loaded into ARB. The specificity of commonly used amplicon primers was determined for each FL-ASV using the analyz-e_primers.py script from PrimerProspector v.1.0.1 (42). The specificity of primer sets was defined based on the overall weighted scores (OWSs) for the primer with the highest score as follows: perfect hit (OWS, 0), partial hit (OWS, Ͼ0 and Յ1), and poor hit (OWS, Ͼ1). A comma-separated table with the specificity of each primer set for each FL-ASV was made in R, loaded into ARB, and used to color the tree.

SUPPLEMENTAL MATERIAL
Supplemental material is available online only. TEXT S1, DOCX file, 0.04 MB.