Comprehensive ecosystem-specific 16S rRNA gene databases with 1 automated taxonomy assignment (AutoTax) provide species-level 2 resolution in microbial ecology

21 High-throughput 16S rRNA gene amplicon sequencing is an indispensable method for studying the 22 diversity and dynamics of microbial communities. However, this method is presently hampered by 23 the lack of high-identity reference sequences for many environmental microbes in the public 16S 24 rRNA gene reference databases, and by the lack of a systematic and comprehensive taxonomic 25 classification for most environmental bacteria. Here we combine high-quality and high-throughput 26 full-length 16S rRNA gene sequencing with a novel sequence identity-based approach for 27 automated taxonomy assignment (AutoTax) to create robust, near-complete 16S rRNA gene 28 databases for complex environmental ecosystems. To demonstrate the benefit of the approach, we 29 created an ecosystem-specific database for wastewater treatment systems and anaerobic digesters. 30 The novel approach allows consistent species-level classification of 16S rRNA amplicons sequence 31 variants (ASVs) and the design of highly specific oligonucleotide probes for fluorescence in situ 32 hybridization, which can reveal in situ properties of microbes at unprecedented taxonomic 33 resolution. 34

washing steps, where 80% ethanol was used. RiboLock RNase inhibitor (Thermo Fisher Scientific) 132 was added to the purified total RNA to minimise RNA degradation. All commercial kits were used 133 according to the protocols provided by the manufacturer, unless otherwise stated. Oligonucleotides 134 used in this study can be found in Table S1. beating to minimise heating due to friction. DNA-free total RNA was obtained by treating a 146 subsample of the purified nucleic acid with the DNase Max kit (MO BIO Laboratories), followed 147 by clean up using 1.0x RNAClean XP beads with elution into 25 µL nuclease-free water. 148 . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted September 27, 2019. The purified nucleic acids obtained from the biomass samples were pooled for each sample source 162 type (activated sludge or anaerobic digester) with equal weight of DNA from each sample. Full-163 length SSU sequencing libraries were then prepared, as previously described (Karst et al., 2018). 164 The f16S_rDNA_pcr1_fw1 (activated sludge) or f16S_rDNA_pcr1_fw2 (anaerobic digester 165 biomass) and the f16S_rDNA_pcr1_rv were used for the molecular tagging, and approximately 166 1,000,000 tagged molecules from each pooled sample were used to create the clonal library. The 167 final library was sequenced on a HiSeq2500 using on-board clustering and rapid run mode with a 168 HiSeq PE Rapid Cluster Kit v2 (Illumina) and HiSeq Rapid SBS Kit v2, 265 cycles (Illumina) as 169 previously described (Karst et al., 2018). 170 . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted September 27, 2019. ; https://doi.org/10.1101/672873 doi: bioRxiv preprint were filtered for phiX sequences using -filter_phix, trimmed to 250 bp using -fastx_truncate -198 trunclen 250, and quality filtered using -fastq_filter with -fastq_maxee 1.0. The sequences were 199 dereplicated using -fastx_uniques with -sizeout -relabel Uniq. Exact amplicon sequence variants 200 (ASVs) were generated using -unoise3 (Edgar, 2016b). ASV-tables were created by mapping the 201 raw reads to the ASVs using -otutab with the -zotus and -strand both options. Taxonomy was 202 assigned to ASVs using -sintax with -strand both and -sintax_cutoff 0.8 (Edgar, 2018). 203 204

Data analysis and visualization 205
Usearch v.11.0.667 was used for mapping sequences to references with -usearch_global -id 0 -206 maxrejects 0 -maxaccepts 0 -top_hit_only -strand plus, unless otherwise stated. Data was imported 207 into R (R Core Team, 2016) using RStudio IDE (RStudio Team, 2015), analysed, and aggregated 208 using Tidyverse v.1.2.1 (https://www.tidyverse.org/), and visualised using ggplot2 (Wickham, 209 2009)  Details about the optimal formamide concentration used for each probe are given in Table S4. The Lactobacillus reuteri (DSM20016), and Janibacter melonis (DSM16063) were used to assess the 246 need for the specific unlabelled competitor probes Tetra67_C1, Actino221_C3, and Tetra732_C1, 247 respectively. If appropriate pure cultures were not available, probes were optimized using activated 248 sludge biomass with a high abundance of the target organism predicted by amplicon sequencing. 249 . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted September 27, 2019. ; https://doi.org/10.1101/672873 doi: bioRxiv preprint (Fernando et al., 2019). The approach was used to identify phenotypic differences between probe-252 defined Tetrasphaera phylotypes. Briefly, FISH was conducted on optically polished CaF2 Raman 253 windows (Crystran, UK), which give a single-sharp Raman marker at 321 cm -1 that serves as an 254 internal reference point in every spectrum. Tetrasphaera species-specific (Cy3) probes (Table S4) 255 were used to locate the target cells for Raman analysis. After bleaching the Cy3 fluorophore with 256 the Raman laser, spectra from single cells were obtained using a Horiba LabRam HR 800 Evolution 257 (Jobin Yvon -France) equipped with a Torus MPC 3000 (UK) 532 nm 341 mW solid-state 258 semiconductor laser. The Raman spectrometer was calibrated prior to obtaining all measurements to 259 the first-order Raman signal of Silicon, occurring at 520.7 cm -1 . The incident laser power density on 260 the sample was attenuated down to 2.1 mW/μm 2 using a set of neutral density (ND) filters. The 261 Raman system is equipped with an in-built Olympus (model BX-41) fluorescence microscope. A 262 50X, 0.75 numerical aperture dry objective (Olympus M Plan Achromat-Japan), with a working 263 distance of 0.38 mm, was used throughout the work. A diffraction grating of 600 mm/groove was 264 used, and the Raman spectra collected spanned the wavenumber region of 200 cm -1 to 1800 cm -1 . 265 The slit width of the Raman spectrometer and the confocal pinhole diameter were set to 100 μm and 266 72 μm, respectively. Raman spectrometer operation and subsequent processing of spectra were 267 conducted using LabSpec version 6.4 software (Horiba Scientific, France). All spectra were 268 baseline corrected using a 6 th order polynomial fit. 269 . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted September 27, 2019.  (Figure 1a). This resulted in a total of 926,507 fSSU sequences after quality filtering. The 280 raw sequences were dereplicated and denoised with Unoise3 to generate a comprehensive reference 281 database of 9,521 ESVs. As each fSSU is independently amplified due to the unique molecular 282 identifiers (UMIs) added before the PCR amplification steps, the risk of having multiple ESVs with 283 identical errors is extremely low if we assume random distribution of errors (see supplementary 284 results). The ESVs are therefore considered to be essentially error-free. 285 To determine the influence of library preparation method, we compared ESVs created based 286 on fSSU obtained from the four individual libraries. The DNA-based approach yielded approx. 12 287 times more unique ESVs than the RNA-based approach for the same sequencing cost (Table S3). 288 The reduced number of unique ESVs from the RNA-based libraries was expected, as only 13.3% of 289 the assembled sequences represented full-length 16S rRNA gene sequences ( Table S3). As the 290 Archaea are not targeted by the primers used, we compared the bacterial ESVs from the four 291 libraries to assess the influence of primer bias (Figure 1b). This revealed that 28% and 31% of the 292 unique ESVs identified in the shallow RNA-based libraries were not present in corresponding 293 . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

Evaluation of the full-length ESV database using amplicon data 306
In order to evaluate if the full-length ESV database contained high-identity references for all 307 prokaryotes in the ecosystem, we mapped V1-3 amplicon sequencing data obtained from two 308 sources: the same samples used to create the ESV database and samples from unrelated Danish 309 WWTP and ADs. To ensure highest resolution, amplicon data was processed into ASVs. The 310 ecosystem-specific ESV database (9,521 seq.) included more high-identity references (>98.7% 311 identity) for all analyzed samples, compared to the 58-353-fold larger universal databases, such as 312  (Figure 1d-e, and S3-S4). Compared 324 to all universal reference databases, the Danish reference ESVs performed better or as well for most 325 of the investigated non-Danish WWTPs, which indicates that the database covers many of the 326 microbes that are common in WWTP across the world. We anticipate that high-throughput 16S 327 rRNA gene sequencing of non-Danish WWTP and ADs will provide references for the region-328 specific taxa in the future. 329

A new comprehensive taxonomic framework 331
A major limitation in the classification of amplicon data from environmental samples is the lack of 332 lower rank taxonomic information (family, genus, and species names) for many uncultivated 333 bacteria in the universal reference databases. To address this, we developed a robust taxonomic 334 framework (AutoTax), which provides consistent taxonomic classification of all sequences to all 335 seven taxonomic ranks by using a reproducible computational approach, based on identity 336 thresholds (Figure 2). 337 The full-length ESVs were first mapped to the SILVA_SSURef_Nr99 database, which 338 provides the taxonomy of the closest relative in the database as well as the percent identity between 339 the ESV and this reference. The taxonomy was assigned to the ESV down to the taxonomic rank 340 that is supported by the sequence identity thresholds proposed by Yarza et al. (2014) ( Table 1). As 341 . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made genomes with intragenomic copies that are less than 98.7% conserved, these are exceptions rather multiple ESVs from the same species associate with different genera). When this is the case, the 358 taxonomy for the ESV, which first appears in the reference database, is adapted for all ESVs within 359 that species. The pipeline produces formatted reference databases, which can be directly used for 360 classification using sintax or classifiers in the qiime2 framework. 361 AutoTax provided placeholder names for many previously undescribed taxa ( Table 2, 362 Figure S5). Essentially all species, more than 72% of all genera, 50% of all families, and 30% of all 363 orders, obtained their names from the de novo taxonomy and would otherwise have remained 364 unclassified. The novel taxa were affiliated with several phyla, especially the Proteobacteria, 365 . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made To benchmark the full-length ESV database, we classified amplicon data obtained from activated 373 sludge and anaerobic digester samples using this database and compared the results to 374 classifications obtained from the universal reference databases (Figure 3a). The ESV database was 375 able to classify many more of the ASVs to the genus level (~90%), compared to 376 SILVA_132_SSURef_Nr99 (~45%), GreenGenes_16s_13.5 (~25-30%), GTDB_r89 (~20-25%), 377 and the RDP_16S_v16 training set (~25%) but also compared to gold standard within wastewater 378 treatment systems MiDAS 2.1 (~65%), which is a manually ecosystem-specific curated version of 379 the SILVA_123_SSURef_Nr99 database. Importantly, many of the top 50 most abundant ASVs 380 only received classification with the AutoTax processed ESV database (Figure 4 and S6). 381 The use of the ecosystem-specific full-length ESV database thus significantly improved the 382 classification at all taxonomic levels and, importantly, provided species-level classifications for the 383 majority of the ASVs (~85%). To investigate the effect of the taxonomy assignment by AutoTax 384 alone, we processed the SILVA_132_SSURef_Nr99 database using AutoTax. This increased the 385 percentage of ASVs classified at the genus-level from ~45% to ~75%, and at the species-level from 386 0% to ~45%, demonstrating that the large universal databases would also benefit from the use of 387 AutoTax. However, as better classifications are obtained using smaller ecosystem-specific 388 . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted September 27, 2019. ; https://doi.org/10.1101/672873 doi: bioRxiv preprint databases, we anticipate that such databases should be used whenever a defined ecosystem is 389 studied. These ASVs were classified using sintax and the full-length ESV database. We then calculated the 395 fraction of amplicons, which was correctly classified to the same genus and species as their source 396 ESV (Figure 3b). More than 95% of the ASVs were assigned to the correct genus and 72-90% to 397 the correct species, depending on primer set used. The primers targeting the V1-3 variable region 398 performed especially well for species-level identification (90% correct classifications), while the 399 commonly used primers targeting the V4 variable region were among the worst (72% correct 400 classifications). Further analysis of the ASV that did not receive a correct classification revealed 401 that the majority did not get any classification, and very few obtained wrong classifications (<0.2% 402 at genus-level and <0.8% at species-level). The new species-specific probes designed to target the remaining abundant species, which 450 can create up to 2-3% of the biomass in some plants (Figure 5a), revealed different morphologies 451 (rod-shaped cells, tetrads, filaments, Figure 5c). Having probes for these different species most 452 importantly allows in situ single cell analyses for each. Using these FISH probes in combination 453 with Raman microspectroscopy, it was confirmed that all the FISH-defined Tetrasphaera species 454 were likely PAOs, based on the presence of a large peak for poly-P (1170 cm -1 , Figure 5d). No 455 Raman peaks were found for other intracellular storage compounds such as glycogen, PHA, or 456 trehalose -consistent with current models for the physiology of the genus in these systems. 457 Additionally, the new reference database was used to design a probe set (Tetra183 + Tetra617) for 458 genus-level screenings of all abundant species of Tetrasphaera in Danish plants for (Figure 5a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted September 27, 2019. ; https://doi.org/10.1101/672873 doi: bioRxiv preprint

632
. CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made   Figure S4. 650 . CC-BY 4.0 International license available under a was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted September 27, 2019. ; https://doi.org/10.1101/672873 doi: bioRxiv preprint SILVA_132_SSURef_Nr99 database to identify the closest relative and the shared percent identity. 653 (2) Taxonomy was adopted from this sequence after trimming, based on percent identity and the 654 taxonomic thresholds proposed by Yarza et al. (Yarza et al., 2014). To gain species information, 655 ESVs were also mapped to sequences from type strains extracted from the SILVA database, and 656 species names were adopted if the identity was >98.7% and the type strain genus matched that of 657 the closest relative in the complete database. was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted September 27, 2019. ; https://doi.org/10.1101/672873 doi: bioRxiv preprint was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted September 27, 2019. ; https://doi.org/10.1101/672873 doi: bioRxiv preprint was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint (which this version posted September 27, 2019. ; https://doi.org/10.1101/672873 doi: bioRxiv preprint