High-resolution single-molecule long-fragment rRNA gene amplicon sequencing of bacterial and eukaryotic microbial communities

second-generation sequencing. We developed a method to obtain nearly full-length rDNA by assembling single DNA molecules combining DNA co-barcoding with single-tube long fragment read technology and second-generation sequencing. Benchmarking was performed using mock bacterial and fungal communities as well as two forest soil samples. All mock species rDNA were successfully recovered with identities above 99.5% compared to the reference sequences. From the soil samples we obtained good coverage with identiﬁcation of more than 20,000 unknown species, as well as high abundance correlation between replicates. This approach provides a cost-effective method for obtaining extensive and accurate information on complex environmental microbial communities.


INTRODUCTION
Currently, there are two main approaches for sequencing rRNA genes in complex microbial environmental samples: secondgeneration technologies such as sequencing by synthesis and DNA nanoball sequencing (DNBseq) and third-generation technologies such as nanopore and single molecule real-time sequencing.Second-generation sequencing suffers from relatively short read lengths (less than 300 bases) making it difficult to sequence the entire rRNA gene.Recently, co-barcoding technologies combined with second-generation sequencing and de novo assembly have enabled recovery of nearly full-length MOTIVATION Current metagenomics studies are often limited by a lack of accurate taxonomic identification related to the species and strain level.This situation is more severe for complex environment samples.Moreover, obtaining full-length rDNA for taxonomic identification is still a challenge using cost-efficient shotgun sequencing and expensive using single-molecule sequencing.Co-barcoding of shotgun sequencing reads provides additional information to assemble full-length rDNA at the single molecule level and is also high-throughput and cost-efficient.
rRNA genes, but the quantification reproducibility using complex samples is unknown. 1,24][5][6][7][8] As such, there is still a great demand for high-throughput and cost-effective approaches to support the deciphering of complex microbial communities.In this study, we developed a strategy whereby a combination of rolling-circle replication (RCR) and DNA co-barcoding 9 techniques allows sequence information of long amplicons to be obtained by a high-throughput, single-molecule level de novo assembly approach to achieve species resolution and highly accurate profiles in an efficient and cost-effective manner using a short read sequencing platform.To demonstrate the applicability of this method, we performed benchmarking using two mock communities of various bacterial and fungal species.We also applied the protocol to field soil samples as a demonstration for the discovery of unknown species and were able to recover rRNA sequences even longer than those found in current reference databases.

RESULTS
The basic idea behind the present protocol is to assemble a single DNA molecule by using DNA co-barcoding technology enabled by the single-tube long fragment read (stLFR) method.Using stLFR, it is possible to label DNA sub-fragments from a single long DNA molecule with the same barcode.Importantly, in a single stLFR library there are approximately 3.6 billion different barcodes, allowing each long DNA molecule in a sample to be labeled by a unique barcode. 10One shortcoming of the current stLFR method is that sequence coverage of each long molecule is less than 1X.As a result, using the De Bruijn graph (DBG) algorithm, it is usually impossible to achieve full assembly of DNA fragments from a single barcode.To overcome this lack of co-barcoded sequence coverage, we applied a modified version of the stLFR technology, named single tube completecoverage long fragment read (stcLFR).This approach involves an initial process of RCR, which performs a linear amplification whereby tandem copies of a single DNA molecule are joined head to tail in one long single-stranded concatemer.This method also avoids cumulative replication errors because the same DNA fragment is used as the template for replication.After conversion to double-stranded DNA, a transposon containing a capture sequence is inserted at regular intervals of approximately 500 base pairs in the RCR amplified products.After insertion of the transposon, the DNA is captured by barcodes containing oligonucleotides anchored to the surface of a micron-sized magnetic bead, whereupon the DNA is fragmented and ligated to these oligonucleotides (Figure 1A).After PCR on the beads, the DNA fragments with unique barcodes, corresponding to multiple copies of each single molecule, are subjected to DNBseq.By decoding barcoded sequences from the sequencing dataset, copy depth can be estimated in each barcoded binning group.Bins with depth >5X are then used for parallel de novo assembly (Fig- ure 1B).Successfully assembled contigs are termed pre-binning assemblies (PBAs) and used for subsequent analysis.
The ZymoBIOMICS mock community, which contains eight bacterial species (see STAR Methods), was used to test the ability of stcLFR to properly assemble the rDNA amplicons.A region covering 4.5 kb from the ribosomal small subunit (SSU) gene (515Fng forward primer) to the ribosomal large subunit (LSU) gene (TW13 reverse primer) was amplified for sequencing.From 187,942,995 raw reads generated from 3 replicates, 186,580,349 passed quality control as high-quality reads.Among these, 158,291,516 harbored a barcode, comprising in total 26,512,829 unique barcodes.Among those with valid barcoded read bins, 148,814 bins with a coverage >5X were used for de novo assembly.With a relatively small number of reads and simple genomic content, the assembly proceeded faster and was more complete compared to general metagenomic assembly.We finally retrieved 146,509 assembled PBAs with a centroid length distribution close to 800 bp (Figure S1A).Most of them covered either the SSU (34.15%) or the LSU (38.58%).Only 25.23% of PBAs covered both the SSU and the LSU (Figure S1B).The assembly quality was high, with 9.72% of the aligned PBAs exhibiting 100% identity to the reference genomes, and more than 60% of the PBAs achieved 99.5% or higher identity with the reference genomes.By contrast, only 1.78% of the PBAs could not be aligned to the reference genomes (Figure 2A).Details of numbers obtained in each step are summarized in Table S1.
As the ZYMO mock community is widely used as a reference sample for benchmarking, we also collected public datasets on sequencing of 16S rRNA and 23S rRNA genes using Illumina HiSeq2500, HiSeq4000, PacBio, and Nanopore platforms.Substitutions and insertions error rates in the 16S rRNA and 23S rRNA genes were computed by using clean reads mapping to the reference sequences (Figure S2).For substitutions, we observed that the stcLFR reads exhibited slightly higher error rates than those observed for HiSeq4000 reads but lower error rates than what was observed for the other sequencing platforms.For deletions and insertions, the error rate was similar to that of the HiSeq2500 reads but higher than those observed for HiSeq4000, PacBio, and Nanopore reads.For the stcLFR PBAs, we observed that the error rates for deletions and insertions were comparable to those observed for the HiSeq2500 platform but higher that those determined using the HiSeq4000, PacBio, and Nanopore platforms, For substitutions, the stcLFR PBAs exhibited slightly higher error rates than those observed for the HiSeq4000 platform but lower error rates than those observed using the HiSeq2500, PacBio, and Nanopore platforms (Figure S2).
For tests on fungi, a mock community comprising seven common fungal species was used to assess the classification resolution at the species level (see STAR Methods).Primers spanning a region $2.5 kb covered ITS1 and ITS2, and parts of the flanking SSU and LSU rRNA genes were used.Following the DNA library preparation and sequencing protocols used for bacterial species, we generated 190,208,124 high-quality reads and decoded 21,215,624 barcoded bins, with 270,794 having a coverage >5X.In total, 263,416 PBAs were generated with a size centered on 2.3 kb, which is close to the value expected from the design (Figure S1C).Exceeding the results obtained for the bacterial samples, 12.54% of the fungal PBAs exhibited a 100% alignment to the reference genomes, and more than 72.4% of the PBAs achieved 99.5% or higher identity to the reference genomes.Only 0.43% of the PBAs could not be aligned to the reference genomes (Figure 2B).76.8% of the PBAs covered the entire region from the SSU to the LSU, and the ITS region was included in 83.7% of the PBAs (Figure S1D).
For assessing the accuracy of quantification, we mapped all decoded reads to the assembled complete rRNA gene sequences.Mapped reads sharing the same barcode represent a single DNA molecule and as such were counted once.We determined the difference between the observed and the calculated relative abundance of each species to estimate the error.For analyses of the mock bacterial community, we Report compared the relative abundance of each taxonomy to the theoretical expectation, revealing that the differences ranged from À9.3% to 15.6%, with a standard deviation of 7.9% (Figure 2C).For analyses of the fungal mock community, we estimated the error to range from À9.7% to 8.2%, with an error standard deviation of 5.6% (Figure 2D).We estimated that the variation in the relative abundance of each species, as defined by the variance of error, was as low as 0.32% for bacterial species and 0.15% for fungi (Figure 2C).Though we only obtained a low percentage of amplicons covering the entire bacterial rRNA gene region as designed, we successfully recovered long fragments of bacterial 16S and 23S rRNA genes, as well as fungal fragments covering the entire ITS region and parts of the flanking 18S and 28S rRNA genes.
From both bacterial and fungal mock samples, the identity between PBAs and the corresponding reference sequences exceeded 99% in most cases, a percentage higher than the 97% that often is used for amplicon-based species identification.In addition, the observed low variation in the relative abundance of each species also demonstrated the consistency of the protocol, an important requirement for comparative analyses using high-resolution profiles.
We had limited success in retrieving the entire $4 kb sequence of bacterial rDNA regions and observed a gap in the coverage of the bacterial rDNA genes, but we still successfully recovered long fragments of bacterial 16S and 23S rRNA genes.3][24] In eukaryotes, tRNA genes are generally located outside the SSU and LSU regions, not in between, which may at least in part explain why the eukaryotic amplicons survived but the bacterial amplicons were split.
From the reference alignment benchmarking, we observed that when a PBA exceeded its designed size, part of the sequences could be aligned to reference sequences belonging to different species.We consider such PBAs as chimeric PBAs, which were subsequently filtered out (Figure S3B, red lines).Since we successfully enriched for the designed size of fungal rDNA sequences, this trend was prominent when the length of PBAs exceeded 2.3 kb for fungal samples.In addition, for bacterial PBAs with length <500 bp, we also observed a higher probability for achieving the exact same identity and bit score by multiple reference sequences representing different species, leading to an ambiguous annotation (Figure S3B, solid blue line).This trend became more severe in fungal PBAs with sizes <2000 bp (Figure S3B, solid blue line).These findings might point to one limitation of using short sequences to distinguish taxonomies at the species level, especially for fungi.However, coverage of multiple regions greatly eliminated the ambiguous annotation (Figure S3C).
To further explore the abundance and reproducibility of taxonomies in real environmental samples, we sequenced three replicates of two natural soil samples to identify bacterial rRNA genes and fungal ITS regions using the same primer sets and protocols used for the mock samples.We obtained 632,574,076 bacterial reads and 1,006,826,680 fungal reads with valid barcodes.For bacterial amplicons, we successfully generated 544,433 PBAs from 724,038 candidate bins (Figure 2E).As we observed using mock communities, only a small fraction (5.86%) of the bacterial PBAs covered both the SSU and the LSU regions.Most bacterial PBAs only covered the SSU (49.79%) or the LSU (30.97), whereas 13.38% did not match any currently known rDNA region (Figure 2F).For fungal amplicons, 1,807,779 PBAs from 1,845,429 high-coverage bins were generated with a size distribution that peaked at 2.3kb and 1.8kb (Figure 2G).A PBA size of 2.3kb was consistent with the primer design, showing full coverage of the flanking region of the SSU region, the ITSs, and the flanking region of the LSU region.By contrast, half of the PBAs with sizes that peaked around 1.8kb lacked the LSU region.We detected ITS regions from 65.05% of the PBAs (Figure 2H).This percentage increased when the size exceeded 1 kb.
We also tried to profile the bacterial and fungal community at different rank levels.Because soil samples contain very complex microbial communities, we chose Kraken2 to handle the profiling task.To build a Kraken2 database, we first used PBAs to generate cluster trees for bacteria and eukaryotes (mostly fungi).Since we retrieved individual rDNA subunit assemblies for bacteria, we selected PBAs harboring SSU sequences for clustering.For eukaryotes, and especially fungi, the ITS region provided the highest level of discrimination, and accordingly we selected PBAs covering ITS1 and ITS2 for clustering.The operational taxonomic unit (OTU) clusters were constructed using a set of identity thresholds enabling classification of taxa at multiple taxonomy levels, from domain to species.These thresholds were determined by pairwise global alignments of ribosomal gene subunit sequences including SSU and LSU from the SILVA database and ITS from the UNITE database (Figure S4), setting 97% identity as the threshold for genus association and 99% for species association for bacteria, and 95% and 97% identity as the thresholds for genus and species association, respectively, for fungi.Additional thresholds were the same for bacteria and fungi (see STAR Methods).The trees generated using PBAs were then merged with taxonomy trees based on the SILVA and the UNITE databases (Figure 3A).In this process, clades were merged with taxonomies when the representing PBAs achieved sequence identities greater than the rank's threshold.The combined tree contained three categories of branches.One type was represented by the public SILVA and UNITE sequences (PUBs, green in Figure 3, defined as exclusively PUBs), one represented by sequences only present in PBAs (blue in Figure 3, exclusively PBAs), and one represented by both the PBA and the PUB sequences (red in Figure 3, the shared PBAs and PUBs).The merged taxonomy tree and combined sequences were then used to build the database.
By this approach, we mapped all barcoded reads to the combined database, retrieving 60,942 bacterial OTUs at the species level of which 1,078 OTUs were supported by both the PBA sequences and PUB sequences, 18,403 could not be annotated, representing unknown species, and the remaining 41,461 were annotated by publicly available sequences.At the genus level, 6,765 OTUs represented PBAs with no annotation.Using 72% identity as the threshold for association with a phylum, we identified 241 OTUs of which 19 were unknown.For the eukaryotic communities, we retrieved 37,355 fungal OTUs at the species level of which 1,592 OTUs were supported by both the PBA sequences and PUB sequences, 1,817 could not be annotated, representing unknown species, and the remaining 33,946 OTUs were annotated by PUB sequences.At the genus level, 1,266 OTUs represented PBAs with no annotation.Using 72% identity as the threshold for association with a phylum, we identified 244 OTUs, all of which were annotated (data of sample S are shown, Figure 3B, left panel).
We found that species-level OTUs supported by both PBA and PUB sequences represented only 1.8% of all OTUs; the relative abundance of these OTUs combined corresponded to about 12% of the relative abundances of all OTUs.At the genus level, the relative abundance of OTUs supported by both the PBA and PUB sequences was even higher (62%), and at the phylum level the relative abundance of the OTUs supported by both PBA and PUB sequences reached close to 100%.This trend was more obvious for fungal OTUs, where 4.2% of OTUs supported by both the PBAs and PUBs contributed 79% of relative abundance at the species level (Figure 3B, right panel).This indicated that OTUs supported by PBAs (whether annotated or unknown) were abundant and dominated the microbial community.Compared with the fungal community, more dominant taxonomies may represent novel species.
Further analysis showed that the three categories of branches in the combined tree exhibited distinct differences Cell Reports Methods 3, 100437, March 27, 2023 5 Report in terms of reproducibility between replicates.OTUs supported by both PBA and PUB sequences exhibited high correlation between replicates (r = 0.93 for bacterial communities and r = 0.88 for eukaryotic communities; Spearman's correlation, Figure 3C).Higher correlations were observed at the genus and phylum levels.By contrast, the correlation between replicates of OTUs supported exclusively by PBA was lower, and the correlation decreased significantly when the abundance became lower than 10 À4 , both for bacteria and eukaryotes.Some of the unannotated bacterial phyla taxa still varied, indicating that they might not represent an independent phylum.For the taxa supported by public references, though well organized by taxonomic information, the species taxa performed worse than those supported by unannotated PBAs.However, the performance was improved dramatically at the phylum levels.These dominant annotated OTUs supported both by PBAs and PUBs and highly abundant (>10 À4 ) unknown OTUs exhibited high reproducibility for quantification comparison tasks at the species and genus levels.
The rarefaction curves also showed a better performance in terms of the number of reads needed for sufficient coverage of OTUs from the shared PBAs and PUBs sequences.To retrieve the majority of taxa (here defined as 95% of maximum OTUs), the publicly available database required the highest number of read-pairs, 410 million read-pairs for bacteria and 425 million read-pairs for fungi.To assess exclusively PBA OTUs, this requirement was reduced to 110 million and 150 million for bacteria and eukaryotes, respectively.The shared PBAs and PUBs representing OTUs required the lowest number of reads, only 40 million bacterial read-pairs and 55 million eukaryotic readpairs, respectively.
We finally analyzed the evolutionary phylogeny of bacterial and eukaryotic fully assembled OTUs.We failed to obtain both SSU and LSU of bacterial rDNA, and only 5,574 nearly fully covered SSU PBAs were aligned for maximum likelihood (ML) model fitting (Figure 4A).Most of the clades were clustered well at the phylum rank, but more than half of the clades at the genus rank and nearly all clades at the species rank lack annotation.These clades may represent novel taxonomies, but these findings may also reflect that current identification mostly is based on 97% similarity as a threshold, which seems to be insufficient for classification of long amplicons.For eukaryotes, 3,031 PBAs covering both SSU and LSU were aligned for ML model fitting (Figure 4B).Compared with the bacterial tree, major groups of the eukaryotic tree are mixed with ambiguous phyla.In addition, unknown phyla were found in groups II, IV, and VIII.Thus, for both bacterial and fungal long OTUs, nearly half of the OTUs at the genus level and almost all at the species level are unknown.This in-dicates that current references still might be insufficient for species level identification, warranting more studies especially on the eukaryotic microbiome.

DISCUSSION
Here we describe a cost-efficient method to sequence and assemble nearly full-length rDNA sequences by combining DNA co-barcoding with stLFR technology and second-generation sequencing.Recently, Karst and co-workers 3 published an alternative approach for long-read amplicon sequencing.In Report that work, redesigned UMIs and single-molecular sequencing were used for obtaining long-read high-accuracy amplicon sequences using Nanopore or PacBio sequencing.In their work, they applied the PacBio UMI method to generate 253,089 high-quality, full-length bacterial rRNA operon sequences from 70 human fecal samples.According to their estimation, the cost was about US$396 per sample.For comparison, to obtain 50 million short reads per sample using the stcLFR method, each sample could 19,522 bacterial PBAs and close to 30,000 OTUs, or 2,547 fungal PBAs and about 20,000 OTUs, for a cost of US$70 per sample.While their work demonstrated an elegant approach, our approach relying on second-generation sequencing enables a more cost-effective high-throughput sequencing coupled with robust and reproducible quantification.
The generation of long amplicons in a single study enables high taxonomic resolution of even very complex microbial communities as those existing in the rumen of ruminants 15 and the soil. 16,17In addition, we observed an increase in the occurrence of chimeras and assembly gaps between the bacterial 16S and 23S sequences, where tRNA tandem genes are located.Based on this observation, regions with complex structures, like tRNA cloverleaf structures, should be avoided when designing target amplicons, as such structures may negatively impact co-barcoding continuity and randomness.Long amplicon sequences are also of great value for refining current reference databases.In summary, our approach provided a 99% identity at the species level, pointing to a high-throughput strategy to expand current rRNA gene databases by including long marker sequences and potential novel taxonomies, as well as a comparable accurate quantification profiling strategy.This approach provides a cost-effective method for obtaining extensive and accurate information on environmental complex microbial communities.

BFigure 1 .
Figure 1.Sequencing framework (A)In vitro processes of rolling-circle replication (RCR) and stcLFR.DNA from bacteria and fungi was extracted and amplified using designed rDNA primers.Amplicons were turned into long concatemers with circularization and RCR.The products of RCR were labeled with unique barcodes by integrating transposons and hybridizing onto 30 million different clonal barcoded beads in a single tube.After PCR, these sheared short sub-fragments were subjected to DNA sequencing.(B) In silico processes of assembly, classification, and quantification.Reads with attached barcodes were decoded during the quality-control process.Prebinning was done by grouping reads sharing the same barcode.Subsequently, de novo assembly was performed for each bin independently and in parallel.The pre-binning assemblies (PBAs) were then selected by target size range, and rDNA subunits or internal transcribed spacer (ITS) regions were detected by Barnnap and ITSx software.The open reference databases SILVA and UNITE were used for taxonomic classification.Kraken2 was used to generate a database for profiling by mapping all barcode-detected reads.

Figure 2 .
Figure 2. Long fragments restore performance (A) Identity of alignments of mock bacterial rDNA PBAs to reference sequences.(B) Identity of alignments of mock fungal rDNA PBAs to reference sequences.(C) Difference between the observed and the calculated abundances of bacterial species in the mock community.(D) Difference between the observed and the calculated abundances of fungal species in the mock community.(E) rDNA sequences assembled from soil bacterial DNA.Subunits (16S/SSU and 23S/LSU) were detected by alignment to the SILVA and UNITE database or predicted by barrnap.The length distribution of each sample is plotted on the negative part of the y axis.(F) Pie plot of percentages of subunits detected.(G) rDNA sequences assembled from soil fungal DNA.Percentage of covered 18S/SSU, 28S/LSU, and ITS is indicated by different colors.(H) Pie plot of percentages of subunits and ITS regions detected.

Figure 3 .
Figure 3. Identification and performance of a combined Kraken2 profiling strategy(A) The pre binning assembly (PBA) cluster tree was generated by a series of clustering of PBAs at default identity cutoff for each level of taxonomy.The identity cutoffs were estimated by pairwise alignments of annotated SSU, LSU, and ITS sequences from public databases.The public taxonomy tree was mainly based on the SILVA SSU taxonomy tree and supplemented with new taxonomies from the SILVA LSU and UNITE databases.A combination of PBAs and public reference sequences (PUBs) was established based on the identities between the PBA and PUB sequences.The combined taxonomy tree contains three branch categories of sequences: 1) sequences shared by PBAs and the PUBs (red), 2) sequences unique to PBAs with no taxonomy association (blue), and 3) reference sequences from public taxonomy tree not shared by PBA (green).(B)The performance of the three branches of the taxonomic trees.Left: the number of OTUs shown at the species, genus, and phylum levels, visualized as bar plots.Right, relative abundances.The colors indicate OTUs belonging to PUBs (green) or PBAs with (red) or without (blue) annotation.(C) Correlation of relative abundance between duplicates.At each level of taxonomy, the relative abundance observed in each duplicate is colored light red for bacteria and light green for fungi.The Spearman correlation values for bacteria and fungi are indicated.(D) Rarefaction curves of observed bacterial and fungal OTUs at the species level using the different databases.One bacterial sample (solid line) and one fungal sample (dash line) with three replicates and sequenced to more than 500 million read pairs were analyzed.On each curve, a cross marks the number of reads needed to cover 95% of the total counts.

Figure 4 .
Figure 4. Phylogenetic tree based on bacterial SSU rDNA and eukaryotic SSU-LSU rDNA long sequences (A) After filtering of too-short assemblies and potential chimeras, 5,574 bacterial PBAs covering nearly the full region of SSU rDNA were aligned and fitted using a GTR model with 100 bootstraps.Branches representing 15 clades are colored according to phyla.Based on current public databases, the fractions of known taxonomic rank annotation at the levels of (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, and (S)pecies are shown in the bar plot at the left-bottom corner.(B) After filtering of too-short assemblies and potential chimeras, 3,031 fully assembled eukaryotic PBAs were aligned.SSU and LSU regions were directly joint for GTR model fitting with 100 bootstraps.According to the annotations, we clustered nine clades even though many of the clusters presented a mixture of different phyla.The fractions of known taxonomic rank annotation at the levels of (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, and (S)pecies are shown in the bar plot in the right-bottom corner.