ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Research Article
Revised

Concatenated 16S rRNA sequence analysis improves bacterial taxonomy

[version 3; peer review: 2 approved]
PUBLISHED 01 Sep 2023
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Manipal Academy of Higher Education gateway.

This article is included in the Cell & Molecular Biology gateway.

Abstract

Background: Microscopic, biochemical, molecular, and computer-based approaches are extensively used to identify and classify bacterial populations. Advances in DNA sequencing and bioinformatics workflows have facilitated sophisticated genome-based methods for microbial taxonomy although sequencing of the 16S rRNA gene is widely employed to identify and classify bacterial communities as a cost-effective and single-gene approach. However, the 16S rRNA sequence-based species identification accuracy is limited because of the occurrence of multiple copies of the 16S rRNA gene and higher sequence identity between closely related species. The availability of the genomes of several bacterial species provided an opportunity to develop comprehensive species-specific 16S rRNA reference libraries.
Methods: Sequences of the 16S rRNA genes were retrieved from the whole genomes available in the Genome databases. With defined criteria, four 16S rRNA gene copy variants were concatenated to develop a species-specific reference library. The sequence similarity search was performed with a web-based BLAST program, and MEGA software was used to construct the phylogenetic tree.
Results: Using this approach, species-specific 16S rRNA gene libraries were developed for four closely related Streptococcus species (S. gordoniiS. mitisS. oralis, and S. pneumoniae). Sequence similarity and phylogenetic analysis using concatenated 16S rRNA copies yielded better resolution than single gene copy approaches.
Conclusions: The approach is very effective in classifying genetically closely related bacterial species and may reduce misclassification of bacterial species and genome assemblies.

Keywords

bacterial nomenclature, bacterial taxonomy, concatenated phylogeny, species-specific barcode reference library

Revised Amendments from Version 2

Added limitations of this approach in the conclusion section.

See the author's detailed response to the review by Siddaramappa Shivakumara
See the author's detailed response to the review by Wellyzar Sjamsuridzal

Introduction

The genomic region encoding the 16S ribosomal RNA (16S rRNA) is extensively studied, and used to identify and classify bacterial species. The 16S rRNA is a conserved component of the small subunit (30S) of the prokaryotic ribosome. The gene encoding the 16S rRNA is ~1500 base pair (bp) long, and it consists of nine variable regions (Reller et al. 2007; Chakravorty et al. 2007; Sabat et al. 2017). The sequence of the 16S rRNA gene has been extensively used as a molecular marker in culture-independent methods to identify and classify diverse bacterial communities (Clarridge 2004; Johnson et al. 2019). Bacterial 16S rRNA sequences are currently being used to study the evolution, phylogenetic relationships, and environmental abundance of various taxa (Vetrovsky and Baldrian 2013; Srinivasan et al. 2015; Peker et al. 2019).

Although 16S rRNA sequence analyses are the mainstay of taxonomic studies of bacteria, there are some limitations. For example, the 16S rRNA gene has poor discriminatory power at the species level (Winand et al. 2020), and the copy number per genome can vary from 1 to 15 or even more (Vetrovsky and Baldrian 2013; Winand et al. 2020). The variable copies of this gene within a genome makes distinct data for a species. Therefore, gene copy normalization (GCN) may be necessary prior to sequence analysis. However, GCN may not improve the 16S rRNA sequence analyses in all scenarios, and comprehensive, species-specific catalogues of 16S rRNA gene copies may be necessary (Starke et al. 2021). Furthermore, intra-species variations in the 16S rRNA gene copies were observed in several bacterial genome assemblies (Paul et al. 2019). Only a few bacterial species contain identical 16S rRNA gene copies, and sequence diversity increases with increasing copy numbers of 16S rRNA genes (Vetrovsky and Baldrian 2013). The high levels of similarity of the 16S rRNA gene across some bacterial species poses a major challenge for taxonomic studies using bioinformatics methods (Deurenberg et al. 2017; Peker et al. 2019).

Factors such as purity of bacterial cultures, quality of the purified DNA samples, and potential DNA chimeras should be carefully considered while sequencing and analysis of 16S rRNA genes (Janda and Abbott 2007; Church et al. 2020). Sequencing errors can lead to misidentification of bacteria and phylogenetic anomalies (Alachiotis et al. 2013). Other concerns include sequence ambiguities, gaps generated during DNA sequencing and sequence comparisons, and choosing the appropriate algorithm (local or global) for sequence alignment. Since the local alignment algorithm is extensively used for sequence similarity-based comparisons, it is important to carefully consider whether a single variable region or a combination of variable regions of the 16S rRNA gene would be ideal for bacterial classification (Janda and Abbott 2007; Johnson et al. 2019; Winand et al. 2020). Using erroneous 16S rRNA sequences as references and improper bioinformatics workflows can mislead bacterial identification. Further, the growth of bioinformatics and genetic data has led to the current genome-based microbial classification. However, the success rate of these approaches are highly dependent on the skill of data analyst personnel in next generation sequencing technologies, computational tools, operation of high performance computing systems. Researchers without sufficient experience or skill in such technologies may also mislead the bacterial taxonomy (Baltrus 2016).

Other methods for bacterial identification include the sequencing and analysis of the polymerase chain reaction (PCR) amplified ∼4.5 kb 16S–23S rRNA regions (Benitez-Paez and Sanz 2017; Sabat et al. 2017; Kerkhof et al. 2017). However, the 16S–23S rRNA sequence-based method is less practical application due to the lack of appropriate reference sequence databases and reliable tools/methods for sequence analysis (Sabat et al. 2017). Recent advances in bioinformatics workflows (Winand et al. 2020; Schloss 2020) and reference databases such as SILVA, EzBioCloud (Quast et al. 2013; Yoon et al. 2017) have further improved 16S rRNA-based bacterial taxonomy. However, these approaches are not completely reliable due to misclassification of some bacterial species and erroneous genome assemblies (Steven et al. 2017; Martínez-Romero et al. 2018; Mateo-Estrada et al. 2019; Bagheri et al. 2020).

The entire 16S rRNA gene (~1500 bp) can be amplified and sequenced using the conventional or high throughput sequencing methods. However, many 16S rRNA sequence-based bacterial identification studies do not seem to include all of these nine variable regions (Stackebrandt et al. 2021). Due to the large volume of whole-genome data that is being produced by high throughput sequencing technologies, there is an urgent need to translate the genomic data for convenient microbiome analyses that ensure clinical practitioners can readily understand and quickly implement (Church et al. 2020). This study aimed to develop a workflow for accurate identification of bacteria using concatenated, species-specific 16S rRNA sequences. It was hoped that the species-specific libraries would yield much better resolution in sequence similarity- and phylogeny-based bacterial classification.

Methods

Estimation of variations in intra-genomic 16S rRNA gene copies

It has been reported that sequence alignment of 16S rRNA gene copies at the intra-genomic level shows a higher degree of variability in species belonging to the Firmicutes and Proteobacteria (Vetrovsky and Baldrian 2013; Ibal et al. 2019). Therefore, this study used eight 16S rRNA gene copies (Underlying data: Supplementary data 1 (Paul 2022)) retrieved from the complete genome of Enterobacter asburiae strain ATCC 35953 (NZ_CP011863.1). To estimate intra-genomic variability between these 16S rRNA gene copies, BLAST+ 2.13.0 (RRID:SCR_004870; Altschul et al. 1990) and Clustal Omega 1.2.4 (RRID:SCR_001591; Sievers et al. 2011) sequence alignment algorithms were used. Previous studies suggested unweighted pair group method with arithmetic averages (UPGMA) algorithm for the phylogenetic analysis of 16S rRNA genes (Clarridge 2004; Caporaso et al. 2011). Hence, phylogenetic analysis of these 16S rRNA gene copies were performed using the UPGMA method (Maximum Composite Likelihood; 500 bootstrap replicates) provided in the MEGA software (version 11; RRID: SCR_000667; Kumar et al. 2018).

Construction of species-specific concatenated 16S rRNA reference libraries

Previous studies have reported that the genes encoding 16S rRNA from several bacterial species share >99% sequence identity (Deurenberg et al. 2017; Peker et al. 2019). Therefore, the 16S rRNA-based methods failed to correctly identify bacterial species that are genetically closely related (Deurenberg et al. 2017; Devanga-Ragupathi et al. 2018). It has been reported that 16S rRNA-based methods cannot distinguish between Streptococcus mitis and Streptococcus pneumoniae due to the high sequence similarity (Reller et al. 2007; Lal et al. 2011). Hence, the study decided to choose the 16S rRNA gene copies from four closely related species of Streptococcus.

More than 552,575 whole-genome sequences are currently (Aug 2023) available for bacterial species in the Genome database (RRID:SCR_002474; https://www.ncbi.nlm.nih.gov/genome). Many of these genomes were sequenced using high throughput sequencing technologies such as Illumina/Ion-Torrent (short read sequencing) and PacBio/Nanopre (long read sequencing). Furthermore, most of these whole-genome sequences were obtained after a hybrid assembly of short and long read sequence data. This extensive, high throughput data can be effectively used to develop advanced genome-based methods for microbial systematics. Although the genomic data is available in four levels (contig, scaffold, chromosome, and complete), this study used only the complete genomes to retrieve 16S rRNA genes.

To develop species-specific barcode reference libraries, this study retrieved full-length 16S rRNA genes from 16 complete genome sequences belonging to four Streptococcus species (S. gordonii, S. mitis, S. oralis, and S. pneumoniae). Details of the dataset used to develop species-specific concatenated reference libraries are provided in Table 1, and the sequences are provided in the underlying data (Supplementary data 2 (Paul 2022)). Sequences were trimmed beyond the universal primer pair (fD1-5′-GAG TTT GAT CCT GGC TCA-3′ and rP2-5′-ACG GCT AAC TTG TTA CGA CT-3′, which are used for full-length 16S rDNA amplification, Weisburg et al. 1991) to maintain uniform length. To perform multiple sequence alignment and identify the intra-species parsimony informative (Parsim-info) variable sites, the MEGA 11 software was used. A species-specific barcode reference library that covers the entire Parsim-info variable sites was constructed by concatenating four 16S rRNA gene copies from four different strains of a species. The rationale for the selection of four copies for constructing a species-specific barcode reference library was: (i) a maximum of four variations can be found at a single site, and (ii) earlier studies have shown that the mean 16S rRNA copies per genome is four (Vetrovsky and Baldrian 2013).

Table 1. Details of whole genome assemblies used for the development of concatenated 16S rRNA reference libraries.

One copy of 16S rRNA gene from each strain is used for the concatenation.

SpeciesStrainsGenome accession numberNo. of 16S rRNA gene copiesSequencing platformSpecies-specific library nameLibrary length (bp)No. of Parsim-info sites
S. gordoniiFDAARGOS 1454CP077224.14PacBio; IlluminaS.gordonii-Ref-I60767
NCTC7868LR134291.14PacBio
KCOM 1506CP012648.15Illumina
NCTC9124LR594041.14PacBio
S. mitisB6NC_013853.14NAS.mitis-Ref-I603310
KCOM 1350CP012646.13Illumina
SVGS 061CP014326.14PacBio; Illumina
NCTC 12261CP028414.14PacBio
S. oralisNCTC 11427LR134336.14PacBioS.oralis-Ref-I603824
34CP079724.14Illumina; Nanopore
FDAARGOS 886CP065706.14PacBio; Illumina
F0392CP034442.14PacBio
S. pneumoniae475CP046355.14PacBioS.pneumoniae-Ref-I60326
NU83127AP018936.14Nanopore; Illumina
NCTC7465LN831051.14PacBio
6A-10CP053210.14PacBio

Demonstration of concatenated 16S rRNA in sequence similarity and phylogeny

This study analyzed a few cases to demonstrate (i) the classical sequence similarity and (ii) phylogenetic analysis using concatenated species-specific 16S rRNA reference libraries. The study used nine 16S rRNA gene copies (sequenced using the Sanger method) showing higher sequence similarity to the 16S rRNA genes of multiple species of Streptococcus were retrieved from GenBank database (RRID:SCR_002760). The web-based BLAST2 (version 2.13.0) program for aligning two or more sequences was used to estimate the maximum score, total alignment score, and sequence identity of these nine 16S rRNA sequences selected. For the sequence similarity search, a single copy of the 16S rRNA (sequenced using the Sanger method or retrieved from a whole-genome assembly) can be considered as ‘Query sequence’. The concatenated species-specific reference libraries need to be provided in the text area for ‘Subject sequence’. However, to perform phylogenetic analysis, it is mandatory that the target sequence (length = n bp) be concatenated four times (length = 4 × n bp). Phylogenetic analysis was performed for single gene copies and concatenated approach using UPGMA method as indicated above.

Results

Intra-genomic 16S rRNA variations in E. asburiae

Historically, sequences of the 16S rRNA genes have been used to identify known and new bacterial species. However, efficiency of PCR-based amplification, poor discrimination at the species level, multiple polymorphic 16S rRNA gene copies, and improper bioinformatics workflows for the data analysis can impact the identification. The genome of E. asburiae contains eight copies of the 16S rRNA gene. Analysis using Clustal Omega (global alignment) and BLAST (local alignment) showed that the sequences of these eight alleles had average identities of 99.29 and 99%, respectively (Table 2). Therefore, choosing the appropriate algorithm/tool is critical for the estimation of sequence identities and sequence-based species delineation. For analyzing sequence pairs that are highly identical, global sequence alignment algorithms seem to be more appropriate because they consider all the nucleotides for the estimation of sequence identity. Clustal Omega based multiple sequence alignment of the eight alleles of the 16S rRNA gene in the genome of E. asburiae showed 22 variable sites (Figure 1). These results show that the computational analysis using a single gene copy makes different results for species harbouring variable copies of this gene.

Table 2. Percent identity of eight intra genomic 16S rRNA regions from Enterobacter asburiae strain ATCC 35953 (NZ_CP011863.1).

Percent identity given below the diagonal line is calculated with Clustal Omega software (Mean identity: 99.29%) and those above the diagonal line were calculated with the BLASTN program (Mean identity: 99.00%). Genome coordinates of 16S rRNA copies: R1: 2686082–2687660 (1579 bp); R2: 3148265–3149814 (1550 bp); R3: 3313470–3315019 (1550 bp); R4: 3583942–3585481 (1540 bp); R5:3684745–3686294 (1550 bp); R6: 3771751–3773300 (1550 bp); R7: 3968538–3970087 (1550 bp); R8: 4647650–4649199 (1550 bp).

cc338101-2f6a-4b39-8223-00006d60b9f3_Graphical5.gif
cc338101-2f6a-4b39-8223-00006d60b9f3_figure1.gif

Figure 1. Clustal Omega based multiple sequence alignment of eight intra genomic 16S rRNA gene copies from Enterobacter asburiae strain ATCC 35953 (NZ_CP011863.1) showing 22 variable sites.

According to Chakravorty et al. (2007), the nine variable regions of 16S rRNA gene spanned nucleotides 69-99, 137-242, 433-497, 576-682, 822-879, 986-1043, 1117-1173, 1243-1294, and 1435-1465 for V1 to V9 respectively.

The evolutionary relationship between species is usually represented using a phylogenetic tree based on the analysis of a single gene, multiple genes, or whole genomes. However, bacterial identification and classification is mainly based on the phylogenetic analysis of single copies of 16S rRNA genes. A phylogenetic tree was constructed to understand how variations in the sequences of the eight alleles of the 16S rRNA gene in the genome of E. asburiae influence species delineation (Figure 2). These results indicate that the intra-genomic variations in 16S rRNA copies may mislead the bacterial taxonomy in single gene copy approaches.

cc338101-2f6a-4b39-8223-00006d60b9f3_figure2.gif

Figure 2. Phylogenetic tree of eight intra genomic 16S rRNA gene copies from Enterobacter asburiae strain ATCC 35953 (NZ_CP011863.1).

The node label denotes the coordinate of 16S rRNA regions in the genome.

Species-specific concatenated 16S rRNA libraries

This study selected four species of Streptococcus (S. gordonii, S. mitis, S. oralis, and S. pneumoniae) to construct species-specific concatenated reference libraries based on 16S rRNA gene sequences obtained from complete genomes. Four variable copies of the 16S rRNA gene from a species are required to construct a species-specific concatenated reference library. The details of species-specific libraries are listed in Table 1 and the sequences are provided in the underlying data (Supplementary data 3 (Paul 2022)). Analysis using the sequences of 16S rRNA genes showed 24, 10, 7, and 6 Parsim-info variable sites for S. oralis, S. mitis, S. gordonii, and S. pneumoniae, respectively. The intra-species Parsim-info variable sites were located in both the conserved and variable regions of the 16S rRNA gene (Supplementary data 4 (Paul 2022)).

The study used full-length 16S rRNA gene copies from four different strains to highlight the variations at the species level. However, a large number of partial 16S rRNA gene sequences are available in the public genetic databases. Further, many researchers are amplifying only few variable regions of the 16S rRNA gene. In such cases, a species-specific concatenated reference library can be constructed using partial sequences. Intra-species variations in the sequences of 16S rRNA gene copies influence the sequence-based bacterial identification. Therefore, concatenation of the sequences of 16S rRNA gene provides much better resolution compared to analysis using sequences from a single copy of the 16S rRNA gene.

Demonstration of concatenated 16S rRNA based species identification

This study compared sequences of nine 16S rRNA genes from different species of Streptococcus (Table 3) against the species-specific concatenated reference libraries constructed. The analysis showed that the concatenated sequences provide much better resolution in sequence similarity search and phylogenetic analysis. The sequence accession numbers GU470907.1 and KF933785.1 classified as S. mitis showed a higher maximum and total alignment score with concatenated 16S rRNA library of S. oralis than S. mitis (Table 3). Two sequences (OM368574.1 classified as S. mitis and OM368578.1 classified as S. pneumoniae) showed same score against the four reference libraries constructed. Based on the maximum total alignment score these two sequences are belonging to S. pneumoniae, however, they classified as two separate species. Interestingly, the sequence GU470907.1 classified as S. mitis showed 100% identity with S. oralis reference library with a total alignment score of 10936.

Table 3. Similarity of selected sequences against the concatenated species-specific 16S rRNA reference libraries.

GenBank Accession NumberSpeciesS. gordonii-Ref-IS. mitis-Ref-IS. oralis-Ref-IS. pneumoniae-Ref-I
Max ScoreTotal ScoreIdentity (%)Max ScoreTotal ScoreIdentity (%)Max ScoreTotal ScoreIdentity (%)Max ScoreTotal ScoreIdentity (%)
AJ295848.1S. mitis2495996796.4527691102799.8027581085199.6727521098299.60
AM157428.1S. mitis2462984596.0527241086699.2727021068599.0127081080599.07
NR_028664.1S. mitis2499999196.4527761097999.8727501086499.5427241088899.27
GU470907.1S. mitis25361009696.9127151079699.1427871093610020911071698.87
KF933785.1S. mitis2466983296.0626671059398.5426731065098.6126321050298.15
OM368574.1S. mitis2475989696.2427541096899.6727321081499.4027601099099.73
OM368578.1S. pneumoniae2475989696.2427541096899.6727321081499.4027601099099.73
AM157442.1S. pneumoniae2470986396.1227021077999.0127151072699.1427021077799.01
NR_117719.1S. oralis25311007496.8427101077499.0727871092510026971073998.94

The study plotted two phylogenetic tree to highlight the difference in single gene copy approach and concatenated approach. Figure 3 represent the single gene copy approach, shows phylogenetic tree of the nine 16S rRNA gene sequences selected along with the gene copies used for the construction of four concatenated species-specific reference libraries. The inclusion of misclassified sequences and intra-species variations in 16S rRNA copies may mislead the phylogenetic tree inference. Figure 4 shows the phylogenetic relationship of nine selected sequences with four concatenated species-specific reference libraries constructed. The concatenated GU470907.1 sequence showed a phylogenetic relationship with S. oralis and sequence OM368574.1 was genetically related to S. pneumoniae. Phylogenetic analysis showed that three sequences AM157428 (S. mitis), KF933785 (S. mitis), and AM157442 (S. pneumoniae) stayed separately and might be other species than the four species tested. Furthermore, two sequences AJ295848 and NR_028664 classified as S. mitis showed significant similarity with concatenated 16S rRNA reference library of S. mitis. Similarly, sequence NR_117719 (S. oralis) showed phylogenetic relationship with reference library of S. oralis and OM368578 (S. pneumoniae) with S. pneumoniae reference library. These results further confirm that species-specific concatenated 16S rRNA reference libraries provide much better taxonomic resolution. Therefore, this study recommends concatenated sequences of 16S rRNA genes for sequence similarity- and phylogeny-based species identification.

cc338101-2f6a-4b39-8223-00006d60b9f3_figure3.gif

Figure 3. Phylogenetic analysis of randomly selected nine 16S rRNA sequences classified as Streptococcus species and sequences used for species-specific reference library.

The phylogenetic tree plotted using single copy approach. The node name highlighted in shapes (cc338101-2f6a-4b39-8223-00006d60b9f3_Graphical1.gif, cc338101-2f6a-4b39-8223-00006d60b9f3_Graphical2.gif, cc338101-2f6a-4b39-8223-00006d60b9f3_Graphical3.gif, cc338101-2f6a-4b39-8223-00006d60b9f3_Graphical4.gif) represents the sequences which are used for the construction of four concatenated species-specific reference libraries.

cc338101-2f6a-4b39-8223-00006d60b9f3_figure4.gif

Figure 4. Phylogenetic tree constructed using concatenated 16S rRNA approach.

The randomly selected nine 16S rRNA sequences classified as Streptococcus species were compared with four species-specific reference libraries constructed. The node name highlighted in shapes (cc338101-2f6a-4b39-8223-00006d60b9f3_Graphical1.gif, cc338101-2f6a-4b39-8223-00006d60b9f3_Graphical2.gif, cc338101-2f6a-4b39-8223-00006d60b9f3_Graphical3.gif, cc338101-2f6a-4b39-8223-00006d60b9f3_Graphical4.gif) represents the four species-specific reference libraries.

Discussion

Sequencing and analysis of the 16S rRNA encoding region is a conventional and robust method for identifying and classifying bacterial species. The barcode gene is widely used in sequence similarity, phylogeny, and metagenome-based species identification. However, the accuracy of bacterial taxonomy based on 16S rRNA barcode regions is limited by the intra-genomic heterogeneity of multiple 16S rRNA gene copies and significant sequence identity of this gene among closely related taxa. Furthermore, identification of closely related species using sequences of the 16S rRNA gene is a challenge, and it may lead to species misidentification (Boudewijns et al. 2006; Church et al. 2020). About 15% of the bacterial genomes have only a single copy of the 16S rRNA gene, and only a minority of bacterial genomes contain identical 16S rRNA gene copies (Vetrovsky and Baldrian 2013). The 16S rRNA gene copies can vary from 1 to 15 in a genome, and the copy number is taxon specific (Vetrovsky and Baldrian 2013). Sequence diversity increases with the increasing 16S rRNA copy numbers. The 16S rRNA sequence variation can even be found at intra-genomic level or in different strains of a species. Amplification of a limited number of variable regions cannot achieve the same taxonomic resolution as that of the entire gene (Johnson et al. 2019). Usage of misclassified 16S rRNA sequences as a reference and inappropriate bioinformatics workflows can also mislead the taxonomic assignment. To overcome these challenges, it is important to translate high throughput microbial genomic data into meaningful, actionable information that clinicians can readily understand and quickly implement for bacterial identification. Hence, the study intended to develop a species-specific catalogue of concatenated 16S rRNA gene copies that can yield better inference in sequence similarity and phylogenetic analysis.

Several bioinformatics resources are extensively used for the 16S rRNA sequence analysis and bacterial identification. However, several researchers report the sequence similarity derived through a local alignment algorithm. Earlier reports have suggested that the species belonging to the taxa Gammaproteobacteria show higher intra-species variability (Vetrovsky and Baldrian 2013). Hence, the study estimated the percent identity of intra-genomic 16S rRNA gene copies of E. asburiae using local and global alignment algorithms. The reference genome of E. asburiae has eight 16S rRNA gene copies in its genome. The BLAST and Clustal sequence alignment algorithms yielded marginally varying results for the intra-genomic 16S rRNA gene copies. Local alignment algorithms may not consider base mismatches at the ends of sequences when calculating percent identity, while global alignment algorithms consider entire sequences. Therefore, global sequence alignment is best for estimating intra and inter-species identity for single gene copies. However, BLAST can calculate the total alignment score with multiple paralogue regions. Hence, web-based BLAST2 is suggested for estimating the sequence similarity using concatenated barcode reference libraries.

The GenBank (Leray et al. 2019) and NCBI 16S RefSeq databases for bacteria (Winand et al. 2020) are reliable for species-level identification and classification. However, few earlier studies have highlighted the misclassification of species and genome assemblies in public genetic databases (Parks et al. 2018; Varghese et al. 2015). For example, the 16S rRNA sequence accession number (Ac. No.) LT707617.1 shows the organism as Streptococcus mitis. Conventional BLAST-based sequence similarity search shows the highest identity of 99.60% with S. mitis 16S rRNA sequence (Ac. No. AB002520.1). However, the 16S rRNA sequence (Ac. No. LT707617.1) did not show significant similarity with other 16S rRNA reference sequences available for S. mitis. Furthermore, the sequence also shows 99.44% identity with reference 16S rRNA sequences of S. gordonii. Hence, the study performed a sequence alignment of the sequence (Acc. No. LT707617.1) against species-specific concatenated 16S rRNA reference libraries for S. gordonii (S.gordonii-Ref-I), and S. mitis (S.mitis-Ref-I). The alignment resulted in a significant identity of 99.44% with S.gordonii-Ref-I (2279 maximum and 9041 total alignment score) than S.mitis-Ref-I (97.13% identity with 2119 maximum and 8449 total alignment score). Single copy BLAST results may show only a minor fraction of the difference in percent identity and maximum or total alignment score for closely related species. However, sequence similarity estimation using species-specific concatenated reference libraries shows marginal difference in total alignment score, as it is aligned against four copies. Hence, 16S rRNA analysis with a species-specific concatenated barcode reference library will give better accuracy for bacterial classification than approaches using a single copy.

Several 16S rRNA sequences show 100% identity with multiple species, which is the major challenge in sequence-based species identification. For example, the 16S rRNA sequence from S. mitis (Accession. No. GU470907.1; 1522 bp) shares 100% identity with the 16S rRNA gene from S. oralis strain ATCC 35037 genome (Ac. No. CP034442.1). Hence, the sequence (GU470907.1) aligned against the species-specific concatenated reference libraries for S. oralis (S.oralis-Ref-I), and S. mitis (S.mitis-Ref-I). The result showed 100% identity with S. oralis (2787 maximum and 10936 total alignment score), and 99.14% identity with S. mitis (2715 maximum and 10796 total alignment score). Further, a phylogenetic tree of GU470907.1 (1509 × 4 = 6036 bp) with reference libraries S.mitis-Ref-I, and S.oralis-Ref-I was plotted. The UPGMA-based phylogenetic tree showed that the S. mitis (GU470907.1) sequence is more closely related to S. oralis than S. mitis (Figure 4). Concatenated 16S rRNA-based estimation of sequence similarity and a phylogenetic inference provides better resolution than single-gene approaches. These results show that the concatenated 16S rRNA approach is very effective in discriminating genetically closely related bacterial species. Furthermore, other studies have also highlighted that the phylogenetic tree inferred from vertically inherited protein sequence concatenation provided higher resolution than those obtained from a single copy (Ciccarelli et al. 2006; Thiergart et al. 2014).

Recent phylogenetic studies using concatenated multi-gene sequence data highlighted the importance of incorporating variations in gene histories, which will improve the traditional phylogenetic inferences (Devulder et al. 2005; Johnston et al. 2019). Furthermore, a single type of analysis should not be relied upon, instead, and to a certain extent, integrated bioinformatics approaches can avoid misclassification. As a cost-effective approach, the study combined substantial variations in 16S rRNA gene copies from a species to examine the performance of the single gene concatenation approach. Analyses using a concatenated 16S rRNA gene approach have the following advantages: (i) the gene is present in all the bacterial species, (ii) the gene is weakly affected by horizontal gene transfer and mutation, (iii) the approach is very cost-effective, (iv) there is a large volume of reference genomic data available for several bacterial species, (v) it is effective in discriminating closely related bacterial species, (vi) the analyses can be performed in a computer with minimum configuration, and (vii) the analyses can be employed with available tools for sequence similarity and molecular phylogeny.

Conclusions

The concatenated 16S rRNA analyses showed that:

  • Full-length 16S rRNA gene amplification provides better accuracy than inference based on partial gene sequences with a limited number of variable regions.

  • Full-length 16S rRNA gene copies from whole-genome assemblies (in 'complete' stage) should be used rather than partial sequences available from the public genetic databases to construct species-specific concatenated 16S rRNA libraries and further downstream analysis.

  • To avoid mismatches in the sequence alignment, trim the bases beyond the primer ends and correct the base-call errors prior to the analysis.

  • Estimation of mean 16S rRNA identity at the intra-species level helps to classify the species having a higher degree of intra-genomic 16S rRNA heterogeneity.

  • Four distinct 16S rRNA gene copies cover all the Parsim-Info variable sites and these can be used to construct a concatenated species-specific reference library.

  • The total alignment score can be considered if the query sequence shows more or less the same percent identity with multiple species.

  • It is not prudent to rely only on sequence similarity; the final decision must be based on the phylogenetic inference.

  • Species-specific concatenated 16S rRNA gene libraries are recommended for sequence similarity and phylogenetic analysis.

  • The limitation of the approach is that developing a species-specific reference library requires 16S rRNA copies from at least four whole genome assemblies.

Comments on this article Comments (0)

Version 3
VERSION 3 PUBLISHED 19 Dec 2022
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Paul B. Concatenated 16S rRNA sequence analysis improves bacterial taxonomy [version 3; peer review: 2 approved] F1000Research 2023, 11:1530 (https://doi.org/10.12688/f1000research.128320.3)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 3
VERSION 3
PUBLISHED 01 Sep 2023
Revised
Views
2
Cite
Reviewer Report 08 Sep 2023
Siddaramappa Shivakumara, Institute of Bioinformatics and Applied Biotechnology, Bengaluru, Karnataka, India 
Approved
VIEWS 2
The revised version [version 3] addressed the minor concerns ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Shivakumara S. Reviewer Report For: Concatenated 16S rRNA sequence analysis improves bacterial taxonomy [version 3; peer review: 2 approved]. F1000Research 2023, 11:1530 (https://doi.org/10.5256/f1000research.155249.r203269)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
3
Cite
Reviewer Report 08 Sep 2023
Wellyzar Sjamsuridzal, Department of Biology, Faculty of Mathematics and Natural Sciences, Universitas Indonesia, Depok, West Java, Indonesia 
Approved
VIEWS 3
Dear Authors,

I have read the revised version 3 of the manuscript and I found ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Sjamsuridzal W. Reviewer Report For: Concatenated 16S rRNA sequence analysis improves bacterial taxonomy [version 3; peer review: 2 approved]. F1000Research 2023, 11:1530 (https://doi.org/10.5256/f1000research.155249.r203270)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Version 2
VERSION 2
PUBLISHED 03 Apr 2023
Revised
Views
12
Cite
Reviewer Report 21 Aug 2023
Wellyzar Sjamsuridzal, Department of Biology, Faculty of Mathematics and Natural Sciences, Universitas Indonesia, Depok, West Java, Indonesia 
Approved with Reservations
VIEWS 12
Are the conclusions drawn adequately supported by the results? 

Partly: The conclusion should elaborate more by considering the limitation of the study. Developing a species-specific reference library for all bacteria using concatenated 16S rRNA gene copy variants ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Sjamsuridzal W. Reviewer Report For: Concatenated 16S rRNA sequence analysis improves bacterial taxonomy [version 3; peer review: 2 approved]. F1000Research 2023, 11:1530 (https://doi.org/10.5256/f1000research.144651.r189677)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 01 Sep 2023
    Bobby Paul, Department of Bioinformatics, Manipal School of Life Sciences, Manipal Academy of Higher Education, Manipal, 576104, India
    01 Sep 2023
    Author Response
    Thank you for the critical comments. I have included the limitation of this approach in the conclusion section, and the manuscript has been updated. The study used four genetically related ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 01 Sep 2023
    Bobby Paul, Department of Bioinformatics, Manipal School of Life Sciences, Manipal Academy of Higher Education, Manipal, 576104, India
    01 Sep 2023
    Author Response
    Thank you for the critical comments. I have included the limitation of this approach in the conclusion section, and the manuscript has been updated. The study used four genetically related ... Continue reading
Views
5
Cite
Reviewer Report 11 Apr 2023
Siddaramappa Shivakumara, Institute of Bioinformatics and Applied Biotechnology, Bengaluru, Karnataka, India 
Approved
VIEWS 5
The author has addressed most of my technical and non-technical (e.g., writing style and presentation) concerns. Version 2 is considerably improved compared to Version 1 in terms of readability, flow, and impact. However, I still believe there is scope to ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Shivakumara S. Reviewer Report For: Concatenated 16S rRNA sequence analysis improves bacterial taxonomy [version 3; peer review: 2 approved]. F1000Research 2023, 11:1530 (https://doi.org/10.5256/f1000research.144651.r168593)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Version 1
VERSION 1
PUBLISHED 19 Dec 2022
Views
27
Cite
Reviewer Report 10 Feb 2023
Siddaramappa Shivakumara, Institute of Bioinformatics and Applied Biotechnology, Bengaluru, Karnataka, India 
Approved with Reservations
VIEWS 27
The manuscript entitled “Concatenated 16S rRNA sequence analysis improves bacterial taxonomy [version 1]” by Bobby Paul is generally well written and reports interesting results. The methods are appropriate and the analyses meet the quality standards.

However, there ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Shivakumara S. Reviewer Report For: Concatenated 16S rRNA sequence analysis improves bacterial taxonomy [version 3; peer review: 2 approved]. F1000Research 2023, 11:1530 (https://doi.org/10.5256/f1000research.140896.r158444)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 03 Apr 2023
    Bobby Paul, Department of Bioinformatics, Manipal School of Life Sciences, Manipal Academy of Higher Education, Manipal, 576104, India
    03 Apr 2023
    Author Response
    Dear Sir,

    Thank you very much for your critical review and valuable suggestions for manuscript improvement. I have revised the manuscript by addressing all the suggestions and resubmitted it ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 03 Apr 2023
    Bobby Paul, Department of Bioinformatics, Manipal School of Life Sciences, Manipal Academy of Higher Education, Manipal, 576104, India
    03 Apr 2023
    Author Response
    Dear Sir,

    Thank you very much for your critical review and valuable suggestions for manuscript improvement. I have revised the manuscript by addressing all the suggestions and resubmitted it ... Continue reading

Comments on this article Comments (0)

Version 3
VERSION 3 PUBLISHED 19 Dec 2022
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.