GSR-DB: a manually curated and optimized taxonomical database for 16S rRNA amplicon analysis

ABSTRACT Amplicon-based 16S ribosomal RNA sequencing remains a widely used method to profile microbial communities, especially in low biomass samples, due to its cost-effectiveness and low-complexity approach. Reference databases are a mainstay for taxonomic assignments, which typically rely on popular databases such as SILVA, Greengenes, Genome Taxonomy Database (GTDB), or Ribosomal Database Project (RDP). However, the inconsistency of the nomenclature across databases and the presence of shortcomings in the annotation of these databases are limiting the resolution of the analysis. To overcome these limitations, we created the GSR database (Greengenes, SILVA, and RDP database), an integrated and manually curated database for bacterial and archaeal 16S amplicon taxonomy analysis. Unlike previous integration approaches, this database creation pipeline includes a taxonomy unification step to ensure consistency in taxonomical annotations. The database was validated with three mock communities, two real data sets, and a 10-fold cross-validation method and compared with existing 16S databases such as Greengenes, Greengenes 2, GTDB, ITGDB, SILVA, RDP, and MetaSquare. Results showed that the GSR database enhances taxonomical annotations of 16S sequences, outperforming current 16S databases at the species level, based on the evaluation of the mock communities. This was confirmed by the 10-fold cross-validation, except for Greengenes 2. The GSR database is available for full-length 16S sequences and the most commonly used hypervariable regions: V4, V1–V3, V3–V4, and V3–V5. IMPORTANCE Taxonomic assignments of microorganisms have long been hindered by inconsistent nomenclature and annotation issues in existing databases like SILVA, Greengenes, Greengenes2, Genome Taxonomy Database, or Ribosomal Database Project. To overcome these issues, we created Greengenes-SILVA-RDP database (GSR-DB), accurate and comprehensive taxonomic annotations of 16S amplicon data. Unlike previous approaches, our innovative pipeline includes a unique taxonomy unification step, ensuring consistent and reliable annotations. Our evaluation analyses showed that GSR-DB outperforms existing databases in providing species-level resolution, especially based on mock-community analysis, making it a game-changer for microbiome studies. Moreover, GSR-DB is designed to be accessible to researchers with limited computational resources, making it a powerful tool for scientists across the board. Available for full-length 16S sequences and commonly used hypervariable regions, including V4, V1–V3, V3–V4, and V3–V5, GSR-DB is a go-to database for robust and accurate microbial taxonomy analysis.

approach has allowed us to associate altered microbial profiles with diseases, includ ing gut-associated conditions, inflammatory bowel disease, metabolic diseases, and colorectal cancer (1,2).
16S analysis involves various upstream steps, including quality control, read trimming, and taxonomic classification.Previous studies have reported the impact of bioinformatic pipelines in the microbial profiling of biological samples, highlighting the importance of reference databases for taxonomic prediction (3).Currently, the most widely used databases are Greengenes (4), Genome Taxonomy Database (GTDB) (5), SILVA (6), and Ribosomal Database Project (RDP) (7).However, discrepancies between these databa ses have been acknowledged.Robeson et al. (8) found that Greengenes, SILVA, and GTDB presented sequence similarities but were taxonomically different, leading to a low proportion of taxonomic labeling shared among databases at all ranks below the domain level.Moreover, outlier sequences were found in the length distribution across databases, probably corresponding to partial or untrimmed 16S sequences, which are recommended to be discarded to avoid biases in the analysis.Additionally, SILVA and Greengenes exhibited an immense amount of unannotated or unknown labeled sequences at genus and species level (∼80%), which might introduce taxonomic noise during assignment (8).
To overcome these limitations and enhance classification performance, we created the GSR database for bacterial and archaeal 16S-based taxonomic profiling by integrat ing and manually curating the Greengenes, SILVA, and RDP databases.The taxonomic nomenclature of the GSR database has been unified to guarantee the coherence of annotations.Its performance has been compared with Greengenes, Greengenes2 (9), GTDB, SILVA, and RDP databases and other existing integrated databases, including ITGDB (10) and MetaSquare (11).The GSR database is available for full-length 16S sequences and the most commonly used hypervariable regions: V4, V1-V3, V3-V4, and V3-V5.It can be downloaded from the link https://manichanh.vhir.org/gsrdb/.

Creation of the GSR full-16S database
A full-length 16S database, the GSR-DB (Greengenes-SILVA-RDP database), was created by merging three already existing databases: Greengenes (version 13_8, 99%) (4), SILVA (version 138, 99%) (6), and RDP (train set no. 18) (7).A data set with vaginal-related species was also included to ensure species detection for vaginal samples.The num ber of original entries of the Greengenes, SILVA, and RDP was 203,452, 436,681, and 21,194, respectively (more information regarding the construction of these databases can be found in the supplemental material).Before the integration, taxonomy filtering and formatting were performed on each original database.Only Bacteria and Archaea kingdoms were retained from the databases, excluding Eukaryota and Virus kingdoms in the SILVA database.Additionally, a manual curation process was applied to ensure the removal of potential redundancies for the subsequent merging of the original databases.After this process, the percentage of retained entries was 10.05%, 17.08%, and 95.08% for Greengenes, SILVA, and RDP, respectively.The vaginal data set was created with the 16S NCBI sequences proposed by Fettweis et al. (12) to create a vaginal reference database.Sequences and nomenclature corresponding to GenBank IDs provided in the study were retrieved from the NCBI.Sequences without exact species names or those corresponding to non-16S sequences were excluded.Once all original databases were preprocessed correctly, they were merged using the forthcoming algorithm.

Manual curation of the GSR-DB
The curation process for the GSR-DB included several steps to ensure the quality and accuracy of the data.These steps involved manual identification and removal of patterns associated with unknown species (entries unannotated or with unknown labels, such as "uncultured, " "unidentified, " and "candidate").Additionally, sequences that only provide information at the kingdom and species levels were discarded, particularly if they refer to rare bacteria from non-characterized environments (e.g., k_Bacteria,…, s_bacterium_Te63R).Lastly, taxonomic nomenclature was carefully reviewed during the integration of databases, using the Python module ETE toolkit (version 3.0) (13) to retrieve synonyms from the NCBI database.The NCBI taxonomy database (14) was chosen as the reference for taxonomic annotation as it enables the identification of synonyms for all the taxonomic annotations in the databases and provides a stand ardized nomenclature.This procedure is capable of ensuring consistency and identi fying misannotated organisms.One specific example mentioned is the identification of misannotated entries from the SILVA database, where certain entries labeled as bacteria are actually eukaryotic species, such as the annotation d_Bacteria; p_Proteobac teria; c_Gammaproteobacteria; o_Burkholderiales; f_Comamonadaceae; g_Paucibacter; s_Cenchrus_americanus, which is a plant species.This suggests that thorough steps were taken to ensure the accuracy and reliability of taxonomic information in the GSR-DB.

Merging algorithm
The algorithm used to merge the Greengenes, SILVA, RDP, and vaginal processed databases was based on the integration algorithms proposed by Hsieh et al. (10).The algorithm took two databases as inputs and integrated them as follows (Fig. 1A).First, one database was assigned as the reference database (R) and the other as the candidate database (C).Then, for each entry in the candidate database, it checked whether the candidate taxon (T C ) was already present in the reference database.The candidate entry (sequence and taxon) was added to the data set if not present.If T C was present, the algorithm compared the candidate sequence (S C ) to all the sequences in the reference data set (S R ) with the same nomenclature as T C .No integration was performed if S C was identical or present as a substring in any of the S R .On the other hand, if S C was not found in the reference data set, the candidate entry (taxon and sequence) was added to the data set.The RDP data set was chosen as the first reference data set for its taxonomic consistency, then the remaining data sets were added in the following order: SILVA, Greengenes, and vaginal (Fig. 1B).The resulting data set is the GSR full-16S database.

Variable region extraction
The full-length GSR 16S database was used to create region-specific databases contain ing the most commonly used hypervariable regions in 16S analysis (V4, V3-V4, V1-V3, and V3-V5).The sequences for each region were extracted from the full-length GSR 16S database using the extract-reads function implemented in the QIIME 2 feature-classifier plugin (15) and the corresponding primers (3).QIIME 2 RESCRIPt plugin (8) was also used to dereplicate the resulting databases to remove redundant entries.These steps were also performed on the databases [Greengenes, Greengenes 2 (version 2022.10),GTDB (version 207.0),ITGDB, SILVA, and RDP) used in the validation analysis step.

Clustering
Upon extracting the variable region sequences from the full-length 16S database, we encountered numerous identical sequences that do not correspond to the same species.The region-specific GSR databases underwent clustering via CD-HIT (16,17) at a 100% identity threshold to host unique sequences and improve species detection.Taxonomic designations of these clustered sequences were merged into a unified taxonomic name, as shown in the subsequent example.
Nomenclatures to be clustered:

Phylogeny construction
To build the phylogenetic tree, we used the pre-trained model for WoL marker genes and ASV data of DEPP software.This framework allows the positioning of the GSR-DB sequences onto the WoL species backbone tree via a convolutional neural network.The resulting tree is made available in both Newick and QIIME2 formats (Fig. S1).

In silico mock community data sets
To assess the performance of the newly built database, three different mock commun ities (mockrobiota, vagimock, and gutmock) were constructed in silico.The vagimock and gutmock data sets simulate the relative abundance and species of biological samples from two different body sites.They were built from our GSR database with species commonly found in the human vagina and gut.The mockrobiota data sets were constructed using sequences obtained from the mockrobiota repository, a public resource for microbiome Bioinformatics benchmarking.From this repository, we recovered full-length 16S sequences provided only by data sets 3, 4, 5, and 12-23 (18).
Each in silico mock community data set contained five samples, with given microbial abundance profiles, and taxonomic and sequence information.The taxonomic informa tion of the sequences was unified using the ncbi_taxonomy python module from the ETE toolkit (version 3.0) (13).Each mock community has a different level of complexity, which is crucial to reveal possible database issues (3).The composition of the mock communities at the species level can be found in Table S1.

Classifier training and taxonomy assignment
It is known that some classifiers are strongly affected by parameter configurations.Therefore, different parameters for classifier training and taxonomy assignment steps were tested to find the optimal configuration.The sequences and taxonomy of each database were used to train the multinomial naive Bayes classifier implemented in q2-feature-classifier QIIME2 module (15).During this training, the n-gram-range parameter was tested with the values [6,6] and [7,7] (default), as its developers have already reported these ranges as optimal.Then, these classifiers were used to perform the taxonomy assignment of rep-seqs for each region.During the taxonomy assignment, the confidence threshold for limiting taxonomic depth was tested with the values "disable, " 0.5, 0.7 (default), 0.9, and 0.98.Evaluating two n-gram-range values and five confidence thresholds generated 10 different taxonomic profiling for each database.

Parameter comparison
To compare the performance of the 10 possible parameter configurations for each database, we calculated the average F1 scores across mock communities for each taxonomic level.The configuration with higher scores was retained for subsequent benchmarking of the databases.

Database benchmarking
The performance of the GSR database was compared with widely used databases, such as Greengenes, Greengenes 2, GTDB, SILVA, and RDP, but also with other avail able databases, such as ITGDB.Two independent approaches were used to assess the performances: the multi-class confusion matrix and the Bray-Curtis distances.The multi-class confusion matrix was used to evaluate the performance of a machine learning classification (e.g., naive Bayes classifier) by comparing the expected sequence taxonomy versus the classified (Table 1).This confusion matrix allowed us to obtain validation metrics such as accuracy, precision, recall, and F1 score by using the following equations: (1) where TP is true positive, FP is false positive, TN is true negative, and FN is false negative.
The four metrics were measured at each taxonomic level as follows: a match was called when two taxonomic names (ID) were identical between the expected (E) and the assigned (A) name or, in the case of assignments with clustered databases, a match was called when one name was included in the other one (for instance, A = Amylolactobacil lus amylophilus-Lactobacillus iners; E = Lactobacillus iners).For each taxonomic ID (T i ), (i) TP was considered when T i matched both A (assigned taxonomic ID) and E (expected taxonomic ID).(ii) FP was defined when T i matches A but not E. (iii) FN was defined when T i matches E but not A. (iv) TN was defined when neither A nor E matches T i .
Finally, validation metrics for all taxonomic IDs were integrated using a weighted mean, taking the corresponding expected abundance as weight, using the following equation: where s = weighted mean score of the validation metric (precision, recall, F1 score or accuracy) for all taxonomies of a mock community, s i = score of the validation metric for taxonomic ID T i , a i = expected relative abundance of taxonomic ID T i (weight), n = total number of taxonomic IDs included in the mock community.
Bray-Curtis distances were calculated between the expected and assigned composi tion for each sample in R (version 4.2.1) using the vegan package (version 2.6-4).
To discover significant differences in performance metrics, F1 scores and Bray-Curtis distances were compared among the GSR, Greengenes 2, ITGDB, and SILVA databases using the Wilcoxon test (P-values adjusted by the Benjamini-Hochberg method).
Additionally, since different databases might use different taxonomic nomenclature, in order to consider synonyms of scientific names as correct matches, taxonomy unification (ETE toolkit v.3.0) was applied to each taxonomic classification before comparisons.

Tenfold cross-validation
We conducted a 10-fold cross-validation to validate the results obtained with the mock communities.Train and test data sets for each database were built using the scikit-learn Python module (v0.24.1).These training data sets were used to train a naive Bayes classifier in QIIME2, setting the n-gram-range parameter to [6,6], as it yielded globally better classification results (Fig. 2A).These classifiers were used to assign the taxonomic nomenclature to the test data sets, with QIIME2 using the default parameters.Accuracy was assessed following the methodology outlined by Edgar (21).

Gut and vaginal microbial data sets
To further validate our database performance, we performed a case study using actual biological data from human gut samples (22) and human vaginal samples (23).These data sets contained V4 amplicon sequences.Taxonomic assignments were performed in both data sets using QIIME2 naive Bayes feature classifier, with the following V4 databases as reference: Greengenes, Greengenes 2, GSR, GTDB, ITGDB, MetaSquare, SILVA, and RDP.The n-gram-range parameter was set to [7,7] and the confidence threshold to "disable, " as these were the parameters found to perform best in the validation step.

Computational benchmarking
Furthermore, we also tested the computational cost of obtaining a taxonomic profile with the QIIME2 naive Bayes classifier with each of the V4 reference databases employed in this case study.We measured the time and memory consumption of the classifier training and the taxonomic assignment processes.Time was measured with the Python built-in time module, and memory consumption was tracked using the memory_profiler module.These analyses were run on a computer with an Intel Xeon Gold 6238 processor with 44 CPUs and 187 GB of RAM, and Ubuntu 18.04.4.Classifier training was run with default settings.Taxonomy assignment was performed by setting the confidence threshold to "disable" and using 10 threads.

GSR database
To optimize the prokaryotic taxonomic assignment, we created the GSR database by integrating and manually curating Greengenes (v13_8, 99%), SILVA (v138, 99%), and RDP (train set no. 18) data sets (Fig. 1A and B).The integrated full-length 16S GSR database has a total size of 90,408 sequences, with the following source composition: 22.29% RDP, 58.15% SILVA, 19.41% Greengenes, and 0.15% NCBI (vaginal-specific sequences).The source composition and total size of the variable region databases are shown in Table 2.The V1-V3 and V4 databases are those with fewer available sequences.The sequence length distributions of the GSR databases are presented in Fig. 1C.

QIIME2 parameters impact taxonomic assignment performance
The 16S rRNA analysis pipeline of QIIME2 (24) includes training a naive Bayes classifier with a reference database and a subsequent taxonomic assignment of the rep-seqs (15).In these two steps, we tested different values of n-gram-range and confidence threshold parameters for each database (full-length 16S and specific 16S regions) as it is known to affect the classifier's performance.Fig 3A and B summarizes the performance of the aforementioned parameters across tested databases and regions.Two n-gram-range values were tested: [6,6] and [7,7].The Wilcoxon test showed that [7,7] performed better than [6,6] (P < 0.0001) (Fig. 3A).Confidence threshold values show significant differences in F1 score (Fig. 3B, P < 0.0001 for all comparisons in a pairwise manner), precision, and recall at both genus and species levels (Tables S2 to S4).Setting the confidence threshold to "disable" provided the best classification results at the species level, suggesting that setting a confidence threshold for the QIIME2 classifier notably restricts the predictions at the species level without improving the predictions at higher levels.Therefore, the n-gram-range of [7,7] and "disable" confidence threshold were further used to bench mark the GSR database with other already existing databases.

GSR outperforms most existing databases across all tested regions
To assess the performance of the newly created database, we benchmarked the GSR database with the other existing databases (Greengenes, Greengenes2, SILVA, RDP, ITGDB, and MetaSquare), using two different approaches: validation metrics (F1 score shown in Fig. 3C and D; Fig. S2, precision and recall shown in Tables S5 to S7) and Bray-Curtis distances (Fig. S3; Table S8).In order to increase the robustness of the results, we defined the combination of the F1 score and the Bray-Curtis distance as the validation scores.The database with the best validation scores will achieve the highest F1 score and the shortest Bray-Curtis distance.
At the family level, the Greengenes2, GSR, ITGDB, and SILVA databases reached, in most cases, the best validation scores across regions in all the validation data sets (Fig. S2 and S3A; Tables S5 and S8).At genus level (Fig. 3C; Fig. S3B; Tables S6 and S8), GSR achieved significantly better validation scores across almost all regions in the gutmock data set, followed by Greengenes2, ITGDB, and SILVA databases.In the mockrobiota data set, ITGDB and SILVA achieved the best validation scores across regions, sharing similar values with GSR and for V1-V3 and V3-V4.Finally and most importantly, at the species level (Fig. 3D; Fig. S3C; Tables S7 and S8), except for the full-16S region where ITGDB had the best validation score, the GSR database presented the best scores for almost all the regions and validation data sets.Overall, these results indicate that whereas the database performance remains relatively stable up to the family level, substantial differences were observed at the genus and species level, with GSR showing the best performance results among the tested databases.
The Greengenes database performed worst in all tested environments, except for the vagimock data set at the genus level, for which it performed similarly to the other databases.On the other hand, the RDP, GTDB, and SILVA databases yielded better results across all environments and regions.Previous studies have already pointed out the increased accuracy of SILVA and RDP databases in comparison to Greengenes (3), mainly due to the fact that, in the last few years, SILVA and RDP have been updated more frequently than Greengenes.The better performance of Greengenes in identifying genus-level classifications within the vagimock data set could be attributed to the low complexity of this mock community.It has been observed that database limitations may not be as apparent when analyzing mock communities with limited characteristics (3).
Results from the 10-fold cross-validation showed that Greengenes2, GSR, and ITGDB databases presented significantly better performance than the other databases in almost all levels and regions, which is consistent with the results obtained in the mock community validation (Fig. 2).At the species level, Greengenes 2 outperformed GSR, with the exception of the region V1-V3.

Case study: vaginal and gut data sets
In methodological benchmarking studies, it is crucial to contextualize the benchmarking outcomes using actual biological data.Therefore, 10 vaginal and 10 gut microbiome samples were analyzed from Vargas et al. (23) and Yáñez et al. (22) data sets, con taining 2,089 and 31,885 V4 representative sequences, respectively.These data sets allowed us to assess the consistency of taxonomic nomenclature among our newly built database and other existing databases, including Greengenes, Greengenes2, GTDB, ITGDB, RDP, and SILVA.Additionally, the analysis of real data sets allowed us to compare the computational cost of taxonomy profiling among the aforementioned reference databases.

GSR annotation enhances taxonomic nomenclature consistency
Each database uses different synonym terms for the same NCBI taxonomy ID, as shown in Fig. 4. For instance, in Fig. 4A, Greengenes, GSR, and RDP use exclusively the term Bacteroidetes for NCBI:txid976, SILVA, Greengenes2, and GTDB use the synonym Bacteroidota, and ITGDB uses both of them.Similarly, for NCBI:txid201174, Greengenes, GSR, and RDP use the term Actinobacteria, SILVA, Greengenes2, and GTDB use the synonym Actinobacteriota, and ITGDB uses both of them.In addition, GTDB and Greengenes2 split the phylum Firmicutes into several clusters, namely Firmicutes, Firmicutes_A, Firmicutes_B, Firmicutes_C, and, Firmicutes_D.At the order level, another example can be found in Fig. 4B.For NCBI:txid186802, Greengenes and RDP use the term Clostridiales, and GSR uses the synonym Eubacteriales.SILVA, Greengenes2, GTDB, and ITGDB use several non-NCBI terms such as Clostridia, Lachnospirales, and Oscillopir ales.Moreover, ITGDB also uses the accepted term Clostridiales.Finally, other taxonomy inconsistencies can be found at the family level (Fig. S4).For NCBI:txid216572, whereas Greengenes uses the term Ruminococcaceae and GSR uses its synonym Oscillospiraceae, SILVA, GTDB, and ITGDB use both aforementioned terms, and SILVA and ITGDB also use the synonym Hungateiclostridiaceae.Taken together, these results indicate that Greengenes, RDP, and GSR databases have robust taxonomic nomenclatures, using exclusive terms for one NCBI taxonomy id.In contrast, SILVA, Greengenes2, GTDB, and ITGDB databases use several terms to refer to the same taxon, some of which are non-NCBI terms.

Computational benchmarking
The benchmarking was performed in two steps in which the databases were involved: classifier training and taxonomic assignment.A naive Bayes classifier was trained in QIIME2 using the V4 region of each reference database.Training time and memory usage for each classifier are shown in Table 3.The most computationally efficient classifier training was obtained using the RDP, GSR, or Greengenes databases.These three classifiers were trained within 3 minutes and required less than 7 GB of RAM.ITGDB, Greengenes2, and GTDB classifiers show an increased computational cost, doubling the time and memory usage of the aforementioned ones.The SILVA classifier required a significantly higher amount of computational resources, taking up to 40 minutes and 25 GB of RAM to be trained.The MetaSquare (version 1.0.2) classifier was the most computationally expensive to train, being time-consuming and memory-intensive.The trained classifiers were then used to perform a taxonomy profiling of the intestinal and vaginal data sets, with the confidence threshold set to "disable" and multithreading used with 10 threads.The benchmarking results for this step are presented in Table 4. Resource consumption resembled the pattern seen in the classifier training step.Greengenes classifier was the most computationally inexpensive, followed by GSR and RDP, which are still affordable.ITGDB and GTDB almost double the required resources, and SILVA and Greengenes2 were the most resource consuming.MetaSquare classifier was also tested, but its taxonomy assignment could not be completed due to a lack of computational resources.

DISCUSSION
In this study, we generated a new 16S database for prokaryotic and archaea organisms: the GSR database.The performance of the GSR database was assessed in conjunction with six other existing 16S databases: Greengenes, Greengenes2, GTDB, ITGDB, SILVA, and RDP.Our attempt to evaluate the MetaSquare database was hampered due to its extremely high demand for computational resources compared with other existing databases (Table 3).We believe these requirements are unreasonable and impractical.Therefore, we discarded MetaSquare for subsequent analysis and cannot rationally advise its use.
Before database comparisons, we first explored the best parameter configuration for each of the five databases, as previous studies have pointed out the impact of n-gram-range and confidence threshold parameters on classifier performances (15).The n-gram-range value of [7,7] performed better than [6,6], whereas the confidence threshold value of "disable" significantly outperformed at the species level.These results suggest that confidence threshold value plays an essential role in the taxonomic resolution and should be consistently reported in studies.Based on our results, we recommend using the n-gram-range of [7,7] and confidence threshold "disable" values in microbial profiling studies that utilise the GSR database.
Regarding database performance, GSR outperformed GTDB, SILVA, RDP, and Greengenes databases in almost all tested environments and regions.ITGDB database presented a comparable performance to the GSR database, performing better in the mockrobiota data set.However, the ITGDB database has some significant shortcom ings, not detected in the GSR database, such as taxonomical discrepancies and lower computational efficiency.Based on the most unbiased experimental evaluation, GSR was only outperformed by Greengenes 2, with the exception of the region V1-V3.
The case study performed with gut and vaginal sample data sets exposed the consequences of not unifying the taxonomy when merging databases with different taxonomic annotations.ITGDB database presented multiple cases of taxonomical inconsistencies (Fig. 4), where several synonym terms were used to refer to the same taxonomic clade.A similar behavior is also noticeable in SILVA but to a lesser extent.The lack of a consistent taxonomy might severely interfere with microbial taxonomy analyses, impacting diversity metrics or differential composition analyses.In this regard, it is worth noting that the GSR database does not suffer from taxonomic consistency issues and can provide more reliable and robust results.Furthermore, this case study a Gut and vaginal datasets were taxonomically profiled using previously trained classifiers of each reference database.MetaSquare classifier was also tested but no results were obtained due to a lack of computational resources (>187 GB RAM).
revealed that the computational resources used by QIIME2 differ depending on which reference database is employed.ITGDB database made QIIME2 consume twice as many computational resources as GSR (Tables 3 and 4), making GSR a more suitable alternative for obtaining high-resolution taxonomy profiles at lower computing costs.Despite the described results, several limitations need to be considered.First, the lack of testing on non-human samples, such as soil and water samples, raises concerns about the generalizability of the database to different contexts.Without this information, we cannot fully understand how well our database will perform in these environments.Second, the GSR database only contains sequences from known species, excluding unclassified organisms or organisms labeled as uncultured.While this may be detri mental for the analysis of environments containing a large amount of unknown or uncultured species (8), we demonstrated that it improves species detection in welldescribed environments, such as human body sites.Additionally, the utilization of a single classification software (QIIME2 naive Bayes) precludes the ability to extrapolate the performance of our database to alternative classification methods, as the use of different software may yield different results.Finally, another limitation is the restricted testing conducted in human-like environments.Although gut and vagina samples have been examined in this study, the database usefulness could be more comprehensively evaluated by extending the analysis to other human environments, such as skin and saliva.Overall, the GSR database demonstrates potential, but it is crucial to acknowledge and address the aforementioned factors to obtain a thorough understanding of its applications and potential drawbacks.
While 16S amplicon-based sequencing has limitations, its low cost and simplified methodology still make it a valuable tool for analyzing the microbiome composition, especially for low-biomass samples.The vast amount of data generated during the last decade can not only help to answer pressing questions about microbiome-disease relationships in larger epidemiological studies but also can be used along with shotgun metagenomic sequencing data to explore new clinical applications (25).Therefore, the GSR database offers several advantages for microbial taxonomic classification using 16S sequencing.It integrates three of the main reference databases, ensuring a comprehen sive and accurate taxonomic annotation.The taxonomy consistency allows for reliable analysis, which is crucial for the robustness of microbiome studies.GSR database also demonstrates an improved performance with microbial communities containing mainly known species, enhancing its utility in various applications.Finally, its usage is not computationally expensive, making it accessible to researchers with limited computa tional resources.Overall, these features make the GSR database a valuable resource for the scientific community to further investigate microbial communities.
Upcoming versions of GSR-DB will prioritize keeping integrated databases current with the latest versions and consistently updating the taxonomy to align with the most recent NCBI taxonomy release.Additionally, we have the intention to expand the web server's functionalities, allowing users to navigate through the database.

FIG 2
FIG 2 Database evaluation with 10-fold cross-validation.(A) Average n-gram parameter performance in 10-fold cross-validation.N-gram-range with value [6,6] shows a better performance in the 10-fold cross-validation data sets.(B) Tenfold cross-validation results.Average accuracy of classification at family, genus, and species level.Error bars are the standard deviations.Wilcoxon test was conducted between accuracy scores of GSR, Greengenes2, ITGDB, and SILVA databases.ns, not significant; *P adj < 0.05; **P adj < 0.01; and ***P adj < 0.001.

FIG 3
FIG 3 Database evaluation with mock communities.(A and B) Benchmarking of n-gram-range (A) and confidence threshold (B).The median F1 score is shown at each taxonomic level across all tested databases, regions, and validation data sets.Error bars represent the interquartile range (N = 900).Full data for the family, genus, and species level are available in Tables S2 to S4. (C and D) Database benchmarking at genus (C) and species (D) levels using validation metrics.The mean F1 score across the five metagenomic samples is shown for each evaluated region and data set.Error bars are the standard deviation.Database benchmarking results at the family level are available in Fig. S2.Precision and recall metrics are available in Tables S5 to S7 for family, genus, and species levels, respectively.Wilcoxon test was conducted between F1 scores of Greengenes2, GSR, ITGDB, and SILVA databases.ns, not significant; *P adj < 0.05; **P adj <0.01; and ***P adj < 0.001.

FIG 4
FIG4 Relative abundance of gut and vaginal samples at phylum (A) and order (B) levels.Only relevant taxa are displayed.The remaining taxa are included in the label "Others." Family level is available in Fig.S4.

TABLE 1
Example of a multi-class confusion matrix for Ti = Lactobacillus iners a TPs were all the Lactobacillus iners classified as Lactobacillus iners.FPs were all taxa classified as Lactobacillus iners that were not actual Lactobacillus iners.FNs were actual Lactobacillus iners not classified as Lactobacillus iners.TNs are other taxonomies different from Lactobacillus iners correctly classified as non-Lactobacillus iners.

TABLE 2
Source composition of GSR databases a a Number of entries in each GSR database that were recovered from each source database.

TABLE 3
Benchmarking results for the classifier training step a A QIIME2 naive Bayes classifier was trained with each one of the reference databases using the default parameters. a

TABLE 4
Benchmarking results for the taxonomy assignment step a