Impact of 16S rRNA Gene Redundancy and Primer Pair Selection on the Quantification and Classification of Oral Microbiota in Next-Generation Sequencing

ABSTRACT This study aimed to evaluate the number of 16S rRNA genes in the complete genomes of the bacterial and archaeal species inhabiting the human mouth and to assess how the use of different primer pairs would affect the detection and classification of redundant amplicons and matching amplicons (MAs) from different taxa. A total of 518 oral-bacterial and 191 oral-archaeal complete genomes were downloaded from the NCBI database, and their complete 16S rRNA genes were extracted. The numbers of genes and variants per genome were calculated. Next, 39 primer pairs were used to search for matches in the genomes and obtain amplicons. For each primer, we calculated the number of gene amplicons, variants, genomes, and species detected and the percentage of coverage at the species level with no MAs (SC-NMA). The results showed that 94.09% of oral bacteria and 52.59% of oral archaea had more than one intragenomic 16S rRNA gene. From 1.29% to 46.70% of bacterial species and from 4.65% to 38.89% of archaea detected by the primers had MAs. The best primers were the following (SC-NMA; region; position for Escherichia coli [GenBank version no. J01859.1]): KP_F048-OP_R030 for bacteria (93.55%; V3 to V7; 342 to 1079), KP_F018-KP_R063 for archaea (89.63%; V3 to V9; undefined to 1506), and OP_F114-OP_R121 for both domains (92.52%; V3 to V9; 340 to 1405). In addition to 16S rRNA gene redundancy, the presence of MAs must be controlled to ensure an accurate interpretation of microbial diversity data. The SC-NMA is a more useful parameter than the conventional coverage percentage for selecting the best primer pairs. The pairs used the most in the oral microbiome literature were not among the best performers. IMPORTANCE Hundreds of publications have studied the oral microbiome through 16S rRNA gene sequencing. However, none have assessed the number of 16S rRNA genes in the genomes of oral microbes, or how the use of primer pairs targeting different regions affects the detection of MAs from different taxa. Here, we found that almost all oral bacteria and more than half of oral archaea have more than one intragenomic 16S rRNA gene. The performance of the primer pairs in not detecting MAs increases as the length of the amplicon augments. As none of those most employed in the oral literature were among the best performers, we selected a series of primers to detect bacteria and/or archaea based on their percentage of species detected without MAs. The intragenomic 16S rRNA gene redundancy and the presence of MAs between distinct taxa need to be considered to ensure an accurate interpretation of microbial diversity data.

T he 16S rRNA gene has been widely used to estimate bacterial diversity in different environments (1) ever since its promotion as an "evolutionary clock" some three decades ago (2). This gene, which has an average length of approximately 1,500 base pairs (bps), has several characteristics that have led to its identification as a reliable phylogenetic marker. These are the ubiquitous presence of the 16S ribosomal RNA (rRNA) gene in bacteria and archaea, its relative stability in combining conserved and hypervariable (V) regions, and the existence of complete and easily accessible databases (3).
However, the use of the 16S rRNA gene does not come without limitations, and various investigations have demonstrated the existence of up to 15 gene copies per genome in bacteria (4)(5)(6)(7)(8) and up to four in archaea (4,6,9). It is well known that this intragenomic gene redundancy affects estimates of microbial abundance that are based on gene counts (4,7). Overall, there is a tendency for the taxa with a low number of 16S rRNA genes to be underestimated, while those with high numbers are overestimated (7). In addition, the different gene regions do not have the same levels of sequence heterogeneity (6,10). Meanwhile, V1 to V2 (11), V2 to V3 (12), and V2 and V3 separately (13) have been reported to contain the greatest nucleotide heterogeneity and the greatest discriminatory power between species; other regions, such as V3 to V4 (12) and V4, V5, V7, and V8 (13), have been considered less useful for the same purpose because of their higher degree of sequence conservation. Thus, the primer pair employed in the amplification stage may influence both the detection of redundant amplicons (different 16S rRNA gene amplicons obtained from the same genome) as well as matching amplicons (16S rRNA gene amplicons with a 100% sequence similarity value, MAs) from different taxa.
A recent study has reported the existence of a maximum of four different genes per genome (here, genes/genome) in 32 species isolated from periodontal abscesses (14). However, this limited approach does not reflect the complexity of the oral microbiota, where around 700 species have been identified (15,16). On the other hand, the identification, at least at the species level, is highly desirable in 16S rRNA sequencing-based studies of the oral microbiome (17). This is because it has been demonstrated how different species from the same genus are associated with different oral conditions (18)(19)(20). Our results revealed that Porphyromonas catoniae is a core species linked to dental and periodontal health, while Porphyromonas endodontalis is associated with dental and periodontal pathology. About the differential abundance data, while Fusobacterium periodonticum is present in significantly higher numbers in the dentally healthy, Fusobacterium nucleatum subsp. vicentii is present in significantly higher numbers in individuals with high degrees of dental pathology (18). However, the taxonomic resolution at the species level could be affected by the presence of MAs.
To the best of our knowledge, there has not yet been an exhaustive in silico evaluation of the number of 16S rRNA genes present in the complete genomes of the bacteria and archaea inhabiting the human mouth. Moreover, we have been unable to identify any study in the oral microbiology field that has examined the impact of which primer pair is selected for use to detect and classify redundant amplicons and MAs from different taxa. Consequently, the aims of this investigation were to (i) evaluate the number of 16S rRNA genes in the complete genomes of all the bacterial and archaeal species ever detected in the human oral cavity, using for the first time in the oral microbiome literature the search_16S_py tool (https://github.com/slyalina/search_16S_py), which is based on an algorithm with an estimated sensitivity of .99% for identifying known 16S rRNA genes (21), and (ii) assess how the use of different primer pairs would affect the detection of redundant amplicons and MAs from different taxa.

RESULTS
Number of intragenomic 16S rRNA genes in oral-bacterial and oral-archaeal genomes. Genomes with complete sequencing status in the expanded Human Oral Microbiome Database (eHOMD) (22), along with oral-archaeal complete genomes obtained from data collected in previous research (23), were chosen for inclusion in the present work. A total of 518 oral-bacterial genomes belonging to 186 species, and 191 oral-archaeal genomes from 135 species, were evaluated. This discrepancy between the number of genomes and species analyzed is because different strains of the same species have different genomes, and some strains have more than one genomic identifier which could belong to chromosomal DNA or plasmids. Table 1 details the mean number of intragenomic 16S rRNA genes in the bacterial and archaeal phyla through seven taxonomic ranks. The 518 oral-bacterial genomes examined had a mean size of 2,933,660.68 bp and an average number of 4.55 intragenomic 16S rRNA genes, which in turn had a mean size of 1,501.32 bp and an average of 2.60 variants (sequences differing by at least one nucleotide from the referencethe first obtained-sequence). Of the 186 bacterial species, 11 (11/ Tables S1 and S2 in the supplemental material contain the sizes of the bacterial and archaeal genomes and genes, the number of genes/genome, and the number of gene variants/genome across eight taxonomic ranks. Evaluation of the primer pairs taken from our previous research and those used most in oral microbiome studies. A total of 33 primer pairs with the best in silico coverage values in a previous investigation by our group were selected for evaluation (23). In addition, through a literature review, a list of primer pairs used in oral microbiome studies was compiled in this previous research (23), and the six most frequently repeated (i.e., most commonly used) primers were also selected for evaluation here. Tables 2 and 3 detail the size and number of 16S rRNA gene amplicons detected by the primer pairs in the oral-bacterial and oral-archaeal genomes. The mean number of 16S rRNA gene amplicons varied from 4.39 to 4.84 for bacteria (mean amplicon variants/ genome, 1.09 to 2.69) and 1.58 to 2.43 for archaea (mean amplicon variants/genome, 1.08 to 1.34). All the primer combinations identified the maximum mean numbers of intragenomic genes for the bacterial and archaeal species examined (10.83 and 4.00, respectively). However, although most of the primer pairs were able to detect the highest mean value of the gene variants/genome for the archaeal species (i.e., 3.00), only one primer pair detected this maximum value for bacterial species (i.e., 9.00). Tables 4 and 5 show the percentages of detected taxa with and without MAs and overabundance estimators obtained by the primer pairs tested on the oral-bacterial and oral-archaeal genomes. increased as the mean length of the amplicons obtained by the primer pair increased. If we contrast the percentages of species detected with their respective SC-NMA, all the primer pairs with short mean amplicon lengths (S; 100 to 300 bp) analyzed showed the largest differences between both parameters (average difference, 21.34% for bacteria and 23.70% for archaea), followed by those of medium mean amplicon lengths (M; 301 to 600 bp; 7.30% and 13.75%, respectively). The primer pairs with long mean amplicon lengths (L; .600 bp) presented the smallest differences between the coverage and SC-NMA values (4.30% and 5.82%, respectively). According to the SC-NMA values, the best three bacteria-specific primer pairs were KP_F048-OP_R030 and KP_F048-KP_R060 (L; regions V3 to V7 and V3 to V9, respectively; SC-NMA, 93.55%; six MAs; overestimation factor caused by the presence of MAs [OF-MA], 1.06 for both) and OP_F053-KP_R020 (M, V1 to V3, 93.01%, six, 1.06). In contrast, the worst primer pair was OP_F066-KP_R040 (S, V5 to V6, 47.31%, 77, 2.78). The most commonly used bacterium-specific primer pairs in the literature were not among those with the best SC-NMA values in their respective amplicon length categories. Specifically, KP_F031-KP_R021 (M, V1 to V4, 73.12%, two, 1.02) and OP_F009-OP_R029 (M, V5 to V8, 75.27%, 24, 1.24) had the worst SC-NMA among the primers in the a The amplicon length category and gene regions were determined in a previous investigation according to the mean size of the amplicons generated by a given primer and to the mode first position of the forward primer and the mode last position of the reverse primer, respectively (23). The most commonly used primer pairs in the literature were detected in the before-mentioned study (23). In the column "gene region", the different regions covered by each primer pair are defined and delimited by a dash. b A, archaea; ALC, amplicon length category; B, bacteria; bp(s), base pair(s); F, forward; g/G, number of 16S rRNA gene amplicons per genome; gv/G, number of 16S rRNA gene variant amplicons per genome; KP, Klindworth primer; L, long mean amplicon length category (.600 bp); M, medium mean amplicon length category (301 to 600 bp); OP, oral primer; R, reverse; S, short mean amplicon length category, (100 to 300 bp); V, hypervariable region. medium amplicon length category, and KP_F034-KP_R065 (L, V1 to V9, 82.26%, two, 1.02) had the worst value among those in the long category.
Considering the three categories of amplicon lengths, the SC-NMA values for the archaeon-specific primer pairs were 89.63% for KP_F018-KP_R063 (L; V3 to V9; six MAs; OF-MA, 1.11), 85.93% for KP_F022-KP_R063 (M, V5 to V9, eight, 1.14), and 69.63% for the OP_F066-KP_R013 (S, 35, 1.99). Interestingly, the primer pair with a long mean amplicon length, KP_F014-KP_R011 (V3 to V6), which is the one used most in the literature to detect oral archaea, was only able to identify 30.37% of the species tested in this study, resulting in the lowest SC-NMA value (26.67%; five MAs; OF-MA, 1.14).
Tables S3 to S6 contain more detailed information on the results of MA-and MAfree species coverage and the overabundance parameters (overestimation factor [OF] and OF-MA values) obtained by the primer pairs tested against the oral-bacterial and oral-archaeal genomes. Tables S3 and S6 also include the results obtained from the bacterial and archaeal primer pairs for both domains.
Depending on the primer pair tested, from 1.29% (2/155 detected species) to 46.70% (85/182) of the bacterial species and from 4.65% (6/129) to 38.89% (49/126) of the archaeal species had MAs (Tables 4 and 5). Fig. 1 shows the bacterial and archaeal a The amplicon length category and gene regions were determined in a previous investigation according to the mean size of the amplicons generated by a given primer and to the mode first position of the forward primer and the mode last position of the reverse primer, respectively (23). The most commonly used primer pairs in the literature were detected in the before-mentioned study (23). In the column "gene region", the different regions covered by each primer pair are defined and delimited by a dash. species with MAs obtained with at least 10 primer pairs. In the bacteria, these species belonged to the following genera: Actinomyces, Cronobacter, Fusobacterium, Klebsiella, Lacticaseibacillus, Lactobacillus, Staphylococcus, and Streptococcus. In the archaea, these genera were Haloarcula, Halomicrobium, Methanosarcina, Nitrososphaera, Pyrococcus, and Thermococcus. Tables S7 to S9 define in detail which species from the same or different genera shared MAs, depending on the primer pair evaluated.

DISCUSSION
Number of intragenomic 16S rRNA genes in oral-bacterial and oral-archaeal genomes. The intragenomic redundancy of the 16S rRNA gene has been evaluated previously using genomes from diverse sources such as GenBank (5,7,24,25), the National Center for Biotechnology Information (NCBI) microbial genome database (4,6,8,26), and the rRNA Operon Copy Number Database (rrnDB) (9,27,28). These in silico investigations extracted the gene sequences from the complete genomes through tools such as Kodon 2.0 (24; https://www.bionumerics.com/download/software/kodon -version-204) and RNAmmer (6,29) or by using a primer pair targeting the regions V4 to V6 (7). However, none of these studies focused on the genomes of microorganisms living in a specific environment. As gene redundancy has been proven to affect abundance estimates based on gene counts (4, 7), variations in the number of genes/genome of the microbes inhabiting the ecosystem of interest must be examined to ensure proper descriptions of the microbial community. To the best of our knowledge, this study is the first to investigate the number of intragenomic 16S rRNA genes in the microbiota of the oral environment. Through chromatograms derived from direct sequencing or cloning, recent research identified a maximum of four different 16S rRNA genes/genome in 138 clinical isolates taken from periodontal abscesses (14). However, the low number of species evaluated (n = 32) and the focus on a specific niche and health condition within the mouth limit the applicability of the findings to the oral microbiota more generally. In contrast, the present study evaluated all of the complete bacterial genomes described in an oral-specific database (22) and a series of genomes taken from archaeal species previously identified in the human mouth (23); all these bacterial and archaeal genomes were downloaded from the NCBI website (26). Moreover, for the first time in the oral microbiome literature, we extracted the 16S rRNA genes using search_16S_py (https://github.com/slyalina/search _16S_py), a special and easily accessible tool based on Edgar's algorithm, which has an estimated sensitivity of .99% for identifying known 16S rRNA genes (21). In our opinion, this algorithm represents a significant improvement in the detection of the 16S rRNA genes over previously used methods (7,24) since it constitutes a specialized tool for this purpose.
Still, it has to be said that the above-referenced sequencing-based research (30-35) did not aim to account for or correct for the variation in the 16S rRNA gene copy numbers between taxa, nor did they point out that this was a study limitation. Indeed, as far as we know, no investigation of the oral microbiome has used the methods available to correct for the variation in the numbers of 16S rRNA genes, such as CopyRighter (36), PAPRICA (37), or PICRUSt (38). Given the results obtained in the present study, we encourage researchers to incorporate these tools in their clinical investigations as one of the necessary steps to obtain a picture of the oral microbiome composition as close to reality as possible.
Evaluation of the primer pairs taken from our previous research and those used most in oral microbiome studies. There is a lack of literature on how the 16S rRNA gene primer pair influences the detection of redundant amplicons and MAs from different taxa. Recognizing the importance of conducting 16S rRNA gene-based research using habitat-specific databases (17), the present study is the first to evaluate the above-mentioned topic focusing on the oral microbiota.
Through this analysis, we discovered that all the primer combinations amplified the maximum mean number of genes/genome in both the bacterial and archaeal species (10.83 and 4.00, respectively). However, the great majority of them could not detect the maximum mean number of variants/genome in bacteria (i.e., 9.00), which was not the case for the archaea.
The presence of amplicons with matching 16S rRNA gene sequences in different species is a challenge for researchers, as they could be inappropriately misclassified, thus artificially increasing the number of counts in operational taxonomic units (OTUs) or amplicon sequence variants (ASVs) (7,39), depending on the bioinformatics pipeline used. As amplicons derived from distinct regions have different degrees of heterogeneity (6,10), with V1 to V2 (11), V2 to V3 (12), V2, and V3 (13) having more nucleotide variability than V3 to V4 (12), V4, V5, V7, and V8 (13), the primer pair employed may affect both estimates of diversity and taxonomic identifications. Despite the lack of literature on the subject, it is important from a clinical applicability point of view to conduct studies that take into account the quality of the primer pairs, in our case those specific to the oral microbiota.
As described in a previous investigation conducted by our research team (23), the primer pairs that identified .90% of the species in a data set were evenly distributed across the different amplicon length categories. However, these findings do not reflect the influence of MAs. To ensure that this factor was taken into account in the present study, we considered, for the first time in this type of research, the values of the percentage of coverage at the species level with no matching amplicons (SC-NMA), the overestimation factor (OF), which combines the copy number of the 16S rRNA gene amplicons and the number of MAs, and the OF caused by the presence of MAs (OF-MA). The lack of studies employing these parameters makes it impossible to conduct a relevant comparative analysis.
The estimation tools that we newly describe have allowed us to demonstrate in general terms that the primer pairs that obtain amplicons with long mean lengths (.600 bp), followed by those with medium mean lengths (301 to 600 bp), showed the greatest ability to detect bacterial and archaeal species with no MAs in contrast to primer pairs that obtain amplicons with short mean lengths (100 to 300 bp), being the mean differences between coverage and SC-NMA percentages of 22.52%, 10.52%, and 5.06%, respectively. These discrepancies between the two coverage parameters were confirmed by considering the coverage results of a previous study in which we analyzed a larger number of oral taxa based on 16S rRNA gene sequences instead of complete genomes (23). Assessing the impact that MAs could have on species abundance, primer pairs with long mean amplicon lengths had OF-MA values as low as 1.08 (e.g., with OP_F114-OP_R121, V3 to V9), meaning that the small number of MAs detected did not influence abundance. In contrast, primer pairs with short mean amplicon lengths had a maximum value of 3.45 (e.g., with OP_F066-OP_R073, V5 to V6), meaning that abundance was tripled by the presence of MAs.
The above findings reveal that the SC-NMA parameter is more useful than the conventional coverage percentage in selecting the best primer pairs because this last value does not discriminate the presence of MAs for different taxa. If there are several primer pairs with similar SC-NMA values, the OF-MA values would be the appropriate parameter to use to choose between them.
Specifically, the best primer pairs presented mean amplicon lengths of .600 bp and were KP_F048-OP_R030 (for bacteria [B]; V3 to V7; SC-NMA, 93.55%; OF-MA, 1.06), KP_F018-KP_R063 (for archaea [A], V3 to V9, 89.63%, 1.11); and OP_F114-OP_R121 (for bacteria and archaea jointly [B1A], V3 to V9, 92.52%, 1.08). In consequence, we were thus able to demonstrate that sequencing longer fragments enables the identification of lower taxonomy levels (40), reducing the probability of overestimation and classification bias related to MAs. This finding connects with the idea that sequencing the full-length 16S rRNA gene has been regarded as the better solution to the limitations of taxonomic classification (11,40) since it would at least avoid taxonomic classification biases related to MAs in oral bacteria and archaea. However, this assumption is only applicable when using sequence classification methods based on single-nucleotide resolution (100% sequence similarity) that generate ASVs. Considering that the taxonomic threshold of sequence similarity to define the species level has been set from 97% to 98.7% (11,41), we recently performed an in silico study on oral bacteria and archaea and found that the primer pairs with long mean amplicon lengths exhibit species coverage with no amplicon sequence similarity $97% (SC-NASI97) values much lower than those observed here for SC-NMA (up to 51.08% for bacteria, 49.63% for archaea, and 48.29% for both domains) (42). Therefore, full-length gene sequencing will not avoid taxonomic classification biases if we use sequence clustering methods based on 97% sequence similarity thresholds, as is commonly done for OTU construction. On the other hand, we must also take into account that the technologies which currently allow sequencing of the full-length 16S rRNA gene, such as Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT), still have high error rates (10% to 15%) (43).
None of the pairs used most in the oral microbiome sequencing-based studies reported in the literature were able to detect the highest possible numbers of total genomes and species. We might assume a priori that the lower the number of taxa detected, the fewer the opportunities to misclassify them, but the best SC-NMA estimates were also not obtained with these primer pairs. The pair employed most in the literature is KP_F078-OP_R010 (B1A, V4 to V5) (23), as described by Caporaso et al. (44). This showed an SC-NMA score of 66.67% and an OF-MA of 1.68 and was the second-worst primer pair at detecting both the bacterial and archaeal superkingdoms. It was even outperformed by other primers from the same region, such as OP_F098-OP_R119 (B; V4 to V5; SC-NMA, 80.11%; OF-MA, 1.73) if only bacteria were to be detected and KP_F020-KP_R032 (B1A, V4 to V5, 78.51%, 1.79) if bacteria and archaea were to be detected. The next most-used primer pair (23) is the one recommended by Illumina (45): KP_F047-KP_R035 (B; V3 to V5; SC-NMA, 90.32%; OF-MA, 1.12). Although it produced an SC-NMA value of .90%, this was poorer than another primer targeting the same gene region: KP_F048-KP_R031 (B, V3 to V5, 91.94%, 1.12). The other primer pairs used most in the literature, albeit to a lesser extent (23), are KP_F014-KP_R011 (A; V3 to V6; SC-NMA, 26.67%; OF-MA, 1.14), KP_F034-KP_R065 (B, V1 to V9, 82.26%, 1.02), KP_F031-KP_R021 (B, V1 to V4, 73.12%, 1.02), and OP_F009-OP_R029 (B, V5 to V8, 75.27%, 1.24). Of them, the first has produced the lowest SC-NMA value reported here, which is considerably lower than the 80.00% achieved by the primer pair from the same region, KP_F020-KP_R013 (A; V3 to V6; OF-MA, 1.35). For their part, the latter three had the worst SC-NMA values in their amplicon length categories (L, M, and M, respectively), and other alternatives within them, albeit targeting different regions, performed better in silico: KP_F048-OP_R030 (B, V3 to V7, 93.55%, 1.06) in the long category and OP_F053-KP_R020 (B, V1 to V3, 93.01%, 1.06) in the medium category. Consequently, the data derived from the primer combinations employed most in the literature could be improved upon, in some cases significantly, by using the alternative primer pairs presented in this study. Moreover, these results highlight that primer pairs targeting the same gene region do not distinguish equally between taxa with MAs.
Therefore, comparisons of data from studies assessing the same region may be biased, and abundance may be inaccurate but close. In the case of comparing amplicons from different regions, the results could be vastly different and may even lead to opposite biological conclusions. Consequently, the comparison of oral microbiome studies using the same primer pairs would be the most recommended methodological approach.
Our research proves that the detection of MAs with 100% sequence similarity between different taxa is not a one-off issue, as from 1.29% to 46.70% of the oral-prokaryotic species detected by the primer pairs have them. Indeed, this number may be an underestimate, given that we were only able to examine less than a third of the genomic sequences contained in the eHOMD (22), as the remaining ones were not fully sequenced. Despite this, relevant genera present in the oral environment were identified, including Actinomyces, Fusobacterium, Lactobacillus, Methanosarcina, Staphylococcus, and Streptococcus (16,46), that had MAs in at least 10 primer pairs; 3.00%, 2.00%, 4.00%, 19.30%, 9.00%, and 15.00% of all different bacterial or archaeal species with MAs detected by all the primer pairs belonged, respectively, to such genera. Among them, there were health-associated species such as S. mitis (33) and S. oralis (34), disease-associated taxa such as F. nucleatum (30) and S. mutans (31), and those abundant in both states, such as Methanosarcina vacuolata (35), which, as shown above, had problems related to the presence of more than one 16S rRNA gene/genome. Other relevant species from distinct genera such as Capnocytophaga ochracea, T. forsythia, and T. denticola (30) also presented both intragenomic gene redundancy and MAs.
The main limitation of the present research is that only ;25% of the oral microorganism genomes listed on the eHOMD website were evaluated. This lack of complete genomes reduced the number of species evaluated to ;24% of those listed on eHOMD. Although the analysis could have been performed with a fasta file containing 16S rRNA gene sequences from oral microbes, we preferred to use complete genomes for the reasons stated previously (42), thus ensuring the high quality of the gene sequences reviewed. In our opinion, these results are only the tip of the iceberg, and the problematic issue of MAs is likely to affect more taxa as the number of genomes examined increases.
In conclusion, nearly all oral bacteria and about half of the oral archaea have more than one 16S rRNA gene in their respective genomes. Depending on the primer pair used, up to almost half of the species present MAs, affecting relevant genera present in the oral environment, such as Actinomyces, Fusobacterium, Lactobacillus, Methanosarcina, Staphylococcus, and Streptococcus. The performance of the primer pairs to detect non-MA species increases as the average length of the amplicons increases, none of these being the most widely used primer pairs in the oral microbiome literature. The best primer pairs were KP_F048-OP_R030 (for bacteria; region V3 to V7; primer pair position for Escherichia coli [GenBank version no. J01859.1], 342 to 1079), KP_F018-KP_R063 (for archaea; V3 to V9; undefined to 1506), and OP_F114-OP_R121 (for both bacteria and archaea; V3 to V9; 340 to 1405). In addition to the 16S rRNA gene redundancy, the considerable presence of MAs must be controlled to ensure an accurate interpretation of microbial diversity data. The SC-NMA is a more useful parameter than the conventional coverage percentage for selecting the best primer pairs. The choice of primer pair significantly affects diversity estimates and taxonomic classification, conditioning the comparability of oral microbiome studies using different primer pairs.

MATERIALS AND METHODS
Obtaining complete oral-bacterial and oral-archaeal genomes. All the information available on the bacterial taxa present in the oral cavity was obtained from the eHOMD website (22). All genomes with the complete sequencing status indicated by eHOMD were chosen. A total of 528 complete genomes, consisting of one or more NCBI identifiers for each complete genome, were identified among 2,074 on the eHOMD website.
The complete genomes indicated in the eHOMD have one or more GenBank (25) identifiers, which were used to access the complete sequences stored in the NCBI database (26). In general, these complete genomes consisted of one or two identifiers corresponding to their chromosomal DNA; in many cases, however, the genomes had plasmid identifiers as well, which were also investigated.
An initial list of 177 different oral archaea and their corresponding GenBank (25) identifiers, obtained as part of a previous investigation (23), enabled us to access their complete sequences in the NCBI database (26). A total of 185 complete genomes, consisting of one or more NCBI identifiers for each complete genome, were identified.
Integrating the Entrez Programming Utilities (E-utilities) tool (47) in the Python (version 3.9.0) (http:// www.python.org/) script allowed us to acquire the URLs needed to retrieve the information of interest from the various NCBI databases, including Taxonomy (48), RefSeq (49), and GenBank (25). The oral-bacterial and oral-archaeal genomes were then downloaded, and finally, the taxonomy of each of them was obtained.
Detection and extraction of 16S rRNA genes. There were several International Union of Pure and Applied Chemistry (IUPAC) ambiguous characters or nonspecific nucleotides distributed along some of the genomes. None of these characters or nucleotides represent a unique specification for the four nitrogenous bases of the DNA (A, adenine; G, guanine; C, cytosine; T, thymine) and allow for ambiguity among two, three, or four possible nucleic acid states (50). Consequently, we developed a Python (http://www.python.org/) script to detect and then randomly replace them with one of the specific equivalent nucleotides (e.g., R was replaced by either A or G). This substitution was made to avoid alignment problems with the ambiguous nucleotides, and it was done on a random basis because we cannot know with certainty which of the options should have been assigned, as all the equivalents are correct. Other genomes were excluded because they had more than 10 consecutive ambiguous IUPAC nucleotides, mainly of "N" bases. After applying these criteria, we were left with 507 oral-bacterial complete genomes, of which 497 had one chromosome, nine had two chromosomes, and one had three chromosomes. Each individual chromosome was evaluated as a complete genome, so a final number of 518 oral-bacterial complete genomes were considered for analysis. Regarding the oral archaea, 177 complete genomes, of which 166 had one chromosome, 10 had two chromosomes, and one had five chromosomes, were left. Thus, a final number of 191 oral-archaeal complete genomes were considered for analysis.
presence of the gene and provides consistent and homologous endpoints (21). Applying this algorithm, the 16S rRNA gene sequences were detected and extracted from the complete downloaded genomes, while the variants were stored in a fasta file. All the 16S rRNA gene variants identified were designated taxonomically at the strain level or at the species level if no designated strain name existed. This left us with the following for inclusion in subsequent analyses: 518 oral-bacterial genomes, corresponding to 186 species, and 191 oral-archaeal genomes, corresponding to 135 species. Their taxonomy and NCBI identifiers are included in Table S10. For each genome evaluated, we calculated its size, the sizes of the 16S rRNA genes detected, the total number of 16S rRNA genes, the number of different variants, and the number of 16S rRNA genes in each strand. The averages of the data obtained were subsequently determined using Python's NumPy (51) and pandas modules (52) for hierarchical levels above the strain level.
Evaluation of a selection of primer pairs for the detection of 16S rRNA genes. In the present study, primer pairs identified in a previous investigation as having the best in silico coverage values for detecting oral bacteria, oral archaea, or both (23) were selected for evaluation. Moreover, in that previous research, the authors found through a literature review that a total of 206 distinct primer pairs have been used to study the oral microbiome by massive sequencing (23). Those primers that were repeated more times (i.e., which appeared in more articles) were defined as those most frequently used in the oral microbiome literature and were also selected for evaluation here. This left us with 33 and 6 primer pairs, respectively, for this stage of the study, which were classified according to the average length of the amplicons into short (S; 100 to 300 bp), medium (M; 301 to 600 bp), and long (L; .600 bp) primer pairs (23) ( Table 6).
The direct and reverse sequences of each primer pair were used in combination with Python's regex module (53) to obtain, in silico, the amplicons of the 16S rRNA genes identified in all of the chosen genomes. Both sequences were checked to ensure that they matched at some position in the genome evaluated and, if they did, all nucleotides from the first position of the direct primer to the last position of the reverse primer were selected to obtain the in silico amplicons. For each primer pair, we determined the mean size and number of the 16S rRNA gene amplicons, the number of gene variants, the number of genomes and species detected, and the SC-NMA. This coverage value was calculated as The overestimation of abundance at the species level (the overestimation factor [OF]) was also calculated. This represented for each species the combination of the number of copies of the 16S rRNA gene amplicons and the number of MAs. To remove the overestimation derived from the intragenomic gene redundancy, the OF of each species was divided by the number of gene copies, resulting in the OF caused by the presence of MAs (OF-MA). Species with values equal to 1.00 did not have amplicons that matched other species for the corresponding primer pair, while those with estimates greater than 1.00 did. For each primer pair, both parameters were expressed cumulatively and as an average. The best primer pairs selected first were those with the highest SC-NMA value and, of these, those with the lowest OF-MA value. The worst primer pairs were those with the lowest SC-NMA and the highest OF-MA.
Data availability. The principal data generated or analyzed during this study are included in this published article. The Python script is publicly available at https://github.com/laravg/variant_analysis.

ACKNOWLEDGMENTS
This study was funded by Instituto de Salud Carlos III (ISCIII) through project PI21/ 00588 and cofunded by the European Union, the Consellería de Cultura, Educación e Ordenación Universitaria de la Xunta de Galicia (accreditation 2019-2022 ED431G-2019/ 04, group with growth potential ED431B 2020-2022 GPC2020/27; A. Regueira-Iglesias support ED481A-2017/233), and the ERDF, which acknowledges the CiTIUS-Research Center in Intelligent Technologies of the Santiago de Compostela University as a Research Center of the Galician University System.