Species-Level Resolution of Female Bladder Microbiota from 16S rRNA Amplicon Sequencing

ABSTRACT The human bladder contains bacteria, even in the absence of infection. Interest in studying these bacteria and their association with bladder conditions is increasing. However, the chosen experimental method can limit the resolution of the taxonomy that can be assigned to the bacteria found in the bladder. 16S rRNA amplicon sequencing is commonly used to identify bacteria in urinary specimens, but it is typically restricted to genus-level identification. Our primary aim here was to determine if accurate species-level identification of bladder bacteria is possible using 16S rRNA amplicon sequencing. We evaluated the ability of different classification schemes, each consisting of combinations of a reference database, a 16S rRNA gene variable region, and a taxonomic classification algorithm to correctly classify bladder bacteria. We show that species-level identification is possible and that the reference database chosen is the most important component, followed by the 16S variable region sequenced. IMPORTANCE Accurate species-level identification from culture-independent techniques is of importance for microbial niches that are less well characterized, such as that of the bladder. 16S rRNA amplicon sequencing, a common culture-independent way to identify bacteria, is often critiqued for lacking species-level resolution. Here, we extensively evaluate classification schemes for species-level bacterial annotation of 16S amplicon data from bladder bacteria. Our results show that the proper choice of taxonomic database and variable region of the 16S rRNA gene sequence makes species level identification possible. We also show that this improvement can be achieved through the more careful application of existing methods and resources. Species-level information may deepen our understanding of associations between bacteria in the bladder and bladder conditions such as lower urinary tract symptoms and urinary tract infections.

To study the relationships between the bacteria found in the human bladder and the health of the host, it is necessary to accurately identify bacteria in a rapid and large-scale manner. Reliable methods of determining the bacterial identity of an unknown bacterium include matrix-assisted laser desorption ionization-time of flight (MALDI-TOF) analysis or whole-genome sequencing (WGS) of purified colonies; both techniques permit species-level identification of bacteria (16). However, culturing specific bacterial species is time-consuming and laborious. This limitation has been circumvented by adopting culture-independent methods of sequencing DNA directly from an environmental sample, such as shotgun metagenomic sequencing and targeted amplicon sequencing, the latter most commonly involving the 16S rRNA gene (17,18). These culture-independent sequencing methods are an attractive strategy because they can more accurately reveal microbiota diversity by identifying bacteria that are difficult to grow in culture (19).
Targeted amplicon sequencing of the 16S rRNA gene is a common method for identifying bladder bacteria in a large-scale manner (20,21). This is particularly true in urine, where the low microbial biomass (9,22) often results in smaller bacterial DNA quantities than what is required for shotgun metagenomic sequencing (23). When performing targeted amplicon sequencing, DNA is extracted from all cells in a sample. Next, PCR is used to amplify a small segment of the bacterial genome, typically spanning one or more variable regions of the 16S rRNA gene. This amplified DNA segment is subject to high-throughput sequencing, such as Illumina MiSeq or HiSeq short-read technology. Bioinformatics are used to process the resulting sequences and identify the taxonomy of the bacteria (24).
When identifying bacteria using targeted amplicon sequencing, there are three important components (Fig. 1). These components are (i) a database of DNA sequences annotated with taxonomic information; (ii) the identifier, or DNA sequence, of the unknown bacterium; and (iii) a classifier, which is the algorithm that compares the unknown sequence to those in a database until the closest match is found. These components work together as a classification scheme (25). One common classification scheme that can be implemented through the use of the mothur (26) or QIIME2 (27,28) bioinformatic pipelines uses the Silva database (29), the V4 region from the 16S rRNA gene sequence as the identifier (30), and the naive Bayes algorithm (28) as the classifier. A limitation of this and of many other commonly used classification schemes is that the taxonomic resolution is usually constrained to the genus level. Recently, several new approaches to sequence processing and taxonomy assignment have become available (e.g., amplicon sequence variant algorithms such as DADA2 [31], and taxonomic classifiers such as Bayesian lowest common ancestor [BLCA] [32]), which may improve resolution to the species level.
Our primary aim was to determine if species-level identification of bladder bacteria is possible using 16S rRNA amplicon sequencing studies. To achieve this aim, we used known DNA sequences from a representative sample of bacteria found in the human female bladder, published by Thomas-White and colleagues (33). This data set includes bacterial species that have been isolated from the bladder, cultured, and thoroughly characterized using WGS. We computationally tested multiple classification schemes to see which schemes identified these known DNA sequences at the species level. We evaluated several reference databases, variable regions (i.e., potential identifiers), and taxonomic classification algorithms for their ability to accurately identify bladder bacteria at the species level.

RESULTS
Representation of bladder bacteria in 16S rRNA gene sequence databases. The Thomas-White genome sequencing data set consists of 149 bladder bacterial isolates, representing 78 bacterial species from 36 genera (33). There are several databases available for bacterial identification using the 16S rRNA gene (34,35). Of these, the Greengenes (v.13_5) (36), Silva (v. 132) (37), and NCBI 16S Microbial (v. August 2019) databases were evaluated due to their widespread use in amplicon sequencing studies and the availability of species-level annotation. At the genus level, all but one genus from the Thomas-White data set (Globicatella) were present in the Greengenes database, and all genera were present in the Silva and NCBI 16S Microbial databases. At the species level, all 78 bladder bacterial species were present in the Silva and NCBI 16S Microbial databases, whereas only 21 species were present in the Greengenes database (see Fig. S1 and Table S1 in the supplemental material).
Information contained in variable regions differs for bladder bacterial species. Sequencing studies frequently focus on a small segment of the 16S rRNA gene that can be rapidly sequenced in a high-throughput manner using short-read sequencing technology, such as the Illumina HiSeq or MiSeq platforms. To evaluate the performance of different variable regions as identifiers, sequences of the 16S rRNA gene from each species in the Thomas-White data set were downloaded from Silva (see Table S1), and amplicons were computationally generated for the V1-V3, V2-V3, V3-V4, V4-V6, V3, V4, and V6 variable regions using published primers (see Materials and Methods and Fig. 2). These computational amplicons were used to determine how well the currently available classification schemes can distinguish bladder bacterial species. To assess different classification schemes, we tested multiple permutations of the variable regions listed above with different databases (i.e., Greengenes, Silva, or NCBI 16S Microbial) and different classifiers (i.e., naive Bayes or BLCA; see Fig. 1).
To quantify the amount of information contained across variable regions of the 16S rRNA gene among commonly identified bladder bacteria, we performed a sliding- Model of the components that make up a classification scheme to assign taxonomy to unknown sequences. A classification scheme consists of a database, an identifier, and a classifier. The databases used in this study are the Greengenes, Silva, and NCBI 16S databases. The identifiers used in this study are subsequences of the 16S rRNA gene, computationally generated using published primers as coordinates on the gene sequence. These targeted amplicons are the V3, V4, and V6 variable regions of the gene, or span the V1-V3, V2-V3, V3-V4, and V4-V6 variable regions. The classifiers used in this study are the naive Bayes and Bayesian lowest common ancestor (BLCA) algorithms. One example of a classification scheme is the Silva database, V4 region identifier, and naive Bayes classifier. Another example of a classification scheme is the Greengenes database, V1-V2 region identifier, and BLCA classifier. These two examples are distinct from each other and can have different outcomes when assigning taxonomy. window analysis on a multiple-sequence alignment (MSA) of all 16S rRNA gene sequences from the Thomas-White data set. We calculated entropy as a measure of information content along the MSA (Fig. 2B). As expected, the defined variable regions contained regions of high entropy, suggesting variability across species, whereas variable regions were flanked by conserved regions with low-entropy-containing sequences that are similar among species. The V1 and V2 regions contained the highest entropy, while V7 and V8 contained the lowest.
Evaluation of classification scheme performance. To evaluate the ability of currently available resources to identify bladder species, we calculated the recall, precision, and F-measure for each classification scheme implemented (see Materials and Methods). For these calculations, we compared results from computational classification schemes to known bacterial species information from prior WGS studies (33). Each taxonomic classification was evaluated as a true match, true nonmatch, false match, or false nonmatch based on whether the taxonomic classification was correctly assigned or not. Recall refers to the proportion of matches that the classification scheme correctly identified out of all possible matches. Precision refers to the proportion of matches that the classification scheme called correctly out of all classified matches. The F-measure is the equally weighted harmonic mean of recall and precision (38).
In general, classification schemes that use the NCBI 16S Microbial database Higher entropy indicates that the region has more variability across species and therefore more information to identify a bacterial species. Lower entropy indicates that the region has little variability (i.e., is conserved) across species and therefore less information to identify a bacterial species.
perform the best because they have high recall and precision and consequently the highest F-measures (median, 0.79; range, 0.60 to 0.91), regardless of which classifier is used (Fig. 3). Classification schemes using the Silva database show reduced recall and precision and therefore lower F-measures (median, 0.48; range, 0.23 to 0.69).
Because the Greengenes database is missing many of the bacterial species found in the bladder, it is less precise. As such, classification schemes using the Greengenes database can have good recall but low precision, resulting in a low F-measure (median, 0.30; range, 0.22 to 0.36). While the choice of database appears to have the largest effect on the outcome of a classification scheme, the combinations of identifiers and classifiers with specific databases show differences as well. For example, when using the Silva database, the classifier has an impact on the classification scheme outcome. With the naive Bayes classifier, the identifiers yielding the highest F-measures are the V3-V4 (0.71) and V4-V6 (0.70) targeted amplicons. Using the BLCA classifier, the V1-V3 targeted amplicon has the highest F-measure (0.55).
Confidence scores affect classification. The BLCA and naive Bayes classifiers used in this study will classify an unknown sequence even when the posterior probability for that taxon is very low. To account for this situation, a confidence score is calculated. This confidence score measures how much the classification changes through random permutation (bootstrapping), and the value reflects the "goodness of fit" of that classification (39). When lacking any knowledge of how to choose the best confidence score that minimizes the number of errors of a given classification scheme, using a predefined confidence score threshold is an option (40,41). We evaluated the performance of classification schemes against our test set when confidence score thresholds were used. For this analysis, matches that returned with confidence scores less than the designated confidence threshold were considered nonmatches. Figure 4 shows the effect of increasing the confidence score on the number of true matches returned by each classification scheme.
Almost all classification schemes had a decrease in recall when using a default confidence score of 80% ( Fig. S2A and Fig. 4). This effect is especially marked for the classification schemes that use the Silva database, which shows an average 79.3% reduction in recall with higher confidence scores (Fig. 4A). Classification schemes that use the NCBI 16S database are unequally affected by increasing confidence scores; for example, the V1-V3 identifier shows a slight reduction in recall (7.1% on average), while the V6 identifier shows the largest (43.3% on average). Classification schemes that use the Greengenes database are only slightly affected, with different confidence scores resulting in smaller-magnitude differences in recall.
Changes in the precision of classification schemes are mostly affected by the database used (Fig. S2), although confidence scores can further affect precision. For the classification schemes that use the NCBI 16S database, precision is generally improved with increasing confidence score, but at unequal amounts. For example, using the 80% threshold, the V1-V3 identifier shows a slight average increase of 3.5%, while the V6 identifier shows a large average increase of 39.7% (Fig. S2A). Classification schemes that use the Greengenes database show only minimal changes in precision. In contrast, classification schemes using the Silva database are differentially affected, with both reduction and gains in precision depending on the identifier and classifier combinations that are chosen. In general, for the Silva database, precision is reduced when using any 16S rRNA identifier with the BLCA classifier. However, a dramatic increase in precision is shown by the classification scheme composed of the Silva database, V4 identifier, and the naive Bayes classifier. When ignoring a confidence score, this classification scheme has a precision of 52.6%, but this increases to 85.7% when a confidence score threshold of 80% is used (Fig. S2).
The overall changes in how these classification schemes perform when using a 50% or 80% confidence score can be summarized by comparing the F-measure values shown in Fig. 4B. In almost every classification scheme, the F-measure value decreases when a threshold is used. The classification schemes that use the Silva database clearly demonstrate this effect and show a 66.9% reduction in F-measure values, on average, when higher confidence scores, such as 80%, are used. The classification schemes that use the NCBI 16S database show slight decreases in the F-measure values, with the exceptions of those that use the V3, V4, and V6 regions as identifiers. Those classification schemes show a large 27.2% reduction in F-measure values, on average. Finally, the classification schemes that use the Greengenes database have slight changes (0.78 to 6.78%) in their F-measure values, regardless of whether a threshold is used or not.
Amplicons spanning more than one variable region identify a larger number of bladder bacterial species. Amplicons spanning more than one variable region identified more unique bladder bacteria at the species level than amplicons spanning a single variable region. For example, with the commonly used V4 variable region and naive Bayes classifier, 21.8% of bladder bacteria are correctly identified with the Greengenes database, whereas 52.6% are identified with the Silva database and 73.1% with the NCBI 16S database (Fig. 5A). In contrast, using amplicons spanning more than one variable region, such as the V1-V3 region, 91.0% of bacteria are correctly identified at the species level when using the NCBI 16S database.
Species identified depends on choice of database and variable region. While the results thus far have focused on summarizing the overall performance of classification schemes for identifying bladder bacteria at the species level, we also sought to determine which classification schemes could be used to identify specific bacteria ( Fig. 5B and Fig. S3). Although the NCBI 16S database contains the greatest representation of bladder species, some species were not identified using certain variable regions, if at all. For example, Lactobacillus species, which are commonly found in the female bladder and vagina (33,42,43), were overall best represented within the NCBI 16S database, with 8 out of 9 species being identified with the V1-V3 and V2-V3 variable regions (Fig. 5C). However, the other variable regions only identified between 4 and 6 Lactobacillus species when using the NCBI 16S database. Interestingly, Lactobacillus With the commonly used V4 variable region and BLCA classifier, 17% of bladder bacteria are correctly identified using the Greengenes database, compared with 35% correctly identified using the Silva database and 67% using the NCBI 16S database. A similar trend is seen with the naive Bayes classifier. Using other variable regions can lead to improved species-level resolution to a maximum of 91% correctly identified. (B) Percentage of bladder bacterial species identified with each database by variable region (BLCA). Genera with fewer than 3 species are grouped as "other." (C) Identification of Lactobacillus species depends on the variable region and database used, with the NCBI 16S database having the most coverage, regardless of the variable region chosen. The Greengenes database is not shown, since it only classified two species (Lactobacillus pontis and Lactobacillus delbrueckii).
crispatus and Lactobacillus rhamnosus were only detectable with specific combinations of variable regions and databases, and Lactobacillus iners was not correctly identified from our data set with any classification scheme.
Additionally, we found that there were important discrepancies for bacteria that are thought to play a role in bladder health and disease (44) (Fig. S3). Several bladder species, such as Gardnerella vaginalis, were only detected with the NCBI 16S and Silva databases. Staphylococcus species were poorly identified with the V4 region but were distinguishable with all other regions. Streptococcus and Corynebacterium species were best identified with the NCBI 16S database. Escherichia coli, a common uropathogen (45), is not well represented in any of the databases and was only detected with the V4 region and the NCBI 16S database.
Validation of computational findings on V4 amplicon data. To evaluate the performance of our computational findings on experimental 16S rRNA amplicon data, we acquired targeted amplicon sequencing data from 24 urine samples. These urine samples were a subset of those that were used to derive cultures in the Thomas-White data set and thus should contain the same bacteria. Sequencing data were generated as part of two other studies using Illumina sequencing of the V4 region of the 16S rRNA gene (4,46). We reprocessed the raw sequencing data (see Materials and Methods) and performed taxonomic classification to assess the performance of our computational findings. Since 16S rRNA amplicon sequencing will detect many more bacteria than those identified even with enhanced culture, we restricted the evaluation to only the bacteria that grew in culture from a given sample (see Fig. S4). We used Cohen's kappa to assess the agreement between species assignment in the V4 computational data set to the experimental V4 data set (see Materials and Methods). All classification schemes had moderate agreement (Table 1).

DISCUSSION
Our study demonstrates that it is possible to attain results at the species level with existing resources when performing targeted amplicon sequencing of urinary specimens. Although species-level taxonomic resolution is possible, it requires a carefully chosen classification scheme. Within the classification scheme, we found that the classifier chosen did not drastically affect the identification of species, but the database and identifier do. Importantly, we found that the reference database strongly influences the identification of bacteria at the species level. Overall, using our approach, the NCBI 16S database performed the best, whereas the Greengenes database performed the worst.
In our study, we compared several combinations of classification scheme components in the context of the female bladder microbiome to identify bacterial species. This builds on several prior studies comparing databases and classifiers (47)(48)(49), identifiers and databases (50)(51)(52), classifiers and identifiers (53,54), or individual components of the classification scheme for application in microbiome studies. However, to our knowledge, there have been no studies that systematically compared combinations of all the components of a classification scheme (classifiers, databases, and identifiers) for species-level assignment from 16S amplicon experiments. This is an advancement over previous studies in that we were able to comprehensively show that certain combinations of identifier, classifier, and database perform better than others, particularly on bladder bacteria. While database selection is recognized as an important decision in taxonomy assignment for amplicon sequencing studies, few studies have compared the effect of different databases on taxonomy assignment (47,49,55,56), and none have focused on bladder bacterial species. Park and Won recently compared database performance on mock microbial communities and similarly found that the Silva database outperformed the Greengenes database (55). Several studies have also demonstrated that niche-specific databases perform the best for species-level identification (57), an approach that has been taken in research on many microbiome communities, such as the vaginal (58), oral (59), and gut (60) microbiomes.
For species-level taxonomy assignments, the reference database must contain species-level information of the bacteria to be identified. We found that the Greengenes database does not currently contain many bacterial species that are found in the human bladder. In contrast, the NCBI 16S Microbial and Silva databases had representation of all species that were identified from prior studies of bladder bacteria. Thus, the latter two databases are better choices for evaluating bacterial species from the bladder. Even though both of these databases contain species of interest, the NCBI 16S database consistently outperformed the Silva database on the computational data set.
The classifiers used in this study are examples of two different strategies designed to overcome the common challenges of searching an extremely large data set in order to find matching pairs of query sequences and reference records. While these two approaches are different in concept, we did not find significant differences in their performance for species-level classification of bladder bacteria on the computational data set. BLCA is an example of sequence comparison by pairwise alignment. The strength of this method is due to the fact that the similarities between two DNA samples are directly compared. This is a highly effective approach; however, until recent advances in computer technology, it remained impractical because of the computational burden. The naive Bayes classifier is an example of a k-mer-based classification approach, and was designed to circumvent the computational challenges that are faced with use of a pairwise alignment classifier. It is a commonly used classifier, implemented in Ribosomal Database Project (RDP), QIIME2, mothur, and DADA2 bioinformatics workflows. However, there are limitations when using naive Bayes for species-level identification. The first limitation arises from the database training process. If one taxon has more training examples than another, naive Bayes generates unfavorable weights for the decision boundary, leading to preferential classification of taxa with more training examples (61). The second limitation is that all features (i.e., the k-mers generated from the DNA sequences) are assumed to be independent, an assumption that is violated with the palindromic structures found in mRNA. Naive Bayes will favor classifications into taxa containing dependent k-mers over independent k-mers (61). A third limitation was that the available naive Bayes classifiers (such as those available through the RDP classifier [62] and QIIME2 [28]) only had genus-level information for the Silva and NCBI 16S databases. To use these for species-level assignment, it was necessary to retrain the classifier to use the specified databases at the species level. This training converts the DNA sequences to the calculated frequency that each k-mer occurs in a taxon.
Affordable sequencing of large-scale data is presently done with short-read sequencing technology, such as Illumina MiSeq. This is currently limited to sequencing reads of up to 300 nucleotides in length. Until full-length 16S rRNA gene sequencing can be achieved affordably on a large scale (e.g., Oxford Nanopore and PacBio technologies), choosing the optimal region of the 16S rRNA gene for identification purposes remains a significant part of the experimental design. Thus, the variable regions that are used as identifiers require some consideration.
Our findings show that use of the V2-V3 and V1-V3 regions of the 16S rRNA gene allowed for the correct identification of the most bladder bacterial species when combined with the NCBI 16S database and either classifier. In general, amplicons that span more than one variable region perform better than those that contain single variable regions. This is likely due to the increased information available with longer reads. It is important to note that longer reads can have experimental limitations (63), such as difficulties combining partial reads if portions of those reads do not overlap; these limitations were not assessed in our study. While shorter variable regions, such as the V4 region, did not perform as well as longer amplicons, they were able to identify many bladder bacteria at the species level if the NCBI 16S database was used (52 out of 78). These shorter amplicons are widely used with Illumina sequencing and are valid, depending on the study design and level of precision desired.
This study adds to a growing body of literature informing and improving experimental design for urobiome research. Several studies have now informed on best practices (64) and how method of sample collection (2,65), sample handling and storage (66), sample chemical composition (67), and DNA extraction protocol (67,68) can influence downstream results and interpretations. By taking a computational approach to evaluating classification schemes, we were able to thoroughly assess the ability of various combinations of databases, identifiers, and classifiers to identify known bladder bacterial species. However, there are several practical limitations of amplicon sequencing that were not captured in this approach. We used published primers to create computational amplicons, but this may not reflect the efficiencies of primer binding in actual experiments. Additionally, targeted amplicon sequencing generates a large number of overlapping reads and provides the data for methods to correct for errors introduced in the lab by the polymerase enzyme and through sequencing. Our method did not account for the noise that is found in experimental data, which may decrease the ability of the classification schemes to differentiate between bacterial species with similar sequences. Furthermore, our results may not extend to bladder samples that contain novel bacterial species that are not well represented in the existing databases.
Conclusion. Species-level taxonomy assignment will greatly benefit studies focused on the urobiome and its relationship to bladder health and disease. Our results show that it is possible to reliably classify bladder bacterial species using targeted amplicon sequencing of the 16S rRNA gene variable regions with existing classification algorithms and databases. We determined that the most important component of the classification scheme is the database used, and that among those that we tested, the NCBI 16S database allows for the best identification of bladder species. Our validation with V4 amplicon data demonstrates that the predicted computational outcomes are a reasonable approximation for how a classification scheme will perform on experimental data. It can be expected that the alternate variable regions covered in this study, such as the V2-V3 region of the 16S rRNA gene, would have similar outcomes, and we recommend that experimental validation be performed.
Importantly, we found that no single variable region gives 100% coverage of all bladder bacteria species. Thus, the choice of variable region may significantly affect the results of a given study. One approach to resolve this could be to use multiple-amplicon sequencing or long-read sequencing technology. These technologies are emerging and may prove to be beneficial for the urobiome community. Furthermore, no database has 100% coverage across a variable region. This could be resolved by using more than one database for classification, although this approach is complicated by differences in databases in terms of formatting, as well as conflicting classifications (69). Both of these components are important for planning experimental and computational aspects of urobiome studies and should be considered when comparing results across studies.

MATERIALS AND METHODS
Code resources. All scripts that were written for this project can be found in the GitHub repository (https://github.com/KarstensLab/BladderBacteriaSpecies). All scripts sourced from this repository are referred to as "custom." The Thomas-White data set. The 78 species of bladder bacteria used in this study were identified by culturing 149 bacterial isolates from 77 urine samples and performing whole-genome sequencing, as described in Thomas-White et al. (33). This set of identified species served as the basis for our computational analysis and is referred to as the Thomas-White data set. For each species identified, the 16S rRNA gene sequence of the corresponding type strain was downloaded from the Silva v132 release (https://www.arb-silva.de/) on 4/27/2019 (see Table S1 in the supplemental material). A type strain is defined by the sequence of the cultured isolate that was subject to the metabolic, genotypic, and phenotypic evaluations taken to define the bacterial species (70) and is the agreed bacterial organism to which the taxonomic name refers. Sequences were searched using the "[T]" filter setting, and sequences longer than 1,450 nucleotides (nt) with alignment and pintail quality scores greater than 95% were selected. For the species that had no hits, the taxonomic synonym (see below, "Synonyms of species") was used as the search query, if available. One unidentified Corynebacterium species had no type strain available and was excluded from the analysis.
The V4 validation data set. Targeted amplicon sequences from 24 urine samples, using the V4 region of the 16S rRNA gene sequence, are referred to as the V4 validation data set. These 24 urine samples originated from a subsample of the women whose samples comprised the Thomas-White data set. Sequencing data were generated as part of two other published studies using Illumina sequencing of the V4 region of the 16S rRNA gene (4,46). The raw sequence reads were processed with DADA2 v. 1.14.1 (31) to generate amplicon sequence variants (ASVs). The ASVs were subjected to taxonomic classification with the BLCA algorithm.
Synonyms of species. Species names have changed in response to advances in bacterial systematics. All currently known species synonyms were downloaded from the Prokaryotic Nomenclature Upto-Date (PNU) (71) website on 5 January 2020. PNU includes information down to the strain level, but these entries were consolidated to the species level. For example, entries like Enterobacter cloacae and Enterobacter cloacae subsp. dissolvens are treated as synonyms of Enterobacter cloacae. Classification results were then checked for synonyms using the custom "validate_match_batch.py" script.
Presence of Thomas-White species in databases. To verify that all species from the Thomas-White data set were present in the databases used in this study, each database was first converted to a FASTA file (if needed) using the "blastdbcmd" utility included in the BLAST1 suite. The FASTA file was then searched using regular expressions and the Linux command-line program grep for a match of each species in the data set. The commands were implemented using the custom "species_in_db.bash" script. The presence or absence of each species was recorded.
Multisequence alignment. The 16S gene sequences from the Thomas-White data set were formed into a multisequence alignment using the T-coffee program (72). T-coffee v. 12.00.7fb08c2 was downloaded from http://tcoffee.org/Packages/Stable/Latest/ on 5 April 2019. Alignments were performed using the default settings.
Sliding-window analysis. Comparison of the 16S rRNA gene sequences of the species in the Thomas-White data set reveals regions of conserved sequence and regions of variability. The degree to which variable regions of species differ from each other can aid the identification of each species; therefore, quantifying the amount of variability of a region across a set of species is important.
Sliding-window analysis (SWA) is the method by which a list of subsequences is generated by taking successive groups of equal size, in the manner of a window of fixed length sliding across the full sequence. Quantifying the amount of variability along an MSA is achieved by combining SWA with calculation of the Shannon entropy contained in each column framed by the window.
The minimum Shannon entropy occurs when all nucleotides in a position (column) of the MSA are the same. The maximum occurs when all possible nucleotides in the MSA are present at that position. However, the Shannon entropy calculation treats gaps in a sequence as relevant, where in practice, gaps reflect an absence of useful information. Multisequence alignments can generate many columns of gap characters due to insertions or deletions (indels) in the respective sequences that make up the MSA. A consequence of treating gaps as relevant is that the Shannon entropy will interpret these indel regions as conserved sequence. This limitation was overcome by weighting the entropy scores against gaps (73). The locations of known variable regions of the 16S gene sequence were validated, and the relative amount of variability was quantified, using the custom "weighted_ent.py" script.
Primers. Amplicons were computationally generated from the Thomas-White genome sequencing data set for the V1-V3, V2-V3, V3-V4, V4-V6, and V4 variable regions using published primers and for the V3 and V6 regions using designed primers. The primer sequences used, listed in order of amplicon spanning the variable region(s), forward primer name and sequence, and reverse primer name and sequence are as follows: V1-V3, A17F  CCG CCT GGG GAS TAC GVH-39, v6_1410R 59-AGT CCC RYA ACG AGC GCA-39. Degenerate primer design with DegePrime (78) (https://github.com/EnvGen/DEGEPRIME.git) was employed to generate primer sets for the V3 and V6 regions of the 16S rRNA gene that would anneal to as many species in the Thomas-White data set as possible. DegePrime has the option to ignore columns of a MSA if the number of "-" characters exceeds a user-defined threshold. The MSAs were preprocessed with this threshold set to .01. The main script of DegePrime was run using a degeneracy setting of 4096 and a window length of 18.
Extracting computational amplicons. For each primer set, the DNA sequence bracketed by the forward and reverse primers was extracted from the multisequence alignment. Coordinates of the MSA were identified by searching the E. coli sequence (GenBank accession number EU014689.1) included in the MSA for a match to the forward and reverse primer sequences and then mapping those position to the MSA of the Thomas-White data set. This procedure was done using the custom "extract_16s_vr.py" script and output as a multirecord FASTA formatted file.
Taxonomic classifiers. Taxonomic classification was performed with Bayesian lowest common ancestor (BLCA) and naive Bayes classifiers. BLCA (32) was cloned from the GitHub repository https:// github.com/qunfengdong/BLCA.git. For the 16S variable regions, the BLCA classifier was run using default settings but pointing to the selected reference database, either Greengenes, Silva, or NCBI 16S. The naive Bayes classifier as implemented by QIIME2 (28) was used with the Greengenes, Silva, and NCBI 16S databases and confidence settings of 0, 50, and 80, but otherwise default settings.
Evaluating computational results. To evaluate the taxonomic classification results of each classification scheme on the computational amplicon data set, the custom "new_taxonomy_results_2020-3-14.Rmd" file was used. These scripts compare the results of each record pair (each comparison between the query sequence and the sequence held in the reference database) from the classification scheme to the known identity of the query sequence from the Thomas-White data set. All record pairs that were assigned a match by the classification schemes were evaluated according to the following definitions ( Fig. 6): True match: all record pairs assigned as a match that have identical genus and species labels. False match: all record pairs assigned as a match that did not have identical genus and species labels. False nonmatch: if a record representing a species in the Thomas-White data set was present in the database, but was not assigned as a match, the record was evaluated as a false nonmatch.
True nonmatch: all records in the reference database that were not in the Thomas-White data set. While records assigned to this category were not used in evaluating the classification schemes in this study, the definition is still included for completeness.
Performance measures. Recall, precision, and the F-measure were used to evaluate the performance of each classification scheme implemented. Recall refers to the proportion of matches that the classification scheme correctly identified (true matches) out of all possible matches (true matches plus false nonmatches). Precision refers to the proportion of matches that the classification scheme called correctly (true matches) out of all classified matches (true matches plus false matches). The F-measure is the equally weighted harmonic mean of recall and precision. For this study, we chose to maximize recall over precision because the number of true matches impacts the subsequent work on diversity measures, such as species richness and evenness (79).
Evaluating V4 validation results. To determine the expected bacterial species in each sample, the results of the whole-genome sequencing of the isolates cultured from the corresponding subject was used. The species of bacteria in the V4 sequencing data were identified using classification schemes composed of (i) the Greengenes, Silva, and NCBI 16S microbial databases, (ii) the V4 sequencing results as the identifier, and (iii) the BLCA and naive Bayes classifiers. For each classification scheme, the species identification from the computational and validation data sets was scored as a "1" if correct, and "0" otherwise. This resulted in a table for each classification scheme consisting of bacterial species as the rows and the scores from the computational or validation outcomes as columns. Cohen's kappa was then FIG 6 Example of classification evaluation used in this study. Suppose there is a classification scheme comprising a set of query sequences (rows E, F, and G) and the set of reference sequences (columns E, F, L, and M) held in a reference database. In this example, the number of reference records is greater than the query records, and the reference is missing a corresponding G record from the query set. (A) If the query and reference record letters are the same, then they are designated a match. If they are different, they are designated a nonmatch. (B) Next, the classifier is allowed to assign record pairs as matches or nonmatches for all query sequences, represented as green plus signs for matches and blank cells as nonmatches. Some results are correct and some are not. Note that despite the lack of a matching record in the reference database, the classifier still designated the (G:M) pair as a match. (C) Using the definitions for assigning the classifications to the confusion matrix, there is one true match (green square), two false matches (red squares), one false nonmatch (yellow square), and 8 true nonmatches (white squares). (D) The cell values of the confusion matrix are then filled out, and performance measurements can be calculated. For this classification scheme, the precision is 1/(1 1 2) = 0.33, recall is 1/(1 1 1) = 0.5, and the F-measure is (2 Â 0.33 Â 0.5)/(0.33 1 0. 5) = 0.40. calculated for each table using the "kappa2" function from the irr R package (https://CRAN.R-project.org/ package=irr).
Data availability. This project used previously acquired publicly available data (20). All code that was written for this project can be found in the GitHub repository at https://github.com/KarstensLab/ BladderBacteriaSpecies.

SUPPLEMENTAL MATERIAL
Supplemental material is available online only. We acknowledge our late colleague Mark Asquith, Ph.D., who passed away in June of 2019 shortly after the start of this project. Dr. Asquith was a dedicated Assistant Professor in the Department of Medicine at Oregon Health & Science University, with a passion for research pertaining to the human microbiome. Initial discussions with Dr. Asquith about species-level identification of bacteria from 16S rRNA gene sequencing studies helped to shape this project. We also thank the anonymous reviewers who provided critiques that helped strengthen the manuscript.