Characterization of the first vaginal Lactobacillus crispatus genomes isolated in Brazil

Background Lactobacillus crispatus is the dominant species in the vaginal microbiota associated with health and considered a homeostasis biomarker. Interestingly, some strains are even used as probiotics. However, the genetic mechanisms of L. crispatus involved in the control of the vaginal microbiome and protection against bacterial vaginosis (BV) are not entirely known. To further investigate these mechanisms, we sequenced and characterized the first four L. crispatus genomes from vaginal samples from Brazilian women and used genome-wide association study (GWAS) and comparative analyses to identify genetic mechanisms involved in healthy or BV conditions and selective pressures acting in the vaginal microbiome. Methods The four genomes were sequenced, assembled using ten different strategies and automatically annotated. The functional characterization was performed by bioinformatics tools comparing with known probiotic strains. Moreover, it was selected one representative strain (L. crispatus CRI4) for in vitro detection of phages by electron microscopy. Evolutionary analysis, including phylogeny, GWAS and positive selection were performed using 46 public genomes strains representing health and BV conditions. Results Genes involved in probiotic effects such as lactic acid production, hydrogen peroxide, bacteriocins, and adhesin were identified. Three hemolysins and putrescine production were predicted, although these features are also present in other probiotic strains. The four genomes presented no plasmids, but 14 known families insertion sequences and several prophages were detected. However, none of the mobile genetic elements contained antimicrobial resistance genes. The genomes harbor a CRISPR-Cas subtype II-A system that is probably inactivated due to fragmentation of the genes csn2 and cas9. No genomic feature was associated with a health condition, perhaps due to its multifactorial characteristic. Five genes were identified as under positive selection, but the selective pressure remains to be discovered. In conclusion, the Brazilian strains investigated in this study present potential protective properties, although in vitro and in vivo studies are required to confirm their efficacy and safety to be considered for human use.

The genome sequence data were uploaded to the Type (Strain) Genome Server (TYGS), a free bioinformatics platform available under https://tygs.dsmz.de, for a whole genome-based taxonomic analysis [1]. The results were provided by the TYGS on 2020-05-25. In brief, the TYGS analysis was subdivided into the following steps: Determination of closely related type strains Determination of closest type strain genomes was done in two complementary ways: First, all user genomes were compared against all type strain genomes available in the TYGS database via the MASH algorithm, a fast approximation of intergenomic relatedness [2], and, the ten type strains with the smallest MASH distances chosen per user genome. Second, an additional set of ten closely related type strains was determined via the 16S rDNA gene sequences. These were extracted from the user genomes using RNAmmer [3] and each sequence was subsequently BLASTed [4] against the 16S rDNA gene sequence of each of the currently 11767 type strains available in the TYGS database. This was used as a proxy to find the best 50 matching type strains (according to the bitscore) for each user genome and to subsequently calculate precise distances using the Genome BLAST Distance Phylogeny approach (GBDP) under the algorithm 'coverage' and distance formula d 5 [5]. These distances were finally used to determine the 10 closest type strain genomes for each of the user genomes.
Pairwise comparison of genome sequences All pairwise comparisons among the set of genomes were conducted using GBDP and accurate intergenomic distances inferred under the algorithm 'trimming' and distance formula d 5 [5]. 100 distance replicates were calculated each. Digital DDH values and confidence intervals were calculated using the recommended settings of the GGDC 2.1 [5].

Phylogenetic inference
The resulting intergenomic distances were used to infer a balanced minimum evolution tree with branch support via FASTME 2.1.4 including SPR postprocessing [6]. Branch support was inferred from 100 pseudobootstrap replicates each. The trees were rooted at the midpoint [7] and visualized with PhyD3 [8].

Type-based species and subspecies clustering
The type-based species clustering using a 70% dDDH radius around each of the 11 type strains was done as previously described [1]. The resulting groups are shown in Table 1 and 4. Subspecies clustering was done using a 79% dDDH threshold as previously introduced [9].

Type-based species and subspecies clustering
The resulting species and subspecies clusters are listed in Table 4, whereas the taxonomic identification of the query strains is found in Table 1. Briefly, the clustering yielded 8 species clusters and the provided query strains were assigned to 1 of these. Moreover, user strains were located in 2 of 10 subspecies clusters.       The genome sequence data were uploaded to the Type (Strain) Genome Server (TYGS), a free bioinformatics platform available under https://tygs.dsmz.de, for a whole genome-based taxonomic analysis [1]. The results were provided by the TYGS on 2020-05-25. In brief, the TYGS analysis was subdivided into the following steps: Determination of closely related type strains Determination of closest type strain genomes was done in two complementary ways: First, all user genomes were compared against all type strain genomes available in the TYGS database via the MASH algorithm, a fast approximation of intergenomic relatedness [2], and, the ten type strains with the smallest MASH distances chosen per user genome. Second, an additional set of ten closely related type strains was determined via the 16S rDNA gene sequences. These were extracted from the user genomes using RNAmmer [3] and each sequence was subsequently BLASTed [4] against the 16S rDNA gene sequence of each of the currently 11767 type strains available in the TYGS database. This was used as a proxy to find the best 50 matching type strains (according to the bitscore) for each user genome and to subsequently calculate precise distances using the Genome BLAST Distance Phylogeny approach (GBDP) under the algorithm 'coverage' and distance formula d 5 [5]. These distances were finally used to determine the 10 closest type strain genomes for each of the user genomes.
Pairwise comparison of genome sequences All pairwise comparisons among the set of genomes were conducted using GBDP and accurate intergenomic distances inferred under the algorithm 'trimming' and distance formula d 5 [5]. 100 distance replicates were calculated each. Digital DDH values and confidence intervals were calculated using the recommended settings of the GGDC 2.1 [5].

Phylogenetic inference
The resulting intergenomic distances were used to infer a balanced minimum evolution tree with branch support via FASTME 2.1.4 including SPR postprocessing [6]. Branch support was inferred from 100 pseudobootstrap replicates each. The trees were rooted at the midpoint [7] and visualized with PhyD3 [8].

Type-based species and subspecies clustering
The type-based species clustering using a 70% dDDH radius around each of the 11 type strains was done as previously described [1]. The resulting groups are shown in Table 1 and 4. Subspecies clustering was done using a 79% dDDH threshold as previously introduced [9].

Type-based species and subspecies clustering
The resulting species and subspecies clusters are listed in Table 4, whereas the taxonomic identification of the query strains is found in Table 1. Briefly, the clustering yielded 8 species clusters and the provided query strains were assigned to 1 of these. Moreover, user strains were located in 1 of 10 subspecies clusters.    Publication-ready versions of both the genome-scale GBDP tree and the 16S rRNA gene sequence tree can be customized and exported either in SVG (vector graphic) or PNG format from within the phylogeny viewers in your TYGS result page. For publications the SVG format is recommended because it is lossless, always keeps its high resolution and can also be easily converted to other popular formats such as PDF or EPS. Please follow the link provided above! The genome sequence data were uploaded to the Type (Strain) Genome Server (TYGS), a free bioinformatics platform available under https://tygs.dsmz.de, for a whole genome-based taxonomic analysis [1]. The results were provided by the TYGS on 2020-05-25. In brief, the TYGS analysis was subdivided into the following steps: Determination of closely related type strains Determination of closest type strain genomes was done in two complementary ways: First, all user genomes were compared against all type strain genomes available in the TYGS database via the MASH algorithm, a fast approximation of intergenomic relatedness [2], and, the ten type strains with the smallest MASH distances chosen per user genome. Second, an additional set of ten closely related type strains was determined via the 16S rDNA gene sequences. These were extracted from the user genomes using RNAmmer [3] and each sequence was subsequently BLASTed [4] against the 16S rDNA gene sequence of each of the currently 11767 type strains available in the TYGS database. This was used as a proxy to find the best 50 matching type strains (according to the bitscore) for each user genome and to subsequently calculate precise distances using the Genome BLAST Distance Phylogeny approach (GBDP) under the algorithm 'coverage' and distance formula d 5 [5]. These distances were finally used to determine the 10 closest type strain genomes for each of the user genomes.
Pairwise comparison of genome sequences All pairwise comparisons among the set of genomes were conducted using GBDP and accurate intergenomic distances inferred under the algorithm 'trimming' and distance formula d 5 [5]. 100 distance replicates were calculated each. Digital DDH values and confidence intervals were calculated using the recommended settings of the GGDC 2.1 [5].

Phylogenetic inference
The resulting intergenomic distances were used to infer a balanced minimum evolution tree with branch support via FASTME 2.1.4 including SPR postprocessing [6]. Branch support was inferred from 100 pseudobootstrap replicates each. The trees were rooted at the midpoint [7] and visualized with PhyD3 [8].

Type-based species and subspecies clustering
The type-based species clustering using a 70% dDDH radius around each of the 11 type strains was done as previously described [1]. The resulting groups are shown in Table 1 and 4. Subspecies clustering was done using a 79% dDDH threshold as previously introduced [9].

Type-based species and subspecies clustering
The resulting species and subspecies clusters are listed in Table 4, whereas the taxonomic identification of the query strains is found in Table 1. Briefly, the clustering yielded 8 species clusters and the provided query strains were assigned to 1 of these. Moreover, user strains were located in 1 of 10 subspecies clusters.   Tree inferred with FastME 2.1.6.1 [6] from GBDP distances calculated from 16S rDNA gene sequences. The branch lengths are scaled in terms of GBDP distance formula d 5 . The numbers above branches are GBDP pseudo-bootstrap support values > 60 % from 100 replications, with an average branch support of 49.3 %. The tree was rooted at the midpoint [7].