Complete Genomes of Clade G6 Saccharibacteria Suggest a Divergent Ecological Niche and Lifestyle

ABSTRACT Saccharibacteria (formerly TM7) have reduced genomes and a small cell size and appear to have a parasitic lifestyle dependent on a bacterial host. Although there are at least 6 major clades of Saccharibacteria inhabiting the human oral cavity, complete genomes of oral Saccharibacteria were previously limited to the G1 clade. In this study, nanopore sequencing was used to obtain three complete genome sequences from clade G6. Phylogenetic analysis suggested the presence of at least 3 to 5 distinct species within G6, with two discrete taxa represented by the 3 complete genomes. G6 Saccharibacteria were highly divergent from the more-well-studied clade G1 and had the smallest genomes and lowest GC content of all Saccharibacteria. Pangenome analysis showed that although 97% of shared pan-Saccharibacteria core genes and 89% of G1-specific core genes had putative functions, only 50% of the 244 G6-specific core genes had putative functions, highlighting the novelty of this group. Compared to G1, G6 harbored divergent metabolic pathways. G6 genomes lacked an F1Fo ATPase, the pentose phosphate pathway, and several genes involved in nucleotide metabolism, which were all core genes for G1. G6 genomes were also unique compared to that of G1 in that they encoded d-lactate dehydrogenase, adenylate cyclase, limited glycerolipid metabolism, a homolog to a lipoarabinomannan biosynthesis enzyme, and the means to degrade starch. These differences at key metabolic steps suggest a distinct lifestyle and ecological niche for clade G6, possibly with alternative hosts and/or host dependencies, which would have significant ecological, evolutionary, and likely pathogenic implications. IMPORTANCESaccharibacteria are ultrasmall parasitic bacteria that are common members of the oral microbiota and have been increasingly linked to disease and inflammation. However, the lifestyle and impact on human health of Saccharibacteria remain poorly understood, especially for the clades with no complete genomes (G2 to G6) or cultured isolates (G2 and G4 to G6). Obtaining complete genomes is of particular importance for Saccharibacteria, because they lack many of the “essential” core genes used for determining draft genome completeness, and few references exist outside clade G1. In this study, complete genomes of 3 G6 strains, representing two candidate species, were obtained and analyzed. The G6 genomes were highly divergent from that of G1 and enigmatic, with 50% of the G6 core genes having no putative functions. The significant difference in encoded functional pathways is suggestive of a distinct lifestyle and ecological niche, probably with alternative hosts and/or host dependencies, which would have major implications in ecology, evolution, and pathogenesis.

in the supplemental material. Firmicutes was used as an outgroup. The bars in the innermost layer represent the number of (Continued on next page) Highly Divergent Clade G6 Saccharibacteria least 6 distinct clades (G1 to G6) (7,8); however, all currently available human-associated complete genomes belong to clade G1, and only clades G1 and G3 have cultured isolates, leaving clades G2 and G4 to G6 quite poorly understood. Several recent publications have provided the first draft genomes from clades G3, G5, and G6 (4,(8)(9)(10)(11).
Obtaining complete genomes is of particular importance for Saccharibacteria because they lack many of the "essential" single-copy core genes that are typically used to estimate genome completion, as well as complete reference genomes outside the G1 clade. A recent, short-read-based oral microbiome study provided 21 Saccharibacteria draft genomes from clades G1, G3, and G6 (4), with several being high quality (high N 50 , relatively contiguous, and low predicted contamination). Therefore, nanopore sequencing of the same saliva samples that had produced the draft genomes, followed by long-read and/or hybrid assembly, was used to improve these genomes, resulting in 3 complete circular G6 genomes: JB001 (663,355 bp), JB002 (637,739 bp), and JB003 (691,584 bp). Table 1 is a summary of the genomes improved during this study, and Text S1 in the supplemental material contains a full description of the DNA extraction, sequencing, assembly, and analysis methods. These methods are a modified version of a previously reported protocol (12). Although the G1 and G3 "near-complete" improved genomes that were obtained are useful in their own right, they are still incomplete and/or may contain contamination; therefore, the 3 complete G6 genomes are the focus of this report, and the near-complete genomes are briefly discussed in Text S1.
Phylogenetic analysis using concatenated protein sequences was performed using Anvi'o (13) and included the 8 improved/completed genomes from this study, all 26 complete Saccharibacteria genomes available on NCBI (as of 1 April 2021), and 90 Saccharibacteria draft genomes from 5 recent studies (see Table S1). JB001, JB002, and JB003 were indeed members of Saccharibacteria clade G6 ( Fig. 1A; see also Fig. S1) and represent the only human-associated, complete Saccharibacteria genomes outside clade G1. Notably, unlike most Saccharibacteria, which have only one copy of the rRNA gene cluster, the G6 genomes had 2 rRNA clusters. The G6 genomes also had the smallest size and the lowest GC content of all Saccharibacteria (Fig. 1A). Percent average nucleotide identity (ANI) between the G6 genomes was calculated using Anvi'o and suggested that there are at least 3 to 5 distinct species within the clade (Fig. 1B) (a cutoff of 95% ANI is frequently used to estimate the species level [14,15]). JB001, JB003, JCVI_1_bin.12, and G6_32_bin_33_unicycler appear to be the same species, with an ANI of $95%, despite their sources from different human subjects and independent genome assembly (Fig. 1B). JB002 and T-C-M-Bin-00022 had .98% ANI, likely representing the same distinct species, while CMJM-G6-HOT-870 and T-C-M-Bin-00011 had ;98% ANI and formed what is likely an additional G6 species (Fig. 1B). CLC Genomics Workbench was used to perform whole-genome alignment for JB001, JB002, JB003, and the G1 reference strain, TM7x (Fig. 1C). While JB001 and JB003 were syntenic except for an ;28-kbp putative mobile element present in JB003, there were more large-scale differences between JB001/JB003 and JB002. Clearly, TM7x and the G6 singleton gene clusters (i.e., genes appearing in only that one genome) in each genome. The bars in the second layer represent the redundancy (likely contamination) within each genome. The bars in the third layer represent the %GC content of each genome. The bars in the fourth layer represent the total length in base pairs of each genome. The fifth layer displays the source/reference for each genome. The sixth layer displays the genomes that are complete. The outermost layer, and the color of the branches of the tree, illustrate which Saccharibacteria clade each genome is part of. Orange stars indicate genomes that were used in the full pangenome analysis ( Fig. S2; Table S4). Yellow stars indicate genomes that were used in the pangenome analysis of compete genomes only ( Fig. 2; Table S3) as well as the full pangenome analysis ( Fig. S2; Table S4). A larger version of this figure, with the name of each genome labeled, is available in Fig. S1. Note that CP025011_1_Candidatus_Saccharibacteria_bacterium_YM_S32_TM7_50_20_chromosome_ complete_genome and c_000000000001 (GCA_003516025.1_ASM351602v1_genomic.fa), the only two complete genomes in clades G3 and G5, are from environmental, not oral, samples. The raw data in the annotations of the tree are available in Table S1. A blue star indicates the genome isolated from a mammalian rumen, and red stars indicate genomes that were isolated from environmental sources. All other genomes are from human oral samples. (B) Average nucleotide identity (%ANI) of G6 genomes. Heat map of all-versus-all comparison of %ANI of all 11 G6 genomes. The tree on the right is a scaled-up version of the G6 portion of the phylogenetic tree in panel A. Full percentage identity, which takes alignment length into account, is available in Table S2. (C) Whole-genome alignment of TM7x versus complete G6 genomes. The tree on the left is based on the whole-genome alignment itself. Saccharibacteria have undergone many genomic rearrangements and instances of gene gain/loss since their last common ancestor (Fig. 1C).
To examine functional and metabolic differences between the G6 clade and the more-well-understood G1 clade, pangenome analysis was performed using Anvi'o (16) on the 3 complete G6 genomes and 4 diverse G1 complete genomes ( Fig. 2; Table S3). This identified 223 "pan-Saccharibacteria core genes" appearing in all genomes as well as all 94 "G1 core genes" and 244 "G6 core genes" (Fig. 2A). While 97% of the pan-Saccharibacteria core genes and 89% of the G1 core genes had known COG functions and pathways, only 50% of the G6 core genes had known COG functions and pathways ( Fig. 2A), highlighting the enigmatic nature of this clade. The likely reason for the lower number of G1 core genes is the larger amount of known diversity within the G1 clade and the genomes analyzed here (8,9), leading to less conservation across the G1 pangenome. A larger pangenome analysis, examining all 11 G6 genomes and 14 diverse G1 genomes, is available in Fig. S2 and Table S4. This generated similar results, but note that this analysis contains draft genomes which are incomplete and/or may contain contamination. A complete metabolic network illustrating the known KEGG pathways identified in the three sets of core genes from Fig. 2A is shown in Fig. 2B. Both G1 and G6 genomes encode partial cell wall metabolism, glycolysis (missing phosphofructokinase), and arginine biosynthesis pathways and do not encode fatty acid metabolism, a tricarboxylic acid (TCA) cycle, or amino acid metabolism (other than arginine) (Fig. 2B). Notable pathways present in G6 genomes but absent in G1 include maltase glucoamylase (to metabolize starch), fructose bisphosphate aldolase (a glycolytic step), adenylate cyclase, D-lactate dehydrogenase, partial lipoarabinomannan (LAM) biosynthesis, and partial glycerolipid metabolism. Conversely, G1 genomes encode the nonoxidative phase of the pentose phosphate pathway, an F 1 F o ATPase, alpha galactosidase, and several steps in nucleotide metabolism, which were not present in the G6 genomes (Fig. 2B). Between JB001 and JB002, most differences were genes with unknown functions; therefore, the differences in the KEGG pathways encoded were minor (see Fig. S3). The G6 genomes examined did not contain predicted elements of a CRISPR system. Although it is not known how Saccharibacteria obtain needed metabolites from the host, a type IV pilus-like system is generally well conserved across the group, has been proposed as a candidate mechanism (8,9), and was present in the G6 genomes here. The species-level clade that included JB001 and JB003 encoded an ;10-kbp putative prophage element, which was flanked by homologs to the PinE invertase and contained a type 4 secretion system (T4SS) VirD4 homolog and 4 hypothetical proteins, all with ;95% homology to a similar region in Streptococcus salivarius. During review of this report, the complete genome of a novel environmental G1 isolate, "Candidatus Mycosynbacter amalyticus," was published (17). "Ca. Mycosynbacter amalyticus" had a broad host range within the mycolata clade of Actinobacteria and lysed the host cells (17).
Taken together, these analyses indicate that Saccharibacteria clade G6 is highly divergent from clade G1 and may have a different lifestyle, host, and host dependencies. This is in line with the recent hypothesis that G6 reside on the tongue (G6 are referred to as T2 in reference 9) and have a long history of association with animal hosts, while G1 reside in dental plaque and were a much more recent all 7 genomes), the G1 core genes (found in all G1 genomes and no G6 genomes), and the G6 core genes (found in all G6 but no G1 genomes). The pie chart adjacent to each group of core genes indicates the breakdown of COG categories of the gene clusters in the group. The 7 genome layers are ordered based on the tree of the %ANI comparison, which is displayed with the red and white heat map. The layers underneath the %ANI heat map, from top to bottom, indicate the number of gene clusters, the number of singleton gene clusters, the GC content, and the total length of each genome. The Venn diagrams in the inset show the number of overlapping and nonoverlapping genes between JB001 and JB002 and between JB001 and TM7x. The number in parenthesis is the number of genes with unknown functions (UF). (B) KEGG pathways encoded by G1 and G6 core genes. KEGG metabolic map overlaid with the pathways encoded by the pan-Saccharibacteria core genes (black), G1 core genes (green), and G6 core genes (red), as indicated by the Venn diagram key. Enzymes of interest are labeled with text and arrows. Pathways are indicated by labeled boxes; the cell wall metabolism pathway is labeled with the red background to distinguish it due to the odd shape and overlap with the glycolysis pathway space. acquisition from the environment (8,9). Interestingly, the species-level clade containing JB002 (the most reduced Saccharibacteria genome, with only 615 genes) was the only Saccharibacteria group that resided both on the tongue and in dental plaque (9). Although all cultured isolates of Saccharibacteria were epibionts of Actinobacteria, they were all G1 or G3 strains. Residing in a different environment, G6 may have distinct host species, possibly Streptococcus, given the acquired homologous sequence. Based on the fact that G6 Saccharibacteria appear to be exclusively human associated and that human-and animal-associated Saccharibacteria have smaller genomes than their environmental relatives, a hypothesis for the smaller genome size of G6 Saccharibacteria is that they may have had a longer period of host association to undergo genome reduction compared to other clades. The lower GC content in G6 genomes may be explained in part by what appears to be horizontal gene transfer from streptococci, which have low-GC genomes. It is likely that G6 have fallen into the "unknown" taxonomic bucket in the majority of past microbiome studies; thus, the role of G6 in human health remains to be elucidated. The high percentage of genes with unknown functions further adds to the obscurity of this clade. Overall, this article highlights an urgent need for study of Saccharibacteria, since almost nothing is known about the lifestyle, host, or ecological impact of Saccharibacteria clade G6 and even less still is understood about clades G2, G4, and G5.
Data availability. The complete genome sequences of JB001, JB002, and JB003 have been deposited in GenBank under the accession numbers CP072208, CP076101, and CP076102, respectively. The BioProject accession for this project is PRJNA624185. The short reads used to generate the assemblies are available in the SRA database with the accession numbers SRX4318838, SRX4318837, and SRX4318835. The long reads used to generate the assemblies are available in the SRA data set with accession numbers SRX10387815, SRX11020560, and SRX11020561.

SUPPLEMENTAL MATERIAL
Supplemental material is available online only. TEXT S1, DOCX file, 0.1 MB.  ACKNOWLEDGMENTS I thank Karrie Goglin-Almeida, Jelena Jablanovic, and Kara Riggsbee for performing the library preparation and sequencing, Anna Edlund and Nini C. Tran for the saliva samples (described in reference 4), and Jeffrey S. McLean and Floyd Dewhirst for helpful discussions.
This research was supported by NIH/NIDCR K99-DE029228.