Host translation machinery is not a barrier to phages that interact with both CPR and non-CPR bacteria

ABSTRACT Within human microbiomes, Gracilibacteria, Absconditabacteria, and Saccharibacteria, members of Candidate Phyla Radiation (CPR), are increasingly correlated with human oral health and disease. We profiled the diversity of CRISPR-Cas systems in the genomes of these bacteria and sought phages that are capable of infecting them by matching their spacer inventories to large phage sequence databases. Gracilibacteria and Absconditabacteria recode the typical TGA stop codon to glycine and are putatively infected by phages that share their host’s alternate genetic code. Unexpectedly, however, other predicted phages of Gracilibacteria and Absconditabacteria do not use an alternative genetic code. Some of these phages may infect both alternatively coded CPR bacteria and standard-coded bacteria. These phages typically rely on other stop codons besides TGA and thus should be capable of producing viable gene products in either bacterial host type. By avoiding the acquisition of in-frame stop codons, these phages may have a broadened host range. Interestingly, we additionally predict that some phages of Saccharibacteria are targeted by spacers encoded in Actinobacteria, a phylum that includes known hosts for episymbiotic Saccharibacteria. IMPORTANCE Here, we profiled putative phages of Saccharibacteria, which are of particular importance as Saccharibacteria influence some human oral diseases. We additionally profiled putative phages of Gracilibacteria and Absconditabacteria, two Candidate Phyla Radiation (CPR) lineages of interest given their use of an alternative genetic code. Among the phages identified in this study, some are targeted by spacers from both CPR and non-CPR bacteria and others by both bacteria that use the standard genetic code as well as bacteria that use an alternative genetic code. These findings represent new insights into possible phage replication strategies and have relevance for phage therapies that seek to manipulate microbiomes containing CPR bacteria.

I nterest in human microbiome-associated Saccharibacteria, Gracilibacteria, and Absconditabacteria (hereafter referred to as SGA) has increased, in part due to their association with disease (1).SGA are lineages within the Candidate Phyla Radiation (CPR), a monophyletic radiation within the domain Bacteria, characterized in part by consistently reduced genomes, small cell sizes, and limited metabolic capabilities (2).CPR bacteria adhere to lifestyles dependent upon other cells, either by episymbiotic attachment-whereby CPR cells attach to and obtain nutrients from a larger host bacterium-or by deriving essential compounds such as lipids (3) from the surround ing microbial community.In most cases, the hosts of CPR bacteria are unknown, but in the case of certain oral and environmental Saccharibacteria, the hosts have been experimentally established to be species of Actinobacteria (4)(5)(6)(7)(8).The attachment by Saccharibacteria can have a profound impact on the Actinobacteria host, leading to cycles of rapid host evolution and drastic changes in host physiology (4,6,9).The Saccharibacteria-host-bacteria relationship in the human oral cavity has recently been evaluated in vivo, demonstrating that Saccharibacteria reduces the inflammatory effects of periodontitis and the pathogenicity of their host Actinobacteria (10).These studies have catalyzed a paradigm shift from the previous characterization of Saccharibacteria as a likely pathogen (11,12).
In contrast to an episymbiotic lifestyle, one Saccharibacteria species and several Gracilibacteria and Absconditabacteria species are thought to live predatory lifestyles, whereby they feed on specific non-CPR bacteria (6,8,13,14).Predatory bacteria are an emerging area of research garnering interest as an antibiotic alternative with narrow, targeted effects (15,16).The predatory Saccharibacteria Ca.M. amalyticus, for instance, has been proposed as a tool to precisely consume mycolata bacteria that are recalcitrant to antibiotic and phage treatments (8).
Another intriguing feature of Gracilibacteria and Absconditabacteria is that they employ an alternative genetic code in which the canonical stop codon, TGA, is instead recognized as glycine (genetic code 25) (17)(18)(19).While the alternative genetic code of Absconditabacteria and Gracilibacteria is well-established, very little is known about the genetic code of their phages.It has become clear that phages can adopt a genetic code that is distinct from that of their hosts (20)(21)(22)(23).For example, phages that have reassigned the TAG stop codon to be translated as glutamine infect Prevotella that use the standard bacterial code (21).These alternatively coded phages encode in-frame stop codons within late-stage phage genes to likely prevent premature production of structural and lytic proteins (21,23).To enable the production of these proteins in bacteria that use the standard code, these phage genomes must utilize "code-switching" machinery.These findings raise the possibility that standard-coded phages can replicate in bacteria with alternatively coded genomes, but this question has not been comprehensively investigated to date.Here, we explored the diversity and genomic features, including the genetic codes, of phages that are predicted to infect SGA bacteria.In addition to expanding our knowledge of fundamental biology, phages of SGA bacteria could have practical importance, as phages can be used to alter the composition of microbiomes with species or strain specificity (24,25).
When comparing the environmental origin of the genomes containing CRISPR-Cas systems, there is an apparent discrepancy in system distribution between the three SGA lineages.In Saccharibacteria, as has been previously observed (30,34,35), CRISPR-Cas systems are abundant in human and other mammal microbiomes and scarce in other environments.Only 3 of the 25 CRISPR-Cas systems in Saccharibacteria from our database are from non-animal-associated environments.In contrast, 11 of the 16 Gracilibacteria CRISPR-Cas systems belong to genomes from non-animal-associated environments (Fig. 1A).
Despite the streamlined nature of CPR genomes, we also identified five genomes that encode multiple CRISPR-Cas systems.Remarkably, a Gracilibacteria genome (ALUM ROCK_MS4_BD1-5_24_33_curated) encodes three distinct cas loci, including a CRISPR array with 80 spacers.When cas genes and CRISPR arrays are taken together, this specific genome dedicates 24,788 bp of its 2,138,004 bp genome (1.16%) to CRISPR-Cas defense systems.
We examined the architecture of our complete CRISPR-Cas systems and categorized the systems, based on previous classifications (32,36), into five distinct CRISPR-Cas subtypes: type II-A, type II-C1, type III-A, type III-B, and type V-A (Fig. 1B).The distribution of the CRISPR-Cas subtypes in relation to SGA lineage is as follows: Saccharibacteria encode type II-A, type II-C1, and type III-B systems; Gracilibacteria encode type V-A and type III-A systems; Absconditabacteria encode type V-A systems.There were two exceptions to these generalizations: (i) one Saccharibacteria genome encodes a type V-A system and (ii) one Gracilibacteria genome encodes a type II-C1 system.Four of the five subtypes have been previously identified in CPR bacteria, type II-A (37), type II-C1 (34, 37), type III-A (38), and type V-A (13).To our knowledge, subtype III-B has not been previously reported in CPR bacteria.While we expected to find primarily class 2 CRISPR-Cas systems, which are typically more compact and utilize a single effector gene (32,36), the type-III systems (class 1) we identified utilize a multisubunit effector complex.The type III-A and type III-B systems we identified, for instance, contained nine identifiable cas genes together in a single operon.The targets of these subtypes are known to vary; type II and type V-A systems are thought to target double-stranded DNA, while type III-A and type III-B systems are capable of targeting both DNA and RNA (32,36).This may indicate that some Saccharibacteria are capable of targeting both DNA and RNA phages.Interestingly, we also found that Gracilibacteria and Absconditabacteria almost exclusively rely on type V CRISPR-Cas systems despite the system's rarity among bacteria (<2% of all CRISPR-Cas systems identified in bacteria) (36).Furthermore, we compared the system architecture within each subtype based on average amino acid identity (AAI) of component proteins and noted a mostly uniform architecture within each subtype (see Fig. S1 to S3 at https://doi.org/10.5281/zenodo.8422333).Among the systems, we found 10 variants of the canonical CRISPR-Cas subtypes that contained unannotated open reading frames (ORFs) in the interior of a cas operon.These variants may represent novel subtypes within the broader system classification.One of these variants appears to be a type II-A system (see Fig. S1 at https://doi.org/10.5281/zenodo.8422333),and nine appear to be type V-A systems (see Fig. S3 at https://doi.org/10.5281/zenodo.8422333).The functions of these unannotated ORFs and whether they participate in concert with their respective CRISPR-Cas systems remain topics of future study.
To evaluate the novelty of the annotated genes within the complete CRISPR-Cas systems, we compared each Cas amino acid sequence to NCBI's nr database (see Fig. S4 at https://doi.org/10.5281/zenodo.8422333).While most Saccharibacteria and Abscondi tabacteria Cas proteins are well-represented in Genbank, there were a number of our Gracilibacteria Cas proteins with less than 50% AAI to known sequences.One such protein is a Cas9 that only displays a 34% AAI to the best match in Genbank.

Putative SGA-infecting phages
To identify candidate phages that potentially infect the SGA bacteria of our database, we extracted 1,296 non-identical spacers from our quality-controlled, high-confidence arrays from the complete genome data set (147 arrays encoded in 119 scaffolds; see Table S3 at https://doi.org/10.5281/zenodo.8422333).We also searched metagenomic reads for variant sequences that are not reflected in the consensus metagenomic assembly (see Materials and Methods).We recovered an additional 344 unique spacers from 10 SGA genomes.
We characterized our candidate SGA-infecting phages, and their hosts by generat ing a protein-sharing network in which the proteomes of phages and SGA hosts are clustered based on similarity (Fig. 2B).The proteomes of the Saccharibacteria phages and Absconditabacteria phages predicted in this study cluster with those of their predicted host bacteria and those of phages previously identified to infect the same host bacteria, strongly supporting host inference based on our spacer targeting analyses.When clustered in a separate network with non-SGA reference phages, the predicted SGA phages from this study tended to cluster apart from the non-SGA reference phages (see Fig. S5 at https://doi.org/10.5281/zenodo.8422333).This includes several phages newly predicted to infect Saccharibacteria, which form distinct clusters apart from previously identified Saccharibacteria phages or non-SGA reference phages and are thus inferred to be novel lineages.Within both networks, our putative Gracilibacteria-infecting phages did not form a singular cluster.Within the network between SGA bacteria and their predicted phages, three putative Gracilibacteria phages predicted by spacer-matching cluster with Gracilibacteria prophages or Absconditabacteria phages (Fig. 2B).Within the protein-sharing network containing non-SGA reference phages, a number of putative Gracilibacteria phages place within a sparse network that includes reference phages predicted to infect bacteria from either the Bacteroidota or the Firmicutes phylum (see Fig. S5 at https://doi.org/10.5281/zenodo.8422333).

Diverse coding strategies among predicted Absconditabacteria phages and Gracilibacteria phages
To investigate the genetic codes of our candidate phages, we predicted ORFs for each phage genome larger than 20 kb in both the alternative code 25 (the genetic code of Gracilibacteria and Absconditabacteria) and the standard code 11.Using these predictions, we calculated the coding density (a portion of the genome dedicated to protein-coding genes) in each genetic code.Differences in coding densities between code 25 and code 11 were negligible for putative Saccharibacteria phages, indicating that they share genetic code 11 with their predicted hosts (Fig. 3A; see also Table S7 at https://doi.org/10.5281/zenodo.8422333).Contrary to our expectations, 34 of the 38 putative Gracilibacteria-infecting phages larger than 20 kb displayed small changes in coding density between the two genetic codes, indicating that they are not clearly alternatively coded (Fig. 3B; see also To further assess the genetic code of the putative Gracilibacteria phages and Absconditabacteria phages, we annotated and visualized the predicted ORFs of each phage in both code 11 and code 25.Examination of 40 putative Gracilibacteria and Absconditabacteria phage genomes that were not clearly alternatively coded showed that they had very similar gene annotations and genome architectures in both genetic codes (Fig. 4A, see also Table S8 at https://doi.org/10.5281/zenodo.8422333).Further more, their genes displayed an absence of in-frame TGA codons and the presence of multiple, different stop codons in close proximity at gene termini (Fig. 4A; see also Table S8 at https://doi.org/10.5281/zenodo.8422333).These phage genomes are therefore likely compatible with both code 11 and code 25.They contrast with the clearly alternatively coded Gracilibacteria phages and Absconditabacteria phages, which contained high densities of in-frame TGA codons and displayed almost no gene annotations in code 11 (Fig. 4B).
Notably, all Gracilibacteria prophages and Absconditabacteria prophages contained ORFs with high densities of in-frame TGA codons.The classification of these genome regions as prophage rather than novel portions of bacteria genomes is supported by the identification of canonical phage genes that produce a portal protein, tail-related protein, terminase, or integrase (Fig. 4C; see also Table S10  zenodo.8422333).We conclude that the Gracilibacteria prophages and Absconditabacte ria prophages are likely exclusively compatible with code 25.
As it was surprising to find code 11-compatible phages that were targeted by the code-25 Gracilibacteria or Absconditabacteria, we sought to further verify that these phages infect their presumed alternatively coded hosts.Three putative code 11-com patible Gracilibacteria phages and Absconditabacteria phages in the protein-sharing network (Fig. 2B) cluster with clearly alternatively coded Gracilibacteria prophages or Absconditabacteria phages (see Fig. S7 at https://doi.org/10.5281/zenodo.8422333).Additionally, we predicted the taxonomic affiliation of each gene within our iden tified phages.Most putative Absconditabacteria phages contained genes with taxo nomic affiliations matching their host (see Fig. S6; Table S10 at https://doi.org/10.5281/zenodo.8422333),including one possible Absconditabacteria phage compatible with code 11.Five of the 57 putative Gracilibacteria phages contained genes predicted to originate from Gracilibacteria, including one predicted Gracilibacteria phage that was compatible with code 11 (see Fig. S6; Table S10 at https://doi.org/10.5281/zenodo.8422333).These two analyses, in tandem with the spacer-phage matching, strongly suggest that there are, indeed, some Gracilibacteria phages and Absconditabacteria phages that are compatible with code 11.
As many of these code 11-compatible phages did not cluster coherently within the protein-sharing networks (Fig. 2B; see also Fig. S5 and S7 at https://doi.org/10.5281/zenodo.8422333)and did not contain any genes predicted to originate from their predicted host bacteria, we considered the possibility that some of the Gracilibacteria and Absconditabacteria spacer-to-phage hits might be artifactual matches within the large phage databases.Such spurious matches would be most probable if the spacer length is unusually short.Thus, to constrain this probability, we examined the median spacer length that matched with predicted code 11-compatible Gracilibacteria phages and Absconditabacteria phages.We found that these spacers, at 26 bp, were only slightly smaller than those that matched Saccharibacteria phages (30 bp) and those that matched clearly alternatively coded Absconditabacteria phages (28 bp), for which predicted phages generally clustered as expected (see Tables S5 and S6 at https:// doi.org/10.5281/zenodo.8422333).Additionally, when compared to spacers extracted from across the domain Bacteria, a spacer length of 26 bp is within the range of a typical spacer length (29)

Phages that interact with SGA and non-SGA bacteria
We next explored the host range of our putative SGA-infecting phages by comp'aring them to a large spacer database from a wide diversity of bacterial genomes (see Materials and Methods).As members of Actinobacteria are known hosts of Saccharibac teria, we augmented this database with spacers from diverse Actinobacteria genomes (see Materials and Methods).In comparing these spacers to our predicted SGA phages, we identified 23 probable SGA phages also targeted by spacers from non-SGA bacteria (see Tables S11 and S12 at https://doi.org/10.5281/zenodo.8422333).We considered that these spacers may have been acquired in either the SGA bacteria or in the non-SGA bacteria by horizontal transfer.In comparing the spacer inventories of the co-targeting bacteria, we did not find evidence that they shared identical spacers, likely ruling out the possibility that these spacer matches can be attributed to horizontal transfer.These 23 co-targeted phages (i.e., phages targeted by both SGA and non-SGA bacteria) included nine predicted to infect Saccharibacteria, five predicted to infect Absconditabacteria, and nine predicted to infect Gracilibacteria.Seven of the nine putative Saccharibacteria phages were co-targeted by bacteria from the phylum Actinobacteria, including Corynebacterium sp.NML130628, Actinomyces oris, Actinomyces.sp.HMSC075C01, Actinomyces naeslundii, and Actinomyces viscosus.These species are particularly notable as a majority of cultured Saccharibacteria attach to host bacteria from the Actinomyces genus (5,42).When placed in the context of our two protein-shar ing networks, many putative Saccharibacteria phages co-targeted by Actinobacteria are situated within a dense cluster of previously identified Saccharibacteria infecting phages (representative example in Fig. 5A and 2B; and see also Fig. S5 at https://doi.org/10.5281/zenodo.8422333).
Similarly, we also examined putative Absconditabacteria phages and Gracilibacte ria phages that matched spacers from non-SGA bacteria.Three of the five putative Absconditabacteria phages matched spacers from arrays in the genomes of Firmicutes.One such phage is situated in the primary Absconditabacteria cluster within the protein-sharing network (Fig. 5B).Among the nine putative Gracilibacteria phages, five have matches to spacers from arrays within Bacteroidetes species and two matched spacers from arrays within Actinobacteria species.Notably, one predicted Gracilibacteria phage was targeted by multiple spacers from Enterococcus faecalis strains and was linked to the Absconditabacteria phage cluster in the protein-sharing network (Fig. 5C).By examining the genetic code of the nine candidate Gracilibacteria phages that matched spacers from non-SGA bacteria, all nine had similar genome architectures and gene annotations in both code 11 and code 25 (representative example in Fig. 5C; see also Table S8 at https://doi.org/10.5281/zenodo.8422333).Thus, if they indeed infect Gracilibacteria and another standard-coded bacterium, they are likely capable of producing viable gene products in both their alternatively coded Gracilibacteria host and their standard-coded non-SGA host.Two of the five putative Absconditabacteria phages that matched spacers from non-SGA bacteria, however, contained ORFs dense with in-frame TGA codons and clearly use code 25 (representative example in Fig. 5B).

DISCUSSION
Here, we examined the genetic code of predicted SGA phages and observed that some share the genetic code of their putative hosts.This analysis required us to link phages to host bacteria, which we did primarily via CRISPR-Cas spacer targeting.This has been done many times previously (26)(27)(28)(29) and is believed to be generally robust given that the spacers in a CRISPR locus of a host bacterial genome derive directly from the genomes of phages that infect them (28,43,44).A recent study, however, proposed several alternative methods by which a spacer targeting a phage may be acquired in non-viable bacterial host cells: (i) by uptake or entry of phage DNA into physically proximal bacterial cells; (ii) horizontal gene transfer of spacer arrays into non-viable bacterial cells, and (iii) the host-range of the phage is altered after spacer acquisition (45).
To further support our CRISPR spacer-based links, we performed a number of additional analyses.We observed highly similar phage and putative bacterial host genes.Phages are well known to acquire genes from their hosts, so the most likely explanation is that these phage genes are derived directly from the genome of their host bacte rium (46)(47)(48).Furthermore, in multiple protein-sharing networks, we observed strong clustering between bacteria and many of their predicted phages identified by spacermatching.While it is possible that these linkages may be the result of horizontal gene transfer, we observed that many of our predicted SGA phages cluster exclusively with other identified SGA phages when placed in a protein-sharing network with reference phages.Finally, given that only a tiny subset of microbial community members use an alternative genetic code (17,19), our linkage of alternatively coded phages to alterna tively coded host bacterial groups using spacer-phage matching, as has been shown previously (20,23), suggests that spacer-phage matches are very likely not coincidental.Thus, a variety of methods reinforce our confidence in the spacer-matching approach to identify hosts of phages.It is important to note, however, that our host-prediction methods do not fully indicate the capacity for identified phages to replicate in predic ted host cells.To confirm the predicted host range of phages identified in this study, isolation, and experimental validation are still necessary.
In addition to identifying alternatively coded phages targeted by alternatively coded Gracilibacteria and Absconditabacteria, we were surprised to identify phages without in-frame TGA usage targeted by either Gracilibacteria or Absconditabacteria.These phages appear compatible with the standard code 11.The phenomenon wherein phages utilize a genetic code that is different than that of their host bacteria is not without precedent (20,21,23).For example, some Lak phages, despite infecting standard-coded bacteria of the genus Prevotella, have alternatively coded genomes in which the canonical stop codon TAG is reassigned to glutamine (21).Alternatively coded phages that infect standard coded hosts may use an alternative code in part to prevent premature production of proteins that are important in late-stage infection (e.g., in-frame TGA codons within ORFs annotated as structural or lysis-related proteins) (23).To enable the translation code shift needed to produce these proteins, phage genomes often encode a suppressor tRNA that recognizes a canonical stop codon as a sense codon and incorporates a specific amino acid (20-23, 49, 50).Some alternatively coded phage genomes also encode tRNA synthetases that can charge suppressor tRNAs with amino acids (23) and release factors that terminate translation at only two of the three canonical stop codons (20,21,23).For example, some alternative code 4 phage genomes, in which UGA is interpreted as tryptophan, encode release factor 1 (RF1), which only recognizes UAA and UAG, a suppressor tRNA that decodes the UGA stop codon as tryptophan, and a tryptophanyl tRNA-synthetase which charges the suppressor tRNA with tryptophan (23).In combination, these previous observations underline the conclusion that phage genomes that use fewer stop codons than their host genomes require specific adaptations in the form of code shift machinery.
The situation with code 11-compatible phages, such as the predicted Gracilibacteria phages and Absconditabacteria phages we identify in this study, is different because their lack of in-frame canonical stop codons presents no issue for translation.Where TGA is used as a stop codon, it is followed by alternative stop codons in close proximity to terminate translation.In general, this backup stop codon strategy is not uncommon in bacterial genomes and likely evolved to reduce the impact of accidental stop codon read-through (51,52).Thus, phages that employ three stop codons should generally produce viable gene products even if the bacterial translation system only recognizes two stop codons.Unlike alternatively coded phages that infect standard-coded host bacteria, phages that use the standard genetic code generally do not need to alter the translation environment of their hosts.
An intriguing finding of this study is that all identified integrated phage sequences (prophages) in Gracilibacteria and Absconditabacteria genomes were clearly alterna tively coded (contained ORFs dense with in-frame TGA codons).This observation suggests that there is an advantage for a prophage to share the alternative genetic code of its host.This contrasts with the finding that some prophages adopt an alter native genetic code yet reside in bacterial genomes that use the standard bacterial code (23).One potential explanation may be that, akin to codon optimization, higher levels of the alternative code tRNAs are expressed within alternatively coded host bacteria compared to canonical tRNAs, allowing phages with dense in-frame TGAs a more efficient translation of their gene products.The codon optimization hypothesis is supported by the high usage of TGA as a glycine codon within code 25 host bacteria (17), that the use of rare codons can lead to various translation errors (53), and that competition over rare tRNAs can incur lower expression of gene products (54,55).
If code-11 compatible phages can, in fact, proliferate in Gracilibacteria and Abscondi tabacteria, there are two possible explanations for why the predicted code 11-compati ble Gracilibacteria phages and Absconditabacteria phages do not need to incorporate in-frame TGA codons.First, these standard-coded phages may have recently evolved to infect alternatively coded hosts.However, if this were true, we would expect such phages to be rare.As standard code compatibility is apparently not uncommon among Gracilibacteria and Absconditabacteria phages (Fig. 3 and 4), we infer that there is an advantage for these phages to retain their standard code.Second, use of the standard code may broaden their host range, a possibility that is supported by our finding that some standard code compatible Gracilibacteria phages and Absconditabacteria phages are targeted by spacers encoded within standard code non-SGA bacteria.Phages capable of replicating in hosts across phyla have been reported in a previous experi mental study by Malki et al. (56), but have not been fully characterized or definitively confirmed.
Some predicted Saccharibacteria phages are also targeted by Actinobacteria spacers.For cases where Actinobacteria are hosts for episymbiotic Saccharibacteria, these phages may infect both partners.CPR bacteria thus may serve as a decoy to protect their larger bacterial symbionts from phage infection, as has been suggested previously (31,57).
The phages reported here expand known phage diversity.Our results suggest that some of them may infect both standard and alternatively coded host bacteria, and we deduce that there is no fundamental barrier to this phenomenon.Given interest in the use of phages as therapeutics, this finding raises the possibility of producing phages to infect SGA bacteria in standard code bacteria, which may be substantially easier to cultivate than SGA bacteria themselves.Furthermore, this may provide a path by which SGA phages can be generated for morphological characterization.

Absconditabacteria, Saccharibacteria, and Gracilibacteria database prepara tion
We began with a database of 861 CPR genomes derived from a previous publication (30) that contained bacteria from three different lineages: Absconditabacteria, Sacchari bacteria, and Gracilibacteria.We de-replicated the database using dRep (58) at 99% ANI clustering and default alignment fraction (10%).For each genome, we predicted protein sequences using the "single" mode of Prodigal (59).For Saccharibacteria genomes, genes were predicted in genetic code 11.As Gracilibacteria and Absconditabacteria adhere to a non-standard genetic code, code 25 (17,18), Gracilibacteria and Absconditabacteria genes were predicted in genetic code 25.Gene taxonomic predictions were performed using USEARCH (60) with the UniRef100 (61) database.

CRISPR-Cas array prediction and curation
To search for CRISPR arrays in the SGA genome database, we ran CRISPRCas Finder (66) (CCF) on all genomes.We then selected scaffolds containing CRISPR arrays designated as evidence level 3 or 4-arrays deemed highly likely candidates by CCF-for further curation.
We manually curated the scaffolds containing high evidence-level CRISPR arrays to ensure they did not originate from misbinning.Our manual curation considered three complementary metrics: we considered a scaffold to be from SGA bacteria if (i) the majority of predicted proteins appeared to have the closest taxonomic hits to SGA bacteria, (ii) if individual, phylogenetically informative proteins appeared to have the closet taxonomic hits to SGA bacteria, and (iii) if scaffolds displayed high coding density in code 25 relative to code 11.

Identification of complete CRISPR-Cas systems
To identify complete CRISPR-Cas systems present in our database, among the genomes containing high-confidence, manually curated CRISPR arrays, we searched for Cas proteins using the full suite of TIGRFAM HMMs (67) (hmmsearch, model-specific noise cutoff).We additionally manually curated all scaffolds containing cas gene annotations to ensure they were from SGA bacteria using the metrics described above.We defined complete CRISPR-Cas systems based on previously published descriptions of various CRISPR-Cas systems (32,36).For each array, (i) if cas9, csn2, cas1, and cas2 genes were also encoded within the same genome, we categorized it as a type II-A system; (ii) if cas9, cas1, and cas2 genes and no csn2 genes were encoded within the same genome, we categorized it as a type II-C1 system; (iii) if cpf1, cas1, cas4, and cas2 genes were encoded in the same genome, we categorized it as a type V-A system; (iv) if cas10, cas7, cas5, and csm2 genes were encoded in the same genome, we considered it a complete type III-A system; and (v) if cas10, cas7, cas5, and cmr5 genes were encoded in the same genome, we considered it a complete type III-B system.Complete CRISPR-Cas systems were visualized using gggenes (https://github.com/wilkox/gggenes).For CRISPR-Cas systems containing all Cas proteins on the same scaffold, AAI similarities and cas operon architectures were visualized using Clinker (68) at default parameters.
To assess the novelty of identified Cas proteins, we compared the Cas proteins within each complete CRISPR-Cas system to the NCBI nr database using BLASTp (evalue ≥ 1e−3, coverage ≥ 0.75) and retained the best hit per gene product based on percent identity.

Compiling a Saccharibacteria, Absconditabacteria, and Gracilibacteria spacer database
To compile a spacer database, we extracted all spacers from high evidence-level arrays on scaffolds from our redundant SGA database (861 genomes) that passed our manual curation step.In addition, we ran a previously described spacer array expansion step to gather additional spacers from variant sequences that are not reflected in the consen sus metagenomic assembly (38).Briefly, if available, we gathered the metagenomic reads originally used to assemble each genome.We then reassembled the reads using MEGAHIT at default parameters (69), mapped reads back to assembled contigs using Bowtie2 at default parameters (70), and predicted proteins using the "meta" flag of Prodigal.We compared the newly assembled scaffolds to the original, publicly available scaffolds.If a newly assembled scaffold matched an original, manually curated scaffold above the thresholds of 95% coverage and 90% ANI, we predicted CRISPR arrays in the newly assembled scaffold using CCF, extracted spacers from high-evidence level arrays, and added extracted spacers to our spacer database.We de-replicated the spacer database at 100% ANI using USEARCH.

Identification of putative SGA phages
To search for phages putatively infecting SGA bacteria, we compared each unique spacer in our spacer database to two phage databases: IMG/VRv3 (26) and GVD Human Gut Virome (39).BLASTn parameters were set to at least 95% coverage of the spacer and one or less allowed mismatch with the specific flags: -task "blastn-short" -word_size 7 -gapopen 10 -gapextend 2 -penalty −1.Prophages within the CPR genomes were predicted using VIBRANT at default parameters (71).
To identify the likely genetic code of putative SGA-infecting phages larger than 20 kb, we used the Prodigal "single" flag to calculate the coding density of each phage in genetic codes 11 and 25.Furthermore, genome diagrams of Gracilibacteria and Absconditabacteria prophages and phages greater than 20 kb were generated using the Prodigal ORF predictions in code 11 and code 25.In-frame TGA codons were additionally located within the ORF predictions.The genome diagrams of the phages in both code 11 and code 25 were visualized using gggenes.
To annotate phage proteins, we used the Prodigal gene predictions in genetic code 11 for putative Saccharibacteria phages and the Prodigal gene predictions in genetic code 25 for putative Gracilibacteria phages and Absconditabacteria phages.We annotated predicted proteins using pVOG (72) HMM profiles with hmmsearch from HMMER3.Gene taxonomic predictions were performed using DIAMOND (73) with the UniRef100 database.
To compare the putative SGA-infecting phages to reference phages and their predicted host bacteria, we constructed two protein-sharing networks using vContact2 (74) (--rel-mode Diamond, -vcs-mode ClusterONE, and --pcs-mode MCL).One network linked the proteomes of the putative SGA phages (including prophages) identified in this study, previously identified SGA phages (23,26,40,41), and SGA bacteria.The second network linked the SGA phages identified in this study, the previously identified SGA phages, and non-SGA reference phages (--db "ProkaryoticViralRefSeq201-Merged").The resulting protein-sharing networks and their associated metadata were visualized in Cytoscape (75).

Host range of putative SGA-infecting phages
To examine the host range of putative SGA-infecting phages, we compared spacers from four comprehensive databases (41,66,76,77) composed of spacers from across the domain Bacteria to the predicted SGA-infecting phages.As before, the BLASTn parame ters were set to at least 95% coverage of the spacer and one or less allowed mismatch with the specific flags: -task "blastn-short" -word_size 7 -gapopen 10 -gapextend 2 -penalty −1.
We additionally constructed a database for Actinobacteria, some of which are known hosts of Saccharibacteria, by sampling one genome per species-level group from GTDB (release 95, August 2020).Using a similar workflow as with the SGA database, we searched these genomes for high evidence-level arrays with CCF, extracted spacers, and compared them to the putative CPR-infecting phages with the above parameters.Visualization of spacer hits was performed using DNA Features Viewer (78).
To assess if the co-targeting of phages by SGA bacteria and non-SGA bacteria occurred due to the horizontal transfer of CRISPR spacers, we compared the spacer inventories of SGA scaffolds to the comprehensive non-SGA spacer database.For this comparison, we used BLASTn at the parameters: 100% identity and 100% coverage.

FIG 2
FIG 2 SGA phage genome size and protein-sharing network analysis.(A) Size and completeness of the putative SGA phages.(B) Protein-sharing network of the putative SGA phages and SGA bacterial genomes, where each node represents a phage or bacterial genome.Nodes are clustered together based on protein similarity and a number of shared proteins.Previously identified SGA phages (23, 40, 41) were included in the network.Nodes are colored based on the predicted host of the phages or the SGA genome taxonomy.Co-targeted phages indicate those targeted by spacers from CRISPR-Cas arrays of both SGA and non-SGA bacteria.

FIG 3
FIG 3 Phage genetic code analysis.Histogram of phages displaying the change in coding density between code-25 and code-11 predictions for predicted phages of (A) Saccharibacteria, (B) Gracilibacteria, and (C) Absconditabacteria.A larger x-value indicates a higher likelihood of adhering to genetic code 25, while an x-value near zero indicates the likely usage of genetic code 11.Only phages larger than 20 kb were included in this analysis.

FIG 5
FIG 5 Representative phages targeted by both SGA and non-SGA bacteria.The leftmost circular windows display a cutout of the protein-sharing network in Fig. 2, with colors listed in the Fig. 2 legend and the co-targeted phage spotlighted in yellow.The rightmost panels display a genome diagram of the highlighted co-targeted phage.Overlaid are gene taxonomic predictions and the location of spacer matches.(A) Putative Saccharibacteria phage co-targeted by A. oris.(B) Putative Absconditabacteria phage co-targeted by Bacillus sp.H1a.(C) Putative Gracilibacteria phage co-targeted by Enterococcus faecalis.
(see Fig. S8 at https://doi.org/10.5281/zenodo.8422333).In-frame TGA codon usage among putative Gracilibacteria phages and prophages.(A) Genome diagrams in code 25 and code 11 of a code 11-compatible phage predicted to infect Gracilibacteria.In-frame TGA codons are marked by a black line.(B) Genome diagrams in code 25 and code 11 of a clearly alternatively coded (code 25) phage predicted to infect Gracilibacteria.(C) Genome diagrams in code 25 and code 11 of a Gracilibacteria prophage.Predicated prophage regions and genome regions are colored pink and green, respectively.