Identification of hidden N4-like viruses and their interactions with hosts

ABSTRACT The N4-like viruses, which were recently assigned to the novel viral family Schitoviridae in 2021, belong to a podoviral-like viral lineage and possess conserved genomic characteristics and a unique replication mechanism. Despite their significance, our understanding of N4-like viruses is primarily based on viral isolates. To address this knowledge gap, this study has established a comprehensive N4-like viral data sets comprising 342 high-quality N4-like viruses/proviruses (144 viral isolates, 158 uncultured viruses, and 40 integrated N4-like proviruses). These viruses were classified into 97 subfamilies (89 of which are newly identified), 148 genera (100 of which are newly identified), and 253 species (177 of which are newly identified). The study reveals that N4-like viruses inhibit the polar region, oligotrophic open oceans, and the human gut, where they infect various bacterial lineages, such as Alpha/Beta/Gamma/Epsilon-proteobacteria in the Proteobacteria phylum. Although N4-like viral endogenization appears to be prevalent in Proteobacteria, it has also been observed in Firmicutes. Additionally, the phylogenetic analysis has identified evolutionary divergence within the hallmark genes of N4-like viruses, indicating a complex origin of the different conserved parts of viral genomes. Moreover, 1,101 putative auxiliary metabolic genes (AMGs) were identified in the N4-like viral pan-proteome, which mainly participate in nucleotide and cofactor/vitamin metabolisms. Of these AMGs, 27 were found to be associated with virulence, suggesting their potential involvement in the spread of bacterial pathogenicity. Importance The findings of this study are significant, as N4-like viruses represent a unique viral lineage with a distinct replication mechanism and a conserved core genome. This work has resulted in a comprehensive global map of the entire N4-like viral lineage, including information on their distribution in different biomes, evolutionary divergence, genomic diversity, and the potential for viral-mediated host metabolic reprogramming. As such, this work significantly contributes to our understanding of the ecological function and viral-host interactions of bacteriophages.

#Summary.The authors have made many updates in this revision, including efforts to address the major comments previously raised regarding removing the HGT analysis, incorporating a search for prophages in bacterial genome assemblies, and adding a CRISPR-spacer based analysis to predict hosts.The authors have also improved figures and added helpful new figures and provided useful data for the reader in a public github repository.These revisions have strengthened the manuscript and added value for the readers and it is great to see this diversity of N4-like phages and to learn about their host associations.However, I still have major concerns, including around issues from the initial review, these are highlighted below.The comments are long, but I sincerely hope they will be viewed as helpful by the authors.
#Major Comments ##Prophage analyses -missing cases: Exploration of the author's results suggests that there are N4-like phages in GenBank that have been missed in this re-analysis (e.g.Acinetobacter contigs APOR01000031.1 and CP110465.1 and JAIGUQ010000009, all of which appear to contain all 7 N4-like hallmark genes based on BLAST searches, and more cases).
• All of the aforementioned examples represent bacterial contigs that are <100kb, the cut-off used to define which contigs to evaluate in searching the database of bacterial assemblies (L435-L438).Though the intent to be conservative and not introduce false positives is good, this cut-off is problematic because it precludes discovery of phages that exist as extrachromosomal elements (as plasmids), and because assembly-artifacts (e.g.coverage or repeats) can lead to integrated prophages being assembled as distinct contigs.Given that the N4-likes have an average genome size of ~70kb, the 100kb cut-off would be expected to result in such losses.
• Further, though the first 2 examples mentioned above have nearly identical vRNAP amino acid sequences, only one of the pair (APOR01000031.1) is listed in the initial Table S5 vRNAP-based HMM-search against GenBank, and it is not clear why the other should not have also been detected, this raises the question of whether additional such cases are also missing.
• Also, one of these N4-likes (CP110465) is described in GenBank as a plasmid (pRBH2-3), and there are many additional apparently N4-like Acinetobacter baumanii phages that may be plasmids.If this is the case then this would be an interesting additional life-history strategy for this group that may not have previously been described (?) and would thus be worth deeper exploration for other examples.Many of these missing N4-likes are also from bacterial strains from a published study focused on carbapenem resistant Acinteobacter baumanii (CRABS, https://doi.org/10.1016/j.csbj.2021.12.038), which are a medically important group.Altogether, this finding, together with the finding of integrated versions, suggests the possibility that their use in phage therapeutics should be carefully considered.
• Searching with proteins from these examples against both IMG/VR or UniProtKB (with jackhmmer) also identifies additional cases of likely N4-likes that are not represented in Tables S1 or S5.For example, searching with the TerL from CP110465 (UZG64161.1)against UniProtKB (all sequences also in GenBank) identifies an Alteromonas contig CP031010.1 (among others) that also contains a nearby vRNAP.This vRNAP is not in Table S5 and hits predominantly to viruses in the Schitoviridae.This suggests that additional N4-like diversity that would be hit by the 7-hallmarks identified by the authors is being missed.• Finally, it seems that the prophages in IMG/VR (UViG source = isolate) were not included in the analyses along with the UViGs; if the authors plan to repeat or expand any analyses it would likely be beneficial to take advantage of the updated IMG/VR v4 with the geNomad phage calls.
• An exploratory look shows that many phage sequences identified by the authors here as N4-like currently have no family designation in IMG/VRv4 -this highlights the contribution and value of the author's work.
##Identification and characterization of AMGs: Numerous AMGs are identified in this work, and the authors have now grouped these as Class I or II, as suggested.However, it is not clear that these have been thoroughly curated to ensure they do not represent phage-adjacent bacterial genes, or simply phage genes involved primarily in processes directly relating to phage replication rather than shaping of host metabolism.For example, in the DRAM-v manuscript it is highlighted that, in identification of AMGs, "DRAM-v also flags users to the probability of a gene being involved in viral benefit rather than enhancing host metabolic function (e.g.certain peptidases and CAZymes are used for viral host cell entry (Figure 6B)." https://doi.org/10.1093/nar/gkaa621.Though CheckV was applied it was not mentioned whether VirSorter2 or another phage finder was used in addition (prior to CheckV), which would likely be necessary in the case of >100kb bacterial contigs.If this is added in subsequent analyses I suggest considering using geNomad (https://github.com/apcamargo/genomad). #Additional Comments ##Prophage analyses -likely Lactobacillales gram-positive contaminant: As mentioned in the initial review, the finding of an N4like prophage in a gram-positive is highly unexpected (it being the only case of a potential gram-positive bacterial host).Unless additional rigorous steps are taken to rule out the potential for contamination, for example requesting and growing the strain and inducing out the predicted prophage, it is important that the authors be far more cautious in their representation and lean towards representing this as a likely contaminant.Given the high identity of the hits to E. faecalis N4-phages this likely represents a contamination and assembly artifact.##Host prediction: Provide a statement in the methods about the number of additional host predictions achieved using the additional CRISPR search.##Category names: The analyses throughout refer to a category of "Provirus" and a category of "HQ-UViG" (e.g.Fig. 3C, Table 1), on the basis of the pipeline used to identify them, however, this can be confusing as IMG/VR also includes prophages identified in bacterial genomes (from GenBank), not just metagenomes.Thus IMG/VR hits would be expected to also have sequences in the "Provirus" category and some overlap with hits identified by the direct searches in GenBank.In addition, in some places the viruses identified in the second round GenBank search are also referred to as UViGs (in cases where the number 158 is used, e.g.L117), which seems is otherwise reserved for use for the IMG/VR pipeline.Also, it is difficult to follow the #s as sometimes the sets referred to are 144, 10, 40, 148; other times 158 is referenced, other times 611, other times 154.##Clarity: The manuscript would benefit from review of spelling, grammar, and clarity of phrasing.The legend states (L847) that the outer ring is the ecosystem for the HQ UViG but some leaves that are identified by red-colored UViG branches do not have corresponding ecosystem assignments, likewise, in the second ring in from the outside the legend describes this (L848) as referring to experimental host lineages of N4-like viral isolates but leaves identified as UViGs (red branches) also have colorings, and the legend states (L850) that there is a ring "V" but this is not shown."Associated genomes" should probably be "contigs" for L851? Figure Panel A: In the legend in the image the line for "provirus" is missing.Figure Panel B: A number of aspects of this figure would benefit from clarification.Please provide the total number of contigs represented by each category of 1-7 cores and provide further interpretation of the figure in the legend or body.For example, it is unclear why there is not a consistent decrease in the number of contigs with increasing number of cores in each ecosystem, in other words, one would expect that a more relaxed criterion (1 core) would yield a greater number of hits than a more strict criterion (7 cores).Also, it is not clear why the average contig size for each ecosystem (e.g.~100kbp for Aquatic and apparently considering all #s of cores) appears to be larger than the maximum for the 7 cores across ecosystems (~70kbp).It is not clear what datasets were used or what the interpretation of these calculations is.L135-L140 It is also not clear here which genomes are being used -the authors refer to HQ-UViGs but then state that the marked difference in sizes are due to levels of completeness in genomic assemblies, if only contigs with all 7 cores are being considered, then this raises the question of whether longer assemblies reflect poor trimming of host sequences (with potential downstream consequences for AMG analyses).This point requires clarification.##Fig.3 5 In analyses relating to the multigene phylogeny in relation to other gene phylogenies, was gene-length controlled for (to prevent longer genes from dominating the signal)?Panel B/C: The quantitative basis for defining the red boxes is not clear and I suggest omitting them; clarify that the "Top ten subfamilies" are the top ten most abundant (presumably?);however, it is not clear why all subfamilies would not have been labeled.##Fig.6 Though the idea of including a figure to facilitate interpretation is a nice one, it is not clear how to interpret the cartoon.For example, the arrows from the label for portal point to multiple other points in the cartoon but it is not clear what they are pointing to or why (e.g. are all of the circles along the tail meant to represent the major tail protein?If so, they should all be the same size and all arrows should point to one exemplar not to different ones).The plot itself is also not well described, please provide description of what all the points represent, how many there are for each comparison, what their colors mean and whether they relate to the cartoon and if so how, and what the varying color intensity is meant to represent.The large and small subunit terminase appear to be shown as part of the virion -however, this is incorrect as they are not incorporated into the virion but are involved in packaging (e.g.https://viralzone.expasy.org/3944?outline=all_by_species).##Fig.7 Piecharts: The slice for "Others" is colored green as a Class I AMG, however the expanded view shows that it contains Class I and Class II AMGs, if this is the case then a third color should be used for "other"; also "Cell circle" should be "Cell cycle", here and throughout (e.g.Table 1, L265, L266).Cell cartoon: The intent to map AMGs to the cell to highlight processes is nice, however it is a bit challenging to interpret some aspects of this figure as there is a lot of information and not all the components are labeled and for some of the labels it is not known whether the cartoon represents an example of a possible function (e.g.Nutritional limitation SF64), streamlining some pieces here would be helpful.##Fig.S1-S6 -Recommend combining all of these genome diagrams into a single continuous multi-page figure that includes all 342 identified viruses all ordered onto a tree (perhaps the concatenated hallmark tree); this will only be readily viewable to the reader by zooming in quite a bit and scrolling up and down, but for anyone interested in this level of detail they will appreciate having everything on one page.Also, recommend re-coloring the genes such that all 7 core genes have a highly saturated bold color as well as the bold outline (rather than all being white and only marked by an outline) as these provide common anchor points to facilitate rapid visual qualitative assessment and comparison of the genomes.Additional genes of interested to the author could still be colored but perhaps in less saturated colors, for example.Also, it would be visually helpful to include the genome backbone.##Fig.S7 -It is very helpful and nice to have this overview figure.Some additional points would benefit from clarification: • Legend: The legend the numbers don't match the figure, e.g. the legend refers to 158 high quality UViGs including the 148 from IMG/VR and the 10 from GenBank, however there is no reference to 158 in the figure and the 10 from GenBank are not identified as HQ-UViGs in the figure so it is not clear for the reader.• Panel A: Suggest removing the line from "144 N4-like viral isolates" to "10 additional N4-like ...".• Panel A: Suggest changing: "length of N4-vRNAP" to "length of N4-vRNAP (req 2500aa min)", "N4-associated UViGs" to "N4like-vRNAP-candidate UViGs" (at least 4 places); "CD-HIT: 99%..." to "remove near-identical sequences -CD-HiT...".

#Figures ##Fig
• Panel A: The last section leading from the 601 to the 302 is not clear.There is no process box (green box) connecting the 601 to the 148 to explain how this number was reduced.It seems the bottom left section is meant to show the process reflecting searches with HMMs from all the hallmark genes to reduce the number from 601 to 148, if this is the case it would clearer to remove all of those boxes which are redundant with the top left of the figure and simply indicate this as a green box between the 601 and 148 with a statement with something like "search with 7 hallmark gene HMMS from 154 N4-like viral genomes from GenBank".• Panel A: I suggest coloring the boxes for all the unique sets with some new color -> "144 N4-likes", "10 additional ...", "148 HQ-UViGs", and "40 N4-like", as these are the final ones being added up.
• In panel B: Suggest removing "free-living" from "Free-living N4 ..." to "N4-vRNAP hallmark-HMM" • In panel B: Why is this difference between the # of hits (8991) and the number of assemblages (460) so great?Are there many copies commonly found in each of the assemblages?Why would this be?• In panel B: Suggest changing "co-occurrence of all six hallmark genes" to "co-occurrence of all 7 hallmark genes" ##Fig.S8 This figure is not referenced in the manuscript and it is not clear what panel B is showing, for example why are there different numbers of iterations of HMM and how does this inform the reduction of the 601 HQ-UViGs to the 148?
. 1 It is not clear what is meant by "repartition" ##Fig. 2 Figure Legend: Nice addition and cross-referencing, very informative ##Fig.4 Figure legend: What are "floated genera" (L881)?If this refers to other genera that are currently pending as taxonomy proposals please provide a reference.Panel B: What do the yellow and green colorings by the leaves indicate?The legend labeled "Branch" intermingles branch line style with individual leaf coloring.Please name the outgroup.##Fig.
• Figure: I suggest adding headers on the top of the left and right sides & calling them panels A & B.