Hecatomb: an integrated software platform for viral metagenomics

Abstract Background Modern sequencing technologies offer extraordinary opportunities for virus discovery and virome analysis. Annotation of viral sequences from metagenomic data requires a complex series of steps to ensure accurate annotation of individual reads and assembled contigs. In addition, varying study designs will require project-specific statistical analyses. Findings Here we introduce Hecatomb, a bioinformatic platform coordinating commonly used tasks required for virome analysis. Hecatomb means “a great sacrifice.” In this setting, Hecatomb is “sacrificing” false-positive viral annotations using extensive quality control and tiered-database searches. Hecatomb processes metagenomic data obtained from both short- and long-read sequencing technologies, providing annotations to individual sequences and assembled contigs. Results are provided in commonly used data formats useful for downstream analysis. Here we demonstrate the functionality of Hecatomb through the reanalysis of a primate enteric and a novel coral reef virome. Conclusion Hecatomb provides an integrated platform to manage many commonly used steps for virome characterization, including rigorous quality control, host removal, and both read- and contig-based analysis. Each step is managed using the Snakemake workflow manager with dependency management using Conda. Hecatomb outputs several tables properly formatted for immediate use within popular data analysis and visualization tools, enabling effective data interpretation for a variety of study designs. Hecatomb is hosted on GitHub (github.com/shandley/hecatomb) and is available for installation from Bioconda and PyPI.


Lais Farias Oliveira Lima
Annotation of viral sequences from metagenomic data requires a complex series of steps to ensure accurate annotation of individual reads and assembled contigs.In addition, varying study designs will require project specific statistical analyses.

Findings
Here we introduce Hecatomb, a bioinformatic platform coordinating commonly used tasks required for virome analysis.Hecatomb processes metagenomic data obtained from both short and long read sequencing technologies, providing annotations to individual sequences and assembled contigs.Results are provided in commonly used data formats useful for downstream analysis.Here we demonstrate the functionality of Hecatomb through the reanalysis of a primate enteric and a novel coral reef virome.

Conclusion
Hecatomb provides an integrated platform to manage many commonly used steps for virome characterization including rigorous quality control, host removal and both read-and contig-based analysis.
Each step is managed using the Sankemake workflow manager with dependency management using Conda.Hecatomb outputs several tables properly formatted for immediate use within popular data analysis and visualization tools enabling effective data interpretation for a variety of study designs.Hecatomb is hosted on GitHub at github.com/shandley/hecatomb and is available for installation from the Bioconda at anaconda.org/bioconda/hecatomband pypi.org/project/hecatomb/.

BACKGROUND
Viruses are also the most dominant entity on the planet with current global estimates as high as 10 31 viral particles [1], and they are omnipresent in all cellular life forms [2].As such, they exert significant influence on their surroundings.Metagenomic sequencing offers a powerful tool to study viral diversity in both hostassociated and environmental systems [3][4][5][6][7][8][9][10][11][12][13].However, there are currently many challenges associated with viral metagenomics.While viruses are the most abundant and diverse biological entity on the planet, they represent a minority of reference genomes in GenBank, largely due to difficulties associated with studying them [14].There is a vast amount of sequence information that remains taxonomically or functionally ill-defined.These sequences are regularly referred to as "viral dark matter" and pose a significant barrier to the annotation of viral sequences from metagenomic data (reviewed in [15]).Our ability to successfully annotate metagenomic data as "viral" is directly impacted by the size and diversity of the reference database and the sensitivity of our search algorithms.Larger, more diverse reference databases can improve viral sequence annotation but are less conducive to high-sensitivity search algorithms required to identify distant sequence similarity.This dichotomy can force researchers to choose between optimal databases and search algorithms.
Another challenge to interpretation of reference-based sequence annotation is that viral metagenomes are often plagued with false positive classifications [16][17][18].Viruses share regions of sequence similarity with all other domains of life, including 'stolen' genes incorporated from their hosts' genomes, and repetitive or low-complexity insertion elements or transposons.These sequences are present in many reference databases and can result in false-classifications due to shared sequence similarity across taxonomies.The presence of false-positive classifications may influence data interpretation.For instance, mis-classification of viral sequences in clinical samples could lead to incorrect hypotheses about virus-disease associations or patient diagnosis.Similarly, an increased false-positive rate in any environment could lead to overestimates of species diversity.Highly-curated databases may alleviate false-positives but they require tremendous resources and time.Likewise, they risk missing newly discovered viruses which have yet to make their way through the curation process.Thus, it is important for bioinformatic tools to provide a system to classify the quality of similarity-based annotations in light of imperfect databases.
Numerous bioinformatics tools exist for identifying viral sequences from metagenomic data .
However, many of these are lacking in features or fail to offer researchers an end-to-end solution to manage the myriad tasks required of virome analysis (e.g.quality control, host removal, assembly).For example, few tools are designed for read-based annotation of viral sequences and none are currently maintained [44,45].Individual read annotations are valuable as the detection of a small number of viral reads could signify the presence of a virus.This is true for viral sequences both closely related to and divergent from reference viral sequences.Detection of divergent viral sequences requires sensitive alignment-based tools such as BLAST, DIAMOND or MMSeqs2 [46][47][48].Alternative approaches, such as those that use k-mer distances have also been implemented [49][50][51][52].K-mer based algorithms are fast but limited in their ability to annotate viral sequences divergent from those in reference databases.Therefore sensitive alignment based approaches are preferred for detecting divergent or novel viruses.
Alignment-based approaches are computationally more expensive than k-mer based approaches, but can be effectively implemented using a tiered database query approach [44,53].In the tiered approach initial queries are made against small virus-only sequence databases.Subsequent secondary cross-checking against reference databases representing sequences from all domains of life are required to remove falsepositive viral annotations.In addition, queries against both amino acid and nucleotide databases may be of interest.Reference sequences from viral taxa may only be represented in one or the other database types and amino acid databases will not include non-coding viral sequences.Tiered alignment-based queries across multiple databases requires a number of steps and produces a series of disconnected outputs.A robust system to monitor and manage these steps as well as coordinate the outputs into a tractable framework would permit researchers to focus on making biological insights instead of on job and file management.
While read based annotation can provide sensitive and specific detection of viral sequences within a metagenome, metagenome assembled contigs can provide additional layers of information such as gene content, gene order and metabolic pathway prediction.Many of the tools used to assemble viral contigs are the same ones used for assembling bacterial contigs [54][55][56][57].More recently, metavirome specific assembly tools are beginning to emerge with promising results [58,59].Virome metagenomics is evolving beyond binning with algorithms specifically designed to resolve complete genomes from metavirome assemblies [60].Similar to the requirements of read based annotations, assembly requires multiple steps to ensure high-quality data for downstream analysis.
Once contigs are obtained they can be further analyzed using one of many established virome analysis tools [24,29,32,61].Each of these tools provides distinct information about the viral content of a metagenome.For example, VirSorter2 uses a set of customized classifiers and curated Hidden Markov Models (HMM) to estimate the "viraleness" of metagenome assembled contigs.This is useful for separating viral from non-viral contigs, but additional steps are required to assign taxonomic lineages.vContact2 can assign genus-level taxonomy using gene-sharing networks to prokaryotic but not eukaryotic viruses.In contrast, Cenote-Taker 2 provides options for both prokaryotic and eukaryotic viral contig annotation.
VIBRANT uses deep learning neural networks to classify prokaryotic viral contigs.Numerous additional tools are also available for researchers to mine taxonomic and functional information from a metagenome [62].This complex array of tools for virome interrogation provides a number of opportunities for virome researchers.However, each of them is dependent on the generation of high-quality input assembly data from a wide range of experimental systems, and they vary greatly in terms of useability and support.A common workflow that takes inputs from both short and long reads and emphasizes rigorous quality control to remove non-biological contamination and host from a variety of library types and study designs would ensure researchers provide the highest-quality of data to each of these tools regardless of their experimental system.
While several options are available for virome analysis using read or contig based approaches, integrating these results with study data is critical for making biological insights.Project specific statistical analyses are often required.For example, statistical models to test if a pathogenic virus is associated with a disease using sparse read-based results will differ wildly from a study analyzing bacteriophage ecology.Fortunately, the vast majority of these disparate statistical approaches are available as R software packages [63].In addition, principles guided by the popular tidyverse and ggplot2 packages are familiar to many researchers [64,65].These principles have been successfully applied to the analysis of bacterial microbiome data in software suites such as PhyloSeq and microViz, but have yet to be applied to the analysis of virome data [66,67].
Here we present Hecatomb, a bioinformatics platform designed to address the above issues.Hecatomb supports the analysis of both long-and short-read technologies and a variety of library types.The pipeline performs rigorous quality control followed by tiered alignment based taxonomic assignment using MMseqs2 [44].Hecatomb also performs metagenomic assembly and annotation.Each of these steps is managed using the Sankemake workflow manager with dependency management using Conda [68,69].Hecatomb outputs several tables properly formatted for immediate use within popular data analysis and visualization tools enabling effective data interpretation for a variety of study designs.

IMPLEMENTATION
Hecatomb serves as an end-to-end pipeline by processing raw sequencing reads (single-or paired-end, long-or short-reads from Illumina, MGI, PacBio or Oxford Nanopore platforms) through four key modules (Fig. 1).In this way, it was designed to address many of the useability and functionality issues that are present in other software that we summarize in Fig. S1.

Module 1: Sequence Quality Control and Host Removal
Module 1 (Preprocessing) removes non-biological contaminants (i.e.primers, adapters) as well as common laboratory contaminants cataloged in NCBI's UniVec database [70].The user can select to use fastp for contaminant removal from standard library preps, or BBTools for more complicated library preparation strategies such as the round A/B library which generates DNA/cDNA libraries for single-and doublestranded RNA and DNA viruses [71][72][73].Low-quality sequences are also trimmed or removed prior to host removal.
Module 1 has an option to remove host-sequences (e.g.mouse, human) using Minimap2 [74].This is optional as it may not be applicable to all samples (e.g.water, air, soil).Hecatomb comes packaged with several commonly used host-references genomes which have been masked of potential viral sequence.
This masking is designed to minimize the inadvertent removal of viral sequences that may have similarity to host sequence.Masked reference genomes were generated as follows: 1) all viral genomes from the National Center for Biotechnology Information (NCBI) viral assembly database (ncbi.nlm.nih.gov/assembly/?term=viruses) were downloaded and computationally "shredded" into short fragments with an average length of 85 bases sharing a 30 base overlap using shred.shfrom the BBTools suite [73].Shredded viral sequences were then mapped (minimum identity of 90% and at most 2 insertions and deletions) and masked from host-reference genomes using BBmap requiring a [73].Pre-computed masked reference genomes for the following host genomes are available in Hecatomb: human, mouse, rat, camel, Caenorhabditis elegans, dog, cow, macaque, mosquito, pig, rat and tick are available within Hecatomb.A command is provided to generate new masked genomes for additional hosts not included with Hecatomb.
Sequences free of contamination and host are clustered using the nucleotide version of Linclust packaged with MMSeqs2 [48,75].Clustering reduces the number of sequences requiring taxonomic classification to a single, representative sequence thus greatly reducing the computational requirements for read based annotation in Module 2. Sequences are clustered requiring a minimum sequence identity of 97% and 80% alignment coverage of target sequence to the representative sequence (--min-seq-id 0.97 -c 0.8 --cov-mode 1).The size of each cluster per sample is maintained in the final output table (seqtable.fasta).This information serves as a "count table" for each sequence and values are provided as both raw and counts normalized to library size.

Module 2: Read-Based Annotation
Taxonomic and functional annotation is provided to reads in seqtable.fastausing a tiered approach (Fig. 2A).All queries are carried out using MMseqs2 [48].Queries against amino acid databases are performed using MMSeqs2 6-frame translation.Each read in seqtable.fasta is first queried against all viral (taxonomy id: 10239) amino acid sequences in UniProtKB clustered at 99% identity using Linclust (Viral AA DB) [75,76].Sequences annotated as virus are subsequently queried against UniClust50 (Multi-kingdom AA DB) to remove false-positive annotations [77].UniProt functional annotations are also applied when available.
Reads not identified as viral using queries against amino acid databases are queried against nucleotide databases (Fig. 2A).Similar to the translated queries, each read is first queried against a virus-only reference sequence database (Virus NT DB).This database consists of all viral sequences in GenBank (taxonomy id: 10239) clustered at 100% identity using Linclust [48,75].Sequences annotated as virus are subsequently queried against a customized nucleotide database (Polymicrobial NT DB) containing the Virus NT DB and representative RefSeq genomes from bacteria (1 per genus, n = 14,933), archaea (n = 511), fungi (n = 423), protozoa (n = 90) and plant (n = 145) genomes.These reference genomes represent a genomic 'polymicrobial' community and cover a large amount of microbial sequence space.This allows for the removal of false-positive annotations from the first query using a relatively small reference database.
Taxonomic annotations are augmented using a modified version of the 2b lowest common ancestor (2b-LCA) algorithm described in [78].The 2b-LCA algorithm provides conservative taxonomic assignments towards lower-nodes of the tree when similarity is found across a heterogeneous collection of taxonomies.
However, the LCA algorithm fails when crossing higher taxonomic ranks.For example, sequences with similarity to both bacterial and viral taxa have a LCA of "root" in the NCBI tree, while viruses from distinct viral domains (e.g.bacteriophage and vertebrate viruses) are assigned to "virus root".Hecatomb will identify these instances and augment the annotations by reverting to the top-hit annotation.Each instance of this is flagged in the final output table so researchers can choose to include or exclude these from downstream analysis.This approach provides additional information about sequences with ambiguous taxonomic assignments instead of just leaving them as 'root' or 'virus root' and simply discarded.
Sequence annotations from queries against both amino acid and nucleotide databases are combined into one table and assigned updated taxonomies using the most recent version of NCBI's Taxonomy Database based using TaxonKit [79,80] (Fig. 2A).This taxonomy table contains full Linnaean taxonomic lineages (Kingdom, Phylum, Class, Order, Family, Genus and Species), alignment type used for annotation (translated (aa, amino acid database) or untranslated (nt, nucleotide database)) and LCA augmentation information.Due to hecatomb keeping track of individual sequence IDs throughout this process it is then possible to combine these read-based taxonomic assignments to other data generated by Hecatomb or external data resources (Fig. 2B).By default Hecatomb will combine MMSeqs2 alignment information (e.g.target/query ID, e-value, % identity, alignment length, etc.), and count table information gathered during the clustering process.As an example of combining data to external resources, Hecatomb will provide Baltimore virus type information (both Baltimore class and Group).This could easily be extended to a variety of external data resources.Together these disparate data tables are collected into Hecatombs' bigtable.tsv.
As Hecatomb tracks sample identifiers it is then possible to combine the bigtable with sample data.All of these data are formatted to make them easily importable into commonly used data analysis tools for statistical and graphical analysis.

Module 3: Assembly
By default, Hecatomb performs an assembly for individual samples using MEGAHIT for short-reads or Canu for long-reads (Fig. 3) [81,82].Individual sample assemblies are then merged into a population assembly using Flye [83].
Per sample contig abundances are calculated by mapping individual sample reads to the population assembly using BBMap [84].Read counts are reported normalized to library size and contig length using a variety of measures (reads per kilobase million (RPKM), fragments per kilobase million (FPKM) and sequences per million (SPM)).SPM is the same calculation as used for transcripts per kilobase million (TPM) except that the sequences are not assumed to be transcripts [85,86].Additional contig properties (e.g.length, GC-content, coverage %) are combined with taxonomic assignments and sample abundance estimates into a final table (contig_count_table.tsv).
Options to skip the assembly step and to perform a cross-assembly are available for the user.Crossassembly assembles all reads from all samples at once (skipping the individual sample assemblies).This can result in better quality assemblies, but is computationally expensive for larger datasets and may not be an option for many users.

Module 4: Contig-Based Annotation
Module 4 provides taxonomic annotations to contigs.Taxonomy is assigned to all contigs in the population assembly using MMseqs2 [48].Each contig is queried against the same polymicrobial nucleotide database (Polymicrobial NT DB) used for read-based annotation (contigAnnotations.tsv).Additionally, information obtained from both the read-based annotation and assembly modules (Fig. 1) are combined.Read mapping information (start, stop, mapping quality, etc.) is maintained during the sample abundance estimation (mapping with BBMap) performed as part of Module 3 [84].This mapping information is combined with read-based annotations (bigtable.tsv)to generate a new table combining read based taxonomic information across each contig (contigSeqTable.tsv).

Installation and Dependency Management
Hecatomb is hosted on GitHub at github.com/shandley/hecatomb and is available for installation from the Bioconda (anaconda.org/bioconda/hecatomb)and the Python Package Index (pypi.org/project/hecatomb/)easing installation for individual users with a single command [68,87,88].Hecatomb makes liberal use of Conda environments to ensure portability, ease of installation and proper versioning of software dependencies (Fig. S2).All required and optional software dependencies are summarized in Table S1 While Hecatomb is a Snakemake pipeline, it uses the Snaketool command line interface to make running the pipeline as simple as possible [93].Snaketool populates required file paths and configuration files, allowing Hecatomb to be configured and run with a simple command, and offers a convenient way to modify parameters and customize options.

High-Performance Computing Deployment
Hecatomb can be deployed on a high-performance computing (HPC) cluster and can utilize Snakemake profiles for cluster job schedulers (e.g.Slurm, SGE, etc.).Snakemake uses profiles to submit pipeline jobs to the job scheduler and monitor their progress.Profiles can be created manually, but Hecatomb has been designed for compatibility with the official Cookiecutter (https://github.com/cookiecutter/cookiecutter)profiles for Snakemake (https://github.com/Snakemake-Profiles/doc),and comes with a pre-installed Slurm example profile.

Customization
Hecatomb comes precompiled with many predefined settings regarding individual process options.These settings are highly-customizable through the inclusion of a Snakemake YAML file.This file provides a single-source solution to user customization.Settings such as the quality threshold used for read trimming in Module 1 or the length of contig to maintain in Module 3 can easily be adjusted per an individual user or project needs.

APPLICATION
Hecatomb accelerates profiling of viral metagenomes.We reanalyzed a previously published data set of 95 stool samples collected from SIV-infected rhesus macaques (Macaca mulatta) (NCBI BioProject accession: PRJEB9503) [5].Sequences were generated using the Illumina MiSeq 2×250 paired-end protocol using round A/B libraries (DNA and cDNA to enable detection of both RNA and DNA viruses).
These data contain sequences from a variety of RNA and DNA vertebrate viruses.The original study also identified a statistically significant difference in the abundance of several enteric viruses in SIV infected animals compared to uninfected animals.
We first assessed Hecatombs' overall ability to detect diverse viral sequences.Hecatomb was executed using the round A/B preprocessing module, and default parameters.Hecatomb classified sequences into phylogenetically diverse viral groups (Fig. 4A).Bacteriophage from the family Microviridae and the order Caudovirales (Siphoviridae, Podoviridae and Myoviridae), were highly abundant.Sequences belonging to a diverse set of viruses associated with infection of plants and protists were also detected (Fig. S3).Similar to the original study, Hecatomb identified a large number of sequences belonging to the Picornaviridae and Adenoviridiae.Sequences from these viral families were found to be more abundant in SIV-infected macaques when compared to uninfected animals in the original study (Fig. 4C).
Hecatomb collects and organizes alignment statistics (e.g.e-values, percent identity, alignment length, etc.) generated for each sequence annotation in Module 2. These data can be useful for assessing the prevalence and quality of viral annotations within a study.As an example, we examined the percent identity and alignment lengths of every read assigned taxonomy to the four families of viral enteropathogens identified in the original study (Circoviridae, Picornaviridae, Adenoviridae and Parvoviridae) (Fig. 4B).
Hecatomb annotated sequences for these four viral families using both translated queries to amino acid (aa) databases and untranslated queries to nucleotide (nt) databases.Quadrants were applied to visualize low-and high identity and short-and long-alignment lengths for every annotated sequence.Sequences in the upper two quadrants are highly similar to sequences in the reference databases over short (upper left, quartile 1 (Q1)) or long (upper right, Q2) alignment lengths, while sequences in the lower two quadrants have low similarity over short (lower left, Q3) or long (lower right, Q4) alignment lengths.For this analysis we arbitrarily selected 70% identity to represent the cut-off between low and high-identity for translated (aa database) and 90% identity for untranslated (nt database) alignments.These values are adjustable and could be customized for each study and viral family of interest.Using this framework, it is clear that a majority of sequences are high-identity (both short and long alignments) to sequences in both the aa and nt reference databases for the 4 families of enteropathogenic viruses..
In contrast, there were also a large number of sequences classified due to similarity to reference sequences from viruses of protists (Fig. 4B).Mimiviridae, that infect Acanthamoeba, and Phycodnaviridae, that infect algae, are both dsDNA viruses with large genomes [94].While it is conceivable that these viruses may exist in the stool samples of rhesus macaques via water or food, using the quadrant framework there is little evidence of high-identity alignments to any sequence in either the aa or nt databases (Fig. 4B, Fig. S4).
Hecatomb does not automatically remove sequences from these families due to their presence in environmental datasets.There is evidence for short and long low identity alignments (quadrant 4) to both Phycodnaviridae and Mimiviridae reference sequences.Thus, these sequences should be analyzed using additional metrics (i.e.E-values, abundance across samples, etc.) to determine if these represent potentially novel viral sequences.This would not have been possible using stringent E-value filtering prior to data analysis.

Reevaluation of existing environmental datasets.
We assessed Hecatomb's ability to analyze nonhuman associated viromes by processing a previously studied coral reef dataset (NCBI BioProject accession: PRJNA595374) [95,96].The dataset consists of whole genome shotgun (WGS) metagenomic sequencing of both seawater and coral mucus from inner and outer sections of a Bermuda reef system.
The original studies only considered bacterial metagenome assembled genomes which makes it an excellent candidate for generating new biological insights by characterizing the viruses of this previously published dataset.The original study identified statistically significant differences in bacterial compositions between the coral mucus and seawater microbiomes and the coral mucus microbiomes from the inner and outer reefs.All analysis was performed using output from the contig annotations provided by Hecatombs module 3 and 4.
We first tested for differences in viral species alpha diversity using both Shannon diversity and richness comparing both inner and outer reef samples and coral mucus and reef water samples (Fig. 5).Both Shannon diversity and richness were significantly higher for inner reef samples compared to outer reef samples (Fig. 5A).Shannon diversity and richness were not significantly different between coral mucus and reef water.This result is in contrast with the bacterial diversity and richness metrics being similar across all samples as reported in the original study [95].
Next, we compared the viral compositions of these coral reef samples using beta diversity.Principle coordinate analysis (PCoA) of Bray-Curtis dissimilarity of viral genera, and Permutational multivariate analysis of variance (PERMANOVA) showed non-homogeneous distributions between inner and outer reef samples (p = 0.001) as well as between coral mucus and reef water samples (p = 0.015) (Fig. 5B).There is a strong separation of inner and outer reef samples along the x-axis and a weaker separation of reef water and coral mucus samples along the y-axis; this is the same trend observed in the original study for the bacterial compositions.
Linear regression (LR) to characterize the viral taxa that are driving the differences between inner and outer reefs, and between coral mucus and reef water samples.The R package microViz will calculate LR models at all taxa and taxon levels for the various sample groups.We calculated LR models to the genus level and generated tree plots for all taxa with prevalence greater than 10% of samples, coloured by the LM estimate coefficients, and weighted by prevalence between the inner and outer reef samples (Fig. 5C) and between the coral mucus and reef water samples (Fig. 5D).These analyses indicate that inner reef samples have significantly higher relative abundances of many viral taxa, including several Caudoviricetes taxa and especially the Kyanoviridae family which consist of various Synechococcus phages and Cyanophages (Fig. 5C).Conversely, far fewer viral taxa were more abundant in outer reef samples.Reef water samples contained elevated abundances of many viral taxa compared to coral mucus samples, with the main exception of the family of giant viruses Mimiviridae (Fig. 5D).

Accelerated discovery of novel viruses.
Hecatomb retains the assembly graph as well as the assembly itself which downstream tools can utilize to resolve metagenome-assembled genomes.There was strong evidence for the presence of novel bacteriophage within the SIV-macaque dataset in the form of many high-quality but low identity alignments to known reference viruses (Fig. S5).We therefore processed the assembly graph with Phables and identified 127 probable complete phage genomes [60].Phables bins fragmented assemblies into complete genomes.These genomes were assessed with CheckV which determined that 121 of them were high-quality complete phage genomes [97].We assigned taxonomy using MMSeqs2 with the Hecatomb primary nucleotide and amino acid databases (Table S2) [48].Lastly, the genomes were annotated using Pharokka [98].Of the 121 genomes, 98 were Microviridae (Table S2).Of these Microviridae, 96 exhibited the hallmark replication initiation protein followed by a major capsid protein, and a further 55 have the hallmark minor tail or pilot tail spike protein (Fig. 6A) while the other 41 contain hypothetical proteins where the tail spike protein would be.This family was also identified as the most abundant by read count (Fig. 4A).There were 10 Caudoviricetes, two Cressdnaviricota genomes, and eight that had hits to known phages with no taxonomic information.There were 13 cases where two genomes were resolved from the same assembly graph element and likely represent quasi-species that can occur for instance through recombination events [99,100].We also processed the coral dataset using the same method.Synteny is also conserved in these larger phages, for instance, Caudoviricetes phage 1112C1 arranges its capsid and tail proteins together and exhibits the conserved layout described in [101] (Fig. 6B).
In the coral dataset we identified three complete Caudoviricetes phage genomes (Table S2).The number of samples in this study was much lower and likely impacted the number of recovered viral genomes.The recovered viral genomes across both studies are novel with only 18 aligning to a known phage with an identity higher than 90% (Table S2).This demonstrates Hecatomb's utility for data mining published environmental metagenome projects to generate novel viral genomes.

DISCUSSION
Virome sequencing is the premier approach to evaluate the viral content of both host-derived and environmental samples.It is useful for determining what types of viruses are present in individual samples and how virome compositions compare between sample groups.This information forms the foundations for answering a wide array of interesting biological questions.For example, virome composition has recently been analyzed as an indicator of the microbial impacts of climate change [102,103].Virome sequencing was also critical to the discovery and characterization of SARS CoV-2 in 2019 [104].Effective characterization of virome sequencing data requires rigorous and integrated software platforms to facilitate and accelerate virus discovery and virome compositional analysis.Given these tools researchers will be better prepared to assess how viruses are associated with some of the most important challenges to human life today.
All virome studies are dependent on effective computational tools to identify and classify viral reads or assembled contigs within a metagenome.Viral metagenomics is often dependent on identifying sequence similarity against reference sequence databases, either directly via homology-based searchers, or using machine learning techniques that have been trained on reference databases to identify features unique to viral sequences (reviewed in [105]).Homology-based searches can take a 'brute force' approach, wherein all unclassified sequences are queried against a comprehensive, multi-kingdom reference sequence database (e.g.NCBI nt or nr).This approach relies on the search algorithm (e.g.BLAST, DIAMOND [47]) to pick the best or lowest-common ancestor of a group of hits to provide a final taxonomic assignment to an unknown query sequence.This approach is slow and requires significant computational resources, which is why Hecatomb takes an alternate approach.First, by capturing all 'potentially viral' sequences initially querying a small viral sequence database.The 'potentially viral' sequences typically represent a fraction of the full metagenomic data making subsequent computation more tractable.To confirm viral taxonomic assignment, potentially viral sequences are cross-checked against a curated small transkingdom reference database containing genomic representatives from all kingdoms of life.Hecatomb completes this iterative search approach using translated searches against amino acid databases as well as untranslated searches against nucleotide databases, combining the results of each to ensure detection of viral sequences is database independent.This iterative search strategy uses databases orders of magnitude smaller than comprehensive, multi-kingdom databases (such as NCBI's nt and nr) increasing computational efficiency without limiting viral detection.
Hecatomb's design philosophy recognizes that there are no 'perfect' databases or search algorithms.Both the brute force and iterative search approaches against comprehensive or curated databases will result in different rates of true/false positives/negatives.Instead, Hecatomb relies on providing a compiled and rich set of data for search result evaluation.We used this strategy to reassess the virome composition of SIVinfected and uninfected rhesus macaques [5].The original study used an iterative approach, but relied on comprehensive, transkingdom databases (NCBI nt and nr), and identified associations between four families of animal viruses (Circoviridae, Picornaviridae, Adenoviridae and Parvoviridae) and SIV-infection.
The new Hecatomb trans-kingdom database is 6 orders of magnitude smaller than GenBank nt (5.0x10 6 versus 1.3x10 12 ) which results in a significant reduction in computational time and resources.Hecatomb identified the same four viral families and their relationship to SIV mediated disease.Similar to our analysis of these samples using Hecatomb, the original study also classified a number of sequences to the Mimiviridae and Phycodnaviridae.Statistical comparison of these sequences between groups (e.g.SIVinfected vs. uninfected) did not reveal any significant associations thus they were not discussed further.
However, new evaluation of results from Hecatomb indicates that there were likely false-positive classifications reported in the original analysis.Lastly, we identified 121 novel complete phage genomes in this dataset.The majority of these genomes were Microviridae, which was the most abundant family by read abundance in the dataset.Recovering these genomes with a provisional taxonomic classification using Hecatomb's annotations is simple, fast, and scalable.This method is not a replacement for culturing and characterizing viruses in a lab environment.However, these in silico methods are essential when considering the scale of the viral dark matter problem.This reanalysis highlights how coordinated data such as alignment statistics and taxonomy can be powerful tools for virome evaluation and novel virus discovery.
Hecatomb was able to effectively evaluate the viromes of environmental (non-host associated) viromes.
Leveraging the R packages phyloseq and microViz allowed us to quickly and easily complete the analysis in approximately 200 lines of code [66,67].This analysis was primarily designed to identify compositional changes in viromes between reef types (inner or outer), and within coral mucosa and the surrounding water from a previously published metagenomic data set [95,96].The original study identified elevated levels of Pelagibacter, Synechococcus, and unclassified Rickettsiales in inner reef samples compared to outer reef samples.Indeed we found Synechococcus phages and Cyanophages were important for distinguishing the highly fluctuating inner reef system from the thermally stable outer reef.The authors showed that the bacterial microbiomes were unique for inner and outer reefs for both coral and mucus samples.We applied the same method on the viral compositions and confirmed that the viromes for these four sample types are also all unique.
Interestingly, viral species richness and diversity were significantly elevated in inner reef samples whereas the original study found these to be similar for the bacterial composition.Viral activity is an important vehicle for nutrient cycling which was thought to be much higher in the inner reef based on the metabolic profiles of these samples [95].There were many viral taxa that were more abundant in the inner reef samples, and few that were elevated in the outer reef.Similarly, many viruses were more abundant in reef water samples compared to coral mucosa, with the main exception of the giant viruses from the family Mimiviridae.

Potential implications
Virome analysis is complex and requires efficient computational tools to generate analyst friendly results.
Hecatomb provides a comprehensive and computationally efficient solution for both read-and assemblybased viral annotation, virome analysis, and novel virus discovery.The pipeline is delivered with a convenient and easy-to-use front end, and is compatible with different sequencing technologies.
Hecatomb's comprehensive collection of data throughout the pipeline's execution, in particular the collection of alignment statistics, empowers identification and interrogation of viral taxonomic assignments.
We demonstrate Hecatomb's utility for rapid processing and analysis of viral metagenomes with a wellstudied validation gut viral metagenome dataset.We also demonstrate its utility for mining regular metagenome samples for virome analysis by analyzing an existing environmental dataset.

Methods
All commands used for analyzing the Hecatomb annotations are available at gist.github.com/beardymcjohnface/3d3245b2bf6d9544c524f412037d5065.
Reevaluation of SIV dataset.We reanalyzed a previously published data set of 95 samples obtained from stool samples collected from SIV-infected rhesus macaques (Macaca mulatta) (NCBI BioProject accession: PRJEB9503) [5].Sequence data were generated using the Illumina MiSeq 2×250 paired-end protocol on libraries of total nucleic acid (DNA and cDNA to enable detection of both RNA and DNA viruses).Hecatomb was executed using the round A/B preprocessing module, and otherwise default parameters.Data were analyzed in R with Tidyverse [64]; commands are available in the above GitHub Gist.
Hecatomb was run with fast search parameters, cross assembly, and otherwise default parameters.Data were analyzed in R with PhyloSeq [66] and MicroViz [67]; commands are available in the above GitHub Gist.
Identification of phage genomes from Hecatomb assemblies.To identify complete phage genomes, the assembly graphs created by Hecatomb were processed with Phables [60].The predicted phage genomes were assessed with CheckV [97].High-quality and complete phage genomes were assigned provisional Taxonomic annotations using MMSeqs2 with the Hecatomb viral amino acid database (easytaxonomy) and viral nucleotide database (easy-search plus TaxonKit).Lastly, the genomes were annotated with Pharokka [98].

Availability of source code and requirements
Project name: Hecatomb Project home page: github.com/shandley/hecatombProject documentation: hecatomb.readthedocs.io Operating system: Linux   (6) Sample metadata can be joined into the read annotation table using the sample ID as the primary key.
(7) The read annotation table with sample metadata can be quickly and easily analyzed.

1 ABSTRACTBackground
the experimental design and statistical methods used should be given in the Methods section, as detailed in our Minimum Standards Reporting Checklist.Information essential to interpreting the data presented should be made available in the figure legends.Have you included all the information requested in your manuscript?Yes Resources A description of all resources used, including antibodies, cell lines, animals and software tools, with enough information to allow them to be uniquely identified, should be included in the Methods section.Authors are strongly encouraged to cite Research Resource Identifiers (RRIDs) for antibodies, model organisms and tools, where possible.Have you included the information requested as detailed in our Minimum Standards Reporting Checklist?Yes Availability of data and materials All datasets and code on which the conclusions of the paper rely must be either included in your submission or deposited in publicly available repositories (where available and ethically Yes Powered by Editorial Manager® and ProduXion Manager® from Aries Systems Corporation appropriate), referencing such data using a unique identifier in the references and in the "Availability of Data and Materials" section of your manuscript.Have you have met the above requirement as detailed in our Minimum Standards Reporting Checklist?Powered by Editorial Manager® and ProduXion Manager® from Aries Systems Corporation Modern sequencing technologies offer extraordinary opportunities for virus discovery and virome analysis.
It would be interesting to elucidate whether the coral mucus is impacting viral infectivity directly, or if the phages are switching to a lysogenic rather than lytic life cycle in this environment.Corals occasionally shed their mucosa.Shedding occurs far more frequently in the inner reef due to stressors such as thermal fluctuation and sedimentation from surface runoff.The increased flux of nutrients and microbes from corals to the surrounding reef water may be contributing to increased microbial and viral activity in inner reef samples compared to the outer reef.Different concentrations of microbes might also be having an impact.The outer reef systems are subject to upwelling, resulting in greater exchange of water with the open ocean which is probably flushing microbes and viruses from the environment.Unfortunately, it is not possible to infer microbial concentrations from WGS sequencing.The reanalysis of this coral dataset has generated many new hypotheses about viral host interactions within a coral reef system.

( 4 )( 5 )
Taxonomic annotations are calculated and joined into the read annotations again using the sequence ID.ICTV viral classifications are joined into the read annotations by the Taxonomic Family annotation.

Figure S1 .
Figure S1.Features and useability of popular viral metagenomics software as of March 2023.

Figure S3 :
Figure S3: Taxonomic subsets of virus types