Benchmarking informatics approaches for virus discovery: caution is needed when combining in silico identification methods

ABSTRACT Understanding the ecological impacts of viruses on natural and engineered ecosystems relies on the accurate identification of viral sequences from community sequencing data. To maximize viral recovery from metagenomes, researchers frequently combine viral identification tools. However, the effectiveness of this strategy is unknown. Here, we benchmarked combinations of six widely used informatics tools for viral identification and analysis (VirSorter, VirSorter2, VIBRANT, DeepVirFinder, CheckV, and Kaiju), called “rulesets.” Rulesets were tested against mock metagenomes composed of taxonomically diverse sequence types and diverse aquatic metagenomes to assess the effects of the degree of viral enrichment and habitat on tool performance. We found that six rulesets achieved equivalent accuracy [Matthews Correlation Coefficient (MCC) = 0.77, Padj ≥ 0.05]. Each contained VirSorter2, and five used our “tuning removal” rule designed to remove non-viral contamination. While DeepVirFinder, VIBRANT, and VirSorter were each found once in these high-accuracy rulesets, they were not found in combination with each other: combining tools does not lead to optimal performance. Our validation suggests that the MCC plateau at 0.77 is partly caused by inaccurate labeling within reference sequence databases. In aquatic metagenomes, our highest MCC ruleset identified more viral sequences in virus-enriched (44%–46%) than in cellular metagenomes (7%–19%). While improved algorithms may lead to more accurate viral identification tools, this should be done in tandem with careful curation of sequence databases. We recommend using the VirSorter2 ruleset and our empirically derived tuning removal rule. Our analysis provides insight into methods for in silico viral identification and will enable more robust viral identification from metagenomic data sets. IMPORTANCE The identification of viruses from environmental metagenomes using informatics tools has offered critical insights in microbial ecology. However, it remains difficult for researchers to know which tools optimize viral recovery for their specific study. In an attempt to recover more viruses, studies are increasingly combining the outputs from multiple tools without validating this approach. After benchmarking combinations of six viral identification tools against mock metagenomes and environmental samples, we found that these tools should only be combined cautiously. Two to four tool combinations maximized viral recovery and minimized non-viral contamination compared with either the single-tool or the five- to six-tool ones. By providing a rigorous overview of the behavior of in silico viral identification strategies and a pipeline to replicate our process, our findings guide the use of existing viral identification tools and offer a blueprint for feature engineering of new tools that will lead to higher-confidence viral discovery in microbiome studies.


Supplemental Figures
Benchmarking informatics approaches for virus discovery: Caution is needed when combining in silico identification methods

Figure S3
. Proportion of sequences in common between two ruleset combinations for the four viral identification tools tested (VirSorter2, VirSorter, VIBRANT, DeepVirFinder).The combinations are grouped along the x-axis by the number of rules in the first member of the pair.For this reason, a combination with a ruleset with 1 rule and a ruleset with 3 rules would be included once in the "1" rule column and once in the "3" rule column.

Figure S4
. Difference between the performance metrics for the rulesets with and without the +0.5 subrules of the four single-tool rules.A percent difference greater than 0 means that the performance metric was bigger with the +0.5 subrules.

Figure S1 .
Figure S1.Variability across testing sets.The average standard deviation is consistent across the number of testing sets (p-values ≥ 0.25), and the variability of the standard deviation plateaus after 5 testing sets are included (p-values ≥ 0.65).

Figure S2 .
Figure S2.(A) Precision and (B) recall boxplots by ruleset.Ordered by increasing precision and recall, respectively.Colored by the ruleset type.

Figure S5 .
Figure S5.MCC for the 3-5 kb subset.Points are colored by the number of viral identification tools used in the ruleset and ordered by increasing MCC.

Figure S6 .
Figure S6.Individual performance of each tuning rule.Number of sequences with a given viral score for the (Top) tuning addition and (Bottom) tuning removal rules faceted by the sequence type.The tuning addition rules are (1) VirSorter2 viral hallmark genes > 2, (2) called viral by Kaiju and Kaiju match ration ≥ 0.3, (3) percent unknown ≤ 75% and sequence length ≤ 50 kb, and (4) percent viral ≥ 50% by VirSorter2 or CheckV.The tuning removal rules are (1) CheckV host genes > 50 and not called a provirus, (2) CheckV viral genes = 0 and host genes ≥ 1, (3) 3 times the number of CheckV viral genes*3 ≤ the number of CheckV host genes and not a provirus, (4) longer than 500 kb and one or fewer VirSorter2 hallmark viral genes, and (5) DeepVirFinder score ≤ 0.7 and p-value ≤ 0.05.

Figure S7 .
Figure S7.Overview of the change in MCC, precision, performance, and proportion viral on ruleset with the inclusion of the tuning removal and tuning addition rules.The percent difference is greater than zero if the performance metric is higher with the tuning rules.

Figure S8 .
Figure S8.Scatter plot of recall and precision of virus versus non-virus on mock environmental metagenomic datasets.Each point represents a combination of tools run on a single testing set.Points are colored by how many rules were used.

Figure S9 .
Figure S9.Comparison of number of hallmark genes (VirSorter2) and number of viral genes (CheckV) faceted by sequence type.

Figure S10 .
Figure S10.Comparison of number of host genes (CheckV) and number of viral genes (CheckV) faceted by sequence type.

Figure S11 .
Figure S11.Comparison of percentage of genes viral (CheckV) and length of sequence faceted by sequence type.

Figure S12 .
Figure S12.Comparison of length of sequence and number of host genes (CheckV) faceted by sequence type.

Figure S13 .
Figure S13.Comparison of percent unknown (CheckV) and contig length faceted by sequence type.

Figure S15 .
Figure S15.Histogram of number of hallmark genes (VirSorter2) faceted by sequence type and split by having more or less than 2 hallmark genes.

Figure S16 .
Figure S16.Comparison of Kaiju taxonomic assignment and percent viral (CheckV).All plots are faceted by true sequence type.

Figure S17 .
Figure S17.Performance of four representative rulesets across viral types.Number of sequences with a given viral score for four representative rulesets (high recall, high precision, high MCC, and all) faceted by the VirSorter2 viral group (NA represents sequences not called viral by VirSorter2) and separated based on (A) virus and (B) non-virus sequences.Note that the x-axes vary between panels and facets.

Figure S19 .
Figure S19.Proportion of viruses predicted by each tool combination across five environmental datasets.Select tool combinations are colored to highlight their accuracy on the testing set.Multiple sets had the same MCC/precision/recall accounting for multiple t-tests with Bonferroni, and are colored the same accordingly.Vs2+tnv and vs2 were in both the high MCC and high precision groups.Tool combinations are sorted left-to-right by increasing MCC as shown in Figure X.