ENVirT: inference of ecological characteristics of viruses from metagenomic data

Background Estimating the parameters that describe the ecology of viruses,particularly those that are novel, can be made possible using metagenomic approaches. However, the best-performing existing methods require databases to first estimate an average genome length of a viral community before being able to estimate other parameters, such as viral richness. Although this approach has been widely used, it can adversely skew results since the majority of viruses are yet to be catalogued in databases. Results In this paper, we present ENVirT, a method for estimating the richness of novel viral mixtures, and for the first time we also show that it is possible to simultaneously estimate the average genome length without a priori information. This is shown to be a significant improvement over database-dependent methods, since we can now robustly analyze samples that may include novel viral types under-represented in current databases. We demonstrate that the viral richness estimates produced by ENVirT are several orders of magnitude higher in accuracy than the estimates produced by existing methods named PHACCS and CatchAll when benchmarked against simulated data. We repeated the analysis of 20 metavirome samples using ENVirT, which produced results in close agreement with complementary in virto analyses. Conclusions These insights were previously not captured by existing computational methods. As such, ENVirT is shown to be an essential tool for enhancing our understanding of novel viral populations. Electronic supplementary material The online version of this article (10.1186/s12859-018-2398-5) contains supplementary material, which is available to authorized users.

M = Number of genotypes (richness) L = Average genome length of each genotype (bp) f i = Relative abundance of the i th genotype (i ∈ 1, ..., M ) R = Number of reads r = Read length (bp) o = Minimum overlap distance considered in assembling reads (bp) (C 1 , C 2 , C 3 , ..., C R ) = Observed contig spectrum, where C q (q ∈ 1, 2, 3, ..., R) is the observed number of contigs each having exactly q reads. O q = q.C q = Number of reads out of the total R that contributed to observed contigs that have exactly q reads (q ∈ 1, 2, 3, ..., R).
An important assumption made in this formulation is that the f i s follow one of the four theoretical distributions: power-law, exponential, logarithmic or lognormal.
If f i s have a power-law distribution; If f i s have an exponential distribution; If f i s have a logarithmic distribution; If f i s have a lognormal distribution; where erf denotes the error function and erf −1 denotes the inverse error function.
All four functional forms of f i (i.e. equations 1, 2, 3 and 4) depends on M and a distribution specific parameter d. Let us denote the function giving the relative abundance of the i th genotype as F i (M, T, d) where T denotes the distribution function.
If the expected number of reads contributing to contigs having exactly q number of reads is E q (q ∈ {1, 2, 3, ..., R}); 1 where, Accordingly, the expected contig spectrum of a metagenome having population parameters M, L, T, d and, sequenced and assembled with parameters R, r, o is; Given We use the variance weighted squared difference between as the similarity measure between the observed and expected contig spectra. where, S(M, L, T, d) has multiple local minima and one global minimum with highly similar characteristics for given values of R, r, o and (C 1 , C 2 , C 3 , ...). Consequently, our goal now is to find the values of M, L, T and d when S(M, L, T, d) is at its global minimum.
In order to understand the effect of the presence of multiple local minima, let us consider a population where d = 0. For any case of T , F i (M, T, 0) = 1 M . In other words d = 0 corresponds to a population where all M number of genotypes are equally abundant (this is a highly unlikely scenario in a real population). Let us simplify above equations for d = 0.

Equation 6 simplifies to
Therefore, p is independent of i and depends only on the product term L.M .
Accordingly, equation 5 simplifies to Simplified E q depends only on p which is a function of L.M .
This result implies that, for a given sample, (E 1 , E 2 , E 3 , ..., E R ) will be identical for different  Figure S2 shows an example of how the cost function S(M, L, T, d) varies over the region 1000 ≤ M ≤ 50000, 5000 ≤ L ≤ 100000 and d ∈ {0.6, 0.7, 0.8} for a simulated contig spectrum with parameters M 0 = 10000, L 0 = 50000bp, T 0 = power − law, d 0 = 0.7, R = 10000, r = 100bp and o = 40bp (subscript 0 indicates the true value used to simulate the population). We observe that, when d = 0.7 (Figure 3(c)) there exist multiple local minima and a unique global minimum having the value 0. When d = 0.7 (Figures 3(a) and 3(e)), there still exist multiple local minima and a unique global minimum having values greater than 0. Hence, empirically we observe that for populations with d > 0, there exist a unique global minimum with S(M, L, T, 0) = 0 at M 0 , L 0 , T 0 and d 0 . Therefore, a unique global minimum is expected to be found when d > 0 even in the presence of multiple local minima. However, finding the unique global minimum cannot be guaranteed using a heuristic algorithm without utilizing appropriate niching strategies.