Error correction and diversity analysis of population mixtures determined by NGS

The impetus for this work was the need to analyse nucleotide diversity in a viral mix taken from honeybees. The paper has two findings. First, a method for correction of next generation sequencing error in the distribution of nucleotides at a site is developed. Second, a package of methods for assessment of nucleotide diversity is assembled. The error correction method is statistically based and works at the level of the nucleotide distribution rather than the level of individual nucleotides. The method relies on an error model and a sample of known viral genotypes that is used for model calibration. A compendium of existing and new diversity analysis tools is also presented, allowing hypotheses about diversity and mean diversity to be tested and associated confidence intervals to be calculated. The methods are illustrated using honeybee viral samples. Software in both Excel and Matlab and a guide are available at http://www2.warwick.ac.uk/fac/sci/systemsbiology/research/software/, the Warwick University Systems Biology Centre software download site.


INTRODUCTION
Next generation sequencing (NGS) is an extremely powerful tool for the analysis of mixed populations in ecology and biology, providing a means to assess community composition (metagenomics) whilst also aiding the discovery of new species (Radford et al., 2012). The relatively high error rate of NGS, however, limits the ability to analyze population diversity directly. Population diversity is a crucial ecological measure, relating particularly to selection pressure and (relative) evolutionary fitness. Diversity is critical within biological arenas as well, for instance T cell and viral diversity are key complementary measures of the immune system and infection status respectively. There is thus a pressing need to create tools to appropriately correct NGS error, to estimate diversity introduced by NGS and to estimate inherent virus diversity.
This study was motivated by the prevalence and pathogenicity of honeybee deformed wing virus (DWV) and related viruses. It is known that these may cause both asymptomatic low-level and symptomatic high-level infection in honeybees; the Varroa mite is the likely causal factor that produces a shift from the benign to the pathogenic state, correlating with a shift in the level of viral population diversity (Martin et al., 2012; Ryabov diversity and compare their values (the variance of diversity is incorrectly stated in Gregori et al. (2014), but corrected here in the Diversity analysis section). To conclude the review of literature on a broader note, in Schreiber & Brown (2002) a method for correcting systematic bias in the nucleotide distribution (the "distortion") in a genome is presented.
The novel stochastic method for NGS error correction of diversity measures introduced here is based on use of an evolutionary model (in the first instance, the Jukes-Cantor model (Jukes & Cantor, 1969)) and requires the availability of an accurately known sequenced sample. The corrected nucleotide distribution is used to calculate both the expected diversity and the diversity variance. These in turn can be used to test hypotheses about single diversities, or to test the hypothesis of equality of diversity across two samples. Hypotheses about mean diversities and comparisons of mean diversities of two samples are also discussed, taking into account correlation of diversities across the averaged sites. Finally, the ideas are used to produce a diversity threshold for a sample to be consistent with a clonal (zero nucleotide diversity) population. The methodology is illustrated with NGS datasets drawn from the honeybee study. The methods discussed extend our understanding of NGS error correction and provide a core set of tools useful in the study of diversity.

Error and diversity correction
We first correct the NGS error in the nucleotide distribution and then use this distribution to calculate the (corrected) diversity. Our starting point is the true (to be estimated) nucleotide distribution, p. Assuming independence of errors, the error introduced through NGS can be modelled by a 4 × 4 matrix M whose entries give the probability of a particular nucleotide change during sequencing, conditional on the initial value of the nucleotide, whence the theoretical distribution of nucleotides following sequencing is q = Mp. In practice we sequence finitely many nucleotides n so the empirical distribution isq with E(q) = q. The simplest error model is the Jukes-Cantor model, where a nucleotide mutates to any other with equal (low) probability α, summarised as

P(A|A) P(A|C) P(A|G) P(A|T) P(C|A) P(C|C) P(C|G) P(C|T) P(G|A) P(G|C) P(G|G) P(G|T) P(T|A) P(T|C) P(T|G) P(T|T)
Since Mp = p + α(1 − 4p), this transformation has the effect of moving all probabilities closer to 0.25, thereby increasing diversity. Richer mutation models can be incorporated, for example allowing different mutation rates between purines and pyrimidines; for this study the Jukes-Cantor model is sufficient for our purposes.

Figure 1 The stages of NGS nucleotide distribution error correction. (A)
Error is introduced to the true nucleotide distribution p, giving distributionq, the result of multinomial sampling with n (the coverage, the number of nucleotides sequenced) trials in which the true sequenced nucleotide distribution is given by q = M p . The error is corrected by formingr, the normalisation of (M −1q ) + . In practice M must be estimated (the calibration step), using a known initial distribution p. (B) The initial diversity H(p) increases to H(q) under NGS, but falls back to H(r) when corrected.
To utilize this correction method, we need the matrix M; in the case of the Jukes-Cantor model this requires an estimate of the mutation rate α. This can be achieved by estimating it on a sample of known composition and genotype, denoted here the "standard" sample. This provides a two-step method to correct an empirical nucleotide distributionq for NGS error. The first (calibration) step uses p standard andq standard to estimate α (asα, so giving an estimateM of M) while the second (correction) step usesM andq to estimate p withr, as follows:

Calibration
Step. This is the estimation of α, the error rate in NGS sequencing. We begin with a sample in which the nucleotide mix p standard = (p 1 ,p 2 ,p 3 ,p 4 ) T is known at each nucleotide position along a genome. We then sequence the mix using NGS, giving nucleotide mixq standard = (q 1 ,q 2 ,q 3 ,q 4 ) T . For each site we can solve forα usingq =Mp.
Equating ith components provides estimatesα i = (q i − p i )/(1 − 4p i ) for i = 1,...,4; these can be averaged across nucleotides to give a lower variance estimate of α, These estimates can be averaged across nucleotide positions to improve the estimate further.

Correction
Step. Given nucleotide distributionsq produced from NGS of a sample, form the corrected countsr = ( Example 1 in the Results section illustrates these calibration and correction steps with a numerical example. It is possible to estimate an upper bound for the residual error rate remaining after correction, given an NGS error rate α and coverage n. This is done by using the calibration step with p standard = (1,0,0,0) andq standard =r, denoting the estimated residual error rate byβ. Herer is found by first generating n values using q = M(α)p, then correcting using rate α. A graph ofβ against α is given in Fig. 2, for the case p standard = (1,0,0,0) and n = 1,000, 2,000, 5,000, 10,000 and 20,000. This demonstrates, for these parameters, that correction reduces the error by a factor of over 10 for a coverage of n = 1,000. That this is an upper bound is made clear in the later Discussion section.
Given a nucleotide probability mass distribution This can be thought of as the geometric mean  i p p i i of the distribution probability masses, transformed by taking the natural logarithm of the reciprocal; then no diversity, for example (1,0,0,0), has H = 0 while complete diversity, a uniform distribution of (0.25, 0.25, 0.25, 0.25), has H = 1. Figure 3 shows the graph of a component −xlog 4 x of diversity. The steep slope at the extremes, particularly at zero, shows that correction to a component at these extremes has most effect on diversity measurement. Of interest here is detection of clonality, where H = 0, so correction of very small probabilities is critical. Figure 2 The estimated residual error rateβ after correction plotted against the initial error rate α, when the true initial distribution is the degenerate p = (1,0,0,0). The larger the coverage n, the more accuratelyq estimates q and so the more dependable the correction, hence the smaller is β. Curves are shown for coverage n of 1,000 (blue), 2,000 (green), 5,000 (red), 10,000 (cyan) and 20,000 (magenta). Each point on the graph is the mean of N = 5,000 replicate trials. With the corrected nucleotide distribution available, the Shannon diversity can be estimated aŝ (2) Figure 1B schematically shows the relative levels of H(p) (true diversity), H(q) (uncorrected diversity) and H(r) (corrected diversity). We conclude this section by remarking that the data used for calibration should ideally be sequenced under similar conditions to the data which is to be corrected.

Diversity analysis
Two questions are key in an analysis of population diversity. Firstly, whether the diversity H is equal to a given value H and secondly, whether two populations have the same diversity, or alternatively whether one is more diverse than the other. Specifically, we want to test null hypotheses, firstly of the form H 0 : H = H and secondly of the form H 0 : H 1 = H 2 , for two samples labeled '1' and '2' . We will examine both of these tests for diversity measured at a single nucleotide position and for the mean diversity across segments comprising multiple (N) possibly correlated sites. In all cases we will have one or more true nucleotide distributions p, true post NGS nucleotide distributions q = Mp, estimatesq determined using a sample of n nucleotides drawn from q, soq i = n i /n for each i = 1,...,4, where n i is the number of draws resulting in the ith nucleotide and finally, corrected distributionsr. Givenr,Ĥ = H(r), as in (2), estimates the true diversity H = −  i p i log 4 p i with H evaluated at corrected valuesr of estimatesq of true values after error introduction, q. When n is large, this approximation will be close to the true value. In general, since the diversity function is concave, diversity is increased when we replace p with q and decreased when we replace q withq (sinceĤ is known to underestimate H (Hutcheson, 1970)) and thenr (sincer correctsq). This compensation is fortuitous.
The variance of Shannon diversity Var(Ĥ) at a nucleotide position with coverage n and empirical distributionr = (r 1 ,r 2 ,r 3 ,r 4 ) is approximated to first order (Hutcheson, 1970) with the following expression, This can be shown using the Δ-method and that Var( For testing and estimation we also need the variance of the mean of diversities averaged across a segment of sites. For ease of presentation (though it is readily generalised), we now assume that (3) provides the variance of estimated diversity at each nucleotide position, denoted s 2 . Empirical evidence shows that diversity is correlated across positions. We assume an exponential decay of the diversity correlation, with correlation between Table 1 Test statistics and two-sided 95% confidence intervals, in brackets, for diversity testing and estimation with a single position and one or two samples (columns). A population mean is denoted by µ. For the corresponding formulae involving means across multiple positions,H replacesĤ.

Number of samples
 nucleotide positions i and j given by ρ j−i+1 . Then the average diversity across N consecutive nucleotide positions, denotedH, has variance Var(H) given by We estimate ρ using the correlation between adjacent diversities (the one-step lagged correlation) along the segment of interest. With these variance estimates we can conduct hypothesis tests and construct confidence intervals in all cases considered. These are summarised in Table 1 and illustrated numerically in Example 2 in "Results".

Clonal threshold estimation
Suppose that a clonal sample (one with zero diversity) is sequenced by next generation methods with error rate α. This is corrected to levelβ, as shown in Fig. 2. This level of error remains in the sample, whence an upper 95% threshold forH, the average diversity across N sequential positions, with lagged correlation ρ, is where z 0.95 = 1.64 is the 95th percentile of the standard normal distribution. Note that this threshold depends on both the coverage n and the segment length N. A larger n typically decreasesH and so the threshold level, while a larger N decreases the width of the confidence interval aroundH, so also decreases the threshold level.

Sequence data
Three viral samples were sequenced by next generation sequencing (Illumina HiSeq1000). These comprised a sample that could be used for estimation of the error (standard sample), and two experimental samples, from low viral load and high viral load bees. After trimming adapters and barcodes, the first reads (the reads are paired), 101nts long, were used to calculate the NGS error rate and illustrate the methods.

Standard sample: an accurately known mix of two viral recombinants, X59
This was composed of two of the DWV-like viral recombinant RNA genomes described in Moore et al. (2011), VDV-1 DVD and VDV-1 VVD (NCBI Accession Nos. HM067437 and HM067438 respectively). The mixture was produced by in vitro RNA transcription using linearized plasmid clones with full-length cDNA inserts of VDV-1 DVD and VDV-1 VVD (Moore et al., 2011) with the T7 mMESSAGEmMACHINE kit (Ambion). The RNA transcripts were purified using RNAeasy columns (Quiagen), quantified, mixed 25% VDV-1 DVD , 75% VDV-1 VVD and sequenced using the Illumina platform protocol. The NGS reads are available in the EBI Sequence Read Archive, study accession PRJEB5249, ERS395188. The reads were aligned using Bowtie 2 (Langmead & Salzberg, 2012) to a single reference sequence (VDV-1 DVD ). SAMtools mpileup (Li et al., 2009) was used to produce the number of nucleotides covering each position in the reference. We excluded positions of mismatches between the VDV-1 DVD and VDV-1 VVD sequences, about 4% of the genome length. This produced nucleotide pileups where all diversity was a result of the methods used during NGS sequence acquisition. This dataset is used for calibration of NGS error.

A viral population mix from low DWV level (F3) and from high DWV level (E7) honeybees
Samples from honeybee nurse bees for which qRT-PCR showed either low DWV levels (EBI SRA study accession PRJEB5249, ERS395182) or high DWV levels (EBI SRA study accession PRJEB5249, ERS395180) were used. In each case, in order to assemble the required nucleotide distribution data, the following computational steps were carried out. "First reads" of length 101nt were aligned using Bowtie 2 to DWV (NC 004830) and VDV-1 (NC 006494) reference sequences. SAMtools mpileup was then used to produce the number of nucleotides covering each position in the references.

Deformed wing virus
A map of the deformed wing virus adapted from Lanzi et al. (2006) is shown in Fig. 4. In the Results section we select a single site in the helicase region of the genome (Examples 2.1 and 2.2), and average across the capsid region (Examples 2.3 and 2.4).

Example 1. Calibration and error correction (i) Calibration-estimation of the NGS error rate α.
Nucleotide pileups resulting from the X59 mix of two viral recombinants were used for calibration of the NGS error rate. Nucleotide positions where DWV and VDV-1 are identical were selected; for these the true distribution is one of p = (1,0,0,0), (0,1,0,0), (0,0,1,0) or (0,0,0,1). For example, the upper half of Table 2 shows the pileup counts for the first three capsid (see Fig. 4) nucleotide sites. Since the p here place all weight on one nucleotide, Eq. (1) reduces, when p = (1,0,0,0), for example, to (q 2 +q 3 +q 4 )/3 or the average error proportion. Thus, for the first position in Table 2 we would estimate α as (0.0010 + 0.0016 + 0.0008)/3 or 0.0011; for the second and third positions the corresponding averages are 0.0014 and  (ii) Correction of the distributionsq. This was carried out in three ways.
(1) Within the capsid region of X59, usingα = 0.001949 calculated using the entire capsid region. Correction of the distributionq in the first three capsid positions, shown in the upper half of Table 2, yielded ther distributions shown; each is corrected to the exact distribution. (The Excel software available was used to also test the calibration-correction routine across a central region in the helicase and a non-structural region downstream of the helicase; in both cases the degenerate distributions were found with excellent accuracy.) (2) Within the capsid region of E7, again usingα = 0.001949 calculated using the entire capsid region of the independent standard sample X59. The accurate results are shown in the lower half of Table 2.

Single site, two populations
We test whether the diversities are equal at nucleotide position 5201 for F3 (first sample) and E7 (second sample), using the data in Example 2.1. The test statistic z described in Table 1 is 1.6814 with p-value = 0.0463. Hence we reject the null hypothesis of equal true diversities for the two populations at this site, concluding that F3 has higher diversity. A 95% confidence interval for the true diversity difference is [−0.0184,0.2404].

Multiple sites, single population
We test whether the mean diversity across the capsid region (see Fig. 4) of the viral genome is zero, for both E7 and F3 (N = 2,843). For E7,H = 0.0037, the serial correlation of diversities is estimated asρ = 0.1728 and mean site diversity variance is s 2 = 7.714 × 10 −7 whence Var(H) = 3.847 × 10 −10 using (5), giving z = 187.5 so that the p-value is negligible. Here the sample size is so large and hence the power so great that we conclude that the true diversity is non-zero. Of greater interest is a 95% confidence interval for the true diversity, namely [0.0036,0.0037]. Corresponding calculations for the more varied F3 sample give values ofH = 0.0410,ρ = 0.1195, s 2 = 0.0019, Var(H) = 8.702 × 10 −7 , z = 43.98 and again a negligible p-value. A 95% confidence interval for the true mean diversity value is [0.0392,0.0429].

Multiple sites, two populations
We test the null hypothesis that the true mean diversities, over the capsid region, of F3 (first sample) and E7 (second sample) are equal. Using values already given in Example 2.3, we find that z = 40.03, again giving a negligible p-value. Hence we conclude that the mean diversity of F3 is greater than that of E7. A 95% confidence interval for the true mean diversity difference is [0.0355,0.0392].


Var(H(β)) = 0.0008940 using (3). The 95% clonal diversity threshold for a single site is therefore 0.002748 + 1.64 × 0.0008940 or approximately 0.0042. For mean diversity, with N = 2,843 andρ = 0.1728 (appropriate for E7 on the capsid) the 95% clonal mean diversity threshold is H(β) + 1.64(s/ Figure 5 illustrates the confidence intervals found in Example 2.3, labelled "After correction", together with corresponding confidence intervals for the uncorrected data, labelled "Before correction". The corrected clonal mean diversity threshold of Example 3 is also shown. The highly significant difference between the Varroa-free and Varroa-infested nucleotide diversities are evident. Correction of the low diversity sample has a far greater effect on diversity than correction of the high diversity sample, due to the steep slope of the diversity component near zero (Fig. 3). The corrected low diversity sample lies just above the clonal threshold.

DISCUSSION
Here we have proposed a simple method for correction of nucleotide distributions containing errors arising from NGS. The correction method assumes an independent site error model. Throughout we have used the simplest error matrix M, the Jukes-Cantor matrix, with the error rate assumed to be constant across the genome and mutation parameter fitted from a standard (known genotype) sample. The method could be extended to use a richer evolutionary model, with parameters tailored to regions of the genome. For instance, the two-parameter Kimura model assumes transversion and transition errors differ, while the Generalized Kimura model adds a further parameter, allowing transversion error probabilities to differ according to transversion direction (Ewens & Grant, 2005). At an extreme, a full 12-parameter model could be used, fitted in sets of three consecutive nucleotides, since four linear equations are available per position. In principle, both the statistical nature of the error mechanism of the NGS platform and errors introduced in sample generation (e.g., PCR) should be captured in the model used.
A geometric view of the correction process is shown in Fig. 6. The tetrahedron is the locus of all probability mass functions on the four points {A,C,G,T}, with the vertices Figure 5 Confidence intervals for true mean diversity across the capsid region, for the high diversity (F3) and low diversity (E7) samples, both before and after correction for the NGS error. Correction has a larger effect when the (uncorrected) diversity is low, clearly revealing the reduction in diversity from F3 (low DWV level) to E7 (high DWV level). The clonal threshold for the mean shows that the corrected E7 data plausibly has non-zero diversity. (since lower α introduces lower error) and decreases as the coverage n increases (since as n increases,q becomes closer to q).
The diversity testing and estimation stage uses a diversity correlation structure. We have assumed serial correlation of diversities, since this was seen in the samples considered. Other correlation structures could be employed, such as a structure that decays more rapidly with separation.
The method is able to correct in multi-genome situations which appear to lie beyond the reach of other methods. All correction methods in the literature known to the authors are variants of alignment of reads to a reference genome and the taking of a consensus. If the original sample is a mix of two genomes then such methods are unable to correct at sites of variation (since the minor variant will generally be corrected to the major variant), whereas the proposed method estimates the error rate (using a calibration sample) and so is able to correct, even at sites of variation.
Finally, it may be possible to use the "clonal threshold" to detect SNPs or the more general SNVs. This is a topic for future research.

CONCLUSION
Movement of a virus to a new host can trigger changes in the level of diversity of variants of the virus. Such diversity changes can be detected using next generation sequencing of the viral mix, but error introduced in the process may mask a diversity change. We have presented a method for estimating measurement-error-corrected diversity from NGS data, both at single nucleotide sites and along sections of a genome. A simple but effective method is used to correct for NGS errors in the nucleotide distribution. With these corrected estimates we are able to compare diversity between samples, assessing whether they are consistent with equal diversity, and also assess whether a diversity estimate is consistent with a clonal population. We demonstrated our method within the context of viral diversity in bees that either have low viral load or high viral load, showing that their viral diversity is significantly different and that under high viral load the viral population is near-clonal. Correction for NGS error was particularly important when applied to the sample with low diversity, since the diversity estimate decreased by a factor of more than five. This analysis has demonstrated that the Varroa mite has a significant effect on the DWV population in honeybees, due to the emergence of a virulent strain that is preferentially replicated after transmission by Varroa. This results in the striking reduction in viral diversity observed in Varroa-exposed pupae and symptomatic worker bees. Our method is applicable to any population structure analysis using NGS data and thus adds a valuable tool to the study of selection pressure and differential fitness within populations.

ADDITIONAL INFORMATION AND DECLARATIONS Funding
This work was supported by the Biotechnology and Biological Sciences Research Council, the Department for Environment, Food and Rural Affairs, the Natural Environment Research Council, the Scottish Government and the Wellcome Trust, under the Insect