Variational inference for rare variant detection in deep, heterogeneous next-generation sequencing data

Background The detection of rare single nucleotide variants (SNVs) is important for understanding genetic heterogeneity using next-generation sequencing (NGS) data. Various computational algorithms have been proposed to detect variants at the single nucleotide level in mixed samples. Yet, the noise inherent in the biological processes involved in NGS technology necessitates the development of statistically accurate methods to identify true rare variants. Results We propose a Bayesian statistical model and a variational expectation maximization (EM) algorithm to estimate non-reference allele frequency (NRAF) and identify SNVs in heterogeneous cell populations. We demonstrate that our variational EM algorithm has comparable sensitivity and specificity compared with a Markov Chain Monte Carlo (MCMC) sampling inference algorithm, and is more computationally efficient on tests of relatively low coverage (27× and 298×) data. Furthermore, we show that our model with a variational EM inference algorithm has higher specificity than many state-of-the-art algorithms. In an analysis of a directed evolution longitudinal yeast data set, we are able to identify a time-series trend in non-reference allele frequency and detect novel variants that have not yet been reported. Our model also detects the emergence of a beneficial variant earlier than was previously shown, and a pair of concomitant variants. Conclusions We developed a variational EM algorithm for a hierarchical Bayesian model to identify rare variants in heterogeneous next-generation sequencing data. Our algorithm is able to identify variants in a broad range of read depths and non-reference allele frequencies with high sensitivity and specificity. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1451-5) contains supplementary material, which is available to authorized users.


INTRODUCTION
Massively parallel sequencing data generated by next-generation sequencing technologies is routinely used to interrogate SNVs in research samples [Koboldt et al., 2013].For example, deep sequencing confirmed the degree of genetic heterogeneity of HIV and influenza [Flaherty et al., 2011;Ghedin et al., 2011].Many solid tumors are represented as having intra-tumor heterogeneity by DNA sequencing technology, where SNVs are rare in the population [Navin et al., 2010].Also, whole genome sequencing has revealed that many beneficial mutations of minor allele frequencies are essential to respond to dynamic environments [Kvitek and Sherlock, 2013].However, rare SNVs identification in heterogeneous cell populations is challenging, because of the intrinsic error rate of next generation sequencing platform [Shendure and Ji, 2008].Thus, there is a need for accurate and scalable statistical methods to uncover SNVs in heterogeneous samples.
A number of computational methods have been developed to detect SNVs in large scale genomic data sets.These methods can be roughly categorized as probabilistic or heuristic or some combination.Among all of the current probabilistic methods, the Bayesian probabilistic framework has been increasingly used to calculate the unobserved quantities such as variant allele frequency given the observed genomic sequencing data.GATK [McKenna et al., 2010] and SAMTools [Li et al., 2009] use a naive Bayesian decision rule to call variants.EBCall models sequencing errors based on a Beta-Binomial distribution, where the parameters and latent variables are estimated from a set of non-paired normal sequencing samples [Shiraishi et al., 2013].However, the error rate of normal sequencing samples could be unmatched with the error rate of the target samples, which may cause a problem of making false negatives calls [Wang et al., 2013].CRISP compares aligned reads across multiple pools to obtain sequencing errors, and then distinguishes true rare variants from the sequencing errors [Bansal, 2010].However, the bottleneck of CRISP is its low computational efficiency due to a calculation of a large amount of contingency tables.
Generally, independent modeling method models the genotypes of tumour and normal samples separately and then looks for the difference between them.In contrast to independently analyzing a tumour-normal pair, joint modeling method models a joint-genotype of the tumour and normal samples simultaneously.JointSNVMix introduces two Bayesian probabilistic models (JointSNVMix1 and JointSNVMix2) to jointly analyze a tumour-normal paired allelic count of NGS data [Roth et al., 2012].JointSNVMix derives an expectation maximization (EM) algorithm to calculate maximum a-posteriori (MAP) estimate of latent variables in a particular probabilistic graphical model.Furthermore, Roth et al. [2012] reveals that the joint modeling method, JointSNVMix1, observes 80-fold reduction of false positives compared with its independent analogue (SNVMix1).SomaticSniper [Larson et al., 2012] models the joint diploid genotype likelihoods for both tumour and normal samples.Additionally, Strelka [Saunders et al., 2012] models the joint probabilistic distribution of allele frequencies for both tumour and normal samples, which is demonstrated to be more accurate compared with the methods based on the estimated allele frequency tests between tumour and normal samples.
An alternative strategy to probabilistic methods is heuristic methods that use a set of criteria to select variant positions instead of modeling the data using probabilistic distributions.VarScan is an example of a heuristic method that compares tumour and normal samples by satisfying some lower bounds, such as a certain variant allele frequency and a number of allele counts [Koboldt et al., 2012].SNVer focuses on a frequentist method that is able to calculate P -values, but Wei et al. [2011] pointed out that this approach fails to model sampling bias that will reduce the power of detecting true rare variants.
In previous work, we developed a Beta-Binomial model to characterize a null hypothesis error rate distribution at each position.Using this rare variant detection (RVD) model, we call rare variants by comparing the error rate of the sample sequence data to a null distribution obtained from sequencing a known reference sample [Flaherty et al., 2011].RVD can identify mutant positions at a 0.1% fraction in mixed samples using high read depth data.
We improved upon that work, RVD2, by using hierarchical priors to tie parameters across positions [He et al., 2015].We derived a Markov Chain Monte Carlo (MCMC) sampling algorithm for posterior inference.However, the main limitation of MCMC is that it is hard to diagnose convergence and may be slow to converge [Jordan et al., 1999].An alternative method, that we explore here, is to use variational inference, which is based on a proposed variational distribution over latent variables.By optimizing variational parameters, we fit an approximate distribution that is close to the true posterior distribution in the sense of the Kullback-Liebler (KL) divergence.Variational inference can now handle nonconjugate distributions and tends to be more computationally efficient than MCMC sampling [Peterson and Hartman, 1989].
Here, we propose a variational EM algorithm for our Bayesian statistical model to detect SNVs in heterogeneous NGS data.We show that variational EM algorithm has comparable accuracy and efficiency compared with MCMC in a synthetic data set.In section 2, we define the model structure.In section 3, we derive our variational EM algorithm to approximate the posterior distribution over latent variables.In section 4, we call a variant by a posterior difference hypothesis test between the key model parameters of a pair of samples.In section 5, we compare the performance of the variational EM inference algorithm to the MCMC sampling method and the state-of-the-art methods using a synthetic data set.We also show that our variational EM algorithm is able to detect rare variants and estimate non-reference allele frequency (NRAF) in a longitudinal directed evolution experimental data set.

MODEL STRUCTURE
Our Bayesian statistical model is shown as a graphical model in Figure 1.In the model, r ji is the number of reads with a non-reference base at location j in experimental replicate i; n ji is the total number of reads at location j in experimental replicate i.The model parameters are: µ 0 : a global non-reference read rate that captures the error rate across all the positions, M 0 : a global precision that captures the variation of the error rate across positions in a sequence, M j : a local precision that captures the variation of the error rate at position j across different replicates.
The latent variables are: µ j ∼ Beta(µ 0 , M 0 ): position-specific non-reference read rate for position j, θ ji ∼ Beta(µ j , M j ): the non-reference read rate for position j in replicate i.

VARIATIONAL EXPECTATION MAXIMIZATION (EM) INFERENCE
We developed a non-conjugate variational inference algorithm to approximate the posterior distribution, where the parameters are φ {µ 0 , M 0 , M }.
3.1.Factorization.We propose the following factorized variational distribution to approximate the true posterior over latent variables µ j and θ ji .Here, q(µ j ) approximates the variational posterior distribution of µ j , which represents the local error rate distribution in position j across different replicates; and q(θ ji ) approximates the posterior distribution of θ ji , which is the error rate distribution in position j replicate i.
(3) 3.2.Evidence Lower Bound (ELBO).Given the variational distribution, q, the log-likelihood of the data is lower-bounded according to Jensen's inequality, The function L(q, φ) is the evidence of lower bound (ELBO) of the log-likelihood of the data, which is the sum of q-expected complete log-likelihood and the entropy of the variational distribution q.The goal of variational inference is to maximize the ELBO.Equivalently, q is chosen by minimizing the KL divergence between the variational distribution and the true posterior distribution.
The ELBO can be expanded as We write out each component below.The detail of the derivation of and Therefore, we need to compute the following expectations with respect to the variational distribution: .
We select the functional forms for the variational distributions q(θ) and q(µ) to facilitate these expected value computations.
3.3.Variational Distributions.Since θ and r are conjugate pairs, the posterior distribution of θ ji is a Beta distribution, Therefore, we propose a Beta distribution with parameter vector δ ji as variational distribution, The posterior distribution of µ j is given by its Markov blanket, This is not in the form of any known distribution.But, since the support of µ j is [0, 1], we propose a Beta distribution with parameter vector γ j as variational distribution, Given these variational distributions, we have where ψ is the digamma function.
Since there is no analytical representation for E q log , we must resort to numerical integration, Here q(µ j ; γ j1 , γ j2 ) is the probability density function of the Beta distribution that is calculated using the Python built-in function scipy.stats.beta.pdf,and log is calculated using the Python built-in function scipy.special.betaln.Unfortunately, this numerical integration step is computationally expensive.Finally, the entropy terms can be computed as follows, 3.4.Variational EM Algorithm.Variational EM algorithm maximizes the ELBO on the true likelihood by alternating between maximization over q (E-step) and maximization over φ = {µ 0 , M 0 , M } (M-step).

(E-step):
Updating the variational distributions.The terms in the ELBO that depend on q(θ ji |δ ji1 , δ ji2 ) are We update the variational parameters by numerically optimizing δji1 , δji2 = arg max δ ji1 ,δ ji2 L [q(θ ji )] subject to the constraints that δ ji1 ≥ 0 and δ ji2 ≥ 0 and conditioned on fixed values for the other model and variational parameters using Sequential Least SQuares Programming (SLSQP) [Kraft and others, 1988].
We update the variational distribution q(µ j ) using the partial ELBO depending on γ j from each position j (16).
Again, we update the variational parameters by numerically optimizing γj1 , γj2 = arg max γ j1 ,γ j2 L [q(µ j )] subject to the constraints that γ j1 ≥ 0 and γ j2 ≥ 0 and conditioned on fixed values for the other model and variational parameters using SLSQP [Kraft and others, 1988].The computational cost of optimizing ( 16) is high because of the quadrature of E q log in (12).

(M-step):
Updating the model parameters.We can write out the ELBO as a function of each model parameter µ 0 , M 0 , and M j as follows.
The ELBO with respect to µ 0 is The ELBO with respect to M 0 is The ELBO with respect to M j is We also use SLSQP to optimize the ELBO function with respect to each parameter, µ 0 , M 0 , and M j .It is computationally easy to optimize µ 0 (17) and M 0 (18).However, it is costly for optimizing M j (19) because the quadrature is needed to calculate E q log Γ(M j ) Γ(µ j M j )Γ(M j (1−µ j )) using ( 12).
The variational EM algorithm is summarized using pseudocode in Algorithm 1.

HYPOTHESIS TESTING
The posterior distribution over µ j | r case , r control µ j |r case − µ j |r control is the distribution over the change in the non-reference read rate at position j between a case and control sample.However, we cannot compute the posterior distribution, µ j |r case , r control , because we do not have access to the component posterior distributions µ j |r case and µ j |r case .But, by approximating the distribution of µ j |r case and µ j |r control as Gaussians with moments matched to the variational posterior distributions, the distribution of µ j |r case , r control can be approximated as a Gaussian.
Under the variational approximation, for µ j |r case and likewise for µ j |r control .We approximate the posterior for the case sample as and likewise for the control.Then, Now, we can approximate the posterior probability of interest, that is, the posterior probability that the difference in the non-reference read rate is greater than a fixed effect size τ (e.g.zero) for a one sided test.For a two sided test, we compute the approximate probability , where the probability is approximated as described.
It is possible that a position is called a variant due to a differential non-reference read count, but no particular alternative base is more frequently observed than the others.In this case, the likely case is a sequencing error that indiscriminately incorporates a non-reference base at the position.To discriminate this non-biological cause from the interesting true variants we use a χ 2 goodness-of-fit test for non-uniform base distribution [Efron, 2010;He et al., 2015].For each provisional variant, if we reject the null hypothesis that the distribution is uniform, we promote the position to a called variant.
5.1.1.Synthetic DNA Sequence Data.Two random 400bp DNA sequences were synthesized.One sample with 14 loci variants is taken as the case and the other without variants is taken as the control.Case and control samples were mixed to yield defined NRAF of 0.1%, 0.3%, 1.0%, 10.0%, and 100.0%.The details of the experimental protocol are available in Flaherty et al. [2011].The synthetic DNA data were downsampled by 10×, 100×, 1, 000×, and 10, 000× using picard (v 1.96).The final data set contains read pairs for six replicates for the control and cases at different NRAF levels.
5.1.2.Longitudinal Directed Evolution Data.The longitudinal yeast data comes from three strains of haploid S288c which were grown for 448 generations under limited-glucose (0.08%).The wild-type ancestral strain GSY1136 was sequenced as a reference.Aliquots were taken about every 70 generations and sequenced.The detail of library sequencing is described in Kvitek and Sherlock [2013], Bansal [2010], and Kao and Sherlock [2008].The Illumina sequencing data are available on the NCBI Sequence Read Archive (SRA054922) [Kvitek and Sherlock, 2013].For this study, we received the original BAM files from one of the authors.The aligned BAM files have 266 -1, 046× coverage.We used samtools (v 1.1) with -mpileup -C50 flags to convert BAM files to pileup files.Then, we generated depth chart files, which are tab-delimited text files recording the count of the number of nucleotides in columns and genomic positions in rows.We ran our variational inference algorithm on the depth chart files to identify SNVs.

5.2.1.
Comparison of Sensitivity and Specificity.We compared the performance of our variational EM algorithm and the MCMC sampling algorithm in terms of sensitivity and specificity (Table 1).We used the posterior distribution test and χ 2 test to detect variants for a broad range of median read depths and NRAFs.The variational EM algorithm shows higher sensitivity and specificity than the MCMC algorithm in the events when NRAF is 0.1%.The variational EM algorithm has a higher specificity compared with the MCMC algorithm for a median read depth of 41, 472× at 0.3% NRAF and 55, 489× at 1.0% NRAF, but the sensitivity is slightly lower due to false negatives.2 shows the approximate posterior distribution of the variational EM algorithm and samples of the MCMC algorithm.One variant position, 85, is taken as an example to show the comparison of the approximated posteriors.The variational EM and MCMC algorithms both identify all the variants when NRAF is 10.0% and 100.0%.The variational EM algorithm calls 90 false positive positions without a χ 2 test when NRAFs are 0.1% and 0.3% for low median read depth (30× and 400×).This is to be expected because it is highly unlikely to correctly identify a variant base with a population frequency of 1 in 1, 000 with less than a 1, 000× read depth.
A false positive, a non-mutated position that is called by the variational EM algorithm but not called by the MCMC algorithm, is shown in Figure 3.The variance of MCMC posterior estimate is higher than that of the variational posterior estimate.We tested 10 random initial values variational inference algorithm and found the approximate posterior distributions from the variational EM algorithm are essentially equivalent for all random initializations.It is notable that the shape of the proposed Beta variational distribution is well approximated by a Gaussian.

5.2.3.
Comparison to the State-of-the-Art Methods.We compared the performance of our variational EM algorithm with the state-of-the-art variant detection methods, SAMtools [Li et al., 2009], GATK [McKenna et al., 2010], VarScan2 [Koboldt et al., 2012], Strelka [Saunders et al., 2012], MuTect [Cibulskis et al., 2013], and RVD2 [He et al., 2015], using synthetic DNA data set (Table 2).Among all of the methods compared, our variational EM algorithm has a higher sensitivity and specificity for a broad range of read depths and NRAFs.Our variational EM algorithm shows higher specificity than all the other tested methods at a very low NRAF (0.1%) level.However, our algorithm has a slightly lower specificity than the MCMC algorithm when the median read depth is 4, 156× at 0.3% NRAF, and a slightly lower sensitivity than the  MCMC algorithm when the median read depth is 41, 472× at 0.3% NRAF and a median read depth of 55, 489× at 1.0% NRAF.The performance of other methods is stated in detail in He et al. [2015].
5.2.4.Comparison of Timing.The computational time for approximating the variational posterior distribution is increased by expanding the length of region and the median read depth (Figure 4).Our variational EM algorithm is faster than the MCMC algorithm at the low median read depths of 27× and 298×, and slower for the high median read depths of 3, 089× and 30, 590×.Table 3 shows the timing profile for each part of our variational EM algorithm when median read depth is 3, 089×.Optimizing γ in the E-step ( 16) and optimizing M j in the M-step (19) takes more than 95% of the time of one variational iteration in a test of a single processor, since the integration (12) is needed.4).Additionally, we detected 81 novel variants in 8 timepoints that the original publication did not detect.In Table 4, G7 is the baseline NRAF as the control sample when comparing with G70, G133, G266, G322, G385, and G448 in the respective hypotheses testing.The corresponding NRAFs of called variants at different time points are given by the estimate of the latent variable, μj = E q [µ j |r].
All of these variants, except the variant at position Chr04:1,014,740, decrease in NRAF following a maximum.The allele at position Chr04:1,014,740 is a beneficial variant that arises in NRAF to 99.603% at generation 448 within a constant glucose-limited environment.Moreover, we identified the first emergence of this beneficial variant as early as 0.522% in generation 133.Table 4 shows that we are able to detect many variants (NRAF < 1.0%) early in the evolutionary time course.Furthermore, we detected a variant at position Chr04:1,015,666 at generation 70 as 0.091% NRAF, which is less than a 0.1% fraction.
We identified a pair of variants, Chr04:1,014,740 in gene MTH1 and Chr12:200,286 in gene ADE16, that increase in NRAF together in time (Figure 5).We hypotheses that the variants are concomitant in the same strain.In this pair of genes, gene MTH1 is a negative regulator of the glucose-sensing signal transduction pathway, and gene ADE16 is an enzyme of de novo purine biosynthesis.Glucose sensing induces gene expression changes to help yeast receive necessary nutrients, which could be a reason for this pair of genes to mutate together [Johnston, 1999].Further experimental validation of this hypothesis would be required to definitively show that the mutations are concomitant.
The global precision parameter M0 could influence the estimate of µ j due to its regularization effect.We show the influence of different M0 on variant position Chr04:1,014,740, q(µ 1,014,740 |r) in Figure 6.We see that as we decrease the prior precision parameter M0 , μ1,014,740 increases as expected.But the effect of changing M0 over several orders of magnitude does not change µ j greatly.Here M0 = 1.752 in this dataset.

DISCUSSION
In this article, we propose a variational EM algorithm to estimate the non-reference allele frequency in the RVD2 model to identify rare nucleotide variants in heterogeneous pools.We derived a variational EM algorithm to avoid an intractable integration in our Bayesian model.Our results show that the variational A blank cell indicates that the position of that time point is not called significantly different than G7.The positions highlighted as blue were also identified by Kvitek, 2013.The other 81 positions are novel identified variants in 8 timepoints.
EM algorithm (i) is able to identify rare variants at a 0.1% NRAF level with comparable sensitivity and specificity to a MCMC sampling algorithm; (ii) has a higher specificity in comparison with many state-ofthe-art algorithms in a broad range of NRAFs; and (iii) detects SNVs early in the evolutionary time course, as well as tracks NRAF in a real longitudinal yeast data set.
We have chosen parametric forms for the variational distributions.This choice has left us with a complex integral in our variational optimization problem.In future work, we plan to explore other approximations of the variational distributions that render the integral easier to compute.One could use cubic splines to numerically approximate the function and then integrate that surrogate [McKinley and Levine, 1998].
Another strategy is to consider a Laplace approximation for the variational distribution, as we and others have done previously [Saddiki et al., 2014;Wang and Blei, 2013].
Improving the speed of the estimating algorithm enables us to interrogate whole-genome sequencing data.By doing this, we hope to reveal the dynamics of arising variants at the genome-wide scale to show the  genetic basis of clonal interference.Our method can also be extended to study drug resistance by characterizing tumor heterogeneity in targeted anti-cancer chemotherapy samples.We will be able to find the causative variants that lead to drug resistance and understand the causes of resistance at the single nucleotide level.
7. ACKNOWLEDGMENTS F.Z. and P.F. were supported by PhRMA Foundation Informatics Grant 2013080079.We would like to thank Dan Kvitek for kindly sharing the data set used for the longitudinal yeast evolution analysis.We acknowledge Chuangqi Wang and Ross Lagoy for valuable comments on the manuscript.

FIGURE 1 .
FIGURE 1. A. Graphical model representation of the model.B. Graphical model representation of the variational approximation to approximate the posterior distribution.Observed random variables are shown as shaded nodes and latent random variables are unshaded.The object of inference for the variational EM algorithm is the joint distribution p(µ, θ|r, n).

FIGURE 2 .
FIGURE 2. Approximated posterior distributions by the variational EM and MCMC algorithms on a true variant position 85 when the median read depth is 5, 584×.

FIGURE 3 .
FIGURE 3. Approximated posterior distribution by the variational EM and MCMC algorithms on a false variant position 160 when the median read depth is 410×.

FIGURE 4 .
FIGURE 4. Timing for our variational EM algorithm and MCMC sampling algorithm.Sixty processers are used to estimate the model on the synthetic data set.

FIGURE 5 .
FIGURE 5.The NRAF trend of concomitant variants in gene MTH1 and ADE16.The 95% Bayesian credible intervals are shown.

FIGURE 6 .
FIGURE 6. Influence of M 0 on the estimate of µ j .Posterior distributions of the variant at position Chr04:1,014,740, μ1,014,740 , with different M0 are shown.

TABLE 1
. Sensitivity/Specificity comparison of variational EM algorithm with the MCMC algorithm on the synthetic DNA data set.5.2.2.Comparison of Approximated Posterior Distribution.Figure

TABLE 2
. Sensitivity/Specificity comparison of our variational RVD2 with other variant detection methods on synthetic DNA data.

TABLE 3 .
Timing profile of 4 significant figures for one iteration of variational EM algorithm when median read depth is 3, 089×.Single and multiple processors are both tested to estimate timing.Time for optimizing γ in the E-step and optimizing M in the M-step is highlighted in percentage.5.3.Variant Detection on the Longitudinal Directed Evolution Data.We applied our variational EM algorithm to the MTH1 gene at Chr04:1,014,401-1,015,702 (1,302bp), which is the most frequently observed mutated gene by Kvitek and Sherlock [2013].Our algorithm detected the same variants that were found by Kvitek and Sherlock [2013] (shown as highlighted in Table

TABLE 4 .
Identified variants and corresponding NRAFs in gene MTH1 on Chromosome 4.