Estimation of evolutionary parameters using short, random and partial sequences from mixed samples of anonymous individuals

Background Over the last decade, next generation sequencing (NGS) has become widely available, and is now the sequencing technology of choice for most researchers. Nonetheless, NGS presents a challenge for the evolutionary biologists who wish to estimate evolutionary genetic parameters from a mixed sample of unlabelled or untagged individuals, especially when the reconstruction of full length haplotypes can be unreliable. We propose two novel approaches, least squares estimation (LS) and Approximate Bayesian Computation Markov chain Monte Carlo estimation (ABC-MCMC), to infer evolutionary genetic parameters from a collection of short-read sequences obtained from a mixed sample of anonymous DNA using the frequencies of nucleotides at each site only without reconstructing the full-length alignment nor the phylogeny. Results We used simulations to evaluate the performance of these algorithms, and our results demonstrate that LS performs poorly because bootstrap 95 % Confidence Intervals (CIs) tend to under- or over-estimate the true values of the parameters. In contrast, ABC-MCMC 95 % Highest Posterior Density (HPD) intervals recovered from ABC-MCMC enclosed the true parameter values with a rate approximately equivalent to that obtained using BEAST, a program that implements a Bayesian MCMC estimation of evolutionary parameters using full-length sequences. Because there is a loss of information with the use of sitewise nucleotide frequencies alone, the ABC-MCMC 95 % HPDs are larger than those obtained by BEAST. Conclusion We propose two novel algorithms to estimate evolutionary genetic parameters based on the proportion of each nucleotide. The LS method cannot be recommended as a standalone method for evolutionary parameter estimation. On the other hand, parameters recovered by ABC-MCMC are comparable to those obtained using BEAST, but with larger 95 % HPDs. One major advantage of ABC-MCMC is that computational time scales linearly with the number of short-read sequences, and is independent of the number of full-length sequences in the original data. This allows us to perform the analysis on NGS datasets with large numbers of short read fragments. The source code for ABC-MCMC is available at https://github.com/stevenhwu/SF-ABC.


Background
Over the last decade, next generation sequencing (NGS) has become widely available, and is now the sequencing technology of choice for most researchers. NGS produces sequences that are relatively short, varying between 50 bp to 400 bp depending on the specific platform [1][2][3]. Researchers use NGS in several different ways. In this manuscript, we consider the use of NGS in evolutionary studies, where short read fragments are obtained from longer, amplified target sequences in mixed samples of unlabeled or untagged (= "anonymous") individuals. These types of samples are often collected from viral or bacterial populations. The traditional sampling protocol for evolutionary studies that rely on sequences from many individuals has been to use Sanger sequencing technology to obtain the sequence(s) of one (or more) DNA fragment(s) from each individual in the sample separately. This is typically followed by a multiple sequence alignment and reconstruction of the phylogeny or genealogy, perhaps with the simultaneous inference of relevant evolutionary parameters [4,5].
With NGS, short read fragments are typically shorter than the fragment of interest. When NGS is applied to a mixed collection of DNA from several individuals, the challenge for the evolutionary biologist is the absence of an alignment of full-length sequences, each corresponding to an individual in the sample [6]. And without an alignment of full-length sequences, how does one estimate the evolutionary parameters of interest?
One approach is to attempt to reconstruct the full length haplotypes, and use an alignment of these reconstructed haplotypes in standard evolutionary analyses [7][8][9]. There are several programs that attempt to reconstruct the full-length haplotypes from short read fragments obtained from mixed, unlabeled, collections of individuals e.g., ShoRAH, ViSpA, Pre-dictHaplo and Qure [10][11][12][13]. Analyses have shown that with many data sets, reconstruction of haplotypes can be unreliable, producing either too many haplotypes and/or sequences that have relatively low identity to the original sequences [14][15][16]. Consequently, a researcher who chooses to use these reconstructed full-length haplotypes with any program that requires full-length alignments, will be implicitly integrating the errors of haplotype reconstruction into their estimation of evolutionary parameters.
We propose two alternative approaches to infer population genetic parameters from a collection of short-read sequences obtained from a mixed sample of anonymous DNA using the frequencies of nucleotides at each site only. To our knowledge, there is no existing method capable of estimating these parameters without reconstructing the full length sequence alignment nor the phylogeny. A similar approach had been proposed by Johnson and Slatkin previously [17], but their method focuses on samples of very large numbers of individuals where each read in an alignment of short-reads is assumed to come from a separate genome. In contrast, the methods we propose assumes either (1) that a relatively short fragment of the genome is the target of NGS, and/or (2) there are relatively few individual organisms in the sample. In essence, these assumptions ensure that each site for each individual is covered by multiple reads, so that the frequency of nucleotides at each site can be estimated with reasonable accuracy. In samples of viruses and bacteria, for instance, the amplified region is small, say, a few kilobases long, and the number of genomes in a single PCR reaction is often unknown ranging from the tens to the thousands. With microbial populations, typically viruses, there is also the opportunity to collect serial samples from the same fast evolving species over a period of time. For this reason, the methods we have developed estimate the population genetic parameters θ ∝ Nμ, the effective population size scaled by mutation rate, and μ, the mutation rate per site per unit time [18,19].
We describe two algorithms to estimate these parameters, Least Squares (LS) estimation [20] and Approximate Bayesian Computation Markov chain Monte Carlo (ABC-MCMC) estimation [21][22][23]. One advantage of the ABC-MCMC approach over more traditional MCMC approaches is that ABC-MCMC does not require formulation or computation of a likelihood; instead, the method relies on the use of summary statistics derived from simulated data to accept or reject a proposal in the Markov chain. The details of these two algorithms are described in the next section.
We used simulations to evaluate the performance of these algorithms, and we compared these to the results obtained with BEAST [8], a program that is commonly used to simultaneously infer phylogenies and evolutionary parameters using Bayesian MCMC inference with full-length sequences. Our simulations demonstrate that LS point estimates are unbiased, but produce bootstrap intervals that typically over-or underestimate true parameter values. In contrast, ABC-MCMC is able to estimate evolutionary parameters without reconstructing full-length haplotypes, producing 95 % Highest Posterior Density (95 % HPD) intervals that have equivalent coverage to those obtained by MCMC with full-length alignments; however, there can be up to a 10-fold difference between the lowest and highest bound of each 95 % HPD.

Methods
As noted above, both algorithms apply to samples of short-read sequences obtained from a collection of longer target sequences from mixed and unlabeled individuals in a population. The first method is based on least squares (LS) estimation. As we show below, the second method applies the LS results as a pre-processing step prior to beginning the ABC-MCMC. Both methods estimate evolutionary parameters using only the proportion of each nucleotide at each site, and do not require reconstruction of full length haplotypes nor the phylogeny/genealogy. If serial samples are available, sequences from a later timepoint share common ancestors with those from the earlier timepoint. We assume that samples collected from each timepoint are sequenced separately with NGS technology.
As we noted earlier, we assume that each short read sequenced by NGS will be shorter than the full length haplotypes, and we will obtain many more short read fragments than the original number of haplotypes. We also assume that a reference or consensus sequence for the targeted region is available, and we are able to align each short read fragment to a unique location on the reference sequence. After the short reads are aligned to the reference, we can count the frequency of each nucleotide at each site. In practice, the frequency of each nucleotide will be influenced by the sequencing error from NGS [1]. For the simplicity of the algorithms, we assume that the short-reads have been error-corrected prior to analysis.

Least squares (LS) estimation
For serial samples, we can estimate the intra-timepoint (within a single sample) and inter-timepoint (between samples from two different timepoints) average pairwise sequence diversity. Both inter-timepoint diversity (D inter ) and intra-timepoint diversity (D intra ) are calculated using the proportion of each nucleotide at each site. The intra-timepoint diversity (D intra,s,t ) for site s at time t is calculated as: where F s,j,t is the proportion of nucleotide j at site s at time t.
Similarly, the inter-timepoint diversity (D inter,s,t1,t2 ) between time t 1 and t 2 at site s is calculated as D inter;s;t1;t2 ¼ 1− Once the average pairwise diversity for each site is calculated, the mean intra-timepoint diversity for any specified timepoint is given by: and the mean inter-timepoint diversity between time t 1 and t 2 is: where n is the number of sites. If the sequences are obtained from T timepoints, there will be T × (T − 1)/2 estimates of average diversity, of which T will be intratimepoint diversities. The LS method uses both inter-and intra-timepoint diversity to estimate effective population size and mutation rate, based on the method described by [20]. Population genetics tells us that in any given sample from a constant-sized population of a set of sequences of a neutrally evolving locus, average pairwise sequence diversity (measured as the average proportional distance between any two sequences in the sample) is an estimate of θ, which is proportional to the product of mutation rate and effective population size, with the proportionality constant determined by whether the population is haploid (proportionality constant = 2, θ = 2Nμ) or diploid (proportionality constant = 4, θ = 4Nμ).
To estimate the parameters of interest, let μ be the mutation rate per unit of time, θ be the effective population size scaled by μ, and Δt the time between two sampling events. Note that both μ and Δt are scaled to the same unit of time; typically this will be chronological time but, rarely, time in generations may be available. Under a constant population size, and a constant mutation rate, there are only two parameters to be estimated, θ and μ. As noted above, θ is estimated by D intra and θ + μΔt is estimated by D inter . Once estimates of θ and μ are obtained, we can estimate kN = θ/μ, where k is the unspecified proportionality constant.
We can construct a least squares regression by letting Y be a vector of all D inter and D intra , and X be an indicator variable that identifies whether/how θ and μ contribute to the expectation of D inter or D intra . For the constant population, constant mutation rate model, the indicator value for θ is always 1 and the indicator of μ is just Δt. We are then able to fit Y = XΒ using least-squares, where the Β is the LS estimator of parameter θ and μ. For example, if there are three timepoints T A , T B , and T C , and the intervals between each adjacent pair is 200 units of time apart, we can construct a set of linear equations to express the relationship between these parameters, as follows: The linear equations can be shown as Y = XB, where Using the least squares method to solve for B = (X'X) −1 X'Y, we obtain our estimation of θ and μ.
The model can be extended to multiple population sizes and/or multiple mutation rates. If there are three timepoints T A , T B , and T C , then we can estimate the effective population size for each timepoint using the intra distance within each timepoint. Let θ A , θ B , and θ C be the scaled population sizes for timepoint T A , T B and T C . Δ AB is the time difference between timepoint T A and T B , and Δ BC is the time difference between timepoint T B and T C .
Therefore we update the matrix Alternatively, for a constant population size and multiple mutation rates, the model can be specific as: Therefore we update the matrices X and B as To obtain confidence intervals for our estimates, we used a bootstrap procedure in which we generated 1000 pseudoreplicate datasets by resampling sites with replacement along the alignment of short-read sequences. We did this separately for each timepoint in our simulated datasets. For each pseudoreplicate we recalculated the site frequencies, F s,j,t, , and we were able to estimate N and μ; 95 % Confidence Intervals were obtained by taking values corresponding to upper and lower 2.5 % percentiles of ordered bootstrap estimates.

ABC-MCMC (Approximate Bayesian Computation -Markov chain Monte Carlo)
Markov chain Monte Carlo Bayesian inference is a computationally intensive technique for recovering the posterior probability of parameters of interest, while taking account of prior knowledge (including the degree of uncertainty) about these parameters. If P(D|ϕ) (often referred to as the likelihood of ϕ), is the probability of obtaining the data given a set of parameters, ϕ, and P(ϕ) is the prior information we have about the joint distribution of these parameters, then the posterior distribution, P(ϕ|D), is proportional to the product of likelihood and prior, P(ϕ|D) ∝ P(D|ϕ)P(ϕ). Often it is very difficult to obtain the posterior distribution function analytically. The elegance of MCMC resides in its ability to derive the relative posterior distribution by using a proposal distribution to randomly generate a Markov chain of potential states (= values that the parameters of interest can take), and sampling from this chain by accepting or rejecting states in proportion to the posterior density. MCMC works because although it is not easy to calculate the distribution of P(ϕ|D), it is easy to obtain P(ϕ*|D), for a specific value of ϕ*. Consequently by repeatedly sampling ϕ* in the correct proportions, the distribution of P(ϕ|D) is approximated. Once a sample of sufficient size is obtained, it becomes possible to derive estimates for the parameters of interest.
One common implementation of MCMC uses the Metropolis-Hastings algorithm [24,25], which can be described by the following steps.
Step 1: Begin with initial parameter values ϕ i .
Step 3: Calculate the acceptance ratio, α, using the following formula: Step 4 Set i = i + 1 and repeat Step 1.
The algorithm is repeated until the Markov chain samples from the target distribution, typically the (joint) posterior distribution of the parameter(s).
Approximate Bayesian Computation (ABC) is a simulation-based algorithm for Bayesian inference [21][22][23]. ABC does not require the calculation of the likelihood P(D|ϕ); instead, it uses the agreement between summary statistics obtained from D, and those obtained from simulations of data under different values of ϕ to obtain the relative posterior probability distribution P(ϕ|D). If the summary statistic, S, is (nearly) sufficient then P(D|ϕ, S) ≅ P(D|S). ABC is used precisely because it circumvents the need to calculate a challenging or intractable likelihood. Here, we describe our ABC-MCMC procedure as follows:  S o ), is greater than a threshold ε, then reject the proposed parameter values and set ϕ i + 1 = ϕ i . If the difference is less than ε, then set ϕ i + 1 = ϕ* with probability α, calculated by the modified Metropolis-Hasting ratio: The distance, d(S*, S o ), is calculated as: 4. Set i = i + 1 and go back to step 2 for a large number of iterations.
Although ABC requires that sufficient statistics (or nearly sufficient statistics) are used, it is non-trivial obtaining appropriate sufficient statistics. For this reason, Fearnhead et al. [26] proposed a semi-automatic method to generate "nearly" sufficient statistics for ABC by using a linear combination of commonly-used summary statistics. The linear combination is obtained by regressing the summary statistics against known values of ϕ from simulated training datasets. The regression equation serves as the new summary statistic. The outline of their algorithm is as follows: Therefore, there are k values of ϕ T drawn from the training region, and for each ϕ T there are a set of n summary statistics. 4. For each parameter, ϕ T ∈ ϕ T , regress the values of ϕ T against the set of summary statistics, S T , obtained from all simulations. As noted, the LS regression equations that are obtained are a linear combination of summary statistics in S, and serves as a singlevalued sufficient statistic for each ϕ ∈ ϕ, and used as S* in the ABC-MCMC algorithm above.

Implementation of the full algorithm
For each dataset, we calculate the proportion of each nucleotide at each site of the alignment of short-read sequences to a reference sequence. Our LS estimation procedure is applied to obtain point estimates of the parameters of interest. These point estimates are used as guidelines to help us define the training region for Fearnhead et al.'s algorithm. We set the training region as a uniform distribution with mean equal to the LS estimate with upper and lower bounds set to 5 fold above and below the mean, i.e. from (0.2× to 5×). Linear combinations of the following summary statistics are used as sufficient statistics in the regression. All summary statistics are calculated using the proportion of each nucleotide at each site.
4. Divide the pattern of the proportion of each nucleotide between any two timepoints into four different categories, and record the number of sites in each category: a. Category 1: Sites are identical across both timepoints. b. Category 2: Sites in which one nucleotide is fixed in one timepoint, and the same nucleotide is in the majority in the other timepoint. c. Category 3: There are mutations in both timepoints but the same nucleotide is most frequent in both timepoints (in other words, the nucleotide with the highest frequency in one timepoint is also the nucleotide with the highest frequency in the second timepoint). d. Category 4: All others sites.
By applying the Fearnhead et al. algorithm, we obtain linear equations as sufficient statistics for N and μ, these are used in the ABC-MCMC above.
Two priors are specified for ABC-MCMC: the prior distribution for N, the effective population size, is p N ð Þ ¼ 1 N , and, the prior for mutation rate, μ, is a uniform distribution between [0,1]. The full algorithm is summarized in Fig. 1.
We have found that allowing N and μ to vary independently along the MCMC chain results in inefficient mixing, with higher autorcorrelation between samples of a given interval. For this reason, we used block updating, where both N and μ are updated at each generation of the chain [27,28].

Simulation analysis
Two sets of simulation analyses were performed. The first set tested the performance of the LS method with a range of evolutionary parameters. One hundred datasets were simulated by Bayesian Serial SimCoal [29]. The effective population size was fixed at 3000 and mutation rate fixed at 10 −5 mutations per site per generation. By exploring different combinations of other parameters, we were able to estimate the performance of this algorithm. The number of sequences per timepoint were set to 5, 10, 20 and 40. The second set of simulations compared the performance between our ABC-MCMC implementation and the Bayesian MCMC approach implemented in the software Bayesian Evolutionary Analysis Sampling Trees (BEAST) [8].
Based on the results of the first simulation analysis, one hundred datasets were simulated using Bayesian Serial SimCoal [29]. We choose the following parameters for this simulation, there were three timepoints with inter-timepoint intervals set at 400 generations, the number of sequences per timepoint was fixed at 40, the mutation rate was fixed at 10 −5 mutations per site per generation, and the effective population size was fixed at 3000.
For our LS and ABC-MCMC methods, only the relative frequency of nucleotides per site was available as data. For BEAST analyses, the simulated full-length sequence alignment was used. BEAST ran for 10 million iterations, and ABC-MCMC ran for 1 million iterations with 100,000 thousand iterations in the preprocessing stage. For both BEAST and ABC-MCMC, three independent chains were run for each dataset to check for convergence. Samples were recorded every 1000 iterations in order to reduce the autocorrelation and the first 10 % of the samples were discarded as burn-in. The trace plots were checked manually for convergence and the 95 % Highest Posterior Density region (HPD) for each parameter was calculated using Tracer [30].

Simulation result 1: least square estimation
A series of simulations with different parameter settings were tested. Table 1 reports the means of LS estimates of population size and mutation rate obtained over 100 simulations. The results demonstrate that LS estimation requires sufficient numbers of sequences to obtain reasonable estimates of the proportion of each nucleotide for each site. When the number of full-length sequences is low (n < 10), change in one nucleotide at a site has a major effect on the nucleotide frequencies at that site; therefore, it is difficult for the LS algorithm is to estimate the true parameters. In contrast, as the number of sequences in the sample increases, estimation improves markedly.
In Tables 2 and 3, the number of generations between two consecutive timepoints has only a minimal effect on   the estimation efficiency, hence all simulations with intertimepoint intervals longer than 200 generations have similar performances. The number of timepoints does not have a major effect either. The relative efficiency of LS against the other two methods (ABC-MCMC and full-length Bayesian MCMC with BEAST) can be compared with the second set of simulations, where data were generated with 3 timepoints, with 40 sequences per timepoint and with intra-timepoint intervals of 400 generations. In these simulations, comparison of bootstrap 95 % Confidence Intervals of population size (Fig. 2) and mutation rate (Fig. 3) estimates, however, revealed an unflattering picture of LS estimation performance. Only a few LS 95 % Confidence Intervals enclose the true parameter values. Although the LS estimates are unbiased (i.e., their average over all simulations equals the true value), any given estimate performs poorly.

Simulation results 2: BEAST and ABC-MCMC estimation
BEAST was used to analyze 100 datasets with a constant-sized coalescent model with the parameters described above, to obtain estimates of both population size and mutation rate. Each analysis ran for 10 million iterations initially, and samples were stored every 1000 iterations. After manually inspecting the trace plots for convergence, some dataset were reanalysed with 100 million iterations. The 95 % HPD for mutation rate contains the true value 93 times and effective population size 89 times.
ABC-MCMC analyses were performed on the same 100 datasets with 1 million iterations and samples were stored every 1000 iterations. In addition, another 100,000 iterations of preprocessing simulations were used to estimate the sufficient statistics. An example of the regression equations used as sufficient statistics is in.
Some datasets failed to converge for ABC-MCMC and were reanalyzed with 10 million iterations. Two out of 100 datasets failed to converge even with 10 million iterations, and these are excluded from the analyses. Figure 4 is an example of the trace plot for both mutation rate and effective population size. The plot indicates that the MCMC chain mixes well. Figure 5 gives an example of the posterior density of the effective population size recovered, plotted against the prior distribution. Given the difference between the prior and posterior density, it is apparent that there is sufficient signal in the data to shift the posterior distribution of effective population size away from the prior distribution.
The 95 % HPD for mutation rate included the true value 87 times out of 98 analyses, and the 95 % HPD for population size included the true value 91 times out of 98 analyses. The results of both BEAST and ABC-MCMC are summarized in Table 4

Discussion and conclusions
In this paper, we propose two new algorithms to estimate evolutionary genetic parameters by using only the frequency of nucleotides at each site as input. The leastsquares method provides a fast way to estimate effective population size and mutation rate, but our results indicate that LS estimates (and their bootstrap confidence intervals) tend to under-or over-estimate the true parameters. Consequently, we cannot recommend our LS  Trace plot from ABC-MCMC for both effective population size and mutation rate after removing the first 10 % of the generations as burn-in. This demonstrates that the MCMC chain mixes well algorithm as a standalone method for obtaining estimates of evolutionary parameters. Our LS method, nonetheless, provides a useful baseline for the training region which we need to use to derive nearly sufficient statistics for our ABC-MCMC procedure.
Parameter values recovered by ABC-MCMC are comparable to those obtained using BEAST, albeit with much wider 95 % HPDs. The performance of ABC-MCMC, relative to that of BEAST is unsurprising, because BEAST has access to the full-length alignment of all sequences in the sample. When the full-length alignment is summarized by the proportion of nucleotides at each site, there is inevitably a loss of information. Additionally, our methods do not reconstruct phylogenies of the sequences, further reducing estimation efficiency. Despite these significant constraints, it is interesting that the sitewise nucleotide frequencies are able to provide enough information to obtain meaningful estimates of the parameters of interest.
The methods we have developed assume that the reference sequence is available and short reads can be aligned to the reference accurately. Obviously the performance of this approach is dependent on the quality of the reference sequence and how well these short reads are aligned to it. As noted above, in our simulations we have obtained site nucleotide frequencies from the fulllength sequences; we expect that with real NGS data, the accuracy with which we estimate the sitewise frequencies of each nucleotide will depend on sequencing error. Preprocessing the raw reads for quality, readlength, and identity to the reference sequence is likely to remove a significant amount of this error. In this case, it is unlikely that any remaining errors will distort the site frequencies enough to have a noticeable effect on the estimates. Also, all insertions and deletions (indels) have  been ignored in all simulations and summary statistics. However, our implementation of the ABC-MCMC model allows indels at each site as a fifth "nucleotide" or state. We have not yet applied this to our analysis. One major advantage of using only the site frequencies is that it can be applied to an arbitrary number of sequences in the original sample. In our simulations, we did not simulate short-read sequences; instead, we used the full-length sequences to derive the nucleotide frequencies at each site. Nonetheless, the amount of time required to estimate the proportion of each nucleotide for each site will scale linearly to the number of shortread sequences only, regardless of the number of fulllength sequences from which these were derived, and it is likely to be a very fast calculation. In contrast, methods that rely on building phylogenies from fulllength alignments must contend with the superexponential growth in the number of possible trees as the number of sequences increases. For a realistic NGS dataset, the number of reconstructed full-length haplotypes can be large. Consequently, our methods can be used on NGS datasets with large numbers of short read fragments, obtained from a large number of full-length sequences.
In this paper, we have only developed methods to estimate effective population size and mutation rate. Population geneticists are also interested in other evolutionary parameters, including migration rates and recombination rates. We believe that ABC-MCMC, like other Bayesian MCMC methods, provides a flexible framework to construct more complex evolutionary models. But the use of sitewise nucleotide frequencies alone means that we lack the finer-grained information afforded by a genealogy or phylogeny of full-length sequences; consequently, we are not certain how much complexity can be added to the models before the sitewise nucleotide frequencies we use in our methods lose all signal. This is certainly an area that we intend to explore. The source code for ABC-MCMC is available at https://github.com/stevenhwu/SF-ABC.