Gaussian process test for high-throughput sequencing time series: application to experimental evolution

Motivation: Recent advances in high-throughput sequencing (HTS) have made it possible to monitor genomes in great detail. New experiments not only use HTS to measure genomic features at one time point but also monitor them changing over time with the aim of identifying significant changes in their abundance. In population genetics, for example, allele frequencies are monitored over time to detect significant frequency changes that indicate selection pressures. Previous attempts at analyzing data from HTS experiments have been limited as they could not simultaneously include data at intermediate time points, replicate experiments and sources of uncertainty specific to HTS such as sequencing depth. Results: We present the beta-binomial Gaussian process model for ranking features with significant non-random variation in abundance over time. The features are assumed to represent proportions, such as proportion of an alternative allele in a population. We use the beta-binomial model to capture the uncertainty arising from finite sequencing depth and combine it with a Gaussian process model over the time series. In simulations that mimic the features of experimental evolution data, the proposed method clearly outperforms classical testing in average precision of finding selected alleles. We also present simulations exploring different experimental design choices and results on real data from Drosophila experimental evolution experiment in temperature adaptation. Availability and implementation: R software implementing the test is available at https://github.com/handetopa/BBGP. Contact: hande.topa@aalto.fi, agnes.jonas@vetmeduni.ac.at, carolin.kosiol@vetmeduni.ac.at, antti.honkela@hiit.fi Supplementary information: Supplementary data are available at Bioinformatics online.


Introduction
Most biological processes are dynamic and analysis of time series data is necessary to understand them.Recent advances in high-throughput sequencing (HTS) technologies have provided new experimental approaches to collect genome-wide time series.For example, experimental evolution now uses a new evolve and re-sequencing (ER) approach to understand which genes are targeted by selection and how (Burke andLong, 2012, Kawecki et al., 2012).Such experiments enable phenotypic divergence to be forced in response to changes in only few environmental conditions in the laboratory while other conditions are kept constant.The evolved populations are then subjected to HTS.
Experimental evolution in microorganisms has focused on the fate new mutations.For example, in Escherichia coli bacteria (Barrick et al., 2009) and Saccharomyces cerevisae (Lang et al., 2013) new mutations were studied and their effect in large populations was modelled (e.g., Illingworth et al., 2012).In contrast, ER experiments with sexually reproducing multicellular organisms address selection on standing variation and allele frequency changes (AFCs) in small populations where drift plays an important role.For example, for Drosophila melanogaster (Dmel ), several phenotypic traits, such as accelerated development (Burke et al., 2010), body size variation (Turner et al., 2011), hypoxia-tolerance (Zhou et al., 2011) and temperature adaptation (Orozco-TerWengel et al., 2012) have been investigated.Motivated by these experimental studies, we believe that experimental evolution combined with HTS supplies a good basis for studying AFC through time series molecular data.
To perform allele frequency comparisons, pairwise statistical tests between base and evolved populations were typically carried out.Burke et al. (2010) combined Fisher's exact tests with a sliding-window approach to identify genomic regions that show allele frequency differences between populations selected for accelerated development and controls without direct selection.Turner et al. (2011) developed a pairwise summary statistic, called "diff-Stat" to estimate the observed distribution of allele frequency differences and compared this to the expected distribution without selection.Orozco-TerWengel et al. (2012) identify SNPs with a consistent AFC among replicates by performing a Cochran-Mantel-Haenszel test (CMH) (Agresti, 2002).The latter is an extension of the Fisher's exact test to multiple replicates.All above-mentioned statistical methods are based on pairwise comparisons between the base and evolved populations and do not take full advantage of the time series data now available.Here, we propose an alternative Gaussian process (GP) based approach to study AFCs over the entire time series experiment.
GP is a non-parametric statistical model that is extremely well-suited for modelling HTS time series data which usually have relatively few time points that may be irregularly sampled.Recently, there have been some works applying GP models with parameters describing the process of evolution (e.g., Jones and Moriarty, 2013 account for phylogenetic relationships, Palacios and Minin, 2013 for effective population size).GPs have also recently been applied to gene expression time series by a number of authors (Yuan, 2006;Gao et al., 2008;Kirk and Stumpf, 2009;Liu et al., 2010;Honkela et al., 2010;Stegle et al., 2010;Cooke et al., 2011;Kalaitzis and Lawrence, 2011;Titsias et al., 2012;Liu and Niranjan, 2012;Äijö et al., 2013;Hensman et al., 2013).In differential analysis, GPs have been applied to detect differences in gene expression time series in a two-sample setting by Stegle et al. (2010) and for detecting significant changes by Kalaitzis and Lawrence (2011).While these methods provide a very sensible basis for detecting the changing alleles, they fail to properly take into account all aspects of the available HTS data, such as differences in sequencing depth between different alleles and time points.These differences can have a huge impact in the reliability of different measured allele frequencies and taking them into account is vital for achieving good accuracy with the available short time series.

Methods
To identify the candidate alleles which evolve under selection, we model the allele frequencies by Gaussian Process (GP) regression.We fit time-dependent and time-independent GP models and rank the alleles according to their corresponding Bayes factors, i.e. the ratio of the marginal likelihoods under the different models.
GPs provide a convenient approach for modelling short time series.However, when applying them to a large number of short parallel time series as in many genomic applications, naive application leads to overfitting or underfitting in some examples.While these problems are rare, the bad examples can easily dominate the ranking.We overcome these challenges by excluding nonsensical parameter values, for example using a good variance model that can be incorporated into the GP models.

Data and Preprocessing
In the following, we use the term SNP for the markers and alleles under study, but the methods can be applied to any features whose abundance can be quantified in a similar manner.We assume that the SNPs are bi-allelic for a specific position of the genome in a population.In other words, only two of the alleles from (A, T, C, G) can be observed at each SNP position.We first determine the abundances of these two specific alleles and we aim to model the time dependency of the rising allele's frequency over several generations.We will refer to generations as time points for simplicity.
We denote the replicate index of each observation by r j and the time point by t j , j = 1, ..., J, with J denoting the total number of observations.For each of these points, we assume HTS reads have been aligned to a reference genome with y ij reads with a specific allele at SNP position i.We use n ij to denote the total sequencing depth at the position.

Mean and Variance Inference: Beta-Binomial Model
We model y ij as a draw from a binomial distribution with parameters n ij and p ij : where p ij denotes frequency of the specific allele in the population.We set a uniform Beta(1,1) prior on p ij : where α = 1, β = 1.
Since beta prior is conjugate to the binomial likelihood, the posterior distribution will also be a beta distribution: where Then, the posterior mean and variance of p ij can be calculated as: The inferred posterior means and posterior variances are used to fit the GP models as described in the following sections.As the results will show, this step is very important for incorporating the available uncertainty information into the GP models by taking into account different sequencing depths.For example, beta-binomial model assigns larger variances to the alleles with lower sequencing depths (Figure 1).Moreover, the Beta(1,1) prior on p ij leads to a symmetry in the posterior mean and variance.The result of our method is therefore unaffected whichever allele is chosen.

Gaussian Process Regression
A Gaussian process (GP) is a collection of random variables, any finite number of which have a joint Gaussian distribution (Rasmussen and Williams, 2006).We write to denote that f (t) follows a Gaussian process with mean function m(t) = E[f (t)] and covariance function ].We let y = (y i ) N i=1 be a vector of the noisy observations with where is Gaussian observation noise with zero mean and a diagonal covariance matrix Σ .To simplify the algebra we assume the mean function m(t) = 0 and subtract the mean of y.
Gaussian processes allow marginalising the latent function to obtain a marginal likelihood.The covariance function K and the noise covariance Σ depend on hyperparameters and parameters θ that can be estimated by maximising the log marginal likelihood: It is also possible to compute the posterior mean and covariance at non-sampled time points t * , given the noisy observations y at sampled time points t.This is often useful for visualisation purposes.We obtain (Rasmussen and Williams, 2006): where In our GP models we use the squared exponential covariance matrix to model the underlying smooth function.The squared exponential covariance (10) has two parameters: the length scale, l, and the signal variance, σ 2 f .Length scale specifies the distance beyond which any two inputs become uncorrelated.A small length scale means that the function fluctuates very quickly, whereas a large length scale means that the function behaves like a constant function (Kalaitzis and Lawrence, 2011).Three example realisations generated with squared exponential covariance matrix can be seen in Figure 2 (a).
In the standard GP model the observation noise is assumed to be white: the noise at different time points is independent and identically distributed.The corresponding covariance matrix is an identity matrix multiplied by the noise variance parameter, σ 2 n .Three example realisations generated with white noise covariance matrix can be seen in Figure 2

BBGP: Beta-Binomial Gaussian Process
The Beta-Binomial Gaussian Process (BBGP) method combines beta-binomial model with the GP model in the sense that the posterior means and posterior variances of the frequencies, which are inferred by beta-binomial model, are used to fit the GP model using an additional noise covariance matrix which we call fixed beta-binomial (FBB) covariance matrix.
Returning Sec.2.2, let us denote the posterior mean and variance of p ij by m ij and s 2 ij , respectively.That is, To fit the BBPG model, we assume where f (t) ∼ GP(0, K SE (t, t )) and ∼ N (0, Σ W + Σ F BB ).The mean µ m i is eliminated by subtracting the mean from m ij .The additional covariance is a diagonal fixed beta-binomial (FBB) covariance matrix which is used to include known variance information for each observation in the GP model.The elements of Σ F BB are determined by the frequency variance vector which is inferred from beta-binomial model in Sec.2.2.Three example realisations generated with fixed beta-binomial covariance matrix can be seen in Figure 2 (c), where larger variance values were inferred for the later time points.

Time
Frequencies q q q q q q q q q q q q q q q q q q q q Log−likelihood: 11.71 0 15 23 27 37 0.0 0.2 0.4 0.6 0.8 1.0 Time Frequencies q q q q q q q q q q q q q q q q q q q q Log−likelihood:

BBGP-based test
We fit the "time-dependent" BBGP model of Eq. ( 14) and a "time-independent" model without the GP term f (t j ) for each SNP.Thereby the parameters of the squared exponential covariance (K SE , Eq. 10) in the time-dependent model and the white noise covariance (Σ W , Eq. 11) in both models are fitted by maximising the marginal likelihood.The fixed beta-binomial covariance (Σ F BB , Eq. 15) does not contain any free hyperparameters.If the model is actually timeindependent, the length scale in the squared exponential covariance is estimated to be very large, which makes the maximum likelihood of the time-dependent model equivalent to that of time-independent model.Figure 3 shows an example of the time-dependent (left) and timeindependent (right) BBGP models.
We maximise the log marginal likelihood functions for the models by scaled conjugate gradient method using the "gptk" package by Kalaitzis and Lawrence (2011).We use a grid search over the parameter space and initialise the parameters to the grid value with highest likelihood.We set a lower bound equal to the shortest spacing between observations for the length scale parameter to avoid overfitting.
We compute the Bayes factor (BF) for SNP i as (Stegle et al., 2010;Kalaitzis and Lawrence, 2011): where θ 1 and θ 2 contain the maximum likelihood estimates of the hyperparameters in the corresponding BBGP models.Bayes factors indicate the degree of the models to be "timedependent" rather than "time-independent".

Cochran-Mantel-Haenszel Test
We compare BBPG against the Cochran-Mantel-Haenszel test (CMH), which was used by Orozco-TerWengel et al. (2012) to identify alleles with consistent allele frequency change across replicates.The CMH test has been proven to be the best-performing test statistic applied on HTS evolutionary data so far (Kofler and Schlötterer, 2014).Therefore, we take it as the basis of comparison with BBGP.CMH allows to test whether the joint odds ratio of replicated (r = 1, . . ., R) allele counts in a 2 × 2 × R contingency table (Table 1) is significantly different from one.Significant deviation from one implies dependence of allele counts between two time points that is consistent among replicates.The CMH tests pairwise observations of the two alternative allele counts y (1) ij and y (2) ij .In our bi-allelic case y (1) ij = y ij and y (2) In order to compare the counts for all replicates r = 1, . . ., R at the base (B) and the end (E) time points for each SNP position i, we denote B r = {j|t j = B, r j = r} and E r = {j|t j = E, r j = r}.The CMH test statistic (see Agresti (2002) and Appendix A.1) compares the cell counts in Table 1 to their null expected value and follows a chi-squared distribution with one degree of freedom χ 2 (df =1) .We performed CMH tests on the simulated and real data for each SNP position independently using the implementation of the software PoPoolation2 (Kofler et al., 2011).

Simulations
To evaluate the performance of the GP-based method we simulated data that mimics the dynamics of evolving Dmel populations at the genomic level.We carried out forward Wright-Fisher simulations of allele frequency trajectories of populations using the MimicrEE simulation tool (Kofler and Schlötterer, 2014).The initial haplotypes were taken from Kofler and Schlötterer (2014) and capture the natural variation of Dmel population.By sampling from the initial set we established r = 5 replicated base populations (with effective population size of about 200) and let each of them evolve for 60 generations at a constant size of 1000.The recombination rate was set at a predefined level, calculated by D. A. Petrov (Singh et al., 2005).Low recombining regions were excluded from the simulations.We followed the evolution of the total number of 1,939,941 autosomal SNPs among which 100 were selected with rather strong selection coefficient of s = 0.1 and semi-dominance (h = 0.5).Furthermore, we required the selected SNPs to have a starting frequency in the range [0.12, 0.8], not to lose the minor allele in the course of time due to drift.We recorded the nucleotide counts for every second generation and performed Poisson sampling with λ = 45 (overall mean coverage in Orozco-TerWengel et al., 2012) on the count data to produce coverage information (see Appendix A.2). We repeated the whole simulation experiment three times each time using a different set of selected SNPs.

Evaluation Metrics
The methods were evaluated based on precision, recall and average precision (Manning et al., 2008).Precision and recall are commonly used metrics for comparing the ranking-based methods which aim to get most of the relevant items in the very top of the ranked list of the items, in our case selected SNPs among top ranked SNPs.Precision is defined as the fraction of SNPs that are selected, and recall is defined as the fraction of selected SNPs that are returned.That is, The curve obtained by plotting the precision at every position in the ranked sequence of items as a function of recall is called the precision-recall curve.The area under the curve can be summarised using the average precision (Manning et al., 2008), which is defined as the average of pre(k) after every returned selected SNP: where N is the total number of SNPs and 3 Results

Simulated Data
We applied the BBGP and CMH methods to the simulated data with different numbers of time points (i.e., generations) and replicates.We started with nine time points {0, 6, 14, 22, 28, 38, 44, 50, 60} and then removed the midpoint of the shortest interval until the desired number of time points was achieved.In the case of a tie, we kept the time point which is closest to the real sequenced time points (Orozco-TerWengel et al. (2012), see also Appendix A.3 for the complete list of time point sets used).We performed BBGP separately for each of the sampling schemes while CMH can only use two time points (first and last).All simulated SNPs were scored using Bayes factors for the BBGP, and p-values for the CMH test (see Figure 8 in the appendix for an example of a graphical visualisation of the scores).
To investigate the effect of the number of replicates, we chose r = 2, 3, 4, 5 replicates at each sampled time point.We first performed CMH tests with all possible r-replicate combinations.
We then applied BBGP only to the best-performing replicate combinations of each size according to average precision in the CMH evaluations.This strategy ensures a fair comparison between the methods as BBGP is always evaluated against the best CMH results.We also compared BBGP to the standard GP of Kalaitzis and Lawrence (2011) that does not use the beta-binomial model variances using the same replicate combinations as BBGP but only 6 time points.
As shown in Figure 4 (see also Figure 9 in the appendix), BBGP achieves a higher average precision than the standard GP or the CMH.Somewhat surprisingly, CMH seems to benefit very little from more replicates while the performance of the GP methods improves significantly with more replicates.We observe that the CMH is very sensitive to the specific replicates included, as including the fifth replicate in the optimal sequence actually leads to worse performance than four replicates (Figure 9(d) in the appendix).We did not observe similar behaviour with the GP methods.On average over all possible r-replicate combinations adding more replicates helps the CMH as well.The performance of the standard GP approaches that of BBGP as the number of replicates increases, which is consistent with the view that the stronger prior information from sequencing depth is most important when the data are otherwise scarce, as is often the case in real experiments.The full precision recall curves corresponding to the results are shown in Figure 5 (see also Figure 9 in the appendix).While adding more replicates improves BBGP performance significantly, adding more time points helps at most very little (Figure 6).The running time needed to analyse 1000 SNPs in (4 replicates, 6 time points) setting is approximately 30 minutes on a desktop running Ubuntu 12.04 with Intel(R) Xeon(R) CPU E3-1230 V2 at 3.30GHz.The performance can also vary noticeably between different experiments depending on their difficulty (see Figure 10 in the appendix).For example, there is a 10-fold difference in AP between Experiment 1 and Experiment 3 for both methods (Figure 10 in the appendix, see also Kofler and Schlötterer (2014) for the CMH), but the BBGP-based test consistently outperforms the CMH test.
We also investigated if the two methods identify different types of selected SNPs.We calculated allele frequency change (AFC) for each SNP based on the average difference between the base an end populations across replicates.The CMH is very sensitive to large AFCs, while the candidates detected by the BBGP have a much more uniform distribution of AFCs (Figure 11 in the appendix).In general, we would expect a uniform distribution of AFC, as very large AFCs are only possible for SNPs with low starting frequency giving them the potential to rapidly increase.BBGP is much more accurate than CMH in all AFC classes as can be demonstrated by the performance breakdown (Figure 12 in the appendix).tions adapting to elevated temperature regime to identify evolutionary trajectories of selectively favoured alleles.They established replicated base populations from isofemale lines collected in Portugal.The populations were propagated at a constant size of 1000 for 37 generations under fluctuating temperature regime (12h at 18 • C and 12h at 28 • C).DNA pool of 500 females (Pool-Seq) was extracted and sequenced in multiple replicates at the following time points: three replicates at the base generation 0 (B); two replicates at generation 15, an additional replicate at generation 23 and at generation 27; three replicates at the end generation 37 (E).

Real Data Application
CMH test was performed on a SNP-wise basis to identify significant allele frequency changes between the B and E populations (see Orozco-TerWengel et al. (2012) and Appendix A.4).We applied the BBGP method on 1,547,765 SNPs from the experiment and compared the results B-E comparison of the CMH test.The overlap between the top 2000 candidate SNPs of the CMH and the BBGP was rather small (524 SNPs).However, the peaks of both methods covered the same regions (Figure 7).Although Dmel generally has rather small levels of linkage, linkage disequilibrium might have built up during the course of the experiment.In fact, linkage  disequilibrium had a major effect on the number of candidate SNPs identified by the CMH as well as the BBGP based test.As the flanking SNPs showed signs of hitchhiking, the observed AFC of the flanking SNPs were also significant (see also Manhattan plot for the simulated SNPs, Figure 8 in the appendix) and made it difficult to narrow down functionally important regions for thermoadaptation.

Gene Ontology Enrichment
We used Gowinda (Kofler and Schlötterer, 2012) to test for the enrichment of functional categories according to the Gene Ontology (GO) database (Ashburner et al., 2000).Gowinda tests significance of overrepresentation of candidate SNPs in each GO category.It uses permutation tests to eliminate potential sources of bias caused by difference of gene length and genes that overlap.We tested the top 2000 candidate SNPs for both the CMH and the BBGP methods, respectively.FDR correction was applied on the inferred p-values to account for multiple testing.Using Gowinda we did only find one significant category (p < 0.05) for the BBGP and no significant categories for the CMH test (see Table 3 and Table 4 in the appendix).
In addition to taking an arbitrary threshold of the top 2000 SNPs, we also considered the full distributions of p-values for the CMH and the distribution of Bayes factors for the BBGP based tests.For each GO category we compared distribution of all SNP-values (p-values for the CMH and Bayes factors for the GP) in that GO gene set to the distribution outside that gene set using a one-tailed Mann-Whitney U test (MWU) as applied by Segrè et al. (2010).Similar to Gowinda, we used permutations to account for biases such as gene length and other confounding effects (see Appendix A.5).We also conserve the gene order during the randomisation as functionally similar genes are often clustered nearby on a chromosome.Using the MWU tests, we found significant GO category enrichments for both methods ( Moreover, the top ranked candidate categories were similar in both cases (see Table 5 and Table 6 in the appendix).There is a significant overlap between the discovered categories, but there are also clear differences.

Discussion
Our results in detecting SNPs that are evolving under selection using a GP model presented in this paper clearly demonstrate the importance of careful modelling of the measurement uncertainty through a good noise model, in our case using the beta-binomial model of sequencing data.Especially when data are scarce, the BBGP approach leads to much higher accuracy than standard maximum likelihood estimation of noise variances.Incorporating the non-Gaussian likelihood directly to the GP would also be possible, but it would lead to computationally more demanding inference.
In terms of experimental design, the most effective way to improve performance is to collect more replicate measurements.Compared to the CMH test, the BBGP is clearly superior in utilising more replicates.We suspect this is because CMH assumes all replicates should have similar odds ratios between the two time points and this is not sufficiently satisfied by the noisy data.On the other hand, the benefit of adding more intermediate data points seems marginal.This may be because the shape of selected trajectories is a simple sigmoid and adding more points provides limited help in estimating them.
The presented GP-based test is sensitive to SNPs with a consistent time-varying profile.A statistically more accurate model could be derived by assuming each replicate to follow an independent GP, but this would lead to a less sensible test as it would also pick up SNPs with strong but inconsistent drift.Exploring hierarchical GP models to capture the correct dependence structure in a sensible test is an interesting avenue of future research.
In a whole-genome experiment, linkage disequilibrium between nearby markers is an important confounder in identifying the selected marker.None of the methods tested takes this into account, which introduces the problem of hitchhikers, i.e. markers that behave like they would be selected because they are strongly linked with a truly selected marker.It is possible to incorporate such information into the GP model at the cost of higher computational complexity, and this is clearly an important avenue of future research.This is potentially a further advantage of the GP, because it is not possible to incorporate this into frequentist tests.

Conclusion
In this paper, we developed a new test that is based on combining GP models with a betabinomial model of sequencing data, and compared it with the CMH test that allows the pairwise comparison of base and evolved populations across several replicates.
Our results demonstrate that GP models are well-suited for analysing quantitative genomic time series data because they can effectively utilise the available data, making good use of additional time points and replicates unhindered by uneven sampling and consistently show performance superior to the CMH test.
The GP framework is very flexible which enables extensions utilising for example linkage disequilibrium over nearby alleles.As GP models can easily incorporate additional information on the data, we envisage that further promising combinations of the GP approach with evolutionary models will emerge.
homozygote individuals and the set of selected SNPs we followed the simulation protocol outlined at https://code.google.com/p/mimicree/wiki/ManualMimicrEESummaryfor 5 replicates independently.As described in Kofler and Schlötterer (2014), we aimed to reproduce the sampling properties of Pool-Seq using Poisson sampling with λ = 45 (using the script poisson-3fold-sample.py available at http://mimicree.googlecode.com).Briefly, we considered coverage differences between samples, coverage fluctuations due to GC-bias and stochastic sampling heterogeneity.
We carried out 3 independent runs of simulations with different sets of selected SNPs but keeping the parameters unchanged (Figure 10).Finally, we compared with a performance break down according to Allele Frequency Change (AFC) the BBGP to CMH test in different AFC classes (Figure 11 and Figure 12).

A.4 Real Data Application
We applied the BBGP on HTS data of experimentally evolved D. melanogaster populations (Orozco-TerWengel et al., 2012).We compared our proposed method to the CMH results coming from the B-E comparison, downloaded from Dryad database (http://datadryad.org)under the accession: doi: 10.5061/dryad.60k68.We used the synchronized pileup files (BF37.sync)which contains a total number of 1,547,837 SNPs.The CMH test was only performed on SNPs that met certain quality criteria regarding the minor allele count and the maximum coverage (for more information on SNP calling please consult Orozco-TerWengel et al., 2012).

A.5 Gene Set Enrichment
For a biologically meaningful comparison of the CMH test and the BBGP test we performed a gene set enrichment analysis.Below we explain how we used the tool Gowinda (Kofler and Schlötterer, 2012) for Gene Ontology enrichment and how we extended it.
functionally rather similar.Figure 13 shows the overlap between highly enriched categories for different empirical p-value cutoffs.The categories are listed in Table 5 and Table 6.
B Supplementary Tables and Figures      However, there are still some truly selected SNPs that do not show clear pattern of frequency increase.A possible explanation for that can be that the time course, i.e. 60 generations, is not enough for them to rise in frequency.They can also interfere with each other and non-selected SNPs.Modeling spacial structure would be required to capture such a pattern.

Figure 2 :
Figure 2: Example realisations from GPs and noise processes with different covariance structures.
model: Time-independent model:

Figure 3 :
Figure 3: BBGP fits for the time-dependent and time-independent models for an example SNP taken from the real data set (Orozco-TerWengel et al., 2012).Confidence regions are shown for ± 2 standard deviation.Similarly, error bars indicate ± 2 standard deviation (from FBB) interval.Replicates at the same time points are shifted by 0.5 for better visualisation.

Figure 4 :
Figure4: Average precisions (AP) for CMH, BBGP, and standard GP with different number of replicates.The used replicates have been selected as the best-performing r-replicate combinations in the CMH test, except for the CMH mean AP which has been computed by taking the mean of the average precisions over all r-replicate combinations for r = 2, 3, 4, 5.

Figure 6 :
Figure 6: Average precisions (AP) for CMH and BBGP with different number of replicates and time points.More replicates are indicated by thicker lines.APs for the CMH are given as constant lines in order to facilitate the comparison (CMH is performed using only the first and last time points).
(a) Genome-wide distribution of CMH -log(p-values).(b) Genome-wide distribution of GP Bayes factors.

Figure 7 :
Figure 7: Manhattan plots of genome-wide SNP-values.(a) -log(p-values) for the CMH test B-E comparison.Only SNPs with p-values up to -log10(1e-30) are shown; 3 SNPs with pvalues above are set to -log(1e-30) on the plot.(b) Bayes factors for the BBGP.Only those SNPs are indicated for which we calculated both the p-values and the Bayes factors (we did not infer Bayes factors for fixed SNPs).A 1 Mb region was excluded from the analysis on 3R as a low frequency haplotype spreads during the experiment due to an inversion.Previously, the chorion gene cluster on 3L was also excluded as this region has extremely high coverage (Orozco-TerWengel et al., 2012).Regions that were excluded from the analysis are shown in green.The red horizontal line indicates the top 2000 candidate cutoff.The common candidates among the top 2000 are highlighted in magenta.Figure (b) shows well how the beta-binomial variance control can handle high coverage problem of the excluded region on 3L.

Figure 8 :
Figure 8: Manhattan plots of genome-wide test statistic values on simulated data with 5 replicates.(a) -log(p-values) for the CMH test B-E comparison.(b) Bayes factors for the BBGP using 6 time points.Only autosomal regions were simulated and low recombining regions (< 1cM/M b) were excluded.The 100 truly selected SNPs (s=0.1) are indicated in red.As the consequence of linkage structure we observe extended peaks in the vicinity of selected SNPs.However, there are still some truly selected SNPs that do not show clear pattern of frequency increase.A possible explanation for that can be that the time course, i.e. 60 generations, is not enough for them to rise in frequency.They can also interfere with each other and non-selected SNPs.Modeling spacial structure would be required to capture such a pattern.

Figure 12 :Figure 13 :
Figure 12: Precision recall curves for different AFC classes.The performance in terms of precision and recall is shown for the CMH and the BBGP in classes of SNPs with different allele frequency change.The AFC is measured between the base and end generations (60) and averaged over 5 replicates.For the BBGP 6 time points were used.Panel (a) shows the overall performance.For Panel (b)-(d), the AFC classes contain the following number of selected SNPs: 36 in class [0-0.3],30 in class (0.3-0.5], 34 in class (0.5-1.0].

Table 1 :
2 × 2 contingency table of allele counts for the r-th replicate for the CMH test.
Orozco-TerWengel et al. (2012)applied the evolve and re-sequencing with HTS on Dmel popula-

Table 2 :
Table 2 and Figure 13 in the appendix).The number of enriched GO categories for the MWU test.Empirical p-values (Emp.p-val.)are based on 1000 permutations.Overlap between CMH and BBGP tests are shown for different significance levels: p ≤ 0.1, p ≤ 0.05 and p ≤ 0.01.With FDR correction there were no significantly enriched categories.

Table 3 :
Top ranked GO enrichment results with Gowinda on the CMH candidates.Only the top 4 categories are shown.

Table 4 :
Top ranked GO enrichment results with Gowinda on the BBGP candidates.Only the top 4 categories are shown.

Table 5 :
Results of the GO enrichment with MWU on the CMH candidates.Only the categories are shown for which the empirical p-value ≤ 0.05 calculated for 1000 permutations.

Table 6 :
Results of the GO enrichment with MWU on the BBGP candidates.Only the categories are shown for which the empirical p-value ≤ 0.05 calculated for 1000 permutations.