Coalescent: an open-science framework for importance sampling in coalescent theory

Background. In coalescent theory, computer programs often use importance sampling to calculate likelihoods and other statistical quantities. An importance sampling scheme can exploit human intuition to improve statistical efficiency of computations, but unfortunately, in the absence of general computer frameworks on importance sampling, researchers often struggle to translate new sampling schemes computationally or benchmark against different schemes, in a manner that is reliable and maintainable. Moreover, most studies use computer programs lacking a convenient user interface or the flexibility to meet the current demands of open science. In particular, current computer frameworks can only evaluate the efficiency of a single importance sampling scheme or compare the efficiencies of different schemes in an ad hoc manner. Results. We have designed a general framework (http://coalescent.sourceforge.net; language: Java; License: GPLv3) for importance sampling that computes likelihoods under the standard neutral coalescent model of a single, well-mixed population of constant size over time following infinite sites model of mutation. The framework models the necessary core concepts, comes integrated with several data sets of varying size, implements the standard competing proposals, and integrates tightly with our previous framework for calculating exact probabilities. For a given dataset, it computes the likelihood and provides the maximum likelihood estimate of the mutation parameter. Well-known benchmarks in the coalescent literature validate the accuracy of the framework. The framework provides an intuitive user interface with minimal clutter. For performance, the framework switches automatically to modern multicore hardware, if available. It runs on three major platforms (Windows, Mac and Linux). Extensive tests and coverage make the framework reliable and maintainable. Conclusions. In coalescent theory, many studies of computational efficiency consider only effective sample size. Here, we evaluate proposals in the coalescent literature, to discover that the order of efficiency among the three importance sampling schemes changes when one considers running time as well as effective sample size. We also describe a computational technique called “just-in-time delegation” available to improve the trade-off between running time and precision by constructing improved importance sampling schemes from existing ones. Thus, our systems approach is a potential solution to the “28 programs problem” highlighted by Felsenstein, because it provides the flexibility to include or exclude various features of similar coalescent models or importance sampling schemes.

Jobs would require different amounts of memory depending on various factors but 4 GB of dedicated memory to the underlying Java Virtual Machine (JVM) process would be sufficient for running all the jobs. The dedicated memory to the JVM is allocated during program launch and is typically set to 1/4 th of the system`s free memory. If your system does not have enough free memory, the program may become unresponsive for computationally demanding jobs. If the system has enough free memory (say, 8GB) but that would only allocate around 2GB of dedicated memory, you can allocate the required memory manually. Go to INSTALL-DIRECTORY/etc/coalescent.conf and edit the following: change the line default_options="--branding coalescent -J-Xms24m" to default_options="--branding coalescent -J-Xms24m -J-Xmx4000m"

Quick Tour of coalescent via Screenshots
Below is a screenshot of the whole interface under Windows7. Mac-OSX and Linux have the looks slightly different but the content and structure will be the same. This guide will use Windows7 for illustration but the instructions apply to all three (Windows, Mac-OSX and Linux) as well. The jobs under Importance Sampling and PeerReview:Importance Sampling are only under scrutiny for this manuscript. These jobs can be used to verify all the claims in the manuscript. We will describe these jobs in detail and show how to reproduce the results.

Jobs
The job Data Likelihood computes likelihood of data under the infinite sites model. The job MLE-Infinite-Sites (K69) computes MLE of the mutation parameter for data under the infinite sites model. Note that these two jobs are similar except for the range of mutation parameters required in the latter.
The job Figure 9 computes Figure 9 in the manuscript corresponding to the benchmark of Figure 6 in Hobolth, A., Uyenoyamay, M. K., & Wiuf, C. (2008). The job Figure 9: WarmUp is exactly similar to job Figure 9 except the the number of realizations of the importance sampling is 100 times more in the latter. This was to keep this simulation study comparable to the benchmark. However, this takes long time and to give the user a quick overview, we have created the job ' Figure 9: WarmUp'. It finishes in minutes with a clear view of the progress in each cell.

Claim 1: Table 3 -Computing MLE using Multiple Proposals
The job MLE-Infinite-Sites (K69) can be used to verify this table. Select 'K69 Data Set' to be gt94_k69_data. Set 'Min theta value = 1.0', 'Max theta value = 10.0', 'Theta increment value = 0.1'. Make sure all the samplers, gt-EGT, gt-SD, gt-HUW are checked. Under Parameters choose the iteration strategy i.e., how the number of realizations of the importance sampling is counted. There are three options: by-OrderSize-Unit, by-Time and by-SampleSize. Choose by-SampleSize and set 'IS Run Duration = 100000'. Run the job and inspect the textual and graphical output to verify data in Table 3 of the manuscript.

Claim 2: Table 4 -Estimating Likelihood at MLE by Multiple Proposals
The job Data Likelihood can be used to verify this table. Select 'K69 Data Set' to be gt94_k69_data. Set 'theta = 4.8', and 'Exact probability = 8.71E-20'. Make sure all the samplers, gt-EGT, gt-SD, gt-HUW are checked. Under Parameters choose the iteration strategy i.e., how the number of realizations of the importance sampling is counted. There are three options: by-OrderSize-Unit, by-Time and by-SampleSize. Choose by-SampleSize and set 'IS Run Duration = 100000'. Run the job and inspect the textual and graphical output to verify data in Table 4 of the manuscript.

Proposal Efficiency
The job Figure 9 can be used to verify this figure. To display the author`s results check 'Show author`s results?' and run the job. It immediately displays Figure 9 with associated data. To compute the results afresh, check 'Overwrite?' and run the job. Note that the author`s data are not lost by this and can be displayed again. The label 'overwrite' means overwriting any previous user computation; if unchecked, starts the computation where it was left off (the application persists the state of computation because this is a long running job) either by cancelling the job or an application exit. If job had finished before, running the job would immediately display the results. The property 'Async job count' lets run multiple cells in parallel. This property appears only if the underlying system has enough number of cores to make parallel execution benefinicial.