Building an adaptive trait simulator package to infer parametric diffusion model along phylogenetic tree

The development of an adaptive trait simulator package for inferring trait evolution along a phylogenetic tree is shown. Stochastic processes of the continuous type are broadly applied to modeling trait evolution when the evolutionary relationship among species and traits of study interest are present. By including several popular stochastic processes, evolutionary information embedded in a dataset can be revealed. The highlights of the method include: 1. The implementation of the popular Cox-Ingersol-Ross process for modeling rate evolution within the package to prevent rates from becoming negative and thus is potentially a useful extension to study adaptive trait evolution in randomly evolved environment.2. The established trait simulator approach along with approximate Bayesian computation procedure provides a feasible statistical inference without model likelihood.3. The procedure proposed for trait simulator along phylogenetic tree can be applied to all established models of trait evolution in literature, thus providing users an alternative option to analyze their data.


Introduction
The paper focuses on the implementation of methods for adaptive trait evolution in Jhwueng and Maroulas [10] , Jhwueng and Maroulas [11] and Jhwueng [12] , building from Hansen et al. [9] . Adaptive traits play an important role in survival and reproduction of the organism during the evolutionary process. Any biological trait that helps the organism to survive in its specific environment is an adaptive trait. For example, a snow monkey with thicker (positive) fur would survive better than another monkey with thinner (negative) fur in a very cold environment while the coconut crab evolved lungs to live and breed on land. This motivated the invention of this package that enables feasible inference of trait under the models developed for answering questions from adaptive trait evolution in randomly evolving environment.
Prior work in modeling adaptive trait evolution has been focused on using Gaussian processes such as Brownian motion or Ornstein-Uhlenbeck process where the joint distribution of trait variables in the model follows multivariate normal distribution with analytical mean and variance-covariance structure. This work endeavors to proceed statistical inference on models built by non-Gaussian type process for adaptive trait evolution. Due to the complexity of non-Gaussian processes, the built-in models are often with intractable model likelihood. Inferences under normal approximation to the non-Gaussian type adaptive trait model maybe feasible and efficient for analysis but inferences are remains best and most adequate under the given model.

Procedure
ouxy, the adaptive trait simulator package created for our purposes, implements models of adaptive trait evolution with an approximate Bayesian computation (ABC) procedure. It was implemented as a library in R [17] . The manual of ouxy can be accessed at R cran with link https://cran.r-project.org/ web/packages/ouxy/ouxy.pdf .
While the main function for ouxy is to perform trait simulation along the rooted ultrametric phylogenetic tree under the model of adaptive trait evolution (the OUBMBM model, the OUOUBM model, the OUBMCIR model and the OUOUCIR model, see Additional information -Theory section), it allows for several statistical inferences such as parameter estimation and model selection. The ouxy package contains functions that are separated into three main sections (1) set up the reasonable value for parameter estimation and compute summary statistics from raw input ; (2) set up appropriate  range of prior parameters, perform traits simulation under each model and compute their summary  statistics, and (3) conduct statistical inference for parameter estimation and model selection. Summarized below is the step-by-step process of performing analysis.
1. Prepare your own trait data and a rooted ultrametric phylogenetic tree (an ape tree object). 2. Install the ouxy package in your R environment. 3. Set up the tolerance rate and number of replicates for ABC procedure. 4. Perform the analysis using the main function ouxy. 5. Save the output (posterior parameter estimates and model selection) as desired.
The ABC algorithm presented below adopts the Algorithm 1 in Jhwueng [12] . Note that the distance function in ABC algorithm serves a very important role in accepting the proposed values. Bokma et al. [4] used the Euclidean distance between observed and expected variances as the distance function.
Slater et al. [19] used a partial least squares (PLS) regression transformation of the summary statistics to generate a new, lower dimensional set of summaries prior to computing the distance. The distance function used in Bartoszek et al. ABC algorithm treated the phylogeny and trait data separately. ouxy uses 12 summary statistics (the raw mean, raw variance, raw median, raw skewness, raw kurtosis, the contrast mean, contrast variance, contrast median, contrast skewness, contrast kurtosis [7] , Bloomberg's K [3] and Pagel's λ [15] ) and applies the abc rejection method [6] where the Euclidean distance between the summary statistics of raw data and the summary statistics of simulated data is computed, then a specified tolerance rate is used to get the required proportion of points accepted nearest the target values.
Three empirical datasets of bats [1] , corals [18] and lizards [14] are included in ouxy. The bat dataset is used as an example.
# First install package 'ouxy' # install.packages("ouxy") library(ouxy) library(phytools) # use coral data data(bat) # call for tree tree < -bat$tree # call for traits two covariates: Head height(mm), Head length(mm), # response variable: Body mass(g). traitset < -bat$traitset rownames(traitset) < -tree$tip.label  All functions developed in package ouxy help performing regular statistical inference and hence are connected to its main function to its functionality and utility. The four built-in models (OUBMBM, OUOUBM, OUBMCIR, OUOUCIR) within this package include procedure that utilize three stochastic processes of continuous type (as the name of the model: BM stands for Brownian motion, OU for Ornstein-Uhlenbeck process and CIR for Cox-Ingersoll-Ross process) and are directly applied to perform comprehensive analysis. The required input are the tree: an ape tree object [16] and trait data set: a data frame where the species name matches the tree tip labels of study interest. For standard procedure of executing statistical inference, the package starts by searching maximum likelihood estimates using OU process (by R function OUprior), then a reasonable range for the prior parameters were generated (by R function HyperParam). Next, the trait simulator R functions (oubmbmTrait for the OUBMBM model, ououbmTrait for the OUOUBM model, oubmcirTrait for the OUBMCIR model and ououcirTrait for the OUOUCIR model) generate trait dataset under each model. Then a set of summary statistics are computed (by R function sum.stat).
Finally, posterior samples and model selection under ABC procedure are reported (by R function ouxy). Some procedures such as calculating the summary statistics (by R function sum.stat) or drawing samples for a prior (by one of the R functions oubmbmprior, ououbmprior, oubmcirprior, ououcirprior), can perform independently given the required argument. Hence they can be accessed in the package without calling any other functions.
Among the functions developed in package ouxy, functions oubmbmTrait, ououbmTrait, oubmcirTrait, and ououcirTrait simulate traits along the tree, a key element of the procedure. Conventional methods assume a known statistical distribution for trait variables. Then traits are simulated under the specified distribution. However, the four models (OUBMBM, OUOUBM, OUBMCIR and OUOUCIR) lack an explicit model likelihood, thus the idea is to enable model inference based on approximate Bayesian computation (ABC) technique.
The boxplots for 100 simulated traits using a 3 species phylogenetic tree under each model is shown in Fig. 2 .
A particular note is that the representation of trait variable y t requires solving a system of stochastic integrals that includes the integration of Brownian motion with respect to a Brownian (the OUBMBM model and the OUOUBM model) or the integration of CIR process with respect to Fig. 2. Boxplots for tips simulated under the models of adaptive trait evolution using trait simulator in ouxy. The R script to generate the plots for this simulation can be accessed at online supplement compTraitsim.R a Brownian motion (the OUBMCIR model and the OUOUCIR model). Currently the trait variable y t is expressed at its most explicit form for each model and is implemented into R functions oubmbmmodel, ououbmmodel, oubmcirmodel and ououcirmodel, repectively.
Once traits are generated under model of interest, samples are drawn from the prior distribution and an ABC procedure is performed for inference. Posterior samples are chosen based on the rejection method using function ouxy where R package abc [6] function abc is used for performing the ABC procedure. Different models are assumed a priori equally likely with same number of simulations.
Bayes factor is defined as a ratio of the posterior model probability of two different models M i and M j (i.e. B F i j = Pr ( M i | D ) / Pr ( M j | D ) is used for comparing the fit of models. This is done by using function postpr in R package abc [6] where the posterior model probabilities are estimated using the rejection method. Then the Bayes factor for model M i over model M j is computed by the ratio of frequencies of samples from each of these models that are below the threshold.
The following R script performs the analysis.

Posterior mean of parameters
To generate the It is known that statistical inference accounts for the uncertainty. Similar to the simple OU model [2] , ABC estimation for the models in Jhwueng [12] results in wide histograms of the estimator (see Fig. 3 ).

Additional information
Theory ouxy makes use the expression for trait variable y t that solves the corresponding system of stochastic differential equation for model of adaptive trait evolution. Given a rooted phylogenetic tree The model starts with an assumption that the trait variable y t solves the following stochastic differential equation (SDE) where the deterministic term α y t ( θ y t − y t ) refers to a coefficient that measures the quantity of change in an infinitesimal time dt and α y t is the force that pulls y t back to the optima θ y t ; τ y t is the diffusion coefficient that amplifies/reduces the trait change according to the random changing environment measured by dW The optimum θ y s is assumed in a linear relationship with the covariates x k,s , k = 1 , 2 , · · · , p with the representation in Eq. (3) where the stochastic variable x k,s , k = 1 , 2 , · · · , p is assumed following either a BM (i.e d x k,s = σ x dW x k s ) or an OU process (i.e. d x k,s = α x ( θ x − x k,s ) ds + σ x dW x k s ) [8] .
where θ τ > 0 is the optimum of τ y s , α τ > 0 is a constant that pulls τ y s back to θ τ , σ τ > 0 is the rate of change for τ y s , and W τ s is a Wiener process.
Details of the derivation for the solution y t for each model under various assumptions of the corresponding diffusion processes, its optimum parameter θ y s and rate parameter τ y s can be accessed in Jhwueng [12] .

Problems encountered
Proper approximation to the posterior Attractive as this model is, it is important to understand the limits of this approach. Below are a few difficulties encountered when developing this package.
Currently, due to the computational requirement for proper approximation to the posterior probabilities of models, it is time consuming to finish the simulation when a sufficiently large of sample is used. In particular, for the model that requires the numerical approximation of the integral of Brownian motion with respect to another Brownian motion or to numerically evaluate the integral of a CIR process random variable with respect to a Brownian motion, drawing sufficiently large trait samples from larger tree (e.g. tree more than 100 taxa) often takes a long running time to finish the analysis.
Moreover, when modeling the rate parameter with CIR process (the OUBMCIR model and the OUOUCIR model), drawing a sample for y t requires computing a double stochastic integral in Eq. (5) Conventional theory of numerical analysis suggests finer grids on the time domain should be used to attain more accuracy when computing integrals. However, the stochastic integrals in the left hand side of Eq. (5) has large variation and often produce widely spread trait values when more grids are used. Hence larger uncertainty on the trait value is encountered when more grids are used where more random samples are drawn on each grid. Currently, one grid ( m 1 = m 2 = 1 ) is used to draw a sample from the integral in Eq. (5) implemented in the OUBMCIR and the OUOUCIR models.
Since the models of adaptive trait evolution implement more than one process (OU, BM, CIR) to track the dynamics of a trait along the tree, statistical inference may heavily depend on the simulated data due to the complexity of models. It is also possible that traits simulated from the model lie in a particular region of trait space, in this case one research direction to specify the region in trait space may start by using the trait simulator developed here.

Extension to non-linear optimal regression
In evolutionary biology, regression analysis is broadly applied to model the relationship between dependent variables and their covariates. However, across the diversity of life these functional relationships will vary. A possible extension is to implement nonlinear optimal regresssion connecting the functional relationship between response θ y t and covariates x t [13] . For instance, given a Brownian motion (or an Ornstein Uhlenbeck process) covariate x t , the exponential relationship θ y t = b 0 + b 1 exp ( b 2 x t ) converts θ y t to a geometric Brownian motion (or a geometric Ornstein Uhlenbeck process) stochastic variable; while θ y Eq. (2) is analytically intractable with unknown finite dimensional distribution. Moreover, in computing the covariance between two tips that involves the geometric Ornstein Uhlenbeck process, it remains computational challenge to get a reliable estimate for the term in Eq.
Currently a better numerical scheme is required to approximate the quantity in Eq. (6) .
Note that a popular package pcmabc [2] available on CRAN offers the same functionality as ouxy except for a wider class of models, including a trait dependent speciation one. While pcmabc allows arbitrary class of Markov process and requires users to specify the drift coefficient and diffusion coefficient in the diffusion model, ouxy focus on expanding the Ornstein-Uhlenbeck based processes with non-Gaussian process (CIR) on the rate of evolution and functional optimal regression where the optimal of the dependent variable depends on another stochastic covariates. pcmabc. ouxy offers a convenient and an alternative option for users to choose suitable models best to describe their empirical data.

Funding
This research was financially supported by grant of the Taiwan Ministry of Science and Technology 108-2118-M-035-001 and was assisted by attendance as a Short-term Visitor at the National Institute for Mathematical and Biological Synthesis, an Institute supported by the National Science Foundation through NSF Award # DBI-1300426 , with additional support from The University of Tennessee, Knoxville.

Declaration of Competing Interest
None declared.