Simulation data for the estimation of numerical constants for approximating pairwise evolutionary distances between amino acid sequences

Estimating the number of substitution events per site that have occurred during the evolution of a pair of amino acid sequences is a common task in phylogenetics and comparative genomics that often requires quite slow maximum-likelihood procedures when taking into account explicit evolutionary models. Data presented in this article are large sets of numbers of substitution events and associated numbers of observed differences between pairs of aligned amino acid sequences that have been generated through a simulation procedure of sequence evolution under a broad range of evolutionary models. These data are available at https://zenodo.org/record/2653704 (doi:10.5281/zenodo.2653704). They are accompanied in this paper by figures showing the strong relationship between the corresponding evolutionary and uncorrected distances, as well as estimated numerical constants that determine non-linear functions that fit the simulated data. These numerical constants can be useful to quickly estimate pairwise evolutionary distances directly from uncorrected distances between aligned amino acid sequences.


Data
Given a pair of homologous amino acid sequences, there exists a strong positive monotonic relationship between the number d of substitution events per site that have occured during their evolution and the proportion p of observed differences (often called uncorrected distance or p-distance) between the two aligned sequences [1e9]. For estimating the (unknown) evolutionary distance d from the observed value p, analytical formulae of the following form (often called gamma distance) have been proposed: where a and b are two positive numerical parameters depending on the heterogeneity of the replacement rate among amino acid pairs and sites, and on the equilibrium frequencies of amino acid residues, respectively [2,10e15]. In line with previous attempts [2,4,5,14,16], data presented here are estimations of a and b as obtained through computer simulations for 27 empirical models of amino acid substitution [1,17e37] (see names and associated references in Tables 1 and 2

Experimental design, materials, and methods
To simulate the evolution of amino acid sequences along reliable real-case phylogenetic trees, the 1,903,844 available ones on the ftp repository of PhylomeDB v4 (ftp://phylomedb.org/phylomedb) were considered, as they have been inferred by a workflow including homologous sequence clustering and alignment from a broad range of genes and phyla (eukaryota, bacteria and archaea; see details at http://phylomedb.org) followed by maximum likelihood phylogenetic inference [38]. A reduced subset of these trees was built to obtain a wide array of induced patristic distances that are quite evenly distributed over [0, 20] (see y-axis ranges in Figs. 1e6): for m growing from 0.0001 to 20 (step ¼ 0.001), one tree (at least 25 taxa) was picked out such that its diameter (i.e. maximum patristic distance) was as close as possible to m. Following this procedure, 20,000 real-case phylogenetic trees representative of a comprehensive range of evolutionary events and distances were selected. For each considered evolutionary model (see Tables 1 and 2), the evolution of a Scatter plots (gray dots) representing the relationship between the uncorrected distance p (x-axis) and the evolutionary distance d (y-axis) for the four general amino acid substitution models BLOSUM62 [17], VT [22], PMB [25] and LG [29]. Estimated PC and EI gamma distance functions are drawn in red and blue, respectively. Fig. 3. Scatter plots (gray dots) representing the relationship between the uncorrected distance p (x-axis) and the evolutionary distance d (y-axis) for the eight mitochondrial amino acid substitution models mtREV [19], mtMam [20], MtArt [28], MtZoa [30], stmtREV [34], mtInv, mtMet and mtVer [36]. Estimated PC and EI gamma distance functions are drawn in red and blue, respectively. Fig. 4. Scatter plots (gray dots) representing the relationship between the uncorrected distance p (x-axis) and the evolutionary distance d (y-axis) for the three amino acid substitution models cpREV [21], cpREV64 [31] and gcpREV [33] dedicated to plastidencoded protein sequences. Estimated PC and EI gamma distance functions are drawn in red and blue, respectively.  5. Scatter plots (gray dots) representing the relationship between the uncorrected distance p (x-axis) and the evolutionary distance d (y-axis) for five amino acid substitution models dedicated to retrovirus (rtREV [24]), HIV (HIVb, HIWw [27]), influenza (FLU [32]), and dengue (DEN [37]) protein sequences. Estimated PC and EI gamma distance functions are drawn in red and blue, respectively. Fig. 6. Scatter plot (gray dots) representing the relationship between the uncorrected distance p (x-axis) and the evolutionary distance d (y-axis) for the antibody-specific model of amino acid substitution AB [35]. Estimated PC and EI gamma distance functions are drawn in red and blue, respectively. sequence of 50,000 amino acid residues was simulated using INDELible v1.03 [39] along each of the 20,000 selected phylogenetic trees, and the matrix of observed p-distances was computed from each of the simulated multiple sequence alignments using FastME v2.1.5 [40]. Next, for each evolutionary model, a subset of simulated data (i.e. phylogenetic tree, simulated multiple sequence alignment, and corresponding p-distance matrix) was selected to obtain at least 500,000 values p that approximately follow a uniform distribution over [0, 0.9]. For each of those selected multiple sequence alignments, the branch lengths of the associated phylogenetic tree were refitted using RAxML-NG v0.8.1 BETA [41] with the corresponding evolutionary model, and the matrix of patristic distances d was computed using gotree v0.2.10 (https://github.com/evolbioinfo/gotree). Of note, for each of the 27 considered evolutionary models, INDELible and RAxML-NG were both used with the corresponding empirical replacement matrix file gathered from http://giphy.pasteur.fr/empiricalmodels-of-amino-acid-substitution. Finally, as each pair of distance matrices (i.e. uncorrected and evolutionary distances p and d) represents numbers of observed differences and occurred substitutions per site, respectively, each entry was multiplied by the total number of sites (i.e. 50,000) and rounded to the closest integer. This scaling and rounding step allows observing the same integer values than the ones obtained with alternative programs for branch length refitting (e.g. PhyML [42], IQ-TREE [43]) while each program leads to slightly different evolutionary distances d because of rounding errors or implementation choices (not shown).
Two versions of the nonlinear functional relationship between the evolutionary distance d and the uncorrected distance p were fitted separately to each simulated data. The first, called the Poisson correction (PC) gamma distance, is determined by fixing b ¼ 1 in formula (1) [2,5,44]. The second, called the equal-input (EI) gamma distance, is determined with b ¼ 1 À P r p r 2 in formula (1), where p r is the equilibrium frequency of the amino acid residue r [12,13,15]. For each of the 27 considered evolutionary models, empirical values of p r from the corresponding amino acid replacement matrix were used for computing b (Table 2). For each evolutionary model and each of the two PC and EI gamma distances, the numerical constant a was estimated by weighted nonlinear regression from the pairs of integer versions of uncorrected and evolutionary distances p and d gathered from the corresponding simulation file (see above) and divided by the number of simulated sites (i.e. 50,000). Each least-square estimation of the parameter a was performed using R v3.5.3 [45] with the function nls. Default Gauss-Newton algorithm was used with relative weighting (i.e. each d was weighted with d À2 ) and starting value a ¼ 2.
All simulation datasets are available at https://zenodo.org/record/2653704 (doi:10.5281/zenodo. 2653704). The 20,000 phylogenetic trees selected for simulating sequence evolution are available as a text file together with descriptive statistics summarizing the corresponding patristic distances. For each of the 27 evolutionary models (see Tables 1 and 2), blocks of simulation data (i.e. PhylomeDB identifiers, random seeds, trees with refitted branch lengths, integer values of p and d) are available as text files. Estimated values of a are given for the PC and EI gamma distances in Tables 1 and 2, respectively, together with the associated 95% confidence intervals and mean squared errors. Figs. 1e6 represent the 27 scatter plots of simulated d against p, as well as the regression curves for the two PC and EI gamma distance functions. Each scatter plot is also available with and without the regression curves at https://zenodo.org/record/2653704.

Funding
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.