Short CommunicationMtPAN3: Site-class specific amino acid replacement matrices for mitochondrial proteins of Pancrustacea and Collembola
Graphical abstract
Introduction
Modern methods of phylogenetic analysis rely on a so-called model of molecular evolution, i.e. a mathematical model describing how sequences evolve through time (Liò and Goldman, 1998). At the core of an evolutionary model is the substitution rate matrix, in fact a composition of a symmetrical matrix of exchange rates that describes the instantaneous rate of change between each amino acid pair and a vector of equilibrium frequencies. The model can also include additional components, such as ASRV, tree shape and demographic processes, that are nevertheless not directly considered here. The actual values in the matrix can be either optimized alongside the phylogenetic tree during the analysis (i.e. mechanistic models: Yang et al., 1998) or estimated from large reference datasets a priori and subsequently applied to the specific dataset under consideration (i.e. empirical models). While nucleotide analyses almost invariantly use mechanistic models, amino acid analyses generally rely on empirical models, as the large (190) number of substitution parameters cannot generally be estimated from small/medium datasets. A number of matrices have been determined for both general interest proteins (e.g. PAM: Dayhoff et al., 1972; WAG: Whelan and Goldman, 2001; LG: Le and Gascuel, 2008) and specific genetic systems (e.g. MtREV: Adachi and Hasegawa, 1996 for mitochondrial proteins). Following the observation that differences exist between taxonomic groups, matrices specific to a given lineage have also been developed such as, among others, MtPAN for pancrustacean mitochondrial genomes (Carapelli et al., 2007), MtART for Arthropoda (Abascal et al., 2007) and MtZOA for the entire Metazoa (Rota-Stabelli et al., 2009).
Early methods of phylogenetic analysis generally assumed that each and every alignment site evolved under the same model (rates, exchange probabilities, equilibrium frequencies), with more sophisticated implementations going in the direction of modeling such aspects of sequence evolution explicitly. While ASRV is generally accounted for (Yang, 1996, Stamatakis, 2006a), variations in the substitution process (i.e. equilibrium frequencies and exchange probability matrix) are frequently overlooked. One possible approach has been to assign sites to a number of categories defined a priori based on protein secondary structure/solvent accessibility (Goldman et al., 1998, Liò and Goldman, 1999) and to estimate/apply a different rate matrix to each site category. More recently Lartillot and Philippe (2004) used a mixture model where the exchange probability matrix is common to all sites – in fact uniform mutation probabilities are assumed – while stationary frequencies are allowed to vary between groups of sites and sites are automatically assigned to a group based on their fit – the CAT model. The use of site specific – or group specific – exchange probability matrices has in turn proved more problematic. Le et al. (2008) have described new models where multiple matrices are used, either specific to a given subset of the alignment known a priori (e.g. exposed/buried or extended/helix/other in supervised approaches) or to site partitions that are defined during the analysis and hence cannot be associated a priori with specific protein features (a so-called unsupervised approach). Although mixture models are probably the most theoretically favorable context where such procedures can be implemented, and in fact the models described in Le et al. (2008) have been included in a specific version of the software PhyML (PhyML-mixtures), a generalized application of the procedure in an unsupervised context has proven problematic, leading the use of semi-supervised models in Le et al. (2008).
In this study we assess the applicability of a unsupervised procedure of site clustering (Dunn et al., 2013) coupled with the estimation of empirical matrices for each site cluster (Dang et al., 2011). A new set of three matrices specific for Pancrustacea is provided, as well as a script that automates the procedure. The performance of the method is evaluated based on ML and BI both on the reference Pancrustacea dataset and on a single order of interest.
Section snippets
Data and matrix transformation
The data analyzed here derive from the manually curated dataset of complete mitochondrial genomes described in Carapelli et al., 2007, van der Wath et al., 2008, base to the MtPAN matrix, that was further processed to eliminate gapped (322) and ambiguous (31) characters. The entire dataset (81 Pancrustacea and 5 outgroups, 2653 aligned positions: Supplementary data 1) was used for clustering and matrix estimation. The entire dataset, as well sequences from one single hexapod order of special
Clustering
The matrix of per site physiochemical properties is provided as Supplementary text 2. When clustering in 1–10 groups is compared, values of Gapk increase sharply with k going from 1 to 3, plateaux, and then increase again, with a marked elbow on k = 3 (Fig. 1). While a strict application of the one-standard-deviation rule of Tibshirani et al. (2001) would suggest the use of problematic high values of k, in fact above 10, the observation that (a) the difference between k = 3 and k = 4 is a mere 2
Acknowledgments
We wish to thank K. Dunn (Dalhausie U.) for sharing her original script and contributing the function for the reference distribution in Gap_script. We also with to thank E. Trentin (U. Siena) for stimulating discussions on clustering and J. Pons (IMEDEA) for suggestions on the stepping stone procedure. Special thanks to Susana Bueno Minguez (CINECA) for use of R on EURORA and Mark Miller (CIPRES) for modifying the RAxML interface to allow the use of multiple user matrices.
This work was
References (30)
- et al.
MtZoa: a general mitochondrial amino acid substitutions model for animal evolutionary studies
Mol. Phylogenet. Evol.
(2009) - et al.
A comparative analysis of complete mitochondrial genomes among Hexapoda
Mol. Phyl. Evol.
(2013) Among-site rate variation and its impact on phylogenetic analyses
TREE
(1996)- et al.
MtArt: a new model of amino acid replacement for Arthropoda
Mol. Biol. Evol.
(2007) - et al.
MOLPHY version 2.3: programs for molecular phylogenetics based on maximum likelihood
Comput. Sci. Monogr
(1996) - et al.
Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach
(2002) - et al.
Phylogenetic analysis of mitochondrial protein coding genes confirms the reciprocal paraphyly of Hexapoda and Crustacea
BMC Evol. Biol.
(2007) - Carapelli, A., Convey, P., Nardi, F., Frati, F., 2013. The mitochondrial genome of the Antarctic springtail Folsomotoma...
- et al.
ReplacementMatrix: a web server for maximum-likelihood estimation of amino acid replacement matrices
Bioinformatics
(2011) - et al.
A model of evolutionary change in proteins
Improving evolutionary models for mitochondrial protein data with site-class specific amino acid exchangeability matrices
PLoS ONE
Assessing the impact of secondary structure and solvent accessibility on protein evolution
Genetics
Bayes factors
J. Am. Stat. Assoc.
Statistical analysis of the physical properties of the 20 naturally occurring amino acids
J. Protein Chem.
A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process
Mol. Biol. Evol.
Cited by (1)
MtOrt: An empirical mitochondrial amino acid substitution model for evolutionary studies of Orthoptera insects
2020, BMC Evolutionary Biology