Short Communication
MtPAN3: Site-class specific amino acid replacement matrices for mitochondrial proteins of Pancrustacea and Collembola

https://doi.org/10.1016/j.ympev.2014.02.001Get rights and content

Highlights

  • A new three-matrix model is developed for Pancrustacea mitochondrial genomes.

  • Alignment sites are clustered according to physiochemical values.

  • Multiple substitution matrices are estimated, one from each site group.

  • The new multiple matrix model MtPAN3 outperforms available single matrix models.

  • A R/BioPerl script is provided to perform matrix transformation and clustering.

Abstract

Phylogenetic analyses of Pancrustacea have generally relied on empirical models of amino acid substitution estimated from large reference datasets and applied to the entire alignment. More recently, following the observation that different sites, or groups of sites, may evolve under different evolutionary constraints, methods have been developed to deal with site or site-class specific models.

A set of three matrices has been here developed based on an alignment of complete mitochondrial pancrustacean genomes partitioned using an unsupervised clustering procedure acting over per-site physiochemical properties. The performance of the proposed matrix set – named MtPAN3 – was compared to relevant single matrix models (MtZOA, MtART, MtPAN) under ML and BI. While the application of the new model does not solve some of the topological problems frequently encountered with pancrustacean mitogenomic phylogenetic analyses, MtPAN3 largely outperforms its competitors based on AIC and Bayes factors, indicating a significantly improved fit to the empirical data. The applicability of the new model, as well as of multiple matrix models in general, is discussed and an R/BioPerl script that implements the procedure is provided.

Introduction

Modern methods of phylogenetic analysis rely on a so-called model of molecular evolution, i.e. a mathematical model describing how sequences evolve through time (Liò and Goldman, 1998). At the core of an evolutionary model is the substitution rate matrix, in fact a composition of a symmetrical matrix of exchange rates that describes the instantaneous rate of change between each amino acid pair and a vector of equilibrium frequencies. The model can also include additional components, such as ASRV, tree shape and demographic processes, that are nevertheless not directly considered here. The actual values in the matrix can be either optimized alongside the phylogenetic tree during the analysis (i.e. mechanistic models: Yang et al., 1998) or estimated from large reference datasets a priori and subsequently applied to the specific dataset under consideration (i.e. empirical models). While nucleotide analyses almost invariantly use mechanistic models, amino acid analyses generally rely on empirical models, as the large (190) number of substitution parameters cannot generally be estimated from small/medium datasets. A number of matrices have been determined for both general interest proteins (e.g. PAM: Dayhoff et al., 1972; WAG: Whelan and Goldman, 2001; LG: Le and Gascuel, 2008) and specific genetic systems (e.g. MtREV: Adachi and Hasegawa, 1996 for mitochondrial proteins). Following the observation that differences exist between taxonomic groups, matrices specific to a given lineage have also been developed such as, among others, MtPAN for pancrustacean mitochondrial genomes (Carapelli et al., 2007), MtART for Arthropoda (Abascal et al., 2007) and MtZOA for the entire Metazoa (Rota-Stabelli et al., 2009).

Early methods of phylogenetic analysis generally assumed that each and every alignment site evolved under the same model (rates, exchange probabilities, equilibrium frequencies), with more sophisticated implementations going in the direction of modeling such aspects of sequence evolution explicitly. While ASRV is generally accounted for (Yang, 1996, Stamatakis, 2006a), variations in the substitution process (i.e. equilibrium frequencies and exchange probability matrix) are frequently overlooked. One possible approach has been to assign sites to a number of categories defined a priori based on protein secondary structure/solvent accessibility (Goldman et al., 1998, Liò and Goldman, 1999) and to estimate/apply a different rate matrix to each site category. More recently Lartillot and Philippe (2004) used a mixture model where the exchange probability matrix is common to all sites – in fact uniform mutation probabilities are assumed – while stationary frequencies are allowed to vary between groups of sites and sites are automatically assigned to a group based on their fit – the CAT model. The use of site specific – or group specific – exchange probability matrices has in turn proved more problematic. Le et al. (2008) have described new models where multiple matrices are used, either specific to a given subset of the alignment known a priori (e.g. exposed/buried or extended/helix/other in supervised approaches) or to site partitions that are defined during the analysis and hence cannot be associated a priori with specific protein features (a so-called unsupervised approach). Although mixture models are probably the most theoretically favorable context where such procedures can be implemented, and in fact the models described in Le et al. (2008) have been included in a specific version of the software PhyML (PhyML-mixtures), a generalized application of the procedure in an unsupervised context has proven problematic, leading the use of semi-supervised models in Le et al. (2008).

In this study we assess the applicability of a unsupervised procedure of site clustering (Dunn et al., 2013) coupled with the estimation of empirical matrices for each site cluster (Dang et al., 2011). A new set of three matrices specific for Pancrustacea is provided, as well as a script that automates the procedure. The performance of the method is evaluated based on ML and BI both on the reference Pancrustacea dataset and on a single order of interest.

Section snippets

Data and matrix transformation

The data analyzed here derive from the manually curated dataset of complete mitochondrial genomes described in Carapelli et al., 2007, van der Wath et al., 2008, base to the MtPAN matrix, that was further processed to eliminate gapped (322) and ambiguous (31) characters. The entire dataset (81 Pancrustacea and 5 outgroups, 2653 aligned positions: Supplementary data 1) was used for clustering and matrix estimation. The entire dataset, as well sequences from one single hexapod order of special

Clustering

The matrix of per site physiochemical properties is provided as Supplementary text 2. When clustering in 1–10 groups is compared, values of Gapk increase sharply with k going from 1 to 3, plateaux, and then increase again, with a marked elbow on k = 3 (Fig. 1). While a strict application of the one-standard-deviation rule of Tibshirani et al. (2001) would suggest the use of problematic high values of k, in fact above 10, the observation that (a) the difference between k = 3 and k = 4 is a mere 2

Acknowledgments

We wish to thank K. Dunn (Dalhausie U.) for sharing her original script and contributing the function for the reference distribution in Gap_script. We also with to thank E. Trentin (U. Siena) for stimulating discussions on clustering and J. Pons (IMEDEA) for suggestions on the stepping stone procedure. Special thanks to Susana Bueno Minguez (CINECA) for use of R on EURORA and Mark Miller (CIPRES) for modifying the RAxML interface to allow the use of multiple user matrices.

This work was

References (30)

  • K.A. Dunn et al.

    Improving evolutionary models for mitochondrial protein data with site-class specific amino acid exchangeability matrices

    PLoS ONE

    (2013)
  • N. Goldman et al.

    Assessing the impact of secondary structure and solvent accessibility on protein evolution

    Genetics

    (1998)
  • R.E. Kass et al.

    Bayes factors

    J. Am. Stat. Assoc.

    (1995)
  • A. Kidera et al.

    Statistical analysis of the physical properties of the 20 naturally occurring amino acids

    J. Protein Chem.

    (1985)
  • N. Lartillot et al.

    A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process

    Mol. Biol. Evol.

    (2004)
  • View full text