Non-Markovian effects on protein sequence evolution due to site dependent substitution rates

Background Many models of protein sequence evolution, in particular those based on Point Accepted Mutation (PAM) matrices, assume that its dynamics is Markovian. Nevertheless, it has been observed that evolution seems to proceed differently at different time scales, questioning this assumption. In 2011 Kosiol and Goldman proved that, if evolution is Markovian at the codon level, it can not be Markovian at the amino acid level. However, it remains unclear up to which point the Markov assumption is verified at the codon level. Results Here we show how also the among-site variability of substitution rates makes the process of full protein sequence evolution effectively not Markovian even at the codon level. This may be the theoretical explanation behind the well known systematic underestimation of evolutionary distances observed when omitting rate variability. If the substitution rate variability is neglected the average amino acid and codon replacement probabilities are affected by systematic errors and those with the largest mismatches are the substitutions involving more than one nucleotide at a time. On the other hand, the instantaneous substitution matrices estimated from alignments with the Markov assumption tend to overestimate double and triple substitutions, even when learned from alignments at high sequence identity. Conclusions These results discourage the use of simple Markov models to describe full protein sequence evolution and encourage to employ, whenever possible, models that account for rate variability by construction (such as hidden Markov models or mixture models) or substitution models of the type of Le and Gascuel (2008) that account for it explicitly. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1135-1) contains supplementary material, which is available to authorized users.


Background
Since the publication of the work by Dayhoff and Eck [1] introducing for the first time the concept of PAM matrices, protein sequence evolution has been typically modeled as a time-homogeneous Markov process and each protein site is assumed to be ruled by the same dynamic laws and to evolve independently from the others and from its own past history. This concept is a milestone in the modeling of protein evolution and is, for example, at the basis of several successful approaches for structure prediction. After Dayhoff 's first paper, PAM matrices have *Correspondence: laio@sissa.it International School for Advanced Studies (SISSA), Via Bonomea 265, 34136 Trieste, Italy been further developed and specialized by using larger datasets [2,3] and different methods to infer the instantaneous substitution rate matrix [4][5][6]. However, in more recent years, the availability of larger and larger substitution datasets has started challenging this theoretical framework. For example, Benner et al. [7] and Mitchison and Durbin [8] observed qualitative differences in protein evolution at different sequence divergence, raising concerns on treating the substitution process as Markovian. Even more recently, Kosiol and Goldman [9] proved that, if the substitution process is Markovian at the codon level, it is not Markovian at the amino acid level. With that paper it became evident that substitution matrices on codons [10][11][12] should be preferred to those on amino acids, but it is still unclear up to which point evolution at the codon level can be considered Markovian. In particular, substitution matrices both on amino acids and on codons tend to present high rates for double and triple instantaneous substitution rates, i.e. substitutions between codons differing by more than one nucleotide or between amino acids whose codons differ all for more the one nucleotide. This phenomenon, according to biochemical wisdom, does not seem realistic and may hint to further violation of the Markov assumption not kept into account even when describing the evolution at the codon level.
Another important result in the description of protein sequence evolution was obtained in 1993 by Yang [13,14], who proved that the estimations of evolutionary distances and evolutionary trees improve if the variability of substitution rates over sites is accounted for. This rate variability, which is typically modeled by a gamma distribution [13,14], is due to many effects, including different structural and functional constraints [15] and coevolution inducing a coupling between substitutions at close-by sites [16,17].
The importance of taking rate variability into account is widely recognized in phylogenetics and many methods have been developed to include it when dealing with large multiple sequence alignments [18][19][20][21]. However, these findings are generally neglected when building substitution matrices or applying them to alignments where no further information on the rate distribution is available. One noteworthy exception is due to Le and Gascuel [22], who improved the amino acid replacement matrix by Whelan and Goldman [5] by incorporating the variability of evolutionary rates across sites, but still proposing a model on amino acids rather than on codons.
We here present a model based on a Markovian evolution of the single protein site and describe how the amongsite variability of substitution rates, by allowing each site to evolve at a different speed, makes the evolution of full protein sequences effectively non-Markovian. The observed non-Markovian behavior at the full-sequence scale can be seen as the consequence of a reduction in the state space: the full state space, consisting in the twenty amino acids on sites characterized by different rates, is implicitly reduced to the simple set of the amino acids, independently of the specific rate of that site and this gives birth to ensemble average transition probabilities on the reduced state space which are not Markovian. The main consequence is that simple Markov models of protein evolution that neglect rate variability (PAM and PAM-like matrices), no matter if they are empirical or mechanistic and if they are developed at the codon or at the amino acid level, are affected by systematic errors that, for example, may lead to underestimating the evolutionary times. We will also show that one of the effects of treating protein evolution as a Markov process is a general overestimation of instantaneous double and triple substitutions, which might explain the corresponding high values found in the most common instantaneous substitution matrices such as the Jones-Taylor-Thornton (JTT) [3], the Whelan and Goldman (WAG) [5] and the Empirical Codon Model (ECM) [10].

Markov models of protein sequence evolution
We first model protein sequence evolution as a homogeneous continuous-time Markov process defined by an N × N instantaneous substitution matrix, Q, where N is the number of possible states [23]. When protein evolution is modeled on the amino acids, the possible states are the 20 amino acids and N AA = 20, while when the framework of codons is chosen, the possible states are the 61 codons coding for amino acids and N c = 61. From now on the superscript c (resp. AA ) will be reserved to codon related (resp. amino acid related) quantities.
Each off-diagonal entry of Q, Q i,j =i , represents the instantaneous substitution rate from state i to state j and is assumed to be constant in time and over sites. The diagonal entries are defined as minus the sum of all the other entries in that row, Q ii = − j =i Q ij and account for the instantaneous probability of escaping from each state. Q is normalized so that i j =i π i Q ij = 1, where π i is the equilibrium probability of state i, defined by the set of conditions i π i Q ij = 0. Because of this normalization, the time is measured in units of expected substitutions per site. For example, t = 0.01 corresponds to a typical rate of substitution of 1 %, constant along the protein chain.
To analyze the dynamics in the framework of codons we consider the M0 model introduced by Yang [24]: where π c j is the equilibrium probability for codon j, κ is the transition/transversion rate ratio and ω is the nonsynonymous/synonymous rate ratio. The parameters are set to their typical values for protein-coding DNA: ω = 0.2, κ = 2.5 and the codon distribution π c i is chosen as in Kosiol and Goldman [9]. For the sake of completeness, the substitution dynamics is here modeled also in the amino acid framework, using the WAG matrix [5] as Q AA .
The transition probability from state i to state j in a time interval of t is given by: The matrix P(t), defined from now on as the transition probability matrix at time t, describes by construction a Markovian dynamics.

Ensemble average transition probabilities
We now consider the effect on protein sequence evolution of a site-dependent substitution rate. Consistently with what proposed by Yang [13], we assume that the rate of substitution, r, is distributed according to a γ -shaped probability density function with average value r = 1: where α is the shape parameter and k is a normalization factor such that ∞ 0 ρ(r)dr = 1. In the rest of the paper, both in the codon and in the amino acid model, we are going to use α = 0.286. This value is consistent with those observed in several protein families (see Table 1 of ref. [25]), but the specific choice of α is here irrelevant, being the scope of this work to provide a demonstration of the consequences of a plausible rate distribution on protein sequence evolution. The results for other values of α in the range [0.2 : 1] are reported in the Additional files.
The transition probability from state i to state j in a time interval of t for a site characterized by rate r is given by: (4) When no information is available on the specific rate of each site, which is the typical premise when using substitution matrices, we can score an alignment only by comparing it to average transition probabilities. So we are interested in estimating the average probability for a site being in state i at time zero to be in state j at time t, considering that the rate distribution is given by Eq. 3 and that each site evolves according to the Markovian dynamics described by Eq. 4: We will call P(t) the ensemble average transition probability matrix at time t.
Here the term ensemble average should be intended as an average over the ensemble of sites subject to the distribution of the substitution rate described by Eq. 3. We want to highlight that the definition of Eq. 5 implicitly entail that each site is characterized by a substitution rate that remains constant over time. This is, of course, an approximation, because during evolution the propensity for a site to accept mutations may change [26], but, for short evolutionary times and in the range of sequence identity considered in this study (∼ 80 %), this approximation should hold. In fact, this is the same approximation implicitly used in the vast majority of phylogenetic algorithms for tree reconstruction, where each protein site is assumed to maintain the same rate along the branches of the full tree.

Non-Markovian behavior of ensemble average transition probabilities
According to Eq. 5, the ensemble average transition probability matrix is a combination of many Markovian transition probabilities and, in general, combinations of non-identical Markov processes are not Markovian. In other words, even if here the single-site dynamics is assumed to be Markovian, when the full protein sequence evolution is approximated by neglecting site specificity as done in general substitution matrices, the state space is implicitly reduced and only some special reductions, with respect to which that process is "lumpable" [27], still give rise to Markovian dynamics. With this in mind, the non-Markovian behavior of the full protein sequence dynamics can be simply proved either by checking that In the next paragraph, we are going to quantify the entity of the violation of the Markov assumption.

Results
In order to understand qualitatively the effects of the variation of the rate over sites, let us first consider a simplified world with only three codons, A, B and C, characterized all by the same frequency π A = π B = π C = 1 3 . We assume that the instantaneous substitution matrix for this model is: If the rate of substitution is constant over sites, the transition probability matrix at time t is P(t) = e tQ (6) and describes a Markovian dynamics. On the other hand, we can imagine a sequence where, still keeping the same average rate, half of the sites has a reduced substitution rate of 0.5 and the other half has a faster substitution rate of 1.5. For this second system the ensemble average transition probability matrix at time t will be It may be of interest to monitor the value of these two sets of transition probabilities as functions of time and compare them. Being the three codons equiprobable, Q, P and P are symmetric and so we can limit ourselves to check only 3 different transition probabilities: P AB = P BA , P AC = P CA and P BC = P CB . In Fig. 1 we compare the time evolution of these three quantities (respectively in black, blue and red) in the two systems (in solid line for Eq. 6 and points for Eq. 7): clearly P(t) = P(t). So, it is evident that the variation of the rate induces a change in the average transition probabilities even if both Q and the average substitution rate do not change. It is also easy to verify that the dynamics described by Eq. 7 is not Markovian (see Methods).

Non-Markovian behavior in the framework of codons
A similar effect can be observed in the models describing protein sequence evolution both at the codon level and at the amino acid level and it may be interesting to quantify the violation of the Markov assumption in presence of a plausible rate variation.
Concerning codons, we model the instantaneous substitution matrix as in Eq. 1 and the rate distribution by Eq. 3 (see Methods). Since now the rate distribution is continuous, the sum in Eq. 7 is replaced by an integral and gives Eq. 5. To quantify the variation of the ensemble average transition probability matrix with respect to the Markovian transition probability matrix, we compare P c (t) to P c (t) at time t = 0.235, which corresponds, for P c , to the 80 % of sequence identity. In Fig. 2a we show the Fig. 1 Transition probabilities in a simplified world. Comparison between the transition probabilities in a sequence with constant substitution rate over sites and in a sequence with two equiprobable classes of rates for a simplified system described in Results. Black: P AB (t) (solid line) andP AB (t) (points); Blue: P AC (t) (solid line) andP AC (t) (points); Red: P BC (t) (solid line) andP BC (t) (points) entry-by-entry comparison between them in log-log scale: each point corresponds to a pair i, j of codons and its xvalue is given by the Markovian evolution P c ij (t), while its y-value is its non-Markovian counterpart P c ij (t). If the two dynamics gave the same results, the points would lie on the line y = x, but this is not the case. In particular, one can see four separate subsets: the black squares (zoomed in the inset) are the entries corresponding to j = i (the diagonal terms in the matrix), while the red, green and blue points correspond to j = i (the off-diagonal ones), where i and j differ respectively by one, two or three nucleotides. It is evident that, with respect to the Markovian dynamics, P c (t) gives rise to higher entries for j = i enhances double and triple substitutions and discourages single ones.
A first consequence is that the expected sequence identity between two sequences separated by an evolutionary time t is much lower for the Markovian dynamics than for the non-Markovian one. This happens because, even if the average rate of substitutions is the same, in the non-Markovian case it is much more likely that substitutions cumulate on the few sites with rate larger than 1. In this way, a much larger fraction of substitutions takes place on a site that has already mutated, without further modifying the global sequence identity. The Markovian assumption produces therefore a systematic underestimation of evolutionary times. This result may be considered the theoretical explanation of the observation by Yang et al. [14] that, when taking substitution rate variability into account, one gets larger estimates of branch lengths in phylogenetic trees. The difference of sequence identity between two sequences separated by a given evolutionary time in the two processes can be found in Additional file 3 ( Figure S2 (a)). In particular, at the time t = 0.235 the non-Markovian dynamics presents the 85.7 % of sequence identity, while the Markovian one only the 80 %.
It is, then, more appropriate to compare the two processes at fixed sequence identity: in Additional file 3 ( Figure S2 (b)), one can find the same comparison of Fig. 2a with the timet of the Non-Markovian process chosen to produce a sequence identity of the 80 %, which givest = 0.4. Even if this choice balances the entries corresponding to i = j, the non-Markovian dynamics still enhances double and triple substitutions with respect to its Markovian analogue. For example, at sequence identity of 80 %, the estimated probability of finding a substitution from codon ATC to codon TGG (3 different nucleotide, so one of the blue points in Fig. 2a-c) is 5.21 · 10 −8 when the Markovian approximation is adopted, while is more than one hundred times bigger if the rate is γdistributed. In Table 1 one can find some other examples of how transition probabilities change in the two frameworks. In the Additional files we provide the ensemble average transition probability matrices for codons estimated by Eq. 5 (Additional file 1) and the Markovian counterpart described by Eq. 2 (Additional file 2) at sequence identities ranging from 95 to 50 %.

Non-Markovian behavior in the framework of amino acids
The same calculations can be also performed in the framework of amino acids: Fig. 2d shows the entry-by-entry comparison between the Markovian dynamics described by P AA and its non-Markovian analogue P AA (t) in loglog scale, where t = 0.23, corresponding to the 80 % of sequence identity in the Markovian dynamics. Here red, green and blue points identify the transition probabilities between pairs of amino acids whose most similar pair of codons (i.e. the pair of codons with maximal number of identical letters) differ respectively by one, two or three nucleotides. Not surprisingly, the subsets are less separated then in the framework of codons. In fact The first three rows involve substitutions in the framework of codons, while the last three are in the framework of amino acids many amino acid substitutions are a combination of single, double and triple nucleotide substitutions. Anyway it can be observed that the substitutions between amino acids where at least two nucleotides must change are more frequent in the non-Markovian dynamics than in the Markovian one, as already observed for codons. The comparison of the sequence identity generated by P AA and P AA and the entry-by-entry comparison of P AA (t) and P AA t at the 80 % of sequence identity can be found in Additional file 3 ( Figure S2 (c) and (d)). For example, at sequence identity of 80 %, the estimated probability to find a substitution from glycine to isoleucine (where at least two substitutions are needed) is approximately one third for the Markovian model with respect to the non-Markovian one. In Table 1 further examples of differences in the transition probabilities between the Markovian and the non-Markovian dynamics can be found.

Impact on the estimation of Q of the Non-Markovian behavior due to the rate variability
We now show that treating full protein sequence evolution as Markovian, neglecting substitution rate variability, determines also a wrong estimation of Q, the instantaneous substitution matrix. In particular, we will see that, when learning Q c from pairwise alignments, substitution rates between codons differing by more than one nucleotide are systematically magnified. This is somehow intuitive: rate variability allows substitutions to accumulate on the few sites with high substitution rate and so, when learning substitution frequencies from alignments, we find a larger number of double and triple substitutions than expected if the rates were constant. Then, when inferring Q c from these data without taking rate variability into account, the only way to encompass the extra number of double substitutions is to enhance instantaneous double and triple transition probabilities. For simplicity we are going to show this for a particular case, where Q c is estimated from alignments all at the same sequence identity, but the reasoning can be generalized for alignments at various sequence identity and for multiple sequence alignments.
To evaluate the order of magnitude of this overestimation of instantaneous double and triple substitutions, we recover a measure of Q c , Q c (t), from the ensemble average transition probability matrix at timet = 0.4, P c t = 0.4 . If, when estimating Q c t , we are considering the process as Markovian, for a sequence identity of 80 % we would infer the evolutionary time being nott = 0.4 but rather t = 0.235 (see previous calculations and Fig. 2b). So we can calculate Q c t by inverting Eq. 2: witht = 0.4 and t = 0.235. Figure 3 shows the entry-by-entry comparison between the original Q c and Q c t . The two matrices clearly do not correspond, as the points do not lie on the line y = x, so the estimate of Q c from alignments when neglecting rate variability is affected by systematic errors. In particular, we can calculate the fraction of instantaneous double substitutions in the original Q c , f true 2 , and in the estimated Q c , f est 2 , by: where π i is the equilibrium probability of codon i and the sum at the numerator is the restricted sum over the entries involving a pair of codons i, j differing by two nucleotides. The fractions of triple substitutions for the original Q c , f 3 , and for the estimated Q c , f est 3 , are computed in a similar way.
In the original instantaneous rate matrix Q c (Eq. 1) double and triple substitutions are not allowed, so f true  are defined as the substitutions between amino acids whose most similar pair of codons differ by two nucleotides). Considering that mutations take place by chance, one would rarely expect double or triple substitutions to happen in an infinitesimal time on the same codon, which is also the underlying hypothesis in the definition of the mechanistic Q c of Eq. 1. For example if single mutations take place with probability p in a small time dt and the sites evolve independently, two neighbor mutations in the same time interval should happen at the much smaller probability of p 2 .
A possible explanation of the high value of double and triple substitution rates in standard substitution matrices is that the Markov assumption may have induced a fictitious increase for double and triple substitutions. A full proof of this idea would require recalculating Q from the same alignments used to build each matrix by including the rate variability. However, this explanation is consistent with two previous results: when the WAG matrix was re-examined by Le and Gascuel by including the γ correction, they found smaller values for the triple substitutions [22] and De Maio et al. [28] observed that accounting for rate variability by hidden Markov Models reduces the estimate of instantaneous multiple substitutions in the ECM matrix.

Discussion and conclusions
We have discussed the effects of the among-site variability of substitution rates in the process of protein sequence evolution. The relative difference of the rates mixes Markov processes with different speed, which makes the process of full sequence evolution effectively not Markovian. The first consequence of the violation of the Markov assumption is a systematic underestimation of evolutionary distances. We have quantified the violation of the Markov assumption for two realistic models (respectively on codons and amino acids), demonstrating that neglecting the rate variability may cause two orders of magnitude of difference in the relative probability for triple substitutions and one order of magnitude for double substitutions. We have also shown that this approach modifies in a radical way the estimate of Q itself by especially magnifying double and triple substitutions, which might explain the correspondent high transition probabilities in the main instantaneous substitution matrices (e.g. JTT, WAG, ECM).
Statistical inference of phylogenies under Markov models including γ -distributed rate variation [13,14] as well as CAT models [20] can effectively deal with this problem and mixture models [18,19,29], that allow not only site-dependent substitution rates but also site-dependent substitution matrices, can go even beyond. However, the substitution matrices for codons and amino acids are most of the times derived without taking into account the among-site rate variability and these matrices enter necessarily even in the construction of the seed multiple sequence alignment at the basis of any hidden Markov Model. According to our findings, Markovian models for protein evolution based on most of the available substitution matrices are affected by errors that get worse when inferring information far from the learning set. This is valid both for models at the codon level and at the amino acid level, for which Kosiol and Goldman [9] have already showed that a further source of memory is present.
The results shown here are robust with respect to the specific choice of the rate matrix and rate distribution: as can be guessed by the first simple example in Results, any non-trivial rate distribution combined in Eq. 5 with whatever Q gives rise to ensemble average transition probabilities P which differ from the simple P = e tQ . The results presented in Fig. 2 should then be intended as a "proof-of-principle" that variable substitution rates cause a non-Markovian full protein sequence evolution and as a plausible estimate of the entity of the systematic errors arising when using standard substitution models in a naive way.
Even if further and more specific analysis would be necessary to quantify the impact of the effect described in this work on specific applications, the present results seem to discourage the use of simple Markovian models that neglect among-site rate variability for both amino acid and codon sequence alignments, especially when the substitution matrices are learned on alignments in a range of sequence identity very different from the test set. On the other hand, they encourage the use of models that account for among-site rate variability, for example mechanistic codon models with the γ correction, Hidden Markov Models [28,30,31], CAT models [20] or other mixture models [18,19,29] that allow it by construction, or substitution models of the type of Le and Gascuel (LG) [22] that account for it explicitly. In particular, we highlight the necessity of developing a codon analogue of the LG matrix, in order to get rid at the same time of both the identified factors leading to a non-Markovian behavior of full protein sequence evolution: the degeneration of the genetic code and the rate variability.

Appendix A: Numerical proof of the violation of the Markov assumption
In this section we provide a numerical example that proves the violation of the Markov assumption in the evolution ruled by P. For all Markov processes described by a transition probability P the following property holds for any pair of times t 0 and t: This equation, generally known as the Chapman-Kolmogorov equation [32], can then be employed as a test of the Markov assumption. In Additional file 3 ( Figure S1) we compare entry by entry in log-log scale P c (t) with P c (t 0 ) t/t 0 for t = 0.235 and t 0 = 0.01. Coordinates, colors and lines have the same meaning as in Fig. 2a in section Results. It is evident that the points do not lie on the diagonal so P c (t) = P c (t 0 ) t/t 0 . Moreover one can notice that P c (t 0 ) t/t 0 is much more similar to P c (t) than to P c (t).
Indeed P c (t 0 ) t/t 0 = e t·[log( P c (t 0 ))/t 0] and, being t 0 very small, log P c (t 0 ) /t 0 Q c . So, for this choice of t and t 0 , P c (t 0 ) t/t 0 mimics a Markovian dynamics for a Q c slightly different from Eq. 1.

Appendix B: Proof of the violation of the Markov assumption by the properties of lumpable processes
Given a set of states s = {s 1 , s 2 , . . . s N } and a partition on it A = {A 1 , A 2 , . . . A r }, a necessary and sufficient condition for a Markov chain on s to be lumpable with respect to A is that, for every pair of sets A i and A j , the sum s l ∈Aj p s k ,s l of the transition probabilities from state s k to states s l ∈ A j has the same value for every s k ∈ A i [27]. We exploit this property to prove the non-Markovian behavior of sequence evolution in presence of rate heterogeneity. Here the full state space is given by all the possible pairs {r, c} with r a real number in [0 : ∞] corresponding to a rate value and c one of the 64 codons. We consider the following transition probability from state s 1 to state s 2 : p {r1,c1},{r2,c2} = δ(r 1 , r 2 ) · e r1· t·Q c1,c2 . where t is an arbitrarily small time. We partition the state space into a 64-dimensional reduced space given by the set of