The Action of Key Factors in Protein Evolution at High Temporal Resolution

Background Protein evolution is particularly shaped by the conservation of the amino acids' physico-chemical properties and the structure of the genetic code. While conservation is the result of negative selection against proteins with reduced functionality, the codon sequences determine the stochastic aspect of amino acid exchanges. Thus far, it is known that the genetic code is the dominant factor if little time has elapsed since the divergence of one gene into two, but physico-chemical forces gain importance at greater evolutionary distances. Further details, however, on how the influence of these factors varies with time are unknown to date. Methodology/Principal Findings Here, we derive each 10,000 divergence specific substitution matrices for orthologues and paralogues from the Pfam collection of multiple protein alignments and quantify the action of three physico-chemical forces and of the structure of the genetic code at high resolution using correlation analysis. For closely related proteins, the codon sequence similarity is the most influential factor controlling protein evolution, but its influence decreases rapidly as divergence grows. From a protein sequence divergence of about 20 percent on the maintenance of the hydrophobic character of an amino acid is the most influential factor. All factors lose importance from about 40 percent divergence on. This suggests that the original protein structure often does no longer represent a constraint to the protein sequence. The proteins then become free to adopt new functions. We furthermore show that the constraints exerted by both physico-chemical forces and by the genetic code are quite comparable for orthologues and paralogues, however somewhat weaker for paralogues than for orthologues in weakly or moderately diverged proteins. Conclusion/Significance Our analysis substantiates earlier findings that protein evolution is mainly governed by the structure of the genetic code in the early phase after divergence and by the conservation of physico-chemical properties at the later phase. We determine the level of sequence divergence from which on the conservation of the hydrophobic character is gaining importance over the genetic code to be 20 percent. The evolution of orthologues and paralogues is shaped by evolutionary forces in quite comparable ways.


Introduction
The evolution of proteins can be seen as a succession of replacements of amino acids by other amino acids. In order to quantify the rates by which amino acids are replaced by other amino acids so-called substitution matrices (or exchange matrices) are built from multiple sequence alignments of homologous proteins [1]. Substitution matrices are of particular importance for sequence data base searches with protein or DNA sequences of unknown function. Many attempts were made to refine them [2,3,4]. Such a substitution matrix is, strictly speaking, specific for the protein it is derived from because not all positions in a protein are of equal importance. Furthermore, substitution matrices describing the amino acid exchanges between strongly diverged proteins differ from those that describe amino acid exchanges between weakly diverged proteins [5]. We can thus speak of a time dependence of substitution matrices under the assumption that the metaphor of the molecular clock is essentially valid [6].
Several attempts were made to understand the changeability between amino acids and, hence, the elements of substitution matrices. It has become clear that essentially two factors have to be considered: the structure of the genetic code and the conservation of an amino acid's physico-chemical character [7,8]. The genetic code assigns each amino acid one or more codons with specific three-nucleotide-sequences. Amino acids can be coded for by as many as six different codons (L, R, and S) or by a unique codon (M and W). Exchanges between amino acids should therefore be facilitated if their codons are mostly similar in sequence, i. e. if they primarily differ by just one nucleotide [9]. Such amino acid exchanges should be relatively frequent. Conversely, exchanges between amino acids which are essentially coded for by dissimilar codons should be relatively rare.
Apart from the genetic code, an amino acid's physico-chemical properties have to be taken into account. Since substitution matrices are built from fully functional proteins it can be assumed that during an exchange one amino acid is preferably replaced by one with similar physico-chemical properties. This would more likely guarantee the functionality of the protein as a whole than replacement by a dissimilar amino acid. The tendency to remain conserved can thus be interpreted as a result of physico-chemical forces which act in such a way that replacements by dissimilar amino acids are avoided and replacements by similar amino acids are favored. Above all, the three properties hydrophobicity, polarity and volume determine an amino acid's physico-chemical character.
The relative importance of these two major evolutionary players, genetic code and conservation of physico-chemical properties, was quantified previously in protein variants from one and the same species and for closely related proteins [10]. As predicted in an earlier work [11], the genetic code is the major factor controlling evolution within species and physico-chemical forces gain importance in the protein evolution between species. In this work we are interested in studying the influence that the physiochemical factors and the structure of the genetic code exert on the amino acid exchanges as a function of the time that has elapsed since the divergence between two proteins.
To this end we first construct substitution matrices from pairwise protein alignments with a degree of sequence identity varying from 99 down to ten percent (corresponding to degrees of divergence between one and 90 percent) in steps of one percent and correlate them with the matrix of differences in volume, hydrophobicity, polarity and average codon sequence difference.
If the genetic code were the sole factor governing amino acid exchanges then those exchanges should prevail whose codon sequence distance is low. There should therefore be a strong anticorrelation between the codon distance matrix and the empirical substitution matrix. Likewise, if the physico-chemical factors were the sole factors, then those exchanges should predominate whose differences in the physico-chemical properties differ little. The correlation coefficient between the four distance matrices and the substitution matrices for the various degrees of protein sequence divergence allow therefore to estimate the relative strength of the corresponding factors for each degree of divergence.
We are also interested to correlate the changes that are involved in amino acid substitutions on the physico-chemical level and on the genetic code level. Are the physico-chemical changes  Table 1. Exchange frequencies (sum normalized to 10,000) derived from protein alignments which are divergent at 5 percent of their sites.  correlated with each other and with codon sequence changes? A strong correlation between physico-chemical properties and genetic code properties would mean that the genetic code has protection built in against strong changes in the physico-chemical properties. It could be shown in previous work [12,13,14] that changes in the codons that can occur easily by chance, e. g. through a single transition, are mostly involving small changes in the physico-chemical properties. This means that the function of a protein is unlikely disrupted e. g. by occasionally occurring replication errors. In other words, some protection is already imprinted in the genetic code, and it could be shown that this protection is especially marked for hydrophobicity [15]. Here, we are interested to quantify the degree of protection that is intrinsic in the genetic code.

Amino acid substitution matrices
We downloaded 10,340 multiple alignments via ftp from http://pfam.sanger.ac.uk/ (file Pfam-A.full, Pfam Version 23.0, as of July, 2008 [16]) Theses protein families were constructed from 6,145,588 individual protein sequences. From each alignment we picked each two orthologous and two paralogous human sequences randomly. We determined the degree of identity between these pairs of sequences by counting the number of sites with identical amino acids and subsequent division by the total number of aligned amino acid pairs. The degree of divergence of an alignment was calculated as 100 minus the degree of identity in percent rounded to the closest integer. Insertions and deletions were not considered. The counts for each alignment were sorted into a 20620 matrix. Matrices belonging to one and the same degree of divergence between one and 90 percent were then added such that 90 divergence specific substitution matrices resulted. We confined ourselves to one sequence pair each from a multiple alignment as a basis for the substitution matrices (instead of considering all possible pairs) in order to avoid bias towards those proteins whose multiple alignments were constructed from many sequences [2]. The raw counts in the substitution matrices were multiplied with the relative frequencies of the amino acids in the protein alignments of the corresponding degree of divergence so that the values were corrected for unequal amino acid frequency.

Differences in three physico-chemical properties
We calculated the absolute differences in physico-chemical properties for each of the 190 possible amino acid pairs for three fundamental properties: amino acid volume, polarity and hydrophobicity based upon entry numbers GRAR740103, GRAR740102 [17], and SWER830101 [18] of the database AAindex [19]. Since we calculated absolute differences, the differences associated with the exchange of a given amino acid by another and with the reverse exchange are identical. By construction, the resulting matrices are symmetric.

Codon sequence distance
To describe amino acid exchange frequencies in terms of the genetic code, we determine for each exchange the average distance between their codon sequences. As an example, how this measure is calculated we consider glutamic acid and glycine which can be coded for by two and four codons, respectively. An exchange between these two amino acids can thus be assigned eight codon pairs. In two cases, GAA « GGA and GAG « GGG, a one nucleotide swap suffices, in six cases, GAA « GGC, GAA « GGG, GAA « GGT, GAG « GGA, GAG « GGC, and GAG « GGT, two nucleotides have to be changed to attain the amino acid exchange. On average, the codon sequence distance between glutamic acid and glycine is 1.75 nucleotides. As with the physico-chemical changes we arranged the codon sequence distances into 20620 matrices. The assignment of amino acids to codons was taken from the R-package seqinR [20].

Correlation analysis
We used the function mantel (with the method kendall and with the default of 1000 permutations) from the R-package vegan [21] to calculate the correlation coefficient between the property difference matrices and the amino acid substitution matrices.

Correlation between evolutionary factors
First, we were interested if the four factors that influence amino acid exchanges are largely independent or if there is correlation between them. Figure 1 shows all pairwise scatter plots for the changes in volume, hydrophobicity, polarity, and mean codon distance that the 190 possible amino acid substitutions entail. The only strong correlation was found between differences in hydrophobicity and polarity (r = 0.512). Volume is nearly uncorrelated with hydrophobicity and polarity. There is also weak, however statistically highly significant, correlation between differences in the three physico-chemical properties on the one hand and the mean codon distance on the other hand. This means that the amino acids' physico-chemical conservation is partially intrinsic to the genetic code, but it cannot be explained through the genetic code alone. A marked feature of the plots hydrophobicity and polarity vs. average codon distance is that amino acid exchanges that can exclusively be realized by one nucleotide swaps entail almost constant hydrophobicity and polarity. In contrast, no marked tendency can be seen for average codon distances greater than 1. Amino acid substitution matrices in the course of time We determined amino acid substitution matrices from 10,135 and 10,057 orthologous and paralogous pairwise protein sequence alignments, respectively. The number of orthologous alignments contributing to the divergence specific substitution matrices ranged between 74 and for a divergence of one percent and 27 for a divergence of 90 percent. The corresponding numbers were 462 and 21 for paralogous alignments. Degrees of divergence greater than 90 percent were not considered because too few alignments contributed to these substitution matrices. First, we studied some of these divergence specific matrices by eye. An obvious feature is that many of the exchanges are not observed at all when divergence is low. Table 1 shows the substitution matrix derived from orthologous protein alignments which are divergent at five percent of their sites. The sum of all counts was set to 10,000 and then counts were rounded. Almost half of the possible exchanges (176/400) are not observed in this case. Off-diagonal elements are typically two orders of magnitude smaller than diagonal elements, i. e. the number of amino acid sites that are identical in an alignment. As divergence increases, more and more amino acid exchange pairs are observed. At a divergence of 50 percent, all but six possible exchanges are realized (Table 2). Now, off-diagonal elements are typically only one order of magnitude smaller than diagonal elements. Figure 2 shows that the proportion of realized amino acid exchanges out of all possible exchanges rises fast from about 30 percent for weakly divergent proteins to in general 100 percent for orthologous proteins diverged by 50 percent or more of all sites. In order to visualize how the occupation of the substitution matrices varies with divergence we plotted the logarithm of their elements (increased by 1 to avoid negative values) in a gray scale (Figure 3). The 400 possible amino acid pairs are represented by little squares. Bright squares reflect high numbers, dark squares small numbers. As can be seen clearly, the diagonal elements vanish more and more as divergence grows, but a weak trace remains even at a divergence level of 85 percent. Figure 4 shows how the correlation strength between property difference matrices and the substitution matrix for orthologues varies with sequence divergence. The correlation coefficients are negative, but for clarity's sake we present their absolute values. Essentially, the influence of all four factors is decreasing with the exception of plateaux or even slight increases for volume between In order to assess better the relative strengths of the individual factors we present their smoothed curves together in Figure 5. The dominating factor for weekly divergent proteins (up to 20 percent divergence) is the genetic code (dot dashes). For sequences diverged by more than 20 percent, the conservation of hydrophobicity (long dashes) is the most dominant factor. The conservation of volume (solid line) is the weakest force up to a divergence of 60 percent, when its curve merges with that of the genetic code. The conservation of polarity (dotted curve) plays an intermediate role. The correlation for all factors falls until 0.1 for sequences which are 90 percent divergent. We also determined the average correlation coefficient for shuffled alignments of orthologous proteins to assess if this value of 0.1 is beyond noise level. Values between 0.009 for polarity and 0.025 for the mean codon distance suggest that this is indeed the case.

Evolutionary factors in the course of time
We performed the same correlation analysis for paralogues and show the results in Figure 6. Essentially, we obtained identical curve shapes for paralogues (dashed lines) as for orthologues (solid lines). An important difference, however, is that the correlation is somewhat weaker for paralogues for weakly to moderately divergent sequences. When the divergence has exceeded 50 percent, this difference vanishes practically. The difference seems to be more marked for polarity and codon distance than for volume and hydrophobicity (0.03 vs 0.01 for weakly divergent sequences).

Discussion
After a speciation event or a gene duplication event the two copies of a gene evolve separately and their nucleotide sequences start to diverge. This divergence is in general reflected in their protein sequences. On a molecular level, the sequence divergence is caused by replacements of amino acids by other amino acids. These exchanges are not random, but are essentially constrained by the conservation of an amino acid's physico-chemical character to guarantee its functioning. At the same time, the structure of the genetic code facilitates some amino acid exchanges because they can be attained by the swap of just one nucleotide and makes others more difficult because two or three nucleotides have to be exchanged. show how the influence of the amino acids' volume, hydrophobicity, polarity and average codon sequence differences varies with the protein sequence divergence. Each dot represents the correlation coefficient (y-axis) for a substitution matrix with a property difference matrix for a certain degree of protein sequence divergence (x-axis). The correlation coefficients are negative, but are presented as absolute values for clarity's sake. The solid lines are obtained by smoothing the data; they serve as guide to the eye. doi:10.1371/journal.pone.0004821.g004 It has been known that protection against great amino acid alterations is a feature of the genetic code, i. e. amino acid exchanges that can be mediated by just one nucleotide swap entail mostly slight changes in an amino acid's physico-chemical character and those that are mediated by three nucleotide swaps entail mostly greater changes. Though being statistically significant, we could show that this correlation is only moderate; there are many exceptions to the rule. For example, the interchange between the hydrophilic amino acide serine (S) and the hydrophobic amino acid phenylalanine (F) can be mediated through the one-transition exchanges. TCC « TTC and TCT « TTT. It must also be taken into consideration that further protection could be provided by uneven codon usage, i.e. in case of multiple codons for one amino acid those codons are preferred that are dissimilar to codons that code for dissimilar amino acids. Furthermore, protection could also be provided by different exchange rates for transitions (nucleotide interchanges between purines or between pyrimidines) and transversions (nucleotide interchanges between purines and pyrimidines).
The main objective of this study was to analyze which amino acids are preferably replaced by which other amino acid depending on the degree of divergence between the protein sequences and to identify and quantify four factors controlling these exchanges.
To this end we constructed a series of each 90 substitution matrices from pairwise alignments of orthologous and of paralogous protein sequences which we excerpted from the Pfam collection of protein multiple sequence alignments. These matrices are worthwhile studying themselves. Their most marked feature is that they are poorly populated if divergence is weak. Only about half of the possible exchanges are observed, and their rate is very low. From a sequence divergence rate of 50 percent on all possible exchanges are in general observed. To quantify the strength of the influence that is exerted by the conservation of the three physicochemical amino acid features polarity, volume and hydrophobicity, we constructed matrices that contain the differences of these quantities for each amino acid pair. To describe the differences between amino acids in terms of the genetic code we determined the codon sequence difference for each codon pair that can be associated with an amino acid exchange and averaged over all sequence distances to yield the mean codon distance.
To determine both the variation with sequence divergence and the relative strength of the four factors we correlated the four distance matrices with the 90 substitution matrices. Interpreting these correlation strengths as influence that is exerted by the four factors it can be concluded that immediately after the divergence of two proteins the structure of the genetic code is the predominant factor controlling amino acid substitutions. It is in this regime above all the ability to realize an amino acid exchange with the exchange of just one nucleotide that is the decisive factor to explain the empirical exchange rates. The conservation of an amino acid's hydrophobicity plays the leading role when sequence divergence has grown beyond 20 percent. Beyond 50 percent sequence divergence, the strengths of polarity, volume and mean codon distance are comparable.
It has been known since Zuckerkandl's pioneering work that new functions can be performed once proteins have diverged [22], but it has remained obscure if this is the rule or the exception. In [23] it is suggested that the adoption of new functions is the rule for paralogues, but it is the exception for orthologues, at least if there is no gene duplication after speciation (1 : 1 homology). Despite possibly new functions for the derived proteins that can be quite different from the original function similar constraints are exerted onto paralogues and orthologues. Here, we have shown that the conservation of an amino acid's physico-chemical character exerts less and less constraint as divergence proceeds. In analogy to the physico-chemical forces, the genetic code is the underlying structure that facilitates exchanges between amino acids whose codon sequences are similar and makes those exchanges harder whose codon sequences are dissimilar. However, this ''guiding force'' is also losing influence as protein sequence diverges.
Thus, the proteins become less and less bound to their original sequence permitting them to adopt other conformations and to fulfill additional or new functions. The adoption of new functions seems to be a gradual process. We could furthermore show that, at least up to a sequence divergence of 50 percent, paralogues are less subjected to conservatory forces than orthologues. It must, however, be taken into account that the paralogues for our study were exclusively from human, whereas the orthologues were from a large variety of species.
The analysis that we have performed relies upon the concept of the molecular clock [6], i .e. the more time has elapsed the more amino acid exchanges have accumulated in a protein sequence such that the number of accumulated exchanges can be used to measure the time that has elapsed since two proteins have diverged. Although it has turned out that there is no universal molecular clock [24], but rather that various molecular clocks run at different paces across species, proteins and times, we think that the concept is applicable for our purpose since we constructed the substitution matrices for each degree of divergence from a wide variety of different proteins and species. We did not try to calculate precise time periods that have elapsed since two proteins have diverged, but we do claim that more divergent protein pairs are likely to have diverged earlier than less divergent protein pairs. Sequence divergence is presented on the x-axis, the correlation coefficient on the y-axis. Solid line: volume; dotted line: polarity; dashed line: hydrophobicity; dot dashed line: mean codon distance. The mean codon sequence distance represents the strongest influences on amino acid exchanges when protein sequence divergence is low; the conservation of an amino acid's hydrophobicity is the prevailing factor when sequence divergence is greater than 20 percent. doi:10.1371/journal.pone.0004821.g005 Our analysis relies furthermore upon the correctness of the pairwise protein alignments that serve as our data basis. The Pfam-A collection of multiple sequence alignments are produced by first establishing a hand-curated seed alignment of a couple of protein sequences from which a profile Hidden Markov Model is built which again is used to search for homologous sequences in primary sequence data bases. Finally the alignments are checked again manually (P Coggill, personal communication). In such a way the risk to sample non-homologous sequences which could happen by applying an inappropriate substitution matrix is minimized. Since we sampled our pairwise alignments from multiple alignments which consist typically of dozens of sequences (median ,80) and which can therefore be considered as stable, we are quite confident that the overwhelming majority of the alignments we used are of good quality.
When we spoke about the genetic code in this work we tacitly assumed the universality of the genetic code. There are, however, organisms using slight variants of the standard genetic code. To date, about a dozen of these exceptions are known. We have not checked explicitly if the proteins we used were all translated using the standard genetic code but a few exceptions would certainly not affect our results. Even for non-standard codes the vast majority of assignments between codons and amino acids would be the same as in the standard genetic code.
A limitation of our study is that we did not distinguish between the two directions of an amino acid substitution whose rates are priori not identical. The reason for this was that the construction of such (in general asymmetric) substitution matrices requires the inclusion of an outgroup sequence in the protein sequence alignment. Whereas this was possible for almost all orthologous alignments it was only possible for a fraction of the paralogous alignments. We repeated the correlation analysis with the asymmetric substitution matrices from orthologues and obtained essentially the same results.
It was our endeavor to explain the observed amino acid exchange frequencies for the whole range of protein sequence divergence by means of fundamental structures or forces, like the structure of the genetic code or the physico-chemical forces that maintain an amino acid's character. We therefore had to keep our model somewhat simplistic. It is, for example, known that transversions (interchanges between a purine and a pyrimidine) are much rarer than transitions (interchanges between purines or between pyrimidines). We defined the weighted codon distance by attributing the score 2 to transversions and the score 1 to transitions, such that codon pairs involving transversions are assigned greater distances. This modification, however, did not alter the correlation coefficients between the substitution matrices and the codon distance shown in Figs 4-6. More refined models for codon distance could, for example, also incorporate the codon usage. This will, however, be a much more complex analysis because of the dependence of codon usage on the species and remains as a future challenge.