Too packed to change: side-chain packing and site-specific substitution rates in protein evolution

In protein evolution, due to functional and biophysical constraints, the rates of amino acid substitution differ from site to site. Among the best predictors of site-specific rates are solvent accessibility and packing density. The packing density measure that best correlates with rates is the weighted contact number (WCN), the sum of inverse square distances between a site’s Cα and the Cα of the other sites. According to a mechanistic stress model proposed recently, rates are determined by packing because mutating packed sites stresses and destabilizes the protein’s active conformation. While WCN is a measure of Cα packing, mutations replace side chains. Here, we consider whether a site’s evolutionary divergence is constrained by main-chain packing or side-chain packing. To address this issue, we extended the stress theory to model side chains explicitly. The theory predicts that rates should depend solely on side-chain contact density. We tested this prediction on a data set of structurally and functionally diverse monomeric enzymes. We compared side-chain contact density with main-chain contact density measures and with relative solvent accessibility (RSA). We found that side-chain contact density is the best predictor of rate variation among sites (it explains 39.2% of the variation). Moreover, the independent contribution of main-chain contact density measures and RSA are negligible. Thus, as predicted by the stress theory, site-specific evolutionary rates are determined by side-chain packing.


INTRODUCTION
Why do some protein sites evolve more slowly than others? Protein evolution is driven by random mutations and shaped by natural selection (Liberles et al., 2012;Sikosek & Chan, 2014). Mutations are selected depending on their impact on functional properties, such as the chemical nature of catalytic residues, active site conformation, and the protein's ability to fold rapidly and stably. Since changes of these properties depend on the mutated site, amino acid substitution rates vary from site to site.
We can reformulate the question opening the previous paragraph: What specific properties account for site-dependent rates of evolution? The most studied predictors are structural site-specific properties (Franzosa & Xia, 2009). For years, the main structural predictor was believed to be solvent accessibility, as quantified by the Relative Solvent Accessibility (RSA) (Bustamante, Townsend & Hartl, 2000;Conant & Stadler, 2009;Franzosa & Xia, 2009;Ramsey et al., 2011;Shahmoradi et al., 2014). However, local packing density, quantified by the Weighted Contact Number (WCN), predicts evolutionary rates at least as well as RSA (Shih & Hwang, 2012;Yeh et al., 2014a;Yeh et al., 2014b).
The relationship between WCN and substitution rates can be understood in terms of a mechanistic stress model of protein evolution (Huang et al., 2014). Given an ancestral wild-type protein, the model assumes that its native conformation is the active conformation. Mutating a site perturbs (stresses) its interactions with other sites, destabilizing the active conformation. Such a destabilization determines the probability of the mutation being accepted or rejected, and therefore the rate of amino acid substitution. Using the energy function of the parameter-free Anisotropic Network Model (Yang, Song & Jernigan, 2009), the expected destabilization was found to be proportional to WCN, and site-specific substitution rates were predicted to decrease linearly with increasing WCN, in agreement with observations. A site's WCN is the sum of inverse square distances from its C α to the C α of other sites: it is a measure of C α packing density. Therefore, previous substitution rate vs. WCN studies were based on main chain (C α ) packing (Shih & Hwang, 2012;Yeh et al., 2014a;Huang et al., 2014). However, mutations replace side chains. Consider a protein residue, e.g., Thr93 of Human Carbonic Anhidrase II (pdb code 1CA2) (Fig. 1). The environment of the main chain ( Fig. 1A) differs from that of the side chain (Fig. 1B). When Thr93 is mutated, what environment would determine whether the mutation is accepted or rejected? More specifically: Do site-specific substitution rates depend on main-chain packing or on side-chain packing?
To address this issue, we extended the stress model to consider main and side chains explicitly and we theoretically derived that substitution rates depend only on side-chain of packing. We tested the theory on a data set of monomeric enzymes. In agreement with predictions, site-specific substitution rates correlate better with side-chain packing than with main-chain packing measures and RSA. Moroever, partialing out the effect of side-chain packing, the independent contributions of main-chain packing and RSA are negligible.

Theory
In this section, we show that the mechanistic stress model of protein evolution predicts that the substitution rate of a protein site is determined by the packing density of its side chain. This prediction and its empirical assessment are the point of this paper.
The stress model was proposed by Huang et al. (2014) to explain the observed correlation between site-specific substitution rates and packing density. The model is based on the idea that a mutant is viable to the extent that it spends time in the active conformation. In turn, this time will depend on mutational changes of the stability of the active conformation. The fixation probability of a mutant is modeled as where wt stands for wild-type, mut for mutant, C F is the concentration of folded protein and ρ(r active ) its probability of adopting the active conformation. Assuming that C mut /C wt is equal to the ratio of partition functions, from basic statistical physics it follows that: where β represents the selection pressure and is the energy difference between mutant and wild-type in the active conformation. Assuming that βδV * ≪ 1 (weak selection), from (2) we find: i.e., the rate of substitution of site i, K i , is proportional to (minus) the change in stability of the active conformation averaged over mutations at i, ⟨δV * ⟩ i . This is the basic equation of the stress theory.
In Huang et al. (2014), mutational stability changes were calculated using an elastic network model in which each residue is represented by a single node. Within such a one-node-per-residue representation, there is no differentiation between main chain and side chain. Therefore, we cannot predict whether evolutionary rates will be determined by main chain packing or side chain packing. To address this issue, here we represent each residue using two nodes: a main-chain node α, placed at the residue's C α , and a side-chain node ρ, placed at the geometric center of the residue's side chain (Gly's are represented using only one node at C α ). The energy function is: where r n i n j is the distance between nodes n i and n j (n is α or ρ), k n i n j is the force constant of the spring connecting these nodes, and d n i n j the equilibrium spring length. A mutation at site i will replace ρ i , affecting only the parameters of the energy function related to this node. We emphasize: while the mutation may well induce global structural changes involving the backbone and other side chains, the only parameters that will change are those of the mutated side chain. Following Echave (2008) and Echave & Fernández (2010), we model a mutation at i by adding random perturbations to the lengths of the springs connected to ρ i : (3) and (5): Assuming that perturbations are drawn independently from the same distribution, averaging (6) over mutations at i we find: To finish, we assume, as in the parameter-free Anisotropic Network Model (pfANM) of Yang, Song & Jernigan (2009), that k n i n j = 1 d 2 n i n j . Then, from (4) and (7) we obtain: where WCN αρ ρ , derived here, is the side-chain weighted contact number. It depends on contacts between node ρ of the site considered (subscript) and nodes α and ρ of the other sites (superscript). Therefore, the stress model, combined with a two-nodes-per-site pfANM energy function, predicts that site-specific rates will depend on the contact density of the side chain WCN αρ ρ . By analogy with (9) we can calculate the main-chain weighted contact number: We expect WCN αρ α to correlate with WCN αρ ρ , which may result in indirect correlations with substitution rates. However, if the stress model is correct, rates will be determined only by WCN αρ ρ and there should not be any independent effect of WCN αρ α .

Other structural predictors
To assess the prediction of the previous section, we also consider the following structural properties. First, the Weighted Contact Number WCN, which was introduced by Lin et al. (2008) and found to be among the best structural predictors of site-dependent evolutionary rates (Yeh et al., 2014a;Yeh et al., 2014b). It is defined as: where d α i α j is the distance between the the alpha carbons of sites i and j. For the sake of clarity, wherever it is convenient we will use the notation WCN α α to make explicit that the distances between the C α of a site (subscript) and the C α of the other sites (superscript) are considered. Therefore, WCN α α can be considered a measure of main-chain packing density (based only on C α -C α interactions).
Second, by analogy with (11) we can use side-chain centers of mass ρ rather than C α to define: WCN ρ ρ quantifies the packing density of the side chain including only ρ-ρ interactions. Finally, we also consider the Relative Solvent Accessibility, RSA, which is the most studied structural determinant of evolutionary rates. The RSA of a residue is obtained by dividing its area accessible to the solvent (SA) by the maximum SA for the given amino acid type (Tien et al., 2013).

Dataset and empirical substitution rates
To test our theory, we used the data set of Echave, Jackson & Wilke (2015). The set consists of 209 monomeric enzymes of known structure covering diverse structural and functional classes. Each structure is accompanied by up to 300 homologous sequences.
We used the empirical site-specific rates of evolution of Echave, Jackson & Wilke (2015). They were calculated as follows. First, the homologous sequences for each structure were aligned using MAFFT (Multiple Alignment using Fast Fourier Transform) (Katoh et al., 2005;Katoh & Standley, 2013). Second, using the resulting alignments as input, Maximum Likelihood phylogenetic trees were inferred with RAxML (Randomized Axelerated Maximum Likelihood), using the LG substitution matrix (named after Le and Gascuel) and the CAT model of rate heterogeneity (Stamatakis, 2014). Third, the alignment and phylogenetic tree for each structure was used as input of Rate4Site to obtain the site-specific rates of substitution using the empirical Bayesian method and the amino-acid Jukes-Cantor mutational model (aaJC) (Mayrose et al., 2004). Finally, site-specific relative rates were obtained by dividing site-specific rates by their average over all sites of the protein. We denote the empirical rates by K R4S .

Comparison of empirical rates with structural properties
For each protein of the dataset, we used the pdb structure to calculate the five site-dependent structural properties defined above: WCN αρ ρ , WCN αρ α , WCN ρ ρ , WCN α α (= WCN), and RSA. For a given predictor x, we quantified its predictive power using the squared Pearson correlation coefficient R 2 (K R4S ,x). According to the theoretical predictions, WCN αρ ρ should be the sole determinant of site-specific rates. We quantified the independent contribution of each of the other structural descriptors by partialing out the effect of WCN αρ ρ using semipartial correlations. The squared semipartial correlation ρ 2 (K R4S ,x|WCN αρ ρ ) represents the unique contribution of predictor x. Also, it is the amount by which the explained variation of K R4S (R 2 ) would increase when going from the single-variable linear fit K ∼ WCN αρ ρ to the two-variable fit K ∼ WCN αρ ρ + x. Expected values, standard deviations, and p-values were obtained by averaging protein correlations and semipartial correlations for 10,000 bootstrapped replicas of the dataset of 209 proteins.
For statistical analysis we used R (R Core Team, 2014). Correlation coefficients and their p-values were calculated using cor.test(). Semipartial correlation coefficients and p-values were calculated using spcor.test(). For bootstrapping with used boot() with default options.

RESULTS AND DISCUSSION
We theoretically derived a new measure of contact density, the side-chain weighted contact number WCN αρ ρ which, according to the stress model, should be the sole structural determinant of site-specific evolutionary rates. We tested this prediction on a dataset of 209 functionally and structurally diverse monomeric enzymes. Empirical site-specific evolutionary rates K R4S were obtained from multiple sequence alignments using Rate4Site. We compared K R4S with WCN αρ ρ (side-chain weighted contact number), WCN αρ α (main-chain weighted contact number), WCN ρ ρ (side-chain ρ − ρ weighted contact number), WCN α α = WCN (main-chain α − α weighted contact number), and RSA (relative solvent accessibility). For each protein, we calculated correlation coefficients between K R4S and each structural property and semipartial correlations to measure independent contributions. Protein-by-protein results (Table S1) were averaged over all proteins to obtain expected values, using bootstrapping to estimate standard deviations and p-values (see Methods).

Side-chain contact density (WCN αρ ρ ) vs. main-chain contact density (WCN αρ α )
According to the stress model, site-specific substitution rates depend only on side-chain packing, so that main-chain packing should not be directly related to substitution rates.
To test this prediction, we compared empirical substitution rates K R4S with WCN   Consider, for example, Human Carbonic Anhidrase II (pdb code 1CA2). As we mentioned in the Introduction, main chain environments and side-chain environments are different (Fig. 1). Accordingly, WCN αρ ρ and WCN αρ α result in different predicted rates (Fig. 2). The two site-dependent profiles of predicted rates are similar to the empirical K R4S profile. However, WCN αρ ρ -based predictions look somewhat better (Fig. 2) and are better (Fig. 3)  We repeated the previous assessment for each of the 209 enzymes of the data set (Fig. 4). Empirical rates correlate with WCN αρ ρ better than with WCN αρ α for 204 of the 209 proteins studied. Aggregating the data over all proteins, we obtained expected R 2 values of 0.392 ± 0.008 and 0.319 ± 0.008 for WCN αρ ρ and WCN αρ α , respectively. The difference ΔR 2 = 0.073 ± 0.003 is significantly positive (p < 10 −3 , bootstrapping). Therefore, WCN αρ ρ outperforms WCN αρ α as predictor of site-specific substitution rates. The stress model predicts WCN αρ ρ to be the sole predictor of substitution rates. Any correlation between rates and WCN αρ α should be indirect. We measured the direct association between empirical rates and WCN αρ α using the squared semipartial correlation ρ 2 (K R4S ,WCN  et al., 2008), is one of the two main structural predictors of site-dependent evolutionary rates (Yeh et al., 2014a;Yeh et al., 2014b). It is worthwhile to consider whether the new measure presented here, WCN αρ ρ provides an improvement over WCN. We found that WCN αρ ρ outperforms WCN for 206 out of the 209 proteins studied. The expected correlations are R 2 (K R4S ,WCN αρ ρ ) = 0.392 ± 0.008 and R 2 (K R4S ,WCN) = 0.314 ± 0.007. The difference is ΔR 2 = 0.078 ± 0.003, which is statistically significant (p ≪ 10 −3 , bootstrapping). Thus, not only does WCN αρ ρ outperform WCN for almost all proteins, but by a rather large amount: while WCN explains 31.4 % of the variation of evolutionary rates, WCN αρ ρ explains 39.2%, an increase by a factor of 1.25. Moreover, ρ 2 (K R4S ,WCN|WCN αρ ρ ) = 0.0024 ± 0.005 (p < 10 −3 , bootstrapping). Despite statistical significance, the unique contribution of WCN is just 0.2%, which is negligible. Thus, WCN αρ ρ is a better predictor and the independent contribution of WCN is negligible.

CONCLUSION
We used the the mechanistic stress model to predict theoretically that site-specific rates of evolution depend solely on the side-chain contact density WCN theory, WCN αρ ρ is proportional to the mutational destabilization of the protein's active conformation, which is why it correlates with rates: mutations are accepted or rejected according to the degree of destabilization. We tested this prediction on a large dataset of monomeric enzymes. We found that WCN αρ ρ outperforms WCN ρ ρ , WCN and RSA and that the independent contributions of the latter are negligible, which supports the theoretical prediction.
To finish, we note that the structural properties studied do not explain all of the variation of substitution rates among sites. The best predictor, WCN αρ ρ explains on average ∼ 39% of the variation, leaving 61% unexplained. Further research is needed to gain a full understanding of the variation of substitution rates among protein sites.

ADDITIONAL INFORMATION AND DECLARATIONS Funding
This work has been supported by CONICET (National Scientific and Technical Research Council) and UNSAM (National University of General San Martín). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.