Evolutionary couplings detect side-chain interactions

Patterns of amino acid covariation in large protein sequence alignments can inform the prediction of de novo protein structures, binding interfaces, and mutational effects. While algorithms that detect these so-called evolutionary couplings between residues have proven useful for practical applications, less is known about how and why these methods perform so well, and what insights into biological processes can be gained from their application. Evolutionary coupling algorithms are commonly benchmarked by comparison to true structural contacts derived from solved protein structures. However, the methods used to determine true structural contacts are not standardized and different definitions of structural contacts may have important consequences for interpreting the results from evolutionary coupling analyses and understanding their overall utility. Here, we show that evolutionary coupling analyses are significantly more likely to identify structural contacts between side-chain atoms than between backbone atoms. We use both simulations and empirical analyses to highlight that purely backbone-based definitions of true residue–residue contacts (i.e., based on the distance between Cα atoms) may underestimate the accuracy of evolutionary coupling algorithms by as much as 40% and that a commonly used reference point (Cβ atoms) underestimates the accuracy by 10–15%. These findings show that co-evolutionary outcomes differ according to which atoms participate in residue–residue interactions and suggest that accounting for different interaction types may lead to further improvements to contact-prediction methods.


26
A long-standing problem in physical biology is to predict the structure of a protein based solely on its 27 amino acid sequence (Anfinsen, 1973  Different reference points can be used when determining the distance between any pair of amino acid 65 residues, and prior research has shown that the choice of reference point can matter in some contexts. 66 For example, in the context of structural determinants of sequence conservation, it has been found that 67 sequence conservation is more strongly correlated to the number of residue-residue contacts identified via 68 side-chain centers than identified via Cα atoms (Lin et al., 2008;Marcos and Echave, 2015;Shahmoradi 69 and Wilke, 2016). However, it is unknown whether the choice of reference point is important for evaluating 70 and interpreting modern evolutionary coupling approaches. 71 Additionally, some studies eschew fixed reference points entirely. Residue-residue contacts can be 72 defined according to whether the minimum distance between all possible pairs of heavy atoms between 73 two residues falls below a set distance cutoff (Ovchinnikov et al., 2017). However, it is unclear whether 74 all atoms should be considered in this calculation or whether backbone and side-chain atoms should be 75 treated differently. Further, there are a number of ever more complex approaches that can be used to define  Currently, there are no accepted standards in the field for how to determine a set of residue-residue 81 contacts for a given protein structure. Further, there has yet to be a systematic analysis of whether 82 co-evolutionary signatures are more-or less-closely related to different types of intramolecular contacts 83 that may exist within a given structure. Here, we systematically test the accuracy of several evolutionary 84 coupling algorithms against true positive contacts defined via Cα, Cβ , or side-chain geometric centers, 85 as well as via minimum distances between all atoms or atoms residing in side chains only. We find that 86 residue-residue contacts defined according to side-chain centers and side-chain atoms are much more 87 accurately predicted by evolutionary couplings. These results imply that the dominant epistatic effects 88 resulting in co-evolutionary signatures arise from interactions between side-chain atoms. Our findings 89 highlight the importance of the choice of contact definition and provide insight into the constraints 90 governing the evolution of protein structures.

92
Dataset compilation and processing 93 We downloaded the so-called PSICOV dataset of 150 proteins that have been extensively studied (Jones 94 et al., 2012, 2015; Jones and Kandathil, 2018). We processed each starting PDB file to select a single 95 chain, ensure a consistent numbering of residues (1...n), test for unknown or non-standard residues, select 96 the most likely state for all disordered sequence atoms, and remove all extraneous information (including 97 "HETATM" lines). Next, to ensure that all residues were represented in full and repair those that were 98 not, we used PYROSETTA to read in the ".PDB" files using the 'pose from pdb' function and wrote the

Determining structural contacts and contact-types
From each cleaned ".PDB" file, we calculated residue-residue distance matrices using custom python 102 scripts (the euclidean distance from 3-dimensional atomic coordinates). All residues contain a Cα atom 103 so this calculation was straightforward. For Cβ calculations, we used the Cβ atom of all residues except 104 glycine, for which we continued to use the Cα atom. For side-chain center calculations, we calculated 105 the geometric center of each residue based on the coordinates of all non-backbone heavy atoms. This 106 calculation included Cβ atoms but excluded Cα atoms for all amino acids except glycine, for which we 107 again continued to use Cα as the reference point.

108
To calculate minimum atomic distances between two residues, we calculated all pairwise euclidean 109 distances between heavy atoms and selected the minimum distance. In extending this analysis to only 110 consider side-chain atoms, we continued to consider Cβ atoms as part of the side chain but not Cα. Again, 111 we relaxed this restriction for glycine and included Cα as a side-chain atom to permit calculations for this 112 amino acid.

113
For all methods, contacts were assessed by first removing all residue-residue pairs where the two 114 residues were shorter than 12 amino acids apart in chain distance. Contacts were determined throughout 115 this manuscript for each structure according to an 8Å cutoff between Cα atoms. Since accuracy values Cα-based method. Thus, the actual distance cutoff used to classify contacts for these other metrics varies 121 slightly from protein to protein.

122
To classify residue-residue pairs (a and b) via their side-chain orientations, we chose a reference 123 residue (a) and drew two vectors: i) from the Cα atom coordinate to the side-chain center for that residue 124 and ii) from the Cα atom coordinate to the Cα atom coordinate for the other residue in question (b). If 125 the angle between these two vectors was less than π/2 radians, the side chain of residue a was said to 126 point towards residue b. To determine the residue's classification we next repeated the calculation using 127 residue b as the reference and finally classified the residue-residue pair accordingly.

128
Evolutionary coupling algorithms 129 For each of the 150 proteins in our dataset, we followed a principled method to retrieve homologous 130 sequences. We first extracted the amino acid sequence from the ".PDB" file. We next used PHMMER to 131 search through progressively larger databases in order to retrieve up to 10,000 homologous sequences (Pot-132 ter et al., 2018). To do so, we downloaded local versions of the rp15, rp35, rp55, and rp75 databases (Chen 133 et al., 2011). We first searched the smallest, least redundant, database for each sequence using an E-value 134 threshold of 0.0001. For any sequence with greater than 10,000 hits we stopped and selected the top 135 scoring 10,000 hits for further analysis. For sequences with fewer than 10,000 hits we moved to the next 136 largest database and repeated the process. Finally for the small number of sequences for which we did 137 not accumulate at least 1,000 sequences in the largest database (rp75), we used the online version of 138 PHMMER to search the UniprotKB database and downloaded the maximum results.

139
For each protein, we next aligned the hits along-side the reference sequence using MAFFT (v7.380,  Using these alignments, we next calculated evolutionary couplings between residue-residue pairs. All   We re-packed the structure within a 12Å radius and determined whether or not to accept the mutation 162 based off of the resulting change in structural stability. Mutations which either did not alter or increased 163 stability (i.e. resulted in a decreased ∆G) were accepted with a probability of 1. Mutations that decreased 164 stability were accepted with a probability proportional to their ∆∆G as in Teufel and Wilke (Teufel and 165 Wilke, 2017). At the end of the evolutionary process, the resulting amino acid sequence was stored for 166 future analysis. 167 We performed thousands of independent replicates of this expedited evolutionary process where we 168 altered the number of accepted mutations that we accumulated before halting the simulation, the number 169 of replicate evolutionary experiments that we performed, and the fraction of the initial wild-type stability 170 value that we used for our selection criteria. Collections of the resulting sequences were analyzed via 171 evolutionary coupling algorithms in the same manner as empirical sequences, with no need for sequence 172 alignment.

174
Structural contact definitions 175 Putatively true interactions between amino acid residues within a given protein family are frequently 176 derived from the distance between residues in a representative protein structure. Manuscript to be reviewed distance cutoff to define contacts for a given protein. To facilitate a direct comparison between methods, we use the same absolute number of contacts to determine a comparable distance cutoff (specific to each 186 protein) to use for side-chain center distances such that an equal number of true contacts were identified 187 regardless of the reference point used to measure distances. For instance, in Fig. 1 we found that the 188 8Å, Cα-based distance cutoff resulted in 295 contacts. We thus chose the distance cutoff for side-chain 189 centers such that 295 contacts were identified, corresponding to a distance cutoff of 7.33Å for this example Manuscript to be reviewed and side-chain centers (Fig. 2C), but a median overlap of just 63% between contacts defined via Cα and 207 side-chain centers (Fig. 2D). The distribution of protein-specific distance cutoffs that we used to identify 208 equal numbers of contacts across methodologies was approximately normal and in the range of 6-8Ås (SI 209 Fig. S1). 210 We focused on comparing the use of Cα atoms and side-chain centers as reference points to determine 211 contacts, since these two features represent extreme ends of the spectrum. In practice, Cβ atoms occupy an identified by these two methods (SI Fig. S2). Together, these findings highlight that true contacts vary 217 substantially according to the reference point used to measure residue-residue distances.

249
We additionally explored how the number of mutations accumulated per sequence affected the ability 250 of evolutionary coupling algorithms to recover intramolecular contacts. We fixed the number of replicate 251 sequences at 3000, and found that PPV values showed minimal variation according to the number 252 of accepted mutations that had accumulated per sequence over the course of our in silico evolution 253 (Fig. 3C). As before, however, prediction accuracies were substantially higher when we defined true 254 contacts according to side-chain center distances (Fig. 3D). These simulation results highlight that-  Fig. S4). Despite the variability in prediction accuracy between 269 proteins, we observed systematic differences in the PPV according to which metric was used to identify 270 true positive contacts from the PDB structure files (Fig. 4A). When compared to Cα based distances, proteins the median percent increase in PPV between Cα and side-chain center contact identification 275 methods was 43% (Fig. 4B,C). Even between the more similar Cβ and side-chain center methods, the 276 median percent increase in accuracy was 13% (Fig. 4D,E). Both comparisons were highly significant and 277 persisted across the entire range of PPVs represented within our dataset (Fig. 4B,D). Additionally, these 278 results were highly consistent across different evolutionary coupling algorithms (SI Figs. S5 & S6). 279 We also considered an alternative method for computing contacts: determining structural contacts 280 based on the minimum distance between any two atoms belonging to different two different residues Manuscript to be reviewed (Ovchinnikov et al., 2017). We implemented two versions of this algorithm, determining the minimum 282 distance between: i) all heavy atoms within residues and ii) side-chain heavy atoms only. In each case, 283 and to continue to facilitate a direct comparison between methods, we selected the n shortest distances as 284 contacts where n is the number of contacts identified for each protein via the 8Å distance cutoff using Cα.

285
The resulting PPVs were significantly higher when contacts were defined only according to side-chain 286 atoms as opposed to the complete set of backbone and side-chain atoms (Fig. 4A). We further note that 287 PPVs calculated via side-chain center distances were statistically indistinguishable from PPVs derived 288 from the minimum distance between all heavy atoms within side chains.

289
All analyses thus far have been performed using a variable distance cutoff to identify true contacts, 290 such that an equal number of true contacts were identified for each structure regardless of the method. However, it would nevertheless be useful to have a uniform distance cutoff to apply to all structures and 296 such a uniform cutoff is more physically realistic and practically useful. We thus repeated our analysis 297 after fixing the Cβ and side-chain center distance cutoffs at 7.6 and 7.5Ås, respectively (values chosen as 298 being close to the median for the distributions displayed in SI Fig. S1).

299
For Cα, Cβ , and side-chain center methods, there was no significant difference in the number of 300 contacts identified within each structure (SI Fig. S7A) and the conclusions of Fig. 4 with regard to PPV 301 remain the same (SI Fig. S8). However, for comparisons involving the minimum distance between atoms, 302 we found that considering side-chain atoms only resulted in substantially fewer true contacts (using 303 a distance cutoff of 4.5Å between any two heavy atoms to define contacts, SI Fig. S7A) and the raw 304 PPVs between these two methods cannot be meaningfully compared. By contrast, the average precision The preceding analyses have shown that for the exact same evolutionary couplings, PPVs are substantially 318 higher when using side-chain based distances to identify true positive intramolecular contacts than when 319 using either Cα or Cβ -based distances. To look more specifically at why these differences were so residue while that residue's side chain points away, or iii) both residues' side chains may point away from 325 one another with energetic interactions occurring between the respective amino acid backbones (Fig. 5A).

326
As we expected, the Cα-based definition yields fractions close to the random expectation for 1AOE: in 327 ∼25% of the cases side chains point towards each other, in another ∼25% of the cases they point away 328 from each other, and in the remaining ∼50% of the cases one side chain points towards the other while the 329 other points away (Fig. 5B). By contrast, side-chain (and to a lesser degree Cβ ) based contact definitions 330 enrich for cases where both side chains point towards one another (Fig. 5C,D). Manuscript to be reviewed as true contacts for each protein according to Cα, Cβ , and side-chain centers-illustrating that the trend 336 observed in (Fig. 5B-D) applies broadly. If instead we consider residue-residue pairs predicted by the top 337 ranked evolutionary couplings, we observe that a large fraction of these couplings are between residues 338 whose side chains point towards one another in the reference protein structure. Additionally, this fraction 339 is highest for the most highly ranked evolutionary couplings (Top 0.25L, where L is the length of the 340 protein) and is substantially higher than the proportion identified by Cα-based distances (Fig. 5E).

341
This analysis of side-chain orientations highlights that strong evolutionary couplings frequently occur 342 between residues whose side chains point towards one another. Compared to the side-chain center based 343 method, Cα-(and to a lesser extent Cβ -) based contact definitions classify a smaller number of contacts 344 in this orientation and this is likely the cause of decreased predictive accuracies that we observe when 345 using these latter methods to define contacts (Fig. 4).  approach would be preferable. In practice, we have found that a distance cutoff of 7.5Å for side-chain 377 centers works well, and we thus recommend this approach moving forward (SI Fig. S8), although the 378 physical justification for such a cutoff is unclear. The method of evaluating the minimum distance 379 between side-chain heavy atoms is particularly attractive in this regard, since it permits using a physically 380 realistic and uniform distance cutoff between atoms regardless of amino acid identity or the protein 381 under consideration. While it is possible that analysis of still larger datasets might uncover differences in 382 accuracies between these conceptually similar approaches, our study shows that these methods produce 383 nearly identical results and the choice between them should be based on the tradeoff between simplicity 384 of calculation (side-chain centers) and physical reality (minimum atomic distances between side-chain 385 atoms with a uniform cutoff).

386
Leveraging homologous sequence information to predict intramolecular protein contacts has long 387 been a goal of structural biologists, but progress towards this goal has been made possible only in recent