Inferring amino acid interactions underlying protein function

Protein function arises from a poorly defined pattern of cooperative energetic interactions between amino acid residues. Strategies for deducing this pattern have been proposed, but lack of benchmark data has limited experimental verification. Here, we extend deep-mutation technologies to enable measurement of many thousands of pairwise amino acid couplings in members of a protein family. The data show that despite great evolutionary divergence, homologous proteins conserve a sparse, spatially distributed network of cooperative interactions between amino acids that underlies function. This pattern is quantitatively captured in the coevolution of amino acid positions, especially as indicated by the statistical coupling analysis (SCA), providing experimental confirmation of the key tenets of this method. This work establishes a clear link between physical constraints on protein function and sequence analysis, enabling a general practical approach for understanding the structural basis for protein function.

The basic biological properties of proteins --structure, function, and evolvability --arise from the pattern of energetic interactions between amino acid residues (1-5). This pattern represents the foundation for defining how proteins work, for engineering new activities, and for understanding their origin through the process of evolution. However, the problem of deducing this pattern is extraordinarily difficult. Amino acids act heterogeneously and cooperatively in contributing to protein fitness, properties that are not simple, intuitive functions of the positions of atoms in atomic structures (6). Indeed, the marginal stability of proteins and the subtlety of the fundamental forces make it so that many degenerate patterns of energetic interactions could be consistent with observed protein structures. The lack of knowledge of this pattern has precluded effective mechanistic models for the relationship between protein structure and function.
In principle, an experimental approach for deducing the pattern of interactions between amino acid residues is the thermodynamic double mutant cycle (7-9) (TDMC, Fig. 1A). In this method, the energetic coupling between two residues in a protein is probed by studying the effect of mutations at those positions, both singly and in combination. The idea is that if mutations and at positions and , respectively, act independently, the effect of the double mutation (∆ '( )* ) must be the sum of the effects of each single mutant (∆ ' ) + ∆ ( * ). Thus, one can compute a coupling free energy between the two mutations (∆∆ '( )* ) as: the difference between the effect predicted by the independent effects of the underlying single mutations and that of the actual double mutant. ∆∆ '( )* is typically proposed as an estimate for the degree of cooperativity between positions and .
However, there are serious conceptual and technical issues with the usage of the TDMC formalism for deducing the energetic architecture of proteins. First, ∆∆ '( )* is not the coupling between the amino acids present in the wild-type protein (the "native interaction"). It is instead the energetic coupling due to mutation, a value that depends in complex and unknown ways on the specific choice of mutations made (10). Second, global application of the TDMC method requires a scale of work matched to the combinatorial complexity of all potential interactions between amino acid positions under study.
For even a small protein interaction module such as the PDZ domain (~100 residues, Fig. 1B) (11), a complete pairwise analysis comprising all possible amino acid substitutions at each position involves making and quantitatively measuring the equilibrium energetic effect of nearly two million mutations.
Finally, even if these two technical issues were resolved, it is unclear how to go beyond the idiosyncrasies of one particular model system to the general, system-independent constraints that underlie protein structure, function, and evolvability.

Recent technical advances in massive-scale mutagenesis of proteins open up new strategies to
address all these issues. In the PDZ domain, a bacterial two-hybrid (BTH) assay for ligand-binding coupled to next-generation sequencing enables high-throughput, robust, quantitative measurement of many thousands of mutations in a single experiment -a "deep mutational scan" (12)(13)(14). Parameters of the BTH assay are tuned such that the binding free energy between each PDZ variant and cognate ) is quantitatively reported by its enrichment relative to wild-type before and after selection Fig. S1 and Supplementary Methods). This relationship enables extension of single mutational scanning to very large-scale double mutant cycle analyses -a "deep coupling scan" (DCS) study (15).
Indeed, the throughput of DCS is so high that it enables the study of double mutant cycles in several homologs of a protein family in a single experiment. Thus, DCS provides a first opportunity to deeply map the pattern and evolutionary conservation of interactions between amino acid residues in proteins, a strategy to reveal the fundamental constraints contributing to protein function.
We focused on a region of the binding pocket of the PDZ domain, a protein-interaction module that has served as a powerful model system for studying protein energetics (13,16). PDZ domains are mixed αβ folds that typically recognize C-terminal peptide ligands in a binding groove formed between the α2 and β2 structural elements (Fig. 1B). We created a library of all possible single and double mutations in the nine-residue α2 helix of five sequence-diverged PDZ homologs (PSD95 pdz3 , PSD95 pdz2 ,  (Table   S1). Thus, we can (1) analyze the distributions of double mutant cycle coupling energies for nearly all pairs of mutations in the α2 helix and (2) study the divergence and conservation of these couplings over the five homologs.
We first addressed the problem of how to estimate native coupling energies from mutant cycle data. In general, the effect of a mutation at any site in a protein is a complex perturbation of the elementary forces acting between atoms, with a net effect that depends on the residue eliminated, the residue introduced, and on any associated propagated structural effects. Thus, the distribution of thermodynamic couplings at any pair of positions over many mutation pairs could in principle be arbitrary and difficult to interpret. However, we find surprising simplicity in the histograms of coupling energies.
In general, the data follow a double-Gaussian distribution, with either both mean values centered at zero or with one of the means different from zero (  (15)), suggesting that the binary character of couplings may be universal in proteins. A simple mechanistic model is that the observed free energy of ligand binding arises from a cooperative internal equilibrium between two functionally distinct conformational states, with mutations at some sites capable of dramatically perturbing this equilibrium ( Fig. S9). Indeed, such a two-state internal equilibrium has been observed in PDZ domains, and is part of an allosteric regulatory mechanism controlling ligand binding (17). Thus, the population-weighted mean of the distribution of coupling energies for each position pair (Fig. 2, dashed lines) provides an empirical estimate of the native interaction between amino acids through mutagenesis.
Two technical points are worth noting. First, the spread of the distributions is large, generally exceeding the estimated magnitude of the native interactions (Fig. 2). This means (1) that traditional mutant cycle studies carried out with specific choices of mutations are more likely to just reflect the choice of mutations rather than the native interaction, and (2) that the only way to obtain good estimates of the native interaction between residues is to average over the effect of many double mutant cycles per position pair. Second, we find that the BTH/sequencing approach displays such good reproducibility that it is possible to detect coupling energies with an accuracy that is on par with the best biochemical assays.
For example, the average standard deviation in mean coupling energies for position pairs over four independent experimental replicates in PSD95 pdz3 is ~0.06 kcal/mol. Thus, we can map native amino acid interactions with high-throughput without sacrificing quality.
What do the data tell us about the pattern of amino acid interactions? Figure 3 shows idiosyncratically. The conserved couplings form a chain of physically contiguous residues in the tertiary structure that both contact (1, 5, 8) and do not contact (4) the ligand (Fig. 4B). Interestingly, position 4 is part of a distributed allosteric mechanism in some PDZ domains for regulating ligand binding (17), providing a biological role for its energetic connectivity with binding pocket residues. Overall, the pattern of couplings does not just recapitulate all tertiary contacts between residues (compare Fig. 4A with 4E, white and black circles) or the pattern of internal backbone hydrogen bonds that define this secondary structure element. Instead, conserved amino acid interactions in the PDZ α2 helix are organized into a spatially inhomogeneous, cooperative network that underlies ligand binding and allosteric coupling.
This result begins to expose the complex energetic couplings underlying protein function, but also highlights the massive scale of experiments required to deduce this information for even a few amino acid positions. How can we generalize this analysis to deduce all amino acid interactions in a protein, and for many different proteins? There are potential strategies for pushing deep mutational coupling to larger scale, but quantitative assays such as the BTH are difficult to develop, mutation libraries grow exponentially with protein size, and the averaging over homologs will always be laborious, expensive, and incomplete.
A different approach is suggested by understanding the rules learned in this experimental study for discovering relevant energetic interactions within proteins. The bottom line is the need to apply two kinds of averaging. Averaging over many mutations provides an estimate of native interaction energies between positions, and averaging the mutational effects over an ensemble of homologs separates the idiosyncrasies of individual proteins from that which is conserved in the protein family. Interestingly, these same rules also comprise the philosophical basis for a class of methods for estimating amino acid couplings through statistical analysis of protein sequences. The central premise is that the relevant energetic coupling of two residues in a protein should be reflected in the correlated evolution (coevolution) of those positions in sequences comprising a protein family (16,(18)(19)(20). Statistical coevolution also represents a kind of combined averaging over mutations and homologs, and if experimentally verified, would (unlike deep mutational studies) represent a scalable and general approach for learning the architecture of amino acid interactions underlying function in a protein. The data collected here provides the first benchmark data to deeply test the predictive power of coevolution-based methods.
One approach for coevolution is the statistical coupling analysis (SCA), a method based on measuring the conservation-weighted correlation of positions in a multiple sequence alignment, with the idea that these represent the relevant couplings (16,21). In the PDZ domain family (~1600 sequences, pySCA6.0 (22)), SCA reveals a sparse internal organization in which most positions evolve in a nearly independent manner and a few (~20%) are engaged in a pattern of mutual coevolution (16,21,22). In this case, the coevolving positions are simply defined by the top eigenmode (or principal component) of the SCA coevolution matrix, and represent a biologically important allosteric mechanism connecting the β2-β3 loop with the α1-β4 surface through the binding pocket and the buried α1 helix (Fig. 1B, and (23)).
Extracting the coevolution pattern in the top eigenmode for just the α2 helix (Fig. 4C), we find that coevolution as defined by SCA in fact nearly quantitatively recapitulates the homolog-averaged experimental couplings collected here ( 4 = 0.82, = 10 ;<= by F-test, Fig. 4D). The predictions also hold for individual homologs (Fig. S10A-E), consistent with the premise that the essential physical constraints underlying function are deeply conserved. This relationship is robust to alignment size and method of construction ( Fig. S11A-C), but depends on both of the basic tenets that underlie the SCA method -conservation-weighting ( Fig. S11D-E) and correlation (Fig. S11 F-G) (22).
Another approach for amino acid coevolution is direct contact analysis (DCA, (24,25)), a method developed for the prediction of tertiary contacts in protein structures. DCA uses classical methods in statistical physics to deduce a matrix of minimal pairwise couplings between positions ( '( , Fig. 4E) that can account for the observed correlations between amino acids in a protein alignment, with the hypothesis that the strong couplings in '( will be direct contacts in the tertiary structure. Indeed, studies convincingly demonstrate that the top /2 (where L is the length of the protein) couplings are highly enriched in direct structural contacts (26). Consistent with this, this method successfully identifies direct contacts in the PDZ α2 helix (Fig. 4E, compare heat map to white and black circles) to an extent that agrees with the reported work. However, DCA-based predictions of functional energetic couplings between mutations are weakly (though significantly) related to the homolog-averaged experimental data ( 4 = 0.33, = 10 ;B by F-test, Fig. 4F-G). These results are similar or poorer for prediction of couplings in individual domains ( Fig. S10G-K). Interestingly, a DCA model in which only the top pairwise couplings in '( that define tertiary contacts are retained and the weaker non-contacting couplings are randomly scrambled shows predictions that are unrelated to the experimental data ( 4 = 0.05, = 0.09 by F-test, Fig. 4H). Thus, non-contact couplings in the DCA model, which represent noise from the point of view of structure prediction, contribute significantly to prediction of function.
These findings clarify the current state of sequence-based inference of protein structure and function (27,28). DCA successfully predicts contacts in protein structures in the top couplings, but in its current form, does not appear to capture the cooperative constraints that underlie protein function well. In contrast, SCA does not predict direct structural contacts well, but instead seems to more accurately capture the energetic couplings that contribute to protein function(s). As explained previously, these two approaches sample different parts of the information contained in a sequence alignment (29,30), and therefore are not mutually incompatible. These results highlight the need to unify the mathematical principles of contact prediction and SCA-based energetic predictions towards a more complete model of information content in protein sequences.
In summary, the collection of functional data for some 56,000 mutations in a sampling of PDZ homologs demonstrates an evolutionarily conserved pattern of amino acid cooperativity underlying function. This pattern is well-estimated by statistical coevolution based methods, suggesting a powerful and (given the scale of experiments necessary) uniquely practical approach for mapping the architecture of couplings between amino acids. Indeed, the remarkable conclusion is that with a sufficient ensemble of sequences comprising the evolutionary history of a protein family, the pattern of relevant amino acid interactions can be inferred without any experiments. and Syntrophin pdz (1Z86, blue)), emphasizing the conserved αβ-fold architecture of these sequencediverse proteins (33% average identity, Table S1). Structural elements discussed in this work are indicated. (C), The nine-amino acid α2-helix, which forms one wall of the ligand-binding site. (D-E), The distribution of experimentally determined binding free energies, ∆ .'/0 , for all single mutations (D, 855/855) and nearly all double mutations (E, 56,694/64,980) in the α2-helix for the 5 PDZ homologs, with the affinity of wild-type PSD95 pdz3 indicated (wt). The red lines indicate the independently validated range of the assay ( Fig S1B); essentially all measurements fall within this range. These data comprise the basis for a deep analysis of conserved thermodynamic coupling in the PDZ family.