Free Energy Perturbation Calculations of Mutation Effects on SARS-CoV-2 RBD::ACE2 Binding Affinity

Graphical abstract


Introduction
The ability to accurately predict binding affinity changes upon mutations of interfacial residues is a problem of significant importance, ranging from the general problem of understanding of interaction specificity and the design of therapeutics such as potent monoclonal antibodies that target antigens to revealing the mechanism of action of cancer driver mutations. Multiple approaches to the problem have been developed including machine learning, 1-5 statistical potentials 6 and various force field related scoring functions [7][8][9][10][11] embedded in programs such as FoldX 11 and Rosetta. 7 Each approach is associated with its own set of issues, such as conformational changes upon mutation, that are nicely discussed in reviews of Bonvin and co-workers. 12 Moreover, some methods succeed on some test sets and fail on others, suggesting either over-training or simply that some protein-protein interfaces have different properties than others. Issues of experimental validation can also arise 13 ; not all experimental methods are equally accurate and, as discussed below, the nuances of the experimental system can have significant effects on the outcome.
Detailed atomic-level simulations have not been extensively applied to the prediction of mutation effects, in part due to the computational requirements involved. Free-energy perturbation (FEP) methods have the potential to impact the field as physics-based force fields are, in principle, agnostic to the system being studied. Most current applications have involved the optimization of ligand-protein interaction in the context of small molecule drug design (reviewed in 14) but recent publications have begun to explore the use of FEP methods to the study of protein-protein interactions (PPIs); specifically, to the effects of interfacial mutations on protein-protein binding free energies. 8,9,[15][16][17][18][19] This is an inherently complex problem since, as opposed to relatively rigid ligand binding pockets, protein-protein interfaces are often quite large and less constrained so that they can more easily undergo conformational changes as a result of a mutation. Moreover, FEP calculations involve a complex computational infrastructure and are extremely time consuming. However, fast graphical processing units (GPUs) make such calculations feasible and a number of recent publications, involving different software packages, suggest that the methodology has reached the point that good correlation with experiment is to be expected. 8,9,[15][16][17][18][19] Clearly, if FEP methods are capable of providing meaningful results, then in many applications, the computational cost will be worthwhile.
Here we explore the ability of FEP calculations to reproduce the effects of mutations on the binding of the receptor binding domain (RBD) of the SARS-CoV-2 spike protein with the human angiotensin converting enzyme 2 (ACE2) using the FEP+ implementation (see Methods). Given that the pathogen entry into the host cell is mediated by RBD::ACE2 binding, the problem has attracted considerable interest and multiple experimental  and computations studies 33,[47][48][49][50][51][52][53][54][55][56][57][58][59][60][61][62][63] have been reported. We chose to study a set of 23 frequently observed RBD mutations (Table S1) located in the RBD::ACE2 interface ( Figure 1) of Alpha, Beta, Gamma, Delta, or Omicron SARS-CoV-2 variants (Table S2). Surface Plasmon Resonance (SPR) experiments were carried out for each and compared to previous experimental work.  A number of computational methods were applied to predict binding free energy changes upon mutations, DDG. We found FEP to be the best performer and, moreover, FEP trajectory analysis facilitates the characterization of the biophysical effects that underlie mutational effects. We show that FEP successfully recapitulates the stabilizing epistatic effect of the Q498R N501Y mutant present in every Omicron variant. The ability to anticipate non-additive effects of multiple mutations is likely to be an important element of future efforts in protein interface design.
Computational prediction of the effect of mutations on binding affinity  Table 2 are given in Figure S2. The overall performance of FEP is in line with previous work 8,9,14,[66][67][68] ; a PCC of 0.6 and an RMSE of 0.8 kcal/mol ( Table 2). The FEP calculations clearly predict that all four stabilizing mutations, N501Y, N501T, S477N, and Y453F have DDG values < 0 although the prediction for S477N is a weak one. Of note, previous FEP calculations on N501Y 53,55,56 have yielded results very similar to ours attesting to the robustness of the method. The most significant failure of FEP is its prediction that A475V is stabilizing when the SPR results indicate that it is weakly destabilizing. A likely explanation for this result is that A475 is located close to the N-terminal residue of ACE2 (Q18) for which no coordinates were assigned in the crystal structure. 69 Another problematic result, a largely overestimated DDG for Q498R, will be discussed below.
Experimental binding affinity difference values were calculated based on binding affinity (K D ) measurements for wild-type (WT) and single mutant (MT) proteins using the following formula: DDG = RT ln (K D (MT)/K D (WT)), in units of kcal/mol.
The FEP results in Table 2 were obtained from 100 ns simulations. Extensive prior work 8,9 has demonstrated that shorter simulations are often inadequate for some mutations, typically those involving residues with a low degree of solvent exposure (i.e. partially or fully buried at the protein-protein interface). The results of 10 ns FEP simulations are shown in the Supplementary Material (Table S3) for comparison. Of note, the N501Y mutation, which is predicted to be destabilizing at 10 ns is correctly predicted to be stabilizing at 100 ns. Evaluation of theoretical methods on the ACE2:: RBD dataset (Table 2) shows that machine learning (ML) algorithms (Mutabind2, 5 mCSM-PPI2 3 and SAAMBE-3D 1 ) uniformly have a weak correlation with experiment (PCC < 0.4). As we have pointed out previously 60 ML methods tend to overpredict destabilizing mutations presumably due to the preponderance of destabilizing mutations in training sets. A related factor likely accounts for relatively low RMSEs of ML methods since most mutations in training sets have only small effects on binding affinities. BeAtMuSiC evaluates mutation effects using a statistical potential 6 while FoldX uses an empirical physics-based force field 10,11 . Both methods assume a rigid backbone although FoldX allows for side chain rearrangement upon mutation. Neither method produces a meaningful correlation with experiment. Of note, poor performance of FoldX can be attributed to outliers (Figure S2) whereby mutations require backbone rearrangement. BeAtMuSiC identifies no stabilizing mutations while FoldX correctly identifies N501T and Y453F. Other than FEP, Rosetta flex ddG 7 is the only method that allows for backbone flexibility but its PCC is still significantly smaller than that of FEP and, as can be seen in Table 2, Rosetta flex ddG predicts two of the four stabilizing mutations (although the prediction for Y453F is a weak one). Nevertheless, even its partial success highlights the need to account for the ability of proteins to relax in response to interfacial mutations. Finally, we also tested Molecular Mechanics/Poisson-Boltzmann Surface Area (MM/PB-SA) and Molecular Mechanics/Generalized Born Surface Area (MM/GB-SA) methods, which utilize molecular mechanics energies along with a continuum representation of the solvent. As can be seen in Table 2, these methods performed poorly (see Methods for the protocol details).
The ACE2::RBD dataset has limitations for statistical analysis given the narrow affinity range of the experimental values (from À0.8 to +0.6 kcal/mol, Table 2) and the preponderance of neutral (15 out of 23) over stabilizing and destabilizing mutations (4 each). For a more robust assessment, we have compared Table 2 Calculated ACE2/RBD binding affinity changes for RBD mutants. Pearson correlation coefficient (PCC), and root mean square error (RMSE) are calculated for every method tested based on comparison to SPR results in Table 1. Stabilizing mutations with DDG À0.4 kcal/mol in green, destabilizing mutations with DDG ! 0.4 kcal/mol in red. Experimental binding affinity difference values were calculated based on SPR binding affinity (K D ) measurements for wild-type (WT) and single mutant (MT) proteins using the following formula: DDG = RT ln (K D (MT)/K D (WT)), in units of kcal/mol. Correlation plots for all theoretical methods are provided in Figure S2. Calculations were performed on a crystal structure of ACE2/RBD (PDBID 6M0J). Protein specific residue numbering of all the mutants as in Uniprot ID P0DTC2.
performance of various theoretical methods on a larger and more balanced dataset. Specifically, we have expanded the ACE2::RBD dataset with mutations from another system (DIP::Dpr). DIPs and Dprs are families of neuronal adhesion proteins whose members bind to one another with distinct subgroup-specific affinities. 70 We have previously studied a number of DIP::Dpr mutants using SPR [71][72][73] to establish specificity deteminants in this family. Here, we chose 18 single point mutations in Dpr6::DIP-a and Dpr10::DIPa complexes to combine with 23 ACE2::RBD mutations for a total of 41 data points. The combined ACE2::RBD DIP:: Dpr dataset has a 2.5 times wider range of DDG values (À1.3 to +2.2 kcal/mol) than just ACE2::RBD alone and a more balanced distribution of stabilizing, neutral, and destabilizing mutations (Table S3). Table S3 contains the results of different computational methods applied to the combined data set. We use multiple statistical tests to measure performance. Based on the same measures used in Table 2 (PCC and RMSE), all methods exhibit similar performance but with somewhat improved PCCs (except MM/GB-SA). FEP gives the best results with a PCC of 0.8 and RMSE of 1.0 kcal/mol. The next best performing method was Mutabind2 (PCC of 0.6 and RMSE of 0.8). Rosetta flex ddG and FoldX have a PCC of 0.5 and 0.4, respectively, with FoldX having a large RMSE error (1.8 kcal/mol) mostly due to high energetic penalties for mutations that introduce clashes at the interface which are difficult to correct when the backbone fixed. The continuum methods displayed poor performance (PCC < 0.3) and large errors (RMSE $ 6-7). Table S3 also reports PCC values for the 18 observations in the DIP::Dpr data set alone. As is evident from the Table, FEP is by far the best performer with a PCC of 0.9, well above that of other methods. Of note, all methods perform better on the DIP::Dpr than on the ACE2::RBD data set. We attribute this observation to the fact that the DIP:Dpr interface is not prone to backbone rearrangement upon mutation.
We categorized mutations as stabilizing (DDG À0.4 kcal/mol, green), neutral (À0.4 < DD G < 0.4 kcal/mol), or destabilizing (DDG ! 0.4 kcal/ mol, red). The cutoff of ±0.4 kcal/mol was chosen based on expected reproducibility error in SPR measurements (see Methods). We evaluated the ability of various theoretical methods to correctly classify mutations into each of these three categories -essentially yes or no regarding placement in a particular category as well as their ability to correctly classify the effects of each mutant into one of three categories (using Matthews correlation coefficient, MCC). As shown in Table S3, FoldX is best at classification into three categories (MCC (triple class) = 0.7; 32 out of 41 mutations correctly classified) and destabilizing mutations (MCC (dest) = 0.9). FEP is the best method for prediction of stabilizing mutations (MCC (stab) = 0.5) and second best for three category classification (MCC (triple class) = 0.6; 31 out of 41 mutations correctly classified). The probability of achieving this result by random chance is approximately 10 À8 .
To assess the statistical significance of our FEP predictions, we calculated a p-value for the classification of stabilizing mutations, which was approximately 3.0 Â 10 À11 (Table S3) under the assumption that the reproducibility error of the calculations was close to zero. To test this assumption, we performed ten independent repeats of the 100 ns FEP simulations for a select set of mutants. The standard deviation of DDG values and the standard error of the mean (SEM) are reported in Table S4). On average, the DDG values obtained starting with different random velocities deviate by approximately 0.2 kcal/mol (SEM $ 0.06). These small fluctuations would only affect the classification of mutations whose DDG values are very close to the cutoff. In fact, categorizing stabilizing mutations is not affected by using a mean value of multiple independent runs (Table S4).

Physical insights from trajectory analysis
The N501Y mutation is responsible for a high infectivity and transmissibility of the Alpha variant of SARS-CoV-2 74 and has the largest stabilizing effect ( Table 1). The N501T mutation found in the SARS-CoV-2 variants transmitted from mink to humans 75,76 occurs at the same position. Analysis of FEP trajectories reveals that the stabilization effect associated with N501Y and N501T mutations is due to substitution of the asparagine with less polar side chains of tyrosine and threonine. N501 has only one of its polar groups satisfied in the wild-type (WT) structure while, throughout the course of the relevant trajectories, the hydroxyl groups of both tyrosine and threonine participate in hydrogen bonds (see dashed lines, Figure 2 (A)). In addition to enhanced stability due to the absence of unsatisfied hydrogen bonds, both mutants undergo stabilizing interactions with Y41 of ACE2; the aromatic ring of Y501 participates in p-p stacking interactions (see purple lines in Figure 2(A)) while the methyl group of T501 forms a hydrophobic contact with the aromatic ring of Y41 (see gray shading in Figure 2(A)). Of note, previous studies have identified the role of p-p stacking as a source of stabilization of the N501Y mutant 47,53,56,58 but the role of the unsatisfied hydrogen bond in the WT protein has not been emphasized.
Analysis of trajectories associated with the Y453F mutation shows that in the WT protein, a hydrogen bond between the hydroxyl group of Y453 and the Ne atom of H34 is present in $25% of the WT trajectory (Figure 2(B)) while, in most cases, these two residues form hydrogen bonds with trapped solvent molecules. In the Phe mutant, there is no need to satisfy the buried hydroxyl of the tyrosine while the Ne of H34 is satisfied by structured waters or backbone atoms (see dashed lines showing hydrogen bonds in Figure 2(B)). Thus, the enhanced stability of the mutant is likely due to the greater hydrophobicity of a Phe relative to a Tyr. Of note, our trajectory analysis is in agreement with a previous study comparing crystallographic structures of the WT and Y453F mutant. 32 The simulations correctly predict that the S477N mutation is stabilizing but only weakly so. This residue faces the N-terminal of ACE2, and, specifically, the Q18 residue for which coordinates were missing in the crystal structure and, hence, modelled as an acetyl group cap in our simulations (Figure 2(C)). Thus, the calculations may suffer from conformational uncertainties in this region. The trajectories reveal that the longer Asn side chain makes more contacts with the ACE2 N-terminal than does the WT Ser. We show only one snapshot from wild-type and mutant trajectories in Figure 2(C), but in reality, due to the flexibility of the ACE2 N-terminal, no specific contact is retained throughout the simulations.
The largest error in the FEP calculations is for the Q498R mutant whose destabilizing effect is overpredicted by about 2.5 kcal/mol. Analysis of the 100 ns Q498R trajectory reveals that the Arg side chain samples different conformations with the most prevalent state ($53%) involving an unfavorable polar-hydrophobic contact between N501 and the aliphatic chain of the Arg, leaving both polar groups of the Asn unsatisfied ( Figure 3 (A)). This is likely a contributor to the destabilizing DDG value. As we have noted previously, 9 a computationally unfavorable outlier of this magnitude is typically attributable to the failure of the molecular dynamics trajectories (100 ns in this study) to achieve a converged reorganization of the protein structure. To test this possibility, we carried out a 300 ns simulation which appears to reach convergence at $200 ns ( Figure S3) where the destabilizing effect of Q498R is calculated to be 1.9 kcal/mol; reduced from 3.6 kcal/mol at 10 ns and 2.7 kcal/mol at 100 ns. It may well be the case that the system has converged to a metastable state at 300 ns and that there are lower energy states not sampled in the course of the simulation.

Epistatic effect of the Q498R N501Y double mutant in the omicron variant
Zahradnik et al. 64 demonstrated that the Q498R N501Y double mutant is more stabilizing than the additive effect of two single point mutations as estimated by their yeast display assay (a "cooperativity energy" of $À1.7 kcal/mol). Our SPR results on the single Q498R and N501Y mutants predict, if their effects were additive, that the double mutant would be stabilizing by À0.6 kcal/mol (À0.8 + 0.2 kcal/mol for the single mutations, respectively) while the experimental value for the double mutant is À1.2 kcal/mol yielding a cooperativity energy of À0.6 kcal/mol (see Figure 3(B) for the experimental DDG values and Figure S1 for corresponding fitted data and dissociation constants). While the experimental SPR and yeast display DDG values differ, both methods indicate that a substantial epistatic effect is playing a role, perhaps contributing to the greater infectivity of the Omicron variants where these mutations are present.
FEP calculations on the double mutant predict a DDG of À1.4 kcal/mol whereas the predicted additive effect of the two single mutants ( Table 2) is +1.5 kcal/mol -corresponding to a cooperativity energy of À2.9 kcal/mol (Figure 3(B)). Thus, the 100 ns calculations successfully predict the existence of cooperativity. A detailed examination of the structure of the double mutant provides a compelling physical interpretation as to why it behaves so differently from the single Q498R mutation. In the double mutant, Arg 498 forms a favorable pairing with the side chain of Tyr 501. The aliphatic portion of the Arg side chain packs against the Tyr aromatic ring creating an enhanced hydrophobic contact in the double mutant compared to the N501Y single mutant (grey shading, Figure 3(A)), while one hydrogen of the Arg guanidinium head group forms a hydrogen bond with the oxygen of the Tyr hydroxyl group (dashed lines, Figure 3(A)). The geometries of the two residues are such that this pairing can be carried out without introducing any conformational strain, either in the backbone or in the side chains themselves. Thus, in the presence of N501Y, the mutation from Gln to Arg actually enhances binding rather than diminishing it.
As a control, we considered a second double mutant (L452R T478K) where no cooperativity is observed experimentally (Figure 3(C)). The FEP calculations at 100 ns predict values of both single and double mutant within 0.4 kcal/mol from the experiment and a cooperativity energy of À0.1 kcal/mol (same as SPR) (Figure 3(C)). Thus, FEP accurately predicts the absence of cooperativity in the double mutant belonging to the Delta SARS-CoV-2 variant (Table S2).

Discussion
We have carried out experimental and computational studies of a series of RBD mutants located in the RBD::ACE2 interface. The choice was based in part on their frequency in infective SARS-CoV-2 variants and in part because they have been extensively studied.  A central goal has been to assess the ability of the free energy perturbation methodology to predict the effect of mutations at the protein-protein interface on binding affinities but this in turn required an assessment of experimental accuracy. To this end we carried out SPR experiments on each mutant and compared our findings to those previously reported. We also constructed a balanced dataset of mutations with sufficient energy spread and compared performance of easily accessible computational methods to those obtained from FEP.
As shown in Table 1 and discussed above, there is excellent agreement between our SPR results and previous yeast display results from the Bloom group. 20 Moreover, both sets of results are in the range reported in other studies. In particular, all methods agree on the identification of stabilizing mutations although the experimental values vary by as much as 1 kcal/mol, likely due to some of the issues discussed in Methods. As pointed out above, FEP yields the best correlation with our SPR results and, crucially, is the most effective in identifying stabilizing mutations ( Table 2, Table S3).
The mutations we used in our study are not a part of common ML training databases such as SKEMPI 77 ; hence, our dataset for methods evaluation can be considered as "blind". A modest level of performance for ML methods ( Table 2, Table S3) is in line with a study by Bonvin and co-workers where various ML methods were tested on a blind dataset of 487 mutations in 56 complexes. 78 FoldX and Rosetta flex ddG categorize neutral and destabilizing mutations well and are at least partially successful in identifying stabilizing mutations (Table S3), though their overall performance (PCC of 0.4-0.5) is inferior to FEP (PCC of 0.8).
In addition to the methods we were able to test on our data set, many other studies on RBD mutants have been reported. Table S5 lists published results on RBD::ACE2 on different sets of mutants than studied here but it is of interest to compare cases where there is overlap with previous work. Previous FEP studies 53,55,56 also carried out long simulations and all identify N501Y and N501T as stabilizing. TopNetTree, 49 a topology-based ML method, does not perform well on our data set. The MM/GBSA study from the Bahar group 58 successfully predicts that N501Y is stabilizing but was not run on other mutants in our data set. We could not reproduce this result in our MM/GB-SA calculations likely due to protocol differences and use of different crystal structures as input. Finally, Maranas and co-workers 47 used physical interactions extracted from MM/GBSA simulations to train a neural network to predict DDG values (NN_MM- GBSA method). Since many of the mutants in our set were used in the NN_MM-GBSA training (Table S5) it is difficult to make comparisons with our own results. Despite the success of the FEP approach revealed in this work, significant challenges remain. In previous work on a variety of systems, we have demonstrated that the FEP results correlate well with experiment (PCC values on the order of 0.6-0.8) and display RMS errors in the range expected for the OPLS4 79 and related molecular mechanics force fields (0.5-1 kcal/-mol). 8,9,14,[66][67][68] Precision beyond the above cited statistics is very difficult to obtain, indeed, experimental reproducibility errors are typically on the order of 0.4 kcal/mol, even for the high quality SPR results that we report here (see Methods).
As highlighted in this work, the major challenge in using FEP to predict mutation effects on proteinprotein binding affinities, as opposed to small molecule binding, is the possibility of significant conformational changes induced by a mutation, for example if a buried charge is created by mutating a buried hydrophobic residue at the interface to one with a net charge. The key issue is whether the conformational changes required to make accurate predictions are accessible on the timescale of the FEP simulation. When the conformational changes consist primarily of side chain rearrangements and relatively minor backbone motion, FEP will typically deliver reliable results (as in N501Y). When there are significant backbone conformational changes, the accuracy of the FEP results will depend upon whether the barrier to conformational change can be surmounted on the time scale of the simulation. In terms of overall accuracy, the results obtained here are consistent, in terms of both RMSD errors and correlation coefficients, with those we have reported previously in studying HIV derived gp120-antibody binding and also more diverse sets of protein-protein complexes via FEP simulations 8,9 . The most significant errors may result from structural uncertainties (as in A475V) or from the prediction of overly unfavorable free energy changes (as in Q498R) which often result from electrostatic or steric clashes or from the burial of a charge in a hydrophobic pocket. 9 However, mutations that are predicted to be highly destabilizing would be of little or no consequence in the context of binding optimization project since such mutations would be rejected in an initial screen even if converged results were obtained.
The accuracy of binding free energy differences determined by FEP or other structure-based methods is crucially dependent on the quality and completeness of the complex structures used in the simulations. While FEP performs well with experimentally determined structures (as shown in Table S3), achieving comparable success with modeled complexes obtained by docking or homology models of poor quality is much harder. Our study reveals that using experimental structures helps with general convergence, as multiple independent FEP runs result in SEM < 0. 1 kcal/mol (as shown in Table S4). However, when FEP simulations were performed on modeled protein-peptide complexes, a much larger deviation (SEM $ 1.5 kcal/mol) was observed. 79 Although FEP is a successful method for predicting the effect of point mutations on binding free energy differences, it is not suitable for direct evaluation of deletions or insertions effects using alchemical transformations. A combination of MD and advanced sampling techniques to guarantee the adequate exploration of each relevant degree of freedom can be applied to obtain binding free energies within experimental accuracy in the process of progressive separation of proteins via the potential of mean force (PMF). 80,81 The PMF approaches are not limited to single point mutations, but their success requires sufficiently accurate approximation of the native wild-type and mutant states.
Our results suggest FEP might be used in a strategy to optimize the binding of two proteins, for example in the optimization of antibodies. FEP calculations appear to have sufficient rank ordering capability to enable prioritization of specific single residue mutations. The initial step in a design strategy would be an exhaustive screening of single mutations at the various positions across the protein-protein interface (perhaps a few hundred to a few thousand calculations, which would require relatively modest expenditure, given the steadily decreasing cost of GPU-based computation). A key advantage of FEP is that analysis of trajectories can reveal insights as to the response of the interface to different perturbations, enhanced when needed by loop modeling procedures that more efficiently sample conformational space. These insights could then be translated into the investigation of a selected subset of double and triple mutations, with the goal of achieving favorable nonadditive effects. The understanding we gained about the Q498R N501Y double mutant could not have been accomplished by any of the other methods tested here.
It is important to clarify that, despite identifying cooperativity of the N501Y Q498R double mutant, the FEP results have not reached the point where the calculated values have experimental accuracy. It is then interesting to consider how a researcher interested in designing a stabilizing mutation would respond to calculated single mutant values of À1.2 kcal/mol for N501Y, +2.7 kcal/mol for Q498R and À1.4 kcal/mol for the double mutant. We would argue that the large calculated cooperativity energy of À2.9 kcal/mol, and the physical basis of this effect revealed by the simulations, would provide a strong hint that the double mutant is worth testing experimentally. Moreover, the experience we gained in this study would make that decision more likely. In conclusion, a careful exploration of a particular system of biological importance as enabled by FEP simulations would then appear to offer a way forward in many practical applications.

ACE2::RBD dataset
We focused on missense RBD mutations at the interface with ACE2 that occurred most frequently in the US at the beginning of 2021 or were a part of known variants of concern. Among single point mutations with a frequency above 100 as of Jan 4, 2021, only seven mutations (S477N, N439K, N501Y, L452R, Y453F, S477R and S477I) were both missense and interfacial. To expand the dataset, we added missense interfacial mutations with a lower frequency (in the range of 10-100) if mutations were stabilizing or destabilizing in the study of Bloom and co-workers 20 while nearly neutral positions were ignored. Stabilizing mutations were of interest as potentially increasing infectivity of the virus, while destabilizing mutations were a necessary addition to create a balanced dataset for a proper testing of DDG predictors. The K417T and Q498R mutations were added due to occurrence in the SARS-CoV-2 variants of concern and not based on frequency counts. The K417T mutation was added due to its emergence in Brazil (as a part of the Gamma variant) at the time. The Q498R mutant was studied in the context of double mutant effects alongside N501Y due to the cooccurrence of these two mutations in Omicron variant (Table S2) and a recent study on in vitro evolution suggesting cooperativity between the two mutations. 64 Although the cumulative frequency (i.e. a number of times a given mutation has been found in the sequenced SARS-CoV-2 genomes) has changed during 2021, the majority of the most frequently observed mutations in January 2021 were still among the most recurrent mutations at the end of the year, with our data set including 16 out of 20 interfacial missense RBD mutations that had the highest frequency (>1000) on December 31, 2021 (Table S2).
Standalone version (https://compbio.clemson. edu/saambe_webserver/standaloneCode.zip) was used for binding affinity calculations of SAAMBE-3D. FoldX calculations were performed as described in Sergeeva et al. 72 Rosetta flex ddG calculations were run using a standalone version (https://github.com/Kortemme-Lab/flex_ddG_tutorial) with the following parameters (considered to give optimized performance of this method 7 : nstruct = 35, max_minimization_iter = 5000, abs_s core_convergence_thresh = 1.0, number_back rub_trials = 35000, backrub_trajectory_stride = 35000. The MM/GB-SA and MM/PB-SA methods rely on the end-point approximation of the wild-type (WT) and mutant (MT) states to calculate binding free energies using the following formula: DDG = DG binding (MT) À DG binding (WT), where DG binding (P 1 :: P 2 ) = G(P 1 ::P 2 ) À G(P 1 ) À G(P 2 ). To generate the mutated proteins, we used the "Mutate Residue" function followed by a local minimization of the mutated residue in Maestro, using the crystal structure of the WT as input. We employed the AMBER software package and pmemd.cuda 84,85 for minimization. The P 1 ::P 2 complex is minimized in explicit OPC water, followed by removal of water molecules and energy calculations in implicit solvent. The isolated P 1 and P 2 proteins are assumed to have the same conformation as in the complex. The energy calculations were carried out using MMPBSA.py. 86 The mbondi2 radii set was used in both GB and PB calculations, and the OBC2 model was used for GB calculations.
We used Schrö dinger software (2021-2 release) and default FEP+ protocols (implementing guidelines for protein-protein interactions published earlier 8,9 for our predictions of binding affinity changes in the ACE2/RBD complex upon RBD mutations. The release incorporates a recently developed OPLS4 force field, 79 replica exchange with solute tempering (REST) enhanced sampling methodology for mutated residues, and an improved grand canonical Monte Carlo (GCMC) protocol for sampling solvent molecules around mutated residues. 87 FEP+ is a fully physics-based model that uses explicitly represented water. During FEP+ simulations, an alchemical transformation of a wild type amino acid residue into a mutant residue is conducted, which is implemented by running a series of separate molecular dynamics simulations ("lambda windows") with varied energy weighting. The differences between adjacent lambda windows are first calculated using a perturbative expansion and then summed up to estimate the total free energy change between a wild-type and mutant states. To enhance the convergence of the free energy calculations, our default protocols use 12 lambda windows for charge conserving mutations and 24 lambda windows for the charge changing mutations. All mutations were run for 10 ns and 100 ns (see Table S3). Calculations of double mutant effects were performed by simultaneous alchemical transformation of the two mutated residues. This procedure minimizes errors associated with a more common FEP protocol where mutations are introduced sequentially. For example, the DDG (Q498R N501Y) is predicted to be À1.4 kcal/mol using the simultaneous protocol and À0.6 kcal/mol using the sequential protocol. Of note, the results from the simultaneous protocol are in better agreement with the experimental value of À1.2 kcal/mol. The FEP+ methodology automates generation of the mutated end-point, while the user provides a prepared wild-type complex as input. This process involves sampling and ranking the rotamers of the mutated residue, followed by minimization of the mutated residue. Additionally, the protein-protein complex is equilibrated before the FEP/REST2 production run. This protocol is typically effective in accurately identifying the mutated end-point throughout the FEP/REST2 trajectory, except in cases where a significant conformational change is required upon mutation. FEP+ requires GPU computing and the time required per single point mutation depends on the length of simulation (in nanoseconds), system size (in atoms), and type of mutation (with chargechanging mutations taking longer to compute compared to charge-neutral mutations as the number of lambda windows is twice as large). In the ACE2/RBD system ($100,000 atoms including explicit waters in the solvent box), the shortest 10 ns charge-neutral mutations take less than a day, while the longest 100 ns charge-changing simulations take $2 weeks when complex and solvent legs are run in parallel with each simulation leg using 4 GPUs.

Protein expression and purification
The SARS-CoV-2 RBD wild-type and its mutants (residues 331-528), were cloned into the pVRC-8400 mammalian expression plasmid, with a Cterminal 6XHis-tag and an intervening HRV-3C protease cleavage site. Plasmid constructs were transfected into HEK293 cells using polyethyleneimine (Polysciences). Cell growths were harvested four days after transfection, and the secreted proteins were purified from supernatant by nickel affinity chromatography using Ni-NTA IMAC Sepharose 6 Fast Flow resin (Cytiva) followed by size exclusion chromatography on a Superdex 200 column (Cytiva) in 10 mM Tris, 150 mM NaCl, pH 7.4.
Surface plasmon resonance SPR binding assays for monomeric ACE2 binding to RBDs were performed using a Biacore T200 biosensor, equipped with a Series S CM5 chip, at 25°C, in a running buffer of 10 mM HEPES pH 7.4, 150 mM NaCl, 0.1 mg/mL BSA and 0.01% (v/ v) Tween-20 at 25°C. Each RBD was captured through its C-terminal his-tag over an anti-his antibody surface, generated using the His-capture kit (Cytiva, MA) according to the instructions of the manufacturer.
During a binding cycle, each RBD was captured over individual flow cells at approximately 250 RU. An anti-his antibody surface was used as a reference flow cell to remove bulk shift changes from the binding signals. Monomeric ACE2 was prepared at six concentrations in running buffer using a three-fold dilution series, ranging from 1.1 to 270 nM. Samples were tested in order of increasing protein concentration, with each series tested in triplicate. Blank buffer cycles were performed by injecting running buffer instead of the analyte, after two ACE2 injections to remove systematic noise from the binding signal. The association and dissociation rates were each monitored for 180 s and 300 s respectively, at 50 lL/min. Bound RBD/Fab complexes were removed using a 10 s pulse of 15 mM H 3 PO 4 at 100 lL/min, thus regenerating the anti-his surface for a new cycle of recapturing of each RBD, followed by a 60 s buffer wash at 100 lL/min. The data was processed and fit to 1:1 interaction model using the Scrubber 2.0 (BioLogic Software). The number in brackets for each kinetic parameter represents the error of the fit.
We have developed an SPR assay to measure the binding kinetics and affinities of interactions between ACE2 and wild-type or mutant RBD with the RBD captured to the chip to avoid compromised binding activity resulting from chemical immobilization or repeated surface harsh regeneration steps during the experiments. The SARS-CoV-2 RBD is a basic molecule with a pI $ 9, so capture to the chip surface will also minimize artifacts such as non-specific interactions between the positively charged RBDs and the negatively charged dextran layer of the sensor chips at physiological pH. Monomeric ACE2 was flowed over as analyte to avoid avidity effects resulting from using dimeric ACE2. Studies that have performed such experiments in both orientations (RBD tethered to the surface vs in solution as an analyte) showed that using RBD as an analyte yielded affinities that were approximately three-fold stronger for the wild type RBD/ACE2 interaction, suggesting the presence of such nonspecific interactions. 24 In our SPR experiments we have determined the K D for wild type RBD binding to monomeric ACE2 is 162.9 nM (Figure S1), consistent with similar K D s reported from other groups that have used similar methodologies to perform their biosensor-based measurements. 21,89,90 Figure S1(A) shows the binding kinetics for interactions of mutant RBD proteins with ACE2 and Figure S1 (B) shows the kinetic parameters along with affinities calculated for each binding interaction, while Table 1 and Figure 3(B) list experimental changes in binding affinities (DDG = RTln(K D(MT) /K D(WT) )) when RBD is mutated. Experimental reproducibility errors in our SPR data (association and dissociation rate constants) is expected to be $15-20% according to previous estimates based on multiple independent measurements 72 ; a 15-20% error corresponds to $0.4 kcal/mol experimental error in the DDG(SPR) values reported in this study. Of 23 single point RBD mutations probed, four mutations were identified as stabilizing: N501Y, Y453F, S477N and N501T (Table 1).

Differences in experimental protocol affecting binding affinity changes
Previously reported experimental binding affinity changes upon RBD mutation of the ACE2/RBD complex span different choice of protein constructs and orientation of molecules used in the binding assays.
The differences in the constructs lie in the choice of monomeric vs. multimeric forms of interacting proteins and selection of protein domain boundaries. Some studies relied on monomeric ACE2 and RBD 21,22,24,25,[27][28][29][30] , whereas others used at least one of the molecule in a multimeric form (trimeric spike, dimeric ACE2 or monomeric ACE2 fused to a dimeric Fc tag). 20,25,26,31,32,64 RBD domain starts at residue 333 and ends at residue 526. It is common that constructs used in studies flank the RBD construct with a few residues before and after the domain boundary (e.g. 331-528 (this study), 331-531, 20 328-531 21,22 ) though some studies use constructs where such flanking regions are too long so they could result in non-specific binding (especially when containing unpaired cysteine residues, e.g. RBD 319-591. 25 Poor selection of protein domain boundaries can affect protein folding/integrity when a construct has incomplete domain sequence (e.g. RBD 343-532 construct is missing a b-strand). 23 Orientation of molecules in the binding experiments (which molecule is tethered to the chip in SPR) can affect both absolute and relative binding affinities (discussed in SPR methods). Studies using a highly positively charged RBD molecule as analyte and ACE2 immobilized on a chip [27][28][29][30][31][32] can be affected by non-specific binding of RBD to the chip. For example, using RBD as analyte in experiments measuring binding affinity of Alpha, Beta, Gamma, and Delta variants 91 results in stronger binding (by 0.4-0.6 kcal/mol) compared to a setup minimizing non-specific binding by immobilizing RBD on a chip. 92 We used the latter as a reference to assess performance of DDG on predicting double mutant effects of the Delta variant (Figure 3 (C)).

DATA AVAILABILITY
Data will be made available on request.

DECLARATION OF COMPETING INTEREST
The authors declare the following financial interests/ personal relationships which may be considered as potential competing interests: 'RAF is a founder of Schö dinger and a consultant; JMS is an employee and BH is a consultant. The FEP calculations are based on Schrö dinger-distributed software.'. FEP; SPR; protein-protein interactions; ddG prediction; receptor binding domain and angiotensin converting enzyme 2