An Electrostatically-steered Conformational Selection Mechanism Promotes SARS-CoV-2 Spike Protein Variation

Graphical abstract


Introduction
The COVID-19 pandemic triggered a worldwide health crisis, claiming over 3.3 million lives within 16 months and causing substantial damage to global economy (https://www.imf.org/en/ Publications/WEO/Issues/2021/03/23/worldeconomic-outlook-april-2021). Although preventive methods such as social distancing, face covering, testing, and tracing can mitigate spread of the virus, only the achievement of herd immunity through vaccination can put an end to the pandemic in a timely way. 1,2 Development of COVID-19 vaccines started as early as January 2020, enabled by rapid breakthroughs in genome sequencing and structural biology, leading to the first mass vaccination programs starting in early December 2020 (https://www.who.int/news-room/ questions-and-answers/item/coronavirus-disease-(covid-19)-vaccines). The majority of the currently available vaccines target the Spike protein (S protein) (https://www.nytimes.com/interactive/ 2020/science/coronavirus-vaccine-tracker.html), which mediates entry into the host cell 3 through diverse mechanisms. 4 Although SARS-CoV-2 sequence diversity is very low, 5 natural selection can lead to the appearance of favorable S protein mutations that confer viral fitness advantages and pose a threat to vaccine effectiveness. 6 The S protein is a highly glycosylated homotrimeric transmembrane protein 7 that enables binding and fusion of the virus with the host cell. 4 Each S protein monomer consists of nearly 1300 residues divided in two furin-cleavable subunits, S1 and S2. 8 Additionally, SARS-CoV-2 S possesses a second cleavage site -S2 0 -for the TMPRSS2 protease, which is needed to mediate membrane fusion. 9 The S1 subunit is responsible for binding to the host cell receptor, the angiotensin converting enzyme 2 (ACE2), 10-12 a zincdependent peptidyl dipeptide hydrolase. 13 The recognition happens via the receptor binding motif (RBM), 14 which is part of the receptor binding domain (RBD). 15 In its metastable prefusion state, the S protein exists predominantly in two distinct conformations, "closed" and "open", that have been structurally resolved through cryo-EM. 16 In the "closed" conformation, the S protein cannot bind to the ACE2 receptor. Thus, to enable the binding, S1 must undergo conformational changes that expose the RBD and bring the protein into the "open" conformation. The S2 is responsible for the fusion of the viral and host cell membranes, a mechanism underlined by a complex series of events. 17 In summary, the S protein evolved to compartmentalize functional aspects across the annotated structural domains, including the receptor binding domain, cleavage sites and fusion-related regions ( Figure 1(a-e)). 18 .
Slight changes in the ACE2 sequence occurring within humans 19 or across species, 20 can have influence not only on disease severity, but also on viral infectivity: large-scale structural modeling efforts targeting ACE2 variants aim to illuminate slight changes in ACE2-S protein interface and comprehend critical interface residues and biophysical properties involved in the recognition. [21][22][23][24] Obviously, not only variations of the ACE2 protein, the cell receptor, but also variations across SARS-CoV-2 proteins, especially the S protein, [25][26][27][28] can have a vast effect on viral infectivity 6,25,27 and severity. 29,30 Although ongoing efforts are connecting structure-function relationships of S protein variants as compared to the Wuhan strain, 26,[31][32][33][34][35] a systematic, large-scale statistical, biophysical, and structural analysis of S protein variants is missing. This is further complicated by the different conformations that S protein acquires before, during and after recognition by the ACE2 receptor.
Here, we systematically analyzed all S protein substitutions at the amino acid residue level. S protein variants were extracted from SARS-CoV-2 sequencing projects deposited in the public database GISAID 36 (Global Initiative on Sharing All Influenza Data) until the 2nd of January 2021. The end date was chosen to select substitutions that occurred before ongoing vaccination campaigns that started in December 2020. Our results show "forbidden" regions for residue substitutions and a common biophysical denominator for their selection, underlined by a "steered" conformational selection mechanism for subsequent biomolecular recognition. 37,38 In this type of recognition, unbound structural ensembles shift their energy distributions towards less favorable energies upon an external trigger to achieve and facilitate protein-protein recognition, lowering energy barriers. 38 Our results have a direct impact on understanding S protein variation as well as on current and future efforts for deriving vaccines and antiviral therapeutics.

Results
Three distinct sequence regions within S2 are strictly conserved.
We downloaded all 311,255 SARS-CoV-2 S protein sequences deposited in the public database GISAID before the 2nd of January 2021 and considered those with more than 95% sequence coverage (N = 276,712). These sequences account for a total of 505,403 amino acid changes, of which nearly half (N = 257,552) corresponded to the D614G "mutation", a substitution that became dominant after July 2020. 6 For each substitution present in the dataset, we calculated its percentage of occurrence (p.occ.) considering all possible substitutions (N = 24,187) ( Figure S1). This analysis revealed that almost every residue across the S protein has been substituted at least once ( Figure S1), except for 44 specific residues (Figure 1 . Residues C738, S746, C749, G757, F759, and F800 do not exhibit variation and are all localized within the UH, which shields the postfusion helical trimer. 39 Interestingly, the role of F800 is unclear and is not resolved in the postfusion structure. 39 HR1 and Heptad Repeat 2 (HR2) mediate membrane fusion 39 and are known to be conserved within SARS-CoV-2 lineages and other CoVs. 18,40 However, we clearly observe that HR2 is undergoing selection pressure, indicated by the absence of fully conserved residues (Figure 1(e)). This contrasts with the HR1 domain, where various residues were not observed to be substituted (Figure 1(e), Figure S2(a)). The adjacent CH domain has 4 residues that were never observed to be substituted, i.e., E988, R995, A1025, K1028. These residues are embedded in the interface of CH and UH, in very close proximity to the UH conserved residues and co-localize at a specific lateral position across the postfusion conformation ( Figure S2(b)). Lastly, CD also includes residues that remain conserved, which are again found near UH and CH, pointing to a critical stabilizing interaction located at the inter-chain interface ( Figure S2(b)). Interestingly, C1082 and C1126, both present in the CD, form a cysteine bridge and were never observed to be substituted ( Figure S2(b-d), Figure S3(a)). Rare to none substitution events occur for the rest of the cysteine bridges ( Figure S3(a)). Glycosylation sites are also very conserved, with the exception of T323 exhibiting slightly higher frequency of substitution ( Figure S3(b)).
Functionally, the localization of the UH, HR1/CH and CD regions also correlates with crucial conformation changes connected to (a) the opening of the RDB; and (b) the prefusion to postfusion extensive reorganization of the S2. For (a), molecular dynamics simulations of the closed and open S protein conformations showed that in the prefusion state appears a cavity which is formed by HR1/CH and CD regions. 41 This cavity formed by domains of the S2 subunit was observed (e) S protein topology diagram with positioning of residues substituted with frequency ! 0.5% (red) and highly conserved residues which have never been mutated (black) on Spike protein domains. SP: signal peptide; NTD: N-terminal domain; RBD: receptor-binding domain; RBM: receptor-binding motif; SD1: subdomain 1; SD2: subdomain 2; UH: upstream helix; FP: fusion-peptide; FPPR: fusion-peptide proximal region; HR1: heptad repeat 1; CH: central helix region; CD: connector domain; HR2: heptad repeat 2; TM: transmembrane region; CT: Cterminal domain. Furin cleavage site (cleavage between S1 and S2 subunits) and S 0 cleavage site are indicated. Numbers correspond to amino acid residues in the protein sequence.
to be more rigid than regions of the S1 subunit 41 and is suggested to allosterically modulate the opening of the RBD domain through a "bouncing spring" mechanism. 41 For (b), conserved residues are initially in proximity to each other in a condensed coiled coil formed by CH-HR1, but after fusion, the coiledcoil is extended ( Figure S2(b)). This is observed also in other coronaviruses, 42 which implies active role of the conserved residues in the conformational rearrangement. These findings are now further corroborated by our conservation analysis, particularly since mutation rates of RNA viruses are up to a million times higher than that of their hosts. 43 Since the virus can escape neutralizing antibodies by just a single amino acid substitution, 44,45 identifying protein regions with extensive, localized conservation is extremely important for designing drugs against SARS-CoV-2 and for vaccine development.

Identification and characterization of frequent S protein variants and residing substitutions
Although these highly conserved sites are critical for drug design, frequent substitutions are also equally important since they enable viral immune and drug escape 46,47 or increase viral fitness. Viral genetic diversity has to be steadily monitored to swiftly adapt vaccine and therapeutic drug development to emerging SARS-CoV-2 variants which appear through accumulation of nucleotide variations. 48 Our frequency analysis shows that most of the substitutions are not persisting, having very low frequency of residue substitution, most being below 0.5% (Figure 2(a)), probably partly due to CoVs proofreading function, 49 and partly because many of the substitutions do not confer substantial fitness advantages and therefore do not persist.   Only 23 residues are substituted with frequency ! 0.5% and are annotated in Figure 2(a, b). Interestingly, 18 of these residues are located on the S1 subunit, formed by residues M1-R685 with 4 residues (N439, Y453, S477, N501) located in the receptor binding motif (RBM) of the RBD (Figure 1 (e), Figure 2(b)). We also observe residue clustering (N = 9) in the NTD (Figure 1(e), Figure 2(b)). One of the mutations of interest, which is placed at the NTD, is A222V. This mutation first appeared in July 2020 50 and was present in over half of the sequences by November 2020 (Figure 2(c)). Despite becoming widespread, it remains mysterious for its structural or functional impact, although 2 reports suggested that it does not affect antigenicity 51 and has minimal impact on viral entry. 50,51 The high frequency substitutions present in S1 should affect the unbound and bound states of the S protein and not the postfusion state. Residues with frequency of substitution ! 0.5% (N = 23) are not only participating in characterized substitutions across lineages, but also in variants not extensively described to date, which comprise a large portion of the retrieved data ( Figure 2(c-e), Figure S3(c), Figure S4). Multiple substitutions per residue are also observed ( Figure S3(c), Table S1). Many identified substitutions influence monoclonal antibody (mAb) escape or alter binding affinity for ACE2, especially those located on the RBD and include: (a) S477N. S477 is located within a flexible loop (residues A475-G485) 52 and exhibits the highest local flexibility. 52 Most often S477 substitutes to asparagine (S477N) 50 and its relative population peaked in July 2020 (p.occ. $27.5%), remaining still high (p.occ. $5.2%) by the end of 2020 (Figure 2 (c)). This substitution was shown to strengthen binding of S protein with the human ACE2 receptor 52,53 and to be resistant against multiple mAbs. 54 Additionally, S477 was observed to mutate to arginine (S477R) and isoleucine (S477I) (Figure 2(e)). As such, S477 substitutions may promote viral transmission 55 and possibly increase affinity to ACE2, 56,57 although the biophysical basis of these effects is unknown and warrants further structural characterization.
(b) Y453F and (c) N439K. Y453F 58 is known as "cluster five", which arose among farmed minks in Denmark and had its frequency peak in November 2020 (p.occ. $1.4%, Figure 2(d)). Y453F is located in the interaction interface with ACE2 and increases binding affinity by 4-fold but does not alter inhibition potency in convalescent sera. 53,58 Both Y453F and N439K may potentially be escape substitutions. 59 (d) N501Y. The most prominent substitution of N501 is N501Y (Figure 2(c)), appearing in the B.1.1.7, B.1.351 and P.1 lineages. This substitution strongly increases binding affinity to ACE2, 53 e.g., for B.1.1.7, a 7-fold affinity increase was measured. 33 This effect can be explained via a destabilization effect caused by N501Y, increasing S protein open state population. 60 Another fre-quent substitution is N501T (Figure 2(e)) which was identified as one of the dominating mutations in minks in the USA. 61 Hydrophobicity as a driving force for S protein variation at the primary structure level We hypothesized that variation in amino acid substitutions across the S protein sequence may be driven by certain physical-chemical properties, since certain physical-chemical rules exist for rationalizing binding affinity of protein-protein interactions. 62,63 To discover possible global effects, we first compared hydrophobicity between the Wuhan strain and 148 amino acid substitutions with a spectrum of measured frequencies. These substitutions were selected based on their Jan-Dec 2020 time series profiles as shown in Figure 2 (c-e) and Figure S4. The overall distribution of hydrophobicity across the S protein shows 3 prominent sets, at low, near-neutral and high hydrophobicity values ( Figure S5(a)). These 3 distinct sets are recapitulated for a subset of the residues from the Wuhan strain that corresponds to the wild-type residues for the selected 148 substitutions (Figure 3 (a)). In short, an obvious shift towards hydrophobicity is detectable upon mutation (Figure 3(a-c)). The shift is underlined by a disappearance of set 1, decrease of set 2, prominent increase of set 3 (Figure 3(a-c)) and is substantial when either all 148 selected residues are considered (Figure 3(a)) or relatively low frequency substitutions (p. occ. < 0.5%) (Figure 3(b)). For the group of high frequency substitutions (p.occ. ! 0.5%) a similar trend is observed, with the prominent appearance of set 3, but shift was not substantial (Figure 3(c)). Certain percentages are calculated for the S protein sequence of the Wuhan strain when considering the type of amino acid ( Figure S5(b)). Percentages are also derived in a similar range for the Wuhan strain residues of the 148 selected residues and their substitututions considered in the different groups ( Figure 3(d-f)). Their mutated equivalents exhibit a substantial increase of non-polar amino acids, reducing the proportion of charged and polar residues. This result holds true for both low-and high-frequency groups of the substitutions (Figure 3 (e-f)). Overall, there is a clear shift toward hydrophobicity for all considered groups of substitutions that points to a possible denominating biophysical effect of specific origin.
Structure-based analysis of S protein variation reveals suboptimal scoring across conformational states for the Wuhan strain Protein variation and selection is underlined by a plethora of factors, 64 some of which can be derived from structural data. 65 Therefore, considering the key observation of the global increase in hydrophobicity across the selected substitutions, we asked what these mutations could bring on a structural level. We constructed 475 structural models of S protein corresponding to all 148 substitutions in 4 known conformational states which are involved in the recognition, namely unbound ("closed" 66,67 (N = 148), "open" 66,67 (N = 148)), bound 68 (N = 148) and postfusion 39 (N = 31) (Figure 1(ad)). As a control, the Wuhan strain S protein was included. We refined all structural models via short molecular dynamics simulations using the HAD-DOCK server 69 and HADDOCK scores, along with its components, were retrieved for all 4 known Results show that distribution of scoring components calculated for the residing interfaces are relatively narrow, and most mutations only change score values by few relative units ( Figure 4, Figure 5, Figure S6). This result shows that an introduced variation in the S protein sequence is not expected to majorly impact the S protein structure, but rather finely regulate interface non-covalent interactions related to HADDOCK scoring components. Interestingly, calculated scores of Wuhan strain S protein in four conformational states most of the times falls within the average values of the calculated distributions ( Figure 4), but in two cases it considerably shifts towards lower or higher values (Figure 4(b, c)). One might argue that slight score shifts are close to noise, but, within these unit shifts, HADDOCK scoring can recapitulate binding affinities of protein-protein interactions within experimental inaccuracies. 62,63,70 This observation for the Wuhan strain S protein indicates that its initial sequence was suboptimal in terms of either stability or conformation if scores are to be interpreted as stability or affinity proxies; this means that additional mutations show better scores, overall stabilizing the unbound states, binding to the human ACE2 host receptor and also its fusion.

Scoring of S protein unbound states with HADDOCK and Rosetta uncovers electrostatics as a robust scoring component
Our observations rely on the HADDOCK score and its components that are not absolute energetic values, but rather indicative of effects of different scoring components that each corresponds to a physical (e.g., van der Waals, electrostatics, desolvation energy) or structural (e.g., buried surface area: BSA ( A 2 )) contribution. To corroborate our analysis, we implemented similar approach using Rosetta. 71 Energy scores for each HADDOCK refined model with single substitution were re-calculated with Rosetta and its own scoring function. As the scores from these two methods are in different scales, rather than comparing the absolute values 1-to-1 we can only compare if they both follow a similar trend (positive, negative or zero slope). A Bayesian-Ridge linear regression model is fitted for every pair of similar type of scores from the two methods. Correlation coefficient R 2 and p-value (for null hypothesis that the slope is zero, using Wald Test with tdistribution) are calculated and quoted in respective subplots ( Figure S7). The results show the strongest correlation between the scores from the two methods for buried surface area (BSA, A 2 ), followed by the electrostatics score for closed and open conformational states ( Figure S7(a, b)). However, for ACE2-bound and postfusion states, the strongest correlation is seen for BSA and desolvation ( Figure S7(c, d)). The results from this assignment indicate that both methods robustly capture the electrostatics, desolvation, and BSA scores in closed and open states corroborating that performed HADDOCK scoring is not biased towards its own scoring.

The electrostatics scoring component of HADDOCK as a common denominator for S protein variation selection
To disentangle possible scoring components that may be important for the studied substitutions, structures were modelled by introducing corresponding substitutions, followed by a refinement procedure as detailed in the Materials and Methods. All models were ranked corresponding to their increasing probability of occurrence, and split into two groups, low-and high-frequency, using a variable percentage of occurrence (p.occ.) value. The p.occ. value was discretely changed between a range of 0.04% p.occ. 2.0% in incremental step size of 0.02%. At every step, the two groups thus formed, were compared with each other, and statistically treated by applying (a) a t-test (Figures S8, S9) and (b) Bayesian methods ( Figure S10). For the Bayesian analysis the groups were formed at slightly broader step size of 0.1% in the range of 0.1% p.occ. Table S7. Interestingly, the HADDOCK score, which is a combination of reported non-covalent forces (van der Waals, desolvation, electrostatics) shows substantial decreased values with increased p.occ. thresholds towards lower p-values, specifically for unbound closed and open states ( Figure S8). This behavior is derived from the electrostatics scoring component. Analysis via the t-test shows that electrostatics scores calculated for the closed and open states (Figure 6(a), Figure S8) exhibit substantial decrease and not those calculated for the bound ( Figure S8) and postfusion states ( Figure S9, Figure S11(a)). In addition, the difference in average electrostatics scoring components between consecutive groups is negative. This translates to higher electrostatics scores in the high-frequency groups, pointing to, possibly, a global, destabilizing effect underlying selection of the substitutions (Figure 6(b)). This effect is not observed when the bound state or the postfusion conformation are taken into account, although increasing destabilizing effects are observed for the bound state ( Figure S11(b)).

2.0%. Number of mutations per step is reported in
Electrostatics scoring components may capture long-range charge interactions in a protein complex. 62,72 Therefore, we calculated all distances between Ca atoms of residues in the models and grouped the substitutions according to their interchain proximity. Results show that residue variation is not localized in the interface but is well-distributed across the S protein, namely inside interfaces and rim regions (d Ca-Ca 15.0 A) and non-interacting surfaces (d Ca-Ca > 15.0 A), see number (N) of substitutions calculated per state (Figure S12-S13). Statistical analysis for each of the classes (N interface , N non-interface ) as function of increased frequency (as performed before, e.g., for Figure S10-S11) shows that substitutions proximal to the interfaces are involved (N interface ), (Figure 6(c), Figure S12), and not the residues localizing further ( Figure S13). It is of note that, again, electrostatics is the single reliable scoring component for the group of residues proximal to the inter-trimeric S protein interface with increasing p.occ. thresholds (Figure 6(c)). To critically assess our observation from the above pvalue trends, we applied a thorough Bayesian p. occ. 2.0% with 0.02% step). Asterisk in (a-c) panels indicate position starting from which the calculated p-values are indicative of a consistent trend (p-value < 0.1). (d-f) Calculated effect size (and 95% high probability density interval, HDI) using Bayesian parameter estimation for closed, open and bound states. The variants were separated into low-and high-frequency groups as function of threshold (0.1% p.occ. 2.0% with 0.1% step). The farther away from 0 the effect size (and the 95% HDI) is, the higher is the effect. The effect sizes (and 95% HDI) for the electrostatics scores for only the closed conformation show significant differences. parameter estimation method to calculate the effect size/s as described in Materials and Methods. Based on the Bayesian analysis, we see that in the closed state the electrostatics score predominantly shows a significant effect size irrespective of the p.occ value. In other words, electrostatics is a significant contributor but only in the closed conformational state of the S protein ( Figure 6(d-f)) and is calculated not to dependent on interface proximity ( Figure S10).
Given the fact that in the high-frequency group (p. occ. ! 0.5%) only around a quarter of the residues (27%) (Figure 3(f)) are charged and the rest are either polar (27%) or non-polar (46%), the global result is striking. Electrostatics is the only consistent contributor that underlies the selection of higher frequency substitutions in a global manner. The difference in electrostatics scores between high and low frequency substitutions, although mild ( Figure S14 and S15), shows a trend for higher electrostatics scores underlying high-frequency substitutions -especially D614G and D1118H. Higher electrostatics scores translate into a global destabilizing effect.
To compare the calculated scores of the models of S protein wild type (WT) and variants (Tables S8-S10) with the scores of experimentally resolved structures of unbound S protein in closed and open conformations and ACE2-bound conformational state we performed HADDOCK refinements of all available S protein structures of the relevant variants (see Materials and Methods section "Refinement of experimental structures"). To assess correlation, the scoring components for the models and experimental structures were plotted against each other ( Figure S16(a-e)). Due to extremely small number of experimentally resolved structures of S protein variants, the number of datapoints is very limited but the highest correlation scores are observed for electrostatics score for ACE2-bound structures ( Figure S16(d, e)) pointing out, once again, the importance of the electrostatics scoring component for this system.

Lineage-related variation exhibits higher electrostatics contributions with destabilizing effects across S-protein conformational states
To further look into the scoring data, substitutions were categorized into two distinct classes, namely those with low (p.occ. < 1.0%) and high (p. occ. ! 1.0%) p.occ. The categorization was performed for 148 considered substitutions ( Figure S14, S15, and S17) and lineages (B.

Examples of decreased electrostatic interactions within S protein inter-trimeric interfaces
To understand the structural effects of electrostatics in closed and open unbound states, substitutions were visualized after refinement and compared to the Wuhan strain S protein trimer (Figure 7(a-d)). As expected, our calculations directly show the loss of an intertrimeric interface salt bridge when D614 is mutated to a G (Figure 7  (a)). In addition, K854 forms a salt bridge with D614 of the neighboring chain. 39 This salt bridge is abolished through the most common D614G mutation which is known to increase infectivity of SARS-CoV-2. 6 K854, in the fusion peptide proximal region (FPPR), supports the closed conformation, 39 further indicating importance of the FPPR in regulating the opening and closing of the RBD.
Cryo-EM data have shown that such an effect leads to a destabilization of the closed state 73 as also shown by the presented calculations ( Figure 6  (a, b, d)). In addition, calculations show a drastic effect on the electrostatics scoring component, being one of the most destabilizing substitutions together with D1118 ( Figure 5). Interestingly, D1118H removes a negative charge from the side chain, therefore removing a formed salt bridge with R1091, and replaces it with a protonatable group (Figure 7(b)). The newly introduced H1118 still interacts with R1091, but the polar interaction formed is weaker due to possible protonation effects of His and, therefore, introduction of similar charge that would lead to severe repulsion.
The A222V variant remains however, structurally unclear and subtle localized electrostatic effects are influenced. This is because when closed or open states of S protein are investigated, the localized interaction network of A222 does not considerably change. However, due to the marginally longer side-chain of V and its higher hydrophobicity, hydrophobic atoms across the localized region tend to come closer as indicated by the presented atom-atom distances of side chains which upon mutation, overall, decrease (Figure 7(c)). This effect may well simulate a localized atom-atom hydrophobic attraction that is translated into decreased polarity, and therefore, electrostatics score in the corresponding calculations ( Figure 5). An interesting outlier is the A570D variant, which provides contradictory results as compared to all other for both open and closed state ( Figure 5). Introduction of D570 is highly stabilizing, owing to a large conformational change of the Asp side chain, where D570 forms a salt bridge with N856, but in the open state forms another salt bridge with K964 (Figure 7(d)). K946 and N856 are 10.5 A apart, indicating that the introduced mutation can alter interaction partners and contribute to localized flexibility and plasticity and, consequently conformational selection of either closed or open state.
To better understand these intriguing results concerning D570 pairing, we performed additional analysis of Ca-Ca distance distributions of both its interactions, namely D570-N856 and D570-K964 in closed and open conformational states, considering that the A570D substitution occurs in all three S protein chains (A, B, C). Analysis of the 20 final structures after the HADDOCK water refinement step shows that the Ca-Ca distance distribution for D570-N856 does not overlap in closed and open conformations (Figure 8(a)). For closed conformation it fluctuates between 7.5 and 9.0 A, whereas in open conformation it spreads between 10.0 and 12.0 A (Figure 8(a)). In the case of D570-K964, the distance distribution in closed (5.0-6.0 A) and open (5.3-6.3 A) conformations majorly overlap (Figure 8(b)). These observations indicate that the FPPR domain which contains the N856 residue is very flexible and is displaced upon opening of the RBD. Interestingly, each distribution for the open conformation, where RBD of chain A is turned upwards, is relatively narrow and has a clear maximum ( Figure  8(a-b)), whereas all distributions of the closed state have many peaks. Therefore, we can assume, based on the calculations and derived distributions that there is a preferred, most populated state in the open conformation but not in the closed one.
Next, we performed molecular dynamics simulations of the A570D S protein variant in closed and open conformational states in triplicates (Figure 9(a, b), Figure S18(a, b)). These MD simulations confirmed the observation that N856 is placed in a very flexible region, since the Ca-Ca distance between D570-N856 fluctuates between 6 and 14 A. On the other hand, the distance between D570-K964 is stable in all simulations over 20 ns and fluctuates in a range between 4 and 7 A. Despite the increased electrostatic component score that we have shown ( Figure 5), the interaction network of D570 changes as function of S protein state, indicating an additional contributor for supporting or inhibiting opening of the RBD.

Discussion
In this work we analyzed all SARS-CoV-2 genomes deposited in GISAID 36 up to January 2021. This date is critical to investigate only substitutions representing vaccine-unhindered virus adaptation. The S protein sequence was specifically investigated and its naturally occurring sequence variation. We have first observed that there are "hot" and "cold" spots for variation and showed that 44 positions across the S protein sequence have never undergone selection pressure during this timeline. Such "cold" spots, which are independent of viral adaptation, are mainly clustered in 3 regions (UH, HR1/CH and CD) of the S2 subunit and are correlated to important structurebased regions for biomolecular folding, function, and recognition. On the contrary, the most frequently substituted residues, the "hot" spots, are mainly located in the S1 subunit and especially in the NTD region. Therefore, the "cold" regions are of uttermost importance for development of therapeutics which might cover a broader spectrum of SARS-CoV-2 variants, whereas monitoring of "hot" spots and their structural characterization is of uttermost importance for vaccine adaptation.
Overall, our primary structure-based analysis showed that the hydrophobicity is a driving force for a selection of low-frequency substitutions. The same trend for selection of more hydrophobic Thus, we observed that all highfrequency substitutions of S protein, which are also included in multiple SARS-CoV-2 lineages, destabilize the electrostatics of the unbound S protein, both in the closed, but also in the open state. We have concluded that a mechanistic model can explain this type of adaptation ( Figure 10). In this model, the ground state of the closed S protein conformation is destabilized upon introduction of substitutions that would be subsequently selected; those that are not selected have variable effects. This destabilization majorly affects S protein interfaces in the open conformation. We hypothesize that such destabilization may raise the free energy of the unbound state, and therefore, a conformational ensemble that could recognize the ACE2 receptor is easier sampled. This mechanism is a reminiscent of a "steered" conformational selection mechanism for biomolecular recognition. 37,38 It is rather surprising that the "steered" conformational selection is majorly and globally underlined by electrostatics and is not strongly influenced by any other type of non-covalent interactions.
Globally, when looking into the structural data in PDB, just few structurally resolved S protein variants are reported; most of the variants included in our study are not experimentally structurally determined (compare N = 476 vs few experimental data in Figure S14). Moreover, according to number of structurally resolved variants in closed and open conformational states (Tables S8-S10 Furthermore, the S protein has not yet been fully resolved at high resolution due to the sheer flexibility and complexity of its structure. Most structures in PDB do not contain the C-ter domains as well as flexible loops of various protein domains. In addition, most experimentally resolved structures contain additional mutations which are stabilizing S protein prefusion conformation. A lot of these limitations are minimized in computational studies of the S protein variants, which makes computational studies where dozens of variants are studied in large scale very valuable for the scientific community. Our conformational selection model for SARS-CoV-2 S protein adaptation ( Figure 10) now rationalizes multiple biochemical and structurebased observations from x-ray and cryo-EM structures. Conformational changes have been observed in the unbound states of the S protein when a variant is introduced (e.g., for D614G, 31 28 ). Although it is arguable that the RBD up/down conformations in the structures that have been solved may be an artifact of the introduced mutations to study the Spike with struc- tural methods, cryo-electron tomography data showed that the different Spike conformations that we studied here, resolved by single-particle cryo-EM, [66][67][68] are relevant in the context of intact SARS-CoV-2 viruses. 75,76 All above-mentioned structural data collectively suggest that introduced mutations may increase the representation of open S protein states, in agreement with the mechanistic model that we present ( Figure 10). Structural data for variants for which their bound states with ACE2 are available indicate intricate and opposing effects on interface stabilization, without pointing to a common denominator. 27,34,77 These differential contributors are also captured in our study (e.g., shown in Figure 5, Figure S6), but our calculations suggest that S protein adaptation mainly occurs in the unbound and not the ACE2-bound ensembles, and, if the size effect is considered, only in the closed state. Conclusively, our mechanistic model on "steered" conformational selection for high-frequency substitutions provides an explanation of a global underlying denominator for SARS-CoV-2 adaptation and might be relevant to describe adaptations of other protein complexes with implications to biotechnology, structure-based design, and deeper understanding of molecular-level biophysical adaptation.

Models of closed, open, ACE2 bound states and postfusion conformation
Initial structures for prefusion S protein conformation in closed and open states were downloaded from the CHARMM-GUI 78 COVID-19 Archive. 79 Out of eight existing structural models for each of the states, which were built as described by Woo et al., 80 the "1_1_1" models were chosen. As an initial structure for the S protein postfusion conformation, the partially resolved cryo-EM structure (PDB ID: 6XRA) 39  Postfusion conformation. The postfusion conformation was optimized using a cryo-EM structure of SARS-CoV-2 S2 in the postfusion conformation solved at 3.00 A average resolution (PDB ID: 6XRA). 39 The partly resolved segments of each chain (residues A1174-E1202) were modelled using the x-ray structure solved at 2.90 A resolution (PDB ID: 6LXT) 15 as a template. The final structure contains residues N703-I770 and T912-E1202.  (Tables S8-S10), prepared in accordance with HADDOCK submission requirements using pdb-tools 82 and subjected to HADDOCK refinement. Thereafter, if there were multiple structures of the same variant (Table S11), the average score was calculated per variant. For some conformational states variants are listed twice in the table (e.g., WT) because some of the expressed proteins are containing S protein prefusion conformation stabilizing mutations (K986P, V987P) 83 or (F817P, A892P, A899P, A942P, K968P, V969P) 84 (Tables S8-S10).
Docking, scoring and MD software strengths and limitations. One should be aware of molecular docking algorithms as well as molecular dynamics simulations strengths as well as weaknesses. Molecular docking is a combination of a conformational sampling algorithm and a scoring function. 85 Conformational sampling algorithms are developed and trained on limited number of samples in a test set, thus depending on the system of study and may vary in performance. 85 There are many different categories of sampling algorithms based on molecular dynamics (MD) simulations or stochastic methods such as Monte Carlo and genetic algorithms. 86 MD simulations are simulating atomic motions using simple approximations based on Newtonian physics. 87 The forces which arise from bonded and non-bonded interactions and are described in the force fields, which are limited by two principal challenges: force fields which require further refinements 88 as well as limitations in com-putational power. It is known that, post-docking refinements provide higher hit rates and higher correlation with experimental data. 89 For example, for HADDDOCK it was shown that solvated docking/refinement do not always improve the docking results but improve the scoring, 90-92 which is highly important for our study. Water refinement procedure of models via HADDOCK is described in the following section.
HADDOCK refinement. HADDOCK (High Ambiguity Driven protein-protein DOCKing) is a semiflexible docking method for biomolecular research which is based on bioinformatical predictions and experimental knowledge of biochemical/biophysical interactions. The HADDOCK approach is using python scripts derived from ARIA. 93 For structure calculations it uses CNS 94 (Crystallography and NMR system). The method can be used not only for docking of biomolecules but also for structure-refinement purposes. 69 HADDOCK uses nonbonded electrostatic and van der Waals energy terms of the OPLS force field with the cutoff distance of 8.5 A from a modified version of the parallhdg5.22.pro parameter file 95 for the evaluation of inter-and intramolecular energies. The usual docking protocol consists of three following stages: a rigid-body energy minimization, three steps of a semi-flexible simulated annealing refinements in torsion angle space and a final refinement in Cartesian space with explicit solvent or DMSO (in our case water refinement was applied). The HAD-DOCK refinement interface, which was utilized in our study, includes only the last stage. 69 Here the protein is solvated in an 8 A shell of TIP3P water molecules. At this stage all MD simulations are performed with 2 fs time step for the integration of the equation of motion. First, the system is heated to 300 K performing 100 steps of MD simulation at 100, 200 and 300 K keeping position restraints (k pos = 5 kcal mol À1 A À2 ) on all atoms. Next, 1250 steps of an MD are performed at 300 K with position restraints (k pos = 1 kcal mol À1 A -2 ) only on heavy atoms. During the final cooling step, only the backbone atoms are restrained. At this step an MD simulation is performed at 300, 200 and 100 K with 500 MD steps at each temperature. 96 The user can decide how many times the initial structure will be refined, in this work we run 20 refinements per model. After each of the refinement a HADDOCK score (HS water ) is derived as a weighted sum of van der Waals (E vdW ), electrostatics (E elec ) and desolvation (E desolv ) scoring components. The resulting value for each scoring component is calculated as an average for the top four best-scoring models with the smallest weighted sum value 69 (https://alcazar.science.uu. nl/services/HADDOCK2.2/). Rosetta calculations. For energetic calculations with Rosetta, the latest weekly release (Rosetta 2021.38) 71 was used. The best ranked HADDOCK refined model was used as input, taking only models of variants with single substitutions. Zinc entries in the.pdb files of ACE2-bound were modified according to Rosetta atom names. The altered. pdb file was scored according to the Rosetta tutorial "Analyzing Interface Quality" (https://www.rosettacommons.org/demos/latest/public/analyzing_inter-face_quality/README, accessed 10. March 2022).
To assess any correlation between the resulting scoring values from Rosetta and HADDOCK, scoring components of HADDOCK output (Electrostatics, vdW, Desolvation, BSA) and Rosetta scores (fa_elec, fa_atr, fa_sol, dSASA_int) are paired up (x,y) ( Figure S7). For each pair (e.g., an attribute from Rosetta vs that corresponding from HADDOCK) a Bayesian ridge linear regression model is fitted using Scikit-Learn library (https://scikit-learn.org/ stable/modules/generated/sklearn.linear_model. BayesianRidge.html#). The R 2 score is calculated per pair (as quoted in the subplots) and is defined as (1-u/v), where 'u' is the residual sum of squares P ((y_true -y_pred) 2 ) and 'v' is the total sum of squares P ((y_true -mean(y_true)) 2 , where, y_true is the actual data and y_pred is the linear regression model value. The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). Moreover, the pvalue (for null hypothesis that the slope is zero, using Wald Test with t-distribution) was calculated and quoted in the subplots ( Figure S7).

Analysis of SARS-CoV-2 S protein sequence variation and substitutions selection
For this work we queried the public database GISAID 36 for SARS-CoV-2 S protein sequences uploaded over period of time starting from July 2020 with the last query on the 2nd of January 2021. From the total number of 311,255 S protein sequences, sequences with more than 95% sequence coverage (276,712) were considered for the subsequent analysis. These sequences were aligned against the reference SARS-CoV-2 spike protein sequence (UniProt 97 P0DTC2) using MAFFT 98 v7.471. Frequency of mutations for each aligned site was calculated ignoring undefined residues (X) and gaps (-).

Computational analysis of scoring components contributions
Scores (HD (HADDOCK) score, electrostatics, van der Waals, desolvation, BSA (buried surface area)) derived from structure-refinement calculations for each of the considered substitution  Table S7). To identify statistically significant contribution of the calculated scoring components, p-value for each step was calculated via unpaired t-test ( Figure S8, S9, Figure S12, S13).
To observe destabilization profiles of the electrostatics scoring component, the only significant contributor (Figure 6(a, c), Figure S8, S9), a mean scoring value for each of the groups (0.04% p.occ. 2.0% with 0.02% step) was calculated, and difference in the electrostatics scoring component (DElectrostatics) between the group of substitutions with higher and lower p.occ. was derived for each step (Figure 6(b), Figure S11 (b)). Since p-values provide evidence for rejecting the underlying H 0 , we consider groups of (a) strong (p-value < 0.01, e.g., in the case of hydrophobicity), (b) moderate (p-value < 0.05), (c) weak evidence or trend (p-value < 0.1) and (d) no evidence (p-value ! 0.1).

Bayesian parameter estimation
We use PyMC3, 106 a probabilistic programming package in Python, that fits Bayesian models using notably Markov chain Monte Carlo (MCMC) methods. To quantitatively assess how different any given two groups of data are from the other, we perform a rigorous Bayesian parameter estimation, using the module -Bayesian Estimation Supersedes the T-test (BEST) under PyMC3 based on Kruschke, 2013. 107 Driven by Bayesian probability, this is a comprehensive and more solid approach than the testing approaches that involve expressing a null hypothesis. Moreover, we estimate the uncertainty associated with the estimated parameter that accounts for our lack of knowledge of the model parameters. For a given (group 1, group 2) data, we calculate two parameters, namely, (a) the effect size, and (b) a high-density probability interval around the effect size. Farther away from 0 the effect size (and the 95% HDI) is, the better. Technical details of the procedure: A student t-distribution is used to describe the attributes of each group. A t-distribution is a robust choice for our data, since it is less sensitive to outliers compared to a normal distribution. A tdistribution is represented by three parameters, mean (l), standard deviation (r), and degrees-ofnormality parameter (m) (which is assumed to be same for both groups). The prior distribution corresponding to each model parameter is set to be a very broad Uniform distribution, of width equal to 10 times the standard derivation. The prior distribution for parameter m is assumed to be a very wide exponential distribution, of mean of 30. These wide prior distributions make the estimates of output posterior distributions less sensitive to the input prior distributions.
The posterior distributions of all parameters are estimated by the process of MCMC sampling. The MCMC process generates a large (up to 100,000) representative sample of credible parameter values that better represents the underlying posterior distribution. Note, the MCMC process generates sample of parameter values and not that of the actual data. For each credible parameter estimate (l1, l2, r1, r2) the effect size if computed as ðl1 À l2Þ= ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi . A distribution of effect size (also of 100,000 samples) is computed along with a 95% credible interval, a High-Density probability Interval (HDI). If the means of two groups are not significantly different, then the effect size would tend towards 0. Therefore, a higher effect size indicates a significant difference in two groups. Unlike a single-point value of p 0.05 in standard t-test, the interpretation of Bayesian estimation is not black-and-white, it uses an entire distribution of parameters for calculating the effect size. The conclusions are probabilistic in nature, and therefore, we observe if the estimated 95% HDI of the distribution of effect size does or does not include 0. Moreover, a Region of Practical Equivalence (ROPE) of À0.1 to 0.1 around the null value (0) is considered as 0, such that, the effect size indicates a significant difference in two groups only if the ROPE is completely outside the 95% HDI.
For every attribute (e.g., HADDOCK score, HADDOCK components (Electrostatics score, vdW score, desolvation, BSA etc.)), the mutations are divided in to two groups (p.occ., %, below and above a threshold), while changing the threshold value incrementally from 0.1% to 2%. For each pair of groups, we calculate a distribution of effect size and plot the mean effect size along with the 95% HDI as a function of p.occ. threshold (%). The process is repeated for closed, open, and bound structures.

Molecular dynamics simulations
The MD simulations of the S protein in "closed" and "open" conformational states were performed using NVIDIA CUDA version 9010 acceleration modules of NAMD 2.13 108 for Linux-x86_64multicore-CUDA employing CHARMM36 109 force field. The long-range electrostatics were treated using the particle-mesh Ewald approach. 110 The S protein included M1-L1141 residues of each chain of HADDOCK refined model of A570D variant. The system was solvated in a water box of TIP3P water (with a water layer of 10 A) and neutralized with 150 mM NaCl, eventually comprising 523,769 atoms in "closed" and 552,368 atoms in "open" conformation. System preparation as well as MD analysis were performed using VMD 1.9.3. 111 Note that lipids and sugars were not included in the simulation because the MD simulations were performed using "hard-restrained" approach where only atoms of protein 15 A around protein residues D570, N856 and K964 in all three chains and water-ion environment were allowed to move without harmonical restrains. On all other atoms of the system was applied harmonic force constant of 5 kcal mol À1 A À2 . Each approach was repeated three times for 10 ns per replicate after observing system equilibration. Prior to each of the MD simulations, two all-atom minimization-relaxation cycles were performed. At the beginning of each cycle, the system was minimized for 10,000 steps. During the first cycle, the water-ion environment and hydrogens were allowed to relax, keeping protein harmonically restrained. Subsequently, the temperature was incrementally changed from 0 to 310 K, relaxing the system for 1 ps per increment of 10 K with a final relaxation for 0.1 ns at 310 K. In the second cycle, side chains within 15 A around (and including) D570, N856, K964 residues in all three S protein chains, as well all hydrogens, water molecules and ions were set free while all other atoms were kept restrained. The system was incrementally heated as described for the first cycle with a final relaxation for 0.1 ns at 310 K. Finally, the system with unrestricted atoms 15 A around (and including) D570, N856, K964 residues in all three S protein chains was incrementally heated as previously described and the MD simulation was performed for 10 ns at 310 K, 1.01325 bar. During MD simulation an integration time step was set to 2 fs applying SHAKE algorithm on all bonds involving hydrogen atoms.

Data availability section
Structure files and associated data of S protein variants in closed, open, ACE2-bound and postfusion conformational states generated in this work have been deposited in SBGrid. 112 Four directories are shared: "S_closed", "S_open", "S_ACE2" and "S_postfus". In each folder an initial .pdb file for each of the conformational states and two subdirectories ("param_index" and "structures") are placed. The "param_index" directory includes two files for each of the considered variants: the results file after the HADDOCK2.2 refinement 69 (.html file) and the parameter file that was used for the structure calculation (.web).
The "structures" directory contains the top scoring refined structure file (.pdb) for each of the variants. The nomenclature of each file in subdirectories "param_index" and "structures" corresponds to DIRNAME_R1NUMR2. DIRNAME stands for the name of the main directory ("S_closed", "S_open", "S_ACE2" and "S_postfus"), R1 stands for the one-letter residue code of the "wild type" residue of the S protein, NUM stands for the residue number according to the Uniprot sequence of SARS-CoV-2 Spike protein and R2 stands for the one-letter residue code of the variant to which the "wild type" residue R1 was changed. Results of the HADDOCK score and its components calculations performed with HADDOCK2.2 for each generated variant are summarized in Tables  S2-S6.

Code availability
All unpublished code and scripts used in this study are available upon request.