Calculation of Protein Folding Thermodynamics Using Molecular Dynamics Simulations

Despite advances in artificial intelligence methods, protein folding remains in many ways an enigma to be solved. Accurate computation of protein folding energetics could help drive fields such as protein and drug design and genetic interpretation. However, the challenge of calculating the state functions governing protein folding from first-principles remains unaddressed. We present here a simple approach that allows us to accurately calculate the energetics of protein folding. It is based on computing the energy of the folded and unfolded states at different temperatures using molecular dynamics simulations. From this, two essential quantities (ΔH and ΔCp) are obtained and used to calculate the conformational stability of the protein (ΔG). With this approach, we have successfully calculated the energetics of two- and three-state proteins, representatives of the major structural classes, as well as small stability differences (ΔΔG) due to changes in solution conditions or variations in an amino acid residue.

the different proteins are shown in Table S2.Only the protonation states for histidine, aspartic acid, glutamic acid, N-ter amine and C-ter carboxyl residues are included.The pKa values used to set the protonation states are only indicated in non-trivial cases.Both the indicated and non-shown pKa values were obtained from the PDB2PQR server 1,2 or, in the case of WT CI2, 3 from NMR measurements.Given the pH values simulated, all basic residues (protonation and pKa values not shown in SI Table 2) were set to 'protonated'.Except for pseudo-lysozyme and nuclease (see footnote f in Table S2), salt concentration was set in the simulated systems to mimic the reported experimental ionic strength (see Table 1 of the main text and Table S2).Those systems for which data reported under more than one experimental ionic strength were available, have been simulated either using one representative ionic strength value or the average.Ionic strengths have been modelled by adding the needed number of Na + and Cl − ions after setting the equivalent salt concentration through the Gromacs 4 tool gmx genion.SI Table 2 also summarizes the protein charge, the number of water molecules and ions, and the box dimensions of the simulated systems.Importantly, solvating conditions for the unfolded ensembles (not shown in Table S2) are also adjusted to match the number of water molecules of their folded partner systems.Except for a small difference in the box dimensions, the solvating conditions for these unfolded ensembles are the same as those used for their folded counterparts (Table S2).
Calculation of thermal stability (∆G unf ) of a three-state holoprotein (derivation of a Gibbs-Helmholtz-like equation).According to the following reaction scheme: the unfolding equilibrium constant, K, of a holoprotein that binds a cofactor (or ligand) at a single binding site and whose apoprotein unfolds through a three-state mechanism (as it is the case of Anabaena holoFld) is: where K 1 and K 2 are the equilibrium constants of the first (native to intermediate) and second (intermediate to unfolded) unfolding steps, respectively; K b is the binding constant of the complex formed between the native (N) protein and the cofactor or ligand ( , and the terms between brackets ) are the concentrations of free species in the equilibria.
Hence, the Gibbs free-energy change of the unfolding reaction can be written as: where in the case of Anabaena Fld is the overall stability of the three-state apoprotein, ∆G 0 apo(unf) ()   and and are the Gibbs free-energies of the partial unfolding steps, which ∆G 0 (Fol→Int) () ∆G 0 (Int→Unf) () can be described through the Gibbs-Helmholtz equation (eq. 1 of the main text).

Wherever
SI eq. 1 can be approximated by: Binding constants are known to change with temperature 5 and the van't Hoff Equation 6 can be used to estimate their values from that at a reference temperature (e.g.298.15 K): (3).    () Combining SI eq. 2 and SI eq.3: (4), which, at 1M of free ligand concentration becomes: SI eq.S5 provides the temperature dependence of the standard Gibbs free-energy change of the holoprotein.It should be indicated that the van't Hoff equation cannot capture changes in the value of K b due to changes in the conformation of the native protein that may take place as the temperature increases.Therefore, we will assume here that SI eq.S5 may not hold above the first mid denaturation temperature of the apoprotein.Total: 3.0 or 3.1 ns c a MD setup done with Gromacs package (v2020) 4 .b MD step designed to re-accommodate the water molecules inside the solvating box after having removed a few of them so that they match in number those in the reference system(s) (see footnote h in Table S2).

Supporting Information Tables
c Total simulated time along all the steps.3.1 ns were simulated when the 'volume-adjustment' step was needed.3.0 ns were simulated otherwise.1) are lower than those obtained after neutralizing the system prior to simulation so no extra ions have been added after neutralization.For nuclease, the unfolding thermodynamics measurements have been described to be independent of IS 8 and the experimental ISs used in the reference experimental works vary in a large range, so no extra ions have been added after neutralization.Individual and averaged ISs used in the experimental measurements along with the papers of reference are given in Table 1 of the main text.g Number of ions added by Gromacs gmx genion tool after setting the required salt concentration.h Number of water molecules finally simulated in the systems.(*): systems where the number of water molecules was initially higher (when setting the box diameter) but it was reduced to match the number of waters in the corresponding reference system(s).Example: to calculate the unfolding energetics of holoFld, two systems were involved: 1) boxes with folded holoFld and 2) boxes with unfolded Fld (apo) plus FMN unbound (see Methods and Figure 4a in the manuscript).In this case, the number of water molecules in the systems with unfolded apoFld plus FMN is initially set bigger than the numbers of this molecule previously settled in the folded holoFld system (the reference here).Then, as required in this protocol (based on calculation by difference of thermodynamics of unfolding states) a few water molecules have to be removed to equate their number in the reference system.In these cases, a 'volume-adjustment' step (indicated as 'vol.adj.' in the 'Box volume' column) is introduced in the preparation phase before the heating step (Table S1).
i Truncated rhombic dodecahedron boxes.Dimensions defined in each case by the unfolded structure with maximum diameter among those in the filtered unfolded ensemble: a minimum distance of 1 nm was left between the farthest atom of the unfolded protein from the center and the box edges.(*): Initial diameter settled before removing a few water molecules and adjusting the box volume.f Average radius of gyration of the proteins (for all the simulated replicas per system) over time in the 2ns-production trajectories run, namely: t 0/4 at the beginning of the trajectory (t=0 ns), t 1/4 at the end of the first quartile of the trajectory (t=0.5 ns), t 2/4 at the middle of the trajectory (t=1.0 ns), t 3/4 at the end of the third quartile of the trajectory (t=1.5 ns), and t 4/4 at the end of the trajectory (t=2.0 ns).g P-value of an ANOVA statistical test performed to compare the protein Rg means at the checkpoints indicated.Analysis done for assessing the Rg evolution over time in relation with the known issue of structure overcompaction reported 11,12 for force fields like Charmm22-CMAP 13 .h Significant P-values (for a 95 % of confidence) indicated.These values only appear in some systems of the apoFld intermediate state.In this state, the apoFld structure simulated (PDB 2KQU 10 ) presents disordered regions -appearing quite stretched in the NMR models− that slightly shrink at the beginning of the simulation towards the central structured nucleus albeit without forming new secondary structures nor compacting the already folded structure of the protein.The protein species fraction (χ i ) vs. T plot is included as an inset in f.The colour-coding is indicated in the legends of the panels.

Supporting Information References
( j 83-231 (renumbered 1-149) C-ter fragment of nuclease as in PDB ID 2SNS (see Methods in the main text).k Truncated CI2 structures lacking the first 19 amino acid residues.l Lysozyme variant wherein cysteine residues at positions 54 and 97 are replaced by a threonine and an alanine, respectively.

Figure S1 .
Figure S1.Simplified MD-based scheme and comparison with experimental results for nuclease.a) Protein models and folding states, number of structures (unfolded) and replicas (folded) simulated, diameter cutoff used to filter too-elongated unfolded structures obtained from ProtSA 25 (left, see also Figure S5), and temperatures selected for the MD-based calculation (Charmm22-CMAP) of nuclease thermodynamics at pH 7.0, 5.0 and 4.1.b-d) Global stability curves (∆G unf (T)), thermograms (Excess Cp + χ unf .×∆Cp vs. T), and protein molar fractions (χ i ) vs. T (computed vs. experimental), respectively, obtained for nuclease simulated at pH 7.0.e and f) Global stability curves and thermograms (computed vs. experimental), respectively, obtained for nuclease simulated at pH 5.0.g and h) Global stability curves and thermograms (computed vs. experimental), respectively, obtained for nuclease simulated at pH 4.1.The ∆H unf vs. T linear plot for pH ~5.0 (and ~4.1) appears as an inset in e (and g).The protein species fraction (χ i ) vs. T plot for pH 5.0 (and 4.1) is included as an inset in f (and h).The colour-coding is indicated in the legends of the panels.

Figure S2 .
Figure S2.Simplified MD-based scheme and comparison with experimental results for CI2.a)Protein models and folding states, number of structures (unfolded) and replicas (folded) simulated, diameter cutoff used to filter too-elongated unfolded structures obtained from ProtSA 25 (left, see also Figure S5), and temperatures selected for the MD-based calculation (Charmm22-CMAP) of CI2 thermodynamics.b-d) Global stability curves (∆G unf (T)), thermograms (Excess Cp + χ unf .×∆Cp vs.T), and protein molar fractions (χ i ) vs. T, respectively (in silico vs. experimental), obtained for WT CI2 at pH 3.0.Inset in b depicts the ∆H unf vs. T linear plot with the fitted equation (the slope being ∆Cp unf ) obtained from the MD simulations.e and f) Global stability curves and thermograms (computed vs. experimental), respectively, obtained for WT CI2 at pH 6.3.The ∆H unf vs. T linear plot appears as an inset in e and the protein species fraction (χ i ) vs. T plot is included as an inset in f. g and h) Global stability curves and thermograms (computed vs. experimental), respectively, obtained for Ile76Ala at pH 3.0.Insets in g and h similar to those in e and f, respectively.The colour-coding is indicated in the legends of the panels.

Figure S3 .
Figure S3.Simplified MD-based scheme and comparison with experimental results for lysozyme.a) The protein models and folding states, the number of structures (unfolded) and replicas (folded) simulated, the diameter cutoff used to filter too-elongated unfolded structures obtained from ProtSA 25 (left, see also Figure S5), and temperatures selected for the MD-based calculation (Charmm22-CMAP) of thermodynamics of WT lysozyme and the variant Ile3Glu at pH 2.4.b-d) Global stability curves (∆G unf (T)), thermograms (Excess Cp + χ unf .×∆Cp vs. T), and protein molar fractions (χ i ) vs. T (in silico vs. experimental), respectively, obtained for WT lysozyme at pH 2.4.Inset in b depicts the ∆H unf vs. T linear plot with the fitted equation (the slope being ∆Cp unf ) obtained from the MD simulations.e and f) Global stability curves (∆G unf (T)) and thermograms (in silico vs. experimental), respectively, obtained for Ile3Glu lysozyme at pH 2.4.Inset in e similar to that in b.

Figure S4 .
Figure S4.Simplified MD-based scheme and comparison with experimental results for pseudolysozyme.a) protein models and folding states, number of structures (unfolded) and replicas (folded) simulated, diameter cutoff used to filter too-elongated unfolded structures obtained from ProtSA 25 (left, see also Figure S5), and temperatures selected for the MD-based calculation (Charmm22-CMAP) of pseudo-WT lysozyme thermodynamics.b-d) Global stability curves (∆G unf (T)), thermograms (Excess Cp + χ unf .×∆Cp vs. T), and protein molar fractions (χ i ) vs. T (computed vs. experimental), respectively, obtained for pseudo-WT lysozyme at pH 3.0.Inset in b depicts the ∆H unf vs. T linear plot with the fitted equation (the slope being ∆Cp unf ) obtained from the MD simulations.e and f) Global stability curves (∆G unf (T)) and thermograms (computed vs. experimental), respectively, obtained for pseudo-WT lysozyme at pH 3.7.Inset in e similar to that in b.The protein species fraction (χ i ) vs. T plot is included as an inset in f.The colour-coding is indicated in the legends of the panels.

Figure S5 .
Figure S5.Diameter distributions and cutoffs applied to filter out too elongated structures in the unfolded ensembles generated by ProtSA.Scatter and density plots obtained before (left) and after (right) diameter filtering for a) Barnase, b) Nuclease c) WT CI2, d) Ile76Ala CI2) e) WT lysozyme, f) Ile3Glu lysozyme, g) pseudo-WT lysozyme and h) apoFld.Maximum diameter (applied cutoff) kept for the unfolded ensembles are the maximum values in the depicted x-axis scale in plots at the righthand side (indicated in the top right-hand corner of the figure).

Figure S6 .
Figure S6.∆H unf vs. temperature plot obtained for barnase (pH 4.1, see Table S2) from six temperatures simulated over a range of 100 ºC.Linear and non-linear (2 nd order polynomial) fittings are depicted.The corresponding equations and the squared Pearson correlation coefficients (R 2 ) are shown.The standard error of computed ∆H unf at every temperature are indicated as error bars.

Table S2. Continuation… System b pH pKa(s) c Protonation State d Protein Charge e Ioinic Strength f (mM) No. Ions g No. Waters h Box Diameter i (nm)
Ionic strength (IS) established in the systems (by adding Na + and Cl − ions) based on the values used in the corresponding experimental determination of the thermal unfolding parameters.For pseudo-lysozyme, the experimental IS (see Table *:3.1±0.0 / Asp65*:3.6±0.1 / Asp72*:2.5±0.1 / Glu79:4.8 / N-ter:7.9 / C-ter:1.4 a Solvating conditions for systems simulated using Charmm22-CMAP and Tip3p.Same conditions are used to simulate folded and unfolded systems, and for systems simulated with Amber99SB-ILDN and Tip3p.Note: in order to have the same charge and, therefore, the same number of ions and water molecules in folded and unfolded systems, identical pKa values are assumed for the folded and the unfolded state even if it is not strictly the case.bProteins or cofactors (FMN) simulated in this work.(†):systems which have been additionally simulated with Amber99SB-ILDN.cpKavalues for the native protein (folded) either obtained from the PDB2PQR (PROPKA 3.0 7 ) server 1,2 (no errors given) or reported by Tan et al.3(indicated with an asterisk (*) next to the residue name).d Protonation states resulted from the simulated pH and pKa values (equal for folded and unfolded states, see the Note in footnote a). e Protein charge after setting protonation.f

Table S3 . Radius of gyration (Rg) at selected times over the simulated 2-ns trajectories, and statistical comparison
9rotein or variant simulated.When a protein has been simulated at several pH values, the pH of the simulation is indicated between parentheses (seeTable S2).b Force field used in the simulation.In simulations of holoFld the FMN parameterization used is indicated (see section Methods of the main text).cFoldingstate (folded, intermediate or unfolded) of the simulated systems.dTemperatureat which the trajectories where the Rg is analyzed were run.eRadius of gyration of the crystal structure (see PDB IDs in the main text), average Rg of the initial structures selected from the unfolded ensembles generated by ProtSA9, or average Rg of the NMR ensemble (PDB 2KQU 10 ) representing the apoFld intermediate state. a 1) Dolinsky, T. J.; Nielsen, J. E.; McCammon, J. A.; Baker, N. A. PDB2PQR: An Automated Pipeline for the Setup of Poisson-Boltzmann Electrostatics Calculations.Nucleic Acids Res.