How fast can fast-folding proteins autonomously fold in silico?

the experimental values

3 with the corresponding experimental values 12,13 . Those simulated τs were derived by using cutoffs for Cα RMSD (CαRMSD) of 2.5-3.0 Å 13 and 3.622 Å 12 from the experimentally determined native conformations to identify conformations that constitute the native structural ensembles. If CαRMSD cutoffs of <2.0 Å were used, the τs estimated from those simulations would be considerably longer than the experimental values according to the reported sensitivities of the simulated τs to CαRMSD cutoffs 12,13 . For another example, advanced microcanonical MD simulations predicted τs of fast-folding proteins CLN025 1 and Trp-cage 2 to be 600 ns at 343 K and 14 µs at 335 K, respectively 3 . These τs were derived from the microcanonical simulations with the most populated conformations of CLN025 and Trp-cage that have respective stringent CαRMSDs of 1.0 Å and 1.4 Å from the experimentally determined native conformations 3 . Based on the predicted τs at 335 K and 343 K, the simulated τs of CLN025 and Trp-cage at 300 K are conceivably more than 4-10 times longer than the experimental τs (137 ns for CLN025 and 1.4 µs for Trp-cage) at 300 K because the experimental τs reportedly increase as temperature decreases 4,5 . Therefore, how fast can fast-folding proteins fold in silico equates to how accurate protein folding simulations can be. The reported τs to date suggest that fast-folding proteins cannot autonomously fold in silico as fast as they do in experiments and there is an accuracy gap between simulation and experiment for protein folding speed.
To narrow the gap I developed a new protein simulation method that uses uniformly scaled atomic masses to compress or expand MD simulation time for improving configurational sampling efficiency or temporal resolution, respectively 6,14,15 . As explained in Ref. 6 , uniformly reducing all atomic masses of a simulation system by tenfold can compress the simulation time by a factor of 10 and hence improve the configurational sampling efficiency of the low-mass 4 simulations at temperatures of ≤340 K. This method consequently enables miniprotein folding simulations to be performed on personal computers such as Apple Mac Pros under isobaricisothermal conditions under which most experimental folding studies are performed. Also the kinetics of the low-mass simulation system can be converted to the kinetics of the standard-mass simulation system by simply scaling the low-mass time with a factor of 10 . Aided by the lowmass simulation method, I subsequently developed a revised AMBER forcefield-an empirical potential energy function with a set of revised parameters-that has shown improvements in (i) autonomously folding fast-folding proteins, (ii) simulating genuine localized disorders of folded globular proteins, and (iii) refining comparative models of monomeric globular proteins 7,16,17 .
Hereafter the combination of the revised AMBER forcefield with the low-mass simulation method is termed FF12MC 7 .
I also reported the use of the open-source R survival package, which has been widely used in preclinical and clinical studies 18 , to predict τs of fast-folding proteins from their sequences 6,7 . In performing zebrafish toxicology experiments, I observed that even though all fish with nearly the same body weights received an intraperitoneal injection of the same dose of the same toxin, the times-to-death of 20 toxin-treated fish varied widely in each experiment. In some experiments a few fish did not even succumb to the toxin. Yet, the mean times-to-death and their 95% confidence intervals (95%CIs) obtained from the R survival package varied slightly among different experiments. The resemblance of the live and dead states of the zebrafish to the unfolded and folded states of a protein inspired me to use the R survival package to predict τs from sequences as follows 7 : Perform 1) 20-40 distinct and independent MD simulations of a fastfolding protein sequence using FF12MC to obtain 20-40 sets of instantaneous conformations in time, wherein the sequence adopts a fully extended backbone conformation, 2) a cluster analysis of all instantaneous conformations from the 20-40 sets to obtain the average conformation of the largest conformation cluster as the native conformation of the protein, and 3) a survival 5 analysis using the 20-40 sets of instantaneous conformations in time and the average conformation to determine τ and its 95%CI. One advantage of this method is the rigorous estimation of the mean and 95%CI of τ from a set of simulations-a few of which did not capture a folding event. Another advantage is that the τ prediction does not require the assumption that the protein must follow a two-state folding mechanism. By examining a plot of the natural logarithm of the nonnative state population of the protein versus time-to-folding, one can determine the hazard function for the nonnative state population of the protein. If the plot reveals an exponential decay of the nonnative state population over simulation time, then the protein follows a two-state folding mechanism in the simulations.
To determine how fast CLN025 autonomously folds in silico using the methods outlined above, 40 distinct, independent, unrestricted, unbiased, isobaric-isothermal, and 3.16-µs (of the standard-mass time) classical MD simulations of CLN025 were performed at 300 K using FF12MC. A fully extended backbone conformation of CLN025 was used as the initial conformation of the 40 simulations. All simulations described hereafter are 40 distinct, independent, unrestricted, unbiased, isobaric-isothermal, and classical MD simulations with FF12MC using a fully extended backbone conformation as the initial conformation of the protein for the 40 simulations. Also all simulation times described hereafter are the standardmass simulation times. A cluster analysis of the simulations revealed that the average conformation in the largest cluster adopted a β-hairpin conformation (Fig. 1B). This average conformation had CαRMSD of 0.87 Å and Cα and Cβ RMSD (CαβRMSD) of 0.94 Å relative to the average conformation (Fig. 1A) of 20 NMR-determined conformations of CLN025 (PDB ID: 2RVD) 1 . It is worth noting that CαβRMSD includes main-chain and side-chain structural information and is hence more stringent to measure structural similarity than CαRMSD. Using the average conformation of the largest cluster as the predicted native conformation of CLN025, 6 the first time-instant at which CαβRMSD of the full-length CLN025 sequence reached ≤0.98 Å was recorded as an individual folding time for each of the 40 simulations (Table S1A). Using the 40 individual folding times as times-to-folding, a survival analysis predicted the τ of CLN025 to be 198 ns (95%CI = 146-270 ns; n = 40) at 300 K (Table 1). Plotting the natural logarithm of the nonnative state population of CLN025 versus time-to-folding revealed a linear relationship with r 2 of 0.97 (Fig. 2), which indicates that CLN025 follows the two-state folding mechanism. These results agree with the experimental studies showing that the folding of CLN025 follows a twostate folding mechanism with a τ of 137 ns at 300 K, which was obtained from Fig. 6 of Ref. 4 .
To substantiate the agreement between the experimental and computational τs of CLN025 at 300 K, the 40 CLN025 simulations were repeated at 293 K. Using the same CαβRMSD cutoff and the same predicted native conformation, a survival analysis showed that CLN025 followed the two-state folding mechanism (r 2 = 0.94; Fig. 2) with a τ of 279 ns (95%CI = 204-380 ns; n = 40) at 293 K (Table 1) (Table 1). This is consistent with the τ of 2.4 µs at 280 K that was obtained from Fig. 4 of NMR ln Kf in Ref. 5 . Repeating the 40 simulations of TC10b at 300 K, using the same simulation conditions and the same criteria to define the native structural ensemble, revealed 7 the two-state folding mechanism (r 2 = 0.96; Fig. 2) and led to a shortened τ of 0.8 µs (95% CI = 0.6-1.0 µs; n = 40; Table 1), which is also consistent with the experimental τ of 1.4 µs at 300 K 5 .
In the above studies, the average conformation of the largest cluster was used as the predicted native conformation, and a full-length CαβRMSD cutoff of 0.98 Å from the average conformation was used to identify conformations that constitute the native structural ensemble.
The use of the average rather than the representative conformation of the largest cluster may unrealistically shorten the simulated τ, whereas the use of an "overly" stringent CαβRMSD cutoff of 0.98 Å may unrealistically lengthen the simulated τ. To address these concerns, all τs in Table 1 were re-estimated from the same simulation data using both the average and representative conformations with RMSD cutoffs varying from 0.98 Å to 1.40 Å. As apparent from Table S2, the τs of CLN025 and Trp-cage are insensitive to the change from the average to the representative conformation, and these τs are also insensitive to the variation of the CαβRMSD cutoff within 0.98-1.40 Å. Therefore, the concerns about the definition of the native structural ensemble are unwarranted.
In this context, it is evident from the present data that an agreement between the experimental and computational τs was achieved within a factor of 0.69-1.75 (Table 1)

MD simulations. A fast-folding protein in a fully extended backbone conformation was solvated
with the TIP3P water 19 with surrounding counter ions and/or NaCls and then energy-minimized for 100 cycles of steepest-descent minimization followed by 900 cycles of conjugate-gradient minimization to remove close van der Waals contacts using SANDER of AMBER 11 (University of California, San Francisco). The resulting system was heated from 0 to a temperature of 280-300 K at a rate of 10 K/ps under constant temperature and constant volume, then equilibrated for 10 6 timesteps under constant temperature and constant pressure of 1 atm employing isotropic molecule-based scaling, and finally simulated in 40 distinct, independent, unrestricted, unbiased, isobaric-isothermal, and classical MD simulations using PMEMD of AMBER 11 with a periodic boundary condition at 280-300 K and 1 atm. The fully extended backbone conformations (viz., anti-parallel β-strand conformations) were generated by MacPyMOL Version 1.5.0 (Schrödinger LLC, Portland, OR). The numbers of TIP3P waters and surrounding ions, initial solvation box size, and ionizable residues are provided in Table S3. The 40 unique seed numbers for initial velocities of Simulations 1-40 are listed in Table S4  Cluster analysis and data processing. The conformational cluster analyses of CLN025 and TC10b were performed using CPPTRAJ of AmberTools 16 with the average-linkage algorithm 25 , epsilon of 2.0 Å, and root mean square coordinate deviation on all Cα and Cβ atoms (see Table   S5). No energy minimization was performed on the average conformation of any cluster. The linear regression analysis was performed using the PRISM 5 program.   Table   S1A.