The role of electrostatics in enzymes: do biomolecular force fields reflect protein electric fields?

Preorganization of large, directionally oriented, electric fields inside protein active sites has been proposed as a crucial contributor to catalytic mechanism in many enzymes, and may be efficiently investigated at the atomistic level with molecular dynamics simulations. Here we evaluate the ability of the AMOEBA polarizable force field, as well as the additive Amber ff14SB and Charmm C36m models, to describe the electric fields present inside the active site of the peptidyl-prolyl isomerase cyclophilin A. We compare the molecular mechanical electric fields to those calculated with a fully first principles quantum mechanical (QM) representation of the protein, solvent, and ions, and find that AMOEBA consistently shows far greater correlation with the QM electric fields than either of the additive force fields tested. Catalytically-relevant fields calculated with AMOEBA were typically smaller than those observed with additive potentials, but were generally consistent with an electrostatically-driven mechanism for catalysis. Our results highlight the accuracy and the potential advantages of using polarizable force fields in systems where accurate electrostatics may be crucial for providing mechanistic insights.


Introduction
Over the past thirty years, molecular simulations have become an integral and routine part of understanding biological processes at the molecular level. Advances in both CPU and GPU hardware, and more reliable and sustainable software packages, have led to both wider uptake and greater numbers of applications across multiple fields. 1 Microsecond-length simulations are now routine, allowing exploration of biomolecular dynamics on timescales relevant for a variety of biological processes, whether through conventional simulation, enhanced sampling, or kinetic modeling of the underlying process.
Simultaneously, the major families of biomolecular force fields have undergone multiple rounds of development and improvement. [2][3][4][5][6][7] For any force field, this reparametrization process requires significant time and resource investment to identify errors, develop corrections, and validate the accuracy of the new parameters. The functional form of each force field, which provides the framework for parameter optimization, has therefore remained relatively unchanged and similar between families for decades. 8,9 With small variations between force field families, each typically features a simple harmonic representation of bonds and angles, a Fourier series for proper dihedrals, a Lennard-Jones model of van der Waals (vdW) interactions, and most importantly, an additive, fixed-point-charge representation of electrostatic interactions. [10][11][12] The philosophy for parametrization of these fixed-point-charges varies between families and has seen substantial modification between force field iterations, but generally entails a significant focus on fitting to either molecular electrostatic potentials (for example of sidechain fragments), interaction energies, or derived thermodynamic properties. [10][11][12][13][14] For a given biomolecule, these electrostatic and vdW parameters are then relied upon to accurately recreate interatomic interactions in all possible environments.
Given the diversity of biomolecular environments and the relative simplicity of common potential functions it is perhaps surprising that biomolecular force fields have seen such broad and successful applications. 1,[15][16][17] However, the latest additive force fields feature a delicate balance of force field terms built up over decades, that continue to evolve to be usefully applied in new contexts. Where force fields have failed, parameter improvements have often focused on tweaks to backbone and sidechain torsional potentials, 6,18 but have also included changes to the Lennard-Jones potential, 19 pair-specific corrections to nonbonded interactions in defined environments, 20 adaptations of parametrization protocols to develop implicitly polarized partial charges, 21,22 and even application-specific water models. 23 Nevertheless, fixed, atom-centered partial charges suffer from known deficiencies in their representation of intermolecular interactions. 24 First, molecular electrostatic potentials, often the main target for fitting electrostatic parameters, are more accurately recreated if higher order, atom-centered multipoles are included in the fitted potential function. 25,26 Second, additive models do not include an explicit representation of induction, and may therefore represent electrostatics poorly for molecular conformations, functional groups, or environments not well-validated in the parametrization process. [27][28][29][30] This lack of transferability may pose challenges in the use of force fields for new, untested, applications.
Alternative, more advanced, models of electrostatics have been implemented in new families of biomolecular force fields, 31,32 of which the Charmm Drude 33 and AMOEBA force fields 34 are amongst the most extensively developed. Both include explicit representation of mutual polarization, and AMOEBA also incorporates a classical multipole representation of fixed electrostatics. These potential functions have been repeatedly shown to improve representation of electrostatic properties around biologically relevant moieties, and can improve the reliability and convergence properties of hybrid quantum mechanical/molecular mechanical (QM/MM) studies. [35][36][37][38][39][40][41][42][43][44] Nevertheless, the accuracy advantages of polarizable force fields in the context of real protein environments are still poorly understood, and should be explored in systems where quantitatively accurate intramolecular electrostatics may be crucial to provide mechanistic insight. Biological catalysis is a particular focus in this regard as, in both naturallyoccurring and designed systems, enzyme active sites may pre-orient substrates and stabilize reaction transition states with focused electric fields or other electrostatic motifs. 45,46 Although studies of the enzyme-catalyzed chemical reactions require QM/MM techniques, the reactant and product states can be studied efficiently with purely MM models, provided the underlying electrostatic model is sufficiently accurate to provide useful information on the active site environment.
Given this interest, here we test the ability of biomolecular force fields to reproduce the electric fields inside a whole protein and its associated solvent and ionic environment, using the peptidyl-prolyl isomerase cyclophilin A (CypA) as a model system. We choose to evaluate the accuracy of the AMOEBA polarizable force field, as it has recently been promisingly used in proof-of-concept work for enzyme design. 47,48 Although ultimately successful in redesigning active site interactions, the additional computational cost of AMOEBA has prevented lengthy studies of active site dynamics, and there has been little assessment so far of the physical accuracy of AMOEBA electric fields within the folded protein environment over typical simulation timescales.
Assessing this physical accuracy by comparison to experiment has substantial difficulties. Biophysical techniques for measuring intramolecular electric fields are technically challenging and may require the introduction of unnatural amino acids to act as vibrational probes. 49 Instead, we evaluate the accuracy of AMOEBA electrostatics by comparison to a fully quantum mechanical representation, calculated using the ONETEP linear-scaling density functional theory (DFT) software. 50 ONETEP is based on a reformulation of DFT which takes advantage of the locality of electronic structure to allow DFT calculations with computational effort that increases linearly with the number of atoms, as compared to conventional DFT approaches where the computational effort increases with the third power. As a result, ONETEP can perform calculations on many thousands of atoms, such as entire proteins in solvent, as we do here. A unique characteristic of ONETEP is that it retains the full near-complete basis set accuracy of conventional DFT by in situ optimisation of Non-orthogonal Generalised Wannier Functions 51 (NGWFs) which are expressed in terms of a periodic sinc (psinc) basis set which is equivalent to plane waves. The ONETEP program has been developed to run on parallel computers, using hybrid MPI-OpenMP parallelism. 52 To test the ability of biomolecular force fields to reproduce the electric fields derived from DFT we first perform ns-length AMOEBA molecular dynamics (MD) simulations of CypA to generate structural ensembles for wild-type (WT) and mutant systems. We then calculate electric fields at the reaction site using AMOEBA, 53 and the widely used Amber ff14SB 2 and Charmm C36m 3 additive protein force fields.
Each set of electric fields is then compared with those calculated at the DFT level with ONETEP. We find that AMOEBA electric fields show far greater correlation to DFT than those of additive potentials, and with smaller systematic error. Our results highlight the utility of polarizable potentials in biomolecular environments where the accurate reproduction of electrostatics is crucial. If mechanistic insights are desired, recent high-performance software implementations obviate the timescale limitations of standard polarizable simulations to some extent, 54-56 but cross-validation with experimental data would remain highly desirable to complement the computational predictions.

CypA system preparation
The initial structures for simulated systems were taken from crystal structures of wild-type human CypA bound to HIV capsid protein, PDB entries 1M9C and 1M9Y. 57 In both cases, chain A was taken as the structure of the CypA receptor, and the substrate was trimmed to a six-residue model peptide, of sequence HAGPIA from PDB 1M9C (with the G-P peptide bond in the trans configuration), or AAAPIA from PDB 1M9Y (with the A-P peptide bond in the cis configuration). Substrate N and C termini were not capped and peptides were modeled in their zwitterionic states. Basic and acidic residues were modeled in their ionized states and histidine residues modeled in their neutral, Nd-protonated form. Finally, R55A mutant structures with equivalent cis-and trans-proline substrates were created by manually truncating the Arg55 residue in the crystal structures to the Cb atom.

CypA MD simulations
Simulations of both WT and R55A mutant CypA, bound to cis-and trans-proline substrates, were performed in triplicate using the AMOEBA polarizable force field. A control simulation of WT CypA bound to the cis-proline substrate was also performed with the Amber ff14SB additive force field. See Table S1 for a list of all simulations performed. Initial protein crystal structures were protonated and solvated using the tleap module of the Amber14 package. 58 CypA-peptide complexes were neutralized with Clanions, solvated in an approximately 63 x 65 x 62 Å periodic box of water such that no solute atom was less than 8.0 Å from the box edge, and Na + /Clions added to create a 150 mM NaCl ionic atmosphere. The simulation of the WT CypA cis-proline system performed with the Amber additive force field used the same starting configuration as the first AMOEBA replicate, except ion positions in bulk solvent were randomized. In total, system sizes varied from 17727 atoms (R55A trans-proline system), to 19030 atoms (R55A cis-proline system).
AMOEBA simulations used the AMOEBA 2013 force field for protein and ions, and the AMOEBA 2003 water model. 53,59 Simulations were performed in triplicate, with unique random seeds for the thermostat and barostat. Systems were energy minimized for 2500 steps with a steepest descent algorithm, then heated to 300 K over 50 ps in the NVT ensemble, followed by equilibration to 1.0 bar over 100 ps in the NPT ensemble. Production simulations were then performed for a total of 25 ns each, with trajectory frames saved at 10 ps intervals for a total of 2500 frames per trajectory. An Andersen thermostat and Monte Carlo barostat were used to maintain temperature at 300 K and pressure at 1 bar throughout. A velocity Verlet integrator with a 1 fs timestep was used in all simulations. Long-range electrostatic interactions were treated with a Particle Mesh Ewald summation with a 7.0 Å real-space cutoff. Van der Waals interactions were subject to a 9 Å cutoff with an analytical long-range correction. Induced dipole convergence was set to 10 -5 D/atom. All MD simulations were performed with the Tinker-OpenMM plugin, using Tinker version 7.1, and OpenMM version 6.3. 55,60 The WT cis-proline Amber simulation used the ff14SB force field for protein, TIP3P model for water and Joung and Cheatham parameters for ions. 2,61,62 The system was energy minimized for 5000 steps with a steepest descent algorithm, followed by 5000 steps with a conjugate gradient algorithm. Next, the system was heated to 300 K over 50 ps in the NVT ensemble, equilibrated to 1.0 bar over 100 ps in the NPT ensemble, and simulated for a further 25 ns under NPT conditions, with trajectory frames saved every 10 ps for a total of 2500 per trajectory. A Langevin thermostat and Berendsen barostat were used to maintain temperature and pressure. A 2 fs timestep was used throughout, and the SHAKE algorithm used to constrain all bonds involving hydrogen. 63 Long-range electrostatic interactions were treated with a Particle Mesh Ewald summation with an 8.0 Å real-space cutoff, while van der Waals interactions were calculated with an 8 Å cutoff and long-range analytical correction. MD simulations were performed with the pmemd.cuda module of the Amber14 software. 58

MM field calculations
Using the 2500 frames extracted from each trajectory, electric fields were calculated using the AMOEBA 2013, Amber ff14SB and Charmm C36m force fields. The electric fields experienced by the peptide substrate were evaluated by calculating the environmental field, "⃗ $%& , following the approach of Fried & Wang: 64 "⃗ $%& = "⃗ ()*+,$-− "⃗ ,/01%2 (1) Here "⃗ ()*+,$-and "⃗ ,/01%2 refer to the electric field at a given site in the fully solvated protein-ligand complex, and in a ligand-only system stripped of receptor and solvent, respectively. Fields were first calculated at the two atoms of the isomerized peptide bond -the proline N atom and the carbonyl C atom of the preceding residue ( Fig. 1). Atom-centered fields were then linearly interpolated to provide the field experienced at the bond midpoint. After AMOEBA simulations, the electric fields "⃗ ()*+,$-at the C and N atoms were calculated directly for each trajectory frame using the instantaneous induced atomic dipole at each atomic site, ⃗ /%2 , and the relevant atomic polarizability taken from the AMOEBA force field, : For additive force fields the electric field was calculated by dividing the electrostatic force, ⃗ $,$ , exerted upon the desired atom, by its partial charge, : For all force fields, post-processing of trajectory frames to remove the CypA receptor, solvent and ion coordinates allowed the equivalent calculation of electric fields in the ligand-only system, "⃗ ,/01%2 . Finally, the electric field arising from the protein/solvent/ionic atmosphere of the system and acting on the rotatable amide bond, "⃗ $%& , was calculated via Eqn. 1.
For each trajectory structure, x, y, and z field components at the substrate proline N and preceding C atom were evaluated directly in the Cartesian coordinate frame. The field at each atom was also projected along a vector perpendicular to the plane of the substrate proline ring, calculated individually for each frame as the unit-length cross product of the N-Ca and N-Cd bond vectors. This vector approximates the proposed orientation of the peptidyl carbonyl group at the cis-trans transition state. 65 The projection of the substrate N and C field vectors along the same direction therefore estimates the magnitude of the environment electric field aligned with the peptidyl carbonyl group during the peptide bond rotation.
Finally, N and C atom projections were averaged to interpolate the field magnitude at the bond midpoint.
For the additive models, electric fields were calculated using an in-house python script making use of the Amber sander python API. For the AMOEBA model, fields were calculated using a separate in-house python script, linking to the analyze module of Tinker 7.1.
Finally, ten frames were extracted at 2.5 ns intervals from the first AMOEBA simulation performed for each of the four CypA-substrate systems, for comparison with ONETEP DFT fields.

ONETEP field calculations
All DFT calculations in this work were performed with the ONETEP linear-scaling DFT package in the normconserving pseudopotential approximation. The PBE exchange-correlation functional was used. A minimal in situ optimized NGWF basis was employed, with an 8.0 a0 (≈4.23 Å) localization radius. Density kernel truncation was not applied.
For consistency with MM calculations, periodic boundary conditions were employed throughout. The MM frames were not truncated in any way. A plane-wave kinetic energy (k. e.) cutoff of 800 eV was assumed.
Given that the dimensions of the simulation cell varied slightly between MM frames (due to the use of NPT conditions), and that the cell edges need to be divisible into an integer number of grid points, the actual k. e. cutoffs were between 798 and 832 eV. The resultant subtle difference in basis set quality between frames was neglected.
To calculate electric fields, we first calculated electrostatic potentials on a Cartesian grid with a spacing of about 0.13 Å (a so-called double grid, with a spacing sufficiently fine to represent densities). Two separate components to the potentials were considered -one due to the valence-electronic pseudodensity (Hartree potential), and one due to the local pseudopotential. Both components were obtained in reciprocal space using Fast Fourier Transforms (FFTs) using standard ONETEP methodology. Exchange and correlation potentials, representing non-classical effects, were excluded from analysis. Electric fields were obtained from the potential in reciprocal space, where the gradient operator is simply ⃗: and subsequently transformed back to real space via an inverse FFT. Final electric fields at arbitrary points in space were obtained via trilinear interpolation from the nearest grid points.
Quantities that underwent FFT-processing on a double grid suffer from a small amount of ringing artifacts with a period of the original (single) grid spacing. While the magnitude of the ringing is small, we are investigating sums of two components to the electric field (Hartree and pseudopotential) that have opposite signs and similar magnitudes, thus cancelling out to a large degree. This magnifies relative errors in the total electric field. To improve the accuracy of the calculated fields we smoothed the field values on the grid using a 27-point 3D stencil, where the stencil weights were given by B C 2 EF G , where B is the taxicab distance of a stencil point from the middle of the stencil. This procedure was found to efficiently attenuate the ringing artifacts.
The use of Cartesian grids in the DFT calculations with localised orbitals and in subsequent processing is expected to subtly break rotational invariance (by distinguishing certain spatial directions corresponding to grid edges) and translational invariance (so-called eggbox effect). These effects are minimised in ONETEP where the local orbitals (NGWFs) are optimised in situ, but they are not completely eliminated. 66 To alleviate any concerns over whether these have measurable effect on the obtained electric fields, we performed six additional calculations for one of the MM frames, where the systems were variously rotated and translated relative to the grid. Typical error magnitudes were found to be in the order of 0.1 MV/cm, with a maximum below 0.4 MV/cm, making them negligible in our analysis.

Data availability
All underlying data used for this study is made freely available (DOI: 10.5281/zenodo.3678278), including the simulation trajectories, calculated electric fields, and analysis code for calculating fields from AMOEBA and fixed-charge simulations. The code and underlying data used to create figures is also available in this repository.

CypA as an exemplar for electrostatically-driven enzyme mechanisms
The peptidyl-prolyl isomerase CypA catalyzes the cis/trans isomerization of the amide preceding proline residues in proteins, inducing a structural change in peptide chains that would otherwise be extremely slow under physiological conditions. This fundamental catalyst for protein conformational change is ubiquitously expressed and plays a role in a wide variety of biomolecular mechanisms, including protein folding, trafficking, and signalling and regulation. Equally, however, CypA has been implicated in a wide variety of disease processes, particularly the facilitation of viral infection and replication. 67 Owing to this broad spectrum of activity and therapeutic interest, CypA structure, dynamics and catalytic mechanism have been well-studied for decades using both experimental and computational structural biology approaches. 65,[68][69][70][71][72][73][74] Many of these studies have identified residue R55, conserved across the cyclophilin family, as mechanistically crucial -either a R55A or R55K mutation result in a similar reduction of catalytic efficiency. 71,75 In the case of the lysine mutation this change in activity occurs in spite of little structural perturbation of the active site. 71,72 A subtle combination of electrostatic, structural and/or dynamical effects has therefore long been thought to underlie the contribution of R55 to catalysis.
Intriguingly, in a combined NMR and molecular dynamics study, Camilloni and co-workers proposed that R55 provides a stabilizing electric field that aligns with the electric dipole of the peptidyl carbonyl group as the pseudo-peptide bond rotates through the cis/trans transition state. 65 This 'electrostatic handle' was suggested to drive the rotation to occur only via positive peptide w-angles by reducing the activation energy when the carbonyl dipole is aligned with the R55 electric field, and increasing it when nonaligned.
From simulations of WT CypA, Camilloni et al. calculated the mean electric field component aligned perpendicular to the plane of the proline ring (chosen to be roughly normal to the peptide bond in either of the cis/trans ground states) to be stabilizing, but relatively small at -40 to -50 MV cm -1 . 65 Solvatochromism experiments suggest this magnitude lies between the electric field strengths exerted on probe molecules by polar, hydrogen-bonding environments such as alcohols or water, and those exerted by apolar solvents such as hexane. 64 The CypA active site is therefore likely to be representative of typical intramolecular field strengths encountered in protein environments with both polar and non-polar moieties. Potential energy surfaces calculated for a model proline dipeptide by Camilloni et al. also suggested that a field of -50 MV cm -1 aligned with the cis/trans transition state would lower the isomerization activation energy by approximately 30 kJ mol -1 , qualitatively consistent with experimentally observed reaction speed-ups over the uncatalyzed reaction. 65 Nevertheless, the electrostatic handle hypothesis was proposed using CypA field strengths determined with ONIOM DFT calculations, the relatively high computational cost of which precludes the use of this approach to study electric fields in proteins more generally. The hypothesis does suggest, however, that CypA provides a well-suited test case to evaluate the ability of the AMOEBA force field to recreate electric fields in protein active sites, which would provide a much faster route to evaluating ground-state electrostatics, energetics and dynamics.
AMOEBA field magnitudes projected perpendicular to the substrate proline ring are smaller than those of additive force fields To evaluate the ability of AMOEBA to model typical intramolecular fields in CypA, we first calculated electric fields present in structural ensembles of the WT, cis-proline, CypA complex (PDB 1M9Y, 57 prepared and simulated as per the Methods). For each simulation frame from the triplicate simulations performed with AMOEBA, the Cartesian x, y, and z components of environmental electric field were calculated at the C and N atoms of the substrate peptide bond. Each atom-centered field was projected along the vector perpendicular to the plane of the peptide proline ring, which reflects the magnitude of the field aligned with the carbonyl group of the peptide bond when rotated during the substrate isomerization. Projections were then linearly averaged to estimate the field magnitude at the peptide bond midpoint (Fig. 1). After projection, intramolecular electric field magnitudes ranged from -40 to +40 MV cm -1 ( Fig. 2A), even in these short simulations. However, the distributions overlapped substantially and the mean field strengths from each independent simulation spanned a much smaller range, from -6.9 to -16.8 MV cm -1 .  The magnitude of field fluctuations is affected by the sensitivity of the force field electrostatic model to small structural changes. The AMOEBA field distributions were therefore compared with those of the Amber ff14SB and Charmm C36m additive models, with a TIP3P water model in both cases. For each of the additive force fields, the electric fields at the substrate peptide C and N atoms were recalculated for each trajectory frame of the AMOEBA structural ensemble from the first replicate.
Fields calculated with the additive models were systematically larger (more negative) than those of AMOEBA and showed a broader distribution (Fig. 2B). Distributions with Amber and Charmm force fields were remarkably similar, although slightly larger fields were calculated with the Charmm partial charges ( ̅ = −41.0 MV cm -1 ) compared to Amber ( ̅ = −37.0 MV cm -1 ). To investigate whether the force field used to create the structural ensemble biased the calculated fields, we generated an alternative structural ensemble of the same system with the Amber ff14SB force field. This ensemble showed analogous results, with a very broad distribution of Amber field strengths centered at -40.9 MV cm -1 , and a comparatively tight AMOEBA distribution centered at -18.2 MV cm -1 (Fig. S1). Therefore, the differences observed between the fields calculated with additive and polarizable models were not simply caused by the structural ensemble generated with the AMOEBA force field. Instead, the broader distribution of field strengths observed with Amber & Charmm supports the notion that an additive Coulombic electrostatics model is more sensitive to the instantaneous environmental conformations than the AMOEBA polarizable model.

DFT fields at the CypA active site are consistent with AMOEBA
In our simulations of the WT cis-proline complex, the AMOEBA force field consistently showed an offset of ca. +15 to +25 MV cm -1 to Amber or Charmm electric field strengths. Although this trend was internally consistent across simulations and between force fields (Fig. S2, Table S2), the AMOEBA electric fields appeared to differ from those calculated by Camilloni & coworkers in CypA, which were estimated to be at least -40 MV cm -1 at the active site in both the cis-proline and trans-proline ground states. 65 However, our system and methodology also differed substantially from this previous study. Simulations here were performed with a different peptide substrate, applied no NMR restraints to the protein structural ensemble, and calculated fields at both peptide C and N atoms during the projection of the field vector along the peptide normal, rather than only at the N atom. 65 As such, we tested the accuracy of our MM-calculated field strengths by comparison to field strengths calculated with ONETEP DFT on the entire simulation system, which consisted of 19018 atoms. A subset of ten frames extracted at 2.5 ns intervals from the first replicate AMOEBA WT cis-proline simulation was used for comparison. Field vectors were compared directly using their Cartesian x, y, and z components in the simulation frame, rather than any projection along a specific vector. Charmm force fields. D-F) Equivalent comparison of the Cartesian x, y, and z field components at the bond midpoint, taken as the mean of the N and C atom components. In all cases, uncertainty is estimated from resampling the data with replacement (n = 2000) and is provided as the 95% confidence intervals in R 2 Fields calculated as the mean of those at the C and N peptide atoms (see Fig. 3D-F for the regression lines). Uncertainty estimated as the 95% confidence interval in regression statistics calculated by resampling the data with replacement (n = 2000) Correlations between the ONETEP DFT and AMOEBA fields were relatively high, with R 2 of 0.75 and 0.92 for fields calculated at the substrate peptide N and C atoms respectively (Fig. 3A). Correlations between Amber or Charmm fields and ONETEP were far smaller, with R 2 no higher than 0.60 (Fig 3B-C). Moreover, regression lines for the additive force fields appeared to have a large positive intercept for fields calculated at the peptide N and C atoms individually. An identical trend was observed when comparing QM and MM results after averaging the N and C atom field components for each frame (Fig. 3D-F). Again, AMOEBA fields gave close agreement to those calculated with ONETEP (R 2 = 0.90), with slope close to unity and without the substantial positive y-intercept and broad confidence intervals exhibited by Amber and Charmm fields ( Table 1).
The systematically more positive field components observed with the additive models appear counterintuitive given that Amber and Charmm field strengths were substantially more negative than those of AMOEBA when projected along the vector perpendicular to the proline ring (Fig. 2B). However, the projection process may conflate differences in field magnitude and direction between the force fields, while the comparison of individual Cartesian components with DFT more fairly characterizes both properties. Additionally, the relatively limited data used in each comparison (30 points, one for each Cartesian component of the field in ten separate trajectory frames) could potentially result in dataset bias, and we have not attempted to ensure structural diversity between the analyzed frames, preferring instead to use the whole trajectory length to extract time-separated frames at regularly spaced intervals. The confidence intervals in the R 2 values (Fig. 3, estimated by resampling with replacement), almost never overlap for the ONETEP/AMOEBA and ONETEP/additive models, suggesting that the improved correlation observed with AMOEBA is likely to be statistically significant if the whole population were compared. To broaden the structural diversity of the dataset while restricting the overall size to a computationally-tractable number of full-DFT electric field calculations, we similarly analyzed alternative CypA-peptide substrate complexes.

AMOEBA reproduction of DFT fields is insensitive to substrate isomer and CypA mutation
Having established that AMOEBA fields were consistent with those of DFT in the WT cis-proline system, we performed an identical analysis of the WT CypA bound to a trans-proline substrate (PDB 1M9C, 57 prepared as per the Methods). After simulation with the AMOEBA force field, the AMOEBA, Amber and Charmm MM fields at the bond midpoint, projected perpendicular to the proline ring, were compared across all trajectory snapshots (Fig. S2). Taking the first replicate as representative, the ONETEP and MM x, y, and z field components were then compared for a subset of 10 equally-spaced frames extracted from the full 25 ns simulation (Fig 4A-B).
Again, AMOEBA showed a narrow distribution of projected field strengths approximately 15-20 MV cm -1 more positive than Amber or Charmm fields calculated from the same structures (Fig. 4A). In terms of the absolute x, y, and z field components, AMOEBA also exhibited strong correlation with the DFT fields, achieving an R 2 of 0.95 across the ten frames extracted from the WT trans-proline trajectory. In contrast, Amber and Charmm fields for the same structural ensemble showed far lower correlations (R 2 = 0.82 and 0.73 respectively), and again, 95% confidence intervals in R 2 for the AMOEBA and additive model comparisons did not overlap, despite the limited dataset size (Fig. 4B). Next, we performed an identical comparison of electric fields in two mutant CypA systems to add further structural diversity to the dataset and evaluate the transferability of the observed trends to the functionally relevant R55A mutation. We evaluated R55A systems with either cis-proline or trans-proline substrates in order to be equivalent to the WT systems, and fields from three 25 ns AMOEBA simulation of each mutant system were calculated and compared across the AMOEBA, Amber ff14SB and Charmm C36m force fields (Fig. S2). For each simulated system the first trajectory was taken to be representative, and a subset of ten equally spaced frames were again used to correlate Cartesian field components with ONETEP.  Fields calculated as the mean of those at the C and N peptide atoms (see Fig. 4B,D,F for the regression lines). Uncertainty estimated as the 95% confidence interval in regression statistics calculated by resampling the data with replacement (n = 2000) On mutation of Arg55 in the CypA protein to alanine there appeared to be a large reduction in the projected field magnitude experienced at the substrate proline bond (Fig. 4C, E). The absolute magnitude of this reduction differed in the AMOEBA force field compared to the two additive force fields, but was similar in relative terms -fields were reduced by approximately 40-50 % upon CypA mutation with the cis-proline substrate, and by approximately 80-90 % upon mutation with the trans-proline substrate.
Notably, the fields calculated with the additive models always showed a broader distribution with higher variance (Table S2), again suggesting that the additive potentials were more sensitive to small structural changes in instantaneous conformation than the polarizable model. This led to greater populations of frames with comparatively high and low field strengths when calculated with the additive models.
Comparison of the MM field components to their DFT equivalents (Fig 4D, F) highlighted identical trends to those observed for the WT systems -AMOEBA consistently showed greater correlation and lower systematic errors to DFT fields. Across the four CypA systems individually, no clear trend in terms of overor underestimation was observed with particular force fields. Slopes, intercepts and R 2 varied (Table 1-2), but AMOEBA consistently demonstrated high correlation to DFT fields (R 2 = 0.86 -0.95). Combining all four CypA systems together, however, highlighted a slight tendency for the additive force fields to underestimate the magnitude of the x, y, and z electric field components compared to DFT in both the positive and negative directions (Fig 5A), that is, negative field strengths were calculated to be too positive, positive were calculated to be too negative. Additionally, the broad distribution of the Amber and Charmm fields remained evident in the comparison of the individual Cartesian field components with DFT, consistent with earlier observations of the field strengths projected perpendicular to the proline ring.  Fields calculated as the mean of those at the C and N peptide atoms (see Fig. 5A-B for the regression lines). Uncertainty estimated as the 95% confidence interval in regression statistics calculated by resampling the data with replacement (n = 2000) The observation that additive force fields slightly underestimated the magnitude of the electric field components compared to DFT again appeared to contradict the fact that, when fields were projected perpendicular to the proline ring, field strengths were almost always larger with Amber and Charmm than with AMOEBA (Fig. 2B, Fig. 4A, C, E, Fig. S2). When the appropriate projections were compared to DFT fields, however, the additive force fields demonstrated a clear systematic error, with y-intercepts of approximately -15 MV cm -1 , and slopes much greater than 1 (Fig. 5B, Table 3). Hence the arbitrary projection of the total electric field along a given vector may not assess the general accuracy of an MM force field; instead, our conclusions are better drawn taking into account both magnitude and directionality of the total field, by comparing individual field components. With this in mind, Cartesian field components calculated with AMOEBA correlated well with those of DFT (R 2 = 0.92) and with slope close to unity and intercept close to zero ( Table 3).

Electric fields in polarizable and additive models
The development of electrostatic parameters for additive biomolecular force fields is problematic owing to the fundamentally different environments in which the atomic partial charges are first derived (generally by gas-phase QM calculations upon individual amino acids or fragments) and then applied (condensed-phase simulations of large macromolecular complexes, including water, ions and other solutes). The Amber and Charmm force field families, like many others, have compensated for the effects of transferring parameters to the condensed phase by fitting partial charges that are intrinsically overpolarized. 10,11,76,77 This approximation is ultimately inadequate, and recognition of this fact has contributed to the drive to develop accurate explicitly polarizable models. 8,24,31,33 Nevertheless, parametrization processes for additive models are generally intended to recreate intermolecular energetics (determined by the electrostatic potential at a given site) and forces (determined by the electric field) as accurately as possible. However our results for CypA suggest that the intrinsic overpolarization of atomic partial charges is insufficient to recreate the effects of induction upon intramolecular electric fields, and that this issue may be generalizable to multiple force field families. The polarization model in AMOEBA is both a more accurate electrostatic model overall, with reduced random error and improved correlation to DFT fields, and does not show a sizable systematic error in electric fields, with a near unity slope in comparison to DFT (Fig 5).
The implications of poorly estimating intramolecular electric fields are likely to vary by application.
Properties reliant on a balance of intermolecular interactions and conformational propensities, such as solvation or binding free energies, exploration of structural and dynamical ensembles, or kinetics of transitions, are likely to be recreated successfully with well-parametrized additive models through compensatory effects of the remaining potential function terms. Both the Amber ff14SB and Charmm C36m force fields have been well validated for their ability to accurately recreate biomolecular structure and dynamics. 2,3 AMOEBA, or other equally accurate polarizable models, may provide significant improvement in accuracy when the electric field is the crucial property of interest. 47,48,64 Notably, our results suggest this is true even for electric fields encountered in 'normal' protein environments -the maximum electric field magnitudes encountered in CypA (approximately ±75 MV cm -1 from Fig. 5) are well below those measured in more 'extreme' environments, such as the active site of ketosteroid isomerase. 78 Beyond electrostatics calculations, intramolecular electric fields, and how proteins conformationally respond to changes in electric fields, are also increasingly implicated in biophysical processes, from enzyme reaction mechanisms to voltage sensing in ion channels. [78][79][80][81][82] Clearly, the excellent agreement of AMOEBA with DFT observed here does not negate the potential for inaccuracy in other systems or even alternative states along the CypA reaction pathway, but it does highlight the potential for polarizable MM models to achieve quantitative accuracy in biological environments where the electric field is crucial for structure or function.

Implications for CypA mechanism
The 'electrostatic handle' mechanism of CypA proposes that catalytic activity is driven by the alignment of a large electric field (provided by the R55 residue in WT CypA) with the carbonyl dipole of the residue preceding the substrate proline in the transition state. Our results with both AMOEBA and additive force fields are consistent with a significant contribution of R55 to the overall electric field in both CypA ground states. However, the magnitude of this contribution (taken as the difference between mean fields in the WT and R55A simulations) appears to be smaller than that proposed by Camilloni and coworkers, with a maximum of -8.7 MV cm -1 for the cis-proline or -23.5 MV cm -1 for the trans-proline system across all AMOEBA simulations (Table S2), rather than the ca. -30 MV cm -1 estimated previously. 65 The comparisons of QM and MM electric fields performed here were not designed specifically to probe the CypA mechanism, and there may be multiple reasons for the observed discrepancies in field strengths.
First, the 25 ns MD simulations performed with AMOEBA, without replica-averaged NMR restraints, are unlikely to fully explore the equilibrium ground state dynamics of CypA. Field distributions varied between repeat simulations of all four WT and R55A CypA states (Fig. S2, Table S2), suggesting a lack of convergence of the absolute field strengths. However, in all but one simulation AMOEBA fields, projected perpendicular to the proline ring, were relatively smaller than those of either Amber or Charmm. On average, AMOEBA fields also showed a drop in field strength in the R55A mutant simulations, generally supporting the conclusion that R55 makes significant contributions to the active site field in the ground states. Additionally, the control calculation of AMOEBA field strengths taken from a structural ensemble generated with Amber suggests that the use of AMOEBA for MD does not fundamentally change the distribution of configurations explored, in the context of electric fields.
It is also possible that the contribution of R55 to electric fields in CypA was previously overestimated. First, field strengths were previously calculated at the proline N atom only, rather than averaged across both peptide bond atoms, and we consistently see a larger electric field at the proline N than the preceding C atom in all force fields tested in our calculations (Table S2). Second, although fields were previously calculated using DFT ONIOM calculations at the B3LYP/6-31G** level, only a subset of the CypA protein (250 atoms) was included in the QM region, and without long-range electrostatic effects of protein and solvent estimated by electrostatic embedding. Moreover, electric fields were calculated taking into account the contribution of only the CypA receptor residues, not using the environment fields approach of Fried & Wang that includes back-polarization of the receptor by the substrate. 64 It is difficult to predict the modulation of electric field caused by back-polarization at a specific site along a specific vector, but the effect should strictly be included as far as practicable. Finally, the previous WT CypA simulations were performed incorporating experimentally-derived restraints, and should therefore be a reliable representation of the WT structural ensembles. However, trajectory frames were geometry optimized using the Amber ff99SB-ILDN force field prior to the DFT calculations, and structures of the R55A mutant were generated directly from the WT structural ensemble rather than a separate set of simulations. The previously reported effects of the R55A mutation are therefore purely electrostatic, and do not consider the potential for any structural reconfiguration upon mutation. Both approximations are conceptually likely to overstabilize and hence overestimate the electrostatic effects within the CypA active site.
Our results are therefore broadly consistent with the electrostatic mechanism proposed by Camilloni and coworkers, but may not have captured all catalytically-relevant CypA motions too, as the magnitude of the overall field contribution is small. The limitations of the short simulations performed here, notwithstanding the fact that the AMOEBA 2013 protein force field has not been as widely tested and validated as either of the two additive models investigated, mean that we cannot draw unambiguous conclusions regarding the CypA mechanism. Recent detailed investigations of CypA dynamics have required both innovative biophysical techniques, and lengthy advanced sampling methodologies for simulations, and the use of polarizable force fields does not negate this necessity for adequate sampling. 74,[83][84][85] However, the accuracy of AMOEBA fields compared to DFT suggests that polarizable simulations provide a promising avenue to better explore the electrostatic contributions to mechanism across members of the cyclophilin family, in a computationally efficient way.

Conclusion
AMOEBA's use of higher order multipoles and an induced dipole representation of induction has been developed to accurately reproduce the electrostatic potential around small molecules and amino acid fragments. The results presented here demonstrate that this accurate representation of electrostatics transfers directly to a typical condensed phase biomolecular environment. The ability of MM force fields to reproduce intramolecular electric fields is an experimentally relevant, and increasingly experimentally measurable, property of interest. The use of linear-scaling ONETEP DFT calculations allowed us to evaluate electric fields inside fully atomistic, fully solvated representations of CypA under periodic boundary conditions and in the presence of explicit ions, without simplified or truncated structural models. In the absence of experimental measurements of electric fields in CypA, we compared electric fields calculated with AMOEBA, Amber and Charmm biomolecular force fields with these extensive QM calculations.
AMOEBA showed significantly better agreement with DFT-calculated fields across the full range of systems, structures and field magnitudes tested, with non-overlapping 95 % confidence intervals in R 2 in the majority of cases, despite relatively small sample sizes.
The high accuracy of AMOEBA electrostatics in condensed phase, intramolecular environments, suggests that it (and potentially other polarizable force fields) may be ideal for applications where electrostatic accuracy is key. Electric field strengths in the CypA simulations are well within the range of those typically found in biomolecular environments, which therefore suggests a wide range of potential applications.
Promisingly, however, studies of enzyme activity and mechanism with polarizable force fields have already begun to allow quantitative insights into electrostatically-driven biological mechanisms that additive force fields appear unable to provide. 47,48 The results we present from simulations of CypA cannot unambiguously determine a reaction mechanism, nor were they intended to. They are supportive of, but not limited to, an electrostatic role for R55, consistent with what is known for CypA. However the fact that AMOEBA can determine electric fields in CypA with almost DFT-level accuracy indicates that, with suitably equilibrated structural ensembles, provided for example by high-performance software implementations, 54,55 AMOEBA could be used to accurately probe the ground-state dynamics and electric fields of the wider cyclophilin family within single simulations, and at substantially lower computational cost than is currently required for QM/MM or QM post-processing approaches.