CHARMM-DYES: Parameterization of Fluorescent Dyes for Use with the CHARMM Forcefield

We present CHARMM-compatible forcefield parameters for a series of fluorescent dyes from the Alexa, Atto and Cy families, commonly used in F¨orster resonance energy transfer (FRET) experiments. These dyes are routinely used in experiments to resolve the dynamics of proteins and nucleic acids at the nanoscale. However, little is known about the accuracy of the theoretical approximations used in determining the dynamics from the spectroscopic data. Molecular dynamics simulations can provide valuable insights into these dynamics at an atomistic level, but this requires accurate parameters for the dyes. The complex structure of the dyes, and the importance of this in determining their spectroscopic properties, means that parameters generated by analogy to existing parameters do not give meaningful results. Through validation relative to quantum chemical calculation and experiment, the new parameters are shown to significantly outperform those that can be generated automatically, giving better agreement in both the charge distributions and structural properties. These improvements, in particular with regards to orientation of the dipole moments on the dyes, are vital for accurate simulation of FRET processes. Abstract We present CHARMM-compatible forceﬁeld parameters for a series of ﬂuorescent dyes from the Alexa, Atto and Cy families, commonly used in F¨orster resonance energy transfer (FRET) experiments. These dyes are routinely used in experiments to resolve the dynamics of proteins and nucleic acids at the nanoscale. However, little is known about the accuracy of the theoretical approximations used in determining the dynamics from the spectroscopic data. Molecular dynamics simulations can provide valuable insights into these dynamics at an atomistic level, but this requires accurate parameters for the dyes. The complex structure of the dyes, and the importance of this in determining their spectroscopic properties, means that parameters generated by analogy to existing parameters do not give meaningful results. Through validation relative to quantum chemical calculation and experiment, the new parameters are shown to signiﬁcantly outperform those that can be generated automatically, giving better agreement in both the charge distributions and structural properties. These improvements, in particular with regards to orientation of the dipole moments on the dyes, are vital for accurate simulation of FRET processes.

by analogy to existing parameters do not give meaningful results. Through validation relative to quantum chemical calculation and experiment, the new parameters are shown to significantly outperform those that can be generated automatically, giving better agreement in both the charge distributions and structural properties. These improvements, in particular with regards to orientation of the dipole moments on the dyes, are vital for accurate simulation of FRET processes.

Introduction
The study of the structure and dynamics of biochemical systems, such as DNA and proteins, is a complex but important field of research. There are numerous methods for studying such systems, which encompass both experimental and computational approaches. One popular and successful method is based on the photophysical process, Förster resonance energy transfer (FRET). 1,2 FRET occurs between a donor and acceptor fluorophore at sufficiently close separation, typically 1 − 10 nm. 3 The energy transfer also requires that the two fluorophores' absorption and emission spectra overlap and their electric dipoles are non-orthogonal. Typically an electronically excited donor (D) will then transfer energy non-radiatively to an acceptor (A), which will relax to its ground state by radiating energy, and this process has been used to determine when two molecules are in close proximity. 4 Some studies have also used nonemissive acceptors as quenchers in FRET experiments. Donor and acceptor fluorophores can be covalently bound to sites in DNA or proteins and the FRET efficiency determined from the intensity of the fluorescence emissions. The experimental data can be used to determine the fluorophore separation 5 and thus both structural and dynamical information, with the following equation: In this, E is the FRET efficiency -the quantum yield of the energy transfer -while r is the dye-pair separation. The value R 0 is the Förster radius, the separation of the dyes at which there is exactly 50% FRET efficiency. Since R 0 is kept fixed for a specific donor-acceptor pair in a given solvent, the accurate determination of this value is essential in producing distance information. The expression for R 0 is given by: 3 where φ D is the fluorescence quantum yield of the donor without the acceptor present, κ describes the relative orientation of D and A (see below), n is the refractive index of the intervening medium and J(λ) is the overlap integral of the D emission spectrum with the A absorption spectrum.
κ 2 is often termed the dipole orientation factor, as κ is defined as: where µ i is the normalized transition dipole moment of a fluorophore, and R DA is the normalized displacement between the centres of D and A. The majority of FRET experiments make the so-called "κ 2 approximation", which assumes that the average value of κ 2 is 2 3 , corresponding to the value expected if the dipoles orient themselves isotropically over a sphere. 6 There is also an inherent assumption that the rate of rotational motion of the dyes is fast compared to the excited state lifetime. 7,8 Nonetheless, the FRET technique has been shown to give accurate distance measurements within a 3Å precision for separations between 30 − 100Å, 9 which has allowed the use of such measurements to determine biomolecular structure and dynamics. 8,10-12 However, the reliability of FRET below 40Å is unknown. This is thought to be partially due to the validity of assuming the isotropic limit of κ 2 in this regime. 13,14 Computational methods, such as molecular dynamics (MD), have also been used to study the dynamics of FRET dyes. 15,16 These MD simulations have helped elucidate the dynamics between the fluorophores and the biochemical system of interest. A study by Shoura et al., investigating the behaviour of two fluorophore dyes, Atto 594 and Atto 647N, found that κ 2 was not 2 3 when the dyes were separated by 2 nm, but in fact 0.33. 15 The authors of said paper note, however, that forcefield parameterization will strongly affect the observed behaviour of the dyes, hence accurate parameterization is key to obtaining reliable simulation data. In particular, the charge parameters will directly change the predicted dipole moments and thus the dipole orientations of Eqn. 1, while the relative rigidity of the dyes -reflected in bonded parameters -will affect their rotational freedom. More careful parameterization of these dyes is thus essential.
There has previously been work by Graen et al. 17 giving parameters for some of the most popular dyes, but for the AMBER forcefield. 18 These built upon previous work on determining charges for specific dyes for both the AMBER and CHARMM27 forcefields. [19][20][21][22] However, these are either not directly transferable to other forcefields, or with the exception of the AMBER-DYES forcefield, done by analogy. As will be discussed later, the CHARMM parameterization protocol is perhaps more well-suited than the general AMBER forcefield (GAFF) procedure to the physics of the dye interactions. In the present work, extensive force field parameterization of a series of fluorescent dyes commonly used in FRET experiments has been carried out, for use with the CHARMM protein and nucleic acid forcefields.
FRET dyes are large, presenting a difficult problem to parameterize, and the accurate description of their charge distributions in particular is vitally important for meaningfully studying their dynamics. We demonstrate that, due to the high anisotropy of the electron density, automatic generation of parameters by analogy using standard tools does not provide sufficient accuracy, and present new parameters that show much better agreement with quantum-mechanical results.

Parameterization
The CHARMM forcefield, 23 as with most of the standard point-charge-based classical forcefields, divides the parameters into two classes: bonded and non-bonded. The former comprise harmonic bond-stretching, torsional, and improper dihedral terms, mediated by force constant parameters k, and equilibrium values. Added to this are dihedral terms described by a cosine-based Fourier series, controlled by force constants, frequencies n, and phases δ. The non-bonded terms comprise van der Waals forces in the form of Lennard-Jones potentials, with the usual parameters of well-depth and equilibrium bond distance, and the Coulombic interaction between electrostatic point charges, q.
In this work, we follow the well-documented CHARMM general force-field (CGenFF) scheme for optimizing these parameters, 23,24 with some adaptations which will be described in this section. As a starting point, we have taken the parameters automatically generated by the CGenFF program; all results will be compared to these. It should be noted that the heuristic measure of confidence in the generated values given by the program indicated a strong need for reparameterization of almost all of the bonded parameters. Following convention, 23 we do not reoptimize the Lennard-Jones parameters, but take those by analogy from the existing CGenFF parameters.

Dyes and atom types
The fluorescent dyes used experimentally are varied and many, depending on the absorption and emission frequencies needed for the particular experiment. In addition, these will usually be attached to a protein or nucleic acid via some kind of organic linker. Providing parameters for every possible combination of dye, linker, and attachment point would require considerable effort, and would not make use of the commonalities between the different systems. To this end, we restrict our attentions here to a representative subset, which can easily be extended by analogy to other, similar dye systems.  Figure 1: The 2D structures of two of the dyes parameterized herein: Atto 647N and Cy5. Non-hydrogen atoms are numbered by their assigned atom type. There are also 5 hydrogen types, depending on their attachment: aromatic; aliphatic on a primary, secondary or tertiary carbon; on an alkene. The other dyes, and a table of which atom types the numbers correspond to, can be found in the Supporting Information.
In broad terms, the commercially available dyes can be divided into three families: the Alexa, Atto, and Cy dyes. Within these, we present parameters for: Alexa Fluor 647, Atto 550, Atto 647N, Cy3, Cy3B, Cy5, and Cy7. Two of these structures are given in Figure 1 for reference, along with the classification of their atoms into types. These types were based on those generated by the CGenFF program, then adapted to fully exploit the similarities in structure between the different fluorophores. Similar diagrams for the dyes not shown in the Figure are given in the Supporting Information.
The linkers used vary greatly in length and method of attachment. However, they are all structurally simple in comparison, involving only standard functional groups which are already well-parameterized by CGenFF. As such, we only give one such linker with our parameters, using a common attachment to the dyes, and other variants can be generated as needed from this. Similarly, while the linker may be attached to any given amino acid or nucleobase, the point of attachment with the linker is common, so that we give only one example of each attachment. The focus of this work is on the dyes themselves.
As can be seen in Figure 1, each dye has regions of planarity and extended conjugation.
This causes a rigidity in the structure that is not well described by the default parameterizations, hence the need for refitting of the bonded parameters. Moreover, the electronic delocalization caused by these structural features means that the electron density is highly anisotropic, with regions of concentrated charge, which are not described well by the generic point charges. In this respect, the use of CHARMM over, for example, AMBER is prudent, as the former models the point charges based on interactions with water molecules, which allows for a more accurate description of this anisotropy. In contrast, a similar study parameterizing other dyes for the AMBER forcefield needed to adapt the charge optimization process to account for this. 17

Charges
The accuracy of the point charges is particularly important as the intended application is in the study of the interactions between the dyes, and modelling spectroscopic events based on these dynamics. Overly isotropic parameters will lead to a poor description of the interactions between dye pairs that mediate energy transfer, resulting in erroneous dynamics. On the other hand, unbalanced, overly strong charges would lead to the dye pairs being held tightly together indefinitely, whereas experimentally it is expected that there should remain a fairly high degree of rotational and translation freedom in the diffusion of the dyes. 25 Standard procedure in CHARMM is to optimize the structures of the new residues at the Møller-Plesset second-order perturbation theory (MP2) with the 6-31G* split-valence basis set. However, as will be discussed later, the need to generate a Hessian makes this prohibitively expensive, and the Hessian needs to be generated at the same level of theory as the optimization to be valid. As such, we optimized the dye structures using density-fitted (DF) MP2 26 with the 6-31G* basis, 27,28 and the associated auxiliary MP2 fitting sets, 29 in the molpro suite of programs. 30,31 Density fitting reduces the scaling of the calculation by an order of magnitude, such that even a numerical Hessian takes considerably less resource to calculate. To the authors' knowledge, it has not been used before in a force-field parameterization, but previous benchmark studies have demonstrated that structural parameters and force constants from DF-MP2 only deviate from full MP2 on the order of less than a single percent. 32,33 The charge optimizations then proceeded by generating potential energy curves at the Hartree-Fock (HF)/6-31G* level of theory for a water molecule interacting with each nonhydrogen point of contact, using Gaussian '09. 34 These points of contact were generated automatically by the forcefield toolkit (ffTK) program, 35 and then pruned so as to remove interactions where the water molecule would not have access to the site. The forcefield charges were then optimized by a constrained least-squares minimization from the molecularmechanics (MM) calculated potential energies compared to the quantum mechanical (QM) potential energies. An L 2 regularization term was added to avoid overfitting the charges, avoiding potential problems with overly localized charges. This process was then repeated iteratively until the overall objective function changed by less than 0.1%. To maintain compatibility with the CHARMM forcefield, the aliphatic and aromatic hydrogen charges were constrained to be +0.09 and +0.15 e, respectively.

Bonded parameters
For the harmonic terms, we followed the analytical partial Hessian fitting approach of Wang and coworkers, using their ParmHess code. 36 We adapted the code to read the outputs from the DF-MP2 numerical Hessian as calculated in molpro. The dyes are fairly large molecules for MP2 (at around 90 atoms each), and so it would be considerably cheaper to use a density functional instead. However, the standard CHARMM procedure is to use MP2, and it is vitally important that any new parameters are compatible with the rest of the forcefield.
While density fitting introduces an error, this is relatively small both in terms of geometries and frequencies. It is therefore likely that the DF-MP2 results will be much closer to the full MP2 results than if we used a different method, and as a result, will fit better with the rest of the forcefield.
For each bond-stretch, angle torsion, and improper dihedral, an MM Hessian for just the two terminating atoms in the term is calculated, and the corresponding parameters fitted to the equivalent portion of the QM Hessian. This is repeated iteratively through the list of terms until the parameters change by less than one percent. The procedure is highly dependent on the choice of starting guess: we found that in particular for the bonds involved in the conjugated portions of the dye, the guesses generated by CGenFF are too large, and lead to unphysical rigidity (i.e. very large force constants). This problem could be avoided by using a starting guess of 0.6 to 0.8 times the original guess, as determined by careful reoptimization at each stage.
Finally, the dihedral terms represent a particular challenge, due to the simultaneous optimization of multiple parameters in a variable length Fourier series. Potential energy scans of the proper dihedrals were performed at the DF-MP2/6-31G* level. We then took the multiplicity of each Fourier series by analogy with similar dihedrals already in CGenFF, usually as generated automatically. As with the charge optimization, MM scans were then performed with the new parameters, which are then iteratively fitted in a least-squares optimization against the QM results. This was performed by direct calculation of the dihedral term in Python, minimized with the standard SciPy BFGS algorithm. 37 After each iteration, we inspected the total error in each dihedral term, and increased the multiplicity of any terms that showed large errors. These were then refitted until all parameters changed by less than one percent.

Results
All MD simulations were carried out in GROMACS 2016.4, 38 with the TIP3P model for water. 39 The parameters developed here are designed to be used with the CHARMM36 forcefield, although none of the MD simulations herein required additional parameters beyond the water and ion parameters. Temperature was controlled with a Langevin thermostat to a target temperature of 300 K, while a pressure of 1 bar was maintained with a Parinello-Rahman barostat with a time constant of 2 ps. A timestep of 2 fs was used throughout.
A short-range van der Waals cutoff of 1.4 nm was employed, and the particle-mesh Ewald method used to calculate long-range electrostatic interactions.
In each MD simulation, we minimized the energy before equilibrating the water and ions for 500 ps to 300 K, before equilibrating the dyes for a further 500 ps. The production runs were all 50 ns long. Diffusion coefficients have been calculated by calculating the meansquare deviation of the bare dye in a cubic box of water expanded 1 nm on all sides around the centred dye. A straight line was then fitted to the mean-square deviation results from 10 ns onwards, to ensure equilibration had been achieved. IR spectra were calculated using VMD's spectral analysis tool 40 on the last 40 ns of the simulation, and compared to the DF-MP2 spectra of the gas-phase dye to determine an empirical shift for the calculated frequencies.

Charges and electric dipole moments
There is a paucity of experimental data on FRET dyes with which we can meaningfully compare the results from the MD simulations. The standard approach would be to compare with known crystal structures in proteins, or to compute bulk quantities, neither of which is possible; to the authors' knowledge, no such crystal structures are available, and FRET dyes are far too costly to study in the bulk. As such, we will instead focus on comparing to quantum-mechanical results, in particular of the charge distributions and structural properties, while also demonstrating the performance compared to the CGenFF-generated parameters.  optimized parameters within 1.2 to 1.4 times the QM magnitude and possessing the correct orientation. The implications for simulations of FRET dynamics are significant -inaccurate orientation of the electric dipole moments will lead to erroneous rotational behaviour of the dyes, greatly effecting the value of κ 2 . As discussed in the introduction, this will in turn strongly affect computed FRET efficiencies. Figure 3: Electric dipole moments, originating from the center of mass, as calculated quantum mechanically using DF-MP2/6-31G*, or from MD simulations using the optimized parameters or the initial CGenFF guess. (a) The Cy3B dipole moments scaled by a factor of 0.3. (b) The Atto 647N dipole moments scaled by a factor of 0.7. In all cases the DF-MP2/6-31G* optimized geometry is shown.
A second, less clear, validation of the charges can be found through computation of the dyes' infrared spectra. Theoretically, such spectra depend strongly on both the bond force constants, and the local dipoles (thus the charges) in those bonds. However, no experimental IR spectra are available in the literature; while we could instead calculate UV-visible spectra by performing quantum-mechanical calculations on snapshots from the MD, this would be an even less direct validation. As such, we chose two dyes (Atto 647N and Cy3B) to determine IR spectra for, and compare to the results from the MD simulations. The comparison is clouded further by the dyes used in the experiment containing N -hydroxysuccinimide (NHS) groups that are required for the reaction to conjugate them to biomolecules. The experimental and computed spectra are compared in Figure S9 of the Supporting Information, alongside details of the experimental procedure, and it can be seen that computation qualitatively reproduces the key peaks around the 3000 and 1000-2000 wavenumber region. It should be noted that the reactive NHS groups leave during the reaction with biomolecules, hence they are not included in the forcefield parameters or any of the simulations reported.  determining the FRET efficiency, such that good agreement in diffusion constants is a key requirement on our simulations. Table 1 gives these along with experimentally-determined values from the manufacturers, where available. As each dye family has broadly similar structures, their diffusion constants are also similar, such that we only consider one from each family. This is in part also due to limited availability of experimental data. It should also be noted that the experimental data was determined at a slightly different temperature (298.15 K compared to 300 K in the simulations), but this is unlikely to make a large difference. However, while the calculated results do not quite agree quantitatively with the experimental ones, the new parameters do perform better than the CGenFF ones, which consistently overestimate. This is likely again due to the differences in charge distributions, with the inaccurate interactions with water from Figure 2 affecting the ability of the dyes to diffuse through water.

Conclusions
New molecular mechanics parameter sets, compatible with the CHARMM forcefield, have been designed, developed and tested for a group of fluorescent dyes commonly used in FRET experiments. Specifically, this includes dyes from the Alexa (Alexa Fluor 647), Atto (Atto 550 and Atto 647N) and Cy (Cy3, Cy3B, Cy5 and Cy7) families, along with an organic linker of the type used to attach a dye to a protein or nucleic acid. Both the dyes and the linker are presented as a representative subset, the commonality between systems means that parameters for similar dyes and linkers can be straightforwardly generated by analogy. Validation of the new parameter sets with quantum mechanical benchmark data demonstrates they produce significantly more accurate results than parameters automatically generated using CGenFF. The accurate description of both the charge distribution and conformational flexibility of the dyes is vital to give realistic dynamics, as these dyes have highly delocalized electronic structures and are more rigid than the typical biomolecules studied using CHARMM.
The accurate charges presented a particular difficulty, as the optimisation of these dyes at a high-level of theory is very computationally expensive. To this end, we used density-fitted MP2 rather than the HF level used previously in optimization of parameters for use with the AMBER forcefield. 17 The resulting optimized charges much better reproduced the dipole orientations and magnitudes than the analogy-assigned (automatically generated) parameters, with the latter giving electric dipole moments orientated 60 -90 degrees away from the quantum mechanical result. As the FRET efficiency directly depends on the orientation factor, κ 2 , this has important implications for future simulations of FRET dyes; both the charge distribution and the relative orientation of the dyes will be affected by the forcefield parameters chosen. Additionally, the new bonded parameters give diffusion constants in better agreement with experiment, and better reproduce the DF-MP2 geometry. This should facilitate probing whether the assumptions used in FRET studies -specifically that the rotational diffusion is much faster than the excitation event -are realistic.
There are still assumptions implicit in our model, however. Firstly, we have had to assume that the excited state electronic structure is not significantly different from that of the ground state. Previous studies have suggested this assumption is reasonable, though only for specific dyes. 19 Secondly, we are constrained by the fixed point-charge approximation used in the CHARMM forcefield. This may not necessarily be appropriate, as the delocalized electronic structure of the dyes is likely highly polarizable. One possibility to address this would be to add polarizable centres to the model; this then presents the question of where to put these centres, and how many are needed. Both this and characterization of the excited state structures would be interesting but challenging avenues of future research. The parameters presented here, in the meantime, will improve our ability to describe the dynamics of dyes attached to biomolecules, and how those dynamics may affect FRET events.  The residues and parameters that make up the CHARMM-DYES force field can be downloaded in machine readable format from https://dx.doi.org/10.15131/shef.data.12593681.
These are in GROMACS format, and structured as follows. The GROMACS-compatible force field is in the directory charmm36-dyes.ff, which should be copied either to your working directory or directly into your GMXLIB folder before use. Within this directory you can find: • new atom types in the atomtypes.atp file; • new residues in merged.rtp, for each dye by itself, with a linker, and attached to thymine; • new bonded parameters in ffdyesbonded.itp; • new-by-analogy non-bonded parameters in ffdyesnonbonded.itp.
In the parent directory, there is also a dyes.rtp file containing just the dye residues, for reference, and a residuetypes.dat file which needs to be added to the global GMXDATA/top directory. In addition, we have provided a pdbs folder containing correctly-labelled PDB files for each of the dyes by themselves, and with the linker attached.
If you wish to create new residues, perhaps by attaching the dyes to different nucleobases or amino acids, you will need to update each of the above files. Guidance on how to do this can be found in the GROMACS documentation. S-2

Atom types
Below is a table of the new atom types, and figures showing how these are assigned to each dye, and to the linker.
S-3  30 29 In Section 3.1 of the main manuscript the performance of the optimized parameters are validated in terms of interaction energies and electric dipole moments. Properties calculated using the CHARMM forcefield with the new, optimized charges were substantially improved when compared to those using the CGenFF initial guess, using the output of QM calculations as a benchmark. A second, less clear, validation of the charges can be found through computation of the dyes' infrared spectra. We chose two dyes (Atto 647N and Cy3B) to determine IR spectra for, and compare to the results from the MD simulations. FTIR measurements were taken using a PerkinElmer Spectrum Two with a UATR module. 100 nanomoles of dye were dissolved in ethanol and transferred to the ATR crystal. The ethanol was allowed to evaporate before the pressure arm was applied, and scans were taken. Spectra shown are an average of 4 scans from 4000 to 600 cm −1 , with a subtracted background acquired from a repeat procedure with ethanol alone. Figure S9 shows the resulting spectra, with overlays of both the experimental and computed spectra. Key peaks around the 3000 and 1000-2000 wavenumber regions appear to be qualitatively reproduced by computation. It should be noted that the computed intensities, which rely strongly on the magnitude of dipole moments, are known to be unreliable. The dyes used in the experimental IR spectra contain N -hydroxysuccinimide (NHS) groups that are required for the reaction to conjugate them to biomolecules. There are additional peaks in the experimental spectra due to these groups; they can be clearly seen in both dyes, particularly in the 2000-3000 wavenumber region. The reactive NHS groups leave during the reaction with biomolecules, hence they are not included in the forcefield parameters or any of the simulations reported. S-9