Using Quantum Mechanical Approaches to Study Biological Systems

Conspectus Quantum mechanics (QM) has revolutionized our understanding of the structure and reactivity of small molecular systems. Given the tremendous impact of QM in this research area, it is attractive to believe that this could also be brought into the biological realm where systems of a few thousand atoms and beyond are routine. Applying QM methods to biological problems brings an improved representation to these systems by the direct inclusion of inherently QM effects such as polarization and charge transfer. Because of the improved representation, novel insights can be gleaned from the application of QM tools to biomacromolecules in aqueous solution. To achieve this goal, the computational bottlenecks of QM methods had to be addressed. In semiempirical theory, matrix diagonalization is rate limiting, while in density functional theory or Hartree–Fock theory electron repulsion integral computation is rate-limiting. In this Account, we primarily focus on semiempirical models where the divide and conquer (D&C) approach linearizes the matrix diagonalization step with respect to the system size. Through the D&C approach, a number of applications to biological problems became tractable. Herein, we provide examples of QM studies on biological systems that focus on protein solvation as viewed by QM, QM enabled structure-based drug design, and NMR and X-ray biological structure refinement using QM derived restraints. Through the examples chosen, we show the power of QM to provide novel insights into biological systems, while also impacting practical applications such as structure refinement. While these methods can be more expensive than classical approaches, they make up for this deficiency by the more realistic modeling of the electronic nature of biological systems and in their ability to be broadly applied. Of the tools and applications discussed in this Account, X-ray structure refinement using QM models is now generally available to the community in the refinement package Phenix. While the power of this approach is manifest, challenges still remain. In particular, QM models are generally applied to static structures, so ways in which to include sampling is an ongoing challenge. Car–Parrinello or Born–Oppenheimer molecular dynamics approaches address the short time scale sampling issue, but how to effectively use QM to study phenomenon covering longer time scales will be the focus of future research. Finally, how to accurately and efficiently include electron correlation effects to facilitate the modeling of, for example, dispersive interactions, is also a major hurdle that a broad range of groups are addressing The use of QM models in biology is in its infancy, leading to the expectation that the most significant use of these tools to address biological problems will be seen in the coming years. It is hoped that while this Account summarizes where we have been, it will also help set the stage for future research directions at the interface of quantum mechanics and biology.


■ INTRODUCTION
Quantum mechanics (QM) has revolutionized our understanding of the structure and reactivity of small molecular systems. For example, QM based methods provide structural information in excellent agreement with experiment, match experimental barrier heights for chemical reactions, and provide chemically accurate interaction energies for hydrogen-bonded or dispersive systems. 1,2 Given the tremendous impact QM has had for systems of <100 atoms, it is attractive to believe that this impact could also be brought into the biological realm where systems of a few thousand atoms and beyond are standard. To achieve this goal, the bottlenecks of QM methods have to be addressed. Depending on the method employed, various steps in a QM calculation can be rate determining. In semiempirical methods, matrix diagonalization is rate limiting, while in density functional theory (DFT) or Hartree−Fock (HF) theory electron repulsion integral computation is rate-limiting. 3 These theories neglect the correlation energy, which is important to account for to obtain highly accurate results including barrier heights and interaction energies (especially dispersion dominated ones). In this case, the correlation treatment of choice is rate-limiting be it Møller−Plesset (MP) theory or coupled-cluster (CC) methods. 3 State-of-the-art linear-scaling algorithms, which linearize the computational cost with the system size, have attracted much attention. 4−6 Significant effort has been devoted to the development of linear-scaling methods to compute the total energy of large molecular systems at the HF or DFT level. 7−14 One of the challenges is to speed up the calculation of long-range Coulomb interactions when assembling the Fock matrix elements. Fast multipole based approaches have successfully reduced the scaling in system size to linear 11,12,15−17 and made HF and DFT calculations affordable for larger systems when small to moderate sized basis sets are utilized. There are also fragment-based methods for QM calculation of protein systems including the divide and conquer (D&C) method of Yang, 8 Yang and Lee, 9 Dixon and Merz,18,19 Gogonea et al., 20 Shaw and St-Amant, 21 and Nakai and co-workers, 22−25 the adjustable density matrix assembler (ADMA) approach method of Exner and Mezey, 13,26−28 the fragment molecular orbital (FMO) method of Kitaura and co-workers, 29−31 the X-pol model of Gao and Xie, 32 and the molecular fractionation with conjugate caps (MFCC) approach developed by Zhang and co-workers. 33,34 Most applications of these methods to protein systems have been largely limited to semiempirical, HF, and DFT calculations. Among these approaches, FMO has been applied to higher level ab initio calculations such as second-order Møller−Plesset perturbation theory (MP2) 35 and coupled cluster theory (CC). 36 Nakai and co-workers have described D&C-MP2 22,25,37 and D&C-CCSD 38 approaches and applied them to linear or near-linear systems.
QM excels at the static modeling of a molecular system. However, in chemistry and biology, fluctuations are important to describe a broad range of effects. Thus, the ensemble picture of a molecular system is more appropriate, and while molecular dynamics (MD) methods are beginning to address dynamical issues using classical potentials, how to address this "sampling" issue when using more expensive QM based approaches will need to be addressed.
Herein we briefly review work carried out in my laboratory over the past decade in exploiting fully QM methods in the study of biological systems (see Scheme 1). In particular, we will discuss the use of D&C techniques to address the matrix diagonalization step in semiempirical methods and in HF and DT approaches. After summarizing the technical aspects of how to compute the QM energy of many thousands of atoms, we will review several fully QM application studies carried out for the first time in my laboratory on biological systems. Finally, I will discuss the future outlook of using QM to solve biological problems and the ongoing challenges of using a fully QM model on large systems.

HARTREE−FOCK BASED CALCULATIONS
In biological systems, the D&C approach 8 takes advantage of chemical locality. Details of the implementations of this approach at the semiempirical, 9,18,19 HF and post-HF 21−25,39 levels of theory are given in the literature, so below we only give a brief overview. In this approximation, it is assumed that the atoms that are far away from the region of interest only weakly influence local regions of a protein. Hence, the entire system is divided into fragments called core regions (Core α ) with an associated buffer region (Buffer α ) assigned to each core region to account for the local environmental effects. The combination of every core region and its buffer region constitutes a subsystem (R α ) (see Scheme 2). Local MOs of each subsystem are solved by the Roothaan−Hall equation where F α and S α are local Fock overlap matrices, respectively.
After the local MO coefficient matrices C α are obtained, the total density matrix system is given by where D μν α is the partition matrix, and p μν α is the local density matrix defined by where n i α is a smooth approximation to the Heaviside step function: After the density matrix is converged, the total HF energy is given as where H μν α is the local one-electron core Hamiltonian matrix defined similarly to the local Fock matrix (see eq 2).
For HF calculations, the construction of the Coulomb matrix and exchange matrix along with the diagonalization of the Fock matrix are the three most time-consuming steps, while the diagonalization of the Hamiltonian matrix scales as O(N 3 ). In the D&C scheme, the diagonalization calculation is performed for each subsystem, linearizing the diagonalization step as a function of the number of subsystems. However, it is important to realize that the D&C algorithm does not help to reduce the scale of computation of the Coulomb matrix and exchange matrix. Other approaches 18 provide ways to linearize these steps. However, for semiempirical methods where matrix diagonalization is rate-limiting, the D&C approach greatly enhances the capabilities of these methods. 9,18,19 ■ APPLICATIONS OF FRAGMENT QM TO

BIOLOGICAL PROBLEMS
We have used our fragment QM methods (largely at the semiempirical level) to study biological problems with a focus on protein structure and solvation, 40−44 structure-based drug design 45,46 and biological structure refinement by NMR and Xray spectroscopy. 47−64 Below we will briefly highlight work done in our laboratory using these methods to address specific biological problems.
The interface between a biological molecule and water affects its stability and function. 65 The classical picture of this interface does not allow for the commingling of the electron density (the so-called charge transfer (CT) effect) between the protein and the surrounding aqueous environment. We examined the QM nature of a small protein (E. coli cold-shock protein A; CspA) in aqueous solution through the use of semiempirical D&C calculations over 100 snapshots. The surprising result was the observation that two units of charge were transferred from the protein surface to that of the water molecules. Thus, water molecules, at the protein interface, are perturbed not only by the rugged protein interface but also electronically. Via integration over the entire protein surface, the net results of numerous small net transfer of charge yield the observed value of 2 units of charge. Not surprisingly, the main participants in this charge transfer effect are charged amino acids at the protein−water interface (see Figure 1). The magnitude of the CT transfer effect on the energetics of hydrogen bond interactions has been debated, 44,66−68 but we reported 44 a value of ∼30% for a series of dimers using a Kitaura-Morokuma analysis 69 (this latter paper also reports a similar CT effect for the water dimer), a value later supported by more sophisticated calculations. 67 We carried out extensive validation studies of the nature of CT 44 because our pioneering studies employed semiempirical models. Through these studies, we concluded that in terms of the magnitude of the CT effect the semiempirical models employed were reliable. Subsequent work using ab initio FMO 70 and ab initio MD calculations 71 reconfirmed our initial predictions.
In related work, charge transfer in receptor ligand interactions in the context of SBDD has been examined by my laboratory. 45 In a study of 165 noncovalent protein−ligand complexes, we found that 11% of the complexes had more than 0.1e of charge transferred from the protein to ligand. Not surprisingly, in 49 metalloenzyme complexes, on average, there is 0.6e transferred between the protein and the ligand. The direction of the CT effect depended on the nature of the protein−ligand complex. For example, in matrix metalloproteases (MMP), charge was transferred from the protein to the ligand, while for human carbonic anhydrase (HCA) and carboxypeptidases (CPA) charge was transferred n the opposite direction.
The view that the interface between a protein, water, or a ligand is electronically altered via polarization and CT effects is now generally accepted. Indeed, the observation of CT at the protein−water interface highlights the need to include this effect in modern force field methods, and work is being pursued along these lines. 72,73 However, without the aid of QM based methods, this effect would not have been discovered and its importance might have gone unappreciated for some time.

■ QM IN STRUCTURE-BASED DRUG DESIGN (SBDD)
The docking and subsequent scoring of small molecules bound to receptors has been a mainstay of modern SBDD tools. 74,75 However, the reliability of these approaches has been debated. 75−77 Most tools in this class utilize approximate potentials to study the interaction between the protein and its ligand. We reasoned that the use of more advanced QM based methods might improve the outcome. 45,46 We reported the first application of linear-scaling methods to SBDD where we calculated the binding affinity of ligands bound to the HCA and CPA with reasonable accuracy. 46 The free energy of binding in solution was calculated using the following set of equations: Figure 1. E. coli cold-shock protein A (CspA) surrounded by a water layer. The green water molecules are near Lys residues and experience a slight loss of electron density (∼0.06e in aggregate), while the orange water molecules are near Glu/Asp residues gain electron density (∼0.2e in aggregate). The magenta water molecules are not within 5 Å of the charged residues.

Accounts of Chemical Research
where the free energy of binding in solution was calculated as the sum of the gas-phase interaction energy (ΔG b g ) and a solvation correction (ΔG solv terms). The gas-phase interaction energy consisted of enthalpic (semiempirical computed ΔH f 's) and entropic components. The electrostatic part of the enthalpic component was calculated using the DivCon program, employing semiempirical Hamiltonians. 46 A dispersion correction was lso utilized (the (1/R 6 )LJ term). The solvation correction was calculated as a difference of the solvation free energies of the protein−ligand complex (PL), the protein (P), and the ligand (L). The solvation free energy was calculated using a Poisson−Boltzmann (PB) based selfconsistent reaction field (PB/SCRF) method. 78 A major advantage of PB is the internal dielectric of the protein need not be preset as in classical treatments. 78 Finally, an empirical entropy term depending on the change in solvent accessible area (ΔSA) and a ligand rotatable bond count.
In further studies, we carried out a larger scale validation of this QM based scoring function for predicting binding affinity. We calculated interaction energies for a diverse range of protein−ligand complexes comprising of 165 noncovalent complexes and 49 metalloenzyme complexes. 45 For the 165 noncovalent complexes, the interaction energies, without any fitting, agreed with experimental binding affinity within 2.5 kcal/mol. When different parts of the scoring function were fit to experimental free energy of binding using regression methods, the agreement was within 2.0 kcal/mol. For metalloenzymes, the agreement with experiments without fitting was within 1.7 kcal/mol, and with fitting was within 1.4 kcal/mol. Overall, the correlation with experiment was satisfactory with R 2 values of 0.69 46 and 0.55, 45 respectively. However, while the method performed reasonably it was not a "quantum" improvement over traditional methods Much further work has been done to explore the use of QM and QM/MM approaches to predict protein−ligand binding affinities 79,80 with the general conclusion that,while QM does help, it is not a panacea, at least in the current incarnation. Beyond simply having a better level of theory, many issues remain to be resolved to produce robust computed binding affinities including the inclusion of explicit water molecules 81 and incorporation of receptor flexibility. These effects plague both simpler and more sophisticated model Hamiltonians and is the subject of much on ongoing research. 82 ■ QM BASED STRUCTURE REFINEMENT QM NMR Refinement of Protein/Ligand Complexes NMR spectroscopy has proven itself to be a powerful and versatile tool for the study of protein−ligand interactions. The three-dimensional structures of protein−ligand complexes are determined by combining interproton distance restraints derived from nuclear Overhauser effect (NOE) with other restraints from J coupling constants, hydrogen bonds, and/or residual dipolar couplings. Since Fesik and co-workers introduced SAR (structure−activity relationship) by NMR, 83 many NMR-based screening methods have been developed to identify potential drug molecules in pharmaceutical research. 84−88 All these techniques take advantage of the fact that, upon ligand binding, significant perturbations can be observed in NMR parameters of either the receptor or the ligand. These perturbations can be utilized qualitatively to detect complex formation or quantitatively to measure the binding affinity.
Among these NMR parameters, chemical shifts are exquisitely sensitive on the chemical environments of compounds. Therefore, theoretical calculation of chemical shift perturbations (CSP) upon ligand binding provides insights into protein−ligand interactions at the molecular level.
There are two categories of computational approaches to calculate NMR chemical shifts: classical/empirical models and QM. The classical/empirical models 89−95 are parametrized to experimental data or DFT results. These approaches are fast so that they can be easily applied to proteins, but are less general being largely focused on protein chemical shifts. In this case, the generality of a QM based approach offers a significant advantage.
We developed a fast and accurate approach 96 to calculate NMR chemical shifts using the D&C method at the semiempirical level of theory. Along with this approach, we have reported the automatic fragmentation QM/MM (AF-QM/MM) method that computes chemical shifts for large systems using ab initio methods. 54 This approach was first applied to the FKBP-GPI complex (see Figures 2 and 3). 48 By  comparing calculated proton chemical shifts of the ligand to experimental data, it was possible to determine the best binding site pose. To further validate this approach, we generated several hundred poses of GPI using different docking programs and then scored them by calculating CSPs and then comparing them to experiment. 97 We have found that the deviation of the computed CSPs from experiment better differentiate decoy poses from native poses than scoring functions used in docking studies. This demonstrates that CSP based approaches can provide an accurate way in which to predict protein/ligand complex structure using in silico NMR approaches.

QM X-ray Refinement
We have also explored the use of QM based methods in X-ray refinement. We utilized D&C QM calculations to refine an entire protein, 98 and we have carried out several studies looking at the QM/MM refinement of small molecules. 59−62 One of the strengths of QM is the ability of these methods to provide high quality structural models of both the protein and the small molecule. Modern X-ray refinement of protein structures typically uses a pseudoenergy formalism where a chemical model (E Chem ) is combined with the X-ray signal (E X-ray ) for which the gradients are obtained and the energy minimized. 99 In this formulation, the quality of the results depends on both the chemical model term and the quality of the experimental data. For proteins, highly parametrized expressions are available for the E Chem term, but the quality of this term for small molecules is variable. However, QM models can treat both the protein and ligand equally well with no parametrization.
As an example of how a QM (in this case a QM/MM refinement) can improve structure quality, I briefly describe our work on benzamidine ligands bound to a number of receptors. 55 In the case of benzamidines, the amidine group (see Figure 4) prefers to be out of the plane of the benzene ring by ∼ ±40°; however, many of the reported protein X-ray structures model this group as planar. Using QM/MM X-ray refinement approaches, we demonstrated that this moiety was indeed nonplanar and that the models employed in the earlier refinements did not accurately represent this functional group. Figure 5 shows an example of this for PDBID 1Y3X, where we show the result before and after QM/MM X-ray refinement. Moreover, the standard R, R free , and real-space R standard structure quality metrics all improved in the QM/MM X-ray refined structures. 55 The Future of QM in Biology QM has already had a major impact on the study of biological systems primarily through its use to build sophisticated classical force field representations of biological macromolecules. Via QM/MM methods, 100 enzymatic catalysis can be routinely studied and this development was awarded the Nobel Prize in Chemistry in 2013. However, it is important to keep in mind that the computational study of biological systems at the molecular-level faces two daunting challenges: We must both accurately 101−106 calculate the energies and forces involved, but we must also sample all relevant states of a system. QM largely addresses the former, but how to extensively sample biological systems at the QM level of theory remains a challenging issue. Beyond the creation of faster and more accurate QM models, strategies to address the sampling issue will also have to be devised. This will likely be addressed via a combination of classical and QM models 82

■ ACKNOWLEDGMENTS
This research was generously supported by the NSF and the NIH. I would like to thank the many group members who contributed to the work described herein, whose names and specific contributions can be found in the references cited.