A simple and consistent quantum-chemical fragmentation scheme for proteins that includes two-body contributions

The Molecular Fractionation with Conjugate Caps (MFCC) method is a popular fragmentation method for the quantum-chemical treatment of proteins. However, it does not account for interactions between the amino acid fragments, such as intramolecular hydrogen bonding. Here, we present a combination of the MFCC fragmentation scheme with a second-order many-body expansion (MBE) that consistently accounts for all fragment – fragment, fragment – cap, and cap – cap interactions, while retaining the overall simplicity of the MFCC scheme with its chemically meaningful fragments. We show that with the resulting MFCC-MBE(2) scheme, the errors in the total energies of selected polypeptides and proteins can be reduced by up to one order of magnitude and relative energies of different protein conformers can be predicted accurately.

as electron densities and other properties of proteins. [18][19][20][21][22] In the simplest case, the protein is partitioned into single amino acids by cutting the peptide bonds. The dangling bonds are then capped with acetyl groups (ACE) and N-methylamide groups (NME) to preserve the local chemical environment (see Figure 1). Analogously, disulfide bridges between amino acids are cut at the S S bond and capped with methyl sulfide groups. 16 The total energy of a protein consisting of N amino acids can then be approximated as, that is, as the sum of all energies E frag i of the capped fragments, from which the energies E cap k,kþ1 ½ of the cap molecules, formed by combining the corresponding ACE and NME groups, are subtracted. In the first sum, the index i runs over all amino acid fragments, whereas in the second sum the subscript k,kþ1 ½ denotes the cap molecule that results from the cut between amino acids k and k þ 1.
Particular advantages of the MFCC scheme include the fact that it uses chemically intuitive fragments (i.e., the amino acid residues), its straight-forward implementation in combination with numerous quantum-chemical methods, and the possibility to execute the fragment calculations in an embarrassingly parallel fashion. The specific choice of conjugate caps ensures that all resulting fragment and cap molecules are chemically sensible and mimic the corresponding parts of the protein. However, the MFCC scheme is not exact and introduces errors compared to a supermolecular calculation for the full protein. In particular, the MFCC scheme only relies on calculations for the (capped) single amino acid fragments (i.e., only one-body terms) and interactions between the individual amino acids, such as those due to hydrogen bonding within the protein's secondary structure elements, are neglected.
Several approaches have been proposed to alleviate these shortcomings of the MFCC method and to improve its accuracy. Some additional interactions can be accounted for by using larger fragments consisting of more than one amino acid 19,23 or by using additional caps to model the severed hydrogen bonds. 24 In a similar spirit, numerous energy-based fragmentation methods using overlapping fragments have been proposed, 10,12,[25][26][27][28][29][30] which approach the accuracy of the full supermolecular calculation as the size of the fragments is enlarged. Alternatively, interactions between fragments can be partly included in the MFCC method by using a classical electrostatic [31][32][33] or density-based [34][35][36][37][38] embedding scheme.
A systematic and rigorous way of including interactions between fragments in quantum-chemical fragmentation methods is provided by the many-body expansion (MBE). [39][40][41][42][43][44][45] When considering molecular clusters instead of proteins, the total energy can be expanded using the MBE as, where E i is the total energy of the i-th molecular fragment, and is the interaction energy of the dimer consisting of the molecular fragments i and j, with the total energy of this dimer E ij . Three-body and higher-order contributions can be defined analogously. 46 However, combining the MBE with the MFCC fragmentation approach is not straightforward because of the ACE-NME caps that need to be introduced. To approximately describe two-body interactions in MFCC, a pairwise energy correction based on specific parts of the fragment density matrices has been proposed. 47 Similarly, for fragmentation methods with overlapping fragments, such as the molecular tailoring approach by Gadre and coworkers, specific missing two-body and three-body contributions can be added. 48 Very general schemes for systematically including many-body effects have also been developed. conventional MFCC method, the N-terminals of the fragments are then capped by acetyl groups (ACE) and the C-terminals by Nmethylamide groups (NME) (see Figure 1), and the total energy of the protein is approximated according to Equation (1).
Before discussing the combination of MFCC with the MBE, we first consider a simplified variant of the MFCC method, in which hydrogen atoms are used as caps. 23 In this case, the cap molecule is a dihydrogen molecule (see Figure 2 where E HÀMFCC tot is calculated according to Equation (1) (using hydrogen caps) and the two-body fragment-fragment interaction energy correction is given by where the interaction energy between fragments i and j is calculated as Here, E frag ij is the energy of the capped fragment consisting of amino acids i and j. If these are not neighbors along the protein chain (i.e., j ≠ i þ 1), this dimer will in total contain four caps (two for each amino acid). Otherwise (i.e., j ¼ i þ 1), it will only contain two caps, and it becomes necessary to subtract the energy of the cap molecule E cap i,j ½ that is lost when joining the two neighboring fragments.
We will refer to the two-body expansion described above as H-MFCC-MBE(2)-ff. To decrease the computational cost, it is possible to introduce a cutoff λ and to neglect fragment-fragment interactions ΔE ff ij if the distance between two fragments (defined as the shortest distance between any two atoms) is above this threshold. Introducing such a threshold ensures that the scaling of the computational effort with the number of fragments remains linear.

| MFCC with ACE-NME caps
Returning to the conventional MFCC method with ACE-NME caps, additional challenges appear when considering a two-body correction.
The first problem is illustrated in Figure 3 for a small part of a β-sheet structure consisting of amino acids A, B, and C in one strand and D, E, and F in the other strand (see top left part of Figure 3). The peptide groups in the two strands form hydrogen bonds, one of which is highlighted in Figure 3. It connects the N H group of the peptide unit between amino acids A and B (highlighted in blue) and the C O group of the peptide unit between amino acids D and E (highlighted in red).
The lower part of Figure 3 shows the fragments and cap molecules that are generated by the MFCC fragmentation, that is, amino acid fragments A to F and cap molecules AB and BC for the upper strand as well as cap molecules DE and EF for the lower strand. When calculating the MFCC total energy E MFCC tot [see Equation (1)], the blue and red groups are each added twice and subtracted once, that is, they are correctly included only once in the total energy.
The trouble starts when applying the fragment-fragment twobody contribution as given by Equation (5). Here, the interaction between the blue and the red peptide group is included four times instead of only once. It appears in the calculations for the fragment- This can be alleviated by considering not only the fragmentfragment interactions, but also the fragment-cap and cap-cap interaction energies, that is, where the fragment-cap interaction energies enter with a negative sign, F I G U R E 2 Illustration of the simplified H-MFCC scheme for an alanine dipeptide. The protein is cut at the peptide bonds, and the dangling bonds are capped with hydrogen atoms (red and blue), which form a dihydrogen cap molecule.
whereas the cap-cap interaction energies enter with a positive sign, The signs of these terms can be rationalized using the inclusionexclusion principle (see, e.g., Refs. [30,45,54,55] (8) and (9) will be discussed below.
The fragment-cap interaction energies in Equation (8) can be calculated as and similarly the cap-cap interaction energies in Equation (9) are given by where E fragÀcap i, k,kþ1 ½ and E capÀcap i, k,kþ1 ½ are the total energies calculated for a dimer combining fragment i with cap molecule k, k þ 1 ½ and cap molecule k, k þ 1 ½ with cap molecule l, l þ 1 ½ , respectively. We will refer to the two-body expansion described here (with the additional modifications discussed in the following section) as MFCC-MBE(2). Again, a cutoff λ for neglecting distant pairs of monomers (fragments and/or caps) can be introduced to decrease the required computational effort.
Revisiting the example in Figure 3, we now note that while the interaction between the peptides groups highlighted in red and blue appears four times with a positive sign, it now also appears four times

| Correcting for close contacts in MFCC-MBE(2)
For some interactions, close contacts between the ACE-NME caps are problematic and preclude a direct calculation of some terms in the expansion given in the previous section. First, we consider the fragment-fragment interaction energies ΔE fragÀfrag ij (see Equation (6)).
Here, calculating the 1-3 interaction between amino acid fragment i and i þ 2 is problematic because the caps of the two amino acid fragments become so close that they induce a spurious interaction (see right-hand side in Figure 4). This problem has been previously discussed in the context of fragmentation methods. 52,54 However, as these 1-3 interactions cannot be expected to be negligible, we decided to calculate these terms indirectly 23 as illustrated in Figure 4. Note that this requires calculations for trimers of F I G U R E 3 Illustration of the fragment-fragment, fragment-cap, and cap-cap interactions that appear in the MFCC-MBE(2) method for a short fragment of a β-sheet. The hydrogen bonding interaction between the peptide groups highlighted in blue and red appears in four fragment-fragment interaction terms, in four fragment-cap interaction terms, and in one cap-cap interaction term. See text for details.
directly connected amino acids. However, we still consider MFCC-MBE(2) a two-body expansion, as these trimer calculations are actually used to calculate specific (two-body) fragment-fragment interaction energies indirectly. Moreover, the number of considered trimers scales linearly with the number of fragments and thus does not increase the scaling of the computational effort with system size.
Altogether, the fragment-fragment interaction energies are thus calculated as, where E fragÀfragÀfrag refers to the energy of the trimer consisting of amino acid fragments i, i þ 1, and i þ 2.
Similarly, spurious close contacts appear for fragment-cap and cap-cap interactions if these are not at least one amino acid apart.
However, as it is generally not possible that these problematic  (8) and (9)].

| Implementation and computational details
The MFCC-MBE(2) method has been implemented in the PyADF scripting framework, 56   This computational effort can be reduced by introducing a cutoff λ for neglecting the interaction between monomers (fragments and/or caps) based on their distance. Here, we skip those terms in Equations (5), (8), and (9) for which the two monomers (fragments and/or caps) are further apart than λ, where the distance is measured as the shortest distance between any two of their atoms. Such a simple distance-based screening is commonly employed in the context of MBEs. 44,[76][77][78] It can significantly reduce the number of required single-point calculations (see Table 2) and restores a linear scaling of the computational effort with the system size.
To assess the accuracy of such a distance-based screening of the two-body interactions, Figure 5 plots the errors in the total energy calculated with MFCC-MBE(2) using hydrogen caps and ACE-NME caps as a function of the cutoff λ. Note that the errors shown for λ ¼ ∞ correspond to those given in Table 1

| Total energies of polypeptides and proteins
To test the accuracy of the MFCC-MBE(2) method for a broader test set of proteins, we considered the SEM5 SH3 domain (PDB-code 3SEM, 81 85 The small proteins human insulin (PDB code 3I40, 86 21 amino acids, 784 atoms) and the C-terminal domain of the ribosomal protein L7/L12 from escherichia coli (PDB code 1CTF, 87 68 amino acids, 1005 atoms) were also included in our test set. For comparison, the already discussed ubiquitin (PDB-code 1UBQ, 75 76 amino acids, 1231 atoms) is also included.
For these proteins, Figure 6  T A B L E 2 Number of monomer calculations (fragments and caps) and dimer calculations (fragment-fragment, fragment-cap, and cap-cap interactions) necessary with the MFCC-MBE(2) method for ubiquitin with and without using distance cutoffs. increases from left to right in the figure. In all cases, the inclusion of two-body interactions in MFCC-MBE(2) reduces the error in the total energy significantly. In the cases of 3SEM, 1UBQ, 1FKF, and 4PA1, we were able to reduce the error by a whole order of magnitude. In the case of 3I40, the error is reduced by a factor of six, and in the case

| Relative energies of 2LYW conformers
While MFCC-MBE(2) drastically reduces the error in calculated total energies compared to the conventional MFCC method, the remaining errors in the total energies are still rather high in several cases. However, errors in total energies are not very meaningful for many applications and a much better accuracy can be expected for errors in relative energies due to systematic error cancelation.
To assess the accuracy that can be reached for relative energies,

| CONCLUSIONS AND OUTLOOK
We have devised a combination of the popular quantum-chemical fragmentation method MFCC using ACE-NME caps with the MBE. Calculations for a test set of seven proteins as well as for alanine polypeptides confirm that the consistent inclusion of two-body contributions in MFCC-MBE(2) reduces errors in the total energies by up to one order of magnitude compared to the conventional MFCC scheme.
The remaining errors depend on the considered proteins, and tend to F I G U R E 8 Comparison of the relative energies (Orca/PBEh-3c with C-PCM water solvation model) of 10 conformers of 2LYW calculated with MFCC (orange) and MFCC-MBE(2) using a distance cutoff of 4 Å (green) as well as in supermolecular calculations (blue). All energies are given relative to conformer 5 and the conformers on the horizontal axis are ordered according to their energy in the supermolecular single-point calculations.
F I G U R E 7 Comparison of the error in the total energy (Turbomole/MP2/ def2-SVP) for the conventional MFCC method with ACE-NME caps (blue bars) to those for the MFCC-MBE(2) method using a distance cutoff of 4 Å (orange bars) for an (Ala) 10 polypeptide in α-helix, 3 10 -helix, and β-sheet structures, and for an (Ala) 11 β-turn. As reference, singlepoint calculations for the full polypeptides have been performed.
increase with their size. While the errors in total energies remain rather large with MFCC-MBE(2) (up to 832 kJ/mol for 2AK3 with 226 amino acids), they are below 5 kJ/mol per amino acid in all considered test cases. Moreover, in a comparison of relative energies for 10 conformers of 2LYW the MFCC-MBE(2) is able to predict their relative energies with good accuracy, while the conventional MFCC method fails to deliver the correct energetic ordering.
Our implementation of MFCC and MFCC-MBE(2) in the PyADF scripting framework is flexible and allows for the use of various quantum-chemical methods and different quantum-chemical program packages. Here, we employed PBEh-3c, DFT/BP86, and MP2. We did not observe a strong dependence of the accuracy on the quantumchemical method.
To further increase the accuracy of quantum-chemical fragmentation methods for proteins, we will consider the inclusion of higher-order many-body corrections as well as the use of suitable embedding schemes in the fragment calculations 34 49,52 is that it retains the simplicity of the MFCC scheme. In particular, it uses the amino acids as fragments, and it only requires calculations of these fragments and combinations thereof (i.e., fragmentfragment dimers, fragment-cap dimers, and cap-cap dimers), which makes its implementation straightforward. This simplicity will also make the inclusion of higher-order many-body contributions on top of MFCC easily possible.
The work presented here opens the door to efficient and accurate quantum-chemical calculations of protein-ligand interaction energies, which we will explore in our future work. In this case, it will become possible to simplify the expansions significantly by deleting terms that