Estimating Absolute Configurational Entropies of Macromolecules: The Minimally Coupled Subspace Approach

We develop a general minimally coupled subspace approach (MCSA) to compute absolute entropies of macromolecules, such as proteins, from computer generated canonical ensembles. Our approach overcomes limitations of current estimates such as the quasi-harmonic approximation which neglects non-linear and higher-order correlations as well as multi-minima characteristics of protein energy landscapes. Here, Full Correlation Analysis, adaptive kernel density estimation, and mutual information expansions are combined and high accuracy is demonstrated for a number of test systems ranging from alkanes to a 14 residue peptide. We further computed the configurational entropy for the full 67-residue cofactor of the TATA box binding protein illustrating that MCSA yields improved results also for large macromolecular systems.


Introduction
Entropies are key quantities in physics, chemistry, and biology. While free energy changes govern the direction of all chemical processes including reaction equilibria, entropy changes are the underlying driving forces of ligand binding, protein folding and other phenomena driven by hydrophobic effect. Traditionally calculating entropies from atomistic ensembles x 1 , . . . ,x n f gof n configurations x i [R 3N of a macromolecule of N atoms remains notoriously difficult.
We here propose and apply a method for calculating configurational entropies where r(x) denotes the configurational probability density r(x)ẽ xp ({bV (x))=Z c in the 3N dimensional configurational space governed by the potential energy V x ð Þ of the system. The fact that N is usually on the order of several hundreds or thousands renders the evaluation of this integral quite challenging despite a number of successful attempts. [1][2][3][4] These broadly fall into three classes, (i) special-purpose perturbation type approaches, also known as thermodynamic integration [5], (ii) step-by-step reconstruction methods, in particular the scanning procedures introduced by Meirovitch [6,7], (iii) direct approaches which analyse information readily available in standard equilibrium simulation trajectories [8][9][10].
While perturbation approaches provide relatively accurate free energy differences also for larger systems, accurate entropies are obtained only for smaller molecules. The main obstacle, which aggravates with system size, is the sampling problem, which severely limits the accuracy, in particular for explicit solvent models [2,5].
The most widely used direct method is the quasi-harmonic approximation [8] (QH), which provides an upper limit to the configurational entropy in terms of 3N independent classical or quantum mechanical harmonic oscillators [9,10], which is equivalent to approximating the configurational density r(x) by a multi-variate Gaussian function, with A {1~C derived from the covariance matrix [9,10] C~S x{SxT ð Þx{SxT ð Þ T T. However, for macromolecules undergoing large conformational motions the entropy is likely to be considerably smaller than this QH upper limit due to coupling and anharmonicities and, in particular, due to the existence of multiple conformational states [11][12][13][14]. Indeed, for smaller systems such as di-saccharides [15] or lipids [16], or small subsets of larger proteins [17] significantly lower entropies than with QH were obtained by inclusion of anharmonicities [11][12][13]18,19] and pairwise correlation of QH modes [20].

The MCSA Scheme
Here we develop a direct method consisting of three building blocks. Results for small test systems will be presented during this introduction of the methodology to illustrate the effect of each building block. Figure 1 shows that indeed for various small test systems (alkanes, dialanine and a complete 14-residue b-turn) the quasi-harmonic approximation severely overestimates the reference entropy. The reference values were obtained by thermodynamic integration (TI) gradually perturbing the systems towards an analytically tractable reference state consisting of noninteracting particles in harmonic wells, as described in methods and Refs. [21,22]. Entropy estimates obtained for all test systems are also summarized in Table 1.

Non-Parametric Density Estimation
As the first of the three building blocks of the methodology we recently introduced a non-parametric density estimation resting on adaptive anisotropic ellipsoidal kernels [21] that captures the configurational density in sufficient detail. Briefly, the configurational part of the entropy in a d-dimensional space is estimated from n configurations according to where Þ T x denotes the ensemble average of an adaptive anisotropic kernel function K, whose anisotropy and scaling r i,k depends on the local density at point x i , and whose L 1 -measure is denoted by Z d x i ð Þ. This formula simplifies to the well-known k-nearest neighbour entropy (k-NN) by fixing the kernel function to an (isotropic) sphere whose radius r i,k is chosen such that exactly k configurations are within the sphere centered at configuration x i . In this limiting case, Z d is the volume of the d-dimensional unit sphere. NN estimators in general are entirely non-parametric and, at a finite sample size n, have minimal bias [23] in any given number of dimensions d. A major drawback, however, is the fact that due to the so-called 'curse of dimensionality' [24] simple k-NN estimators are applicable for up to ten dimensional configurational spaces only [25]. In contrast, as can be seen in Fig. 1 (left, ''dir''-bar), adaptive anisotropic kernels yield accurate results even for the 45-dimensional configurational space of dialanine. For the more than 500-dimensional configurational space of the 14-residue b-turn, however, the 'curse of dimensionality' [24] renders it impossible to improve on the quasiharmonic approximation with direct density estimation alone ( Fig. 1 right). Convergence properties and full technical details of this first MCSA module are discussed in Ref. [21]. , and the C-terminal b{turn of Protein G (right, please note that here the units are kJ/(mol K)). Thermodynamic integration (TI), density estimates over the whole configurational space (dir), full correlation analyis with subsequent clustering and kernel density estimation (FCA), quasi-harmonic (QH) and mutual information expansion estimates of 2nd (MIE2) and 3rd (MIE3) order were obtained as described in the text. doi:10.1371/journal.pone.0009179.g001

Generation of Minimally Coupled Subspaces
As the second building block of our method, we apply an entropy invariant transformation T such that the usually highly coupled degrees of freedom separate into optimally uncoupled subspaces, each of which being sufficiently low-dimensional to render non-parametric density estimation applicable. As the most straightforward class of entropy invariant transformations, we consider here linear orthonormal transformations of the form y~T x{SxT ð Þ , with det T~1. More general transformations are currently explored [26]. We apply Full Correlation Analysis (FCA) [27] which minimizes mutual information by considering where y i denote the components of y and r (1) i y i ð Þ' 3N{1 Ð r(y)dy j=i the 1-dimensional marginal density along y i . This procedure minimizes non-linear correlations of second and higher order [27] and therefore generalizes the principal component analysis (PCA) which only considers linear correlations of second order. For complex macromolecules, however, even for the optimal linear FCA transformation T, considerable non-linear correlations between several degrees of freedom will remain and cannot be neglected. To address this issue, the FCA modes are subsequently clustered according to the generalized correlation coefficient [25,28] with the mutual information between components y i and y j . This is achieved by assigning mode indices j to m clusters C s such that all modes with correlation coefficients larger than a certain threshold h are assigned to the same cluster. This disjoint clustering defines an approximate factorization r y ð Þ& P m s~1 r (ds) s denotes the generalized d s -dimensional marginal density along 6 j[Cs y j . This factorization is approximate in the sense that for the entropy the residual entropy S res C s f g s~1,...,m h i is small.
Such approximate factorization, of course, neglects all intercluster correlations. These can be pairwise correlations, and thus are small vh ð Þ by construction, or higher-order correlations. For the latter we have to assume that they are also effectively eliminated by our threshold criterion. This assumption is supported by the observation that for the alkanes and for dialanine, with h~0:025, S dir &S FCA (cf. Fig. 1). Thus, our factorization yields accurate entropies and S res is indeed small.

Mutual Information Expansions for Oversized Clusters
However, for the larger molecules considered here, the necessarily small threshold typically results in at least one cluster being too large for a sufficiently accurate density estimate (e.g., for the b-turn d 1~1 08). Accordingly, while our factorization still improves the entropy estimate (cf. Fig. 1), S res cannot be neglected anymore. The third building block of our method addresses this issue by subdividing each oversized cluster into h s disjoint subclusters D (s) i of sizes d s 1 , . . . ,d s hs v15, C s~S hs i~1 D (s) i , irrespective of the necessarily remaining strong correlations between these. The residual entropy contributions to the configurational entropy ..,m h i will be drastically increased due to non-neglegible intra-cluster where r a :r (da) 6 . Expanding the mutual information terms up to second or third order, respectively, with the right-hand sum running over all possible permutations fi 1 , . . . ,i a g[f1, . . . ,kg, has proven sufficiently accurate in liquid state theory [29] and information theory [30,31]. Indeed, for the b-turn, inclusion of the remaining correlations via this expansion improved the entropy estimate (Fig. 1). For the other test systems S dir &S FCA &S MIE3 . In contrast, for some of the test systems S MIE2 vS TI , such that from our observations, 3rd order MIE provides a better estimate and an upper bound to the true entropy.
Applications of MIE to macro-molecular systems can be hampered by the curse of dimensionality and combinatorial explosion of the number of terms [32,33]. In this work, the problem is circumvented by clustering into sufficiently highdimensional (*15) subspaces which minimizes residual inter-D a correlations and delays the onset of the combinatorial explosion. At the same time the subspaces are sufficiently small that even for the 3rd-order MIE no direct density estimates beyond the critical dimensionality of d s~4 5 are required.

TATA Box Binding Protein: Protein Test Case and Error Estimate
Together, these three building blocks enable one to calculate configurational entropies even for larger biomolecules. We considered the 67-residue TATA box binding protein (TBP, pdb code 1TBA) inhibitor in two different configurations; complexed ( Fig. 2 top left) and free (Fig. 2 top right). To estimate the statistical error of MCSA and QH configurational entropy estimates, for both states five independent molecular dynamics (MD) simulations were carried out using the OPLS force-field [34] and the TIP4P explicit solvent model [35] (see methods section for full simulation details). Fig. 2 shows the results obtained by the five entropy estimation methods for both complexed (left) and free (right) inhibitor. All methods estimate the free cofactor's entropy to be significantly higher than that of the bound cofactor. As can be seen, for both complexed and free cofactor, QH yields the largest estimate. The first two MCSA modules combined (kernel density estimation on little correlated configurational subspaces obtained from FCA) already yield remarkably smaller estimates, irrespective of whether a high or a low clustering threshold h was chosen (hi thresh and low thresh in Fig. 2), i.e., chosing small but higher correlated subspaces or larger but lowly correlated subspaces provides similar estimates. Finally, employing all the three MCSA modules including MIE of 2nd (MIE2) and 3rd (MIE3) lowered the estimate again with, as before, the 2nd-order estimate being lower than the 3rd-order estimate.
The fact that the QH estimate is the largest in all cases corroborates the observations for the small test cases, and generally shows that MCSA yields improved estimates also for large macromolecules. Already the first two MCSA modules provide lower entropy estimates, even though relatively large configurational subspaces (d s~3 5 . . . 88, see Table 1) were obtained from FCA, which illustrates that indeed our kernel density estimator works accurately also for the complex highdimensional configurational spaces spanned by proteins. Further, the fact that the clustering threshold did not affect the final estimate very much naturally reflects the fact that clustering with a high threshold yields small subspaces C s which are correlated, such that S res C s f g s~1,...,m h i in Eq. 3 is large, increasing our estimate S r y ð Þ ½ . On the other hand, clustering with a small threshold gives rise to a small S res C s f g s~1,...,m h i but sparse sampling due to large d s then entails higher S r (ds) is also increased in this case. As expected, the third MCSA module, MIE, circumvents this problem and lowers the MCSA estimate further by 404 or 397 J=(molK) for the free and the complexed cofactor, respectively. The 2nd-order estimate is lower than the 3rd-order estimate in all cases, which shows that also for proteins the pair correlations are generally overestimated, and inclusion of 3rd-order correlations is indeed crucial.
The statistical errors are relatively small in all cases, but generally twice as large for the free than for the complexed cofactor. We attribute this observation to the larger inherent flexibility of the free state, and hence to insufficient molecular dynamics sampling. Consequently, the MIE error for the free cofactor is over three times larger than that of the the complex. Interestingly, the MIE estimate is slightly more affected with the error for the free cofactor being three-to fourfold as high as for the complex. Due to the high number of terms to be evaluated for the MIEs (Eq

Discussion
We have developed a minimally coupled subspace approach (MCSA) to estimate absolute macromolecular configurational entropies from structure ensembles which takes anharmonicities and higher-order correlations into account. The approach combines three building blocks which together allow one to calculate absolute entropies even for the highly complex configurational densities generated by the dynamics of biological macromolecules such as proteins. MCSA shares the versatility of the quasi-harmonic approach as it can be applied to unperturbed equilibrium trajectories while achieving the accuracy of specialpurpose perturbation type methods. The effective dimension reduction provided by the Full Correlation Analysis allows for the application of mutual information expansions to large macromolecules. Further, the adaptive kernel non-parametric density estimation method developed for MCSA requires much weaker a-priori assumptions about the properties of the configurational densities than (quasi-)harmonic approaches. The method is applicable also to large macromolecules such as proteins. In this study, we showed that MCSA applied to the TATA box binding protein yielded significantly smaller and thus improved entropy estimates.
We note that here we focus at configurational entropies of the solute only, thus missing both the solvent as well as the solvent/ solute parts. Using permutation reduction techniques [36], our method should be capable of capturing also these important contributions, which however lies outside the scope of the present work.

Thermodynamic Integration Reference Entropy
Absolute free energies for the test systems butane to decane, dialanine, and the ProteinG b-turn were calculated by thermodynamic integration (TI). Simulation parameters cf. below. The TI scheme we have chosen to obtain the Helmholtz free energy A of the fully interacting particles consists of two phases. Harmonic position restraints with a force constant k~25000 kJ=(mol nm 2 ) were slowly switched on for each atom in the first phase, and in the second phase all force-field components were gradually switched off. Within the second phase, the charges were switched off prior to the rest of the force field. After the second phase, the system consisted of non-interacting dummy particles with mass m oscillating in their respective harmonic position restraint potentials, i.e., The free energy of this harmonic system can be obtained analytically, where k j~k k j =m j denotes the mass-weighted force constant. Hence, the thermodynamic integration yields the absolute free energy  used, and the intermediate values of l i~0 , 1e-6, 5e-6, 1e-5, 5e-4, 1e-4, 1e-3, 1e-2, 2e-2, 3e-2, 5e-2, 7e-2, 9e-2, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1 were distributed unevenly to obtain approximately balanced DA i values. For each value of l a trajectory of 12:5 ns (alkanes and dialanine) or 125 ns (b-turn), respectively, was generated. The error estimates of the TI reference entropies detailed in Table 1 were obtained via two ways for the alkane test systems and dialanine. First, by averaging over five independent simulations and, second, by performing blockwise averaging as derived in Ref. [37] over each of the 23 V i (l) of each of these five trajectories. We found that the error estimates obtained by these two methods agree very well. Accordingly, for the b-turn only the block averaging method was applied and the resulting error estimates are also given in Table 1.

Molecular/Stochastic Dynamics Simulations
The test systems that were compared with a thermodynamic integration reference (butane to decane, dialanine, and the ProteinG b-turn) were set up as follows. Force-field parameterizations were obtained from the Dundee Prodrug server [38] based on the GROMOS united-atom force field [39]. Stochastic Dynamics simulations were performed using the molecular simulations package GROMACS [40] in vacuo at 400 K with friction constant c set to 10, dielectric constant e~1, integration step size of 0:0005 ps and no bond constraints. Positional restraints were applied to three adjacent terminal heavy atoms. To obtain MCSA error estimates, each of the simulations was carried out five times using different starting velocities. MCSA and QH entropy estimates were obtained from trajectories of lengths 12:5 ns (alkanes and dialanine) or 125 ns (b-turn), respectively, i.e. the TI entropy references required 23 times as much computing time as MCSA and QH estimates.
The TATA box binding protein (TBP) complex (protein database entry 1TBA) was simulated using the OPLS all atom force field [34] in explicit TIP4P solvent [35] and periodic boundary conditions. NpT ensembles were simulated, with the protein and solvent coupled separately to a 300-K heat bath (t~0:1 ps). [41] The systems were isotropically coupled to a pressure bath at 1 bar (t~1:0 ps) [41]. Application of the Lincs [42] and Settle [43] algorithms allowed for an integration time step of 2 fs. Short-range electrostatics and Lennard-Jones interactions were calculated within a cut-off of 1:0 nm, and the neighbour list was updated every 10 steps. The particle mesh Ewald (PME) method was used for the long-range electrostatic interactions [44], with a grid spacing of 0:12 nm. The free cofactor was simulated using the same parameters as above. The starting structure was obtained by removing the TBP from the X-ray structure of the complex and equilibrating for 2 ns. Entropy estimates and corresponding errors for both complexed and free cofactor were obtained from five trajectories of 200 ns length each.

Mutual Information Expansions Implementation Details
Fill modes. Due to the moderate regularization assumptions, our adaptive kernel density estimator is sensitive to the sparse sampling problem whose effect is highly dependent on the dimensionality. To guarantee the same accuracy of all density estimates required for the computation of the correlation terms I n of Eq. 5 despite different dimensionality it is, thus, necessary to ensure the same local densities around points y i in different terms. This is normally not provided. The mutual information between two modes y i and y j , contains differently well sampled terms in denominator and numerator, because the number of sampling points available to estimate r(y i ,y j ) is only half the number of sampling points available for estimating the marginal densities r(y i ) and r(y j ) (see Fig. 3). The accuracy for the estimation of the marginal densities is, consequently, possibly higher than the joint estimate yielding an inaccurate correlation estimate. To overcome this problem, we devised the concept of fill modes. Accordingly, artificially decorrelated modes y i 0 : fy' i,1 , . . . ,y' i,3N g~permfy i,1 , . . . ,y i,3N g are created by permuting its components y i,j , with 1ƒjƒ3N. The marginal densities r(y i 0 )~r(y i ) and r(y i 0 ,y j )~r(y i )r(y j ), yielding a new expression for Eq. 6, where the product of the marginal densities r(y i ) and r(y j ) is now computed from the synthetically decorrelated joint distribution r(y i 0 ,y j ), such that the same accuracy for the joint estimate is guaranteed as for the marginal estimates. Conducting this scheme on the 3rd order correlation function of three modes y i , y j and y k , I 3~ð ivjvk r(y i ,y j ,y k ) ln r(y i ,y j ,y k ) r(y i ,y j )r(y i ,y k )r(y j ,y k ) r(y i )r(y j )r(y k ) , yields I 3~ð ivjvk r(y i ,y j ,y k ) ln r(y i ,y j ,y k ) r(y i ,y j ,y k 0 )r(y i ,y k ,y j 0 )r(y j ,y k ,y i 0 ) where the pairwise joint distributions have been 'filled up' with permuted 'fill modes', as described above, e.g. r(y i ,y j )r (y i ,y j ,y k 0 )=r(y k 0 ).

Consistent dimensions.
The sensitivity of the nearestneigbour estimates, Eq. 2, towards the sparse sampling problem also affects the different terms of Eq. 5, which inevitably suffer from different sparse sampling problems if computed separately. Furthermore, a huge number of probability density distributions r(y i ),r(y i ,y j ), . . . ,r(y i ,y j , . . . ,y k ) is computed more than once for the many instances of identical correlation terms appearing in that equation. Expanding over entropy terms rather than correlation terms, in contrast, yields S r y 1 , . . . ,y n ð Þ ½ X t k~1 X m 1 v:::vm k q k,t S r y 1 , . . . ,y k ð Þ ½ , ð9Þ where the first summation runs over different orders k~1, . . . ,t until truncation order tƒn. q k,t~P t i~k ({1) izk n{k i{k designates how many times a certain order appears and whether it needs to be added or subtracted, and the second sum over all n k possible combinations fm 1 , . . . ,m k g[f1, . . . ,ng. To guarantee the same estimation accuracy for all r y 1 , . . . ,y k ð Þof Eq. 9, each term is filled up to truncation order t yielding

Author Contributions
Conceived and designed the experiments: UH OFL HG. Performed the experiments: UH OFL. Analyzed the data: UH OFL. Contributed reagents/materials/analysis tools: UH OFL. Wrote the paper: UH OFL HG. Figure 3. Principle of fill modes. a) Two arbitrarily correlated modes y i and y j marginally distributed on the axes. Correlation is clearly visible from the y j -distributed y i . The joint distribution r(y i ,y j ) is more sparsely sampled than both marginal distributions. b) The y j 0 -distributed y i is decorrelated and has exactly as many sample points as the joint distribution in a), allowing precise computation of I 2 (y i ,y j ). doi:10.1371/journal.pone.0009179.g003