Encoding prior knowledge in ensemble refinement

The proper balancing of information from experiment and theory is a long-standing problem in the analysis of noisy and incomplete data. Viewed as a Pareto optimization problem, improved agreement with the experimental data comes at the expense of growing inconsistencies with the theoretical reference model. Here, we propose how to set the exchange rate a priori to properly balance this trade-off. We focus on gentle ensemble refinement, where the difference between the potential energy surfaces of the reference and refined models is small on a thermal scale. By relating the variance of this energy difference to the Kullback–Leibler divergence between the respective Boltzmann distributions, one can encode prior knowledge about energy uncertainties, i


I. INTRODUCTION
In the natural sciences, we frequently encounter the challenge of analyzing noisy and incomplete experimental data using theoretical models.The number of parameters in the models often exceeds the number of available data points.Image reconstruction is the textbook example of this class of ill-defined inverse problems. 1Another important class of problems is ensemble refinement, as performed in integrative structural biology and hybrid modeling.3][4] In such cases, we have to take into account statistical and systematic errors in the model and in the data and balance the information provided by experiment and theory.While maximum entropy (MaxEnt) and Bayesian methods address these challenges, the balancing of information has proven to be a long-standing and difficult problem. 5 The balancing of information from experiment and theory can be viewed as a Pareto optimization problem (Fig. 1).In a multiobjective optimization, we seek to minimize both the mean-squared deviations from the experimental data, χ 2 , with respect to the statistical weights in the refined model, and the deviation of these weights from a given reference model, as quantified by the Kullback-Leibler (KL) divergence 6 S KL .In the plane spanned by S KL and χ 2 , the set of weights where one is optimal for a fixed value of the other defines a Pareto front.On this Pareto front, we then seek a particular solution for which we consider the trade-off between increases in S KL and decreases in χ 2 to be fair.To encode the relative value given to S KL and χ 2 , we introduce a parameter θ that sets the so-called marginal rate of substitution, dχ 2 /dS KL = −2θ.Minimization of a loss function χ 2 /2 + θS KL with respect to the model weights then gives us a specific set of refined weights on the Pareto front (or L-curve 7 ) with an optimal trade-off.The curve of optimal solutions in the plane spanned by the deviations from experiment, χ 2 , and the deviations from theory, S KL , determines the Pareto front or L-curve (red line).The sub-optimal solutions (orange shaded area) lie above this curve.We encode our prior knowledge by choosing a value of the so-called marginal rate of substitution, given by −2θ here.We use this rate to trade off the two deviations according to our prior experience, dχ 2 = −2θ dS KL , which gives a unique solution (blue disk).At this point, the slope of the Pareto front is exactly the marginal rate, as indicated by the blue line.
The proper choice of the θ value 5 has been tackled by different workarounds.As a result, various methods to image reconstruction, ensemble refinement, and related problems can be distinguished by their choice of a solution parameterized by a multiplicative parameter θ of the KL divergence: solutions are chosen by the L-curve analysis; [7][8][9][10] by using algorithms to determine an elbow of the L-curve; 11,12 so that the resulting error is the most-likely according to some statistics; 1 by putting a prior on θ and integrating it out; 13,14 by selecting a perfect fit ignoring errors completely as in classical MaxEnt methods; 15,16 by not including θ as a parameter at all, which corresponds to setting θ = 1; 17 or by crossvalidation. 18While these workarounds are useful in practice, they generally do not balance information from experiment and theory properly in the sense of using a priori knowledge.As a result, the solutions obtained tend to be overfitted or underfitted.Moreover, a solution is not always guaranteed to exist for some of these workarounds. 10ere, we propose how to choose the θ parameter a priori for gentle ensemble refinement.Often, ensemble refinement is quite crude such that the optimal ensemble deviates drastically from the reference ensemble.In gentle refinement, we take care that this is not case.In this regime, the expected KL divergence SKL can be approximated in terms of the mean-squared error of the potential energy surface U defining the force field used to create the reference ensemble, SKL ≈ var(βΔU)/2.Here, energies are measured in units of the thermal energy, β = 1/(k B T), in the equilibrium Boltzmann distributions of the reference and refined ensembles, with k B being the Boltzmann constant and T being the absolute temperature.We propose to use this physically meaningful relation to encode prior knowledge about the expected force-field error by setting to fix the Pareto exchange rate at −2θ.The mean-squared energy error depends on the type and number of observables used for refinement and on the thermodynamic state.This article is organized as follows.In Sec.II A, we first introduce ensemble refinement with a focus on the Bayesian inference of ensembles (BioEn) method. 8,9We relate the KL divergence for Boltzmann distributions to the mean of the reduced energy difference in Sec.II B. We derive an approximation for the KL divergence as half the variance of the reduced energy difference in Subsection II C.Moreover, the KL divergence is approximately symmetric with respect to its arguments.In Subsection II D, we show that these approximations are exact in the Gaussian case.In Subsection II F, we discuss how to use this information to encode our prior knowledge in the entropic prior used in BioEn and related methods.In Subsection II G, we describe how to estimate the relevant energy uncertainty in the space of observables.We present three example systems in Sec.III.For these examples, we quantify the validity of the approximation of the KL divergence by the energy variance and illustrate the benefits and limits of gentle ensemble refinement in Sec.IV.In Sec.V, we discuss practical aspects of gentle ensemble refinement.We end with concluding remarks in Sec.VI and the implication of our results for ill-defined inverse problems in general.

A. Background
The BioEn posterior 9 for a sampled ensemble with N conformations and normalized reference weights w (0) = (w where w = (w 1 , . . ., wN) is the vector of the normalized weights we want to find by refinement.The so-called entropic prior 19 is given by where θ is the confidence parameter and S KL (w∥w (0) ) is the KL divergence or relative entropy, θ encodes how much we trust our original ensemble.For independent Gaussian errors, for example, the likelihood of the measured data is given by where Here, ⟨yi⟩ = ∑ N α=1 wαy α i are the averages of the calculated observables y α i for conformation α with indices i = 1, . . ., M for the measured averages Yi with associated errors σi, both theoretical and experimental.The theoretical errors primarily reflect uncertainties in The Journal of Chemical Physics ARTICLE pubs.aip.org/aip/jcpcalculating the observables y(x) for given conformations x using simplified models of the experiment.By contrast, we account for errors in the weights in the entropic prior, Eq. ( 3), through the confidence parameter θ.Note that the structural ensembles do not have to come from molecular simulations [20][21][22] and that the data do not necessarily have to be from experiment.
To find the optimal solutions, we maximize the BioEn posterior or, equivalently, minimize the negative log-posterior given by which is the loss function mentioned in the Introduction.This formulation has been originally developed for the EROS method 8 based on a method for image reconstruction. 1 Since then, it has entered into the BME method, 23 for example.Note that any method that is equivalent to refinement by replica simulations 24,25 is also equivalent to the EROS/BioEn method with a properly chosen coupling constant and in the limit of an infinite number of replicas. 9As discussed in Ref. 9, the dependence on the number of replicas can be removed with a subsequent BioEn refinement.
We next explain how we calculate and approximate the KL divergence for Boltzmann distributions.In the following, we focus on the continuous distributions q(x) and p(x) underlying the reference weights w (0) and optimal weights w, respectively.

B. Kullback-Leibler divergence for Boltzmann distributions
For continuous probability densities p(x) and q(x), the KL divergence 6 is defined as In the refinement of an isothermal ensemble, x represents 3N-dimensional conformations, q(x) is the reference distribution underlying simulations, and p(x) is the refined distribution.The angular brackets with subscript "p" indicate the expectation value with respect to p(x).
For the sake of generality, we point out here that the KL divergence in Eq. ( 8) is the expectation of the information difference, with respect to p(x).Jayne's measure 26 m(x) guarantees the invariance of information under variable transformation.Although it cancels in the expression above, m(x) is needed to define a proper difference between the information of the two ensembles.,27 Let us assume that probability distributions are given by Boltzmann distributions.For a potential energy surface Uq(x) defining the reference ensemble, we then have where the normalization constant Q q is the partition function, Qq = ∫ dx e −βU q (x) ≡ e −βF q (11) with Fq = −k B T ln Q q being the free energy.We analogously define p(x), Q p , and Fp for the potential energy surface Up(x) of the refined ensemble.In an isothermal ensemble at inverse temperature β = 1/(k B T), the energy of conformation x is given by Uq(x) = Eq(x), where Eq(x) is its potential energy.In an isobaric-isothermal ensemble, Uq(x) = Eq(x) + pVq(x), where p is the pressure and V(x) is the box volume for conformation x.
To evaluate the KL divergence, Eq. ( 8), for Boltzmann distributions, we use that where ΔU(x) ≡ Up(x) − Uq(x) is the energy difference.Using that the free-energy difference is given by ΔF ≡ Fp − Fq, we obtain where we introduced the reduced energy difference, between the force fields of the two Boltzmann distributions p(x) and q(x).Importantly, these energy differences are uniquely determined because additive constants in the energies cancel in the Boltzmann distributions due to normalization.Consequently, for Boltzmann distributions, the KL divergence can be written as an average of the reduced energy difference, The reduced energy differences tell us how we have to change the energies of the reference Boltzmann distribution q(x) to sample the optimal ensemble according to Eq. ( 13), i.e., as a physical interpretation of the information differences introduced in Eq. ( 9).
To estimate the reference and refined weights, and thus the reduced energy differences, we do not need to calculate any partition functions.On the contrary, we actually numerically estimate the free-energy difference and thus the log-ratio of partition functions.For two different force fields, we can estimate the free-energy difference from the KL divergence using Eqs.( 14) and ( 15), This expression is akin to the definitions of the Helmholtz free energy for the isothermal ensemble and to the Gibbs free energy for the isothermal-isobaric ensemble, where kBS KL (p∥q) corresponds to the negative entropy.We obtain another familiar looking equation if we perform the average in the q-ensemble, The Journal of Chemical Physics

ARTICLE pubs.aip.org/aip/jcp
For finite ensembles, we estimate the KL divergence, Eq. ( 8), numerically using Eq. ( 4).In Appendix A, we show how to properly interpret the discrete reference weights w (0) α and refined weights wα for ensembles sampled from arbitrary distributions.Thus, all results derived here for continuous distributions also apply to discrete ensembles.We next use that the KL divergence is given by the mean reduced energy differences and relate it to the variance in the regime of gentle ensemble refinement.

C. KL divergence approximations
In the following, we rewrite averages of exponential functions as cumulant expansions, which lead to simple expressions for Gaussian distributions. 28,29We calculate the mean energy change determining the KL divergence, Eq. ( 15), introducing the refined distribution function p(Δu) of the energy differences, where δ[⋅] is the Dirac delta function.Analogously, we define q(Δu) for the reference distribution.Equation ( 16) becomes We integrate both sides of this equation over Δu and use that q(Δu) is normalized such that Introducing the cumulant generating function, we have ln ⟨e Δu ⟩ p = G(1) = 0. κn is the nth cumulant of the p-ensemble.By solving G(1) = 0 for κ 1 , we obtain We have used that the first two cumulants are the mean κ 1 = ⟨Δu⟩ p = μ and the variance Consequently, the KL divergence, Eq. ( 15), is given by a sum of the higher-order cumulants.The leading term is half the variance.For small errors in the energy, |Δu| ≪ 1, the cumulants of order n = 3 and higher can be ignored, allowing us to approximate the KL divergence in Eq. ( 15) by We have dropped the subscript "p" for average and variance.That is, these quantities without a subscript always refer to the refined ensemble p(x).
The KL divergence is equally approximated by the variance of the configurational energy difference βΔU.The free-energy difference ΔF cancels in the variance, Eq. ( 14), such that var(Δu) = var(βΔU) and We next show that for small errors |Δu| in the force field, where higher cumulants can be ignored, the KL divergence is approximately symmetric with respect to its arguments, S KL (p∥q) ≈ S KL (q∥p), where We can express the average of Δu in the q-ensemble, using the cumulant expansion as before.Using that p(Δu) is normalized, we obtain ln ⟨e −Δu ⟩ q = H(−1) = 0, where we introduced the cumulant generating function for the q-ensemble, Evaluating H(−1) = 0, we obtain where we used that ∫ dΔu q(Δu)e −Δu = ⟨e −Δu ⟩ q = 1, Eq. (20).λn is the nth cumulant of the q-ensemble.For |Δu| ≪ 1, such that The cumulants of the p-ensemble can be expressed by the cumulants of the q-ensemble and the other way round. 28From Eq. (20) follows that G(t) = H(t − 1).Inserting the corresponding cumulant expansions on both sides of this equation and collecting equal powers of t by applying the binomial theorem to (t − 1) n , we obtain Analogously, we obtain from From Eq. ( 32) and for |Δu| ≪ 1, we obtain κ 1 = λ 1 − λ 2 ≈ −λ 1 , where we have used λ 2 ≈ 2λ 1 , Eq. (30).We find that the two definitions of the KL divergence are approximately equivalent, We obtain the same result using Eqs.( 24) and (33).
The Journal of Chemical Physics ARTICLE pubs.aip.org/aip/jcp

D. Exact results for Gaussian energy distributions
If the refined distribution p(Δu) of the energy error is Gaussian, then the above approximations are exact. 28,29This case not only serves as a reference, but is also of practical interest, as we shall see below for one of the examples, Ala5.
For Gaussian distributions, all cumulants beyond the variance are zero.Equation ( 23) becomes This constraint on the Gaussian distribution derives from Eq. ( 20) for normalized distributions p(Δu) and q(Δu).
In this Gaussian case, the approximation of the KL divergence by the variance half, Eqs. ( 25) and (26), is exact, i.e.,

S(p∥q
In the Gaussian case, the KL divergence is exactly symmetric with respect to its arguments, i.e., S KL (p∥q) = S KL (q∥p). 30Using Eq. ( 20), we can rewrite as where we used Eq. ( 35).

E. Relation to free energy perturbation theory
We can now use the different expressions for the KL divergence to approximate the free-energy difference ΔF as in free energy perturbation theory. 31Inserting Eq. ( 26) into Eq.( 17), we obtain Usually, ΔF is calculated in the q-ensemble, Eq. ( 18), These approximations become exact for Gaussian distributions of the potential energy difference ΔU, 30 where S KL (p∥q) = S KL (q∥p) = var(βΔU)/2 = varq(βΔU)/2.

F. Encoding prior knowledge in ensemble refinement
The relation of the KL divergence to the reduced energy differences allows us to relate the more abstract information difference expectation of Eq. ( 8) to the more physical quantity of the variance of the reduced energy difference, Eq. (26).We use the latter to choose the confidence parameter θ, which defines the rate at which the entropic prior exp(−θS KL ) in Eq. ( 3) decreases with increasing KL divergence.
The choice of the prior as an exponential function of the KL divergence can be motivated with the maximum entropy principle. 15If we only know the expectation value of the KL divergence, SKL , then the maximum entropy distribution of the KL divergence is proportional to exp (−S KL / SKL ).In this case, In gentle ensemble refinement, we demand that S KL ≲ 1, S KL (p∥q) ≈ S KL (q∥p), and S KL ≈ var(βΔU)/2 according to Eq. ( 26).For a given force-field error, we then expect SKL ≈ var(βΔU)/2.According to Eq. ( 41), we set θ ≅ 2/var(βΔU) a priori as in Eq. ( 1).
Beyond gentle ensemble refinement, where the expectation of the KL divergence is no longer determined by var(βΔU), we have to set SKL directly to encode our prior knowledge.
The choice of θ in the prior can be validated by checking its consistency with the optimal value of the KL divergence a posteriori.If we evaluate the prior for given θ and the corresponding optimal S KL value, then we expect a reasonably high value of the prior.That is, θS KL is of the order of one after refinement.If it is much larger, then the information from the experiment dominates the ensemble.We might have underestimated the quality of our ensemble or the size of the errors in the data.If it is much smaller, then the reference distribution dominates the ensemble.We might have overestimated the quality of our reference ensemble or the size of the errors in the data.

G. How to determine the energy uncertainty
In gentle ensemble refinement, we have to choose the energy uncertainty var(βΔU) a priori to properly set the confidence parameter θ according to Eq. (1).To do so, we go from conformation space to the space of observables.As introduced in the Appendix of Ref. 9, the probability density of the reference ensemble in the observable space is given by where y i is the ith component of the observable vector y.The probability density of the refined ensemble p(y) in the observable space is defined analogously.
Refining in the space of observables is equivalent to refining in the space of conformations.By design, the average observables entering the likelihood are equal in both spaces.As we show in Appendix C, also the KL divergence in the conformation space is equal to the KL divergence in the observable space for optimal solutions.
The observable space is a coarse-grained representation of the conformation space. 32We introduce a coarse-grained energy potential energy Vq(y) using The Journal of Chemical Physics Analogously, we introduce the coarse-grained energy Vp(y) in the p-ensemble.Note that Q q and Q p and thus ΔF are the same as introduced above in the conformation space.The coarse-grained reduced energy difference becomes The reduced energy uncertainty in the observable space is a free-energy difference.Expressing q(x) in Eq. ( 42) by Eq. ( 16), we obtain, for the free-energy difference in Eq. ( 45), The subscripts p|y and q|y of the angular brackets indicate the sub-ensembles of the pand q-ensembles with fixed values of y.Consequently, Δv(y) corresponds to the free-energy difference between the constrained p-ensemble and the constrained q-ensemble for a given value of y.
For the optimal weights obtained by BioEn, the energy uncertainties calculated in the space of conformations x and observables y are equal, ⟨Δu⟩p = ⟨Δv⟩ p(y) (see Appendix C) and varp(Δu) ≈ var p(y) (Δv) in gentle ensemble refinement.The underlying reason is that in the BioEn optimal solution, the factor scaling the relative weight of conformation x depends only on y(x), not on x directly.Conformations with the same value of y are thus treated equally.
The uncertainty in the free-energy difference Δv(y) depends on the type and number of observables.For different types of observables probing different aspects of molecular conformations, we will have different expectations about the energy error.For a polymer, the expected energy uncertainties for sub-ensembles with a fixed end-to-end distance will be different than for sub-ensembles with fixed carbon-hydrogen bond lengths.The sub-ensembles for these two observables are quite different and so is the energy uncertainty.
If we combine independent observables and refine against all of them at once, then the uncertainty will be larger than for each individual observable.If the observables are uncorrelated, then the probability distribution of the observable vector q(y) factorizes into probability distributions for individual components, q(y) = ∏ M i=1 q(yi).The same is true for p(y).The total KL divergence then becomes a sum over the KL divergences for individual components of the observable vector, In this case, we add up the energy variances for the individual components, to define θ.The energy uncertainty thus depends both on the type and number of observables and on the thermodynamic state.The sensitivity of the distribution p(y|c) of the observables to a particular force-field parameter c determines its impact on the respective energy uncertainty, an issue examined in detail in the Bayesian inference of force fields (BioFF). 33p(y|c) is defined analogously to Eq. ( 42) with p(x) now parameterized by c, i.e., p(x|c) ∝ exp[−βU(x|c)] through the potential energy U(x|c).The reference ensemble is defined by c = c 0 .To the lowest order, the KL divergence S KL (c) grows quadratically with small changes δc = c − c 0 in the force-field parameter c, with a proportionality coefficient that is given by the expectation value of the squared mean force with respect to the parameter c.This relation follows from the definition of S KL and the normalization condition of p(y|c).Note that errors in the force field might not only be due to inaccurate parameters but also due to their simplified functional forms.
In summary, we have to estimate the energy uncertainty in the observable space to determine θ according to Eq. ( 1).We can use this θ-value to directly refine in the conformation space.If we refine in the observable space instead, then we use the optimal generalized forces as derived in Ref. 9 to obtain the refined ensemble in the conformation space.

III. METHODS
We explore the concept of gentle ensemble refinement and the proposed encoding of prior knowledge using three example systems of increasing complexity.We refine a continuous version of the discrete double-well model presented in Ref. 10 and a simple polymer model based on the von Mises probability distribution presented in Ref. 33 using synthetic data.We also refine fully atomistic simulations of the pentapeptide Ala5 in explicit solvent 10 using data from nuclear magnetic resonance (NMR) experiments. 34s a simple model system, we define the reference ensemble in terms of a continuous double-well potential given by where x is a scalar and also serves as an observable, i.e., y(x) = x.This energy function is symmetric with respect to x = 0.It has two minima at ±x 0 with values Uq(±x 0 ) = 0.These minima are separated by a barrier at x = 0 of a height given by ax 0 4 .The corresponding Boltzmann distribution is given by with the normalization constant (partition function) given by The Journal of Chemical Physics ARTICLE pubs.aip.org/aip/jcpwhere c = ax 0 4 /2 and In(c) is the modified Bessel function of the first kind of order n.Note that ⟨x⟩q = 0 due to symmetry.
In the following, we use x 0 = 1 and a = 3 for the reference distribution q(x) such that the barrier height is ax 0 4 = 3 (Fig. 2).The experimentally measured expectation value of the observables x is set to Y = 0.8.The error is σ = 0.2.We use rejection sampling to generate an ensemble of N = 10 000 independent random points distributed according to q(x) for ensemble refinement.
For concreteness, we also work out a specific case.The reference potential Uq(x) deviates from the assumed true potential by the addition of a linear term, Up(x) = Uq(x) − b x, with Uq(x) = 3(x 2 − 1) 2 as above.A slope of b = 1.217 165 was chosen so that ⟨x⟩p = 0.8 = Y exactly.The corresponding forcefield error is varp(Δu) ≈ 0.51.For this error, we expect a value of θ = 2/varp(Δu) ≈ 4 to give a balanced fit, as will be tested.
As a more realistic example, we refine a two-dimensional polymer model 33 using synthetic one-dimensional data (Fig. 3).The Boltzmann distribution is given by a product of von Mises distributions acting on the angle differences between neighboring bonds.In this phantom-chain model, the beads do not interact.We set the mean values of the angle differences to zero.All bonds have length one, and we apply the same stiffness parameter κ = 10.We randomly generate conformations of polymers with 100 beads.We sample N = 10 000 independent conformations.As an observable, we use the end-to-end distance.We set Y = 75 for the experimental value and an error of σ = 5.
As an example for an actual application, we refine the previously published simulation data 10 of Ala5 with the experimental data in the form of NMR J-couplings. 34The J-coupling calculation of Ref. 10 applied the Karplus equation 35 using the so-called DFT2 FIG. 2. Ensemble refinement of the double-well system using synthetic data.The black solid vertical line indicates the experimentally measured observable of Y = 0.8.The gray shaded area indicates Y ± σ with σ = 0.2.The reference probability distribution function of the observable for parameters a = 3 and x 0 = 1 is shown as a black dotted line.The histogram of N = 10 000 samples drawn from the reference distribution, Eq. (51), is shown as a gray solid line.The histograms of x obtained with BioEn optimal weights for different θ values are shown in color.For all distribution functions, we show the average values of the observables as dashed vertical lines in the corresponding color.The calculated average values approach the experimental value for decreasing θ. parameters from Ref. 36.We use N = 50 000 conformations as in the original publication.
To find optimal solutions for the weights given a value of θ, we use the forces method 10 as implemented in the open-source BioEn software available at https://github.com/biophys/BioEn.An open-source Julia 37 implementation using the package Optim.jl 38can be downloaded from https://github.com/biophys/BioEn.jlfree of charge.To generate synthetic data, we use https://github.com/bio-phys/RefinementModels.jl for the doublewell model and https://github.com/bio-phys/BioFFfor the von Mises polymer model.We use uniform reference weights for all systems.
We illustrate the effects of refinement on the weights using cumulative ranked weights. 10For the three systems, we compare the numerical results of the cumulative ranked weights to the analytical results for Gaussian energy distributions (see Appendix B).

IV. RESULTS
We first establish the range of validity for gentle ensemble refinement for the three example systems.The range of validity of the approximation of the KL divergence by the energy uncertainty, Eq. ( 26), depends on the system under consideration (Fig. 4).We find good agreement for decreasing θ values down to (θ, S KL ) ≈ (10, 0.17) for the double-well system; to (θ, S KL ) ≈ (10, 0.26) for the polymer model; and to (θ, S KL ) ≈ (1, 1.43) for Ala5.In gentle ensemble refinement, the two definitions of the KL divergence are approximately equivalent, S KL (p∥q) ≈ S KL (q∥p) [Eq.(34) and Fig. 4].The relative residuals of S KL (q∥p) and of the approximation by half the energy variance with respect to S KL (p∥q) are of similar shape and magnitude but opposite sign (Fig. 4, bottom panels).For all three systems, we have S KL ≲ 1 for θ ≳ 10, which overall delimits the regime of gentle refinement.15), used in BioEn (black solid line) to its approximation given by half the variance of the reduced energy change βΔU (red dashed line), Eq. ( 26), and to its alternative definition S KL (q∥p) (cyan dotted-dashed line), Eq. ( 27).We show the results for the doublewell system (top), the polymer model (middle), and Ala 5 (bottom).The bottom panels show the relative difference of −varp(Δu) (red dashed line) and S KL (q∥p) (cyan dotted-dashed line) with respect to S KL (p∥q) = −⟨Δu⟩p.The approximations roughly start deviating from the values given by the exact expression at (θ, S KL ) ≈ (10, 0.17) for the double-well system, (θ, S KL ) ≈ (10, 0.26) for the polymer model, and (θ, S KL ) ≈ (1, 1.43) for Ala 5 .For smaller θ values, we leave the regime of gentle refinement.The vertical lines indicate θ = 100 (blue), 10 (orange), and 1 (green).

The Journal of Chemical Physics
Having established the limits of validity for gentle ensemble refinement, we now show that already gentle refinement substantially improves the agreement between simulation and experiment.This improvement is illustrated by the L-curves defining Pareto fronts for Bayesian ensemble refinement (Fig. 5).The L-curve consists of optimal χ 2 values divided by the number of data points M plotted against the optimal KL divergence values S KL for different θ values.The three chosen θ values (100, 10, and 1) cover the elbow FIG. 5. L-curve plots of the optimal χ 2 /M vs the optimal KL divergence S KL for the double-well system (top), the polymer model (middle), and Ala 5 (bottom).M is the number of data points.Annotated disks indicate the optimal values for θ = 100 (blue), 10 (orange), and 1 (green).
regions of the respective L-curves.Note that these elbow regions are not sharply defined.For gentle refinement at θ = 10, χ 2 has drastically decreased, while S KL remains relatively small.
Overfitting and underfitting are illustrated by the double-well system (see Fig. 2).We have underfitting for θ = 100 as the calculated average is multiple standard deviations away from the experimental value.We have overfitting for θ = 1 as the calculated average agrees nearly perfectly with the experimental average despite the error.The character of the distribution of the observable can change quite substantially even when the optimal entropy values appear relatively small, as we discuss in more detail below.For the optimal The Journal of Chemical Physics ensembles, the statistical weights of the left state [with observable values y(x) = x < 0] are given by 42%, 23%, and 12% for (θ, S KL ) = (100, 0.01), (10, 0.17), and (1, 0.36), respectively, compared to 50% for the reference state.
For the concrete example of a linear perturbation (Sec.III), we obtain a balanced solution for θ ≈ 4, estimated using Eq. ( 1), between underfitting and overfitting.In particular, the calculated mean after refinement of ⟨x⟩ ≈ 0.66 matches the target Y = 0.8 ± 0.2 within the error.θS KL ≈ 1.05 is close to one, as expected.In this sense, the θ estimate from Eq. ( 1) is, indeed, consistent.
The approximation of the KL divergence by the variance, Eq. ( 26), holds for arbitrarily shaped distribution functions of the reduced energy differences Δu (Fig. 6).The distribution functions are bimodal for the double-well system, monomodal for the polymer model, and well approximated by Gaussian distributions for Ala5.The approximation of the KL divergence by the variance works especially well for the latter, as it is exact for a Gaussian Δu distribution.
Refined weights with small entropy values can already be quite different from the reference weights (Fig. 7).To illustrate this point, we focus on results for θ = 10 located in the elbow region of the L-curve (see Fig. 5) and at the border of the regime of gentle refinement.We show the cumulative ranked weights in Fig. 7.For θ = 10, the S KL values are small and range from 0.17 for the double-well system, over 0.26 for the polymer model, to 0.37 for Ala5.In these three cases, the cumulative weights show that the top half of the conformations already have a cumulative probability of 80%, which is quite a substantial change.The relative weight of top ranked conformations further increases with increasing KL divergence.
In Fig. 7, we compare the numerical results of the rankedweight distributions to the analytical results obtained by assuming Gaussian distributions of Δu, parameterized by ⟨Δu⟩ and var(Δu) calculated from the weights (see Appendix B).For the largest value of θ and thus the smallest values of S KL , the agreement is excellent.However, for smaller values of θ, the agreement deteriorates somewhat.For the double-well system, the analytical approximation captures the trends qualitatively but fails quantitatively because the underlying distribution of Δu is clearly non-Gaussian.For the polymer model, the distribution of Δu is skewed, but unimodal for all θ values considered (see Fig. 6), resulting in near-quantitative agreement of the Gaussian-based approximations and the actual weight distributions after refinement.For the most realistic case presented here, Ala5, the Δu distribution is well approximated by a Gaussian.Consequently, the approximations of the weight distributions are quite accurate, even for the smallest value of θ = 1 with S KL ≈ 1.43.

V. DISCUSSION
When refining sufficiently gently, the KL divergence is well approximated by the expected energy uncertainty, Eq. ( 26).We can quantify and visualize the extent of the resulting weight changes by A poor overlap, as indicated by large weight changes, can be improved by enhanced sampling methods that enrich the sample before refinement to better match the experimental observables, e.g., by using replica simulations. 24,25We then reweight the sampled ensemble using MBAR or binless WHAM to produce a reference ensemble for subsequent BioEn ensemble refinement, as we have proposed in Ref. 9. By doing so, we can remove any biases, e.g., due to a finite number of replicas.One can also directly bias the degrees of freedom determining the values of the observables, e.g., by using empirical force-field refinement, as described, for example, in Ref. 33.Also in this case, we can generate a reference ensemble by reweighting the sampled ensembles for subsequent BioEn ensemble refinement.
Even for gentle refinement, the changes to the reference weights can be noticeable.For θ = 10, we found the top 50% of the conformations to carry ∼80% of the weight in our three examples, consistent with the Gaussian approximation (Fig. 7).Despite the small entropy values of S KL ≈ 0.2 to 0.4, these changes to the weights are sufficient to substantially reduce the deviations from experiment (see the L-curves in Fig. 5).
We suggest using our prior knowledge about the force-field accuracy in the space of observables to set the confidence parameter as θ ≅ 2/var(βΔU), Eq. ( 1).One can justifiably deviate from this proposal to set θ according to available prior knowledge.For example, if larger values of the variance are less probable than implied by the exponential form of the prior with θ ≅ 2/var(βΔU), then one should increase θ accordingly.In any case, knowledge on the expected variance var(βΔU) can be applied to set and interpret the scaling parameter θ.Conversely, a particular choice of θ by other reasoning is also an expression of the expected errors in state populations of the reference ensemble.Whereas functions other than an exponential function could be used to define the prior, the simple relation to the force-field error would most likely be lost.
We build up experience about the force-field uncertainties quite naturally.Experienced researchers performing simulations will often be able to state expectations about the energy uncertainties of their favorite force fields for a class of observables.However, the mean-squared force-field error can also be learned from repeated ensemble refinements against different experimental data across a variety of systems.

VI. CONCLUDING REMARKS
The results presented here can be generalized to non-Boltzmann distributions and are thus valid for the general class of Bayesian/MaxEnt approaches for ill-defined inverse problems.In such cases, we take advantage of the fact that the energy entering Boltzmann's distribution corresponds to Shannon's information content h(x) = −ln p(x) for arbitrary probability distributions p(x). 39Consequently, the difference in energy corresponds to a difference in information, Δh(x) = ln[p(x)/q(x)], Eq. (9).Like in gentle ensemble refinement, the KL divergence can be approximated by the variance of this information difference, i.e., S KL ≈ var(Δh)/2, if these differences are small.In general, we express our confidence in the "natural unit of information" or "nat," which is numerically equal to k B T for Boltzmann distributions.
Gentle ensemble refinement is a powerful tool for molecular simulations and modeling.Empirical force fields rely on approximations in their functional form to trade off efficiency and accuracy.Therefore, not all errors in the force field can be resolved just by re-parameterization.1][22] The inaccuracies introduced by such approximations can be alleviated using gentle ensemble refinement to integrate system-specific information, most often in the form of experimental data.

FIG. 1 .
FIG. 1.Bayesian/MaxEnt ensemble refinement as a Pareto optimization problem.The curve of optimal solutions in the plane spanned by the deviations from experiment, χ 2 , and the deviations from theory, S KL , determines the Pareto front or L-curve (red line).The sub-optimal solutions (orange shaded area) lie above this curve.We encode our prior knowledge by choosing a value of the so-called marginal rate of substitution, given by −2θ here.We use this rate to trade off the two deviations according to our prior experience, dχ 2 = −2θ dS KL , which gives a unique solution (blue disk).At this point, the slope of the Pareto front is exactly the marginal rate, as indicated by the blue line.

FIG. 3 .
FIG. 3. Ensemble refinement of the polymer model using synthetic data.The black solid vertical line indicates the experimentally measured observable of Y = 75.The gray shaded area indicates Y ± σ with σ = 5.The histogram of N = 10 000 samples drawn from the Boltzmann distribution is shown as a gray solid line.The histograms of the end-to-end distance obtained with BioEn optimal weights for different θ values are shown in color.For all distribution functions, we show the average values of the observables as dashed vertical lines in the corresponding color.The calculated average values approach the experimental value for decreasing θ.

FIG. 6 .FIG. 7 .
FIG.6.Histograms (solid lines) of the energy changes Δuα for θ = 100 (blue), 10 (orange), and 1 (green) for the double-well system (left), the polymer model (center), and Ala 5 (right).Gaussian distributions of the same mean and variance are shown as dashed lines in matching colors.
KL and a plot of the cumulative ranked weights, respectively.KL divergences S KL ≪ 1 and a narrow gap between the respective cumulative ranked weights indicate a good overlap between the reference and refined ensembles.