Bayesian-Maximum-Entropy Reweighting of IDP Ensembles Based on NMR Chemical Shifts

Bayesian and Maximum Entropy approaches allow for a statistically sound and systematic fitting of experimental and computational data. Unfortunately, assessing the relative confidence in these two types of data remains difficult as several steps add unknown error. Here we propose the use of a validation-set method to determine the balance, and thus the amount of fitting. We apply the method to synthetic NMR chemical shift data of an intrinsically disordered protein. We show that the method gives consistent results even when other methods to assess the amount of fitting cannot be applied. Finally, we also describe how the errors in the chemical shift predictor can lead to an incorrect fitting and how using secondary chemical shifts could alleviate this problem.


Introduction
Intrinsically Disordered Proteins (IDPs) do not fold into stable conformations, because their free energy landscapes possess many shallow local minima. Some of these may correspond to conformations rich in secondary structure and α-helices appear to be the most common of these regular structures. Their ability to fold into different conformations allows IDPs to interact with different binding partners and tuning their populations by post-translational modifications can allow for the regulation of important cellular functions. An accurate description of the secondary structures adopted by IDPs, as well as of their populations, can therefore help us understand the cellular behaviour of this important class of proteins.
Several methods exist to generate ensembles of configurations representing the conformational heterogeneity of IDPs, including all-atom molecular dynamics simulations [1], implicit-solvent Monte Carlo simulations [2] and other sampling approaches [3][4][5]. As the time evolution of the conformations is not relevant in the types of ensemble reweighting we here consider, from now on we loosely refer to molecular dynamics as any simulation method that can produce an ensemble of conformations. In practice, the simulated ensembles differ from the true ensembles for three main reasons. First, because the simulation conditions may not be exactly the same as the experimental ones-for example, different ionic strength, salts or pH may be used. Second, because equilibrium sampling of the IDPs conformational space is challenging to achieve, mainly due to its enormous size. And third, because all energy functions used in molecular simulations describe particle interactions with limited accuracies. This is particularly so for water-protein interactions, that are prevalent in Methods to test approaches to combine experiments and simulations are often hampered by the fact that we generally do not know the true target ensemble, and thus it is then not clear what an accurate result would be. Thus, to gain full control of the reweighting procedure and to know the desired outcome, we here use synthetic data. This is an approach that has proved powerful for discovering the impact of experimental uncertainties in structure determination that we and others have previously used [36][37][38][39][40][41]. In particular, we consider the a99SBdisp as a reference ensemble for generating synthetic data (i.e., we treat calculated CS from this ensemble as the experimental data). To clarify that these structures and their predicted CS are not actually experimental, we call this ensemble target ensemble. Then we will fit the predicted CS of other ensembles to the CS of the target ensemble. Namely, we will fit the a03ws and the C36m trajectories, which correspond to different conformational ensembles. We will also fit the a99SBdisp derived CS into the target ensemble, i.e., into itself, to evaluate the effect of the errors of the CS predictor (see the Methods section).
The use of CS derived from a target ensemble which is known will allow us to compare the reweighted distribution of CS to the target-ensemble distribution, which would not be accessible from an actual NMR experiment. The CS from an NMR experiment depend on the time-scale of exchanging conformations. If, as it is for IDPs, this time-scale is fast, the measured CS is a time and ensemble average of all conformations. Thus we only get the first moment (the mean) of the distribution of CS in the conformational ensemble. In cases when the experiment reports distributions or higher distribution moments, several approaches can extend the Maximum Entropy method to those cases [42][43][44].
Even if the experiment only produces the mean of the distribution, the goal of reweighting is to obtain an ensemble that better reproduces the experimental ensemble of structures, not only its (measured) mean. By using synthetic data, we can compare how the distribution of CS during the reweighting approaches the target distribution. After all, the BME procedure guarantees that we can systematically approach the mean value, but at some point, overfitting could lead to a distribution that deviates from the target ensemble distribution instead of approaching it. Methods to test approaches to combine experiments and simulations are often hampered by the fact that we generally do not know the true target ensemble, and thus it is then not clear what an accurate result would be. Thus, to gain full control of the reweighting procedure and to know the desired outcome, we here use synthetic data. This is an approach that has proved powerful for discovering the impact of experimental uncertainties in structure determination that we and others have previously used [36][37][38][39][40][41]. In particular, we consider the a99SBdisp as a reference ensemble for generating synthetic data (i.e., we treat calculated CS from this ensemble as the experimental data). To clarify that these structures and their predicted CS are not actually experimental, we call this ensemble target ensemble. Then we will fit the predicted CS of other ensembles to the CS of the target ensemble. Namely, we will fit the a03ws and the C36m trajectories, which correspond to different conformational ensembles. We will also fit the a99SBdisp derived CS into the target ensemble, i.e., into itself, to evaluate the effect of the errors of the CS predictor (see the Methods section).
The use of CS derived from a target ensemble which is known will allow us to compare the reweighted distribution of CS to the target-ensemble distribution, which would not be accessible from an actual NMR experiment. The CS from an NMR experiment depend on the time-scale of exchanging conformations. If, as it is for IDPs, this time-scale is fast, the measured CS is a time and ensemble average of all conformations. Thus we only get the first moment (the mean) of the distribution of CS in the conformational ensemble. In cases when the experiment reports distributions or higher distribution moments, several approaches can extend the Maximum Entropy method to those cases [42][43][44].
Even if the experiment only produces the mean of the distribution, the goal of reweighting is to obtain an ensemble that better reproduces the experimental ensemble of structures, not only its (measured) mean. By using synthetic data, we can compare how the distribution of CS during the reweighting approaches the target distribution. After all, the BME procedure guarantees that we can systematically approach the mean value, but at some point, overfitting could lead to a distribution that deviates from the target ensemble distribution instead of approaching it. In this work we will explore how similar the target and reweighted ensembles are during a reweighting procedure using the distance between the distributions (see Methods section). We will also compare the behaviour of the average value of these distributions. From that, we will propose some rules to assess the amount of reweighting (determining θ) in real situations where the target distribution is unknown because only the experimental average value is available.

Methods
The MD trajectories with different force fields were generated as part of a recent study aimed at developing and benchmarking force fields [35] and were kindly shared by the authors. In this work we use the trajectories generated with the Amber ff03ws force field used with the TIP4P/2005 water model and with scaled protein-water interactions (a03ws) [6], the CHARMM36m (C36m) force field with its native TIP3P water model [45], and Amber ff99SB-disp with its native TIP4P-D-based water model (a99SBdisp) [35]. All the simulations are 30 µs long and contain N = 29777 frames. In this work we take the a99SBdisp as the target distribution and we will fit the a03ws and the C36m to this one, using CS. In the main text we focus on the a03ws results, but all the analysis is reproduced for the C36m ensemble as supplementary figures.
CS can be used to reweight IDP ensembles because they are sensitive to local secondary structure conformations. Indeed, experimentally derived CS indices allow for the quantification of secondary structure based on the CS values [46,47]. Although one might use the predicted secondary structures from these algorithms to reweight simulated ensembles this would introduce unnecessary model bias, and we instead computed the CSs from the ensembles and calculate the primary data directly. For that, we used existing algorithms and software to predict CS from structures.
Methods to calculate CS can introduce both systematic and random errors in their predicted values, but because we do not have experimental CS for our target ensemble, but predicted ones, we need a procedure to model these errors, that would be present in an actual experiment. The use of a predictor with perfect precision and accuracy can give some insight into the information content of CS, as we also explore below, but it does not reflect an experimental setting. Although we could have arbitrarily introduced random and systematic error to the predicted CS of the target ensemble, we preferred an alternative approach where the errors came from using two different predictors. We believe that that magnitude of the introduced errors will be more realistic than chosen arbitrarily.
We used Sparta+ [48] and PPM [49] to simulate the target a99SBdisp CS and only PPM [49] to simulate the reweighted distributions. When not stated otherwise, the target CS come from Sparta+ and we therefore do not expect that reweighting will lead to perfect agreement with the CSs predicted from the ensembles using PPM. PPM was specifically designed to calculate CS for individual frames in a conformational ensemble, i.e., it does not include in its training the thermal fluctuations around a given structure that others methods implicitly absorb into the predicted values. We have used as the error for the PPM predictor the reported errors of the validation set ( Table 3 in [49]), namely 1.06 for Cα, 1.23 for Cβ and 1.32 for the carbonyl C. This choice is based on the fact that the experimental error in the CS is negligible compared to the error of the forward model, and thus that the main uncertainty when determining whether a predicted average CS is "close" to the experimental measurement comes from the error of the forward model. We note also that the estimated error of the CS predictors come from analyses of a much more heterogeneous set of structures and might be an overestimate compared to the relatively narrow distribution of CS in IDPs. Our set of CS contained 69 Cα chemical shifts, 64 Cβ chemical shifts and 69 C chemical shifts, totalling m = 202 chemical shift data.
The application of the BME approach is equivalent to the minimization of the following objective function: where w i are the reweighted weights, N is the number of structures in the ensemble and m the number of experimental observables. χ 2 red measures the degree of fitting: where CS i j is the chemical shift of atom i in structure j and CS i EXP is the target value. And the relative entropy, defined as S REL = − N j=1 w j log w j w 0 j measures the deviation from the initial weights, w 0 j , in our case taken as 1/N. S REL is closely related to the effective sample size, N eff , a useful measure of how much reweighting has taken place (see Equation (3)).
In the calculation of χ 2 red we divide by the number of CS data, assuming completely independent data. Each σ i is equated to the error of the PPM predictor reported above, as errors in the convergence of the simulations and the accuracy of the force fields are difficult to quantify. Because of these assumptions, the absolute value of χ 2 red is of limited utility and it should be merely interpreted as a normalized measure of the residuals.
The effective sample size varies from 1 when all structures have the same weight, i.e., before the reweighting; to 0, in case a single structure gets 100% of the weight. There are different definitions on the effective sample size, one commonly used was put forward by Kish [50], but here we use the definition based on relative entropy, as it fits better within the BME paradigm: The optimization of Equation (1) can be performed with a set of auxiliary Lagrange parameters λ, which allow writing the optimal weights as [24,51,52]: where Z is a normalization constant. This expression highlights two important aspects of the BME approach. First, the N weights are determined by optimizing the values of m λ parameters. As N, the number of structures in the ensemble, is usually much larger than the number of experimental data m, this reduces overfitting. Second, the fitting consists of adding a linear term to the free energy associated to each observable. This linear term can be seen as the minimal perturbation needed to tilt the distribution of the computed ensemble so that its average shifts towards the experimental value. Indeed, a purely Bayesian approach introducing a linear term to the energy [53] leads to results that are equivalent to a Maximum Entropy approach including an error term [54]. The strength of the reweighting is directly determined by the values of λ, and not by θ. In other words, a given θ may lead to different λ, and thus different reweightings for two different ensembles fitted to the same target values. (see Figure S7) After each of the fittings, we compared the average values with Equation (2) and the distribution of values between the reweighted and the target distribution with the Wasserstein distance, as implemented in the Python package SciPy [55]. As each pair of reweighted and target CS distributions produces a Wasserstein distance value, thus we used the mean of all these m values as a measure of the similarity between two sets of distributions.
Secondary CSs were calculated for the target and reweighted ensemble in the same way. The secondary structure of each conformation was determined with the DSSP [56] implementation in MDTraj [57]. Then, the averaged CS for residues in coiled conformations represents the random coil reference. The secondary CS is the result of subtracting the random coil reference to the CS.

Results
The first step in a reweighting procedure is to ensure that the ensemble can be reweighted. Ideally, the ensemble should contain values around the target value. Figure S1 shows that this is the case for the two fitted ensembles discussed in this work. If the target value lies outside the distribution of the calculated ensemble values, the convergence of the reweighting is compromised. In that case one should check if there are systematic errors in the predictor, the experiment, or the simulation. Alternatively increasing the sampling could populate the less probable regions of the distribution. Sampling of these regions can also be achieved by running simulations with restraints [33,52,[58][59][60][61][62].
The improvement of the fit resulting from the reweighting can be measured with the χ 2 red explained in the Methods section. Lower values of χ 2 red indicate a better fitting. As reweighting takes place, some conformations gain weight while others lose it. This leads to a reduced entropy compared to the original ensemble, which can be converted to an effective sample size, N eff , ranging from 1 to 0. As the amount of reweighting increases to fit the data more accurately, N eff , decreases. In some systems, the shape of the χ 2 red vs. N eff curve allows for the determination of a critical value from which no further reweighting is necessary [26,63]. However, in fitting CS of ACTR we find a relatively homogeneous decrease of χ 2 red , which makes it difficult to decide when to stop fitting(see Figure 2). A second common procedure is the Wald-Wolfowitz run test [64] for the residuals, where one checks that they are mutually independent. Long stretches of residues with the same sign indicate that more reweighting is possible. However, when using BME, decreasing θ decreases the size of the residuals, but not their signs (see Figure S2), i.e., each reweighted CS approaches its target CS from the same side as θ is decreased. This renders the Wald-Wolfowitz run test procedure inadequate.
We note here that Figure 2 shows a low χ 2 red even at the start of the fitting procedure. A χ 2 red below one may suggest that the fit is good enough, but can also reflect problems in estimating errors in both experiments and forward models, and in the statistical independence of the data. The σ i used in the calculation corresponds to the PPM error for a heterogeneous set of structures and their experimental CS. When compared to Sparta+ ACTR chemical shifts, the errors are much smaller (See Figure S3). Because of this error underestimation, the absolute value of χ 2 red is of limited use. If we use the PPM CS for the target ensemble, we are effectively using a predictor without error. In that case, χ 2 red is ill-defined, but we get a curve very similar to Figure 2 (see Figure S4). This suggests that the CS difference reflects more the different structural composition of the ensemble (Figure 1) than the errors of the predictor.
The improvement of the fit resulting from the reweighting can be measured with the explained in the Methods section. Lower values of indicate a better fitting. As reweighting takes place, some conformations gain weight while others lose it. This leads to a reduced entropy compared to the original ensemble, which can be converted to an effective sample size, Neff, ranging from 1 to 0. As the amount of reweighting increases to fit the data more accurately, Neff, decreases. In some systems, the shape of the vs. Neff curve allows for the determination of a critical value from which no further reweighting is necessary [26,63]. However, in fitting CS of ACTR we find a relatively homogeneous decrease of , which makes it difficult to decide when to stop fitting(see Figure 2). A second common procedure is the Wald-Wolfowitz run test [64] for the residuals, where one checks that they are mutually independent. Long stretches of residues with the same sign indicate that more reweighting is possible. However, when using BME, decreasing θ decreases the size of the residuals, but not their signs (see Figure S2), i.e., each reweighted CS approaches its target CS from the same side as θ is decreased. This renders the Wald-Wolfowitz run test procedure inadequate.
We note here that Figure 2 shows a low even at the start of the fitting procedure. A below one may suggest that the fit is good enough, but can also reflect problems in estimating errors in both experiments and forward models, and in the statistical independence of the data. The used in the calculation corresponds to the PPM error for a heterogeneous set of structures and their experimental CS. When compared to Sparta+ ACTR chemical shifts, the errors are much smaller (See Figure S3). Because of this error underestimation, the absolute value of is of limited use. If we use the PPM CS for the target ensemble, we are effectively using a predictor without error. In that case, is ill-defined, but we get a curve very similar to Figure 2 (see Figure S4). This suggests that the CS difference reflects more the different structural composition of the ensemble (Figure 1) than the errors of the predictor.
Indeed, the lack of a clear target value for is one key reason why we need to determine the parameter θ. Independently of the origin of the small the ensembles described by the three force fields are different ( Figure 1) and should be distinguishable by their differences in CS. For many residues the CS differences are small because they present a random coil structure, but the region where the target and simulated ensembles show different helical content, the CS show larger differences (Figures 3 and S5). This trend is clearer for the a03ws force field than for the C36m ( Figure S5). This suggests that reweighting would improve the simulated ensembles by rendering them closer to the target ensemble. But traditional methods do not always provide a clear answer to how to choose θ to avoid over-fitting but have enough reweighting.  Indeed, the lack of a clear target value for χ 2 red is one key reason why we need to determine the parameter θ. Independently of the origin of the small χ 2 red the ensembles described by the three force fields are different ( Figure 1) and should be distinguishable by their differences in CS. For many residues the CS differences are small because they present a random coil structure, but the region where the target and simulated ensembles show different helical content, the CS show larger differences ( Figure 3 and Figure S5). This trend is clearer for the a03ws force field than for the C36m ( Figure S5). This suggests that reweighting would improve the simulated ensembles by rendering them closer to the target ensemble. But traditional methods do not always provide a clear answer to how to choose θ to avoid over-fitting but have enough reweighting.
In this work we suggest using a validation set to determine the relative weights of the experimental data and the simulation prior when reweighting, i.e., determine an optimal value of θ. A similar idea was used by Cesari et al. to find the optimal balance between the distribution arising from the force field and the experimental data [65]. Here, we put forward the following procedure: (1) Split the frames of the trajectory into a training set (t) and a validation set (v) of the same size.
We used odd and even frames for the training and validation set respectively. We use sets of the same size because we aim to compare average values and distributions, not individual conformations of the validation set. A validation set as large as the test set is therefore needed to minimize the standard error of the mean and discretization errors of the distribution. We here chose to use an interleaved training and validation set after confirming that the two sets have highly uncorrelated CSs as frames are sampled only once every 1 ns. (2) Fit the BME for a range of θ values to the training set and apply the optimized Lagrange parameters λ to reweight the validation set. χ 2 red (Equation (2)) of the training (χ 2 red (t)) and the validation set (χ 2 red (v)). b. Average distance between the training and the validation sets (D(t,v)). c.
Average distance between the training set and the target (goal) distribution (D(t,g)), and average distance between the validation set and the target (goal) distribution (D(v,g)).
Note that step 3c is only possible in the case of synthetic data and is used as a benchmark for a selection procedure. Ideally we would like to reweight as long as the validation set distribution approaches the target distribution, i.e., we want to minimize D(v,g). In a real case scenario, as the target distribution is not known, we cannot measure this distance. Therefore we need to see if any of the accessible measures can give us information about D(v,g). In this work we suggest using a validation set to determine the relative weights of the experimental data and the simulation prior when reweighting, i.e., determine an optimal value of θ. A similar idea was used by Cesari et al. to find the optimal balance between the distribution arising from the force field and the experimental data [65]. Here, we put forward the following procedure: (1) Split the frames of the trajectory into a training set (t) and a validation set (v) of the same size.
We used odd and even frames for the training and validation set respectively. We use sets of the same size because we aim to compare average values and distributions, not individual conformations of the validation set. A validation set as large as the test set is therefore needed to minimize the standard error of the mean and discretization errors of the distribution. We here chose to use an interleaved training and validation set after confirming that the two sets have highly uncorrelated CSs as frames are sampled only once every 1 ns. (2) Fit the BME for a range of θ values to the training set and apply the optimized Lagrange parameters λ to reweight the validation set. (Equation (2)) of the training ( ( )) and the validation set ( ( )).
b. Average distance between the training and the validation sets (D(t,v)). c. Average distance between the training set and the target (goal) distribution (D(t,g)), and average distance between the validation set and the target (goal) distribution (D(v,g)).
Note that step 3c is only possible in the case of synthetic data and is used as a benchmark for a selection procedure. Ideally we would like to reweight as long as the validation set distribution approaches the target distribution, i.e., we want to minimize D(v,g). In a real case scenario, as the target distribution is not known, we cannot measure this distance. Therefore we need to see if any of the accessible measures can give us information about D(v,g). Figure 3. CS difference between the target CS and the a03ws CS for the three atoms used (C, CA, and CB) compared to the difference in helicity for these two ensembles (in red). Although for many residues the difference lies below the error of the predictor, for some regions it does not. Besides, the helicity is correlated with the CA and the CB CS difference. Figure 4 plots all the values described in the previous procedure. As expected, the BME produces an average CS that approaches the target CS, so that it systematically reduces ( ). At θ≈3 the validation set fits best the target CS and then ( ) starts increasing. This is a sign of overfitting. Interestingly the behaviour of these averaged values is followed very similarly by the distributions, as represented by the shape of D(t,g) and D(v,g). The fact that the distributions, which Figure 3. CS difference between the target CS and the a03ws CS for the three atoms used (C, CA, and CB) compared to the difference in helicity for these two ensembles (in red). Although for many residues the difference lies below the error of the predictor, for some regions it does not. Besides, the helicity is correlated with the CA and the CB CS difference.  Figure 4 plots all the values described in the previous procedure. As expected, the BME produces an average CS that approaches the target CS, so that it systematically reduces χ 2 red (t). At θ ≈ 3 the validation set fits best the target CS and then χ 2 red (v) starts increasing. This is a sign of overfitting. Interestingly the behaviour of these averaged values is followed very similarly by the distributions, as represented by the shape of D(t,g) and D(v,g). The fact that the distributions, which are what we aim to fit, and the average values have parallel behaviours is very positive as this tells us that we can use the average value as a proxy of the distribution. At a value close to θ ≈ 3, D(t,v) has a high curvature and starts growing. The training and the validation distributions, which had been very similar for values θ > 3, start diverging. This confirms that at this point, overfitting occurs. The fit to the C36m ensemble shows a very similar trend ( Figure S6). A different measure also suggests the start of the overfitting regime. Figure S7 shows the root mean square of the Lagrange λ terms that the BME introduces for the fit (see Equation (4)). At θ < 2 the λ parameters start growing at a much faster speed, showing that from this point the reweighting becomes very strong and very sensitive to the θ value. been very similar for values θ > 3, start diverging. This confirms that at this point, overfitting occurs. The fit to the C36m ensemble shows a very similar trend ( Figure S6). A different measure also suggests the start of the overfitting regime. Figure S7 shows the root mean square of the Lagrange λ terms that the BME introduces for the fit (see Equation (4)). At θ < 2 the λ parameters start growing at a much faster speed, showing that from this point the reweighting becomes very strong and very sensitive to the θ value. The quantities defined as thin lines would not be measurable in a real case scenario, but their behaviour can be inferred from the quantities in thick lines.
To evaluate the importance of the error introduced by the predictor and whether this was leading to overfitting, we reproduced the same analysis with target CS coming from the PPM predictor, i.e., excluding the predictor error. As Figure S8 depicts, the behaviour of the different quantities is equivalent to Figure 4, with the only difference than the difference in distributions D(t,g) and D(v,g) and shifted to lower values, as expected from the use of the same predictor. The two distributions are more similar, because the predictor error is missing, and they become diverging at a slightly smaller value of θ, as also does. This shows that the stochastic error of the predictor averages out and that BME is robust to it. Systematic errors in the predictor could be important, as will be shown below, but in this case their influence is much smaller than the differences in CS arising from conformational differences.
Once an optimal value for θ has been determined, other properties of the ensemble can be calculated. As an example, here we calculate the α-helical content and plot it in Figures 5 and S9. We stress that the reweighting improves the fit, but does not lead to a perfect agreement. There are several sources for the disagreement. The major source of the disagreement is the fact that the predictor (PPM software) has an error in the prediction with respect to the target value (Sparta+ software). Even for the same ensemble of configurations predicted and target CS differ, thus leading to a disagreement between secondary structure content, as is discussed below. Further, although the two are tightly related, there is not a one-to-one relationship between the secondary structure and the CS values. Also, the reweighted ensemble may not contain structures with the optimal α-helical content. For example, the a03ws ensemble contains a negligible amount of α-helical structures around residue number 60, therefore, even after the reweighting, the helicity of this region cannot increase and remains too low. Finally, as long as the experimental data does not uniquely specify the The quantities defined as thin lines would not be measurable in a real case scenario, but their behaviour can be inferred from the quantities in thick lines.
To evaluate the importance of the error introduced by the predictor and whether this was leading to overfitting, we reproduced the same analysis with target CS coming from the PPM predictor, i.e., excluding the predictor error. As Figure S8 depicts, the behaviour of the different quantities is equivalent to Figure 4, with the only difference than the difference in distributions D(t,g) and D(v,g) and shifted to lower values, as expected from the use of the same predictor. The two distributions are more similar, because the predictor error is missing, and they become diverging at a slightly smaller value of θ, as χ 2 red also does. This shows that the stochastic error of the predictor averages out and that BME is robust to it. Systematic errors in the predictor could be important, as will be shown below, but in this case their influence is much smaller than the differences in CS arising from conformational differences.
Once an optimal value for θ has been determined, other properties of the ensemble can be calculated. As an example, here we calculate the α-helical content and plot it in Figure 5 and Figure S9. We stress that the reweighting improves the fit, but does not lead to a perfect agreement. There are several sources for the disagreement. The major source of the disagreement is the fact that the predictor (PPM software) has an error in the prediction with respect to the target value (Sparta+ software). Even for the same ensemble of configurations predicted and target CS differ, thus leading to a disagreement between secondary structure content, as is discussed below. Further, although the two are tightly related, there is not a one-to-one relationship between the secondary structure and the CS values. Also, the reweighted ensemble may not contain structures with the optimal α-helical content. For example, the a03ws ensemble contains a negligible amount of α-helical structures around residue number 60, therefore, even after the reweighting, the helicity of this region cannot increase and remains too low. Finally, as long as the experimental data does not uniquely specify the target distribution, procedures such as this will always be affected by the choice of a reference (prior) distribution and will thus not give the exact target distribution. When we examine the α-helical content of the validation ensemble after reweighting we find that it is further away from the initial ensemble than the training ensemble. This makes sense as we are using a Maximum Entropy algorithm, which ensures minimal perturbation of the training ensemble from the original one. The minimal perturbation, expressed as minimization of the Kullback-Leibler divergence, is true only for the training ensemble. Therefore, the parameters of the training ensembles applied to the validation ensemble lead to a further divergence from the original ensemble. Consequently, the described procedure should be used to determine the optimal θ, but once it is known, one should re-fit the complete ensemble with that optimal θ. Figures 5 and S9 show that the reweighting changed the helicity the most in the region between residues 30 and 45. This is because CSs are not only sensitive to local residue conformations, but also to neighbouring residue conformations. This results in the calculated data in long helical stretches differing most from a random coil than short helices. Consequently, the reweighting is able better to 'see' long helices and reweight those regions accordingly.
The reweighting will only improve properties that are sensitive to the experimental data used. In this case, the CS of different atoms are sensitive to the helical conformations to a different degree. Thus, reweighting leads to an ensemble with improved helical content. Other properties, such as the radius of gyration (Rg), are sensitive to the global mass distribution of the conformations. For expanded ensembles, an increase of helicity leads to more collapsed ensembles, as helical structures are compact [66,67]. For the a03ws, reweighting leads to an overall decrease of helicity ( Figure S10), and therefore to a larger Rg. For the C36m, the slight increase in helicity does not correlate with the Rg, presumably because the initial ensemble is already rather compact. As also emphasized by Best as co-workers, both local (such as secondary structure) and global (such as Rg) properties should be used to characterize IDPs ensembles, and one should not expect the reweighting based on one set of these properties to improve the other [1]. Using these other properties as cross-validation, as it is sometimes done, may lead to an incorrect perception of the amount of reweighting needed. As an example, the C36m fitting shows a minimum that could suggest that the evolution of Rg could be used as a cross-validating property, but the a36m shows a monotonic increase, for the reasons previously mentioned. If several experimental data are known, they should be included in the reweighting to obtain more realistic ensembles with improved local and global properties.
A good way to estimate the effect of the predictor error is to fit the target ensemble to itself. In When we examine the α-helical content of the validation ensemble after reweighting we find that it is further away from the initial ensemble than the training ensemble. This makes sense as we are using a Maximum Entropy algorithm, which ensures minimal perturbation of the training ensemble from the original one. The minimal perturbation, expressed as minimization of the Kullback-Leibler divergence, is true only for the training ensemble. Therefore, the parameters of the training ensembles applied to the validation ensemble lead to a further divergence from the original ensemble. Consequently, the described procedure should be used to determine the optimal θ, but once it is known, one should re-fit the complete ensemble with that optimal θ. Figure 5 and Figure S9 show that the reweighting changed the helicity the most in the region between residues 30 and 45. This is because CSs are not only sensitive to local residue conformations, but also to neighbouring residue conformations. This results in the calculated data in long helical stretches differing most from a random coil than short helices. Consequently, the reweighting is able better to 'see' long helices and reweight those regions accordingly.
The reweighting will only improve properties that are sensitive to the experimental data used. In this case, the CS of different atoms are sensitive to the helical conformations to a different degree. Thus, reweighting leads to an ensemble with improved helical content. Other properties, such as the radius of gyration (R g ), are sensitive to the global mass distribution of the conformations. For expanded ensembles, an increase of helicity leads to more collapsed ensembles, as helical structures are compact [66,67]. For the a03ws, reweighting leads to an overall decrease of helicity ( Figure S10), and therefore to a larger R g . For the C36m, the slight increase in helicity does not correlate with the R g , presumably because the initial ensemble is already rather compact. As also emphasized by Best as co-workers, both local (such as secondary structure) and global (such as R g ) properties should be used to characterize IDPs ensembles, and one should not expect the reweighting based on one set of these properties to improve the other [1]. Using these other properties as cross-validation, as it is sometimes done, may lead to an incorrect perception of the amount of reweighting needed. As an example, the C36m fitting shows a minimum that could suggest that the evolution of R g could be used as a cross-validating property, but the a36m shows a monotonic increase, for the reasons previously mentioned. If several experimental data are known, they should be included in the reweighting to obtain more realistic ensembles with improved local and global properties.
A good way to estimate the effect of the predictor error is to fit the target ensemble to itself. In that case, the disagreement between predicted and target CS comes exclusively from the use of different predictors. We will refer to predictor error as the difference between the predictor used for the reweighting ensemble (PPM) and the predictor used as a target (Sparta+) without any assumption of which one gives CS closer to the true experimental ones.
If the predictor error was a Gaussian noise with zero mean, it would average to zero for an ensemble of tens of thousands of structures. However, the predictor has systematic deviations that depend on the residue type and its conformation ( Figure S11). The systematic deviations also correspond to what higher helical content would give: negative C and CA and positive CB (see Figure 3). This suggests that the reweighting will tend to decrease the helical content.
By following the same procedure as before, we obtain an optimal value of θ ≈ 2 ( Figure S12). It may seem surprising that when the reweighted ensemble matches exactly the target ensemble the θ value is lower than when fitting the a03ws and C36m ensembles. Two things need to be considered. First, the amount of disagreement between the CSs is not much lower than when fitting the a03ws or the C36m ensembles. This shows that the predictor is a major source of errors, but even with considerable unknown systematic errors in the predictor, the reweighting of the a03ws and C36m ensembles gave consistently improved ensembles. Second, θ values between ensembles are not comparable because θ does not determine the strength of the reweighting. The strength of the reweighting is determined from the optimized values of the weights, which themselves arise from the optimized values of the Lagrange (λ) parameters (see Equation (4)). For a given θ, the reweighting is weaker for the current a99SBdisp ensemble than for the a03ws and C36m (see Figure S7). As Figure 6 shows, even after the reweighting with θ = 2, the helicity has changed, by at most~0.05, from 0.31 to 0.26 in the region of residue 37. For all the other regions the changes in helicity are smaller. After all, the reweighting procedure 'sees' CSs that would correspond to lower helicity, and therefore results in an ensemble with lower helicity. If the predictor error was a Gaussian noise with zero mean, it would average to zero for an ensemble of tens of thousands of structures. However, the predictor has systematic deviations that depend on the residue type and its conformation ( Figure S11). The systematic deviations also correspond to what higher helical content would give: negative C and CA and positive CB (see Figure 3). This suggests that the reweighting will tend to decrease the helical content.
By following the same procedure as before, we obtain an optimal value of θ ≈ 2 ( Figure S12). It may seem surprising that when the reweighted ensemble matches exactly the target ensemble the θ value is lower than when fitting the a03ws and C36m ensembles. Two things need to be considered. First, the amount of disagreement between the CSs is not much lower than when fitting the a03ws or the C36m ensembles. This shows that the predictor is a major source of errors, but even with considerable unknown systematic errors in the predictor, the reweighting of the a03ws and C36m ensembles gave consistently improved ensembles. Second, θ values between ensembles are not comparable because θ does not determine the strength of the reweighting. The strength of the reweighting is determined from the optimized values of the weights, which themselves arise from the optimized values of the Lagrange (λ) parameters (see Equation (4)). For a given θ, the reweighting is weaker for the current a99SBdisp ensemble than for the a03ws and C36m (see Figure  S7). As Figure 6 shows, even after the reweighting with θ = 2, the helicity has changed, by at most ~ 0.05, from 0.31 to 0.26 in the region of residue 37. For all the other regions the changes in helicity are smaller. After all, the reweighting procedure 'sees' CSs that would correspond to lower helicity, and therefore results in an ensemble with lower helicity. Although the reweighting from the target a99SBdisp ensemble was not large, deviating from the initially correct ensemble is not satisfactory. We therefore looked for alternatives to minimize the reweighting. One alternative is to improve the matching between calculated and target CS. One would be to use the best possible predictor as its accuracy when comparing with actual experimental CS should not be overlooked.
Considering that a source of error in the predictor may come from the baseline CS for each Although the reweighting from the target a99SBdisp ensemble was not large, deviating from the initially correct ensemble is not satisfactory. We therefore looked for alternatives to minimize the reweighting. One alternative is to improve the matching between calculated and target CS. One would be to use the best possible predictor as its accuracy when comparing with actual experimental CS should not be overlooked.
Considering that a source of error in the predictor may come from the baseline CS for each residue, we hypothesized that the use of secondary CSs could lead to a cancellation of errors. Secondary CS are defined as the difference between measured CSs and CSs from random coil conformations [20]. They measure the change of CS when a given residue adopts a secondary structure conformation. If predictors are good at describing this change then secondary CS could lead to more sensitive reweightings. As secondary structures can be assigned at a residue level for each conformation of a computational ensemble, the generation of secondary CS from computed ensembles is fast and simple (see Methods section).
The use of secondary CS leads to essentially no-reweighting. We repeated the optimization procedure described above with calculated and target secondary CSs. The optimization of θ leads to flat curves ( Figure S13), that at first sight could suggest that the optimal value of θ is difficult to determine. To a certain extent this is true, the reason being that the reweighting is mostly insensitive to θ. What is more interesting is that the reweighting is negligible even for extremely small values of θ, as Figure S13 shows, so that the reweighted ensemble remains essentially unchanged and thus, close to target ensemble as it was desirable.
As expected, the negligible reweighting does not lead to overfitting. Figure S15 shows that in the fitting of secondary CS, the effective sample size stabilizes to a value of 0.42 for θ < 10 −3 . A value of N eff far from 0 shows that the method is not overfitting. We remind the reader that the amount of reweighting is not determined by θ itself but by the Lagrange λ parameters that result from the optimization procedure. The result of this procedure tells us that, whatever our relative confidence in the computed ensemble and the target CS, the reweighting will be essentially zero.
The secondary CS before any reweighting already show an excellent agreement with the target ones, so that reweighting is not necessary. The χ 2 red before reweighting in one order of magnitude smaller for secondary CS than for CS. However, we wanted to test our method in an extreme case. We conclude that even when it could easily lead to overfitting, the nature of BME prevents that, leading to stable results. But it also shows that when there are alternative methods to reduce the errors of the predictors, as the use of secondary CSs, the reweighting becomes simpler (in the sense that one does not need to tune θ) and more accurate.

Discussion
The Maximum Entropy approach is a simple, yet powerful method to fit computed ensembles to experimental data [68]. Its extension into a Bayesian approach allows the treatment of uncertainties arising from both experiment and computation [17,25,26]. It is often difficult to determine the optimal balance between the prior information, encoded in the force field, and the experimental data. These difficulties arise in particular because computational errors are difficult to assess. They arise from the predictor (the forward model), the convergence of the simulation, the accuracy of the force field used and the computational reproducibility of the experimental conditions. Also, in many cases it is difficult to determine accurate estimates of experimental errors. Together, these effects result in difficulties in determining a unique value of θ, or that a wide range of possible θ values appear comparably good. Because of uncertainties in the experimental and computational errors it is difficult to choose the level of reweighting simply by the magnitude of χ 2 red , thus making it important to find objective and robust criteria to determine the values of θ.
Here we suggest using a validation set that arises from the same distribution, and therefore belongs to the same model as the training set. As we show in the case of comparing results from CS and radius of gyration, the use of other physical observables for cross-validation can sometimes lead to a wrong choice of θ. If the observables are strongly correlated to the ones being fitted, it will be hard to detect any overfitting, as the overfitted ensemble, even if having a small effective size, will also reproduce the correlated observables. On the contrary, if the physical observable is not correlated, then there is no reason why the reweighted ensemble should improve its fitting, as we showed for the radius of gyration.
The split of the data between training and validation sets can be done in different ways. The two sets would be maximally uncorrelated if the sets arise from the first and the second half of the trajectory. But unless each of the halves is fully equilibrated-which is not the case in our trajectory-the two sets would correspond to different distributions, and one should not expect that the same Lagrange λ parameters should apply to the two sets. For instance, if the first half samples more helical conformations than the second, the reweighting could lead to a reduction of helicity for the first half and an increase for the second. On the contrary, by using interleaved frames to create training and validation sets we ensure they represent samples of the same distribution, whether fully equilibrated or not. Therefore, the same Lagrange λ parameters should improve both training and test until the moment we start overfitting the training set. Therefore, there is a range of θ values before the overfitting that are dependent on the distribution to be fitted but independent of the particular structures chosen from that distribution. This is the basis of our selection of the optimal θ value. However, if the validation set is strongly correlated to the training set, the overfitting could not be detected in the validation set. We therefore recommend checking the correlation between neighbouring frames. Alternatively, the reweighting and validation suggested procedure can be repeated for a subsample of the trajectory and it should lead to a similar optimal θ value.
Even if we do not incur in overfitting, a large amount of reweighting indicates either that the prior is considerably different from the target distribution or that the predictor has important systematic errors. In both cases the reweighted ensemble will still be far from the actual ensemble, and one should be especially cautious in not over-interpreting it, especially in extracting structural information from it that is not sensitive to chemical shifts, as we have discussed for the radius of gyration.
We also showed that the errors of the predictor should not be underestimated. In the specific case of NMR CS, it appears that even when the predictors were designed and parameterized to remove systematic errors, they may still arise in applications to a particular system. These unknown systematic errors contribute to the reweighting and are a major source of overfitting that our method avoids to a certain extent.
The use of secondary CS seems a promising approach to reduce the errors of the predictors. Its calculation from a computational ensemble is simple, but the experimental calculation is not trivial. One approach is to use tabulated CS for protein random coils [20], including corrections depending on neighbouring residues as well as pH and temperature [69][70][71][72][73]. A more precise but more time consuming approach is to completely denature an IDP and measure the resulting CS [74,75]. Both approaches introduce some uncertainties that may render the secondary CS less accurate. In future work we will explore whether they still remain a better alternative when reweighing computational ensembles of IDPs.
We would like to end this discussion by reminding the reader that the BME and related approaches are particularly suited to avoid overfitting. Maximization of the entropy implies that the reweighted distribution is minimally perturbed, unlike in other reweighting approaches [45][46][47][48][49][50][51][52][53]. One may think that still, small values of θ, which means a strong confidence on the experimental data, would lead to an overfit of the finite number of computed structures. However, the fitting procedure fits a small number of parameters (equal to the number of CSs), in our case m = 202, to a much larger number of structures, N = 29977. If each weight was fitted individually, the procedure would be highly prone to overfitting, with N = 29777 parameters. Instead, BME determines weights from a much smaller set of parameters, the number of observables (m = 202) plus θ. The effect of this is that the possibility of overfitting is substantially reduced.

Conclusions
In this work we have shown how chemical shifts can be used to improve the configurations arising from molecular dynamics simulations of intrinsically disordered proteins. We have introduced a systematic method to assess the amount of reweighting (θ) needed to fit the experimental chemical shifts based on cross-validation. This approach allows to circumvent the difficulty of knowing the errors associated to the simulations and the chemical shift predictors. We have also shown how the predictor errors lead to an incorrect reweighting as they include a systematic bias. This error seems to be greatly reduced by using secondary chemical shifts, something that we will further explore in future work.
Supplementary Materials: The following are available online at http://www.mdpi.com/1099-4300/21/9/898/s1. Figure S1: Distribution of the difference between the ensemble CS and the average target CS for the 2 ensembles discussed in this work. The x-values correspond to a concatenation of C, CA and CB CS. To be able to reweight the values should be distributed at both sides of y = 0, as is the case. Figure S2: Evolution of the CS difference (residuals) for the reweighting procedure for a range of θ values from 100 (purple values) to 0.1 (yellow values). As the residuals have the same sign for most of the reweighting procedure, one cannot use the Wald-Wolfowitz run test to define the amount of reweighting. Figure S3: Difference in the chemical shifts predicted with Sparta+ and PPM for the a99SBdisp ensemble. The left plots show the value for each residue and the right plots show their distribution for a better visualization of their spread and shift. Atom types are CA (top), C (middle) and CB) bottom. Figure S4: Evolution of χ 2 red with the effective sample size (N eff ), showing the lack of an L-shaped curve. The blue and orange lines are the same as in Figure 1. The NE suffix stands for "no-error". It corresponds to the values where the target CS are calculated with PPM. For the sake of clarity χ 2 red uses the same error as in the case with errors, even though χ 2 red is ill-defined in this case and should be regarded as a scaled root-mean-square error. Figure S5: The CS difference between the target CS and the computed CS for the C36m ensemble compared to the difference in helicity for these two ensembles. Figure S6: Behaviour of different quantities during the reweighting procedure of the C36m ensemble. Figure S7: Root mean square of the Lagrange parameters λ during the reweighing procedure for all three ensembles. See Equation (4). Figure S8: Behaviour of different quantities during the reweighting procedure of the a03ws ensemble with no predictive errors (NE suffix). As was done in figure S3, χ 2 red uses the same error as in the case with errors, even though χ 2 red is ill-defined in this case and should be regarded as a scaled root-mean-square error. Figure S9: α-Helical content for the reweighted and target ensembles for the C36m force field. For the reweighted ensembles the train (C36m-t) and validation (C36m-v) sets are shown. The original ensembles before the reweighting is also shown. Figure S10: Evolution of the radius of gyration for different reweightings. Figure S11: Error in the predictor (PPM) with respect to the target chemical shift (Sparta+) for different secondary structure elements of the a99SBdisp ensemble. For atoms CA and C there is a systematic underestimation of the chemical shift, whereas for CB, there is a systematic overestimation. The errors also depend on the type of secondary structure. The codes are the following: 'C': Loops and irregular elements, 'B': Residue in isolated beta-bridge, 'E': Extended strand, participates in beta ladder, 'G': 3-helix (3/10 helix), 'H': Alpha helix, 'I': 5 helix (pi helix), 'S': bend, and 'T': hydrogen bonded turn, as determined by the DSSP algorithm implemented in MDtraj. Figure S12: Behaviour of different quantities during the reweighting procedure of the a99SBdisp ensemble. The quantities defined as thin lines would not be measurable in a real case scenario, but their behaviour can be inferred from the quantities in thick lines. Figure S13: Behaviour of different quantities during the reweighting procedure of the a99SBdisp ensemble using secondary chemical shifts. The quantities defined as thin lines would not be measurable in a real case scenario, but their behaviour can be inferred from the quantities in thick lines. Remark that the χ 2 red values are very small for all θ values. D(t,g) and D(t,v) have been scaled by 1/20 so that the shape of the curves could be seen. Figure S14: α-Helical content for the reweighted and target ensembles using secondary chemical shifts. For the reweighted ensembles the train (a99SBdisp-t) and validation (a99SBdisp-v) sets are shown. The original ensemble before the reweighting is also shown and, as expected, it corresponds exactly to the target ensemble as they are the same. Figure S12: Evolution of the effective sample size (N eff ) with θ for the secondary chemical shift fitting of a99SBdisp ensemble to the a99SBdisp target ensemble. Figure S15: Evolution of the effective sample size (N eff ) with θ for the secondary chemical shift fitting of a99SBdisp ensemble to the a99SBdisp target ensemble.