Metadynamic metainference: Enhanced sampling of the metainference ensemble using metadynamics

Accurate and precise structural ensembles of proteins and macromolecular complexes can be obtained with metainference, a recently proposed Bayesian inference method that integrates experimental information with prior knowledge and deals with all sources of errors in the data as well as with sample heterogeneity. The study of complex macromolecular systems, however, requires an extensive conformational sampling, which represents a separate challenge. To address such challenge and to exhaustively and efficiently generate structural ensembles we combine metainference with metadynamics and illustrate its application to the calculation of the free energy landscape of the alanine dipeptide.


Details of the metainference equations
Properties of conditionally independent variables. In the derivation of the metainference equations, we make large use of the properties of conditionally independent variables. Here we revise some basic relations. The variables and are conditionally independent given if , = | ( 1) or equivalently: Here we demonstate that in this situation the following relation holds: We start by applying Bayes theorem to , : Now we use the conditionally independence of and given : , = ( , ) ( ) ( 5) and the definition of conditional probability: to write: If now we apply Bayes theorem to ( | ), we obtain , = ( ) • ( 8) which leads to Eq. S3:  ( 10) We can further simplify this expression in the case of Gaussian noise: In this case, we can write The product of the two Gaussian probability density functions (PDFs) is a scaled Gaussian PDF The scaling factor is itself a Gaussian PDF Since typically we are not interested in determining ! f r , we can marginalize it from Eq. S10 after inserting Eq. S13 and obtain: can be simplified in case of Gaussian noise on all data points:

The prior
A,_ can be modeled using a unimodal distribution peaked around a typical dataset effective uncertainty A,m and with a long tail to tolerate outliers data points: where A,m = A DEF Q + A,m C Q , with A DEF is the the standard error of the mean for all data points in the dataset and A,m C is the typical data uncertainty of the dataset. We can thus marginalize A,_ by integrating over all its possible values, given that all the data uncertainties A,_ C range from 0 to infinity: H AIJ which leads to: and to the metainference energy function in Eq. 20 of the main text: Computational details 2) We run a 200 ns-long molecular dynamics simulation using the AMBER99SB-ILDN force field and the static WTMetaD bias potential g ( , ) obtained at the end of step 1. Under the effect of the bias, the system explored all the relevant regions of the Ramachandran 3 plot. In this simulation, configurations were saved every 0.2 ps, for a total of 10 6 frames.
3) We back-calculated using the driver utility of PLUMED 4 all the 36 distances from the trajectory obtained at step 2. We used these distances to calculate the averages in the ensemble defined by Eq.
S25. In order to do so, we had to reweight each frame to eliminate the effect of the static WTMetaD bias potential, which ensured ergodicity, and to add the offset in the free energy of the C ax local minimum. Therefore, in the calculation of the average distances from the trajectory generated at step 2, each frame _ was assigned the following weight 5 where C is the Boltzman constant and is the temperature of the system. The final averages are reported in Tab. S1 (third column) along with the average distances calculated separately in the region of C 7eq ( < 0, fourth column) and C ax ( > 0, fifth column). The average distances in the AMBER99SB-ILDN ensemble are reported in Tab. S1, sixth column.

4)
To introduce systematic errors in the pure (synthetic) data calculated at step 3, we added a random offset in the range from 0.2 to 0.3 nm to 20% of the final averages reported in Table S1. For the bias exchange metadynamics 12 part (BEM), we used the 4 dihedrals defined above as CVs and the same Gaussian parameters as in PBMetaD. Exchanges were attempted every 1000 MD steps. Each metainference replica used one dihedral as CV. Since we utilized a total of 8 replicas and 4 CVs, each CV was biased by 2 different replicas.
Metainference simulations details. All the simulations were performed using the same parameters of the M&M runs described above.   Table S1. Average distances (in nm) between all pairs of (non-bonded) heavy atoms of alanine dipeptide, calculated in the reference ensemble (third column), only in the regions of the C 7eq (fourth column) and C ax (fifth column) local minima, and in the ensemble generated by the AMBER99SB-ILDN prior (sixth column). The atom IDs in the first and second columns correspond to the atom numbers in the PDB file reported below (second column). The metainference score is defined as a function of the atom (average) distances defined here.