Guide for determination of protein structural ensembles by combining cryo‐EM data with metadynamics

Metadynamics electron microscopy metaInference (MEMMI) is an integrative structural biology method that enables a rapid and accurate characterization of protein structural dynamics at the atomic level and the error in the cryo‐EM experimental data, even in cases where conformations are separated by high energy barriers. It achieves this by incorporating (a) cryo‐electron microscopy electron density maps with (b) metadynamic‐enhanced‐sampling molecular dynamics. Here, I showcase the setup and analysis protocol of MEMMI, used to discover the atomistic structural ensemble and error in the cryo‐EM electron density map of the fuzzy coat of IAPP, a fibril implicated in type II diabetes.

Metadynamics electron microscopy metaInference (MEMMI) is an integrative structural biology method that enables a rapid and accurate characterization of protein structural dynamics at the atomic level and the error in the cryo-EM experimental data, even in cases where conformations are separated by high energy barriers. It achieves this by incorporating (a) cryoelectron microscopy electron density maps with (b) metadynamicenhanced-sampling molecular dynamics. Here, I showcase the setup and analysis protocol of MEMMI, used to discover the atomistic structural ensemble and error in the cryo-EM electron density map of the fuzzy coat of IAPP, a fibril implicated in type II diabetes.
Cryo-electron microscopy (cryo-EM) has brought a revolution in structural biology, by giving access to highresolution structures of macromolecules in near-native environments. With more than 20 000 single-particle electron density maps, the electron microscopy data bank (EMDB) contains rich information about biomacromolecular structures. However, conformational heterogeneity-henceforth referred as structural ensemble-often leads to cryo-EM density maps representing averaged out structural information [1], leading in turn to low-resolution regions of the cryo-EM maps. This effect impedes the reconstruction of an atomistic structure in these regions and leads to structurally missing regions in the final atomistic PDB model.
To address this challenge, inspired by the framework of free energy landscapes and statistical mechanics, the electron microscopy metaInference (EMMI) [2] method models accurately a structural ensemble, by combining noisy (i.e., subject to experimental errors) and heterogeneous (i.e., embedding a structural ensemble) cryo-EM density maps, with prior knowledge of the system given by molecular dynamics (MD) force fields. Moreover, EMMI can not only characterize the atomistic structural ensemble compatible with the cryo-EM density map but also the error in the map. So far, EMMI has been employed in various complex biological problems ranging from antibiotic resistancerelated CLP-protease to microtubule-tau complexes and SARS-CoV-2 membrane receptor proteins and nanobody complexes [3][4][5][6][7][8]. However, the accuracy of EMMI depends on the sampling speed.
The recently developed metadynamics EMMI (MEMMI) has enabled to accelerate the generation of atomistic structural ensembles with slowly interconverting states [9]. MEMMI accelerate the structural ensemble sampling by adding a history-dependent bias to the system as a function of microscopic degrees of freedom of the system known as collective variables (CVs). Albeit MEMMI allows to use many collective variables, thereby reducing the error in made if the choice of CVs is poor, a dedicated review on how to optimize collective variables can be found in Ref. [10]. With this bias, the system is able to escape deep-free energy minima and transition between structural states. In this protocol, we illustrate a step-by-step guide into the preparation and analysis of an MEMMI simulation of our previous study to determine the structural ensemble and cryo-EM density map error of the full-length (residue 1-37) islet amyloid polypeptide fibril (IAPP), a notoriously difficult system to characterize atomistically due to the structural heterogeneity and associated errors in the measurement. The formation of IAPP fibrils are associated with the demise of pancreatic β-cells in type-2 diabetes (T2D). The low-density regions of the 12-residue long N-terminal tails also known as fuzzy coat are thought to play a key role in amyloid function such as in the binding of RNA, phase separation, and mediation of molecular chaperone binding [11,12]. Moreover, the fuzzy coat was demonstrated to be involved in membrane binding, either causing catalysis of aggregation directly or by capturing amyloid precursors, consequently facilitating secondary nucleation pathways [11,12].

Metadynamics electron microscopy MetaInference
While the cryo-EM density map data consist of voxels on a grid, MEMMI uses a Gaussian mixture model ϕ D (x) (GMM) conversion of the voxel map, consisting of N D Gaussian components where x is a Cartesian vector, ω D,i is the scaling factor of the ith component of the data GMM, and G is a Gaussian around x D,i with covariance matrix Σ D,i . The overlap function per Gaussian component ov MD,i measures the agreement of MD generated models and the data GMM, where the model GMM ϕ M (x) is a forward model that converts MD models to GMM. Molecular heterogeneity is dealt in MEMMI by simulating many replicas r of the system and hence the overlap in Eqn (2) is estimated over the ensemble of replicas ov MD,i . In the discrepancy measure, ov MD,i is the forward model that is compared to the data-GMM self-overlap ov DD 1B). MEMMI is designed to handle systematic errors, e.g., biases in the force field or forward model, random errors (e.g., due to error in cryo-EM density map), and errors due to the finite size of the ensemble [22]. Models are generated according to the following MEMMI energy function: where X represents the atomic coordinates of the structural ensemble, comprising individual replicas X r , where r = [1, N R ]. E MD (X) is the energy of the MD forcefield, sampled by multi-replica MD simulations. The second term quantifies the deviation of the structural ensemble with the data GMM map, while properly considering the error associated to the limited number of replicas in the ensemble σ SEM and the random and systematic errors σ B in the prior, forward model, and experiment. Note σ SEM is estimated per data point (σ SEM i ) and can either be set to a constant or calculated on the fly by window averaging [23]. σ B is estimated per datapoint i and replica r as σ B i,r and can be sampled by Monte Carlo at each time step, or calculated a posteriori. The third term, E σ , corresponds to the energy associated with the error σ = (σ B , σ SEM ).
The fourth term represents the metadynamics bias to accelerate the sampling of the structural ensemble. Parallel-bias metadynamics (PBMetaD) [24] and multiple-walkers scheme [25] are used. V PB is a time-dependent biasing potential acting on a set of N CV collective variables s(X), which are functions of the coordinates given as The ensemble average ov MD,i X ð Þ is properly unbiased by the biasing potential as in standard umbrella-sampling [26] The relative error σ i /ov DD,i in the experimental data can be calculated a posteriori by reweighting as ensemble average System preparation All the below operations are carried out in a LINUX or MacOS environment except otherwise specified. The simulation input files can be found in the following github page [27].
1 Download and employ PDB 6Y1A as a template in Robetta server and use the full length IAPP sequence as input and the RosettaCM option. This generates a full length atomistic structure models for IAPP (Fig. 1A). 2 Load the atomistic input of step 1. and the EM map EMD-10669 (cryo_EM_10669.mrc) into CHIMERA (models #0 and #1), respectively, and fit the IAPP structure with the map by executing "fitmap #0 #1". (Fig. 2B). Write a pdb file of this initial IAPP model as iapp_initial.pdb ( Fig. 2B third column). 3 Segment the map (model #1) within 6A around the initial model using the command line command "vop #1 zone #0 6" followed by saving the segmented map relative to the model #0 to cryo_EM_10669_6A.mrc. This is now considered the experimental voxel cryo-EM map shown in Fig. 2B first column. 4 Use GMCONVERT to convert the experimental voxel map generated in the previous step to the data GMM ϕ D by using a divide-andconquer approach (see Fig. 2B second column). We used the code in Ref. [28,29] and respective commands in the command-line are: python3 generate_gmm.py--input_mapcryo_EM_10669_6A.mrc --n_proc 1 cd ITER_1 cat */*gmm > 1.gmm gmconvert VcmpG -igmm 1.gmm -imap ../$1 -zth 0.0 -omap iter_1.mrc > iter_1.log 5 Where the last step is a conversion of the data GMM to a readable version for PLUMED by using "./con-vert_GMM2PLUMED.sh 1.gmm iapp.dat", where the file convert_GMM2PLUMED.sh and generate_gmm.py can

MD equilibration
In a PYTHON JUPYTER notebook, one can use the following commands.
1 Save a no hydrogen structure of the complex.
gmx pdb2gmx -f complex_noh. gmx make_ndx -f em_solv_ion.gro <<EOF r GLU ¦ r ASP a OE1 ¦ a OE2 ¦ a OD1 ¦ a OD2  the initial structures need to be prepared for each replica and the associated replica folders "r$i" where each replica will run. Below, we perform MEMMI using eight replicas.

Analysis
This subsection is dedicated to the analysis of the outcomes. In particular, 1-6 focus, respectively, on how to (1) reconstruct a continuous trajectory of the structural ensemble, (2) plot free energies as a function of collective variables, (3)(4)(5)  To plot the free energy surface of a collective variable, e.g., disulf3 of the 3rd tail of the biased and unbiased side [9], the following commands are executed in the notebook ( Fig. 2A Tips & tricks 1 In MEMMI, we first expressed the experimental voxel map data as a data GMM containing 10 000 Gaussians in total showing 0.975 correlation to the original voxel experimental map (Fig. 2B second column). 2 LINCS is used for bonds constraints [32], the Lenard Jones interactions are switched off with a cutoff at 1 nm and the long-range interactions are treated using PME (Fourier spacing of 0.12 and a 1 nm cutoff for the short-range electrostatic interactions). Pair lists are evolved every 10 fs with 1 nm cutoff every 2 fs. Leap frog and velocity rescale are used to integrate Newton's equations and temperature coupling [33]. The Parrinello-Rahman barostat [34] is used in the NPT, with a coupling time constant of 1.0 ps. 3 Cα are position restrained in the 500 ps NPT equilibration with a 200 kJÁmol −1 nm -2 force constant, while the temperature and pressure are set to 300 K and 1 atm, respectively. 4 In the 2 ns, 300 K, NVT simulation no position restraints are used. 5 Configurations were saved every 10 ps. The cryo-EM restraint is calculated every two MD steps, employing neighbor lists for comparing overlaps between model and data GMMs, with cutoff equal to 0.01 and update frequency of 100 steps. 6 An interesting future direction is to utilize the analysis protocol of Analysis section to compare for a particular protein, the structural ensemble generated by MEMMI and other complementary methods such as MANIFOLDEM [35], CRYOFOLD [36], and MDFF [37].