Structural ensembles of disordered proteins from hierarchical chain growth and simulation

Disordered proteins and nucleic acids play key roles in cellular function and disease. Here we review recent advances in the computational exploration of the conformational dynamics of flexible biomolecules. We focus on hierarchical chain growth (HCG) from fragment libraries built with atomistic molecular dynamics simulations. HCG combines chain fragments in a statistically reproducible manner into ensembles of full-length atomically detailed biomolecular structures. The input fragment structures are typically collected from molecular dynamics simulations, but could also come from structural databases. Experimental data can be integrated during and after chain assembly. Applications to the neurodegeneration-linked proteins $\alpha$-synuclein, tau, and TDP-43, including as condensate, illustrate the use of HCG. We conclude by highlighting the emerging connections to AI-based structural modeling.

• HCG builds on successful data-driven statistical coil models in structural biology and polymer theory, and complements molecular dynamics (MD) simulations.

Introduction
A significant fraction of the proteome in higher organisms consists of intrinsically disordered proteins (IDPs) that do not fold into well-defined structures and of proteins with intrinsically disordered regions (IDRs) [1].Disordered segments are also present in nucleic acids.In particular, singlestranded RNAs (ssRNAs) such as messenger RNA (mRNA) feature regions that do not form double helices or other folded structures [2,3].IDPs and IDRs are unfolded in solution and can transiently adopt secondary structure [4].Binding to other biomolecules can induce IDRs to fold [5], though disorder can persist also in the bound state [6].IDPs and IDRs have distinct functions, e.g., in the nuclear pore complex [7], are a major component of biomolecular condensates [8], and are closely linked to neurodegenerative diseases [9] with their interactions (dys)regulated by mutations and posttranslational modifications [10,11].
The structural heterogeneity of IDPs is best represented by a broad structural ensemble [12].Non-local interactions in IDPs are necessarily transient, unlike in folded proteins.As a consequence, the conformation space of IDPs is inherently hierarchical in the sense that, at any scale, the local conformational preference will be minimally impacted by regions distant in sequence.Building on this principle, we recently introduced hierarchical chain growth (HCG) [13 •• ] to explore the structural heterogeneity of IDPs.
Here, we review the concepts and applications of HCG by computational fragment assembly as an extension, alternative, and complement to molecular dynamics (MD) simulations for IDPs.By preserving the local structure across scales where possible, chain growth is appealing not only because of high computational speed and flexibility, but also by the possibility to produce accurate representations of the structural ensembles even of large IDPs.Chain growth can be used to create a broad ensemble of structures that can, if needed, be refined by integrative modeling using experimental data and/or MD simulations.

Chain growth
Modeling of the global structure of polymers has long been approached by chain growth algorithms.For a biomolecule with internal structure, we imagine dividing its sequence into fragments (Figure 1).For each of these fragments, we generate a pool of structures, as illustrated schematically with Figure 1: Schematic of naive, iterative, and hierarchical chain growth.The structures of a linear biopolymer are assembled from four fragments (colored chains) picked from their respective pools (ovals).(A) Growing chains by a naive algorithm.On encountering a clash, the current chain is rejected and a fresh attempt is launched.While correct, this algorithm is extremely inefficient for long chains.(B) Iterative algorithm.Instead of re-growing the entire chain when a clash is detected, many chain-growth approaches simply repeat the step until a conformation without clash is obtained.Such algorithms are incorrect unless the bias resulting from repeated drawings is properly accounted for, as in Rosenbluth sampling.(C) Hierarchical chain growth (HCG) is a correct and efficient algorithm.Different fragments are recursively combined until the full-length chain is obtained.Absent steric clashes, monomer fragments are combined to dimers, dimers to tetramers and so on.For chains with N = 2 M -fragments, the algorithm has only M = log 2 N assembly levels.
the four urns in Figure 1.This pool may be filled with local structures taken from databases of experimental structures or from molecular dynamics simulations of chain fragments.The task is then to assemble these fragments by a chain-growth algorithm.Naively one might consider that one simply needs to grow polymer chains sequentially, say from N to C terminus (Figure 1A).However, so not to introduce a bias, one would have to stop the growth of a chain as soon as a clash is encountered and start to grow an entirely new chain instead of simply redrawing a new fragment (Figure 1B).Otherwise, the outcome will depend on arbitrary choices such as the direction of chain growth, N-to-C versus C-to-N.Rosenbluth and Rosenbluth recognized this problem of detailed balance in chain growth early in the history of computer simulations, and addressed it by a careful reweighting of self-avoiding random walks (SAWs) on a lattice [14].
In combination with importance sampling, chain growth has become a powerful tool to create large ensembles for polymers, including biopolymers [15,16].To grow a chain, one assembles short fragments that can be sampled very efficiently at high quality.For IDPs, the flexible-meccano model by Bernadó et al. [12] is widely used, also for proteins under physiological conditions [17].It builds on the observation that the local structure in IDPs is captured well by coil models [18][19][20][21][22][23].In flexible-meccano, chains are grown based on the backbone-dihedral statistics in the Protein Data Bank (PDB).

Hierarchical chain growth
In disordered proteins, local structure is determined primarily by the local amino acid sequence, lacking the cooperative interactions of folded proteins between regions distant in sequence.HCG [13 •• ] exploits this hierarchical nature.A protein chain is divided into overlapping sequence fragments.Fragment structures are sampled with replica-exchange molecular dynamics (REMD) simulations.From the resulting pools, the fragments are then chosen at random.Adjacent fragments are combined with a rigid body superimposition of the heavy atoms of their overlapping regions.If the corresponding root-mean-square distance (RMSD) is below a given cut-off and if there are no steric clashes, the fragment pair is entered into the respective pool at the next assembly level.This assembly process is continued hierarchically all the way up to the level of full-length chains (Figure 1C).At each level of the assembly process, the size of the chain fragments effectively doubles.The hierarchical assembly manifestly preserves detailed balance, which guarantees that arbitrary choices such as the order of the assembly do not affect the final ensemble.Hence, HCG grows ensembles of chains with a well-defined distribution.By construction, the members of the HCG ensemble are statistically independent.As a result, HCG produces broad ensembles of IDPs with highly diverse conformations in a computationally efficient manner, sampling a significantly broader conformational space than, say, one 2 µs-long MD simulation in case of α-synuclein (aS) [13 If needed, HCG can be complemented by MD simulations of solvated full-length chains.As shown for aS in Figure 2, the radius of gyration R G calculated for an HCG ensemble with 20,000 chains [13 •• ] is already in good agreement with the measured value from SEC-SAXS [24].For three different combinations of protein force field and water models, we found that aS tended to collapse below the size seen in the SEC-SAXS measurements [24].These findings highlight, first, that care must be taken to assess the collapse tendency.Second, as shown in Figure 2, even for the loosely packed aS with 140 amino acids, it takes many hundreds of nanoseconds of MD just to relax the chain size.Third, without any further simulations, HCG appears to be at least on par with the three MD simulation models.HCG thus provides an excellent starting point for further inquiry.
Applications of HCG extend beyond the sampling of IDP ensembles.For instance, HCG has shed light on the early stages of autophagy.Sawa-Makarska et al. used an implementation of HCG to model the disordered N and C termini of the protein Atg9 in the Atg9-containing vesicles seeding yeast autophagosomes [28].The extensive coverage of the vesicle surface by the Atg9 tails explained the relatively low rate of Atg8 lipidation, which requires unhindered access to the surface.Interestingly, some of the principles used in chain growth also find their application in other approaches to model important biological systems such as glycoproteins.Gecht and coworkers implemented a tool, GlycoSHIELD [29], that helped, e.g., to prepare a proper model of the SARS-CoV-2 spike protein by attaching glycan conformers onto the protein of interest.In another variant, Turoňová et al. [30] resampled the hip and knee joints of SARS-CoV-2 spike stalk to probe the full extent of its mobility.
Interactions between distant parts of the chain other than steric exclusion can be taken into account [16,31 • ], including electrostatics, at least at the level of implicit solvent descriptions.Including electrostatic forces in HCG may be important for growing structures of highly charged biomolecules [6].A pragmatic way forward can be to use larger chain fragments for HCG sampled in MD simulations using explicit ionic solvents.

Integration of experimental data
An ensemble representation establishes a sound foundation for the interpretation of experimental data in case of structural disorder in a molecular system [3,[32][33][34][35].As a first line of attack to improve the consistency between measured and calculated observables, one can reweight the members of the unbiased ensemble rather than adjust their structure [36][37][38][39].In a Bayesian view, the initial ensemble can be considered a sample of the prior distribution.By imposing restraints derived from experiment already in the creation of the ensemble [38,40,41], this sample can be enriched.Combinations with enhanced sampling techniques such as metadynamics [42] or replica exchange [38] further improve the sampling efficiency.Uncertainties in measurements and their modeling are readily taken care of in a Bayesian framework [38].However, the integration of data is no panacea: for comparably poor force fields, the overlap with the "true" ensemble may not be sufficient for reweighting according to a single or a few experimental observables to establish meaningful ensembles [43].In other words, the quality of the Bayesian prior matters, which may not surprise considering the vast conformational space to be sampled.
In chain growth, experimental data can be integrated already during the ensemble generation in a form of integrative modeling [22].The flexiblemeccano approach and its extension ASTEROIDS have been successfully used to account for different types of NMR data and single-molecule FRET and SAXS data [44 • ].Biased fragment choice, with fragment weights derived from a Bayesian formulation, has been shown to be powerful in early applications of chain growth [45] or in the refinement of MD ensembles of flexible proteins by fragment replacement [46].
The reweighted hierarchical chain growth (RHCG) is an extension of HCG to integrate experimental data by assigning weights to the fragment conformations [31 • ].RHCG is designed to counteract the problem of systematic biases in the fragment pool.Consider, for instance, a systematic force-field error in the energetic balance between locally extended and helical peptide conformers.As the size of the molecules increases, it becomes less likely that all parts of a chain are drawn from the relevant subspace.Consequently, after global reweighting only a few chains may end up dominating the final ensemble.RHCG counteracts this tendency by using suitable fragment weights, which can be assigned, for instance, by Bayesian inference [33,38,39].In a global reweighting of the ensemble after chain assembly, the fragment weights are fully accounted for [31 • ].In this way, RHCG generates a well-defined and diverse output ensemble that has high overlap with the true ensemble.
With RHCG we were able to account for solution experiments on tau K18 as diverse as NMR, single-molecule FRET, and small-angle X-ray scattering [31 • ].We also captured structural features seen in tau fibrils and provided important atomic-resolution insight to complement current ideas of how tau mutations shift the balance between the tau conformational ensembles in health and disease (Figure 3B).P301L, P301T, and P301S mutations shift our tau ensembles away from turn-like conformations that would be able to bind to microtubules.Instead, this region populates aggregation-prone extended conformations.RHCG has thus revealed how subtle shifts in localstructural propensities could give rise to pathogenesis [31 • ].

Teixeira et al. [47 •
] recently published a software suite that samples IDP ensembles following the principles of data-driven coil models and contains tools for further analysis and ensemble refinement.Interestingly, their approach also captured shifts in local structures propensities in response to the neurodegeneration-linked P301L mutations in accordance with the RHCG ensemble [31 • ].

Condensates
IDPs are often associated with protein condensates.An exciting perspective is to build molecularly detailed models of such crowded solutions of (disordered) biomolecules.One possibility is to harness the power of HCG to directly model such dense systems.Individual conformations are drawn from an ensemble of single chains grown with HCG and assembled in a simulation box, which can serve as starting point for MD simulations.For the low-complexity domain (LCD) of the neurodegeneration-linked RNA-binding protein TDP-43, we generated models of condensates with atomic detail (Figure 4) using a variant of HCG and then ran MD simulations from this initial system [48 • ].In the simulations, phosphomimicking mutations led to a loss of protein-protein interactions and an increase in protein solvent interactions in the C terminus of the TDP-43 LCD that destabilized the condensates, complementing coarse-grained simulations of the phase behaviour of phosphomimicking mutants and phosphorylated TDP-43.The experiments by Dormann and colleagues [48 • ] have suggested that disease-linked phosphorylation, rather than driving the progression of neurodegenerative diseases, is a potential cell-protective mechanism; by hyperphosphorylation the cell may try to hinder the condensation and aggregation of TDP-43.
Combining high-resolution experiments, theory, atomistic and coarsegrained modeling has already started to yield insights into the drivers of liquid-liquid phase separation [49].This is a particularly exciting prospect as coarse-grained simulation models parameterized using large sets of highresolution experimental data can capture trends in the global arrangements of disordered proteins as well as their propensities to phase separate [50 •• ].Another interesting direction is the simulation of dense solutions of disordered proteins or their fragments at sub-critical concentrations [51  • ] can provide critical insights into molecular driving forces for condensation.However, how to optimally combine models from chain-growth and atomistic simulations with coarse-grained models is an open question.

Outlook
On the methods side, the emerging connections of chain growth to machine learning and artificial intelligence (AI) deserve special attention.Historically, coil models have been an attempt to collect statistical information about protein structure and use it to infer the local and global structures of proteins.As such, coil models and HCG have a natural connection to machine learning and AI.
AlphaFold2 [54] showcases the power of AI to predict three dimensional protein structure.The resulting acceleration in structural studies of complex assemblies [55] raises the intriguing question as to what can be learned about disordered regions from AlphaFold predictions.Currently AlphaFold2 does not capture disordered regions as a properly weighted ensemble.Hence, an exciting prospect is the combination of AlphaFold2 models of the folded protein and conformations from IDP/IDR ensembles using, e.g., HCG, molecular dynamics or knowledge-based approaches.Interestingly, segments in IDRs often appear structured in AlphaFold2 predictions, possibly in reflection of their binding to distinct partner proteins [56 •• ], which had been used effectively to map and model the interactions of short linear motifs (SLiMs) with structured nucleoporins in the scaffold of the nuclear pore complex [55].One potential problem is that AlphaFold2 may capture, in the same model, structures an IDR may adopt in different complexes, as has been shown for conditionally folded proteins by comparison to experimental structures.Thus, a critical assessment of the thousands of local structures predicted for IDPs/IDRs may be advisable even for proteins where AlphaFold2 produces high-quality models of the folded domains.
Growing efforts have also been made to harness the power of AI to characterize structural ensembles of IDPs.Gupta and colleagues recently developed an AI based approach that learns IDP conformational space from short MD simulations to then generate broad IDP ensembles [57].It is interesting to speculate to what extent this approach can be combined with ensembles sampled with HCG.Zhang et al. [58 • ] are developing a neural network that learns structural ensembles of disordered proteins from experimental information.In fact, the neural network generates and learns torsion-angle probability distributions for interdependent neighboring residues, while also biasing the probability distribution towards experimental data, using a Bayesian formalism.Even more ambitiously, a recent preprint shows how a coarse-grained representation of an atomistic ensemble can be learned by a neural network, which reproduces the equilibrium densities of the input ensemble [59 • ].HCG ensembles usually extend beyond the conformations sampled by direct MD simulations, at least for long chains, and should thus provide a valuable reference in these endeavors.
In recent years, we have witnessed a lot of progress in sampling structural ensembles of flexible (bio)polymers.However, efficient sampling of the vast conformational diversity still remains challenging.Approaches that model conformational ensembles based on local structure statistics, i.e., coil models, have been shown to be promising.The hierarchical chain growth (HCG) builds on the basic ideas of coil models.Using HCG one can sample ensembles with highly diverse conformations in a computationally efficient manner.In the cases studied, the ensemble properties agreed well with available experimental data.The quality of the generated ensemble could be further improved by integrating experimental information, producing richly detailed structural ensembles consistent with experiment across scales.full-length chains of disordered proteins from libraries of fragment structures.The algorithm fulfills detailed balance and thus generates ensembles with a well-defined distribution.
• Hierarchical chain growth is extended by assigning weights to the fragment conformations within a Bayesian framework.Reweighted hierarchical chain growth (RHCG) resolves local structural propensities in tau and highlights how mutations can subtly but decisively shift the balance from functional to aggregation-prone conformations linked to neurodegeneration.formational space of disordered protein states.J Phys Chem A 2022, 126:5985-6003.
• A python-based coil modeling platform IDPConformerGenerator is presented.The generated ensembles can then be refined by a Bayesian approach.Interestingly, the authors demonstrate how the P301L mutation in tau can shift local structural propensities in accordance with results from hierarchical chain growth.• Hierarchical chain growth provided atomically-resolved models of phase-separated TDP-43 condensates.Atomistic molecular dynamics simulations of WT and 12D mutant condensates highlighted how phosphomimicking mutations enhance the solvation of the disordered domain of TDP-43.Hierarchical chain growth complemented coarse-grained simulations of phase behavior and led to a microscopic understanding of the experimental observations.modeling, which can provide important clues for researchers but also needs to be interpreted carefully.Cautionary examples highlight that for disordered proteins AlphaFold2 may predict a mixture of the two different structures a protein adopts with two different binding partners.
• A neural network is used to generate coils models.Experimental data are integrated in a form of reinforcement learning.
• The authors develop a neural network representation to learn a coarse-grained force field from a fine-grained ensemble.

Figure 2 :Figure 3 :
Figure 2: HCG of α-synuclein extended by atomistic MD simulations with different force fields.(A-C) The box-and-whiskers plots show the distribution of the radius of gyration R G calculated over windows of 50 ns across 20 independent runs initiated from 20 randomly chosen structures of the HCG ensemble (mean: black; median: red; box: interquartile range; bars: extrema).Results are for (A) the amber99StarILDN-q force field and TIP4Pd water model [25], (B) the a99SBdisp force field and TIP4P-d water model [26], and (C) the a033ws force field and TIP4P/2005 water model [27].The dashed green line indicates the average RMS R G for a 20000 HCG ensemble of aS.The solid gray line is the R G value measured via SEC-SAXS by Araki et al. [24] with the standard error indicated by shading.(D-G) Snapshots of aS as grown with HCG before MD (D), and after 500 ns MD (E-G) with the force fields of panels A-C.

Figure 4 :
Figure 4: Snapshot of an LCD TDP-43 condensate.The all-atom system was built for MD simulations by combining TDP-43 LCD chains preassembled by HCG [48 • ].The surface of the chains is shown in color and atoms from a single TDP-43 LCD chain are shown with atomistic detail.Solvent molecules are omitted for clarity except for a small region, where water is shown as sticks and ions as spheres (sodium in cyan, chloride in blue).Blue lines indicated the periodic simulation box.