On the potential of machine learning to examine the relationship between sequence, structure, dynamics and function of intrinsically disordered proteins

Intrinsically disordered proteins (IDPs) constitute a broad set of proteins with few uniting and many diverging properties. IDPs-and intrinsically disordered regions (IDRs) interspersed between folded domains-are generally characterized as having no persistent tertiary structure; instead they interconvert between a large number of different and often expanded structures. IDPs and IDRs are involved in an enormously wide range of biological functions and reveal novel mechanisms of interactions, and while they defy the common structure-function paradigm of folded proteins, their structural preferences and dynamics are important for their function. We here discuss open questions in the field of IDPs and IDRs, focusing on areas where machine learning and other computational methods play a role. We discuss computational methods aimed to predict transiently formed local and long-range structure, including methods for integrative structural biology. We discuss the many different ways in which IDPs and IDRs can bind to other molecules, both via short linear motifs, as well as in the formation of larger dynamic complexes such as biomolecular condensates. We discuss how experiments are providing insight into such complexes and may enable more accurate predictions. Finally, we discuss the role of IDPs in disease and how new methods are needed to interpret the mechanistic effects of genomic variants in IDPs.


Introduction
Intrinsically disordered proteins (IDPs) constitute a broad and relatively heterogeneous class of proteins that have in common that they do not adopt a well-defined three-dimensional structure, at least in the absence of binding partners. This in itself is not a very strict definition because also natively folded proteins are dynamic. Experimentally, disordered proteins are often characterized using a range of biophysical measurements that typically reveal the presence of transiently formed secondary structure elements and occasionally weak, transient longer-range interactions. Analogously, many proteins have regions of intrinsic disorder interspersed within or between folded domains, and in many ways these intrinsically disordered regions (IDRs) behave similarly to IDPs, and in general we will refer to both as IDPs.
The flexibility and dynamics combined with an extended surface area endow IDPs with an ability to adapt, a trait that is often key to their biological function, either because it enables them to bind to multiple different proteins or because the intrinsic dynamics may affect both binding kinetics and thermodynamics. This dynamics, however, also makes it difficult to characterize IDPs both experimentally and computationally.
It was early recognized that the amino acid composition and sequences of IDPs differed in several ways from those of folded proteins. Thus, aided by databases containing experimentallyvalidated IDPs (Hatos et al., 2020) a large number of prediction methods have been developed to predict protein disorder from sequence alone Necci et al., 2021). While overall very successful, such prediction methods inherently need to deal with the heterogeneity in what is considered a disordered protein, including large differences in biological context (complexes, post-translational modifications, etc).
A complementary approach to study IDPs is to characterize the conformational ensembles that they populate. In certain favourable cases, computational methods can on their own predict some conformational properties. Often, however, a detailed and accurate characterization requires integrating one or more types of biophysical experiments with computational methods to collectively derive a collection of structures that represent the conformational heterogeneity (Mittag and Forman-Kay, 2007;Jensen et al., 2013) or dynamics (Salvi et al., 2016) of the protein. A number of such approaches exist and some of the resulting ensembles are collected in the Protein Ensemble Database (PED; Lazar et al. (2021)), by analogy to the Protein Data Bank (PDB; wwPDB consortium (2019)), which, however mostly contains more well-defined protein structures or IDPs in complexes.
For folded proteins, Anfinsen's observations (Anfinsen, 1973;Eisenberg, 2018) suggested that it should be possible to predict the three-dimensional structure of a folded protein based only on its primary structure and its interaction with the environment. Over the years, this has led to the field of protein structure prediction, and a plethora of innovative approaches to predict structure from sequence. The accuracy of such methods are evaluated during the biennial critical assessment of structure prediction (CASP) experiment. While there have been continued improvements in the ability to predict structures over the years, the last two installments of CASP (CASP13 (2018) and CASP14 (2020)) have witnessed some substantial and impressive advances in accuracy, in particular in the so-called template-free modelling (Kryshtafovych et al., 2019;AlQuraishi, 2021). While a number of developments have contributed to this, we here highlight three. First, in the last decade a number of methods have been developed to extract structural information from multiple sequence alignments (MSAs) e.g. through the analysis of correlated substitutions during evolution (Lapedes et al., 2012;Weigt et al., 2009;Marks et al., 2011;Morcos et al., 2011;Balakrishnan et al., 2011;Xu, 2019). Second, there has been an explosion in the number of sequences available making such sequence-based approaches useful and applicable to a wider number of proteins. Finally, various deep-learning approaches have been used to 'learn' the complex relationship between the amino acid sequence (or MSA) and the three-dimensional structure. Most visible has been the development of the AlphaFold approach (Senior et al., 2020) in CASP13 and AlphaFold 2 in CASP14, although many other groups have also contributed to these developments including among others Xu ( Motivated by our own research, this review begins by examining whether such methods can be used to predict information about the (highly conformationally heterogeneous) three-dimensional 'structures' and ensembles of IDPs using only the primary structure as input. We also discuss how machine learning methods may aid in integrative modelling of the conformational ensembles of IDPs. We discuss the unique properties of IDPs in complexes, both those formed via short linear motifs and in larger assemblies and biomolecular condensates, and how new sources of data may be useful to develop better prediction methods. Finally, we discuss the role of IDPs in human diseases and how an improved understanding of the relationship between sequence, structural properties, formation of complexes and function may help in this area (Fig. 1). Overall, we highlight a number of challenges that are particularly relevant for IDPs, and some of the questions that might be addressed by combining machine learning methods with experiments and other computational approaches.

Towards improved conformational ensembles
From sequence to structure Before discussing potential applications to IDPs, we first describe very briefly some of the key steps that have lead to improved structure prediction of folded proteins, but note that this description is far from comprehensive. One key ingredient has been the ability to extract structural information, for example in the form of contacts (Lapedes et al., 2012;Marks et al., 2011), distance distributions (Senior et al., 2020) or distributions over distances and orientations (Yang et al., 2020a), from the analysis of MSAs. This work builds on earlier ideas that correlated mutations observed through evolution contain information about the proximity of amino acids in the three-dimensional structure (Taylor and Hatrick, 1994;Neher, 1994;Göbel et al., 1994), but required more advanced analysis methods, including global analysis methods (Lapedes et al., 2012;Weigt et al., 2009;Marks et al., 2011;Morcos et al., 2011;Balakrishnan et al., 2011;Xu, 2019), as well as increased number of sequences to reveal their full potential. The structural information obtained from the MSAs can then be used to guide structure determination using a range of methods to obtain three-dimensional models. Indeed, while many methods have been shown to help improve the accuracy of contact prediction, it is not clear that improved contact prediction always leads to substantially improved models of three-dimensional structures (Kassem et al., 2018). More recent improvements in the AlphaFold 2 method appear to involve using a so-called end-to-end model (AlQuraishi, 2019;Laine et al., 2021;AlQuraishi, 2021), which has been trained to predict structures directly from sequence using an iterative approach (Jumper et al., 2020 Figure 1. An overview of the relationship between sequence, structure, dynamics and function of IDPs, and how the inherent disorder also in the complexes affects the use of machine learning approaches. Some of the challenges to understand these relationships include improving predictions of disorder, describing ensemble properties, and finding ways to include complex heterogeneity and context effects. Finally, it is still not clear how many disease variants in IDPs lead to disease and by which mechanisms.
pears that a machine learning model has been optimized to predict directly the three-dimensional structure from sequence and trained using large sequence and structure databases.
We return now to the question of how such methods might be applied to IDPs, and note several obstacles that need to be overcome. First, the goal should not generally be to predict a single structure from the sequence, but rather an ensemble of dynamic structures. We here note in passing that some predictions provide multiple structures, but these generally represent the uncertainty of the prediction rather than the intrinsic heterogeneity and dynamics of the structure. Second, it is often difficult to generate high quality and deep MSAs of IDPs, in particular for those of low sequence complexity, or when conserved folded domains used to anchor alignments are lacking. Third, we do not have available a large number of structural ensembles that can be used to benchmark let alone train prediction methods. Thus, in contrast to the case for folded proteins where the PDB contains approximately 175.000 entries with protein structures, the PED contains ca. 200 ensembles.
Recent work has begun to develop approaches to understand the relationship between sequence and conformational ensembles of IDPs. In one such study, the concept of amino acid co-evolution was applied to predict contacts in IDPs from MSAs (Toth-Petroczy et al., 2016). In several of the proteins analysed, the predicted contacts could be shown to coincide with key contacts observed within an IDP when bound in a complex to a folded protein (Toth-Petroczy et al., 2016), thus demonstrating that the same principles that have been used so successfully for folded proteins have the potential to provide insight into IDPs, at least when they form complexes. In another study, we used similar sequence analyses of the disordered protein CsgA (Tian et al., 2015). Here, we found a strong pattern of predicted contacts that corresponded to a folded amyloid-like state that CsgA forms. In this case, these contacts are preserved by evolution because CsgA forms a functional amyloid that is beneficial to the bacteria. While these and related studies suggest that sequence analysis might contain information that can be extracted to learn about the structures of IDPs, they have so far mostly revealed information about folded states that the IDPs might adopt or, perhaps, local secondary structure in the disordered states (Toth-Petroczy et al., 2016). Here we note that there has been a substantial amount of work on predicting local structure in flexible peptides and proteins, but that our focus here is more generally on both local and global structures.
One of the limiting factors is also that we do not have a well-developed framework to discuss and quantify the relationship between sequence and ensemble properties. Indeed, as discussed above, protein disorder covers a continuum ranging from almost folded, but flexible and compact globules to chains that appear as statistical random coils. Because these proteins are best described in statistical terms, one approach is to bypass the three-dimensional structure all-together and predict key structural parameters directly from sequence. A number of such studies have focused on discovering the rules that govern the relationship between amino acid composition and patterning and, in particular developed a conceptual framework to examine such sequence-ensemble relationships more generally. Presumably, such approaches as well as more advanced computational methods and expanded sets of experimental data will be needed to predict structural properties beyond predicting compaction and local structure in IDPs.

Beyond sequence alignments
As discussed above, one of the main sources of information in current protein structure prediction comes from MSAs. While a detailed discussion of methods used to generate MSAs is beyond the scope of this paper, we note here that they are generally constructed based on the assumption of positional homology (Bawono et al., 2017), i.e. that a specific position in one sequence corresponds to a specific position in a homologous sequence. While this in turn is often the case for folded proteins, the situation in IDPs appears more complicated. Moreover, some IDPs diverge by gene duplication (Lee et al., 2008) or are only found in some species (Rozen et al., 2015), and indels (insertion and deletions) are found to be frequent in IDPs (Light et al., 2013), obscuring alignments further.
Recently, several ideas and methods from natural language processing have been applied to modelling, interpreting and predicting properties from protein sequences (Alley et al., 2019; Rao et al., 2019;Heinzinger et al., 2019;Ofer et al., 2021). While such methods are still difficult to interpret and expensive to train, they might help study proteins for which it is difficult to construct good alignments. Nevertheless, it still appears that MSAs contain substantial information that is not easily extracted from such language models (Rao et al., 2019), and indeed combining the two can be advantageous (Rao et al., 2021).
Initial applications of language models to IDPs suggest that such models could be very useful in cases where one cannot construct accurate MSAs (Heinzinger et al., 2019). It will be interesting to explore whether these methods can be used to discover new rules that govern IDP sequences, and extract structural information from them. For example, as discussed above, properties such as the level of compaction and local structural motifs can to some extent already be predicted from sequence. It will, however, be interesting to explore how such methods can be improved-also for other properties such as long-range interactions or post-translational modifications-by developing new methods to represent sequences that are not based on the positional-conservation dogma that implicitly underlies many structure-prediction methods for folded proteins (Pritišanac et al., 2019;Huihui and Ghosh, 2021).

Forward models for interpreting experimental data
Many methods for predicting the structures of folded proteins are explicitly or implicitly based on the availability of thousands of labelled sequence-structure pairs in the PDB, used either for parameterization, training or validation. As discussed above, we have much fewer experimentallyderived conformational ensembles available for IDPs, and we here discuss how machine learning methods can provide potential advances in modelling IDPs.
While a plethora of methods exist for modelling conformational ensembles of IDPs, they are typically based on either biasing molecular simulations using experimental data or on selecting structures from a pre-generated ensemble to improve agreement with experiments (Mittag and Forman-Kay, 2007;Jensen et al., 2013;Bonomi et al., 2017;Orioli et al., 2020). In these methods it is very important that the underlying dynamics and conformational averaging is treated correctly. For folded and rigid proteins one often transforms the experimental measurements into geometric restraints that are then applied during or after simulations. While this is possible for IDPs, a more general approach involves calculating experimental quantities from conformations and ensembles and comparing these to experiments. This calculation relies on so-called 'forward models', i.e. algorithms to calculate experimentally-accessible quantities from conformational ensembles.
To give a concrete example, small-angle X-ray scattering (SAXS) experiments are often used to probe the compaction of an IDP, often quantified by the radius of gyration ( ). One approach might therefore be to extract the from experimental SAXS data and generate a conformational ensemble with the same (average) . Such an approach would, however ignore solvent contributions to the experimental measurements (Henriques et al., 2018) as well as information from a wider range of scattering angles (Riback et al., 2017;Fuertes et al., 2017;Zheng and Best, 2018). Also, when multiple sources of experimental data are used it is very important to treat errors and ensemble-averaging correctly (Ahmed et al., 2020), and that becomes more difficult when working with quantities that are transformed values of the experimental measurements. Instead of using , the more common approach is to use a forward model to calculate SAXS data from each conformation in an ensemble, and then compare the calculated average with experiments (Bernadó et al., 2007). A number of such forward models exist for SAXS experiments, that differ in how they treat solvent effects, as well as in accuracy and computational efficiency (Hub, 2018). Impor-tantly, different forward models may give different views of a conformational ensemble (Cordeiro et al., 2017;Henriques et al., 2018), because-depending on the relationship between structure and measurement-different ensembles will be needed to agree with the experiments (Pesce and Lindorff-Larsen, 2021).
There are at least two different approaches to develop forward models, and these approaches can be combined. The first uses basic physical principles underlying the experiment to link structure and observable. Again using SAXS as an example, one of the most commonly used methods to calculate SAXS data from experiments (Crysol; Svergun et al. (1995)) calculates SAXS intensities from the scattering amplitude of the protein in vacuum as well as a model for the solvent contribution. The former is in turn based on empirically-derived form factors whereas the solvent contribution is parameterized using two parameters capturing the average solvent displaced by surface atoms and the excess density of the solvation layer. For folded and globular proteins, these two parameters are often fitted based either on a known structure or a model for the structure. Thus, the calculations of SAXS data from structural models may involve combining such a physical model while fitting one or a few empirical parameters against the experimental data.
Other forward models, such as for example methods that are used to calculate protein chemical shifts from protein structures are also often based on physical principles, but have a large number of parameters that need to be fit to experiments (Xu and Case, 2001;Shen and Bax, 2007;Kohlhoff et al., 2009;Han et al., 2011). This in turn is often based on data for folded proteins for which both high resolution structures and assigned chemical shifts are available. The mathematical function that connects structure and chemical shift is highly complex and has its roots in quantum mechanics. Thus, an alternative approach to express this relationship is to use neural networks (Meiler,

2003; Shen and Bax, 2010; Li et al., 2020a; Yang et al., 2020b).
One assumption underlying most of these approaches is that the experimental chemical shifts (which are time and ensemble averaged quantities) can be predicted accurately from a single structure. While that may be sufficient to study the rigid regions of folded proteins, other approaches may be needed to deal with more flexible parts ( Semi-empirical forward models such as those described above can be extremely difficult to develop for IDPs. This is because we rarely have sets of proteins for which we accurately know the conformational distribution derived independently from the set of measurements that one aims to develop a forward model for. Thus, most structures and ensembles determined for IDPs are implicitly based on forward models trained and validated on folded proteins. In the case of SAXS data this means, for example, that we often make the assumption that the solvation of a disordered protein is similar to that of a natively folded protein, and that the solvation properties are independent of the structure. While this may be true, this is very difficult to validate. One approach towards this goal may be to use more refined forward models to derive the ensembles (Hub, 2018; Hermann and Hub, 2019), to reparameterize simplified models using such more refined methods (Henriques et al., 2018), or to refine ensembles and forward models in a self-consistent manner (Rieping et al., 2005

; Brookes and Head-Gordon, 2016; Pesce and Lindorff-Larsen, 2021).
How can machine learning methods aid in the further development of forward models, and thus in our ability to derive conformational ensembles from experimental data? As described above, neural networks have already been used extensively to parameterize a function used to calculate chemical shifts, and we expect such methods will become refined and extended to a wider set of experiments. Machine learning methods have also been developed to extract shape information from SAXS experiments (Franke et al., 2018) though, to our knowledge, not as forward models. Similarly, a deep neural network based approach has been developed to process and extract structural information from electron paramagnetic resonance (EPR) experiments (Worswick et al., 2018). Finally, a neural network was recently trained using quantum calculations to predict data from infrared absorption spectroscopy . Circular dichroism (CD) spectroscopy is widely used to study IDPs (Chemes et al., 2012), yet calculating CD spectra from conformational ensembles of IDPs is difficult and generally based on 'basic spectra' derived from folded proteins often via secondary structure classification (Nagy et al., 2019). We envisage that machine learning methods can aid in generalizing such approaches towards IDPs. In addition to the improved accuracy potentially afforded by such machine-learning-based forward models, they may also have other advantages such as rapid evaluation and differentiability, both of which can be important when determining conformational ensembles from experimental data.

Improving energy functions for simulating IDPs
Returning to the problem of predicting conformational properties and ensembles of IDPs from sequence we now explore how experiments and machine learning methods may be combined to improve conformational modelling. Ensembles generated either directly from molecular simulations or from integrative modelling using experiments are dependent on the quality of the physical models used in simulations (Orioli et al., 2020). Thus improved force fields and energy functions both enable more accurate predictions of conformational properties from sequence, but also makes integrative methods more robust ( In recent years there have been substantial improvements in explicit solvent, all-atom force fields used to study the structure and dynamics of IDPs (Best, 2017; Huang and MacKerell Jr, 2018; Robustelli et al., 2018;Mu et al., 2021), and these improvements have been derived both by better quantum-level calculations and empirical fitting to experimental data. Conformational sampling of IDPs, in particular long IDPs or their complexes, remains a substantial challenge, and therefore implicit solvent or coarse-grained methods are sometimes used. These can in turn be parameterized using either bottom-up (based on more accurate models) or top-down (from experiments) approaches, or indeed a combination of the two (Noid, 2013).
Some time ago we developed an automated approach to parameterize force fields based on experimental data and applied it to develop a coarse-grained model for IDPs (Norgaard et al., 2008). The basic idea, which had also been explored earlier for force field development (Njo et al., 1995;Norrby and Liljefors, 1998;Groth et al., 2001;Bathe and Rutledge, 2003), is to sample force field parameter space and to optimize the parameters by comparing simulation results against experiments. Using a Bayesian framework it is possible to combine the experiments with other sources of information, and one may use reweighting techniques to speed up parameterization (Norgaard et al., 2008). In some sense, this approach can be considered a machine learning approach for learning force field parameters from experimental data. Later, similar ideas have been developed and applied to the problem of optimizing all-atom force fields against experimental data ( , 2021), and we expect such approaches could have a substantial impact on our ability to simulate IDPs at various resolutions.
The methods described above suggest that machine learning methods may be used both to improve our ability to calculate and interpret experimental observables and to parameterize com-putational models for IDPs directly against experiments. Common to both problems is the focus on interpreting and using the experimental measurements. This is key because the procedure when going from experimental measurement to conformational ensemble involves approximations and loss of information. In the context of folded proteins, this is generally thought to be less of a concern, and the three dimensional coordinates are often a relatively good representation of the system and of the data. This in turn means that structure prediction methods can be trained or benchmarked on the protein structures (coordinates) rather than the experimental measurements used to derive them. We expect that this will not be the case for IDPs, and instead we suggest that machine learning methods for structure prediction should be benchmarked or trained directly on experimental data similarly to the force fields described above. Related, it is still an open question to what extent the complicated models used to predict protein structures from sequence internally represent the physics of proteins, and thus training models for structure prediction from experiments may end up being comparable to training molecular force fields.

Towards predicting interactions and complexes
Identifying short linear motifs An noted above, the primary structures of disordered proteins are generally not very well conserved. Nevertheless, their sequences do carry important information about their function, clues to which can be derived from direct sequence analysis and alignments. Although complicated to perform, and often assisted by manual refinements and adjustments, it is still possible to construct multiple sequence alignments of disordered proteins and from these alignments identify conservation hotspots in otherwise poorly conserved regions. In such cases, few positions-as little as between two and five-are highly conserved across species and found to be distributed across a confined stretch of approximately a dozen residues. These conserved sequence stretches represent so-called Short Linear Motifs (SLiMs) (Neduva et al., 2005;Van Roey et al., 2014;Jespersen and Barbar, 2020). SLiMs are recurrent, and the same SLiM can be identified in different, seemingly unrelated proteins conferring binding to specific partner proteins or other biomolecules. They constitute interactions sites, and the conserved residues are essential contact points that form part of the complex interface, and are thus essential to IDPs and their interactome. Today, more than 2000 SLiMs have been identified and annotated, and more candidate SLiMs reported with many assembled in the Eukaryotic Linear Motif database ( 2020)). It is, however, difficult to identify new SLiMs, define SLiM properties and specificity, and to annotate their functions. Below we discuss some areas where new experimental approaches and machine learning method may be integrated to shed further light on these problems.
One problem when applying machine learning methods to predict new instances from known SLiMs is that, typically, only a small number of experimentally verified cases are reported for each individual SLiM. This is mainly because methods for SLiM identification have been low-throughput and have relied mostly on bioinformatics approaches with subsequent biochemical and biological testing (O'Shea et al., 2017), or through integrating computation and medium throughput experiments (Zeke et al., 2015). More recently however, new high-throughput approaches have been used to define, expand and refine SLiMs. Examples include combining structure-based shape complementarity analysis and proteome-wide affinity purification mass spectrometry (Brauer et al.,  2019) and proteomic peptide phage display (ProP-PD), a method for simultaneous proteome-scale identification of SLiM-mediated interactions and foot-printing of the binding region with amino acid resolution (Ivarsson et al., 2014; Sundell et al., 2018). Recent work addressed ≈1,000,000 overlapping peptides covering the entire human disorderome in a single binding assay (Benz et al., 2021).
The generation of these large data sets provides new possibilities to train various types of prediction methods. Thus, a model has been trained to discriminate experimentally determined 14-3-3-binding SLiMs from non-binding phosphopeptides (Madeira et al., 2015) and a Random Forrest model was trained on a high-throughput phage display data set collected for low-specificity SLiM binding to S100A5 identifying recognition rules based on features of hydrophobicity and shape complementarity as primary determinants (Wheeler et al., 2020). Likewise, prediction of binding regions in longer IDPs have been aided by the use of a trained bidirectional recurrent neural network, combining sequence, predicted secondary structures, Vina docking score and predicted disorder to improve the prediction (Khan et al., 2013). Thus, machine learning approaches may help identify features that define SLiM binding and specificity, and are often used together with 3D structures, as done e.g. for PDZ binding peptides (Kundu and Backofen, 2014); a case where also more confident negatives could be included. Similar improvements in the number of reliable true-negatives were achieved in a reevaluation of a high-throughput binding data of SH2-pTyr interactions (Ronan et al., 2020). Currently such efforts are limited by a relative small number of large data sets, and further that larger scale experiments often address already known SLiMs. Another problem when developing prediction methods is the relatively low number of negative examples in many data sets, which has an impact on the number of false positives provided by the resulting models. Thus, ways to improve this issue are clearly needed. Once addressed, however, machine learning approaches could substantially further our understanding of SLiM-based interactions by enabling extraction of features of interaction that expand our view on sequence properties that determine SLiMs. Such features, which may also relate to conformational features, may help move beyond the expectation and limitation provided by a defined SLiM-sequence space. Indeed, SH2 domains, which are known to bind phospho-tyrosine ligands, have been shown to be able to also accommodate glutamates (Wallweber et al., 2014), which would not be expected solely from the SLiM definition, and therefore not typically included in fragment based database designs for machine learning purposes (Plewczyński et al., 2005). Finally, results from machine learning approaches may have the further benefit of contributing to the development of new vocabulary to describe SLiM-based interactions and uncover novel rules for interactions by IDPs.

Annotating function to short linear motifs
Although many SLiMs have been classified, it has been estimated that the human proteome counts more than 100,000 SLiMs, leaving most SLiMs unidentified (Tompa et al., 2014). Needless to say, each newly discovered potential SLiM in a disordered protein needs experimental verification as well as annotation; a task that remains a huge effort and experimentally highly challenging. So, although identification of their presence may be relatively accessible, and even aided by machine learning approaches, functional annotation of SLiMs remains an obstacle. Current high-throughput approaches for functional annotation have used in vivo SLiM-dependent proximity labeling, and in silico modeling of motif determinants to uncover new interactors (Wigington et  There are, however, a number of complications that may make it difficult to apply machine learning methods to aid in annotating the function of newly discovered SLiMs. The same SLiM may in one protein be embedded in a sequence that folds to an -helix when bound, whereas in another protein, the same SLiM may form an extended structure or a -strand when bound. One example is provided by a set of plant transcription factors that all bind to the -hub domain RST from RCD1 through the RST-binding SLiM. Here, the transcription factors individually form either a helix, an extended or disordered SLiM structures in the complexes (O 'Shea et al., 2017; Bugge et al., 2018). Thus, inherent to SLiMs is a certain plasticity in the position of the key conserved residues that form the critical contacts with the binding partner. Furthermore, the same sequence stretch within a disordered protein can have overlapping SLiMs and form biologically relevant complexes with different partners. There are several examples of this, for example the transcriptional activation domains of the tumor suppressor protein p53, which each have many different partners binding to the same overlapping region (Oldfield et al., 2008; Teilum et al., 2021).
Once the target protein is known, additional complications can arise. One is SLiM 'reversibility', in which two proteins with the same SLiM binds in opposite directions to the same partner, as shown for Sap25 and REST binding to Sin3-PAH1 (Swanson et al., 2004) and peptide binding to MHC class II molecules (Günther et al., 2010). This directly points to the SLiM context as carrying additional functional relevance (Stein and Aloy, 2008). Indeed, it has been shown that the context may have both positive and negative effects on binding through charge attraction and repulsion (Palopoli et al., 2018;Prestel et al., 2019), and it may contribute to allosteric regulation (Garcia-Pino  et al., 2010; Li et al., 2017; O'Shea et al., 2017). Thus, the influence of context on SLiM-based interactions is emerging as functionally important and with a large potential relevant to drug targeting (Bugge et al., 2020). However, these flanking sequences and regions are often not conserved and are not resolved in experimental structures of the protein complexes-or even included in the experiments. Thus, these regions and their potential structural ensembles and conformational preferences cannot be extracted from the PDB and thus they currently constitute a data-gap for training purposes.
As the sequence properties of SLiMs are known only for a small fraction of the predicted SLiMome, there is a strong need for procedures that may enable the identification and annotation of SLiMs without extensive experimental efforts. Combined with the variability in the number of residues separating the key conserved sites within a SLiM, the possibility of being able to predict distance distributions of SLiM-based interactions, in which the possible contact points and special requirements could be mapped, would potentially be an important asset that may help facilitate functional annotation and even pinpoint relevant binding partners to address. While machine learning approaches seem like a promising approach, the elasticity of the SLiM sequence and the low conservation of the SLiM context would make a purely sequence-based approach difficult. Further information might be obtained from an MSA of the IDP and the binding partner (Skerker et al., 2008;Burger and Van Nimwegen, 2008;Weigt et al., 2009), although the signal for contacts might be relatively weak and difficult to extract. Another problem that emerges is how to learn from sets of SLiMs that have been characterized in depth, and apply this knowledge to other sequences that have been probed much less.
One recently described approach to learn the rules for protein-peptide interactions is a bespoke machine-learning approach, termed hierarchical statistical mechanical modelling, which can be trained on families with abundant experimental data (structures and sequences) (Cunningham  et al., 2020). The approach learns a pseudo-energy function for interactions relevant for binding, which can be transferred also to proteins for which less information is available. In this way, the approach provides an elegant example of how machine learning methods can be used to learn general rules of biophysics that enable transfer and predictions on a wider class of problems and systems.
Looking ahead, although many structures have been determined of complexes between folded domains and peptides representing SLiMs from disordered proteins, these structures have in most cases been solved in the absence of the flanking regions. As these regions can be highly relevant for binding specificity and affinity, it is important to develop approaches that take these sequences into account. At the moment, however, the functional and structural properties of flanking regions are poorly understood and rarely studied, making it difficult to develop prediction methods. Initially, it might be fruitful to compare the surface properties of the protein that binds the IDP (e.g. charge patterning and hydrophobicity) to the overall physico-chemical properties of the flanking regions. One approach towards such endeavours uses a sequence-based model of charge patterning to relate sequence to function (Huihui and Ghosh, 2021). Eventually, and aided by the generation of data sets that includes longer peptides or full-length proteins, it may be possible to develop prediction methods that combine local and long-range interactions, perhaps using similar methods as when predicting effects of enhancers in gene regulation (Shlyueva et al., 2014; Avsec et al., 2021).

Complexes beyond SLiMs
As most IDPs have large exposed surface areas with high conformational flexibility, they also have high potential for binding other proteins (Berlow et al., 2015; Gao et al., 2018). IDP have thus shown remarkable structural and functional diversity in their complexes, ranging from complex formation through folding-upon-binding with interfaces of similar composition and properties as to those formed between folded complexes (Rogers et al., 2014;Sugase et al., 2007;Iešmantavičius et al., 2014;Robustelli et al., 2020), over complexes where the disordered partner remains dynamic to different extents (Brzovic et al., 2011; Tillu et al., 2021). At the extreme end of the scale, highly dynamic complexes, which entirely lack the formation of stable secondary or tertiary structures, can form, for example between two highly and oppositely charged IDPs (Borgia et al., 2018;  Schuler et al., 2020). The dynamics retained in these complexes serve functional roles through very different mechanisms. Their dynamic properties lead to several mechanistic advantages such as complex partner exchange (Berlow et al., 2017, 2019), facilitated dissociation via competitive substitution through formation of trimers (Sottini et al., 2020), ensemble redistributions (Henley  et al., 2020), and allosteric regulations (Milles et al., 2018; Hendus-Altenburger et al., 2019). Their malleability also confers other functional advantages to IDPs, one of which is the ability to bind multiple binding-partners as hubs, some at an overlapping site in competition, and some distributed along the chain leading to scaffolding and e.g. the formation of signalling complexes or transcriptional factories. How would machine learning methods aid in decomposing the role of disorder in functions of IDPs and what are the problems associated with this task?
One of the first discoveries from studying disordered protein complexes were that they can fold upon binding, either to an already folded partner through one of two highly discussed mechanisms (Dogan et al., 2014;Iešmantavičius et al., 2014;Arai et al., 2015) or through the occasional mutual folding of two disordered proteins (Demarest et al., 2002; Dogan et al., 2012). Whereas foldingupon-binding of disordered regions at first may seem highly analogous to the process of protein folding, and hence in principle should be amiable to machine learning approaches to predict the structures of the complexes, there are however a number of obvious caveats to its direct use. First, even though the binding region may be known, it is not easy to predict from sequence alone, which part of the disordered protein will fold. Further, a continuum of disorder can exist both in the IDP alone and in a complex, and highly disordered complexes, by some termed fuzzy (Fuxreiter and Tompa, 2012;Olsen et al., 2017), may result in weak and near-stochastic interactions.
One such example is the activation domains of transcription factors (Erkine, 2018), whose properties were originally characterized as 'acid blobs and negative noodles' (Sigler, 1988). Recently, a number of multiplexed assays have been used to expand this view and study the functional requirements of the sequence properties of transcriptional activation domains (Staller et al., 2018;Ravarani et al., 2018;Erijman et al., 2020;Tycko et al., 2020;Sanborn et al., 2021;Staller et al., 2021). These results confirm the original observations of a requirement for hydrophobic and negatively charged residues and provide additional information about the role of patterning. Further, the data can be used to train various sequence-based machine learning models for activity (Ravarani et al., 2018; Erijman et al., 2020; Sanborn et al., 2021; Griffith and Holehouse, 2021). The results suggest that most functional variation can be explained solely by amino acid composition, but that there is additional signal from higher-order properties of the amino acid sequence (Erijman et al., 2020), thus highlighting the importance of generating sequence libraries with such properties in mind (Staller et al., 2018).
In addition to specific favourable interactions in a complex, binding by disordered proteins may also be driven by the use of entropy through other mechanisms (Pritišanac et al., 2019; Flock et al.,  2014) such as via counter-ion release (Borgia et al., 2018), increased conformational flexibility in the complex or expansion of the surrounding disordered context (Heller et al., 2015). Prediction methods should ideally be able to quantify the remaining disorder after binding. Indeed, there are several examples of IDPs for which the bound state involves differently structured sub-populations of the complex, which all contribute to the specificity and selectivity in binding (Brzovic et al., 2011;  Henley et al., 2020), and there are complexes in which several contacts are made between the IDP and the folded partner, but where these dynamically and independently interchange (shuffle) just as in holding a hot potato (Perham, 1975 ; Hendus-Altenburger et al., 2016). While such ex-amples provide difficult targets for prediction methods, they are also difficult to characterize by experiments, and thus there is limited data to train and benchmark on.

Biomolecular condensates
Many IDPs have the ability to form multivalent interactions that are key for the ability to form socalled biomolecular condensates, either alone, with another IDP or in complex with folded domains or RNA. We refer the reader to recent reviews on the topic (Banani et al., 2017; Peran and Mittag,  2020; Dignon et al., 2020; Choi et al., 2020), and focus here mostly on the role of IDPs in forming such condensates and a set of problems where machine learning methods might help.
Biomolecular condensates often form via the process of liquid-liquid phase separation (LLPS), and a central requirement for a molecule to form these structures is the ability to form multivalent interactions. In the context of IDPs, this can for example be a protein carrying multiple SLiMs that can bind to folded multidomain proteins (Li et al., 2012; Bouchard et al., 2018) or a set of amino acid residues within an IDP that can form sufficiently strong interactions between them (Wang et al., 2018;Martin et al., 2020). Key areas for biophysical research include identifying the sequences and interactions that drive phase separation, identifying determinants of specificity in condensate formation, and elucidating the structural and dynamical features in biomolecular condensates. Before examining these questions, we stress that not all IDPs readily undergo LLPS, and that not all condensates involve IDPs.
Given the importance of IDP-IDP and SliM-target interactions, the methods discussed above for characterizing IDPs and SLiMs are also important for studying condensates. One key insight is that-due to the similarity between intramolecular interactions within IDPs and intermolecular interactions between IDPs-there is a correspondence between the propensity of an IDP to sample more compact structures and for it to undergo phase separation (Panagiotopoulos et al., 1998;Lin and Chan, 2017;Dignon et al., 2018aDignon et al., , 2020Choi et al., 2020;Martin et al., 2020). Thus, methods to predict compaction of IDPs or to parameterize simulation methods for isolated IDPs will also aid in studying phase separation of IDPs. Similarly, methods to predict SLiMs and their binding partners-and possibly the affinity of pairwise interactions-from sequence or from MSAs will aid in mapping the interactions that drive phase separation in these systems, and help to derive rules and features for their formation.
A number of databases have recently been created to collect information about proteins that undergo phase separation (Li et al., 2020b,c;Mészáros et al., 2020;You et al., 2020;Ning et al., 2020). Such databases are now being used to develop prediction methods for phase separation (Vernon et al., 2018;Hardenberg et al., 2020;van Mierlo et al., 2021;Raimondi et al., 2021;Saar et al., 2021), also with the aim of providing insight into the sequences and properties that are important for phase separation. In the same way as prediction methods for protein disorder have played a central role in understanding the role of disorder at the proteome level, such methods have the potential to do the same for biomolecular condensates (Vernon et al., 2018;Hardenberg et al., 2020).
Moving ahead, it will be important to extend such databases and prediction methods with additional quantitative information on the propensity to phase separate, and to annotate more broadly what components or features are involved in the formation of condensates. In the same way as many proteins and peptides have been shown to form amyloid structures under some conditions, many proteins will likely undergo LLPS. Thus, in the same way as methods for predicting aggregation propensities have been trained on quantitative measurements of aggregation (Chiti et al., 2003;Fernandez-Escamilla et al., 2004;Pawar et al., 2005), improvements in our ability to predict the propensity to undergo LLPS will likely involve fitting to or benchmarking against quantitative measurements of phase separation. Such analyses are already being performed with various coarse-grained simulation methods discussed above (Dignon et al., 2018b;Martin et al., 2020;Dignon et al., 2020;Choi et al., 2020;Bremer et al., 2021), but it may be difficult to scale these methods to proteomewide applications or to scan large numbers of components in heterotypic conden-sates. The relationship between intra-and inter-molecular interactions and the driving force for phase separation suggests that it might be possible to train sequence-based prediction models on single-chain properties and use these to predict the ability to undergo LLPS. Such methods have already provided a number of general rules about valency and patterning that appear promising for our ability to predict the propensity of proteins to undergo LLPS from their amino acid sequence (Martin et al., 2020;Statt et al., 2020;Hazra and Levy, 2020;Amin et al., 2020;Bremer et al., 2021). Including the context, such as concentration, crowding and additional partners in heterotypic condensate formation in these models would be an important extension. The conformational landscape of IDPs is also dependent on a richness in protein post-translational modifications such as phosphorylations, methylations, sulfation, and lipidation, and e.g. phosphorylation and arginine methylation has been shown to affect the formation of condensates (Nott et al., 2015;Monahan et al., 2017;Lu et al., 2018;Hofweber et al., 2018;Hofweber and Dormann, 2019). Thus, predicting post-translational modifications and their effects on condensates would help provide additional insight into how condensates are regulated.

Intrinsic disorder and human diseases
Given their wide range of biological functions, it is not surprising that IDPs are involved in a number of human diseases (Uversky et al., 2009) including neurodegeneration (Uversky, 2015) and in particular in cancer (Iakoucheva et al., 2002;Deiana et al., 2019;Mészáros et al., 2021). How may machine learning methods help understand the role of IDPs in disease?
While it appears a simple question to ask whether IDPs are enriched in a particular disease, answering this question requires accurate and unbiased predictions of protein disorder (Deiana  et al., 2019). Thus, we need continuous development of databases and quantitative measures of protein disorder and assessment of prediction accuracy, as well as development of new prediction methods (Nielsen and Mulder, 2019;Dass et al., 2020;Hatos et al., 2020;Necci et al., 2021).
It is important to gain a better understanding of the molecular mechanisms underlying diseases involving IDPs. The expression of IDPs is tightly regulated, and misregulation may lead to disease (Babu et al., 2011). For folded proteins, it is well established that genetic missense variants may cause disease via a wide range of mechanisms including affecting both protein stability and interactions (Stefl et al., 2013;Sahni et al., 2015;Stein et al., 2019). A substantial number of disease-causing variants are, however, located in regions of predicted disorder and are predicted to affect for example SLiMs (Vacic et al., 2012). Thus, it is becoming clear that missense variants in IDPs can also lead to disease via perturbed interactions that either cause loss or gain of function (Meyer et al., 2018;Li et al., 2019;Wong et al., 2020), including promoting the formation of fibrils and toxic oligomeric species.
Loss of protein stability arising from missense variants and resulting protein degradation is established to be a key mechanism underlying loss of function for many folded proteins (Casadio  et al., 2011; Stein et al., 2019), and indeed measurements or predictions of protein stability and abundance are useful for predicting loss of function (Matreyek et al., 2018; Cagiada et al., 2021). While intrinsic thermodynamic stability of a folded state is not a meaningful quantity for IDPs, missense variants may still affect their cellular abundance. This may for example happen by mutations leading to impaired interactions and degradation, as exemplified by a missense variant in the IDR of the growth hormone receptor; a mutations leading to severe lung cancer (Chhabra et al., 2018). Similarly, missense variants may lead to new interactions by SLiM appearance (Davey et al., 2015;  Meyer et al., 2018), lack of degradation by interference with degrons, disorder-to-order formation (Vacic et al., 2012), or changes in long-range interactions (Grazioli et al., 2019). In the latter example, machine learning techniques helped uncover differences in conformational dynamics from molecular dynamics trajectories of amyloid beta and the E22G disease variant, implicating their fibrillation into different morphologies. Thus, we need a better understanding both of the how IDPs are targeted for degradation and the sequence signals that determine cellular abundance (van der Lee et al., 2014), and of how contact remodeling along the chain impacts the ensemble. Disordered regions may act as degradation signals (degrons) (Uversky, 2013), and new large-scale experiments are enabling a better understanding of the sequence and structural properties of degrons (Geffen et al., 2016; Koren et al., 2018). We expect that such experiments will ultimately enable better predictions of the degradation and abundance of IDPs, and the effects of mutations on these properties.
One particularly important role of IDPs in disease may be in those that are involved in the formation of biomolecular condensates. A number of diseases have been associated with misregulation or formation of such condensates (as recently reviewed by others; Aguzzi and Altmeyer (2016) (2021)), and thus a better understanding of the sequence properties that drive the formation of condensates will be important for predicting their role in disease (Tsang et al., 2020) as well as for targeting them pharmaceutically (Biesaga et al., 2021).
More generally, in order to better predict how variants in IDPs may cause disease, we need a clearer overview of the relationship between sequence, structural and dynamical properties, binding preferences and function. For folded proteins, analyses of conservation via MSAs are very powerful to predict whether a variant may cause disease (Riesselman et al., 2018;Livesey and Marsh, 2020), but as discussed above, constructing and analysing MSAs provide unique challenges for IDPs. Thus, we need new methods to leverage the increasingly growing sequence databases to predict the effects of sequence variation in IDPs (Zarin et al., 2019;Zhou et al., 2020;Zarin et al., 2021), ultimately enabling targeting and drug development for combating diseases related to misregulation and dysfunction of disordered proteins. From an experimental point of view, multiplexed assays of variant effects (also sometimes called deep mutational scans) can provide key insights into both fundamental aspects of protein science (Fowler and Fields, 2014) and genotype-phenotype relationships and disease (Starita et al., 2017). Such experiments are now also beginning to provide a more comprehensive view of the effects of amino acid subsitutions in IDPs such as the experiments on activation domains of transcription factors discussed above (Staller et

Outlook
IDPs are an enormously broad class of molecules and together with IDRs they are involved in a wide range of biological functions. A key defining feature of IDPs and IDRs is something they do not have, namely a persistent three-dimensional structure. Thus, in many ways they are defined by being different from the globular and membrane proteins that in more than a century have been the central focus of much protein science. Indeed, when the first CASP experiment was performed in 1994 (Moult et al., 1995) only few proteins where recognized as being intrinsically disordered and rarely was the conformational disorder linked to biological function.
In some ways, IDPs and IDRs are simpler than folded protein because their linear (primary) structure already provides much insight into their chemistry and ability to interact with other molecules. Thus, a number of computational methods have been developed to predict disorder from sequence and to identify local segments of the sequence that can bind to other molecules.
This apparent simplicity, however, can be deceiving. For folded proteins, the necessity to fold into a specific three-dimensional structure puts substantial restraints on the sequence and thus on evolution. Thus, MSAs of folded proteins often provide clues about specific residues and regions that are key to structure and function. Together with a large number of high resolution structures, this has lead to our ability to predict with increasing accuracy the structure of folded proteins. In contrast, while the sequences of many IDPs are conserved for function, this relationship is complex and different from that governing folded proteins, largely because their function is also coupled to their dynamics. It is interesting to speculate what protein science would have looked like if we had first discovered IDPs, and then later found sequences that fold into specific three-dimensional structures.
Over the last 25 years, we have begun to understand the rules that govern the structural properties of IDPs, their interactions and their biological functions. Like for folded proteins, much insight has come from studying one system at a time, and computational methods are used to consolidate this into rules and predictions. In this review we have outlined a number of current problems in studies of IDPs, including our ability to characterize their structural preferences and interactions and our limitations in describing them. We have highlighted areas where machine learning and other computational methods have already had important impact, and new areas for further exploration (Fig. 2). Common to all is the tight interplay between experiment and computation. Particularly important is perhaps the realization that the two need to be developed together, with experiments being designed to inform computational methods, and computational algorithms developed, trained, and benchmarked using experiments. We look forward to see where these approaches will take the field. Outlining connections between sequence, structure, dynamics and function of IDPs where implementing machine learning approaches could have a potential (indicated with black connectors). From sequences and sequence alignments, machine learning approaches may help extract conformational properties from poorly defined sequence alignments of IDPs. Machine learning may be used to improve methods for combining biophysical experiments (here illustrated by SAXS, NMR, smFRET, EPR, IR and CD) and computation for example by deriving better forward models, and helping parameterizing force fields for better coarse grained (CG) models of IDPs. Machine learning may also enable extraction of new SLiMs and annotation of their biological functions, and provide insight into and ability to predict how context and flanking region (flanks) contribute to IDP function. Machine learning may also help predict and understand properties important for the formation of biomolecular condensates, and how context plays roles in their formation and dissolution. Finally, but not illustrated here, the combination of these approaches may help assign pathogenicity to genetic variants of IDPs. United, machine learning in combination with bioinformatics, simulation, theory and experiments can provide new rules for understanding IDP ensembles and IDP function. Jointly, such rules are necessary to enable the important decomposition of how mutations in IDPs may lead to disease states.