Predictions of Backbone Dynamics in Intrinsically Disordered Proteins Using De Novo Fragment-Based Protein Structure Predictions

Intrinsically disordaered proteins (IDPs) are a prevalent phenomenon with over 30% of human proteins estimated to have long disordered regions. Computational methods are widely used to study IDPs, however, nearly all treat disorder in a binary fashion, not accounting for the structural heterogeneity present in disordered regions. Here, we present a new de novo method, FRAGFOLD-IDP, which addresses this problem. Using 200 protein structural ensembles derived from NMR, we show that FRAGFOLD-IDP achieves superior results compared to methods which can predict related data (NMR order parameter, or crystallographic B-factor). FRAGFOLD-IDP produces very good predictions for 33.5% of cases and helps to get a better insight into the dynamics of the disordered ensembles. The results also show it is not necessary to predict the correct fold of the protein to reliably predict per-residue fluctuations. It implies that disorder is a local property and it does not depend on the fold. Our results are orthogonal to DynaMine, the only other method significantly better than the naïve prediction. We therefore combine these two using a neural network. FRAGFOLD-IDP enables better insight into backbone dynamics in IDPs and opens exciting possibilities for the design of disordered ensembles, disorder-to-order transitions, or design for protein dynamics.

Supplementary Text S1. Examples of different quality FRAGFOLD-IDP predictions.
To get a better intuition about the quality of FRAGFOLD-IDP predictions and the meaning behind R S values, here we discuss in more detail examples of poor, medium and excellent FRAGFOLD-IDP predictions. Disorder profiles presented in the main manuscript Figure 1 are reproduced here for convenience and clarity.
An example of a poor prediction is 1SIY -lipid transfer protein 1 (Figure 1). The prediction achieves an R S value of 0.21. Indeed, the disorder profile is not informative. Although the disordered region between residues 50 and 62 is correctly identified, the noise coming from false positives makes it lost in 4 other highly disordered regions predicted by FRAGFOLD-IDP.
Also, the short disordered region around residue 20 is completely missed in the prediction. An example of a medium quality prediction is 1P94 -ParG protein ( Figure 2). The prediction achieved R S = 0.54, which is close to the median value of the predictions on the entire dataset.
Here, FRAGFOLD-IDP correctly identifies first 15 residues as highly disordered, but underestimates the breadth of this region, which spans 35 residues. Finally, the predictions from around residue 48 to 76 are correctly identified as ordered and the disorder profile shows low per-residue RMSD values. An example of an excellent prediction is 2KJV -ribosomal protein S6 (Figure 3). It achieves an R S value of 0.82. FRAGFOLD-IDP captures all of the features of the NMR disorder profile remarkably well. The large disordered region between residues 40 and 60 is well reproduced, although FRAGFOLD-IDP slightly overestimates it, extending the region to around residue 35.
The C-terminal region (residues 82-101) is also slightly overestimated and in FRAGFOLD-IDP it starts around residue 79. Finally, a small medium disorder region around residue 10 is captured by FRAGFOLD-IDP, but it spans from residue 1 to 15, instead of residue 7 to 12. The 4 increase in per-residue RMSD signal could be partially attributed to the way sliding window (window size = 10) superposition works, i.e. from residues 1 to 9 there are less averaging steps, because of the sliding window size -residue 1 is superposed only once, residue 2 twice, etc.

Supplementary Text S2. Outliers.
The results of the outliers are gathered in Table 1. The set contains proteins shorter than an average in the dataset (75 residues in outliers and 105 residues in the dataset), but have a typical disorder content (29% in outliers, 33% in the dataset). FRAGFOLD-IDP R S is the output of the FRAGFOLD-IDP method, best cluster R S represents the highest R S result generated on the same set of models as FRAGFOLD-IDP R S , but selecting the highest R S among the clusters generated by PFClust. Top and median R S values come from 1,000 random ensembles generated from the same raw ensemble, as previously. Naïve R S are the results of the naïve approach that uses only secondary structure prediction, but does not require any simulations. bridges that constrain the structure making it more ordered (Figure 4).   Supplementary Figure S1. Relationship between the disorder content in NMR ensembles and per-CATH class quality of FRAGFOLD-IDP predictions. Top level CATH classification (class) was assigned to each protein. In the case if a given protein was not classified in CATH it was given "none" category. For each CATH class, linear regression fit was also computed. Prediction performance for alpha class (N = 58) is not correlated with disorder content (Pearson's r = -0.04, p = 0.78). Beta class (N = 30) is also not correlated with disorder content (r = -0.02, p = 0.90). Alpha/beta class (N = 60) is negatively correlated with disorder content (r = -0.33, p = 0.01). Few secondary structures class (N = 7) is negatively correlated with disorder content (r = -0.80, p = 0.03), but under-represented, especially in cases with disorder > 40% (2 cases). None class (N = 45) is positively correlated with disorder content, but not statistically significant (r = 0.14, p = 0.37).
Supplementary Figure S2. Optimisation of the consensus predictor. (A and B) optimisation of the window size using number of features/2 as the number of hidden units. (C and D) optimisation of the window size using geometric mean of the number of input and output units as the number of hidden units. Outliers are shown as red dots.
A B D C