Deep learning in prediction of intrinsic disorder in proteins

Graphical abstract

Experimentally characterized IDPs and IDRs can be collected from several databases, such as DisProt [28], PDB [29], IDEAL [30], DIBS [31], and MFIB [32]. However, these resources cover only a small fraction of IDPs, with the largest DisProt and PDB databases currently including about 2 thousand and 25 thousand IDPs, respectively [28,33]. Compared to over 225 million protein sequences that are available in the newest 2021_04 release of Uni-Prot [34], we have a long way to go to comprehensively identify and annotate IDPs and IDRs. Computational methods that accurately predict intrinsic disorder can be used to facilitate efforts to close this huge and growing knowledge gap. Computational predictors already made large impact on the intrinsic disorder field, by powering a rapid acceleration in the research on IDPs and IDRs [35]. They are also used across many areas including rational drug design [23][24][25][26], structural genomics [36][37][38], and medicine [39,40].
Development of computational predictors of disorder is a longstanding research problem. A recent survey has identified 103 disorder predictors that were developed over the last four decades [41]. Current surveys point to the long history of the disorder prediction area, providing invaluable insights concerning architectures of these methods, their availability, trends in their development efforts and approaches to comparatively evaluate their predictive performance [40][41][42][43][44][45][46][47][48]. Moreover, users and developers benefit from empirical studies that comparatively assess predictive quality of disorder predictors [33,[49][50][51][52][53][54][55][56][57][58][59]. These comparative studies include several community assessments, such as Critical Assessment of Structure Prediction (CASP) between CASP5 to CASP10 [53][54][55][56][57][58] and Critical Assessment of Intrinsic Protein Disorder (CAID) [52]. The community assessments involve evaluation of predictors on blind test datasets (i.e., datasets that were not available to the authors of the predictors) by independent assessors who do not take part in the competitions utilizing tests and metrics that are widely accepted by the community.
Results of CASP10, the most recent CASP community assessment that covers disorder prediction (i.e., subsequent CASP experiments do not include disorder predictions), reveal that the top three predictors belong to the machine learning (PrDOS and DIS-OPRED) and meta-predictor (MFDp) categories [58]. However, a recent survey notes a rapid influx of a new subfamily of machine learning methods that relies on deep neural networks (DNNs) after the first DNN-based method was released in 2013 [41]. DNNs differ from shallow neural networks, which were commonly used to implement disorder predictors in early 2000 s [36,[72][73][74][75][76], by use of multiple hidden layers and more sophisticated types of neurons and connections. The shift to the deep network models is motivated by their favorable levels of predictive performance when compared with the other types of disorder predictors. In particular, we observe that the best performing methods from the just completed CAID experiment [85], which include flDPnn [86], SPOT-Disorder2 [87], RawMSA [88] and AUCpred [89], rely on DNNs. Motivated by their growing numbers and success, we provide the first review of the DNN-based disorder predictors. We identify and summarize 13 DNN-based disorder predictors that were developed since 2013. We analyze trends in the development of these predictors and empirically compare predictive quality produced by the deep learners against the other types of disorder predictors based on results produced on blind test dataset from the CAID experiment. We also comment on future prospects in the development of the DNN-based disorder predictors.

Prediction of intrinsic disorder using deep learning
Nowadays, deep learning is widely used to develop methods that predict protein structure and function. Perhaps the most obvious example is protein structure prediction where deep learning models, such as AlphaFold, have deservedly dominated over other types of methods [90][91][92][93]. Moreover, deep learning is utilized to predict other structural aspects of proteins, such as contacts [94], secondary structure [95] and torsional angles [96]. DNNs are also successfully applied to predict protein function [97][98][99], proteindrug interactions [100,101], and functional sites [102][103][104].
The intrinsic disorder prediction field was not immune to the infusion of the deep learning-based approaches. The first DNNbased disorder predictor, DNdisorder [105], was published in 2013. Table 1 summarizes a comprehensive list of 36 disorder predictors that were published since that time. This list contextualizes the efforts to develop deep learning predictors in a broader setting of the entire disorder prediction field. We identify the 36 predictors using a wide-ranging list of sources including databases of disorder predictions: MobiDB [122], D 2 P 2 [123] and DescribePROT [124]; community assessments and surveys that were published on or after 2013 [33,[41][42][43]46,47,49,50,52,58,59], and a manual search of relevant articles from PubMed that we collect using the ''(disorder[Title]) AND (prediction[Title]) AND protein" query. Table 1 reveals that 13 out of the 36 recent disorder predictors use deep learning models. We find that it took two more years for the second DNN-based predictor, DeepCNF-D, to be published in 2015 [112]. The following three years include similarly low numbers of new deep learning tools, with two methods published in 2016, one in 2017, and one more in 2018. Year 2019 marks a turning point in the efforts to develop DNN-based disorder predictors, with two tools published in 2019, two in 2020, and four in 2021. Fig. 1 conveniently summarizes the corresponding trends. It highlights the gradual shift to developing predictors that rely on deep networks and the fact that these methods constitute majority (58%) of the predictors that were published over the last three years (green line in Fig. 1). We also note that the consistent levels of the release of new methods that range between 11 and 13 per every three-years long interval. Table 1 provides a few additional insights. We manually check websites of the corresponding methods and find that 23 out of 36 predictors (over 60%) are available to the end users as either standalone software (5 methods), webserver (10 methods) or in both modalities (10 methods). Interestingly, all DNN-based predic-tors that were published after 2016, except for flDPlr, are among the publicly available tools. This rate of availability is substantially better compared to related areas including prediction of proteinbinding and RNA-binding residues where the availability is at around 40% [103,125]. The webservers are a convenient option to less programming savvy end users, such as some biochemists or structural biologists. In this case, predictions are performed on the webserver end and users are not required to install and run Table 1 Summary of intrinsic disorder predictors that were developed since 2013 when the first deep learning-based method was released. The predictors are sorted in the chronological order of their year of publications. ''*" denotes predictors that are used in Fig. 3.

Predictor name
Year No" means that a given predictor was not published in a peer-reviewed journal but was included based on participation in the CASP and/or CAID assessment. 2 Availability: released as ''SP" (standalone program), ''WS" (web server). ''No" not released as either SP (standalone program) or WS (web server), and ''N/A" (not available) SP and/or WS were released at the time of publication (i.e. URL was provided in the original article) but they were not available as of February 2022 when the access was tested. the software on their hardware. However, the main drawbacks of webservers are that they depend on the uninterrupted availability of Internet, limit the size of individual jobs (i.e., number of proteins can be predicted), and their results could be delayed when their workload is heavy. On the other hand, the standalone software option is best suited for skilled programmers and bioinformaticians. The software must be installed and executed locally. This facilitates running larger jobs and allows embedding a given disorder predictor into other bioinformatics pipelines. For instance, putative disorder generated by the popular IUPred [61,62,120] was used to predict DNA-binding residues [126], B-cell epitopes [127], and quality of protein structures [128]. Table 2 details the 13 deep learning-based disorder predictors. We summarize inputs, topologies, predictive performance, and runtime of these methods. The inputs cover a broad range of relevant information including the input sequence itself and several sequence-derived characteristics, such as evolutionary information (e.g., position-specific scoring matrix (PSSM) and residue-level conservation), putative structural features (e.g., secondary structure and solvent accessibility), and physiochemical characteristics that are typically quantified at the amino acid level (e.g., polarizability, hydrophobicity, and isoelectric point). We define topologies based on two key aspects: type of the deep network and its size/ depth. The network types include classical deep feed forward neural networks (FFNNs) and more sophisticated restricted Boltzmann machines (RBM), convolutional neural networks (CNNs) and bidirectional recurrent neural networks (BRNNs). We grade the network sizes by the number of hidden layers into three categories: moderately deep with between 2 and 3 hidden layers; deep with 4 to 5 hidden layers; and very deep with over 5 hidden layers. We observe a few interesting patterns. First, majority of the predictors rely on multiple input types, with the two most popular options being evolutionary and putative structural data. These methods take advantages of the deep neural network's ability to combine diverse types of inputs including numeric data, such as conservation and relative solvent accessibility, nominal data, such as secondary structure, and binary data, such as one-hot encoding of amino acid types, to produce high-quality latent feature space.
Second, these disorder predictors rely on a diverse collection of network types, including hybrid designs that combine convolutional and bidirectional recurrent topologies. Third, they utilize designs with widely varying network sizes including nine moderately deep, one deep and three very deep networks. Altogether, this analysis reveals that the current designs broadly explore the input and network topology spaces.
The recently completed CAID experiment reveals that some of the DNN-based solutions provide favorable predictive performance when compared to other types of disorder predictors [52]. This conclusion is perhaps best captured with the following quote: ''The SPOT-Disorder2 and flDPnn, followed by RawMSA and AUCpreD, are consistently good. However, flDPnn is at least an order of magnitude faster than its competitors, and it succeeded on all sequences, whereas SPOT-Disorder2 skipped 5% of sequences as a result of a length limitation." [85]. While these four best predictors rely on deep learning, they implement the underlying predictive models using very different designs. More specifically, flDPnn relies on moderately deep FFNN architecture [86], SPOT-Disorder2 and RawMSA are very deep hybrids of CNN and BRNN [87,88], while AUCpreD utilizes moderately deep CNN topology [89]. This observation suggests that accurate disorder prediction can be accomplished using different types of deep learners.
We provide a wider comparison of the predictive performance of deep learners. We cover 11 DNN-based methods that exclude only the two oldest methods, DNdisorder and DeepCNF-D. DNdisorder is not available to the end users (Table 1) while the standalone version of DeepCNF-D requires specific feature encoding of the sequence that we could not reproduce. We compare predictive performance of the remaining 11 deep learners using the annotated CAID dataset from https://idpcentral.org/caid/data/1/ and https://idpcentral.org/caid/data/1/reference/disprot-disorder. txt. This dataset includes 652 protein sequences and 337,908 amino acids, with 838 disordered regions and 54,820 disordered residues. For the 8 of the 11 predictors that were evaluated in CAID (i.e., AUCpred [89], AUCpred-np [89], DisoMine [129], flDPnn [86], rawMSA [88], SPOT-Disorder [113], SPOT-Disorder-Single [115] Table 2 Summary of intrinsic disorder predictors that use deep neural network models. The predictors are sorted in the chronological order of their year of publications. X marks inputs that are used by a given predictor. ''*" denotes predictors that are used in Fig. 3.

Predictor name
Year and SPOT-Disorder2 [87]), we parse their CAID predictions from https://idpcentral.org/caid/data/1/predictions/. We collect results for the other three methods (IDP-Seq2Seq [119], RFPR-IDP [120], and Metapredict [121]) using the webservers and standalone programs provided by the authors. Table 2 shows that the predictive quality of deep learners measured with the area under the ROC curve (AUC) ranges between 0.722 for RFPR-IDP and 0.814 for flDPnn. We further evaluate whether differences in the AUCs of the 11 predictors are robust across different datasets by comparing results across 20 randomly selected disjoint sets of 5% of proteins from the CAID dataset. We assess significance of differences in AUCs between the best-performing flDPnn and the other methods. We use the t-test if the underlying data are normal; otherwise, we use the Wilcoxon signed-rank test; we test normality with the Anderson-Darling test at the 0.05 significance. We find that flDPnn and RawMSA are not statistically different (p-value 0.05) but flDPnn is statistically better than the other 9 methods (pvalue < 0.05). We similarly quantify significance of differences between RFPR-IDP that has the lowest AUC and the other 10 predictors. This analysis reveals that SPOT-Disorder, Metapredict, AUCpreD-np and IDP-Seq2Seq produce predictions that are not statistically better than RFPR-IDP (p-value 0.05). The remaining 4 predictors that include AUCpreD, SPOT-Disorder-Single, SPOT-Disorder2, and DisoMine are significantly better than RFPR-IDP (p-value < 0.05) and significantly worse than flDPnn (p-value < 0.05). Correspondingly, we identify 3 groups of the DNN-based predictors: 1) flDPnn and RawMSA that secure the best results (AUC > 0.78); AUCpreD, SPOT-Disorder-Single, SPOT-Disorder2, and DisoMine that obtain the second-best performance (0.755 < AUC < 0.78); and RFPR-IDP, SPOT-Disorder, Metapredict, AUCpreD-np and IDP-Seq2Seq that provide more modest levels of predictive quality (0.720 < AUC < 0.755).
We also analyze an average per-protein runtime for the predictors from Table 2. Similar to the analysis of the predictive performance, we could not perform this analysis for DNdisorder and DeepCNF-D that do not provide working implementations. We extract the runtime data from the CAID results for the eight methods that participated in this experiment [52], and we estimate it for the other three methods (IDP-Seq2Seq, RFPR-IDP and Metapredict) based on the implementations provided by the authors. We find that the runtime of the 11 predictors varies widely (Table 2), with the fastest predictors that produce results in several seconds and the slowest that require over 10 min for the same task.
Using the above analysis, Fig. 2 compares the 11 available predictors based on three key characteristics: predictive performance quantified with AUC, speed measured with runtime, and mode of availability. We score each characteristic in the 0 to 2 range where higher number is associated with darker shade and indicates better quality, i.e., higher AUC, lower runtime and more ways to access a given predictor. The most well-rounded predictors include flDPnn (total score of 6), SPOT-Disorder-Single (score of 5), DisoMine (score of 4) and Metapredict (score of 4). When analyzing individual dimensions, the fastest methods (i.e., per-protein runtime < 1 min) include AUCpreD-np, SPOT-Disorder-Single, Dis-oMine, flDPnn, RFPR-IDP and Metapredict. The most accurate methods are flDPnn and rawMSA and methods that are available in two modes (webserver and standalone) include SPOT-Disorder, SPOT-Disorder-Single, SPOT-Disorder2, flDPnn and Metapredict.

Deep learning methods outperform other predictors of intrinsic disorder
Motivated by the finding that the top performing predictors in CAID are deep learners [52,85], we investigate whether this result can be extended more broadly to other DNN-based methods. More specifically, we compare the results for the 11 available deep learning-based disorder predictors from Table 2 against the results of other types of methods that we collect using the same CAID data. This analysis covers a comprehensive set of 29 disorder predictors including 11 deep learners that are annotated with * in Table 2 and 18 methods that use the other types of models. The latter group includes 12 machine learning predictors (DisEMBL-465 [36], DisEMBL-HL [36], DISOPRED3 [75], DisPredict2 [66], Espritz-D [130], Espritz-N [130], Espritz-X [130], flDPlr [86], PONDR VSL2B [131], PreDisorder [74], RONN [132], and s2D-2 [110]); 5 sequence scoring function-based methods (FoldUnfold [133], IsUnstruct [134], IUpred2A-long [114], IUpred2A-short [114], and pyHCA [135]) and one meta-predictor (MobiDB-lite [78]). We mark these methods with * in Table 1, except for DisEMBL-465, DisEMBL-HL, JRONN, FoldUnfold, PONDR VSL2B, PreDisorder, IsUnstruct, Espritz-D, Espritz-N, and Espritz-X that were published before 2013. We quantify the predictive performance using four popular metrics that are consistent with the measures used in the most recent community assessments [52,58], including AUC, area under the precision-recall curve (AUPR), F1 and Matthews correlation coefficient (MCC). Finally, we quantify statistical significance of differences in the predictive performance between the results of the 11 deep learners and the 18 other methods. We test normality of the measured scores with the Anderson-Darling test and we apply the student t-test for normal data and the Wilcoxon test otherwise.  The AUC values are categorized into three groups using statistical test that measures robustness of differences between predictors over different protein sets; details are described in the text. Methods with AUCs that are not statistically different (pvalue 0.05) from the best (worst) performing flDPnn (RFPR-IDP) are labeled with 2 (0), while the remaining predictors are labeled with 1. The runtime is divided into three ranges: < 1 min (score of 2); between 1 and 10 min (score of 1); and 10 min (score of 0). The availability score counts the number of modes where 2 means that both SP (standalone program) or WS (web server) are available and 1 that either SP or WS are available.

Summary and outlook
Disorder prediction is an active and well-establish research area with over 40 years of history. The first DNN-based disorder predictor was published in 2013 and 12 more deep learners were published since. We find that majority of the disorder predictors that were developed in the last three years utilize deep neural networks. The popularity of this design is motivated by several factors. First, these models can be molded into many different architectures that are flexible to use diverse types of inputs. Our analysis of the 13 DNN-based disorder predictors reveals that they rely on very diverse designs that explore different inputs, topologies and sizes. Second, our empirical results reveal that the DNNbased predictors are in general statistically better when directly compared against a representative collection of the other types of predictive models. This conclusion is in line with the results of the recent CAID experiment where the top four predictors are deep learners [52,85]. Third, our multifaceted comparison of the deep learners provides useful clues for the end users by identifying methods that are accurate, fast and widely available. We identify several well-rounded predictors that include flDPnn (very accurate, very fast, and available in multiple ways), SPOT-Disorder-Single (accurate, very fast, and available in multiple ways), DisoMine (accurate and very fast) and Metapredict (very fast and available in multiple ways). These results and accolades support conclusions of the a recent article that say ''deep-learning-based methods will likely continue to show the greatest potential for future improvement" [85].
Our analysis finds that the architectures of the current deep learners are considerably diverse. This suggests that the optimal architecture is yet to be identified. We reason that this should be a hybrid design to accommodate for the underlying variety of different types/flavors of disorder [136][137][138]. For instance, IDRs cover a wide spectrum of sizes, from short regions that are frequently localized at the sequence termini to very long regions that span the entire protein sequence [139,140]. IDRs also vary in their conformational space, which is signified by their classification into the native coils, native pre-molten globules and native molten globules [4,141]. Moreover, IDRs carry out many different functions, and some of them are multifunctional (moonlighting) [142,143], which results in many different biases in their sequences [4,137]. Interestingly, design of the recently published and well-rounded flDPnn suggests that predictive quality can be improved by innovating inputs that are fed into the deep networks [86]. The authors point to multiple options including development of extended sequences profiles that cover relevant sequence-derived protein characteristics beyond the commonly-used inputs listed in Table 2, and con-struction of aggregate features that quantify sequence bias at the region or whole sequence level. These two future directions go hand in hand given the fact that the hybrid deep learners are inherently capable of handling diverse and large inputs.
While most of recently released predictors of intrinsic disorder utilize DNNs, this is not necessarily the case for the methods that predict binding IDRs. There are close to 20 predictors of disordered protein-binding regions [144] and several methods that predict IDRs that interact with nucleic acids and lipids [42,145]. Examples of the recently published tools include FLIPPER [146], SPOT-MoRF [147], OPAL+ [148], DisoLipPred [149] and DeepDISObind [150]. The CAID experiment evaluated close to a dozen of these predictors and concluded that ''disordered binding regions remain hard to predict" [52], motivating further efforts in this area. One of the potential reasons for the low predictive performance of these tools is a relatively low utilization of the deep learning architectures. We identify only a handful of DNN-based predictors of binding IDRs including SPOT-MoRF [147], MoRFPred_en [151], en_DCNNMoRF [152], DeepDISObind [150], and DisoLipPred [149]. A similar situation is true in the context of prediction of disordered linker regions where neither of the two currently available methods, DFLpred [153] and APOD [154], applies deep learning and their predictive performance is relatively limited. Given the success of DNNs in the disorder prediction, we believe that this technology could be successfully applied to strengthen the quality of the predictors of binding IDRs and disordered linkers.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Fig. 3. Comparison of predictive performance between disorder predictors that utilize deep neural networks (in red) and the other disorder predictors (in blue). The predictive performance is quantified with AUC, AUPR, F1 and MCC. Results of individual predictors are denoted by dots. Distributions of these values are summarized with the box plots. *** means that the predictive performance of the deep learners is significantly higher than the performance of the other methods (p-value < 0.05).