Data set of intrinsically disordered proteins analysed at a local protein conformation level

Intrinsic Disorder Proteins (IDPs) have become a hot topic since their characterisation in the 90s. The data presented in this article are related to our research entitled “A structural entropy index to analyse local conformations in Intrinsically Disordered Proteins” published in Journal of Structural Biology [1]. In this study, we quantified, for the first time, continuum from rigidity to flexibility and finally disorder. Non-disordered regions were also highlighted in the ensemble of disordered proteins. This work was done using the Protein Ensemble Database (PED), which is a useful database collecting series of protein structures considered as IDPs. The data set consists of a collection of cleaned protein files in classical pdb format that can be readily used as an input with most automatic analysis software. The accompanying data include the coding of all structural information in terms of a structural alphabet, namely Protein Blocks (PBs). An entropy index derived from PBs that allows apprehending the continuum between protein rigidity to flexibility to disorder is included, with information from secondary structure assignment, protein accessibility and prediction of disorder from the sequences. The data may be used for further structural bioinformatics studies of IDPs. It can also be used as a benchmark for evaluating disorder prediction methods.

Intrinsic Disorder Proteins (IDPs) have become a hot topic since their characterisation in the 90s.The data presented in this article are related to our research entitled "A structural entropy index to analyse local conformations in Intrinsically Disordered Proteins" published in Journal of Structural Biology [1].In this study, we quantified, for the first time, continuum from rigidity to flexibility and finally disorder.Non-disordered regions were also highlighted in the ensemble of disordered proteins.This work was done using the Protein Ensemble Database (PED), which is a useful database collecting series of protein structures considered as IDPs.The data set consists of a collection of cleaned protein files in classical pdb format that can be readily used as an input with most automatic analysis software.The accompanying data include the coding of all structural information in terms of a structural alphabet, namely Protein Blocks (PBs).An entropy index derived from PBs that allows apprehending the continuum between protein rigidity to flexibility to disorder is included, with information from secondary structure assignment, protein accessibility and prediction of disorder from the sequences.The data may be used for further structural bioinformatics studies of IDPs.It can also be used as a benchmark for evaluating disorder prediction methods.
© 2020 The Author(s

Value of the Data
• Atomic coordinate files in pdb format are processed in a manner suitable for most analysis programs.
• The PB assignment and entropy calculation allow defining the rigidity -flexibility -disordered state as done in [1] and are easy to use for further research.• The secondary structure assignment and solvent accessibility are provided, as they represent the basis for structural analyses.• Two types of disorder prediction methodologies are provided; all these data can be used as a benchmark for evaluating disorder prediction methods.
• These data were largely used for the Journal of Structural Biology [1] , and can be useful for researchers interested in the analyses of IDPs and IDRs, but also for the development of novel prediction approaches.The addition of secondary structure assignment, solvent accessibility and the two different disorder prediction methodologies will also help them greatly.

Data description
Intrinsic Disorder Proteins (IDPs) and Intrinsic Disorder Regions (IDRs) are a non-negligible part of the protein structures.IDPs are not ordered and are likely to be unfolded in solution under native functional conditions [2][3][4] .They do not have a well-defined 3-D structure, but embrace an ensemble of conformations.In our recent research [1] , we have analysed the Protein Ensemble Database (PED3) [5] in the light of a structural alphabet [6] .PED3 is a useful database collecting series of protein structures associated to IDPs.PED stores 25,473 protein structures of 60 ensembles in 24 entries.
We provide the entire dataset in four separate folders.The data collected in these folders represent the core of our previous research published in the Journal of Structural Biology [1] .
The first folder (1_DATA) consists of the raw data, i.e. the 24 entries with accompanying ensembles in the pdb format.They could be directly downloaded from PED website, but we have cleaned few of them for better parsing.Each subdirectories is noted PED x AA y -pdb, where x is always a number ranging from 1 to 9, and y is a letter ranging from A to D, i.e.PED1AAD-pdb ( β-synuclein).
DSSP software [10] .DSSP provides the 8-states assignment ( α-helix, π -helix, 3.10 helix, bend, turn, β-bridge, β-sheet and coil), but also the solvent accessibility.These two pieces of information are essential for most structural analyses.Each structure is in a file named PED x AA y -n .dssp,with n corresponding to the number of the models.DSSP is the most widely used secondary structure assignment for over thirty years.
The fourth folder (4_DISORDER) contains the disorder prediction outputs.Two very different methodologies were chosen, namely DisoPred 3.1 [11] and PrDOS [12] .Their results can be quite dissimilar.It underlines the importance to have a better description of the disorder states.Each of the 24 entries is shown individually.DisoPred subdirectory contains files named name.pbatthat include prediction values of protein binding residues in disordered regions as well as disordered and ordered residues).In addition, it includes in a corresponding csv file

Raw data
The raw data were downloaded from PED website and correspond to an important occurrence of ensembles.PED 3 contains 25,473 protein structures of 60 ensembles in 24 entries.Out of these, 6 entries have data from both SAXS and NMR, 7 from only SAXS, 10 from only NMR and one from Molecular Dynamics.Some entries have 10 or fewer models, while 8 have them more than 500.The PED4AAB entry, the Sendai virus phosphoprotein ensemble is the most populated with 13,718 models.All the models follow the classical PDB format (without most of the remarks).It can already be seen that some residues are incomplete and could be problematic for the future analyses.

Protein blocks
Protein Blocks (PBs) is a structural alphabet composed of 16 local prototypes [7] , PBs are employed to analyse local conformations.Each specific PB is characterized by the ϕ, ψ dihedral angles of five consecutive residues.The PBs m and d can be roughly described as prototypes for central α-helix and central β-strand, respectively.PBs a through c primarily represent the N-cap region of β-strand while PBs e and f correspond to the C-caps; PBs g through j are specific to coils, k and l correspond to the N-cap region of α-helix, and PBs n through p to that of C-caps [ 6 , 13 ].PB assignment was carried out for every residue from every snapshot extracted from MD simulations using PBxplore tool [8] available at GitHub ( https://github.com/pierrepo/PBxplore).A useful measure to quantify the flexibility of each amino acid, called N eq (for equivalent number of PBs) [7] was used.N eq is a statistical measurement similar to entropy; it represents the average number of PBs a residue may adopt at a given position.N eq is calculated as follows [7] : Where, f x is the frequency of PB x in the position of interest.A N eq value of 1 indicates that only one type of PB is observed, while a value of 16 is equivalent to an equal probability for each of the 16 states, i.e. random distribution.We have also computed average N eq values.PBs were successfully used for the analysis of molecular dynamics simulation of e.g.integrins, Duffy Antigen Chemokine Receptor (DARC) protein, KiSS1-derived peptide receptor (KISS1R), HIV-1 capsid protein, α-1,4-glycosidic hydrolase, NMDA Receptor Channel Gate.

Disorder prediction
Two approaches were used, namely DisoPred 3.1 [11] and PrDOS [12] .The first is one of the most well-known and used approaches (664 citations in January-2020 as measured by Google Scholar), the second one is less well-known but also has a large number of citations (463 at the same period).Both are based on very different approaches and provide slightly different tendencies depending on the entries, making them useful to enrich the analyses.