Explaining Conformational Diversity in Protein Families through Molecular Motions

Proteins play a central role in biological processes, and understanding their conformational variability is crucial for unraveling their functional mechanisms. Recent advancements in high-throughput technologies have enhanced our knowledge of protein structures, yet predicting their multiple conformational states and motions remains challenging. This study introduces Dimensionality Analysis for protein Conformational Exploration (DANCE) for a systematic and comprehensive description of protein families conformational variability. DANCE accommodates both experimental and predicted structures. It is suitable for analysing anything from single proteins to superfamilies. Employing it, we clustered all experimentally resolved protein structures available in the Protein Data Bank into conformational collections and characterized them as sets of linear motions. The resource facilitates access and exploitation of the multiple states adopted by a protein and its homologs. Beyond descriptive analysis, we assessed classical dimensionality reduction techniques for sampling unseen states on a representative benchmark. This work improves our understanding of how proteins deform to perform their functions and opens ways to a standardised evaluation of methods designed to sample and generate protein conformations.


Supplemental tables and figures
Supplemental Table S1     We reconstructed each conformation using the principal components computed from the set of conformations not belonging to the same cluster.
80,50 70,80 70,50 50,80 50,50 30,80 30,50   80,50 70,80 70,50 50,80 50,50 30,80 30,50    Global properties of the ensembles and their sequence alignments.We report values computed across eight versions of the database, corresponding to eight combinations of sequence similarity and coverage thresholds.These combinations are given in x-axis.A. Number of singletons, pairs, and ensembles with at least 3 members.B. Distributions of sequence identity measured as a normalised sum-of-pairs scores with null mismatch and gap penalties.C. Distribution of coverage expressed as the fraction of positions with less than 80% gaps.D. Distribution of global alignment quality computed as a normalised sum-of-pairs scores with the following parameters: σ match = 1, σ mismatch = σ gap = -0.5 (see Methods).
Influence of ensemble size on motion complexity We report motion complexity, measured as the number of principal components or modes required to explain 80% of the positional variance, in function of the ensemble size, i.e. number of conformations.A-B.Scatterplots in log scale.C-D.Discretized heatmaps.We consider the most stringent set up, namely l 80 80 (A,C), and the most relaxed one, namely l 30 50 (B,D).Expansion of three conformational ensembles upon relaxing sequence selection criteria.We compare the set of conformations detected at two different levels of sequence similarity and coverage, namely l 80 80 (on the left) and l 30 50 (on the right).For the latter, we show separately the conformations already included in the ensemble at l 80 80 (on the left) and the new additional conformations (on the right).The number of conformations in each (sub)ensemble is given on top.The color code indicates the position in the sequence, from the N-terminus in blue to the C-terminus in red.The flavodoxin (FLAV) ensemble contains one partially unfolded conformation, highlighted with the arrows.Some properties of these three examples are reported in Figure 2.
Evolution of motion complexity upon protein family expansion.A. Number of ensembles where motion complexity increases, remains the same, or decreases between the most stringent and the most relaxed set ups.We extracted the motions from either the covariance (in black) or the correlation (in grey) matrix.B. Comparison of motion complexity estimated from the correlation matrix in the most stringent set up (x-axis) versus the most relaxed one (y-axis).
Systematic exploration of the two hyperparameters for kPCAbased conformation reconstruction.We illustrate the influence of the hyper parameters σ and α on the reconstruction error (in Å) for a randomly picked up conformation (4th one) from the ADK protein ensemble.The red star highlights the optimal parameter values.We used the RBF kernel.Supplemental FigureS7: Distributions of the RMSD reconstruction errors (in Å) for each ensemble in the benchmark set.We systematically reconstructed each conformation through a leave-one-cluster-out cross-validation procedure (see Methods).We set the hyperparameters of the kPCA and UMAP to the values yielding the best reconstruction, for each ensemble.The protein names in the x-axis are ordered according to motion complexity.Reconstruction error in function of the distance to the training set for kPCA with RBF kernel.The distance is computed between the test conformation and the convex hull defined by the training conformations in the low-dimensional representation space.It is normalised by the number of residues.: PCA feature spaces for three proteins from the benchmark.We show the projections of the conformations in the l-dimensional PCA feature space, where l is the number of principal components needed to explain 90% of the total positional variance, for ADK (A), MurD (B) and ATPase (C).The point shapes indicate the clusters to which the conformations belong as determined by k-means clustering where k = l + 2. The colors reflect the RMSD reconstruction error (in Å).

Table S2 : Properties of the ensembles in the most conservative and the most relaxed set ups.
a To compute motion properties, we focused on the subset of ensembles with at least three members.Indeed, pairs of conformations trivially exhibit single-mode motions and are thus disregarded variability exhibited by pairs of conformations can be trivially explained by only one mode Supplemental TableS3: Properties

Proportion of conformations reconstructed with high accuracy.
conformational collection (see Supplementary TableS5).Supplemental Table S5: Average