Connected Component Analysis of Dynamical Perturbation Contact Networks

Describing protein dynamical networks through amino acid contacts is a powerful way to analyze complex biomolecular systems. However, due to the size of the systems, identifying the relevant features of protein-weighted graphs can be a difficult task. To address this issue, we present the connected component analysis (CCA) approach that allows for fast, robust, and unbiased analysis of dynamical perturbation contact networks (DPCNs). We first illustrate the CCA method as applied to a prototypical allosteric enzyme, the imidazoleglycerol phosphate synthase (IGPS) enzyme from Thermotoga maritima bacteria. This approach was shown to outperform the clustering methods applied to DPCNs, which could not capture the propagation of the allosteric signal within the protein graph. On the other hand, CCA reduced the DPCN size, providing connected components that nicely describe the allosteric propagation of the signal from the effector to the active sites of the protein. By applying the CCA to the IGPS enzyme in different conditions, i.e., at high temperature and from another organism (yeast IGPS), and to a different enzyme, i.e., a protein kinase, we demonstrated how CCA of DPCNs is an effective and transferable tool that facilitates the analysis of protein-weighted networks.


IGPS allosteric pathways
The secondary structures mainly involved in the IGPS allosteric pathways, as shown in Figure S1, initiated at the effector binding site, include elements at the sideR of HisF ( loop1, f α1, f α2, f α3, f β1, f β2, and both f α2-f β3 and f β8-f α8' turns) and HisH (hα1 and hα4), which propagate the signal through the Ω-loop towards the active site, where the so-called oxyanion strand (i.e. the conserved 49-PGVG sequence) undergoes as significant conformational change assisted by the partial unfolding of the hα2 helix.

Birch Clustering
To cluster the weighted network, we represented the list of absolute edge weights in the DPCN as a one-dimensional vector.On this vector we performed the scikit-learn 1 implementation of Birch clustering, 2 with a threshold of 0.5 and a branching factor of 50.In order to bypass an arbitrary choice of a number of clusters, we generally did not perform the final clustering step, except when mentioned.For the other clustering methods that have been used for comparison with Birch's one, we employed standard hyperparameters of scikit-learn version

Clustering and groups of relevance
An analysis of the edge counting decay as a function of contact weight showed in Fig. 1C in the main text reveals that the choice of an arbitrary number of edges (ranked by weight), such as the top-fifty chosen above, is arbitrary and arguable, because it will not necessarily preserve groups of edges that have very similar weights.In other words, the edges ranked right below position 50, e.g.51-55, might have contact weights very similar to those of the latest edges in the top-fifty list and be relevant spots in the graph.On the other hand, choosing a general threshold weight cutoff is also arbitrary and, however, does not guarantee a human-friendly selection of edges.Indeed, choosing for instance a threshold weight of 5.50 in the DPCN graph of IGPS would leave 75 edges in the network, increasing the number of edges by 50% (with respect to the top-fifty) despite the small variation in weight threshold (with respect to 6.38).
In order to identify groups of edges with similar contact weights, named groups of relevance, we employed clustering methods, which provide a set of edges groups that can be ranked by their relevance in the weighted network, i.e. by the amount of contact perturbations that they incorporate.We tested several clustering techniques applied to the DPCN graph of IGPS, as shown in Figure S4.Among them, the Birch clustering resulted to be the most suitable technique for the DPCN network, as explained below.Figure S3 shows the portion of DPCN graph associated with edges in the top-four groups of relevance, used as example (with other cases reported in Figure ??).The group with the highest relevance, i.e. the largest contact perturbation weight, contains a single pair, i.e. the contact hE56-hR59 that is involved in the unfolding of the h2 helix of HisH.The second group of relevance involves three edges: f A224-f F227, hR116-hD159, f R249-hY136.The first pair has been attributed to a displacement of a hydrogen-bonding near the effector binding site, 3 marking the beginning of the allosteric pathways, while the two other interactions are related to the breathing motion in a direct, with f R249 that is part of the so-called molecular hinge of the HisF/HisH interface, 4 or indirect way, with hR116-hD159 being a contact pair in HisH affected by the interface motion.The third group of relevance contains four edges: f D11f K19, f E67-hR18, hS115-hD159, f K4-f V248.The first two are recognized key elements of the allosteric mechanism, 4 the third one is directly related to the hR116-hD159 perturbation discussed above, while the latter one involves the connection between the ion gate (i.e.f K4) and the breathing motion (i.e.f V248 is adjacent to f R249 of the molecular hinge).The fourth group contains six edges (including a triad): f K4-f F214, hH120-hH141, f H228-f K19f R27, hN12-hN15 and hY79-hS197.Among these six edges, the first one is associated to the opening of the ion gate (i.e.f K4) as well as the second one (being adjacent to hM121), 3 while the triad, involving loop1 (f K19 and f R27) in the effector binding site, and the hN12-hN15 pair have been recognized as part of the allosteric signaling mechanism. 4The latter perturbation of this group has not been previously highlighted, since it belongs to a cluster of interaction in HisH not directly linked with other relevant perturbations.

Number of edges
Edge weight  In other words, one can cluster the edges, which are weighted by the number of contacts perturbations, according to their weights, i.e. with edges belonging to the same cluster having similar values of weights.Then, one can put these clusters in an order, going from the first one featuring the largest weights to the last one with the smallest edges.Since these clusters are ranked by weights that refer to the contact perturbations, they represent a list of groups of edges ordered by their relevance in the perturbation contact network, thus they are named "groups of relevance".So, the ordered groups of relevance can be simply interpreted as the clusters of edges present in DPCN ordered by their relevance.Thus, their selection and interpretation does not require any previous knowledge of the proteic systems under investigation.

Top-4 Groups of relevance
Birch clustering offers several key advantages for our analysis.Firstly, it is a hierarchical clustering method, allowing for versatile exploration of the data.By producing a set of clusters without setting a fixed number of them, it offers flexibility in the subsequent analysis and interpretation of the results, while guaranteeing that the groups formed are of similar relevance.Moreover, Birch clustering has been specifically designed to handle large-scale data efficiently and demonstrates robustness against noise.This scalability and noise tolerance further support its applicability to larger systems that are prone to thermal fluctuations.
In Birch clustering, the first output is a set of clusters produced without setting a fixed number of them, with re-clustering that can be subsequently performed using a predefined number of clusters.Here, we have conducted the analysis using the first output from Birch clustering (while re-clustering using 2-3 clusters is also reported in Figure S5 for completeness).Figure S3 shows the counting of edges decay as function of contact weight (previously shown in Fig. 1C) combined with the Birch clustering results.By setting a given number of edges as cutoff, corresponding to the selection of a given threshold weight discussed above (i.e.50 edges and 6.38 threshold), one of the groups of relevance obtained by clustering is likely cut at an arbitrary point.In contrast, one could exploit the clustering output and analyze the clusters in decreasing order of relevance, stopping at the desired number of clusters.
Overall, the most interesting outcome of the clustering procedure is the appearance of a connection between the residues directly bound to the effector and loop1 (involving residues f A224, f F227, f H228, f K19, f D11 and f R27), which was not clearly observed in previously analyses of IGPS allostery.However, this connection is dislocated in three different groups of relevance.Indeed, as expected, this analysis does not provide insights on the local propagation of contact perturbations and, thus, straightforward information on the allosteric signaling mechanism.In fact, increasing the number of groups of relevance (e.g. more than the four groups considered so far) will not necessarily provide more information on local propagation of perturbations, while it complicates the analysis due to the quick increase of the total number of edges to be considered (see Figure S3).

Detailed CCA analysis for bacterial IGPS
Two of the largest components in the final CC structure relate to the alteration of motion in loop1, representing the first contact perturbations upon effector binding (i.e., components 1 and 9).4][5] Then, the contact perturbations reach the HisF/HisH interface, with two CCs (namely 4 and 8) being associated with the rearrangements of contacts between surface-exposed residues, involving known alterations of salt-bridges connecting the f α2, f α3, hα1 and hα4 helices. 3,4These interface carboxamide group, and f D176 with the hydroxyl group at the glycerol side) and reach the loop1 (in particular residues f N22 and f F23), through the f S144-f F23 and f D176-f N22 contacts.These contacts increase upon effector binding, in line with the closing of loop1 once PRFAR enters in its binding site.Notbaly, the CCA indicates that almost all residues in loop1 are interconnected within the CC with the largest size and diameter (i.e.component 1), and these perturbations add up to those in the f β8-f α8 turn (f A224), reaching the f α1 helix, see Figure S6A.At the HisF/HisH interface and in the HisH active site, the CCA captures quite effectively the most relevant allosteric effects, including the critical f E67-hR18 salt-bridge that involve both f α2 and f α1 helices and the hN15-hN12 contact for the f α3-Ω-loop interaction.Interestingly, these known alterations in the active sites propagate, as expected, from the Ω-loop to the 49-PGVG sequence, see the hP10-hG50 contact, but are also directly connected to the hE56-hR59 salt-bridge in the hα2, which features the largest weight (i.e.contact perturbation) in the whole DPCN graph.The weakening of the hP10-hG50 interaction is obviously associated to the recognized allosteric event of the hP10-hV51 hydrogen-bond breaking (causing the oxyanion strand flip) and the CCA indicates that it is directly related to the breaking of the hE56-hR59 interaction.This suggests that point mutations of the hE56 and/or hR59 residues might significantly alter the activity of the

Figure S1 :
Figure S1: Allosteric mechanism of IGPS from Thermotoga maritima.The substrate (glutamine) is positioned in the active site and represented in red.The effector (PRFAR) is positioned in the effector site and represented in cyan.HisF is in yellow and HisH in green.Key secondary structure elements are represented and labeled in pink.

Figure S2 :
Figure S2: Clustering techniques with different colors indicating the clusters in which each data point belong.The time taken for each clustering is written in each plot.For algorithms asking specifically for a number of clusters, we chose 2 clusters in each case and for algorithms asking for a number of neighbors, we chose 10 neighbors in each case.The running time of each algorithm is indicated in the upper right part of each panel.

Figure
Figure S3: A) Birch clustering of the DPCN graph of IGPS represented within the plot of the number of edges decay as a function of contact weight.Each color of the dots is associated to a different cluster, i.e. to each group of relevance (Gr).The inset zooms in the region of the first seven groups of relevance.B) 3D representation of IGPS including the DCPN graph selection based on the top-four groups of relevance, following the same color scheme of panel A.

Figure S4 :
Figure S4: Top-5 to 7 groups of relevance using Birch clustering with no final clustering step (n clusters = 11)

Figure
Figure S5: (top) Birch clustering with 2 clusters displaying the top cluster on the protein.(bottom).Birch clustering with 3 clusters displaying the top-one and top-two cluster on the protein.

Figure S6 :
Figure S6: Detailed view of the DPCN of IGPS focusing on the CCs nearby the effector (A) and active (B) sites, highlighting the main amino acid residues (black labels) and secondary structures (in pink) involved in the allosteric signal propagation.

Figure S7 :
Figure S7: Scatter plot of the size (number of edges) and order (number of nodes) of each component against their vanishing point.There is a tendency of big vanishing points to create big components albeit not a complete correlation.

Figure S8 :
Figure S8: Distribution of the diameters in the final components.22 components have a diameter of 1 (thus consisting of a single edge) while 5 have a diameter of 2 (trivial examples of propagations).The ninth major component have a diameter bigger or equal than 3.