Interpretable Machine Learning of Amino Acid Patterns in Proteins: A Statistical Ensemble Approach

Explainable and interpretable unsupervised machine learning helps one to understand the underlying structure of data. We introduce an ensemble analysis of machine learning models to consolidate their interpretation. Its application shows that restricted Boltzmann machines compress consistently into a few bits the information stored in a sequence of five amino acids at the start or end of α-helices or β-sheets. The weights learned by the machines reveal unexpected properties of the amino acids and the secondary structure of proteins: (i) His and Thr have a negligible contribution to the amphiphilic pattern of α-helices; (ii) there is a class of α-helices particularly rich in Ala at their end; (iii) Pro occupies most often slots otherwise occupied by polar or charged amino acids, and its presence at the start of helices is relevant; (iv) Glu and especially Asp on one side and Val, Leu, Iso, and Phe on the other display the strongest tendency to mark amphiphilic patterns, i.e., extreme values of an effective hydrophobicity, though they are not the most powerful (non)hydrophobic amino acids.


S1 Details on the algorithms
We provide additional details on the restricted Boltzmann machines (RBMs) and on the clustering procedure.
Our RBMs are trained with well-known optimizations 1,2 and with the "centering trick". 3,4e build the ensemble of RBMs, with a fixed number of hidden units N h , by training R different realizations.Each realization is characterized by the RBM's random state, which determines the weights initialization and the data split into training (80%) and validation (20%) sets.Thus, we obtain a set of RBMs differing for parameter values and slightly for analyzed datasets.Each weight w ij is initially drawn from a uniform distribution in the interval [−ℓ, +ℓ] with ℓ = 2(N h + N v ) −1/2 .Biases are initialized to zero.
The bipartite structure of the RBM allows storing the information within weights and biases in many ways due to the invariance for permutation and sign reversal of hidden units.To overcome this variability when comparing units, we isolate each hidden unit j in an RBM and compare its weights w ij , and those of its mirror image −w ij , with those of all other hidden units in the same RBM and other RBMs of the ensemble.Thus, for every pair of hidden units j, m, we compute the minimum Euclidean distance among them or their mirror versions, We then feed the distance matrix d jm to a popular density-based algorithm, DBSCAN, 5,6 to perform clustering.We focus on tuning two parameters: a radius around each data point (ϵ) and the minimum number of samples (min s ) within ϵ from a data point that would prevent its labeling as noise, i.e., that would grant to put that point in a cluster.
For tuning ϵ and min s , we introduce a cost function C(ϵ, min s ) whose minimum corre- sponds to the optimal parameter values, where G := G(ϵ, min s ) denotes the set of groups returned by DBSCAN and Ω(g) denotes the number of hidden units in group g ∈ G.The first term in (S2) is the fraction of hidden units that are cataloged as noise.Hence it penalizes configurations with high noise.The second term is proportional to the average cluster size ⟨Ω(g)⟩ g∈G and penalizes configurations with all the hidden units merged in a single, giant cluster.In Figure S1 we show the results and the intermediate steps of the parameter tuning at the start of α-helices.For the other cases, the results are similar.The procedure returns an optimal region of parameters: we choose the average points within this region as optimal parameters.In Table 1, we collect the optimal hyperparameters of DBSCAN analysis for each dataset.The average RBM is then built as the average of weights w ij and biases b j within each group after aligning all its hidden units to minimize the distance from a reference one.For better overall visualization, since aliphatic amino acids (I, L, V) always yield a well-defined pattern, we adopt the convention that their weights at position γ = 1 are positive.The visible bias a i is instead independent of the grouping; hence, it is averaged among all the original RBMs.
For each hidden unit j representing the average of units in a group, we analyze weights w ij to extract the similarity among amino acids.For this purpose, we split the array of weights w ij into 20 sub-vectors of length Γ (i.e., a row if w ij is reshaped to a 20×Γ matrix), each one related to a specific amino acid.To extract the most relevant linear combinations of coordinates in the Γ-dimensional space of sub-vectors in a group, we perform a principal component analysis (PCA) 6 on them.The PCA ranks the most relevant and independent linear combinations of coordinates in the Γ-dimensional space.

S2 STRIDE vs DSSP
It is important to notice that the location of a secondary structure's starting/ending point in proteins may depend on the algorithm chosen to detect them.In the main text, we show the results obtained with the DSSP.In this section, for comparison, we show the weights of an ensemble of RBMs trained with a database of α-helices obtained with the algorithm STRIDE 7 (see Figures S2, S3).The patterns observed do not deviate significantly from those shown in the main text for the DSSP (see Figures 4 and 5), indicating that the method based on the ensemble RBMs is sufficiently robust with respect to possible (small) deviations in the assignment of the extremities of secondary structures by different algorithms.

S3 Correlation matrices
Correlations between the occurrence of amino acids (say a, b) at different positions (1 ≤ i < j ≤ Γ) require the computation and parallel visualization of many matrices (one for every i-j pair).This, for example, is shown in Figure S4 for the start of α-helices.Each matrix element is C ij ab = ⟨I(a, i)I(b, j)⟩ − ⟨I(a, i)⟩⟨I(b, j)⟩, where I(a, i) = 1 if the amino acid at position i is a, and I(a, i) = 0 otherwise, and ⟨. ..⟩ is the mean over the ensemble.
Of course, the correlations displayed in Figure S4 (and Figures S5, S6, and S7 for the other portions of the secondary structure) provide useful information.However, interpreting these results a priori, without the knowledge acquired by studying the RBMs' weights, seems more complicated.

Figure S1 :
Figure S1: Example of parameter tuning in DBSCAN for the start of α-helices.The "Noise" matrix represents the fraction of noise points (first term in (S2)).The "Average" matrix represents the average group size divided by the total number of hidden units in the ensemble (second term in (S2)).The "Total" matrix is the sum of the previous two and coincides with the cost function in (S2).The red points highlight the optimal region of the parameters.

Figure S2 :
Figure S2: Results for the start of α-helices found with the algorithm STRIDE.See the legend of Figure 4 in the main text for more details.

Figure S3 :
Figure S3: Results for the end of α-helices found with the algorithm STRIDE.See the legend of Figure 4 in the main text for more details.Compare this to Figure 5 in the main text.

Figure S4 :
Figure S4: Covariance matrices C ij ab for the empirical occurrence of amino acid a at position i and amino acid b at position j > i.Each matrix is for a given i, j pair for the start of α-helices.

Figure S5 :
Figure S5: As in Fig. S4 but for the end of α-helices.

Figure S6 :
Figure S6: As in Fig. S4 but for the start of β-sheets.

Figure S7 :
Figure S7: As in Fig. S4 but for the end of β-sheets.