Learning protein constitutive motifs from sequence data

Statistical analysis of evolutionary-related protein sequences provides information about their structure, function, and history. We show that Restricted Boltzmann Machines (RBM), designed to learn complex high-dimensional data and their statistical features, can efficiently model protein families from sequence information. We here apply RBM to 20 protein families, and present detailed results for two short protein domains (Kunitz and WW), one long chaperone protein (Hsp70), and synthetic lattice proteins for benchmarking. The features inferred by the RBM are biologically interpretable: they are related to structure (residue-residue tertiary contacts, extended secondary motifs (α-helixes and β-sheets) and intrinsically disordered regions), to function (activity and ligand specificity), or to phylogenetic identity. In addition, we use RBM to design new protein sequences with putative properties by composing and 'turning up' or 'turning down' the different modes at will. Our work therefore shows that RBM are versatile and practical tools that can be used to unveil and exploit the genotype–phenotype relationship for protein families.


Introduction
Sequencing of many organism genomes has led over the recent years to the collection of a huge number of protein sequences, gathered in databases such as UniProt or PFAM [1].Sequences with a common ancestral origin, defining a family (Fig. 1A), are likely to code for proteins with similar functions and structures, hence providing a unique window into the relationship between genotype (sequence content) and phenotype (biological features).In this context, various statistical approaches have been introduced to infer protein properties from sequence data, in particular amino-acid conservation and coevolution (correlation) [2,3].
A major objective of these approaches is to identify positions carrying amino acids with critical impact on the protein function, such as catalytic or binding sites, or specificity-determining sites controlling ligand specificity.Principal Component Analysis (PCA) of the sequence data can be used to unveil groups of coevolving sites with specific functional role [4][5][6].Other methods rely on phylogeny [7], entropy (variability in amino-acid content) [8,9], or hybrid combination of both [10].
Another objective is to extract structural information, such as the contact map of the three-dimensional fold.Considerable progress was brought by maximum-entropy methods, which rely on the computation of direct couplings between sites reproducing the pairwise coevolution statistics in the sequence data [11][12][13][14].Direct couplings provide very good estimators of contacts [15][16][17][18], and capture pairwise epistasis effects necessary to model fitness changes resulting from mutations [19][20][21].
Despite these successes, a unique framework capable of extracting the structural and functional features common to a protein family, as well as the phylogenetic variations specific to sub-families is still missing.Hereafter, we consider Restricted Boltzmann Machines (RBM) for this purpose.RBM are a powerful method coming from machine learning [22,23]; they are unsupervised (sequence data need not be annotated) and generative (able to generate new data).Informally speaking, RBM learn complex data distributions through their statistical features (Fig. 1B).
We apply RBM to three sequence datasets: the Kunitz domain, a protease inhibitor, historically important for protein structure determination [24], the WW domain, a short module binding different classes of ligands [25], and lattice-protein in silico data [26,27] to benchmark our approach on exactly solvable models [28].Our study shows that RBM are able to capture (1) structural features, either local, such as tertiary contacts, or extended, such as secondary structure motifs (α-helix and β-sheet); (2) functional features, i.e. groups of amino acids controlling specificity or activity; (3) phylogenetic features, related to sub-families sharing evolutionary determinants.Some of these features involves two residues only (as direct pairwise couplings do), others extend over large and not necessarily contiguous portions of the sequence (as in collective modes extracted with PCA).
The degree of presence of each feature in a given sequence defines a multi-dimensional representation of this sequence, highly informative about the biological properties of the corresponding protein (Fig. 1C).Focusing on representations of interest allows us to design new sequences with putative functional properties.In summary, our work shows that RBM offer an effective, versatile computational tool to characterize and exploit quantitatively the genotype-phenotype relationship specific to a protein family.

Restricted Boltzmann Machines
Definition.A Restricted Boltzmann Machine (RBM) is a joint probabilistic model for sequences and representations, see Fig. 1C.It is formally defined on a bipartite, two-layer graph (Fig. 1B).Protein sequences v = (v 1 , v 2 , ..., v N ) are displayed on the Visible layer, and representations h = (h 1 , h 2 , ..., h M ) on the Hidden layer.Each visible unit takes one out of 21 values (20 amino acids + 1 alignment gap).Hidden-layer unit values h µ are real.The joint probability distribution of v, h is up to a normalization constant.Here, the weight matrix w iµ couples the visible and the hidden layers, and g i (v i ) and U µ (h µ ) are local potentials biasing the values of, respectively, the visible and the hidden variables (Figs.1B & 2A).
The probability of a sequence, P (v), is obtained by summing (integrating) P (v, h) over all its possible representations h.
Learning.The weights w iµ and the defining parameters of the potentials g i and U µ are learned by maximizing the product of the probabilities P (v) of all the sequences v in the Multiple Sequence Alignment (MSA), see Methods.We also introduce penalty terms over the weights and parameters to avoid overfitting.Details about the learning procedure are reported in Methods and Supplementary Information, Section V.
From sequence to representation.Given a sequence v on the visible layer, the hidden unit µ receives the input This expression is analogous to the score of a sequence with a position-specific weight matrix.Large and small |I µ | correspond to, respectively, good and bad matches between the weights (attached to unit µ) and the sequence.
The input I µ determines, in turn, the conditional probability of the activity h µ of the hidden unit, up to a normalization constant.The nature of the potential U is crucial to determine how the average activity h varies with the input I, see Fig. 2B.For quadratic potential U, the conditional probability is Gaussian, and the average activity is a linear function of the input.Complex, non-linear dependencies are obtained with the so-called double Rectified Linear Units (dReLU) potentials (Fig. 2A and caption).
From representation to sequence.Given a representation (set of activities) h on the hidden layer, the residues on site i are distributed according to Hidden units with large activities h µ strongly bias this probability, and favor values of v i corresponding to large weights w iµ (v i ).
Sampling.Use of Eqn.(3) allows us to sample the representation space given a sequence, while Eqn.(4) defines the sampling of sequences given a representation, see both directions in Fig. 1C.Iterating this process generates high-probability representations, which, in turn produce very likely sequences, and so on.Protein design corresponds here to the sampling of sequences v conditioned to a set of representations h of interest.

Sequence space
Representa.on space Reverse and forward modeling of proteins.A. Example of Multiple-Sequence Alignment (MSA), here of the WW domain (PF00397).Each column i = 1, ..., N corresponds to a site on the protein, and each line to a different sequence in the family.Color code for amino acids: red = negative charge (E,D), blue = positive charge (H, K, R), purple = non-charged polar (hydrophilic) (N, T, S, Q), yellow = aromatic (F, W, Y), black = aliphatic hydrophobic (I, L, M, V), green = cysteine (C), grey = other, tiny (A, G, P).B. In a Restricted Boltzmann Machine (RBM), weights w iµ connect the visible layer (carrying protein sequences v) to the hidden layer (carrying representations h).Biases on the visible and hidden units are introduced by the local potentials g i (v i ) and U µ (h µ ).Due to the bipartite nature of the weight graph, hidden units are independent conditioned to a visible configuration, and vice versa.C. Sequences v in the MSA (dots in sequence space, left) code for proteins with different phenotypes (dot colors).RBM define a probabilistic mapping from sequences v onto the representation space h (right), indicative of the phenotype of the corresponding protein, and encoded in the conditional distribution P (h|v), Eqn.(3) (black arrow).The reverse mapping from representations to sequences is P (v|h), Eqn.(4) (black arrow).Sampling a subspace in the representation space (colored domains) defines in turn a complex subset of the sequence space (colored circles), and allows one to design new sequences with putative phenotypic properties (blue domain and arrow).

Kunitz domain
The majority of natural proteins are obtained by concatenating functional building blocks, called protein domains.The Kunitz domain, with N = 53 residues (protein family PF00014 [1]), inhibits trypsin and other proteases.Its structure was the first one ever resolved [24], and is often used to benchmark folding predictions based on simulations [29] and coevolutionary approaches [15-17, 30, 31].We use the MSA of PF00014 to train our RBM with M = 100 dReLU (Methods).Figure 3A shows the MSA sequence logo and the secondary structure motifs.
Figure 3B shows the weights w iµ (v) attached to 5 hidden units, chosen for their diversity.Each logo identifies the amino-acid motifs in the sequences v giving rise to large (positive or negative) inputs I onto the associated hidden unit, see Eqn.  25).In practice, the parameters of the hidden unit potentials are fixed through learning of the sequence data.B. Average activity of hidden unit h, calculated from Eqn. (3), as a function of the input I defined in Eqn.(2).The three curves correspond to the three choices of potentials in panel A. For the quadratic potential (Black), the average activity is a linear function of I.For dReLU1 (Green), small inputs I barely activate the hidden unit, whereas dReLU2 (Purple) essentially binarizes the inputs I.
distribution of the inputs I 1 partitions the MSA in three subfamilies (Fig. 3C, top).The two peaks in I 1 −2.5 and I 1 1.5 identify sequences where the contact is due to an electrostatic interaction with, respectively, (+, −) and (−, +) charged amino acid on sites 45 and 49; the other peak in I 1 0 identify sequences realizing the contact differently, e.g. with an aromatic amino acid on site 45.Weight 1 shows also a weaker electrostatic component on site 53 in Fig. 3B; the 4-site separation between sites 45-49-53 fits well with the average helix turn of 3.6 amino acids (Fig. 3D).
Weight 2 focuses on the contact between residues 11-35, realized in most sequences by a C-C disulfide bridge (Fig. 3B and negative I 2 peak in Fig. 3C, top).A minority of sequences in the MSA, corresponding to I 2 > 0 and mostly coming from nematode organisms (Supplementary Fig. 13), do not show the bridge.A subset of these sequences strongly and positively activate hidden unit 3 (Supplementary Fig. 13 and I  Weight 4 describes a feature mostly localized on the β 1 -β 2 strands and on the preceding loop (sites 10 to 15), see Figs. 3B&D and contact map in Supplementary Fig. 9.This loop, in particular sites 10-12-13, is involved in the binding to proteases, according to structural studies of the trypsin-trypsin inhibitor (BPT1) complex [32] and to mutagenesis experiments [33].
Weight 5 codes for a complex extended mode, negatively activated by a small subset of the MSA sequences with high similarities (Fig. 3C).These sequences correspond to the protein Bikunin (AMBP), made of two Kunitz domains and present in most mammals and some other vertebrates [34].
The remaining 95 inferred weights are shown in Supporting Information.We find a variety of structural (including pairwise contacts as in weights 1 and 2) and phylogenetic (dedicated to few, evolutionary close sequences as hidden unit 5) features; the latter include in particular stretches of gaps, mostly located at the extremities of the sequence [30].Several weights have strong components on the same sites as weight 4, showing the complex pattern of amino acids controlling binding affinity.
As shown above, co-occurrence of large weight components often correspond to nearby sites on the 3D fold.To extract structural information in a systematic way, reminiscent of direct-coupling based approaches, we use our RBM to derive effective couplings between sites from the log ratio of sequence probabilities upon single and double mutations (Methods).These effective pairwise couplings can then be used for contact estimation (Methods).Figure 3E shows that the quality of prediction of the contact map of the Kunitz domain with RBM is comparable to state-of-the-art methods based on direct couplings [15].[35] using VMD [36].White spheres denote the positions of the 3 disulfide bridges in the wild type sequence.Green spheres locate residues i with v |w iµ (v)| > S for each hidden unit µ (S = 1.5 for µ = 1, 2, 3, 1.25 for µ = 4 and 0.5 for µ = 5.E. Contact Prediction using RBM compared to direct coupling-based methods (Pseudo-Likelihood Method [18], and Boltzmann Machine learning [37]).Positive Predicted Value (PPV), i.e. fraction of true contacts (distance < 8 Å on PDB structure 5pti [38]) vs. number of predicted contacts, see Methods, Section E.

WW domain
WW is a protein-protein interaction domain found in many eukaryotes and human signalling proteins, involved in essential cellular processes such as transcription, RNA processing, protein trafficking, receptor signalling.WW is a short domain with N = 31 amino-acids (Fig. 4A, PFAM PF00397), which folds into a three-stranded antiparallel β-sheet.The domain name stems from the two conserved tryptophans (W) at positions 5-28 (Fig. 4A), which serve as anchoring sites for the ligands.WW domains bind to a variety of proline (P)-rich peptide ligands, and can be divided into four groups, based on their preferential binding affinity [39].Group I binds specifically to PPXY motif -where X is any amino acid; Group II to PPLP motifs; Group III to proline-arginine containing sequences (PR); Group IV to phosphorylated serine/threonine-proline sites [p(S/T)P].
Five weight logos of the inferred RBM are shown in Fig. 4B; the remaining 95 weights are given in Supporting Information.Weight 1 codes for a contact between sites 4-22 realized either by two amino acids with oppositive charges (I 1 < 0), or by one tiny and one negatively charged amino acid (I 1 > 0).Weight 2 shows a β-sheet-related feature, with large entries localized on the β 1 and β 2 strands (Fig. 4B); the corresponding residues are in contact on the 3D fold, see Fig. 4D.Hidden unit 3 is negatively activated by few sequences (Fig. 4C) carrying the W28X mutation, with non-aromatic X; this rare mutation can be compensated by a complex mutation pattern around the β 1 -β 2 extremities (weight 3 in Fig. 4B).
Weights 4 and 5 involve sites on the β 2 -β 3 binding pocket and on the β 1 -β 2 loop of the WW domain.The distributions of activities in Fig. 4C highlight different groups of sequences in the MSA that strongly correlate with experimental ligand-type identification, see Fig. 4E.We find that (i) Type I domains are characterised by I 4 < 0 and I 5 > 0; (ii) Type II/III domains are characterized by I 4 > 0 and I 5 > 0; (iii) There is no clear distinction between Type II and Type III domains; (iv) Type IV domains are characterised by I 4 > 0 and I 5 < 0.
Our results are in good agreement with various studies.(i/ii) Mutagenesis experiment showed the importance of sites 19,21,24,26 for binding specificity [40,41].For the YAP1 WW domain, as confirmed by various studies (see [41] Table 2), the mutations H21X and T26X reduce the binding affinity to Type I ligands, while the mutation Q24R increases it and the mutation S12X has no effect.This is in agreement with the negative components of weight 4 (Fig. 4A): I 4 increases upon mutations H21X and T26X, decreases upon Q24R and is unaffected by S12X.Moreover the mutation L19W alone, or combined with H21[D/G/K/R/S] could switch the specificity from Type I to Type II/III [40].These results are consistent with Fig. 4E: YAP1 (blue cross) is of Type I but one or two mutations move it to the right side, closer to the other cluster (orange crosses).Espanel and Sudol [40] also proposed that Type II/III specicity required the presence of an aromatic amino acid (W/F/Y) on site 19, in good agreement with weight 4. (iii) The distinction between Types II and III is unclear in the literature, because WW domains often have high affinity with both ligand types.(iv) Several studies [4,42,43] have demonstrated the importance of the β 1 -β 2 loop for achieving Type IV specificity, which requires a longer, more flexible loop, as opposed to short rigid loop for other types.The length of the loop is encoded in weight 5 through the gap symbol on site 13: short and long loops correspond to, respectively, positive and negative I 5 .Removing R13 in the loop of the Type IV hPin1 WW domain reduces its binding affinity to [p(S/T)P] ligands [4].The importance of residues R11 and R13 was shown in [42].These observations agree with weight 5, authorizing substitutions between K and R on sites 11 and 13. (v) A specificity-related sector of eight sites was identified in [4], five of which carry the top components of weight 4 (green balls in Fig. 4D).Our approach provides another specificity-related feature (weight 5) and the motifs of amino acids affecting Type I & IV specificity, in good agreement with the experimental findings of [4].[4].Lower triangle: artificial, from [4].Diamond: natural, from [45].Crosses: YAP1 (0) and variants (1 and 2 mutations from YAP1), from [40].The three clusters match the standard ligand type classification.

Sequence Design
The biological interpretation of the features inferred by the RBM guides us to sample new sequences v with putative functionalities.In practice, we sample from the conditional distribution P (v|h), Eqn.(3), where a few hidden-unit activities in the representation h are fixed to desired values, while the others are sampled from Eqn. (4).For WW domains, we condition on the activities of hidden units 4 and 5, related to binding specificity.Fixing h 4 and h 5 to levels corresponding to the peaks in the histograms of inputs in Fig. 4C allows us to generate sequences belonging specifically to each one of the three ligand-specificity clusters, see Fig. 5A.
In addition, sequences with combinations of activities that are not encountered in the natural MSA can be engineered.As an illustration, we generate by conditional sampling hybrid WW-domain sequences with strongly negative values of h 4 and h 5 , corresponding to a Type I-like β 2 -β 3 binding pocket and a long, Type IV-like β 1 -β 2 loop, see Fig. 5A&B.For Kunitz domains, the property 'no 11-35 disulfide bond' holds only for some sequences of nematode organisms, whereas the Bikunin-AMBP gene is present only in vertebrates; they are thus never observed simultaneously in natural sequences.Sampling our RBM conditioned to appropriate levels of h 2 and h 5 allows us to generate sequences with both features activated, see Figs. 5C&D.
The sequences designed by RBM are far away from all natural sequences in the MSA, but have comparable probabilities, see Figs. 5E (WW) and 5F (Kunitz).Their probabilities estimated with pairwise direct-coupling models (trained on the same data), whose ability to identify functional and artificial sequences has already been tested [14,46], are also large, see Supplementary Fig. 8.
Our RBM framework can also be modified to design sequences with very high probabilities, even larger than in the MSA, by appropriate duplication of the hidden units (Methods).This trick can be combined with conditional sampling, see Fig. 5E&F. in Fig. 3C.Red sequences combine the absence of the 11-35 disulfide bridge and a strong activation of the Bikunin-AMBP feature, though these two phenotypes are never found together in natural sequences.D. Sequence logo of the red sequences in panel 5C, with 'no disulfide bridge' and 'bikunin' features.E. Scatter plot of the number of mutations to the closest natural sequence vs log-probability, for natural (gray) and artificial (colored) WW domain sequences.Same color code as panel 5A; dark dots were generated with the high-probability trick, based on duplicated RBM (Methods).Note the existence of many high-probability artificial sequences far away from the natural ones.F. Same scatter plot as in panel 5E for natural and artificial Kunitz domain sequences.

Benchmarking on Lattice Proteins
RBM models can be benchmarked on in silico lattice proteins (LP), introduced in the 90 s to study protein folding and design [27].A 'protein' of L = 27 amino acids may fold into ∼ 10 5 distinct structures on a 3 × 3 × 3 cubic lattice, with probabilities depending on its sequence (Methods and Figs.6A&B) [26].We generate a MSA containing sequences having large probabilities (> 0.99) of folding into one structure shown in Fig. 6A [28].A RBM with M = 100 dReLU hidden units is then learned, see Supplementary Information for details about regularization and cross-validation.
Various structural LP features are encoded by the weights as in real proteins, including complex negative-design related modes, see Figs. 6C&D and the remaining weights in Supporting Information.Performances in terms of contact predictions are comparable to state-of-the art methods on LP, see Supplementary Fig. 11.
The capability of RBM to design new sequences with desired features and high values of fitness, exactly computable in LP as the probability of folding into the native structure in Fig. 6A, can be quantitatively assessed.Conditional sampling allows us to design sequences with specific hidden-unit activity levels, or combinations of features not found in the MSA (Fig. 6E).These designed sequences are diverse and have large fitnesses, comparable to the MSA sequences and even higher when generated by duplicated RBM (Fig. 6F), and well correlated with the RBM probabilities P (v) (Supplementary Fig. 7).

Discussion
In summary, we have shown that RBM are a promising, versatile, and unifying method for modeling and generating protein sequences.RBM, when trained on protein sequence data, reveal a wealth of structural, functional and evolutionary features.To our knowledge, no other method has been able to extract such detailed information in a unique framework so far.In addition, RBM can be used to design new sequences: hidden units can be seen as representation-controlling knobs, tunable at will to sample specific portions of the sequence space corresponding to desired functionalities.Above all, a major and appealing advantage of RBM is that the architecture of the model embodies the very concept of genotype-phenotype mapping.
From a machine-learning point of view, the values of RBM defining parameters, such as the number M of hidden units and regularization penalties were selected based on the likelihood (probability) of natural sequences not used for training (test set), see Supplementary Information, Section I.As expected, increasing M improves likelihood up to some level, after which overfitting starts to occur.Adding sparsifying regularization prevents overfitting, with only a mild decrease in the likelihood of test sets for not too strong penalties.Crucially, imposing sparsity facilitates the biological interpretation of weights and, thus, enhances the correspondence between representation and phenotypic spaces (Fig. 1C).It also allows us to drive the RBM operation point into the so-called compositional phase [47], in which multiple recombinations of features generate a variety of new sequences with high probabilities, such as those shown in Fig. 5.Further evidence for such compositional representations will be presented in [48].
While training RBM is challenging and requires sampling, substantial improvements were developed in the present work (Supplementary Information, Section V).Generative models alternative to RBM that do not require Markov Chain sampling exist in machine learning, such as Generative Adversarial Networks [49] and Variational Auto-encoders (VAE) [50].VAE were recently applied to protein sequence data for fitness prediction [51,52].Our work differs in several important points: our RBM is an extension of direct-based coupling approaches, requires much less hidden units (15 and 50 times less than, respectively, [51] and [52]), has a simple architecture with two layers carrying sequences and representations, infers interpretable weights with biological relevance, and can be easily tweaked to design sequences with desired statistical properties.Note that, while we present here detailed results for relatively short protein domains, our method can be applied to much longer proteins with hundreds of amino acids.
From a computational biology point of view, RBM unifies and extends previous approaches in the context of protein coevolutionary analysis.From the one hand, the features extracted by RBM identify 'collective modes' controlling the biological functionalities of the protein, in a similar way to the so-called sectors extracted by statistical coupling analysis [6].However, contrary to sectors, the collective modes are not disjoint: a site may participate to different features, depending on the residues they carry.On the other hand, RBM coincide exactly with the direct-coupling analysis [15,30] when the potential U(h) is quadratic in h (Methods).For non-quadratic potentials U, couplings to all orders between the visible units are present, all generated from the weights w iµ .Quadratic potentials, indeed, lack high-order interaction terms that significantly better describe gap modes [53] and outliers sequences (Supplementary Fig. 6).dReLU RBM therefore offer a practical way to go beyond pairwise coupling models, without an explosion in the number of interaction parameters.
The weights shown in Figs.3B and 4B could be unambiguously interpreted and related to existing literature.However, the biological significance of some of the inferred features remains unclear, and would require experimental  +) charges (I 1 > 0), or by hydrophobic residues (I 1 < 0).Weights 2 and 3 are related to, respectively, the triplets of amino acids 8-15-27 and 2-16-25, each realizing two overlapping contacts on S A (blue dashed contours).Weight 4 codes for electrostatic contacts between 3-26, 1-18 and 1-20, and imposes that the charges of amino acids 1 and 26 have same sign.The latter constraint is not due to the native fold (1 and 26 are 'far away' on S A ) but impedes folding in the 'competing' structure, S G (Fig. 6B and Methods), in which sites 1 and 26 are neighbours [28].D. Distributions of inputs I and average activities (full line, left scale).All features are activated across the entire sequence space.E. Conditional sampling with activities (h 2 , h 3 ) fixed to (h ± 2 , h ± 3 ), see red dots in panel 6D.Designed sequences occupy specific clusters in the sequence space, corresponding to different realizations of the overlapping contacts encoded by weights 2 and 3 (panel 6C).Conditioning to (h − 2 , h + 3 ) makes possible to generate sequences combining features not found together in the MSA, see bottom left corner, even with very high probabilities (Methods).F. Scatter plot of the number of mutations to the closest natural sequence vs. p nat , the ground-truth fitness function for natural (gray) and artificial (colored) sequence.Note the large diversity and the existence of sequences with higher p nat than in the training sample.
investigation.Similarly, the capability of RBM to design new functional sequences need experimental validation besides the comparison with past design experiments (Fig. 4E) and the benchmarking on in silico proteins (Fig. 6).While recombining different parts of natural proteins sequences from different organisms is a well recognized procedure for protein design [54,55], RBM innovates in a crucial aspect.Traditional approaches cut sequences into fragments at fixed positions based on secondary structure considerations, but such parts are learnt and need not be contiguous along the primary sequence in RBM models.We believe protein design with detailed computational modeling methods, such as Rosetta [55,56], could be efficiently guided by our RBM-based approach, in much the same way protein folding greatly benefited from the inclusion of long-range contacts found by direct-coupling analysis [16,57].Future projects include analyzing more protein families, and developing systematic methods for identifying functiondetermining sites.In addition, it would be very interesting to use RBM to determine evolutionary paths between two, or more, protein sequences in the same family, but with distinct phenotypes.In principle, RBM could reveal how functionalities continuously change along the paths, and provide a measure of viability of intermediary sequences.(3) Hence, with the duplicated RBM, sequences with high probabilities in the original RBM model are given a boost compared to low-probability sequences.This allows us to sample efficiently from P 2 (v).Note that more subtle biases can be introduced by duplicating some (but not all) of the hidden units in order to give more importance in the sampling to the associated statistical features.The architecture of RBM makes the production of a large variety of biased distributions over the visible configurations very easy.

D. Training algorithm for RBM
Hereafter we sketch the algorithm training.It is performed by maximizing, by stochastic gradient ascent, the difference between the data likelihood log P (v) M SA and the regularization penalties.All parameters used are given in Supplementary Information V.B.

Data and preprocessing
We use the PFAM sequence alignments released in December 2017 for both Kunitz (PF00014) and WW (PF00397) domains.All columns with insertions are discarded, after which duplicate sequences are removed.We are left with, respectively, N = 53 sites and B = 8062 unique sequences for Kunitz, and N = 31 and B = 7503 for WW.To correct for the heterogeneous sampling of the sequence space, a reweighting is applied: each sequence v with = 1, ..., B is assigned a weight w equal to the inverse of the number of sequences with more than 90% amino-acid identity (including itself).In all that follows, the average over the sequence data of a function f is defined as For Lattice Proteins, accurate sampling is achievable (see below) and no reweighting is required.

Gradients of log-likelihood
The gradients of the log-likelihood L with respect to the fields g i (a), couplings w iµ (a) and the hidden-unit potential parameters that we write generically as ξ µ , read Here, δ vi,a = 1 if v i = a and 0 otherwise, denotes the Kronecker function.The subscript of the averages • indicate whether they are computed over the sequences in the MSA (data) or over the distribution of sequences generated by the RBM (model).Evaluating the model averages exactly is untractable, and Monte Carlo (MC) methods or approximations [62,63] are required, see Methods, Section C. Hence, each gradient takes the form of a difference between a data average and a model average.Supplementary Information Section VB presents details about the stochastic gradient descent (SGD) procedure and about the parameters values.

Overparametrization of RBM model
Similarly to pairwise Potts models, the model is overparametrized and gauge choices must be made for the optimum to be well defined [14].Firstly, the transformations g i (v) → g i (v) + K i and w iµ (v) → w iµ (v) + K iµ , coupled with θ ± µ → θ ± µ − i K iµ leave the probability invariant.We choose the standard zero-sum gauge for the fields and weights, i.e. the sums over all symbols v of the fields g i (v) and of the weights w iµ (v) vanish.SGD updates preserve the zero-sum gauge for the fields but not for the weights, see Eq. 5; the weights must be modified after each update to restore the zero-sum gauge: w iµ (v) → w iµ (v) − 1 q v w iµ (v ), where q = 21 is the number of symbols of visible units.In addition, there can be redundancy between the potential parameters and the amplitude of the weights.For instance, for the quadratic potential, the model distribution is invariant under rescaling transformations The standard choice in the literature [23] is the constant gauge γ = 1, θ = 0; although simple, this choice has the strong downside that the scale of the hidden units h µ depends on the scale of the weights, w, and on the number of visible units, N .Indeed, Γ µ (I µ ) ≡ Iµ−θµ γµ = I µ , which can get as large as w N when a visible configuration is strongly overlapping with the weight vector.As the support of h is not bounded, neither are the gradients of the log-likelihood with respect to w iµ (v), and divergence can occur [23].Moreover, the distribution of h µ fluctuates a lot during training, which complicates the moment estimation and can harm convergence.This covariate shift phenomenon [64] is general to all neural network, and an interesting solution, batch normalization, has been recently proposed for feedforward networks [65].The idea is to reparameterize the network such that all intermediate activities have zero mean and unit variance.For the quadratic potential, we adapt this idea and choose γ µ and θ µ such that h µ (v) M SA = 0 , Var[h µ (v)] M SA = 1 (6) where Var denotes the variance.These implicit equations over γ µ , θ µ can be solved analytically: In practice, γ µ , θ µ must be updated after each modification of the weights and recomputing the mean and variance over the full data set is inefficient, see Supplementary Information VD.We estimate the two parameters with the mini-batch used for the next SGD update.In addition to being efficient, this procedure also ensures that the hidden layer is normalized before the gradient update, hence guaranteeing that the gradients are always of order 1.To let γ µ and θ µ converge, we perform an exponential smoothing at the end of the training, of the form θ µ .Note that, since γ µ , θ µ are functions of w, the gradients with respect to the weights must be updated accordingly as: Explicit gradient expressions are provided in Supplementary Information VD.Although Bernoulli potentials are not overparametrized, a similar reparameterization also improves performance, see Supplementary Information Section VC.The case of ReLU and dReLU potentials is more involved and treated in Supplementary Information Section VE.

Regularization
RBMs are regularized to avoid overfitting and to create interpretable sequence representations, as discussed above.We add to L a standard L 2 penalty ∝ 1 2 i,v g i (v) 2 for the potentials acting on the visible units, and a customed sparse penalty for the weights ∝ µ i,v |w iµ (v)|

2
. The latter term corresponds to an effective L 1 regularization with an adaptive strength increasing with the weights, thus promoting homogeneity among hidden units.Besides, it prevents hidden units from ending up entirely disconnected (w iµ (a) = 0 ∀i, a), and makes the choice of the penalty strength more robust, see Supplementary Information Fig. 2. Overall, the regularized log-likelihood is: E. Contact map estimation RBM can be used for contact prediction in a manner similar to direct-coupling based (Potts) models (Figure Methods 1).We first derive an effective coupling matrix J eff ij (a, b) from the RBM.To do so, let us consider a sequence v, and two sites i, j.We define the set of mutated sequences v a,b through We then define the following likelihood ratio where P is the marginal distribution over visible configurations, see (1).In terms of mutational fitness landscape, R ij (v; a, a , b, b ) measures epistatic contributions to the double mutation a → a and b → b on, respectively, sites i and j in the background defined by sequence v.
Applying the same definition to a pairwise Potts model, with marginal distribution
3 > 0 peak in Fig 3C).Positive components in weight 3 logo suggest that these proteins stabilize their structure through electrostatic interactions between sites 10 (− charge) and 33-36 (+ charges both), see Figs. 3B&D, which compensate the absence of C-C bridge on the neighbouring sites 11-35.

FIG. 3 :
FIG. 3: Modeling Kunitz Domain with RBM. A. Sequence logo and secondary structure of the Kunitz domain (PF00014), showing two α-helices and two β-strands.Note the presence of the three C-C disulfide bridges between 11-35, 2-52, 27-48.B. Weight logos for five hidden units, see text.Positive and negative weights are shown by letters located, respectively, above and below the zero axis.Values of the norms W µ 2 = i,v w iµ (v) 2 are given.Same color code for the amino acids as in Fig. 1A.C. Top: Distribution of inputs I µ (v) over the sequences v in the MSA (dark blue), and average activity vs. input function (full line, left scale); red points correspond to activity levels used for design in Fig. 5. Bottom: Histograms of Hamming distances between sequences in the MSA (grey) and between the 20 sequences (light blue) with largest (for unit 2,3,4) or smallest (1,5) I µ .D. 3D visualization of the weights, shown on PDB structure 2knt[35] using VMD[36].White spheres denote the positions of the 3 disulfide bridges in the wild type sequence.Green spheres locate residues i with v |w iµ (v)| > S for each hidden unit µ (S = 1.5 for µ = 1, 2, 3, 1.25 for µ = 4 and 0.5 for µ = 5.E. Contact Prediction using RBM compared to direct coupling-based methods (Pseudo-Likelihood Method[18], and Boltzmann Machine learning[37]).PositivePredicted Value (PPV), i.e. fraction of true contacts (distance < 8 Å on PDB structure 5pti[38]) vs. number of predicted contacts, see Methods, Section E.

FIG. 4 :
FIG. 4: Modeling WW Domain with RBM. A. Sequence logo and secondary structure of the WW domain (PF00397), with three β-strands.Note the two conserved W in positions 5 and 28.B. Weight logos for five representative hidden units, same as Fig. 3B.C. Corresponding inputs, average activities and distances between top-20 feature activating sequences, same as Fig. 3C.D. 3D visualization of the features, shown on the PDB structure 1e0m [44].White spheres locate the two W. Green spheres locate residues i with v |w iµ (v)| > 0.7 for each hidden unit µ.E. Scatter plot of inputs I 4 vs.I 5 .Gray dots represent the sequences in the MSA; they cluster into three main groups.Colored dots show artificial or natural sequences whose specificities, given in the legend, were tested experimentally.Upper triangle: natural, from[4].Lower triangle: artificial, from[4].Diamond: natural, from[45].Crosses: YAP1 (0) and variants (1 and 2 mutations from YAP1), from[40].The three clusters match the standard ligand type classification.

FIG. 5 :
FIG. 5: Sequence design with RBM. A. Conditional sampling of WW domain-modeling RBM.Sequences are drawn according to Eqn. (3), with activities (h 4 , h 5 ) fixed to (h − 4 , h + 5 ), (h + 4 , h − 5 ), (h + 4 , h + 5 ) and (3h − 4 , h − 5 ) see red points indicating h ± 4 , h ± 5 in Fig. 4C.Natural sequences in the MSA are shown with gray dots, and generated sequences with colored dots.Four clusters of sequences are obtained; the first three are putatively associated to, respectively, ligand-specific groups I, II/III and IV.The sequences in the bottom left cluster, obtained through very strong conditioning, do not resemble any of the natural sequences in the MSA; their binding specificity is unknown.B. Sequence logo of the red sequences in panel 5A, with 'long β 1 -β 2 loop' and 'type I' features.C. Conditional sampling of Kunitz domain-modeling RBM, with activities (h 2 , h 5 ) fixed to (h ± 2 , h ± 5 ), see red dots indicating h ± 2 , h ± 5

FIG. 6 :
FIG.6: Benchmarking RBM with lattice proteins.A. S A , one of the 103, 406 distinct structures that a 27-mer can adopt on the cubic lattice[26].Circled sites are related to the features shown in panel 6C.B. S G , another fold with a contact map (set of neighbouring sites) close to S A[28].C. Four weight logos for a RBM modeling S A data, see Supplementary Information for the remaining 96 weights.Weight 1 corresponds to the contact between sites 3 and 26, see black dashed contour in panel 6A; the contact can be realized by amino acids of opposite (-+) charges (I 1 > 0), or by hydrophobic residues (I 1 < 0).Weights 2 and 3 are related to, respectively, the triplets of amino acids 8-15-27 and 2-16-25, each realizing two overlapping contacts on S A (blue dashed contours).Weight 4 codes for electrostatic contacts between 3-26, 1-18 and 1-20, and imposes that the charges of amino acids 1 and 26 have same sign.The latter constraint is not due to the native fold (1 and 26 are 'far away' on S A ) but impedes folding in the 'competing' structure, S G (Fig.6Band Methods), in which sites 1 and 26 are neighbours[28].D. Distributions of inputs I and average activities (full line, left scale).All features are activated across the entire sequence space.E. Conditional sampling with activities (h 2 , h 3 ) fixed to (h ± 2 , h ± 3 ), see red dots in panel 6D.Designed sequences occupy specific clusters in the sequence space, corresponding to different realizations of the overlapping contacts encoded by weights 2 and 3 (panel 6C).Conditioning to (h − 2 , h + 3 ) makes possible to generate sequences combining features not found together in the MSA, see bottom left corner, even with very high probabilities (Methods).F. Scatter plot of the number of mutations to the closest natural sequence vs. p nat , the ground-truth fitness function for natural (gray) and artificial (colored) sequence.Note the large diversity and the existence of sequences with higher p nat than in the training sample.

Figure Methods 1 : 2 =
Figure Methods 1: The statistics of local conservation (frequencies of amino acids on each site) and pairwise correlations (joint frequencies on all pairs of sites) can be reproduced by a graphical model, called Maximum Entropy Potts model in statistical physics, carrying 21-state variables vi, where i = 1, ..., L. The pairwise effective couplings between the sites, Jij(vi, vj), and the local fields, hi(vi), define the log probability of the corresponding protein sequence {vi} [14].A RBM with quadratic hidden-unit potential U reduces to an effective Potts model.

Figure Methods 2 :
Figure Methods 2: Duplicate RBM for biasing sampling toward high-probability sequences.The visible-unit configurations v are sampled from from P2(v) ∝ P (v) 2 .

R
ij (v; a, a , b, b ) ≡ log P (v a,b ) P (v a ,b ) P (v a ,b ) P (v a,b ),