Navigating the amino acid sequence space between functional proteins using a deep learning framework

Motivation Shedding light on the relationships between protein sequences and functions is a challenging task with many implications in protein evolution, diseases understanding, and protein design. The protein sequence space mapping to specific functions is however hard to comprehend due to its complexity. Generative models help to decipher complex systems thanks to their abilities to learn and recreate data specificity. Applied to proteins, they can capture the sequence patterns associated with functions and point out important relationships between sequence positions. By learning these dependencies between sequences and functions, they can ultimately be used to generate new sequences and navigate through uncharted area of molecular evolution. Results This study presents an Adversarial Auto-Encoder (AAE) approached, an unsupervised generative model, to generate new protein sequences. AAEs are tested on three protein families known for their multiple functions the sulfatase, the HUP and the TPP families. Clustering results on the encoded sequences from the latent space computed by AAEs display high level of homogeneity regarding the protein sequence functions. The study also reports and analyzes for the first time two sampling strategies based on latent space interpolation and latent space arithmetic to generate intermediate protein sequences sharing sequential properties of original sequences linked to known functional properties issued from different families and functions. Generated sequences by interpolation between latent space data points demonstrate the ability of the AAE to generalize and produce meaningful biological sequences from an evolutionary uncharted area of the biological sequence space. Finally, 3D structure models computed by comparative modelling using generated sequences and templates of different sub-families point out to the ability of the latent space arithmetic to successfully transfer protein sequence properties linked to function between different sub-families. All in all this study confirms the ability of deep learning frameworks to model biological complexity and bring new tools to explore amino acid sequence and functional spaces.

: The Adversarial AutoEncoder architecture number 3 presented in Supplementary Table 1. The discriminator (in red) takes as input data from a prior distribution or the latent space computed by the encoder/generator. Using a sigmoid activation function, the discriminator is trained to distinguish between the two types of data. By updating the weight of the encoder/generator based on the discriminator performances, the encoder/generator learn to approximate the prior distribution and fool the discriminator. The autoencoder architecture (in blue) corresponds to a variational autoencoder. Latent space is decoded by a decoder and new sequences are generated using a softmax activation function.

Structural modeling pipeline
Using the latent space computed by the neural network described in Supplementary Figure 1, four different latent space arithmetic strategies were developed to produce sequences with mixed functions. The generated protein sequences were further modelled using MODELLER and their energy evaluated using the DOPE energy function. Supplementary Figure 2 describes the general pipeline developed.
Figure 2: Modeling pipeline used to generate sequences sharing properties of two sub-families. The hmmsearch MSA is filtered and passed to the encoder to project each sequence to the latent space. Latent space projections can be used for visualization (see Supplementary Figure 4). Different strategies (1 to 4) are tested to generate new points in the latent space and generate new sequences through the decoder. The new sequences are used in combination with structures of the sub-families to create homology-based structural models and evaluated using the DOPE energy function of MODELLER.

HUP and TPP latent spaces
Latent spaces generated by the neural network described Supplementary Figure 1 were evaluated in terms of functional and taxonomic properties for the Sulfatase, HUP and TPP families (see main text for the results regarding the Sulfatase family).
Supplementary Figure 3 presents the latent spaces of the HUP and TPP families colored according to their EC annotations. HUP points colored in yellow correspond to protein with EC 6.1.1.1 and 6.1.1.2, pink-colored points to proteins with EC 6.1.1.1 and violet colored points to proteins with EC 2.7.11.24 and 6.1.1.2. TPP sequences have more annotated functions than HUP sequences (57 different EC assignation), but a global pattern can be found in the projection corresponding to two groups of proteins (brown and violet) annotated with EC 2.2.1.1 (Transketolase), oxidoreductase proteins in orange,pink,red,and green shades), and proteins with function EC 2.2.1.9 (2-succinyl-5enolpyruvyl-6-hydroxy-3-cyclohexene-1-carboxylic-acid synthase, in yellow and gray shades).  Supplementary Tables 3 displays for each cluster computed in the 100 dimensional latent space its homogeneity in term of taxonomy and functional 4 annotation. Clusters have a higher functional homogeneity than taxonomic homogeneity. Thus, sequences are projected according to sequence signal coming from functional activity rather than evolutionary relationship. Supplementary Tables 4 and 5 show the taxonomic and functional homogeneity of clusters for the HUP and TPP families with similar findings.

Protein latent space interpolation
Supplementary Figure 5 shows the sequences generated during the interpolation between protein sequences of S1-0 and S1-4 sub-families . An abrupt transition can be observed at positions 53, G (S1-0 sub-family) to S (S1-4 sub-family), and 51 [VI] (S1-0 sub-family) to T (S1-4 sub-family), corresponding to very conserved residues in Prosite motif PS00523 (see Supplementary Figure 6). The other positions of the motif are less affected by abrupt transitions but also appear to have lower amino acid diversity than other columns. A similar behavior can only be observed for position 111, T (S1-0 sub-family) to W (S1-4 sub-family), of motif PS00149 (positions 102 to 112). The other positions of the motifs are either conserved (T105, G109, K110) or accepting fluctuations (columns 103, 104, 107).
Supplementary Table 6 displays the interpolation coefficients along Perason correlation and R 2 values of interpolations between pairs of sub-families (S1-0, S1-2, S1-3, S1-6, S1-7, S1-8 and S1-11). The computed coefficients correspond to the linear regression presented in the main document Figure 2. Figure 5: First 130 amino acids for generated sequences using interpolation between data points of the latent space. The interpolation is performed between latent spaces of protein ID 2 of the sulfatase S1-0 family and of protein ID 2196 of the sulfatase S1-4 sub-family. Amino acid color coding is based on physo-chemical properties. Large transitions between gaps to amino acids and amino acids to gaps can be observed at positions 75 to 86 and positions 116 to 122. Amino acid columns transformation can be observed at multiple positions: 21 (S to G), 51 (V/I to T/S), 53 (G to S) etc. Logo plots were computed on Prosite motifs PS00523 and PS00149 of the Sulfatase family to illustrate the amino acid content of the protein sequences generated by latent space arithmetic . These regions correspond to the most conserved regions of the Sulfatase family and have been proposed as signature patterns for all the sulfatases in the Prosite database. In Supplementary Figure 6, panels A and D correspond to the biological sequences of the S1-0 sub-family and of the S1-2 sub-family respectively. Panels B and C correspond to generated protein sequences using either the Sulfatase sub-family S1-0 as the source and S1-2 as the query (Panel B) and to which the background latent space has been subtracted (strategy 2 on Supplementary Figure 2) and reciprocally (Panel C). Figure 6: Logo plots of MSA parts from the S1-0 and S1-2 (panels A and D) sub-families, and generated sequences using S1-0 as query and S1-2 as source (panel B) and S1-2 as query and S1-0 as the source (panel C).
In the first fragment corresponding to motif PS00523, G55 and T55 are the most frequent amino acid of sub-families S1-0 and S1-2 (Panels A and D) and it can be observed a "competition" between these two amino acids for generated sequences (Panel B and C) with a slightly higher probability for the amino acid of the family used as a query (G in Panel B and T in panel C). The residue S57, involved in the active site of the sulfatases, is less frequent in the query subfamily S1-0 (panel A) than in the sub-family S1-2 (panel D). The high frequency of S at position 57 in the sub-family S1-2 compared to the sub-family S1-1 may have an impact when performing the latent space arithmetic as S is predominant in the generated sequences. This influence of a very frequent amino acid in one of the source or query sub-family on the generated sequences can also be observed at the position 70 and is less visible when multiple amino acid frequencies are more balanced. In the second fragment corresponding to motif PS00149, residue R at position 101 follows this pattern. It is highly frequent in sub-family S1-0 (panel A) and less frequent in the sub-family S1-2 (panel D). The generated protein sequences display a R at position 101 with high frequency. The inverse can be observed for Y at position 105, highly frequent in sub-family S1-2 (panel D).
Other positions are however displaying much more complex patterns and cannot be summarized as a frequency competition between source and query sub-families. For instance, G at position 71 is very frequent in the sub-family S1-2 but have a comparable frequency with R in sub-family S1-0. The generated protein sequences don't display G has the only possible residue but seem to follow the frequency of their respective query sub-families. Positions in generated sequences with multiple amino acid sharing comparable frequencies in the source and query sub-families have also mixed frequencies, such as in positions 53, 59, 67, 98, or 106.
These behaviors can be observed several times through the logo plots but are still positions specifics, meaning that the bits scores pattern observed in the source sub-families (Panels A and D) do not necessary allow to predict the amino acids bits scores in the generated sub-families (Panels B and C). For instance, W at positions 113 as a high bit score value in the MSA of the sub-family S1-2 but little influences in the amino acids of the generated sequences where T of the sub-family S1-0 is found predominantly. Moreover, these observations are performed on residues of the Prosite motifs which are by definition conserved in sulfatases.

Energy distribution example
Supplementary Figure 7 shows the DOPE energy distributions of generated and original protein sequences with different protein structure templates.

Strategy 1
First strategy consists to add the mean latent space, computed using the encoder on the sequences of the source sub-family, to the encoded sequences of the query sub-family. Supplementary Figure 8 shows the energy distribution Figure 7: Energy distributions of models computed using structures from subfamily S1-0 (reds) or sub-family S1-2 (blues) and sequences from biological proteins or inferred using latent space arithmetic between spaces encoded by the S1-0 and S1-2 sub-families. Each violin plot corresponds to a specific targeted structure and sequence couples. For example, Struct. 0 Seq. 0 indicates that the energy distribution corresponds to sequences of the S1-0 sub-family modeled on structures of the S1-0 sub-families and Struct. 2 Seq. S1-0m2 corresponds to the energy distribution of sequences inferred using the latent space of the sub-family S1-2 added to the latent space of the sub-family S1-0 and modeled on structures of the S1-2 sub-family.  Figure 7, were computed. The y axis represents the difference between the mean values computed for query sequences modeled on structures of the same sub-family and mean values computed for source sequences modeled on structures of the query sub-family (ex: differences between mean of Struct. 0 Seq. 0 and mean of Struct. 0 Seq. 2 distributions in Figure 7). The x axis corresponds to the difference between the mean values computed for query sequences modeled on structures of the same sub-family and mean values computed for query sequences to which latent space of the source sub-family sequences have been added and modeled on structures of the query sub-family (M QS /Q), or source sequences to which latent space of the query sub-family sequences have been added and modeled on structures of the source sub-family (M SQ /Q) (ex: differences between mean of Struct. 0 Seq. S1-0m2 and mean of Struct. 0 Seq. 0 distributions in Figure 7). Points in the red area correspond to mean distribution values from generated sequences whose modeled structures have a higher energy than models created using pairs of sequences/structures from different sub-families. Points in the blue area correspond to mean distribution values from generated sequences whose modeled structures have a lower energy than models created using pairs of sequences/structures from the same sub-family. 14 1.6.4 Strategy 2 Second tested strategy differs from the first one by subtracting the mean background latent space, computed from the latent space of all sub-families, from the latent space of the query sub-family.  Blue dots correspond to protein sequence similarity computed between sequences of the same protein sub-family. Orange squares to similarity computed between generated sequences and the sequences of their query sub-family (ex: S1-0m2 generated sequences and S1-0 sub-family sequences). Green x to similarity computed between generated sequences and the sequences of their target sub-family (ex: S1-0m2 generated sequences and S1-2 sub-family sequences). Red upper triangles to similarity computed between sequences of two different sub-families (ex: S1-0 sequences and S1-2 sequences). Magenta lower triangles to similarity computed between sequences of the same generated sequence group. The variance and the mean of each distribution are displayed on the horizontal and vertical axes.

Strategy 3
Third strategy differs to the second as the background strategy is computed using all sub-families except sub-families source and query.  In the fourth strategy, the subtraction is performed using a local KD-tree to only remove features shared by closest members of a given query and addition is performed randomly selecting a member of the the source family and it's closest 10 members.