Deep learning of protein sequence design of protein–protein interactions

Abstract Motivation As more data of experimentally determined protein structures are becoming available, data-driven models to describe protein sequence–structure relationships become more feasible. Within this space, the amino acid sequence design of protein–protein interactions is still a rather challenging subproblem with very low success rates—yet, it is central to most biological processes. Results We developed an attention-based deep learning model inspired by algorithms used for image-caption assignments to design peptides or protein fragment sequences. Our trained model can be applied for the redesign of natural protein interfaces or the designed protein interaction fragments. Here, we validate the potential by recapitulating naturally occurring protein–protein interactions including antibody–antigen complexes. The designed interfaces accurately capture essential native interactions and have comparable native-like binding affinities in silico. Furthermore, our model does not need a precise backbone location, making it an attractive tool for working with de novo design of protein–protein interactions. Availability and implementation The source code of the method is available at https://github.com/strauchlab/iNNterfaceDesign Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
The ability to computationally engineer protein sequences has a wide range of applications ranging from therapeutics (Fosgerau and Hoffmann, 2015;Khera and Maity, 2019), to vaccines (Li and Li, 2020;Liu et al., 2020;Malonis et al., 2020;, sensors (Karimzadeh et al., 2018;Merkx et al., 2019) or protein-based materials (Capezza et al., 2019;de la Rica and Matsui, 2010). While there has been progress toward designing protein folds, much improvement is needed for the redesign or de novo design of protein-protein interfaces (PPIFs). The success rates for the de novo generation of proteinprotein interactions achieved by existing methods are very low, with only a few examples demonstrating that it is possible (Cao et al., 2022;Fleishman et al., 2011;Strauch et al., 2014). Even the most recent experimental work yielded very low success rates of designs that bind to a target protein yet still require substantial computational and laboratory resources (Cao et al., 2022), underlining that it is still highly challenging. Recent works using neural networks substantially improved accuracy in structure prediction (Baek et al., 2021;Jumper et al., 2021;Senior et al., 2020) and protein sequence design (Anand et al., 2020;Chen et al., 2020;Gao et al., 2020;O'Connell et al., 2018). The latter methods outperform traditional methods for sequence design based on energy function integrated into procedures for sampling, filtering and optimization (Adolf-Bryfogle et al., 2018;Desjarlais and Handel, 1995;Raha et al., 2000). The average sequence recovery, which is the ratio of recovered residues to all residues in the structure, achieved by the current top-performing protein-design programs [such as dTERMen ] is around 30%, while the SPROF model (Chen et al., 2020) achieved 39.8% on independent test sets. Based on these inspiring results, we developed a deep learning-based approach for the sequence design of PPIFs. The architecture of our neural network-based approach is inspired by a model for the generation of image captions with visual attention (Xu et al., 2015). For our model, protein structures are treated as a 3D object to be captured and translated into 'words'. Features from the protein complex are extracted using machine learning vision techniques and transformed into amino acid sequences instead of words. We developed two deep learning models, PepSeP1 and PepSeP6; the former has a single sequence output, and the latter produces six amino acid sequence outputs per complex. These models outperformed Rosetta's FastDesign mover (Khatib et al., 2011;Tyka et al., 2011) on an independent test set.
Furthermore, we successfully recovered PPIFs fragments containing interaction hot-spot residues. 'Hot-spots' are a characteristic feature of protein interactions as interface residues do not contribute equally to the binding energy, but rather have a few residues that contribute the majority of the binding energy. Therefore, it is crucial to be able to recapitulate these contacts. As only a few residues have highly energetically favorable interactions (Cukuroglu et al., 2014;Wells and Clackson, 1995), it is crucial to monitor their recovery. Our model is intended to be applied on de novo interface fragments, or 'motifs', which can then be grafted into scaffolding proteins. However, it can be used on any peptide fragment for redesign. Larger fragments can be designed by making subsequent designs of connected backbone fragments. Our method differs from existing deep learning models (Wu et al., 2020;Zhang et al., 2021), which are intended to help docking by defining the interaction pairs of residues on two counterparts rather than creating the counterparts themselves.

Test sets
The method is trained and tested on peptide-binding site complexes extracted from native PPIFs (Fig. 1). The peptide of the complex is a 6-residue fragment of a protein-ligand, and the binding site is a patch of a ligand binder consisting of 24-48 residues which are in immediate proximity to the backbone atoms in the 6-residue fragment. Peptides were perturbed up to 1.07 Å root-mean-square deviation (RMSD) of their native conformation to simulate a more applicable scenario in which peptides deviate from their native positions.
Complexes were selected according to the following criteria: 1. Two residues of a peptide ligand contribute to the binding with increment DDG i > 0.5 Rosetta energy unit (REU) and three nonterminal residues located within 6 Å from a binding site. Measured as the distance between the closest side chain heavy atoms in the native interface.

Resolution thresholds for structures with homo-oligomeric
PPIFs, hetero-oligomeric PPIFs and antibody-antigen complexes are 2.0 Å , 2.5 Å and 3.5 Å , respectively. 3. Non-standard residues and non-nutritional compounds should not be within 5 Å of the fragments of interest. 4. The complexes should have negative binding free energies. 5. Non-polar residues should be present in the hot-spots of a peptide ligand.
The current dataset does not include complexes connected through covalent bonds or containing non-canonical amino acids. All complexes for this study originate from multichain structures obtained from the Protein Data Bank (Berman, 2000).
The complexes were extracted from 9002 co-crystal structures. For our benchmark set, we separated 70 structures containing influenza's hemagglutinin (HA), MERS-CoV, SARS-CoV and SARS-CoV-2 proteins co-crystalized with antibodies (Supplementary Table S2), resulting in 915 complexes. All other complexes with these listed antigens were deleted from the main set. The remaining complexes were further curated to avoid duplicates before splitting randomly into training (8485 files), validation (270 files) and test sets (177 files), resulting in the total number of extracted complexes: 93 458 in the main, 2924 in the validation and 1245 in the test set. We note that all datasets contain antibodies or antibody fragments but not any of the extracted listed antigens from our benchmark set (Table 1, Supplementary Tables S4 and S5).
Our method was evaluated using test and benchmark sets, referred to as 'set T' and 'set B', respectively. Set T is divided into two subsets: T-ho and T-he, based on whether they are correspondingly derived from homo-or hetero-oligomeric PPIFs. Additionally, a subset T-ho-asymm containing complexes from subset T-ho is introduced; complexes in this subset are not extracted from symmetric patches of homo-oligomeric PPIFs. Set B was used for the evaluation of sequence recovery of interfaces of antibody-antigen protein complexes. This set is divided into subsets as well ( Table 1). The quantity of complexes in the benchmark subsets depending on the types of antigens is summarized in Supplementary Table S6. Other details related to the construction of the datasets can be found in the Methods section of Supplementary Information S1.1-S1.5.
Besides simply testing the methods on all-glycine peptide ligands extracted from native interactions, we also tested them on artificially highly perturbed (HP) peptide variations of all-glycine peptide ligands as part of our case studies. Backbones were generated by the iNNterfaceDesign method (Syrlybaeva and Strauch, 2022b, Supplementary Fig. S3). Due to substantial deviations from the native backbones, these cases provide challenging targets for amino acid sequence (AAS) design and recovery of hot-spot interactions.

Input data for deep learning models
Input data for the neural network is based on topological features of the complex and amino-acid sequence of the binding sites (Table 2). We use two types of distance maps as the main geometrical descriptors of the structures, describing either distance between residues within a single interface counterpart (distance maps 1, intramolecular) or across the whole interface (distance maps 2, intermolecular). The distance maps are based on N or O backbone atoms. We augment information regarding the system's geometry by providing secondary structure types of residues of the binding sites through input 3.
Other input data are amino acid types of residues in the binding sites, and input 5 defines the homo-or hetero-oligomeric nature of PPIFs. The ablation study of the impact of different types of inputs on the method's performance is presented in Supplementary Information S1.9.

Pepsep1 model
The model has an encoder-decoder architecture successfully utilized for sequence prediction models (Chen et al., 2020). The model's (c) Types of peptide ligands and their generation: native peptide ligand, the 6-residue fragment depicted in subfigure a (c1), was mutated into an all-glycine peptide ligand and perturbed (c2, Supplementary Information S1.3). Perturbed backbones were either reverted to their native amino acid sequence (c3, annotated as native* within the main text to highlight the backbone perturbation) or designed with PepSeP1 (c4). HP (highly perturbed) backbones (c5) are generated using the iNNterfaceDesign method (Syrlybaeva and Strauch, 2022) architecture is similar to a model from the TensorFlow tutorial (https://www.tensorflow.org/tutorials/text/image_captioning), especially the decoder part (Xu et al., 2015). The encoder of PepSeP1 utilizes two types of convolutional blocks consisting of 8 and 4 convolutional layers, respectively (Fig. 2, more details in Supplementary  Fig. S4). Both convolutional blocks extract feature vectors. They are concatenated to form the tensor F, in which the second dimension equals the number of amino acids to predict. The tensor then becomes an input for the decoder. The decoder is an attention-based recurrent neural network for which we used the Bahdanau-style (additive) attention with long-short-term memory (LSTM) units. We applied a bidirectional attention approach to provide more context for each prediction. This approach implies that feature vectors are processed by LSTM layers twice, in direct and reverse directions (Fig. 2). The attention mechanism produces context vectors by processing concatenated hidden states of the LSTM layers.

Pepsep6 model
The encoder of the PepSeP1 model was extracted and incorporated into the PepSeP6 model without changes (Supplementary Fig. S5). The weights of the encoder were set as untrainable during training. Five outputs are generated by passing feature vectors produced by the encoder into the decoder of PepSeP6 five times; each of the five iterations is accompanied by a final hidden state from the previous prediction of the sequence. The sixth sequence is the output of the PepSeP1 model. The neural network was built in TensorFlow using Keras application programming interface (more details under Supplementary Information S1.6).

Training of PepSeP1 and PepSeP6 models
Training of the PepSeP1 model was conducted 20 times in three stages: 5 epochs with a learning rate of 0.001, 5 epochs with a learning rate of 0.0001 and 2 epochs with a learning rate of 0.00002. Categorical cross-entropy loss was applied based on comparing full target and predicted sequences. The trained model with the highest sum of rates of recovery of native sequences measured on heterooligomeric PPIFs was selected (subsets T-he, B-ab/ag and B-ag/ab).
PepSeP6 model was trained ten times using the pretrained PepSeP1 using the following three stages: 4 epochs with a learning rate of 0.001, 4 epochs with a learning rate of 0.0001 and 2 epochs with a learning rate of 0.00002. The Adam optimizer was used to minimize the mean squared error during optimization. A custom loss function was implemented for the training of PepSeP6; the details can be found in Supplementary Information S1.7 and S1.8.

Refinement and calculation of binding energies of complexes of test sets
The target binding sites were relaxed without peptide ligands. Perturbed peptides were then added back either with their native or redesigned AAS (c3 and c4 in Fig. 1). Optimization of side chain conformations of all residues of the complex was done applying the FastRelax mover three times over 300 steps. The structure with the lowest score out of the three results was selected for the subsequent refinement of poses. This operation applied the FastRelax mover three times over 300 steps while applying harmonic constraints (standard deviation SD ¼ 1.0 Å with width parameter of 1.5 Å ) for peptide ligands based on perturbed native-like backbones (c2). Less restrictive constraints (SD ¼ 3.0 Å with a width parameter of 2.0 Å ) were used for HP peptides (c5) predictions. Binding free energies of the complexes were estimated using InterfaceAnalyzerMover (Stranges and Kuhlman, 2013) with repacking chains after separation. The Rosetta scoring function ref15 was used for all calculations.

Redesign of complexes
Peptide ligands designed by the PepSeP1 method underwent an additional design step after refinement using the FastDesign protocol to compare results with the original performance of PepSeP1. We set constraints on residue types according to position-specific scoring matrices based on outputs of PepSeP1 (Supplementary Information S1.7). We performed three different protocols of the redesign. We controlled how much possible amino acid types were allowed at each position. Variations are denoted as RD3, RD5 and RD20, reflecting whether the most probable three or five amino acid types were selected, or all amino acids were allowed. We also performed a redesign using the FastDesign protocol on all-glycine peptide backbones. Two relaxation scripts were utilized during the redesigns: default (MonomerRelax2019) and InterfaceDesign2019.

Performance of PepSeP1 model
To evaluate the performance of PepSeP1, we utilized sequence recovery. The overall sequence recovery accuracy (R all ) on all peptide ligand residues of set T is 40.83%. However, the results depend substantially on the types of PPIFs: native sequence recovery rates for homo-and hetero-oligomeric PPIFs are 46.73% and 27.71%, respectively (Table 3). In our dataset, heteromeric protein interfaces are in general weaker transient interactions. Furthermore, the residue composition of transient interactions is more diverse, including higher rates of polar and charged groups alongside hydrophobic amino acids (Acuner Ozbabacan et al., 2011). These factors likely drove the lower peptide recovery rates of 27.71% for heterooligomeric PPIFs (Table 3). The performance of non-symmetric homo-oligomeric complexes from subset T-ho-asymm was at 31.26%, close to the performance on subset T-he. We next applied PepSeP1 to assess transient PPIFs of antigenantibody interfaces using subsets B-ab/ag and B-ag/ab, containing complexes extracted from antibody-antigen complexes (Table 1). Achieved recovery success rates are 26.00% and 16.78%, respectively. Detailed data regarding the accuracy of the method depending on the types of antigens can be found in Supplementary Table S8. The best results for PepSeP1 are observed in the case of MERS-CoV: 30% and 25.44% on subsets B-ab/ag and B-ag/ab, correspondingly. Detailed descriptions of the performance of the model on different regions of antibodies from subset B-ab/ag [framework region, complementarity determining regions (CDR): H1, H2, H3, L1, L3] are summarized in Supplementary Table S9. The highest rates of sequence recovery are observed on CDR-H2 loops (40.91%); the average rate across all samples of CDR loops was 25.6%. These results are lower in recovery than methods that utilizing contextual information of antibodies or structural libraries, with recovery rates greater than 70%, such as RosettaAntibodyDesign (RAbD) (Adolf-Bryfogle et al., 2018) using CDR loop libraries for sampling or RosettaSurf (Scheck et al., 2021). Another example of a method designing CDR loops using contextual information, namely, the structure of the framework region, is the deep learning model RefineGNN (Jin et al., 2021), trained and tested on CDR-H3 loops; the method achieved an accuracy of 35.57%. That result is lower than the performance of PepSeP1 on CDR-H2 loops but higher than our results on CDR-H3; the performance of RAbD is 28.53% on that test (Jin et al., 2021). It should be noted that the performance of PepSeP1 was measured on challenging cases by testing its performance on the c2 peptide fragments with an average perturbation of 1.07 Å RMSD from their native positions.
As expected, the secondary structure of the peptide fragment impacts the recovery rates (Supplementary Table S10): the highest rate of 41.82% was observed for b-sheet structures; the accuracy of predictions for alpha-helices and loops are lower (40.16 and 38.55%, respectively). High rates of R all are observed on subsets Bab/ab and B-ag/ag. However, B-ag/ag consists of many complexes originating from symmetric homo-oligomeric PPIFs, and 36.13% of complexes from subset B-ab/ab are encountered in the training set (Supplementary Table S5).
The model's performance on recovery of hot-spot residues of the peptides was also considered. Energetic contributions of individual residues to the binding DDG i were obtained by alanine scanning (Kortemme et al., 2004). Contacts were treated as hot-spots if they had a binding energy contribution of at least 3 REU. Success rate R hotÀspot was calculated with respect to hot-spot residues of native peptide ligands. R hotÀspot exceeds R all by 6% approximately and equals to 45.95% on set T. Rates R hotÀspot measured on subsets Tho-asymm and T-he are similar and approximately equal to 35%. Recovery of hot-spot residues on subset B-ab/ag is 32.97% which is close to the results of other subsets. We received the lowest results on the subset B-ag/ab: low recovery rates were obtained on influenza's hemagglutinin (24.14%, Supplementary Table S8) (which constitutes 82% of complexes of this subset), and the hot-spots on surfaces of SARS-CoV-1 and SARS-CoV-2 (0%).
Assessing binding free energies DG native Ã B and DG designed B across test subsets, we see that the binding sites have a higher affinity for peptide ligands with native sequences, but the difference is small or equal to 0.5 REU on set T only (Supplementary Table S11). We estimated complexes with native* peptides (c3) instead of native ones (c1) during calculations of energetic metrics, for comparison with the designed complexes, in order to eliminate the systematic superiority of native peptides due to more favorable backbone conformations as these did not undergo perturbation. A comparison of native and designed by the PepSeP1 method complexes in more detail is discussed in Supplementary Information Sections S2.1 and S2.2. The accuracy of the predictions depends on the relative solvent accessible surface area (SASA) of a given residue and the D i G P-value of the native PPIFs (Supplementary Fig. S10). We observed some reduction of R all with increasing of both SASA, as seen before (Chen et al., 2020), and D i G P-value. The decrease is more prominent for homooligomeric PPIFs.

Performance of PepSeP6 model
As there can be different solutions to binding to the same interface (Brian and David, 2000;DeLano et al., 2000), we also integrated a variation of our software that produces six sequences, called PepSeP6. The output sequences differ within three positions on average. However, we also observed completely identical sequences, or simply 1 solution only ( Supplementary Fig. S12). Such convergence is observed when all six outputs are generated with high rates of sequence recovery (80-100%).
R all of the model across all six outputs is 38.67% (Table 3), which is lower than R all of PepSeP1. However, R all calculated only across the outputs most matching the native sequences out of six is equal to 47.48% (Supplementary Table S13). The difference in binding energies of complexes with all six sequences is only within 1 REU in average ( Supplementary Fig. S14), the values are very close to native ones. Thus, PepSeP6 sequences provide comparable binding energies and could present alternative binding solutions as seen in nature. More discussion of the performance of PepSeP6 model is presented in Supplementary Information S2.3.

Performance of Rosetta's FastDesign protocol and redesigns according to RD3, RD5 and RD20 schemes
The Rosetta FastDesign protocol is currently a commonly used protocol for redesigning AAS, demonstrating many experimentally validated designed protein interfaces (Cao et al., 2020;Huang et al., 2016;Jacobs et al., 2016;Linsky et al., 2020;Silva et al., 2019). Here, we compared the performance PepSeP1 model with FastDesign itself and with the results of combining these two methods in three versions, RD3, RD5 and RD20, described in Section 2. FastDesign results obtained using the default relax script (MonomerRelax2019) are shown here (Table 3). Overall, the redesigns of PepSeP1 sequences did not result in higher recovery rates of native residues according to R all R all values of RD3, RD5 and RD20 approaches. PepSeP1 substantially outperformed FastDesign in sequence recovery when both are applied to the same all-glycine peptides. However, FastDesign provides higher rates of R hotÀspot in the case of T-he and B-ag/ab subsets. Rates of recovery obtained using InterfaceDesign2019 relax script are reported in Supplementary  Table S15.
The average binding affinities of designs obtained by FastDesign are noticeably lower than the affinities calculated for PepSeP designs (Table 3, Supplementary Fig. S14). Analysis of amino acid distribution ( Supplementary Fig. S15) reveals that FastDesign incorporates too many proline residues on perturbed c2 backbones. Also, the designs excessively include glutamate residues.

Case study: antibody-antigen interactions
The antibody CR3022 binds to the receptor binding domain (RBD) of the Severe Acute Respiratory coronavirus 2 (SARS-CoV-2) (Fig. 3a). To evaluate sequence recovery, we extracted each loop of the heavy chain that forms contacts with the RBD in the form of 6-residues fragments. Heavy_V_Gene of antibody is IGHV5-51 (Human) (Schneider et al., 2022), sequence identity of H1-H3 loops to germline are 77%, 80% and À1, respectively, according to PyIgClassify (Adolf-Bryfogle et al., 2015). After the perturbation of these contacts and changing them to poly-glycine residues, we redesigned each fragment using PepSeP1 and FastDesign. R all is equal to 50% in the case of the first two designs by means of PepSeP1; residues at the fourth position in both native and designed amino acid sequences of the first fragment have a functional similarity. Most contacts do not contribute to much of the binding energy in CDR-H3 and accordingly, we only see R all of about 16% in the cases of both PepSeP1 and FastDesign designs. FastDesign outperformed PepSeP1 in two cases regarding DG B values.
The 5J8 antibody is a broadly neutralizing antibody that binds to residues within the receptor binding site of influenzas hemagglutinin (HA). Its main contact CDR-loop exceeds six residues, so we performed the PepSeP1 design by aligning two structures predicted for 6-residue fragments (Fig. 3b). This illustrates how re-design can be used for fragments longer than sixmers. The antibody has two residues strongly contributing to the binding with the HA1 subunit over positions 97-100E of chain H: Tyr100 DDG i (DDG i ¼ 3.7 REU) and Asp100B DDG i (DDG i ¼ 8.2 REU). The CDR-loop belongs to cluster H3-17-* according to PyIgClassify, the sequence identity of CDR to germline is À1.
One of the most important contacts of the antibody loop with the receptor binding site is the aspartate at position 100B, mimicking the carboxy group of HA's receptor sialic acid (Schmidt et al., 2015), which PepSeP1 captured. We further saw the insertion of a tryptophan at position 100, which establishes a large contact area with the aliphatic part of Lys133 of the HA molecule by PepSeP1. This is another crucial contact, as aromatic or hydrophobic contacts have been described as another canonical interaction with the receptor binding site (Ekiert et al., 2012). Additionally, we saw the recovery of Ser98. FastDesign recovered the identical residues and Pro100A. However, Tyr or another aromatic or hydrophobic residue at position 100 was not predicted.
To illustrate performance on highly perturbed (HP) peptide fragments (c5), we applied PepSeP6 to design the main contact CDRloop of the broadly neutralizing stem antibody FI6. We generated 9residue HP peptide ligands (c5) with an RMSD of about 4.0 Å relative to the CDR-loop of its original location. The FI6 antibody fragment has four hot-spots residues: Arg99 (DDG i ¼ 3.2 REU), Leu100A (DDG i ¼ 4.1 REU), Tyr100C (DDG i ¼ 3.9 REU) and Phe100D (DDG i ¼ 4.4 REU). The CDR-loop belongs to cluster H3-22-* according to PyIgClassify, the sequence identity of CDR to germline is À1. All designed peptide sequences have lower computed binding energies than the native complex. A design depicted for HP1 has recovered residues at positions 100C-100D. Residues Leu100A Table 3. Performance of different methods applied to all-glycine (c2) backbones on recovery of native residues of peptide ligands and corresponding average binding energies DG designed B . Cells are colored proportionally to their values in different shades of gray to highlight the highest and lowest values of obtained results. and Ser100 are recovered partially by Ile100 and hydroxylic Tyr99. Designs of HP2 have low binding energies and recovered interactions as well. For example, design four has recovered positions of Phe100D and Trp100F; it has Arg close to Arg99. Most designs have the crucial aromatic contact mimicking Phe100D in the native AAS, making important hydrophobic contacts with residues in the receptor pocket of HA (Fig. 3d) (Davide et al., 2011).
The results show that PepSeP1 and PepSeP6 can reproduce relevant contacts. The resulting peptide ligands show high-binding affinity and often outperform designs of FastDesign, especially in the case of redesigning highly perturbed native backbones. We can thereby illustrate its usefulness for homology models and potentially de novo designed backbones as they likely are not at the exact position they should beas either method has a high margin of error. Thus, a more knowledgebased design process can guide the backbone refinement. The PepSeP6 method provides more diverse designs, and iterations through which sequences with higher recovery rates can be revealed. As there are multiple solutions to binding at a specific epitope, it will also be useful for any affinity optimization processes.

Conclusions
The neural network designed for the recovery of peptide ligand sequences at a known protein binding site is performed in this study. To our knowledge, this is the first neural network model for the prediction of amino acid sequences for peptides involved in interchain interactions. The model was developed in two versions: PepSeP1 and PepSeP6 with correspondingly single and multiple (six) outputs. The native sequence recovery rate of PepSeP1 is 40.83% on the independent test set; the average accuracy of PepSeP6 designs is 38.67%, with recovery rates of 48.63% across the output sequences resembling the native structures the most. The models are characterized by training on non-perfect backbones of structures which make them more applicable either to work with homology models which are likely not at atomic accuracy or for the engineering of novel interaction in which the motifs are derived de novo motifs.

Funding
This work was supported by the National Institutes of Health [R01AI140245 to E.M.S.].