BIPSPI+: Mining Type-Speciﬁc Datasets of Protein Complexes to Improve Protein Binding Site Prediction

Computational approaches for predicting protein-protein interfaces are extremely useful for understanding and modelling the quaternary structure of protein assemblies. In particular, partner-speciﬁc binding site prediction methods allow delineating the speciﬁc residues that compose the interface of protein complexes. In recent years, new machine learning and other algorithmic approaches have been proposed to solve this problem. However, little effort has been made in ﬁnding better training datasets to improve the performance of these methods. With the aim of vindicating the importance of the training set compilation procedure, in this work we present BIPSPI+, a new version of our original server trained on carefully curated datasets that outperforms our original predictor. We show how prediction performance can be improved by selecting speciﬁc datasets that better describe particular types of protein interactions and interfaces (e.g. homo/hetero). In addition, our upgraded web server offers a new set of functionalities such as the sequence-structure prediction mode, hetero-or homo-complex specialization and the guided docking tool that allows to compute 3D quaternary structure poses using the predicted interfaces. BIPSPI+ is freely available at https://bipspi.cnb.csic.es.


Introduction
Protein-protein interactions (PPIs) play a pivotal role in most biological processes and thus, understanding how PPIs occur is an important step towards elucidating how these processes take place in cells and organisms.Studying the biochemical underpinnings behind PPIs can be better approached from a structural perspective.

Experimental
techniques such as X-ray crystallography, nuclear magnetic resonance or cryo-electron microscopy are capable of solving the 3D structure of PPIs, in many cases, reaching atomic resolutions.However, these techniques are expensive, time-consuming, and they cannot keep pace with the amount of interactomic data that every year is generated.As a result, many computational approaches have been developed to complement experimental methods and provide PPIs details at different levels of granularity.
In recent years, many computational methods have been designed to characterize PPIs when different levels of molecular information are available.2][3][4][5] When protein atomic models are not available, new deep learning methods have been highly successful predicting the tertiary structure of proteins. 6,7However quaternary structure prediction is more challenging and although initial steps have been conducted in that direction, they are computationally demanding and still require from manual intervention. 8As an alternative, lower granularity predictions can be computed fully automatically with less computational requirements.0][21][22] Contrary to conventional binding site prediction (non-partner specific), which aims to predict all the residues of a given protein that participate in any interaction, partner-specific methods seek to identify those residues that are involved in a particular PPI.Since proteins tend to interact with many distinct partners 23 and the involved interfaces can be quite different, partner-specificity is a convenient feature when studying a particular PPI.
5][26][27][28][29][30] Most of these methods aim to predict pairs of interacting residues, each belonging to a different protein partner, using machine learning algorithms trained over datasets derived from atomic models of protein complexes.Although several algorithmic approaches have been proposed, little emphasis has been made on the dataset used for training and developing these approaches.Thus, most, and especially the recently published partner-specific predictors, have been limited to small datasets, mainly the different versions of the Protein-Protein Docking Benchmark. 16,19,20,25Indeed, to the best of our knowledge, only the works of Meyer et al. and  Townshend et al. tried to build datasets for this particular problem, yet their impact on performance was not analysed in detail. 24,27More importantly, only a single strategy for dataset compilation was considered.
In this work, we present BIPSPI+, a new version of our partner-specific binding site predictor that illustrates how a carefully selected training dataset can severely improve machine learning-based methods performance.BIPSPI+, as the original version, 21 can be employed to predict the binding sites of two interacting proteins given either their sequences or their structure.The new version offers a novel mode that can be used in those cases in which only the structure of one of the partners is known, exhibiting better performance than the sequence-only version.Additionally, the new approach was trained independently to predict binding sites for hetero-and homo-dimer cases.Overall, BIPSPI+ outperforms the original version in all studied datasets irrespectively of the input type, being especially worth noting the improvements for homo-complexes predictions.
In addition to offering better performance, the BIPSPI+ web server has been upgraded to include a new guided docking option that employs PatchDock 2 on BIPSPI+ predictions used as restraints.As a result, BIPSPI+ can now provide both binding site prediction and atomic models for the PPIs.To our knowledge, our method and the Ahmad and Mizaguchi one are the only partnerspecific predictors available through web servers, and only ours allows the users to directly perform guided docking from the predictions.
BIPSPI+ web app is publicly available at https:// bipspi.cnb.csic.esand as a stand-alone tool at https://github.com/rsanchezgarc/BIPSPI.Methods BIPSPI is a machine learning-based partnerspecific binding site predictor trained on structurally solved protein assemblies deposited in the PDB. 31,32The training set consists of interacting and non-interacting residue pairs obtained from the 3D structure of protein complexes using a distance threshold criterion.BIPSPI+ is an upgraded version of the BIPSPI v1 web platform that implements three new major features: a new input mode (sequence & structure), complex-type stratification (homo-complex vs hetero-complex mode), and an optional step of guided Protein-Protein Docking (PP-docking).The following section briefly presents these new features, summarized in Figure 1.For a complete description of the method, we refer the reader to the Supplementary Material section 1.
Input: sequence-sequence, structure-structure and sequence-structure modes BIPSPI v1 could be employed to predict the interacting residues of two protein structures or two sequences.BIPSPI+ has been redesigned to work also for cases in which only the structure of one of the interacting partners is known.This new input mode, which we have termed as the "sequence-structure mode", employs sequenceonly features to describe the sequence amino acids of the partner with no atomic model whereas residues of the other partner of the complex are described employing all features as in structurestructure mode.Consequently, the result page for this mode (Figure 1(j)) is a hybrid of the structurestructure (Figure 1(i)) and the sequence-sequence mode (Figure 1(h)) viewers, consisting of a 3D-viewer for the partner with structure and a sequence panel for the partner with unknown structure.

Guided docking
BIPSPI+ web platform has been upgraded to perform an optional step of guided protein-protein docking using PatchDock 2 .Thus, after computing binding site predictions for protein structures, the user can select, using a threshold slider and a table with checkboxes, which are the residues that will be used as restraints for guided docking.Then, PatchDock is executed with default parameters.After execution, a results page allows for interactive visualization of the highest-score predicted poses as well as downloading the atomic models and raw files generated during the docking step (see Figure 1(k-l)).

Homo-complexes and hetero-complexes datasets
BIPSPI+ was trained on two different datasets, one dataset consisting only of hetero-complexes (HEMt) and another dataset containing only of homo-complexes (HODt), both being more than one order of magnitude larger than the original BIPSPI v1 training dataset.Supplementary Material section 1.1, 1.2 and 2.5 describe these and other studied datasets.Similarly, the performance of our method was also assessed against two testing datasets representing the two possible types of complexes.Particularly, we employed the Protein-Protein Docking Benchmark v5 (Bv5), composed of 230 hetero-complexes and a custom evaluation benchmark, which we termed HOe (Homo-complexes evaluation), composed of 223 homodimers.Since the HOe dataset contains only complexes in bound state, performance could be overestimated when using structural features, but since comparison against BIPSPI v1 was also carried out on this dataset, improvement conclusions could be considered robust.Moreover, it is important to notice that the model trained on sequence-based features only is not affected by this problem and thus, its performance estimation is reliable.

BIPSPI+ usage
BIPSPI+ can be employed to obtain partnerspecific binding site predictions for protein Figure 1.BIPSPI+ execution options.The protein complex to be predicted can be either a homo-complex (a) or a hetero-complex (b).Depending on the availability of atomic models, BIPSPI+ can be executed under different modes.The sequence-only mode is used if none of the structures is known (c and e), results being displayed in the Seq-Seq viewer (h).For the case of heterocomplexes in which only one of the structures is available (g), the sequence vs structure mode is used and the results are displayed in the Seq-Struct viewer (j).Finally, if the structure of the monomer is either known for homo-complexes (d) or the structure of the two interacting partners is known for heterocomplexes (f), the full structure mode is used instead.In this case, results are displayed in the Struct-Struct viewer (i), in which the structure of the two interacting partners, or two copies of the monomer, are shown.From this viewer, it is possible to execute guided docking, selecting the subset of residues to be employed as constraints (k).Docking results are displayed in the Docking viewer (l).complexes.First, the user needs to select the oligomerization state of the PPI as homo-complex or hetero-complex (Figure 1(a and b), Supplementary Figure 1).Then, the user needs to provide either a sequence or an atomic model for the monomer in the homo-complex case (Figure 1 (c and d), Supplementary Figure 1) or for both interacting partner in the hetero-complex case (Figure 1(e-g), Supplementary Figure 1).The 5 different oligomerization types and input types combinations (homo-sequence, homo-structure, hetero-sequence-sequence, hetero-sequencestructure, and hetero-structure-structure) are processed by 5 different models trained on the same types of data as the input.
After calculations, binding site predictions are displayed in one of the three different types of viewers depending on the input type (Figure 1(hj)).In each of the viewers, the predicted interface residues with a score greater than the selected threshold are highlighted on the input sequences or structures (Figure 1(h-j)).Thresholds can be changed using a slider that displays the expected precision for the predictions given the current value of the threshold. 21For easiness of visualization, homo-complexes results are displayed using the same graphical interface in which two exact copies of the input monomer and the predictions are displayed as independent partners.
Finally, for the case of homo-complexes with structure or heterocomplexes with structures for both partners, it is possible to launch a guided docking job using as restraints the binding site residues predicted by BIPSPI at different thresholds (Figure 1(k), Supplementary Figure 3) or a custom subset of them, by checking the ignore checkbox of some of the residues with scores above the selected threshold.Once the residues to be used as restraints are selected, the docking calculations are carried out, and the highest score docking results are displayed in the Docking viewer, in which the user can visually inspect or download the proposed models (Figure 1(l), Supplementary Figure 4).

Better training data enhances performance
Since the performance of machine learning methods is severely influenced by the amounts and quality of the available data, it seems reasonable to believe that partner-specific binding site prediction can also benefit from this strategy.However, obtaining PPI complexes for a training dataset is challenging.First of all, the total number of solved complexes represents only a small fraction of the interactome.For instance, in humans, less than 10% of the binary interactions have been structurally solved. 33Second, there are very few examples for which we know the structure of both the bound and unbound structure, most of them contained in the Bv5.While the former problem cannot be directly tackled until more experi-mental data is obtained, the importance of the latter could be not so critical for methods like BIP-SPI, which integrates both structural and sequence-based features. 21Consequently, for the second version of our method, we constructed larger training datasets that, for the majority of the complexes, do not contain the unbound version of the interacting partners.Despite this limitation, as it is shown in Figure 2 blue and red curves and described in Supplementary Material section 2.2, the inclusion of more bound complexes in the dataset was able to significantly improve results over BIPSPI v1.Thus, for the Bv5 using structural information, we measured a mean ROC AUC for residue-residue pairs interactions of 0.927 (median ROC AUC of 0.951) and a ROC AUC of 0.848 for binding site prediction.The new version increased both metrics with respect to our original method (0.905 and 0.823, respectively), achieving stateof-the-art performance (see Supplementary Table 5).For a detailed description of the evaluation approach see Supplementary Material section 1.3.
In addition to the size of the dataset, we also studied some other parameters that affect the quality of the data.For instance, we showed (see Supplementary Material section 2.3 and Supplementary Figure 8) that the inclusion of multimers, despite multiple caveats such as automatic receptor/ligand definition, enhances the performance of the predictions for predicting both dimers and multimers.Other studied parameters are discussed in Supplementary Material section 2.4-5.
Another important challenge when increasing the size of the dataset is the fact that most of the protein complexes contained in the PDB correspond to homo-complexes while the standard testing dataset, the Bv5, only contains hetero-complexes.Although the physics behind homo-complexes is the same that in hetero-complexes, statistical analysis show that physicochemical features of hetero-and homo-complex interfaces differ in many aspects such as contact preference, composition or hydrophobicity. 11Consequently, some difference in performance could be expected depending on the oligomerization state.However, when we first studied the impact of the oligomerization type, the observed difference in performance for BIPSPI v1 was beyond our expectations, with a difference in MCC of 0.15 (see Supplementary Table 1) and important precision drops in the highthreshold region (high precision and low recall), the most interesting one for experimental validation (see Figure 2 left vs right panel).For BIPSPI+ we included homo-complexes in the training dataset using two strategies: first, training using two different datasets, one for each complex type (HEMt and HODt) and second, combining HEMt and HODt into one single training dataset (HEHODt).Supplementary Table 1 and Supplementary Figures 5-6 show that the first strategy offers comparable or bet- ter results for all the analysed benchmarks, suggesting that the information extracted by our method from one oligomerization state is of little value, if not harmful for the other type.Consequently, in BIPSPI+, the users are required to select the oligomerization state and the two trained models are applied accordingly.

Sequence-structure mode
Partner-specific binding site predictors require sequence information and/or structural data for both interacting partners as input.So far, existing methods only consider the symmetric cases in which either the two structures or the two sequences are present.However, it is quite common that only the structure of one of the interacting protein partners is available (e.g., modelling low-resolution regions in cryo-EM maps, synthetic designs, etc.).Given the fact that structural data allows for better prediction performance, the common alternative of approaching those cases as if only the sequences were available is not compelling.In order to overcome this shortcoming, we have developed for BIPSPI+ the sequence-structure (seq-struct) mode.
We evaluated the performance of the seq-struct mode using as evaluation benchmark Bv5 and we studied how the new mode performed on both the input provided as sequence and the one provided as structure (see Supplementary Material Table 3).As expected, the quality of the predictions for the seq-struct mode, with an MCC value of 0.331, lies between the performance of the model that only employs sequences (MCC of 0.311) and the model that employs structures from the two partners (MCC of 0.403).For more details, see Supplementary Results section 2.6).
Figure 3(a) illustrates the benefits of this new execution mode on 2OZA, one of the protein complexes of the Bv5 for which we computed the predictions providing as input either the two sequences of chains A and B (X in unbound) or the sequence of the chain B and the structure of chain A. From direct inspection, it could be noticed that, when the structure of the studied protein partner is employed, the quality of the predictions largely improves.Thus, for chain A, the accuracy at threshold 0.5 is 0.60 when only the sequences are employed.However, when the structure of chain A is employed, accuracy gets boosted to 0.89.

Guided docking
While binding site predictions are invaluable sources of hypothesis for multiple experimental scenarios (e.g.mutagenesis experiments), when possible, 3D atomic models of the protein complexes offer a much richer description of PPIs.BIPSPI predictions have been successfully used as guided PP-docking constraints, [34][35][36] improving the quality of 3D models.However, guided docking pipelines tend to be complicated, involving several computational steps and requiring a good understanding of the different tools. 37ith the aim of facilitating the generation of 3D models, we have included a simple guided docking pipeline based on PatchDock, a rigid body docking algorithm based on geometric hashing.Our pipeline simply requires the users to select a threshold for the binding site predictions so that the selected residues will be provided to PatchDock as binding site restraints, limiting the search space to those poses compatible with the selected restraints.We acknowledge that our pipeline is simple and, consequently, better results could be easily obtained using more complicated pipelines and/or algorithms.However, our intention was to develop a user-friendly solution to retrieve fast initial structural models that could help the users to understand the binding site predictions.
Despite our pipeline's simplicity, accurate models can be obtained in many cases, providing conformational changes are not severe.Thus, Figure 3(b), illustrates an example of a 3D model for the protein complex Subtilisin Carlsberg-OMTKY3 Complex (PDB code 1YU6, 38 chains A and C respectively) computed using the BIPSPI+ web application.From direct inspection of the figure, it can be noticed that binding site predictions for this complex were of high quality, with an important part of the binding site accurately predicted.These accurate predictions ultimately allowed the docking algorithm to propose a high-ranked solution (3rd) of medium quality (iRMS 1.5 A, DockQ = 0.64 39 ) in a totally automatic fashion.

Conclusion
Partner-specific binding site predictions have proven to be a useful resource in several contexts, especially for guiding protein-protein docking.Consequently, new approaches have been developed in recent times.However, while most of the new methods make special emphasis on algorithmic aspects, the crucial impact that datasets have on performance was not deeply studied.With the aim of addressing this issue, we developed BIPSPI+, an improved version of our original method, trained on carefully selected datasets of complexes, that exhibit enhanced performance.While BIPSPI+ outperforms BIPSPI v1 in all the evaluated benchmarks, it is especially for the case of homo-complexes when performance is largely boosted.In addition to enhanced performance, the BIPSPI+ web application, freely available at https://bipspi.cnb.csic.es, has been updated to easily deal with homo-complexes and also for hetero-complexes in which only one of the interacting partners is structurally solved.Finally, the BIPSPI+ web application offers an optional step of guided protein-protein docking that can provide users with complete structural models of the protein interaction.BIPSPI+: Mining type-specific datasets of protein complexes to improve protein binding site prediction R Sanchez-Garcia 1,2* (orcid 0000-0001-6156-3542), JR Macias 1 (orcid 0000-0003-2621-6806), COS Sorzano 1 , (orcid 0000-0002-9473-283X), JM Carazo 1 , (orcid 0000-0003-0788-8447) J Segura 3 (orcid 0000-0003-0788-8447)

Training datasets
The training datasets employed in this work can be broadly classified into three categories: only heterocomplexes datasets; only homocomplexes datasets; and a mixture of the previous two types.As only heterocomplexes representative, we compiled a new dataset, termed HEDt, consisting of 2,401 bound heterodimers collected from the PDB.Additionally, we extended the HEDt dataset to include 1,571 hetero-multimers resulting in the HEMt dataset.It is important to notice that, contrary to the case of dimers, when multiple chains are included in the atomic models, splitting the complex into its ligand and receptor components is not trivial and, in many cases, subjective.At the risk of making mistakes (and introducing noise during training but not in evaluation), we simply extracted each of the interacting chains together with the other chains that are in contact with each of them.In the case of a shared interacting chain, we assign it to one of the partners randomly.Similarly, for the homo-complexes dataset, we have collected from PDB 1,981 bound homodimers that constitute the HODt dataset.Finally, the HEHODt dataset is composed of the heterodimers included in HEDt and the homodimers contained in HODt, thus exhibiting a balanced proportion between homo-complexes and heterocomplexes (~8:11).
For the sequence-structure mode, we generate two training instances for each complex, one in which the sequence is available for the ligand and the structure for the receptor and another in which the roles are reversed.

Dataset compilation
All the protein complexes included in this study, except for the ones comprising Bv5 1 , were collected from the PDB database according to the following criteria: 1) resolution better than 3.5 Å; 2) number of residues structurally determined >50%; 3) sequence length of each chain >30 residues; and 4) number of interacting residue pairs >10.
Due to the fact that some protein families are overrepresented in the PDB, using all available protein complexes would result in biased datasets and thus, potentially leading to poor performance in machine learning models.For this reason, in order to preserve diversity while reducing redundancy, we used a combination of pairs of SCOPe families and sequence-based clustering as sampling criteria.In particular, we have grouped protein complexes based on their SCOPe families, and within each SCOPe pair group, we have further divided it into groups according to sequence identity (95% threshold).Then, we have selected as a representative for each of the groups the structure with the best resolution.Protein chains lacking SCOPe family classification were grouped based only on sequence identity-based clusters using a 30% identity threshold.This threshold may be considered as the limit to observe 3D structural similarity between proteins and thus, avoids including a large amount of structural redundancy from the null class 2 .Proceeding in this way, we reduce the redundancy introduced by highly-populated SCOPe families and non-classified proteins while preserving most of its diversity.
We employed the same definition as in BIPSPI, SASnet, and many other previous publications.Consequently, a pair of residues is labelled as interacting if the distance between any of their heavy atoms is <6.0 Å. Due to symmetry reasons, when training and/or evaluating with homo-complexes, we have corrected the list of interacting pairs to include as positive pairs those that are not directly in contact in the structures but that are equivalent to others that are in contact.For example, given the homodimer A-B and the interacting residue pair A:i-B:j, the pair A:i-B:j should also be considered as positive since residues A:j and B:j and A:i and B:i are equivalents.Such correction could be also useful for the case of hetero-complexes in which one of the partners is a homo-complex, but it was not considered in this work in order to simplify comparisons.

Evaluation
The performance for Residue-Residue Interaction (RRI) prediction was measured by computing the mean Area Under the ROC Curve (mRAUC) as in many other works.Binding site prediction performance was evaluated using several metrics, including the Area Under the ROC Curve (RAUC), Matthew's Correlation Coefficient (MCC), True and False Positive Rates (TPR, FPR), and Positive Predictive Value (PPV) (see Sanchez-Garcia et al. 3 for more details).Violin plots displaying the distribution of RRI ROC-AUC and binding site AUC were also computed.
Hetero-complexes predictions were evaluated using the Protein-Protein Docking Benchmark v5 (Bv5), a dataset of 230 hetero-complexes for which both the bound and unbound structures are available Bv5.Since we are interested in the performance under different oligomerization states, we have also computed the same metrics for two subsets of the Bv5: BM90C 4 , the subset of heterodimers contained in Bv5 and Bv5Mul, the subset of multimeric proteins contained in Bv5.
For the case of homo-complexes, we compiled an evaluation benchmark following the same principles as in Bv5 except for the fact that all the complexes employed are dimers in bound state.This evaluation benchmark, which we termed HOe, is composed of 223 homodimers.Notice that HOe comprises only bound complexes and thus, performance using structural information could be overestimated.On the contrary, the performance estimations measured using sequence-only features are not affected by this caveat.
For the sequence-structure mode, since we generated two training instances for each complex, evaluation is performed also on the two instances, reporting them independently for each partner and also as averages when considering the whole complexes.
The evaluation process consisted of a 10-fold cross-validation approach between testing and training set, i.e. the testing set was divided into 10 subsets of equal size and then, for each subset a model was trained removing from the training set any protein that shared a SCOPe domain with the particular testing subset.This approach guarantees that the training does not contain any information on the tested data.

Algorithm
BIPSPI+ employs the same algorithm that BIPSPI v1 uses, including the same features, model and hyperparameters.The main differences with respect to the original version are related to the different new input types.First, in version 2 we always apply two stacked models (feedback model) independently of the input type provided, whereas originally, this strategy was only employed for structural input.Second, for the sequence-structure mode, we employ sequence-only features for one of the partners and both structural and sequence-only features for the other.Last, for homocomplexes prediction, the algorithm is executed with two copies of the same monomer as inputs and the final predictions for the pairs are computed averaging the predictions for the same.
In addition to the aforementioned modifications, the procedure to ensure the independence between the training and testing set is also different, since now there are more than one complex with the same pair of SCOPe families.Consequently, a grouped ten-fold crossvalidation strategy, using as groups pairs of SCOPe families, was used to prevent crosscontamination.Proteins with no SCOPe defined are assigned to virtual families according to sequence clustering at 30% identity.

Data augmentation
We generated simulated conformations, obtained from the atomic models contained in the training sets, as a novel type of data augmentation.Particularly, we randomly sampled poses from trajectories generated with the "imc" program of the iMod package 5 , using default parameters.This program performs a Monte Carlo simulation guided by Normal Modes Analysis on Internal Coordinates and generates plausible trajectories that begin at the atomic model provided.Other alternatives such as Molecular Dynamics or Flexible Docking were not considered due to computational limitations but could produce similar results.

Method comparison
For comparison with other methods, we employed the SASNet neural network 6 as described in the original publication and we trained it on the HEMt dataset since SASNet original publication reported performance using Bv5 and our best dataset for Bv5 is HEMt.Moreover, due to the fact that SASNet was trained on both the Bv5 and DIPs, a custom dataset, a direct comparison between DIPs and HEMt can be conducted.

BIPSPI+ Usage
compare the performance of both versions.Additionally, in Supplementary Figure 7, BIPSPI+ exhibits far better performance than BIPSPI v1 when sequences are used as input.In that case, the improvements are so important that, for the residue-residue interaction (RRI) problem, the first and second quartile in BIPSPI+ approximately matched the second and third quartile in BIPSPI v1.More importantly, the improvement in RRI results also translate to an important improvement in binding site prediction, in which the BIPSPI+ distribution is shifted by ~ 1/4 of the interquartile range.For the case of homocomplexes, independently of whether the input is a sequence or a structure, a similar improvement is observed.The differences in performance for heterocomplexes using structural features are less striking, but still statistically significant in all computed tests (see Supplementary Table 2) and visually noticeable.Since for all cases the first quartile is the one that varies more between versions, this implies that the worse performing examples tend to be better predicted in our new version, although improvements are observed for all the range of values.
Second, the stratification of the training data into homo-or hetero-complexes leads to better results, as it can be concluded from the facts that 1) when the opposite type of oligomerisation dataset is used for training, results severely worsen, and 2) the performance measured when using HEHODt is similar or slightly worse than when using their specific counterparts (HEMt/HEDt and HODt, see Supplementary Figure 5 and Supplementary Figure 6).Additionally, it can also be concluded that the addition of multimers to the dataset has an overall positive effect even when the inputs are two sequences and thus, the concept of multimer is not naturally modelled.In the following subsections, some particular aspects of the dataset will be studied in more detail.
Supplementary Table 1.Ten-fold cross-validation performance evaluated on Bv5 and HOe for several training datasets.

Figure 3 .
Figure 3. BIPSPI+ use cases.a) Sequence-structure mode improvement example.Sequence-only predictions (blue asterisks) and sequence-structure (green asterisks) predictions on structurally solved residues for Bv5 unbound complex 20ZA chain X (B in bound).Residues in contact with chain A are marked in red.Predictions above 0.5 score are marked with stars in blue when using only sequence information for both chain A and X and green when the structure of chain X is employed alongside the sequence of chain A. b) PatchDock docking model for the Subtilisin Carlsberg-OMTKY3 Complex (PDB code 1YU6, chains A and C respectively) obtained from BIPSPI+ web server.The crystallographic structures are depicted in grey for Chain A and green for Chain C whereas the docked model is depicted in purple.