Geometric epitope and paratope prediction

Abstract Motivation Identifying the binding sites of antibodies is essential for developing vaccines and synthetic antibodies. In this article, we investigate the optimal representation for predicting the binding sites in the two molecules and emphasize the importance of geometric information. Results Specifically, we compare different geometric deep learning methods applied to proteins’ inner (I-GEP) and outer (O-GEP) structures. We incorporate 3D coordinates and spectral geometric descriptors as input features to fully leverage the geometric information. Our research suggests that different geometrical representation information is useful for different tasks. Surface-based models are more efficient in predicting the binding of the epitope, while graph models are better in paratope prediction, both achieving significant performance improvements. Moreover, we analyze the impact of structural changes in antibodies and antigens resulting from conformational rearrangements or reconstruction errors. Through this investigation, we showcase the robustness of geometric deep learning methods and spectral geometric descriptors to such perturbations. Availability and Implementation The python code for the models, together with the data and the processing pipeline, is open-source and available at https://github.com/Marco-Peg/GEP.


Introduction
Identifying the binding sites of antibodies is essential for developing vaccines and synthetic antibodies. These binding sites, called paratopes, can bind to antigens, wherein the corresponding binding site is known as the epitope, thus neutralizing harmful foreign molecules in the body. Experimental methods for determining the residues that belong to the paratope and epitope are time-consuming and expensive, highlighting the need for computational tools to facilitate the rapid development of therapeutics. The recent COVID-19 epidemic highlighted this need further, as mutations in the antigen were shown to impact the binding mechanism, potentially reducing the efficacy of existing treatments (Thomson et al., 2021). Predicting the binding sites of an antibodyantigen interaction requires considering the entire antigen for epitope prediction and a localized region of the antibody, known as the Complementarity-Determining Region (CDR), * Equal contribution (order chosen randomly) 1 Sapienza, University of Rome 2 Gatsby Computational Neuroscience Unit, University College London 3 Google DeepMind 4 Mila, Université de Montréal.
The 2023 ICML Workshop on Computational Biology. Baltimore, Maryland, USA, 2023. Copyright 2023 by the author(s).
The integration of geometric and structural information in protein-to-protein interaction studies has led to significant progress (Stärk et al., 2022;Dai & Bailey-Kellogg, 2021). While several methods have concentrated on the 3D graph representation, few methods (Dai & Bailey-Kellogg, 2021;Zhang et al., 2023) have investigated the 3D surface representation. We aim to assess the impact of utilizing the geometric representation of the antigen and antibody in the task of epitope-paratope prediction. Our approach, GEP (Geometric Epitope-Paratope) Prediction, proposes different geometric representations of the molecules to create accurate predictors for predicting antibody-antigen binding sites. The use of geometrical information is further justified by the emergence of technology predicting the single-protein structure, such as AlphaFold 2 (Jumper et al., 2021), which has comparable accuracy to experimental methods. We present the following contributions in our paper: • We analyze the significance of geometric information within the context of graph learning, using equivariant layers that enable more robust and accurate predictions.
• Additionally, we fully exploit the geometric information in molecules by representing them as surfaces and applying techniques based on spectral geometry, leading to state-of-the-art performance.
• We will release a pipeline for generating a dataset from PDB molecules that produces molecular representations in graph and surface formats, enabling crossmethod comparisons.

Related work
The structure of proteins provides crucial information about the location and orientation of the binding sites. Various approaches have been taken in the literature to address the task of epitope and paratope prediction, including sequential (Liberis et al., 2018;Deac et al., 2019) and structural Del Vecchio et al., 2021) methods. Furthermore, Geometric deep learning has emerged as a powerful tool for predicting protein-protein interactions (Isert et al., 2023), with graph-based representations being one of the most common approaches (Tubiana et al., 2022;Stärk et al., 2022). These methods leverage the geometric information of the molecules to learn complex relationships between epitopes and paratopes. For instance, some approaches (Del Vecchio et al., 2021;da Silva et al., 2022) use the graph structure to compute features based on neighbouring residues, which are then aggregated to highlight the most probable region of interaction.
An alternative approach is to represent proteins as surfaces.
MaSIF (Gainza et al., 2020) focuses on the more general problem of protein interaction region prediction and uses a surface representation learned through convolutions defined on the surface. PiNet (Dai & Bailey-Kellogg, 2021) represents the protein surface as a point cloud and employs PointNet (Qi et al., 2017) to classify points as interacting or not. On the contrary, Zhang et al. (2023) model the surface of a molecule as a graph and apply an equivariant graph neural network (EGNN, (Satorras et al., 2021)) for binding site prediction.
Integrating structural and geometric information has proven to be a promising approach for improving protein interaction prediction. Still, few studies have focused on the specific case of epitope and paratope prediction (Cia et al., 2023). Our work supports this view by showing that considering the problem as a geometric one can effectively improve performance.

Motivation
The shape and structure of molecules play a crucial role in determining their interactions with other molecules, as complementary geometric shapes are required for successful binding (Fischer, 1894). To accurately predict molecular interactions, it is essential to incorporate geometric information such as 3D coordinates and spectral descriptors. Our approach to predicting molecular interactions integrates this geometric information into the representation of proteins as graph residues, resulting in a more enhanced and accurate representation.Furthermore, we recognize the importance of the outer surface of a molecule in molecular interactions.
To address this, we focus on computations performed on the outer surface of the molecule and then map these predictions to the corresponding residues. By considering the surface of the molecule, we gain valuable insights into the molecular interactions occurring on the surface and enable the use of geometric deep-learning models to analyze these interactions. This approach can potentially provide significant benefits over traditional methods, ultimately leading to more accurate and efficient predictions of molecular interactions.

Data
Comparing methods across different molecular representations is crucial for advancing research in molecular mod-elling. We developed a reusable pipeline that generates a dataset to evaluate methods using inner and outer structure representations.
We collected a dataset of 133 protein complexes from Epipred , with 103 for training and 30 for testing. The training and test sets have been selected to share no more than 90% pairwise sequence identity. The PDB files were obtained from the Sabdab database (Dunbar et al., 2014). In the test set, 7.8% of antigen residues were labelled as positive. Additionally, we used a separate set of 27 protein complexes from PECAN derived from a subset of the Docking Benchmark v5 (Vreven et al., 2015) to validate our results.
We construct a residue graph (Figure 3a) for each protein, where a 28-dimensional physicochemical feature vector represents each residue. This vector comprises a one-hot encoding of the amino acid (including 20 possible types and one for an unknown type), in addition to seven other features representing the physical, chemical, and structural properties of the amino acid type. These additional features can be considered a fixed embedding, as described in (Meiler et al., 2001).
For each protein, we generated a surface mesh (Figure 3b) using the PyMOL API with a 1.4 Å water probe radius. We associated each point on the protein's surface with a residue by finding the closest atom to that point. This association was then used to transfer the feature of each residue to the points on the surface.

Method
In our experiments, we considered two scenarios: a protein represented through its inner structure (I-GEP) and outer structure (O-GEP). In both cases, we leverage the geometric information to improve the performance of epitope and paratope prediction methods.

I-GEP
Our I-GEP model is a method for predicting epitopes and paratopes using a graph-based approach that captures the inner structure of a protein. Each residue is represented as a node in a graph, and edges are created between the 15 closest neighbouring residues within 10 Å. The I-GEP model has two main components: a structural module that computes an embedding for each residue using the graph structure and a graph attention network (GAT) that combines information from both the antigen and antibody residues. The network then predicts both epitope and paratope residues simultaneously using a fully connected layer, as shown in Fig. 2.
To improve the accuracy of our predictions, we integrate

O-GEP
Our O-GEP model operates on the protein's surface and includes a geometric module that uses the surface's geometry to spread information across it. This process generates features that are then combined and shared between the antibody and antigen through fully connected layers (segmentation module), resulting in an interaction probability for each point on the surface, as shown in Fig. 2.
We explore two different models for the geometric module.
As a baseline, we use PointNet (Qi et al., 2017) to recreate the architecture proposed in PiNet (Dai & Bailey-Kellogg, 2021). The second model employs diffusion layers from DiffNet (Sharp et al., 2022) to propagate features on the surface. This makes our model robust against surface perturbations and suitable for handling meshes and point clouds with fewer points.
We further examine the impact of using the Heat Kernel Signature (HKS) as an extra geometric descriptor input. The HKS (Sun et al., 2009) is a concise point-wise spectral signature which summarizes local and global information about the intrinsic geometry of a shape by capturing the properties of the heat diffusion process on the surface. One of the key benefits of using HKS is that it remains stable even under minor surface perturbations, thus enabling it to withstand even conformational rearrangements of the proteins. To utilize the HKS descriptor, we concatenate it with the input features at each point on the surface and then pass the concatenated data through the geometric module.
To transfer the binding probabilities from the protein's surface to the residues, we utilized the average of all the points on the surface that correspond to the same residues. This method ensures that the binding probabilities are accurately represented in the residue space, enabling us to make reliable predictions about epitope and paratope locations.

Training
To handle imbalanced binary classification tasks, the networks were trained using the class-weighted binary crossentropy loss and the Adam SGD optimizer. For parameter tuning, we performed a hyperparameter search on the validation set. We train each model with five random seeds, and for each run, we keep the models' weights that performed the best on the validation set. During training, we also randomly rotate instances of the dataset to increase the robustness of the models. See Appendix A for more details.

Evaluation
Given the significant disparity in class sizes, we utilize Matthew's correlation coefficient (MCC) between the residues' classification as our main benchmarking metric for model evaluation. We also report the area under the receiver operating characteristic curve (AUC ROC) and the area under the precision-recall curve (AUC PR) as used in (Dai & Bailey-Kellogg, 2021;Del Vecchio et al., 2021). All reported values are aggregated across five random seeds to ensure the robustness of our findings. EPMPxyz 0.10 ± 0.01 0.63 ± 0.01 0.15 ± 0.01 E(n)-EPMP 0.14 ± 0.01 0.68 ± 0.02 0.16 ± 0.01

Results
In this section, we report the results of our experiments and demonstrate the contribution of geometric information on the task of epitope-paratope prediction.

I-GEP results
We conducted experiments to evaluate the effectiveness of incorporating geometric information by comparing our proposed models from Section 5.1 with the EPMP model proposed in (Del Vecchio et al., 2021). Our results, presented in Table 1, clearly demonstrate that the inclusion of geometric information leads to a meaningful increase in performance. Specifically, the use of the E(n) invariant layer (E(n)-EPMP) resulted in an improvement in all metrics for both antibody and antigen.

O-GEP results
To test the performance of O-GEP models, we consider the methods proposed in Section 5.2 with different combinations of input features. The results are summa-rized in Table 2. Incorporating diffusion layers (DIFFNET) along with 3D coordinates and Heat Kernel Signature as additional features consistently outperformed the baseline method PINET. The use of these techniques led to an MCC score twice as high as that obtained by the I-GEP models. However, unlike epitope prediction, the paratope prediction did not show the same level of improvement with O-GEP models. In this case, the best results were achieved by considering only the HKS features and diffusion layers.  Figure 3a shows the results of the E(n)-EPMP on the residual graph. The epitope prediction focuses on sparse regions of the antigene, such as the spiky edges. In contrast, paratope prediction concentrates on the residues closest to the antigen. In Figure 3b, the predictions of DIFFNET pc (XYZ+HKS) are shown on both the surface and residues of the molecules. The predictions are highly localized on the region nearest to the binding molecule. It's worth noticing that the 3d coordinates given as input to the models are centred and randomly rotated, providing no prior knowledge of the binding region.

Conclusions
We investigated the effectiveness of geometric deep learning techniques in predicting antibody-antigen interactions. Our results indicate that incorporating geometric information is crucial for accurately predicting epitope and paratope regions. Specifically, the use of invariant representation in I-GEP models outperformed previous models, and O-GEP models with diffusion layers and additional geometric features achieved state-of-the-art performance. Our study highlights the potential of geometric deep learning in computational biology. Future research could explore using spectral shape analysis to address the more complex problem of conformational rearrangement in antigen-antibody binding (Stanfield et al., 1994).