DeepRank-GNN: a graph neural network framework to learn patterns in protein–protein interfaces

Abstract Motivation Gaining structural insights into the protein–protein interactome is essential to understand biological phenomena and extract knowledge for rational drug design or protein engineering. We have previously developed DeepRank, a deep-learning framework to facilitate pattern learning from protein–protein interfaces using convolutional neural network (CNN) approaches. However, CNN is not rotation invariant and data augmentation is required to desensitize the network to the input data orientation which dramatically impairs the computation performance. Representing protein–protein complexes as atomic- or residue-scale rotation invariant graphs instead enables using graph neural networks (GNN) approaches, bypassing those limitations. Results We have developed DeepRank-GNN, a framework that converts protein–protein interfaces from PDB 3D coordinates files into graphs that are further provided to a pre-defined or user-defined GNN architecture to learn problem-specific interaction patterns. DeepRank-GNN is designed to be highly modularizable, easily customized and is wrapped into a user-friendly python3 package. Here, we showcase DeepRank-GNN’s performance on two applications using a dedicated graph interaction neural network: (i) the scoring of docking poses and (ii) the discriminating of biological and crystal interfaces. In addition to the highly competitive performance obtained in those tasks as compared to state-of-the-art methods, we show a significant improvement in speed and storage requirement using DeepRank-GNN as compared to DeepRank. Availability and implementation DeepRank-GNN is freely available from https://github.com/DeepRank/DeepRank-GNN. Supplementary information Supplementary data are available at Bioinformatics online.

Generation of docking models with HADDOCK Docking with HADDOCK follows a three step process: 1) in the first stage (it0), the docking is performed from the separated and randomly rotated starting conformations, treating the proteins as rigid units,; 2) in the second stage (it1), a semi-flexible refinement is performed, consisting of a simulated annealing in torsional space (with fixed bond lengths and angles) introducing flexibility first along the side chains at the interface (i.e. those within 5 Å of the partner protein), and second in both backbone and side chains of the interface residues; 3) in the last stage (itw), the models are subjected to a final energy minimization (and/or short refinement in explicit solvent (water)). Each HADDOCK step has its own scoring function and restraints can be provided to guide the docking with a priori information.
In order to ensure a suitable number of near-native models (i.e. iRMSD to the reference structure ≤ 4 Å), we combined models from 5 docking scenarios with increasing level of a priori information: 1) docking with random surface patch restraints (10000/400/400 models for it0/it1/water stages), 2) docking with center of mass restrains (10000/400/400 models for it0/it1/water stages), 3) docking with true interface residues defined within a 5 A distance to the partner protein (1000/400/400 for it0/it1/water stages), 4) docking with true interface residues defined within a 3.9 A distance to the partner protein (1000/400/400 for it0/it1/water stages) and 5) refinement of the bound complex (50/50 it1/water stages).

Automated weights computation for CrossEntropy Loss function
The cross-entropy loss function is commonly used for classification tasks and more especially when dealing with multiple classes. The crossentropy loss function computes the probability of each class y, and applies a logarithmic penalty based on how far is the prediction from the ground truth value.
For a binary classification tasks, the cross-entropy loss function is defined as: Where i spans the number of classes, ! is the ground truth value (0 or 1), and ! is the probability of the class i, and where , = 1 − / . Provided that 1 is the target (i.e. the correct class to predict), then 1 = 1 and all other ! = 0. We can thus simplify the equation for multiclasses tasks to: 1 being the probability of the target class 1 and being computed with the SoftMax activation function: where 1,3! is the predicted value for the target class, c spans the number of classes, and 1,) is the predicted value for each class c.
In PyTorch, the cross-entropy loss is computed for each entry n of the input batch as follow: 3! being the weight, or scaling factor, assigned to the target class. By default, 3! is set to 1 for each class c, meaning that all classes equally contribute to the loss calculation. The batch loss can be computed as the mean (default in pytorch and DeepRank-GNN) or the sum of individual losses. However, in case of unbalanced training dataset, it is recommended to weight the different classes with higher weights for the minority class(es) to avoid optimizing the neural network weights solely based on majority classes. An agreed upon technique to weight the contribution of each classes in the loss function is to assign weights that are inversely proportional to the frequency of each class in the training set.
In DeepRank-GNN, the users can input their own weights or let DeepRank-GNN automatically compute them as follow: -Compute the frequency of each class -Convert it into a percentage Example: We have a training dataset of 400 graphs, 4 classes (0,1,2,3) split as follow: -Class 0: 200 graphs -Class 1: 50 graphs -Class 2: 50 graphs -Class 3: 100 graphs The frequency F is given by: It is then transformed into a frequency percentage:

Graph Interaction Network (GINet)
The convolution layers in the GINet are inspired by the graph attention network (GAT) described by Veličković et al., 2018 and the edge aggregated graph attention network (EGAT) from Mahbub and Bayzid, 2020. Attention mechanism (Veličković et al., 2018) is used to weight the contribution of individual neighbors in the new state of a node. When applied to PPIs, we expect attention mechanism to learn favorable or deleterious contacts in a local environment.
The feature representation of node i, ℎ ! ∈ 9 # , is transformed into a new feature representation ℎ′ ! ∈ 9& (where ! and + are respectively the number of input and output features) through the aggregation of the feature representation of neighbouring nodes ! , i.e. the first ordered neighbours of i, including node i itself. We use the weighted sum of the neighbours feature representation as an aggregator, where 9 ∈ 9&×9 # is the learnable parameters used in the linear transformation of the node features representation and where the weights !; are learned through an attention mechanism described in equations 2 and 3. Each edge of the graph is assigned an attention score !; given by Where !; is edge feature, which herein corresponds to the interaction strengh as explained in the Featurization section of the main manuscript, ' ∈ 9'×9' is the learnable parameter used in the linear transformation of the edge feature representation ( ' being the number of edge features), and ⃗ ? ∈ /×9&A9' is the attention mechanism. || denotes vector concatenation. !; is further normalized by a SoftMax activation function to give a probability distribution ! (between 0 and 1) over the node neighbours ! , with !; reflecting how much the neighbouring node j should contribute to the new representation of node i.
As described by Mahbub and Bayzid, 2020 and shown in equation 3, the calculation of the attention score considers the node features ℎ ! ∈ 9 # and ℎ ; ∈ 9 # as in the original GAT paper plus the edge feature !; ∈ 9& . The addition of the edge feature, which herein corresponds to the interaction strength, is essential for PPI interfaces studies as it is well established that the potential energy of a system depends on the atomic pairwise distances. In a coarse-grained system with residue-level representation, the same trend transfers to residue-residue distances.
Pooling layers are used to reduce the number of nodes in the graph, which depends on the input 3D model, and learn higher level features, i.e. features of group of nodes, during the training. The pooling operation is here performed on clusters of highly interacting nodes that can be computed using either a Markov Cluster Algorithm (MCL) (Enright et al., 2002), or the Louvain community detection algorithm (Blondel et al., 2008). Both clustering methods rely on the interaction strength and therefore cluster nodes that are close to each other in 3D space on the fly. A max pooling is then applied on these clusters. This pooling operation aggregates the nodes of a given cluster into a single node whose feature values are given by the maximum feature values across the cluster. Interface and internal edges are then drawn to connect these new nodes. The edge features are here obtained by summing up all the edge features of the nodes belonging to the cluster.

Figure S9
Comparison of the performance obtained on the CAPRI Scoreset. See legend of Fig.S4.