Dock2D: Synthetic data for the molecular recognition problem

Predicting the physical interaction of proteins is a cornerstone problem in computational biology. New classes of learning-based algorithms are actively being developed, and are typically trained end-to-end on protein complex structures extracted from the Protein Data Bank. These training datasets tend to be large and difficult to use for prototyping and, unlike image or natural language datasets, they are not easily interpretable by non-experts. We present Dock2D-IP and Dock2D-IF, two"toy"datasets that can be used to select algorithms predicting protein-protein interactions$\unicode{x2014}$or any other type of molecular interactions. Using two-dimensional shapes as input, each example from Dock2D-IP ("interaction pose") describes the interaction pose of two shapes known to interact and each example from Dock2D-IF ("interaction fact") describes whether two shapes form a stable complex or not. We propose a number of baseline solutions to the problem and show that the same underlying energy function can be learned either by solving the interaction pose task (formulated as an energy-minimization"docking"problem) or the fact-of-interaction task (formulated as a binding free energy estimation problem).


Introduction
One of the long-term goals of computational structural biology is to explain and predict the phenotype of an organism from the basic physical and chemical properties of its molecular constituents.Given the key role of proteins in living processes, a central problem in the field is to predict the structure of a protein from its amino acid sequence [1,2].
Considerable progress on that problem has recently been made by algorithms such as AlphaFold2 [3] and RoseTTAFold [4], using multiple sequence alignments (MSAs) as additional inputs, and by more recent models replacing MSAs with sequence embeddings from large language models [5,6,7,8].The outstanding performance of AlphaFold2 at the 2020 CASP 14 competition suggests that such architectures can solve a large portion of the protein structure prediction problem [9,10,11].
Because proteins typically perform their functions by coming into contact with other proteins (as well as with nucleic acids, carbohydrates, and a variety of small molecules), it is also critical to be able to predict protein-protein interactions (PPIs) and the structures of the complexes they form.While this problem can be viewed simply as a more general version of the protein structure prediction probleminvolving two protein chains instead of one [12]-, many of the data sources used in state-of-the-art protein structure prediction algorithms are not as easy to leverage for the prediction of PPIs.There are fewer structures of protein complexes than of single proteins and, more importantly, the co-evolution signal between physically interacting proteins is weaker and harder to extract than that between interacting regions of a single protein [13,14,15].For that reason, PPI prediction models are expected to benefit greatly from strong priors based on molecular structure and energy [16].
In light of the success of AlphaFold2, it has become clear that learning-based algorithms will be crucial in solving many of the remaining key problems in computational structural biology.From the last decade of progress in computer vision and natural language processing, we expect that the design of these new learning-based algorithms will be driven by the development of standard performance metrics and wellcurated datasets.One such example from the field of computer vision is the ImageNet dataset [17], which enabled the development of the AlexNet [18] and ResNet [19] architectures, still considered the gold standard for computer vision applications.Many of these learning algorithms have risen to prominence by first showing outstanding performance on smaller, simpler datasets such as MNIST [20].These datasets allow for fast prototyping and are generally more interpretable than the large, complex datasets ultimately used for training.
In the case of PPI prediction, learning-based algorithms are typically trained on structures extracted from the Protein Data Bank [21] and tested on datasets such as the Docking Benchmark [22] and the CASP/CAPRI targets [23,24].However, these datasets are impractical for prototyping and are not easily interpretable by humans, which makes it harder to analyze the advantages and drawbacks of any new algorithm.
This paper presents Dock2D-IP and Dock2D-IF, two simple, small-scale datasets that can be used to select algorithms predicting protein-protein interactions.Our aim is to provide an equivalent of the MNIST dataset for the molecular recognition problem, formulated here as an energy-based shape recognition problem.While we are designing the Dock2D datasets with protein-protein interactions in mind, they are generic enough to apply to any type of receptor-ligand interactions.
Protein-protein interaction data fall into two broad categories [25,26]: 1) structural data obtained from X-ray crystallography or high-resolution electron microscopy and 2) fact-of-interaction data obtained using proteomics methods.In this work we use a single energy function to generate datasets for both categories, so that prediction models can be trained from either type of data.To make the problem more easily tractable and interpretable, we formulate it in two dimensions instead of three and use a simple shapebased interaction potential similar to the one proposed by Katchalski-Katzir et al. [27].
The paper is divided as follows: the "Datasets" section describes the generating function and the Dock2D datasets, the "Models" section describes algorithms for the pose prediction problem and for the fact-of-interaction prediction problem, and the "Results" section reports the performance of the proposed models and discusses their relative merits and their potential for transfer learning.

Datasets
Learning-based problems usually assume that the training data are sampled from an unknown probability distribution and that a solution consists in de-IUDFWLRQ n IUDFWLRQ ducing this distribution from the samples and using it to infer new data.We design the Dock2D datasets following this general principle.We construct both the Dock2D-IP and Dock2D-IF datasets by first generating two pools of protein-like shapes: one for training/validation and one for testing.We generate these shapes by randomly placing points inside a circle of 40 pixels in diameter and by computing a concave hull using the Alpha Shape Toolbox [28].The concavity parameter α and number of points n are sampled from a list of predefined values.All shapes have approximately the same size but have a contour that is more or less intricate (depending on n) and more or less concave (depending on α).
We generate the interactome using an energy-based Boltzmann distribution where R and L represent the shapes of the receptor and the ligand, respectively, where t = (t x , t y ) is the translation of the ligand with respect to the receptor, and where φ is the angle of rotation of the ligand around its center.Parameter β stands for the inverse temperature and is set to 1 for convenience.Partition function Z is defined such that p sums up to 1 over the range of all possible translations and rotations.
The energy function is defined as in Ref. [27]: + a 10 corr(t, φ, R s , L) where The integral represents a summation over pixelated versions of the R and L shapes, and M φ corresponds to a rotation of the ligand by angle φ around its center.In Eq. ( 2), R and L represent the bulk of the shape (1 if the center of the pixel is inside, 0 if it is outside), and R s and L s represent its boundary, obtained by convolving the image with a vertical and horizontal Sobel-Feldman 3 × 3 edge-detection filter [29].We set the coupling coefficients to the following values: a 00 = 100, a 10 = a 01 = −10, and a 11 = −10.The energy of a receptor-ligand conformation is low when the overlap of the R and L bulk regions is small while the overlap of the bulk of one shape with the boundary of the other is large (see Figure 2).The additional negative contribution of the boundary-boundary overlap penalizes the formation of non-specific interfaces.We construct the Dock2D-IP dataset ("IP" for "interaction pose") by minimizing the energy function (2) and selecting the transformation (t 0 , φ 0 ) that leads to that global energy minimum: The minimum energy itself is denoted E 0 .This is equivalent to letting β → ∞, in which case the Boltzmann distribution would be nonzero for that docking pose only: Because it is created from minimum-energy poses, Dock2D-IP is also reminiscent of real training datasets for protein docking, which are usually composed of high-resolution X-ray structures representing a single conformation, not a Boltzmann-weighted ensemble.
The Dock2D-IP dataset contains all (R, L) pairs for which E 0 < −100.This energy cutoff ensures that all examples have at least one strong binding mode.Although proteins displaying multiple binding modes are generally selected against during natural evolution (because they often lead to aggregation), we do not explicitly discard examples having multiple interaction poses with energies below the cutoff.
The (Note that any shape found with a certain interaction partner in a training example may be found with a different partner in a validation example.For validation purposes, those two examples are considered different enough.)The test set contains 11850 examples.Each Dock2D-IP example is stored as two 50 × 50 images R and L, along with the transformation (t 0 , φ 0 ) that brings L in its optimal pose relative to R.
Because the shapes are pixelated before they are rotated, the energy tends to be lower when φ is a multiple of 90 • .For this reason, values φ 0 = −180, −90, 0, and 90 • are overrepresented in the dataset (see Figure A.2 in Appendix).We do not correct for that slight imbalance.
We construct the Dock2D-IF dataset ("IF" for "interaction fact") by computing the following free energy of interaction: The sum over t is performed on a 100 × 100 grid, representing translations from −50 to +50 pixels in each direction, while the sum over φ is performed from −180 to 179 • in 1 • increments.We consider two shapes R and L to interact if F (R, L) < −100, assigning a ground-truth label "1" for shapes that interact and "0" for shapes that do not.The F < −100 condition ensures that those shapes have at least one binding mode with E 0 < −84.90, which corresponds to −100+ln(360×100×100) (see Figures 2 and A.3).This derives from the fact that free energy F can be no higher than the minimum energy E 0 (if only one state of the system has E = E 0 and all other states have energies E → +∞) and can be no lower than E 0 − ln(n) (if all n states of the system have the same energy E 0 ).Interestingly, the Dock2D datasets show a high propensity for homodimers.
In the Dock2D-IP dataset, 143 of the 5081 training/validation examples and 178 of the 11850 test examples are homodimers.This represents 2.8% and 1.5%, respectively, while we would have expected a ratio of 400/80200 = 0.5% if the probability of binding had been the same for any two shapes.The enrichment in homodimers is even more pronounced in the Dock2D-IF training/validation set, where 35.8% of the 400 homodimeric interactions are positive (143/400) while only 6.9% of the 79800 heterodimeric interactions are positive (5508/79800).In the Dock2D-IF test set, 45.0% of homodimeric interactions are positive (180/400) and 16.1% of heterodimeric interactions are positive (12845/79800).This is a known effect in proteins [30,31], which has been attributed to a number of biophysical or evolutionary reasons [32].In the Dock2D datasets, the reason is purely geometric and follows the argument originally proposed by Lukatsky and coworkers [33,34].Since homodimers have a twofold (C 2 ) symmetry, any region of a homodimer interface contributes to the stabilization energy twice: once on each side of the symmetry axis.Therefore, it is more likely that shapes generated at random will form strong interactions with copies of themselves (homodimers) than with unrelated shapes (heterodimers).Homodimer interactions are also artificially enhanced by the fact that they correspond to φ 0 = −180 • , which is a multiple of 90

Models
As stated in the Introduction, the main goal of the Dock2D datasets is to facilitate the development of models that can learn from either "interaction pose" (IP) data or "interaction fact" (IF) data.Here we propose two types of models that set a standard of algorithmic performance on each dataset: a "bruteforce" model, that generates a prediction by calculating the energy function for all translations and rotations, and an "economical" model, that learns the energy function without calculating the full partition function Z.While the "brute-force" models essentially solve both the IP and IF prediction tasks, they are impractical solutions for the three-dimensional version of the problem, for which the number of states is much larger than in the two-dimensional version.By contrast, the "economical" models aim to provide tractable solutions in higher dimensions.

Overall architecture
All models learn a two-feature representation of each shape (receptor and ligand) using a Siamese SE(2)equivariant CNN [35] (see Figure 4) and use these features to compute an energy of the form (2), with weights a 00 , a 10 , a 01 , and a 11 treated as learnable parameters.Spatial correlations (3) are computed using fast Fourier transform (FFT) [36].
The Siamese SE(2)-equivariant CNN module has only two convolutional layers, each using a fixed kernel size of 5 × 5.The first layer maps the input shape onto one scalar and four vectors and the second layer maps those back to one scalar and one vector, representing the bulk and boundary features used in Eq. (2).Each convolutional layer is followed by a ReLU applied to the norm of the features [35], to preserve the equivariance and allow the scalar and the vectors to mix.The final vector feature is reduced to scalar by taking an element-wise norm.

Docking models
For the interaction pose (IP) prediction task, the training loss L IP is defined as the cross-entropy between the predicted Boltzmann distribution and the ground-truth probability distribution, encoded as P = 1 at (t 0 , φ 0 ) and P = 0 everywhere else.The IP task requires to learn all parameters of the CNN, as well as the four "a" parameters from Eq. ( 2).This algorithm is called "brute-force" because it explicitly computes the energy for all possible translations and rotations.While a brute-force approach is entirely feasible in two dimensions, it becomes somewhat prohibitive in three dimensions, requiring a sixdimensional array of energy values to be computed.
A simpler version of the IP task is also devised, in which the correct orientation φ 0 is provided and the model is trained to only predict the correct translation t 0 .The training loss is defined as previously but using the predicted Boltzmann distribution over the translations only: This "simplified IP task" can in principle learn the same parameters as the full IP task.

Binding free energy models
For the interaction fact (IF) prediction task, energy E(t, φ) is further transformed into a single value ∆F meant to represent the binding free energy of the two shapes.The training loss L IF is defined as the binary cross-entropy between the probability of binding and the ground-truth fact of interaction (P = 1 or P = 0).The free energy of binding is written as In the "brute-force" version of the IF model, the partition function Z is evaluated by integrating over all translations and rotations: with e −F (φ) = t e −E(t,φ) .Constant F 0 is a learnable parameter that represents the free energy level below which the shapes are considered to interact (when − ln Z < F 0 , ∆F < 0 and P > 0.5).The IF task requires to learn all parameters from the IP task, plus the F 0 parameter.Equations ( 9) and (10) involve exponential functions ("sigmoid" and "logsumexp", respectively), which can have their gradients vanish if any of their arguments gets out of bounds.To avoid those situations, we add a regularization loss that prevents any value of ∆F from becoming too large: Weight w is chosen to be small, since the only purpose of this loss term is to preserve a nonzero gradient.
A second version of the IF model, called "sampled", learns the same energy function as the brute-force Figure 4: Brute-force solution to the joint "interaction pose" (IP) and "interaction fact" (IF) problems.The receptor and ligand shapes are converted to their two-component representations, and rotated ligand representations are explicitly generated for φ = −180 • to +179 • , in increments of 1 • .The asterisk ( * ) represents the operation described by Eq. ( 2), leading to an energy E(t, φ) and a probability P (t, φ) from which L IP , the loss function for the IP prediction task, is directly computed.∆F is computed using Eq. ( 10), leading to a probability of binding P from which L IF is directly computed.Operations shown in green contain learnable parameters.The architecture is similar for the "simplified IP task", with a single angular value calculated (the ground-truth angle) instead of 360.For the "sampled IF task", the 360 angular values are replaced by n(B) < 360 values sampled using a Monte Carlo approach (see text).model but using only a subset of all possible orientations.Instead of computing the partition function from Eq. ( 11), we compute the following estimate of it: F (φ) is computed explicitly for orientations φ ∈ B but, for orientations φ / ∈ B, it is assumed to have a uniform value F 0 , treated as an additional learnable parameter.Estimate Ẑ is expected to provide the same training signal as Z as long as the angles φ ∈ B correspond to the dominant interactions between the shapes.Ẑ becomes equal to Z when B includes all angles φ used in the brute-force calculation of Z, that is, all integers from −180 to +179.
Sample B is defined separately for each training example and is generated using a Monte Carlo procedure.After assigning each example a random initial orientation φ, a new orientation φ is picked by performing 100 steps of random walk (changing the orientation by ±1 • at each step).Following the conventional Metropolis algorithm, the new value φ is directly accepted if F (φ ) ≤ F (φ) and is accepted with a probability e −[F (φ )−F (φ)] if F (φ ) > F (φ).

Results
Table 1 shows the performance of the IP models on the Dock2D-IP validation and test sets.Although 4064 training examples are available in the Dock2D-IP dataset, the models were trained on fewer examples (1000, 100, and 10) to illustrate their data efficiency.All IP models were trained for 100 epochs (see Figure A.6 for the training curves).Models reported in Table 1 have mean ligand RMSDs below 2 pixels, indicating that they recover the true poses within the resolution limits of the input.The learned representations, shown in Figure 5, indicate that the models have also recovered the equivalent of the bulk and boundary features used by the generating function.
Interestingly, the performance of the model trained on the "simplified" IP task is comparable to that of the model trained on the full IP task (see Table 1).We also note that very few training examples are needed to achieve optimal performance: Using 100 examples instead of 1000 does not significantly af-  Interpretation is complicated by the class imbalance of the dataset, which is about 15:1 for the training/validation set (15 non-interacting pairs for each interacting pair) and about 6:1 for the test set.To facilitate the comparison, we also report the performance using Matthews correlation coefficient (MCC) [38].All brute-force IF models have high accuracy and precision, whether they are trained in full ("brute-force IF") or using SE(2)-CNN weights transferred from the IP model ("brute-force IF IP ").The lower recall values reflect the 1:15 and 1:6 imbalance between positive and negative examples.Moreover, each model performs simi-larly on both the validation and test sets, despite the difference is class imbalance and the greater variety of shapes in the test set.
The sampled IF model learns the energy function despite the free energy being poorly converged.After 1000 epochs of training, the Monte Carlo algorithm has typically visited 60 orientations per example (see Figure A.5), which means that Ẑ from Eq. ( 13) is computed using the default F 0 value for typically 120 of the 360 discrete orientations.
The energy function learned from the IF task is directly transferable to the IP task and yields a comparable-if not better-performance (see the bottom two rows of Table 1).By contrast, the energy function learned from the IP task, which has an undetermined scale because it only serves to identify the lowest-energy pose, can be transferred to the IF task only after appropriate re-scaling.We perform that calibration by transferring to the IF model all parameters learned from the IP model but by re-training Orthogonalization is based on the ground-truth "a" coefficients for the input features and on the learned "a" coefficients for the learned features.
the "a" coefficients (see Eq. ( 2)).The corresponding "brute-force IF IP " models, which are pre-trained on the IP task and have only five learnable parameters (the "a" coefficients and F 0 ; see Eq. ( 10)), perform on par with the brute-force IF model (see the second half of Table 2).The features learned from the IP task are very similar whether they are trained on the "brute-force IP" task or on the "simplified IP" task, but are somewhat different from those learned from the IF task (see Figure 5).As expected, the IF task recovers the groundtruth features better, since it is directly transferable to the IP task and therefore more general.

Conclusion
We have proposed the Dock2D datasets (Dock2D-IP and Dock2D-IF) and several representation-learning "baseline" models demonstrating end-to-end learning of a simplified, 2D version of the molecular recognition problem.The representation-learning baselines effectively solved the interaction pose (IP) prediction problem in two dimensions (see Table 1), whether they were trained on the full task of predicting optimal pose (t 0 , φ 0 ) or on the simplified task of predicting optimal translation t 0 given the ground-truth orientation φ 0 .Training on the IP task (whether "brute-force" or "simplified") was extremely efficient, requiring only 100 epochs and completing in minutes.
The baselines also solved the fact-of-interaction (IF) prediction problem (see Table 2), whether the predicted ∆F values used in the L IF loss were computed explicitly using Eq. ( 11) or sampled using Eq. ( 13).While training models on the IF task took significantly longer-requiring 1000 epochs and sometimes taking days to complete-, training time can be drastically reduced by starting from the energy function learned from the IP task (see Figure A.8).
One of the goals of this work is to pave the way for algorithms using fact-of-interaction data to infer structural details of protein-protein complexes.Our results suggest that, provided a model implements an energy function E from which both an optimal pose and a binding free energy F can be computed, the representations learned from training that model on the IP task are compatible with those learned from the IF task, and that a two-headed model could be jointly trained on a mixture of structural data and fact-of-interaction data.In the context of the Dock2D toy data, the transferability of learned features is due to the fact that the same energy function was used to generate both the Dock2D-IP and Dock2D-IF datasets.While it is unclear to what extent the same approach will work on experimental data, since the connection between configurational energy E and binding free energy F cannot be exactly defined and since what constitutes a positive interaction depends on the experimental assay used [39], we expect many of the strategies developed in this work to be applicable.1.For the brute-force models (in blue), cross-entropy loss L IP is calculated for distribution (7); for the simplified models (in red), it is calculated for distribution (8).All models were trained for 100 epochs.2. The brute-force models pre-trained by transferring the energy function learned from the IP task (red curves) did not require as extensive training.

Figure 1 :
Figure 1: Examples of training/validation shapes generated using predefined values of the α and n parameters.The histograms show how often each value of α and n was picked during shape generation.Both parameters are drawn from probability distributions (0.25, 0.50, 0.25).
Dock2D-IP training/validation set contains 5081 examples, which are shuffled and split 4:1, resulting in 4064 training examples and 1017 validation examples.

Figure 2 :Figure 3 :
Figure 2: Docking energy of two shapes as a function of ligand orientation φ [see Eq. (2)].Docked complexes are shown at four values of φ, illustrating the global energy minimum (at φ 0 ∼ − π 2 ) and three local minima, with receptor shape in black and ligand shape in red.The blue horizontal line represents the binding free energy F = − ln Z [see Eq. 6)], which is always less than the global energy minimum.
Figure 3 shows examples of binding modes for interacting and non-interacting examples from the dataset (see also Figure A.4 in Appendix).
Any accepted value is added to B, unless it is already in.Two Monte Carlo moves are attempted each time the training example is presented to the network, which may add up to 2 new elements to B. Since no element are ever removed from B during training, Ẑ could eventually become equal to Z for all examples, at which point parameter F 0 would be ef-fectively removed from the model and would become unlearnable.However, this point is never reached in practice.First, since each Monte Carlo move consists in 100 steps of random walk on a ring, the sampling procedure can visit only half of all 360 φ values: either the odd values or the even values.Second, even after 1000 epochs of training, B samples on average 32.8% of all 180 possible orientations per training example (see Figure A.5 in Appendix).

Figure 5 :
Figure 5: Features learned from each model.Top row are the ground-truth features (panel A) and their orthogonalized versions (panel B).Second row are the orthogonalized features learned from the "brute-force IP" task (panel C) and "simplified IP" task (panel D).Third row are the orthogonalized features learned from the "brute-force IF" task (panel E) and "sampled IF" task (panel F).Orthogonalization is based on the ground-truth "a" coefficients for the input features and on the learned "a" coefficients for the learned features.

Figure A. 1 :FRXQWFigure A. 2 :Figure A. 3 :Figure A. 4 :FRXQWFigure A. 5 :
Figure A.1: Examples of test set shapes generated using predefined values of the α and n parameters.The histograms show how often each value of α and n was picked during shape generation.The α parameter is drawn from probability distribution (0.0625, 0.25, 0.375, 0.25, 0.0625) and the n parameter, from probability distribution (0.125, 0.375, 0.375, 0.125).

Figure A. 6 :
Figure A.6: Training curves of the models presented in Table1.For the brute-force models (in blue), cross-entropy loss L IP is calculated for distribution(7); for the simplified models (in red), it is calculated for distribution(8).All models were trained for 100 epochs.

Figure A. 7 :
Figure A.7: Distribution of ligand RMSD for the experiment fromTable 1 yielding the poorest scores (training on the "simplified" IP task using only 10 examples).The average ligand RMSD is 0.37 for the training set (10 examples), 5.97 for the validation set (1017 examples), and 7.64 for the test set (11850 examples).

Figure A. 8 :
Figure A.8: Training curves of the models presented in Table2.The brute-force models pre-trained by transferring the energy function learned from the IP task (red curves) did not require as extensive training.

Table 1 :
Performance (mean ligand RMSD) of models on the validation and test sets of the interaction pose dataset (Dock2D-IP).The entire validation and test sets were used in evaluation (1017 and 11850 examples, respectively).The last two rows report how the features and energy function learned from the IF task perform on the IP task.All learnable parameters transferred from "brute-force IF" model (and frozen).b All learnable parameters transferred from "sampled IF" model (and frozen). a

Table 2 :
Performance of the models on the validation and test sets (16040 and 80200 examples) of the factof-interaction dataset (Dock2D-IF).All models were trained on 100 pairs (5050 training examples).MCC is the Matthews correlation coefficient.theperformance.The performance is somewhat reduced when only 10 examples are used but that degradation is due to a small fraction of the validation/test examples for which an alternate pose is predicted (RSMD > 10) (see Figure A.7).Table2shows the performance of the IF models on the Dock2D-IF validation and test sets (see Figure A.8 for the training curves).