PDBench: evaluating computational methods for protein-sequence design

Abstract Summary Ever increasing amounts of protein structure data, combined with advances in machine learning, have led to the rapid proliferation of methods available for protein-sequence design. In order to utilize a design method effectively, it is important to understand the nuances of its performance and how it varies by design target. Here, we present PDBench, a set of proteins and a number of standard tests for assessing the performance of sequence-design methods. PDBench aims to maximize the structural diversity of the benchmark, compared with previous benchmarking sets, in order to provide useful biological insight into the behaviour of sequence-design methods, which is essential for evaluating their performance and practical utility. We believe that these tools are useful for guiding the development of novel sequence design algorithms and will enable users to choose a method that best suits their design target. Availability and implementation https://github.com/wells-wood-research/PDBench Supplementary information Supplementary data are available at Bioinformatics online.


Background
Proteins are the molecules that perform almost all of the biochemical work in all living things. They have a staggering array of functionality from incredibly performant materials, like silks and wools, to some of the most efficient catalysts, capable of accelerating complex chemical reactions (Alberts et al. 2002). Beyond their roles in nature, proteins * These authors contributed equally. † To whom correspondence should be addressed. are broadly applied in industry and medicine. They are the active ingredient in many recent blockbuster drugs, such as immunotherapies for cancer (Nahta and Esteva 2006), and protein catalysts (enzymes) are increasingly used for chemical synthesis, providing greener alternatives to traditional chemical processes .
Proteins are formed from long polymers known as polypeptides. Each polypeptide is assembled from fundamental building blocks called amino acids. These building blocks have a common element that is bonded together to make the polypeptide chain called the backbone, and a variable region called the side chain (see Fig. 1), which has 20 possible chemical groups (Alberts et al. 2002). The sequence of amino acids (primary structure) leads to the formation of local and long-range chemical interactions (secondary and tertiary structure respectively), which induce the polypeptide to form a distinct 3D structure, and it is this structure that leads to the proteins function (Anfinsen et al. 1961). Polypeptides vary in length, from small peptides composed of only a few amino acids to huge proteins containing tens of thousands of amino acids, with a median length of around 300 (Brocchieri 2005). The complexity is further increased as multiple polypeptides can assemble to form larger structures (quaternary structure).
The 3D structure of proteins can be determined using 3 main experimental methods: X-ray crystallography, NMR spectroscopy and Cryo-Electron Microscopy. Advances in these techniques have led to a rapid increase in the amount of structural data that is available, with > 180, 000 structures available in the Protein Data Bank (PDB) (Berman 2000) as of August 2021. The structures produced by the diligent work of thousands of structural biologists, combined with the foresight and infrastructure to openly shared these data, has given us a deep insight into how proteins function. This data set is now rich enough that we can start addressing the problem of how the sequence of amino acids relates to a folded, functional 3D structure, i.e. the "Protein Folding Problem", through computational means. This is an important problem to address as it is far easier to obtain sequence information than it is to obtain structural information, with genomes available for > 64, 000 organisms in the NCBI database (https://www.ncbi.nlm.nih.gov/genome/), each of which contains the amino-acid sequence of 100s to 10,000s of proteins.  Figure 1: a) An experimentally determined protein structure (PDB ID: 1ubq) rendered in "cartoon" format using PyMol (Schrödinger, LLC 2015). The box show atomic level detail with the backbone (tubes) and side chains (lines). Nitrogen atoms are in blue, carbon atoms in grey and oxygen atoms in red. b) A cartoon of protein design, which aims to identify sequences of amino acids (stars) that are able to produce a target backbone shape (circles). This is commonly posed as an inference problem and errors are measured using aggregate metrics such as accuracy of the inferred sequences. Our experiments reveal that measuring the deviation (RMSD) of the designed protein, as predicted by AlphaFold2 (Jumper et al. 2021), from the target structure provides more insight than generic metrics.
There has been steady improvement in proteinstructure prediction algorithms over the past 2 decades (Kryshtafovych et al. 2019), but recent deep-learning based methods have utterly transformed the field, with stateof-the-art methods producing models that are within the experimental error of the structure-determination method (Baek et al. 2021;Jumper et al. 2021). The front runner in these algorithms is DeepMind's AlphaFold2 (AF2), which has now been applied to predict the structure of every protein in the human genome, as well as the genomes of 20 model organisms (Tunyasuvunakool et al. 2021), which provides an invaluable resource for understanding the function of natural proteins.
However, the protein sequences that have been sampled in nature account for a tiny fraction of all the possible proteins. Even for a relatively small protein with around 200 amino acids, there are around 10 260 possible sequences, which is significantly more sequences than have been sampled in every cell of every organism since proteins arose (Baker 2019). This means that it is a statistical certainty that the most performant materials, therapeutics and enzymes for any application have not been sampled. If we are to unlock the potential of this pool of unobserved proteins, known as the dark matter of protein folding space (Taylor et al. 2009), we must solve the "Inverse Protein Folding Problem", that is if we have a desired 3D structure with a useful function, how can we design an amino-acid sequence that will reliably fold into this structure.
To address this challenge, many successful approaches for designing proteins have been developed, including minimal design and rational design, but computational protein design (CPD) has quickly become the most widely used method (Woolfson 2021). The most successful method in this area is Rosetta (Guntas, Purbeck, and Kuhlman 2010), a software suite for computational protein design, which uses a physics-based design method, but deep-learning based methods show a lot of promise for a variety of tasks in CPD, in particular fixed-backbone sequence design, where a de-sired backbone structure is passed as the input and an amino acid sequence is returned as the output Qi and Zhang 2020).
Current methods of benchmarking fixed-backbone sequence design focus on sequence recovery, where the backbones of natural proteins with known amino-acid sequences are passed as the input and the accuracy of the method is measured by the degree of identity between the predicted sequence and the true sequence Qi and Zhang 2020;Strokach et al. 2020). However, this approach is fundamentally flawed, and accuracy values of this type do not reflect the real world utility of the design methodology.
To take an extreme example, if a sequence design method had 100% sequence recovery, this would not necessarily be a useful feature. Very often the purpose of protein design is to diversify the properties of known proteins to make them better suited to a particular application, for example stability or expressability (Goldenzweig et al. 2016). With a sequence recovery of 100%, no diversity would be generated, and the method may not be capturing the functional redundancy that we observe in natural proteins, that is, many sequences can fold to adopt the same overall shape (fold). Ultimately, we must move beyond simplistic methods for evaluating design methodologies and provide information to users that will help them to assess whether a specific method will be appropriate for their target application.
Our Contributions Here, we have created a benchmark for evaluating the performance of CPD methods, which gives a more holistic view of performance. It combines a set of about 600 structures carefully sampled from a broad range of protein folds, with diverse secondary structures compositions and resolutions. We applied this benchmark to evaluate state-of-the-art methods for sequence design, both physicsand neural-network based, and uncovered useful information about their applicability and performance. Finally, we used this benchmark to guide the development of highly performant convolutional, graph and hybrid CNN/GNN methods, and tested their performance by applying AF2 to predict the structure of our designed sequences. Our results show that about 85% of our designs are predicted to fold into their intended 3D structure (within 3Å), despite having < 42% sequence identity with the native sequence, which demonstrates that they have clear utility to design novel proteins with diverse properties.

PDBench: Evaluation benchmark set
Diverse Set of Structures Our benchmark set contains 595 protein structures spanning 40 protein architectures that are clustered into 4 fold classes (see Fig. 2): mainlyα, mainly-β, α-β and special, as presented in the CATH database (Knudsen and Wiuf 2010). The 'special' category contains proteins that do not have regular secondary structure. Crystal structures with maximum resolution of 3Å and up to 90% sequence identity were carefully chosen to cover the structural diversity present in the PDB (see Fig. 2). This ensures that the performance is evaluated on high-and low-quality inputs (see Fig. 3) and the results are not biased towards the most common protein architectures. The benchmark structures were prepared by removing all nonbackbone atoms using AMPAL (Wood et al. 2017). Benchmarking Tool We have developed an open-source benchmarking library that is implemented in Python. The inputs to our program are a prediction matrix (in .csv format) and a dataset map (in .txt format). The prediction matrix is n × 20 for a protein with n amino acids in the polypeptide chain. Each row in the matrix encodes prediction probabilities across the 20 canonical amino-acid classes. The dataset map contains a list of protein chains to be evaluated with the benchmark. The outputs of our program are the metrics for each model in a plot, as well as the option to generate comparison plots between different models to compare their performances. The software uses the AMPAL library to read protein sequences and then replaces non-canonical residues with standard amino acids residues for compatibility with protein design software (Wood et al. 2017). For example, selenomethionine is converted to methionine. Optionally, noncanonical residues can be omitted from calculations.
DSSP is used to assign the secondary structure for each residue (Joosten et al. 2010). For simplicity, predictions for α-helix, π-helix and 3 10 -helix are combined and treated as helices. Predictions for hydrogen-bonded turn, bend and isolated β-bridge residues are combined and treated as structured loops. CATH database (Knudsen and Wiuf 2010) is used to assign protein architectures to chains.
Metrics We calculate four groups of metrics: 1) recall, precision, AUC, F1 score, Shannon's entropy, and prediction bias for each amino acid class; 2) accuracy, macroprecision, macro-recall, similarity and top-3 accuracy for each protein chain; 3) accuracy, macro-precision, macrorecall, similarity and top-3 accuracy for each secondary structure type; 4) accuracy, macro-precision, macro-recall, similarity and top-3 accuracy for each protein architecture. All metrics except similarity and prediction bias are calculated with SciKit-Learn (Pedregosa et al. 2011). Prediction bias is a metric measuring the discrepancy between the occurrence of a residue and the number of times it is predicted.
Accuracy-based metrics are useful, but there is functional redundancy between amino acids, as many side chains have similar chemistry. The similarity of amino acids can be determined by the relative frequency of substitution of one amino acid for another observed in natural structures, using substitution matrices such as BLOSUM62, which we combine into a similarity score for the sequence (Henikoff and Henikoff 1992). We also created torsion angle comparison plots between true and predicted residues for each model. Ψ and φ angles for the plots are extracted using AMPAL (Wood et al. 2017); the residue count for each torsion angle pair for each amino acid is normalized by the true number of that amino acid in the dataset.

Structure Evaluation
We computationally validate designed sequences, using AF2 (Jumper et al. 2021), to determine if the sequence adopts the intended 3D structure. The sequence predictions are used as an input and the predicted structure is compared to the target structure, calculating a length-normalised form of Root Mean Squared Deviation (RMSD) (Carugo and Pongor 2008). As the folding predictions are computationally demanding, we randomly chose 59 monomeric 1 protein structures from our benchmark. We used a modified version of ColabFold (Mirdita, Ovchinnikov, and Steinegger 2021) to allow for multiple simultaneous predictions on a single GPU (about 2 days for each set of 59 structures). RMSD was calculated using the align command in PyMOL (Schrödinger, LLC 2015).

Existing Deep Learning-based Models
The input to CNN models is a fixed-size 3D grid, that we call a frame, which is centred on the Cα atom of the input amino acid, with a frame for each residue in the protein sequence. The network is trained to classify, predicting probabilities across a fixed set of 20 classes for each frame. The input to GNN models on the other hand is the whole protein as a graph. Each residue is represented as a node in the graph, with an edge connecting two nodes if the corresponding residues are within a predefined distance threshold. The GNN is then trained to produce a 20-dimensional embedding that represents the class probabilities for all residues at once. An illustration of the different types of data input is shown in the Supplementary Materials.

ProDCoNN (CNN)
We replicated the architecture described in Zhang et al. (2020). We used our open-source voxelisation library to create the protein dataset used to train the model. The neural network was built using Keras (Chollet et al. 2015). (2020) proposed using a 3D DenseNet architecture ) for sequence design. Their model contains 3M trainable parameters, however we were not able to replicate it with the information given. Qi and Zhang (2020) kindly shared their model architecture for us to train with the same dataset as the other models, although the architecture provided was different to that described in the paper. When trained with the same data as the other models, we were not able to match the published performance of DenseCPD, therefore we used their trained model. However, it must be noted that this model has been trained on some of the benchmark structures. ProteinSolver (GNN) is a model proposed by Strokach et al. (2020). Their code and the model are open-source and available online. The model uses a threshold of 12Å to consider two residues connected by an edge.

Novel models
TIMED-Unbalanced (CNN) We used the Keras framework as a high-level interface to TensorFlow (Abadi et al. 2015), to create a deep CNN, which we refer to as TIMED (Three-dimensional Inference Method for Efficient Design). Details of the architecture are in the Supplementary Material. One of the key features of our neural network is the use of the final Global Average Pooling (GAP) layer rather than a Fully Connected (Dense) layer, to preserve spatial information (Lin, Chen, and Yan 2014). The model also uses Spatial Dropout rather than standard dropout to help enforce this relationship. The model was trained with categorical cross entropy loss.

TIMED-Balanced (CNN)
To avoid biasing our neural network towards the natural frequency of amino acid, we use a random undersampling method. In the training and validation sets, all residue classes are capped to match the number of the least abundant residue. At the beginning of every epoch, for residues with a higher count than the minimum, frames are randomly re-sampled to increase the number of total residues observed by the network.

GX (GNN + GNN-CNN Hybrid)
Our Graph eXpanded method involves a GNN architecture with several SAGE layers (Hamilton, Ying, and Leskovec 2018) and a mean aggregation function. While we experimented with several features, we utilises the frame predictions of TIMED as its input features and the coordinates of atoms C, N, O and Cα. We call this GX[PC] since it is a hybrid CNN-GNN model which aim at improving on the high CNN performance by adding longer-range information captured by the GNN. We used a weighted categorical cross entropy loss where weights were obtained from the inverse natural frequency of residues (Gasteiger et al. 2005  The plot at the bottom shows the negated correlation coefficient (Y axis) between macrorecall and the resolution (inÅ) of the input structure. The performance of DenseCPD depends on input with fine resolution. All p-values were significant (< 10 −8 ) except for ProteinSolver (0.3) which is excluded from the plot.

Model Performance per Class
We compared all the models with our benchmark using the accuracy, macro-recall and similarity metrics (see Fig. 4).
Since proteins can adopt a wide range of structures, per-fold metrics provide more meaningful insights about the performance of a model. We divided our benchmark set (595 structures) into four categories of protein folds as explained in Sec. 2. See Fig. 2 for the proportion of structures in each category. We calculated metrics for each protein structure, and averaged the metrics across folds. Error bars show one standard deviation from the mean within each group. Between the physics-based methods, Rosetta outperformed EvoEF2 in all our tests and all metrics by about 4-10 %. All TIMED and DenseCPD models marginally out-performed the physics-based methods across the metrics regardless of structure types except for the "special" class. DenseCPD 2 had the highest performance across all metrics by a considerable margin. ProteinSolver had the lowest performance across all methods (vide infra).

Dependence on Resolution
The spatial resolutions of experimentally determined protein structures varies by methodology and other factors. The finer the resolution, the more precisely the position of the atoms are known in the structure. We calculated correlations between macro recall and input resolution (inÅ) for all models across different folds (Figure 4). Except for Protein-Solver, all models had a significant correlation between resolution and macro-recall, with p-values values of < 10 −8 . DenseCPD consistently had a higher correlation than other models, which could be explained by the fact that it was trained on 2Å structures while the other deep learning models (with the exception of ProteinSolver) were re-trained on 3Å structures. Rosetta exhibited lower correlation than other models on all protein classes except 'special'.

Prediction Bias
The composition of the amino acids in proteins is not uniformly distributed, and can vary significantly between different protein folds. Therefore, we investigated the effect of balancing our frame dataset prior to training. When TIMED and ProDCoNN were trained without balancing, the accuracy on the benchmarking set increased from 38.5% to 41.1% and from 35.1% to 37.7%, respectively. However, this also resulted in increased prediction bias for the most common amino acids, as well as decreased macro-recall. This is particularly obvious for alanine, glutamate and leucine in α-helices (4%, 10% and 6% bias in TIMED; 2%, 15% and 10% in ProDCoNN, respectively) and, to a lesser extent, leucine and valine in β-sheets (3% and 5% in TIMED, 4% and 7% in ProDCoNN) (see Supplementary Material). When the amino acid distribution was balanced by random undersampling, the prediction bias was closer to 0% for all amino acids, indicating that predicted sequences had the same amino acid distribution as true sequences. Overall, TIMED architecture appears to be more robust to amino acid imbalance which could be key to designing natural-looking sequences.
The benchmark also produces torsion angle comparison plots that investigate the of prediction bias in more detail. The geometry of the protein backbone can be described primarily by rotation around the N-C α and C α -C bonds. These rotations can be described by the two torsion angles Φ and Ψ. Certain backbone conformations are associated with specific combinations of Φ and Ψ e.g. repeated regions of Φ=-60, Ψ=-60 lead to the formation of α-helical secondary structure. By comparing torsion-angle frequencies of true and predicted amino acids, we can identify if predicted amino acids are in energetically favourable regions and how their distribution has changed with respect to different structures within proteins. For example, the first iteration of TIMED model had the accuracy of 26% but 19% prediction bias for glycine. We discovered that glycine is overpredicted almost everywhere but especially in the α-helical region (Ψ=-60, Φ=-60) and, to a lesser extent, in region associated with β-sheets (around Φ=-140, Ψ=140), and is frequently confused for other amino acids (Fig. 6).

Design Evaluation with AlphaFold2 (AF2)
We randomly selected 59 structures from the benchmark (about 10% of the benchmark) covering classes of mainly alpha, mainly beta and alpha-beta. We predicted the residue sequences using each of the models in Sec. 3. We provided the predictions as input into AF2, a state-of-the-art protein folding model, to obtain hypothetical 3D structures of the predicted sequences. We then compared the RMSD between the predicted structure and the target structure. We did this for each model and for each class. We excluded structures with the "special" class as they are highly irregular.
As shown in Fig. 5, the physics-based methods tend to have a larger range of RMSD, especially EvoEF2. Rosetta has a larger RMSD range in the mainly beta fold. Most models performed better for alpha-beta folds. The performance of deep-learning models is usually equivalent or better than the physics-based methods, with DenseCPD performing best overall. It generally has the smallest range in RMSD except for the mainly-alpha fold where it has a larger range of RMSD than TIMED-unbalanced. The unbalanced models seem to outperform the balanced ones in terms of RMSD, consistently over all folds. The GX[PC] model seems to improve the prediction of TIMED-Balanced in the mainly beta fold and alpha-beta. The ProDCoNN model typically tends to perform better in mainly beta structures.
We also visualise these data using a percentage plot to show how many structures are present at each RMSD threshold (see Supplementary Material). We see that, under 6Å, deep-learning models tend to have a similar percentage of structures at each RMSD threshold. DenseCPD is the best overall model, but it closely matched by other deep learning models. It is worth reiterating that some of these structure from the benchmark were present training set of DenseCPD. We avoided running ProteinSolver on AF2 given the poor performance in our benchmark.

Discussion
Combinations of Metrics Across Classes When considering accuracy metrics (Fig. 4) along with similarity to the target structures (Fig. 5), there is a marked difference in the performance of all the design algorithms across the different fold classes. It is interesting that all of the deep learning based methods performed well when designing "mainly β" structures, as these are traditionally very challenging targets (Woolfson et al. 2015;Huang, Boyken, and Baker 2016). Furthermore, the accuracy of sequence recovery was more strongly correlated with resolution in the β-rich classes ("mainly β and αβ") ( Fig. 4), which suggests that the sequence preferences in β structure are closely linked to subtle details in the backbone conformation. This might indicate why β design is more challenging: it is more difficult to generate high-quality backbone models for β-rich structures compared with α-rich structures, most likely due to the higher degree of conformational flexibility compared to α helices, and so our focus should be on improving backbonemodelling techniques for β structure rather than sequencedesign methods. Glycine Trap The over prediction of glycine by our initial model is likely related to the flexibility of this residue, which allows it to adopt a broad combinations of Φ and Ψ (Lovell et al. 2003). It is well documented that the identity of the side chain has a large impact on the backbone conformation (Shapovalov and Dunbrack 2011), therefore it is likely that it is important to the quality of predictions. However, if too much attention is placed on backbone torsion angles, it is easy to imagine that a network would over predict residues with a strong influence on backbone conformation, such as glycine and proline, which is what we see in our initial model (See Fig. 6 right). However, glycine is a highly destabilizing α-helices (López-Llano, Campos, and Sancho 2006), so it is important to be aware of these biases. PDBench can easily identify cases where this, or other sequence biases, have occurred, which can be used to guide the development of the design methodology.
Balancing Amino Acids Using the benchmark, we explored the effect of balancing amino-acid classes. In some ways, unbalanced classes better reflect the biochemical availability of the individual amino acids, which could improve production of the proteins in living systems. However, this would mean that the biases of natural proteins would be reflected in the sequences produced by the design algorithm, when there's strong evidence that functional proteins exist in sequence spaces that are unexplored in nature (Weidmann et al. 2019). The benchmark highlights these sequence biases clearly, and armed with the results of the benchmark, potential users can decide whether this behaviour is desirable for their specific application.  Figure 5: Boxplot showing the ranges of RSMD between the real structure and predicted structure for each model, seperated by protein class. 59 Structures were randomly selected from the benchamrk and the predicted sequence for each model was fed to AF2 to obtain a theoretical structure. The structures were then compared using RMSD 100 , a normalised version of RMSD to compare protein structures of different lengths proposed by Carugo and Pongor (2008).
ProteinSolver and de novo Design We were surprised to see that the performance of ProteinSolver was lower using our benchmark compared to their published results. We hypothesised that this could be due to differences in the input structures. If side-chain atoms of the input structures were included when calculating the distance matrix, which is used as an input to their model, the distance matrix would indirectly encode sequence information present on the backbone scaffold. For example, for long and flexible residues like glutamate and lysine, those residues could be closer to many more residues than the backbone alone, which will have an obvious impact on the distance matrix. When designing de novo, the identity of the side-chains is unknown, and so we strip the side chain atoms out of the benchmark structures. We tested ProteinSolver using our benchmark structures both with and without side-chain atoms, which confirmed our hypothesis of label leakage, as the macro-recall dropped from 36.3% to 13.3%. ProteinSolver clearly has some utility for making minor changes to natural sequences, but it appears to be unsuitable for de novo design.
Significance Threshold for Metrics We have discovered that accuracy, macro-recall and other statistical metrics can accurately estimate a model's performance only up to a certain point. For example, the TIMED unbalanced and DenseCPD models differ significantly by accuracy, macrorecall and similarity. However, difference in RMSD in the AlphaFold2 predictions are not significant, so it appears that they both produce sequences that will fold to the target structure. Perhaps, after a certain accuracy point, statistical metrics become less relevant. In the case of de novo design for example, high accuracy might limit the utility of the design method, as the sequences produced will have lower variability. In real-world application of protein, the increased diversity of lower accuracy models might be more desirable, especially when the experimental strategy involves high-throughput screening.
Limitations We have used structure predictions from AF2 to validate the designs produced by the sequence design algorithms, which is computationally demanding. To provide context, the evaluation using AF2 (on about 10% of the benchmark set) took two days of computation on our servers. While the reported performance is impressive (Jumper et al. 2021), it remains to be seen whether this is a useful method for determining if designs will fold in experimental settings. As AF2 is trained on observed protein structures, the predictive power of this model may be lower for sequences outside of observed structural space. In this case, physics-based simulations such as molecular dynamics, might be required to evaluate the AF2 structures further, although these simulations are even more computationally expensive, so reduce the scale at which design can be performed.

Conclusion
We have presented PDBench, a dataset and software package for evaluating fixed-backbone sequence design algorithms. The structures included in PDBench have been chosen to account for the diversity and quality of observed protein structures, giving a more holistic view of performance. We find that generic metrics such as classification accuracy and recall are less informative when determining the utility of computational models for protein-sequence design, as they can obscure properties of the design method that could have a major impact on the quality of the designed sequences. Finally, structure predictions of designed sequences showed that there appears to be diminishing returns from further improving fixed-backbone design algorithms, and so, in the spirit of quantum physicists before us, perhaps it is time to "shut up and calculate" (Mermin 1989). We are currently in the process of experimentally validating designs from the TIMED models in the laboratory.