High-accuracy protein structure prediction in CASP14

The application of state-of-the-art deep-learning approaches to the protein modeling problem has expanded the “ high-accuracy ” category in CASP14 to encompass all targets. Building on the metrics used for high-accuracy assessment in previous CASPs, we evaluated the performance of all groups that submitted models for at least 10 targets across all difficulty classes, and judged the usefulness of those produced by AlphaFold2 (AF2) as molecular replacement search models with AMPLE. Driven by the qualitative diversity of the targets submitted to CASP, we also introduce DipDiff as a new measure for the improvement in backbone geometry provided by a model versus available templates. Although a large leap in high-accuracy is seen due to AF2, the second-best method in CASP14 out-performed the best in CASP13, illustrating the role of community-based benchmarking in the development and evolution of the protein structure prediction field.

seen in harder targets, where only ab initio, template-free methods could be applied. By making use of remote sequence homology searches using the same approaches as the aforementioned methods, improvements in methods based on contact prediction 10 allowed an increase in model accuracy for harder targets, but never to the levels of those for which at least one experimental structure was known for a related sequence. 9 Only in 2018, in CASP13, was a jump of accuracy seen due to the inclusion of deep-learning methods for contact prediction. Here, and for the first time, very difficult targets were modeled with an average GDT_TS of 70% by DeepMind's AlphaFold (AF) method. 1,11 Now, in CASP14 we saw a further leap in model accuracy, 36 with the best model for targets at any difficulty level reaching a GDT_TS above 90%, a range at the level of experimental accuracy. With this, model accuracy appears now to be uncoupled from model difficulty, which gives a new meaning to high-accuracy modeling. For this reason, in the high-accuracy category from CASP14 we assessed the models built for all regular targets, not focusing only on those for which template-based methods could be applied. Building on the evaluation metrics applied in the assessment of high-accuracy models in previous CASPs, but also comparing the geometric quality of the models with the target, we analyzed models, possible templates and target structures for their overall and local accuracy, as well as for their usefulness in molecular replacement (MR). As expected from the results in other CASP14 categories, AlphaFold2 (AF2) produced the most accurate models for most targets and AF2 models could solve a large majority of crystal structures by MR. Nevertheless, the second-best method, an updated version of the deep learning-based trRosetta 12

| Target classification and scope
Despite assessing all regular targets (i.e., those starting by "T"), we still separated them into different difficulty categories to compare and rank the different methods. We considered the target classification as described in reference 13 provided by the Prediction Center (https:// predictioncenter.org/). 14 There, the different targets submitted are divided into evaluation units (or domains), each classified as "template-based easy" (TBM-Easy), "template-based hard" (TBM-Hard), "hybrid template-based/free modeling" (TBM/FM), or "free-modeling" (FM) targets, depending on whether good templates could be found in the PDB at the time of the experiment. Only individual evaluation units were considered for method ranking. These are referred to as "targets" for the remainder of the text. In addition, individual target difficulties were computed by taking the average of the coverage of the best structural template and the HHsearch probability of the match between the target and the best template.

| CASP13-Inclusion of torsion angle deviations
Many metrics for the evaluation of model accuracy have been developed over the years, especially in the context of CASP, and a large number are computed and made available through the Prediction Center. In CASP13, Croll et al. 15 adopted, and built on, the same overall ranking score used for TBM models in CASP12, 16 which was based on five metrics: (1) GDT_HA, the high-accuracy version of the Global Distance Test (GDT), rewarding parts of the target that could be reproduced with high precision 3 ; (2) lDDT, a local difference distance test that evaluates how well the all-atom distance map of the target is reproduced by the model 17 ; (3) CADaa, which compares residues' contact surface areas 18 ; (4) Sphere-Grinder (SG), a measure of local environment conservation 19 ; and (5) ASE, the accuracy self-estimate measure that assesses how well the coordinate error estimates provided by the predictors in the model indeed predict the real positional deviations from the target. 14 To these metrics, Croll et al. added three other terms to evaluate the geometric fitting of models: the (1) backbone and (2) side-chain deviation scores, which measure the local difference of torsion angles in the model relative to the target 15 ; and (3) the MolProbity clashscore, which assesses the number of serious clashes in the model. The overall CASP13 ranking score combining all metrics was given by: with z as the adjusted z score of the underlying metric over all models for a given target. 15

| CASP14-Inclusion of geometrical improvement
In CASP14, we implemented an additional novel metric that evaluates whether the backbone of a predicted model is geometrically better than the experimental structure of the target, an evaluation for which no metric is yet available. This metric comes from the observation that experimental structures submitted to CASP for use as targets are not only biologically but also qualitatively diverse; for example, some may not correspond to a final model and may require further refinement steps that will improve the geometry of their backbone (see the example in Figure S1). Thus, assuming that the target structure corresponds to the ground truth may lead to low scores for models that actually lack problems present in the unrefined target and are, thus, likely more accurate. In order to account for that, we developed DipDiff, which is based on the DipScore. 20 The DipScore is a distance geometry-based metric and a local protein backbone validation score that is used for guiding automated protein model building with ARP/wARP at medium-to-low resolution 20,21 and as a general protein backbone validation score, 22 being also useful for the identification of geometrically strained residues of potentially functional importance. 20 Briefly, it is computed based on the all-to-all interatomic distances between the backbone Cɑ and O atoms of the residue to be analyzed and its two flanking neighbors, and evaluates the likelihood of the observed combination of interatomic distances to be correct based on what is found in high-quality, high-resolution structures from the PDB and those expected from a random sampling of atoms around a target Cɑ; the closer to one the DipScore is, the more likely that conformation is to be geometrically correct. It is computed using DipCheck, 20 distributed through the ARP/wARP and CCP4 23 software packages.
To compare the models to the target structure, we followed an approach similar to that of Croll et al. 15 and, for each target, computed the per-residue DipScore differences to the target, so that a positive value corresponds to a residue that has a better geometrical environment in the model than in the target structure, and a negative value to the inverse, and took the average difference as the DipDiff score. In order to evaluate the usefulness of this metric for assessment, we compared it to the different metrics in the S CASP13 scoring function and those available through the Prediction Center by computing the Pearson Correlation Coefficient between the different metrics.
For ranking, we updated the S CASP13 scoring function, giving the same weight to DipDiff as to the other backbone geometry evaluating scores: with z denoting the adjusted z score of the underlying metric over all models for a given target, computed accordingly to Croll et al., 15 that is, a set of initial Z-scores for a given metric was computed based on the mean and SD of all models under consideration, all models yielding a Z-score below À2 were then considered as outliers and the Zscores recomputed using the mean and SD computed excluding them.
At the end, negative Z-scores were set to zero in order to reduce the penalty on groups who tested novel methods. S CASP14 and S CASP13 ranking scores were computed for all models, for all targets, but only the first model submitted by each method (model 1) was considered for ranking. Individual sidechain quality was assessed based solely on the CASP13 geometric score.

| Template selection and scoring
For targets classified as TBM-easy or TBM-hard, we also compared their putative templates with the submitted structure. For that, we used the pre-computed results from running HHsearch 24  LGA 3 to find residue correspondences. This was carried out using the same LGA parameters used by the Prediction Center for targettemplate structural alignment.

| Method ranking
In CASP, it is common to rank methods based on the sum of the ranking scores of their models across all targets they submitted models to. However, such an approach directly weights for the number of targets for which a given method submitted a model, and favors those methods that systematically underperformed but may have modeled one target particularly well. We believe it is fairer to reward methods based on their consistent ability to accurately predict structures and, in CASP14, ranked methods in the high-accuracy category based on the median S CASP14 . Only the methods that submitted models for at least 10 targets were considered, and were ranked based on their first model (i.e., the model submitted as their predicted best model, Methods were also classified into different categories based on the group type provided by the Prediction Center (human or server) and the keywords made available through their abstract, especially the "DeepL" (Deep Learning, DL) keyword, which states whether DLbased approaches were used.

| Evaluation of model usefulness for MR
For targets solved via X-ray crystallography, we were able to assess the usefulness of the models as MR search models. To do this we employed AMPLE, 25 an MR pipeline designed to rationally truncate inaccurately predicted regions of ab initio models. Here, AMPLE's single model mode 26 was modified to make use of the local RMS error estimates usually present in the B-factor column of the submitted models. These RMS error estimates were squared and multiplied by 8π 2 /3 so that they could be interpreted as B-factors and thereby downweight predicted unreliable contributions from the model in MR. 15 They were also used to guide AMPLE's progressive truncation process through which the predicted least reliable regions of the model were removed in $5% increments. MR solutions were verified using phenix.get_cc_mtz_pdb 27 where a global map correlation coefficient (map CC) ≥0.25 was considered a solution. Given the relatively large computational overhead of MR, this assessment with AMPLE was limited to AF2 models. Full-length models were used for all the datasets except T1032, T1073, T1080 and T1091, where it was apparent that the crystallized structure was only a fragment of the full length target: in those cases the part of the model corresponding to the crystallized section was used.

| Code and data availability
All these steps were implemented as Python scripts and more details, as well as dependencies (e.g., pdb-tools 28 for the handling of PDB files incompatible with some tools), can be found in the source code, which is available at https://github.com/JoanaMPereira/CASP14_high_ accuracy. The computed data, as well as a script for the individual calculation of the DipDiff given two structures, is also available through the same link. The modifications made to AMPLE are now available through CCP4. 23 3 | RESULTS AND DISCUSSION

| Backbone geometric quality and accuracy
Before proceeding to method/group ranking based on model accuracy according to different metrics, we first looked at the individual targets and how their backbone geometry compared to that of their models. For that, we developed and used the DipDiff metric, which corresponds to the average per-residue DipScore difference between a given model and its target structure. Subtracting the perresidue DipScore of a predicted model to its reference structure provides us with information regarding those regions that are geometrically better (or worse) in a given model. The average of those differences tells us whether (1) a model has all geometric issues resolved without the introduction of other major ones (positive DipDiff), (2) a model has either the same geometric issues found in the target or fixed them while introducing others (DipDiff around 0), or (3) the model has several severe geometric issues not found in the target (negative DipDiff).
As an example, Figure 1A depicts the experimental structure of target T1047s2-D3, colored based on the DipScore of its individual residues (the redder, the lower the score). While most residues have a DipScore close to 1, at least 10 located mostly in loops and terminal helix regions are colored pink or red, indicating they have a very low DipScore and thus uncommon backbone geometries. One example is that of Asp344 ( Figure S1), which has at least two stretched interatomic distances 20 not supported by the experimental electron density ( Figure S1A). The observation that these two distances are shortened in the final ( Figure S1B) deposited structure (PDB ID 7BGL), resulting in a favorable DipScore, supports the notion that the target structure was not a finalized, refined model. In Figure 1B Figure 1C), with all residues in the target with a low DipScore having a score close to 1 in the model, translating into a positive DipScore difference in these regions. As the model lacks those severe geometric problems in the target without including others, its average DipScore difference (the DipDiff) is positive ( Figure 1D). The second (average) model also corresponds to the same overall fold, but has a similar number of residues with an unlikely backbone geometry; these problems make the DipScore difference distribution wide while keeping the DipDiff close to 0 ( Figure S2A). The third model, which is the worst model submitted for this target, does not fold as the target structure, but also most of its residues have a score close to 0, moving its DipDiff to extremely low values ( Figure S2B).
The DipDiff is related to but does not correlate strongly with any of the scores included in S CASP13 ( Figure S3A); the closest is the DipDiff. However, such a "pyramid-behavior" is not observed for the models in CASP14, as 80% of the models have a negative DipDiff and 2% have a DipDiff above 0.1, especially at the 50-80 GDT_HA range.
The distribution of DipDiffs across all targets ( Figure 3A) demonstrates that indeed most are modeled with a mostly negative median DipDiff (the pan-target median of individual target median Dipdiff values in CASP14 is À0.06). Still, a few cases seem to have been systematically modeled with a better backbone geometry than the target itself (marked with a star in Figure 3A). These are T1047s2-D3 (median DipDiff 0.08), T1058-D1 (median DipDiff 0.05), T1085-D3 (median DipDiff 0.15) and T1100-D1 (median DipDiff 0.12, submitted by our group). By carrying out the same analysis for templates, it was interesting to observe that for most targets we found good templates with a positive DipDiff ( Figure 3A), suggesting that: (1) even when a good template is available (even if only partial), methods may be ignoring this information to build their models or include biases to deviate from the template backbone geometry, building, in general, models with an unrealistic backbone geometry; and that (2) the same seems to be true when building a protein structure from experimental data.

| Sidechain accuracy
When it comes to sidechain modeling, we used the score developed by Croll et al. 15 in CASP13, which measures the difference of sidechain torsion angles in the model relative to the target. This metric varies between 0 and 1, and the closer to 0 the lower the deviation.
Contrary to the backbone, where the median CASP13 backbone geometry deviation score was 0.1 ( Figure 3B), side-chains seem to continue to be the hardest to predict correctly, with the median model side-chain geometry deviation score for each model between 0.4 and 0.5 ( Figure 3C).
As expected from the relationship between side-chain and backbone geometry, 29,30 the closer the geometry of the models' backbone, the more accurate were the side-chain conformations predicted ( Figure 2D). However, such a correlation is not observed between the sidechain geometry and the backbone geometric quality ( Figure 2C): models may be predicted without backbone geometry problems but When considering templates, on the other hand, the models submitted to CASP14 seem to outperform the geometry accuracy of the templates, independently on how close they are to the targets ( Figures 2C,D and 3C).

| Overall ranking
When we rank the different methods based on the median S CASP14 of their first models, there is a clear leader, AF2, followed by three close runner-ups ( Figure 4A), BAKER, BAKER-EXPERIMENTAL and Baker-RosettaServer ( Figure 4A). AF2 has a median S CASP14 of 2.2, which means the models it produces, and classifies as its best models, score in general 2.2 SDs better than the average model across all targets. In contrast, the three Baker methods have a median S CASP14 of about 1.0 (BAKER 1.01, Baker-experimental 0.98, and the Baker-RosettaServer 0.86). When compared to the ranking obtained with S CASP13 , these four methods would rank the same way; the only difference lying in the places below, with S CASP14 allowing for a better resolution in the lower places of the ranking (red arrows in Figure 4B).
Following these four methods, the accuracy of the next six methods is very similar, and here they would rank differently if the S CASP13 F I G U R E 3 Per-template distribution of backbone and sidechain high-accuracy scores. For each target, the distribution of (A) the DipDiff, (B) the CASP13 backbone geometry score, and (C) the CASP13 sidechain geometry score for model 1 submitted by each group for that target is depicted as a boxplot. Outlier models are depicted as gray dots. Green dots represent individual templates for that target, with the shade of green representing the average between how much the template covers the target (target coverage) and the HHsearch probability of the alignment. The darker the shade of green, the closer to 100% scoring function were used. With S CASP14 the ranking at these places is

| By target difficulty
We further assessed the accuracy of the different top 10 overall methods as a function of target difficulty, denoted by the four target categories defined by the Prediction Center ( Figure 6). AF2 stands out as the most accurate across all target types, both at the overall level of the S CASP14 score and its individual parameters (see the examples for median GDT_HA, median DipDiff, and median CASP13 geometry deviations score in Figure 6). It reaches a median GDT_HA around 80% at any level of difficulty, accompanied by a general slight improvement of the backbone geometry quality without large backbone geometry deviations from the target. Its sidechain modeling, however, is dependent on the target's category, but is always significantly better than any of the other methods ( Figure 6E).
For the remaining methods, their ranking order varies depending on the target type, with this effect being more prominent for difficult, FM targets. In this category, both BAKER and BAKER-EXPERIMENTAL methods stand out from the other seven methods by achieving a median GDT_HA above 40% and a median S CASP14 above 1.0, while the Baker-RosettaServer reaches a median GDT_HA of about 30% and a S CASP14 below 1.0, comparable to that of the other five methods. Model backbone geometry quality is also dependent on the target difficulty for some methods, especially those in the lower half of the ranking ( Figure 6C). Sidechains on the other hand seem to be as hard to predict for either target type, with most methods reaching a CASP13 sidechain deviation score around 0.4 in any category, although a slightly higher accuracy is reached for easy targets. Unfortunately, AF did not participate in CASP14, which would provide a baseline for progress assessment. In order to circumvent that and evaluate the effect of target difficulty in the accuracy of the best model per target in CASP14 when compared to CASP13 and CASP12, we computed these metrics also for the CASP12 and CASP13 targets (using the data provided through the Prediction F I G U R E 6 Accuracy of the models predicted by the top 10 ranking methods overall by target category. Barplots (with associated deviation) for the median (A) SCASP14 overall score, (B) Global Distance Test-high-accuracy version (GDT_HA), (C) DipDiff, and CASP13 (D) backbone and (E) sidechain geometry deviation scores, for the overall top 10 ranking methods based on their first models (model 1) submitted for targets within the four target categories: "template-based easy" (TBM-Easy), "template-based hard" (TBM-Hard), "hybrid template-based/free modeling" (TBM/FM), and "free-modeling" (FM) Center) and plotted them against the difficulty of all targets in the three CASP experiments ( Figure S4). When compared to the effect of target difficulty in the accuracy of the best model per target, in CASP14 this dependency is eliminated due to the AF2 models, with the GDT_HA curve almost flat at around 80%, the CASP13 backbone deviation score below 0.05 and the CASP13 sidechain geometry score significantly below the CASP13 and CASP12 curves. When AF2 models are excluded, these curves approach the CASP13 behavior and values, but remain at better ranges, especially for GDT_HA and backbone geometry deviation scores. Most of these models were produced by BAKER methods, indicating that on average the BAKER methods reached accuracies higher than those of the best models in CASP13, produced by AF. This is, however, not the case for sidechain building and backbone geometry quality; for these metrics the best models in CASP14 excluding AF2 are as accurate as the best models in CASP12 and CASP13. Indeed, only AF2 brought an increase in sidechain accuracy since CASP12. Similarly, there is a tendency for improvement in model backbone geometry quality from CASP13 to CASP14, but this is not as significant as for the other metrics.  Figure 7):

| Model usefulness for MR
Where truncation levels are defined as the percentage of the original model retained after truncation, T1030 worked over a range of 18%-44% truncation levels, T1070 worked over a range of 19%-75% truncation levels, T1085 worked over a range of 21%-85% truncation levels and finally, T1100 worked over a range of 20%-25% truncation levels.
The AMPLE solutions for T1030 ( Figure 7A), T1070 ( Figure 7B) and T1100 ( Figure 7D)  bone conformation quality, even better than the target, but still be wrong. Cases where this is due to the modeling of the wrong fold are easy to identify with metrics as the GDT_HA, but those where this is because the target contains experimentally supported, strained residues not identified by the modeling procedure are harder to evaluate.
In such cases, only cross-validation against the experimental data can help, which is usually not available for all targets in CASP during assessment.
The most accurate methods in CASP14 were those using complex deep learning approaches for the prediction of contact maps, with AF2 considerably standing out as the source of the best models for 89 of 97 targets, achieving a median GDT_HA score of 78%. When looking at progress, AF2 substantially outperformed its predecessor, AF, but the median accuracy of the second-ranked models was also better than that of AF in CASP13, showing that in the year between CASP13 and the start of CASP14, several groups built on the path opened by AF to implement considerable method improvements. This is especially true when it comes to local and global details of the backbone, with the AF2 models frequently achieving an accuracy comparable to that of experimentally derived models.
However, the same cannot be said for side-chains, which remain the hardest to model at a high level of accuracy, not only for hard targets. Here again, AF2 stood out as the best method, modeling sidechains closest to their target geometries and achieving an accuracy for hard targets better than that of the other methods for easy targets. Accordingly, AF2 models could be used in MR in a straightforward fashion for almost all targets that could be tested. In most cases no editing of the AF2 prediction was necessary for its successful deployment as an MR search model, but automatic editing by AMPLE 25,26 on the basis of residue error estimates was valuable in some cases.
As has become clear in our analysis, AF2 marks a solution to the structure prediction problem for single protein chains which have a folded structure. As such, it and similar methods that are racing to catch up will make the structure space of proteins as accessible to F I G U R E 8 The two targets (T1032, T1091) which remained unsolved by AMPLE. For each target, the upper panel shows the crystal structure (teal) and the lower panel shows the AF2 model with residues colored on a gradient based on the predicted RMS error provided by AF2 (white: lower predicted RMS error; red: higher predicted RMS error) biochemists as sequence search programs did for the sequence space a quarter century ago, greatly accelerating the analysis of biological processes. Furthermore, AF2 and related methods represent a key step on the path to derive all structural properties of a protein by computation, such as dynamics, ligand interactions, folding path and folded state under different conditions. The importance of this for the life sciences is difficult to overstate.