Definition and classification of evaluation units for tertiary structure prediction in CASP12 facilitated through semi‐automated metrics

For assessment purposes, CASP targets are split into evaluation units. We herein present the official definition of CASP12 evaluation units (EUs) and their classification into difficulty categories. Each target can be evaluated as one EU (the whole target) or/and several EUs (separate structural domains or groups of structural domains). The specific scenario for a target split is determined by the domain organization of available templates, the difference in server performance on separate domains versus combination of the domains, and visual inspection. In the end, 71 targets were split into 96 EUs. Classification of the EUs into difficulty categories was done semi‐automatically with the assistance of metrics provided by the Prediction Center. These metrics account for sequence and structural similarities of the EUs to potential structural templates from the Protein Data Bank, and for the baseline performance of automated server predictions. The metrics readily separate the 96 EUs into 38 EUs that should be straightforward for template‐based modeling (TBM) and 39 that are expected to be hard for homology modeling and are thus left for free modeling (FM). The remaining 19 borderline evaluation units were dubbed FM/TBM, and were inspected case by case. The article also overviews structural and evolutionary features of selected targets relevant to our accompanying article presenting the assessment of FM and FM/TBM predictions, and overviews structural features of the hardest evaluation units from the FM category. We finally suggest improvements for the EU definition and classification procedures.


| I N TR ODU C TI ON
The biannual Critical Assessment of Structure Prediction (CASP) aims to provide an objective evaluation of state-of-the-art methodologies in protein structure prediction. 1 Participants submit models for targets whose structures are unknown or withheld from public release during the experiment; then independent teams of assessors evaluate the submitted models along different tracks. In this article, we present the CASP12 targets, the procedure to split them into evaluation units (EU) against which models are assessed, and the classification of these units into one of three difficulty categories.
In this round, CASP organizers secured 71 experimental structures (Table 1) for predictors to model, thanks to contributions from multiple groups as listed in Table 2. Based on a number of metrics and protocols further described in this article, the 71 targets were split into 96 EUs.
We introduced in this CASP a way of assigning difficulty to the EUs based on objective metrics, so as to semi-automate the process. Our metrics, available for future CASPs at the Prediction Center, 2 include one that captures how well the automated servers perform, another that measures sequence-level similarity to Protein Data Bank (PDB) entries that could be used as templates, and a third one that captures structural similarity to PDB entries disregarding sequence similarity.
These metrics enabled easy discrimination of easy and hard targets, allowing experts to focus on targets of intermediate difficulty.  Table S1 provides an overview of the EUs, depicting their CASP12 classification and their relationship to existing PDB structures according to concepts outlined in the evolutionary classification of structure domains (ECOD) database. 3

| M A TE RI A L S A ND M E TH ODS
All GDTTS, HHpred, LGA, HHblits, Grishin plot, and Neff calculations used in this work were performed by the Prediction Center; thus they will be readily available in future CASP rounds. 2

| Difficulty metrics
In this round of CASP, we introduced a combination of metrics that facilitate classification of the evaluation units. These metrics are the HHpred score, which measures the sequence-level similarity to PDB entries, and the LGA score, which measures structural similarity to PDB entries.
The HHpred score for an EU is defined as a product of the raw HHpred probability of the top hit obtained in a HHpred 4 run of its protein sequence against the sequences of all PDB entries available during the target prediction window and the percentage of the query sequence covered by this hit. The LGA score for an EU is the LGA_S score of the highest-scoring structure in sequence-independent LGA 5 runs of the target structure against all PDB entries available before closing the prediction window for the target. Both metrics range from 0 for no sequence or structure similarity to 100% for perfect sequence (HHpred score) or structure (LGA score) match. The combined metric is simply an average of the HHpred and LGA scores: HHpred score1LGA score 2 Besides the difficulty metrics based on sequence-and structure-level matches to the PDB, part of the classification into difficulty categories is based on the actual performance of the top 20 server predictions. The metric used to quantify the quality of individual models was the Global Distance Test Total Score (GDTTS), 6 which reports an average of the maximum number of residues that can be superimposed under cutoffs of 1, 2, 4, and 8 Å, normalized by the number of residues in the target. The GDTTS score is defined in the 0-100 range. Structurally wrong models usually score below 20 GDTTS points, while a perfect model that matches the full structure within 1 Å deviation in the Ca coordinates of all residues scores 100 GDTTS points. The GDTTS measure is historically the main metric for global assessment of tertiary structure predictions in CASP. 7

| Grishin plots for objective evaluation of domain splits
An evaluation scheme to help objectively decide if individual proposed EUs corresponding to structure domains should be kept separate or merged into a single EU was introduced in CASP9. 8 The so-called "Grishin plot" (called after the name of the CASP9 assessor, who introduced it) allows easy comparison of the weighted sum of GDTTS scores for each individual domain versus the GDTTS score for the   Figure 1A). If predictions on separate domains are not better than the performance on the merged domain, then dots fall on the diagonal, in which case we do not split (example in Figure 1D). This way one keeps EUs as large as possible, so that if predictors got large portions correctly modeled we would eventually reward them upon assessment and also place accent on predicting correct inter-domain orientation.

| Evaluation of alignment depths
Given that the assessment of tertiary structure prediction revealed that

| Classification of evaluation units into difficulty categories
Splitting of each target into EUs was followed by assignment of each unit to a difficulty category. Initially, all EUs were divided into two broad difficulty categories based on the average GDTTS score from the top 20 server predictions. The EUs with scores above 50 were preliminarily classified as not requiring detailed manual assessment (easier targets) and those with the score below 50 were preliminarily classified as requiring such an assessment. The split along the server performance lines traditionally corresponds to classification into template-based and free modeling targets, with commonly noted exceptions as in CASP11. 9 Obviously, the correlation between the performance-based  LGA scores in Figure 2A, B, we reasoned that both the existence of good sequence and structure matches to the PDB provide metrics to quantify target difficulty. We therefore combined the HHpred and LGA scores as an average, obtaining a new plot ( Figure 2C) with a quite smooth correlation against average server GDTTS. From this plot we set up boundaries (boxes with gray dashed borders in Figure 2C) from which we defined the easier, template-based modeling (TBM) EUs as those for which the combined HHpred-LGA score was higher than 60 and the average server GDTTS was above 50 (at which level the global topology usually begins to be visually evident, red points in Figure 2C).
These EUs were considered as suitable for the evaluation not requiring heavy engagement of human assessors. We next defined the more difficult, free-modeling (FM) EUs as those for which the combined HHpred-LGA score was lower than 60 and the average server prediction was worse than 50 (blue points in Figure 2C). These were the EUs that definitely required expert judgment on the submitted models.
EUs outside these boundaries exhibited significant deviations from expected performance and were classified as FM/TBM, while certain borderline cases within the TBM and FM definitions that presented sequence or structure deviations from possible templates were reclassified, after detailed inspection, from TBM (T0912D2 and T0945D1) or FM (T0874D1, T0896D2, T0909D1, and T0868D1) into FM/TBM as well (green in Figure 2A). This definition results in the FM/TBM EUs distributing nearly orthogonal to the main trend, such that those with higher average score are actually predicted worse by the servers. In fact, these FM/TBM targets are harder than TBM because of fold changes and other effects relative to PDB entries, as detailed below and exemplified in Figure 3. These targets were considered as potentially suitable for the automatic-only evaluation alongside the TBM targets, at the same time deserving a more rigorous human assessment typical for free modeling.

| Difficult to classify FM/TBM domain examples
Several FM/TBM EUs exhibit high sequence-based scores, yet automated server models display relatively low performance (green dots with high HHpred score in Figure 2A: T0868D1, T0896D1, T0896D2,   T0945D1, T0874D1, T0875D1, T0876D1, T0884D1, T0887D1, Examples of EUs that were defined as FM/TBM despite having high HHpred scores relative to top HHPRED templates, because careful inspection revealed important differences in terms of insertions on the core fold, 3D arrangement of similar secondary structures, domain swaps, and interacting domains. In panels A-E, the portions of structure described in the text are rainbow-colored from blue (N-terminus) to red (C-terminus). In panel F, the top HHpred templates for the three EUs of target T0912 (in blue, red, and magenta) are aligned on top of the full target structure (cyan)

| Final set of 96 evaluation units for the tertiary structure prediction track of CASP12
The  topology. The T0869 CdiI comes from a 120 residue long protein, with segment 3-106 present in the structure. Both targets were treated as single domains for evaluation, but predictors were asked to submit models in the same reference frame so that the complex be assessed in the quaternary structure prediction track.
T0868 was classified as FM/TBM as described in the section devoted to FM/TBM targets above. Briefly recapping from that sec- A plot of average server GDTTS versus log 10 (1 1 Neff) shows no strong trend other than a weak apparent lower limit for server prediction scores increasing slowly with alignment depth (Figure 7). This weak trend is actually clear almost exclusively for the FM models; and regarding FM/ TBM targets, the plot does not suggest that alignment depth can be used to improve their classification into clear FM or TBM. This may suggest that although deeper alignments might help to make an FM target "easier", the extent to which it is made easier is far less than the contributions from sequence and structural similarities to PDB entries.
Overall it therefore seems that alignment depth is currently not a good additional metric for difficulty among all targets, although it might help to more finely assign difficulty among FM targets. One way or the other, alignment depth should be kept in mind for EU classification in future CASPs when alignment-based methods might improve and sequence databases grow even further. Moreover, if alignment depths are eventually included for EU classification, they shall be considered during EU definition as well.

| CON CL U S I ON
We have presented here the CASP12 EUs and their classification into TBM, FM and FM/TBM categories, and described salient structural and evolutionary features of selected FM and FM/TBM target units. We further make the point that, although case-by-case expert analysis is still needed for EU definition and classification, the Prediction Center now offers an array of tools that facilitate both tasks, freeing time that human experts can dedicate to the most complicated cases.