CATH 2024: CATH-AlphaFlow Doubles the Number of Structures in CATH and Reveals Nearly 200 New Folds

CATH (


Introduction
The number of protein sequences increased exponentially following the advent of nextgeneration sequencing, and far outpaces the number of experimentally solved 3D protein structures in the Protein Data Bank (PDB).There are $215,000 experimentally solved protein structures in PDB 1 while UniProtKB contains $227 million protein sequences.To cope with the massive sequence data, methods for accurate protein structure prediction (e.g.homology modelling and ab initio approaches) have been pursued for decades.The Protein Structure Prediction Center hosts a biannual contest via CASP (Critical Assessment of Structure Prediction) which rigorously assesses the performance of these methods. 2 In the latest CASP14, AlphaFold2, a sequencebased AI method (developed by Google's DeepMind) significantly outperformed all other 145 methods and provided good quality models. 3lphaFold2 is trained with structures deposited in the PDB, and co-evolution information learned from multiple sequence alignments generated from sequence database searches.AlphaFold2 0 s breakthrough in CASP14 has revolutionised structural biology research by narrowing the protein sequence-structure gap.Subsequently, the AlphaFold Database (AFDB) (https://alphafold.ebi.ac.uk/) was established in a collaboration between Google's DeepMind and EMBL-EBI and the latest release contains 3D-models for 214 million UniProt sequence entries, a major step in bridging the sequence-structure gap. 4 AFDB provides a rich source of information for structural classification and structure-based function prediction.Protein structures can typically reveal more distantly related homologues than sequence alone as structure tends to be more conserved than sequence in evolution.A number of structural classifications have been established (CATH, SCOP, SCOPe, ECOD) which focus on the domain since domains are semi-independent functional units of proteins that can independently evolve and can be arranged in combination with other domains to evolve new or modified protein functions.Both CATH and ECOD have already performed analyses of subsets of the AFDB data [5][6][7] and shown that there is a substantial amount of good quality predicted structural data which can expand the classification resources significantly.
CATH is a Global Core Biodata Resource (GCBR) which classifies domains into a hierarchy of Class, Architecture, Topology/fold and Homologous superfamily.Class classifies domains based on the type of secondary structure elements (i.e.alpha, beta, alpha/beta, few secondary structures) whilst Architecture classifies domains based on the arrangement of secondary structure elements.Topology (fold) considers both the arrangement and connectivity of secondary structure elements.Homologous superfamily classifies domains that share sufficient structural and sequence similarity to infer homology from a common ancestor.Within the superfamily, CATH provides functionally coherent groups known as CATH-Functional Families (FunFams). 8,9In its latest release (version 4.3), CATH classifies 150,885 PDB structures, segregated into 495,811 domains, classified into 5841 homologous superfamilies. 10ATH has grown steadily since it was established in the early 1990 s, but the release of >200 million AlphaFold predicted models is likely to represent a 400-fold or more expansion in domain structures and requires the development of new pipelines to process this data in a timely manner.Since the first release of AlphaFold2, various groups have introduced new computational methods and pipelines for clustering and segmenting AlphaFold2 chains.For example, the Steinegger group has clustered full-chain AlphaFold2 models from AFDB (version 3) into 2.27 million structure clusters and suggested 31% of these do not match any structure deposited in PDB. 11Durairaj and colleagues analysed AFDB (version 4) dataset and used the structural data to find new families and functional relationships providing novel functional insights into several protein families associated with the dark proteome. 12Recently, the Grishin group developed DPAM, a novel domain predictor to process AFDB chains and provide structural assignments for domains in 48 proteomes, including human. 6,7ur group developed a new protocol (CATH-Assign) for processing AFDB models.This includes domain detection, evaluation of model quality and subsequent classification in CATH.CATH-Assign uses a combination of approaches involving new deep-learning methods for protein structure comparison (Foldseek 13 ) and protein language model based methods for remote homology detection (CATHe, developed in-house. 14It was used to perform a preliminary analysis of AFDB models from 21 model organisms, classifying 341,213 domain structures in CATH superfamilies and revealing 25 novel folds with at least one human representative. 5n this study, we report the extension of CATH-Assign with new deep learning based methods for domain detection (Chainsaw). 15To cope with the scale of the data, Chainsaw is faster and more sensitive than previous algorithms used by CATH.We also report the development of a novel pipeline (CATH-AlphaFlow, Figure 1) which encodes major steps of the CATH-Assign protocol in a NextFlow workflow.CATH-AlphaFlow were applied to all novel structures in the PDB not currently classified in CATH.It was also applied to the AFDB structures from the 21 model organisms to refine domain boundaries and improve classification of domains.
Application of CATH-AlphaFlow to the PDB and AFDB structures expanded CATH by 112% to 1,060,659 domain structures and brought 349 new folds into CATH (253 from PDB structures and 96 from AFDB).We used various public resources and tools to obtain functional annotations for these.
CATH-AlphaFlow and the functional annotation strategies outlined here are robust and fast and will assist in mining the vast data released by AFDB and related platforms (e.g.3D-Beacons https://www.ebi.ac.uk/pdbe/pdbe-kb/3dbeacons/).The novel domain assignments and fold groups enabled by CATH-Assign/AlphaFlow will be available from the CATH-beta daily snapshot (ftp://orengoftp.biochem.ucl.ac.uk/).

Materials and Methods
We updated CATH by applying CATH-AlphaFlow, which includes a new domain detection algorithm (Chainsaw 15 to all the protein structures in the PDB not classified in CATH (v4.3).Our dataset included PDB structures deposited up to September 2023.
We also applied CATH-AlphaFlow with Chainsaw to improve domain boundary assignment and validate the classification of AFDB structures from 21 model organisms.
1. Domain detection Domain boundaries were assigned for both PDB structures and AFDB structures using Chainsaw, 15 a novel state-of-the-art deep-learning algorithm benchmarked against other widely used methods (e.g.UniDoc 16 ).Low-probability assignments were subsequently validated with UniDoc 16 and manual curation.As expected, Chainsaw gave significantly better domain boundaries than our previous method for detecting domains in UniProt proteins, CATH-resolve-hits (CRH), 17 which relies purely on sequence data.Chainsaw has various advantages over CRH, including better accuracy, higher speed when GPU-accelerated and requires a single PDB or MMCIF file as input.Domain boundaries detected by Chainsaw were subsequently used to extract the domain regions from PDB and AFDB files using the pdb-selres module from pdb-tools. 18upplementary Figure 1 shows how the agreement between CRH and Chainsaw boundaries falls as the sequence similarity between the query and the closest relative in CATH falls.Structure-based methods like Chainsaw will also be much better at detecting boundaries for domains with no homologues in CATH.

Removing low quality structures
Domains from the PDB chains or AFDB were assessed for quality using the metrics described in Bordin et al. 5 Well-packed globular domains with an average pLDDT of 70 or more, more than 3 secondary structure elements, less than 30% of the domain residues in Long Unordered Regions, and less than 65% of residues in unordered regions were subsequently processed by CATH-AlphaFlow.

Domain Processing with CATH-AlphaFlow
Using CATH-AlphaFlow, segmented domains from AFDB models and PDB structures were assigned to CATH.CATH-AlphaFlow is a series of Python modules created to perform consistent processing of protein chains (either from models or from experimental data), many of which have been orchestrated in NextFlow (https://github.com/UCLOrengoGroup/cath-alphaflow).The steps involved are discussed as follows and highlighted in Figure 1.

CATH assignment using Foldseek
Domains were scanned against CATH superfamilies using Foldseek and tentatively assigned to CATH superfamilies using thresholds benchmarked on a set of curated CATH/SCOP assignments as described in. 5We consider valid hits at 5% error rate against a library of CATH clustered at 95% sequence identity (S95) from 5,841 Superfamilies in CATH classes 1 to 4, with a 60% overlap between query and target and a bitscore cutoff of 130 for homology (CATH superfamilylevel, H) and 98 for fold detection (Topology-level, T) (Figure 1B).
Validation of Foldseek structural matches by structure comparisons with SSAP.
Structural matches by Foldseek were confirmed by re-comparing the matched pairs with SSAP 19 against the S95 representative hit in the Foldseek search (Figure 1C).Pairs with a SSAP score >=70 and a residue overlap of 60% (of the larger domain) were considered valid hits.Higher SSAP scores (>=80) suggest homology which was subsequently validated by sequence based approaches described below.

Validation of superfamily assignment by sequence-based methods
Domain sequences for H-hits were extracted from the PDB or AFDB files using the pdb_tofasta module from pdb-tools and further validated by scanning them against Hidden Markov Models built using HMMER3 20 from S95 representatives of CATH (Figure 1C).We used thresholds established for CATH-Gene3D, 10 an e-value cut-off of 1e-3 for homology assignments with a minimum bitscore cutoff of 25 and overlap of 80%.Query domains matching the same superfamily by HMMER3 scan and Foldseek/SSAP, were assigned to that superfamily.
We also validated homology using CATHe, 14 a CATH superfamilies predictor based on embeddings from the ProtT5 protein language model.Structural matches which had CATHe predictions with a probability of 90% were considered valid superfamilies assignments (Figure 1C).

Clustering domains assigned to the same CATH fold group into new CATH SuperFamilies
Query domains which matched a particular fold group in CATH were compared against each other all-versus-all using TMalign 21 with a minimum overlap of 60% and TMscore cutoff of 0.7 normalised by the length of the largest domain (as benchmarked in 5 (Figure 1D).The resulting score matrix was clustered with complete linkage clustering using cathcluster, from the cath-tools suite (https://cathtools.readthedocs.io/en/latest/).This gave a set of putative new superfamilies.
4. Clustering domains with no match to a CATH superfamily or fold group to identify putative novel fold groups Domains not meeting our criteria for inclusion in a CATH fold group or superfamily were clustered by performing an all-vs-all scan with Foldseek with an overlap of 60% and a bitscore cutoff of 130, and the resulting output clustered with complete linkage using cath-cluster.Since TMalign is more sensitive than Foldseek, each cluster representative was subsequently scanned with TMalign against all CATH S95 representatives with an TMscore cutoff of 0.5 and a minimum overlap of 60% to check for fold hits missed by the initial Foldseek scans.Cluster representatives without a CATH fold assignment after this step were compared against each other using TMalign and clustered using cath-cluster with complete linkage, a TMscore cutoff of 0.5 and an overlap of 60%, to give a set of putative novel fold groups (Figure 1E).

Identification of structural relatives for novel folds/superfamilies in AFDB/UniProt50
Putative novel folds from AFDB and PDB were searched against representatives from AFDB clustered at 50% sequence identity (AFDB/ UniProt50) via the Foldseek web server (https:// search.foldseek.com/search)to identify all structural relatives.This information was used to determine the size of the structural superfamily and analyse the taxonomic distribution.The multiple sequence alignment generated from all the relatives (TM-score !70) was used to identify conserved sites using the Scorecons program. 23.Functional annotation of novel folds For putative novel folds from AFDB domains we used fold representatives to annotate functions using various approaches described below.

Curation of available functional information
We extracted functional annotations from InterPro, Pfam and UniProt (i.e.GO terms, EC number, information on interactions, functional site information or literature evidence of function).

Assigning predicted functions using sequence and structure based methods
We predicted functional annotations using a number of sequence-based and structure-based methods.
4) Information on functional partner proteins (from STRING/IntAct).

Predicting functional sites
We predicted functional sites using DeepFRI (score >0.5) and P2RANK 26 (Score >0.50).Additionally, we identified conserved sites by analysing the multiple sequence alignment of structural relatives in AFDB/UniProt50 database using the Scorecons program (threshold score >0.70 23 ).

Classification of protein structures in the Protein Databank (PDB) currently unclassified in CATH
A total of 108,130 PDB structures, unclassified in CATH, were analysed using CATH-AlphaFlow.CATH-AlphaFlow consists of multiple steps illustrated in Figure 1.Chainsaw identified 212,942 constituent domains within the protein chains.Subsequent scanning of the domain structures against the S95 representatives from the CATHv4.3superfamilies by Foldseek algorithm gave a total of 151,648 matches to the CATH superfamily (H-level) or fold group (T-level).
Further validation of the Foldseek matches was performed by verifying the structural similarity using TMalign and the in-house SSAP algorithm (see Methods).Assignment to CATH superfamilies was subsequently verified by scanning against CATH S95 HMMs and also by CATHe (see Methods).Overall, 137,193 could be assigned to CATH superfamilies and a further 14,455 to CATH fold groups, whilst the remaining 61,294 are putative novel folds in CATH (see Supplementary Figure 2).
These were compared all-against-all using Foldseek and subsequently clustered into 6944 clusters (see Methods 4).Representatives from these clusters were scanned against CATH S95 domain structure representatives using the slower but more sensitive TMalign.This step assigned 3,238 clusters to known fold groups in CATH.Clustering of domains assigned to these fold groups (see Methods 3) gave 1697 additional putative CATH superfamilies.
The remaining 3706 domain representatives were subjected to quality checks, the FoldCheck protocol (see Methods 4) and manual inspection for further evaluation of domain quality and verification of new folds.We removed $92% which had problematic features (e.g.poor resolution, high proportion of long regions of residues with no secondary structure and poor packing i.e. lacking globularity), multi-domain and synthetic proteins (see supplementary Figure 3).
A total of 253 domains were validated as nonproblematic globular folds, new to CATH.A significant proportion of these (161/253 i.e. 64%), had already been classified in other resources such as ECOD and SCOPe, giving 92 folds newly identified in our analyses.
As a final check on novelty and to assign architectures, we compared the representatives of these folds with their close structural matches in CATH obtained from the TMalign search.The highest-scoring matches were superposed with the query domain using the cath-superpose tool (https://cath-tools.readthedocs.io/en/latest/tools/cath-superpose/) to confirm the domains were not extremely remote homologues and assess whether there were similarities in architecture.Matches to the core regions were examined particularly carefully as the core (typically >40% of the structure) is typically conserved even between very remote homologues.No representatives were sufficiently structurally similar, nor had functional evidence to suggest an evolutionary relationship with, or same fold as, any CATH domain.
For novel folds where the closest matches had different architectures, new architectures were assigned.Some interesting novel architectures and topologies were identified (see Figure 2), discussed below.

Processing AFDB chains from 21 model organisms using CATH-AlphaFlow
Earlier pilot work analysing 369,512 high-quality AF2 domains from 21 model organisms 5 reported that 92.3% (341,213 domains) could be assigned to CATH superfamilies.Unassigned domains were clustered into 4,235 structural clusters.Preliminary manual analysis of 610 clusters containing human relatives revealed 25 globular domain structures likely to be novel folds.Many of the remainder comprised sequence-based matches to Pfam and CATH families which had problematic domain boundary assignments.
To improve the domain segmentation process, we re-processed all cluster representatives using Chainsaw.The resulting 6,266 domains were then processed using CATH-AlphaFlow modules (e.g.Foldseek, SSAP, CATH-HMM, CATHe) for assignment to CATH (as for the unclassified PDB structures above).The improved domain boundary assignments allowed us to assign 431 domains existing CATH superfamilies and a further 496 to CATH fold groups.The remaining 5,835 domains were scanned in an all-vs-all fashion using Foldseek and clustered into 4,644 structural clusters (see Methods Section 4).
Cluster representatives were scanned against CATH using TMalign which is slower but more sensitive than FoldSeek.1836 representatives matched to CATH superfamilies.
2,478 representatives matched to CATH fold groups and were clustered into 2,477 putative novel superfamilies.Expanding to all domains, 347,479 AFDB domains could be assigned to CATH superfamilies.
Combining these AFDB domains with the 212,942 experimental domain structures from the PDB which we bring into CATH (see above) represented an increase in CATH domain structures by 560,421 to 1,060,659, a 112% increase.This also gave an increase in the number of CATH superfamilies (i.e. for PDB and AFDB domains with T-hits but not H-hits) from 5,481 to 9,655.We expect these superfamily numbers to reduce in the future (see discussion).
We manually evaluated the remaining 330 putative novel fold clusters.As with the PDB structures, many representatives had problematic features such as remaining segmentation issues, high proportion (or long regions) of residues with no secondary structure, poor packing i.e. lacking globularity.We also used FoldCheck to see whether any were extremely remote homologues of a CATH superfamily.In total, 75 AFDB domains were identified as new folds in addition to the 21 identified in our pilot work, giving a total of 96 novel folds identified in the AFDB dataset of 21 model organisms.Predicting function and mapping functional sites for AFDB domains.
Where possible, we obtained functional annotations for the novel folds.Pfam annotations indicate that most are associated with membrane proteins and involved in transmembrane transport (e.g.potassium transporter) iron permease, and oxidoreductase activities (see Supplementary Table 1).
For 63 with no annotations, we predicted their function using the structure-based method DeepFRI, identified putative functional sites using P2RANK/Scorecons/DeepFRI as well as additional annotations by literature and UniProt searches.
Their predicted function suggests an association with important biological processes such as spermatogenesis (in mice); photosynthetic acclimation, pollen development and proteasomal degradation in flowering plants; iron permease activity in yeast; odontogenesis in human and female meiosis pathway in Caenorhabditis elegans (Further details in Supplementary Tables 1-3).In bacteria, novel domain folds are involved in functions such as pentosyltransferase activity; 4 iron 4 sulfur cluster binding and cytochrome c oxidase assembly.Supplementary Tables 2-3 provides details of annotations from DeepFRI and PROST, and also UniProt GO terms inherited from other structural relatives in the AFDB. Figure 3 shows a selection of examples where the AFDB structures provided useful functional insights.
iii.Structural insights into mutations linked to human disease The SSUH2 gene is associated with a pathogenic missense variant (P118Q), involved in the genetic disorder Dentin Dysplasia Type I (which causes dentin defects in humans. 27Foldseek scans found 74 structural relatives in AFDB, the alignment of which revealed several high-confidence conserved sites (Score >0.70) (see Figure 3C).The pathogenic mutation lies close to two Scorecons predicted sites (121 and 148, score = 0.70).Analyses of the impact using Dynamut2 and mCSM (À0.85 Kcal/mol), suggested that the mutation affects atomic interactions with the proximal conserved sites and is de-stabilising.

Discussion
We applied a novel computational workflow (CATH-AlphaFlow) to classify domains from PDB in CATH, bringing 212,942 PDB domains into CATH and giving 253 novel CATH folds (92 of which are not observed in other classifications).We also applied CATH-AlphaFlow to the AFDB predicted structures for 21 model organisms.This confirmed the previous assignments of 341,213 domain structures to CATH superfamilies. 5 Improved domain assignments by ChainSaw enabled a further 6,266 domains to be accurately detected and assigned to CATH superfamilies.Analysis of the remaining AFDB domains revealed 96 new folds/superfamilies (including the 21 identified in earlier pilot work. 5or those folds not annotated in Pfam and other databases, we used a variety of state-of-the-art tools to predict functions and functional site information.High-confidence DeepFRI-predictions were available for 50 novel folds, some of which are associated with important biological processes in mice (fertility), flowering plants (photosynthetic acclimation, proteasomal degradation), and yeast (iron permease).
Although studies of the PDB in 2012 suggested the domain fold library was nearly complete, the identification of 349 novel folds in this study (increasing CATH folds >25% to 1739) suggests more novelty remains to be elucidated in the AFDB database.
By applying our new classification workflow (CATH-AlphaFlow) we doubled the number of domains in CATH to over one million.212,942 of these are experimental and mean that CATH is now up-to-date with the September-2023 version of PDB.We are now applying CATH-AlphaFlow to all AFDB entries to expand CATH further by processing all AFDB entries.Assigning AFDB domains to CATH superfamilies will help in bringing functional annotations to the UniProt sequences.

Figure 1 .
Figure 1. Outline of the CATH classification pipeline.

3 .
Analysis of the novel folds identified in the PDB and AFDB domain structure clusters We examined all 253 (from PDB) and 96 (from AFDB) novel folds identified.Some (161/253) of the PDB domains had already been classified in SCOPe (version 2.08) and ECOD (version 20230309).A selection of novel folds from PDB and AFDB with highly unusual architectures/topologies are shown in Figure 2.

Figure 2 .
Figure 2. Illustration of novel folds with unusual topologies/architectures.The top panel are PDB domains, lower panel AFDB domains.