The CATH Hierarchy Revisited—Structural Divergence in Domain Superfamilies and the Continuity of Fold Space

Summary This paper explores the structural continuum in CATH and the extent to which superfamilies adopt distinct folds. Although most superfamilies are structurally conserved, in some of the most highly populated superfamilies (4% of all superfamilies) there is considerable structural divergence. While relatives share a similar fold in the evolutionary conserved core, diverse elaborations to this core can result in significant differences in the global structures. Applying similar protocols to examine the extent to which structural overlaps occur between different fold groups, it appears this effect is confined to just a few architectures and is largely due to small, recurring super-secondary motifs (e.g., αβ-motifs, α-hairpins). Although 24% of superfamilies overlap with superfamilies having different folds, only 14% of nonredundant structures in CATH are involved in overlaps. Nevertheless, the existence of these overlaps suggests that, in some regions of structure space, the fold universe should be seen as more continuous.

CATH/SCOP dataset (Reid et al, 2007). Currently, at least 60% of remote homologues (<35% sequence identity) in CATH can be recognised for a reasonable error rate (1%) using the SAM-T single HMM method of Karplus et al (Karplus et al, 1998) and 80% using the PRC HMM-HMM method of Madera et al (Reid et al, 2007). Similar in structure. CATH uses the SSAP (Orengo et al, 1996) and CATHEDRAL (Redfern et al, 2007) algorithms to perform structural alignments of protein domains. Both methods apply a logarithmic scoring scheme normalized to be between 0 and 100, independent of protein size. Benchmarking has shown that comparing two proteins with similar folds generally results in scores of 70 or more and homologous proteins will often give scores of 80 or more. Once the two proteins have been structurally aligned they are superimposed via the McLachlan algorithm (McLachlan, 1982) and a normalized RMSD is calculated (Redfern et al, 2007). Similar in function. Orthologous proteins are likely to have very similar functions. Paralogous arising from gene duplication may have diverged in function, however, analyses suggest that some aspects of molecular function may remain conserved e.g. sites on the surface used in catalysis or binding or intermediates formed during a catalytic reaction (Todd et al, 2001;Rison et al, 2002). Information is extracted from the public databases GO (Ashburner et al, 2000), COGs (Tatusov et al, 2003), EC (Bairoch, 2000), FunCat (Ruepp et al, 2004) and KEGG (Kanehisa et al, 2000) and also from the literature. The SAWTED algorithm (MacCallum et al, 2000) is used to compare SwissProt keywords between proteins and a method for comparing GO terms is also employed (Bairoch, 2000;Lord et al, 2003).
The SAM-T technology of Karplus et al (Karplus et al, 2005) is used to build superfamily specific Hidden Markov models (HMMs) from diverse representatives within the superfamily (Sreps). Sequences from completed genomes are scanned against the CATH Hmm library to recognise domains which can be assigned to CATH structural superfamilies. Release 3.1 has HMM models for 9000 S35 families. The latest release of Gene3D (release 6) contains 527 completed genomes from all kingdoms of life. Table S1 gives the population of each CATH superfamily following the assignment of Gene3D sequences. Gene3D also contains information on the functions of sequences containing predicted CATH domains.
Automatic homology assignment is only applied for close homologues (>= 80% sequence identity >= 60% of residues in larger domain aligned against the smaller domain.). For more remote homologues the automatic methods are applied as described above and the data generated by these methods together with information obtained from the literature is taken into account in a manual validation of the classification by he CATH curators.

What is an SGG?
A Structural Sub-Group (SSG) describes a cluster of two or more CATH domains that have been grouped together due to their close structural similarity. The measure used to describe structural similarity is the SIMAX score which is the RMSD of the two aligned structures normalized by the ratio of the number of residues in the larger domain to the number of aligned residues (see Section 5.3). All comparisons within a SSG cluster have a SIMAX score of less than a given cutoff. The cluster type SSG5 corresponds to domains that all have a SIMAX similarity score of <5Å.
In order to remove redundancy, SSGs are created from sequence family (S35) representatives. A cutoff of 35% sequence identity is used for selecting representatives since above 35% identity relatives are likely to share similar structures. Superfamilies with only one sequence family will not have any SSG clusters nor will SSGs be created from clusters with only one structure (called singletons).

What is a aggregate alignment/SSGA?
The SSGs provide tightly clustered groups of protein structural domains where all members of a given group are guaranteed to have close structural similarity. However, in order to investigate more distant evolutionary relationships, it may be useful to examine structural alignments between proteins with more diverse structural relationships. The structural sub-group aggregates (SSGA) provide a means of linking the tightly clustered SSGs together to examine more distant evolutionary relationships. SSGAs are created in two steps. The first step selects representative structures from each SSG where the most representative structure for a given SSG is defined as the domain with the highest structural similarity to domains in neighbouring SSGs. The second step clusters these representative structures with a more relaxed structural similarity cutoff to allow a certain amount of structural diversity, but not so much as to make the structural alignment meaningless. The SSGA9 aggregate cluster type selects one representative structure from each SSG5 cluster, then clusters these structures into one or more groups with a SIMAX cutoff of <9Å.

Identifying the core residues common to all relatives in an SSG or SSGA
The CORA multiple alignment algorithm was used to create structure-based alignments for the domains in both the SSG and SSGA clusters. A consensus score was then calculated for each alignment position as a simple ratio of the number of structures aligned at this position divided by the total number of structures in the alignment. Alignment positions that have a consensus score of 1.0 (i.e. positions where all structures could be aligned) were considered to contain "core" residues. Highlighting the corresponding "core" residues for each structure in the SSG provides a means of examining the structural embellishments observed within SSG clusters (especially within the more disparate SSGS clusters). It also allows comparison of common cores between superfamilies and the manual clustering of superfamilies sharing similar common cores into the same (T)-level (i.e., (T)opological core motif level) in CATH. Links have been provided on the web pages that allow the user to view the core residues highlighted on each domain in the SSGs through interactive molecular viewers (e.g., Jmol and Rasmol).

Figure S15. Schematic flow chart of the CATH update protocol
The first major step is DomChop, where one or more domains are assigned for each chain. The newly created domains are then classified into CATH superfamilies (HomCheck). As with DomChop, any domains that cannot be automatically classified are manually curated. Grey boxes denote production of metadata, blue boxes workflow decisions, yellow boxes manual curation. Definitions of abbreviations and terms used are as follows: NW (Needleman-Wunsch sequence alignment algorithm); HMM (hidden Markov model); ChopClose (assigns domain boundaroes for close homologous); CATHEDRAL (structure comparison program) Figure S16. Structural drift between representatives within the same S35 level (i.e., Sreps) in the CATH database (version 3.1).

Figure S17.
Plot of normalized RMSD scores returned by pair-wise structure comparisons of homologous domains versus the average sizes of the domains being compared.