Diversity in protein domain superfamilies

Whilst ∼93% of domain superfamilies appear to be relatively structurally and functionally conserved based on the available data from the CATH-Gene3D domain classification resource, the remainder are much more diverse. In this review, we consider how domains in some of the most ubiquitous and promiscuous superfamilies have evolved, in particular the plasticity in their functional sites and surfaces which expands the repertoire of molecules they interact with and actions performed on them. To what extent can we identify a core function for these superfamilies which would allow us to develop a ‘domain grammar of function’ whereby a protein's biological role can be proposed from its constituent domains? Clearly the first step is to understand the extent to which these components vary and how changes in their molecular make-up modifies function.


Introduction
Families of proteins arise through speciation (orthologous relatives) and through duplication of genes during evolution (paralogous relatives) and it is the paralogues that are most likely to diverge, although not necessarily [1]. By classifying families, superfamilies and collating information on their protein structures, sequences and functions, we can explore how relatives diverge and understand the molecular mechanisms underlying any functional changes [2]. Such insights are essential for inheriting properties between relatives to cope with the huge dearth in experimental annotations. For example, an inspection of the experimental annotations in the UniProtKB/Swiss-Prot sequence database (June 2015) reveals that less than 15% of human proteins have detailed functional characterisation and only 4% have known structures. They are also essential for under-standing whether genetic variations are likely to be tolerated and affect function.
Many resources now exist for classifying protein families, some of which consider the entire protein (e.g., PAN-THER [3], HAMAP [4], TIGRFAMs [5] and SFLD [6]) whilst others classify the domain components (e.g., Pfam [7], SMART [8], PRINTS [9], InterPro [10], CDD [11], CATH [12], SCOP [13] and ECOD [14]) generally considered to be evolutionary independent modules having distinct functional properties [15]. Some resources like PhyloFacts [16] also provide classification of both fulllength proteins and domains. At least two thirds of eukaryotic and more than a half of prokaryotic proteins are composed of multiple domains [17] and the most highly populated domain superfamilies are universal to all kingdoms of life or major clades or branches [18]. Therefore, whilst studies have suggested that there may be approximately 100 thousand protein families [16,19] many proteins can be decomposed into common constituent domains derived from a more limited repertoire of 15,000 superfamilies [19]. Within a protein, the different domains tend to have different roles, which when combined make up the general function of that protein. Therefore, by understanding the different functional roles that domains possess we can start to build up a 'domain grammar of function' [20]. Interestingly, a few hundred of these domain superfamilies' dominate nature, accounting for nearly two thirds of all known domains [21]. It is in these superfamilies that we see the most diversity (see Figure 1) and this is largely reflected in their binding properties and/or their ability to metabolise diverse substrates.
In this review we use the CATH-Gene3D domain classification, currently the most comprehensive structurebased superfamily resource, to assess the extent of divergence across protein domain 'superfamily space' and review the mechanisms of divergence revealed by detailed studies of specific families undertaken by us and other groups.

Capturing information on structural and functional diversity within superfamilies
Specialised manually curated structure-based classifications like SFLD [6], TEED [23], CYPED [23], LccED [24] and ESTHER [25] provide valuable insights into the diversity of selected enzyme superfamilies and there have been several elegant studies of large, diverse superfamilies in the Structure Function Linking database (SFLD) resource [26,27 ]. However, relatively few superfamilies have been explored in such detail because of the limited experimental data. Since relatives sharing structural and functional properties experience similar constraints on their sequences to preserve these properties, one way to explore diversity across 'superfamily space' is to exploit the much more prolific sequence data that is available [22,23,28 ].
By appropriately clustering relatives with similar sequence properties, several resources [6,16,19] classify specific 'functional families'. Approaches range from pair-wise comparisons [6] to more sophisticated profile-based analyses [22] that can also be used to detect key residue sites differing between the functional families. Whilst residues important for folding or stability tend to be conserved across the whole superfamily, positions only conserved in certain functional families (specificity determining positions or SDPs) are often under positive selection and associated with distinct functional properties [29,30] (see Figure 2(a)). SDPs can be associated with a wide variety of protein sites. For example, in addition to    mutations in the ligand binding pocket, diversity in the Metabotropic Glutamate Receptors is conferred by SDPs in allosteric sites, the dimerization interface and the hinge region [31 ]. Similarly the functional specificity of signalling proteins like the Ras superfamily involves mutations in the nucleotide-binding pocket and interfaces co-ordinating the communication between the nucleotide and membrane-binding regions [32].
For exploring superfamily diversity in the CATH-Gene3D resource, we have used an approach that searches for SDPs to distinguish between different functional clusters [22]. This approach sub-classifies the 2700 CATH-Gene3D superfamilies into 110,000 functional families by optimal partitioning of hierarchical clustering trees for each superfamily, based on identifying characteristic patterns of differentially conserved positions (SDPs) and conserved positions between different functional groups, all of which have at least one relative with an experimental functional annotation in the Gene Ontology (GO) [33]. Whilst validation suggests that these functional groups are reasonably effective in transferring experimental annotations between relatives, there is still considerable room for improvement, as suggested by the results of a recent international large-scale protein function prediction assessment [34]. However, functional family classification does shed light on superfamily diversity, revealing that for only 7% (200) of these superfamilies, sequence change is associated with very significant diversity in structure, function and protein context (see Figure 1) while the remaining 93% of the superfamilies appear to have structurally and functionally conserved relatives.
Functional diversity in binding and enzyme superfamilies -'molecular tinkering' Of the 200 most diverse domain superfamilies, each of which have 100 or more functional families and account for 50% of all CATH-Gene3D domains, 95% of these are superfamilies directly or indirectly associated with enzymatic activity and many of the remainder have relatives with binding activity. Whilst detailed studies of some superfamilies have characterised considerable structural divergence modifying functional site features ( [35,36], see also below), just small changes associated with residue mutations in a binding or active site can alter the shape, physicochemical and electrostatic characteristics significantly, modifying ligand specificities in binding proteins and affecting substrate specificities, chemistries and catalytic efficiencies in enzymes. The Nuclear Receptor superfamily shows amazing diversity in the ligand binding cavity brought about by such mutations, driven by strong divergent selection and adaptive positive selection [37]. Similarly, in the Tubulin superfamily, many of the positively selected sites are found at or adjacent to functionally important sites [38].
In enzymes, considerable sequence divergence can occur in the active sites. In nearly 55% of 101 experimentally well-annotated enzyme superfamilies (accounting for almost 50% of all enzyme sequences in CATH-Gene3D) dramatic changes in catalytic machinery occur [39]. However, in support of previous studies of Babbitt and coworkers [28 ] which reported that many relatives in SFLD superfamilies share a common mechanistic step, 40% of these superfamilies have one or two catalytic residues common to all functional families. In some cases catalytic residues with similar physicochemical properties are located at similar 3D locations even though they are in different positions in the sequence (see Figure 2(a)). Thus, frequently some aspect of the chemistry is conserved and analyses based on phylogenetic trees derived from structure-based alignments of CATH-Gene3D superfamilies confirm, on a much larger scale than early studies [2], that most superfamily diversity is associated with changes in substrate specificity [40 ], suggesting that it is hard to change the chemistry presumably because of the complex sequence of mutations needed to create a new arrangement of catalytic residues with the correct spatial relationships. However, dramatic changes in chemistry can occur, such as in the Enolase superfamily [41,42], Aldolase Class I superfamily [43 ] (see Figure 3)  Enzyme superfamilies showing the greatest versatility in CATH-Gene3D, frequently adopt alpha/beta structures, two thirds having TIM or Rossmann folds. As Tawfik and his colleagues have reported in a recent publication, these structures tend to have regular, well-packed structural cores and the catalytic residues mainly locate to loops largely detached from these cores and therefore perhaps better able to tolerate the destabilising effects of mutations [49 ,50 ]. Diversity in protein superfamilies can also arise from mutations in protein interfaces. Furthermore, relatives can exploit completely different surfaces in their protein interactions. Large-scale studies comparing CATH-Gene3D functional families showed that in 645 highly versatile superfamilies, cumulative binding sites from diverse relatives covered most of the protein surface and were associated with a wide range of protein partners [52 ]. However, sometimes the same interface is exploited but by different partners. In the two Dinucleotide Binding Domains Flavoproteins (tDBDF) superfam-ily, the diversity of reactions carried out by relatives is achieved by different protein partners acting as electron acceptors and interacting with the same face of the tDBDF domain [53]. Paralogous relatives are more likely to bind different protein partners [54] and this is a significant effect in the beta-propellor superfamilies, whose structures contain repeating WD40 sub-domains, and in which human paralogues have multiple distinct surfaces interacting with a very wide variety of proteins, peptides or nucleic acids [55].

Structural mechanisms of superfamily divergence
Although only 10% of the CATH-Gene3D functional families have structural representatives, this data can help identify superfamilies capable of great structural plasticity where relatives display considerable diversity due to extensive residue insertions and repetitions or inserted structural motifs [56,57]. For 160 CATH superfamilies, accounting for half of all known domains in CATH-44 Genomes and evolution Gene3D, at least a two-fold variation in the size is observed between the most diverse domains [58]. However, analyses of selected superfamilies [35,59] and more recent large-scale studies have shown that the structural core (generally 40-50% of the domain) is highly conserved even for relatives separated by billions of years [57] (see Figure 1). Long residue inserts in diverse relatives generally adopt secondary structure features that form structural decorations to this core and can be associated with modified functions, for example, by altering active site geometry and thereby changing substrate specificity (see Figure 2(a)), or altering surface features and thereby changing protein interaction partners [52 ].
In the Thiamine pyrophosphate (TPP)-dependant superfamily, different functional families have varying inserts forming small additional secondary structure features that reshape the active site for different substrates (see Figure 2(b)). In the HUP domain superfamily, also, quite extensive structural embellishments extend the active site [35]. Insertions of motifs or sub-domains can also result from gene fusions, for example, in the Haloalkanoic Acid Dehalogenase (HAD) superfamily where they provide diverse specificity determinants for a broad range of substrates [60 ].
Dramatic structural rearrangements can also arise from variations in repeating units. In the Vicinal Oxygen Chelate (VOC) superfamily, members share a common babbb subdomain that is organized into different topological (or domain-swapped) combinations in different relatives that maximizes the catalytic versatility of the metal center [61]. These and other structural changes such as circular permutations and rearrangements in bsheet topologies can sometimes transform the fold [62] as well as modifying the function [50 ].
Diversity can also emerge from changes in less structured regions, for example, repeats giving rise to low-complexity regions (LCRs), such as polyalanine or polyglutamine runs. These often evolve rapidly and can have a major influence on the transcriptional activity of the protein [63]. Similarly, variations in (Gly)n -X repeats in glycine rich domains have been observed to alter the expression pattern, modulation and sub-cellular localization of relatives in some plant families [64].

Superfamily diversity arising from different multi-domain contexts
Gene fusions are another evolutionary mechanism conferring diversity as they can significantly alter the context of a domain (i.e., by changing the multi-domain architecture (MDA) of the protein), thereby modifying its molecular function and biological role. Domains have been frequently duplicated and shuffled within genomes, during evolution, with fusions being more frequent and generally occurring at N or C termini [65]. For 92% of the 200 most diverse superfamilies in CATH-Gene3D superfamilies, that is, those having the highest number of functional families, relatives occur in more than 100 different multi-domain contexts [21] (Figure 1(b)). Changes in domain partners may not necessarily alter the function of the domain but change the context in which it operates, for example, locating it in different protein complexes and/or pathways. For example, early studies demonstrated the recruitment of domain relatives to different metabolic pathways for the chemistry they bring [66].
However, changes in domain partners can also alter specificity. For example, in the highly diverse Thiamine pyrophosphate (TPP)-dependant enzyme superfamily changes in domain partnership alter the size and physicochemical properties of the active site pocket (see Figure 2(b)), enabling a huge range of substrates, products and stereo-selectivity [67]. Different oligomerisation states also effectively change the domain context. Again, in the TPP superfamily, various oligomerisation states have evolved in different species. Whilst some may be associated with enhanced stability, others clearly influence active site characteristics by changing the positioning of the domains providing catalytic residues (see Figure 2(b)).

Diversity in superfamilies due to promiscuity
Diversity within a superfamily can also be the result of individual relatives having multiple functions. For example, relatives can have multiple catalytic activities not necessarily of equal efficiency, as in promiscuous enzymes; or moonlighting functions whereby proteins perform completely different functions to their native activity sometimes involving different sites [68,69]. Promiscuity can be the starting point for the evolution of a new function [49 ,50 ]. Under natural selection, promiscuous enzymes can give rise to specialist enzymes by a variety of different mechanisms -protein dynamics (e.g., changes in conformational dynamics have converted a promiscuous generalist beta-lactamase to a penicillin-specific beta-lactamase, without significant changes in the structure of the active site [70]), domain insertions (e.g., HAD superfamily [36,60 ]), rearrangements in the catalytic metal ions [71] and binding of alternative cofactors [72].
An increasing number of proteins are now known to moonlight and these activities can be induced by oligomerisation, cellular localization, differential expression and substrate concentration. For example, Albaflavenone monooxygenase in the Cytochrome P450 superfamily, also functions as a Terpene synthase, an activity not observed in any other superfamily member. The catalytic machineries for the two enzymatic reactions are located in distinct pockets on the domain and the reactions are carried out at different pHs [73].

Conclusions
In most large diverse superfamilies, functional diversity results from a combination of different molecular mecha-nisms ( Figure 4). For example, in the PD-(D/E)XK Phosphodiesterase superfamily there are structural embellishments to the core, domain swapping events, active site residue variations and changes in MDA [74]. Similarly, in the Ribonuclease H-like (RNHL) superfamily [75], and many other families discussed above.
Experimental data on functional diversity grows slowly as detailed studies are time-consuming and expensive, however, classifying the millions of sequences accumulating in public repositories like UniProt into putative functional families can reveal subtle changes in conservation patterns that suggest shifts in binding specificities or catalytic machineries. These data can guide experiments to focus on unusual relatives and more comprehensively landscape the functional repertoires of the most versatile superfamilies. For example, sequence similarity networks based on protein families can help in providing a com-prehensive summary of sequence, structure and function relationships in a functionally diverse superfamily. Recent studies [27 , 60 ,76] of such networks derived from curated family classification for three functionally diverse superfamilies in SFLD have been used to aid in target selection for interesting targets for experimental characterisation. The availability of automated functional classifications of superfamilies will ultimately guide experimental validation using high-throughput approaches and aid in improving the functional annotation of genomes. This will be especially important for large diverse superfamilies.
Only 63% of the 25 million domain sequences in CATH-Gene3D can be assigned to an experimentally annotated functional family and less than 10% of these families have a known structure, so there may be much more diversity to discover. Certainly analyses of microbial communities hint at exciting novel chemistries [77,78].
46 Genomes and evolution  Although the lack of data hinders our understanding, most studies of enzyme superfamilies, even those that are mechanistically very diverse, suggest that chemistry is usually preserved or there is conservation of a specific partial reaction among all relatives and that it is substrate specificity that is much more likely to change [28 ]. Furthermore, the relative success of domain-based strategies for protein function prediction [22,79] suggests that a general functional role is conserved across most domain superfamilies and that diversity largely results from exploitation of that role on multiple ligands or substrates, and in multiple contexts. In other words, the structural diversity observed in promiscuous superfamilies is more frequently associated with changes that reflect different domain contexts or changes in substrate specificity rather than dramatic changes in the functional role. This suggests that for many domain superfamilies' a domain grammar of function can be applied.

References and recommended reading
Papers of particular interest, published within the period of review, have been highlighted as: of special interest of outstanding interest