Inference of Functional Properties from Large-scale Analysis of Enzyme Superfamilies

As increasingly large amounts of data from genome and other sequencing projects become available, new approaches are needed to determine the functions of the proteins these genes encode. We show how large-scale computational analysis can help to address this challenge by linking functional information to sequence and structural similarities using protein similarity networks. Network analyses using three functionally diverse enzyme superfamilies illustrate the use of these approaches for facile updating and comparison of available structures for a large superfamily, for creation of functional hypotheses for metagenomic sequences, and to summarize the limits of our functional knowledge about even well studied superfamilies

As increasingly large amounts of data from genome and other sequencing projects become available, new approaches are needed to determine the functions of the proteins these genes encode. We show how large-scale computational analysis can help to address this challenge by linking functional information to sequence and structural similarities using protein similarity networks. Network analyses using three functionally diverse enzyme superfamilies illustrate the use of these approaches for facile updating and comparison of available structures for a large superfamily, for creation of functional hypotheses for metagenomic sequences, and to summarize the limits of our functional knowledge about even well studied superfamilies.
In the post-genomic era, access to large amounts of gene sequence and protein structure data has become the norm; by mid-2011, the number of protein sequences in the Uni-Prot/TrEMBL Database (1) topped 16 million, whereas the Protein Data Bank (2) contained over 73,000 structures. Additional millions of sequences are becoming available from newer types of genome projects, including metagenomics projects, with one report for the human gut microbiome accounting for an additional 3.3 million microbial genes (3). Because experimental determination of protein function lags far behind the rate of sequence and structure determination, improved computational methods for function prediction are urgently needed to help bridge the gap between sequenced genes and functionally characterized protein products. In response, new methods are rapidly being developed to address these challenges, and community efforts are now under way to increase the pace of experimental and computational prediction of protein function (4,5). Another large-scale effort (http://www.nigms.nih.gov/ News/Results/gluegrant_051510.htm) aims to develop a combined experimental/computational strategy for the prediction of the reaction and substrate specificity of enzymes, the protein class that is the subject of this minireview. Addi-tionally, community challenges such as the Critical Assessment of Function Annotations (CAFA) (Automated Function Prediction 2011) have been mounted to assess and improve the current state of automated prediction of protein function. Viewing the glass as half-full, progress in sequencing and annotation over the last decade led one group to estimate that some functional features can be assigned to as much as 85% of proteins in completely sequenced genomes (6). From a more skeptical perspective, more recent assessments of annotation accuracy suggest that computational approaches are especially prone to misannotation (7,8), indicating that significant challenges for functional inference remain.
This minireview focuses on how new insights about protein structure-function relationships and functional inference can be obtained from large-scale analyses of proteins, specifically for "functionally diverse" enzyme superfamilies. We define these types of superfamilies as sets of homologous proteins that conserve structural and active site features that can be explicitly associated with a conserved partial reaction or other chemical capability. Within a superfamily and constrained by these superfamily-common features, many divergent families may have evolved that exhibit different reaction and/or substrate specificities (9). (See the Prologue for some definitions of superfamilies, families, and related terms.) These types of superfamilies provide a useful context for inference of functional properties of members of unknown function ("unknowns") because the constraints imposed by the structure-function paradigm unique to each superfamily restrict the search space for functional inference of their reaction and substrate specificities, simplifying their functional assignments. Because the number of sequences in each superfamily is still increasing rapidly, large amounts of new data are regularly available to inform these investigations. Moreover, sequence and structural similarities among all of the members of a superfamily can be associated with many types of functional information, allowing us to leverage what is known to guide inference of functional properties of unknowns that are similar. (See the minireview by Gerlt et al. (48) in this thematic series describing strategies for assigning functions in the enolase superfamily for an example.) Furthermore, as our coverage of genome space increases, new "outlier" functions in superfamilies can be identified from specialized environmental niches, extending our estimates of the natural boundaries of functional variation that a particular superfamily supports.
Below, we describe how the continuing increase in sequence and structural data can be used to understand better the evolution of new functions and to improve functional inference accessed using a relatively new application of network-based methods, protein similarity networks, an attractive approach for investigation of functional properties from the context of sequence and structural similarity. Results from such largescale studies are reviewed here using examples from three different superfamilies of enzymes: the eukaryotic protein kinase (ePK) 2 -like superfamily, a large group of acid-sugar dehydratases from the enolase superfamily, and the glutathione transferase (GST) superfamily.

Emerging Roles for Large-scale Computational Analysis of Protein Superfamilies
As methods for managing and analyzing sequence and structural data have improved, computational studies can more effectively address broad issues in large-scale mapping of structure-function relationships and deduction of the patterns by which natural evolution has led to the divergence of many functions from an ancestral structural scaffold. For example, for protein kinases, one of the largest and most important enzyme superfamilies, the seminal Manning tree (10) provided a foundation for classification of human kinases and those from other eukaryotes. Likewise, a large-scale study of redox proteins generated a census of sequence, structural, and functional characteristics of the divergent superfamilies of the thioredoxin fold class that are represented in nature (11).
Large-scale analyses have the additional advantage of revealing patterns not easily observable when smaller data sets are examined. For example, comparison of sequence and structural features conserved in the active sites of the members of the large and functionally diverse enolase superfamily allowed the prediction of the specific partial reaction uniting the entire superfamily, the abstraction of an ␣-proton of a carboxylic acid, thereby restricting the functional prediction problem for the thousands of sequences now identified as superfamily members to consideration of only the overall reactions and substrates consistent with that paradigm (12). Using that structure-function mapping as a foundation, more detailed computational and experimental studies have identified differences among superfamily members that distinguish the reaction and substrate specificities of the Ͼ20 constituent families whose functions can now be assigned (see the minireview by Gerlt et al. (48) for a listing). Other notable studies linking structural and mechanistic features across large enzyme superfamilies include analyses of the amidohydrolase (13,14), enoyl-CoA hydratase (15), nudix (16), haloalkanoic acid dehalogenase (17), and two dinucleotide-binding domain flavoprotein (18) superfamilies, to name a few.
As more powerful tools and computers have been created, the ease of mounting such studies has enabled new types of analyses that provide context for interpreting functional characteristics across homologous members of superfamilies. These include sophisticated algorithms for multiple alignment and phylogenetic inference, both of which have long been used to examine evolutionary relationships among groups of sequences. Especially relevant to this minireview, phylogenomic approaches, first described over a decade ago (19), combine phylogenetic reconstruction with functional assignment of unknowns based on their placement in the tree relative to knowns. Phylogenomic approaches have now been applied extensively to improve the accuracy of homology-based anno-tation and to distinguish divergent families within enzyme superfamilies (see Ref. 20 for an example). Additionally, searchable online databases such as BRENDA (21) provide access to a large store of enzyme function information, whereas others provide online curation and computational tools created to link enzyme sequence and structural information with functional characteristics and mechanistic properties (22)(23)(24)(25).

Network-based Approaches for Large-scale Analysis of Protein Superfamilies
Although large-scale analyses indeed provide a "big picture" perspective that adds much to our understanding of genomic and chemical biology, the growing size of the data sets and their associated metadata continue to raise significant challenges for analysis and dissemination. Network-based analysis represents one approach used to capture biological context, with genetic or protein interaction networks using computational and/or experimental data being among the most common. Sequence and structure similarity networks have also been used for the analysis and visualization of structure-function relationships (26 -28). This technique allows users to efficiently and quickly examine similarities of much larger sets of proteins than is generally possible using traditional methods such as phylogenetic trees and multiple alignments. For example, one such study mounted a comparison of over 145,000 sequences to create a map in which proteins are positioned according to sequence relationships and gene functions (29). The recent development of software platforms such as Cytoscape (30) facilitates the use of network methods and algorithms of several types, enabling access to these types of tools by non-experts.
Although they are not a substitute for phylogenetic inference, networks generated from even such simple metrics as all-by-all pairwise comparisons of a large number of divergent sequences have been shown to track well with known relationships and with the clustering provided by trees. Furthermore, they support facile mapping of many types of orthogonal data to proteins clustered by similarity (31). Types of information such as genome/operon context, interaction networks and pathways, and organism-specific information have been shown to enhance the accuracy of functional inference (see Refs. 32 and 33 for relevant reviews). In analogy to phylogenomics, functional information of many types can be associated with nodes (e.g. protein sequences or structures) in a similarity network to improve functional inference and insight. Because protein similarity networks can be quickly generated in interactive formats, users can easily explore these associations by coloring nodes with different combinations of sequence/structural properties and functional information.
Examples illustrating the application of large-scale analysis of structure-function relationships using protein similarity networks are described below. Interactive versions of these networks are available from the authors and can be viewed using the freely available Cytoscape software (30).

Tracking Growth of Structural Coverage: ePK-like Superfamily
The ePK-like superfamily is a large and diverse group of homologous enzymes that share a common protein kinase-like fold (34) and conserved residues associated with ATP-dependent phosphorylation of proteins and small molecules. ePK-like enzymes mediate many important cellular processes, including signal transduction (10). They make up almost 2% of eukaryotic genes and, although present as a smaller percentage of bacterial genes, may be at least as important in bacterial cellular regulation as the structurally unrelated histidine kinases (35).
The size and diversity of the ePK-like superfamily make it hard to generate a global overview of their sequence and structural relationships. As a result, only a small number of groups have attempted the time-consuming task of generating largescale classifications of the kinases. In one of these studies, Kannan et al. (35) used a library of hidden Markov models (HMMs) to identify Ͼ45,000 ePK-like sequences from the NCBI nonredundant database (36) and the Global Ocean Sampling data set (37) and to classify them into 20 families. Examination of this diverse sequence set allowed the identification of 10 residues conserved across most families. Six of these residues were known to be involved in ATP and substrate binding and catalysis, whereas the functional role of the remaining residues had not been established. This study also showed that all but one of these well conserved residues had been lost over the course of evolution in one or more families (in some cases, substituted with changes in other regions of the protein), illustrating the plasticity of the ePK-like fold. Although profile-profile alignments and alignments of conserved motifs could be used to group some families into related clusters, the size and diversity of the superfamily have continued to challenge the construction of a more detailed evolutionary history.
Scheeff and Bourne (38) were able to surmount the problem of low sequence identity across the superfamily by combining sequence and structural information into a single phylogenetic analysis. The results suggested that the tree constructed by this method had some advantages and was more reliable than trees produced using either sequence or structural data alone.
In addition to these types of global analyses, many thousands of detailed studies have been published describing properties of smaller groups and of individual enzymes. However, the sheer number of sequences and structures in this superfamily, coupled with the rate of growth of the sequence and structure databases, makes keeping an up-to-date record of kinase relationships increasingly difficult, even without the inclusion of linked functional information. (The Pfam (39) PKinase clan currently includes nearly 85,000 sequences.) Here, we illustrate the use of similarity networks to keep track of relationships between enzymes in large superfamilies. In this example, networks generated from pairwise structural comparisons provide a current update of the structural coverage of the superfamily. Fig. 1 shows structure similarity networks for the ePK-like superfamily, 3 colored by Pfam classifications, with Fig. 1A indicating the differences in structural coverage in the years between when the study by Scheeff and Bourne (38) was published (October 2005) and May 2011, respectively. As is clear from these summaries, the structure space has filled out significantly over this 6-year span. Most strikingly, the fructosamine kinase family defined by Pfam, Fructosamin_kin (red oval in Fig. 1A, lower panel), was not represented at all in the network from 2005. Fig. 1B shows the same network as in Fig. 1A (lower  panel), but thresholded at a higher stringency scoring cutoff (achieved by increasing the score threshold required for drawing edges between two nodes), enabling a more detailed view of the same structural relationships. Fig. 1B provides a different and somewhat more detailed view of the growth of structural coverage between these two time points. Although these networks use a set of structures that is larger and somewhat different from that used by Scheeff and Bourne, they track reasonably well with those trees (data not shown). Some exceptions include structures for which the position was labeled as uncertain in the Scheeff and Bourne tree. Alternative versions of these networks colored by the Manning classification (10), with the addition of the atypical kinase class used in Ref. 38, are provided in Fig. 2.
As shown in this example, similarity networks can be used effectively to update relationships among proteins in a superfamily as new structures become available, if, as for the ePK-like superfamily, its structural coverage is good. Sequence networks can also be used to summarize relationships among proteins on a large scale (11), as described below. Although the scale at which networks can easily query such data is still much larger than can generally be accommodated using multiple alignments and trees, the size of networks that can be viewed and manipulated by software such as Cytoscape is limited by the number of edges they contain. In practice, for a superfamily as large as the kinases, only a small proportion of the available sequences can be represented in a single network, typically requiring the use of representative sequences to cover the divergence space. Additionally, because of the diversity of many superfamilies, including the ePK-like superfamily, it is not possible to connect the whole set of sequences at statistically significant scores.

Prediction of New Carbon Sources in Human Gut Microbiome from Comparisons with Acid-sugar Dehydratases of Enolase Superfamily
Microbes residing in the gut have a significant influence on human health. In addition to aiding in energy harvest from food and synthesizing essential vitamins, changes in the gut microbial population are associated with medical conditions such as inflammatory bowel disease and obesity (3). Variations in microbiome populations have also been observed following treatment with antibiotics (40). Thus, much interest is now focused on determining the molecular functions and biological roles of the gut metaproteome both in healthy individuals and in those suffering from disease.
One of the most comprehensive studies on the human gut microbiome to date describes a set of 3.3 million microbial genes sequenced and assembled from fecal samples of 124 individuals (3). As expected, the census of protein functions initially identified in this metagenome includes proteins in many cen- 3 For network analysis for the ePK-like superfamily, structures were chosen to include only one structure for each unique UniProt ID, with a preference for 1) structures solved October 2005 or previously and 2) wild-type, 3) ligandbound, and 4) good resolution structures. Using the FAST algorithm (46), each structure in the set was used as a query against a database containing all structures in the set. Networks were created at various N-score cutoffs and visualized using Cytoscape.
tral metabolic pathways such as those involved in carbon utilization pathways. We used the information available in the Structure-Function Linkage Database (SFLD) 4 (25) for a large set of acid-sugar dehydratases in the enolase superfamily to probe for additional and possibly unique carbon sources in the microbiome. This was accomplished by identifying putative acid-sugar dehydratases in the gut metagenome that differ from those that had been previously identified, whether of known or unknown specificity. The substrate specificities of 10 acid-sugar dehydratases have now been biochemically established, 5 allowing functional assignment of specificity to ϳ40% of the ϳ2000 sequences currently represented in this subgroup of the superfamily in SFLD. Although the rest can be assigned with high confidence as likely acid-sugar dehydratases, their substrate specificities remain unknown. Using SFLD tools, protein sequences from the human gut microbiome predicted to be acid-sugar dehydratases were identified and clustered together with the knowns and unknowns of the subgroup already annotated in the database. The results are summarized in the network shown in Fig. 3A. 6 This network is thresholded at a relatively permissive cutoff, where most families are found in one major cluster. Other reaction families that do not show similarities to any of the nodes in 4 SFLD is a joint project of the Babbitt laboratory (supported by National Institutes of Health Grant GM60595 and National Science Foundation Grants DBI-0234768 and DBI-0640476) and the UCSF Resource for Biocomputing, Visualization, and Informatics (supported by National Institutes of Health Grant P41 RR001081). Additional support for the creation of networks available at SFLD is provided by the Enzyme Function Initiative (supported by National Institutes of Health Grant U54 GM093342). 5 Of 10 acid-sugar dehydratase families of known reaction specificity in SFLD, only seven are colored in Fig. 3, as two others are not represented in this analysis. The mandelate racemase family, the namesake of the subgroup, is also colored. Although mandelate racemase is not an acid-sugar dehydratase, it is a member of this subgroup by sequence and structural similarity and is therefore included in Fig. 3. 6 For network analysis for the gut metagenome, the sequence set consists of 1) the subgroup from SFLD containing acid-sugar dehydratases (named the mandelate racemase subgroup), filtered to 90% identity, aside from experimentally characterized members, all of which are present, and 2) all gut metagenome sequences that matched either this SFLD subgroup HMM or an SFLD family HMM from a family within the subgroup with an e-value cutoff of at least 1eϪ2 and that did not better match any other enolase superfamily SFLD HMMs. These sequences were filtered to 90% identity and to remove fragments under 150 amino acids. BLAST analysis (47) was performed using each sequence in the set as a query against a database containing all sequences in the set. Networks were created at two different e-value cutoffs and visualized as described in Footnote 3. this large cluster at a threshold better than the cutoff form smaller clusters arranged randomly at the bottom of Fig. 3A.
Simple examination reveals a few emerging clusters in the main cluster and also in the separated clusters (e.g. the circled group in Fig. 3A) that are populated primarily or exclusively by gut metagenomic sequences. Because these sequences are somewhat distant from those with characterized functions (designated by different colors), they may indeed represent unique acid-sugar dehydratases and, hence, new carbon sources not previously associated with the superfamily. A more detailed examination of this hypothesis can be obtained by visualization of the network at the more stringent e-value cutoff, shown in Fig. 3B. In this view, most of the characterized families within the subgroup have separated into individual clusters, suggesting that this threshold cutoff may be useful for hypothesizing the boundaries of at least some of the functionally distinct families within it. From this view, we can predict the specificity of some of the metagenomic sequences that cluster closely with known families, e.g. fuconate and galactonate dehydratases. The perspective provided in Fig. 3B also lends support to the hypothesis that the separated clusters populated only by gut metagenomic sequences and other uncharacterized sequences from the GenBank TM Data Bank may indeed represent new carbon sources not previously identified as members of the enolase superfamily. Finally, the addition of these metagenomic sequences to the networks helps to fill out the sequence space representing the acid-sugar dehydratases and illustrates more fully the breadth of their natural diversity. It is also interesting that some clusters containing members of characterized families in Fig. 3B have no representatives from the gut microbiome, suggesting that these functions may not be represented in the microorganisms that live in the gut (or those functions are supplied by enzymes from a different evolutionary background).

What We Do Not Know About Cytosolic GST Superfamily
GSTs constitute a large class of enzymes that play important biological roles in cell signaling and metabolism of endogenous compounds, drugs, and other xenobiotics. They are ubiquitous in nature (except for archaea) and may represent as much as 0.01% of the enzyme universe. 7 Based on sequence similarities, GSTs have historically been organized into major classes using the names of Greek letters (e.g. Alpha, Pi, Omega, Theta, etc.) (41). Within each major class, subclasses designate functional and other properties. Although a number of GSTs have been experimentally characterized in terms of their general substrate profiles, the physiological substrates and reaction specificities of only a small minority are known. Still, because of their importance to human biology and health, GSTs are among the best studied of enzyme superfamilies, with thousands of publications detailing their biological roles and structural and functional properties.  Only a few studies have focused on the GST superfamily on a large scale, however (11,42,43). The sequence similarity network 8 shown in Fig. 4 provides an overview of the cytosolic GST superfamily from one of these (42). It compares 622 GSTs representing Ͼ6000 sequences and shows that they can be divided into two major groups distinguished by sequence and structural similarity (and also by variations in their active site features). The majority of the enzymes in the smaller of the two groups shown in Fig. 4 (Group 1) are from eukaryotic organisms, whereas those from the larger group (Group 2) are more mixed, but with the largest number coming from bacteria.
The summary of sequence relationships and structural coverage provided in Fig. 4 is the first time that similarity relationships across the entire GST superfamily were captured in a single view. This map shows both the sequences that could be classified as members of one of the major classes (colored nodes) as well as those that had not even been assigned to one of these general classes (light and dark gray nodes) and had thus far only been identified as belonging to the cytosolic GST superfamily. Remarkably, despite decades of study, these results reveal that the huge majority of GSTs have never been functionally char-acterized at any level. Furthermore, the representation of the colored nodes in the overall topology suggests that many additional classes likely remain to be defined. The view provided in Fig. 4 thus lays a foundation for choosing new sequences for which functional and structural characterization may be especially valuable for prediction of new functional classes. Many additional GST sequences have recently been identified, 9 so the proportion of GSTs for which no functional information is available continues to increase dramatically.

Challenges for Computational Prediction of Functional Properties
The examples provided in this minireview suggest the value of large-scale analyses such as similarity networks for summarizing sequence and structural relationships in large superfamilies and for developing hypotheses about how structure-or sequence-based clustering tracks with functional boundaries. However, like any other method, similarity networks also have some significant limitations, a few of which have been addressed above and others elsewhere (31). Although it is only by experimental investigation that the in vitro and in vivo functions of unknowns can ultimately be validated, the continual 8 For network analysis for the GST superfamily, the sequence set was generated, and networks were calculated and visualized as described previously (42). 9 P. C. Babbitt and D. Stryke, unpublished data. . Sequence similarity networks of acid-sugar dehydratases known or predicted to belong to enolase superfamily and human gut microbiome. Networks were generated from all-by-all BLAST comparisons of 1578 sequences representing sequences of eight known acid-sugar dehydratase families and the mandelate racemase family from the mandelate racemase subgroup (see Footnote 5) as defined by SFLD and a filtered set of gut metagenome sequences that showed significant similarity to the members of the subgroup. Each of the 1578 nodes represents a sequence. Larger square nodes represent those that have been experimentally characterized, so their reaction and substrate specificities are known. Brown nodes represent sequences from the human gut metagenome, and white nodes represent SFLD sequences in the subgroup for which the reaction and substrate specificities have not been predicted. The remainder (small nodes) represent sequences for which specificity can be predicted at high confidence, colored by their SFLD family names (see Footnote 4). Nodes were arranged using the yFiles organic layout provided with Cytoscape version 2.7. A, each edge in the network represents a BLAST connection with an e-value of 1eϪ44 or better. At this cutoff, sequences have a median percent identity and alignment length of ϳ32% and 369, respectively. B, each edge in the network represents a BLAST connection with an e-value of 1eϪ84 or better. At this cutoff, sequences have a median percent identity and alignment length of ϳ44% and 384, respectively. Lengths of edges are not meaningful except that sequences in tightly clustered groups are relatively more similar to each other than sequences with few connections.
growth of sequence data makes it increasingly difficult for either focused or high-throughput experimental studies to keep up. Even a reasonable fallback position requires the development of new strategies for identifying the few experiments that could be most useful for validation of large-scale computational predictions. As illustrated here and elsewhere (44,45), protein similarity networks represent one way to generate the context needed for choosing those experiments and interpreting the results.