Targeting the Human Cancer Pathway Protein Interaction Network by Structural Genomics*

Structural genomics provides an important approach for characterizing and understanding systems biology. As a step toward better integrating protein three-dimensional (3D) structural information in cancer systems biology, we have constructed a Human Cancer Pathway Protein Interaction Network (HCPIN) by analysis of several classical cancer-associated signaling pathways and their physical protein-protein interactions. Many well known cancer-associated proteins play central roles as “hubs” or “bottlenecks” in the HCPIN. At least half of HCPIN proteins are either directly associated with or interact with multiple signaling pathways. Although some 45% of residues in these proteins are in sequence segments that meet criteria sufficient for approximate homology modeling (Basic Local Alignment Search Tool (BLAST) E-value <10−6), only ∼20% of residues in these proteins are structurally covered using high accuracy homology modeling criteria (i.e. BLAST E-value <10−6 and at least 80% sequence identity) or by actual experimental structures. The HCPIN Website provides a comprehensive description of this biomedically important multipathway network together with experimental and homology models of HCPIN proteins useful for cancer biology research. To complement and enrich cancer systems biology, the Northeast Structural Genomics Consortium is targeting >1000 human proteins and protein domains from the HCPIN for sample production and 3D structure determination. The long range goal of this effort is to provide a comprehensive 3D structure-function database for human cancer-associated proteins and protein complexes in the context of their interaction networks. The network-based target selection (BioNet) approach described here is an example of a general strategy for targeting co-functioning proteins by structural genomics projects.

In the past decades, many cancer-associated genes have been discovered, their mutations have been precisely identified, and the pathways through which they act have been characterized (1)(2)(3). The completion of the human genome sequence (4 -6), the use of automated sequencing technology, and the development of microarray-based genomics and proteomics technologies (7,8) have had a significant impact on the field of cancer biology (9). In part based on these genome scale data, cancer is now recognized as a systems biology disease (10). Accordingly a comprehensive analysis of the molecular basis of cancer requires integration of the distinct but complementary fields of biochemistry, genomics, cell biology, proteomics, structural biology, and systems biology (8).
Recently a large number of biological pathway and network databases have been developed to capture the expanding knowledge of protein-protein interactions (e.g. Human Protein Reference Database (HPRD) 1 (11) and Database of Interacting Proteins (12)) and of metabolic and/or signaling pathways (e.g. KEGG (13), Reactome (14), Signal Transduction Knowledge Environment (STKE), and BioCarta). A few databases are specifically focused on cancer-associated signaling pathways, such as The Cancer Cell Map and the Rel/NF-B Signal Transduction Pathway. Pathguide (15) provides an overview of more than 200 Web-based biological pathway and network databases. It is challenging to appropriately integrate and utilize this large number of individual databases for systems biology (16). Lu et al. (17) have proposed to merge both pathway and network approaches by embedding pathways into large scale network databases. This approach integrates data on classical biochemical pathways with newly generated large scale proteomics data.
Since the era of genome sequencing, biologists have made extensive use of protein sequence information. Three-dimen-sional (3D) structural information is increasingly being used for understanding evolution and the mechanisms of molecular function. 3D structure provides critical information connecting protein sequence with molecular function. Although sequence alignments, which are broadly used by the molecular biology community, provide useful suggestions about which residues in homologous protein sequences are in corresponding positions, 3D structure-based alignments provide the true determination of corresponding residue positions (18 -20), which may be inaccurately identified by sequence alignment information alone especially in cases where the sequence conservation is weak. In favorable cases, protein structure can yield insights into mechanisms of enzyme activities and proteinligand interactions. In addition, 3D structures of proteins involved in human disease can be used to discover and/or optimize new pharmaceutical agents (21,22).
A complete understanding of molecular interactions requires high resolution 3D structures as they provide key atomic details about binding interfaces and information about structural changes that accompany protein-protein interactions. Structural genomics is an international effort aimed at providing 3D structures, either directly by x-ray crystallography or NMR spectroscopy or by homology modeling, for all proteins in nature (23). Such a comprehensive structure-function database, containing experimental structures and homology models for hundreds of thousands of proteins, will accelerate research in all areas of biomedicine (24 -26).
Recently Xie and Bourne (27) have discussed the structural coverage of human proteins grouped by the Enzyme Commission and the Gene Ontology classifications. This analysis provides a valuable summary of the structural information available for many human disease-related proteins and provides guidance for protein target selection by structural genomics projects.
As a component of this vision of structural genomics, we have established the Human Cancer Pathway Protein Interaction Network (HCPIN) database, a collection of human proteins that participate in cancer-associated signaling pathways, and their protein-protein interactions. HCPIN (version 1.0) includes ϳ3000 proteins and ϳ10,000 interactions. HCPIN integrates (embeds) pathway data with protein-protein interaction data (17) and provides protein structure-function annotations to inform cancer biology. The HCPIN Website, illustrated in Fig. 1, provides an extensive collection of experimental and homology models of proteins or domains associated with human cancers.
Here we summarize the current 3D structural coverage of HCPIN and present plans for targeting the remaining proteins in this network for structural analysis. The Northeast Structural Genomics Consortium (NESG) has selected proteins from HCPIN for cloning, expression, purification, and 3D structure determination. This network-based target selection approach provides a framework not only for completing structural coverage of a disease-associated protein interaction network but also provides specific hypotheses regarding protein interaction partners that can be tested by co-expression, co-crystallization, and 3D structure determination of the resulting protein-protein complexes (28,29). The long range goal of this effort is to provide a comprehensive 3D structurefunction database for human cancer-associated proteins, the corresponding protein-protein complexes, and their interaction network.

EXPERIMENTAL PROCEDURES
Database Searches-Cell cycle progression, apoptosis, MAPK, Toll-like receptor, TGF-␤, phosphoinositide 3-kinase (PI3K), and JAK-STAT signal transduction pathways were downloaded from the KEGG database (version 0.6, January 2006) (13). Protein-protein interactions and multiprotein complexes were downloaded from the Human Protein Reference Database (11) (09_13_05 release), which included ϳ16,000 proteins and ϳ20,000 interactions. Interactions for all pathway proteins and also additional interactions between interaction proteins are included in the HCPIN. The list of 363 genes involved in human cancer was obtained from the Cancer Gene Census (CGC) Database (1). This list is exclusively restricted to genes in which mutations that are reported are causally implicated in oncogenesis. We used an IPI human cross-reference file (release 3.12) (30) to cross-reference proteins from HCPIN, CGC, and Swiss-Prot (31).
HCPIN 3D structural coverage statistics is assessed by running a BLAST search against PDB sequences (February 2006) using the TargetDB search tool with standard default parameters. Disordered residues with missing coordinates for segments within otherwise well determined 3D structures are counted as "structurally covered" in our structural coverage statistics. HCPIN proteins with no cross-referenced Swiss-Prot ID are considered as not having verified gene models and are excluded from structure statistical analysis.
Bioinformatics Programs-SignalP v3.0 (32) and TMHMM v2.0 (33) were used for predicting secreted and transmembrane proteins. The Pfam domains are identified in the SwissPfam file provided from Pfam v19.0 (34,35). The program COILS (36) was used to predict coiled coil regions. We labeled regions of low complexity by using the program SEG (37). Default options were used for all programs. An in-house Perl program was written to predict disordered regions based on mean charge and mean hydrophobicity (38).
Topology and Statistics Analysis-Program Pajek was used for network topology analysis (39). The program R was used for statistics analysis (40).
Homology Modeling and Structure Quality Assessment-HCPIN homology models are selected from MODBASE (41) and/or built using the XPLOR homology modeling protocol of Homology Modeling Automatically (HOMA) (42). If multiple models are available from MODBASE, the model with highest sequence identity is selected by HCPIN. Structure quality reports for each of the experimental structures and models were generated using the Protein Structure Validation Software suite (43), which includes structure validation analysis with ProsaII (44), Verify3D (45), Procheck (46), MolProbity (47), and other structure quality assessment tools. Over time, the homology model database of HCPIN will be updated and expanded.
The HCPIN Web-accessed Database-Generation of Web pages (HTML) for the HCPIN server was done using Java and a relational database (MySQL). We recommend Web browsers Firefox version 2.0 or higher and Internet Explorer 7 or higher to provide full Java functionality. Ribbon diagrams were generated using PyMOL. We plan to update structure coverage annotation information weekly and update HCPIN protein information every 4 months.

Human Cancer Pathway Protein Interaction Network
The HCPIN is a collection of proteins from cancer-associated signaling pathways together with their protein-protein interactions. The HCPIN version 1.0 was constructed by combining proteins from seven KEGG (13) classical cancer-associated signaling pathways together with protein-protein interaction data from the HPRD (11). HPRD is a resource of protein-protein interaction information manually collected from the literature and curated by expert biologists to reduce errors (11). We used KEGG because of its high quality (48). Pathway interaction information from KEGG was excluded from HCPIN because of a lack of precise definitions (17).
The seven pathways in this initial version of HCPIN include (i) cell cycle progression, (ii) apoptosis, (iii) MAPK, (iv) innate immune response (Toll-like receptor), (v) TGF-␤, (vi) PI3K, and (vii) JAK-STAT pathways. Many well known important cancerassociated proteins, such as p53 and NF-B, are associated with at least one of these pathways. The current version of HCPIN includes 2977 proteins and 9784 protein-protein interactions, including 240 multiprotein complexes each comprised of at least three proteins (Table I).
HCPIN proteins collected from the KEGG pathways are called pathway proteins. Other HCPIN proteins that are not included in the KEGG pathways but interact with these pathway proteins are called interaction proteins. The representation of protein complexes using a binary protein-protein interaction graph remains a challenge because without detailed structural studies it is often not possible to distinguish direct physical interactions from interactions mediated through the complex (49,50). We used triangular pseudonodes, which link proteins involved in the same complex, to represent multiprotein complexes (50). These multiprotein complexes account for ϳ1000 edges of the total ϳ10,580 edges in the HCPIN. Table I summarizes other statistics of the HCPIN with and without these multiprotein complexes. Of 664 pathway pro-FIG. 1. The HCPIN is a Web-accessible database. It is designed for use by cancer biologists interested in assessing 3D protein structural information in the context of the protein interaction network. A, HCPIN home page. B, a snapshot of Networks view, visualizing protein-protein interactions with structure annotations. The outside ring represents the percentage of structural coverage. Green ring, experimental model is available with Ͼ99% sequence identities; yellow ring, homology model is available with Ͼ80% sequence identities. The Website provides tools for interactive analysis of the HCPIN. C, a snapshot of Proteins view, listing sequence information and PDB BLAST hits, summarizing all structural information available for the human HCPIN protein and its homologues and providing links to the corresponding PDB entries and other structure-function annotation information. D, a snapshot of Icon gallery, a collection of ribbon diagrams for each of the known structures and the structural models in the HCPIN. teins defined by KEGG, 150 have no annotated physical interactions in the HPRD. Some of these may be associated with the seven KEGG pathways by gene transcription or have interaction partners that are not yet identified or annotated in the HPRD.
The interaction data included in the current version of HCPIN are a subset of the HPRD. Although including only ϳ15% of HPRD proteins, HCPIN accounts for about half of the protein-protein interactions in the HPRD (09_13_05 release). Despite the fact that HCPIN represents only a portion of the signaling network of the human interactome, its degree of distribution is similar to that of many other scale-free interactome networks (51)(52)(53)(54)(55)(56). The clustering coefficient in the HCPIN is better approximated by C(k) ϰ k Ϫ1 than by a kindependent clustering coefficient C(k), which further indicates the modularity of HCPIN (51,52,57). Future expansions and refinements of HCPIN will include cancer-related signaling pathways from other sources (15) as well as proteinprotein interaction data from other manually curated sources (e.g. Database of Interacting Proteins (12); MINT, the molecular interaction database (58); or Reactome (14)). We envision HCPIN as an evolving, curated resource of structure-function information for the human cancer protein interactome.
The Cancer Gene Census Database comprises 363 proteinencoding human genes that are causally implicated in oncogenesis (1), defined here as CGC proteins. Among these 363 CGC proteins, 186 CGC proteins are included in the HCPIN, and only 52 of these are pathway proteins. This high coverage of cancer genes in the HCPIN confirms that the cancer genes are heavily associated with signaling pathways and their interactions and also demonstrates that the seven pathways that we selected for this analysis are central to cancer biology. This coverage may be increased by including additional cancerrelated signaling pathways. Many HCPIN proteins that are fundamental in cancer biology, such as Grb2, Jun, Src, etc., are not included in CGC, and many CGC proteins are not included in HCPIN because they are not characterized to date in the protein-protein interaction literature covered by KEGG or HPRD.
Network Centrality Measures Versus Essentiality-The degree of a protein (node) is defined as the number of interactions in which a particular protein participates (vertex degree). The betweenness of a protein (vertex betweenness) measures the number of non-redundant shortest paths going through this protein. Proteins with a high degree or high betweenness are central proteins, which are often critical for cell survival (59 -63). For many scale-free interaction networks, degree and betweenness are highly correlated (63). Similarly strong correlations are observed for the HCPIN (Kendall's ϭ 0.79, p value Ͻ2.2e Ϫ16 ). As can be seen in Fig. 2, top central proteins of HCPIN with both high degree and high betweenness include key cancer-associated essential proteins, such as p53, Grb2, Raf1, EGF receptor, and others. Fig. 2 also shows that proteins with high betweenness but low degree are quite abundant, especially for CGC proteins (in red). This suggests that bottleneck proteins, like hub proteins, play essential biological roles; this is in agreement with previous observations (61)(62)(63).
Cross-talk between Signaling Pathways-Signaling pathways interact with one another to form complex networks (64). The subnetwork of proteins in a specific pathway together with their interaction partners forms a pathway interaction subnet (also called embedded pathways (17)). Accordingly the seven core KEGG signaling pathways used to construct this version of HCPIN are associated with seven larger pathway interaction subnets. We have also estimated here the crosstalk of the seven signaling pathways by looking at the frequencies of specific proteins in (i) each of the seven signaling pathways and (ii) each of the seven associated pathway interaction subnets.
We first analyzed the cross-talk between pathway proteins associated with each of the seven KEGG signaling pathways. About 20% of all HCPIN pathway proteins are included in more than one KEGG signaling pathway. Fig. 3A summarizes the frequency of observing one pathway protein in multiple signaling pathways. For example, the AKT family of paralogs, the PI3K family of paralogs, and the TNF␣ protein are involved in four of the seven signaling pathways. The uniqueness of particular proteins to particular KEGG pathways differs for the different signaling pathways. Although some 60 -70% of pathway proteins from either the innate immune response or apoptosis pathways are directly associated with at least one other signaling pathway, for the other pathways studied only ϳ30% of pathway proteins are associated with more than one pathway. We next analyzed the cross-talk between the pathway interaction subnets associated with each of the seven KEGG signaling pathways by HPRD interaction data. Fig. 3B summarizes the frequency of observing one HCPIN protein in multiple pathway interaction subnets. These data show that HCPIN proteins are frequently shared between multiple pathway interaction subnets. Overall about 53% of HCPIN proteins are associated with more than one pathway interaction subnet. In other words, more than half of the HCPIN proteins are either directly associated with or interact with multiple signaling pathways. Although only ϳ20% of all pathway pro-teins are directly associated with multiple (Ͼ1) pathways (Fig.  3A), ϳ58% of pathway proteins are associated with multiple pathway interaction subnets (Fig. 3C). The percentage of pathway proteins associated with multiple pathway interaction subnets (58%) is similar to the percentage of all HCPIN proteins associated with these interaction subnets (53%); the cross-talk between pathways is mediated approximately equally by core pathway proteins and interaction proteins.
Seven pathway proteins are involved in all seven pathway interaction subnets (i.e. Raf1, a serine/threonine kinase; Stat1; Stat3; Rb; p53; CREB-binding protein; and TGFR1). Another  3. Cross-talk between pathways. A, frequency of observing one protein in one or more of the seven KEGG signaling pathways. ϳ20% of HCPIN pathway proteins are associated with two or more pathways. B, frequency of observing one HCPIN protein in one or more of seven pathway interaction subnets. Ͼ50% of HCPIN proteins are associated with two or more interaction subnets. C, frequency of observing one pathway protein in one or more pathway interaction subnets. The frequencies (1-7) are also labeled on the side of these pie charts.
seven interaction proteins (i.e. proteins in the interaction subnet that are not core pathway proteins) are included in all seven pathway interaction subnets (i.e. tyrosine kinase Lyn, estrogen receptor ␣, ␤-catenin, insulin receptor, casein kinase II, Hsp90-␣, and Sam68). These proteins associated with all seven interaction subnets play central roles in cancer biology.

Structural Coverage of HCPIN Proteins
The accuracy of homology models is largely determined by the percent sequence identity with the template 3D structure upon which the model is based (43,65). Models built at ϳ30 -50% sequence identity with the template (a medium accuracy modeling level) tend to have ϳ90% of the main chain modeled within 1.5-Å root mean square deviations from the correct structure but with frequent side-chain packing, core distortion, and loop conformation errors (65,66). Homology models built with more than 50% sequence identity tend to have about 1.0-Å root mean square deviation from correct structures for the main-chain atoms, with larger deviations for side-chain packing (66). Our goal is to characterize the structural coverage of the HCPIN using high quality experimental structures or accurate models, especially for enzyme active sites, based on structural templates with BLAST E-value Ͻ10 Ϫ6 and sequence identity Ͼ80% (a high accuracy mod-eling level). Although this cutoff is somewhat arbitrary, models generated from such templates will usually be of high reliability and accuracy. Such high quality structures or models of these human proteins are potentially useful for active site docking, studying catalytic mechanism, and designing ligands useful for drug discovery (67).
We have estimated the structural coverage of HCPIN at both medium accuracy modeling level (defined here as BLAST E-value Ͻ10 Ϫ6 ) and high accuracy modeling level (defined here as BLAST E-value Ͻ10 Ϫ6 and at least 80% sequence identity). Human protein sequence information has been annotated by different experimental and computational methods and stored in different databases with various levels of gene model accuracy (30). Alternative splice sites, translation initiation sites, and other gene modeling issues complicate the protein sequence annotation process (68). Swiss-Prot is a high quality manually annotated protein knowledgebase (69). About 78% of HCPIN protein sequences (2328 sequences) can be validated by Swiss-Prot (IPI v3.12) gene model annotations (30). The structural coverage statistics discussed here are for only these 2328 protein sequences that can be verified by Swiss-Prot data. Table II summarizes the structural coverage of HCPIN proteins at medium and high accuracy homology modeling lev- els. At the medium accuracy level, about 86% of Swiss-Protverified proteins from the seven HCPIN pathways (pathway proteins) have at least one domain with structural information available from the PDB. These proteins are defined as having single domain coverage (27), i.e. either an experimental structure or a structure template useful for medium accuracy modeling of at least part of the protein structure. These structures and models cover about 55% of residues in HCPIN (defined here as residue coverage), excluding predicted low complexity and coiled coil regions. Interestingly innate immune response and apoptosis pathways, which are heavily involved in pathway cross-talk, also have the highest residue coverage (Ͼ70%). At the high accuracy modeling level, the structural coverage of pathway proteins is much lower; only 52% have single domain coverage with 25% of residues covered. These structural coverage statistics are upper bounds because this analysis excludes the ϳ20% of proteins in HCPIN for which protein coding sequences cannot be verified by the Swiss-Prot (IPI v3.12) database. The single domain and residue coverage of the interaction proteins, which are included in the seven pathway interaction subnets but not in the seven KEGG pathways, is much lower than for pathway proteins, 76 and 42%, respectively, at medium accuracy level and 44 and 18%, respectively, at high accuracy level. These coverage statistics reflect the traditional bias of targeting core signaling pathway proteins in structural biology projects. Overall HCPIN has 78% (45%), 46% (20%) single domain (residue) coverage at medium and high accuracy modeling levels, respectively. This single domain coverage of HCPIN proteins is significantly higher than the estimated average single domain coverage of the human proteome (27).
We have annotated the 3D structural coverage of all HCPIN proteins in the network diagrams provided on the HCPIN Website ( Fig 1B). These Web-based graph representations provide direct interactive global views of the 3D structural coverage of these pathway protein interaction networks. The outside ring on each node represents the percentage of the residue coverage of the protein.
HCPIN Domains-Domains are the evolutionary modular building blocks of proteins. Experimental protein structure determination processes using x-ray crystallography or NMR spectroscopy are generally domain-oriented. Pfam is a manually curated database of protein domain families derived from sequenced genomes (34). There are ϳ1000 PfamA domains identified in HCPIN with size ranging from ϳ50 to 1000 residues. At medium level modeling accuracy, 53% of HCPIN PfamA domain families have complete fold coverage (i.e. at least one member of the domain family has essentially complete 3D structural coverage), whereas 35% of HCPIN domain families have no fold coverage at all. About 10% of Pfam domain families in HCPIN have partial fold coverage; i.e. a 3D structure is available for part of the predicted Pfam domain. This reflects inherent differences between sequence align-ment-based domain boundaries used in Pfam and the actual structural domain boundaries. Some 10% of HCPIN domains occur at least 10 times in the set of HCPIN proteins. The most abundant domain in HCPIN is the collagen domain (appearing 265 times), which occurs frequently in extracellular structural proteins involved in formation of connective tissue. Other frequently occurring domain types include Pkinase, zinc finger C 2 H 2 , and WD40 domains. Table III summarizes the top 2% most abundant domain types in the HCPIN together with their 3D structural coverage statistics. All 21 of these most abundant domain families have "complete fold coverage" in that at least a medium level accuracy model or experimental structure is available for the full sequence of the domain. Modeling coverage, defined here as the percentage of domain members in HCPIN that can be modeled at high level modeling accuracy, is also summarized for these domain families in Table III. The frequently occurring domain families of intracellular proteins listed in Table III have relatively high modeling coverage. For example, the SH2, an intracellular signaling domain, has the highest modeling coverage, 58%. Experimental structures are available for fewer members of the frequently occurring secreted and membrane-associated domains listed in Table III, resulting in lower modeling coverage of these domain families. Progress in completing the HCPIN modeling coverage for these most abundant domains of HCPIN will provide a comprehensive understanding across the domain family of their structurefunction relationships.
The HCPIN Structure Gallery-The HCPIN Website includes over 1000 protein or domain structure models of which two-thirds are experimental structures from the PDB (with greater than 99% sequence identity to the human HCPIN protein or protein domain) and one-third are homology models built with structural templates having BLAST E-values Ͻ10 Ϫ6 and at least 80% sequence identities (Fig. 1D). To date, the NESG structural genomics project has determined 3D structure of 10 human proteins or domains targeted from the HCPIN; some of these are shown in Fig. 4.

HCPIN Target Selection for Structural Genomics
With the goal of providing high accuracy structural models of disease-associated human proteins, especially enzymes, our homology models of HCPIN proteins require a template protein of known 3D structure with pairwise BLAST E-value Ͻ10 Ϫ6 and Ͼ80% sequence identity with the target protein (67). As discussed above, our structure coverage analysis shows that significant experimental efforts in x-ray crystallography and/or NMR spectroscopy are still needed to complete the structure coverage of the HCPIN at this high accuracy modeling level. Accordingly these "structurally uncovered" regions (defined at this high accuracy level) of HCPIN proteins have been selected for sample production and structure analysis efforts by the Northeast Structural Genomics Consortium.
Are HCPIN Proteins Suitable for Structural Genomics Efforts?-Because of limitations of current protein structure production technologies, it is generally more challenging to determine 3D structures of eukaryotic proteins or of secreted, integral membrane, or one-pass transmembrane proteins compared with intracellular proteins. Integral membrane proteins are particularly challenging to produce for 3D structure analysis. About 10% of HCPIN intracellular proteins have 100% residue coverage at high accuracy level, whereas only ϳ2% of HCPIN proteins predicted to be secreted and/or membrane-associated have such complete coverage (Fig.  5A). However, considering the HCPIN proteins with only partial structural coverage, our analysis shows that pathway proteins predicted to be secreted and/or membrane-associated (e.g. soluble domains of one-pass transmembrane proteins) have similar single domain and residue coverage compared with intracellular proteins (Table II). These statistics suggest that structural genomics should not only target domains from intracellular proteins but also the domain families of extracellular secreted and/or extracellular domains of one-pass transmembrane human proteins of the HCPIN.
Size limitations are a concern for structural genomics that require large scale protein sample production and are particularly relevant for structural NMR studies. Protein sample production is generally more successful for proteins of Ͻ600 residues. NMR studies usually require samples with Ͻ180 residues. For this reason, we also analyzed size distributions of Swiss-Prot-validated HCPIN protein chains (Fig. 5). The average full-length HCPIN protein is about 600 residues. The size distribution of predicted intracellular proteins is similar to the size distributions of predicted secreted and membraneassociated proteins (Fig. 5B). Size distributions are also similar for proteins with and without structural single domain coverage (Fig. 5B). Even very large proteins contain domains with some structural coverage that in most cases have been studied by expressing segments of the protein sequence constituting one or a few structural domains. Residue structural coverage distributions (Fig. 5C) are also similar for predicted intracellular, secreted, and/or membrane-associated HCPIN proteins with an average coverage of ϳ110 residues. These size statistics are within the size limitation ranges that are currently addressed well by structural genomics efforts, supporting the feasibility of including these HCPIN proteins as targets of the Northeast Structural Genomics Consortium.
Target Selection Process- Fig. 6 shows details of our target selection process. HCPIN v1.0 consists of 664 pathway proteins identified from KEGG together with an additional 2313 interaction proteins from HPRD. 2328 of these 2977 HCPIN proteins are validated by Swiss-Prot (IPI v3.12) (30). For each amino acid sequence of those validated proteins, we filtered out regions that are not suitable for high throughput structural genomics efforts, including regions with low complexity, those predicted to be coiled coils, or those predicted to be largely disordered (38). We have identified 1160 intracellular proteins that have regions/domains suitable for such high throughput structural genomics efforts. Domains from secreted or membrane proteins have also been targeted as part of technology development projects but with lower priorities. Fig. 5D shows the size distribution of these targeted intracellular proteins. Although the size of the full-length targeted proteins varies, about 75% of targeted regions/domains have less than 300 residues. These protein targets are publicly accessible. Efforts have begun to clone, express, purify, and characterize these 1160 human proteins and protein domains. We have prioritized these targets mainly based on high throughput feasibility rather than other factors such as molecular and cellular functions. In addition, we prioritize for sample production and structure analysis hub and bottleneck proteins with high network degree and/or betweenness measures. The network annotations in the HCPIN database also provide biological and bioinformatics information that is being used on a case-by-case basis to prioritize particular protein targets.

DISCUSSION
Protein Sample Production Concerns-Protein sample production is challenging for the HCPIN proteins for several reasons. Cloning and expression of certain human proteins in Escherichia coli can be difficult or impossible. Many of these signaling proteins have multiple domains, evolved to convey biological signals from different inputs, and require reliable techniques for domain parsing. In addition, these cancerassociated signaling networks include significant numbers of proteins with extensive disordered regions, which are inherently challenging for expression, purification, and structure FIG. 5. A, percent residue coverage distributions for HCPIN proteins. Intracellular, proteins inside the cell; s/m, proteins predicted to be secreted or having at least a segment that is integral or transmembrane. B, box plots of size distributions of HCPIN proteins and HCPIN proteins with single domain coverage. Intracellular, proteins inside the cell; s/m, as defined above; intracellular-SD, intracellular proteins with single domain coverage, s/m-SD, proteins predicted to be secreted or having at least a segment that is transmembrane with single domain coverage. C, box plots of size distributions of HCPIN proteins with residue coverage. Intracellular-residue and s/m-residue, residue coverage of intracellular proteins and predicted secreted/membrane-associated proteins, respectively. Single domain and residue coverages are shown at high accuracy level. A similar distribution is observed at medium accuracy level. D, box plots of size distributions of full-length and targeted subregions of proteins selected by the NESG structural genomics project.
determination (70,71). Large macromolecular complexes not only require larger amounts of material but also a precise and coordinated assembly of the different subunits, conditions that are often not easy to reproduce in vitro (28).
Despite the challenges, there are certain technical advantages of targeting an extensive protein interaction network like the HCPIN. Many proteins that fail expression when produced alone can be expressed, purified, and crystallized by coexpressing and co-purifying them with their interacting partner proteins (29). We are taking advantage of this approach with HCPIN targets as potential partners for co-expression and co-purification are indicated from the network. Although cancer-associated signaling networks are likely to include significant numbers of proteins with extensive disordered regions (70,71), such disordered regions may become ordered upon binding to their protein partners, making the corresponding complexes suitable for high throughput structural genomics (72)(73)(74)(75).
General Strategy for Targeting Proteins from Pathwaybased Interaction Networks-We propose a general strategy to select targets from pathway-based interaction networks. This target selection strategy can be applied to any biochemical pathway of interest. Previously reported target selection strategies for structural genomics have focused on family (76,77), whole genome (78 -80), pathways (25,67), and complexes (28,29). The target selection strategy we present here combines the selection strategies that have been proposed for structural genomics of biochemical pathways (25,67) and of protein-protein complexes (28,29).
First, lists of proteins involved in a specific biological pathway are collected. These proteins are called pathway proteins. Interaction proteins are then identified, including those that potentially interact directly with any pathway proteins or contribute to multiprotein complexes formed with pathway proteins. Interactions can be derived from the literature, curated peer-reviewed databases (11)(12)(13)(14)58), high throughput protein interaction experiments (53)(54)(55)(56), and/or integrated prediction methods (81,82). Gene models for both pathway and interaction proteins are then validated using Swiss-Prot (83). Protein sequences not verified by Swiss-Prot will require further analysis to confirm their authenticity. Regions of proteins with known 3D structure information from PDB are identified. Regions of proteins not covered with 3D structure information and also suitable for high throughput structural determination are then selected as structural genomics targets with emphasis on hub and bottleneck proteins. This BioNet target selection strategy not only provides a systematic approach for complete structure coverage for disease-associated pathways (25,67) but also provides a framework for studying protein interactions and complexes (28,29).
Community Outreach-Since the era of genome sequencing, biologists now use protein sequence information extensively. However, the general biological community uses much less structural information. The HCPIN Website (Fig. 1) is built to make structural information about cancer-related proteins easily accessible to cancer biologists. Our future plan for HCPIN includes mapping single nucleotide polymorphism/ mutation information, protein-protein interactions, and various structural bioinformatics predictions onto the 3D structures; adding gene ontology and structure-based functional FIG. 6. HCPIN target selection process. SEG regions, low complexity regions predicted by the program SEG (37). SignalP region, signaling peptide predicted by SignalP (32). TM region, transmembrane region predicted by TMHMM (33). C/U-region, structure covered (C) or uncovered (U) region. T-region, targeted region. Disordered regions are predicted based on mean hydrophobicity and net charge (38). E, E-value; HTP, high throughput. annotation; and incorporating microarray and protein expression data. We envision HCPIN as an evolving, curated resource of structure-function information for the human cancer protein interactome.
Many intermediate results, such as expression constructs and biochemical reagents, generated in these ongoing structural genomics efforts are freely available to the biology community. Our structure-function database can be leveraged by many other related initiatives. For example, the National Cancer Institute's Initiative for Chemical Genetics aims to systematically identify perturbational small molecules for each cancer-related protein coded in the human genome (84). Our structural genomics efforts on HCPIN will provide biochemical and structural information as well as key reagents, organized at a single Website, beneficial for such chemical genetics studies (85).