Human Proteinpedia as a Resource for Clinical Proteomics*

Clinical proteomics is an emerging field that deals with the use of proteomic technologies for medical applications. With a major objective of identifying proteins involved in pathological processes and as potential biomarkers, this field is already gaining momentum. Consequently, clinical proteomics data are being generated at a rapid pace, although mechanisms of sharing such data with the biomedical community lag far behind. Most of these data are either provided as supplementary information through journal web sites or directly made available by the authors through their own web resources. Integration of these data within a single resource that displays information in the context of individual proteins is likely to enhance the use of proteomic data in biomedical research. Human Proteinpedia is one such portal that unifies human proteomic data under a single banner. The goal of this resource is to ultimately capture and integrate all proteomic data obtained from individual studies on normal and diseased tissues. We anticipate that harnessing of these data will help prioritize experiments related to protein targets and also permit meta-analysis to uncover molecular signatures of disease. Finally, we encourage all biomedical investigators to maximize dissemination of their valuable proteomic data to rest of the community by active participation in existing repositories such as Human Proteinpedia.

Advancements in proteomics and its clinical applications have led researchers to exploit them to discover protein markers for cancer diagnosis, interrogate key components of signaling pathways, capture protein-protein interactions, dissect organellar proteomes, identify post-translational modifications and to catalog protein expression and subcellular localization profiles (1,2). Clinical proteomics deals with the application of proteomic technologies to help decipher the changes that occur in cells, tissues, and organs under diseased conditions. With the increase in the use of recent high-throughput technologies such as mass spectrometry, data generation far outstrips the pace of data storage and dissemination. Data once generated can always be revisited and queried in new or different ways that could even lead to potential breakthroughs in terms of identifying diagnostic markers or therapeutic targets. Although proteomic data can be submitted to public repositories, this is neither popular nor mandated, even for published data. Given the high experimental and labor costs in addition to the precious nature of the data, it is imperative that there are concerted community efforts to capturing such data and making them available in formats that would be most useful to biomedical researchers.
Cancer Biomarkers and Disease Proteomics-The potential of mass spectrometry to identify proteins in samples in a high throughput (3) manner with reduced sample requirements have made mass spectrometry an ideal tool to be deployed in clinical proteomics (4). Thus, use of proteomics for identification of cancer biomarkers for diagnostic, prognostic, or therapeutic applications is of substantial interest. In this regard, quantitative analysis of protein expression in normal and cancer tissues to identify proteins overexpressed in cancers has already been successfully reported by a number of groups (5)(6)(7)(8)(9)(10). Because it has already been demonstrated that early diagnosis of breast, colorectal, and cervical cancers through screening approaches can lead to a reduction in mortality rates (11), there is sufficient justification for aggressive pursuit of novel biomarkers for early detection of all cancers.
In addition to the search for biomarkers, it is also of interest to identify proteomic changes that occur in diseases to gain insights into their pathogenesis. Such proteomic changes could include alterations in abundance of proteins or their post-translational modifications or subcellular localization, among others (12). In the future, it may even be possible to diagnose a particular disease condition from organ-specific proteomic signatures present in serum. For this, we must first systematically obtain proteomic data from individual organs. Such data can be archived, and meta-analysis can be carried out to decipher the signatures, as was recently reported for head and neck and colon cancers (13).
Is Proteomics Synonymous with Mass Spectrometry?-The routine use of mass spectrometers to identify a multitude of proteins in a high-throughput fashion has led to a situation where the terms '"proteomics" and "mass spectrometry" are sometimes used interchangeably. A number of repositories have been developed that only accept data from mass spec-trometry experiments. However, proteomics includes a broad array of techniques that are still in common use including Western blotting, immunohistochemistry, yeast two-hybrid, peptide and protein microarrays, x-ray crystallography, NMR spectroscopy, fluorescence microscopy, and flow cytometry. Among these techniques, antibody-based methods are especially used in the oncology field for diagnosis and classification of cancers (14). HUPO 1 Antibody Initiative (15) was initiated to accelerate the production and use of validated antibodies against human proteins (16). With the availability of a large number of antibodies, assays such as immunohistochemistry and enzyme-linked immunosorbent assay can be used for biomarker validation. Therefore, it is important to remember the clinical platforms that are relevant to oncology research when proteomic platforms are being discussed.
Genomic Versus Proteomic Data-In the case of genomic data, the International Nucleotide Sequence Consortium has already established a working principle according to which any sequence data that is submitted to any one of the 3 members, GenBank (17), European Molecular Biology Laboratory (EMBL) (18), or DNA Data Bank of Japan (DDBJ) (19), will automatically be reflected in the other data bases. Further, all sequences submitted to these data bases are freely available to the public without any restrictions. This method of data sharing has been in practice for over 20 years now. Further, if a manuscript contains novel sequences, submission of the nucleotide sequences to any one of the three major nucleotide sequence data bases prior to publication is mandatory. In fact, manuscripts are accepted subject to the condition that a unique data base accession number assigned by these data bases will be provided by the authors before publication.
Unlike genomic data, however, proteomic data is diverse with a multitude of experimental platforms and data types with the result that there are no general working principles for data submission that apply to all types of proteomic data. However, for specific data types such as mass spectrometry data, specific guidelines are beginning to emerge (20) although they are not universally adopted at the current time.
Data in Centralized Repositories Versus Supplementary Information-Given the current size of most proteomic data sets, the authors are often unable to accommodate them in the body of the article. Most of them end up publishing the majority of such data as supplementary information either at the web site of the journal or on their own web site (21). However, there are a number of disadvantages of submitting data as supplementary information instead of contributing them to centralized repositories as listed. 1) Most scientific articles are not freely available and preclude many scientists from accessing published articles. Even if the supplementary information is provided freely by the journals, it would be of no use without the original article that is only accessible by a fraction of the scientific community. 2) Data added as supplementary information might not be easily accessible, as most are in pdf or word document formats and cannot be searched readily.
3) The supplementary data provided by the authors generally does not follow a specific format. This makes it difficult to combine independent data sets for data mining or meta-analysis purposes. 4) Retrieving information on a specific gene from supplementary information is not a trivial task because the nomenclature system is often decided by the authors. 5) Supplementary information is most often limited to the web space provided by the journals and large raw mass spectrometry data (in the gigabyte range) are mostly left out.
On the contrary, data contributed to centralized repositories can be downloaded freely, is more searchable, and is often constrained so that common standard formats are used. Moreover, it is possible for information from diverse research articles to be integrated and presented to the user at the context of the protein or a biological pattern as is done in the case of Human Proteinpedia. With the recent advancements in semantic web (22) and data base interoperability (23), it will become even more fruitful for the scientific community to contribute their data to centralized repositories for optimal utilization of data.
Standardization and Vocabulary Issues in Proteomic Data-Gene nomenclature is regulated by human genome organization, whereas naming of proteins is largely left to individual investigators. This is unfortunate because even literature searches are based on text and not sequences, which makes it almost impossible to retrieve the published literature on any given protein in a comprehensive fashion. Some features of proteins are beginning to be standardized using controlled vocabularies such as eVOC (24) for describing tissue expression, Gene Ontology (25) for cellular component, molecular function, and biological process, while RESID (26) and Proteomics Standard Initiative-Molecular Interaction (27) vocabularies are available for post-translational modifications and protein-protein interactions, respectively. Proteomics Standard Initiative-Mass Spectrometry (PSI-MS) vocabularies are used to standardize mass spectrometry-based experimental annotations. Nevertheless, even though these controlled vocabularies are available, they are by no means in common use as major data bases themselves do not always adhere to the available vocabularies (28).
A Need for Unified Information about Proteins-Some of the most popular public repositories store information about specific aspects of proteins. For instance, Protein Data Bank (PDB) (29) is an archive of structural data of biological macromolecules. PRoteomics IDEntifications PRIDE data base (30) and PeptideAtlas (31) are some of the leading mass spectrometry-based data repositories. HPRD (32), IntAct (33), Mint (34), BioGrid (35), and data base of interacting proteins Almost all information for this protein is derived from community annotations through Human Proteinpedia including subcellular localization and expression in tissues, cell lines, and diseases. The annotated data shows that this molecule is expressed in B cells, brain, liver, ovary, and platelets. It is also expressed in ovarian cancer and in several cell lines (293T, HeLa, and K-562). Clicking on any of these hyperlinked terms opens a pop-up window (e.g. cytoplasm or platelet, as shown), which provides additional experimental data and details about the contributing laboratory as well as any publications. For example, the window on the left shows peptide identification data, peptide scores, precursor mass, charge state, and sequence identifiers from this unpublished study. If available, the MS/MS spectra are hyperlinked to another window as shown in the right lower part that allows the users to manually inspect the data. (36) are some of data bases capturing protein-protein interaction data. LifeDB (37) catalogs subcellular localization, whereas Human Protein Atlas (38) archives immunohistochemistry data. These data bases were designed to either collect or accommodate data only from specific experiment types; very few archive data from multiple platforms. Thus, it is currently impossible for a researcher to view all of these data stored in these specialized data bases in one location. Further, there is a lack of mechanisms to automatically exchange most proteomic data types between repositories In developing a resource for housing proteomic data including that from clinical proteomics, two major issues should be considered. The first is that the data should be shared regardless of the size of the dataset (i.e. it is not just high-throughput data that are worth sharing; data from individual experiments is often even more valuable and should not be ignored). Second, there should be a central portal where the available data is compiled and displayed in the context of a gene/ protein. The latter feature would permit users to construct complex queries such as "what are the post-translational modifications on my protein of interest, its interacting proteins, its subcellular localization, and if it is overexpressed in cancers". Such queries cannot be made in any of the existing proteomic repositories although some provide links to other data bases for certain data types. (39) is a community portal for sharing human proteomic data that is developed with the active participation of more than 70 laboratories around the world. It allows researchers to share their human proteomic data in a manner that is somewhat similar to that of Wikipedia. However, experimental evidence is mandatory for inclusion of data in Human Proteinpediaand; the contributions are always linked to the investigator and the laboratory. Annotations pertaining to post-translational modifications, expression in cell lines or tissues, protein-protein interactions, enzyme substrate, and subcellular localization can be submitted. Human Proteinpedia includes data from diseases such as cancers thereby allowing the biomedical community to take a system's view of the disease proteome. Moreover, it can accommodate data from multiple experimental platforms such as yeast two-hybrid screens, peptide/protein arrays, immunohistochemistry, Western blots, mass spectrometry, co-immunoprecipitation, and fluorescence microscopy.

Human Proteinpedia as a Portal for Basic and Clinical Information about Proteins-Human Proteinpedia
Thus, Human Proteinpedia represents an early attempt to unify human proteomic data under a single resource. An important feature of Human Proteinpedia is that it displays the data in the context of proteins that are annotated in HPRD, a literature curated data base for human proteins (32). An example of tumor protein D52-like 2, which is an uncharacterized protein, will illustrate how Human Proteinpedia can not only handle the complex query described above but provide meaningful answers that otherwise might be difficult to find or derive. Fig. 1 shows the expression of tumor protein D52-like 2 in normal tissues, diseases, and in cell lines along with its subcellular localization. These are all based on data submitted by the community, and the name of the contributing laboratory is clearly displayed when a user clicks on a link (the figure shows the link from the term "cytoplasm" and "platelet"). In addition, in this case, we would not know that this protein is expressed in ovarian cancer without the data contributed by the community. Similarly, Fig. 2a shows that tumor protein D52-like 2 interacts with I-Kappa-B Kinase-Epsilon, a kinase that phosphorylates IkappaB-␣, based on a large-scale protein interaction mapping experiment. Finally, Fig. 2b shows that this protein is phosphorylated on serine and threonine residues with links to the primary data that can be explored by the users.
Likewise, Fig. 3 shows the molecule page of suppressor of mek1 (SMEK1) in HPRD. The molecule is unclassified and its site of expression in normal human tissues is also unknown in the literature. However, annotations contributed by the scientific community through Human Proteinpedia reveal the site of expression of SMEK1 in normal and disease tissue as well as cell lines (Fig. 3). These annotations reveal that SMEK1 is moderately expressed in glandular cells of normal colon tissue while being strongly expressed in tumor cells of colorectal cancer tissue. Fig. 4 shows the expression of an extracellular matrix protein, fibrinogen like 2 (FGL2), in hepatocellular carcinoma (HCC). This protein, similar to fibrinogen ␤ and ␥, was not previously reported to be involved in HCC. However, it is shown to be expressed in HCC by immunohistochemistry as well as by Western blotting (Fig. 4). Given the fact that early diagnosis will improve prognosis, it is important to pursue such overexpressed molecules, which could turn out to be potential biomarkers.
Human Proteinpedia have several advantages over other proteomic resources with respect to clinical proteomic data. Human Proteinpedia incorporates data from multiple experimental platforms, whereas most of the centralized repositories accumulate data from one or two experimental platforms. Given the advantages of each proteomic platform, integration of clinical data produced from all of them under a single banner was lacking. However, Human Proteinpedia displays such clinical information along with the literature-curated data in the context of a protein molecule. With gaining popularity, we expect that even more diverse clinical studies will be integrated and it will be possible to extract biologically meaningful patterns of molecules expressed in particular disease conditions. Further, such data could drive planning of new clinical studies.
Conclusions and Outlook-To systematically take advantage of the explosion in proteomic data, it must be captured efficiently for the explicit purpose of sharing with the community. In this regard, the researchers should pursue depositing their data to any of the public repositories. In addition, the peer-reviewed journals should actively encourage the authors to submit their data to such proteomic repositories as proposed recently by Nature Biotechnology (40) and Nature Methods (41). Human Proteinpedia allows referees of submitted manuscripts to access the data anonymously if the authors have submitted the data prior to publication for this purpose.
To capture the proteomic data that has already been generated, our team at the Institute of Bioinformatics is scanning through the published issues to date in all of the major proteomic journals including Molecular & Cellular Proteomics, Proteomics, and Journal of Proteome Research for possible inclusion in Human Proteinpedia. The corresponding authors of the relevant articles are being contacted and requested to contribute the data. Those who volunteer work with the team so that the data submission is as simple and painless as possible for the contributor. In addition, the team will obtain data that is not present in Human Proteinpedia from other public proteomic repositories on a regular basis and integrate them with the existing information.
Cancer Genome Anatomy Project (42) aims to catalog the gene expression profiles of normal, precancer, and cancer tissue samples. The goal of this initiative is to improve detection, diagnosis, and treatment of patients through worldwide collaboration. While this project is mainly targeted toward genomic and transcriptomic analysis, future plans that include analyses of cancer proteomes are almost certain. Genomic analysis alone cannot predict the various proteomic alterations in cancers and a better understanding of these alterations will impact detection, diagnosis, and treatment. With additional initiatives being announced to dissect various aspects of the human proteome, including a recent one by HUPO, the need for a portal that allows effective sharing of data effectively among scientists is almost a prerequisite. We anticipate that Human Proteinpedia will be one such portal.
The day when biologists will have a single integrated portal to view data from genomics, transcriptomics, and proteomics data might not be too far off. An initial step to unify the human proteomic data has been taken with the development of Human Proteinpedia. However, this would not have been possible without the enthusiastic participation of the proteomics community. We hope that investigators will continue to share their data to maintain the momentum and anticipate that more and more laboratories will join. Future goals include the addition of protein structure information, and efforts are already on to allow users to view proteomic information submitted to Human Proteinpedia at the genomic level by mapping the peptides onto the genome. We anticipate that the availability of such data will spur the development of additional "omics" tools and newer bioinformatics approaches for harvesting the information provided by the datasets. * This work was supported, in whole or in part, by National Institutes of Health Grant U54 RR020839 (Roadmap Initiative for Tech-nology Centers for Networks and Pathways). The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.