ReviewInformaticsComputational proteomics: designing a comprehensive analytical strategy
Introduction
Proteomics is the global study of proteins (cell lines, tissues or organisms). Unlike other high-throughput assays, it is not merely concerned with identification and quantitation, but also cellular and tissue localization, post-translational modifications and inter-molecular interactions. Hence, more biological data is assayable (Fig. 1). The functional applications of proteomics can be broadly divided into four categories: (i) pure diagnostics, (ii) biomarker discovery, (iii) identification of root causes and (iv) identification of biologically specific network rewirings [1]. In reality, most current applications lie in the areas of biomarker discovery and diagnostics. Functional aspects [(iii) and (iv)] on the other hand, are lacking, owing to poor planning, experimental difficulties and a lack of appropriate analytical techniques. Currently, this landscape is improving rapidly as proteomic technologies and its accompanying informatics matures [2].
The experimental difficulty of proteomics lies in both sample complexity and technical limitations. There are far more protein moieties (500k) than genes (20–30k) for which the dynamic range is large (more than eight orders of magnitude). On technological limitations, coverage [3], consistency [4] and quantitations [5] are relatively unstable.
Most current proteomics is predicated on the mass spectrometer (MS). The most common set up is the untargeted approach – shotgun liquid chromatography coupled with tandem mass spectrometry (LC–MS/MS). Untargeted proteomics is a discovery approach, suitable for deriving novel biological insights or discovering novel biomarkers whereas targeted proteomics (Selective Reaction Monitoring, SRM or Multiple Reaction Monitoring, MRM) is diagnostic and confirmatory (Fig. 2). Although in this review, we focus on the former, similar approaches can be adopted for the latter.
In untargeted proteomics, the early steps involve (i) sample acquisition; (ii) protein separation through LC or other separation methods; (iii) tryptic digestion followed by chromatographic and/or electrophoretic separation and MS analysis; and (iv) searching a protein-sequence collection to identify proteins based on the MS and tandem MS (MS/MS) information [6]. Although each of these steps requires extensive experimental optimizations, we focus more on informatics considerations.
Computational proteomics, is the umbrella term for describing a gamut of bioinformatics techniques from peptide identifications, to corroborations with biological function. It has yet to be effectively integrated with proteomics for maximal analytical outcome. Unfortunately, bioinformatics is not a salve for remedying poor experimental planning: the right techniques have to be used and/or developed to answer the right experimental question. Given the right set of analytical tools, proper integration is capable of promoting proteomics to a large-scale discovery platform.
This article is a roadmap considering various experimental aspects and the accompanying bioinformatics requirements. It highlights the key issues for consideration and offers some insights on how various resources (bioinformatics, systems biology, genomics, and proteomics) can be combined. It also offers some inroad in new proteomic applications: namely computational epigenetics. Finally, we evaluate how integrative proteomic pipelines, combined with computational predictions can be applied for discoveries and its applicability towards drug discovery.
Section snippets
Planning ahead
LC–MS/MS begins with tryptic digestion of proteins into peptides which are subsequently separated through LC [7]. The separated peptides are then ionized and further resolved through MS based on their different mass-to-charge (m/z) and subsequently detected over a period of detection time giving rise to a preliminary set of MS peaks. The peptides corresponding to these MS peaks can be further fragmented giving rise to a secondary MS/MS spectrum. This allows sequence identification and
Identification
For protein identification, four factors (outside of experiment) are critical. These include (i) algorithm, (ii) parameters, (iii) database, and (iv) statistics. It should be noted that identification is a challenging problem and only the main issues are covered here. For more details, Nesvizhskii [15], Grandholm and Kall [16], Hoopman [17] and Eng [18] are good choices.
Quantitation
Quantitation can be achieved by various means. Broadly, these can be divided into labelled relative and unlabelled absolute.
Examples of labelled relative include familiar workflows such as SILAC and iTRAQ. Here, a tag such as stable isotopes (SILAC) or a chemical marker (iTRAQ) is incorporated into the peptides derived from different samples. These samples can be combined and analysed. Corresponding peptides should result in similar peak patterns but because of the tags, would be shifted by a
Class recovery
Given sufficient proteomic coverage, high level of consistency and stability and/or accuracy in quantitation measurements, samples can be analysed computationally using established methods commonly used in microarray analyses, for example, hierarchical clustering (HCL), K-means, among others to recover the underlying sample classes. This can also serve as a standard for determining the quality of the experiment before proceeding further (if the underlying classes cannot be recovered, it is
Functional analysis
Converting identifications into biological insight is the final and most crucial step. Suppose the experiment was perfect (all proteins detected and quantified), data references remain imperfect (pathways, networks and term annotations). There are three major considerations: (i) Pathway databases are highly incomplete and inconsistent. (ii) Protein interaction networks are ridden with indeterminable false positive and false negative rates. (iii) Gene Ontology (GO) is too general for specific
Pathway databases
Pathway analysis is a popular choice for making sense of data. The key advantage is that it reduces a long protein list into biologically coherent units, thereby greatly simplifying analysis. The major groups of pathway analysis methods have been reviewed recently elsewhere 3, 37.
Instead, we highlight the limitations and major problems. Firstly, pathway databases are imperfect, Soh et al. [38] demonstrated that major pathway repositories have limited overlaps, even on well-established pathways.
Biological networks
Biological networks are large systems where emergent and systemic insights could be drawn. It was successfully applied to genomics but harder in proteomics due to its lower throughput. Proper applications of networks towards proteomics can be very useful in solving its technical limitations, thereby improving biological insight.
Localized network analysis tries to identify critical components of the network or use existing biological modules, for example, known complexes or clusters. Two
Gene Ontology (GO)
GO is widely used for identifying common functional themes between proteins. Portals like GO Term Finder [44] and GO East [45] are easily available and use a standard hypergeometric scoring method for identifying significantly implicated terms. GO has several issues. Firstly, it is designed for general purposes. Therefore, it cannot be used for understanding niche or novel biological functions for which there is little and/or no characterization. This can be dealt with by development of
Bridging the genomic–proteomic divide
Biological information is stored in DNA, transcribed into mRNA before translation to protein. Between each step lies a multitude of confounding complexities. The formal meta-control of gene expression through histone modifications and miRNA fall under the purview of epigenomics, is currently poorly understood and difficult to assay computationally or biologically. This is further discussed in the next section. Here, we are interested in bridging the divide between the two major assaying
Computational epigenetics and its association with proteomics
Epigenetics is the regulation of gene expression at the meta-level. This goes beyond the central dogma, and includes events such as chemical modification of DNA (methylation and acetylation), RNA silencing or alternate splicing and transcription factor regulation.
These mostly occur on the genome level. However, epigenetic regulation at, or close to the protein level is also known. An important instance is the effect of miRNAs on proteins and the resultant protein network changes.
miRNAs act by
Integrative computational proteomics and its implications for drug and/or biomarker discovery
Using untargeted and targeted proteomics in series – a combination of iTRAQ and SRM/MRM proteomics – is a powerful tool for identification and verification of candidate protein biomarkers [52]. For example, through iTRAQ, Narumi et al. identified 19 phosphopeptides for further verification using SRM of which 15 were successfully quantified [53]. However, the selection procedure could be rendered much more powerful by incorporation of novel bioinformatics. For instance, the underlying class
Concluding remarks
The proper integration of proteomics with bioinformatics is non-trivial but promises high levels of synergy from which deeper insights and hypotheses could be drawn. We’ve covered a broad range of key issues that must be noted to bolster analysis. Finally, synergistic combinations of discovery and targeted proteomics, in conjunction with appropriate computational methods can have potentially strong impact in both the scientific and pharmaceutical landscape.
Acknowledgements
W.W.B. Goh is supported in part by a Wellcome Trust Scholarship (83701/Z/07/Z). L. Wong is supported in part by a Singapore Ministry of Education Tier-2 grant MOE2012-T2-1-061.
References (53)
The coming age of complete, accurate, and ubiquitous proteomes
Mol. Cell
(2013)Intact-protein-based high-resolution three-dimensional quantitative analysis system for proteome profiling of biological fluids
Mol. Cell. Proteomics
(2005)A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics
J. Proteomics
(2010)- et al.
Current algorithmic solutions for peptide-based proteomics data generation and identification
Curr. Opin. Biotechnol.
(2013) The Paragon Algorithm, a next generation search engine that uses sequence temperature values and feature probabilities to identify peptides from tandem mass spectra
Mol. Cell. Proteomics
(2007)Integration of prostate cancer clinical data using an ontology
J. Biomed. Inform.
(2009)- et al.
Networks in proteomics analysis of cancer
Curr. Opin. Biotechnol.
(2013) How advancement in biological network analysis methods empowers proteomics
Proteomics
(2012)Proteomics signature profiling (PSP): a novel contextualization approach for cancer proteomics
J. Proteome Res.
(2012)- et al.
Taming the isobaric tagging elephant in the room in quantitative proteomics
Nat. Methods
(2011)