Elsevier

Drug Discovery Today

Volume 19, Issue 3, March 2014, Pages 266-274
Drug Discovery Today

Review
Informatics
Computational proteomics: designing a comprehensive analytical strategy

https://doi.org/10.1016/j.drudis.2013.07.008Get rights and content

Highlights

  • Proper combination of proteomics with bioinformatics is highly synergistic.

  • Appropriate experimental design is needed for proper bioinformatics analysis.

  • Enhanced resolution renders proteomics suitable for functional studies.

  • New discovery paradigms can lead to novel drug targets/biomarkers.

The proper combination of proteomics with bioinformatics is highly synergistic, capable of propelling proteomics into a truly high-throughput platform. However, appropriate experimental design and analytical considerations are needed to maximize analytical outcome. This review highlights key issues and caveats in converting raw data to protein identifications, and subsequently biological insight. It offers some insights on how the establishment of highly robust proteomics pipelines can be used for studying novel areas such as computational epigenetics, high performance functional studies and new discovery paradigms for drug targets and biomarkers

Introduction

Proteomics is the global study of proteins (cell lines, tissues or organisms). Unlike other high-throughput assays, it is not merely concerned with identification and quantitation, but also cellular and tissue localization, post-translational modifications and inter-molecular interactions. Hence, more biological data is assayable (Fig. 1). The functional applications of proteomics can be broadly divided into four categories: (i) pure diagnostics, (ii) biomarker discovery, (iii) identification of root causes and (iv) identification of biologically specific network rewirings [1]. In reality, most current applications lie in the areas of biomarker discovery and diagnostics. Functional aspects [(iii) and (iv)] on the other hand, are lacking, owing to poor planning, experimental difficulties and a lack of appropriate analytical techniques. Currently, this landscape is improving rapidly as proteomic technologies and its accompanying informatics matures [2].

The experimental difficulty of proteomics lies in both sample complexity and technical limitations. There are far more protein moieties (500k) than genes (20–30k) for which the dynamic range is large (more than eight orders of magnitude). On technological limitations, coverage [3], consistency [4] and quantitations [5] are relatively unstable.

Most current proteomics is predicated on the mass spectrometer (MS). The most common set up is the untargeted approach – shotgun liquid chromatography coupled with tandem mass spectrometry (LC–MS/MS). Untargeted proteomics is a discovery approach, suitable for deriving novel biological insights or discovering novel biomarkers whereas targeted proteomics (Selective Reaction Monitoring, SRM or Multiple Reaction Monitoring, MRM) is diagnostic and confirmatory (Fig. 2). Although in this review, we focus on the former, similar approaches can be adopted for the latter.

In untargeted proteomics, the early steps involve (i) sample acquisition; (ii) protein separation through LC or other separation methods; (iii) tryptic digestion followed by chromatographic and/or electrophoretic separation and MS analysis; and (iv) searching a protein-sequence collection to identify proteins based on the MS and tandem MS (MS/MS) information [6]. Although each of these steps requires extensive experimental optimizations, we focus more on informatics considerations.

Computational proteomics, is the umbrella term for describing a gamut of bioinformatics techniques from peptide identifications, to corroborations with biological function. It has yet to be effectively integrated with proteomics for maximal analytical outcome. Unfortunately, bioinformatics is not a salve for remedying poor experimental planning: the right techniques have to be used and/or developed to answer the right experimental question. Given the right set of analytical tools, proper integration is capable of promoting proteomics to a large-scale discovery platform.

This article is a roadmap considering various experimental aspects and the accompanying bioinformatics requirements. It highlights the key issues for consideration and offers some insights on how various resources (bioinformatics, systems biology, genomics, and proteomics) can be combined. It also offers some inroad in new proteomic applications: namely computational epigenetics. Finally, we evaluate how integrative proteomic pipelines, combined with computational predictions can be applied for discoveries and its applicability towards drug discovery.

Section snippets

Planning ahead

LC–MS/MS begins with tryptic digestion of proteins into peptides which are subsequently separated through LC [7]. The separated peptides are then ionized and further resolved through MS based on their different mass-to-charge (m/z) and subsequently detected over a period of detection time giving rise to a preliminary set of MS peaks. The peptides corresponding to these MS peaks can be further fragmented giving rise to a secondary MS/MS spectrum. This allows sequence identification and

Identification

For protein identification, four factors (outside of experiment) are critical. These include (i) algorithm, (ii) parameters, (iii) database, and (iv) statistics. It should be noted that identification is a challenging problem and only the main issues are covered here. For more details, Nesvizhskii [15], Grandholm and Kall [16], Hoopman [17] and Eng [18] are good choices.

Quantitation

Quantitation can be achieved by various means. Broadly, these can be divided into labelled relative and unlabelled absolute.

Examples of labelled relative include familiar workflows such as SILAC and iTRAQ. Here, a tag such as stable isotopes (SILAC) or a chemical marker (iTRAQ) is incorporated into the peptides derived from different samples. These samples can be combined and analysed. Corresponding peptides should result in similar peak patterns but because of the tags, would be shifted by a

Class recovery

Given sufficient proteomic coverage, high level of consistency and stability and/or accuracy in quantitation measurements, samples can be analysed computationally using established methods commonly used in microarray analyses, for example, hierarchical clustering (HCL), K-means, among others to recover the underlying sample classes. This can also serve as a standard for determining the quality of the experiment before proceeding further (if the underlying classes cannot be recovered, it is

Functional analysis

Converting identifications into biological insight is the final and most crucial step. Suppose the experiment was perfect (all proteins detected and quantified), data references remain imperfect (pathways, networks and term annotations). There are three major considerations: (i) Pathway databases are highly incomplete and inconsistent. (ii) Protein interaction networks are ridden with indeterminable false positive and false negative rates. (iii) Gene Ontology (GO) is too general for specific

Pathway databases

Pathway analysis is a popular choice for making sense of data. The key advantage is that it reduces a long protein list into biologically coherent units, thereby greatly simplifying analysis. The major groups of pathway analysis methods have been reviewed recently elsewhere 3, 37.

Instead, we highlight the limitations and major problems. Firstly, pathway databases are imperfect, Soh et al. [38] demonstrated that major pathway repositories have limited overlaps, even on well-established pathways.

Biological networks

Biological networks are large systems where emergent and systemic insights could be drawn. It was successfully applied to genomics but harder in proteomics due to its lower throughput. Proper applications of networks towards proteomics can be very useful in solving its technical limitations, thereby improving biological insight.

Localized network analysis tries to identify critical components of the network or use existing biological modules, for example, known complexes or clusters. Two

Gene Ontology (GO)

GO is widely used for identifying common functional themes between proteins. Portals like GO Term Finder [44] and GO East [45] are easily available and use a standard hypergeometric scoring method for identifying significantly implicated terms. GO has several issues. Firstly, it is designed for general purposes. Therefore, it cannot be used for understanding niche or novel biological functions for which there is little and/or no characterization. This can be dealt with by development of

Bridging the genomic–proteomic divide

Biological information is stored in DNA, transcribed into mRNA before translation to protein. Between each step lies a multitude of confounding complexities. The formal meta-control of gene expression through histone modifications and miRNA fall under the purview of epigenomics, is currently poorly understood and difficult to assay computationally or biologically. This is further discussed in the next section. Here, we are interested in bridging the divide between the two major assaying

Computational epigenetics and its association with proteomics

Epigenetics is the regulation of gene expression at the meta-level. This goes beyond the central dogma, and includes events such as chemical modification of DNA (methylation and acetylation), RNA silencing or alternate splicing and transcription factor regulation.

These mostly occur on the genome level. However, epigenetic regulation at, or close to the protein level is also known. An important instance is the effect of miRNAs on proteins and the resultant protein network changes.

miRNAs act by

Integrative computational proteomics and its implications for drug and/or biomarker discovery

Using untargeted and targeted proteomics in series – a combination of iTRAQ and SRM/MRM proteomics – is a powerful tool for identification and verification of candidate protein biomarkers [52]. For example, through iTRAQ, Narumi et al. identified 19 phosphopeptides for further verification using SRM of which 15 were successfully quantified [53]. However, the selection procedure could be rendered much more powerful by incorporation of novel bioinformatics. For instance, the underlying class

Concluding remarks

The proper integration of proteomics with bioinformatics is non-trivial but promises high levels of synergy from which deeper insights and hypotheses could be drawn. We’ve covered a broad range of key issues that must be noted to bolster analysis. Finally, synergistic combinations of discovery and targeted proteomics, in conjunction with appropriate computational methods can have potentially strong impact in both the scientific and pharmaceutical landscape.

Acknowledgements

W.W.B. Goh is supported in part by a Wellcome Trust Scholarship (83701/Z/07/Z). L. Wong is supported in part by a Singapore Ministry of Education Tier-2 grant MOE2012-T2-1-061.

References (53)

  • J. Eriksson et al.

    Improving the success rate of proteome analysis by modeling protein-abundance distributions and experimental designs

    Nat. Biotechnol.

    (2007)
  • L. Kall et al.

    Computational mass spectrometry-based proteomics

    PLoS Comput. Biol.

    (2011)
  • L.M. de Godoy

    Comprehensive mass-spectrometry-based proteome quantification of haploid versus diploid yeast

    Nature

    (2008)
  • H. Liu

    A model for random sampling and estimation of relative protein abundance in shotgun proteomics

    Anal. Chem.

    (2004)
  • S.P. Schrimpf

    Comparative functional analysis of the Caenorhabditis elegans and Drosophila melanogaster proteomes

    PLoS Biol.

    (2009)
  • F. Desiere

    Integration with the human genome of peptide sequences obtained by high-throughput mass spectrometry

    Genome Biol.

    (2005)
  • P. Picotti et al.

    Selected reaction monitoring-based proteomics: workflows, potential, pitfalls and future directions

    Nat. Methods

    (2012)
  • L.C. Gillet

    Targeted data extraction of the MS/MS spectra generated by data-independent acquisition: a new concept for consistent and accurate proteome analysis

    Mol. Cell. Proteomics

    (2012)
  • V. Granholm et al.

    Quality assessments of peptide-spectrum matches in shotgun proteomics

    Proteomics

    (2011)
  • J.K. Eng

    A face in the crowd: recognizing peptides through database search

    Mol. Cell. Proteomics

    (2011)
  • T. Koenig

    Robust prediction of the MASCOT score for an improved quality assessment in mass spectrometric proteomics

    J. Proteome Res.

    (2008)
  • J.K. Eng

    A fast SEQUEST cross correlation algorithm

    J. Proteome Res.

    (2008)
  • J. Cox et al.

    MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification

    Nat. Biotechnol.

    (2008)
  • C. Bauer

    PPINGUIN: peptide profiling guided identification of proteins improves quantitation of iTRAQ ratios

    BMC Bioinformatics

    (2012)
  • N. Lietzen

    Compid: a new software tool to integrate and compare MS/MS based protein identification results from Mascot and Paragon

    J. Proteome Res.

    (2010)
  • T. Kwon

    MSblender: a probabilistic approach for integrating peptide identifications from multiple database search engines

    J. Proteome Res.

    (2011)
  • Cited by (0)

    View full text