Elsevier

Journal of Theoretical Biology

Volume 362, 7 December 2014, Pages 44-52
Journal of Theoretical Biology

Pathway and network analysis in proteomics

https://doi.org/10.1016/j.jtbi.2014.05.031Get rights and content

Abstract

Proteomics is inherently a systems science that studies not only measured protein and their expressions in a cell, but also the interplay of proteins, protein complexes, signaling pathways, and network modules. There is a rapid accumulation of Proteomics data in recent years. However, Proteomics data are highly variable, with results sensitive to data preparation methods, sample condition, instrument types, and analytical methods. To address the challenge in Proteomics data analysis, we review current tools being developed to incorporate biological function and network topological information. We categorize these tools into four types: tools with basic functional information and little topological features (e.g., GO category analysis), tools with rich functional information and little topological features (e.g., GSEA), tools with basic functional information and rich topological features (e.g., Cytoscape), and tools with rich functional information and rich topological features (e.g., PathwayExpress). We first review the potential application of these tools to Proteomics; then we review tools that can achieve automated learning of pathway modules and features, and tools that help perform integrated network visual analytics.

Introduction

Proteomics, the collective study of all measured proteins in cells of a given condition, is inherently a systems science that requires the understanding of not only the independent parts – protein constituents and their expressions in a cell – but also the interplay of proteins, protein complexes, signaling pathways, and network modules as a whole for achieving biochemical functions. Ideker et al. (2001) introduced an integrated approach to identify metabolic networks and build cellular pathway models, by using measurements from DNA microarrays, protein expressions, and protein interaction knowledge. This work provides systems biology researchers with a practical example how biological networks could be used to perform integrative functional genomics data analysis. By gaining system-wide perspectives of protein functions, Proteomics promises to further study which subsets of proteins are essential in regulating specific biological process. In Proteomics analysis, the incorporating of prior knowledge how groups of proteins work in concert with each other or with other genes and metabolites has made it possible to unravel the complexity inherent in the analysis of cellular functions (MacBeath, 2002). New network biology and systems biology techniques have emerged in recent Proteomics studies (Bensimon et al., 2012, Sabidó et al., 2012) including cancer (Goh and Wong, 2013).

There has been a rapid accumulation of data due to advances in Proteomics technologies (MacBeath, 2002). Proteomics data are often generated from high-throughput experimental platforms, e.g., two-dimensional (2D) gel, liquid chromatography coupled tandem mass spectrometers (LC–MS/MS), multiplexed immunoassays, and protein microarrays (Altelaar et al., 2013, Kingsmore, 2006). These platforms can assay thousands of proteins simultaneously from complex biological samples (Aebersold and Mann, 2003) to measure the relative abundance of proteins or peptides in various biological conditions. More accurate quantitative measure of peptides could also be performed with isotopic labelling of proteins in two different samples (Ong and Mann, 2005). Similar to Genomics, Proteomics studies have been widely used to extract functional and temporal signals identified in biological systems (Blagoev et al., 2004). Popular experimental techniques to measure protein–protein interactions include the yeast two-hybrid (Y2H) system (Ito et al., 2001).

In contract to the recent accelerated application of next-generation sequencing (NGS) in biology, a primary hurdle that slows down Proteomics’ applications is the Proteomics data’s high variability, which makes it difficult to interpret Proteomics data analysis results biologically (Colinge and Bennett, 2007). Possible sources of data variations arise from biological sample heterogeneity, sample preparation variance, protein separation variance, detection limits of various proteomics techniques, and pattern-matching peptide/protein identification or quantification inaccuracies from Proteomics data management software. The unusual high level of data noises inherent in Proteomics studies in contrast to those in DNA microarrays or NGS instruments have made Proteomics experiments difficult to repeat, and many statistical methods developed for Genomics applications ineffective. There are plenty of reviews that cover the computational challenges (Vitek, 2009, Noble and MacCoss, 2012, Barla et al., 2008) and solutions to apply statistical machine learning approaches to the problem, e.g., with the use of support vector machines (SVM) (Elias et al., 2004), Markov clustering (Krogan et al., 2006), ant colony optimization (Ressom et al., 2007), and semi-supervised learning (Käll et al., 2007) techniques. The ultimate challenge, however, is how to extract functional and biological information from a long list of proteins identified or discovered from high-throughput Proteomic experiments, in order to provide biological insights into the underlying molecular mechanisms of different conditions (Khatri et al., 2012). Therefore, additional protein functional knowledge, e.g., the abundance of proteins, cellular locations, protein complexes, and gene/protein regulatory pathways, should be incorporated in the second phase of proteomics analysis in order to filter out noisy protein identifications missed in the first statistical analysis phase of Proteomics analysis.

Pathway and network analysis techniques can help address the challenge in interpreting Proteomics results. Analysis of proteomic data at the pathway level has become increasingly popular (Fig. 1). For pathway analysis, we refer to data analysis that aims to identify activated pathways or pathway modules from functional proteomic data. Biological pathways can be viewed as signaling pathways, gene regulatory pathways, and metabolic pathways, all of which are curated carefully in reputable scientific publications. Pathway analysis can help organize a long list of proteins onto a short list of pathway knowledge maps, making it easy to interpret molecular mechanisms underlying these altered proteins or their expressions (Khatri et al., 2012). For network analysis, we refer to data analysis that build, overlay, visualize, and infer protein interaction networks from functional Proteomics and other systems biology data. Network analysis usually requires the use of graph theory, information theory, or Bayesian theory. Different from pathway analysis, network analysis aims to use comprehensive network wiring diagram derived both from prior experimental sources and new in silico prediction to gain systems-level biological meanings (Wu and Chen, 2009). Many large knowledge bases on biological pathways and protein networks have been published, e.g., BioGRID (Chatr-aryamontri et al., 2013), STRING (Franceschini et al., 2013), KEGG (Kanehisa and Goto, 2000), Reactome (Matthews et al., 2009), BioCarta (Nishimura, 2001), PID (Schaefer et al., 2009), HAPPI (Chen et al., 2009), HPD (Chowbina et al., 2009), and PAGED (Huang et al., 2012) databases.

Compared to pathway and network analysis approaches applied in genomics, the advantages of the related researches in proteomics are listed below: (1) Pathway analysis for proteomic data can be directly interpreted in signaling pathways with signal proteins. (2) Network analysis for proteomic data can have direct evidences supported by protein–protein interaction data validated by in-vitro experiments. (3) Both pathway analysis and network analysis for proteomic data can be visualized in a functional protein network with transcriptional factors labeled, which are all measured indirectly in genomic studies.

Section snippets

Pathway and network analysis for proteomics

Many pathway databases and pathway analysis software tools have become available in the last decade (Khatri et al., 2012, Ramanan et al., 2012), with some directly applicable to Proteomics (Goh and Wong, 2013, Goh et al., 2012). In Proteomics, statistically significant proteins identified from high-throughput Proteomic instruments often suffer from high false discovery rate (Vitek, 2009), partly because the inherently high level of variance in Proteomics data can make it difficult to identify

Network analysis for complex protein networks

Complex protein networks are often characterized by scale-free properties (Barabási and Oltvai, 2004), i.e., their node distribution follow power laws. Such networks are highly robust to node communication errors, even with unrealistically high failure rates (Albert et al., 2000). The ability of error tolerance not only appears in complex protein networks, but also has been found in many other types of scale-free networks, such as World-Wide Web (WWW), the Internet, social networks and cell

Summary

Due to the data variability issues inherent in Proteomics measurements, statistical significance alone is insufficient to the evaluation of Proteomics results. We believe both pathway models’ functional information and topological information should be integrated to make Proteomics data interpretation relevant to biological mechanism. With the availability of two types of information, one in protein functional categories and the other in network topological features, we can categorize pathway

Acknowledgements

This work is partly supported by Indiana Center for Systems Biology and Personalized Medicine (CSBPM) and Wenzhou Medical University.

References (82)

  • V.K. Ramanan et al.

    Pathway analysis of genomic data: concepts, methods, and prospects for future development

    Trends Genet.

    (2012)
  • E. Sabidó et al.

    Mass spectrometry-based proteomics for systems biology

    Curr. Opin. Biotechnol.

    (2012)
  • R. Aebersold et al.

    Mass spectrometry-based proteomics

    Nature

    (2003)
  • R. Albert et al.

    Error and attack tolerance of complex networks

    Nature

    (2000)
  • A.M. Altelaar et al.

    Next-generation proteomics: towards an integrative view of proteome dynamics

    Nat. Rev. Genet.

    (2013)
  • G.D. Bader et al.

    Pathguide: a pathway resource list

    Nucleic Acids Res.

    (2006)
  • Z. Bar-Joseph et al.

    Fast optimal leaf ordering for hierarchical clustering

    Bioinformatics

    (2001)
  • A.-L. Barabási

    Scale-free networks: a decade and beyond

    Science

    (2009)
  • A.-L. Barabási et al.

    Network biology: understanding the cell’s functional organization

    Nat. Rev. Genet.

    (2004)
  • A.-L. Barabási et al.

    Network medicine: a network-based approach to human disease

    Nat. Rev. Genet.

    (2011)
  • A. Barla et al.

    Machine learning methods for predictive proteomics

    Brief. Bioinf.

    (2008)
  • A. Bensimon et al.

    Mass spectrometry-based proteomics and network biology

    Annu. Rev. Biochem.

    (2012)
  • B. Blagoev et al.

    Temporal analysis of phosphotyrosine-dependent signaling networks by quantitative proteomics

    Nat. Biotechnol.

    (2004)
  • U. Chandran et al.

    Gene expression profiles of prostate cancer reveal involvement of multiple molecular pathways in the metastatic process

    BMC Cancer

    (2007)
  • A. Chatr-aryamontri et al.

    The BioGRID interaction database: 2013 update

    Nucleic Acids Res.

    (2013)
  • J. Chen et al.

    HAPPI: an online database of comprehensive human annotated and predicted protein interactions

    BMC Genomics

    (2009)
  • J.Y. Chen et al.

    A systems biology approach to the study of cisplatin drug resistance in ovarian cancers

    J. Bioinform. Comput. Biol.

    (2007)
  • S.R. Chowbina et al.

    HPD: an online integrated human pathway database enabling systems biology studies

    BMC Bioinf.

    (2009)
  • H.-Y. Chuang et al.

    Network-based classification of breast cancer metastasis

    Mol. Syst. Biol.

    (2007)
  • J. Colinge et al.

    Introduction to computational proteomics

    PLoS Comput. Biol.

    (2007)
  • A.C. Culhane et al.

    GeneSigDB—a curated database of gene expression signatures

    Nucleic Acids Res.

    (2010)
  • G. Dennis et al.

    DAVID: database for annotation, visualization, and integrated discovery

    Genome Biol.

    (2003)
  • M.T. Dittrich et al.

    Identifying functional modules in protein–protein interaction networks: an integrated exact approach

    Bioinformatics

    (2008)
  • M. Dorigo et al.

    Ant colony optimization

    Encyclopedia of Machine Learning

    (2010)
  • S. Draghici et al.

    A systems biology approach for pathway level analysis

    Genome Res.

    (2007)
  • E.J. Edelman et al.

    Modeling cancer progression via pathway dependencies

    PLoS Comput. Biol.

    (2008)
  • J.E. Elias et al.

    Intensity-based protein identification by machine learning from a library of tandem mass spectra

    Nat. Biotechnol.

    (2004)
  • A. Franceschini et al.

    STRING v9. 1: protein–protein interaction networks, with increased coverage and integration

    Nucleic Acids Res.

    (2013)
  • W.W. Goh et al.

    How advancement in biological network analysis methods empowers proteomics

    Proteomics

    (2012)
  • W.W.B. Goh et al.

    Networks in proteomics analysis of cancer

    Curr. Opin. Biotechnol.

    (2013)
  • L.H. Hartwell et al.

    From molecular to modular cell biology

    Nature

    (1999)
  • J. He et al.

    Efficient and accurate greedy search methods for mining functional modules in protein interaction networks

    BMC Bioinf.

    (2012)
  • H. Huang et al.

    PAGED: a pathway and gene-set enrichment database to enable molecular phenotype discoveries

    BMC Bioinf.

    (2012)
  • H. Huang et al.

    PAGED: a pathway and gene-set enrichment database to enable molecular phenotype discoveries

    BMC Bioinf.

    (2012)
  • T. Ideker et al.

    Integrated genomic and proteomic analyses of a systematically perturbed metabolic network

    Science

    (2001)
  • T. Ito et al.

    A comprehensive two-hybrid analysis to explore the yeast protein interactome

    Proc. Nat. Acad. Sci.

    (2001)
  • L. Käll et al.

    Semi-supervised learning for peptide identification from shotgun proteomics datasets

    Nat. Methods

    (2007)
  • M. Kanehisa et al.

    KEGG: kyoto encyclopedia of genes and genomes

    Nucleic Acids Res.

    (2000)
  • P. Khatri et al.

    Ten years of pathway analysis: current approaches and outstanding challenges

    PLoS Comput. Biol.

    (2012)
  • S.K. Kim et al.

    A gene expression map for Caenorhabditis elegans

    Science

    (2001)
  • S.F. Kingsmore

    Multiplexed protein measurement: technologies and applications of protein and antibody arrays

    Nat. Rev. Drug Discovery

    (2006)
  • Cited by (84)

    View all citing articles on Scopus
    View full text