Introduction

Since the commercialization of DNA microarray technology in the late 1990s, high-throughput (HTP) data relevant to cancer research have been accumulating at an ever-increasing rate. These data have led to crucial insights into fundamental cancer biology, including the mechanisms of tumorigenesis, metastasis, and drug resistance (Rhodes and Chinnaiyan 2005). They have also had enormous clinical impact, e.g., several cancers can now be fractionated into therapeutic subsets with unique prognostic outcomes based on their molecular phenotypes (Buyse et al. 2006; Dhanasekaran et al. 2001; Lowe et al. 2010; Pegram et al. 1998; Slamon and Press 2009; Spentzos et al. 2004; Zhu et al. 2010b). Despite these successes, many cancers still have a high mortality rate and no effective treatment. Looking at 1.9 million patients from 31 countries and 5 continents, the CONCORD study found that current treatments achieve a 5-year survival rate for less than 50% of diagnosed cancers (Coleman et al. 2008). For many cancers, survival rates have not changed in decades—pancreatic cancer remains almost 100% lethal, and the overall survival rate for lung cancer has improved only from 13% to 16%. Most cancers still lack any effective early disease biomarkers, and predictive signatures are limited to a few known mutations, such as EGFR or kRAS in lung cancer or HER2 in breast cancer. Predictive and prognostic biomarkers are often inconsistent from study to study (i.e., they show poor overlap), and cannot be validated by other methods or in new cohorts of patients (Diamandis 2010; Dupuy and Simon 2007; Lau et al. 2007).

The key difficulty is that cancer is a complex and heterogeneous disease: many genes are amplified, deleted, mutated, and up- or down-regulated. Many pathways are activated or suppressed. These changes vary substantially in different cancers, in different patients with the same cancer, and even in different tumor samples from the same patient (Axelrod et al. 2009; Bachtiary et al. 2006; Blackhall et al. 2004). To get the full picture, we will need to combine information from diverse experimental platforms and other sources that offer different perspectives on the problem, e.g., gene and protein expression, protein–protein interactions (PPIs) and pathways, chromosomal aberrations, mutation events, epigenetic changes, and clinical information from drug trials and the bedside—leading to integrative computational biology.

The challenges fall into three main categories. The first is noise: HTP platforms are inherently noisy—results vary substantially from run to run and from lab to lab, and are prone to false positives and negatives. The second challenge is volume: there is a vast quantity of relevant data, new data are piling up at an increasing rate and old data need to be constantly reinterpreted and updated in the light of new findings or reanalyzed with improved algorithms. The third challenge is analysis: simple methods—such as differential expression analyses of microarray data—often miss much of the signal in the data.

Integrative computational methods will continue to play a central role in addressing these challenges. We need new designs for databases; new software and workflows to combine and continuously update heterogeneous and distributed data; new analytical methods to identify complex signals in different data sources; and new standards for generating, maintaining, and sharing data. These methods will depend on advances in many areas such as statistics, knowledge representation and ontology, machine learning, data mining, graph theory, and visualization. Integrative analyses may ultimately lead to a better understanding of cancer, earlier diagnoses and true personalized medicine, where therapies are individually tailored based on combinations of single nucleotide polymorphisms, and gene, protein and microRNA expression levels (Auffray et al. 2009; Augen 2001; Cervigne et al. 2009; Reis et al. 2010; Zhu et al. 2010a, b).

Integrative computational biology shares many tools and goals with the closely related field of systems biology, the discipline that attempts to explain the structure and behavior of complex biological systems as a function of their simpler components (Kirschner 2005). The term systems biology may be applied to describe anything from the output of an ‘omics’ experiment to explicit mechanical models of tumor growth (Deisboeck et al. 2009; Hornberg et al. 2006). Integrative computational biology, in contrast, specifically concerns the computational interrogation and integration of diverse HTP molecular biology data.

In this review, we discuss the major challenges of HTP cancer studies and give examples of integrative computational methods that help to meet them.

The challenges facing HTP cancer biology

A wealth of genomic and proteomic cancer data is now available from HTP screens. While these data have improved our understanding of basic cancer biology and some have even translated into improved patient diagnosis and treatment, significant challenges remain. In this section we briefly review some of the major obstacles to the progress of HTP cancer research.

Noise

Cancer is a heterogeneous disease, and HTP platforms are noisy, i.e., the resulting data sets have false positives and false negatives. Consequently, there can be large variation in results from lab to lab (Bell et al. 2009; Irizarry et al. 2005); methods are needed to control this noise and to integrate different data sources to make them more reliable. The main noise issues that plague HTP cancer research are false negatives and false positives, intra- and inter-sample heterogeneity, and platform bias.

False negatives and false positives in HTP data

HTP screens suffer from substantial noise—both false positives and false negatives—which must be resolved through complementary experiments and computational analysis (Auffray et al. 2009; Augen 2001). For example, while HTP PPI screens can identify thousands of protein interactions at once, they do so at the cost of either high false discovery rates or poor sensitivity. The false discovery rate is the proportion of detected interactions that are false, and sensitivity is the proportion of true interactions that are successfully detected. When interactions detected by two HTP studies were tested in small-scale screens, they were found to have false discovery rates of 22% (Rual et al. 2005) and 38% (Stelzl et al. 2005). An evaluation of five HTP methods found that their sensitivity rates ranged from only 21 to 36% at false discovery rates of 0–11% (Braun et al. 2009). Similarly, mass spectrometry analyses of human serum typically produce many false negatives (Gstaiger and Aebersold 2009). One problem is that human serum has a high dynamic range—protein concentrations are estimated to vary over ten orders of magnitude (Gstaiger and Aebersold 2009). Mass spectrometers have a much smaller range of detection, leading to false negatives: low abundance proteins are not detected. This challenge may be somewhat diminished by extensive sample fractionation (Kislinger et al. 2006).

Tumors are heterogeneous

Often only a single sample from each tumor is available for analysis. But tumors are highly heterogeneous, so these samples may not be representative of the whole tumor (Axelrod et al. 2009; Bachtiary et al. 2006; Blackhall et al. 2004). Tumors comprise cells belonging to several distinct subpopulations—e.g., tumor regions can be hypoxic to different extents, or be made up of different proportions of tumor initiating cells (cancer stem cells)—and these differences have consequences for predicting drug response and prognosis (Axelrod et al. 2009; Blackhall et al. 2004; Jubb et al. 2010). Intra-tumor heterogeneity can lead to different cell populations expressing different levels of protein, or having different mutations and copy number alterations, which complicates analyses. For example, variable tumor epithelial and stromal cell content in breast tumor samples can significantly affect gene expression profiles and signature accuracy (Cleator et al. 2006; Myhre et al. 2010). Some of these difficulties can be alleviated with techniques such as laser-capture micro-dissection (Fend and Raffeld 2000), which can isolate regions of a sample that contain a more uniform population of cells, but these techniques remain costly and slow. In addition, they may not result in a sufficient amount of material for follow-up experiments. Though single samples from tumors can suffice for population-level studies (Axelrod et al. 2009), the fact that two samples from the same patient can be quite different means that this heterogeneity poses a significant challenge to personalized medicine. On top of that, in cancer studies there are issues with the control samples available for analysis: often these “normal” samples come from tissue directly adjacent to the tumor. The properties of these samples, such as gene expression profiles, may be quite different from those of more distant tissue or of a healthy patient (Chandran et al. 2005).

Different experimental platforms can disagree

Experimental platforms produced by different companies can yield conflicting results (Curtis et al. 2009; Elias et al. 2005; Tan et al. 2003). For example, different microarray platforms use different probe design and labeling and have different dynamic ranges. A gene may be overexpressed in cancer on one platform, yet under-expressed on another, simply because the two platforms use different DNA sequences to “probe” the same gene. On a given platform, some genes are represented by many probes while others by one or none, and this representation is different for different platforms. Genes that are expressed at low levels are particularly problematic for concordance across array platforms (Barnes et al. 2005). Another problem is that genome annotations continue to grow and change (Eggle et al. 2009). Updated probe set definitions can substantially affect the number and the identity of differentially expressed genes (Sandberg and Larsson 2007). Further, HTP technology is developing rapidly and new technologies are coming into use every year; and there will always be new differences and conflicts to resolve. The challenge is to have infrastructure set-up to compare and integrate new technologies as they come along, and systematically identify the best workflow for data processing and analysis, e.g. (Ponzielli et al. 2008).

Analysis

A primary goal of integrative computational biology analysis in individualized medicine is to identify small groups of genes/proteins/microRNAs, etc., that can be used to improve diagnosis, predict outcome or predict treatment response, i.e., to identify prognostic or predictive signatures. Their identification in HTP data is challenging since basic analysis methods fail to capture the entire signal in the data, and good signatures comprise not only the most differentially expressed molecules. For this section we will focus our attention on gene expression microarray studies, but the criticisms apply equally well to similar experimental designs, such as those using HTP protein or microRNA assays.

Lists of differentially expressed genes show poor overlap across studies

The most popular way of analyzing microarray data is to detect whether individual genes are differentially expressed in one condition versus in another, e.g., in non-responders versus responders to some drug treatment. Differentially expressed genes are widely used in cancer research—their applications include deriving molecular signatures of cancer subtypes, increasing our understanding of the biology of tumorigenesis, and providing new candidate markers for diagnosis, prognosis, and drug response. Unfortunately, there are major challenges with their identification and interpretation. Previous work has shown that lists of differentially expressed genes are poorly reproduced across studies (Tian et al. 2005; Zhang et al. 2008); even random subsets of samples from one experiment can yield widely divergent gene lists. These problems are caused by high dimensionality, small number of samples, and noise (biological and technical variability), but they can be exacerbated by the analysis method. For example, analyses that quantify the differential expression of gene groups rather than individual genes show higher conservation across platforms and studies (Subramanian et al. 2005).

The most differentially expressed genes do not yield the best signatures

Very often in microarray experiments, the most differentially expressed genes are used to construct prognostic or predictive signatures: machine learning methods are trained to use the expression levels of those genes to predict, e.g., the disease state of a patient and probability of survival. The problem with this approach is that single-gene analyses overlook multivariate effects. More sophisticated analyses are needed to identify sets of genes that complement one another, i.e., ones whose combined expression levels yield the best-performing prognostic signatures. Such analyses show that genes essential to a good prognostic signature are often not highly differentially expressed on their own (Chuang et al. 2007; Fujita et al. 2008).

Signatures validate poorly on other data sets

One of the most important contributions of HTP biology to cancer research has been to develop prognostic and predictive signatures. Unfortunately, many signatures have failed to validate by other methods or in new cohorts of patients (Lau et al. 2007; Zhu et al. 2008). Obviously, this is a great concern for personalized medicine. Existing prognostic and predictive biomarkers for the same condition overlap only partially, and the set of biomarker genes identified depends strongly on the subset of patients used to generate it (Ein-Dor et al. 2005). One study estimated that to achieve 50% overlap in prognostic gene sets for breast cancer patients would require several thousand samples (Ein-Dor et al. 2006). Several factors contribute to these problems, including: (1) diverse patients and heterogeneous tumors, (2) different profiling platforms, (3) diverse statistical and bioinformatics approaches to biomarker identification (Shedden et al. 2008), (4) an insufficient number of samples (Ein-Dor et al. 2006), and (5) the existence of multiple equivalent signatures (Boutros et al. 2009; Ein-Dor et al. 2005).

Volume

The pace of discovery is rapid, data are piling up, and annotations are changing all the time. We need integrated workflows to organize data sets and make them easy to access, incorporate, annotate and update.

Large amounts of data need to be integrated and maintained

There are several examples of the rapid increase of knowledge and data. The Gene Expression Omnibus (GEO) currently contains over 490,000 microarray samples (Edgar et al. 2002). Roughly 100 cancer genomes have been sequenced so far—most of these just within the past year—and several major projects are underway, which should see that number quickly increase. For instance, the International Cancer Genome Consortium (ICGC) plans to sequence 500 tumors from each of 50 different cancers (Hudson et al. 2010), and the Cancer Genome Atlas (TCGA) will sequence more than 20 different tumor types in the next 5 years (Ledford 2010). Between 2008 and 2010, the number of papers published per year containing in their abstracts “cancer genome”, “personalized medicine”, and “whole AND genome AND sequencing” have roughly doubled, and “cancer AND signature” has shown a 50% increase (Fig. 1) [data generated using MEDSUM (Galsworthy 2009)].

Fig. 1
figure 1

The rapid accumulation of personalized medicine publications. Counts of associated publications from the last 10 years for several personalized medicine search terms, retrieved with MEDSUM. Year is shown on the x-axis and publication count on the y-axis. The search term “cancer AND signature” is shown in blue; “whole AND genome AND sequencing” in turquoise; “personalized medicine” in red; and “cancer genome” in green

Many disease-related targets remain uncharacterized

Many disease-related targets remain uncharacterized; we have little or no information about their protein interaction partners, the pathways they control or are affected by, or their splice variants, functional mutations, or protein structure. For example, while there are roughly 21,000 genes in the human genome (Clamp et al. 2007), the Reactome and KEGG Pathway databases contain pathway annotations for only about 4,000 of these (Kanehisa et al. 2008; Matthews et al. 2009). And while many studies have shown that interaction networks provide information that can substantially improve predictive signatures (Chuang et al. 2007; Fortney et al. 2010; Nibbe et al. 2010), they can do so only for genes represented in the interactome.

We used the I2D interaction database (Brown and Jurisica 2005, 2007; Niu et al. 2010), to investigate how many cancer-associated genes remain uncharacterized in terms of their protein interactions, combining known interactions and interologs (interactions in other species that are transferred to humans via orthology). For instance, if we consider cancer genes with fewer than 5 interactions to be uncharacterized [this is unlikely to be their true connectivity, because cancer proteins are typically network hubs (Jonsson and Bates 2006)], then we identify 19% of genes in the Sanger Cancer Gene Census [SCGC; (Futreal et al. 2004)] and 47% of the genes in Online Mendelian Inheritance in Man [OMIM; (Hamosh et al. 2005)] as uncharacterized (Fig. 2). Looking at genes in the Cancer Data Integration Portal (CDIP) that are implicated in cancer by at least two studies, we find similar numbers; 42% of genes associated with prostate cancer, 31% of genes associated with ovarian cancer, and 24% of genes associated with lung cancer are uncharacterized. Interestingly, even though lung cancer is represented by a much smaller set of genes in CDIP—only 1,232 genes, versus 5,575 for ovarian and 9,803 for prostate cancer—it shows a roughly comparable proportion of uncharacterized genes across different interaction cut-offs (Fig. 2).

Fig. 2
figure 2

Many cancer-implicated genes remain uncharacterized. Percentages of known disease genes (y-axis) with zero, fewer than 5, or fewer than 10 known protein interactions (x-axis) in I2D version 1.85 (provided in Supplementary File 1). A large percentage of disease genes from the Sanger Cancer Gene Census (blue) and Online Mendelian Inheritance in Man (turquoise) remain uncharacterized. Similar results hold for genes implicated in prostate (green), lung (orange), and ovarian (red) cancers by two or more studies, as retrieved from the Cancer Data Integration Portal

The human interactome is poorly characterized

PPI networks provide biological context and meaning for gene signatures and yield better network-based prognostic and predictive signatures. Unfortunately, with current experimental knowledge the coverage of the human interactome is only around 13%. The human interactome is estimated to have around 650,000 interactions (Stumpf et al. 2008), and currently there are only 82,857 unique experimentally derived interactions (and an additional 55,471 unique interologous interactions) in I2D, a database that integrates HTP PPI data sets and most online PPI databases [such as BioGRID (Stark et al. 2006), BIND (Bader et al. 2003), DIP (Xenarios et al. 2000), HPRD (Peri et al. 2004), InnateDB (Lynn et al. 2008); IntAct (Hermjakob et al. 2004) and MINT (Zanzoni et al. 2002)]. While HTP methods can help to identify many protein interactions, their overlap is usually low, and even when combined were shown to result in a false negative rate of around 41% (Braun et al. 2009). Interaction dynamics with respect to time and localization remain largely unknown. Also, though over 92% of human genes have splice variants (Wang et al. 2008), and many genes may have hundreds or even thousands of such variants [for instance, the Drospohila gene Dscam may have over 38,000 alternative splice forms (Schmucker et al. 2000)], we still lack information about most variant-specific interactions.

How integrative computational biology can address these challenges

The field of integrative computational biology uses techniques from computer science, mathematics, physics and engineering to comprehensively analyze and interpret biological data. Through the creation of new analysis and visualization methods, software tools and databases, it can help diminish the challenges to HTP cancer biology. Here we present some successful applications of integrative computational biology to cancer research. These applications fall into four main categories: data integration, network analysis, databases, and standards.

Data integration

As we have seen, noise in HTP cancer studies can arise both from biological and technological variability. One of the most effective strategies for reducing both types of noise is data integration. The idea is simple: we can be more confident about the result of an experiment if similar experiments yielded similar results. We can integrate different experiments that measure the same biological entity, such as microarray studies measuring tumor versus normal gene expression differences on different experimental platforms. We can also integrate different data types, such as mutation, expression, and proteomic data. Clearly, data integration can increase our confidence in results that are consistent across multiple studies and experimental modalities. But data integration can also increase sensitivity, since different platforms and methods exhibit different biases—e.g., some protein interactions may be undetectable by some methods. Integrative computational approaches can also be complemented by better experimental design: e.g., in tumor profiling, analyzing multiple samples from the same patient reduces the effect of intra-tumor heterogeneity (Bachtiary et al. 2006; Blackhall et al. 2004), and executing multiple MudPIT (Multi-dimensional Protein Identification Technology) runs significantly improves sensitivity (Gortzak-Uzan et al. 2008; Kislinger and Emili 2005; Sodek et al. 2008; Wei et al. 2011).

Integrating the same type of data across multiple platforms and studies

With microarray and similar data, small sample numbers and different experimental platforms can lead to highly variable results. These problems can be addressed by combining data from different studies and platforms, which increases the effective number of samples and helps control for inter-platform heterogeneity.

Most approaches to integrating microarray data can be divided into two general classes, pooling and meta-analyses. In pooling, multiple expression data sets are merged into a single data set (van Vliet et al. 2008; Warnat et al. 2005); typically, gene measurements from each separate study are transformed before pooling to make the experiments more comparable (Fierro et al. 2008). Previous work found that pooling six breast cancer data sets (over 900 samples in total) yielded better-performing signatures (van Vliet et al. 2008). In contrast, for a meta-analysis, statistics are computed for each data set separately and then combined. Meta-analyses identify gene changes that are seen consistently across many studies (Rhodes and Chinnaiyan 2005). Several cancer-specific databases gather information from multiple studies to facilitate meta-analysis, including Oncomine (Rhodes et al. 2007), GeneSigDB (Culhane et al. 2010), and CDIP. CDIP currently covers lung, ovarian, prostate and head and neck cancers. For example, for prostate cancer, CDIP contains 119 different microarray analyses with over 850 tumor and normal samples. While 12,975 unique genes are significantly differentially regulated in at least one study, far fewer show this trend across multiple studies and we can use this information to prioritize genes, both for biological validation and for signature generation.

Integrating different types of data

Integrating complementary data from different sources is helpful for reducing noise and prioritizing targets (Gortzak-Uzan et al. 2008; Varambally et al. 2005). For example, Gortzak-Uzan et al. (2008) combined proteins identified in ovarian cancer ascites with differentially expressed genes from CDIP and PPIs from I2D to identify putative biomarkers for early ovarian cancer detection in serum. For target prioritization, prognostic and predictive signatures can be put in their biological context by overlapping and expanding them with data from cancer-specific resources such as the Sanger Census database (Bamford et al. 2004; Stratton et al. 2009), the COSMIC catalog of somatic mutations (Bamford et al. 2004), GeneSigDB, and Oncomine. Techniques such as Gene Set Enrichment Analysis (GSEA) (Subramanian et al. 2005) can be used to combine experimental data with annotation and pathway databases such as the Gene Ontology (Ashburner et al. 2000), KEGG, and Reactome which allows us to convert lists of differentially expressed genes into lists of differentially expressed gene groups, which are more stable across studies (Subramanian et al. 2005).

Integrating data to predict new PPIs

Protein networks are essential for cancer signature design and interpretation (Chuang et al. 2007; Ergun et al. 2007; Rhodes and Chinnaiyan 2005), yet current experimental networks are far from complete, contain false positives, and in general lack context, such as condition, place and strength of individual interactions. Biological techniques can also be enriched by in silico approaches to predict new interactions and the context of existing ones. For example, we recently developed FpClass (Kotlyar and Jurisica 2006; Kotlyar et al. 2011), an association mining algorithm that greatly expands coverage of the human proteome without increasing false discovery rate: it predicts 178,738 new high-accuracy interactions for 9,372 proteins. FpClass integrates many different data types to predict interactions, including sequence, domains, gene expression, and post-translational modifications. Importantly, FpClass reduces the number of uncharacterized disease genes (again, where a gene is considered uncharacterized if its protein product participates in fewer than 5 interactions, see Fig. 2) to 34% for OMIM and 11% for SCGC.

Integrating data to predict disease gene function

Though many cancer-related genes remain uncharacterized, HTP data and in silico methods can be applied to assign them putative functions (Hu et al. 2007). For example, in the successful MouseFunc prediction challenge (Pena-Castillo et al. 2008), teams of scientists competed to predict mouse gene function (Gene Ontology categories) on the basis of several different data sources, including expression, sequence, interactions, phenotype annotations, disease associations and phylogenetic profiles. Computational methods for predicting the function of uncharacterized cancer genes have been previously reviewed in (Hu et al. 2007).

Network analysis

Network approaches have been successful in addressing some of the analysis-based challenges of HTP cancer biology. Genes do not act in isolation; they form highly complex and interlinked molecular networks. Examining genes in the context of these networks can yield valuable clues about their function and relations, and expand our knowledge of individual cancer pathways and their cross-talk (Agarwal et al. 2009; Mills et al. 2009). Despite the noise in current protein interaction data sets, network analysis can uncover biologically relevant information, such as lethality (Hahn and Kern 2005; Jeong et al. 2001), functional organization (Gavin et al. 2002; Maslov and Sneppen 2002; Wuchty 2006), hierarchical structure (Ravasz et al. 2002; Yu et al. 2006), modularity (Han et al. 2004) and network-building motifs (Milo et al. 2002; Przulj et al. 2006; Rice et al. 2005). Three important applications of protein networks to cancer research include signature generation, signature interpretation, and disease gene prediction.

Networks for signature generation

By integrating network information with gene expression data, we can identify predictive signatures that perform better and are more conserved across studies than signatures based on gene expression data alone. There are many ways of using networks to create improved gene signatures. One class of methods that has proven very successful is score-based subnetwork biomarkers (Chuang et al. 2007; Fortney et al. 2010; Hwang et al. 2009; Ideker et al. 2002; Nacu et al. 2007). In these approaches, genes are aggregated around an initial “seed” gene in a network to generate subnetworks whose pooled activity levels can be used to predict the value of some response variable, such as disease status or survival time. Subnetworks identified using this approach were shown to be highly conserved across studies, and to perform better than individual genes or pre-defined gene groups at predicting breast cancer metastasis (Chuang et al. 2007). Importantly, many crucial genes belonging to subnetwork biomarkers were not differentially expressed on their own, demonstrating the added value of a network approach. Related approaches have been used to develop subnetwork biomarkers for colon cancer using a combination of proteome and transcriptome data (Nibbe et al. 2009, 2010).

Networks for signature interpretation

Many genes that play a role in predictive signatures have not been previously linked to cancer, and thus can be considered as novel candidate cancer genes. Networks can be used to link these genes with known cancer mechanisms and pathways (Radulovich et al. 2010; Rhodes et al. 2005; Sodek et al. 2008; Tomasini et al. 2008). Gene signatures mapped to protein interactions can be further annotated with other profiles (including proteomic, CGH, and miRNA studies), and with network structures, such as graphlets (Przulj 2007; Przulj et al. 2006). Networks can also reveal new connections between different prognostic signatures. For example, we recently identified a 15-gene prognostic and predictive signature in lung cancer (Zhu et al. 2010a). Though our signature did not directly overlap with previously published ones, network analysis revealed that they were highly related: there were direct interactions between the protein products of genes from our signature and others. Similar results have been shown in other studies (Zhu et al. 2009, 2010b).

Networks for identifying new disease genes

Analyses of the network connectivity of cancer genes have shown that they can be characterized by several topological properties. For example, proteins encoded by cancer genes tend to be central in interaction networks [they have high degree and betweenness centrality (Jonsson and Bates 2006; Rambaldi et al. 2008; Syed et al. 2010)], have high clustering coefficients (Li et al. 2009), and are overrepresented in network motifs (Rambaldi et al. 2008). Several methods use the topological characteristics of known cancer genes, in combination with other features (such as Gene Ontology categories, protein domains, biological pathways, and sequence features), to predict new cancer genes (Aragues et al. 2008; Rambaldi et al. 2008) or functional SNPs (Savas et al. 2009). Many algorithms identify modules in interaction networks, or groups of densely interconnected genes that can be highly functionally related (King et al. 2004; Newman 2006; Palla et al. 2005; Spirin and Mirny 2003). Module-finding algorithms can also be applied to predict new disease genes (Chuang et al. 2007; Goh et al. 2007): clusters enriched for known cancer genes may implicate novel genes in cancer.

Generating context-specific co-expression networks from microarray data

Co-expression networks are distinct from PPI networks: instead of indicating physical interactions, edges (and edge weights) between genes reflect the degree of correlation of their expression profiles. Co-expression networks have also been useful in cancer research, e.g., (Aggarwal et al. 2006; Choi et al. 2005). For example, we can use microarray data from different health and disease conditions, such as cancer and non-cancer, to create networks specific to those contexts. We can then compare the networks using a variety of measures that reflect local and global network structure, such as graphlets (Przulj et al. 2006), communities (Palla et al. 2005), etc. Weighted Gene Co-expression Network Analysis (WCGNA) (Zhang and Horvath 2005), a popular method for generating and interpreting co-expression networks, has been applied to study prostate cancer (Wang et al. 2009) and glioblastoma (Horvath et al. 2006).

Databases and visualization

Integrated and updated tools and resources can help to deal with the large volume of HTP biological data being generated, as well as facilitate integrative analyses of cancer profiles. Integrative databases fulfill several essential functions; for instance, they organize heterogeneous and distributed data sources and make them easy to use, and they update and reinterpret old data in the light of new findings (e.g., updated probe set mappings for a microarray platform). For example, the I2D database integrates HTP-detected PPIs with PPIs from several source databases. Tens of thousands of these I2D PPIs are unique to a single source database (Fig. 3); integrating these different sources both substantially improves coverage and increases confidence for those interactions present in multiple sources. Importantly, to support HTP data analysis, databases and portals need to support multiple identifiers and batch processing. Both databases and network visualization methods have played essential roles in interpreting HTP biological data and cancer signatures.

Fig. 3
figure 3

The value of data integration: I2D, the Interologous Interaction Database. Experimentally derived human interactions from I2D version 1.85; the network comprises 13,190 proteins connected by 82,857 interactions. Interactions unique to a single source or database are rendered as red (covering 11,781 proteins and 49,440 interactions), and interactions present in two or more sources are colored blue. Node color and size are proportional to degree. Visualization and annotation of the network was performed in NAViGaTOR ver. 2.2.1. Network file with GO and PubMed annotation is available as Supplemental File 2, in NAViGaTOR 2.2 XML format

Integrating heterogeneous and distributed data sources

Different data sources can be complementary. For instance, while biological pathway databases may disagree on individual pathway definitions, by combining them we can diminish their false positives and false negatives; by integrating pathways with protein interactions, we can improve their coverage and relevance (Radulovich et al. 2010; Savas et al. 2009). Having all of these data available from a single source or at least in a standardized format simplifies integrative analyses and reduces the error and database incompatibility (arising from, e.g., inconsistent gene nomenclature). One example is CDIP (Gortzak-Uzan et al. 2008), a cancer informatics portal that combines HTP data from microarrays, CGH, KEGG, I2D, annotations from GeneCards (Rebhan et al. 1997), and other molecular information specific to different cancer types. Another is mirDIP (Shirdel et al. 2011), a microRNA integration portal that combines computational predictions of microRNA: mRNA binding from multiple databases (version 1 combines 11 largest sources).

Updating old data to reflect the latest knowledge

In part because of the noise inherent in HTP platforms, our best knowledge of what they are measuring and how their data should be interpreted changes with time. Past work has shown that using updated transcript data to re-annotate the assignment of microarray probe sets to genes provided for Affymetrix microarray platforms affected 20–30% of all probe sets (Gautier et al. 2004). Updated probe set definitions lead to higher precision and accuracy (Sandberg and Larsson 2007), and more consistent results across different microarray platforms (Carter et al. 2005; Elo et al. 2005). The Ensembl database (Hubbard et al. 2009) maintains and regularly updates a list of probeset mappings for many popular microarray platforms.

Visualizing biological networks

Tools like Cytoscape (Shannon et al. 2003) and NAViGaTOR (Brown et al. 2009; McGuffin and Jurisica 2009; Viau et al. 2010) allow biologists to visualize protein interactions and perform network analyses using an intuitive graphical interface, and have been widely used for cancer signature interpretation and development. Effective network visualization tools integrate several relevant resources; e.g., NAViGaTOR users can choose to supplement and annotate network nodes and edges with additional data from several sources including I2D PPIs, PSICQUIC (Orchard et al. 2010), STRING (Szklarczyk et al. 2011), Gene Ontology categories, and KEGG, PhosphoSite (Hornbeck et al. 2004), PathwayCommons (Cerami et al. 2011), Reactome, WikiPathways (Pico et al. 2008) biological pathways, and i-HOP (Hoffmann and Valencia 2005). NAViGaTOR also provides tools for network analysis, including several measures of centrality, methods for identifying motifs and communities, and random network and enrichment analysis. Different visualization tools can yield different and complementary biological information; Fig. 4 shows three alternative visualizations of MAPK3 using pathway databases and PPIs in NAViGaTOR. The MAP kinase signaling cascade is implicated in tumor growth and may be a key anti-cancer drug target (Roberts and Der 2007; Sebolt-Leopold and Herrera 2004).

Fig. 4
figure 4figure 4

Different databases offer different and complementary perspectives on the same data. Querying different databases for the same gene, MAPK3 (with protein product MK03), results in pathway-related data with different structure and organization. Pathway data can be supplemented with protein–protein interaction data to provide more complete networks. a From KEGG (release 57.0), the MAPK signaling pathway (PATHWAY:hsa04010). b From Reactome (version 35), the MAP Kinase Cascade (REACT_634). c Protein–protein interaction network from I2D version 1.85, displayed in NAViGaTOR 2.2.1, and overlaid with pathway data from a and b. Nodes represent proteins unique to hsa04010 (blue), proteins unique to REACT_634 (green), and proteins present in both pathways (red). Edges represent links unique to hsa04010 (dark blue), REACT_634 (dark green), I2D (gray), or present in both hsa04010 and I2D (light blue), or REACT_634 and I2D (light green). The concentric circle layout uses Reactome nodes as a center, and organizes remaining nodes based on local connectivity

Web services for frequent updates

One of the main challenges faced by integrated databases is staying current: the knowledge and data in the source databases are changing all the time. Fortunately, many source databases provide access to their content through web services. Integrated databases and analysis tools can then query these databases in response to a user request and access the latest data. For example, NAViGaTOR uses web services to access pathway data from KEGG, Reactome, Pathway Commons (Cerami et al. 2011) and Wiki pathways (Pico et al. 2008), protein interactions from the IMEx consortium (Orchard et al. 2007), and protein interactions and gene associations via i-HOP (Hoffmann and Valencia 2005) or STRING. Of course, not every relevant data set can be accessed through web services; a list of web services currently available is provided by the EMBRACE Registry (Pettifer et al. 2009).

Text mining for automated database annotation and quality-checking

Many of the data and observations gained from biological experiments are not available in an easily accessible form—they are buried in the text of journal articles. But techniques from text mining can be used to automatically extract biological knowledge from free text; these have been extensively reviewed in, e.g., (Altman et al. 2008; Cohen and Hunter 2008; Hoffmann et al. 2005; Jensen et al. 2006; Rodriguez-Esteban 2009). For integrated databases, text mining methods have proven especially valuable for automatically annotating and quality-checking HTP data. For example, text mining has been applied to annotate the I2D PPI database (Niu et al. 2010) and to verify predicted PPIs from FpClass (Kotlyar et al. 2011). Text mining can also supplement integrated databases with new knowledge; e.g., by extracting gene-drug relationships (Kuhn et al. 2010; von Eichborn et al. 2011).

Standards and recommendations

Progress in HTP research relies on well-designed and widely adopted standards for how data is produced, managed, and analyzed. These provide a consistent way of dealing with large volumes of data, help reduce noise, and ensure reproducibility at both the experimental and computational levels.

Standards for HTP data and tools

Different research groups, databases, and analyses must use consistent reporting standards so their data can be effectively shared, integrated, and evaluated. In cancer research, standards are needed at several levels of data collection and analysis, from tumor collection, storage, and sample preparation to pre-processing and further computational study of the resulting HTP data.

For many HTP data types, there are now well-developed standards for data reporting, such as Minimum Information About a Microarray Experiment [MIAME (Brazma et al. 2001)] and large public repositories for raw data, such as the GEO (Barrett et al. 2005) and the PRoteomics IDEntifications database [PRIDE (Jones et al. 2006)]. Data standards development is an ongoing area of research, and has been extensively reviewed (Brazma et al. 2006; Brooksbank and Quackenbush 2006; Enkemann 2010). Some outstanding issues include data identifiers—e.g., the many-to-many mappings between genes from databases like EntrezGene and Ensembl—and open access to data. Though many high-impact journals now mandate that all HTP data and code associated with a publication be made open-access, this requirement is not universal, and sometimes the same journal will require microarray data submission, but not proteomic data. Also, many articles that do publish their data too frequently make them available only in an impractical form—such as hundreds of pages in supplementary PDF files. We need new initiatives to standardize (and enforce) HTP data submission formats to ensure that they are machine-readable. This will enable wider use of data and new discoveries, and may also substantially improve quality of research by reducing errors and mistakes (Baggerly and Coombes 2009; Carey and Stodden 2010).

For computational data analysis, we also need large-scale comparisons to provide a fair evaluation of different methods. In cancer research, these efforts depend on resources like GeneSigDB (Culhane et al. 2010), a curated database containing over 2,000 cancer gene signatures, and large well-annotated data sets such as the set of gene expression profiles for over 400 non-small cell lung cancer adenocarcinoma tumors provided by the Director’s Challenge Consortium for the Molecular Classification of Lung Adenocarcinoma (Shedden et al. 2008). New initiatives such as TCGA and ICGC are expanding cancer profiles across multiple different platforms and tumor types. Making the data and related clinical information publicly available will significantly contribute to our understanding of the molecular changes in cancer, enabling new discoveries as well as more comprehensive validation of novel prognostic and predictive signatures.

Recommendations for biological experiments

The quality of any computational analysis is limited by the data that are available. Considering intra- and inter-tumor heterogeneity, we may sometimes have too few samples or too much noise to be able to develop cancer signatures with the required accuracy and reproducibility. In this case, computational analyses of the available data can at least provide guidelines as to the kinds of experiments and data that are needed. For example, one study estimated that to achieve 50% overlap in prognostic gene sets for breast cancer patients would require several thousand samples (Ein-Dor et al. 2006), and other work created a general tool to calculate the number of samples needed to train a classifier as a function of normalized fold change, class prevalence, and the number of genes on an array (Dobbin et al. 2008). Much of the human PPI network, including PPIs of known cancer genes, remains uncharacterized. Past work shows that a reliable confidence score can be associated with an interaction by feeding the output of four different complementary HTP assays into a logistic regression model (Braun et al. 2009). Also, high-confidence PPIs predicted by various computational methods can focus future experiments and reduce their false positives and false negatives (Kotlyar and Jurisica 2006; Kotlyar et al. 2011).

A directory of tools and resources for integrative computational cancer biology

Below is a brief directory of links to some of the major resources for integrative computational biology that appear in this review.

Biological pathway annotations and related data

Experimental data

Protein–protein interactions

Gene expression

Mutations

Proteomics

microRNAs

Network visualization software

Gene function prediction

Cancer genome initiatives