Data integration to prioritize drugs using genomics and curated data

Background Genomic alterations affecting drug target proteins occur in several tumor types and are prime candidates for patient-specific tailored treatments. Increasingly, patients likely to benefit from targeted cancer therapy are selected based on molecular alterations. The selection of a precision therapy benefiting most patients is challenging but can be enhanced with integration of multiple types of molecular data. Data integration approaches for drug prioritization have successfully integrated diverse molecular data but do not take full advantage of existing data and literature. Results We have built a knowledge-base which connects data from public databases with molecular results from over 2200 tumors, signaling pathways and drug-target databases. Moreover, we have developed a data mining algorithm to effectively utilize this heterogeneous knowledge-base. Our algorithm is designed to facilitate retargeting of existing drugs by stratifying samples and prioritizing drug targets. We analyzed 797 primary tumors from The Cancer Genome Atlas breast and ovarian cancer cohorts using our framework. FGFR, CDK and HER2 inhibitors were prioritized in breast and ovarian data sets. Estrogen receptor positive breast tumors showed potential sensitivity to targeted inhibitors of FGFR due to activation of FGFR3. Conclusions Our results suggest that computational sample stratification selects potentially sensitive samples for targeted therapies and can aid in precision medicine drug repositioning. Source code is available from http://csblcanges.fimm.fi/GOPredict/. Electronic supplementary material The online version of this article (doi:10.1186/s13040-016-0097-1) contains supplementary material, which is available to authorized users.

Use database to calculate gene K-ranks (a) Figure 1: Details of GOPredict scoring. Circles are genes, boxes are GO processes and hexagons are drugs. Bold blue borders denote entity for which a score is being calculated. Green background denotes an entity whose scores have already been calculated. (a) Calculate K-ranks for genes. For example, the fibroblast growth factor receptor 3 (FGFR3 ) has a rank in two studies which are used to calculate its K-rank. (b) Calculate essentiality scores for GO processes using gene K-ranks. For example, 1,130 genes negatively and 939 positively regulate 'cell development' (GO:0048468) and have a K-rank in the database. Four genes are depicted for clarity. (c) Calibrate gene scores based on GO process essentiality scores: neighborhood of genes is expanded and genes without drugs are removed. For example, FGFR3 unambiguously regulates 17 processes (9 positively, 8 negatively). Only two are depicted for clarity. (d) Prioritize drugs based on calibrated gene scores.

Gene cancer-essentiality (K-rank)
A byproduct of the knowledge-base design is that it allows defining hypothesis-driven selection of study sets. Here we used GOPredict to characterize the cancer-essentiality of genes in activating, inactivating and survival-associated study sets. A full list of studies in each study set is in Additional file 2.
We quantified cancer-essentiality for a gene with the K-rank separately for activating, inactivating and survival-associated study sets. Gene scores per study set are listed in Additional file 3. For the survival-associated data, we used studies containing results of univariate survival analyses. Different study sets contain different numbers of genes due to study specific inclusion criteria (Methods).
Out of 14,746 genes, 883 were significantly associated with activating changes (Krank P < 0.05). Out of 17,560 genes, 1,245 were significantly associated with inactivating changes (K-rank P < 0.05). Seventy-five genes including FGFR3 and ERBB2 appear in both sets.
Out of 3,365 genes, 123 were significantly related to cancer survival in more than two studies (K-rank P < 0.05). Thirteen of these genes also appear in the inactivation and eight in the activation list. Of note, EGFR is significant in all three study sets: activating (2nd highest, P = 7.0 × 10 −6 ), inactivating (11th, P = 0.0003) and survival-associated (7th, P = 0.003).
To summarize, the genes which GOPredict characterized to be cancer-essential include known cancer genes such as EGFR as well as genes not previously associated with cancer.

Cancer-essential GO processes are closely connected to cancer hallmarks
In addition to cancer-essentiality of genes, GOPredict quantifies the cancer-essentiality of GO processes. Out of 1,178 GO processes in our database, genes in the combined activating, inactivating and survival-associated set participated in 890 GO processes. Of the 7,474 regulatory relationships between genes and GO processes 4,343 were activating and 3,131 inhibitory.
Most genes regulated a small number of GO processes (mean 2.84, range 1-37). Interestingly, the 309 genes (12% of genes) which regulated more than five GO processes accounted for 43% of all activating interactions (Fisher's exact test P = 0.0007). Each GO process had 1 to 590 regulatory relationships (mean 10.3). Each GO process was regulated by a mean of 4.3 inhibitors and 6.0 activators which differed significantly (Mann-Whitney-U test P = 2.1 × 10 −8 ).
Cancer-essentiality of a GO process was quantified separately for activation and inactivation. Thirty GO processes were both activated and inhibited in cancer (activation P < 0.0001 and inhibition P < 0.0001) and contain several GO processes highly connected to cancer hallmarks [1] (Additional file 3). Interestingly, the fibroblast growth factor receptor signaling pathway (GO:0008543), which is known to play diverse roles in cancer in general [2] and female cancers in particular [3] was among the high scoring GO processes. In summary, the GO processes prioritized by our algorithm are known to be highly relevant for cancer as they capture known breast cancer as well as cancer-hallmark processes.

Cancer gene essentiality enables defining novel multivariate survival co-variates
GOPredict produces several by-product results when prioritizing drugs. For example, quantifying gene cancer-essentiality -an intermediary result of GOPredict -enables finding novel cancer gene candidates even when a gene is poorly characterized. As a proof of concept, we looked for genes that had received high scores through alterations in both ovarian and breast cancer study sets.
Three genes -SLC25A32, PYCR1 and OSR2 -were altered in one or more study sets of both cancers. For each gene separately, we built a multivariate Cox survival model with OVCA gene expression status (up/down/normal) of the gene, tumor grade, FIGO stage, and residual tumor size as co-variates. The overexpression of SLC25A32 (ANOVA P = 0.003) and lack of residual tumor (ANOVA P = 0.02) were significant independent predictors of poor survival in TCGA OVCA (Table 1).

GOPredict is robust to changes in study sets
We utilized GOPredict to analyze study sets which we defined to be activating, inactivating or survival-associated based on the type of studies in each set. To analyze the sensitivity of GOPredict to these choices, we added three TCGA methylation studies (BRCA, OVCA and COAD) to test the effect of changing study sets on the K-ranks and drug predictions in the BRCA ERBB2 amplified query data set. The survival-associated K-ranks show good correlation (Spearman's ρ = 0.97). Analyzing the highest ranked 1,000 genes alone, K-ranks show very good correlation (Spearman's ρ = 0.996). Drug scores are also highly correlated (Spearman's ρ = 0.86) and the best scoring 10% of drugs (31 drugs) show an overlap of 90% (27/31) between the two analyses. Furthermore, to test the effect of removing study sets on the results, we analyzed the BRCA ERBB2 amplified query data set without the two Census studies. We then compared results with and without the Census data. Unadjusted K-ranks are highly correlated both for activating (Spearman's ρ = 0.98) and inactivating study sets (Spearman's ρ = 0.99). Furthermore, drug scores show high correlation (Spearman's ρ = 0.95). Thus, our results suggest that the Census studies are beneficial to include in the analysis but are not a major source of noise.
These analyses suggest that GOPredict scoring shows robustness to changes in study sets.

Extended Methods
All data were analyzed using the Anduril framework and Moksiskaan database integration tool [4,5]. All annotations were based on the Ensembl GRCh37.

Preprocessing and analysis of in-house and curated data
First, we analyzed in-house four different cancers downloaded from The Cancer Genome Atlas (TCGA). TCGA provides a large set of tumor samples and related clinical data for various cancers [6]. We have previously analyzed gene-expression, copy-number and DNA methylation alterations in glioblastoma multiforme (GBM) [4] and serous ovarian carcinoma (OV) [7], and we also analyzed respective data in breast invasive carcinoma (BRCA) and colon adenocarcinoma (COAD). Furthermore, in all these data sets we analyzed the impact of gene expression, copy-number-alteration and DNA methylation on patient survival. Statistical significance of survival was assessed in every instance using the log-rank test.
The high-confidence results from these analyses are stored in our database and these results are then used to calculate gene K-ranks and essentiality scores.

In-house data
We downloaded from the Cancer Genome Atlas two types of gene expression data (Agilent and AffyMetrix Exon array) for all four cancers. When both types of data were available for a cancer, only exon array data were used.
Expression data We preprocessed Agilent expression data (BRCA, COAD). First, probes matching either multiple or no genes were removed. Then, data were normalized to a mean of 0. Exon arrays (OV, GBM) were normalized and gene expression values quantified with the Multiple Exon Array Preprocessing algorithm (MEAP) [13]. After normalization, both platforms were further processed identically. For each gene, the genes was considered up-or downregulated in a sample if the gene's expression was further than three standard deviations from the median of control samples. We then grouped samples according to this up/downregulation data and analyzed the survival predictive power for each gene separately.
Copy-number data Copy-number data were also available from two platforms: AffyMetrix 6.0 SNP arrays (BRCA, COAD) and Agilent CGH (GBM, OV). When both platforms were available for a cancer, we used Agilent data. Copy-number data from AffyMetrix 6.0 SNP arrays were extracted with the R package crlmm [14]. Samples with signal-tonoise ratio of less than 5 were removed. Moreover, probes with confidence limit less than 0.9 were removed. Copy-number data from Agilent comparative genomic hybridization arrays were preprocessed as previously described in [4]. Data from both array platforms were segmented using circular binary segmentation [15]. Copy-number calls were made in two ways. For SNP arrays (BRCA, COAD), CNA were called when the copy-number was further than 0.3 from the diploid state. For Agilent CGH arrays (GBM, OV), copynumbers were called as described in [7]. Copy-number calls per gene were used to group samples for survival analysis so that deleted or amplified samples were compared against the non-deleted or non-amplified group of samples.
Methylation data We downloaded level 3 methylation data (beta values) and transformed them into M-values [16]. This transformation makes the methylation value distribution normal and enables us to use the t-test for methylation change significance analysis [16]. For each gene, if the genewise methylation difference between the median methylation of control samples and a tumor sample was more than 2 than the sample was grouped into hypo-or hypermethylated sample group. As in gene expression survival analysis, these groups were then used to determine methylation induced survival differences.
Database inclusion criteria All results were deposited genewise into our database.
We filtered results to be deposited based on study-specific criteria. A gene's expression fold-change value was deposited into the database if the gene's q ≤ 0.001 (t-test, Benjamini-Yakutieli multiple hypothesis correction). Similarly, survival analysis based on gene-expression data were deposited if the log-rank p ≤ 0.01. For methylation data, methylation fold-changes were deposited if the q ≤ 0.001 and methylation based log-rank survival p ≤ 0.01. Copy-number alteration frequencies were deposited if the frequency exceeded 10%. Copy-number alteration based survival analysis results were saved if log-rank p ≤ 0.01.

Curated data
The Catalogue of Somatic Mutation in Cancer (COSMIC) database is a collection of somatic aberrations of cancer genomes [8]. We obtained gene specific mutation frequencies from COSMIC with the Biomart interface. Tumorscape collects results of copy-number analyses of human cancers [9]. We included ten cancer types (breast, colorectal, glioma, hepatocellular, lung non-small cell carcinoma, lung squamous cell carcinoma, melanoma, ovarian, prostate, and renal) to our analyses. For each gene, we recorded whether it was amplified or deleted in a cancer type.
The Cancer Gene Census database provides information about causative mutations in cancer [10]. We used this database to identify genes, which are frequently activated (amplification, translocation) or inactivated (copy-number deletion, or missense-, nonsense-or splice-site-mutation) in cancer. Similarly, we obtained a list of genes for which amplification had been causally linked with the gene's overexpression and found a set of 77 frequently amplified and overexpressed genes in human cancers [11]. Lastly, we found a set of differentially expressed genes related to brain metastases in breast cancer [12]. We have included all 26 genes to this study.
COSMIC Mutation frequencies for each gene were saved if a mutation had been observed in at least 20 samples and the mutation ratio was at least 10%. Copy-number alterations from Tumorscape were saved if the genewise alteration frequency exceeded 10% and GISTIC q ≤ 0.25.

Preprocessing breast and ovarian cancer query data sets
Gene expression data We downloaded gene expression microarrays from the Cancer Genome Atlas for 524 primary breast carcinoma tumors and 59 controls [17]. First, probes matching either multiple or no genes were removed. Then, data were normalized to a mean of 0. In addition, we downloaded exon expression microarrays for 491 ovarian serous adenocarcinoma tumors and 10 controls [18]. Exon arrays were normalized and gene expression values quantified with the Multiple Exon Array Preprocessing algorithm (MEAP) [13]. Each gene in each data set was assigned one of three states: upregulated, downregulated or normal. For both data sets, a gene was considered upregulated in a sample if that sample's expression was more than three standard deviations over the median of normal samples. Similarly, a gene was considered downregulated in a sample if that sample's expression was more than three standard deviations below the median of normal samples.
Genomic data Copy-number data from AffyMetrix 6.0 SNP arrays were extracted with the R package crlmm [14]. Samples with signal-to-noise ratio of less than 5 were removed. Moreover, probes with confidence limit less than 0.9 were removed. Copy-number data from Agilent comparative genomic hybridization (CGH) arrays were preprocessed as previously described [4]. Data from both array platforms were segmented using circular binary segmentation [15]. Copy-number calls were made in two ways. For SNP arrays (BRCA), CNA were called when the copy-number was further than 0.3 from the diploid state. For Agilent CGH arrays (OV), copy-numbers were called as previously described [7]. For each gene, copy-number was assigned to be either amplified, deleted, or unchanged. Mutation data were downloaded from TCGA and synonymous mutations were removed. All mutations were considered loss-of-function mutations. Mutation and copy-number were fused before building the activity matrix.
After preprocessing individual data levels, 497 BRCA and 390 OV samples with all three data types remained for the construction of activity matrices.

Survival analysis
To test the prognostic power of selected cancer-essential genes, we built a Cox proportional hazards model with the R package survival. The end-point was overall survival and events were defined as the patient vital status at last follow up. We developed a model for OVCA only for which there were 421 samples available with a total of 239 events.
The covariates included to the Cox model were the expression status of a gene (activated, inactivated or unchanged when compared to control), tumor grade, tumor FIGO stage, age at diagnosis (over 50yo versus at most 50yo), and the size of the residual tumor. All co-variates were categorical. Age was not included because it did not pass the proportional hazards assumption unlike the other covariates.

Database ranking and the GOPredict algorithm
The result database is an integrated database which contains relationships between bioentities such as genes, proteins, drugs, and biological processes [5]. The information has been integrated from multiple primary source databases including Ensembl, WikiPathways, Gene Ontology and PINA as well as drug target information from KEGG and DrugBank [5]. A complete and up-to-date list of databases included in the database is available from the website http://csbi.ltdk.helsinki.fi/moksiskaan/.

Overview of gene ranking in the database
The database contains analysis results in addition to relationships between bioentities. The results (studies) are stored as sets of bioentities (e.g., genes) with a study specific value attached to each of them. For the GOPredict analysis, we extensively analyzed four TCGA data sets: ovarian [7], glioblastoma [4], breast and colorectal cancers. Briefly, data for gene copy-number (SNP and Array CGH), transcriptional expression, DNA methylation and clinical data are analyzed and the high-confidence results are stored in the database. For example, ERBB2 is overexpressed in our TCGA breast cancer transcriptomic analysis and subsequently stored in the database as an overexpressed bioentity (gene) in a specific study (TCGA breast cancer transcriptomic analysis). Similar ranks have been derived for all genes from other analyses. The per-gene database ranks are used to calculate for each gene a cancer essentiality score called the K-rank (Figure 1).

Overview of GOPredict
Details of GOPredict are shown in Figure 1. Study specific gene K-ranks are calculated for three subsets of studies in the database ( Table 2). These gene specific K-ranks are then mapped to Gene Ontology processes, and recalibrated K-ranks or cancer essentiality scores for genes are calculated based on the process scores. Finally, the recalibrated Kranks are used to determine drug prioritization scores. Drugs targeting genes, which are deleted, do not influence the score calculation.

Derivation of gene ranks
Let C denote the set of all studies stored in the Moksiskaan database (2) and G denote all genes in the database (note that this set contains all primary assembly Ensembl v.70 genes without patches). Every study in the database is a set of result values (e.g., copy-number alteration frequencies, gene expression fold-change values, somatic mutation frequency according to COSMIC) for genes. Let study m ∈ C and gene g ∈ G, and let V ∈ R G×C contain the scores of all genes in all studies. Moreover, let v g,m ∈ V be the score of gene g in study m.
Since scores from different studies originate from different measurements with different dynamic ranges, the most robust way of combining them is by ranking them according to their relative score in each study. We also have to take ties into account because gene scores can contain duplicated values.
The function o + : G × C → [0, |G| − 1] provides the number of values in V which are greater than the given (x,y): The rank values W ∈ R G×C are calculated for each g ∈ G as follows: where o=(g,m)−1 2 balances the effect of duplicate scores by averaging the number of scores equal to v g,m (not including g itself, hence subtracting one).
Normalized rank values (N ∈ R G×C ) for each m are produced by scaling the sum of each column W :,m :

Calculating the K-rank
Recall that m ∈ C and g ∈ G, and let k ∈ R G×1 be a matrix of studywise importance scores of genes. Now k is derived from the normalized ranks of a particular set of studies C ⊂ C: Similarly to k, let m ∈ R G×1 , o ∈ R G×1 and r ∈ R G×1 be matrices derived from study subsets of C: the activated, inactivated and survival-associated subsets (as listed in Table 2). These score matrices represent three separate scores: up-regulation score m indicating activation of a gene in cancer, down-regulation score o representing deactivation of a gene in cancer, and an unsigned survival-associated score r.

Essentiality scores
We start by defining the permutation test function which will be used to assess statistical significance of the scores. Let perm(a, y, Γ) : R → R j×1 be the permutation test function, where j is the number of tests; a ∈ R j×1 is the vector of observed values of the test statistic; y ∈ N j×1 is a vector of sample sizes which is randomly sampled for each a j ∈ a; and Γ is the sample space from which y j ∈ y values are drawn in each permutation for each j. Moreover, let 5,000 be the number of permutations.

Calculating Gene Ontology process essentiality scores
Let u = m + r, u ∈ R G×1 be the upregulation scores for all genes g ∈ G and d = o + r, d ∈ R G×1 the downregulation scores. In addition, let B be the set of biological process in Gene Ontology (GO) and let b ∈ B be a biological process (Fig 1a). Moreover, let R + ∈ {0, 1} G×B and R − ∈ {0, 1} G×B be binary matrices such that Essentiality scores s2 + ∈ R B×1 and s2 − ∈ R B×1 for GO processes (Fig 1b) are row sums of the product of a score vector (u or d) and its regulation matrix (R + or R − ): and Note that R + :,b ∈ N B×1 is the vector of all positively regulating edge counts of GO processes b and similarly R − :,b ∈ N B×1 for negatively regulating. P-value vectors p2 + ∈ R B×1 and p2 − ∈ R B×1 for s2 + and s2 − are calculated as and Recalibrating the K-rank Let B 1 and B 2 be sets of GO processes such that Recalibrated K-ranks or gene essentiality scores (Fig 1c) s3 + ∈ R G×1 and s3 − ∈ R G×1 are defined as Thus, essentiality scores s3 + g and s3 − g for a gene g are the harmonic means of all P-values of GO processes it regulates. P-values for all s3 + and s3 − are calculated by a permutation test and then transformed to logarithmic scale. First, let q = [ u d ] and p3 + ∈ R G×1 and p3 − ∈ R G×1 . Now and

Drug prioritization and activity matrix construction
In the final step, we add drugs and sample specific measurements to the network. Measurements of biological alterations of each sample are represented as graphs of genes and their activities. Let S be the set of samples. The activity status matrix S ∈ {−2, −1, 0, 1} G×S of sample genes g ∈ G and samples s ∈ S is inferred from the matrix of transcriptional activities E ∈ {−1, 0, 1} G×S integrated with DNA mutations and copy-number alterations A ∈ {−1, 0, 1} G×S . Furthermore, let e g,s ∈ E and α g,s ∈ A: 1, e g,s = 1 ∨ (e g,s / ∈ {−1, 0, 1} ∧ α g,s = 1) 0, e g,s = 0 −1, e g,s = −1 −2, α g,s < 0 NA otherwise. (12) Let D be the set of all drugs in Moksiskaan and d ∈ D. Drugs are added if they regulate genes in G. Let D ∈ {−1, 0, 1} D×G be a matrix of effects of drugs on the genes g ∈ G such that For each gene, we know the sets of samples that are either up-or downregulated in the sample set. Accordingly, let S + g = {∀x ∈ S | S g,x > 0} ⊆ S for upregulation and S − g = {∀y ∈ S | S g,y < 0} ⊆ S for downregulation. Observe that ∀g : be the set of genes with an activating drug and G − d = {∀x ∈ G | D d,x = −1} be the set of genes with an inactivating drug. The drug scores s4 + ∈ R D×1 and s4 − ∈ R D×1 (Fig 1d) for each drug d ∈ D over the whole sample set S are and Sorted lists of s4 + and s4 − scores yield the prioritized drugs with the largest effect in the input cohort.

Additional file 2
A spreadsheet file containing the list of studies.

Additional file 3
A spreadsheet file containing the cancer-essentiality results for genes and processes.

Additional file 4
A spreadsheet file containing the drug prioritization results.

Additional file 5
A ZIP-archive containing the analysis code.