Genome-wide associations of signaling pathways in glioblastoma multiforme

eQTL analysis is a powerful method that allows the identification of causal genomic alterations, providing an explanation of expression changes of single genes. However, genes mediate their biological roles in groups rather than in isolation, prompting us to extend the concept of eQTLs to whole gene pathways. We combined matched genomic alteration and gene expression data of glioblastoma patients and determined associations between the expression of signaling pathways and genomic copy number alterations with a non-linear machine learning approach. Expectedly, over-expressed pathways were largely associated to tag-loci on chromosomes with signature alterations. Surprisingly, tag-loci that were associated to under-expressed pathways were largely placed on other chromosomes, an observation that held for composite effects between chromosomes as well. Indicating their biological relevance, identified genomic regions were highly enriched with genes having a reported driving role in gliomas. Furthermore, we found pathways that were significantly enriched with such driver genes. Driver genes and their associated pathways may represent a functional core that drive the tumor emergence and govern the signaling apparatus in GBMs. In addition, such associations may be indicative of drug combinations for the treatment of brain tumors that follow similar patterns of common and diverging alterations.


Background
Gliomas represent a heterogeneous family of primary brain tumors that are a significant cause of cancer mortality in the United States [1] with glioblastoma multiforme (GBM) as their most aggressive form. While gliomas strongly differ in their geno-and phenotype, genetic and molecular heterogeneities contribute to the biological and clinical behaviour of different glioma subtypes. The availability of high-throughput gene expression profiles [2][3][4] provided the opportunity for a quantitative characterization of individual tumors and their classification [5][6][7]. Recently, several groups have identified subnetworks and pathway-based features that are associated with certain GBM types [8][9][10][11] as well as utilized interactions to identify driver genes [12].
The genomic set-up of GBMs is increasingly well characterized [11,13,14], allowing the identification of certain signature alterations. In addition, correlations between changed expression levels of genes and their corresponding genomic alterations are currently investigated [15,16]. However, genomic profiling poses a significant challenge to uncover driving genomic alterations from the large number of deletions and amplifications present in cancer genomes.
The use of microarray technology to simultaneously measure expression of many different genes has been a driving force for the systematic mapping of eQTLs [17,18], since gene expression in many individuals is the substrate for investigating the effects of genomic changes on the expression of individual genes. While some eQTL analyses of human brain tissue have been recently reported [19], eQTL studies have also been combined with network analyses to identify transcription modules of disease-related, coexpressed genes [20][21][22][23] and to find causal pathways in glioblastomas [24].
To account for the observation that biological functions are mediated by groups of genes, we determined associations between the expression of pathways and genomic copy number alterations with a machine learning approach. While large signature alterations were driving the association patterns of over-expressed pathways, we found the opposite for under-expressed pathways, an observation that held for composite effects between chromosomal alterations as well. Confirming their biological relevance, identified regions were enriched with driver genes that play a role in gliomas. As a consequence, we observed pathways that were significantly enriched with such driver genes. We conclude that such pathways may indicate a functional core that governs the signaling machinery and tumor emergence in GBMs.

Determination of pathway associations
We used gene expression profiles of 158 Glioblastoma Multiforme (GBM) patient and 21 non-tumor control samples from epilepsy patients that were collected from the NCI-sponsored Glioma Molecular Diagnostic Initiative (GMDI) and from Henry-Ford hospital (HF) [13,25]. Accounting for the observation that genes perform their biological functions as an assembly of genes rather than in isolation, we collected 181 signaling pathways from the PID database [26]. Utilizing Gene Set Enrichment Analysis (GSEA) [27] we compared GBM to non-tumor control samples and found 119 over-expressed pathways with a positive enrichment score. Moreover, we obtained 62 under-expressed pathways with a negative enrichment score. We further determined subsets of genes in each signaling pathway that govern the pathways over/under expression in the disease cases ( Figure 1A). Such 'leading edge genes' were defined as subsets of genes that appeared in an expression ranked gene list before the enrichment score of a given pathway reached its maximum [27]. Representing each pathway by its corresponding set of leading edge genes we assigned a sample specific expression fold change score to each pathway. In particular, we defined such a score of pathway p in disease sample j as A p;j ¼ , where E i,j is the expression value of gene i in disease sample j, and E N i is the average expression of gene i in the set of control samples ( Figure 1A).
Searching for genomic loci that potentially play a role in the underlying expression phenotype, we determined associations between the expression fold change scores of pathways and copy number variations of genomic loci. Since genomic variations in neighboring regions tend to be highly correlated, we first chose a subset of 1,510 representative loci (i.e. tag-loci) in GBMs. Specifically, we represented each locus as a x-dimensional vector of copy number alterations in the corresponding x = 158 patient samples. Focusing on a potential tag-locus, we greedily accumulated all consecutive loci, so that the Pearson's correlation coefficient of any consecutive loci in the region was > 0.95 [24]. While the number of genes a tag-locus can harbor varied strongly we found an average of 6.1 genes per tag-locus, a number that is comparable to the median of 6.5 genes in pooled analyses of human cancers [14].
We searched for genome-wide associations by nonlinearly fitting pathway fold change scores as a function of tag-loci's specific copy number alterations in all GBM samples ( Figure 1B). We represented copy number alterations CNA of a tag-locus i in sample j as lg 2 CNA i,j and applied random forest algorithm to assess the impact of a tag-locus on the regression process by its normalized importance score. Reflecting the increase of the prediction error when the given locus is omitted in the regression process, we defined the normalized importance as is the average importance, and σ i (p) is the standard error of a tag-locus i for a given pathway p.
To assess the statistical significance of the normalized importance of each locus and pathway pair we randomized sample-specific pathway fold changes and copy number alterations. We applied a Z-test to null distributions thus obtained ( Figure 1C) and calculated a P-value for each tag-locus/pathway pair. Correcting all P-values by their corresponding false-discovery rate [28] we used FDR < 0.05 as a threshold to define a significant association. While we found 504 significant associations between 109 over-expressed pathways and 267 tag-loci we observed 471 associations between 56 under-expressed pathways and 209 tag-loci (Additional file 1 Table S1).

Analysis of associations
As a benchmark we show a profile of genomic alterations in glioblastomas in Figure 2A. Specifically, we determined the frequency of patients with |CNA i | > 1.5 at each taglocus i, allowing us to observe large signature areas of genomic amplifications on chromosome 7 and deletions on chromosome 10. In Figure 2B, we show the distribution of FDRs of all tag-locus/pathway associations. While associations to over-expressed pathways largely coincided with signature alterations, we observed strong associations to under-expressed pathways that mostly appeared on chromosome 4. PDGFRA, KIT, and KDR genes that are located on the amplified segment 4 q12 probably play an important role in tumor biology due to their increased expression of receptors and their ligands. Specifically, Imatinib mesylate targets PDGF receptors while KIT was indicated as a mediator of anti-tumor activity in patients with recurrent GBM [29]. Such results were confirmed in Figure 2C where we plotted the number of different pathways that were significantly associated with tag-loci (FDR < 0.05).
Determining associated genomic areas we counted the number of pathways that mutually shared tag-loci. We binned loci according to their corresponding chromosomes and pooled all pathways that were significantly associated with tag-loci on the corresponding chromosomes. We observed a pronounced cluster of chromosomes, pointing to genomic alterations that were associated to the same overlapping sets of over-expressed pathways ( Figure 3A). Specifically, we found that most pathways were shared between tag-loci on chromosomes 7 and 10. In turn, tag-loci on chromosomes 1, 2, 4, 14, 16 and 21 shared numerous under-expressed pathways, suggesting that composite effects between associated tag-loci largely follow the initial patterns of single associated loci ( Figure 3B).

GRAIL analysis
Since each tag-locus on average harbored more than 6 genes we used GRAIL algorithm [30] to investigate the relevance of such identified genomic regions based on previous knowledge about glioma specific disease regions. Utilizing co-reports of genes in PubMed abstracts, GRAIL explores genes in candidate and reference genomic regions and automatically assesses their degree of relatedness. As references we used a list of genes that are commonly altered in gliomas [2,31] (Additional file 2  Table S2), allowing us to identify potential candidate (driving) genes. As for associated over-expressed pathways, we found 87 tag-loci with genes that were significantly similar to genes in the reference regions and associated to over-expressed pathways (GRAIL P-value < 0 .05, Additional file 3 Table S3). In turn, we found 67 such tag-loci with associated under-expressed pathways (Additional file 3 Table S3). In particular, we show such loci that were associated to more than one pathway in Table 1. Generally, tag-loci that were associated to many pathways were highly enriched with genes that were previously reported to have a driving role in the biology of brain tumors. Qualitatively, genes that were associated to overexpressed pathways included prominent signaling and regulation genes that are involved in receptor tyrosine kinase (RTK) signaling (EGFR, EGF, KRAS, PTEN, FRAP1, Determining leading edge genes that govern their over/under expression when comparing disease to control cases we represented each pathway by its expression fold change score. (B) We fitted pathway fold change scores as a function of the corresponding copy number variations of tag-loci. Using random forest algorithm we obtained a normalized importance score of each locus/pathway pair. (C) To assess the statistical significance of a tag-locus' importance for fitting a pathways expression we performed permutation tests by randomizing both pathway fold change scores and copy number alterations. Focusing on such random distributions of importance scores we applied a Z-test to determine P-values, allowing us to assess the significance of an association between each tag-locus and pathway.
PIK3 subunits and NF1). In particular, the RTK pathway plays a role in the mediation of growth signals to enhance cell survival and proliferation. The most commonly affected gene in the RTK pathway is EGFR, which is amplified in as many as 45% of GBMs resulting in increased mRNA expression [2,32]. Other RTKs were also shown affected in GBMs, such as amplification of PDGFRA and cMET in 13% and 4%, respectively, and mutation of ERBB2 in 8% of cases [2].
As for driver genes that were located nearby tag-loci associated to under-expressed pathways, we show such links between associated genes and their corresponding underexpressed pathways in a heatmap in Figure 4. Ward clustering such a matrix, we observed a small cluster of genes that largely associated with membrane based pathways revolving around ephrin-A/EphA related pathways previously linked to GBMs [33]. In the cluster of genes that were largely differentially expressed (FDR < 0.05, Student's t-test) we found  prominent cancer-related genes such as EGF. Furthermore, we found CDH13, a calcium-dependent cell-cell adhesion gene that is associated with working memory performance in attention deficit disorders and a regulator of neural cell growth [34]. Also, we observed a member of the MAP kinase family, MAPK10, that plays regulatory roles in signaling pathways during neuronal apoptosis through its phosphorylation and nuclear localization [35]. PRDM2 is a tumor suppressor gene and a member of a nuclear histone/protein methyltransferase superfamily. Although the function of this protein has not been fully characterized, it may play a role in transcriptional regulation during neuronal differentiation and pathogenesis of retinoblastoma [36]. Finally, we observed ABCG2, a membrane-associated protein that is included in the superfamily of ATP-binding cassette (ABC) transporters. Specifically, this transporter has also been shown to play protective roles in blocking absorption at the blood-brain barrier [37]. The presence of many driver genes that appear in RTK signaling prompted us to determine pathways that were enriched with such driver genes. Utilizing Fisher's exact test, we found 14 pathways that were enriched with driver genes associated to over-expressed pathways (P < 0.05). Analogously, we obtained 12 pathways enriched with driver genes that were associated to under-expressed pathways (Table 2). Generally, such enriched pathways mainly revolved around ERBB1 signaling while PIK3 subunits and KRAS mostly drove their enrichment. Furthermore, Table 2 shows that EGFR appeared frequently among enriched pathways of genes that were associated to over-expressed pathways. In turn, EGF played this role when we focused on under-expressed pathways.

Discussion and conclusions
We applied a stepwise methodology to uncover genomic alterations that are informative of observed patterns of We annotated tag-loci that were significantly associated with more than one over-or under-expressed pathways in GBMs with their corresponding genes (GRAIL P < 0.05).
pathway activity changes in glioblastoma multiforme, providing a high-level picture of the cell's molecular phenotype. Usually, association studies suffer from a large number of tests, contributing to a massive multiple testing problem. In our case, we mitigated this issue by using a limited number of tag-loci. Furthermore, a low number of tested pathways contributed to lower statistical complexity as well, limiting the number of applied tests. While others have investigated the influence of copy number alterations on gene expression in GBMs before, such studies focused on single genes [38] to identify regulatory networks. Furthermore, other authors used networkbased approaches involving genes that were placed in areas of copy number alterations to identify candidate oncogenic, modular processes and driver genes [12]. Here, we investigated patterns that emerge from large-scale genomic eQTL-like associations to whole groups of genes. In particular, we represented each pathway by its corresponding leading edge genes, defined as subset of genes that govern the over/under expression of a pathway, comparing disease to control cases. Applying a non-linear eQTL approach we observed that genomic signature alterations of GBMs largely translated into elevated normalized importance scores of corresponding tag-loci and high frequencies of associated pathways. As for over-expressed pathways, significantly associated tag-loci were largely limited to chromosomes 7 and 10, an expected result since alterations on chromosomes 7 and 10 belong to signature modifications in GBMs. Surprisingly, we observed the emergence of chromosome 4 as the major contributor of associations to under-expressed pathways while associations to tag-loci on chromosomes 7 and 10 were largely absent. Such an observation was rather unexpected as chromosome 4 lacks frequent copy number alterations, while its involvement has been shown only in a subset of GBMs [39]. Furthermore, we also found that composite effects between chromosomes that are associated to under-expressed pathways also involved a variety of other genomic locations. In turn, such observations remained limited to tag-loci on chromosomes 7 and 10 that were associated to over-expressed pathways.
Since genomic regions that were found to be frequently associated to pathways referred to known alterations, we performed an analysis of the relatedness of genes based on disease regions in gliomas, allowing us to identify potential driver genes. Qualitatively, we observed that some genes were already identified as driver genes in GBMs, indicating the relevance of the determined associations. Furthermore, we identified a small set of driver genes that were associated to under-expressed pathways. While such a set included EGF as a prominent driver gene we also found a variety of genes that have important neuronal functions. While their involvement in such a cluster suggests a composite effect with EGF, their prevalence in associations to under-expressed pathways may indicate a previously unknown role in GBMs as well.
While we observed that many observed driver genes were included in prominent signaling and regulation pathways we determined pathways that were enriched with such genes. Since we considered associations to signaling pathways such driver pathways may represent a core that governs the change of the signaling apparatus in GBM. In particular, we found 14 pathways that were enriched with genes associated to over-expressed pathways. Specifically, PIK3 subunits, KRAS and EGFR were frequently involved in such pathways. In turn, we Figure 4 Driver genes of under-expressed pathways. We mapped driver genes to their corresponding, associated under-expressed pathways. Specifically, we observed a cluster of genes that was significantly associated to signaling pathways, revolving around ephrin-A/EphA related pathways. In the small cluster, we identified genes such as CDH13, EGF and MAPK10 that play important roles in neurological functions. obtained 12 pathways enriched with genes that were associated to under-expressed pathways. While PIK3 subunits and KRAS were involved in these pathways too, we frequently found EGF instead of EGFR, an interesting observation given that the interaction of EGF and EGFR triggers many important signaling and regulation processes in human cancers.
These observed patterns of common and diverging genomic regions may indicate that a rational design of drug combinations for the treatment of brain tumors follows similar patterns of common and diverging alterations, generally pointing to avenues for the design of glioma subtype specific drug cocktails. In particular, our results suggest that therapy approaches may target different pathways simultaneously. Indeed, combination therapy with EGFR inhibitors [40] and drugs targeting the PI3K/AKT/PTEN pathway [41] were considered for the design of GBM specific drug cocktails.
Currently, we only accounted for genomic alterations, omitting other potential molecular causes for the emergence of GBMs. Further analysis of associated pathways will have to include other sources of molecular genomewide data. For example, methylation data may indicate other avenues that contribute to the expression regulation of pathways. Therefore, the integration of such data as variables may allow us to identify composite effects between methylation characteristics and genomic alterations that can influence the expression change of pathways and point to novel, previously unknown regulation mechanisms.

mRNA treatment
We investigated 158 glioblastoma multiforme patient and 21 non-tumor control samples from epilepsy patients from the Rembrandt database (https://caintegrator.nci.nih.gov/ caintegrator/) that were collected from the NCI-sponsored Glioma Molecular Diagnostic Initiative (GMDI) and from Henry-Ford hospital (HF) [13,25]. Using HG-U133 Plus 2.0 arrays, normalization was performed at the PM and MM probe level with dChip [25,42]. Using the average difference model to compute expression values, model-based expression levels were calculated with normalized probe level data, and negative average differences (MM > PM) were set to 0 after log-transforming expression values [25]. Accounting for weak signal intensities, all probe sets with more than 10% of zero log-transformed expression values were removed. To represent a gene, we chose the corresponding probeset with the highest mean intensity in each tumor subtype.

Determination of copy number alterations
Matching patient genomic data were collected from the Rembrandt database (https://caintegrator.nci.nih.gov/ caintegrator/) where all samples were hybridized on Genechip Human Mapping 100 K arrays. Copy numbers were calculated using Affymetrix Copy Number Analysis Tool (CNAT 4). After probe-level normalization and Significantly pooling driver genes that were associated to over-expressed pathways, we found 14 pathways applying Fisher's exact test (P < 0.05). Analogously, we observed 12 pathways enriched with driver genes that were associated to under-expressed pathways. We annotated all pathways with their corresponding driver genes.
summarization calculated log 2 -tranformed ratios were used to estimate raw copy numbers. Using a Gaussian approach raw SNP profiles were smoothed (> 500 kb window by default) [13,43,44].