An integrative characterization of recurrent molecular aberrations in glioblastoma genomes

Glioblastoma multiforme (GBM) is the most common and malignant primary brain tumor in adults. Decades of investigations and the recent effort of the Cancer Genome Atlas (TCGA) project have mapped many molecular alterations in GBM cells. Alterations on DNAs may dysregulate gene expressions and drive malignancy of tumors. It is thus important to uncover causal and statistical dependency between ‘effector’ molecular aberrations and ‘target’ gene expressions in GBMs. A rich collection of prior studies attempted to combine copy number variation (CNV) and mRNA expression data. However, systematic methods to integrate multiple types of cancer genomic data—gene mutations, single nucleotide polymorphisms, CNVs, DNA methylations, mRNA and microRNA expressions and clinical information—are relatively scarce. We proposed an algorithm to build ‘association modules’ linking effector molecular aberrations and target gene expressions and applied the module-finding algorithm to the integrated TCGA GBM data sets. The inferred association modules were validated by six tests using external information and datasets of central nervous system tumors: (i) indication of prognostic effects among patients; (ii) coherence of target gene expressions; (iii) retention of effector–target associations in external data sets; (iv) recurrence of effector molecular aberrations in GBM; (v) functional enrichment of target genes; and (vi) co-citations between effectors and targets. Modules associated with well-known molecular aberrations of GBM—such as chromosome 7 amplifications, chromosome 10 deletions, EGFR and NF1 mutations—passed the majority of the validation tests. Furthermore, several modules associated with less well-reported molecular aberrations—such as chromosome 11 CNVs, CD40, PLXNB1 and GSTM1 methylations, and mir-21 expressions—were also validated by external information. In particular, modules constituting trans-acting effects with chromosome 11 CNVs and cis-acting effects with chromosome 10 CNVs manifested strong negative and positive associations with survival times in brain tumors. By aligning the information of association modules with the established GBM subclasses based on transcription or methylation levels, we found each subclass possessed multiple concurrent molecular aberrations. Furthermore, the joint molecular characteristics derived from 16 association modules had prognostic power not explained away by the strong biomarker of CpG island methylator phenotypes. Functional and survival analyses indicated that immune/inflammatory responses and epithelial-mesenchymal transitions were among the most important determining processes of prognosis. Finally, we demonstrated that certain molecular aberrations uniquely recurred in GBM but were relatively rare in non-GBM glioma cells. These results justify the utility of an integrative analysis on cancer genomes and provide testable characterizations of driver aberration events in GBM.

thus important to uncover causal and statistical dependency between 'effector' molecular aberrations and 'target' gene expressions in GBMs. A rich collection of prior studies attempted to combine copy number variation (CNV) and mRNA expression data. However, systematic methods to integrate multiple types of cancer genomic data-gene mutations, single nucleotide polymorphisms, CNVs, DNA methylations, mRNA and microRNA expressions and clinical information-are relatively scarce. We proposed an algorithm to build 'association modules' linking effector molecular aberrations and target gene expressions and applied the module-finding algorithm to the integrated TCGA GBM data sets. The inferred association modules were validated by six tests using external information and datasets of central nervous system tumors: (i) indication of prognostic effects among patients; (ii) coherence of target gene expressions; (iii) retention of effector-target associations in external data sets; (iv) recurrence of effector molecular aberrations in GBM; (v) functional enrichment of target genes; and (vi) co-citations between effectors and targets. Modules associated with well-known molecular aberrations of GBM-such as chromosome 7 amplifications, chromosome 10 deletions, EGFR and NF1 mutations-passed the majority of the validation tests. Furthermore, several modules associated with less well-reported molecular aberrations-such as chromosome 11 CNVs, CD40, PLXNB1 and GSTM1 methylations, and mir-21 expressions-were also validated by external information. In particular, modules constituting trans-acting effects with chromosome 11 CNVs and cis-acting effects with chromosome 10 CNVs manifested strong negative and positive associations with survival times in brain tumors. By aligning the information of association modules with the established GBM subclasses based on transcription or methylation levels, we found each subclass possessed multiple concurrent molecular aberrations. Furthermore, the joint molecular characteristics derived from 16 association modules had prognostic power not explained away by the strong biomarker of CpG island methylator phenotypes. Functional and survival analyses indicated that immune/inflammatory responses and epithelialmesenchymal transitions were among the most important determining processes of prognosis. Finally, we demonstrated that certain molecular aberrations uniquely recurred in GBM but were relatively rare in non-GBM glioma cells. These results justify the utility of an integrative analysis on cancer genomes and provide testable characterizations of driver aberration events in GBM.

INTRODUCTION
Glioblastoma multiforme (GBM) is the most common and malignant primary brain tumor in adults. Patients diagnosed with GBM typically have short survival times ($1 year) and poor prognosis. Similar to other cancers, GBM cells harbor a large number of alterations at genetic, epigenetic, transcriptional and phenotypic levels [e.g. (1)(2)(3)(4)(5)(6)(7)]. The efforts of conducting a complete and comprehensive survey of cancer genomes were culminated in the Cancer Genome Atlas (TCGA) project (8).
Second, it is of great interest to unravel causal and mechanistic relations regarding molecular aberrations in tumor cells. Among a large number of molecular alterations, only a small fraction of them may drive malignancy of cancers. The remaining alterations are likely passengers caused by chromatin instability and dysregulation of the transcriptional/translational apparatus. Separating driver from passenger aberrations and identifying the causal and mechanistic links connecting them are two key questions in cancer genomics. Many studies attempted to decipher the causal/regulatory relations of genes from expression data and external information [e.g. (15)(16)(17)(18)(19)(20)].
Central dogma imposes a strong constraint on information flows from DNAs to proteins. Accordingly, many researchers seek associations connecting putative drivers on DNAs and passengers on mRNAs or proteins. Expression quantitative trait loci (eQTL) studies treat gene expression levels as traits and build association links with sequence variations on adjacent (cis-acting) or distant (trans-acting) loci [e.g. (21,22)]. In cancer genomes, there are rich studies building associations between copy number alterations (CNA) and mRNA expressions. Most of them identify cis-acting associations between amplified/ deleted DNA segments and up/downregulated genes on the same or adjacent loci [e.g. (23)(24)(25)(26)(27)]. Some of these studies capture both cis-acting and trans-acting associations in the same modeling framework [e.g. (28)(29)(30)]. Beyond copy number variation (CNV) and mRNA data, there are several studies incorporating mutations, microRNA expression, DNA methylations and protein interaction networks in the models [e.g. (8,31,32)]. Despite the utility of these methods, they suffer from two shortcomings. First, although some of these approaches can in principle incorporate multiple types of high-throughput data in the same modeling framework [e.g. (28)(29)(30)32)], none of them explicitly unifies the data of mutation, CNV, single nucleotide polymorphism (SNP), DNA methylation, mRNA and microRNA expressions in the same model. Second, different types of associations need to be prioritized, as they carry disparate levels of mechanistic information. This issue is either irrelevant (e.g. when only CNV and mRNA data are considered) or not addressed in the prior studies.
Recently, we proposed a modeling framework to build 'association modules' from integrative cancer genomic data sets (33,34). The aim was to find statistical and causal links connecting molecular aberrations on DNAs or microRNAs (sequence mutations, CNVs, DNA methylations, SNPs, microRNA expressions) to the expressions of protein-coding genes. In this work, we apply this modeling framework to reconstruct the association modules from the integrated TCGA GBM data. Compared with our previous publications, the contribution of this work has two folds. First, it systematically invokes six validation tests with both reported knowledge and external data sets of central nervous system tumors to justify the relevance of these modules pertaining to the underlying regulatory mechanisms and prognosis of GBM. Second, the inferred association modules confirm prominent molecular aberrations of GBM and also report the influences of less well-known molecular aberrations and reveal their strong prognostic power. These results justify the utility of an integrative analysis on cancer genomes and provide testable characterizations of driver aberration events in GBM.

Data sources and processing
The following multi-modal GBM data were downloaded from the TCGA data portal website (https://tcga-data.nci. nih.gov/tcga/): (i) an Affymetrix and an Agilent mRNA expression microarray data; (ii) two Agilent Comparative Genomic Hybridization (CGH) CNV array data; (iii) sequence data of 496 genes from three sources; (iv) one Illumina DNA methylation microarray data; (v) one Affymetrix and one Illumina SNP array data; (vi) one Agilent microRNA array data; and (vii) clinical information including ages, genders, dates of diagnosis and death (if applied), histological types, treatments of patients and others. Supplementary Table S1 summarizes the centers and platforms generating each type of data. Fifty-three genes were included in the mutation data, as they were probed in at least 50 samples and mutations in at least three samples were observed. Eight hundred seventy-five genes were included in DNA methylation data, as they were probed in at least 50 samples, and the numbers of hypo and hyper-methylated samples exceeded 10. The expression profiles of 22 697 mRNAs and 817 microRNAs were reported.
Each type of GBM data consists of distinct numbers of genes and samples. To generate a compatible joint data set for integrative analysis, we chose 248 samples appeared in all seven types of data and considered the union of all genes probed in each type of data. Supplementary Table  S2 lists the TCGA IDs of the 248 selected samples and the NCBI symbols of 22 697 selected genes and 817 microRNAs.
Both mRNA and CNV data consist of two replicates generated by distinct laboratories. In both mRNA and CNV data, probes of the same genes between the replicates possess significantly higher correlation coefficients than those generated from randomly selected probe pairs (Supplementary Figures S1 and S2). Consequently, we merged multiple instantiations of mRNA expression or CNV data and generated a single data set for each type. For mRNA expressions, we rank-transformed the probe data of each data set into cumulative distribution function (CDF) values, averaged over the intra-gene CDF values on the same samples and then averaged the gene-level CDF values from the two data sets.
We formulated the associations between molecular aberrations and gene expressions with logistic regression models (see the text later in the text and Supplementary Text 1). To establish a unified format of data for the models, each type of data was transformed into probabilities of discrete states. For data types of mRNA and microRNA expressions, CNVs, DNA methylations, we proposed a probabilistic quantization approach to preserve information of continuous values (34). Rank-transformed CDF values were mapped into tristate probabilities [P(upregulation),P (nochange),P(downregulation)] by integrating over a family of monotonic quantization functions (see Supplementary Text 1). The data of mutations and SNPs were by nature discrete; hence, each entry was expressed as a binary vector with the entire probability mass on the designated state.
As amplification/deletion events on DNAs often stretch over multiple contiguous probes of CGH arrays, adequate units of CNV data are contiguous segments rather than individual probes. We partitioned normalized CNV probe data into segments using a naive Bayes model. The model hypothesizes that the CGH array probes on the same segment have consistent values, and thus their CNV states all depend on a common hidden variable. An algorithm was devised to recursively partition each chromosome into segments that maximized the likelihood function of the CNV probe data. The partitioned boundaries from individual samples of the two replicate data sets were merged to form consistent segments. The detailed procedures of the partitioning algorithm are described in (34) and Supplementary Text 1.
In all, 1353 candidate regulators were extracted from three sources: transcription factors from the TRANSFAC database (35), cancer-related genes according to the annotations from the Online Mendelian Inheritance in Man (OMIM) database (36), transcription factors and signaling proteins from the FanTom database (37).
To verify the reproducibility of association modules, we downloaded eight data sets of mRNA expression and clinical information of CNS cancers from the NCBI Gene Expression Omnibus database. Supplementary Table S3 summarizes the information of external data sets.

Associations between molecular aberrations with gene expressions
We define an association module as a tuple consisting of three components: (i) observed effector molecular aberrations on DNAs or microRNAs; (ii) downstream target genes whose expression profiles are associated with effector molecular aberrations; and (iv) regulators (transcription factors or signaling proteins) that mediate the effects from effectors to targets. We consider the following seven types of associations and illustrate them in the left panel of Figure 1: (1) Cis-acting effects with CNVs of chromosomes. The CNV of a chromosome is positively associated with the expressions of its constituent genes. The expression profile of a target gene is possibly explained by one or multiple effectors. These associations are formulated as a logistic regression model. Denote x as effectors and y a target gene expression, where f i ðxÞ specifies the effect of the i th feature on y and i is a free parameter. Building an optimal association model is intractable due to combinatorial explosion. Therefore, we used two heuristics to streamline model selection. First, only the candidate effectors with significant marginal effects on the targets were considered. We incurred totally 1:936 Â 10 10 pairwise associations between candidate effectors and targets on 23 HP DL360 G7 servers in parallel. Each server contains dual Intel(R) Xeon(R) CPUs E5520 with 2.27 GHz and 24 GB main memory. The total running time was 50 h. In all, 2 135 755 pairwise associations were selected according to predetermined thresholds of log-likelihood ratios, P-values and correlation coefficients reported in Supplementary  Table S4. Second, not all types of molecular aberrations are equally likely to drive gene expressions. Some candidate effectors provide direct explanations for gene expressions without requiring many mechanistic assumptions underlying gene regulation (e.g. cis-acting effects with CNVs). Others have massive number of features thus are likely to introduce spurious associations (e.g. SNPs). We proposed a layered modeling framework to prioritize molecular aberrations and incrementally incorporate candidate effectors to the model according to their priorities. Starting with an empty model without any effector, we incrementally selected the effectors that provided additional explanatory power relative to the preceding model. Candidate effectors were incorporated in the model with the following order of priority: (1) Level 1 associations: Mutations and DNA methylations of the target gene; segment CNVs covering or near (within 1 Mb) the target gene, SNPs on the same chromosome as the target gene. The top-left panel of Figure 2 demonstrates a simple joint association model. The mRNA expression of BHLHE40 is jointly associated with NF1 mutation (positive) and RBP1 DNA methylation (negative). The union of the association models for all target genes constitute a bipartite network with links between effectors and targets. A small association network is displayed in the top-right panel of Figure 2. In principle, each effector and its neighbor targets in the bipartite network comprise an association module. However, many modules will be highly overlapped due to similarity of effector profiles and shared targets. Consequently, we consolidated association modules with the following steps. First, closely related effectors were integrated: segment CNVs on the same chromosomes were combined, whereas DNA methylations and microRNA expressions were clustered by a graph theorybased clustering algorithm (see Supplementary Text 1). For instance, in the bottom-left panel of Figure 2, DNA methylation profiles of 204 genes form three clusters. Targets of the merged effectors were also merged. Second, in addition to resembled effectors, modules with considerable overlap of targets were also mergeable. Two modules were mergeable if their overlapped targets either exceed either one-third of the smaller module or 50 genes. A simplified example is shown in the bottom-right panel of Figure 2. The modules of chromosome 11 CNV and ZMYND10, RBP1 and FES methylation are mergeable as the intersected targets (79 genes) comprise more than half of the smaller module (155 genes).
The right panel of Figure 1 summarizes the procedures of association module construction. Detailed descriptions of data normalization, CNV segmentation, pairwise associations, layered model selection and assembly of association modules are reported in Supplementary Text 1. We also include the source codes of the module construction procedures in Supplementary Text 3.

Validation of association modules
To justify the analysis results, it is essential to know whether the discovered association modules are statistically sound, biologically meaningful and reproducible in other data sets. We invoked six validation tests for each module to address these questions: (i) assessing the prognostic power of target gene expressions on survival times; (ii) verifying reproducible coherence of target gene expressions; (iii) verifying reproducible retention of effectortarget associations; (iv) identifying recurrent effector aberrations in glioblastomas; (v) evaluating the functional enrichment of targets; and (vi) checking co-citations between effectors and targets.

Assessing prognostic power of target gene expressions
In survival analysis, Cox regression coefficients gauge the dependency of patient's hazard function on observed features (38). Positive associations with survival times exhibit negative Cox regression coefficients. Here, we treated the mRNA expression of each gene as a covariate and estimated its Cox regression coefficient in TCGA and three external data sets of brain tumors: GSE16011 (4), GSE4412 (5), GSE7696 (39). For each association module, the distribution of Cox regression coefficients in its targets was compared with the background distribution of all genes. The prognostic power of an association module was measured by the P-value of the one-sided Kolmogorov-Smirnov (KS) test between the two Cox regression coefficient distributions. The modules significantly enriched with predictive targets (P-value 10 À8 ) in at least three data sets were reported.

Verifying coherence of target gene expressions
We evaluated the coherence of target gene expressions across the nine data sets. For each association module, we computed the distribution of correlation coefficients between target gene expressions in each data set. As a comparison, we extracted genes belonging to the Gene Ontology (GO) category of ribosomes (accession number 0005840) and computed the distribution of their expression correlation coefficients. The P-values of the one-sided KS tests between the distributions of target correlation coefficients and ribosome gene correlation coefficients were reported. Ribosome genes were chosen as a reference set, as they were strongly co-expressed across many tissue types and conditions (40). Consequently, the gene sets with higher correlation coefficients than the reference set should be strongly coherent.

Verifying associations between effectors and targets
We examined the associations between effectors and targets across the nine data sets. Information about effector molecular aberrations (CNVs, mutations, DNA methylations) was not available in the external data sets. Therefore, we used the expressions of effector genes as proxies of their molecular aberrations. For chromosome CNVs, we chose the expressions of genes in the CNV cisacting modules of the same chromosomes as proxies. For mutations and DNA methylations, we simply chose the expressions of the mutated/methylated genes as proxies. Associations with microRNA expressions were not considered as there were no mRNA proxies. For each module, the distribution of correlation coefficients between effector proxy and target expressions was evaluated. This distribution was compared with a background distribution of correlation coefficients between effector proxies and all valid genes. The P-values of the one-sided KS tests between the two distributions were reported.

Identifying recurrent effector aberrations in GBM
We examined two expression data sets-GSE16011 (4) and GSE4412 (5)-containing multiple subtypes of gliomas including GBM and non-GBM samples. For each association module, we extracted the proxy expressions of their effectors and compared their distributions between GBM and non-GBM samples. The P-values of the two-sided KS tests between the two distributions were reported.

Evaluating functional enrichment of target genes
We extracted 3312 functional categories from the GO database (41) and 889 pathways from three pathway databases: Reactome (42), BioCarta (43) and the NCI Pathway Interaction Database (44). We also extracted the 35 biomarker genes pertaining to prognosis of high-grade gliomas reported in (3) and divided them into three subclasses accordingly: proneural, mesenchymal and proliferative. In addition, we extracted 13 gene sets associated with embryonic stem (ES) cell identity from (45). For each association module, we applied the Fisher's exact test to evaluate the enrichment significance of each GO category, pathway and gene set. The P-values were adjusted with Bonferroni correction by multiplying with the total number of GO categories, pathways or gene sets considered.

Checking co-citations between effectors and targets
We incurred a batch search on the PubMed database to find all the pairs of effector/regulators and targets in each module that were co-cited in the same publications. For each effector/regulator in an association module, we counted both the numbers of co-cited genes among the target members and among all the 22 697 genes. Using the background frequency ( #co-citedgenes total#genes ) as a binomial probability of randomly finding a co-cited gene pair, we calculated the P-value of enrichment with co-cited target genes in a module.
Detailed procedures and results of each validation test are described in Supplementary Text 1.

A global landscape of molecular aberrations in glioblastoma
The circos plot (46) in Figure 3 displays three types of molecular aberrations-CNAs, gene mutations and DNA methylations-on the genomes of 248 GBM samples from the TCGA database (8). The CNV datavisualized as red (high values) and green (low values) stripes-of chromosomes X and Y are largely consistent with patients' genders. Besides X and Y, chromosomes 7 and 10 undergo prevalent amplification and deletion, respectively. Other CNAs are more sporadic but tend to span the entire chromosomes rather than localize in small regions. Supplementary Figure S3 displays the auto-correlations of the CNV profiles between intra-chromosomal probe pairs segregated by their genomic distances. On most chromosomes, long-range correlations exist between distant probes. The results justify the use of one representative CNV profile for the CNV profiles of all probes on a chromosome.
Sequence mutations on protein-coding genes are categorized into two types: (i) nonsense point mutations or frame-shifting insertions/deletions that disrupt mRNA synthesis and (ii) missense point mutations or in-frame insertions/deletions that do not necessarily block mRNA synthesis. We term the first class as nonsense mutations, the second class as missense mutations and discard synonymous point mutations. Only five genes are mutated in at least 10 samples: IDH1, EGFR, PTEN, TP53 and NF1. The majority of NF1 mutations are nonsense (4 missense mutations and 14 nonsense mutations). In contrast, other genes are either dominated by missense mutations (IDH1, EGFR, TP53) or mixture of both types of mutations (PTEN).
The genotype data of 850 432 SNPs were reported. Owing to the large number of SNPs considered, we imposed a stringent threshold on filtering pairwise associations between genotypes and mRNA expressions (log likelihood ratio !7.0, P-value 10 À5 ). Only 15 cisacting SNP-mRNA pairs pass the threshold. In all the 15 pairs, the SNP loci are within 50 kb from the mRNA genes. Owing to the small number of significant SNP-mRNA pairs, the cis-and trans-acting effects with SNPs are not incorporated in association modules but are visualized in Supplementary Figure S4.

False-discovery rates of pairwise associations
The association modules were derived from pairwise associations between all putative effector molecular aberrations and target mRNA expressions. As a large number of hypotheses were tested, it is necessary to ensure that most significant pairwise associations do not arise by chance. We assessed the credibility of reported pairwise associations by false-discovery rates [FDR, (47,48)]. FDR estimates the fraction of false positives among the reported pairwise associations. We adopted the permutation tests described in (34) Table 1 shows the FDRs for each type of associations. Both types of FDRs on all pairwise associations combined are small (1:053 Â 10 À3 and 1:497 Â 10 À3 , respectively). Associations with mutations yield the highest FDRs (0.171 and 0.230). The confidence of associations with mutations is likely degraded by the small numbers of samples carrying mutations. In contrast, the FDRs of all other types of associations are negligible, ranging from 10 À7 to 10 À3 . Table 2 and Supplementary Text 2 present the summary information of 45 association modules inferred from the integrated TCGA GBM data sets. In addition, we report the members of the association modules in an annotated webpage cancermodel.stat2.sinica.edu.tw/GBM/. Overall, 6331 genes belong to at least one association module. All types of molecular aberrations (except SNPs) appear as putative effectors to modulate mRNA expressions. Each chromosome constitutes an association module of cisacting effects with CNVs (modules 5-28). Supplementary Table S5 reports the number and fraction of target genes and each type of effectors present in association modules. Each type of associations account for a relatively comparable number of target genes. In contrast, the number and fraction of each type of effectors exhibit great variations. Every chromosome constitutes a module of CNV cis-acting associations, and nearly one-third of all chromosomes (29.17%, seven chromosomes) constitute modules of CNV trans-acting associations. In contrast, only the mutations of three genes (5.66%) are associated with target gene expressions, and only the DNA methylation of eight genes (0.91%) are associated with target gene expressions.

Summary of association modules and their validation results
The majority of prior studies integrating copy number and transcriptomic data considered only cis-acting effects of the same genes. Louhimo et al. (27) compared the performance of 10 algorithms capturing cis-acting CNV-mRNA associations. We applied six of those algorithms to the TCGA GBM data and compared the inferred genes putatively deregulated by CNVs with the genes manifesting cis-acting CNV-mRNA associations in our model. For each method, we sorted genes by their CNV-mRNA integration scores and counted the fraction of genes labeled with cis-acting CNV-mRNA associations in our model. Supplementary Figure S5 displays dependencies between the fraction of cumulative cisacting gene numbers and gene ranks. The methods     Figure S6).
Beyond the directions of associations with survival times, we also evaluated the strength of prognostic predictions for each module by the Cox regression coefficient P-values of the median expression profiles over their targets. Modules 2 and 14 manifest consistent and significant associations with survival times (P < 0.1) in all four data sets. Figure 5 visualizes the target expressions in modules 2 and 14 in relation to survival times and the Kaplan-Meier curves of patients segregated by the median target expression values. In TCGA data, the effector molecular aberration levels (chromosomes 11 and 10 CNVs) and their corresponding Kaplan-Meier curves are also displayed. Module 2 effector CNV and target expression levels have strong negative associations with survival times, whereas module 14 effector CNV and target expression levels have strong positive associations with survival times. The consistent and significant associations of modules 2 and 14 with survival times are also independent of tumor types and grades in the Erasmus data (Supplementary Figure S7).
Additional evidence strongly supports the prognostic relevance of modules 2 and 14. Module 2 harbors 7 of 11 genes involved in epithelial-mesenchymal transition (3). Patients with high expressions of these genes are previously reported to suffer from poor prognosis. We examined the Cox regression coefficients of these genes in each CNS data set and found they were all positive and largely significant except in GSE7696 (Supplementary  Table S7). Furthermore, we identified 20 biomarker genes whose expression profiles yielded consistent (Cox regression coefficients have identical directions) and highly significant (Cox regression P 0.01) prognosis in all four data sets. In particular, module 2 consists of six biomarkers and module 14 consists of seven biomarkers. 'glioma-CpG island methylator phenotype' (G-CIMP). We compared the molecular characterizations derived from the association modules with the established GBM subtypes and found them considerably overlapped.
For each association module, we built a binary classifier of GBM samples based on their median target gene expression levels. Overlaps of these classification outcomes and the four transcriptional subtypes are reported in the left panel of Figure 6 and Supplementary Table S8. Twelve association modules are remarkably aligned with at least two subtypes (yellow patches in Figure 6). For instance, in module 1, classical and mesenchymal subtypes have low and high target gene expressions, respectively (60 of 63 samples and 71 of 74 samples). The alignment outcomes are generally compatible with known genomic characteristics of the transcriptional subtypes. For instance, classical subtypes typically encounter chromosome 7 amplification (high expression of module 7 targets) and chromosome 10 loss (low expression of module 1 targets), whereas mesenchymal subtypes encounter NF1 mutation (high expression of module 41 targets). Supplementary Table S9 reports the dominant association module activities occurred in the samples of each transcriptional subtype.
We also counted the overlap of the G-CIMP subgroups The presence of the joint hallmark is not perfectly aligned with G-CIMP positive status. Strikingly, unlike molecular characteristics of single modules, the prognostic power of the joint hallmark is not explained away by G-CIMP status. Supplementary Figure S9 displays the Kaplan-Meier curves of the four subclasses based on possible combinations of G-CIMP status and the joint hallmark. Although the presence/absence of the joint hallmark does not alter the already superior survival times of G-CIMP positive patients, G-CIMP negative patients possessing the joint hallmark have significantly longer survival times than those without the joint hallmark (median survival times are 633 and 372 days, respectively, logrank P 0.0356).

Associations of effectors and targets are reproducible in external data sets
To validate reproducibility of association models, we incorporated eight additional gene expression data sets of CNS tumors (Supplementary Table S3). In each data set, we intended to check whether (i) targets in each association module retained coherent expression profiles and (ii) effector aberrations and target expression profiles remained associated with the directions compatible with the modules derived from the TCGA data. Table 2 and Supplementary Table S11 report the target expression coherence of each association module in the nine data sets (including TCGA). Twelve modules retain coherent target expressions in at least five data sets (including TCGA). We were unable to directly verify associations between effector molecular aberrations and target gene expressions, as molecular aberration data were lacking in external data sets. Instead, we used the expression profiles of effectors as proxies for the molecular aberrations and examined their associations with target  gene expressions. Table 2 and Supplementary Table S12 show the associations between the effector proxies and target gene expressions for each module across the nine data sets. Among the 21 association modules (excluding the 24 CNV cis-acting modules), 11 manifest significant effector-target associations (P-value 10 À10 ) in at least six data sets.
Amplification of chromosomes 7 and 19, deletion of chromosome 10 and RBP1 hypo-methylation are recurrent and specific molecular aberrations in glioblastoma tumors Reproducible associations do not necessarily imply conserved molecular aberrations on effectors. It remains unclear whether molecular aberrations on certain effectors (i) recur on multiple GBM samples or (ii) are specific to GBM tumors. To answer these questions, we examined two expression data sets-GSE16011 (4) and GSE4412 (5)-including GBM and other glioma subtypes such as pilocytic astrocytomas, oligodendroglial tumors and mixed oligoastrocytic tumors. The goal is to identify the effector molecular aberrations that recur primarily on GBM but not on other glioma subtypes. As these molecular aberrations were not directly measured, we again used the effector expression profiles as their proxies. For chromosome CNVs, we used the expression profiles of the cis-acting targets as their proxies. Figure 7 displays the GBM-specific recurrent effector molecular aberrations. Chromosomes 10 and 7 CNVs reveal the strongest contrasts between GBM and other glioma samples. In both data sets, the cis-acting targets of chromosome 10 have significantly lower expression levels in GBM than in other glioma samples (KS test P-values < 10 À250 and 1:65 Â 10 À124 , respectively). In contrast, the cis-acting targets of chromosome 7 have significantly higher expression levels in GBM samples (KS test P-values < 10 À250 and 4:22 Â 10 À89 , respectively). These results corroborate frequently reported chromosome 7 amplification and chromosome 10 deletion on GBM and furthermore indicate the uniqueness of these CNV aberrations on GBM.
Beyond chromosomes 7 and 10 CNVs, several other effectors also manifest GBM-specific aberrations. Chromosome 19 is amplified in GBM relative to other glioma samples. Chromosome 14 has moderately lower CNVs in GBM relative to other gliomas. RBP1 expression is upregulated in GBM and downregulated in other glioma samples, suggesting that it is hypo-methylated in GBM.

Module targets are enriched with functional classes of immune/inflammatory responses, mesenchymal-epithelial transitions and ES cell transcription factor targets
We examined the enrichment of target genes of each association module on 3312 GO functional categories (41) and 889 pathways assembled from three sources (42,44). Table 2 and Supplementary Table S13 list the enriched functional categories/pathways and their hyper-geometric P-values after Bonferroni correction. In all, 25 of 45 association modules are enriched with at least one functional category/pathway. In particular, immune response genes are highly enriched in module 1 (P-value 2:60 Â 10 À26 ), and inflammatory response genes are enriched in modules 1 and 2 (P-values 4:37 Â 10 À12 and 2:56 Â 10 À4 , respectively).
Phillips et al.
(3) used a 35-gene panel to cluster highgrade gliomas into three subclasses with distinct levels of prognosis: proneural, mesenchymal and proliferative. A disjoint set of genes are upregulated in each subclass. We found that modules 2 and 41 were highly enriched with genes expressed in the mesenchymal subclass (P-values 3:24 Â 10 À9 and 2:20 Â 10 À5 , respectively). Notably, co-occurrence of NF1 mutations (the effector aberration of module 41) and upregulation of mesenchymal markers is previously reported (6).
We also applied two widely used tools of functional enrichment-DAVID (50) and MSigDB (51)-to analyze the association modules and compared the results with our own hyper-geometric enrichment analysis (Supplementary Table S14). The results reported by DAVID are highly overlapped with Table 2 and Supplementary Table S13 as both are based on documented functional classes such as GO terms. For instance, the following functional classes are enriched according to both our analysis and DAVID: immune response in module 1, inflammatory response in module 2, olfactory receptor activity in module 3, mismatched DNA repair in module 11, protein complex assembly in module 15. In contrast, the enrichment results of MSigDB are 'orthogonal' to Supplementary Table S13, as the gene sets chosen (C2: curated gene sets and C6: oncogenic signatures) contain information not covered in GO terms and pathways. In particular, several gene sets related to brain function are enriched in multiple modules. Gene sets upregulated and downregulated in brain from patients with Alzheimer's disease are enriched in 11 and 9 modules, respectively. Genes correlated with classical type of GBM are enriched in nine modules. Nevertheless, some gene sets are not obviously related to brain function but are still enriched in multiple modules, such as the differentially expressed genes in leukemia, fetal livers and breast cancers. Functional implications of these enrichment results need to be investigated.

Co-citations between effectors and targets
We incurred a batch search for all pairs of effectors/regulators and targets of association modules on the NCBI PubMed database and checked whether they were cocited in the same references. Table 2 and Supplementary  Table S15 list the effectors with enriched co-cited target genes in each module. Overall, nine association modules contain enriched co-cited effector-target pairs.
MicroRNA expressions are associated with CNVs of chromosomes 7, 10 and 11 Like protein-coding genes, microRNA expressions can also be modulated by effector molecular aberrations on DNAs. We used the module-discovery algorithm to microRNA expression data and identified 22 modules associated with cis-acting and trans-acting effects of CNVs (Supplementary Table S16). The size of each association module and the total number of microRNA expressions explained by CNVs are considerably smaller than those of mRNAs. The largest module consists of 19 target microRNAs, and only 114 microRNAs (13.95%) are targets in association modules, as compared with 6331 protein-coding genes (27.89%) included in association modules. However, the dominant chromosomes in CNV associations remain invariant between mRNA and microRNA modules: chromosomes 7, 10 and 11. The results suggest that the mechanisms of modulating RNA expressions through DNA copy number changes are likely similar between mRNAs and microRNAs.

DISCUSSION
The inferred association modules simultaneously recapitulate critical molecular aberrations in GBM and map their presence and absence among predefined molecular subtypes. Table 3 combines the results of Table 2 and Figure 6 and summarizes the molecular aberrations in the four GBM transcriptional subtypes and G-CIMP positive tumors. GBM is typically characterized by chromosome 7 amplification, chromosome 10 deletion and NF1 loss-of-function mutations (9), (8,10,11). All these molecular aberrations become effectors of association modules. Beyond these well-known molecular aberrations, several other association modules reveal the effects of less well-known effectors: chromosome 11 deletion, chromosome 19 amplification, methylations of ZNYND10, RBP1, PLXNB1, CD40, GSTM1 and expressions of mir-181a, mir-21, mir-22. RBP1 is among the most differentially hyper-methylated and downregulated genes in G-CIMP positive tumors (7). PLXNB1 encodes plexin, a receptor for the semaphorin signals guiding axonal growth (52). In melanoma, PLXNB1 blocks tumorigenesis by inhibiting the MAP kinase pathway and controlling the extracellular matrix (53). CD40 encodes a TNF receptor essential in mediating a variety of immune and inflammatory responses (54). CD40 possesses multiple functions promoting tumorigenesis and progression including inflammatory response, repression of TNFa-induced apoptosis (55) and angiogenesis (56). GSTM1 encodes glutathione S-transferase mu 1, an enzyme catalyzing the biosynthesis of glutathione. Glutathione is an antioxidant protecting cells from the damage caused by reactive oxygen species. Previously, deletion and sequence polymorphisms on GSTM1 were reported in gliomas (57,58). Mir-21 is reported to have elevated levels in glioma cells (59) and can promote axonal growth as well (60).
The target gene expressions of association modules are also aligned with documented characteristics of GBM molecular subtypes. For instance, in Table 3, classical subtypes possess pronounced chromosome 7 amplification and chromosome 10 deletion, proneural subtypes have TP53 mutations and mesenchymal subtypes harbor NF1 mutations (6). Moreover, alignment between association modules and CpG island methylator phenotypes reveals multiple concerted molecular aberrations on the G-CIMP positive samples. G-CIMP positive tumors typically lack chromosome 7 amplification, chromosome 10 deletion and NF1 mutation, but harbor methylations of RBP1, CD40 and GSTM1. Curiously, G-CIMP positive samples contain disproportionally frequent TP53 mutations (8 of 21 or 38.1%) compared with G-CIMP negative samples (35 of 224 or 15.62%).
The functional consequences of molecular aberrations successfully explain the predictive power of the G-CIMP phenotypes for prognostic outcomes. Concerted hypermethylation on CpG islands is a biomarker co-occurred with other benign molecular features (lack of chromosome 7 amplification, chromosome 10 loss, NF1 mutation, etc.). Therefore, the predictive power of each single module ( Figure 4) is explained away by the G-CIMP phenotype (Supplementary Figure S8). Tumors carrying the joint hallmark of 16 modules are highly overlapped with G-CIMP positive tumors. In addition, G-CIMP negative tumors carrying the joint hallmark yield superior prognosis than the remaining G-CIMP negative tumors (Supplementary Figure S9). Furthermore, the joint expression hallmark derived from 16 association modules can be reduced to four modules without sacrificing its prognostic power (Supplementary Figure S10). Downregulation of modules 11 and 42 and upregulation of modules 12 and 14 suffice to segregate the 10 G-CIMP negative tumors with long survival times from the remaining G-CIMP negative samples. From these observations, we propose that a joint hallmark derived from 16 (or 4) association modules can faithfully predict GBM prognostic outcomes.
Another striking finding from our analysis is pronounced activities of the immune system in inferred modules and their significant associations with survival times. GBM is typically characterized with a strong immunosuppressive microenvironment (1,62,63). Chromosomes 10 and 11 CNVs are positively associated with expressions of many immunity or inflammation related genes on other chromosomes. Furthermore, effectors of modules 42 (CD40) and 43 (GSTM1) are also involved in inflammatory responses. Conventional wisdom often links chromosome 10 loss with inactivation of PTEN (2,(8)(9)(10)(11) and views activation of the PI3K/Akt/mTOR pathway through PTEN mutations as the primary cause of immune suppression in GBM (64,65). Our study suggests that PTEN might not be the only key regulator gene on chromosome 10, as neither PTEN expression nor PTEN mutation is strongly associated with target gene expressions. Other candidate regulators on chromosome 10 such as CXCL12 and BLNK retain strong associations with chromosome 10 CNVs and target gene expressions across multiple data sets. Alterations on those genes may directly modulate immune responses in addition to the effect of the PI3K/Akt/mTOR pathway.
Observations from survival analysis, however, are paradoxical. All the association modules enriched with immune/inflammatory response genes-modules 1, 2, 41 and 42-are also negatively associated with survival times. This observation seems to contradict with the nature of an immuno-deficient microenvironment of GBM. However, immune/inflammatory responses can both promote tumorigenesis by fostering angiogenesis, cancer cell proliferation and invasiveness (66) and suppress/attack cancer cells presenting specific antigens (63). In GBM, the cancerpromoting characteristics of immune/inflammatory The last column reports the molecular aberrations absent in G-CIMP positive tumors.
responses seem to dominate the progression of tumors. These observations can be attributed to the canonical theory of the effects of inflammation on tumorigenesis, and recent findings about strong couplings of the KRAS signaling pathway and innate immune signaling (67). More refined investigation is required to determine the effects of antagonistic interactions for immune/inflammatory responses.
There are several limitations in the data and methods of the current study. First, intratumoral heterogeneity among cancer, stromal and immune cells is not considered. Interactions between multiple cell types are critical for tumor progression but cannot be unraveled with the population-averaged TCGA data. Second, prognostic prediction among the majority of G-CIMP negative patients remains poor. The wide range of survival times among this group (though generally lower than G-CIMP positive patients) cannot be further stratified by association modules or transcriptional subtypes. Third, causal and mechanistic links between molecular aberrations remain unknown, as longitudinal data of tumor evolution and intervention experimental data are unavailable.
To sum up, cross-sectional, static, observational and coarse-grained TCGA data supply rich information of molecular characteristics of cancer genomes. Yet, they are unable to tackle several central issues of contemporary cancer research, such as evolution of cancer genomes, heterogeneous interactions between tumors and microenvironment, emergence and development of cancer stem cells and acquisition of drug resistance. New experimental technologies and computational methods will enable scientists to study these questions and hopefully acquire systematic understanding and treatment strategies of cancer.