Integrated analysis of mutations, miRNA and mRNA expression in glioblastoma

Background Glioblastoma arises from complex interactions between a variety of genetic alterations and environmental perturbations. Little attention has been paid to understanding how genetic variations, altered gene expression and microRNA (miRNA) expression are integrated into networks which act together to alter regulation and finally lead to the emergence of complex phenotypes and glioblastoma. Results We identified association of somatic mutations in 14 genes with glioblastoma, of which 8 genes are newly identified, and association of loss of heterozygosity (LOH) is identified in 11 genes with glioblastoma, of which 9 genes are newly discovered. By gene coexpression network analysis, we indentified 15 genes essential to the function of the network, most of which are cancer related genes. We also constructed miRNA coexpression networks and found 19 important miRNAs of which 3 were significantly related to glioblastoma patients' survival. We identified 3,953 predicted miRNA-mRNA pairs, of which 14 were previously verified by experiments in other groups. Using pathway enrichment analysis we also found that the genes in the target network of the top 19 important miRNAs were mainly involved in cancer related signaling pathways, synaptic transmission and nervous systems processes. Finally, we developed new methods to decipher the pathway connecting mutations, expression information and glioblastoma. We indentified 4 cis-expression quantitative trait locus (eQTL): TP53, EGFR, NF1 and PIK3C2G; 262 trans eQTL and 26 trans miRNA eQTL for somatic mutation; 2 cis-eQTL: NRAP and EGFR; 409 trans- eQTL and 27 trans- miRNA eQTL for lost of heterozygosity (LOH) mutation. Conclusions Our results demonstrate that integrated analysis of multi-dimensional data has the potential to unravel the mechanism of tumor initiation and progression.

Background Glioblastoma (glioblastoma multiforme or GBM) is the most common and aggressive type of primary brain tumor in humans, involving glial cells and accounting for 52% of all parenchymal brain tumor cases and 20% of all intracranial tumors [1]. Glioblastoma is located preferentially in the cerebral hemispheres. In the past two decades, the molecular mechanisms, genetics and pathways to treat GBM have extensively been studied [2]. However, the precise mechanism of GBM is unknown and its median survival rate is very low [3]. The Cancer Genome Atlas (TCGA) generated largescale multi-dimensional datasets to catalogue cancer alterations [4]. GBM is the first cancer studied by TCGA. In the GBM pilot project, a total of 601 genes were sequenced for detection of somatic mutations in 179 tumor and matched normal tissues pairs; expressions of 12,042 genes were measured in 243 tumor tissue samples and 10 normal tissue samples and 1 cell line; and expressions of 534 microRNAs (miRNAs) were profiled in 240 tumor tissue samples and 10 normal tissue samples. The generated multi-dimensional genetic and molecular data will provide rich information which allows us to uncover mechanisms of GBM.
In the past, although enormous efforts toward generating high-throughput genetic and molecular data have been made, knowing how to integrate genetic and molecular data and unravel underlying mechanisms of cancer remains unresolved. To gain significant insights into the mechanisms of cancer, it is indispensable to reconstruct and dissect various pathways, and genetic and molecular networks that define the molecular states of systems involved in tumorigenesis. To meet this challenge, we will reconstruct genetic and molecular networks involved in tumorigenesis and study how these networks respond to the perturbation of somatic mutations and environments. Genes with DNA variation, mRNA and miRNA form the complex genetic and molecular networks that determine the cell's function and response to the perturbation of external stimuli. These networks consist of three levels. The subnetwork in the first level consists of (1) significantly associated somatic mutations and (2) LOH which were identified by a group association test statistic. The subnetworks in the third level are mRNA and miRNA coexpression networks. The subnetworks in the second level are to connect the subnetworks in the first level and subnetworks in the third level. The network pipeline analysis approach is shown in Figure 1.
It is increasingly recognized that miRNAs have emerged as an important component in the regulation of gene expression, with imperfect base pairing, to target sites in the 3' UTR of the messenger RNAs [5]. Three types of methods, sequence analysis, miRNA-mRNA regression analysis, and machine learning, have been used to identify potential target genes [6,7]. To improve the accuracy of target prediction, we will combine sequence analysis with regression analysis for target prediction, which finally leads to the formation of miRNA target networks. miRNA target networks connect gene coexpression and miRNA coexpression networks. Expression quantitative trait loci (eQTLs) are genomic loci that regulate gene transcription. eQTLs may act in cis (locally) or trans (at a distance) to a gene or a precursor miRNA [8]. We applied a group regression method which regresses the expression of a mRNA or a miRNA on the number of all mutated alleles across the region of interest to identify eQTL. The identified eQTLs connected the subnetworks in the first and third levels.
Biological functions and mechanisms are encoded in network properties. An important strategy for unraveling the mechanisms of initiation and progression of cancer is to conduct analysis of complex genetic and molecular networks and study their behaviors under genetic and environment perturbations. Robustness of a biological network, ability to retain much of its functionality in the face of perturbation [9], has emerged as a fundamental concept in the study of network topological properties [10]. The locations of the DNA variants, mRNA and miRNA in the genetic and molecular networks are likely to affect the phenotypes. We use network structural analysis as a tool to identify a set of key cancer causing genome alternations and core modules of biological networks that play an essential role in the development of cancer.
The purpose of this report is to use systems biology and network approaches to develop novel analytical strategies to unravel the mechanism of GBM by systematically integrating the multi-dimensional datasets from the TCGA project.

Test Association of Somatic Mutations and LOH Mutations with Glioblastoma
Traditional statistical methods for genetic studies of complex diseases are mainly based on the common disease/common variant hypothesis. However, it has been reported that common variants account for only a small proportion of the genetic contribution to complex diseases. Recent deep-resequencing reveals that in the genome there are a large number of rare variants (0.05 × MAF) which also play an important role in causing various complex diseases including cancer [11]. Multiple rare mutations, each with a minor marginal genetic effect, but collectively may make large contributions in the population. Most statistics for testing the association of common alleles with common diseases have mainly focused on the investigation of variants individually. However, due to their rarity, the frequencies of rare alleles may be comparable with genotyping errors. As a consequence, individual tests of association of rare variants with disease, as is often done by the traditional association tests, have limited power and may not be robust [12]. An alternative approach to the current variant-by-variant tests is group association tests in which a group of rare genetic variants are jointly tested [12][13][14]. It has been shown that the number of rare alleles in large samples is approximately distributed as a Poisson process with its intensity depending on the total mutation rate [15]. The intensity of the Poisson process within a segment of the genome can be interpreted as the mutation rate. Similar to the standard χ 2 test for association of SNPs which compares the difference in allele frequencies between cases and controls, we propose a new group test statistic T a that is to compare differences in the mutation rates between tumor and normal samples (for details, see Methods). The statistic T a was applied to GBM in the TCGA dataset. The tumor tissues of 179 GBM patients and 179 matched normal tissues were sequenced. A somatic mutation was recorded when the mutation was detected only in the tumor tissue. There are 306 genes in which at least one somatic mutation was detected. A LOH mutation was recorded when the genotype in blood or normal tissue is heterozygous, and in the tumor tissue, the reference allele loses normal function and the genotype becomes homozygous. There are 124 genes in which at least one LOH mutation was detected. We identified association of somatic mutations in 14 genes by the statistics T a (Table 1), and association of LOH in 11 genes by the statistic T G with GBM with a false discovery rate (FDR) of less than 0.05 (Table 2). Genes TP53, PTEN, EGFR, NF1, RB1 and ERBB2 were reported to be associated with GBM in the previous TCGA data analysis [4]. The remaining 8 somatic mutated genes and 10 LOH mutated genes were newly identified by the statistics T G or T a . NCBI Entrez gene database [16] reported that: CHEK2 is a cell cycle checkpoint regulator and putative tumor suppressor, and associated with GBM; GSTM5 was reported to be involved in cancer, BCL11A is a proto-oncogene, and FN1 is involved in tumor metastasis and angiogenesis. Gene PRAME was reported to be associated with melanoma and acute leukemias. Association of the other 9 genes such as NRAP, MK167, C10orf54 and C9orf66 with GBM was first reported here.

Network Analysis of Gene Expressions
Comparative studies of gene expression between normal and tumor tissues is one of the most widely used strategies for unraveling the molecular circuitry underlying cancer [17]. To uncover the mechanisms of glioblastoma, expressions of 12,042 genes were measured in 243 tumor tissue samples and 10 normal tissue samples and 1 cell line using the Affymetrix HT Human Genome U133 Array Plate Set. A total of 1,697 genes were differentially expressed between tumor and normal tissues by the Wilcoxon rank-sum test (P-value for declaring significance after Bonferroni correction is 4.15 × 10 -6 ). Of the 1,697 genes, 97 genes were cancer genes or cancer candidate genes (Additional files 1); 11 of them were oncogene including CDK4 and RAF4; and 21 of them were tumor suppressor genes including TP53 and RB1. Twenty-five of which were GBM related genes including TCF12, TP53, COL4A1, COL3A1 and COL5A2. We also observed that of the 1,697 genes, 242 genes were in signal transduction pathways, 908 genes were down regulated and 789 genes were up regulated (Additional files 2).
To investigate the functions of the genes at the system-level and uncover the mechanism of GBM, we used Lasso algorithms for a Gaussian graphical model to infer gene coexpression networks (see Methods). The largest connected coexpression network with the average shortest path 20.4 and diameter 74 had 2,115 genes and 2,276 edges ( Figure 2). Coexpression networks are usually organized into modules that perform specific biological processes and tasks. A dynamic tree cut procedure was used to identify modules and a Fisher's exact test was used to discover pathways that were overrepresented in the module (see Methods). A total of 13 modules were significantly enriched for at least one pathway ( Figure 3 and Table 3), indicating that a coexpression network was organized into functional units.  The architecture of a coexpression network is important for uncovering the genes which are involved in cancer. To identify the most important genes in the coexpression network, we used the damage value of a node as a measure to rank the importance of a node. The damage value of a node which is defined as the difference in the number of nodes of the largest connected network before and after removal of that particular node can be used to measure the effect of the removal of the node, i.e. the ability of a network to avoid malfunctioning when a gene is removed (damaged) [18]. We ranked all differentially expressed genes in the largest coexpression subnetwork according to their damage values. The top 5% of the genes in the ranked list with damage values greater than 15 were summarized in Additional files 3. These genes were essential to the function of coexpression networks. We suspect that these genes may be involved in the development of GBM.
There were three differentially expressed genes (TP53, COL3A1, and RAP1GDS1) whose damage values ranked among the top in cancer and cancer candidate genes (Additional files 1). Their removal from the network may disconnect some components in the network and hence compromise the functions of the genes in coexpression networks. TP53 and COL3A1 are GBM related genes and RAP1GDS1 is a cancer candidate gene. The p53 pathway is central to oncogenesis [19]. TP53 was up regulated in GBM tissue samples and was highly over expressed in the CEREBRUM tissue samples. Interestingly it was observed that TP53 had a damage value of 23, but was only connected with two up regulated genes (TGIF2 and EIF4A1). TGIF2 was differentially expressed (P-value < 1.43 × 10 -7 ) and had a damage value of 24. TGIF2 plays an oncogenic role through inhibition of TGFB [20]. EIF4A1 was over expressed in tumor tissue (P-value < 5.09 × 10 -6 ) and had a damage value of 22. Using Programmed Cell Death 4 (PDCD4) EIF4A1 inhibits translation initiation and acts as a tumor suppressor by forming a complex [21]. COL3A1 was over expressed in GBM tissue samples (P-value < 1.6 × 10 -6 ) and had damage value 21. It was also reported to be over expressed in ovarian cancer and breast cancer [22,23]. COL3A1 encodes a fibrillar collagen which is a major component of the extracellular matrix protein surrounding cancer cells. The presence of ECM protein prevents apoptosis of cancer cells. COL3A1 plays an important role in apoptosis, proliferation regulation and anticancer drug resistance [24]. RAP1GDS1 was under expressed in GBM tissue samples (P-value < 4.4 × 10 -7 ) and had a damage value of 18 in the coexpression network. RAP1GDS1 is a transcription factor [25]. It was reported that translocation fusion of the NUP98 and RAP1GDS1 genes was recurrent in T-cell acute lymphocytic leukemia [26].
Other genes at the top of the list including T1A1, KIAA1279 and CACYBP also have remarkable biological implications. T1A1t had the largest damage value (38), was over expressed in the GBM tissue samples, and played a role in apoptosis [27]. KIAA1279 was reported to be associated with the nervous system [28] and CACYBP was reported to participate in p53-induced beta-catenin degradation. CACYBP can suppress proliferation and tumorigenesis of renal cancer cells [29]. CACYBP is also reported to be under expressed in gastric cancer [30] and renal cancer [29].
Using DAVID Bioinformatics Resources online [31], we performed the gene set enrichment analysis and found the most enriched Gene Ontology (GO) and  Table 3. pathway groups. GO:0045202~synapse (a cellular component term), GO:0007268~synaptic transmission (a biological process term) and GO:0005509~calcium ion binding (a molecular function term) were among the most significant GO terms, with P-values 6.98 × 10-14 , 1.43 × 10 -9 and 5.06 × 10 -6 , respectively. Epithelial cell signaling in Helicobacter pylori infection (an infectious diseases pathway), long-term potentiation (a nervous system pathway) and calcium signaling pathway (a signal transduction pathway), were the most significantly enriched pathways with P-values 2.57 × 10 -6 , 4.42 × 10 -5 and 4.45 × 10 -5 , respectively. The results suggested that the differentially expressed genes were most involved in the signal and nervous system related pathways.
Network Analysis of miRNA Expressions miRNAs are short endogenous non-coding RNAs of 22 nucleotides in length that negatively regulate gene expression through base pairing with target mRNAs [32]. Deregulation of miRNA and miRNA-related genetic alternations are involved in causing cancers [33]. To unravel the pattern of differential regulation of miRNA, expressions of 534 miRNAs including 470 human miRNAs were profiled in 240 tumor tissue samples and 10 normal tissue samples by Agilent 8 × 15 K Human miRNA-specific microarrays. A total of 149 miRNA were differentially expressed between the GBM tissue and normal tissues samples which were identified by Wilcoxon rank-sum test (P-value for declaring significance after Bonferroni correction is 9.36 × 10 -5 ). Of the 149 differentially expressed miRNA, 73 miRNAs were up regulated and 76 down regulated (Additional files 4). Among them, 21, 81 and 15 miRNAs were reported to be associated with the GBM, other cancers and other diseases, respectively, in the literatures [34].
Similar to genes, miRNA are not isolated, instead they act together to perform biological functions [35,36]. To understand how miRNA regulate biological processes at a system level, we used Lasso algorithms for a Gaussian graphical model to map the simultaneous expression of miRNA into a coexpression network in which nodes represent miRNA and edges represent conditional dependence between two connected miRNAs, given all other miRNA expressions in the network (see Methods). The largest connected miRNA coexpression network with the average shortest path (10.75) and diameter (49) had 385 miRNAs and 451 edges ( Figure 4). One main feature of this miRNA coexpression network is that mir-770-5p and mir-329 divided the whole network into three subnetworks, the left, middle, and right subnetwork. The middle subnetwork was densely connected. Similar to the gene coexpression network, we use the damage value of a node as a robustness measure to rank the importance of a miRNA in the miRNA coexpression network. We ranked all differentially expressed miRNA in the largest connected miRNA coexpression network according to their damage values. The top 19 differentially expressed miRNAs with the largest damage values (> 20) which are referred to as fragile miRNAs) were listed in Additional files 5. Of the 19 fragile miRNAs, 16 miRNAs (14 miRNAs were over expressed in the GBM tissues) were in the left coexpression subnetwork and 3 under expressed miRNAs were in the right subnetwork. The middle subnetwork contains no fragile miRNAs. Figure 4 shows that miRNAs in the middle subnetwork were densely connected. The middle network with large numbers of redundant miRNAs was highly robust in response to perturbation of external forces. Mir-487a, mir-487b, mir-502 and mir-532 were the most important components in the miRNA coexpression network. Removal of one of them would cause disconnection between the left part and right part of the miRNA coexpression network and hence lead to dysfunction of the whole miRNA coexpression network. We observed that mir-487a and mi-487b were down regulated, and mir-502 and mir-532 were up regulated in the GBM tissues. SNPs within the miR-502 seed binding region in the 3'-UTR of the SET8 gene which methylates TP53 is reported to be associated with early age of onset of breast cancer [37]. Their major functions are to regulate cell cycle, DNA replication, cytokine-cytokine receptor interaction, hematopoietic cell lineage and signal transduction (see details in the next section). It is interesting to note that three of the 19 fragile miRNAs were significantly related to GBM survival time (univariant cox regression [38]). The three fragil miRNAs include hsamir-487b with P-value: 0.0063 and hazard ratio: 1.349; hsa-mir-17-5p with P-value: 0.0035 and hazard ratio: 0.743; and hsa-mir-106a with P-value: 0.0155, hazard ratio: 0.791. Therefore, we predict that these miRNAs are cancer related and play an important role in tumorigenesis and survival.
miRNA Target Networks miRNAs down regulate gene expressions by base-pairing with the 3'-noncoding region of the target mRNAs. It is estimated that up to 30% of the genes might be regulated by miRNAs [39]. It is hypothesized that miRNAs and their targets form complex networks to perform various biological functions. To reveal mechanisms of the GBM, we identified target genes of the miRNAs and constructed miRNA target networks. Procedures for discovering the target genes consisted of two steps. The first step was to conduct sequence analysis that used sequence complementarities of miRNA and its target site to predict potential miRNA target genes. Since miR-NAs repress the expression of its target gene, the second step is to test the inverse relationship between the expression profile of the miRNA and that of its potential target gene.
To achieve this, we regressed the expression of target mRNA on the expression of miRNAs and selected mRNA with significant negative regression coefficients as miRNA targets. The P-value for declaring significant evidence of the miRNA target was 1.00 × 10 -4 . We also searched the predicted potential miRNA targets in the miRGen [40], which integrated animal miRNA targets according to combinations of four widely used target prediction programs (miRanda, PicTar, TargetScan, and DIANA-microT), and experimentally supported targets from TarBase [41] and miR2Disease [34].
The above methods were applied to miRNA and mRNA expression dataset in 237 tumor tissue samples and 10 normal tissue samples with 1,697 differentially expressed mRNAs and 149 differentially expressed miR-NAs. This resulted in extremely complex miRNA target networks. We found 3,953 matched miRNA-mRNA pairs for 127 differentially expressed miRNAs and 1,089 differentially expressed genes. Of the 3,953 target pairs, 65 down regulated miRNA targets 468 over expressed genes while 62 up regulated miRNAs target 621 under expressed genes (Additional files 6).
Among them, a total of 14 previously verified targets of 8 miRNAs by experiments elsewhere were listed in Table  4. Of 8 miRNAs, 4 under expressed miRNAs (mir-124a, mir-29b, mir-29c and mir-33) function as a tumor-suppressor and 4 over expressed miRNAs (mir-155, mir-16, mir-21 and mir-210) function as an oncogene (Table 4). CTDSP1 has been reported to be a validated target gene of mir-124a [42], RTN4 and SLC25A22 were validated targets of mir-16 [43]. mir-21 was found to be over expressed in multiple cancers down regulated tumor suppressor genes (TPM1, PTEN, PDCD4 BASP1 and RTN4) in invasion and metastasis of cancer [44]. BASP1 is a transcriptional co-suppressor for the Wilms tumor suppressor protein WT1, thus it can regulate WT1 transcriptional activity [45]. Nogo-A, one protein isoforms encoded by RTN4, turned out to be a neuronal protein involved in diverse processes that goes from axonal fasciculation to apoptosis [46]. Mir-155 targeted a regulator of the apoptosis gene LDOC1 [47]. Up-regulation of miR-210 directly targeted gene EFNA3, which is crucial for endothelial cell response to hypoxia, affecting cell survival, migration, and differentiation [48]. Mir-29c was reported to be under expressed in nasopharyngeal carcinomas and up regulated genes COL4A1, COL4A2 and TDG [49]. COL4A1 and COL4A2 were genes encoding extracellular matrix proteins, as previously discussed; they play an important role in apoptosis, proliferation regulation and anticancer drug resistance [24]. COL4A2 was validated to be also targets of mir-29b in another research group [50]. TDG was involved in DNA repair, a process frequently dysregulated in many cancers [49]. In mouse and human cells, miR-33 inhibits the expression of ABCA1, thereby attenuating cholesterol efflux to apolipoprotein A1 [51]. It was also reported that the role of miR-33 controlling the hematopoietic stem cells selfrenewal through p53 may lead to the prevention and treatment of hematopoietic disorders [52].
The resulting miRNA target networks have several remarkable features. First, many important genes in the mRNA coexpression networks were targets of differentially expressed miRNAs. Through our prediction, we found that the top 17 genes with damage values greater than 19 in the gene coexpression network were negatively regulated by 34 differentially expressed miR-NAs (Additional files 7). All 17 genes and many miR-NAs were crucial components in the gene and miRNA coexpression networks. Their altered expressions played an important role in tumorigenesis. Among them, over expressed RBBP4 (P-value < 1.8 × 10 -8 ) with the largest damage value (37) in the table was negatively regulated by under expressed mir-29b (Pvalue < 2.07 × 10 -7 ) and mir-29c (P-value < 3.66 × 10 -7 ). Under expressed TRIM8 (P-value <3.77 × 10 -6 ) with a damage value of 29 was negatively regulated by over expressed mir-629 (P-value < 2.02 × 10 -7 ). Over expressed TP53 (P-value <1.40 × 10 -7 ) with a damage value of 23 was negatively regulated by underexpressed mir-485-5p. RBBP4 has been implicated in chromatin remodeling and regulation of cell proliferation. It has been reported that RBBP4 is overexpressed in different human tumors, such as lung, liver and thyroid cancer, acute myelocytic leukemia, and acute lymphoblastic leukemia [53]. TRIM8 is thought to be a new tumor suppressor gene [54].
Second, many genes targeted by critical components in the miRNA coexpression network, played important roles in the regulation of neural process and tumor genesis. To study the function of the top 19 differentially expressed miRNAs with the largest damage values (Additional files 5), we constructed mRNA target networks of these 19 miRNAs with 476 nodes and 1,128 arcs and 174 edges as shown in Figure 5(A), where 5 under expressed miRNAs negatively regulated 85 overexpressed genes and 14 overexpressed miRNA negatively regulated 372 underexpressed genes, 1,128 arcs were miRNA-mRNA pairs, 15 edges connected coexpressed miRNAs, and the remaining 159 edges linked coexpressed genes (Additional files 8). The genes with large damage value or related to cancer in the gene coexpression network (Additional files 3) were highlighted and marked in Figure 5(A). We extracted two subnetworks from Figure 5(A). The first subnetwork consisted of 8 genes (KIAA1279, CNNM2, CAMKV, TGIF2, SLC6A15, SLC17A7, NRIP3, and UROS) with damage values greater than 15 which were regulated by 11 miRNAs (mir-25, mir-106b, mir-93, mir-15a, mir-16, mir-15b, mir-329, mir-218, mir-17-5p, mir-106a, and mir-320) with damage values greater than 20. The second subnetwork consisted of 14 under expressed cancer genes (FBXW7, GABRA1, MYT1L, NEFL, NEFM, SNAP25, SYT1, VSNL1, RTN1, SH3GL2, SV2B, SYN2, KIAA0774 and RIMS2) which were regulated by 10 over expressed miRNAs (mir-15b, mir-25, mir-16, mir-92, mir-15a, mir-320, mir-106b, mir-93, mir-106a, and mir-17-5p) in the mRNA coexpression network. Two subnetworks are shown in Figure 5(B). Two subnetworks are involved in the synaptic transmission process. Neuron communication occurs at the synapse via neurotransmitters. FBXW7 serves as a negative regulator of oncoprotein and is a general tumor suppressor. It has been reported that FBXW7 is implied in various cancers including glioblastoma [55]. GABRA1 encodes a gammaaminobutyric acid (GABA) receptor and GABA is the major inhibitory neurotransmitter in the mammalian brain. MYT1L (myelin transcription factor 1-like) regulates nervous system development. Both NEFLand NEFM are neurofilaments. They play a role in intracellular transport to axons and dendrites. SNAP25 is a presynaptic plasma membrane protein and regulates neurotransmitter release. SNAP25 was reported to be implicated in neuritogenesis in human neuroblastoma [56]. Both SYT1 and VSNL1 serve as Ca(2+) sensors in synaptic transmission [16]. SYN2 is a member of the synapsin gene family, SV2B is a synaptic vesicle glycoprotein and RIMS2 regulates synaptic membrane exocytosis [16].   Table 5 summarizes the major functions of target genes of the miRNAs in Figure 5(A) where one-sided Fisher's exact test was used to test for significantly enriched GO categories or pathways. Table 5 shows that the target genes of these miRNAs were mainly involved in cancer related signaling pathways and nervous system processes including synapse, synaptic transmission, neurotransmitter transport, nervous system development and neurological system processes. It is interesting to note that coexpressed mir-15a, mir-15b, mir-16, mir-25 and mir-92 were major miRNAs that targeted genes in both cancer related signaling pathways and nervous system processes, and mir-16a was significantly associated with the GBM survival time.
Third, the contribution of miRNAs to the expression variation of some genes is very high. The proportion of expression variation of target genes explained by the linear influence of miRNA variation can be measured by the coefficient of determination (R 2 ). In Additional files 9, there were a total of 78 under expressed genes with the coefficient of determination greater than 20% which was negatively regulated by 7 over expressed miRNAs (mir-15a, mir-15b, mir-16, mir-25, mir-93, mir-106a and mir-106b). Surprisingly, up to 33% of the expression  [57]. The function of CAMK2N1 is to inhibit MEK/ERK activity and induce p27 accumulation which negatively regulates cell-cycle progression. Reducing expression of CAMK2N1 will accelerate tumor growth. 26% of the expression variation of CPEB1 that regulates synaptic plasticity and is implied in cancer development [58] and GRM1 that is involved in neurotransmitter in the central nervous system and implicated in melanoma [59] was explained by overexpressed mir-25 and mir-15b, respectively. Expressions of genes SYT1, SNAP25, GABRA1, VSNL1, MYT1L, SV2B, SYN2 and SLC12A5 which were involved in synaptic transmission, were also largely regulated by miRNAs.
To study the function of somatic mutations and LOH and connect them with disease through gene expressions and miRNAs, we studied their cis or trans regulatory effects on gene or miRNA expression traits. Since the popular methods for eQTL analysis have mainly focused on testing cis or trans regulatory effects individually, their applications to testing cis or trans regulatory effects of rare somatic mutations and LOH are inappropriate. We applied the group regression method to expression profiles of 12,043 genes produced by Affymetrix HT Human Genome U133 Array Plate Set at MIT [3] and expression profiles of 470 human micro-RNAs produced by Agilent 8 × 15KHuman miRNA-specificmicroarray at the University of North Carolina in 169 tumor tissue samples from glioblastoma patients [3], which were shared among the gene expression, miRNA expression and mutation datasets.
We first studied the cis regulatory effects of somatic mutations and LOH on mRNA or miRNA expression traits. For somatic mutation, we found four cis-eQTL (TP53, EGFR, NF1 and PIK3C2G). P-values for testing association of somatic mutations in TP53, EGFR, NF1 and PIK3C2G with their expressions were 0.033, 0.019, 0.028 and 0.006, respectively. Fold change (defined as the ratio of their average expressions of the samples with somatic mutations over the average expressions of the samples without somatic mutations) for the 4 genes were 1.10, 1.73, 0.86 and 1.23, respectively. Although regression analysis did not show significant association of somatic mutations in BCL11A, FN1 and COL3A1 with their expressions due to very low frequencies of mutations, the fold change were 1.45, 0.60 and 0.20, respectively. We still observed some regulatory effects of somatic mutations on these two genes. For LOH mutation, two cis-eQTL (NRAP and EGFR) were detected with P-values 0.030 and 0.034 and fold changes 0.90 and 2.32, respectively. Regression analysis did not show significant association of LOH mutations in the gene CYP1B1 with their expressions, but the fold change was 0.48.
Next we identify the trans-regulatory effects of somatic mutations and LOH on mRNA or miRNA expression traits. The thresholds for declaring significant association of the set of somatic mutation and LOH in the gene with mRNA and miRNA expression after Bonferroni correction for multiple tests were 1.63 × 10 -4 and 4.03 × 10 -4 , respectively. A network that connects 14 genes with somatic mutations associated with GBM, and their regulated mRNAs and miRNA expressions is shown in Figure 6(A). Somatic mutations in these 14 genes were strongly correlated with expressions of 177 significantly differentially expressed genes, 45 of which were under expressed and 132 were over expressed in tumor tissue samples, a total of 262 trans gene eQTL were found (Additional files 10). These mutations also significantly affected expressions of 23 miRNAs, 11 of which were underexpressed and 12 were overexpressed and a total of 26 trans miRNA eQTL were found (Additional files 11). Remarkably, we found 5 paths (Table 6): (1) RB1 with association of somatic mutations with GBM regulated expressions of both gene LAMP2 and mir-340, and mir-340 in turn targeted gene LAMP2; (2) DST with association of somatic mutations regulated expressions of both gene OTUB1 and mir-15b, and mir-15b in turn targeted gene OTUB1; (3) FN1 with association of somatic mutations regulated expressions of both SSR2 and mir-125a, and mir-125a in turn targeted gene SSR2; (4) FN1 with association of somatic mutations regulated expressions of both gene TDG and mir-125a, and mir-125a in turn targeted gene TDG; and (5) TP53 with association of somatic mutations regulated expressions of both gene TP53 and mir-504, and mir-504 in turn targeted TP53. Similarly, Figure 6(B) shows a network that connected 11 genes with LOH associated with GMB, their regulated mRNAs and miRNA expressions. These 11 genes with LOH as trans-eQTLs affected 323 differentially expressed or interacted genes, 190 of which were under expressed and 133 were over expressed, a total of 409 trans gene-eQTL were found (Additional files 12). The LOH also affected expressions of 19 miRNAs, 9 of which were underexpressed and 10 were overexpressed, and a total of 27 trans miRNA-eQTL were found (Additional files 13). Similar to the differentially expressed gene set enrichment analysis, we searched the enriched GO and pathway groups for the eQTL gene list of every somatic/ LOH mutations by DAVID Bioinformatics Resources [31]. The most enriched GO terms included the molecular function term GO:0015075~ion transmembrane transporter activity (for LOH mutated gene CYP1B1 eQTL), biological process term GO:0050877~neurological system process (for LOH mutated gene ABCA13 eQTL) and cellular component term GO:0005643~nuclear pore (for somatic mutated gene TP53 eQTL) with P-values 6.73 × 10 -7 , 4.31 × 10 -5 , and 0.02, respectively. The enriched pathways included the Wnt signaling pathway (for somatic mutated gene DST eQTL), colorectal cancer (for somatic mutated gene FN1 eQTL) and Gap junction (for LOH mutated gene ABCA13 eQTL) with P-values 0.018, 0.012 and 0.013, respectively. Several examples are provided here, for the detailed results, see Additional files 14.

Discussion
Genetic and molecular alternations that are likely to cause tumor formation are often organized into complex biological networks. The purpose of this report is (1) to explore the possibility of integrating altered DNA variations, mRNA and miRNA expression variations into multi-level complex genetic networks that contribute to tumorigenesis; (2) to decipher paths from somatic mutations and LOH to tumor formation through genetic and molecular networks; and (3) to identify key genetic alternations causing tumor formation using network approaches. In the first step, we reconstruct single type molecular networks whose components are of the same type of molecules, either mRNAs or miRNAs. We reconstructed mRNA coexpression networks and miRNA coexpression networks for glioblastoma. The coexpression networks attempt to uncover the regulatory relationships among mRNAs or miRNAs. The second step is to reconstruct miRNA target networks that connect mRNA and miRNA coexpression networks and indentify the eQTL network that connects mutations and mRNA, miRNA expression profiles.
We have addressed several issues for deciphering the paths from somatic mutations and LOH to tumor formation. The first issue is how to test association of somatic mutations or LOH with glioblastoma. We developed a group association test that is based on population genetics for assessing association of somatic mutations and LOH with cancer. We identified 14 genes harboring somatic mutations associated with GBM and 11 genes harboring LOH associated with glioblastoma. The second issue is how to uncover the components of mRNA or miRNA coexpression networks which respond to somatic mutations and LOH associated with glioblastoma. Traditionally, when alleles are common, these components are identified by mapping cis-eQTL or trans-eQTL. However, when alleles are rare, these components are hard to find by individually mapping cis-eQTL or trans-eQTL. The approach developed here is to extend group tests for a qualitative trait to a quantitative trait. The components that respond to perturbation of rare somatic genetic variants were identified by regressing the expression of an mRNA or a miRNA on the number of all mutated alleles across the gene. We discovered large comprehensive genetic and molecular networks that connect genes harboring associated mutations or LOH, mRNAs and miRNAs. Interestingly, we found five triangle cycles in the networks which indicated that significantly associated somatic mutations or LOH regulated both differentially expressed mRNA and miRNA and the miRNA in turn also affected expression levels of the mRNA. The approach presented here has two remarkable features. First, it offered a powerful tool for differentiating driver mutations from passenger mutations. Second, it provides functional information on how somatic mutations or LOH lead to tumorigenesis through complex genetic and molecular networks.
Our studies support the hypothesis that cancer may be the emergent properties of many genetic variants that are highly interconnected. Genetic and environmental stimuli can be viewed as random attacks to genetic and molecular networks. Topological properties of the genetic and molecular networks are closely related to the function of cells. Cancer arises from the failure of networks to respond to attacks. In other words, the attacked networks are unable to return to their normal states and remain functional in the face of random perturbations. We suspect that key components of a network contributing to the robustness of the network also play an important role in the function of the cells. We modeled over or under expression of mRNAs and miR-NAs, and genetic alternations as deletions of a node in the network which will cause dynamical changes in the network and used a damage value of the node to measure its contribution to the robustness of the network. We found that several cancers related genes such as TP53 have large damage values in the genetic and molecular networks. These key components in the network may serve as therapeutic intervention points.
Our results are preliminary. Although network analysis may have the potential to unravel the mechanism of tumor initiation and progression, the presented network structures and their properties in this report may depend on sampled tissues. Whether the structures of our reconstructed network can be replicated in other tissues or not is the key to the success of network analysis in cancer studies.

Conclusion
In this paper, we use system biology and network approaches to develop novel analytical strategies for (1) systematically integrating altered DNA variations, mRNA and miRNA expression variations into multilevel complex genetic networks that contribute to tumorigenesis, (2) deciphering paths from somatic and LOH mutations to tumor formation through genetic networks and (3) identifying key genetic alternations causing tumor formation by network analysis.

Test Association of Somatic Mutations and LOH with Glioblastoma
Cancers arise from mutations that confer growth advantage on cells [61]. The somatic mutations in cancers can be classified either as "drivers" or "passengers" [62]. In other words, mutations often have no effect on the development of a tumor. As the number of tumor tissues and normal tissues increases we can observe somatic mutations in both tumor and normal tissues. The current popular method for identifying driver mutations is to compare the difference in the mutation rates [4,63]. However, there is debate about how to assess a significant excess of mutations in tumors [64]. We need to develop formal tests to detect differences in mutation rates between tumor and normal tissues. Most traditional statistical methods that often test the association of genetic variants individually were designed for testing association of common alleles with common diseases and are inappropriate for testing the association of rare somatic mutations. A feasible approach is to record rare sequence variants at different genome positions and to collectively test the association of a set of rare variants. It has been shown that the number of rare alleles in large samples is approximately distributed as a Poisson process with its intensity depending on the total mutation rate [15]. The intensity of the Poisson process within a segment of the genome can be interpreted as the mutation rate. Similar to the standard χ 2 test for association of SNPs which compare the differences in allele frequencies between cases and controls, the proposed statistics are to compare difference in the mutation rates between tumor and normal samples. In some cases, we may have a homozygous genotype of rare mutations. To improve the power, in this case, we can count it twice. Instead of defining statistics in terms of genotype, we can similarly define the test statistics in terms of the rare alleles, which is denoted as T a . The statistic T a counts the number of mutated alleles at the locus. Thus, it will count homozygote of rare variants twice.
To examine the validity of the test statistics, we performed a series of simulation studies. We used infinitely many allele models and software (GENOME) to generate rare variants. Suppose that the mutation rate per generation per base pair is 1.00 × 10 -8 , the recombination rate between consecutive fragments is 0.0001, and the migration rate per generation per individual is 0.00025. We simulated 100 fragments of which each fragment length equals 10k base pair. A total of 100,000 individuals who were equally divided into cases and controls were generated in the general population, 500-2,000 individuals were randomly sampled from each of the cases and controls and 10,000 simulations were repeated. Table 7 summarizes the type I error rates of two statistics. Table 7 shows that the estimated type I error rates of the statistics for testing association of a set of rare variants with the disease were not appreciably different from the nominal levels α = 0.05, α = 0.01 and α = 0.001.
A LOH mutation was recorded when the genotype in blood or normal tissue is heterozygous, and in the tumor tissue, the reference allele loses normal function and the genotype becomes homozygous. The statistic T G can also be used to test association of LOH with glioblastoma.

Survival Analysis Links Gene and microRNA Signatures with GBM Survival Time
Patients (n = 358) with complete clinical information were obtained from the TCGA data portal. Survival analysis is used to deal with these time-to-event censored data. Censoring refers to the patients who may drop out or are still alive at the end of the study. In fact, leaving censored patients out would introduce bias to the remaining uncensored samples, and it is difficult to make adjustments for such bias. The approach we use, Cox proportional hazard regression, is a standard method in biostatistics for dealing with survival data [38]. For microRNA signature, we use the univariate Cox proportional hazard regression model to regress the survival time on every microRNA [65]. Hazard ratios from the Cox regression analysis were used to identify which microRNA signatures were associated with death from the recurrence of cancer or any cause. Protective signatures were defined as those with a hazard ratio for death < 1. High-risk signatures were defined as those with a hazard ratio for death > 1. All analyses were done with the SAS version 9.1 software (SAS Institute Inc). Two-tailed tests and P-values < 0.05 for significance were used.

Lasso for Coexpression Networks
A co-mRNA expression or co-miRNA expression network can be constructed by joint sparse regression for estimating the concentration matrix in which off-diagonal elements represents the covariance between the corresponding variables conditional on all other variables in the network [66]. Sparse regression for reconstruction of coexpression network is briefly introduced here. (For details, see [66]). Denote the mRNA or miRNA expression levels as variables y 1 ,...y q . A variable is represented by a node. An edge connecting two nodes indicates that the connected two variables are conditionally dependent, given all other variables. Assume that the vector of q variables Y = [y 1 ,...y q ] T follow a normal distribution N (0,Σ). Denote the partial correlations as r ij = Corr(y i , y j | y -(i,j) ) for 1 ≤ i <j ≤ q and -(i, j) ≡ {k:1 ≤ k ≠ i, j ≤ q}. If we assume the normality of the variables, then two variables y i and y j are conditionally dependent, given all other variables if and only r ij ≠ 0. Let Σ -1 = (s ij ) be the concentration matrix. Then,     ij ij ii jj = − . When sample size is much larger than the number of variables, the concentration matrix can be directly estimated from the inverse of the sampling covariance matrix. However, when the number of variables in the network is larger than the sample size, the inverse of the sampling covariance matrix may not exist. We need to develop methods for network modeling via estimating a sparse concentration matrix. One such techniques is sparse regression.
It is well known that the relationship between the partial correlation and regression exists. In other words, if where Y i = [y i1 ,...,y in ] T is the sample of the ith variable and l is a penalty parameter which controls the size of the penalty. A larger value of l leads to a sparse regression that fits the data less well and a smaller l leads to regression that fits the data well but is less sparse. Let θ Coordinate descent algorithms can be used to minimize L(θ, l) with respect to θ. Let p = q(q-1)/2 Briefly, coordinate descent algorithms are given as follows: 1. Initial step: let (x) + = xI (x > 0) , for j = 1,..., p Step 2: for j = 1,..., p, update θ 3. We repeat step 2 until converge.

Functional Module Identification and Gene Set Enrichment Analysis
Coexpression networks are usually organized into functional modules that perform specific biological tasks. Genes within coexpression modules often share conserved biological functions. A dynamic tree cut procedure was used to identify modules [67]. A coexpression network was clustered using hierarchical clustering.
Modules are defined as braches of the dendrogram. A one-sided Fisher exact test that calculates the probability of seeing the observed number of genes within a pathway or a GO category in the module by chance was used to test for overrepresentation of a pathway or a GO category in the module. We assembled 465 pathways from KEGG [68] and Biocarta http://www.biocarta. com. To test both the enriched Gene Ontology and pathways for the gene set, for example, the differentially expressed gene list and the eQTL genes of mutations, we used an online tool DAVID (The Database for Annotation, Visualization and Integrated Discovery), which provides a comprehensive set of functional annotation tools for investigators to understand the biological meaning behind large lists of genes [31], http://david. abcc.ncifcrf.gov/.

Ranking of the Nodes in the Network
Biological functions and mechanisms are encoded in network properties. An important strategy for unraveling the mechanisms of initiation and progression of cancer is to conduct analysis of complex biological networks and study their behaviors under genetic and epigenetic perturbations. Robustness of a biological network, ability to retain much of its functionality in the face of perturbation [9], has emerged as a fundamental concept in the study of network topological properties [10]. Widely used measures of network robustness include ranking the importance of the nodes in the network. One of the most efficient measures of importance in robustness analysis of the network is the damage value of a node which quantifies the effect of the removal of that particular node from the network. Formally, we define the damage value of a node as follows. Let G = (V, E) be the connected compo-