3.1 Extraction of RNA sequencing data
The RNA sequencing data, clinical information and survival information files of 155 GBM primary cancer tissues and 5 normal brain samples are all downloaded in TCGA. The TCGA includes data of 38 types of tumors, including 6 types of data (copy number, DNA methylation, gene expression RNAseq, phenotype, Somatic mutation, miRNA expression). In this study, we only took transcriptome data, including coding and non-coding RNA data.
3.2 Difference analysis of GBM samples
Violin diagram and box diagram were used to show the overall distribution of mRNA and LncRNA of all samples, and volcanic diagram and heat diagram were used to statistical and visual display the results of different mRNA and LncRNA. R package of pheatmap was used for the heatmap, and R package of ggplot2 was used for the box diagram, volcano diagram and violin diagram.
A heatmap can intuitively show the expression levels of different genes in each sample, cluster them according to the expression data of different genes, and obtain the relationship between the expression patterns of samples and genes, so as to draw a correlation conclusion with biological significance. In the heat map, each column represents one sample, including 155 GBM samples and 5 normal samples. Each line represents a gene (mRNA and LncRNA). Firstly, statistical software R was used to conduct statistical analysis of the differences between the mRNA and LncRNA of GBM samples and normal samples (FDR < 0.01 & P value < 0.01 & |log2FC| > 1). Finally, 547 differentially expressed mRNAs were selected from 19261 mRNAs, and 471 differentially expressed LncRNA s were selected from 11307 lncRNAs. Genes expressing similar expressions (mRNA and LncRNA) were clustered together with the samples. Black represents 0, indicating no change in gene expression level, red represents increased expression level, and green represents decreased expression level. The brightness of the color represents the degree to which gene expression levels rise or fall. It can be seen from the results that mRNA and LncRNA are mostly up-regulated in the 5 normal samples compared with GBM(Fig 1E,F).
The violin diagram and box diagram can be combined to show the distribution of gene expression abundance in each sample. The results showed that the abundance difference of mRNA and LncRNA in these 160 samples was less than 5%, that is to say, the abundance difference of coding and non-coding genes in normal and abnormal samples was not significant(Fig 1 A,C).
To be more specific, we also found by barplot that 430 of the 547 mRNAs were down-regulated and 117 were up-regulated, compared with the normal sample. Among the 471 differentially expressed lncRNAs, 422 were down-regulated and 49 were up-regulated (Fig 2A).
The volcano plot is used to show the significantly different expressions of the two sets of samples. The p-value and Fold change values obtained by accurate T-test statistical analysis were used to draw the volcano map between the two groups. In the volcano plot, one of the coordinates shows the negative log of P-values computed by the T-test, and the other shows the converted log of P-values compared to the two conditions. Similarly, we can also find through the volcano map that compared with the normal sample, the GBM sample down-regulated more in the differentially expressed mRNA and LncRNA (Fig 1B,D).
3.3 Enrichment analysis of GO and KEGG pathways of DEGs
We conducted GO and KEGG enrichment analysis of the differentially expressed genes respectively, and selected the GO and KEGG results of Top30. In the results of BP, we found that the main difference between gene enrichment in embryonic bone system development and formation, chemical synapses and synaptic transmission, eating behaviors, G protein coupled receptor pathway and G protein coupling the second messenger cyclic nucleotide receptor signaling pathways, neuropeptide signaling pathways, and other functions. In the results of MF, we found that the differentially expressed genes were mainly concentrated in neurotransmitter binding and receptor activity, G protein-coupled peptide receptor activity, G protein-coupled receptor activity and other functions. In the results of the KEGG pathway, we found that the differentially expressed genes were mainly concentrated in the interaction in stimulating nerve tissues, G-protein-coupled receptor ligand binding, G-protein-coupled receptor downstream signaling, and G-protein-coupled receptor pathways, among which G-protein-related pathways accounted for three of the first ten pathways. This indicates that there is a close relationship between GPCRs and GBM. More detailed GO and KEGG enrichment analysis results are shown in the Fig 3 below. This will help us further understand the role of DEGs in the genesis and development of GBM.
3.4 Univariate COX survival regression analysis
In order to better classify the samples, there were 167 GBM tumor samples with prognostic survival data in TCGA, including 155 GBM primary cancer tissues and 12 GBM metastatic cancer tissues (one of which has no survival data), including 15110 lncRNAs in total, and 14480 lncRNAs were left after the deletion of some lncRNAs that appeared only a few times. Finally, a total of 642 lncRNAs with significant prognostic correlation were calculated by univariate COX survival regression analysis (Fig 2B).
3.5 LncRNA-based prognostic biomarkers in GBM patients
In the significance analysis results of GBM samples, we obtained a total of 471 lncRNAs with significant differences, and 642 lncRNAs with significant prognostic differences in univariate COX survival regression analysis. The analysis software was used to take the intersection of the two (Venn diagram) to obtain 12 lncRNAs, which were not only differentially expressed in normal tissues and cancer tissues, but also had prognostic differences in cancer tissues(Fig 2C).
3.6 LncRNA with independent prognostic factor
Prognostic biomarkers based on LncRNA in GBM patients, namely 12 lncRNAs obtained by intersection, were used for prognostic survival curves respectively. Univariate COX survival regression curve analysis showed that among the 12 lncRNAs (Fig 4), only AC097641.1 and AC117453.1 were HR< 1, is a protective factor. While other LncRNA HR>1, is a risk factor. The prognosis of patients in the high-risk group was significantly worse than that of patients in the low-risk group. Then, stepwise COX multivariate regression was used for the 12 lncRNAs obtained, and 5 lncRNAs (P<0.05) (Fig 5).
3.7 Multiple regression model risk Score survival analysis
We constructed a multiple regression model with 5 lncRNAs that could act as independent prognostic factors. The risk score of each sample was calculated based on a multiple regression model consisting of five lncRNAs (Risk Score =1.43*AC021594.1+3.92*AC105046.1+0.27*CRNDE+0.92*LINC01574+1.11*LINC02320) Then, according to the median risk score, all cases were divided into the high-expression group and the low-expression group, and the correlation between risk score and prognosis was calculated. P < 0.0001 was significantly correlated with prognosis.
Survival analysis using scatter plot (Fig 6.A, B), heat map (Fig 6.C), and Kaplan-Meier plot (Fig 6.D), all showed that the prognosis of the high-risk group was significantly lower than that of the low-risk group. It is proved that the multiple regression model can accurately predict the clinical prognosis of GBM patients.
3.8 Coexpression network of coding gene/non-coding gene
Although more and more studies are trying to reveal the functional significance of lncRNAs, the biological role of most lncRNAs is still unknown. Biological processes and cellular regulatory networks are very complex and involve the interaction of various molecules, such as proteins, RNA and DNA. Therefore, we constructed a coexpression network of coding gene/non-coding gene based on 155 GBM samples and 5 normal brain samples in TCGA database to study the potential interaction between GPCR-related mRNAs and lncRNAs in each sample. The co-expression network consisted of 93 GPCR related mRNAs and 5 lncRNAs (Fig 7). The figure showed that 5 lncRNAs (AC021594.1, AC105046.1, CRNDE, LINC01574, LINC02320) were all related to a large number of targeted mRNAs. AC021594.1 can target 23 network nodes, AC105046.1 22 network nodes, CRNDE 20 network nodes, LINC01574 22 network nodes and LINC02320 23 network nodes. The co-expression network also shows that a coding gene can also target many network nodes. GRIN1 targets 23 network nodes , AC105046.1 targets 22 network nodes, CRNDE targets 20 network nodes, LINC01574 can target 22 network nodes, LINC02320 can target 23 network nodes, GABRA1 can target 19 network nodes, GABRG2 can target 18 network nodes, RASAL1 can target 18 network nodes, CHRM1 can target 16 network nodes, SST can target 16 network nodes. These results showed that lncRNAs and GPCR-related mRNAs in GBM patients interregulated very closely.
In addition, external data sets in CGGA were also used to verify the above results. In the CGGA database, one of the lncRNAs, namely CRNDE, could be found, and the results were consistent with the previous ones. In other words, the prognosis of the high-risk group was significantly lower than that of the low-risk group, which was significantly correlated with the prognosis(Fig 8).