In silico analysis of long non-coding RNAs in medulloblastoma and its subgroups

Background Medulloblastoma is the most common malignant pediatric brain tumor with high fatality rate. Recent large-scale studies utilizing genome-wide technologies have sub-grouped medulloblastomas into four major subgroups: wingless (WNT), sonic hedgehog (SHH), group 3, and group 4. However, there has yet to be a global analysis of long non-coding RNAs, a crucial part of the regulatory transcriptome, in medulloblastoma. Methods and findings We performed bioinformatic analysis of RNA-seq data from 175 medulloblastoma patients. Differential lncRNA expression sub-grouped medulloblastomas into the four main molecular subgroups. Some of these lncRNAs were subgroup-specific, with a random forest-based machine-learning algorithm identifying an 11-lncRNA diagnostic signature. We also validated the diagnostic signature in patient derived xenograft (PDX) models. We further identified a 17-lncRNA prognostic model using LASSO based penalized Cox’ PH model (low risk group HR= 0.207, 95% CI= 0.133-0.323, p-value= 2e-14). Conclusions Our analysis represents the first global lncRNA analysis in medulloblastoma. Our results identify putative candidate lncRNAs that could be evaluated for their functional role in medulloblastoma genesis and progression or as diagnostic and prognostic biomarkers.


Introduction
, characterized as WHO group IV, represents the most common malignant pediatric central nervous system (CNS) tumor [1][2][3][4], representing 9.2% of all pediatric brain tumor cases [1,5] and roughly 500 new cases each year in the US. MBs originate in the cerebellum and share molecular signatures with embryonic cerebellar lineages, with metastasis sites commonly include parts of the brain, spinal cord, and, rarely, to extraneural sites [6][7][8].
Commonly used treatment strategies for MB include maximal safe surgical resection, radiotherapy, and chemotherapy, which are poorly tolerated by pediatric patients who are usually under seven years of age [9]. Appropriate treatment selection depends upon the clinical subgroup, stage, extent of resection, location, and the patient's ability to withstand treatment [10]. In efforts to improve therapeutic outcomes, combined genetic and epigenetic approaches have refined MB classification into four clinically and molecularly distinct subgroups: wingless (WNT), sonic hedgehog (SHH), group 3, and group 4 [11]. Despite these significant advances, MB remains deadly for many patients, with a ~30% fatality rate. Further, even successful eradication of the tumor often results in a deteriorated overall quality of life due to side effects including organ dysfunction, neurocognitive impairment, endocrine disabilities, and secondary tumors [10][11][12][13].
Even with these advances in molecular classification, group 3 and group 4 tumors are heterogeneous groups that continue to make management challenging. There is an urgent need to identify the underlying molecular mechanisms in these subgroups to drive precision medicinebased approaches, improve quality of life, and increase our understanding of MB in general.
Long non-coding RNAs (lncRNAs) represent a major part of the transcribed genome that do not code for functional proteins. LncRNAs are more than 200 nucleotides in length and are transcribed by RNA polymerase II. While previously labelled as transcriptional "noise", it is now understood that lncRNAs are functional and play important roles in cellular physiology, development, and disease progression. In humans, there are at least three times as many lncRNAs as protein-coding genes [14]. Although the precise roles of the vast majority of identified or predicted lncRNAs remain unknown [14], they are increasingly recognized as being involved in cis or trans interactions regulating gene expression in the nucleus and protein interactions in the nucleus and cytoplasm. Some of the functionally diverse roles of lncRNAs include transcriptional silencing (e.g., XIST [15]), enhancers by regulating three-dimensional chromosomal structure to strengthen interactions between enhancers and promoters (e.g., LUNAR1 [16]), and as microRNA sponges that sequester microRNAs from their target sites (e.g., SNHG7 [17]). LncRNAs can also act as scaffolds for protein-protein and protein-nucleic acid interactions [18]. They are potential biomarkers and therapeutic targets in cancer, with several lncRNAs now studied for their oncogenic or tumor suppressor potential in several cancer types through their regulation of the cell cycle, cell death, senescence, metastasis, immunity, and cancer cell metabolism [19].
However, there has yet to be a genome-wide study of MB to identify dysregulated lncRNAs. With this aim, we analyzed the transcriptomic profiles of 175 MB patients to map lncRNA expression profiles and identify subgroup-specific lncRNAs. We show that the MB lncRNAome exhibits significant heterogeneity that corresponds to the molecular subtypes. Using a random forest-based machine-learning algorithm, we identify lncRNA signatures that could improve on present diagnostic approaches, while penalized Cox-PH regression identifies prognostic lncRNAs. Taken together, our analysis identifies candidate lncRNAs with subgroup-specific activity in MB and with diagnostic and prognostic value.

RNA-seq alignment, quantification, and differential gene expression analysis
Raw FASTQ files were quality checked for adapter contamination using FASTQC. FASTQ files containing adapter sequences were trimmed by running through trim_galore in default mode.
The reads were mapped to the GRCh38/hg38 human genome assembly p12 (release 28, www.gencodegenes.org) using HISAT2 and annotated using the corresponding release version GENCODE comprehensive annotation file and LNCipedia 5.2 high confidence set annotation file. Mapped reads were quantified using StringTie to obtain FPKM values, which were converted to read counts using the prepDE.py script (provided in StringTie manual). For variance-stabilized normalized reads and differential gene expression analysis, reads counts were processed with DESeq2 in R [24].

Consensus clustering
Variance-stabilized expression levels of the top 10,000 variant lncRNAs determined from standard deviations of read counts normalized to library size were used as input to perform 1000 permutations of k-means-based consensus clustering using ConsensusClusterPlus R package [25].

Co-expression module detection and trait correlation analysis
Variance-stabilized expression of top 5000 variant lncRNAs was used to obtain a weighted correlation network using the WGCNA R package [26]. The correlated lncRNA cohorts were associated with MB subgroups using the module-trait correlation algorithm as described [26].

Random forest model
Subgroup-specific diagnostic models were obtained by performing variable selection using expression of differentially expressed lncRNAs/ protein coding genes (PCGs) and the randomForest R package (as described in [27]). For all models, variance-stabilized expression of differentially expressed lncRNA/PCG genes were used as variables to obtained models to classify patients into known subgroups. For the lncRNA model distinguishing SHH, group 3, and group 4, patient samples were divided into a 60% training set and 40% tuning set. Only differentially expressed (|logFC| >1.5, padj <0.01) lncRNAs genes between SHH, group 3, and group 4 were used to classify patients into known subgroups. The training model was used to find important genes ranked based on the "mean decrease accuracy" parameter. Low ranking genes with high expression correlation (>0.80) to high ranked genes were discarded. Gene combinations based on the final ranked list were used in the tuning model to find the minimum number of genes resulting in the minimum or comparatively lower error rate in the tuning set. A similar approach was used to distinguish WNT from the other subgroups. For the training set, all WNT samples were combined with 60% samples from SHH, group 3, and group 4 samples. For the tuning set, all WNT samples were combined with 40% training set the remaining subgroups.
To distinguish group 3 and group 4, a similar 60%-40% training-tuning model was adapted for classification. 8 A random forest-based model for PCGs distinguishing WNT, SHH, group 3, and group 4 was obtained as described above for lncRNAs. To validate the protein coding model, expression levels of the obtained signature genes were used to classify patient samples from the MAGIC dataset using the random forest model and tSNE plots.
Receiver operating characteristics (ROC) curves and area under the ROC curve (AUC) values for one versus rest comparisons were computed based on a generalized linear model-based fit of subgroup identity with normalized gene expression levels of signature genes as the variable using the pROC R package.

Transcriptional network inference
A transcriptional inference network for putative regulation between candidate lncRNAs and transcription factors was obtained using minet R package [28]. Regulatory interaction measures were obtained based on the network obtained from CLR-, arcane-, and mrnet-based models.
Only edge connections predicted in all three approaches were analyzed further for first neighbor connections of transcription factors with each candidate lncRNA.

Survival analysis
Expression of 621 lncRNAs included in the MAGIC microarray dataset for 612 patients with survival information belonging to the four subgroups was used as input to find prognostic signatures using penalized Cox's proportional hazard (Cox-PH) model with the LASSO (α =1) penalty using glment R package [29]. Smallest mean squared error and lambda.min was selected from 100 random runs of the model fitting with 10 fold cross validation in each run. Variables (lncRNAs) with non-zero coefficients associated with the lambda.min were selected as the 9 prognostic model. Stability of the obtained model was verified by performing 1000 bootstraps on the data using BootValidation R package. A risk score was calculated for each patient by summing the value of the product of normalized expression and penalized Cox-PH coefficient of a candidate lncRNA with that of all other candidate lncRNAs. Kaplan-Meier analysis was conducted using the obtained risk score of a 17-lncRNA prognostic model for the MAGIC dataset with survival information and 74 patients belonging to four subgroups with survival information in the ICGC RNA-seq dataset.

Long non-coding RNA signatures of medulloblastoma and subgroup-specific lncRNA enrichment
To determine genome-wide expression profiles of lncRNAs in MB, we analyzed RNA-seq data from 175 patient samples obtained from the ICGC PedBrain dataset. For comprehensive lncRNA annotation, we chose a combination of the GENCODE and LNCipedia datasets [30]: GENCODE represents the largest manually curated lncRNA dataset and LNCipedia contains the maximum number of high fidelity predicted lncRNA genes [30]. The expression of 52,128 unique lncRNAs and 19,033 protein-coding genes (PCGs) were quantified (Fig 1A) in 18 WNT MBs, 45 SHH MBs, 46 group 3 MBs, and 66 group 4 MBs. To better understand the role of lncRNAs in different MB clinical and molecular subgroups, we investigated correlations between lncRNA expression and subgroup type. First, we performed consensus clustering using the top 10,000 highly variant lncRNAs, which clustered the MBs into four different groups that highly overlapped with the known molecular subgroups (Fig 1B, Fig S1), suggesting that lncRNAs could contribute to subgroup-specific traits. With the objective of identifying highly variable lncRNAs specifically enriched in each subgroup, we performed weighted correlation analysis using WGCNA to find expressional corelated lncRNAs and their subgroup specificity. Weighted co-expression analysis of the top 5000 highly variable lncRNAs identified nine distinct cohorts after merging modules below the threshold. We next determined subgroup-specific module expression by performing module-trait association to obtain each module's correlation and significance value (Fig 1C, Table S2).
Module cohorts were significantly positively correlated with WNT (467 lncRNAs, A3), SHH (452 lncRNAs, A4), group 3 (629 lncRNAs, A9), and group 4 (760 lncRNAs, A8) MBs; respectively. Gene expression in each of the identified modules also showed that these genes are highly co-expressed in their respective groups compared to other groups (Fig 1D). In addition, cohorts enriched in group 3 and group 4 were more correlated than WNT and SHH MBs, and 11 vice versa. This suggests that similar to protein-coding gene expression and DNA methylation patterns, group 3 and group 4 patients also share similarities based on lncRNAs' expression.
A candidate diagnostic lncRNA signature for medulloblastoma subgroup classification WGCNA analysis suggested a number of lncRNAs with subgroup-specific expression. We therefore proceeded to identify the minimum number of lncRNAs that could faithfully classify MB subgroups. To achieve our objective, we used random forest based machine learning approach that has been shown to be a robust method for such classification objectives [27]. As patients were not evenly distributed between the four subgroups, we adopted a two-step approach: first, we identified a signature for groups with similar patient distributions i.e., SHH, group 3, and group 4; second, we identified a signature distinguishing WNT from the other subgroups (Fig 2A). Using this two-step approach, an 11-lncRNA signature was identified with an average <7% class error rate. Using the 11-lncRNA model, the 175 samples were re-classified into the already known subgroups with few misclassifications (Fig 2B), with individual lncRNA showing highly subgroup specific up/down expression (Fig 2C). Importantly, patient ICGC_MB23, which was classified as SHH in our random forest model but labeled WNT in the obtained dataset, was originally considered as an SHH MB in Kool et al. [31] and lacked the signature mutation in β-catenin. We also validated the 11-lncRNA based patient grouping using t-SNE based clustering (Fig 2D, that did not classify ICGC_MB23 as SHH) and specificity and sensitivity of the model using ROC/AUC analysis performing one versus rest comparison (Fig   2E). In the absence of an independent dataset containing expression levels of the 11 candidate lncRNAs to validate the model, we instead validated our random forest model building process.
We performed a similar classification of 175 MB samples using protein-coding genes to produce a 14-PCG model with equivalent success to the lncRNA model in classifying patient samples into the four subgroups (Fig S2). We then validated the 14-PCG model using the independent MAGIC microarray dataset of 763 patient belonging to the four subgroups. As expected, the 14-13 PCG model performed well with an overall <4% class error rate, validating our random forest model building process (Fig S3).
Group 3 and group 4 represent the two most heterogeneous yet closely related and difficult to distinguish MB subgroups, a pattern evident from diagnostic signature-based clustering (Fig   2D). To identify lncRNAs that could distinguish group 3 and group 4 tumors, we again used our random forest model approach to select highly differentially regulated and discriminative genes (Table S3). Our analysis yielded an 8-lncRNA model that did not improve the overall efficiency of group 3 versus group 4 classification (Fig S4) compared to the 11-lncRNA model distinguishing all subgroups (Fig 2). Nevertheless, the analysis did reveal some lncRNAs with potential functional roles in group 3 or group 4 MBs (Fig S4B), some of which overlapped with the 11-lncRNA model (i.e., MIR100HG, USP2-AS1, and lnc-CFAP100-4). However, we also identified other candidate lncRNAs including ARHGEF7-AS2, lnc-HLX-1, lnc-EXPH5-2, lnc-CH25H-2 and lnc-TDRP-3 that showed group-specific differential expression in group 3 or group 4 patients (Fig S4B).
We further validated our random forest-based model in patient derived xenograft (PDX) samples derived from SHH (BT084, DMB012, RCMB32, and MED1712FH), group 3 (RCMB28, MB002, MB511H, and RCMB40), and group 4 (RCMB51, DMB006, RCMB45 and RCMB38) patients using the 9-lncRNA signature to classify SHH, group 3, and group 4 patients (Fig 2A), as WNT PDXs were not available for analysis. SHH, group 3, and group 4 samples were successfully identified using k-mean based clustering, principal component analysis (PCA) (Fig   3A) and hierarchical clustering using normalized RNA-seq expression levels (Fig 3B), with the 14 exception of RCMB28 that was found to be more related to SHH PDXs. Quantitative expression of signature genes was validated by qPCR (Fig 3C and D) and closely resembled expression in patient RNA-seq data (Fig 2C). In order to infer the physiological importance of the identified 11-lncRNA candidates, we used a combination of CLR, arcane, and mrnet transcriptional inference algorithms to identify potential interactions between the identified lncRNAs with the expressed transcription factors in MB patients. The identified lncRNAs could potentially interact with a number of transcription factors in a complex interconnected network with both highly positively and negatively correlated associations, suggesting putative biological cross-regulation (Fig S5).

Prognostic lncRNAs in medulloblastoma
To identify prognostic lncRNAs, we used the MAGIC microarray array dataset containing survival data for 612 (out of 763) patients. The microarray expression data contained expression levels of 621 lncRNAs that we used for multivariate Cox proportional hazards (Cox-PH) 15 regression analysis. For feature selection, we utilized a penalized multivariate Cox-PH model using the LASSO penalty (α = 1). We first used the entire dataset to select 17 lncRNAs as prognostic markers and their associated penalized coefficients ( Table 1). Of these 17 lncRNAs, 10 were markers of good prognosis and 7 were associated with poor prognosis. The 17-lncRNAs model was validated on 1000 bootstraps of the MAGIC dataset, that showed predictive stability of the proposed prognostic model. Using the penalized coefficients and log-normalized expression values for each lncRNA, we assigned a risk score to each patient. Kaplan-Meier analysis of 612 patients using the risk score as the input variable suggested that our risk score signature was a highly significant in prognostic value (Low risk HR= 0.207, 95% CI= 0.133-0.323, logrank p-value= 2e-14) (Fig 4A). To validate our 17-lncRNA prognostic model in an independent dataset, we used ICGC RNA-seq data of 74 patients with survival information and used the penalized coefficients obtained from the MAGIC dataset analysis and variance stabilized expression from RNA-seq data to obtain an equivalent risk score. Again, Kaplan-Meier analysis showed that prognostic significance of our 17-lncRNA model (Low risk HR= 0.135 , 95% CI= 0.017-1.08, logrank p-value= 0.026) (Fig 4B). These candidate lncRNAs could potentially be involved in MB development as pro or anti-tumorigenic factors.

Discussion
Long non-coding RNAs are increasingly recognized as important players in cancer research [19], particularly as biomarkers and/or therapeutic targets [32][33][34][35][36], including in brain tumors [20,[37][38][39][40]. However, there is a lack of knowledge of lncRNAs' involvement in MBs. Here we bridged this knowledge gap by proposing diagnostic and prognostic biomarkers candidates for further study in vitro and in vivo systems to understand their potential function in MB genesis/progression.
Our study is the first genome-wide analysis of lncRNAs' expression profile in MB and its subgroups. Overall lncRNAs' expression dynamics mirrors the well-known MB heterogeneity seen in genetic and epigenetic analyses [23,41]. MB subgroup clusters obtained using highly variable lncRNAs overlapped with existing clinical and molecular subgroups. Using variantly expressed lncRNAs and weighted correlations, we further identified subgroup-specific lncRNAs.
These upregulated lncRNAs might represent functionally relevant genes and require further validation. The obtained lncRNA signature could be curated using transcriptional inference algorithms and proximity to or co-relation with known MB relevant protein coding genes for further functional validation in vitro and in vivo studies.
Presently, very few lncRNAs have been studied for their putative roles in MB or its subtypes.
NKX2-2AS was shown in vitro to modulate SHH-potentiated MB development by acting as a miRNA sponge for miR-103 and miR-107, thereby depressing their tumor suppressive targets BTG2 and LATS1 and inhibiting proliferation and migration [42]. CDKN2B-AS1 (ANRIL) has been shown to promote proliferation in vitro studies by sponging miR-323 and activating BRI3 dependent p38-MAPK, AKT and WNT signaling [43], and in current analysis, it was found to be upregulated in group 4 patients compared to other MBs. PVT1 is prevalently found fused to MYC and NDRG1 genes in group 3 tumors, leading to oncogenic transformation of these genes [44]. lnc-IRX3-80 (CRNDE) was also reported as an oncogenic lncRNA in vitro and in vivo studies [45]. Both PVT1 and lnc-IRX3-80 were upregulated in WNT and SHH MBs in our current analysis. Lnc-FAM84B-15 (CCAT1), which was found upregulated in WNT and group 3 MBs, has also been shown to be involved in promoting tumor proliferation and metastasis by activating MAPK pathway [46]. MIR100HG (lnc-NeD125) has been shown to be overexpressed in group 4 MBs, again acting as an miRNA sponge for miR-19a-3p, miR-19b-3p and miR-106a-5p, exerting an oncogenic function by de-repressing cell cycle target genes [47]. MIR100HG is also oncogenic in gastric cancer [48], breast cancer [49], and leukemia [50]. In our analysis, only MIR100HG (lnc-NeD125) was selected in our diagnostic signature, being highly expressed in all MBs but group 3 (Fig 2). In addition, our 11-lncRNAs model could complement existing molecular and clinical-based diagnostic approaches, particularly for group 3 and group 4 MBs.
Our 17-lncRNAs prognostic model represents another set of putative functionally important lncRNAs. Of the 17 lncRNAs, seven were associated with poor prognosis, including H19 ( Table   1). None of the candidate poor prognostic marker were specifically expressed in a particular subgroup of patients, suggesting independent prognostic value, although patients with high expression of the signature tended to be group 3 (Fig S6). H19 is a well-studied oncogenic lncRNA in various cancer systems including glioblastoma, where it has been shown to be promote cellular proliferation and metastasis [56][57][58][59]. LncRNA LINC01551 has been found to upregulate cellular proliferation and migration in non-CNS cancers such as hepatocellular carcinoma by interacting with the miR122-ADAM10 axis [60]. LINC00336 promoted lung cancer progression by inhibiting regulated cell death by knocking down miR-6852 function [61].
Overall, our analysis proposes new lncRNAs candidates in MB with functional, diagnostic, and prognostic significance that warrant further investigation and validation. This is the first global analysis of lncRNAs in MB that will provide an invaluable resource for those working in the field to prioritize for further study.

SUPPLEMENTAL INFORMATION
Supplementary Tables   Table S1. List of 175 medulloblastoma patient samples and clinical information. Table S2. List of subgroup specific lncRNA cohorts identified by WGCNA analysis. Table S3. List of differentially expressed lncRNA between group 3 and group 4 patients.