Identification of HMG-box family establishes the significance of SOX6 in the malignant progression of glioblastoma.

Glioblastoma multiforme (GBM) is the most malignant neuroepithelial primary brain tumor and its mean survival time is 15 months after diagnosis. This study undertook to investigate the genome-wide and transcriptome-wide analyses of human high mobility group box (HMG-box) TF (transcript factor) families / HOX, TOX, FOX, HMG and SOX gene families, and their relationships to GBM. According to the TCGA-GBM profile analysis, differentially expressed HOX, FOX, HMG and SOX gene families (62 DEmRNA) were found in this study. We also analyzed DEmRNA (HMG-box related genes) co-expressed eight DElncRNA in GBM, and constructed a ceRNA network analysis as well. We constructed 50 DElncRNA-DEmiRNA-DEmRNA (HMG-box related genes) pairs between GBM and normal tissues. Then, risk genes SOX6 and SOX21 expression were correlated with immune infiltration levels in GBM. SOX6 also had a strong association with MAPT, GSK3B, FYN and DPYSL4, suggesting that they might be functional members in GBM.


INTRODUCTION
Glioblastoma multiforme (GBM) is the most malignant neuroepithelial primary brain tumor [1]. For GBM patients, its mean survival time is 15 months after diagnosis [2]. HMG-box (high mobility group box) domains are associated with the HMG-box proteins which influence DNA-dependent processes (transcription, replication, and DNA repair) and require changing the conformation of chromatin [3].
The HMG-box gene family is a family of TF-encoding genes which include a DNA-binding homeobox domain [4], such as HOX, FOX, SOX, HMG, and TOX genes. There were many studies on HMG-box genes in gliomas. HOX gene family was highly expressed in GBM cancer stem cells compared with parental lines, and HOX-PBX inhibition was a potential therapeutic target for GBM patients [5], and HOXD10 was targeted by hsa-miRNA-23a to inhibit glioma cell invasion [6]. Sex-determining region Y (SRY)-related high mobility group box of genes was abbreviated as SOX genes [7]. Using human glioma-initiating cell (GIC) lines (GIC1 and GIC2) created from anaplastic oligodendroglioma (AO) and GBM, both GIC1 and GIC2 expressed SOX2 and SOX3, and neither GIC line expressed SOX1 [8]. The gliogenesis of GBM was dependent on SoxD (SOX5, SOX6 and SOX7) and SoxE (SOX8, SOX9 and SOX10) [9]. SOX6 was specifically expressed by IgGs in GBM [10]. The moderate expression of SOX10 and SOX11 was linked to glioma, whereas the overexpression of them were associated with GBM [11]. SOX9 expression is connected to a poor prognosis of GBM patients and with resistance to temozolomide [12]. SOX2 / SOX21 axis could function as a tumor suppressor during glioma genesis [13]. However, SRY, SOX12, SOX15, SOX18, and SOX30 have not been reported to be associated with GBM. FOXM1 AGING overexpression promoted clonogenic growth of GBM cells. FOXG1 and SOX2 via transcriptional control of core cell cycle and epigenetic regulators to fuel unconstrained self-renewal in GBM stem cells [14]. SOX9 and FOXG1 co-regulated a subset of EGFR [15]. Hsa-miR-338-5p also played a tumor suppressor role in glioma by binding FOXD1 [16].
However, transcriptomic-and genomic-wide systematic studies of HMG-box families in GBM is lacking. In order to better solve this problem, integrated analysis of HMG-box related gene family in GBM based on data gathered from GEO and TCGA database. We expected to find the DE-HMG-box and related noncoding RNA in GBM, and discovered the potential drug or disease target for GBM. Our findings provided new insights into the molecular role and phylogeny of the HMG-box families in GBM.

Transcriptomic identification of DEGs between GBM and normal tissues
By obtaining data from TCGA database, we re-analyze the transcriptomic profiles of TCGA-GBM dataset, and 174 samples (169 GBM tissues and 5 normal tissues) were chosen to obtain DEmRNA (differentially expressed mRNA) and DElncRNA (differentially expressed lncRNA). GBM miRNAs expressed profiles were downloads from GEO database (GSE90603). There were 123 HMG-box genes that exist in TCGA-GBM. Through the analysis of the TCGA datasets, it was found that partial SOX, FOX, HOX, TOX and HMG gene families (a total of 62 genes) were significantly differentially expressed (|logFC| > 1 and qvalue < 0.05) in our study, relative expression heatmap visualization was drawn in Supplementary Figure 1. Starting from the left, the first 5 datasets were normal tissues, and the remaining 169 were GBM tissues. Differentially expressed HMG-box genes were displayed via volcano plot ( Figure 1). Only five HMGbox DEGs (FOXP1, FOXO4, SOX7, FOXP2 and AGING TOX2) were downregulated between GBM and normal tissues, the others were up-regulated.
To further explore the function of isolated DE-HMGbox related genes, these 62 DEmRNAs were entered into clusterProfiler for GO enrichment. PPI network and Reactome and KEGG pathway enrichment analyses were built by STRING database and also presents the most significant enriched pathways of DE-HMG-box related genes in GBM (Figure 2A). Moreover, HMGA2, HOX9-11 were enriched in "Transcriptional misregulation in cancer" in the KEGG pathway. From Reactome pathway results, we found these genes enriched in six pathways, such as HOXA1-4, HOXB2-4, HOXC4, SOX2, and FOXD3 were enriched in "developmental biology", HMGB1-2 were enriched in "Activation of DNA fragmentation factor" (Figure 2A).
The top ten GO enrichment analysis results (q-value < 0.05) were shown in Figure 2B, the most significantly enriched in "GO:0009952: anterior/posterior pattern specification" ( Figure 2B).

Identification of HMG-box DEGs co-expressed lncRNAs
According to the median risk score, GBM patients in TCGA were divided into high-and low-risk groups. We performed principle component analysis (PCA) graphs on the HMG-box related DEmRNA, co-expressed DElncRNA and risk DElncRNA ( Figure 3A-3C), green dots present low risk, and red dots present high risk in GBM patients. The eight HMG-box related lncRNAs heatmap employed in constructing the risk scoring model and survival information were shown in   Figure 3F). Of these eight lncRNAs, six were detected as high risk (BNC2-AS1, AC018450.1, MIR222HG, AC005005.3, AC025171.1, AGAP2-AS1, coefficient > 0), while two were supportive (SOX21-AS1, ZEB1-AS1, coefficient < 0). We also found that the overall survival time of patients in the high-risk group was lower than that in the lowrisk group (p-value <1.604e-08, Figure 3G).
A total of 147 DElncRNA (q-value < 0.05) were gained as well, of which 44 DElncRNA were up-regulated and 103 DElncRNA down-regulated. The association networks that included the DE-HMG-box gene families and their related co-expressed DElncRNA were constructed ( Figure 4). The resulting lncRNA-mRNA association network had 68 interfaces between 38 lncRNAs and 27 mRNAs. The network showed that SOX6 was proposed to be the target of nine lncRNAs, FOXO4 was targeted by seven lncRNAs, and three mRNAs (HOXD4, SOX11, and SOX6) targeted AP002360.3.

CeRNA network construction
We collected TCGA-GBM profiles (lncRNAs and mRNAs) and GEO data GSE90603 (miRNAs) in GBM through computational analysis to estimate potential relationships based on the ceRNA hypothesis to further understand their function. We found that 401 AGING DEmiRNA (differentially expressed miRNA) is between seven normal tissues and sixteen GBM tissues (282 down-regulated and 119 up-regulated DEmiRNA). Using miRCode through miRNA response elements, eleven specific DEmiRNAs (three down-regulation and eight up-regulation) were detected to bind with sixteen DElncRNAs (fifteen down-regulation and one upregulation).
In order to improve the prediction accuracy, TargetScan, SeedVicious and miRanda databases were combined to predict nine candidate DEmRNA targets for DEmiRNA. Cytoscape software was used to visualize a ceRNA network comprising sixteen lncRNAs, eleven miRNAs, and nine mRNAs based on the interactions between lncRNAs, miRNAs, and mRNAs ( Figure 5).

Risk score performance, comparison and combination of gene-signature
To confirm the performance of the risk score in determining the survival rate of GBM patients, we used a model based on the prognostic dual genes (SOX6 and SOX21) signature to score the risk for each GBM patient. Risk genes (SOX6 and SOX21) expression levels were positively correlated with the infiltration levels of dendritic cells (p-value = 7.524E-08) and macrophages (p-value = 0.012) ( Figure 6A). ROC curve analysis of five-year survival rate was used to evaluate the projection potential of two HMG-box-related genes. The area under the curve (AUC) of the prognostic model based on the properties of the two genes had a total survival time of 0.625 at 60 months ( Figure 6B). Patients were categorized as high risk (n = 76) or low risk (n = 76), with the median risk being used as the cutoff value for survival analysis. Kaplan-Meier analysis showed that the overall survival curves of the two groups were significantly different (p-value = 1.478e-03, Figure 6C). Each patient's risk score ( Figure  6D), survival status ( Figure 6E), and spread of gene expression levels of both genes ( Figure 6F) were also analyzed. In order to evaluate the performance of HMGrelated genes as markers, we obtained two gene markers (SOX21, HR: 0.970 (95% CI: 0.942-0.999)); SOX6, HR: 0.906 (95%) to predict the prognosis of GBM patients through forest distribution maps. CI: 0.827-0.993)) ( Figure 6G). Given the increasing association between immunological feature and prognosis in GBM cancer, we further explored the correlation between SOX6 and SOX21 in GBM. We explored whether SOX6 and SOX21 expression were correlated with immune infiltration levels in GBM. We measured the correlations of SOX6 and SOX21 expression with immune infiltration levels in GBM from TIMER. SOX6 expression level has significant positive correlations with infiltrating levels of purity, and significant negative correlation with dendritic cells in GBM, whereas the SOX21 expression level has significant negative correlation with neutrophil in GBM ( Figure  6H). Subsequently, we further investigated the correlation between SOX6 and SOX21 gene expression in GBM patients, and the results showed that there was a significant positive correlation between SOX6 and SOX21 expression ( Figure 6I). Regarding prognosis, Kaplan-Meier curves illustrated that GBM with SOX6high had a worse prognosis than that with SOX6-low (p = 0.023), and with SOX21-high had a worse prognosis than that with SOX21-low (p = 6.17E-06) ( Figure 6J). We also combined the clinical information to visualize the expression profiles of SOX6 and SOX21 and found that there is a significant difference in SOX6 only in terms of age composition ( Figure 6K). These findings strongly implied that SOX6 might play a specific role in immune infiltration in different subtypes of GBM.
Data from the Human Protein Atlas database showed that immunohistochemistry staining of SOX6 protein was higher in GBM cancer tissue compared with normal tissue (Figure 7).

Systematic analysis of SOX gene family and the importance of SOX6 in GBM
As a result, a total of 81 SOX members were identified in our study and divided into nine groups. Generally, AGING the SoxA group contains SRY, SoxB1 has three members (SOX1, SOX2, SOX3), SoxB2 has two members (SOX14, SOX21), SoxC has three members (SOX4, SOX11, SOX12), SoxD contains SOX5, SOX6, SOX13, SoxE contains SOX8, SOX9, SOX10, SoxF contains SOX7, SOX17, SOX18, SoxH has SOX30, SoxG contains SOX15. As shown in Supplementary Figure 2A, the phylogenetic tree was constructed based on the SOX proteins using FastTree. We generated a graph to show the SOX protein structures by GSDS. According to the result of MEME suite, we found that there was a conserved and core motif (motif 1) in all SOX proteins, which is HMGbox domain and contains 79 amino acid residues (Supplementary Figure 2B and Supplementary Figure  2C). All motifs' logos were shown in Supplementary Figure 2B and 2C. It's noteworthy that motif 10, 2, 5 only appeared in SoxD group (SOX5, SOX6, and SOX13), they might be the domain to identify SoxD group. The SOX protein secondary structures showed in Supplementary Figure 2D and Table 1). By examining the properties of SOX genes for each of the four species (Homo sapiens, Mus musculus, Coturnix japonica, and Gallus gallus), the grand average of hydropathicity (GRAVY) value for those SOX genes mainly ranged from -1.080 --0.206, which were higher than Mus musculus -1.984 --0.207 (Supplementary Figure 2E and Supplementary Table  2). We found that the length of amino acids varied among species ranging from 204 -817 nt. The distribution of molecular weight (Mol. Wt., kDa) for SOX genes ranged from 23.88 -90.72. The isoelectric point (pI) of the SOX genes was from 4.91 -9.96. According to the results of chromosome location, a total of 81 SOX gene members were mapped to the 14 chromosomes (Supplementary Figure 3). SOX6 belongs to the SoxD group, based on the high expression in GBM, we used the GEPIA2 to obtain the top 200 co-expressed genes (Spearman's correlation >= 0.68). Then, co-expressed genes network was constructed by STRING, and re-drawn by Cytoscape ( Figure 8A). To analyze the biological classification and pathway of coexpressed genes, we used Cytoscape's plugin ClueGO app for functional enrichment analyses (p-value <= 0.05). GO analysis indicated that the biological processes including tau-protein kinase activity (FYN, TTBK1,  The most important module was obtained using MCODE plugin ( Figure 8A). We found MAPT, GSK3B, FYN and DPYSL4 as co-expressed hub genes. Hierarchical clustering of the hub genes was performed using the UCSC online tool ( Figure 9A), indicating the concordant expression pattern across four genes. SOX6 compared with MAPT, GSK3B, FYN and DPYSL4 had the highest correlation coefficients (Spearman's = 0.648, 0.765, 0.693 and 0.642) in GBM compared with other tumors ( Figure 9B). This data demonstrated that SOX6 had a strong association with MAPT, GSK3B, FYN and DPYSL4, suggesting that they may be functional partners in GBM.

DISCUSSION
GBM is an aggressive primary malignant brain tumor, and has one of the worst 5-year survival rates among all AGING cancers after diagnosis [25]. It is the first time to report the role of HMG-box gene families and their related DElncRNA and DEmiRNA in GBM. Here, we presented a description of GBMs based on the integration of the genomic and transcriptomic profiles of the HMG-box gene families (such as SOX, HOX, FOX, HMG and TOX gene families).
Through significant differential expression analysis, we found that only 62 of the 123 HMG-box genes were significantly differentially expressed in GBM, and only five genes were down-regulated and 57 were upregulated.
From the expression distribution, this showed that most DE-HMG-box genes were specifically and highly expressed in GBM and had a very important role in enhancing cancer cells growth. From the PPI network and functional pathway results in this study, we found that partial HOX genes were correlated with transcriptional misregulation in cancer, activation of anterior HOX genes in hindbrain development during early embryogenesis, and developmental genetics. DE HMG-box families were closely linked to glioma-related tumors.
Mechanisms utilizing lncRNA have been shown to take part in various types of cancer. However, a comprehensive analysis of the differential expression profiles of DE HMG-box genes co-expressed lncRNA network in GBM has been lacking. Using multivariate Cox and risk score methods, we detected an eight-lncRNA signature which was able to classify GBM patients into the high-risk group and low-risk group with significantly different overall survival (p-value = 1.604e-08). Comparing our functional analyses with DE HMG-box genes, we discovered that eight-lncRNA might participate to GBM via development biology. Then, we further made a DE-lncRNA-DEmiRNA-DE-HMG-box network to expose the HMG-box related ceRNA mechanism in GBM. For example, in our ceRNA network, we found MIR200HG-hsa-miR-146a-5p-SOX2 / HOXD10 axis in GBM, down-regulated lncRNA MIR200HG could competitively bound miRNAs (up-regulated), thereby indirectly encouraging targeted SOX2 / HOXD10 upregulation. In HMDD (the Human microRNA Disease Database) v3.2 [26], we observed that the miRNAs are linked to glioblastoma or glioma, and found several connections in miRNA- AGING mRNA binding. The upregulation of hsa-miR-146a and inactivation of NF-κB signaling induced the sensitization of human glioblastoma cells to TMZinduced apoptosis by curcumin [27]. Hsa-miRNA-155 targeted FOXO3a could promote cell proliferation and invasion in glioma [28]. In this study, hsa-miRNA-155a-3p could combine OIP5-AS, and reduce the influence to SOX11, down-regulated MIAT, AC010980.2, and EPB41L4A-AS could target upregulated hsa-miR-23a-3p, and formed a ceRNA network with up-regulated HMGB2 and HMGN2. As previously reported, Circ_PTN performed as sponge of miR-122, and activated SOX6 expression in glioma cells [29].
We systematically identified the HOX, FOX, SOX, TOX and HMG genes in humans, and also took SOX gene family as an example to comprehensive analysis of the SOX gene family from the phylogeny and protein structure. Although the absence of 3D structural data and the low accuracy of secondary structure prediction, we characterized secondary structures of SOX proteins to identify possible structural consequences of aminoacid substitutions, which indicated that these evolutionary changes have altered SOX protein function in some way.
In the current study, we initially screened out two DEmRNA (SOX6 and SOX21) of the SOX gene family that were found to be related to the clinical outcome of the GBM patients TCGA database. This study observed that up-regulated SOX6 could be targeted by eight down-regulated DElncRNAs in GBM tissues. SoxD group included SOX5, SOX6 and SOX13, we detected that SOX6 and SOX13 were over-expressed in GBM patients. SOX6 was expressed most frequently (6/7 or 86%) in GBM by RT-QPCR experiment [30], and expressed in the nuclei by immunohistochemical experiment [31]. GBM patients with low SOX6 expression present higher survival rates than those with high SOX6. FOXC1 could play the essential role in brain tumor biology and patients with GBM [32]. High expression of GATA2 connected with poor prognosis in GBM patients and promoted GBM progression by EGFR pathway [33]. SOX6 and SOX13 could cointeract with FOXC1 and GATA2, which might lead to aggressive the brain tumors. Hsa-miR-335-5p and hsalet-7b-5p were significantly suppressed in GBM tissues [34], which targeted SOX13 in current study. We speculated that SOX6 might regulate GBM indirectly.
All the mechanism studies are aimed at finding drug targets, better prevention and treatment of diseases, there have been many studies such as a large number of alkylating anticancer agents and mutagens, and might be related to DNA replication [35][36][37]. As the HMG-box proteins influence DNA-dependent processes (transcription, replication, and DNA repair) [3], DE HMG-box genes and related DE-lncRNA / DE-miRNA in GBM, might serve as a potential drug target for DNA loss repair. Existing studies have shown that SOX5, SOX9 and SOX6 could be used as a drug target for INSULIN and DEXAMETHASONE, for the treatment of neurological diseases [38,39], indicating that the combination of drugs and genes can reach the bloodbrain barrier, so whether they could also be used as a drug target for solid tumor GBM, needs further research and mining by our research group.

CONCLUSIONS
In summary, our study obtained the identification of HMG-box families and established a ceRNA network in GBM by TCGA and GEO dataset, presenting them as potential therapeutic targets for the treatment of GBM. Comparing our functional analyses with DE HMG-box genes, we identified that eight-lncRNA might contribute to GBM via development biology. SOX6 and SOX21 might represent a prognostic biomarker and potential therapeutic target to improve the diagnosis and treatment of GBM. SOX6 had a strong association with MAPT, GSK3B, FYN and DPYSL4 and might be functional partners in GBM. Our study provided useful information for further exploration of GBM. Moreover, more GEO datasets should be integrated to assess and reduce the bias during the analysis process, further experiments should validate in vitro and in vivo to make the role of these key genes and pathways clear in the development of GBM.

Analysis of DElncRNA-DEmRNA co-expression network
DElncRNA-DEmRNA (both genes with fold change > 2 and q-value < 0.05) co-expression network was built to determine the relationships in GBM (the absolute value of Pearson correlation coefficient > 0.5 and pvalue < 0.001). The co-expression relationships were visually represented as the co-expression network using Cytoscape v3.7.2 [49].

Single and multivariate factor cox analysis, ROC and survival curve plot
In order to predict the DEmRNA co-expressed lncRNA connected to survival, we performed the single factor Cox analysis by R package "survival" [50], risk model was calculated as previously reported [51]. According to the best risk model obtained by multivariate Cox analysis, the survival score was performed, and the average number of risk scores of each sample of TCGA-GBM data was also calculated. Above-average patients belong to the high-risk group, and belowaverage patients belong to the low-risk group. The Kaplan-Meier method was used to draw the survival curves of the two groups.

Data mining for SOX6 and co-expressed hub genes
GEPIA2 (Gene Expression Profiling Interactive Analysis) [68] was used to analyze the gene expression correlation by TCGA-GBM data. The Spearman method was used to find the correlation coefficient. PPI network was constructed by STRING v11 [69]. Cytoscape's plugin CluGO was used for functional enrichment analyses (GO, KEGG, Reactome, and Wiki pathway) [70]. Cytoscape's plugin MCODE was applied for finding linked regions based on topology (MCODE score > 5, degree cutoff =2, node score cutoff = 0.2, max depth = 100, k-score =2). We plotted the heat map of SOX6 and co-expressed hub genes by University of California Santa Cruz (UCSC) browser [71], and the correlation of these genes was drawn by TIMER [47].

Supplementary Tables
Supplementary Table 1