A novel long non-coding RNA, AC012456.4, as a valuable and independent prognostic biomarker of survival in oral squamous cell carcinoma

Oral squamous cell carcinoma (OSCC) is a major malignant cancer of the head and neck. Long non-coding RNAs (lncRNAs) have emerged as critical regulators during the development and progression of cancers. This study aimed to identify a lncRNA-related signature with prognostic value for evaluating survival outcomes and to explore the underlying molecular mechanisms of OSCC. Associations between overall survival (OS), disease-free survival (DFS) and candidate lncRNAs were evaluated by Kaplan–Meier survival analysis and univariate and multivariate Cox proportional hazards regression analyses. The robustness of the prognostic significance was shown via the Gene Expression Omnibus (GEO) database. A total of 2,493 lncRNAs were differentially expressed between OSCC and control samples (fold change >2, p < 0.05). We used Kaplan–Meier survival analysis to identify 21 lncRNAs for which the expression levels were associated with OS and DFS of OSCC patients (p < 0.05) and found that down-expression of lncRNA AC012456.4 especially contributed to poor DFS (p = 0.00828) and OS (p = 0.00987). Furthermore, decreased expression of AC012456.4 was identified as an independent prognostic risk factor through multivariate Cox proportional hazards regression analyses (DFS: p = 0.004, hazard ratio (HR) = 0.600, 95% confidence interval(CI) [0.423–0.851]; OS: p = 0.002, HR = 0.672, 95% CI [0.523–0.863). Gene Set Enrichment Analysis (GSEA) indicated that lncRNA AC012456.4 were significantly enriched in critical biological functions and pathways and was correlated with tumorigenesis, such as regulation of cell activation, and the JAK-STAT and MAPK signal pathway. Overall, these findings were the first to evidence that AC012456.4 may be an important novel molecular target with great clinical value as a diagnostic, therapeutic and prognostic biomarker for OSCC patients.


INTRODUCTION
The five-year survival rate is approximately 50% for oral squamous cell carcinoma (OSCC), which is one of the most common malignancies of the head and neck region (Bozec et al., 2009;Ferlay et al., 2015;Kamangar, Dores & Anderson, 2006;Kim et al., 2017;Verusingam et al., 2017). The predisposition of OSCC to distant metastases and metastases in the lymph nodes, its highly invasive nature, and its tendency towards local recurrence are important factors that contribute to the poor prognosis of OSCC patients (Massano et al., 2006;Singh & Schenberg, 2013). Hence, more effective novel tumor diagnostic and prognostic biomarkers (Mehrotra & Gupta, 2011), which can improve the survival rate and can be used to assess treatment outcomes, are urgently needed.
The Cancer Genome Atlas (TCGA) (http://cancergenome.nih.gov) database, which is primarily used to collate specimens from cancer patients and adjacent normal tissue specimens, contains large data sets collected with high-throughput methods at multiple genomic and proteomic levels (Chin, Andersen & Futreal, 2011;Wang, Gerstein & Snyder, 2009). The Gene Expression Omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo/) is the largest and most comprehensive public gene expression repository for high-throughput data at NCBI (Barrett & Edgar, 2006;Clough & Barrett, 2016). Both the GEO and TCGA collect macroscopic clinical information, such as stage and grade of tumor, survival time, age, sex, and race. Therefore, the TCGA and GEO databases can be analyzed systematically and comprehensively to explore important potential value and information.
In this study, we first sought to use the existing GEO microarrays and TCGA RNA-seq data to identify differential expression of lncRNAs between OSCC and control tissue samples. Then, the differentially expressed lncRNAs were evaluated by Kaplan-Meier survival analysis and univariate, multivariate Cox proportional hazards regression analyses and Gene Set Enrichment Analysis (GSEA). Ultimately, through systematic and objective analysis, we first discovered that lncRNA AC012456.4 is significantly associated with survival outcomes of OSCC patients based on TCGA data. Then, AC012456.4 was further successfully confirmed as a potential prognostic biomarker for the prediction of overall survival (OS) in the GEO database. We hope that the lncRNA AC012456.4 revealed in our study may serve as a novel biomarker and potential target for the diagnosis, treatment, and prognosis of OSCC.

Data source
The RNA-seq data and corresponding patient information data of head and neck cell carcinoma (HNSC) were downloaded from the TCGA database. Clinical samples from the oral cavity (buccal mucosa, tongue, lip, hard palate, alveolar ridge, floor of the mouth and oral cavity) were chosen, while some samples from other parts (hypopharynx, larynx, oropharynx and tonsil, for example) were excluded. The original microarray data between OSCC and adjacent normal tissue samples were downloaded from the NCBI GEO databases. The accession numbers were GSE36820 and GSE41613, respectively. The microarray data of GSE36820 and GSE41613 were based on GPL570 (Affymetrix Human Genome U133 Plus 2.0 Array).

Data pre-processing and differential expression analysis
The edgeR package was downloaded from the Stanford University website. The original microarray data from the GEO were converted into expression measures using the affy R package. Then, the differentially expressed lncRNAs were identified by the Limma R package (Ritchie et al., 2015;Teufel et al., 2016). The differentially expressed lncRNAs that were screened from the TCGA were analyzed by the edgeR package (Robinson, McCarthy & Smyth, 2010). To improve screen accuracy and simplify the screening process, the cut-off criteria, which was in accordance with the procedure of Benjamini & Hochberg (BH), was as follows: 1. the false discovery rate was controlled at 0.01; 2. the fold change should be more than 2. The differentially expressed lncRNAs among GSE36820, GSE41613 and the TCGA were identified by the intersect function in the R package. Tumor and normal tissue data were recorded and were statistically analyzed.

Identification of lncRNAs with prognostic value in OSCC
The differences between expressed lncRNAs (fold change >2, p < 0.05) are involved in the prognostic value for OSCC. The OSCC patients were divided into two parts, depending on the average expression level of candidate lncRNAs: a high expression group and a low expression group. Survival differences and p-values were compared between the two groups and were evaluated using a Kaplan-Meier survival analysis and a log-rank test. After this, a univariate Cox proportional hazards regression analysis (Bair & Tibshirani, 2004) was conducted to assess the correlation between candidate lncRNAs and patient overall survival (OS) and disease-free survival (DFS) (p < 0.05). Statistically significant lncRNAs and clinical candidate predictors were further evaluated by multivariate Cox proportional hazards regression analyses to identify independent prognostic lncRNAs. Candidate predictors included age, gender, grade, and stage. We then performed subgroup analyses. The hazard ratio (HR) and 95% confidence interval (CI) were also assessed.

Gene set enrichment analysis (GSEA)
GSEA 2-2.2.3 (JAVA version) was downloaded from the Gene Set Enrichment Analysis website (http://software.broadinstitute.org/gsea/index.jsp). Then, the downloaded dataset was imported using the GSEA software. Gene sets identified as related to biological signal conduction on the MSigDB (Molecular Signatures Database) (http: //software.broadinstitute.org/gsea/msigdb), which may be found on the GSEA website, served as reference gene sets. This process was repeated 1,000 times for each analysis according to the default weighted enrichment statistical method. Gene sets with a false discovery rate (FDR) <0.25 and a family-wise error rate (FWR) <0.05. The GSEA analysis includes four key statistics: Enrichment Score (ES), Normalized Enrichment Score (NES), False Discovery Rate (FDR) and P-value.

Statistical analysis
In this study, all analyses, including the t -test, heat map, and survival analyses, were performed with the R, GraphPad and SPSS software packages. p values less than 0.05 were considered significant. All statistical tests were two-sided.

Characteristics of OSCC patients according to the TCGA
In this study, the datasets of 350 OSCC patient and 44 controls were acquired and downloaded from the TCGA (http://cancergenome.nih.gov) database; these datasets contained expression data and clinical information related to 14,448 lncRNAs. The clinicopathological features of all patients are shown in Table 1. The mean ± standard deviation (STDEV) for all patient ages is 61.590 ± 12.886.

Significant differentially expressed lncRNAs in OSCC
In all, 2,493 differentially expressed lncRNAs were identified through analysis of 14,448 lncRNAs using the edgeR packages (fold change >2, p < 0.05) (Fig. 1). Moreover, 855 lncRNAs were down-regulated and 1,638 lncRNAs were up-regulated in the OSCC samples compared to normal tissue. Down-regulated and up-regulated lncRNAs account for 34.2% and 65.6% of the differentially expressed lncRNAs, respectively.
Through the above Kaplan-Meier survival analysis, the variables of age, gender, grade, tumor stage, and TNM stage were identified as statistically significant factors that are related to the above 21 lncRNAs and patient prognosis. We also applied univariate and multivariate Cox regression analyses to evaluate the ability of 21 candidate lncRNA signatures to serve as independent prognostic variables. The univariate analysis indicated that decreased AC012456.4 expression (HR = 0.706, 95% CI [0.551-0.903], p = 0.006), age, tumor stage, and TNM stage were all significantly related to worse OS in OSCC patients (Table 3) (Table 4). In addition, age and N stage were highly significantly correlated with shorter OS or DFS.  lncRNA AC012456.4 was low expressed in OSCC tissues and associated with clinicopathological parameters OSCC patients were further classified into high or low expression groups based on the median value of the relative lncRNA expression. The expression of lncRNA AC012456.4 was significantly weaker in OSCC tissue samples (1.360 ± 0.05569) relative to normal tissue samples (3.062 ± 0.2304) in the TCGA (p < 0.0001) (Fig. 3). The correlation between lncRNA AC012456.4 expression and clinicopathologic parameters of OSCC patients was also further analyzed. As shown in Table 5, lncRNA AC012456.4 expression was significantly correlated with alcohol history consumption (p = 0.033). Additionally, decreased expression of lncRNA AC012456.4 expression was nearly significantly associated with T stage (p = 0.075). However, no significant association was found between other clinicopathological factors and lncRNA AC012456.4 expression.

Evaluation of the prognostic value of lncRNA AC012456.4 via the GEO
For the purpose of evaluating the robustness of lncRNA AC012456.4 expression in the prediction of OS of OSCC patients, we acquired other independent datasets from the GEO with accession numbers of GSE36820 and GSE41613, which contained OSCC samples, but samples with incomplete clinical information were excluded. The prognostic signatures and the Kaplan-Meier analysis were calculated and performed for each OSCC sample.
In agreement with the result of the TCGA datasets, low expression levels of lncRNA AC012456.4 were associated with lower OS (Fig. 4). The lncRNA AC012456.4 was also expressed at low levels in OSCC tissues (p < 0.0001).

Relationship between lncRNA AC012456.4 and biological pathways and functions
Biological pathways and functions of lncRNA AC012456.4 were identified by GSEA. This analysis revealed that lncRNA AC012456.4 was involved in many critical pathways and correlated with tumorigenesis. A total of 150 pathways listed in the high-risk group were enriched, including KEGG MAPK SIGNALING PATHWAY, KEGG JAK-STAT SIGNALING PATHWAY, KEGG CALCIUM SIGNALING PATHWAY and KEGG PATHWAYS IN CANCER. Twenty-seven pathways in the low-risk group were also identified, including the KEGG OXIDATIVE PHOSPHORYLATION, KEGG PROTEASOME and KEGG SPLICEOSOME (Fig. 5). Similarly, 3073 GO annotations in the high-risk group and 516 GO annotations in the low-risk group were enriched ; M1, Distant metastasis; N0, No regional lymph node metastasis; N1, Metastasis in a single ipsilateral lymph node, 3 cm or less in greatest dimension; N2, Metastasis in a single ipsilateral lymph node, more than 3 cm but not more than 6 cm in greatest dimension; or in multiple ipsilateral lymph nodes, none more than 6 cm in greatest dimension; or in bilateral or contralateral lymph nodes, none more than 6 cm in greatest dimension; N3, Metastasis in a lymph node more than 6 cm in greatest dimension; T1, Tumor 2 cm or less in greatest dimension; T2, Tumor more than 2 cm but not more than 4 cm in greatest dimension; T3, Tumor more than 4 cm in greatest dimension; T4a, Moderately advanced local disease; T4b, T4b Very advanced local disease. Tumor invades masticator space, pterygoid plates, or skull base and/or encases internal carotid artery.
( Fig. 6). Relevant partial results for KEGG pathways and GO analysis are listed in Table 6 and Table 7.

DISCUSSION
OSCC is a common, highly invasive type of oral cancer prone to early recurrence and metastasis (Massano et al., 2006;Singh & Schenberg, 2013). Therefore, early diagnosis and treatment of OSCC is essential (Bozec et al., 2009). While cytology-and pathology-based methods have been applied to the clinical differential diagnosis of OSCC, limitations in the detection methods and poor prognoses have limited the five-year survival rate (Omar, 2013). Hence, more reliable, accurate and sensitive prognosis biomarkers and tools for early diagnosis are urgently needed (Mehrotra & Gupta, 2011). In recent years, many studies have revealed a close association between aberrant expression of lncRNAs and tumorigenesis (Alessandro & Irene, 2014;Batista & Chang, 2013;Espinosa, 2017;Rinn & Chang, 2012;Slaby, Laga & Sedlacek, 2017), which may aid in cancer diagnosis and prognosis. Fewer than 2% of genes in the human genome are transcribed, and up to 98% of these transcripts are non-coding RNAs (Jandura & Krause, 2017;Espinosa, 2013;Quinn & Chang, 2016). lncRNAs are a class of non-coding transcripts ≥ 200 nucleotides in length that are actively involved in many biological processes, such as epigenetic regulation, cell cycle regulation, chromatin modulation and regulation of multiple gene expression (Rinn & Chang, 2012;Wang et al., 2017). These non-coding transcripts also play key roles in the occurrence, development and progression of malignant tumors (Espinosa, 2017;Kopp & Mendell, 2018;Spizzo et al., 2012). An increasing number of studies have reported that lncRNAs can play essential roles as oncogenes or tumor suppressor genes involved in the development and progression of various cancers (Batista & Chang, 2013;Espinosa, 2017;Kopp & Mendell, 2018;Reik, 2009;Rinn & Chang, 2012;Slaby, Laga & Sedlacek, 2017;Spizzo et al., 2012), including OSCC (Fang et al., 2017Gomes et al., 2017;Guo et al., 2017;(Li et al., 2017). For example, the down-regulation of HOTAIR is associated with cancer progression in 26 human tumor types (Bhan & Mandal, 2015).
However, most early studies focused on a single gene or the results obtained from a single cohort study of lncRNAs and OSCC. Sun et al. (2017) used qRT-PCR to analyze the expression levels of lncRNA PDIA3P in 58 OSCC and paired noncancerous tissue samples. This study found that the overexpression of lncRNA PDIA3P correlated with lower survival rates for OSCC patients. One study by Wu et al. (2015) suggested that high expression of lncRNA HOTAIR in OSCC patients would contribute to the development and progression of cancer, leading to a poor prognosis. Similarly, LINC00668 expression is increased in both 50 OSCC tissues and cells, and over-expression is significantly correlated with poorer survival for OSCC patients; Therefore, this might be a negative predictive factor for the prognosis of OSCC patients (Zhang, 2017). In the era of big data, the development of TCGA and GEO technology has allowed researchers to predict and identify new biomarkers, which has enhanced the reliability and accuracy of current research. Cui et al. (2017) used TCGA and GEO data to determine that the expression levels of several lncRNAs, including RP1-228H13.5, TMCC1-AS1, LINC00205, and RP11-307C12.11, were associated with OS and recurrence-free survival of hepatocellular carcinoma patients. Three lncRNAs (LINC01140, TGFB2-OT1, and RP11-347C12.10) were significantly correlated with prognoses of hepatocellular carcinoma patients, independent of some clinical characteristics. Using the database, three lncRNAs, which may play key roles in the development, progression, and recurrence in gastric cancer, were identified (Song et al., 2017). However, the functions, roles, and molecular mechanisms of lncRNAs associated with OSCC remain unclear.
In this study, we identified lncRNAs that are dysregulated in OSCC and evaluated the relationships between the TCGA database and the clinicopathological features of these OSCC patients. Based on the above analysis, a total of 21 lncRNAs were correlated with patient prognoses, of which 13 lncRNAs (TTC39A-AS1, RP11-93B14.9, AC012456.4, RP11-87C12.5, RP11-464F9.21, LINC01549, RP11-897M7.1, AP003900.6, LINC01343, RP11-181E10.3, CTD-2545H1.2, RP11-796E2.4 and LINC01108) were significantly positively associated with OS and DFS, while the up-regulation of the latter eight lncRNAs (AC007879.2, BOK-AS1, CTB-161M19.4, CTD-2033A16.3, FAM95B1, RP11-1C8.7, RP11-285G1.14 and RP11-286E11.1) were correlated with poorer prognoses. Lan et al. (2017) have also reported that RP11-1C8.7 predicted the progression and outcome of patients with kidney renal papillary cell carcinoma and was regarded as an independent prognostication factor for kidney renal papillary cell carcinoma. Thus far in the published literature, no report has evaluated the biological function and molecular mechanisms of other lncRNAs associated with human cancers.  To our knowledge, this study is pioneering research and identified the lncRNA AC012456.4, which exhibited significantly lower expression in OSCC tissues than in adjacent normal tissues. Additionally, a Kaplan-Meier survival analysis (Gyorffy, Lánczky & Szállási, 2012) as well as univariate and multivariate Cox regression analyses revealed that lncRNA AC012456.4 was an independent prognostic factor and was significantly correlated with shorter OS and DFS. Further validation via the GEO database was consistent with the TCGA database analysis results. Moreover, we further evaluated the relationship between AC012456.4 expression and the clinicopathological features of OSCC patients. Low levels of AC012456.4 were found to be significantly associated with the history of alcohol consumption in OSCC patients. Interestingly, according to previous studies, we found that alcohol consumption can increase the probability of G:C to A:T transitions and that alcohol drinkers exhibited a significantly higher incidence of p53 mutations in OSCC (Hsieh et al., 2001), which suggested that alcohol may play a critical role in the progression of OSCC.
Since lncRNAs perform their biological function by specifically binding to target genes, we further explored the possible biological functions and molecular pathways of AC012456.4. Through GSEA, AC012456.4 was found to be significantly involved with tumor-related signaling pathways and crucial biological functions in tumorigenesis. Key pathways and functions for tumor initiation and progression were identified, such as GO biological function annotation and KEGG pathways, including the adaptive immune response, RRNA metabolic processes, CALCIUM, MAPK, and the JAK/STAT signaling pathway. Additionally, mutation, aberrant expression and modification of these GO annotations and signaling pathways have been frequently reported in OSCC and other cancers. We found that the MAPK pathway could be activated by the low expression of the tumor suppressor QKI-5, which can promote the proliferation of OSCC cells (Fu & Feng, 2015). We also revealed the strong relationships between HOXC10 and gastric cancer cell proliferation and metastasis, which occur through the MAPK pathway (Guo et al., 2017). Other pathways and biological functions have also been reported in pancreatic ductal adenocarcinoma (Huang et al., 2017a), hepatocellular carcinoma (Huang et al., 2017b;Wonganan et al., 2017), and human papillomavirus-transformed tumors (Skeate et al., 2018).
Dysregulated expression of lncRNA signatures has tremendous potential value, but this research has limitations. Above all, we have explored the correlation between AC012456.4 expression and OSCC prognosis based on the TCGA and GEO databases, which signifies that the exploration was performed using a bioinformatics approach. Then, further research, such as quantitative real-time PCR, as well as in vivo and in vitro experiments, will require collaborative efforts to explore the potential molecular functions and related mechanisms of these lncRNAs in OSCC.

CONCLUSIONS
In summary, this study was the first to discover that lncRNA AC012456.4 was poorly expressed in OSCC, with decreased survival rates for OSCC patients. This may be a potential