A three-long non-coding RNA-expression-based risk score system can better predict both overall and recurrence-free survival in patients with small hepatocellular carcinoma

Growing evidence indicates that long non-coding RNAs (lncRNAs) may be potential biomarkers and therapeutic targets for many disease conditions, including cancer. In this study, we constructed a risk score system of three lncRNAs (LOC101927051, LINC00667 and NSUN5P2) for predicting the prognosis of small hepatocellular carcinoma (sHCC) (maximum tumor diameter ≤5 cm). The prognostic value of this sHCC risk model was confirmed in TCGA HCC samples (TNM stage I and II). Stratified survival analysis revealed that the suitable patient groups of the sHCC lncRNA-signature included HBV-infected and cirrhotic patients with better physical conditions yet lower levels of albumin and higher levels of alpha-fetoprotein preoperatively. Besides, Asian patients with no family history of HCC or history of alcohol consumption can be predicted more precisely. Molecular functional analysis indicated that PYK2 pathway was significantly enriched in the high-risk patients. Pathway enrichment analysis indicated that the two lncRNAs (LINC00667 and NSUN5P2) associated with poor prognosis were closely related to cell cycle. The nomogram based on the lncRNA-signature for RFS prediction in sHCC patients exhibited good performance in recurrence risk stratification. In conclusion, we identified a novel three-lncRNA-expression-based risk model for predicting the prognosis of sHCC.


INTRODUCTION
Hepatocellular carcinoma (HCC) is one of the most fatal malignancies because of its dramatically growing incidence and related mortality worldwide [1]. One of the biggest challenges facing most clinicians is the early diagnosis and early surgical intervention of HCC to reduce the resultant public health burden [2]. The development of screening techniques and surveillance programs can at least in part curb the ongoing epidemics of HCC [3]. Despite great improvement in therapeutic AGING lncRNAs are lack of protein-coding ability [9]. Recent studies have found that lncRNAs-encoded peptides played a critical role in biological activities [10,11], and further some lncRNAs regulated a wide range of transcriptionally or post-transcriptionally biological processes [12]. Moreover, an increasing number of lncRNAs were identified to be associated with the initiation, progression and metastasis of cancer at many sites, including the liver [13][14][15]. The rapid development of RNA sequencing techniques can help unfold the exciting potential of using lncRNAs as potential biomarkers to facilitate the detection, treatment and prognosis of cancer [16,17]. Currently, only a few lncRNAs such as HOTAIR and HULC have been well characterized in hepatocarcinogenesis [18,19], and emerging studies proposed that lncRNAs might be potentially reliable predictors for HCC clinical outcomes [20][21][22][23]. To yield more information, we in this present study aimed to construct a prognostic risk score system based on lncRNAs expression data to predict the prognosis of small HCC (sHCC) through a comprehensive analysis of microarray data. The sHCC is a special type of HCC with the maximum tumor diameter ≤5 cm defined in this study and favorable long-term outcomes, and so early detection of sHCC has very important clinic value.

Construction of the prognostic risk score system of sHCC
The overall design and workflow of this study is presented in Figure 1. After an initial screening of the lncRNAs associated with OS and RFS in the discovery series (GSE14520), the significant lncRNAs (P <0.05) were subjected to the LASSO modelling. The sHCC risk score system was built as follows: risk score = (-1.179864) × (expression value of LOC101927051) + 0.3570553 × (expression value of LINC00667) + 0.1603625 × (expression value of NSUN5P2). In this prognostic formula, higher expression of LOC101927051 was associated with lower risk of death and recurrence (coefficient < 0). On the contrary, higher expression levels of LINC00667 and NSUN5P2 were related to worse OS and RFS (coefficient > 0). Based on the absolute value of coefficients, it is not hard to see that LOC101927051 had the most influence on survival prediction, yet NSUN5P2 had the least. Using this formula, each patient received a risk score in connection with personal prognosis. Then all patients were classified into high-risk and low-risk groups by the cutoff value of -1.875 based on the risk scores generated from ROC curves (Figure 2A). The OS and RFS in the discovery dataset are presented in Figure 2B and 2C, respectively. The low-risk group was identified to have significantly better clinical outcomes than the high-risk group, in terms of both OS (Log-rank P =0.0022, HR=2.402, 95% CI: 1.392 to 4.143) and RFS (Log-rank P =0.0354, HR=1.588, 95% CI: 1.031-2.444) from KM curves ( Figure 2D and 2E).

Validation and exploration of the risk score model for survival prediction in the TCGA dataset
To validate the prognostic value, we applied the three-lncRNA signature to the TCGA cohort (stage I and II). The cut-off points of risk score to divide high-risk and low-risk groups in the validation dataset was 1.33 based on ROC curves. KM curves of the validation series showed great utility in predicting OS and RFS with P values from Log-rank tests of 0.0062 (HR=2.183, 95% AGING CI: 1.212-3.932) and 0.0129 (HR=1.627, 95% CI: 1.081-2.451), respectively ( Figure 3A).
Analysis was done in TCGA cohort to further investigate the potentiality of the three-lncRNA risk score model. The cut-off value adopted was 1.33, consistent with the overall group. The number of patients classified into high-risk and low-risk groups and the results of Log-rank tests are listed in Table 1. The lncRNA prognostic signature exhibited better performance in HBV and cirrhotic patients with relatively better physical conditions (ECOG =0) ( Figure 3B, 3C and 3D). Considering preoperative laboratory indexes, patients with higher serum levels of AFP (alphafetoprotein, >20ng/ml) and relatively lower levels of ALB (albumin, <4.0 g/dl) could benefit more in prognosis by using this risk score system ( Figure 3E and 3F). As for medical background information, the lncRNA prognostic signature seemed more applicable to Asian patients with no family history of HCC or history of alcohol consumption ( Figure 3G, 3H and 3I).

Identification of relevant biological processes and pathways of the three-lncRNA signature
BioCarta pathway enrichment through GSEA was conducted in high-risk groups of both discovery and validation datasets simultaneously. Only PYK2 pathway was significantly enriched in both datasets ( Figure 4A).

AGING
This pathway was reported to play a role in tumorigenesis and tumor progression that might be partly responsible for the poor prognosis of sHCC [24,25]. The two lncRNAs, LINC00667 and NSUN5P2, which indicated a poor prognosis of sHCC, were enriched in the same module. The significantly enriched pathways of LINC00667 and NSUN5P2 were mainly associated with cell cycle ( Figure 4B and 4C).

Establishment of nomogram for recurrence-free survival prediction in sHCC
In order to integrate all independent risk factors of OS and RFS for the construction of sHCC prognostic nomogram, various clinicopathological factors including TP53 mutation and the expression level of PTK2B, the core gene of PYK2 pathway, of each TCGA sample were subjected to univariate and multivariate COX regression analyses. The risk score was the significant independent factor of RFS (P = 0.004, HR = 1.811, 95% CI: 1.329-2.466) rather than OS (data not shown). Liver cirrhosis and ECOG were also independent risk factors of RFS of sHCC (Table 2). Ultimately, RFS nomogram was formulated based on the three significantly independent factors above. Furthermore, one-and three-year predicted RFS rate was shown in the nomogram ( Figure  5A). The C-index for recurrence-free survival prediction was 0.633 (95% CI, 0.562-0.704). The calibration curves of the nomogram showed good agreement between the predicted 1-and 3-year RFS rate and the actual observation ( Figure 5B). Each patient with complete clinical information on liver cirrhosis (or not) and ECOG score would get the Nomo-score based on which, patients were classified into different risk groups by the cut-off values 4.35 and 6.25. From Kaplan-Meier analysis of the TCGA dataset, notable differences were observed between high-, intermediate-and low-risk groups (P = 0.0002) ( Figure 5C).

DISCUSSION
Tumor size is an established independent risk factor for HCC that has been applied to staging system for medical guidance. Early-stage patients (solitary tumor ≤5 cm), if receiving proper and timely personalized treatment and surveillance, can have a satisfactory clinical outcome [26]. Moreover, therapeutic approach selection and monitoring indexes are fatal to the prognosis of early-stage HCC [27]. Therefore, reliable biomarkers and genetic signatures as treatment targets and prognostic predictors are of importance for sHCC. After decades of research on genetic markers of cancerrelated events like genes and miRNAs, lncRNAs have attracted much attention recently. To the best of our knowledge, this is the first study that have constructed a lncRNA-expression-based risk model to predict the prognosis of sHCC. First and foremost, we repurposed the whole set of microarray probes of GSE14520 for lncRNAs and employed 153 sHCC samples as the discovery dataset. Of all 1254 re-annotated lncRNAs, three (LOC101927051, LINC00667 and NSUN5P2) were selected to construct the risk score system for sHCC prognosis through analysis in silico. The three-lncRNA signature was eventually validated and developed in another independent cohort from the TCGA.
Functional annotations in high-risk patients of both discovery and validation datasets revealed that PYK2 pathway was significantly enriched. Proline-rich tyrosine kinase 2 (PYK2), known as PTK2B (protein tyrosine kinase 2 beta), is one of focal adhesion kinases (FAKs) in the regulation of calcium flux of iron channels and activation of cellular signaling pathway like the Canonical (β-Catenin-dependent) Wnt signaling pathway [24,28]. There is evidence that PTK2B was involved in cell proliferation, invasion and migration of a variety of malignancies, and its alteration can result in the poor prognosis of HCC [29][30][31][32]. Therefore, to integrate all independent factors of OS or RFS by using the nomogram, the expression level of PTK2B was taken into consideration, while no association with survival was found. Enrichment analysis of the coexpressed genes of LINC00667 and NSUN5P2 revealed that the two lncRNAs might be related to cell cycle. More interestingly, the specific cell cycle genes regulated by TP53 were also found to be significantly enriched ( Figure 4B). Besides, since TP53 mutation was an acknowledged high risk factor to the tumorigenesis and progression of HCC, we evaluated the relation between the mutation status of TP53 and OS or RFS as well [33]. Our COX regression analysis indicated that TP53 alteration might not be a risk factor for sHCC.

AGING
The construction of our risk score system may help identify high-risk and low-risk patients experiencing sHCC without invasive examinations, and provide advice to surgeons to aid modifying therapeutic strategies, for example, transplantation or curative section, as there are various treatment options for sHCC patients. To achieve a better prediction ability of prognosis, both genetic and clinicopathological characteristics were incorporated in the nomogram. Ultimately, the RFS nomogram was comprised of the three-lncRNA signature, liver cirrhosis status and ECOG score, which can be obtained from liver biopsy, facilitating examination and doctors' assessment. Accord-ing to the prognostic signature or the nomogram, if can be put into clinical practice in the future, high-risk patients of cancer-related death and recurrence can be recognized before surgery, and recommended a more aggressive strategy with strictly pre-and postoperative adjuvant treatment and surveillance. However, additional studies are needed to confirm the risk model in larger groups of patients, and the molecular functions of the three separate lncRNAs in HCC also requires further exploration.
In conclusion, we identified a novel three-lncRNAexpression-based risk model for predicting the prognosis of sHCC patients. Based on survival stratified analysis, this lncRNA risk score system is more suitable for cirrhotic and HBV-infected patients with good ECOG performance, low level of preoperative albumin and high level of AFP. In addition, the lncRNAsignature can help improve our understanding of the carcinogenesis and development of HCC, as well as the clinical decision-making as potential biomarkers and therapeutic targets for sHCC patients.

Microarray datasets preparation and re-annotation
Microarray datasets including gene expression profiles and associated clinical characteristics analyzed in this study were downloaded from publicly available GEO database (https://www.ncbi.nlm.nih.gov/geo/) and The Cancer Genome Atlas (TCGA, http://cancergenome. nih.gov/). GSE14520 from GEO database was conducted by GPL571 (Affymetrix Human Genome U133A 2.0 Array) and GPL3921 (Affymetrix HT Human Genome U133A Array), including 247 HCC samples, 239 paired non-tumor tissue samples and 2 healthy liver samples. Out of them, 153 tumor samples with survival information acquired from sHCC formed the discovery dataset. The inclusion criteria were as follows: pathologically verified HCC tissue; the largest tumor no more than 5cm in diameter; complete followup data including overall survival status and time, recurrence status and date. The exclusion criteria were as follows: non-tumor or healthy tissue; lack of histological examination results or pathological results were cholangiocarcinoma, combined hepatocholangiocarcinoma or metastatic liver cancer; the size of tumor larger than 5cm in diameter; incomplete follow-up information. Then, we re-annotated array probes from the Affymetrix Human Genome U133A 2.0 Array to obtain lncRNA profiles, mainly according to the methods proposed by Zhang X et al [34]. After mapping probe set IDs to the NetAffx annotation files, we extracted non-coding protein genes and excluded microRNAs, rRNAs and other short RNAs. Eventually, 1254 lncRNA transcripts including duplicates (different probe IDs may be mapped to the same transcript) were re-annotated. Besides, 235 HCC samples in stage I and II with complete survival and recurrence information from the TCGA formed the validation dataset. Of the 235 patients, 220 had surgery, 1 received liver transplantation, 13 underwent other treatment (no specific information) and 1 was lack of therapeutic records. The TCGA HCC genome profiles contained more than 14,400 lncRNA transcripts and 22,700 mRNA transcripts. All the genomic expression data from the two datasets in this study were from tumor tissue. In addition, the mutation information of gene TP53 of associated TCGA samples was downloaded as well. The median OS and RFS of the discovery and validation sets were 53.0, 38.7 months and 34.1, 15.9 months, respectively.

Construction and confirmation of the sHCC-lncRNA risk score
Survival analysis based on univariate COX proportional hazards of each lncRNA annotated in the discovery series was done to screen out those with a significant p value less than 0.1. Then, we used the least absolute shrinkage and selection operator (LASSO) [35] to construct the risk score system based on above selected prognostic lncRNAs. LASSO statistical modelling was performed with "glmnet" package in the R software (version 3.4.0, https://www.r-project.org/), and meanwhile the coefficients of eligible lncRNAs in risk score model were generated based on expression data for each sHCC sample [36]. Absolute value of each coefficient denoted the contribution of corresponding lncRNAs to the prognostic risk score.
The corresponding risk scores for the samples from both discovery and validation datasets were calculated using the risk score system. Patients were divided into highrisk and low-risk groups in either cohort with cut-off values determined by the receiver operating characteristic (ROC) curves (time-independent). The whole group was divided into two subgroups according to the AGING outcome event of each patient (dead or alive). Then ROC curves were plotted based on the risk scores and the survival status of each sample. Risk score was selected as the cut-off value when the area under the curve (AUC) reached its maximum. Kaplan-Meier (KM) curves were plotted, and P values and hazard ratio (HR) along with 95% confidence interval (CI) from Log-rank tests and COX regression analyses were calculated to compare survival and recurrence risk between high-risk and low-risk groups. Stratified analysis was conducted to evaluate suitable patients of the sHCC prognostic model in the TCGA cohort. In each sub-group stratified by various clinical characteristics, KM curves were plotted accordingly in the overall group. All ROC and KM curves were plotted by the GraphPad Prism version 7.0 and P value less than 0.05 was considered statistically significant.

Gene set enrichment analysis and functional enrichment analysis
Functional annotations in both high-risk and low-risk samples were done through gene set enrichment analysis (GSEA), an approach in silico performed by the JAVA program (http://www.broadinstitute.org/gsea) using Molecular Signature Database (MSigDB) [37]. Pathway enrichment was carried out in the high-risk patient group based on the BioCarta pathway database [38]. The significance threshold of false discovery rate (FDR) for the significantly enriched biological processes and pathways was set at 0.05. Gene enrichment analysis of the identified lncRNAs was carried out in the Reactome pathway database using Metascape, a free online tool for gene annotation (http://metascape.org/gp/index.html#/main/step1) [39]. The correction network of the enriched terms was presented in Cytoscape [40]. The possible functional Reactome pathways were enriched based on the coexpressed genes of the lncRNAs in the same module clustered by Weighted Gene Coexpression Network Analysis (WGCNA). WGCNA was a new method for detecting the highly connected genes and conducted with "wgcna" package in R studio [41].

Statistical analyses
Statistical analyses were conducted with STATA software version 12.0 (StataCorp, TX, USA), unless otherwise indicated. P value less than 0.05 was considered statistically different. Univariate and multivariable COX proportional hazards regression analyses were performed in TCGA cohort using risk score and clinical information to find the independent predictor of the OS and the RFS of sHCC. P-value less than 0.05 was adopted as a threshold. The nomogram was built based on the significant factors by the package of "rms" in R studio. The concordance index (C-index) and the calibration curves were utilized to evaluate the performance of the nomogram and compare the predicted-and actual-probability of survival. Each patient got the total points from the nomogram (Nomoscore). KM curve analysis was carried out to measure the performance of the nomogram by dividing patients into high-, intermediate-and low-risk groups using tertiles of the Nomo-scores as the cut-off values.