Identification of a novel snoRNA expression signature associated with overall survival in patients with lung adenocarcinoma: A comprehensive analysis based on RNA sequencing dataset

Since multiple studies have reported that small nucleolar RNAs (snoRNAs) can be serve as prognostic biomarkers for cancers, however, the prognostic values of snoRNAs in lung adenocarcinoma (LUAD) remain unclear. Therefore, the main work of this study is to identify the prognostic snoRNAs of LUAD and conduct a comprehensive analysis. The Cancer Genome Atlas LUAD cohort whole-genome RNA-sequencing dataset is included in this study, prognostic analysis and multiple bioinformatics approaches are used for comprehensive analysis and identification of prognostic snoRNAs. There were seven LUAD prognostic snoRNAs were screened in current study. We also constructed a novel expression signature containing five LUAD prognostic snoRNAs (snoU109, SNORA5A, SNORA70, SNORD104 and U3). Survival analysis of this expression signature reveals that LUAD patients with high risk score was significantly related to an unfavourable overall survival (adjusted P = 0.01, adjusted hazard ratio = 1.476, 95% confidence interval = 1.096‒1.987). Functional analysis indicated that LUAD patients with different risk score phenotypes had significant differences in cell cycle, apoptosis, integrin, transforming growth factor beta, ErbB, nuclear factor kappa B, mitogen-activated protein kinase, phosphatidylinositol-3-kinase and toll like receptor signaling pathway. Immune microenvironment analysis also indicated that there were significant differences in immune microenvironment scores among LUAD patients with different risk score. In conclusion, this study identified an novel expression signature containing five LUAD prognostic snoRNAs, which may be serve as an independent prognostic indicator for LUAD patients.


Introduction
Lung adenocarcinoma (LUAD) is the most common subtype of lung cancer worldwide and is a non-small cell lung cancer. With the rapid development of medical technology and drugs, the diagnosis and treatment of LUAD has been greatly improved, but the survival rate of LUAD is still not satisfactory. Small nucleolar RNA (snoRNA) is a class of small non-coding RNAs widely found in the nucleoli of eukaryotic cells. Among them, Box C/D and Box H/ACA are the main known snoRNA types, which mainly guide the dioxymethylation and pseuduracil modification of ribosomal RNA by base pairings, respectively [1]. In addition, some studies have reported that snoRNA is involved in the post-transcriptional modification of snRNA, tRNA and mRNA [1]. There are also a considerable number of snoRNAs whose functions are unknown and are called orphan snoRNAs. In recent years, With the advancement of high-throughput sequencing, numerous dataset suggest that snoRNA is dysregulation and plays a role in multiple cancers [2,3]. Previous studies have reported that snoRNA plays a role in tumorigenesis and progression of cancers, and may become a new prognostic biomarker for cancers [4,5]. Since multiple studies have reported that snoRNAs can be serve as prognostic biomarkers for cancers, however, the prognostic values of snoRNAs in LUAD remain unclear [6,7]. In current study, the prognostic value of snoRNAs were explored based on the RNA sequencing dataset of The Cancer Genome Atlas (TCGA) LUAD cohort, and the bioinformatics approaches were further used for comprehensive analysis.

Acquisition of RNA sequencing dataset
The whole genome RNA sequencing dataset of lung adenocarcinoma were downloaded from TCGA website (https://portal.gdc.cancer.gov) [8], and the raw dataset were normalized by edgeR. SnoRNAs with an average value greater than 0.5 was included in the follow-up analysis, and the rest were excluded. Inclusion criteria for patients included in the prognostic analysis: 1) LUAD patients have both survival information and RNA sequencing dataset; 2) RNA sequencing sample is the primary tumor tissue. Exclusion criteria: 1) Recurrent tumor tissue or duplicate samples of the same patient; 2) The patient lacks survival parameters or RNA sequencing dataset. We finally won the 500 cases LUAD patients included in the subsequent survival analysis [9]. Since all the dataset for the present study comes from TCGA database, the authors did not have any experiments involving humans or animals in this study. Therefore, the approval of the ethics committee is not required.

Identification of prognostic snoRNAs and signature construction
The identification of prognostic snoRNAs are calculated by the survival package in the R platform through the multivariate Cox proportional hazard regression model. The high-and low-expression groups of snoRNAs or genes were grouped according to the median value of expression. The clinical adjustment variables in the multivariate Cox model are these variables related to LUAD overall survival (OS). We use the step function of the survival package to screen the optimal expression signature for the prognostic-related snoRNAs of LUAD in the R platform. The calculation formula of snoRNA expression signature is as follows: Risk score = expression of snoRNA1 × β1 + expression of snoRNA2 × β2 + … expression of snoRNAn × βn [9]. The surcivalROC package is used to evaluate the prediction accuracy of this snoRNA expression signature. The nomogram was executed by the rms package. Joint effect survival analysis was used to evaluate the combination of risk score and traditional clinical parameters to classify LUAD patients with different prognosis.

Functional enrichment analysis
In order to further understand the potential functional mechanism of this snoRNA expression signature. We conduct a functional enrichment analysis through three ways of snoRNA co-expression genes, differentially expressed genes (DEGs) and gene set enrichment analysis (GSEA). The acquisition of whole-genome expression profile dataset was derived from the same dataset as snoRNA, and the processing method of the original data was also the same. Co-expression genes were defined as the absolute value if Pearson correlation coefficient (r) between geneand snoRNA expression in LUAD tumor tissues greater than 0.2. DEGs were also identified by edgeR, and DEGs were defined as log2|fold change (FC)| > 1, P < 0.05 and false discovery rate (FDR) > 0.05. The GSEA analysis was performed by the GSEA desktop application and the reference gene sets were C2 (c2.all.v7.4.symbols.gmt) and C5 (c5.all.v7.4.symbols.gmt). The criteria for GSEA significance results are as follows: |Normalized Enrichment Score (NES)| > 1，FDR < 0.5 and P < 0.05 [10,11]. Functional enrichment analysis of DEGs and co-expression genes were performed by Database for Annotation, Visualization and Integrated Discovery v6.8 (DAVID v6.8, https://david.ncifcrf.gov/home.jsp), and gene ontology (GO) term and kyoto encyclopedia of genes and genome (KEGG) pathways were annotated for these genes [12,13]. Subsequently, we also investigated the tumor immune microenvironment of LUAD. The ESTIMATE package is used to calculate the parameters of the tumor immune microenvironment [14].

Statistical analysis
The co-expression interaction was evaluated using Pearson's correlation coefficient r. The calculation of FDR is performed according to the Benjamini-Hochberg program [15]. All statistical analysis adopts SPSS version 22.0 and R version 3.6.2. P < 0.05 considered significant difference.

Identification of prognostic snoRNAs
The demographics of TCGA LUAD cohort are shown in Table S1. In the patients' baseline dataset, we found that tumor stage is closely related to OS in TCGA LUAD cohort. Therefore, we included tumor stage as a correction factor in the multivariate Cox proportional hazards regression model. We obtained 940 snoRNAs from the raw RNA sequencing dataset of TCGA LUAD cohort, and finally obtained 288 snoRNAs that met the requirements for subsequent survival analysis. Seven snoRNAs that are significantly related to LUAD OS were screened out through multivariate Cox proportional hazards regression model analysis in the R platform (Table 1 and Figure 1

Comprehensive survival analysis of snoRNA expression signature
Then we get the optimal combination signature through the step function in R platform. The formula of expression signature is as follows: Risk score = expression of snoU109 × (0.1293) + expression of SNORA5A × (0.1046) + expression of SNORA70 × (-0.2012) + expression of SNORD104 × (-0.1005) + expression of U3 × (-0.1155). Through survival analysis, we found that in this snoRNA expression signature, the OS of LUAD patients with high risk was significantly reduced (adjusted P = 0.01, adjusted hazard ratio = 1.476, 95% confidence interval = 1.096-1.987, Figure 3A,B). Subsequent time-dependent receiver operator characteristic curve (ROC) analysis found that this snoRNA expression signature has a certain application value in predicting LUAD OS. The area under curve (AUC) is 0.618 in 5-year OS, and the highest is 0.666 in 9-year OS ( Figure 3C). Joint effect survival analysis suggests that the combination of risk score model and tumor stage can more accurately classify LUAD patients into subgroups with significant differences in prognosis (Table 2 and Figure 4A,B). The nomogram suggested that the contribution of risk score in the TCGA LUAD cohort was second only to the tumor stage ( Figure 4C).

Functional enrichment analysis
We also screened co-expression genes for the five snoRNAs in the risk score signature by Pearson correlation coefficient. A total of 917 SNORA5A co-expressed genes were screened ( Figure 5 and Table S3). 158 co-expressed genes for SNORA70 ( Figure 6 and Table S3). 1048 co-expressed genes for SNORD104 (Figure 7 and Table S3). 987 co-expressed genes for snoU109 ( Figure 8 and Table  S3). 303 co-expressed genes for U3 ( Figure 9 and Table S3). Through merging, we found that a total of 3031 snoRNA co-expressed genes were obtained. Functional enrichment analysis revealed that these genes were significantly involved in the following biological functions: DNA replicationdependent nucleosome assembly, ubiquitin-protein transferase activity, cell cell adhesion, DNA repair, regulation of cell cycle, integrin binding, I-kappaB kinase/NF-kappaB signaling, negative regulation of TOR signaling, hippo signaling, and regulation of ubiquitin-protein ligase activity involved in mitotic cell cycle (Table S4). KEGG analysis revealed that these snoRNA co-expressed genes may be involved in the following signaling pathways: focal adhesion, ubiquitin mediated proteolysis, proteoglycans in cancer, transcriptional misregulation in cancer, and Ras signaling pathway (Table S4). A total of 319 LUAD prognostic related genes were identified, including 169 low risk genes [hazard rate (HR) < 1] and 150 high risk genes (HR > 1) ( Figure 10A and Table S5). The top three significant genes are transmembrane protein 125 (TMEM125, Figure 10B), phosphatidylglycerophosphate synthase 1 (PGS1, Figure 10C), and myozenin 1 (MYOZ1, Figure 10D).      We further screened DEGs for LUAD patients between low-and high-risk phenotypes. A total of 772 DEGs were obtained, of which 394 DEGs were significantly down-regulated and 378 DEGs were significantly up-regulated ( Figure 11 and Table S6). Functional enrichment analysis revealed that these genes were significantly involved in the following biological functions: DNA replication-dependent nucleosome assembly, DNA replication-independent nucleosome assembly, positive regulation of cell proliferation, positive regulation of mitotic nuclear division, positive regulation of cell division, cell surface receptor signaling pathway, and fibroblast growth factor receptor binding (Table S7). KEGG analysis revealed that these DEGs may be involved in the following signaling pathways: drug metabolism-cytochrome P450, PPAR signaling pathway, transcriptional misregulation in cancer, and chemical carcinogenesis (Table S7). Through the survival analysis of these DEGs, we obtained a total of 88 LUAD prognostic-related DEGs ( Figure 12A and Table S7). The top three significant DEGs are protein arginine methyltransferase 8 (PRMT8, Figure 12B), keratin 9 (KRT9, Figure 12C) and growth factor independent 1B transcriptional repressor (GFI1B, Figure 12D). In addition, we also constructed the co-expression interaction network of these DEGs through the weighted gene co-expression network analysis (WGCNA) approach. We have obtained three DEGs modules through WGCNA ( Figure 13A-D). The WGCNA co-expression interaction network is displayed in Figure 14. In this WGCNA coexpression interaction network, there are two hub DEGs with the highest degree, which are histone cluster 1 H3b (HIST1H3B) and histone cluster 1 H1b (HIST1H1B) respectively, and both with the degree of 38 in turquoise module ( Figure 14).    We also use GSEA to explore the mechanism of this snoRNA expression signature. GSEA using the c5 reference gene set suggested that LUAD patients with high-risk score phenotypes are markedly different from low-risk score phenotypes in the following biological processes: ErbB signaling pathway, toll like receptor signaling pathway, regulation of cell cell adhesion, nuclear factor (NF)-kappaB-Inducing Kinase (NIK)/NF kappa B signaling, non canonical Wnt signaling pathway, cell cell G1/S phase transition, integrin mediated signaling pathway, extrinsic apoptotic signaling pathway, epidermal growth factor receptor signaling pathway, platelet derived growth factor receptor signaling pathway, transforming growth factor beta receptor signaling pathway, positive regulation of Wnt signaling pathway, receptor signaling pathway via signal transducers and activators of transcription (STAT), regulation of apoptotic signaling pathway, regulation of mitogen-activated protein kinase (MAPK) activity and phosphatidylinositol 3 kinase (PI3K) binding ( Figure 15A-P). GSEA using the c2 reference gene set suggested that LUAD patients with high-risk score phenotypes are markedly different from low-risk score phenotypes in the following pathways: epidermal growth factor receptor (ErbB1) receptor proximal pathway, p38/MAPK pathway, toll like receptor signaling pathway, metastasis epithelial-to-mesenchymal transition (EMT) up, PI3KCI/AKT pathway, NFκB signaling, signaling by the B cell receptor (BCR), Notch pathway, apoptosis, cell cycle pathway, p53 signaling pathway, hypoxia-inducible factor (HIF) pathway, tumorigenesis up, tumor angiogenesis up, TGFB signaling pathway and integrin pathway ( Figure 16A-P).

Investigation of tumor immune microenvironment
We also used whole-genome RNA sequencing data to score the immune microenvironment in TCGA Luad patients. Stromal score was significantly increased in LUAD patients with high risk and high snoU109 ( Figure 17A). Stromal score was markedly reduced in LUAD patients with high expression of SNORD104 ( Figure 17A). Immune score was markedly increased in LUAD patients with high risk ( Figure 17B). ESTIMATE score was markedly increased in LUAD patients with high risk and high snoU109, and was markedly reduced in LUAD patients with high expression of SNORD104 ( Figure 17C). Stromal score of risk score and its snoRNAs in LUAD tumor tissues; (B) Immune score of risk score and its snoRNAs in LUAD tumor tissues; (C) ESTIMATE score of risk score and its snoRNAs in LUAD tumor tissues.

Discussion
With the wide application of next-generation sequencing technology in tumors, more and more non-coding RNAs have been found to be abnormally expressed in tumor tissues, and their biological functions play a certain role in tumorigenesis, development and prognosis [16,17]. SNORNA belongs to a class of non-coding RNAs. Recently, increasing evidences have shown that SNORNA plays an important role in the tumorigenesis, development, diagnosis and prognosis of cancers [18]. Previous studies have mined the high-throughput sequencing data of TCGA and constructed a snoRNA-related online analysis tool to explore the prognosis and drug response for snoRNA in TCGA pan-cancer cohort [2,3,19]. Shuwen et al. reviewed snoRNAs related to hepatocellular carcinoma (HCC). Multiple snoRNAs have been reported to be closely related to EMT and Wnt/β-catenin signaling pathways [4]. Dsouza et al. also reviewed the breast-cancer-related snoRNA and gave a comprehensive overview of the role of snoRNA in breast cancer. In addition, the regulation, biological function, signaling pathway and clinical efficacy of the abnormal expression of snoRNAs in breast cancer were summarized [5]. Previous studies have also identified cancer prognostic related snoRNAs and constructed prognostic snoRNAs expression signature based on the TCGA RNA sequencing dataset. Deng et al. used the TCGA lower grade glioma (LGG) patient RNA sequencing data set to screen for prognostic-related snoRNA, and initially constructed an prognostic expression signature based on eleven prognostic snoRNAs, and initially identified the potential biological function mechanisms and targeted drugs of this expression signature in LGG [7]. Cao et al. used the TCGA bladder cancer patient RNA sequencing data set to screen prognostic-related snoRNA, and then used the least absolute shrinkage and selection operator cox regression model to construct an expression signature based on five prognostic snoRNAs. And constructed a bladder cancer prognostic nomogram based on the risk score model. SuivivalROC is used to evaluate the accuracy of this risk score model to predict the 5-year survival of bladder cancer patients, and the AUC is greater than 0.7 [20]. Liu et al. also identified 15 prognostic-related snoRNAs by analyzing TCGA sarcoma cohort dataset, and constructed a expression signature based on four prognostic snoRNAs [6]. Pan et al. built an incremental feature selection based on the TCGA pan-cancer database and using the monte carlo feature selection approach to predict the prognosis of cancer patients [21]. Zhao et al. confirmed that snoRNA is a diagnostic indicator of clear cell renal cell carcinoma through TCGA data and the serum samples of the validation cohort, and also constructed a six-snoRNA signature that can be used as an independent prognostic indicator of clear cell renal cell carcinoma [22]. In this study, a prognostic expression signature of five-snoRNA was constructed based on the multivariate Cox proportional hazard regression model by LUAD prognostic snoRNAs. In a review of previous studies, no one in LUAD has ever comprehensively analyzed prognostic related snoRNA. This study is the first to report that snoRNA can be used as a prognostic marker for LUAD.
Among the five prognostic snoRNAs in the signature of this study, only U3 is closely related to cancers in previous studies, and the other four snoRNAs have not been reported in cancers. expression had a poor prognosis [6]. In this study, we also observed that LUAD patients with low U3 expression have a poor prognosis for patients with higher expression. Some DEGs or co-expressed genes of snoRNAs screened in this study that are related to the prognosis of LUAD have also been reported to be closely related to cancers in previous studies. TMEM125 has been reported to be significantly associated with prognosis in non-small cell lung cancer [23]. Previous studies have confirmed that PRMT8 plays a certain role in the proliferation, invasion, migration and drug resistance of colon cancer, and is closely related to the formation of stem cells of colon cancer cells [24]. PRMT8 is highly expressed in breast cancer, ovarian cancer, and gastric cancer. High PRMT8 expression is associated with increased survival in breast and ovarian cancer patients, while high PRMT8 expression is associated with reduced survival in gastric cancer patients [25]. KRT9 and HSP70 interact to influence bladder cancer cells invasion and metastasis in vitro [26]. GFI1B is necessary for the development of erythrocyte and megakaryocyte cell lines, and its abnormal function often leads to cancer or hematologic malignancies [27]. Meanwhile, study have shown that GFI1B can regulate the immobility of hematopoietic stem cells and the differentiation of red blood cells and platelets, and the loss of GFI1B heterozygosity can accelerate the progression of acute myeloid leukemia (AML) [28]. Low expression of GFI1B is associated with poor prognosis in patients with myelodysplastic syndrome (MDS) and AML [29]. Spi-1 proto-oncogene (SPI1) is a regulatory target of GFI1B, and there is a negative correlation between the two genes. GFI1B can affect the development of erythroid and bone marrow mononuclear cells by regulating SPI1 [30]. GFI1B p32 mutation can reduce the number of platelets, p32 is elevated in acute and chronic leukemia cells, and cancer-related genes in humans are significantly dysregulated, confirming that GFI1B may play a role in carcinogenic regulation [31]. Significant overexpression of GFI1B in hematological malignancies can increase the proliferation of leukemia cells and make the disease progress. Targeted inhibition of GFI1B can significantly inhibit the proliferation of hematological malignancies and can be used as a potential therapeutic target for hematological malignancies [32,33]. In the functional enrichment analysis, we enriched a large number of classic tumor-related signaling pathways, which may be the potential molecular mechanism of snoRNA's function in LUAD.
The objective evaluation of this research is not perfect. First, this study is a single-center cohort study and lacks additional validation cohorts. Second, this study is based on RNA sequencing data for functional enrichment analysis, and the biological functional mechanisms selected still need to be verified by in vivo and in vitro experiments. Third, due to the limited clinical parameters and sample size provided by the TCGA official website, our results still need further verification. Despite the above shortcomings, our research still innovative. First of all, this study is the first comprehensive screening study of LUAD prognostic-related snoRNAs based on RNA sequencing dataset, and no one has ever reported on LUAD prognostic snoRNAs before. Secondly, we also use prognostic-related snoRNA to construct a prognostic signature model. Third, we also explored the molecular mechanism of this snoRNA feature based on RNA sequencing data. In summary, this study innovatively identified the prognostic related snoRNAs of LUAD for the first time, and comprehensively explored them from the perspective of prognosis and molecular mechanisms. It has certain clinical application value and provides theoretical basis for future study.

Conclusions
In summary, seven LUAD prognostic related snoRNAs were screened in this study. We also constructed an expression signature containing five LUAD prognostic related snoRNAs (snoU109, SNORA5A, SNORA70, SNORD104 and U3). Functional analysis indicated that LUAD patients with different risk score phenotypes had significant differences in cell cycle, apoptosis, integrin, TGFB, ErbB, NFκB, MAPK, PI3K and toll like receptor signaling pathway. These biological functional mechanisms may be the potential factors for the significant differences in prognosis between different risk score phenotype LUAD patients. Immune microenvironment analysis also indicated that there were significant differences in immune microenvironment scores among LUAD patients with different risk score. However, our results still need to be validated in future studies.