Identification of a circadian gene signature that predicts overall survival in lung adenocarcinoma

Background Lung adenocarcinoma (LUAD) is one of the most common subtypes of lung cancer which is the leading cause of death in cancer patients. Circadian clock disruption has been listed as a likely carcinogen. However, whether the expression of circadian genes affects overall survival (OS) in LUAD patients remains unknown. In this article, we identified a circadian gene signature to predict overall survival in LUAD. Methods RNA sequencing (HTSeq-FPKM) data and clinical characteristics were obtained for a cohort of LUAD patients from The Cancer Genome Atlas (TCGA). A multigene signature based on differentially expressed circadian clock-related genes was generated for the prediction of OS using Least Absolute Shrinkage and Selection Operator (LASSO)-penalized Cox regression analysis, and externally validated using the GSE72094 dataset from the GEO database. Results Five differentially expressed genes (DEGs) were identified to be significantly associated with OS using univariate Cox proportional regression analysis (P < 0.05). Patients classified as high risk based on these five DEGs had significantly lower OS than those classified as low risk in both the TGCA cohort and GSE72094 dataset (P < 0.001). Multivariate Cox regression analysis revealed that the five-gene-signature based risk score was an independent predictor of OS (hazard ratio > 1, P < 0.001). Receiver operating characteristic (ROC) curves confirmed its prognostic value. Gene set enrichment analysis (GSEA) showed that Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways related to cell proliferation, gene damage repair, proteasomes, and immune and autoimmune diseases were significantly enriched. Conclusion A novel circadian gene signature for OS in LUAD was found to be predictive in both the derivation and validation cohorts. Targeting circadian genes is a potential therapeutic option in LUAD.


INTRODUCTION
Lung cancer is a leading cause of death in the world (Bray et al., 2018). The estimated 5-year survival rate is only 19% (Siegel, Miller & Jemal, 2019). In 2019, there were 228,150 new diagnoses of cancers of the lung and bronchus in the United States. Primary lung cancer is divided into two main types: small-cell lung carcinoma and non-small cell lung carcinoma (NSCLC). The latter is further classified into different subtypes according to the histological origin, such as lung adenocarcinoma (LUAD), squamous cell carcinoma, or large cell carcinoma. Among these, LUAD is the most prevalent subtype, with an increasing incident in recent years (Cheng et al., 2016). The prognosis of LUAD is improving due to advances in molecular targeted treatment and immunotherapy (Hirsch et al., 2016;Peters et al., 2019). However, accurate prognosis prediction models for LUAD are still lacking.
The circadian clock is a molecular time-keeping system that is evolutionarily conserved. It is vital for the maintenance of physiologic homeostasis and normal function in all organisms. It coordinates a variety of biological processes and behaviors (Fu & Kettner, 2013;Panda et al., 2002). In the suprachiasmatic nucleus (SCN) of the hypothalamus, a central clock maintains the daily rhythms in the body by neural and humoral communication with peripheral clocks located in peripheral tissues and regulates bodily functions such as sleep/wake cycles and the secretion of many hormones. Disruption of the circadian clock has been listed as a likely carcinogen by the World Health Organization based on both population and laboratory-based findings (Lunn et al., 2017;Straif et al., 2007), which raised the interest in research on the relationship between circadian genes and tumor development. Some circadian genes have been demonstrated to control the occurrence and development of NSCLC (Qiu et al., 2019). However, the association between circadian genes and prognosis in patients with LUAD remains to be elucidated.
The present study aims to explore the prognostic role of circadian genes in patients with LUAD using The Cancer Genome Atlas (TCGA) data obtained from the NCI Genomic Data Commons, which includes the clinical characteristics and mRNA expression profiles of tumor and tumor-adjacent normal tissues. A prognostic multigene signature will be established using differentially expressed circadian clock genes and then validated with the GSE72094 dataset extracted from the Gene Expression Omnibus (GEO) database. Underlying molecular mechanisms were investigated by performing a Gene set enrichment analysis (GSEA) with Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways.

Data collection
The clinical characteristics and RNA sequencing data (HTSeq-FPKM) of 515 patients with LUAD were retrieved from the NCI Genomic Data Commons (https://portal.gdc. cancer.gov/repository). These 515 patients provided 535 samples from LUAD tumor tissue and 59 samples from adjacent normal tissue. Among the patients, 500 had complete RNA sequencing data and 469 had both complete sequencing data and complete clinical information.
The differential expression of the following 14 core genes of the circadian clock according to previous literature was analyzed: Period 1 (PER1), PER2, PER3,  (Chen et al., 2020;Cox & Takahashi, 2019;Mocellin et al., 2018;Shafi & Knudsen, 2019;Yu et al., 2019). The validation dataset was obtained from the GSE72094 dataset in the GEO database (https://www.ncbi.nlm.nih.gov/geo/) and included microarray and clinical data for 443 LUAD tumor samples (Schabath et al., 2016). The normalized count data were downloaded. The data cut-off date was September 10, 2020. Patients with no follow-up data or information on the expression of circadian genes were excluded.
The TCGA and GEO databases are public data repositories and therefore, ethical approval for this study was not required. This study followed the polices and guidelines for data access and publication specified by the TCGA and GEO databases.

Prognostic validity of the gene signature
Differentially expressed genes (DEGs) involved in the circadian clock were analyzed in the tumor and tumor-adjacent normal tissues of LUAD patients from the TCGA cohort using the ''limma'' package in R (false discovery rate (FDR) <0.05). Univariate Cox regression analysis was used to identify circadian genes related to overall survival (OS). A gene signature for the prediction of OS was constructed with the DEGs for the circadian clock using Least Absolute Shrinkage and Selection Operator (LASSO)-penalized Cox regression analysis and the ''glmnet'' package in R. DEGs served as independent variables, and OS as the response variable.
A risk score based on the expression of identified candidate genes was calculated for each patient according to the following formula: score = sum (normalized gene expression level × regression coefficient). Patients were classified as either high-or low-risk using the median score as the cut-off value. The survival analysis of different risk groups was determined with the ''survminer'' R package. In order to validate the performance of the signature, we used the principal components analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE) to analyze dimensionality reduction. The ''prcomp'' function in the R ''stats'' package was used to carry out the PCA. The data distribution for high-risk and low-risk patients was also mapped using t-SNE and the ''Rtsne'' package in R. The predictive value of the gene signature was evaluated with time-dependent Receiver operating characteristics (ROC) curve analysis using the ''timeROC'' package in R. The associations between the risk score, clinical characteristics (gender, age, smoking history, and stage), and OS were assessed with univariate and multivariate Cox regression analyses.

Functional enrichment analysis
The DEGs between the high-and low-risk groups in the TCGA LUAD cohort were identified using the ''limma'' R package again. GSEA of these DEGs was carried out with KEGG pathways (|log2 fold change| ≥ 1, FDR <0.05). Both a nominal P-value <0.05 and FDR q-value <0.05 were considered statistically significant.

Statistical analysis
All statistical analyses were conducted with R software (Version 3.5.3) and SPSS software (Version 25.0). Gene expression was compared using the two-tailed Student's t -test and proportions were compared using the Chi-squared test. The Kaplan-Meier method and the log-rank test were used to assess the differences in OS between high and low-risk patients. Univariate and multivariate Cox regression analyses were used to identify independent predictors of OS. P < 0.05 (two tailed) was considered statistically significant.

Clinical and demographic characteristics
Two patient cohorts with available data on OS and the RNA expression of circadian clock genes were used to create the prognostic model. The derivation cohort consisted of 500 patients with LUAD and complete RNA sequencing data from the TCGA database while the validation cohort consisted of 398 patients with LUAD from the GSE72094 dataset. Among these patients, 469 patients from TCGA and 328 patients from GSE72094 who not only had complete RNA sequencing data, but also complete clinical data including OS, age, gender, smoking history, and tumor stage, were included in the univariate and multivariate COX analyses. The validation cohort had higher age, lower TNM stage, more smokers, and a higher median OS compared to the derivation cohort. The baseline demographic and clinical characteristics of the included patients are summarized in Table 1.

Identification of DEGs related to circadian clock in the TGCA LUAD cohort
In the TCGA LUAD cohort, 9/14 circadian genes were found to be differentially expressed between tumor and tumor-adjacent normal tissues. Five candidate genes were identified to be significantly associated with OS using univariate Cox proportional regression analysis (Figs. 1A-1B). The clustering of the 5 candidate genes are shown with a heatmap in Fig. 1C.

Generation of a prognostic signature in the TGCA LUAD cohort
The 5 identified candidate genes were incorporated into a five-gene-signature based prognostic model using LASSO Cox regression analysis. According to risk scores calculated using the expression levels of these 5 genes, half of the patients were classified as high-risk (n = 250) and the other half as low-risk (n = 250) ( Fig. 2A). The chance of survival was lower and the survival time was shorter in the high-risk group than in the low-risk group (Fig. 2B). PCA and t-SNE analysis showed discernible dimensions between high-risk and low-risk patients (Figs. 2C-2D). Kaplan-Meier survival curves confirmed that OS was significantly worse in high-risk than in low-risk patients (Fig. 2E, P < 0.001). The predictive performance of the five-gene-signature based risk score for OS was evaluated using time-dependent ROC curves. The area under the curve (AUC) values were: 1 year, 0.726; 2 years, 0.668; and 3 years, 0.657 (Fig. 2F).

Validation of the five-gene-signature based prognostic model
The stringency of the model developed using the TGCA LUAD cohort was validated in the GSE72094 dataset. Risk scores were calculated for all patients based on the expression levels of the 5 identified candidate genes and patients were classified as high-risk or low-risk accordingly (Fig. 3A). The high-risk group had a significantly higher chance of death and lower OS time (Fig. 3B). PCA and t-SNE analysis showed discernible dimensions between high-and low-risk patients (Figs. 3C-3D). Kaplan-Meier survival curves confirmed that OS was significantly worse in high-risk patients (Fig. 3E, P < 0.001). The AUC values were: 1 year, 0.621; 2 years, 0.657; and 3 years, 0.642 (Fig. 3F).

Prognostic value of the five-gene-signature based risk score
Univariate and multivariate Cox regression analyses were conducted to determine whether the five-gene-signature based risk score was an independent predictor of OS (Table 2). The derivation cohort consisted of 469 patients from the TCGA LUAD cohort; and

Enrichment analysis in the TGCA LUAD cohort
Genes that were differentially expressed in the high-respective low-risk groups were subjected to GSEA for KEGG pathways (Table 3). The results showed that tumorigenesis pathways related to pyrimidine metabolism, cell cycle, proteasome, base excision repair, homologous recombination, and DNA replication were enriched (Fig. 5).

DISCUSSION
Genes of the circadian clock are often abnormally expressed in tumor tissues and may play an important role in tumorigenesis (Kelleher, Rao & Maguire, 2014;Kettner, Katchy & Fu, 2014). The present study identified 9 DEGs between tumor and tumor-adjacent normal tissues among the 14 circadian genes. The genes PER3, CRY2, TIMELESS, NPAS2, and RORA were found to be correlated with OS. These results suggest that circadian clock genes may affect the survival outcome in LUAD and that a signature based on the expression of these genes may predict OS and may be an independent prognostic factor. The PER family is generally considered to have a tumor suppressor effect, and the mechanisms behind the tumor suppressing effects of PER1 and PER2 are clear (Gery et al., 2006;Wood et al., 2008). PER3 has been confirmed to affect the susceptibility and prognosis of lung cancer through expression changes, methylation, and single nucleotide polymorphisms (SNPs) (Chu et al., 2018;Couto et al., 2014;Liu et al., 2014). However, the exact mechanism for the PER3 inhibition of tumors is not yet clear. The study by Jun-Sub et al. showed that PER3 is required for checkpoint kinase 2 (CHK2) activation in human cells, which highlighted its potential role in cell cycle arrest and DNA damage repair (Im et al., 2010). Previous studies have linked the circadian clock gene CRY2 to the occurrence and development of many tumors (Hasakova et al., 2018;Lesicka et al., 2018;Relles et al., 2013;Tokunaga et al., 2008). As a transcriptional suppressor, CRY2 functions as an important regulator of cell cycle, proliferation, DNA damage checkpoint control, and DNA repair (Hoffman et al., 2010). CRY2 acts as a tumor suppressor gene. It can limit tumor formation by increasing c-MYC turnover (Huber et al., 2016), or increase the elimination of premalignant and malignant cells through the activation of p53-independent apoptosis pathways (Lee & Sancar, 2011). The circadian genes NPAS2 and TIMELESS, on the other hand, are both correlated with poor OS. A recent study has shown that upregulated NPAS2 promoted the survival of hepatocellular carcinoma cells through the upregulation of cell division cycle 25 A (CDC25A) and inhibition of mitochondria-dependent intrinsic apoptosis (Yuan et al., 2017). Knockout or inhibition of TIMELESS can lead to cell cycle stagnation and subsequent apoptosis, which limits the growth of liver cancer cells (Elgohary et al., 2015). The circadian gene RORA was found to be downregulated in LUAD tissue and negatively correlated with LUAD prognosis in this study. RORA is a versatile gene. Besides the circadian clock, it is also a well-known regulator of inflammation and lipid metabolism. Moreover, recent studies have suggested that RORA may also play a role in the progression and prognosis of colon cancer and breast cancer (Lee et al., 2010). The recruitment of RORA can induce the expression of the tumor suppressor genes F-box/WD repeat-containing protein 7 (FBXW7 ), Semaphorin 3F (SEMA3F ), and P21, leading to apoptosis and suppression of tumor cell proliferation (Wang et al., 2017).
Results from the enrichment analysis revealed that metabolic pathways related to the substrates of DNA synthesis (pyrimidine metabolism and pentose phosphate pathway) were enriched in the high-risk group, as well as pathways regulating cell cycle and DNA replication. Increasing evidence suggests a regulatory effect of circadian genes on cellular proliferation (Chakrabarti & Michor, 2020), and their involvement in the proliferation of a variety of tumor cells (Abreu et al., 2018;Wang et al., 2016;Yu et al., 2018). A recent study on lung cancer demonstrated that the loss of the central clock components led to increased c-MYC expression, which enhanced proliferation (Papagiannakopoulos et al., 2016). Base excision repair and homologous recombination pathways were also found to be enriched in the high-risk group, which may indicate that the circadian clock disorder affects the repair of gene damage to influence the survival of malignant tumors.  et al., 2018;Ullah et al., 2020). This cross-talk between circadian clock genes and the ubiquitin-proteasome pathway may be related to the prognosis of LUAD. Some immune and autoimmune disease pathways were enriched in the low-risk group. This shows that the disturbance of the circadian clock is accompanied by alterations in the function of the immune system (Aiello et al., 2020), which may be related to the occurrence, development, and prognosis of LUAD. Wu and his colleagues have shown that abnormal circadian genes contribute to T cell exhaustion and global upregulation of immune inhibitory molecules, such as programmed death-ligand 1 (PD-L1) and cytotoxic T-lymphocyte antigen (CTLA)-4, which promote tumor development (Wu et al., 2019).
There are several limitations to this study. Firstly, the present study is a retrospective study with data from publicly available databases. This makes the study more prone to selection bias and it is also impossible to draw conclusions regarding cause-effect. Experimental studies should be conducted to understand the mechanisms behind the role of the circadian genes. Secondly, using tumor-adjacent normal tissue as a control has the advantages of minimizing biological variation, but one cannot be sure if the seemingly ''normal'' tissue adjacent to a tumor is truly ''normal''. Thirdly, while there might be many other genes that are important in LUAD, we only focused on 14 core genes of the circadian clock. It is possible that other more important genes were excluded from the design.

CONCLUSIONS
In summary, we constructed a novel five-gene signature with genes involved in the circadian clock to predict the prognosis of LUAD. The signature could successfully separate LUAD patients with a low risk of non-survival from those with a high risk in both the derivation and validation cohorts. The underlying molecular mechanisms between circadian genes and tumor proliferation, DNA repair, ubiquitin-proteasome pathway, and immunity in LUAD remain poorly understood. and warrant further investigation. Circadian genes might be potential targets for future cancer therapy.