Identification of differentially expressed genes in non-small cell lung cancer

Lung cancer is the most common malignant tumor and the leading cause of cancer-related deaths worldwide. Because current treatments for advanced non-small cell lung cancer (NSCLC), the most prevalent lung cancer histological subtype, show limited efficacy, screening for tumor-associated biomarkers using bioinformatics reflects the hope to improve early diagnosis and prognosis assessment. In our study, a Gene Expression Omnibus dataset was analyzed to identify genes with prognostic significance in NSCLC. Upon comparison with matched normal tissues, 118 differentially expressed genes (DEGs) were identified in NSCLC, and their functions were explored through bioinformatics analyses. The most significantly upregulated DEGs were TOP2A, SLC2A1, TPX2, and ASPM, all of which were significantly associated with poor overall survival (OS). Further analysis revealed that TOP2A had prognostic significance in early-stage lung cancer patients, and its expression correlated with levels of immune cell infiltration, especially dendritic cells (DCs). Our study provides a dataset of potentially prognostic NSCLC biomarkers, and highlights TOP2A as a valuable survival biomarker to improve prediction of prognosis in NSCLC.


INTRODUCTION
Lung cancer is the most common tumor worldwide, and carries the highest morbidity and mortality rates [1]. Lung cancer is classified into two major histological subtypes, small cell lung cancer (SCLC; 13% of cases) and non-small cell lung cancer (NSCLC; 83% of cases). Surgical resection is seldom an option for SCLC treatment, owing to typical advanced-stage diagnosis; thus, most SCLC patients receive chemotherapy, but its efficacy is generally limited. On the other hand, only a small number of early-stage NSCLC patients can be treated with surgery, which achieves a 5-year survival rate as high as 70% in patients with stage IA NSCLC [2]. Chemotherapy or radiotherapy are also indicated in patients with more advanced NSCLC, but are associated with a 5-year survival rate of only ~23%. While some success is being achieved with newer immunological and targeted therapies for NSCLC, there are still significant limitations precluding their use in many cases [3]. Notwithstanding, the low 5-year survival rate for patients with lung cancer is largely due to insufficient preventive efforts and generalized late diagnosis [4].
Bioinformatics analysis allows screening of tumorassociated biomarkers from large data repositories to assist early diagnosis and prognostic assessment of cancer [5,6]. For example, mining of publicly available genomic repositories (i.e. The Gene Expression Omnibus database (GEO) and The Cancer Genome Atlas (TCGA) database) led to identification of a subset AGING of cancer-dysregulated miRNAs, which may allow early detection of pre-cancerous and cancerous oral lesions [7], and of tumor microenvironment-related genes that predict poor outcomes in glioblastoma patients [8].
In our study, a GEO dataset was selected for identification of differentially expressed genes (DEGs) in NSCLC. Gene ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), and protein-protein interaction (PPI) network analyses were used to link DEGs' genomic and functional information. In addition, data retrieved from TCGA and GTEx projects was evaluated through Gene Expression Profiling Interactive Analysis (GEPIA) to further assess the presence of relevant DEGs in NSCLC subtypes. Among the most significant DEGs, the TOP2A gene encoding human topoisomerase IIα (TOPIIα) emerged as a potential prognostic biomarker for earlystage lung cancer. Furthermore, its expression was negatively correlated with tumor infiltration of immune cells (especially dendritic cells, DCs) in NSCLC samples. While functional studies are needed to complement our findings, the biomarker dataset provided by our study may serve to improve early diagnosis of NSCLC and help advance new therapeutic strategies.

Identification of DEGs in NSCLC
The GEO dataset GSE103512 was selected for identification of DEGs in 60 human NSCLC specimens against 9 matched normal tissue samples using the GEO2R tool. Genes were defined as DEGs if they had a log2FC > 1.5 or < -1.5 and p < 0.01. A total of 118 genes were identified as DEGs; among these, 11 were upregulated and 107 were downregulated in NSCLC ( Figure 1). A full DEG list is shown in Supplementary  Table 1.

KEGG pathway analysis of DEGs
KEGG pathway analysis was performed using the Database for Annotation, Visualization and Integrated Discovery (DAVID) v6.8. Results indicated that the DEGs identified in NSCLC samples were mainly related to 'complement and coagulation cascades', 'p53 signaling pathway', 'ECM-receptor interaction', 'PPAR signaling pathway', and 'focal adhesion' ( Figure 2D).

Validation of upregulated DEGs
The DEGs upregulated in NSCLC were selected for validation by quantitative real-time PCR (qPCR) on 17 paired NSCLC/adjacent non-tumor samples collected from surgical patients. The overall trend indicated that all the upregulated DEGs from the GEO database were also overexpressed at the mRNA level in our clinical NSCLC specimens. However, overexpression in NSCLC samples vs normal lung tissues was only significant for TOP2A (P = 0.018), SLC2A1 (P = 0.011), TPX2 (P = 0.016), and ASPM (P = 0.049) ( Figure 3A-3K).

Protein-protein interaction network and correlation analysis of upregulated DEGs
We used the STRING database (https://string-db.org/) to construct protein-protein interaction (PPI) networks for 11 DEGs upregulated in NSCLC ( Figure 5A). Results showed that TOP2A, TPX2, and ASPM were interconnected. GEPIA was next used to conduct correlation analysis on these three genes. The correlation coefficients for TOP2A & ASPM, TOP2A & TPX2, and TPX2 & ASPM were 0.63, 0.57, and 0.69 respectively (P = 0.000) ( Figure 5B-5D). These data suggest that overexpression of TOP2A, TPX2, and ASPM may significantly impact the development or progression of NSCLC.
To confirm the relationship between TOP2A (the most significantly upregulated DEG) and patient prognosis, we analyzed TOP2A protein expression in a tissue microarray (TMA) of NSCLC samples by immunohistochemistry (IHC). TOP2A signal loca-lized mainly in the nucleus and to a lesser extent in the cytoplasm of tumor cells. TOP2A immunoreactivity was next categorized into four levels, i.e. "-", "+", "++", and "+++", according to the intensity and density of TOP2A-positive cells in each sample ( Figure 6E). Using sample-associated clinical data, Kaplan-Meier analysis confirmed that patients with high TOP2A expression had significantly worse OS (P = 0.0259) ( Figure 6F). These results strongly suggest that assessment of TOP2A expression, at the protein and/or mRNA levels, could be a valuable aid for NSCLC prognosis evaluation.
To further investigate the possible impact of TOP2A expression on lung cancer, we analyzed the relationship between TOP2A expression and clinical characteristics of lung cancer patients in the Kaplan-Meier plotter databases (Table 1). High TOP2A expression was significantly associated with poor OS in both female (P = 1.3e-05) and male (P = 5.8e-06) patients. In addition, high TOP2A expression was associated with poor OS in Stage 1 (P = 9.6e-08), Stage T1 (P = 7.6e-05), Stage N0 (P = 3.6e-04), and Stage M0 (P = 3.2e-05) patients. These results indicated that TOP2A expression levels can inform prognosis in early-stage lung cancer patients. Therefore, we propose that TOP2A may serve as an efficient survival biomarker to significantly improve the prediction of NSCLC prognosis.

DISCUSSION
Among all malignant tumors, lung cancer currently carries the highest incidence (11.6%) and mortality rate (18.4%) in the world. [9]. Although it is widely recognized that new cancer cases could be avoided by eliminating or reducing exposure to known lifestyle and environmental risk factors [10], the current burden of lung cancer requires urgent efforts to identify key genes with diagnostic and prognostic significance.     Our study interrogated 60 NSCLC samples and 9 matched normal lung controls in a GEO dataset to identify DEGs in NSCLC. A total of 118 genes (11 upregulated and 107 downregulated) were identified as DEGs. GO analysis showed that most of these DEGs were related to structures or functional processes affecting the extracellular space/matrix, suggestive of diverse roles in the tumor microenvironment. Association with GO 'glycosaminoglycan binding' molecular function, as well as GO 'oxygen-containing compound' and 'wound healing' biological processes indicated the DEGs' involvement in the anti-inflammatory response. These findings were supported by KEGG pathway analysis, which showed significant correlations with the complement system.
In general, the genes that were highly expressed in tumor tissues had known tumor-promoting effects, while the lowly expressed genes generally mediate tumor-suppressing effects. For validation analysis, we focused on putative tumor-promoting genes, which are more directly targetable and have therefore greater potential clinical applicability. Overall, our qPCR assays on 17 independent NSCLC samples confirmed that all the upregulated DEGs in the GEO dataset were also expressed at higher levels in our NSCLC samples, compared to matched non-tumor controls. However, mRNA overexpression in our NSCLC cohort was only significant for TOP2A, SLC2A1, TPX2, and ASPM. We speculate that increasing sampling size may still lead to validation of more DEGs. In addition, GEPIA analysis demonstrated that TOP2A, SLC2A1, TPX2, and ASPM were upregulated in clinical samples from two major subtypes of NSCLC, i.e. LUAD and LUSC, which stresses the potential relevance of these DEGs in NSCLC development and/or progression. Moreover, these findings are partially supported by a previous gene expression profiling study that identified TOP2A and TPX2 as putative biomarkers of NSCLC [11].
The TOP2A gene is located at 17q12•21 and encodes the human topoisomerase IIα (TOPIIα), which mediates DNA decatenation by rejoining DNA double strand breaks to separate entangled sister chromatides during cell division [12,13]. TOP2A is highly expressed in dividing cells, and is considered as a proliferation marker in both normal and tumor cells. TOP2A is highly expressed in esophageal, liver, gastric, breast, and colorectal cancers. In breast cancer, high expression of TOP2A is associated with low expression of estrogen receptor (ER) and high expression of Ki-67, and was proposed to be an important prognostic molecular indicator [12,[14][15][16][17][18]. The SLC2A1 gene encodes GLUT1, a glucose transporter that mediates a rate-limiting step for glucose metabolism in cancer cells [19][20][21]. SLC2A1 is considered an early marker of malignant tumors, overexpressed in esophageal squamous cell carcinoma, gastric carcinoma, and colon cancer, among others, often in association with poor prognosis [22][23][24][25]. The TPX2 gene is located at 20q11.2 and encodes a microtubule-associated protein involved in spindle assembly during cell mitosis. TPX2 overexpression is common to many tumor types. In hepatocellular carcinoma, it was correlated with increased proliferation, apoptosis inhibition, and induction of EMT [26]. In breast cancer, TPX2 silencing repressed PI3K/AKT and activated p53 signaling, which inhibited proliferation and promoted apoptosis [27]. The ASPM gene, located at 1q31, encodes a 3477 amino-acid-long protein involved in mitotic spindle regulation and DNA double-strand break repair. ASPM overexpression has been associated with the development of various tumors [28,29]. In hepatocellular carcinoma, ASPM was suggested to be a novel marker for vascular invasion, early recurrence, and poor prognosis [28]. In prostate cancer, high ASPM expression correlated with tumor progression and predicted poor outcome [29]. Altogether, the above findings from diverse tumor types are consistent with our expression data and our PPI network results, suggesting that TOP2A, TPX2, and ASPM function interconnectedly to increase mitotic rate in tumor cells. In subsequent studies, TOP2A-, TPX2-, and ASPM-specific knockout cell and animal models could be used to validate the contribution of each gene to NSCLC progression and survival.
Our survival analyses on the Kaplan-Meir plotter tool indicated that upregulation of TOP2A, SLC2A1, TPX2, and ASPM independently predicted poor OS in NSCLC patients. Moreover, for TOP2A, high expression was associated with poor OS in Stage 1, Stage T1, Stage N0, and Stage M0 NSCLC patients. An association between TOP2A and poor OS was further confirmed by assessing protein expression by IHC in a NSCLC TMA. These results indicated that TOP2A expression levels may aid prognosis evaluation in early-stage lung cancer patients.
The role of TOP2A in development/progression of NSCLC is still unclear. Since our GO and KEGG enrichment analyses indicated that the identified DEGs were also involved in immune responses, we assessed molecular markers of tumor-infiltrating immune cells, which critically affect early anti-tumor responses and often sustain tumor growth through immuno-suppressive actions. TOP2A expression was slightly correlated with B cells, CD4+ T cells, CD8+ T cells, and neutrophils in LUAD, and with macrophages and DCs in LUSC. In particular, a more significant negative correlation between TOP2A expression and HLA-complex members, CD1C, NRP1, and ITGAX expression in DCs was detected in LUSC. Since DCs are crucial antigen presenting cells (APCs) that trigger T-cell mediated antitumor immunity [30], impaired function of tumor-infiltrating DCs may seriously affect the body's anti-tumor immune response. Although further proof is clearly needed to establish a causal relationship, these data suggest that TOP2A overexpression may impair DC-mediated anti-tumor immune response in NSCLC.
In summary, through bioinformatics analyses we showed that TOP2A, SLC2A1, TPX2, and ASPM are overexpressed in NSCLC and show significant association with poor OS. As cell-cycle dependent proteins with interrelated functions, TOP2A, TPX2, and ASPM play key roles in the mitotic machinery that drives tumor cell replication in NSCLC and other tumor types. Further analysis confirmed that TOP2A expression was correlated with the prognosis of earlystage lung cancer patients and was negatively correlated with immune cell infiltration in NSCLC, especially of DCs. Thus, our study provided a potential biomarker dataset for NSCLC prognosis and suggested that TOP2A, in particular, may be a valuable survival biomarker to improve prognostic efforts and possibly guide new therapeutic developments for NSCLC.

Enrichment analysis
GO enrichment and KEGG pathway analyses were performed using DAVID v6.8 (https://david.ncifcrf. gov/), an online set of functional annotation tools to infer biological activities for large gene lists [32,33]. P < 0.05 denoted statistical significance.  Table 2.

GEPIA-based analysis of RNA-sequencing expression data
GEPIA [34] is a newly developed interactive web server for analyzing RNA-Seq expression data of 9,736 tumors and 8,587 normal samples from TCGA and GTEx projects using a standard processing pipeline (http:// gepia.cancer-pku.cn/index.html). It is developed by Zefang Tang, Chenwei Li, and Boxi Kang (Zhang Lab, Peking University), and provides customizable functions such as differential expression analysis, profiling according to cancer type or pathological stage, patient survival analysis, similar gene detection, and correlation and dimensionality reduction analyses. We selected NSCLC specimens and normal lung tissues for differential expression analysis, and DEGs for correlation analysis. The Spearman method was used to determine significant correlations.

Survival analysis
The Kaplan Meier plotter [36,37]   excluded from analysis owing to incomplete patient information and/or sample absence. Therefore, IHC was performed as reported previously [4] on 70 matched specimens. Paraffin sections were dewaxed, followed by antigen retrieval with Tris-EDTA buffer (pH 9). Deparaffinized sections were treated with methanol containing 3% hydrogen peroxide for 15 min, washed with PBS, and incubated with blocking serum for 30 min. Then, sections were incubated with anti-TOP2A (66541-1-Ig, Proteintech, USA) diluted 1:100, at 4ºC overnight. Immunoperoxidase staining was conducted using a streptavidin-peroxidase kit and 3,3′diaminobenzidine (Zhongshan Jinqiao Co., Beijing, China). Hematoxylin was used to counterstain the nuclei. Intensity and density of TOP2A-positive cells was evaluated and scored as reported before [4].

TIMER analysis
TIMER (https://cistrome.shinyapps.io/timer/) [38,39] is a comprehensive resource for systematic analysis of immune infiltrates across diverse cancer types using RNA-Seq expression profiling data. Six immune cell types (B cells, CD4+ T cells, CD8+ T cells, neutrophils, macrophages, and dendritic cells) were assessed by TIMER on NSCLC sample data, and the correlation between TOP2A expression and immune infiltration was determined. In addition, we assessed the correlations between TOP2A expression and gene markers of tumor-infiltrating immune cells [40].

Statistical analysis
Data were analyzed using GraphPad Prism 5.0. Expression levels of DEGs between NSCLC and matched normal tissues were compared by paired twotailed t-test. OS was calculated using Kaplan-Meier analysis and log-rank test. P < 0.05 was considered significant.

ACKNOWLEDGMENTS
We thank Rui-Rui Yao and Wen Wen for providing experimental guidance.

CONFLICTS OF INTEREST
The authors declare that they have no conflicts of interest to disclose.

FUNDING
This work was supported by the National Natural Science Foundation of China (81872482).