Lymph node metastasis in lung squamous cell carcinoma and identification of metastasis‐related genes based on the Cancer Genome Atlas

Abstract Squamous cell carcinoma (SCC) is a unique clinical and histological category that accounts for about 30% of total lung cancer. To identify risk factors for lymph node metastasis and analyze the molecular features of these metastases in lung SCC, a retrospective study was performed for 170 lung SCC patients who underwent surgical treatment. The overall survival of these patients with or without lymph node metastasis (LM/NLM) was analyzed using the Kaplan‐Meier method. We also used the TCGA database to compare the differentially expressed genes (DEGs) in patients with stage T1‐2 and T3‐4 lung SCC. Data from both our retrospective study and the TCGA database demonstrated a correlation between age and stage T1‐T2 LM (P = .002). There were significant differences between the LM and NLM groups in both mean survival time and median survival time for different T‐stages (P = .031). There were 176 upregulated and 177 downregulated DEGs between the LM and NLM groups in the stage T1‐2 group and 93 upregulated and 34 downregulated DEGs in the stage T3‐T4 group. These differentially expressed genes were predicted to participate in five cellular components, five molecular functions, and five biological processes. There were 20 genes, including GCG, CASR, NPY, CGA, TAC1, ALB, APOA1, CRH, CHRH, TRH, and GHSR, located at the core of the protein‐protein interaction network in the stage T1‐2 group and 11 genes, including F2, CASR, GRM1, GNRHR, GRPR, NTSR1, PROKR2, UTS2D, PTH, ALB, and FGA, in the stage T3‐4 group. Overall, LM plays a key role in the treatment response and prognosis of SCC patients. Several risk factors, including age and stage, were identified for LM. There was a previously undiscovered enrichment of significant novel genes in lung SCC between the LM and NLM groups, which may have the potential for predicting prognosis and targeting.


| INTRODUCTION
Lung cancer remains the deadliest cancer worldwide 1 with a particularly dire prognosis in China. 2 Squamous cell carcinoma (SCC) of the lung is a unique clinical and histological category of non-small cell lung cancer (NSCLC) that accounts for about 30% of all lung cancers. 3 Many of these tumors are in stages IIIA, IIIB, or IV at the time of diagnosis. 4 Studies have shown that the efficacy of chemotherapy and concurrent chemoradiation for locally advanced NSCLC is unsatisfactory. 5,6 Although substantial advances have been made in the treatment of non-squamous NSCLC, effective therapies are still needed for squamous NSCLC. Surgical intervention is the current principal treatment for SCC of the lung, even for locally advanced disease. 7,8 However, surgical intervention for locally advanced squamous cell lung cancer is still highly controversial because many individual factors affect the surgical outcome. The most important factor is the presence of lymphatic and distant metastasis. In particular, surgical intervention provides a favorable prognosis only for local invasion of the tumor without lymphatic and distant metastasis.
Despite recent advances in the early detection and treatment of lung cancer, the prognosis remains poor partly because of the high rate of recurrence and metastasis after surgical resection. At present, special attention is being paid to lymphadenectomy or lymph node sampling in NSCLC according to China's standards for the diagnosis and treatment of primary lung cancer, 9 the US national comprehensive cancer network guidelines, 10 and the European society of thoracic surgeons guidelines. 11 However, the current situation is not satisfactory because the status of lymph node metastasis is not clear and lymph node biopsies are either not performed or performed carelessly. 12 For NSCLC that are at a clinically early stage (proven only after surgery), the question remains: "Does complete lymphadenectomy bring unnecessary risks to the patient?" It has been reported that systemic lymph node dissection prolongs the hospital stay and increases postoperative morbidity. Therefore, more and more studies have focused on identifying the risk factors for lymph node metastasis (LM) in clinical early-stage NSCLC. Identifying these risk factors will help with treatment selection (eg, surgical versus non-surgical approach; systemic lymph node dissection).

| Patients
This study was approved by the Ethical Review Committee of Tianjin Medical University General Hospital. All biological samples were obtained with the patients' written informed consent. All procedures and experimental protocols were approved by the Laboratory Animal Ethics Committee of Tianjin Medical University, and all methods were performed following the relevant guidelines. We retrospectively examined data from patients with squamous cell carcinoma of the lung (LUSCC), who received surgical treatment at the Department of Lung Cancer Surgery, Tianjin Medical University General Hospital between January 2008 and December 2011. A total of 170 patients were enrolled in the study, including 92 patients with stage T1-T2 and 78 patients with stage T3-T4 LUSCC. Patients were grouped according to the presence or absence of lymph node metastasis. In the stage T1-T2 group, there were 57 patients without metastasis and 35 patients with metastasis. In the stage T3-T4 group, 27 patients had no lymph node metastasis and 51 patients with metastases. Patients treated with surgery for histologically confirmed SCC were considered eligible if they had tissue available for analysis and had clinical follow up data available. The demographic data, complete medical history, pathology results, and follow up data were recorded and verified in real-time. Survival data were ascertained through medical record review. The TNM stage was classified according to the American Joint Committee on Cancer (AJCC) 2017 8th edition of the tumor node metastasis (TNM) classification using standard radiological guidelines. The staging was carried out using computerized tomography scans of the upper abdomen and thorax, magnetic resonance imaging of the brain, wholebody bone scintigraphy, fiberoptic bronchoscopy, and tissue histology.
All patients underwent either lobectomy or pneumonectomy as the primary surgical intervention with systematic lymph node dissection or selective lymph node dissection. Some patients required reconstruction of part of the left atrium or a large blood vessel at the time of the initial surgery.
The TCGA database contained clinical and genetic data  from 494 patients with LUSCC, including 403 patients with  stage T1-T2 and 91 patients with stage T3-T4 disease. The  stage T1-T2 patients included 270 cases without and 133 cases with lymph node metastasis. The stage T3-T4 patients included 51 cases without and 40 cases with lymph node metastasis.

| Study variables
Patient information, including survival time, demographic information (age and gender), the location of lung cancer, examination of regional nodes, lymph node metastasis, survival information, and living conditions, were obtained from the inpatient and The Cancer Genome Atlas (TCGA) databases. These characteristics were classified by categorical variables (eg, age, gender, the location of lung cancer, smoking history, lymph node metastasis, and degree of differentiation) using univariate and multivariate analysis. Age was classified into two groups (ie, ≤65 and >65 years old). Smoking history was classified into never-smokers and smokers.

| Statistical analysis
Statistical analysis was performed using SPSS version 20.0. The Kaplan-Meier method was used to assess overall survival (OS) of patients with and without lymph node metastasis (LM and NLM, respectively). The two groups were compared using the log-rank test. The Pearson's Chi-square test was used to analyze the relationships between lymph node involvement and clinicopathologic variables. A twosided P-value of .05 was considered statistically significant.

| TCGA database analysis
We obtained clinical characteristics and genetic data for LUSCC patients from the TCGA database and excluded patients with undefined T or N staging. The data were then divided into two groups, T1-T2 with lymph node metastasis group and T3-T4 without lymph node metastasis and subjected to enrichment analysis for differentially expressed genes.

| Preprocessing of RNA-Seq data
TCGA-Assembler software was used to analyze the LUSCC RNA-Seq data from the TCGA database. Data for a total of 494 LUSCC patients were analyzed, including 270 NLM and 133 LN patients in the stage T1-T2 group, and 51 NLM and 40 LN patients in the stage T3-T4 group.

| DEG screening
The edgeR package (Bioconductor) was used to identify differentially expressed genes (DEGs) between the LUSCC LM and LUSCC NLM, in T1-2 group and T3-4 group, respectively. 13 The multi-test package was used to determine the false discovery rate (FDR) and adjusted P-values. The criteria for DEG screening were FDR < 0.05 and |log2-fold change (FC)| > 1; FC = gene expression value for LUSCC with LM/gene expression value for LUSCC with NLM.

| Construction of co-expression network
The EBcoexpress package (Bioconductor) was used to obtain correlations between DEGs. A co-expression network was constructed using DEG-DEG pairs with correlation coefficients (|r|)> 0.6.

| Functional and pathway enrichment analyses of DEGs
Gene annotation information was downloaded from the Gene Ontology (GO) Consortium (http://geneo ntolo gy.org/). GO identifiers were further divided into cellular components (CC), biological processes (BP), and molecular function (MF) categories. We used the GO identifier to annotate the DEGs. Functional and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment of the DEGs in co-expression networks were performed using DAVID 14 and KOBAS 2.0 15 based on a hypergeometric algorithm. The threshold for these analyses was set at P < .05. The result was visualized using FUNRICH (http://www.funri ch.org/).

| Protein-protein interaction network (PPI network)
To further analyze the protein-protein interaction network constructed by differential genes. We entered the differential genes into the multiple proteins list on the http://stringdb.org/cgi/input.pl. We set organism to Homo sapiens, hid disconnected nodes in the network, and set the minimum required interaction score to highest confidence (0.900).

| Small-molecule drug analysis of DEGs
Based on the Connectivity Map database (https ://porta ls.broad insti tute.org/cmap/), 16 small-molecule drug analysis was performed to determine the functional relationship between bioactive small-molecule drugs and the DEGs. A correlation coefficient (|score|) > 0.3 was used as the evaluation criterion of this experiment.

| Patient characteristics
Our retrospective study included a total of 170 patients. The clinical and pathological characteristics are presented in Table  1. The median patient age was 62 years (range, 39-77 years). Most patients were male (85.3%). At the time of diagnosis, the majority of patients (40.6%) had an Eastern Cooperative Oncology Group Performance Status (ECOG-PS) score of 3; no patients had an ECOG-PS score of 4.
All 170 patients had primary LUSCC, including 92 stage T1-T2 patients and 78 stage T3-T4 patients. There were 35 patients in the stage T1-T2 group and 51 in the stage T3-T4 group who had lymph node metastasis. Presence of lymph node metastasis were positively correlated with TNM staging regardless of whether they were stage T1-T2 or T3-T4 (P-value = .002 and .013, respectively). Age was also correlated with T1-T2 stage lymph metastasis (P = .002) (

| Lymph node metastasis can seriously affect lung cancer patient prognosis
The T1-T2 stage cases were divided into groups with and without lymph node metastasis. The T3-T4 stage patients were similarly divided. We used Kaplan-Meier survival curves to analyze the OS between these sets of groups as shown in Figure 1. There were significant differences between the lymph node metastasis and no lymph node metastasis groups in both the mean and median survival times (P = .031). The mean and median survival times of the lymph node metastasis group were significantly shorter than those of the no lymph node group. The results for the univariate and multivariate Cox regression survival analyses for the OS in SCC are shown in Table 4. In the univariate analysis, lymph node metastasis and tumor location predicted a worse OS. However, there were no significant differences in the OS based on gender, age, or smoking history. In the multivariate analysis, the data were adjusted for gender, age, smoking history, lymph node metastasis, and tumor location. The results showed that lymph node metastasis was an independent predictor of OS (Table 5).

| DEGs between LUSCC with and without lymph node metastasis
We used RNA-Seq data from the TCGA database (total of 19 754 genes) to further clarify the DEGs in stage T1-T2 and T3-T4 patients with lymph node metastasis compared to those without lymph node metastasis. For the stage T1-T2 patients, a total of 353 significant DEGs (FDR < 0.05 and |log2 FC| > 1) were identified between the LM and NLM groups (176 upregulated, 177 downregulated). For the stage T3-T4 patients, we found 127 significant DEGs (FDR < 0.05 and |log2 FC| > 1) between the LM and NLM groups (93 upregulated, 34 downregulated). A list of the top ten upregulated and downregulated genes for the stage T1-T2 and T3-T4 patients are presented in Complementary Tables 1 and 2, respectively.
The differential genes were mapped into a volcano plot ( Figure 2). Among the DEGs, there were 28 genes T A B L E 1 (Continued) in the intersection (11 upregulated, four downregulated). Thirteen genes showed different trends between the lymph node metastasis and no lymph node metastasis groups ( Figure 3).

| Functional annotation of DEGs
After GO functional annotation, 353 T1-T2 stage DEGs were predicted to participate in five categories each of CC, MF, and BP ( Figure 4A,C,E). The CC categories were mainly associated with the extracellular, plasma membrane, cytoplasm, exosomes, and nucleus. The MF categories mainly consisted of transporter activity, G-protein coupled receptor activity, transcription factor activity, extracellular matrix structural constituents, and catalytic activity. The BP categories were mainly associated with signal transduction, cell communication, transport, cell growth or maintenance, and cell adhesion. The cell adhesion categories only included upregulated DEGs. For the T3-T4 stage DEGs, 127 were also predicted to participate in five categories each of CC, MF, and BP ( Figure 4B,D,F). The CC categories were mainly associated with the extracellular, plasma membrane, cytoplasm, nucleus, and integral to plasma membrane. The MF categories included transporter activity, transcription factor activity, structural molecule activity, structural constituent of cytoskeleton, and growth factor activity. The BP categories consisted of signal transduction, cell communication, transport, cell growth or maintenance, and energy pathways. The categories for growth factor activity and transporter activity only included upregulated DEGs.

| Enrichment analysis of DEGs in coexpression network
After DEGs GO functional analyses, we further analyzed the DEGs perspective in the co-expression network and did the functional enrichment. A total of 84 genes were found in T1-T2 DEGs, which constituted a gene co-expression network. T1-T2 stage DEGs in the co-expression network were significantly enriched for 12 functions, including translation, immune response, transport, cell adhesion, cell communication, signal transduction, energy pathways, metabolism, protein metabolism, biological process unknown, regulation of nucleobase, nucleoside, nucleotide and nucleic acid metabolism, cell growth, and maintenance (P < .05). A total of five DEGs genes in T3-T4 DEGs were involved in co-expression network, which were significantly enriched in six functions, including signal transduction, energy pathways, metabolism, regulation of nucleobase, nucleotide and nucleic acid metabolism, cell communication, and unknown biological processes (P < .05) ( Figure 5).

| KEGG enrichment analysis
To further study the signaling pathways for the DEGs, we performed KEGG enrichment analysis using the KEGG database. The results identified a total of 21 signal pathways associated with the T1-T2 DEGs with maturity-onset diabetes of the young ranked first (adjusted P = 1.63E-06) (Complementary Table 3). There were seven signaling pathways associated with the T3-T4 DEGs and the neuroactive ligand-receptor interaction ranked first (Complementary Table 4).

| PPI analysis
We conducted further interactive network analysis of the DEGs using STRING version 5.5. The established

C O M P L E M E N T A R Y T A B L E 1
The top ten different upregulated and downregulated genes between T1-T2 stage metastatic and non-metastatic patients in TCGA-SCC protein-protein interaction networks for the T1-T2 and T3-4 DEGs are presented in Figure 6. We found that there were a total of 20 genes located at the core of the protein-protein interaction network for the T1-

| Small-molecule drug analysis of DEGs
The Connectivity Map (Cmap) is a collection of genomewide transcriptional expression data from cultured human cells treated with bioactive small molecules and simple pattern-matching algorithms, which together enable the discovery of functional connections between drugs, genes, and diseases through the transitory feature of common gene-expression changes. We used Cmap to identify

C O M P L E M E N T A R Y T A B L E 2
The top ten different upregulated and downregulated genes between T3-T4 stage patients with metastatic and non-metastatic patients in TCGA-SCC small-molecule drugs that had a positive or negative correlation with the DEGs from in this study. The range of the correlation scores was 1 to −1. A score <0 indicated that a drug could inhibit tumor lymph node metastasis whereas a score >0 revealed that a drug could promote tumor lymph node metastasis. We found that for T1-T2 stage LUSCC, only one small-molecule drug (cefotiam) could counteract the genetic changes that were associated with lymph node metastases (Complementary Table 5). For T3-4 stage LUSCC, 13 drugs were negatively or positively correlated with lymph node metastases. Among them, seven drugs might inhibit lymph node metastases (raloxifene, iproniazid, exisulind, arachidonyltrifluoromethane, 16-phenyltetranorprostaglandin E2, econazole, and fluoxetine). In contrast, five drugs might promote lymph node metastases (AH-6809, ticarcillin, 5255229, mesoridazine, and stachydrine) (Complementary Table 10). These data suggest that these drugs should be used with caution in LUSCC patients.

| DISCUSSION
Locally advanced lung cancers invading the heart, great vessels, trachea, or vertebrae have historically been classified as unresectable. These tumors were usually treated with palliative chemotherapy or radiation, alone or in combination. However, advances in surgical techniques that go far beyond standard surgery have been challenging this dogma for the last three decades. 17 Currently, patients with local invasion but without lymphatic and distant metastasis, who undergo surgical resection within a multimodality treatment regimen, have the best chance for cure. However, does early T stage mean early N stage? There are still many early T stage patients with lymphatic metastasis, who not only need radical resection and systemic lymphadenectomy, but also chemotherapy, radiation, and targeted therapy. Therefore, it is essential to understand the risk factors associated with lymphatic spread. What are the genetic differences in patients with and without early-stage T lymph node metastasis? Why In the present study, age was significantly correlated with stage T1-T2 lymph metastasis, which was consistent with the TCGA analysis. In the early disease stages, younger patients (<65) were more likely to have lymphatic metastasis. These findings were similar to those for melanoma, in which older melanoma patients had lower rates of sentinel lymph node metastases. 18 In thyroid papillary microcarcinomas, largevolume LM was more frequently found in young (<40 years) and male patients. 19 These findings support the notion that surgery and systemic lymphadenectomy rather than selective lymph node dissection may be favored in young clinically LN-negative LUSCC patients as a primary therapeutic option.
Due to the decreasing cost of current sequencing technology, the ability to explore the genetic characteristics of tumors is becoming widely available. An important question is how to use this abundance of genetic information judiciously with a focus on genes that have been previously associated with a specific cancer type. In this study, we retrospectively examined data from both our cohort of patients and the TCGA database, and found that in addition to age, the expression of specific genes in primary tumor tissue correlated with lymph node metastasis in early T stage (T1 and T2) LUSCC and local advanced T stage (T3 and T4) LUSC without lymphatic metastasis. A total of 353 significant DEGs were identified F I G U R E 5 GO annotations for the DEGs. A and B, DEGs in cellular component categories. C and D, DEGs in molecular function categories. E and F, DEGs in biological process categories. DEGs: differentially expressed genes. Left vertical axis: percentage of DEGs involved in a specific term out of all DEGs between the lymph node metastasis and no lymph node metastasis groups for the T1-T2 stage patients, including 176 upregulated and 177 downregulated DEGs. In addition, there were 93 upregulated and 177 downregulated DEGs identified for the T3-T4 group. There were 28 genes at the intersection. We believe that these 28 genes may be the driver genes for SCC lymph node metastasis. In particular, genes with similar tendencies in both groups may promote LUSC lymph node metastasis, and genes with different tendencies in both groups may be associated with local invasion without metastasis. Among these genes, the human albumin gene (ALB) and the extracellular calcium-sensing receptor (CASR) genes are located at the core of the protein-protein interaction network. ALB (GenBank mRNA RefSeq: NM_000477.6) is considered relevant to congenital analbuminaemia, a very rare autosomal recessive disorder with an estimated prevalence of less than 1 in 1 million). 20 There have been no reports that have associated ALB with cancer lymph node metastasis. CaSR belongs to class C of the GPCR, which signals in response to Ca 2+ and other ligands, such as gadolinium, polypeptides, and certain antibiotics. 21 CaSR can promote the development of bone metastasis in both renal cell carcinoma 22 and breast cancer. 23 Based on the KEGG Enrichment analysis, a total of 21 signaling pathways are involved in the T1-T2 DEGs, of which the maturity onset diabetes of the young may be of most interest. Its ID is hsa04950, and its related genes include NEUROG3, HNF4A, SLC2A2, NEUROD1, PAX4, HNF1A, and RFX6. The discovery of this pathway is consistent with recent studies that have demonstrated that altered glucose metabolism is closely related to the occurrence and development of lung cancer, [24][25][26] and may suggest that it might also be related to lymph node metastasis. Furthermore, combined treatment with metformin and gefitinib overcomes the primary resistance to EGFR-TKIs via targeting the IGF-1R signaling pathway, 27 which is also related to glucose metabolism. However, this conclusion does not apply to advanced lung cancer. There were also seven signaling pathways involved with the T3-T4 DEGs with the neuroactive ligand-receptor interaction (ID: hsa04080) ranked first. They are potentially targeted by drugs, such as hydroxyzine. 28 The epithelial-mesenchymal transition (EMT) is an important process for tumor invasion and metastasis mediated by complex regulatory mechanisms. EMT is a key event in the transformation of tumor cells from an epithelial-like to interstitial phenotype, which can promote tumor invasion, metastasis, and drug resistance. The hallmark of EMT is the alteration of epithelioid cell markers in tumor cells, such as down-regulation of E-cadherin expression and concomitant upregulation of mesenchymal cell markers, such as N-cadherin, vimentin, and snail transcription factor family members. 29,30 Attenuation of E-cadherin reduces intercellular adhesion and increases cell motility, allowing cells to invade surrounding tissues. 31,32 Multiple signaling pathways are involved in the EMT process in tumor cells, such as Wnt, transforming growth factor-β (TGF-β), and Notch signaling pathways. 33 However, we do not know the relationship among metastasis and DGEs, EMT respectively. Therefore, exploring the molecular mechanism of EMT and identifying genes that regulate this process will have important implications. 34 The study is the first time we try to explore the ideal gene and its up-or down-regulation can further affect the early or late metastasis of tumors. So we did not have a specific selection of EMT signaling pathways. Actually, we plan to identify all DEGs may cause premature tumor metastasis first and explore their function in the EMT signaling pathway, tumor microenvironment and epigenetic changes in further research. Further analysis will focus on whether specific genes are directly related to the EMT-related pathways, and whether other pathways could directly affect invasion and metastasis by LUSCC.

C O M P L E M E N T A R Y T A B L E 3 KEGG enrichment analysis of T1-T2 DEGs
In summary, our TCGA database analysis showed that there were some significant DEGs (eg ALB and CaSR) associated with early T stage lymphatic metastasis and local invasion without lymphatic metastasis. Furthermore, signaling pathways, such as hsa04080 (neuroactive ligand-receptor interaction) and hsa04950 (maturity onset diabetes of the young), mediate the effects of these DEGs. In future research, we will collect long-term follow-up and additional genetic data to validate the functions of these genes. Further analysis of the DEGs identified 84 co-expressed genes among the T1-T2 and T3-T4 DEGs that likely play a significant role in gene regulation. Small-molecule drug analysis, initially suggested that some drugs probably target the DEGs in different stage squamous cell lung cancer, but unfortunately, these drugs are not significantly associated with these small-molecules (0.8 > |score| > 0.3). Thus, there are few targeted drugs currently available to squamous cell lung cancer patients.

| CONCLUSION
Lymph node metastasis plays a key role in the treatment response and prognosis of LUSCC patients. In the early T stages, younger patients (<65) have a stronger tendency for lymphatic metastasis. Surgery and systemic lymphadenectomy rather than selective lymph node dissection may be favored for young clinically LN-negative LUSCC patients as a primary therapeutic option. This unprecedented systems biology analysis of squamous cell lung cancer with or without lymph node metastasis showed statistically significant enrichment of LUSCC genes and identified genetic features of lymphatic metastasis and local invasion by LUSCC. Long-term follow-up and additional genetic data are needed to validate gene function in LUSCC. The identified genes may have the potential to predict prognosis and serve as therapeutic targets.

F I G U R E 6
Protein-protein interaction network of DEGs. A, PPI network of T1-T2 DEGs. B, The number of genes associated with other T1-T2 DEGs. C, PPI network of T3-T4 DEGs. D, The number of genes associated with other T1-T2 DEGs. Circles represent genes, lines represent interactions between genes, and the results inside circles represent the structure of the proteins. The color of the thread represents different types of evidence for an interaction between the proteins (red, fusion evidence; green, neighborhood evidence; blue, co-occurrence evidence; purple, experimental evidence; yellow, text mining evidence; light blue, database evidence; black, co-expression evidence) The mean score changes from 1 to −1. A score <0 means the change in the direction of the gene expression values caused by the drug are opposite to the change in the direction of the gene expression values in lymph node metastasis. Low scores represent high anti-LM. In contrast, a score >0 means the change in the direction of the gene expression values caused by the drug are the same as the change in the direction of the gene expression values in lymph node metastasis (eg, the drug might promote LM. In addition, high scores represent high lymph node metastasis-promoting effects. DEGs: differentially expressed genes.