Identification of Key Genes Related to Lung Cancer with Brain Metastasis through Weighted Gene Correlation Network Analysis


 Background: Brain metastases (BM) are the most common malignant tumors affecting the human central nervous system, more than 50% of BM are lung cancer. However, there is currently no universal and effective screening tool for lung cancer with BM. The study aimed to identify key genes of lung cancer with BM, achieving early and accurate diagnosis of lung cancer with BM and long-term monitoring of the therapeutic response.Methods: RNA-seq data and related clinical information were downloaded from the Gene Expression Omnibus (GEO) database. The weighted gene co-expression network (WGCNA) analysis was performed using WGCNA package and MEtuequois module was considered as the highest correlation. The Limma package of R was used to obtain differentially expressed genes (DEGs) between lung cancer with BM and healthy samples. The Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichments of DEGs were analysed. LASSO and SVM were used to derive key genes for lung cancer with BM, and survival analysis was used to verify the correlation between the outcome and the key genes that we found. Key genes are annotated by GSEA, and CIBERSORT was used to calculate immune infiltration from samples.Results: 5 co-expression modules were detected, and ME turquois was most significantly associated with BM in lung cancer. These genes were mainly enriched in biological processes related to cell adhesion and immunity, and the main signaling pathways are significantly related to disease and cell adhesion. 4 BM-related key genes (AQP1, GSTA1, ROS1, TGFB1I1) were identified by LASSO regression analysis and SVM-RFE screening, they were demonstrated as the potential markers for lung cancer with BM. Immune cell infiltration revealed that they were negatively correlated with T cells regulatory, T cells CD8 and Mast cells activated. Overall survival correlation analysis confirmed that the expression of AQP1, GSTA1 and TGFB1I1 was negatively correlated with the prognosis of lung cancer patients, ROS1 was positively correlated with the prognosis of patients.Conclusions: AQP1, GSTA1, TGFB1I1 and ROS1 were identified as key genes of lung cancer with BM patients, which may regulate cancer progression by regulating immunity and have potential prognostic effects.

related key genes (AQP1, GSTA1, ROS1, TGFB1I1) were identi ed by LASSO regression analysis and SVM-RFE screening, they were demonstrated as the potential markers for lung cancer with BM. Immune cell in ltration revealed that they were negatively correlated with T cells regulatory, T cells CD8 and Mast cells activated. Overall survival correlation analysis con rmed that the expression of AQP1, GSTA1 and TGFB1I1 was negatively correlated with the prognosis of lung cancer patients, ROS1 was positively correlated with the prognosis of patients.
Conclusions: AQP1, GSTA1, TGFB1I1 and ROS1 were identi ed as key genes of lung cancer with BM patients, which may regulate cancer progression by regulating immunity and have potential prognostic effects.

Background
Brain metastases (BM) are the most common malignant tumors that affect the human central nervous system and are also a terrible complication of systemic cancers throughout the body. It accounts for about 15%-30% of brain tumors, and its incidence is 10 times that of primary neuromas in the Brain [1]. It has been reported that over 50% of BMS are lung cancer BM, followed by melanoma, urogenital tumor and tumor of the digestive tract [2]. Moreover, patients with lung cancer BM have low quality of life, short survival period and poor prognosis [3]. Currently, the main treatment methods for BM include whole brain radiotherapy (WBRT), stereotactic radiosurgery (SRS), surgery and chemotherapy [4]. Sarah et al. found that the median survival of patients who received WBRT after surgery was 40 weeks, which was improved compared with the radiotherapy group (15 weeks) [5]. Minniti et al. studied 66 patients with non-small cell lung cancer (NSCLC) BM treated with WBRT combined with SRS. The results indicated a median overall survival (OS) in the combined group was 10.3 months and in the WBRT group alone was 7.2 months (P=0.005) [6]. Currently, these therapeutic strategies are not ideal for the prognosis and survival of patients with lung cancer BM.
Early diagnosis and screening may be an important measure to improve the clinical treatment of lung cancer BM. However, there is currently no universal and effective screening tool for BM in lung cancer. Some studies have reported some biomarkers related to BMS, but the clinical effectiveness of these markers still needs to be further veri ed. For example, CEA, CA153, and cytokeratin 19 fragment had signi cantly better progression-free survival than patients with abnormally elevated lung cancer BM [7][8][9]. Matthew et al. found that p-S6 is overexpressed in metastatic tumors. In primary tumors, higher p-S6 expression is associated with shorter metastatic-free survival [10]. These results suggest that the level of tumor markers can be used as a factor in evaluating prognosis. In this study, the key genes related to lung cancer brain metastasis were screened out by WGCNA analysis based on GEO database, and the relationship between BMS related genes and tumor immunity was focused.

Data sources and processing
Three sets of data, GSE14108, GSE31547 and GSE126548, were downloaded from GEO, among which GSE14108 included 9 patients with lung cancer BM, GSE31547 included 30 patients with primary lung cancer, and GSE126548 included 6 patients with non-small cell lung cancer, of whom 3 had BM and 3 had no BM. TCGA database is used to download 504 lung squamous carcinoma samples.

Weighted gene correlation network analysis
In this study, hclustfunction was applied to cluster analysis of samples in the R WGCNA database, and samples with outliers removed were clustered according to gene expression levels to reveal the correlation between samples. The soft threshold power was determined according to the law of scale-free network, the gene cluster tree was constructed based on the correlation of gene expression levels, and the co-expression module was determined by dynamic pruning method. Modules with similar expression patterns are merged according to the similarity (0.4) of the module eigenvalues. The expression pattern of module genes in each sample was represented by the module eigenvalue, and the heat map of the expression pattern was drawn by the sample eigenvalue. Heat maps were used to identify modules that were signi cantly associated with the clinical features of BM.
Functional enrichment analysis R's "SVA" plug-in was used for batch processing of GSE14108 data and GSE31547 data, and the Limma package was used to analyze differences in gene expression levels between primary lung cancer samples and BM groups (log2FC≥1, P ≤0.05). Ggplot2 is used for heat map visualization. GO functional analysis and KEGG pathway analysis were performed on the module key genes using the R-package cluster Pro ler software.

LASSO regression analysis and SVM -RFE screening
Key genes related to BM in lung cancer were further screened, R glmnet was used in the LASSO regression analysis of candidate feature genes. The radial basis function kernel SVM was independently adjusted for the 10-fold sampled data set, including the use of grid search for the SVM hyper parameters. Then, the optimal parameters are used to train the SVM over the entire data set, doing this for each fold in the external 10-fold CV, and averaging all of these 10 generalization error estimates to provide greater stability. Repeating this process, while changing the number of primary functions used as input, there is usually a "sweet spot" that minimizes the generalization error. According to this "best point", we can screen out our marker molecules to obtain key genes with diagnostic value.

Gene Set Enrichment Analysis
Gene Set Enrichment Analysis (GSEA) is used to evaluate the distribution trend of a pre-de ned gene in the gene table ranked by the phenotype correlation, so as to judge its contribution to the phenotype. Lung cancer samples were divided into high and low expression groups according to the expression of each key gene, and the differentially expressed genes between high and low expression groups were enriched by GSEA pathway analysis. CIBERSORT was used to calculate 22 immune cell in ltrations in patients with BM / primary lung cancer, and Pearson correlation was used to analyze the association of key genes with immune cells.

Prognostic signi cance of key genes
The R Project for Statistical Computing's "Survival" package was used to analyze the correlation between the expression of key genes associated with BM and OS in lung cancer patients.

Results
WGCNA analysis results of modules and genes related to lung cancer brain metastasis The Cluster tree displaying the relationship among the biological replicates of REPL samples was as shown in Fig. 1a, a total of 9 samples were clustered in BM sample, they are GSM354029, GSM354030, GSM354025, GSM354027, GSM354032, GSM354028, GSM354033, GSM354026 and GSM354031. The soft-threshold power is a key value used to power the correlation of the genes and affects the mean connectivity and the scale independence of co-expression modules. When the soft-threshold power was β = 3, the scale independence was higher than 0.8 and the mean connectivity was higher (Fig. 1b). ME Diss Thres was set to 0.4 and combined with the dynamic clipping tree algorithm to split the module. After the cluster tree was generated, eight different gene co-expression modules were identi ed in the REPL, represented by different colors (Fig. 1c). According to the correlation between modules and sample traits ( Fig. 1d), 5 modules were correlated with lung cancer BM (P 0.05), including ME Blue (Core = 0.59, P = 8E-5), ME Black (Core = -0.52, P = 7e-4), ME Brown (Core = 0.8, P = 7e-10), and ME Turquois (Core = 0.89, P = 4E-14) and ME Purple (Core = -0.4, P = 0.01) (Fig. 1e). Among them, ME turquois has the highest correlation with sample traits, showing a positive correlation, with a total of 5724 genes, of which 900 genes satisfy GS> 0.2 and MM> 0.8.

Screening, annotation and expression validation of brainmetastasis-related characteristic genes in lung cancer
After batch processing of sequencing data of GSE14108 and GSE31547, a total of 753 differentially expressed genes were obtained in BM and primary lung cancer samples. Among them, 238 genes were up-regulated in BM samples and 555 genes were down-regulated in BM tissues (Fig. 2a). Furthermore, the intersection of the differentially expressed genes of lung cancer brain metastasis samples and the module genes related to lung cancer brain metastasis resulted in 272 genes, which were used for subsequent analysis (Fig. 2b).As can be seen from the GO enrichment bubble diagram, cell adhesion and immune-related biological processes are the main pathways for the enrichment of key genes in the module, such as positive regulation of cell−cell adhesion, Humoral immune response and positive regulation of cell adhesion (Fig. 2c). The molecular function is mainly related to enzyme activity, including RNA polymerase II activating transcription factor binding, serine-type endopeptidase activity and metallopeptidase activity. As can be seen from the enrichment bar chart of KEGG pathway, the major participating signaling pathways are signi cantly related to diseases and cell adhesion, such as Pertussis, Rheumatoid arthritis, Malaria, Bladder cancer, Legionellosis, In uenza A, TNF signaling pathway, Focal adhesion, Complement and coagulation cascades and Age-RAGE signaling pathway in diabetic complications (Fig. 2d).The results of GSE126548 differentially expressed genes analysis showed that there were 773 differentially expressed genes between BM / primary lung cancer (Fig. 2e). Jvenn was used to intersect the differentially expressed genes of GSE126548 and the key genes of the module. A total of 14 intersection genes were obtained, indicating that these 14 genes may play an important role in lung cancer brain metastasis (Fig. 2f). The heat map results of further cluster analysis showed that the expressions of the 14 signature genes were signi cantly different in the transferred and primary samples (Fig. 2g).
LASSO regression analysis and SVM screening of characteristic genes in lung cancer brain metastases In this study, LASSO regression analysis was used to screen for key genes associated with BM in lung cancer. When lambda. Min was minimal, the best diagnostic model was obtained, which contained 4 characteristic genes (AQP1, GSTA1, ROS1, and TGFB1I1) (Fig. 3a). The receiver operating characteristic (ROC) curve evaluates the accuracy and sensitivity of the diagnostic model, When the area under the ROC curve (AUC) value of the diagnostic model was 1, it indicated that the diagnostic model had high diagnostic value (Fig. 3b). SVM-RFE analysis found that when the number of genes changed from 1 to 14, the error rate of the optimal point for predicting BM lung cancer samples and primary lung cancer samples was 0.00865, and the accuracy rate was 0.991, which corresponding to 9 characteristic genes (ROS1, TGFB1I1, AEBP1, GSTA1, FMO3, AQP1, COMP, OLFM1, MMP1) (Fig. 3c). The intersection of characteristic genes screened by LASSO and SVM was considered as the key genes. The results showed that AQP1, GSTA1, ROS1 and TGFB1I1 may be the diagnostic biomarkers of BMS and primary lung cancer, and are the key genes to study the difference between primary lung cancer and BM (Fig. 3d).
Functional prediction of key genes in lung cancer brain metastases According to the GSEA function prediction results of AQP1, GO biological processes such as immune response regulation and cell activator chromosome separation were signi cantly overexpressed in the high-expression group of AQP1. In addition, Cell adhesion, immunity and cytokine related pathways were signi cantly enriched in the group with high AQP1 expression (Fig. 4a). The GSEA function prediction results of GSTA1 showed that the GO biological processes such as cell division and chromosome separation were signi cantly low expressed in the high expression group of GSTA1. In addition, Cell cycle, disease-related pathways signi cantly enriched in high expression group (Fig. 4b). The GSEA function prediction results of ROS1 showed that the immune response was signi cantly high in the ROS1 high expression group. The prediction results of GSEA function of ROS1 showed that the immune response was signi cantly overexpressed in the group with high ROS1 expression, Cell division, chromosome separation and other GO biological processes were signi cantly low expressed in the high expression group of ROS1. In addition, cell cycle, disease infection, phagocytosis, immune-related pathways were signi cantly enriched in ROS1 high expression group (Fig. 4c). The GSEA function prediction results of TGFB1I1 showed that the biological processes such as immune regulation, lymphocyte activation and lymphocyte-mediated immunity were signi cantly enriched in the TGFB1I1 high expression group, while the GO biological processes such as cell division and chromosome separation were signi cantly decreased in the TGFB1I1 high expression group. In addition, pathways related to disease infection and phagocytes were signi cantly enriched in the TGFB1I1 high expression group (Fig. 4d).
Correlation between key genes of lung cancer brain metastases and immune invasion The CIBERSORT algorithm and LM22 gene marker were used to analyze the proportion of 22 kinds of immune cells in the tumor samples. According to the calculation results of CIBERSORT, we drawn a heat map of the content of immune cells in each sample with P <0.05 (Fig. 5a). At the same time, Pearson correlation is used to analyze the relationship between immune cells. Fig. 5b shows the speci c immune correlation coe cient. For example, the correlation coe cient between Plasma cells and B cells memory is -0.57 (Fig. 5c). Then the immune cell differences based on t-test were shown between BM / primary lung cancer tissues (Fig. 5d). Among them, immune cells B cells naive, Plasma cells, T cells CD8, T cells regulatory, Macrophages M1, Mast cells resting and Mast cells activated have signi cant differences. We further analyzed the correlation of 4 key genes and 7 immune cells with signi cant differences between the BM / primary lung cancer groups, and the results showed that the key genes were negatively correlated with T cells regulatory, T cells CD8 and Mast cells activated, and positively correlated with other cells (Fig. 5e).
Prognostic signi cance of key genes in brain metastases from lung cancer Analysis of the correlation between the expression of key genes and the overall survival of lung cancer patients shows that AQP1, GSTA1 and TGFB1I1 were signi cantly correlated with the prognosis of patients with lung cancer, and the higher the gene expression, the worse the prognosis of patients. ROS1 is signi cantly correlated with the prognosis of lung cancer patients, and the higher the gene expression, the better the prognosis of patients (Fig. 6).

Discussion
WGCNA was applied because of its advantages in uncovering co-expression modules and their correlation with sample traits, and its higher reliability and biological signi cance compared with other bioinformatics approaches. Based on the WGCNA, LI et al. revealed novel regulatory modules associated with recurrent early pregnancy loss and Bao et al. identi ed the hub genes related to hypoxic adaptation in yak (Bos grunniens) [11,12]. In this study, we identi ed 4 key genes associated with BMS in lung cancer based on WGCNA analysis, namely AQP1, GSTA1, TGFB1I1, and ROS1.
In this study, Aquaporin 1 (AQP1) was signi cantly associated with prognosis in patients with lung cancer. Kong et al. found that MicroRNA-133a-3p inhibits cell proliferation, migration and invasion in colorectal cancer by targeting AQP1 [13]. Angelico et al. found that Aquaporin-1 (AQP1) is highly expressed in asbestos-related malignant pleural mesothelioma (MPM) [14]. These studies all suggest that AQP1 is abnormally expressed in brain tumors and may play an important role in the invasion of tumors. In the functional prediction analysis, biological processes such as cell division and chromosome separation were signi cantly low in the GSTA1 high expression group, while cell cycle and disease infection related pathways were signi cantly enriched in the GSTA1 high expression group. Glutathione Stransferase (GSTs), as the main phase II metabolic enzyme isozymes in cells, can catalyze the binding of glutathione to many electrophilic compounds, carcinogens, and some lipophilic anti-cancer drugs, protect cells from oxidative damage, and have anti-mutagenic and anti-mutagenic properties. It is believed that GSTs can slow cell proliferation by inhibiting signaling pathways. For example, GSTs inhibits JNK activity and has an important effect on JNK-mediated cell biological responses (proliferation, differentiation, apoptosis) [15,16]. Aydin compared the percentage of positive staining between normal tissue and tumor tissue in patients with normal laryngeal cancer and patients with laryngeal cancer, and found that the expression of GSTA in normal cells was signi cantly higher than that in tumor cells (P 0.05) [17]. These studies suggest that GSTA1 has potential in tumor prognosis, which is consistent with our ndings.
The GO enrichment results showed that cell adhesion and immune-related biological processes were the main biological processes for the enrichment of module key genes, and TGFB1I1 was signi cantly correlated with the prognosis of lung cancer patients. Previous studies have shown that transforming growth factor beta 1 induced transcript 1(TGFB1I1) can coordinate Rho GTPase activity to regulate adhesion dynamics and actin cytoskeleton remodeling during cell migration, thereby promoting tumor invasion and metastasis [18][19][20][21]. Nowadays, transcription factors such as Snail1, Slug, Zeb, Twist, βcatenin and Fox can affect the EMT process through the regulation of TGF-β, and TGFβ has been con rmed as the main inducer of tumor EMT. TGF-β can also promote the death of CD8 T cells by inhibiting the expression of B cell lymphoma2 (Bcl-2) [22]. At present, TGF-β and programmedcelldeath1 (pd-1) / programmedcelldeath1 ligand 1(Pd-l1) has the potential to improve T cell in ltration and alleviate the inhibitory function of PD-1 on T cells, and this therapeutic strategy is currently being validated in a variety of solid tumors [23]. In addition, we also found that reactive oxygen species (ROS) was signi cantly correlated with the prognosis of lung cancer patients. Under normal circumstances, ROS1 is activated after receiving speci c information from the outside and transmits it to the inside of the cell to guide the cell to grow and proliferate in an orderly manner. If the ROS1 gene is mutated, it will cause cancer cells to drive up, which is the culprit of tumor occurrence. Glioblastoma [24], in ammatory myo broblastoma [25,26], cholangiocarcinoma [27], ovarian cancer [28], gastric cancer [29], colorectal cancer [30], angiosarcoma [31], arachnid melanoma [32] and non-small cell lung cancer [33] are all present with ROS1 fusion protein. Today, ROS1 is a proven therapeutic target for NSCLC.

Conclusion
In conclusion, we identi ed 4 key genes (AQP1, GSTA1, TGFB1I1 and ROS1) related with the progression of lung cancer BM by WGCNA analysis on DEGs database. We believe that these 4 genes may participate in lung cancer BM by regulating immune cell in ltration, which has potential prognostic effects. In other words, our ndings may help predict, screen and early diagnosis of lung cancer BM patients. Additionally, the results will help identify therapeutic targets for lung cancer BM cancer and monitor the response to therapy. The graph shows the scale-free t index for various soft threshold powers to identify the optimal soft threshold power (β). In the graph on the left, the horizontal axis represents the soft threshold power or β values and the vertical axis represents the scale-free network index (R2). The scale-free characteristics of the gene network are stronger when the R2 value is higher. In the right graph, the horizontal axis represents the soft threshold power or β values, the vertical axis represents the means of all the gene adjacency functions in the corresponding gene module. c Cluster dendrogram of all ltered genes enriched based on the dissimilarity measure and the cluster module colors. d module and trait heat map.
e: The scatter plot shows the correlation between module genes and traits.   GSEA function prediction of key genes. a GSEA enrichment results of AQP1, the left side represents GOBP pathway and the right side represents the KEGG pathway. b GSEA function enrichment results of GSTA1, the left side represents the GOBP pathway and the right side represents the KEGG pathway. c The GSEA function enrichment results of ROS1, the left side represents the GOBP pathway and the right side represents the KEGG pathway. d GSEA enrichment results of TGFB1I1, the left representing GOBP pathway and the right side represents the KEGG pathway different immune cells between BM / primary lung cancer, blue indicates the brain metastasis group, and yellow indicates the primary group. "*" means P <0.05, "**" means P <0.01, "***" means P <0.001, and "****" means P <0.0001. e Correlation diagram of key genes and differential immune cells, blue indicates positive correlation, yellow indicates negative correlation. Figure 6