Identification and Validation of an Individualized EMT-Related Prognostic Risk Score Formula in Gastric Adenocarcinoma Patients

Background The epithelial-mesenchymal transition (EMT) is a pivotal process for fibrotic disease, embryonic development, and wound healing. Moreover, some evidence has proven that the disorder of EMT also plays an important role in carcinogenesis, especially invasion and metastasis of various tumors (Ritchie et al., 2015). Additionally, gastric adenocarcinoma (GAC) is a common gastrointestinal malignancy which is the fourth most commonly diagnosed tumor. Our study is aimed at identifying the prognostic value of EMT-related genes in gastric adenocarcinoma. Methods Firstly, high-throughput and clinical data were downloaded from The Cancer Genome Atlas (TCGA) database. 99 differentially expressed EMT-related genes (ERGs) were obtained in these gastric adenocarcinoma data. Secondly, GO and KEGG enrichment analyses show that EMT may promote gastric carcinogenesis. Next, 10 ERGs associated with prognosis of gastric adenocarcinoma patients are screened out by univariate Cox regression, and 6 pivotal prognostic ERGs (MMP8, MMP11, TFDP3, MYB, F2, and CNTN1) are identified through multivariate Cox regression. These 6 genes are confirmed with significant prognostic value in gastric adenocarcinoma through overall survival (OS) analysis. Finally, a risk score formula is constructed and tested in another gastric adenocarcinoma cohort from GEO. Results 99 differentially expressed EMT-related genes (ERGs) and their enriched pathways are identified. 10 ERGs are strongly related to the prognosis of GAC patients. A risk score formula of 6 prognosis-related ERGs used to predict the prognosis of gastric adenocarcinoma patients is identified and tested (risk score = 0.448115∗expression value of MMP8 + 0.378892∗expression value of MMP11 − 0.3226∗expression value of MYB + 1.322812∗expression value of TFDP3 + 0.325063∗expression value of F2 + 0.334197∗expression value of CNTN1). Conclusion This study provides a potential prognostic signature for predicting prognosis of gastric adenocarcinoma patients and molecular insights of EMT in gastric adenocarcinoma, and the formula focusing on the prognosis of gastric adenocarcinoma can be effective.


Introduction
Gastric adenocarcinoma (GAC) is a common gastrointestinal malignancy which is the fourth most commonly diagnosed tumor and the second pivotal cause of cancer-related death toll worldwide [1]. The morbidity of GAC exceeds 400,000 cases, accounting for more than 40% of cancer rates worldwide, and the death toll has reached as high as 300,000 [2]. Over the past 20 years, there has been great progress in the diagnosis and treatment of GAC. However, the detection rate for GAC patients in the early stage remains poor, and most GAC patients are already in the progressive stage when diagnosed. Additionally, the prognosis of GAC patients in advanced stage remains poor, with a 5-year survival rate of less than 30% because of the high recurrence rate after surgery and chemotherapy resistance [3]. Hence, understanding the interaction mechanisms of critical molecules behind GAC occurrence and progression is significant, which will make great aid in the precaution of GAC and in the identification of novel effective therapeutic targets.
The epithelial-mesenchymal transition (EMT) is a pivotal process for fibrotic disease, embryonic development, and wound healing [4,5]. Moreover, some evidence has proven that the disorder of EMT also plays an important role in carcinogenesis, especially invasion, and metastasis of various tumors, including gastric cancer [6][7][8]. Even though the profound mechanism that EMT acts in carcinogenesis has been studied a lot, the prognostic value of EMT-related genes and the biological function of these prognosis-related genes in EMT process remain rudimentary and inconclusive. To sum up, studies on the correlations and underlying biological process between EMT and gastric tumorigenesis should be valuable to develop some new ideas for cancer diagnostic, prognostic, and treatment.

Differentially Expressed ERGs and Enrichment Analysis.
The differentially expressed EMT-related genes (ERGs) in mRNA expression data of STAD cohort were identified by the limma package in R software (FDR < 0:05, |logFC| > 1) [9]. Visualization (with volcano plots and heatmaps) was executed with the ggrepel, ggplot, and pheatmap packages in R software. Biological processes (BP), cellular components (CC), and biological pathways in which these ERGs were significantly enriched were identified by gene ontology  2.3. Establishing an Individualized Prognostic Index according to ERGs. After combining clinical data and mRNA expression data of ERGs, univariate Cox regression analyses were used to screen the differentially expressed ERGs with remarkable prognostic value. Then, multivariate Cox regression analysis was executed in these screened genes to filter out an independent prognostic predictor in ERGs. The clinical feature was identified, and the survival plot of these ERGs were drawn based on STAD-TCGA cohort using log-rank test. At last, the risk score (RS) for these prognostic ERGs was developed. The formula based on multiplied Cox regression coefficients can obtain the relative weight of genes in multiple Cox analysis. The clinical data of GAC patients were separated into high-and low-risk groups by the best RS value as the risk cutoff value. The survival curves were plotted by Kaplan-Meier (KM) method, and differences in the survival rates between high-and low-risk groups were assessed using log-rank test. Finally, these findings are tested in another   GAC cohort in GEO datasets through survival analysis and ROC curve analysis.

Statistical Analysis.
All statistical analyses and plots were implemented through R software 3.6.1 (https://www .r-project.org/). In detail, the relations between the mRNA expression of each ERGs and corresponding clinical data were analyzed through univariate Cox regression and independent prognostic factor, and their risk score (RS) was identified through multivariate Cox proportional hazards regression. The diagnostic and prognostic value of RS was calculated through receiver operating characteristic (ROC) curve in each dataset by "survival ROC" package in R software. Factors with P value < 0.05 was considered with statistical significance.

Result
3.1. Differentially Expressed ERGs. Hierarchical clustering of differentially expressed ERG expression levels. RNA-seq and clinical follow-up data from TCGA-STAD datasets was downloaded, which contains 375 cancer samples and 32 normal samples.
1015 EMT-related genes (ERGs) are acquired from Human EMT Database. Then, the transcriptional expression of these 1015 ERGs in RNA-seq data was extracted and compared between normal samples and GAC samples through limma package in R software (logFC > 1, FDR < 0:05). The results indicated that there are 78 genes significantly upregulated and 21 genes significantly downregulated in GAC (Figures 1(a) and 1(b)). The specific logFC and FDR of differentially expressed genes are showed in Table 1.

Biological Functions and Significant Pathway Analysis
Involved in the Expression of ERGs. Biological functions and significant pathways analysis were executed of these 99 differentially expressed ERGs. Gene ontology (GO) enrichment terms are shown in Figure 2. The result of GO enrichment shows that the screened ERGs positively regulate the growth of GAC.
KEGG pathway enrichment of these genes are shown in Figure 3. These results indicate that the differentially expressed ERGs can promote cancer-related pathways, including PI3k-Akt, IL-17, and JAK-STAT pathways, and may lead to transcriptional misregulation in cancer (Figure 4).

Identification of Prognostic ERGs.
The correlations between the transcriptional expression of these 99 differentially expressed ERGs and clinical data were analyzed using univariate Cox regression (P value < 0.05 is considered a significant criterion). 10 genes were screened with significant prognostic value in GAC (Figure 3).

Construction and Definition of the RS
According to the result of multivariate Cox regression analysis as Table 1, the formula of the risk score (RS) is as follows: risk score = 0.448115 * expression value of MMP8 + 0.378892 * expression value of MMP11 − 0.3226 * expression value of MYB + 1.322812 * expression value of TFDP3 + 0.325063 * expression value of F2 + 0.334197 * expression value of CNTN 1. In this formula, MYB has the only negative coefficient, indicating that high expression level of MYB is positively related with favorable prognosis while upregulated expression of MMP8, MMP11, TFDP3, F2, and CNTN1 is negatively related with the survival time of GAC patients.
Furthermore, we investigated the correlations for clinical index and risk score (RS) (Figures 7(a)-7(c)). The results of independent sample t-test analysis revealed that RS differentially expressed between male and female in GAC patients. Additionally, RS is also higher in patients with distant metastasis and pathological grades.
To validate the pivotal role of RS in predicting the prognosis of GAC, survival analysis between the low-risk group and high-risk group divided by best cutoff was executed in STAD-TCGA cohort. Heatmap shows MMP11 and MYB are significantly downregulated while others are upregulated in samples from high RS patients (Figure 8(a)). The results of survival analysis reveal that GAC patients with high-risk score have significantly worse prognosis than GAC patients with low-risk score (P < 0:01; Figures 8(b)-8(d) and 8(f)). As visualized in Figure 8(e), ROC (receiver operating characteristic) curve analysis indicates that RS has considerable diagnostic and prognostic efficacy in GAC patients (AUC = 0:74).
Finally, the formula and RS index above were tested in GSE84433 cohort from GEO datasets with 357 GAC tumor cases and corresponding follow-up information. Figure 10(a) shows the significantly differential distribution of high-score group and low-score group through survival analysis (P < 0:01). Figure 10(b) verified the prognostic and diagnostic value of RS (AUC = 0:696). The result indicated that RS can be a reliable predictor for predicting the prognosis of GAC patients.

Discussion
Gastric adenocarcinoma (GAC) is a major fatal type of cancer all around the world with a lack of reliable diagnostic and prognostic biomarker and therapeutic target [10]. It is of great importance for researchers to explore the new diagnostic and therapeutic strategy of GAC on these aspects. On the other hand, it has been widely proved that EMT is  an important biological process in cancer, including gastric cancer. Specifically, EMT is characterized by downregulation of the epithelial cell properties and activation of the mesenchymal characteristic. Epithelial and endothelial cells are able to acquire a mesenchymal phenotype through EMT. The expression of certain genes, including those for cadherins, integrins, and many transcription factors, is reprogrammed in the EMT. EMT is thought to be one of the major mechanisms that determine invasion and metastasis of cancer cell [11]. However, most of the studies focus on the function and mechanism of some EMT-related genes and the diagnostic and prognostic value of overall EMT-related genes(ERGs) has not been widely elucidated in GAC.
In this study, we collected the high-throughput data about the transcriptional expression of 375 GAC samples and 32 nontumor sample with their corresponding clinical data from TCGA database. Among these, we screened out the differentially expressed ERGs between GAC samples and nontumor gastric samples. Then, the differentially expressed ERGs are listed and analyzed, and the pivotal functional clusters and pathways of the differential expressed ERGs are identified. Interestingly, major differentially expressed ERGs in the cluster associated with negative regulation of cell aging are upregulated, indicating that EMT process promotes the cellular immortalization of GAC. Another novel finding is DNA H3-K4 methylation, in which the MAPK pathway, JAK-STAT pathway, PI3K-Akt pathway, and IL-17 pathway are involved in EMT process of GAC. The enrichment analysis verifies the affinity between EMT and these enrichment pathways, which reveals some therapeutic target of GAC in EMT process and may lead further research about the internal mechanism. In previous studies, H3K4 methylation has been found widely existing in EMT of various cancers and our finding verified it through highthroughput data [12,13]. Interestingly, there was no previous study that reported that the IL-17 pathway participates in EMT process while our study revealed that the IL-17 pathway may play an important role in EMT process. The researches about IL-17 pathways are needed in further practice. Univariate Cox regression analysis with survival data indicates that 10 ERGs were related to prognosis of GAC patients in STAD-TCGA cohort. Among these 10 genes, 6 genes, including MMP8, MMP11, MYB, TFDP3, F2, and CNTN1, are screened out as independent prognostic predictor, and a formula about risk score (RS) was raised through multivariate Cox regression analysis. MMP8 and MMP11 are two members of matrix metalloproteinases (MMPs), which are used to degrade extracellular matrix (ECM) components and deeply participated in EMT [14]. MYB is another critical transcriptional factor for EMT. Tao et al. reported that b-Myb regulates snail expression to promote EMT in breast cancer [15]. Qu et.al reported c-Myb promotes development of colorectal cancer through EMT [16]. Our study has revealed the significant role of MYB in GAC tumorigenesis, and further studies are needed to identify the related protein and protein interaction about MYB in GAC. TFDP3 is another MET-related gene which can also induce apoptosis and autophagy in breast cancer and prostate cancer  [ [17][18][19]. F2 and CNTN1 has been reported to promote EMT in gastric cancer [20]. As for our present study, we found that upregulation of MMP8, MMP11, TFDP3, F2, and CNTN1 adds the risk score (RS), and overexpression of MYB could decreases RS. Additionally, upregulation of CNTN1 is significantly related with worse pathologic grade. Overexpression of MMP8 is significantly associated with higher clinical stage, overexpression of MMP11 is significantly associated with bigger tumor size, and low expression of MYB is significantly related to worse pathologic grade and metastasis. The validation gastric cancer cohort from GEO database shows basic consistence with our findings in STAD-TCGA cohort.
To sum up, we speculate that MMP8, MMP11, TFDP3, F2, and CNTN1 can promote the progression of GAC while MYB could be a tumor suppressor in GAC.
So far, most of the cancer-related genes identified in bioinformatic methods are analyzed individually by single, which cannot be a comprehensive reflection of carcinogenesis process, and the ability of these single genes remains poor in diagnosis and prognostic prediction [21][22][23][24]. Moreover, most studies research on cancer of an organ without focusing one specific type of cancer or one specific process in carcinogenesis. In our point of view, it is more useful to identify a cluster of genes with clinical value in one specific cancerrelated process, and this can be the highlight in the present study. However, this study also has some imperfection. Firstly, the study can be more valuable if further basic experiments are done on these 6 screened genes to explore their deep mechanism in EMT. Secondly, the validation group in our study is based on GEO database. A cohort based on PCR results from samples of clinical patients and corresponding follow-up can enhance the quality of our present study. The study on these flaws can be our next goal about precision therapy research in GAC.
In conclusion, our study indicates some related genes and pathways in the EMT process of GAC, and MMP8, MMP11, TFDP3, MYB, F2, and CNTN1 can be pivotal ERGs in GAC. The risk score formula can be a sensitive diagnostic and prognostic predictor in GAC. Risk

Data Availability
The data used to support the findings of this study are included within the article. The gene expression data can be accessed on The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO).

Conflicts of Interest
No conflicts of interest, financial or otherwise, are declared by the authors.