A novel DNA methylation signature associated with lymph node metastasis status in early gastric cancer

Lymph node metastasis (LNM) is an important factor for both treatment and prognosis of early gastric cancer (EGC). Current methods are insufficient to evaluate LNM in EGC due to suboptimal accuracy. Herein, we aim to identify methylation signatures for LNM of EGC, facilitate precision diagnosis, and guide treatment modalities. For marker discovery, genome-wide methylation sequencing was performed in a cohort (marker discovery) using 47 fresh frozen (FF) tissue samples. The identified signatures were subsequently characterized for model development using formalin-fixed paraffin-embedded (FFPE) samples by qPCR assay in a second cohort (model development cohort, n = 302, training set: n = 151, test set: n = 151). The performance of the established model was further validated using FFPE samples in a third cohorts (validation cohort, n = 130) and compared with image-based diagnostics, conventional clinicopathology-based model (conventional model), and current standard workups. Fifty LNM-specific methylation signatures were identified de novo and technically validated. A derived 3-marker methylation model for LNM diagnosis was established that achieved an AUC of 0.87 and 0.88, corresponding to the specificity of 80.9% and 85.7%, sensitivity of 80.6% and 78.1%, and accuracy of 80.8% and 83.8% in the test set of model development cohort and validation cohort, respectively. Notably, this methylation model outperformed computed tomography (CT)-based imaging with a superior AUC (0.88 vs. 0.57, p < 0.0001) and individual clinicopathological features in the validation cohort. The model integrated with clinicopathological features demonstrated further enhanced AUCs of 0.89 in the same cohort. The 3-marker methylation model and integrated model reduced 39.4% and 41.5% overtreatment as compared to standard workups, respectively. A novel 3-marker methylation model was established and validated that shows diagnostic potential to identify LNM in EGC patients and thus reduce unnecessary gastrectomy in EGC.

(EMR) are the mainstream approaches for LNM treatment in low-risk EGC patients, due to the minimally invasive, function-preserving, en bloc resection, limited trauma, and maintenance of a good quality of life [5,6]. However, for EGC patients at high risk of LNM, radical gastrectomy with a lymphadenectomy is usually adopted. However, it could lead to various post-gastrectomy complications that include anastomotic leakage, bleeding, stricture, delayed gastric emptying, reflux esophagitis, residual food, and reduced quality of life postoperatively [5,6]. Therefore, precise assessment of lymph node metastatic status in EGC plays a critical role in the treatment decision making.
Currently, LNM is diagnosed mainly by imaging methods, such as endoscopic ultrasonography, computed tomography (CT), positron emission tomography with CT (PET-CT), or by evaluating clinicopathological features after endoscopic biopsy, including submucosal invasion, ulceration, undifferentiated type, and lymphovascular invasion status [7][8][9]. However, the accuracy and reliability of these methods are unsatisfactory, leading to overtreatment and unnecessary gastrectomy in a large portion of EGC patients [10][11][12]. Post-gastrectomy pathological evaluation showed that about 80% of EGC patients with negative lymph node metastasis were treated unnecessarily with radical gastrectomy [10,11]. This suggests that the current standard of care in the clinical setting for LNM diagnosis is inadequate and it is imperative to develop novel methods to accurately determine LNM status and improve the quality of life in patients with EGC.
DNA methylation is one of the most important epigenetic modifications. A growing number of studies have shown that DNA methylation plays a prominent role in tumorigenesis and progression [13,14]. Abnormal DNA methylation occurs before the clinical symptoms of the disease become apparent and often leads to gene misexpression [15]. With the development of high-throughput technologies, cancerous genome-wide methylation data have been used to study potential markers of early diagnosis, prognostic assessment, progression monitoring, and chemoradiotherapy sensitivity [16]. To accurately assess the possibility of LNM in EGC, numerous studies have reported different prediction models, which are constructed mainly based on clinicopathological features [17,18]. To our knowledge, genome-wide DNA methylation mapping and modeling prediction using methylation markers for LNM in EGC have not yet been reported.
Our previous studies have shown that a genome-wide DNA methylation approach can be applied to the diagnosis of bladder cancer and the identification of benign and malignant pulmonary nodules [19,20]. In this study, we performed a DNA methylation profiling of LNM in EGC patients and developed a methylation test for LNM diagnosis.

Study design and patient recruitment
A three-phase strategy was designed in our study ( Fig. 1) which included a marker discovery cohort (n = 47, fresh frozen (FF) tissue samples), a model development cohort (n = 302, formalin-fixed paraffin-embedded (FFPE) samples), and a validation cohort (n = 130, FFPE samples).  The genome-wide methylation sequencing was applied using FF samples to identify LNM-specific methylation markers which were subsequently validated by a qPCR assay. The identified and validated methylation markers were further characterized in the model development cohort using FFPE samples as the same sample type in a practical clinical setting. The diagnostic model developed was further validated and compared to imaging diagnostics, clinicopathology-based model (conventional model), and current standard workups in the validation cohorts. An overview of the patient recruitment workflow is described in Additional file 1: Figure S1. Patients with treatment-naïve EGC were enrolled from Nanfang Hospital (n = 436, 47 fresh frozen FF samples and 389 FFPE samples) and Shenzhen People's Hospital (n = 189, FFPE samples) between January 2015 and November 2020. Samples with failed experimental QCs (n = 146) were excluded from the study. The tissue samples from the EGC patients were surgical specimens and collected before radiation or chemotherapy. The tumor content over 30% of the FFPE samples was confirmed by pathologists. The pathology and LNM status of the samples were confirmed by at least two gastrointestinal pathologists. The clinicopathological characteristics of all patients inducing gender, age, tumor size, tumor location, differentiation, invasional depth, ulceration, and lymphovascular invasion (LVI) are summarized in Table 1.

Discovery of differentiated methylation markers
To identify potential markers, we gathered 47 FF samples of EGC. There were 23 cases of LNM+ tumor and 24 cases of LNM− tumor. Sample genomic DNA was individually constructed genome-wide methylation library using TruSeq ® Methyl Capture EPIC Library Prep Kit (Illumina, USA, Catalog No. FC-151-1002) following the instructions; we refer to the latter as EPIC. The detailed patient clinicopathological features in EPIC genome-wide methylation libraries are shown in Additional file 2: Table S1. After EPIC libraries were tested by Agilent High Sensitivity DNA Kit (Agilent, USA, Catalog No. 5067-4626) for quality assurance, high-throughput sequencing was performed on Illumina's X-Ten platform. The sequencing data processing methods are detailed in Additional file 2: Methods.  Table S2) was designed and used to characterize the methylation patterns in EGC-LNM patients with the EGC-LNM detection kit (AnchorDx, China, Catalog No. EGME-002). The methylation analysis by MethyLight approach was described earlier (details are in Additional file 2: Methods) [21]. The qPCR methylation analysis was performed on the Quant Studio 3 Real-Time PCR System (Thermo Fisher, USA). Then, the diagnostic model of LNM in EGC was established and validated based on methylation-specific qPCR data.

Methylation model development and validation
432 FFPE samples were randomly divided into modeling development cohort (n = 302) and validation cohort (n = 130) at a ratio of approximately 7:3. The cohort division was blinded to the methylation test results. The model development cohort (n = 302) was further randomly split into 50% training and 50% testing sets with a 20-fold validation. The identified 50 markers were analyzed with the least absolute shrinkage and selection operator (LASSO) algorithm to determine the minimum marker requirement and select top markers. The selected top markers were further used for model construction with logistic regression algorithm by iterative marker combination analysis in the model development cohort. A validation (n = 130) cohort was used to independently test the final model. Sensitivity, specificity, accuracy, positive predictive value, and negative predictive value were then evaluated.

Development and evaluation of the conventional model and integrated model
The 8 clinicopathological variables were included in the univariate analysis to explore the association with LNM in the model development cohort, and variables with a p value less than 0.05 were included in multivariate analysis for the conventional model. Forward stepwise regression analysis evaluated odds ratio (OR) values with a 95% CI to identify independent predictors. The integrated model was built according to independent predictors and the 3-gene methylation signature. Tolerance and variation inflation factors were used to evaluate the multicollinearity of multivariate models. Based on both multivariate logistic regression models, two quantitative scoring formulas were derived and the area under the receiver operating characteristic curve (AUROC) was measured. (Details are in Additional file 2: Methods.)

Statistical analysis
Wilcoxon signed-rank test or Mann-Whitney U test were used to analyze epigenome methylation data. Student t test was used to evaluate the distribution of risk scores among different test groups. The χ 2 test or Fisher's exact test and two-tailed t test were used to compare categorical and continuous variables, when appropriate. Logistic regression-based model constructions were conducted using R glmnet (2.0.16) packages. Other details of the statistical analyses are described in Additional file 2: Methods.

Genome-wide screening of DNA methylation markers to detect LNM in EGC tissue samples
A schematic workflow of the study design is shown in Fig. 1. To identify DNA methylation markers that are LNM-specific in EGC, we first performed a genomewide methylation analysis (covering more than 3.34 million CpG sites) on 23 lymph node metastasis positive (LNM+) and 24 lymph node metastasis negative (LNM−) FF tissue samples. A total of 1366 differential methylation CpG sites were found (Additional file 1: Figure S2, FDR < 0.05 and β-value difference ≥ 0.2). Based on the methylation sites, we further identified 60 differential methylated regions (hereafter referred to as the "markers") by using co-methylation region analysis as previously reported [20]. An unsupervised heretical clustering  The data are shown as median with 95% confident intervals. Statistical significance was assessed using a non-paired t test (two-tailed). *p < 0.05, **p < 0.01, and ***p < 0.001. d Methylation level of CCDC166 from genome-wide methylation sequencing was reversely correlated to Δ Ct values from qPCR-based methylation assay in 10 paired discovery samples. Pearson's test was used showed a clear differential pattern between the LNM+ and LNM− patients (Fig. 2a) Our primary goal was to develop a simple methylationspecific qPCR assay for LNM status determination [21]. The 60 markers were further validated technically using the same FF samples by a qPCR approach. Among these markers, 50 markers showed consistent methylation patterns between sequencing and methylation-specific qPCR analysis, and significantly distinguished LNM+ from LNM− in the same samples. However, 10 markers were excluded due to failed technical validation with inconsistent methylation pattern between the two assays ( Fig. 2b-d, Additional file 1: Figures S3 and S4). These results suggested that these markers and qPCR-based assays were reliable and could be used for large-scale cohort analysis.

Development and validation of a 3-marker methylation model for LNM diagnosis
Since in a practical clinical setting, the EGC sample acquired is endoscopic sectioned FFPE samples, we further characterized the 50 methylation markers identified from FF samples by the same qPCR assays in a model development cohort which consisted of 302 FFPE EGCs. To improve the assay diagnostic efficiency and reduce marker redundancy, the least absolute shrinkage and selection operator (LASSO) algorithm was used to determine the minimum number of markers required for maintaining stable diagnostic power and select the corresponding top markers from the 50 candidates. A marker number of five was used for further analysis, and the resulted top 5 markers were subjected for further model development. Methylation models containing any 1-5 markers were iteratively constructed using logistic regression algorithm. By comparing the performance and the performance consistency in 100 random splits of datasets with a train-test ratio of 1:1, a 3-marker methylation model was derived. The 3-marker methylation model, comprising of GNAS, FCGBP, and CCDC166, achieved high AUCs of 0.84 (95% CI 0.74-0.94) and 0.87 (95% CI 0.80-0.93) in the training and test sets, respectively ( Fig. 3a, b, Additional file 1: Figure S5a and S5b). The model showed consistent specificities of 78.3% and 80.9%, sensitivities of 80.6% and 80.6%, and accuracies of 78.8% and 80.8% in the training and test datasets, respectively (Fig. 3c). Notably, LNM+ patients showed significantly higher LNM risk scores, calculated from the model, than LNM− patients in both training and test sets (Fig. 3d, e, p < 0.001).
The model was further validated in an independent cohort consisting of 30 LNM+ and 98 LNM− patients. It achieved an AUC of 0.88 (95% CI 0.80-0.95), sensitivity of 78.1%, specificity of 85.7%, and accuracy of 83.8% (Fig. 3c, g, Table 2, Additional file 1: Figure S5c). Consistent with the results from the model development cohort, the model showed a significantly higher LNM risk score in the LNM+ patients as compared to LNM− patients (Fig. 3f ). We then assessed whether risk scores were associated with clinical characteristics. We found that the LNM risk scores were significantly higher in patients with ulceration, undifferentiation, submucosal invasion, and lymphovascular invasion in the validation cohort (Fig. 3h), indicating the LNM risk scores were associated with the known reported LNM risk factors. On the other hand, the risk score did not vary significantly in EGC patient groups of different age, gender, tumor size, and tumor location (Additional file 1: Figure S6). Taken together, the 3-gene methylation model showed an accurate and robust performance in discrimination for LNM in EGC.
Accordingly, we compared the performance of the 3-marker methylation model with CT imaging and these clinicopathological features for EGC LNM diagnosis. Of interest, we found that the diagnostic performance (AUC) of the 3-marker methylation model  (Fig. 4a, b). The 3-marker methylation model showed significantly higher accuracies (79.8% and 83.8%) than diagnostic model based on CT imaging (59.6% and 53.8%), tumor differentiation (59.6% and 52.3%), tumor invasional depth (58.3% and 56.2%), tumor lymphovascular invasion (67.9% and 70.8%), ulceration (58.3% and 53.1%), and tumor size (51.3% and 47.7%) in the two cohorts, respectively (Fig. 4c, d). The sensitivity and specificity of the 3-marker methylation model were also significantly higher than diagnostic models based on CT imaging or individual clinicopathological features (Additional file 1: Figure S7), with approximately twofold higher sensitivities as compared to CT-based diagnostics (80.6% vs. 41.7% and 78.1% vs. 40.6% in the model development and validation cohort, respectively) (Additional file 1: Figure S7c and S7d).

An integrated model combining methylation and clinicopathological features further improved the LNM diagnostic performance
To evaluate the performance of the 3-marker methylation model and the clinicopathological characteristic-based   model (i.e., the conventional model [17,18]), the risk factors as identified by previous univariate analysis were used in multivariate analysis to select independent LNM predictors (Table 3 and Additional file 2:  (Fig. 5c, d). Diagrams illustrating the predicted results of both the 3-marker methylation model and conventional model as compared to pathology for the same persons in method development and validation cohorts are shown in Fig. 5e, f. For the same patients having LNM, the 3-marker methylation model and conventional model showed a high concordance with the 3-marker methylation model identified additionally more cases. More importantly, the 3-marker methylation model helped more patients without LNM to avoid over treatment.
To explore whether the diagnostic accuracy of the 3-marker methylation model could be enhanced by combining clinicopathological features, we built an integrated model within the model development cohort using independent predictors of LNM, which included 3-marker methylation model ( Fig. 5c-f).

Both the 3-marker methylation model and the integrated model have the potential to reduce overtreatment on LNM− EGC patients
The treatment modalities of EGC depend on the status of LNM in patients. While ESD has been used as the curative procedure of EGC without LNM, surgical resection of tumors with D1/D2 lymphadenectomy is conducted in patients diagnosed with LNM. However, the identification of LNM is not sufficient under current standard workups (The Japan Gastroenterological Endoscopy Society and Japanese Gastric Cancer Association (JGCA) guidelines) [5,23]. To test whether the 3-marker methylation model can augment LNM diagnosis accuracy and treatment precision, we compared the clinical utilities of the 3-marker methylation model and the integrated model to current standard workups in overall 432 surgically resected specimens. For patients with the absolute indication of ESD in our cohorts (n = 29), the 3-marker methylation model and integrated model resulted in 79.3% and 100% diagnostic accuracy, 0.0% undertreatment, and 20.7% and 0.0% overtreatment due to false positive identification, as compared to standard workups of 100.0% accuracy, 0.0% undertreatment and 0.0% overtreatment (Fig. 6a, b). For patients with expanded indication of ESD in our cohort (n = 81, 13 of LNM+, and 68 of LNM−), while the overtreatment rate of the 3-marker methylation model and integrated model was slightly higher as compared to standard workups (16.0% and 12.3% vs. 0.0%), the undertreatment rates of our models were significantly lower (2.5% and 4.9% vs. 16.1%) and the overall accuracies were comparable to standard workups (81.5%, 82.7% vs. 84.0%) (Fig. 6a, c). For patients with relative indication (n = 322, 91 of LNM+ and 231 of LNM−), the 3-marker methylation model and integrated model showed significantly improved accuracies as compared to standard workups (81.1%, 83.2% vs. 28.3%). Additionally, the 3-marker methylation model and integrated model showed remarkably low overtreatment rates . e A overall estimation of undertreatment rate, overtreatment rate, and accuracy analysis of the 3-marker model and the integrated model as compared to the standard workups. Statistical significance was assessed using χ 2 test. *p < 0.05, **p < 0.01, ***p < 0.001, NS., not statistically significant (13.0%, 13.0% vs. 71.74%) (Fig. 6a, d). Since 74.5% of the overall EGC patients are relative indications, the 3-marker methylation model and integrated model have the potential to significantly reduce the overtreatment rate by 39.4% and 41.5% (14.1% and 12.0% vs. 53.5%), respectively, while maintaining a comparable undertreatment rate (4.9% and 3.7% vs. 3.0%) (Fig. 6a, e). Based on our findings, the potential of the methylation model and integrated model integrated in current clinical diagnostic setting was proposed (Additional file 1: Figure S8).

Discussion
In this study, we performed a comprehensive genomewide methylation profiling on EGC tissues and identified 60 LNM-specific methylation markers. Derived from these markers, a qPCR-based 3-marker methylation model was developed and validated with large-scale retrospective cohorts, consisting of 302 and 130 tissue samples, respectively. This model was superior to the most commonly used clinicopathological-based conventional tools in diagnosing LNM, as shown in our head-to-head comparison (AUC 0.85 vs. 0.77 in model development cohort and AUC 0.88 vs. 0.79 in validation cohort), while the conventional model we developed using the clinicopathological information showed similar diagnostic power as compared to previous studies (0.84 in the model development cohort and 0.82 in the validation cohort) [16]. The 3-marker methylation model also showed advantageous diagnostic potential as compared to the reported gene expression-based methods, in which a 15-gene signature was used to identify LNM in early stage (T1-T2) gastric cancer with an AUC of 0.76 in training and AUC of 0.74 in the validation set [24].
The results indicate the robustness of DNA methylation as diagnostic biomarker as compared to RNA expression, as DNAs were relatively stable clinical material and DNA methylation profiles may represent a relatively stable long-term programming of the genome and underlying cellular functions, whereas transcription assays only provide a snapshot of the gene expression activity at a specific time point and represent a transient signaling process [13].
To date, few studies have used genome-wide methylation strategy to screen methylation markers for LNM diagnosis in EGC. Wu et al. reported a 14 LNM-related genes classifier derived from 450 K methylation data of gastric cancer in The Cancer Genome Atlas (TCGA) and developed 14 LNM-related genes classifier which showed a median AUC of 0.78 [25]. Our study applied a more comprehensive approach to dissect the methylome associated with LNM in EGC, with more than 3.34 million CpG sites analyzed which accounted for 97.3% of CpG islands in the genome. The de novo marker discovery effort identified some LNM-specific markers that were first reported in EGC, including the 3 markers (GNAS, FCGBP, and CCDC166) used in the methylation model.
Previous studies have shown that DNA methylation levels of imprinted domains of GNAS in primary breast cancer, lung cancer, and ovarian cancer are very different from those in normal tissues. It has been shown that GNAS promotes breast cancer cell proliferation and epithelial-mesenchymal transformation (EMT) through the PI3K/Akt/Snail1/E-cadherin signaling pathway, which may be responsible for the malignant progression and metastasis [26,27]. The discovery of methylated region was found in the first exon region of GNAS which is hypomethylated in LNM + EGC in our study, suggesting that imprinted domains in GNAS could play a role in gastric cancer metastatic development as well.
FCGBP (Fc fragment of IgG binding protein) has been identified as a metastasis-related gene in colorectal cancer; its down-regulation is an independent risk factor for overall survival and disease-free survival in patients with metastatic colorectal cancer and is significantly associated with the prognosis of those patients [28,29]. We found that the methylated region of FCGBP gene is located in the fifth exon region inside the gene, which may be involved in the regulation of gene expression and affect its function on LNM in gastric cancer. CCDC166 was found to be highly mutated in signet ring cell carcinoma [30]. The mutant region did not occur within the methylated region we found. It was discovered that the methylated region is located in the first exon region of CCDC166 and is hypomethylated in LNM+ EGC in our study. Further studies are needed to explore the biological functions and potential regulatory network of these methylation markers in promoting LNM in EGC.
In current clinical settings, endoscopic ultrasound, CT imaging, and clinicopathological features are standard workups for determining the N staging of gastric cancer. As different N staging may lead to different operative management, it is crucial to accurately access the N staging preoperatively. However, preoperative LNM identification is limited with current technologies. Endoscopic ultrasonography was reported with an accuracy of 43%, while CT imaging has an accuracy of 56% [31,32]. Clinicopathological features can be examined pathologically with endoscopically resected tissues (EMR or ESD) from EGC patients. Patients found with at least one positive pathological feature, such as undifferentiated type, submucosal invasion, lymphatic vascular invasion, or ulceration, are usually recommended for radical surgical procedures [33].
While the incidence of LNM in EGC is about 8%-25%, approximately 69.1% of the patients with EGC undergo radical gastrectomy with a lymphadenectomy according to standard workups [34], indicating the current pathological assessment-based LNM diagnosis procedures are suboptimal that resulted in high rate of overtreatment and unnecessary gastrectomies. CT-positive findings that are largely based on nodule size and/or volume are often accompanied by high false-negative rates [12].
Our 3-marker methylation model demonstrated improved performance over these current conventional methods. We found the LNM risk score calculated from our model was significantly associated with the LNM status in patients but not their age, gender, tumor size, and tumor location. The 3-marker methylation model and integrated model showed significantly improved specificity and low false positive rates, resulting in a remarkable reduction of overtreatment by 39.4% and 41.5% as compared to standard workups; this result suggested a great potential of the assay to reduce unnecessary gastrectomies. However, it is worth pointing out that our study was based on samples that were surgically resected; thus, a large-scale multi-center study with preoperative endoscopic biopsies or endoscopically resected specimens is needed to confirm the robustness and performance of the assay.

Conclusions
In summary, we have established and validated a novel 3-marker methylation model in a large retrospective cohort, with the intention to improve LNM diagnosis accuracy in EGC. With further developments, we are hopeful that we would integrate it into existing preoperative LNM diagnosis procedures and assist in guiding treatment decision making in EGC patients.