A prognostic framework for predicting lung signet ring cell carcinoma via a machine learning based cox proportional hazard model

Purpose Signet ring cell carcinoma (SRCC) is a rare type of lung cancer. The conventional survival nomogram used to predict lung cancer performs poorly for SRCC. Therefore, a novel nomogram specifically for studying SRCC is highly required. Methods Baseline characteristics of lung signet ring cell carcinoma were obtained from the Surveillance, Epidemiology, and End Results (SEER) database. Univariate and multivariate Cox regression and random forest analysis were performed on the training group data, respectively. Subsequently, we compared results from these two types of analyses. A nomogram model was developed to predict 1-year, 3-year, and 5-year overall survival (OS) for patients, and receiver operating characteristic (ROC) curves and calibration curves were used to assess the prediction accuracy. Decision curve analysis (DCA) was used to assess the clinical applicability of the proposed model. For treatment modalities, Kaplan-Meier curves were adopted to analyze condition-specific effects. Results We obtained 731 patients diagnosed with lung signet ring cell carcinoma (LSRCC) in the SEER database and randomized the patients into a training group (551) and a validation group (220) with a ratio of 7:3. Eight factors including age, primary site, T, N, and M.Stage, surgery, chemotherapy, and radiation were included in the nomogram analysis. Results suggested that treatment methods (like surgery, chemotherapy, and radiation) and T-Stage factors had significant prognostic effects. The results of ROC curves, calibration curves, and DCA in the training and validation groups demonstrated that the nomogram we constructed could precisely predict survival and prognosis in LSRCC patients. Through deep verification, we found the constructed model had a high C-index, indicating that the model had a strong predictive power. Further, we found that all surgical interventions had good effects on OS and cancer-specific survival (CSS). The survival curves showed a relatively favorable prognosis for T0 patients overall, regardless of the treatment modality. Conclusions Our nomogram is demonstrated to be clinically beneficial for the prognosis of LSRCC patients. The surgical intervention was successful regardless of the tumor stage, and the Cox proportional hazard (CPH) model had better performance than the machine learning model in terms of effectiveness. Supplementary Information The online version contains supplementary material available at 10.1007/s00432-024-05886-0.


Introduction
Lung cancer remains the leading cause of cancer death (Sung et al. 2021).Common types of lung cancer include adenocarcinoma, squamous cell carcinoma, and small cell lung cancer.Lung adenocarcinoma typically has mutations in the EGFR, KRAS, and ALK genes.LSRCC is a rare adenocarcinoma subtype with a poor prognosis compared with adenocarcinoma.The high percentage expression of TTF-1 and immunostaining profiles CK7+/CK20-in LSRCC was significant in identifying the source of SRCC.Squamous cell carcinoma commonly has mutations in the TP53 and FGFR1 genes.Small-cell lung cancer usually exhibits mutations in the TP53 and RB1 genes (Boland et al. 2014;Campbell et al. 2016;Niu et al. 2022;Chen et al. 2020a, b).According to a 2021 study, lung signet ring cell carcinoma (LSRCC) exhibits distinct risk factors compared to other lung cancers, with a stronger association with Helicobacter pylori and Epstein-Barr virus infections alongside smoking (Boland et al. 2014).Additionally, data from the SEER database indicates that LSRCC accounts for approximately 1-2% of all lung cancer cases (https://seer.cancer.gov/statfacts/html/lungb.html).Furthermore, LSRCC is characterized by the presence of an ALK gene rearrangement and frequent mutations in the ERBB2, TP53, and KRAS genes, contributing to its aggressive behavior and poor prognosis (Akira Okimura et al. 2023) (https://www.cancer.gov/types/lung).LSRCC may more frequently exhibit downregulation of E-cadherin compared to other types of lung cancer.E-cadherin is an important cell adhesion molecule, and its reduction is associated with increased invasiveness and migratory capability of cancer cells Moon et al (2006); Ma et al. (2017).Signet-ring cell carcinoma (SRCC) is a unique subtype of mucin-producing adenocarcinoma characterized by abundant intracellular mucin accumulation, with a poor prognosis across various organs such as the stomach and colon (Wu et al. 2018;Hayashi et al. 1999;Frost et al. 1995;Kitamura et al. 1985;Randolph et al. 1997;Yamashina 1986).While LSRCC can occur as a metastasis from other primary tumors, recent studies have shown a significant rise in primary LSRCC cases, impacting patient survival (Testori et al. 2021); Livieratos et al. 2013).However, more and more studies in recent years have found that the incidence of LSRCC has increased significantly and greatly affects the survival rate of patients.It is believed that early diagnosis and treatment are crucial, with the potential for improved outcomes.LSRCC is a rare non-small cell lung cancer.It is characterized by strong invasion, poor prognosis (Anwar et al. 2020;Hao et al. 2015;Iwasaki et al. 2008), and mostly at an advanced stage (Testori et al. 2021).Its clinical characteristics and treatment have been unclear so far.Normally, after the initial diagnosis, a thorough examination is then carried out to ensure that the tumor is not a metastatic lesion from different primary tumors (Moran 2006).Meanwhile, the prognosis of patients with LSRCC may vary with different treatment methods.So far, there have not been many relevant studies.The clinical knowledge of LSRCC is mainly limited to individual case reports or small case series.There is no large-scale study on the clinicopathological features of LSRCC or the corresponding prognosis of patients (Cai et al. 2021).Based on the above-mentioned, the general prognosis prediction of lung cancer is inaccurate for this kind of population.Therefore, this study aims to explore the significant prognostic factors and establish a more accurate model to predict the survival of patients with LSRCC.
We extracted a large number of population data from the monitoring, epidemiology, and final results (SEER) database, an authoritative cancer statistical database in the United States.The regression analysis was used to determine the influencing prognosis factors and established a nomogram with high accuracy.Machine learning models can combine a large number of variables of different data types in a single model, thereby maximizing the efficacy of prediction testing.Machine learning technology has been widely used to diagnose various types of tumors (Liu et al. 2021).The most typical and commonly used models of machine learning and conventional statistics for cancercensored survival data are random survival forest (RSF) and Cox proportional hazard model (CPH), respectively.The RSF is an ensemble machine learning method constructed with numerous independent decision trees, each of which receives a random subset of samples and randomly selects a subset of variables at each split in the tree for prediction.The final prediction results of an RSF model are the average of the prediction of each tree.The CPH model is a wellrecognized statistical technique to explore the correlation between survival time and covariates (Qiu et al. 2020).
In survival analysis, many different regression modeling strategies can be applied to predict the risk of future events.However, the default choice of analysis often relies on Cox regression modeling due to its convenience.Extensions of the random forest approach (Breiman2001) to survival analysis provide an alternative way to build a risk prediction model (Mogensen et al. 2012).Therefore, the random forest algorithm in machine learning is also used to screen out the most optimized models.Furthermore, we deeply verified the differentiation of nomogram using the pec package in RStudio, while performing cross-validation using bootstrap resampling.

Study population
From Incidence-SEER Research Plus Data 18 Registries, Nov 2020Sub (2000-2018) in the SEER database, we extracted clinical information of patients diagnosed with LSRCC.All information was obtained from the SEER*Stat program (v 8.3.9).The lesion location was the lung and bronchus, with the year of inclusion as 2004-2015 and SRCC ICD-O-3 encoded as 8490.The extracted variables included patient ID, age, race record, sex, marital status, ICD-O-3 Hist/behave, laterality, primary site, ICD-O-3, T, N and M.Stage, surgery, radiation recode, chemotherapy recode, survival months, vital status recode, SEER cause-specific death classification.Inclusion criteria: The follow-up time for all data was traceable and available.Exclusion criteria: unknown race, unknown marital status, blank T, N, and M.Stage, unknown surgical status, unknown radiotherapy status, unknown survival time, and incomplete data.The inclusion and exclusion process is shown in Fig. 1.

Age stratification
We extracted a large age range, and to more reliably analyze prognostic and appropriate treatments across ages, we used X-tile software to calculate the age cutoff to transform variable age from continuous variable to categorical variable, dividing patients into three age groups of 22-64 (47.33%), 65-77 (37.76%), 78-85 + (14.91%) (Supplementary Fig. 1).

Factor exploration and model establishment
Overall survival (OS) and cancer-specific survival (CSS) were predicted using the Kaplan-Meier method and compared by using the log-rank test.OS was defined as the interval from the date of initial diagnosis to the date of death from any cause, and CSS as the survival time from diagnosis to death from lung signet ring cell carcinoma.We randomly divided the data into the training and validation groups at 7:3.The training group was used to build the model and the validation group was used to verify the model.For the training group, we used the Cox proportional hazard regression model and the random forest algorithm to estimate the correlation between clinicopathological features and OS, calculating the risk ratio (HR) and the corresponding 95% confidence interval (Cl).
Significant factors from multivariate Cox regression analysis with factors associated with clinical significance were combined to construct the nomogram, predicting OS in LSRCC patients for 1-year, 3-year, and 5-year.

Model calibration and discrimination
To investigate how much the grasp of the constructed model has predicted patients' survival, we evaluated the model with the training group and introduced the ROC curve.The corresponding area under the curve is the degree of the grasp of the model prediction, and the larger the area, the greater the degree of grasp (Park et al. 2004).
Subsequently, we also explained the proximity of the predicted results to the actual situation of the model using the training group calibration graph.Calibration plots were assessed by a calibration curve showing the relationship between predicted probability and observed probability, and the standard curve is a line passing through the coordinate axis origin with a slope of one.The closer the predicted calibration curve is to the standard curve, the better the predictive power of the nomogram (Coutant et al. 2009).In addition, we applied the pec package in RStudio to verify the discrimination of the model, while performing deep cross-validation using resampling.The package provides functions for inverse probability censoring weighted (IPCW) estimation of the time-dependent Brier score and has an option for selecting between ordinary cross-validation, leave-one-out bootstrap, and the 632 + bootstrap for estimating risk prediction performance.It is also possible to compute prediction error curves with independent test data (Mogensen et al. 2012).
Fig. 1 Flowchart of the process of data selection received chemotherapy, 35.02% received radiation, and 24.49% received chemotherapy plus radiation (Table 1).

Risk factors of OS and CSS
A total of 731 patients were included in the study, and they were randomly assigned to two different cohorts according to the ratio of the training cohort (n = 511) and the validation cohort (n = 220).To identify the prognostic factors, we performed univariate and multivariate Cox regression analyses in the training cohort.

Clinical benefit analysis of the model
Clinical utility is a method to evaluate patients' prognostic benefits, assessing the extent of benefit using a training group decision analysis curve (DCA).To this end, we introduce "threshold probability" while constructing the DCA curve, and the farther the distance between the model and extreme curves is, the higher the clinical utility of the model (Zhang et al. 2020).

Validation of the model
We used the data from the validation group to again construct the calibration curve, ROC, and DCA to verify the nomogram to ensure the accuracy of the study.

Comparison of the models based on Cox and random forest algorithms
Both in the training and verification set, the C-index based on the Cox model is higher than that of the random forest model, which also reflects that the Cox model has a stronger accuracy in model construction than the random forest algorithm in this study (Table 4).

Nomogram construction
Age, chemotherapy, primary site, surgery, T.Stage, and M.Stage were identified as independent prognostic factors via multivariate Cox analysis (all p < 0.05) and further included to establish the nomogram.However, it is important to consider both clinical and statistical significance when choosing inclusion variables (Iasonos et al. 2008).Therefore, we also included radiation and N-Stage in the predictive model (Fig. 2).To use the nomogram, an individual patient's value is located on each variable axis, and a line is drawn upward to determine the number of points received for each variable value.The sum of these numbers is located on the total points axis, and a line is drawn downward to the survival axis to determine the likelihood of 1-year, 3-year, and 5-year survival time (Zheng et al. 2019).The nomogram revealed that surgery, T.Stage, and chemotherapy had the largest impact on the patient's prognosis.

Deep verification
Model 1 was constructed using six significant factors derived from Cox multivariate analysis and two other clinically relevant factors, and six significant factors obtained from Cox multivariate analysis were used to construct

A population analysis of different therapy modalities
Using Kaplan-Meier curves, we compared the OS and CSS in LSRCC patients (Fig. 3A and B), and the results showed that the CSS was similar to the OS.
In the same way, we used Kaplan-Meier curves to compare the effects of different therapy modalities (including surgery, radiotherapy, and chemotherapy) on patients' survival (Fig. 4).The results showed that among monotherapy, analyzing the results of OS, we found that patients who received surgery only were associated with the best prognosis, followed by chemotherapy only and radiotherapy only.The patients who received no therapy demonstrated the worst prognosis (Fig. 4A).Also, we found a similar conclusion in the results of CSS (Fig. 4C).Among the combined therapy modalities, no matter whether in OS (Fig. 4B) or CSS (Fig. 4D), surgery combined with chemotherapy had the best prognosis overall, followed by surgery combined model 2. Compared the predictive power of the two models and validated while performing deep cross-validation using bootstrap repeat sampling.The results showed that the C-index of model 1 was slightly higher before and after cross-validation, thus its prediction power was also stronger (Supplementary Fig. 2).Fig. 2 The nomogram of the 1-year, 3-year, and 5-year overall survival of patients in the training cohort survival period than those who received chemoradiotherapy.Patients who received no therapy and radiation alone were associated with a poorer prognosis, but radiation alone had a longer survival period than no therapy.For patients aged 75-85+ (Fig. 5C), different from the previous two groups, the two modalities of therapy --surgery combined with radiation and surgery combined with chemotherapy had considerable and long-term results, while the effect of surgery alone and surgery combined with chemoradiotherapy was not ideal.The effects of monotherapy modalities focused on short survival.Patients with no therapy had the worst prognosis.
For LSRCC patients, the aggressive therapeutic intervention was associated with a better prognosis, and the younger patients had a relatively longer survival and greater benefit.

Different monotherapy modalities of T.Stage
Clinical T.Stage is commonly used to indicate the tumor size, in general, the larger the T.Stage is, the larger the tumor volume is.Similarly, the tumor size can also have an impact on the therapy mode and consequence to some extent.Therefore, to investigate the effect of tumor size on the outcome of different therapy modalities, we included T.Stage with three monotherapy modalities into Kaplan-Meier curves and found significant differences in prognosis under different treatment modalities (Fig. 6).For patients who received surgery alone (Fig. 6A), we found there was no T0 patient in this therapy modality.Patients with smaller T-Stage had a better prognosis overall, but in some particular cases, T3 had a slightly better prognosis than T2, and T4 also had a better prognosis than T3, but the differences were not significant.As is shown in Fig. 6A, patients diagnosed with TX appear to be associated with shorter survival and the worst prognosis, which may be related to the inability to identify the primary tumor.Subsequently, we analyzed patients who received radiation alone (Fig. 6B), and the analysis showed that the prognostic benefit of T0 was significantly optimized over the other stages during the initial period of survival.Meanwhile, patients diagnosed with T3 had a better with chemoradiotherapy and surgery combined with radiation.Patients who received radiation combined with chemotherapy had the worst prognosis.However, patients who received surgery combined with radiation had the greatest survival benefit for a short time after therapy is worth noting.Combining the Kaplan-Meier curves of OS and CSS for analysis, we found a common conclusion that surgical intervention had better prognoses on OS and CSS in all monotherapy or combined therapy modalities listed.

Age stratification for different therapy modalities
The age data we screened had a large span, considering that therapy for younger patients might not be appropriate for older patients, therefore, we used Kaplan-Meier curves to analyze the therapy of different age groups (Fig. 5).For patients aged 22-64 (Fig. 5A), on the whole, therapy with surgical intervention had a greater benefit on survival.In the long run, the results of surgery alone and surgery combined with chemotherapy are considerable.Surgery combined with radiation yielded similar effects to surgery alone in the short term but is not appropriate for long-term survival effects.The effect of surgery combined with chemoradiotherapy was worse than the other three therapy modalities with surgical intervention.Radiation alone and chemoradiotherapy had a similar effect in the short term after therapy, and subsequently, chemoradiotherapy had a slightly better effect than chemotherapy alone.Patients without therapy and who received radiation only were associated with the worst prognostic.For patients aged 65-77 (Fig. 5B), overall, therapy with surgical intervention is more beneficial for survival, excluding surgery combined with radiation.In the long term, we found that surgery combined with chemotherapy, surgery combined with chemoradiotherapy, and chemotherapy only are considerable.The therapy of surgery combined with radiation contributed to higher short-term survival but did not apply to long-term survival.Among the therapies without surgical participation, the survival effect of chemoradiotherapy was similar to that of chemotherapy alone.Patients with chemotherapy alone had a longer better prognosis than T1 in a certain period.Meanwhile, patients diagnosed with T4 had a better prognosis than that of TX at first in a survival period, and then in the following period, the prognosis benefit of TX was slightly better than that of T4.In summary, patients diagnosed with T0 had a relatively better prognosis overall, regardless of the therapy modality.No matter in what T-Stage, patients who underwent radiation were associated with the worst prognosis.
prognosis than T0, T1, and T2 during some certain survival time.As a whole, patients diagnosed with T4 had the worst prognosis, confirming the logic that larger tumors are more burdensome for patients.Finally, in the chemotherapy-only group (Fig. 6C), the prognosis of patients diagnosed with T0 was significantly improved compared with other T-Stages.Patients diagnosed with T1 had a greater benefit than T2 overall, but patients diagnosed with T2 and T3 also had a (Cl:88.8-96.8%) in 5-year, which further demonstrated that the predictive power of our model was desirable (Fig. 7B).The calibration curve of the validation group further confirmed that the results predicted by the nomogram were close to the actual situation (Fig. 8D, E, F).

Clinical benefits of the nomogram
The DCA results for the nomogram are shown in Fig. 9.According to the DCA, when triggering the medical intervention at the same threshold probability, the nomogram brought greater net benefit to patients and excellent clinical utility (Fig. 9A).The DCA curve of the validation group further supported this conclusion (Fig. 9B).

Calibration and validation of the nomogram
The ROC curve of the training group (Fig. 7A) showed that the AUC value constructed by us was 82.3% (Cl:78.7-85.9%) in 1-year, 86.3% (Cl:82.3-90.3%) in 3-year and 87.4% (Cl:82.7-92.2%) in 5-year, indicating the considerable predictive power of the model we built.The prediction result of the nomogram was close to the standard curve of the training group, demonstrating that the nomogram prediction result was close to the actual situation (Fig. 8A, B,  C).
External validation of the nomogram showed that the AUC value of the validation group was 86.3% (Cl:81.5-91.1%) in 1-year, 90.8% (Cl:86.9-94.7%) in 3-year, 92.8%Therefore, further study is urgently needed.A nomogram is a useful and convenient tool for individualized cancer prognoses, which is widely used for cancer prognosis (Iasonos et al. 2008;Liang et al. 2015;Kattan et al. 2002).We wanted to develop a nomogram to explore significant factors influencing LSRCC prognosis as well as more appropriate treatment modalities.On the nomogram, treatment mode accounts for a relatively large proportion and has a greater impact on prognosis.Therefore, our study further analyzed the Kaplan-Meier survival curves of OS and CSS in different treatment modalities.In the age-stratified Kaplan-Meier survival curve analyzing different treatment modalities, overall, surgery combined with chemotherapy was associated with a better prognosis.With aging, fewer treatment approaches to benefit patients, and poorer prognostic outcomes for patients receiving treatment, we speculate that this may be linked to poor basal physical condition in older patients.By analyzing Kaplan-Meier survival curves for age-stratified and T-Stage-stratified patients, we can provide a basis for what treatment methods different patients should receive.Regular physical examination, early tumor detection and actively cooperating with treatment yielded better survival benefits even in older patients, consistent with the results obtained in the previous study (Wang et al. 2020).
Random forests have had incredible success across a variety of learning disciplines and have fared well in machinelearning competitions (Deo 2015).We also constructed the prognostic model using the random forest algorithm and compared it with the Cox risk regression model.We found that the Cox proportional hazard model outperformed the

Discussion
Due to the rarity of LSRCC and its lack of specific clinical manifestations, it is very difficult to diagnose.It is often confused with other types of lung cancer, leading to misdiagnosis, and treatment delays, and conventional treatments are often ineffective (Cai et al. 2021).Recent research offers a glimmer of hope.Owing to its unique biology, existing prognostic tools for lung cancer are unsuitable for LSRCC patients (Anwar et al. 2020).However, studies show promise for targeted therapies and immunotherapies (Boland et al. 2014;lbrahim Yildiz 2021).ALK inhibitors have presented an opportunity for the individualized treatment of lung cancers and provided an alternative therapeutic approach for those patients intolerant to chemotherapy and radiation therapy (Hao et al. 2015;Kwak et al. 2010).A case report shows that Lorlatinib has an antitumor effect in ALK-positive LSRCC (lbrahim Yildiz 2021).Lorlatinib is a potent, brain-penetrating, third-generation, macrocyclic ALK/ROS1 TKI with broad-spectrum potency against most known resistance mutations that develop during treatment with existing first-and second-generation ALK TKIs (lbrahim Yildiz 2021).Depending on the biopsy site and tumor types, the vacuoles of signet ring cells seem to contain quantities of mucin, glycogen, lipid, or immunoglobulin (Yiğit et al. 2018).Historically, research on LSRCC has been limited, with case reports and small analyses dominating the field (Cai et al. 2021).However, a recent rise in SRCC diagnoses across various primary sites has revealed significant variations in survival based on tumor location (Wu et al. 2018).algorithm, owing to LSRCC being a rare cancer, it is difficult to get the best effect in the calculation process, and there exists experience of overfitting (Liu and Dai 2022).
Of course, our study also has certain limitations.First, as a regression inquiry, some bias is inevitable.Second, SRRC is a rare type of cancer, and the patient's clinical characteristics are not complete in the SEER database.So the blank and absence of a large number of data are inevitable.For example, the data of the Grade variable is mostly blank, so it can not be explored.Third, the latest installment of AJCC is version 8, while the TNM installment of LSRCC patients is still blank even in version 7, so we had to study AJCC version 6. Fourth, the SEER database does not have some variable comorbidities, and chemotherapy regimens, which may hinder further prognostic analysis.Finally, other independent large-scale datasets are lacking to externally validate the models.random forest model in terms of predictive accuracy.This finding is consistent with previous research that has shown that the Cox model is a robust and reliable tool for survival prediction (Poon and Lu 2022;Chen et al. 2020a, b;Moolgavkar et al. 2018).The superior performance of the Cox model in this study may be due to several factors.First, the Cox model is less sensitive to outliers and missing data than the random forest model (Wang and Li 2017;Baralou et al. 2023;Wang et al. 2023).Second, the Cox model is able to capture complex nonlinear relationships between the predictor variables and the outcome variable.Future research should focus on validating our findings in larger and more diverse patient populations.Additionally, future research should investigate the use of the Cox model in conjunction with other machine learning techniques to further improve the accuracy of patient prognosis prediction.
For the nomogram we constructed, the pec package in RStudio was applied in it.Based on the random forest

Fig. 6
Fig. 6 Overall survival of patients with different T.Stage.(A) Surgery only.(B) Radiotherapy only.(C) Chemotherapy only

Table 1
Characteristics of LSRCC patients LSRCC Lung signet ring cell cancer, CT Chemotherapy, RT Radiotherapy

Table 2
Univariate analysis of overall survival and cancer special sur-Cl Confidence interval, HR Hazard ratio, OS Overall survival, CSS Cancer special survival 0.15-0.27)and chemotherapy (HR:0.73;Cl:0.60-0.89)had a better prognosis.According to the results of CSS, eight variables including age, chemotherapy, radiation, primary site, surgery, T.Stage, M.Stage, and N.Stage were identified as independent prognostic.Patients aged 65-77 (HR: 1.24;

Table 3
Multivariate analysis of overall survival and cancer special