Overall survival prediction models for gynecological endometrioid adenocarcinoma with squamous differentiation (GE-ASqD) using machine-learning algorithms

The actual 5-year survival rates for Gynecological Endometrioid Adenocarcinoma with Squamous Differentiation (GE-ASqD) are rarely reported. The purpose of this study was to evaluate how histological subtypes affected long-term survivors of GE-ASqD (> 5 years). We conducted a retrospective analysis of patients diagnosed GE-ASqD from the Surveillance, Epidemiology, and End Results database (2004–2015). In order to conduct the studies, we employed the chi-square test, univariate cox regression, and multivariate cox proportional hazards model. A total of 1131 patients with GE-ASqD were included in the survival study from 2004 to 2015 after applying the inclusion and exclusion criteria and the sample randomly split into a training set and a test set at a ratio of 7:3. Five machine learning algorithms were trained based on nine clinical variables to predict the 5-year overall survival. The AUC of the training group for the LR, Decision Tree, forest, Gbdt, and gbm algorithms were 0.809, 0.336, 0.841, 0.823, and 0.856 respectively. The AUC of the testing group was 0.779, 0.738, 0.753, 0.767 and 0.734, respectively. The calibration curves confirmed good performance of the five machine learning algorithms. Finally, five algorithms were combined to create a machine learning model that forecasts the 5-year overall survival rate of patients with GE-ASqD.


Methods
Data collection. The datasets analysed during the current study are available in the SEER databases repository, SEER* Stat 8.3.6, https:// seer. cancer. gov/. SEER belong to public databases. It was not necessary to get written informed consent for participating in the present research as the information contained in the SEER database has been de-identified and is publically available following authorization. Users can download relevant data for free for research and publish relevant articles. Our study is based on open source data, so there are no ethical issues and other conflicts of interest.
Patient and variable selection. We  The exclusion criteria were applied: (I) age < 18 year-old, (II) not the primary tumor; (III) unknown information about race, stage, regional nodes examined, tumor size, T, N, M; (IV) For futher training and validation prognostic model analysis, survival time less than 60 months would be excluded. The following clinical pathologic variables were selected: age at diagnosis, race, sequence number, marital status, stage, surgery status, radiation status, chemotherapy status, regional nodes examined (RN Examined), AJCC T, N, M stage, primary site. All patients were staged according to the SEER stage: localized, regional, and distant. We employed the sixth edition of the Derived AJCC Stage Group. It is worth mentioning that the X-tile software (https:// medic ine. yale. edu/ lab/ rimm/ resea rch/ softw are/) converted continuous variables (age at diagnosis) into categorical variables by determining the optimal cutoff points for each variable 13 . We divided the age at diagnosis into the 18-66, and 67-95-year categories using 66-and 95-year as the cutoff values. The main endpoint was overall survival (OS), which was calculated as the period from diagnosis to death from any cause. The sample was randomly split into a training set and a test set at a ratio of 7:3. The patient selection flowchart is shown in Fig. 1.

Machine learning models.
In this study, we have used several supervised ensemble-based machine learning algorithms, including Logistic Regression (LR), Decision Tree (DT), Random Forest (RF), Light Gradient Boosting Machine (LGBM), and Gradient Boosting (Gbdt) separately to build classification models to stratify GE-ASqD patients., and we searched for the models with the best performance. In machine learning, a random forest (forest) is a classifier that includes multiple decision trees. The categories of its output are determined by the modes of categories output by individual trees.
The LightGBM (gbm) algorithm is a lifting machine learning algorithm. It is a fast, distributed and highperforming gradient lifting framework based on a decision tree algorithm. It can sort, classify, run regressions, and perform many other machine learning tasks.
The construction of a decision tree model has two steps: induction and pruning. Induction is the step of constructing a decision tree by setting all hierarchical decision boundaries based on data at hand. However, the tree model is subject to severe over-fitting due to the nature of the training decision tree, and this is when pruning is  www.nature.com/scientificreports/ required. Pruning is the process of removing unnecessary branch structures from the decision tree, simplifying the process of overcoming over-fitting and making it easier to interpret. Elevation is a machine learning technique that can be used for regression and classification problems. It produces a weak prediction model (like a decision tree) at each step and weights it into the total model. If the weak prediction model of each step generates consistent loss function gradient direction, then it is called gradient boosting (Gbdt).
For all machine learning studies, the Python (Python 3.7.13) programming language has been utilized.We have utilized Python libraries such as pandas and numpy for basic data processing and sklearn for machine learning.
The coefficients for the machine learning technique were trained and tested. Evaluation and comparison were completed with the prediction accuracy of a model constructed by machine learning and the area under the curve (AUC). F1-Measure evaluation indicators are used in information retrieval and natural language processing. Precision rate indicates the proportion of correctly classified cases of the sample. Accuracy rate refers to the number of paired cases split by the total number of cases. Recall rate relates to the positive cases in the sample which were predicted correctly. MSE (Mean Squared Error) measures the amount of error in statistical models. Missing data were estimated through multiple imputations.
Statistical analysis. All statistical analyses were conducted using R version 3.6.1 (www.r-proje ct. org). The association among demographic, clinicopathological, and treatment variables for the histological subtypes was compared using the chi-square test and the Fisher exact test.
Univariate cox regression analysis demonstrated potential prognostic factors with P values < 0.1. Multivariate cox proportional hazards model was used to evaluate the prognostic factors associated with OS. Prognostic factors with P values < 0.1 on univariate analyses were entered into multivariate analyses. Then, a set of machine learning models were developed base on the independent prognostic factors associated with OS for GE-ASqD patients.
According to the results of univariate cox regression analysis, we found that race, age at diagnosis, sequence number, stage, surgery status, radiation status, chemotherapy status, regional nodes, T, N, M stage were potentially correlated with the OS of GE-ASqD (P < 0.05).
Alive but survival months < 60 months were excluded, finally 907 patients remain for further analysis. Table 3 summarizes the baseline characteristics of the training and validation sets. All variables were similarly distributed between the two sets, with EE-ASqD (95.9% vs. 94.9%) and OE-ASqD (4.1% vs. 5.1%) in the training and validation sets. In both sets, almost all patients sequence number were the one primary only (86% vs. 88%). Most of the patients in the training and validation sets were white (83.8% vs. 84.3%), 18-66-year (78% vs. 81%), and married (51% vs. 58%). The clinical data demonstrated a relatively localized (71.6% vs. 75.8%) malignancy; In comparison, approximately 77% and 80.2% of patients in the training and validation sets, respectively, was T1 stage, 89% vs.92.7% of N0, and 95% vs. 96% as M0. In both sets, almost all patients received surgery (96.5% vs. 96.3%), for regional nodes examination (RN Examined) were done (63% vs. 61%) in the training and validation sets furthermore. whereas only a few patients received chemotherapy (13%) and radiation (26% vs. 23%) in the training and validation sets.
Prognostic model construction and model performance. In this study, the dataset consisted of 907 individual patients' information. We divided the whole dataset into 70% for training and 30% for testing. Accuracy, Precision, Recall, F1-score, AUC, and MSE evaluation metrics were employed to test the classifier performance. Figure 2A shows the associated independent risk factors based on a multiple linear regression model. Multiple linear regression models are used to quantify the relationship between predictor variables and a response variable takes on a continuous value. Two of the most important values in a regression www.nature.com/scientificreports/ The models constructed by four machine learning algorithms in the test group are compared (Figs. 3B and 4B). LR had the highest accuracy (0.799), precision (0.559) and Auc (0.779). The recall rate and f1-score for the gbm algorithm was 0.407 and 0.44. The lowest f1 score was that of decision tree at 0.059. The AUC values of the five algorithms were: LR (0.779), Gbdt (0.767), forest (0.753), DecisionTree (0.738) and gbm(0.734). Among the five algorithms, LR had the lowest MSE value at 0.201. The calibration curves confirm good performance of the five machine learning algorithms ( Fig. 5A and 5B).

Discussion
In the era of "personalized medicine, " the use of prediction models has gained increasing interest among clinicians to guide treatment planning, individualized treatment aims to minimize unnecessary exposure to therapyrelated morbidity and at the same time offer proper management for high-risk patients. The combination of PARP  15,16 .
It is now evident that OC and EC are not single disease, but is a category comprised of several distinct histotypes. The medical community's understanding of OC and EC has changed significantly over the past few years 17,18 . Katelyn et al. also suggest that each morphologic subtype has potential therapeutic implications 19 . Additionally, the likelihood of passing away varied noticeably among those with various histological subtypes 20 . Less research has been focused on gynecologic endometrial adenocarcinoma with histological subtypes of squamous differentiation in gynecologic oncology, but this needs to be examined because of the unique prognostic determinant and association pattern our findings suggest throughout the survival trajectory. Data from the Table 3. The baseline characteristics of the training and validation sets used in the prognostic model.

Characteristics
Training (N = 634) n (%) Validation (N = 273) n (%) P www.nature.com/scientificreports/ SEER program provides a unique opportunity to study a rare disease given the large, nationally-representative sample of cancer patients, extensive follow-up information, and availability of detailed histomorphologic data. According to the data extracted from the SEER database in our current study, GE-ASqD is more common among www.nature.com/scientificreports/ young whites, generally in the early stages of the disease, and most patients underwent surgical treatment and intraoperative lymph node examination.
Although the demographic characteristics of the present study suggest that non-white populations account for a very small proportion of GE-ASqD, there are many studies on ovarian cancer that suggest poor survival for black compared with other ethnic groups. Thus, the recently established African ancestry women's ovarian cancer (OCWAA) consortium 21 analyzed key differential factors by analyzing various characteristics of patients in their large national cancer databases or medical databases. Studies have also noted worse survival in black patients with endometrial cancer (EC) compared with white patients, and higher staging and grading, histological risk, and worse survival in black women 22,23 . Studies have also shown that black women are 2.5 times more likely to die from endometrial cancer 24 . In the multivariate analysis of our study, age, surgery, chemoradiotherapy, lymph node examination, and the presence or absence of node-positive metastases were statistically significantly different from patient prognosis analysis. Our analysis based on LR models and tree models showed that T stage and presence of positive lymph node metastasis were significantly associated with 5-year survival. Especially in tree-based models, age and Nodal properties have been shown to be significantly associated with disease survival.
According to international guidelines 25,26 , the fundamental management of gynecological oncology is achieved through standard surgery or cytoreductive surgery performed by a team of trained gynecological oncologists. Most patients present with early-stage disease are cured by surgery. In this study, we did not further    [27][28][29] . The significance of conventional lymphadenectomy for improving outcomes in early clinical endometrial cancer is controversial, but it is strongly associated with a 15% to 20% surgically related morbidity 30 . Few attempts have been made to www.nature.com/scientificreports/ predict the risk of LNM before surgical treatment [31][32][33] . In recent years, gynecologic oncologists have chosen to use dye-injected tracer in the first few minutes of the operation, combining with Fluorescence microscope to accurately assess whether there are first-stop Sentinel lymph node and distant lymph node metastases, when the results were positive, the decision was made to perform systemic lymph node dissection and para-aortic lymph node dissection to improve the patient's prognosis 34 . At the same time, the implementation of this technology has also had a positive impact on reducing healthcare costs. Among the patients in this study, ovarian cancer patients receiving radiotherapy and endometrioid carcinoma patients receiving chemotherapy were a minority. It has been demonstrated that radiation therapy increases survival in individuals with high risk of endometrial cancer but not in those with intermediate risk 35 . It also increases costs and a higher risk of morbidity. It is advised that patients with mild endometrial cancer refrain from radiation therapy for the time being, and that they instead undergo follow-up observation. The study points out that by dividing gynecologic tumors into clinically meaningful subgroups, we can better understand the pathological development and pathogenesis of tumors, thus adapting to the era of individualized and precise treatment, to select and improve individualized treatment based on machine learning and other prognostic scientific prediction. However, there are some limitations to our study. Firstly, SEER lacks data on chemotherapy treatments and patterns and timing of recurrence. Second, outcome data for individuals receiving targeted therapies were not included in the sample, which may have made the prediction model less comprehensive. Finally, it should be admitted that further external validation in different geographic regions and etiology is of necessity.

Conclusions
In this work, five machine learning algorithms were used to build predictive models after analyzing the clinical traits and prognosis of patients with GE-ASqD. The 5-year OS of patients with GE-ASqD could be accurately predicted using the machine learning model, which may aid clinicians in making more precise and individualized therapeutic decisions. This is especially crucial to boost the long-term prognosis of high-risk patients with histological subtype squamous differentiation.

Data availability
The code to perform all presented studies is written in R or python and is freely available on GitHub: https:// github. com/ users/ mimim ay-cpu/ proje cts/3/ views/1? pane= issue & itemId= 24276 328.