Abstract

Background. A more accurate prediction of liver metastasis (LM) in pancreatic cancer (PC) would help improve clinical therapeutic effects and follow-up strategies for the management of this disease. This study was to assess various prediction models to evaluate the risk of LM based on machine learning algorithms. Methods. We retrospectively reviewed clinicopathological characteristics of PC patients from the Surveillance, Epidemiology, and End Results database from 2010 to 2018. The logistic regression, extreme gradient boosting, support vector, random forest (RF), and deep neural network machine algorithms were used to establish models to predict the risk of LM in PC patients. Specificity, sensitivity, and receiver operating characteristic (ROC) curves were used to determine the discriminatory capacity of the prediction models. Results. A total of 47,919 PC patients were identified; 15,909 (33.2%) of which developed LM. After iterative filtering, a total of nine features were included to establish the risk model for LM based on machine learning. The RF showed the most promising results in the prediction of complications among the models (ROC 0.871 for training and 0.832 for test sets). In risk stratification analysis, the LM rate and 5-year cancer-specific survival (CSS) in the high-risk group were worse than those in the intermediate- and low-risk groups. Surgery, radiotherapy, and chemotherapy were found to significantly improve the CSS in the high- and intermediate-risk groups. Conclusion. In this study, the RF model constructed could accurately predict the risk of LM in PC patients, which has the potential to provide clinicians with more personalized clinical decision-making recommendations.

1. Introduction

Pancreatic cancer (PC) is the fourth leading cause of cancer-related mortality in the USA, and it causes an estimated 25,270 deaths per year worldwide, accounting for 8% of the total cancer death toll [1]. Pancreatic cancer has a 5-year survival rate of <8%, and up to 80% of patients with PC already have distant organ metastasis at the time of diagnosis, which significantly reduces survival benefits from surgical resection of the primary tumor [2]. Thus, an accurate assessment of locoregional and/or distant metastases in patients with PC is essential to determine whether these patients should undergo additional surgical resection or other combination therapies.

The liver is the most common metastasis site, accounting for 37-41.9% of the initially diagnosed cases [3, 4]. Moreover, more than 60% of the patients that undergo tumor resection relapse with distant liver recurrence within the first 24 months after surgery [5]. Magnetic resonance imaging, computed tomography, and ultrasonography are currently the most commonly used inspection methods. Restricted by economics, doctors’ ability, and other aspects, this will affect the judgment of clinicians to a significant extent. Thus, a better prognostic model for the prediction of liver metastasis (LM) in PC is critical to improve treatment and patient outcomes.

The dismal outcomes of PC partly result from its aggressive metastatic nature, but applying appropriate treatment options according to different disease processes can improve the survival rate of patients. In this study, we plan to establish a novel prediction model for liver metastasis based on clinical parameters and simple histopathological with high reliability, which could help to improve patient risk stratification in early PC.

2. Materials and Methods

2.1. Data Source and Study Population

This retrospective study was carried out based on the Surveillance, Epidemiology, and End Results (SEER) database. The publicly available data was collected from 18 cancer registries between January 1, 2010, and December 31, 2018, using SEER-Stat software (ver. 8.3.5). The patients’ files from the SEER database were accessed with official permission, and patients’ records were anonymized. The study was approved by the Ethics Committee of the National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences.

2.2. Main Outcomes and Selected Variables

Patients with primary pancreatic cancer were included in this cohort study. The target outcome was hepatic metastasis of pancreatic cancer. The cancer diagnosis was based on the classification of the topography or histology based on the International Classification of Diseases for Oncology-3 (ICD-O-3)/WHO 2008 guidelines. The primary pancreatic tumor locations included C25.0—head of the pancreas, C25.1—body of the pancreas, C25.2—tail of the pancreas, C25.3—pancreatic duct, C25.4—islets of Langerhans, C25.7—other specified parts of the pancreas, C25.8—overlapping lesion of the pancreas, C25.9—pancreas, and NOS (not otherwise specified). The exclusion criteria were as follows: (1) the presence or absence of metastasis at diagnosis was unknown; (2) pancreatic cancer patients without pathohistological diagnosis; (3) patients younger than 20 years; (4) patients with benign or borderline tumors; and (5) patients with lacking information on race, histological type, and treatment strategy. The derived American Joint Committee on Cancer (AJCC) 6th and SEER combined stage (2016+) TNM staging was used in this study. Patient demographics included gender, age, year at initial diagnosis, and race. Tumor characteristics included lymph biopsy, surgery, tumor size, marital status, survival status, survival time, the presence of distant metastasis, TNM staging (tumor, lymph node metastasis, and distant metastasis), insurance status, and radiation and chemotherapy records. A flowchart of the data collection process is presented in Figure 1.

2.3. Feature Engineering and Data Transformation

These readily available clinical and demographic variables from SEER database were processed to establish the available models using feature engineering techniques. According to the clinical characteristic or median, the continuous variables (age and year at initial diagnosis, tumor size, and number of positive lymph biopsies) were converted into categorical variable. To promote the availability of the prediction model, we employed cross-validation (CV) and recursive feature elimination to iteratively filter variables using the random forest (RF) classifier. CV was used for internal validation as a robust method for evaluating the progress of machine learning and improve the model performance [6]. The variables were evaluated based on their relative importance for the receiver operating characteristic (ROC) of the models.

2.4. Risk Model Establishment and Risk Stratification

All of the patients included in this study were randomly divided into independent training (80%) and testing (20%) sets using R [7]. The prediction models were built based on the training sets, after which they were evaluated and validated based on the test set. The extreme gradient boosting (XGboost), RF, SVM [8], deep neural network (DNN), and logistic regression (LR) algorithms were trained by performing 10-fold CV on the training set. Univariate and multivariate logistic regression analyses were employed to evaluate the features significantly correlated with the risk of hepatic metastasis. In addition, correction analysis was performed on features included in this study to evaluate their mutual relationships. The machine learning models were established and evaluated using the caret package in R.

According to our preliminary findings, performance of these different machine learning algorithms was roughly the same for predicting LM, but there was a trend toward improved availability for RF on both training and testing sets. To further evaluate the risk of HM for PC patients, we calculated the risk scores for every patient based on the RF and then sorted the patients based on the risk scores form high to low. The pancreatic cancer patients were divided into three risk group of the same number: high-risk group, intermediate-risk group, and low-risk group, which can inform the selection of a suitable treatment strategy [9].

2.5. Statistical Analysis

The chi-squared test was employed to assess the significance of differences among categorical variables in the training set and test set, while the Mann–Whitney test was used for continuous variables. The Kaplan-Meier method and log-rank test were used to evaluate the differences among different subgroups in univariate survival analysis. The cancer-specific survival (CSS) and the survival time were the main evaluation indices. Propensity score matching (PSM) was used to balance the patients at a ratio of 1 : 1 between PC with and without treatment. To measure the performance of several models, the sensitivity, specificity, Gini, and area under the ROC curve, as well as the 95% confidence intervals (CIs) were calculated based on the number of correctly classified TP (true positive) cases and the number of the incorrectly classified FP (false positive) cases. The DeLong test was employed to evaluate model performance in identifying liver metastasis (). All analyses were performed using R version 3.6.1.

3. Results

3.1. Demographic and Clinicopathological Characteristics

A total of 47,919 pancreatic cancer patients from SEER database were analyzed in this study (Figure 1). Of whom, 15,909 (33.2%) patients have developed liver metastasis. 20,046 (41.8%) PC patients were over 70 years old, 30,702 (64.1%) were in T3/T4 stage, and 19,313 (40.3%) were in N1/N2 stage. The more PC patients suffered the tumor in head (39.0%) and tail (24.1%) than in body (18.0%) of the pancreas developed the LM. All of the patients were randomly divided into the training set () and an internal test set () with the ration of 8 : 2 (Figure 1). All demographic and clinicopathological variables of these patients are detailed in Table 1.

3.2. Variable Feature Importance of Liver Metastasis Prediction

To evaluate the association between these features and the risk of liver metastasis, the univariate and multivariate logistic regression was performed for linear correlation analysis (Table 2). The results showed that the age at PC diagnosis, gender, race, primary tumor site, T and N stage, tumor histology, size, surgery performed, chemotherapy, and radiotherapy were significant prognosis factors for predicting liver metastasis in univariate and multivariate logistic regression analysis (). And the tumor in the body (adjusted OR, 1.63; 95CI, 1.53-1.73; ) and tail (adjusted OR, 3.23; 95CI, 3.02-3.45; ) of the pancreas suffered higher risk for liver metastasis than in the head of the pancreas. Both chemotherapy (adjusted OR, 0.17; 95CI, 0.16-0.19; ) and surgery (adjusted OR, 0.10; 95CI, 0.09-0.11; ) performed could significantly decrease the risk of liver metastasis for pancreatic cancer. But radiotherapy (adjusted OR, 1.08; 95CI, 1.03-1.13; ) was positively related with the risk of liver metastasis.

3.3. Model Performance

To establish the available predicting models, we used recursive feature elimination and 10-fold-CV to iteratively select features based on the implementation of the RF classifier. Besides, nine features (tumor histology, chemotherapy, N stage, age at PC diagnosis, tumor size, primary tumor site, T stage, radiotherapy, and surgery) were selected and included in machine learning development.

Five risk models were established based on the selected features. We evaluated the importance of selected features by the size of the gain value for predicting liver metastasis in five models (Figure 2). Although the importance of features varied slightly among different models, the overall results noted that surgery, radiotherapy, primary tumor site, and tumor size ranked at the top of the list. The tumor treatments (including surgery, radiotherapy, and chemotherapy) were associated closely with liver metastasis.

The specificity, sensitivity, ROC value, and Gini scores were constructed to identify the reliability of model (Table 3). The results showed that the RF model had the best performance in both training and test sets ( and 0.832, respectively), compared with XGB ( and 0.837, respectively), DNN ( and 0.832, respectively), SVM ( and 0.839, respectively), and LR ( and 0.821, respectively). The sensitivity and specificity values of the predictions noted the same results.

3.4. Risk Stratification for Patients

We calculated the risk score for pancreatic cancer patients for predicting liver metastasis with RF classifier. These PC patients were assigned to an average of three risk groups according to their risk scores ranked from high to low and about 15,973 (33.3%) patients in every risk group (Figure 3(a)); the patients had the highest risk scores in the high-risk group and the lowest in the low-risk group. The result on proportions of liver metastasis showed 11,905 (74.5%) patients with liver metastasis in the high-risk group, 3898 (24.4%) patients in the middle-risk group, and 106 (0.7%) patients in the low-risk group. There was significant difference of proportions of liver metastasis among three groups (). And then, we compare the pancreatic cancer 5-year CSS among the three groups (Figure 3(b)); the survival probabilities were significantly different among three groups; the 5-year CSS was 2.6% in the high-risk group, 4.8% in the middle-risk group, and 26.2% in the low-risk group. The univariate Cox regression analysis noted that low-risk group vs. middle-risk group was HR, 2.98; 95CI, 2.91-3.07; ; low-risk group vs. high-risk group was HR, 3.99; 95CI, 3.88-4.11; ; and middle-risk group vs. middle-risk group was HR, 1.32; 95CI, 1.28-1.35; ; the pancreatic cancer patients with higher risk scores had worse survival.

3.5. The Treatment for Three Risk Groups

To evaluate the therapeutic effect of performed surgery, chemotherapy, and radiotherapy for pancreatic cancer patients in different risk score groups, we balanced the demographic and clinicopathological characteristics of patients receiving or nonreceiving treatment with propensity score matching based on the age at PC diagnosis, race, gender, T stage, N stage, year of PC diagnosis, tumor size, and histology at the ratio of 1 : 1 between patients receiving and not receiving performed surgery, chemotherapy, or radiotherapy. And we analyzed the 1-year and 5-year CSS for patients with balanced baseline in different risk groups. In the high-risk group, the patients receiving surgery (HR, 0.31; 95CI, 0.21-0.46; ), chemotherapy (HR, 0.42; 95CI, 0.40-0.44; ), and radiotherapy (HR, 0.81; 95CI, 0.69-0.96; ) had better CSS than patients not receiving treatment (Figures 4(a)4(c)). In the middle-risk group, the patients receiving surgery (HR, 0.31; 95CI, 0.28-0.35; ), chemotherapy (HR, 0.53; 95CI, 0.51-0.56; ), and radiotherapy (HR, 0.72; 95CI, 0.60-0.78; ) had better CSS than patients not receiving treatment (Figures 4(d)4(f)). In the low-risk group, the patients receiving surgery (HR, 0.29; 95CI, 0.27-0.32; ) had better survival than patients with nonsurgery (Figure 4(g)). But receiving chemotherapy and radiotherapy may not promote the survival and prognosis for pancreatic cancer patients in the low-risk group (Figures 4(h) and 4(i)).

4. Discussion

In this study, we collected data from the SEER database, which covers 47,919 patients with PC. The trends in this dataset are therefore highly representative and universal. We described the clinical characteristics of PC patients with or without LM and factors that predict the risk of LM in these patients. The univariate and multivariate logistic regression analyses showed that the age at PC diagnosis, gender, race, primary tumor site, T and N stage, tumor histology, size, surgery, chemotherapy, and radiotherapy were significantly correlated with the risk of liver metastasis in PC. This result was consistent with similar studies. Compared with elderly patients, metastases are more often observed in younger patients, who usually have more malignant tumors with more aggressive histological features, which may lead to higher rates of liver metastasis or other forms of distant metastasis [10, 11]. Gender is related to liver metastases, which are less frequent in female patients [12]. Tumor site, grade, size, and LN metastasis were all previously identified as independent predictors of liver metastasis in patients with PC [13]. Studies have shown that primary tumors located in the body and tail of the pancreas are more prone to liver metastases than primary tumors that occur in the head of the pancreas. Compared with tumors located in the head of the pancreas, PC in the body and tail is larger or more frequently diagnosed at an advanced stage, which may increase the risk of liver metastases in these patients [14]. Since patient counseling and decision are based on the estimated from the individual risk profiles, these risk factors may help customize liver monitoring and clinical decision-making.

Distant metastasis is a sign of advanced cancer, indicating a poor prognosis for PC patients. Approximately 60% of pancreatic cancer patients are diagnosed with metastasis, especially liver metastasis [15]. Surgery is considered to be the best potential curative treatment for PC patients, but the indications for tumor resection remain controversial. Although a few scholars disagree [16, 17], most studies advocate that surgical resection of the primary tumor and liver metastases should be the preferred choice for patients with resectable PC with liver metastases [1820]. Surgical removal of the primary tumor and metastases can improve the quality of life and prolong survival, especially in patients with oligometastatic PC [2123]. Timely diagnosis of LM is therefore crucial, since it can provide evidence and recommendations for oncologists to make appropriate clinical treatment decisions. Unfortunately, conventional imaging tests for the diagnosis of liver metastases such as Doppler ultrasound, magnetic resonance imaging, or computed tomography have not shown high sensitivity and specificity in PC [24, 25]. Moreover, multiple imaging examinations will also increase the financial burden of patients. Therefore, it is important to establish a model that can accurately predict the probability of LM in PC patients. In this study, we assessed available predictive models using the SEER dataset, which demonstrates significant discrimination and calibration and can provide a basis for formulating an optimal surgical plan. Using this approach, PC patients can be divided into different risk grades to formulate different LM review plans according to the level of risk. Effective clinical decision-making can save large amounts of time and economic costs for patients.

In spite of its promising results, this study still has several limitations. First, this is a retrospective study. Second, due to intrinsic limitations of the database, nonunified selection criteria were employed for patients and detailed information about the treatment was not recorded, such as operation details, chemotherapy plan, and radiation therapy plan, inter alia. Third, the major limitation of our study is the lack of important variables, such as time-to-treatment, type of surgery, patient status, and tumor burden at the surgical margin. Finally, further validation based on a large-scale external cohort is needed.

5. Conclusion

The RF model constructed in this study could accurately predict the risk of LM in PC patients, which may provide clinicians with more personalized clinical decision-making recommendations. The therapeutic effect of treatment is expected to be different for pancreatic cancer patients in the three risk groups based on the RF model. Machine learning technology has the potential to provide reliable individual PC treatment recommendations.

Data Availability

Corresponding authors may provide data to support the findings of this study upon reasonable request.

Conflicts of Interest

The authors declare that they have no competing interests.

Authors’ Contributions

LQG, LD, and HXB participated in the study concept and design. HXB and BL helped in coordination and helped to draft the manuscript. XJY and LXR performed the data analysis. All authors contributed to data analysis, drafting and revising the article, gave final approval of the version to be published, and agree to be accountable for all aspects of the work.

Acknowledgments

This work was supported by the Natural Science Foundation of Henan Province, China (Grant No. 212300410397).

Supplementary Materials

Supplementary Table 1: demographic and tumor characteristics of pancreatic cancer patients in high-, mid-, and low-risk groups. Supplementary Figure 1: ROC of postoperative complication prediction for the random forest, extreme gradient boosting, deep neural network, support vector machine, and logistic regression in training set and testing set. DeLong test: na, value > 0.05; , value < 0.05; , value < 0.01; , value < 0.001. Supplementary Figure 2: (A) survival comparison between PC patients who receive chemotherapy and nonchemotherapy in the middle-risk group (after PSM). (B) Survival comparison between PC patients who receive chemotherapy and nonchemotherapy in the low-risk group (after PSM). (C) Survival comparison between PC patients who receive radiotherapy and nonradiotherapy in the low-risk group (after PSM). (Supplementary Materials)