Application of artificial intelligence in a real-world research for predicting the risk of liver metastasis in T1 colorectal cancer

Liver is the most common metastatic site of colorectal cancer (CRC) and liver metastasis (LM) determines subsequent treatment as well as prognosis of patients, especially in T1 patients. T1 CRC patients with LM are recommended to adopt surgery and systematic treatments rather than endoscopic therapy alone. Nevertheless, there is still no effective model to predict the risk of LM in T1 CRC patients. Hence, we aim to construct an accurate predictive model and an easy-to-use tool clinically. We integrated two independent CRC cohorts from Surveillance Epidemiology and End Results database (SEER, training dataset) and Xijing hospital (testing dataset). Artificial intelligence (AI) and machine learning (ML) methods were adopted to establish the predictive model. A total of 16,785 and 326 T1 CRC patients from SEER database and Xijing hospital were incorporated respectively into the study. Every single ML model demonstrated great predictive capability, with an area under the curve (AUC) close to 0.95 and a stacking bagging model displaying the best performance (AUC = 0.9631). Expectedly, the stacking model exhibited a favorable discriminative ability and precisely screened out all eight LM cases from 326 T1 patients in the outer validation cohort. In the subgroup analysis, the stacking model also demonstrated a splendid predictive ability for patients with tumor size ranging from one to50mm (AUC = 0.956). We successfully established an innovative and convenient AI model for predicting LM in T1 CRC patients, which was further verified in the external dataset. Ultimately, we designed a novel and easy-to-use decision tree, which only incorporated four fundamental parameters and could be successfully applied in clinical practice.


Introduction
Colorectal cancer (CRC) is universally acknowledged as one of the most prevalent gastrointestinal tract malignancies with considerably high morbidity and mortality, drawing more and more attention annually [1][2][3]. In 2/3 of CRC patients, metastasis is commonly recognized as both a pivotal clinical feature and a risk factor of high mortality for intractable CRC [4]. During the progression of CRC, over 50% of patients tend to develop liver metastasis (LM) which is the predominant contributor to unfavorable prognosis of CRC [4,5]. Synchronous LM is determined at the time of diagnosis and 15-25% CRC patients had synchronous LM [6,7].

Open Access
Cancer Cell International *Correspondence: zhangchy73@mail2.sysu.edu.cn; hjb2015@stu.xjtu.edu.cn † Tenghui Han, Jun Zhu and Xiaoping Chen contributed equally to this work 6 Division of Digestive Surgery, Xijing Hospital of Digestive Diseases, Airforce Medical University, Xi'an, China Full list of author information is available at the end of the article Endoscopic therapy is a widely accepted and adopted as a valid therapeutic method for T1 CRC patients. Nonetheless, for early CRC patients with LM, conventional surgical excision, neoadjuvant chemotherapy and radiofrequency ablation are the most effective and recommended treatments, which significantly prolong the 5-year overall survival (OS) rate of CRC patients [8,9]. However, considering the inferior early screening methods, approximately 90% of CRC patients with LM fail to be diagnosed precisely in the early stage and thus undergo incomplete endoscopic resection, which ultimately gives rise to undesirable clinical outcomes [10,11]. Although scholars and academicians have conducted abundant in-depth researches on metastasis-related signatures in vivo and vitro, a satisfactory predictive model of LM for CRC in the early stage is still lacking [12][13][14]. Consequently, we aimed at developing an easy-to-use model to predict the risk of LM for patients in the early stage of CRC accurately and robustly.
Currently, there exists an upregulating and irreversible tendency of discipline integration between medical science and artificial intelligence (AI) [15][16][17]. Besides, both depth and breadth of the discipline integration have been significantly enhanced [14,15]. Researchers employed machine learning (ML) as the breaking point in solving the complicated issue of CRC clinical prediction and acquired plentiful significant breakthroughs [18][19][20]. Nevertheless, these findings simply shed light on the intriguing area of T1 CRC with lymph node metastasis which resembles a virgin land to be further explored by utilizing ML. Given that the majority of previous investigations merely concentrated on the public database when studying the apparent discrepancy among diverse populations, limitations ineluctably appeared. Consequently, clinical data involving the real outer validation is of vital significance to construct a superior prediction model.
In the study, we developed a comprehensive recognition model via adopting AI and ML algorithms, which could remarkably promote the identification of T1 CRC with LM and improve the prognosis of these patients in clinical practice. In addition, the predictive model was constructed via utilizing clinically common and accessible parameters, and further validated in an independent CRC cohort.

Clinical sample collection
An open-access and publicly available CRC cohort was retrieved from Surveillance, Epidemiology, and End Results (SEER) Program database in the U.S. National Cancer Institute. The CRC cohort functioned as a powerful resource for investigators to comprehensively comprehend the natural history of CRC and significantly ameliorated the healthcare quality for CRC patients [21,22]. An additional outer validation cohort of CRC patients who underwent surgery from 2010 to 2021 was obtained from Xijing hospital. The CRC cohort's inclusive criteria were demonstrated as follows: (1) the primary diagnosis was CRC; (2) patients were diagnosed with T1 CRC; (3) liver reexamination was completed within six months of diagnosis; (4) patients with sufficient clinical data. Additionally, exclusive criteria were exhibited as follows: (1) patients who have undergone neoadjuvant radiotherapy; (2) metachronous liver metastases (after diagnosis); (3) comorbidity with other tumors; (4) comorbidity with serious cardiopulmonary disease. Written and informed consent was obtained from all participants. All aspects of the clinical cohort study were evaluated by and included in the Institutional Ethics Committee of Xijing Hospital.

Study population
T1 CRC is defined as a category of tumor that invades only the submucosa, regardless of the presence or absence of lymph node metastasis (LNM). Utilizing the SEER database which employed the 7th cancer TNM stages of the American Joint Committee, we analyzed the data of all patients diagnosed with T1 CRC from 2010 to 2016. Primary demographic data, tumor information and laboratory indexes were extracted by utilizing SEER disease codes and then employed for model construction. Fundamental demographic data included age at diagnosis, gender, race, and marital status. Tumor information contained primary site, size, grade, histologic category and TNM stage. Laboratory indexes involved carcinoembryonic antigen (CEA) prior to surgery, tumor deposits, and perineural invasion (PNI). Survival time and status were collected for further clinical estimation of the predictive model. Furthermore, the information of our validation cohort was normalized via following the criteria of the SEER database (Additional file 1: Table S1). And all clinical information underwent data transformation for the sake of further application in model construction (Additional file 2: Table S2).

Construction of the predictive model
In our research, seven ML models were employed to predict LM in patients with T1 stage CRC. To build up tree decision models, we adopted Light Gradient Boosting Decision (LGBM), Random Forest (RF), and Classification and Regression Trees (CART).
LGBM is a gradient boosting framework that utilizes the tree-based learning algorithm, which has been successfully applied in the construction of medical models in recent years [23,24]. RF is a universally employed ML algorithm to deal with classification and regression issues via the multiple decision trees approach [25]. CART is a classical decision tree algorithm applied in either classification or regression predictive models [26]. The K-Nearest Neighbor (KNN) algorithm was utilized in basic prediction technique. KNN is identified as a vital classification algorithm in the supervised ML domain and is extensively applied in pattern recognition, data mining and intrusion detection [27]. To construct the kernel-based model, the Support Vector Machine (SVM) was selected and put into use. SVM is a supervised ML model that employs classification algorithms for the two-group categorization [28]. Gaussian Naive Bayesian (GNB) algorithm was included in the linear model for specific utilization under the circumstance where the features manifested continuous values [29]. Multilayer Perceptron (MLP) is a feed-forward neural network supplement and has been extensively applied in distinct prediction models [30]. In the wake of employing the Bootstrap aggregating (Bagging) algorithm to optimize the performance of established models, stacked regression was utilized to obtain a stacking model via integrating seven models to output a desirable outcome [31,32].
To polish up performance of the model and retain maximum authenticity of the data, we strictly employed the Synthetic Minority Over-sampling technique in the inner training dataset to solve the issue of data imbalance [33]. To begin with, patients in the SEER database were randomly assigned to the training set (80%) and testing set (20%) respectively while the proportion of LM ( +) (patients with LM) subgroup was approximately identical to that of the LM (−) (patients without LM) subgroup (Additional files 12 and 13). In the training set, k-fold cross validation (k = 10) was performed, and grid search was adopted to figure out the best combination of parameters. For each set of parameters, the model was in turn fitted and validated with 8/10 and 2/10 of data respectively. Subsequently, our T1 CRC cohort in the Chinese population was utilized as an extra outer validation set further to examine both applicability and efficiency of the model (Additional file 14). The overall workflow is elaborately demonstrated in Fig. 1.

Assessment of model performance
To ensure rational comparison of the models and assess their performance, a multitude of indicators were employed involving confusion matrix, the area under the curve (AUC), sensitivity, specificity, precision, negative predictive value (NPV), false discovery rate (FDR), accuracy, and average precision (AP). In addition, the area under receiver operating characteristic curves (AU-ROC) was utilized as a performance index while the AP value was employed as the criterion for the precisionrecall (PR) curve [34]. The average value of parameters was ultimately executed on the testing set and additional outer validation one. Survival analysis was further adopted in the model to evaluate whether it was capable of accurately predicting CRC patients' outcomes.
In light of the fact that neoplastic size was widely recognized as an effective predictor of CRC outcome, we tested nonlinearity of the model via analysis of 5-knot restricted cubic splines (RCS) and evaluated potential correlation of model with the hazard of LM [35]. In order to estimate the performance of models in patients with small CRC sizes, we stratified the testing set into 4 subgroups, tumor sizes of which being 1-10 mm, 1-20 mm, 1-50 mm and > 50 mm respectively. Their AUC and AP values were then calculated.
Moreover, to make the real clinical decision process more reliable, training samples were adopted prior to utilization of over-sampling strategy. Subsequently, to exhibit the specific decision process of how CRC patients with LM were discriminated from the model, regression tree analysis was conducted via CART algorithm.

Statistical analysis
SEER*Stat software (8.3.6 version) was adopted to acquire targeted CRC patients from the SEER database. Python (version 3.6.9) and R software (version 4.0.5) were utilized to perform statistical analyzes. Python packages were listed: 'imblearn' , 'sklearn' , 'lightgbm' , and 'mlxtend' . R packages were vividly demonstrated as follows: 'tableone' , 'survival' , 'mice' , and 'dplyr' . Demographic differences between the two subgroups were tested utilizing either Student's t-test or Pearson chi-square test. Results were considered statistically significant when P ≤ 0.05.

Case structures and clinical baselines
Included CRC data in our study from SEER database ranged from 2010 to 2016. In the aggregate, 262,285 CRC patients were initially enrolled. According to the inclusive and exclusive criteria, a totality of 16,785 patients were enrolled in the inner dataset and 326 out of 8226 CRC patients in Xijing hospital were recruited ultimately ( Fig. 1). Baseline clinical characteristics of the SEER CRC cohort (Training dataset) and Xijing CRC cohort (Validation dataset) were exhibited in detail (Table 1).
Eleven independent clinical factors were included in our established model, incorporating age at diagnosis, gender, marital status at diagnosis, primary site, tumor size, tumor grade, tumor type, N stage, CEA level, tumor deposits, and PNI. Patients from the SEER database were categorized into LM (−) subgroup (16,023 patients without LM, 95.5%) and LM (+) (762 patients with LM, 4.5%) subgroup respectively. For diagnosed age, we found that the proportion of patients under 60 years of age in LM (+) subgroup (333/762; 43.7%) significantly surpassed that in LM (−) subgroup (6553/16,023; 40.9%; P < 0.001). Notably, the ratio of male CRC was significantly higher in LM (+) subgroup than in its counterpart (P = 0.001). Intriguing, there demonstrated no statistical difference in terms of race between the two subgroups. In line with our anticipation, an upregulated occurrence rate was observed in the single (167/2611, 6.4%) than the married (376/8918, 4.2%; P < 0.001). Regarding tumor sites, rectum was the most common primary site in both subgroups, and the proportion is comparatively higher than other T stages CRC patients (P < 0.001). In respect to progression of CRC, the average tumor size of LM (+) subgroup (52.1 mm) was considerably larger than that of LM (−) one (17.5 mm; P < 0.001). Analogously, LM (+) subgroup demonstrated significantly higher proportions of both Grade II-IV (92.8% vs 68%; P < 0.001) and advanced N stage CRC than LM (−) subgroup (P < 0.001). Furthermore, we observed upregulated levels of tumor deposits, PNI and positive rate of CEA in LM (+) subgroup than its counterpart (P < 0.001). As for tumor differentiation, Adenocarcinoma (Adenocarcinoma, NOS, Adenocarcinoma in tubulovillous adenoma and Adenocarcinoma in adenomatous polyp; 12714/16785, 75.7%) was confirmed as the most common neoplastic category among T1 patients ( Table 2).

Parameters tuning in our models
We trained the LGBM with a depth of five, a learning rate of 0.01, basic learners of 240, leaves of 16, and max bins of 128. For RF and CART, we also elected 5 as the maximum depth of the basic trees. The number of neighbors 200 for KNN was the best. In MLP, we ultimately selected the learning rate of 0.01, epochs of 300, hidden layer of 1, and utilized the Adam Optimizer and ReLU activation function. For SVM, a combination of a C value of 0.01 and kernel smoothing parameters of 0.0001 was  determined as the ultimate choice. Additionally, every Bagging model, in possession of 10 basic models, was trained with identical algorithms but various data. The ultimate stacking model incorporated seven bagging models, probability and GNB output by which were recognized as meta classifier.

Evaluation of models
Via internal verifying, all models were observed to reveal superior predictive abilities (AUC values > 0.94). Moreover, by incorporating seven other single models, the stacking model demonstrated a favorable AUC of up to 0.9631 (Fig. 2a). Except for GNB models, AP values of approximately all models attain comparatively preferable levels. Noticeably, the ultimate AP of the stacking mode reached 0.693 (Additional file 3: Figure S1a). Expectedly, the external validation set demonstrated satisfying performance. All models exhibited dramatically high predictive value except the MLP model, and the stacking model contained a final AUC value of 0.992 and an ultimate AP value of 0.811 ( Fig. 2b and Additional file 3: Fig.  S1b). Additionally, via employing the confusion matrix to appraise the value of models, predictive outcomes of both the inner testing set and outer validation set were displayed in Table 3.
LGBM produced fewer quantities of FN (False Negative) and FP (False Positive) than other  To further assess comprehensive performance of the AI model, we made comparisons between previous models and logistic regression ones based on our data. Corresponding results testified that the stack-bagging model outperformed other models (Additional file 6: Table S5).
Furthermore, by means of employing survival status and time from the SEER database, we plotted the Kaplan Meier (K-M) curves of the testing set. It was universally acknowledged that LM functioned as an unfavorable prognostic indicator for CRC patients (Additional file 7: Figure S2a). Likewise, we found that  Figure S2b).

Comparison of significance of each factor
In all single models, tumor size, preoperative CEA levels, tumor deposits, N stage, histology, and PNI all revealed equally fundamental significance in predicting for LM in T1 CRC. Despite the fact that the AI model manifested desirable performance, the individualized influence of each factor on the result and underlying relationships between these factors remained largely unknown. Hence, we calculated and digitized the significance of each factor   used in the built-up AI models (Additional file 8: Figure  S3 and Additional file 9: Table S6). Coinciding with previous anticipation, we found that tumor size, CEA level prior to surgery, tumor deposits, and N stage were the top four crucial predictors among all models. Particularly worth mentioning is the fact that tumor size standed out as the most critical one amidst nearly all models.

Subgroup analysis
On account of the reality that tumor size might play a dominant role in prediction while other parameters made relatively less contributions in terms of forecasting model performance, we determined to further investigate the association of tumor size with LM hazard. Firstly, RCS function of tumor size in the training set exhibited a non-linear profile (non-linearity P value < 0.001; Fig. 3a), indicating that this clinical feature should be encoded as a categorical factor and was inappropriate for being employed in canonical logistic regression analysis. Notably, the 50 mm tumor size demonstrated an optimal cutoff value for subgroup analysis (Fig. 3a). Therefore, we utilized the representative AUC and AP value to further explore the model performance in disparate subgroups. Analysis results indicated that AUC values of 1-50 mm and > 50 mm subgroups reached 0.956 and 0.8772 respectively (Fig. 3b).
In light of the fact that patients with tumor size larger than 50 mm accounted for a lower percentage than the 1-50 mm subgroup, we further divided patients into 1-10 mm and 1-20 mm subgroups. The AUC values (Bagging Stacking model) of 1-10 mm and 1-20 mm subgroups reached 0.8212 and 0.8608 respectively (Additional file 10: Figure S4a and b). Generally speaking, the stacking model was triumphantly verified to possess a favorable prediction capacity in T1 CRC patients with small tumor sizes.

Clinical application
Although the stacking model manifested both desirable and robust predictive power for LM in T1 CRC, the model was intricate in nature which could not be easily apprehended by clinicians. As a consequence, we developed an easy-to-use instrument (clinical decision tree) for the sake of supplementing clinical decision-making process with pragmatic guidance (Fig. 4). In this decision tree, target population were categorized into five groups according to the following four most crucial factors namely CEA level, tumor size, tumor deposits and age. The ROC of clinical decision tree archived 0.949 (Additional file 11: Figure S5), undoubtedly a demonstration of its remarkable discriminative and predictive ability.
The population harboring such characteristics as CEA Positive or Borderline, positive tumor deposits, age ≤ 83 and tumor size > 10 manifested high proportion of LM (32.4%) and could be categorized into the high-risk subgroup of LM. On the contrary, remanent three types of patients uniformly demonstrated low occurrence of LM.

Discussion
Liver is generally identified as one of the most commonly seen metastasis sites for CRC while LM is universally recognized as the most lethal factor of CRC patients [36,37]. Early diagnosis of LM could assist clinicians in taking prompt and timely intervention to improve the prognosis of patients, especially for CRC T1 patients [38,39]. CRC patients in T1 stage could select either surgical or endoscopic treatment, partly depending on the status of distant metastasis. Hence, a convenient and accurate predictive model of LM is urgently demanded to offer guidance on personalized therapeutic strategies. In the study, we established an innovative and convenient model to predict early LM by incorporating 11 clinicopathologic parameters in T1 CRC utilizing seven AI methods. We firstly combined our real-world researches with public data online on a large scale to comprehensively construct and assess LM predictive models in T1 CRC. Given that the AUC of these models was more extensive than 0.94 and model accuracy was approximately as 100% as possible, we came to the conclusion that above-established models were desirable and robust in yielding favorable clinical benefits, which might be of tremendous assistance to clinicians in the selection process of underlying LM CRC patients. More intriguingly, our model manifested extraordinary competence indiscriminating the LM in T1 CRC patients with small tumor size (1-50 mm) from others. Ultimately, to develop an easy-to-use instrument in clinical practice, we plotted a decision tree to screen out the high-risk population of LM. The visualized decision tree was not only precise but also easy to comprehend for clinicians.
Our real-world research incorporated 326 cases of T1 CRC, amidst which LM occurred in merely eight patients (8/326), significantly lower than that of the SEER database (762/16785, P < 0.001). The discrepancy in the LM ratio might be attributed to low diagnostic efficacy in developing countries [40,41]. Interestingly, compared with more advanced T stage CRC patients (169/326), PNI was more frequently appeared in T1 CRC patients of our hospital (1266/8226), consistent with results of the SEER database (11350/16785). Abundant evidence has demonstrated that the percentage of PNI occurring in all T stages is approximately 10-15%. Moreover, PNI is an independent biomarker that indicates aggressive behavior and unfavorable prognosis of CRC [42][43][44][45]. Nonetheless, scarcely explained by published literature were underlying causes behind the high ratio of PNI in T1 CRC which deserved further investigation. In addition, serum CEA was confirmed to have a positive relationship with LM. Accumulating evidence has suggested that the expression level of CEA could function as an independent indicator for the prognosis of CRC patients [46]. Therefore, it was not surprising that the concentration of preoperative plasma CEA was significantly higher in CRC patients with LM compared with those with primary CRC [47][48][49]. Besides, among all indicators, tumor size has been regarded as one of the most significant biomarkers in predicting LM status. It has been reported that tumor size was intimately associated with both lymph and hepatic metastases of CRC [50]. Furthermore, scientists have verified that age might play a nonnegligible role in the advancement and prognosis of CRC [51]. Despite increment in young CRC patients, compelling evidence revealed that the young tended to have more favorable outcomes than the old [51]. Contradictorily, our research indicated that CRC patients younger than 60 years of age were more apt to experience risk of LM than their counterparts, which was consistent with several other researchers [52][53][54]. The probable reason might have something to do with frequently occurred mismatch repair gene mutation and upregulated aggressive neoplastic biology in younger patients [55].
To date, multitudes of investigators have constructed diverse models to predict the metastatic capability of CRC. For instance, Tang et al. [14] built up a novel nomogram to forecast LM in all T stages CRC patients via utilizing multivariable Cox regression. They also found that synchronous LM was an independent prognostic factor for CRC patients. Analogously, Li et al. [56] employed the SEER database to construct a T1 CRC all distant metastasis model by virtue of the conventional logistic regression. Howbeit, due to the limitation of the algorithm and the approach to process data, they acquired a passable model (AUC = 0.879) with ineluctable overfitting. Recently, with enormous technical advancement of AI, the application of ML model in neoplastic diagnosis and prognostic assessment has become increasingly prevalent [57,58]. Numerous novel ML algorithms have remedied deficiencies of canonical statistical methods, such as overfitting, unbalanced data distribution and so on. Ji Hyun Ahn et al. [19] developed an innovative model (AUC = 0.96) to predict LNM in the early stage of CRC patients via utilizing the SEER database and adopting seven AI methods. Nevertheless, these studies were retrospective, singlecenter, and with small quantities of patients. Additionally, Ichimasa et al. [59] testified that AI could downregulate unnecessary surgery after endoscopic resection of LNM (−) T1 CRC compared with current guidelines. Nonetheless, few models for predicting the incidence of LM in T1 CRC patients were developed and assessed utilizing AI methods. In the current study, we established nine models and then validated them in our own dataset. Besides, their efficacy of predicting LM in early CRC was also compared by dint of easily available clinical and histopathological features. Moreover, we found that our constructed AI models could not only assist clinicians in selecting patients with a high risk of LM, but also resemble LM in accurately predicting T1 CRC patients' outcomes. Our models still exhibited a superior ability to discriminate the LM in T1 CRC patients with small tumor size from others (1-50 mm).
So far, only surgical resection has been verified as a curative therapeutic approach for CRC patients with early and resectable LM [60,61]. For patients with untestable LM, early application of systemic chemotherapy might ameliorate the prognosis and enhance the median survival ratio [62]. Integrating entire above-mentioned results, we believed that further utilization of T1 CRC LM models would contribute to the clinical decision making and improve the present therapeutic status.
Admittedly, there still exists several limitations and weaknesses in the study. Firstly, in light that the SEER database is an open and available national program of America, these newly established models might not work in other countries. Secondly, quantities of enrolled patients in our hospital were far from sufficient, and merely eight patients manifested LM status. These shortcomings might lead to a limited verification outcome. In the future, more in-depth and extensive studies will be urgently needed. In addition, we intend to package the stacking model and decision tree to a novel software or website and validate them clinically afterwards in our next work.

Conclusions
In the present study, we successfully established an innovative and stacking bagging model which incorporates 11 clinicopathologic features to predict the incidence of LM in T1 CRC. Our findings indicated that age, gender, married status, primary site, tumor size, CEA, tumor type, grade, N stage and PNI were crucial factors for forecasting LM, amidst which tumor size mattered most. As expected, the stacking bagging model, which integrated strengths of seven single models, demonstrated the strongest predictive power in both databases of SEER and our hospital. Moreover, we found that the stacking model resembled LM when it came to accurate prediction of T1 CRC patients' outcomes. A novel easy-to-use tool (decision tree) was developed to guide clinicians in screening out high-risk patients of LM and exposing them to more aggressive therapeutic strategies.