Establishing machine learning models to predict the early risk of gastric cancer based on lifestyle factors

Background Gastric cancer is one of the leading causes of death worldwide. Screening for gastric cancer greatly relies on endoscopy and pathology biopsy, which are invasive and pose financial burdens. Thus, the prevention of the disease by modifying lifestyle-related behaviors and dietary habits or even the prevention of risk factor formation is of great importance. This study aimed to construct an inexpensive, non-invasive, fast, and high-precision diagnostic model using six machine learning (ML) algorithms to classify patients at high or low risk of developing gastric cancer by analyzing individual lifestyle factors. Methods This retrospective study used the data of 2029 individuals from the gastric cancer database of Ayatollah Taleghani Hospital in Abadan City, Iran. The data were randomly separated into training and test sets (ratio 0.7:0.3). Six ML methods, including multilayer perceptron (MLP), support vector machine (SVM) (linear kernel), SVM (RBF kernel), k-nearest neighbors (KNN) (K = 1, 3, 7, 9), random forest (RF), and eXtreme Gradient Boosting (XGBoost), were trained to construct prognostic models before and after performing the relief feature selection method. Finally, to evaluate the models’ performance, the metrics derived from the confusion matrix were calculated via a test split and cross-validation. Results This study found 11 important influence factors for the risk of gastric cancer, such as Helicobacter pylori infection, high salt intake, and chronic atrophic gastritis, among other factors. Comparisons indicated that the XGBoost had the best performance for the risk prediction of gastric cancer. Conclusions The results suggest that based on simple baseline patient data, the ML techniques have the potential to start the prescreening of gastric cancer and identify high-risk individuals who should proceed with invasive examinations. Our model could also considerably lessen the number of cases that need endoscopic surveillance. Future studies are required to validate the efficacy of the models in a larger and multicenter population.


Introduction
Gastric cancer (also known as stomach cancer) is the fourth most prevalent neoplastic disease and the second leading cause of cancer-related deaths worldwide [1]. Gastric cancer, with a yearly incidence of about 7300 individuals, is one of the five most prevalent malignancies in the Iranian population. This disease is the first cause of cancer-related deaths in both sexes in Iran because the majority of patients are diagnosed in the advanced stages of the disease. Moreover, the 5-year survival rate in Iran is estimated at less than 25% [2]. A large proportion of patients with gastric cancer typically have no specific symptoms, and some of the early signs in patients are similar to gastritis or indigestion; therefore, gastric cancer is easily disregarded by patients. By the time their symptoms are noticeable, most of the patients have developed advanced gastric cancer. As a result, cancer invades adjacent tissues, and in such cases, treatments are ineffective and challenging, and the patient dies in a short while. The 5-year chance of surviving gastric cancer in a patient diagnosed in the early stages is more than 80%, which is significantly higher than the survival rate of a patient diagnosed in the advanced stages [3][4][5], highlighting the urgent need for an early screening method to improve the detection of gastric cancer. Individuals with low-risk gastric cancer should also be monitored to minimize the likelihood of advancing to high-risk stages. Therefore, preventing risk factors that contribute to the formation and development of gastric cancer should be a priority in healthcare system programs [6].
Endoscopy along with pathology biopsy is the current gold standard in the screening and detection of gastric cancer [7]. However, some patients, especially in rural and remote areas, avoid endoscopy or surgery due to the invasive nature and cost of this procedure [8]. If an individual is predicted to have a high risk of developing gastric cancer, preventative measures can be taken in advance. On the other hand, if the prognosis indicates that the patient has a low risk of developing gastric cancer, endoscopic examinations of the upper gastrointestinal tract, which are associated with possible risks and high screening costs, can be avoided or minimized. A large-scale survey of 200,000 individuals who underwent endoscopic examinations revealed a side effect rate of 0.13% and a mortality rate of 0.004%. Therefore, endoscopic screening for gastric cancer has been suggested in several subgroups of patients at risk [9,10]. A meta-analysis study conducted in 2018 showed that population-based endoscopic screening in Asian countries significantly reduces the risk of death from gastric cancer. However, establishing a population-based endoscopic screening program in clinical practice is neither costeffective nor practical [11]. Hence, the adoption of noninvasive techniques or models for the diagnosis of gastric cancer is of great importance.
Thus far, no non-invasive measures have been taken to diagnose gastric cancer with high sensitivity and specificity. Early diagnosis of gastric cancer and the subsequent early treatment are crucial to improve survival and reduce mortality from this cancer. Therefore, due to the complexity and interlinking factors that are causally related to gastric cancer, it is increasingly urgent to adopt non-invasive and time-saving diagnostic methods with high accuracy to minimize imprecision and uncertainty in the diagnosis of gastric cancer.
The adoption of artificial intelligence (AI)-based solutions, such as machine learning (ML), can overcome the restrictions of invasive diagnostic procedures in the screening and diagnosis of gastric cancer due to their computational capacity. ML techniques are well-known tools for developing predictive and data analysis models and can implicitly extract useful information from raw datasets [12]. ML can extract hidden relationships and patterns from large and high-dimensional data in single-or multicenter datasets [13,14]. ML models are automatically created based on training data that can be used to make inferences or decisions in uncertain conditions without being explicit programming [15]. By capturing multifaceted nonlinear relations in the datasets, ML algorithms can increase the prediction accuracy more than traditional statistics techniques [16,17].
Many studies have used ML techniques to predict gastric cancer up until now. Liu et al. used ML to predict gastric cancer with an accuracy of 77% [18]. Cai et al. performed univariate and multivariate analyses for gastric cancer prediction using demographic, dietary, and medical history as input data [19]. Safdari et al. developed a system for earlier diagnosis of gastric cancer using fuzzy logic with a sensitivity of 92.1% and a specificity of 83.1% [20]. Su Y et al. detected gastric cancer using a decision tree (DT) classification of mass spectral data with an accuracy of 86.4% [21]. Brindha et al. utilized dietary and lifestyle features to predict etiological factors of early gastric cancer and trained several supervised ML algorithms, including naive Bayes, logistic regression (LR), and multilayer perceptron (MLP). Their results showed that naive Bayes has the best performance with an accuracy of 90% as compared to the other models [22]. Mortezagholi et al. employed endoscopy images as attributes for a casecontrol study using ML techniques, including the support vector machine (SVM), DT, naive Bayesian model, and k nearest neighborhood (KNN) to predict patients with gastric cancer [23]. They reported that SVM with an accuracy of 90.8% exhibits the best performance compared to the other models [8]. Taninaga et al. showed that eXtreme Gradient Boosting (XGBoost) outperformed LR in predicting gastric cancer using comprehensive longitudinal data with the highest area under the curve (AUC) value (0.899) [10]. Several studies have applied neural network methods to detect gastric cancer based on endoscopic images with high sensitivity [24,25]. In the study of gastric cancer, ML techniques are mainly used to analyze endoscopic images, which are obtained through invasive methods [24,26]. In contrast, analyzing lifestyle-related factors is non-invasive and inexpensive. Gastric cancer is largely reliant on lifestyle-related factors and can be prevented with a change of diet and habits [27]. Therefore, in this study, we aimed to predict gastric cancer based on lifestyle and historical data using ML methods. Thus far, numerous ML methods have been developed in the medical field. Each of these ML methods has a different algorithm and nature of work. If chosen appropriately, all these models will perform at their peak [28]. The selection of ML models is dependent on the data (the type and specific characteristics of each dataset, such as structured or unstructured, number of dimensions, number of samples, and other similar factors) as well as the desired performance [29]. In the current study, we chose six ML algorithms that performed well on structured and unstructured datasets and on datasets with several dimensions (SVM, KNN, random forest (RF)), an ML model that could solve complex nonlinear problems randomly (MLP), and a model that has the ability to set multiple hyperparameters to achieve high accuracy (XGBoost) [30][31][32][33].
The structure of the manuscript is as follows: First, the dataset is presented. The ML techniques used in this paper are then described in detail. After that, the results of comparing ML techniques are shown. In the next stage, the most accurate model for predicting gastric cancer based on the results of performance evaluation metrics is reported.

Study design and experiment environment
This is a retrospective, single-center study that was conducted in 2022. A dataset was collected from the Ayatollah Taleghani database affiliated with Abadan University of Medical Sciences, Abadan City, Iran. Six ML-based models were developed for the prediction of gastric cancer using lifestyle-related factors. This study was conducted based on the cross-industry standard process for data mining (CRISP-DM). The prediction models were developed using Python programming language in five main CRISP stages including data understanding, preprocessing, feature selection, model training, and evaluation, as shown in the block diagram below (Fig. 1).

Dataset description and participants
The dataset used in this study was obtained from the gastric cases referred to the internal clinic of Ayatollah Taleghani Hospital in Abadan, Iran, during 2015-2021. The patients' information was reviewed and extracted by a health information management expert. The patients who were referred to the clinic for the screening, Fig. 1 Block diagram of the proposed system for gastric cancer risk prediction diagnosis, and treatment of gastric cancer were included in this study. A total of 2029 individuals (429 patients vs. 1780 healthy controls) participated in the study. The analyzed dataset contained descriptive information about the respondents (28 features) and the outcome of the gastric cancer risk (one feature), which can be viewed as the dependent variable (Table 1).

Preprocessing the dataset
Data preprocessing is an essential step in the CRISP-DM method, and it has a significant impact on the performance of data mining techniques. The objectives of data preprocessing are to cleanse the outlier, remove the noisy data, impute missing values, and convert the data into a suitable format for more reliable and accurate data analysis. In the first phase of preprocessing steps, we employed the interquartile range rule to detect outliers and normalize the dataset by using the min-max method. To impute missing values, mean and regression-based methods were applied. We also deleted the rows with more than 70% missing values. The Z-score standardization method was used as a data distribution-based data scaling, and for data range-based scaling, the min-max method was employed. In the preprocessing phase, 240 records of the dataset were deleted, and after removing these records, the number of the cases of the dataset was reduced to 2029 records. Since group distribution in the dataset used for this study is imbalanced, one of the groups contains 1780 samples (healthy individuals) while the other has 429 samples (patients). Therefore, we created a new dataset by approximating the group with fewer samples (patient group = 429) to the group with a larger number of samples (healthy group = 1780 individuals), such that the samples of the group with fewer cases in the dataset are randomly reproduced. In this study, the synthetic minority oversampling technique (SMOTE) was applied to the resampling process. The SMOTE method is the most common and effective oversampling method, which is applied in various fields to balance the datasets [34,35].

Feature selection
One of the fundamental issues with many data mining tasks is to determine and specify relationships between attributes in the dataset and outcome. Feature selection is one of the main phases of a successful data mining process, especially in problems with a large number of dimensions or variables in the dataset. Feature selection is defined as the process of determining relevant variables and removing irrelevant ones [36]. In the present study, the relief feature selection algorithm was implemented to reduce the number of features and combinations in order to obtain the most important predictors. Relief is a method for the random selection of relevant attributes based on variables' weight. This algorithm assigns weights to all the variables. The most important attributes to the outcome have higher weight values, whereas the other attributes have lower weights [37].

Training and evaluation of ML classifiers
In order to develop an early prediction model for gastric cancer risk, a total of six ML algorithms were used: MLP, SVM (linear kernel), SVM (RBF kernel), KNN (K = 1, 3, 7, 9), RF, and XGBoost. To implement these models, we experimentally tuned the hyperparameters on the training split of the dataset based on the cross-validation (CV) method. The performance of the ML algorithms was evaluated using the holdout technique, a method for out-of-sample assessment where the dataset was split into two parts (70% training and 30% test). The ML models were then trained on the first split of the dataset and tested on the other part. In addition, the K-fold CV  method (K = 10) was applied to evaluate the performance of the best algorithms to overcome a feasible biased error estimate. Five performance evaluation metrics were selected and reported for each ML technique to compare the performance of classifiers, which is common in medical prediction studies. The performance evaluation metrics of the classifiers are listed below, along with their definitions:

Characteristics of patients
After applying the exclusion criteria and conducting a quantitative analysis of patients' records, 2029 patients were found to be eligible. Of the 2029 participants in the study, 1094 (54%) were male and 1015 (46%) were female, and the age of the participants was between 18 and 94. In total, 1780 (88%) of the study subjects were healthy controls, and 429 (22%) were patients. The descriptive statistics for the 2029 samples in this dataset are shown in Table 2.

The results of selected features using the relief feature selection algorithm
A total of 11 features were selected due to their positive correlation with gastric cancer by performing the Relief feature selection algorithm. These features are Helicobacter pylori infection, high salt intake, chronic atrophic gastritis, consumption of fruits, stomach or duodenal ulcers, weight loss, consumption of high-fat foods, educational level, smoking, stress status, and weight. The key features selected by the Relief feature selection algorithm and their scores are presented in Table 3. Based on the results of the Relief feature selection algorithm (Table 3), Helicobacter pylori infection is three times more effective than the other attributes for the early prediction of gastric cancer. High salt intake, chronic atrophic gastritis, and low consumption of fruits are significantly associated with a high risk of gastric cancer. In contrast, weight loss and stress were ranked as the 10th and 11th most important risk factors in the prediction of gastric cancer, respectively (see Fig. 2).

The results of the tuning of hyperparameters
In order to improve the performance of ML techniques, many methods can be used to tune hyperparameters. In this study, the randomized search CV method was employed for the tuning of parameters and the optimization of ML techniques. Table 4 indicates the best hyperparameter classifier for the early prediction of gastric cancer based on lifestyle factors.

The results of k-fold CV for classifiers' performance on full features and selected features
In the present study, full features and features selected by the Relief feature selection algorithm were tested on six classification algorithms using the 10-fold CV methods. In the tenfold CV method, 90% of the dataset was used for training the algorithms and only 10% was tested. The mean metrics of tenfold methods were measured. Additionally, different metrics values were passed through classification algorithms. At first, we trained and tested  Table 4 Hyperparameters selected to be fed into the classifiers for the early prediction of gastric cancer  Table 5 The results of tenfold CV for models' performance metrics on full features and selected features  Table 6 The results of tenfold CV for models' performance metrics on full features and selected features  Table 5.
According to the results presented in Table 5, the performance of classifiers on the selected features was better than on full features. When the selected factors were included in the model, the results show that the XGBoost classifier yielded an accuracy of 83.4%, a sensitivity of 83.7%, a specificity of 85.9%, an AUC of 84.9%, and an H-score of 83.2%. While the KNN classifier for K = 7 obtained the mean accuracy of 67.98%, the mean sensitivity of 66%, a specificity of 70.2%, an AUC of 66.8%, and a mean H-score of 69.14% when all features were fed into the classifiers ( Table 6). The results of the six best experiments of classifiers on the selected features are shown in Fig. 3.
According to Fig. 3, the performance of the XGBoost classifier was better than that of the five other algorithms (a mean accuracy of 83.7%, a mean specificity of 85.9%, an average sensitivity of 83.7%, an AUC of 84.9, and an H-score of 83.2). In Fig. 3, the worst performance was observed for the MLP classifier with an accuracy, specificity, sensitivity, AUC, and H-score of 70.1%, 71.8%, 70.5%, 70.8%, and 71.2%, respectively. The AUC and classification report of the XGBoost classifier, which was selected as the best classifier in the prediction of gastric cancer, are shown in Fig. 4.

Discussion
It is important to provide accurate and rapid screening for gastric cancer since this malignancy is very treatable when diagnosed in early phases. Therefore, implementing procedures for the screening and diagnosis of gastric cancer is highly beneficial. In this regard, a timely and reliable screening method enables rapid detection of the disease, and the subsequent timely interventions can help boost the likelihood of patient survival (2,10). In this study, we developed non-invasive and cost-effective predictive models using six ML algorithms for gastric cancer risk assessment to distinguish high-risk patients with gastric cancer from the general population. Due to the fact that gastric cancer has multiple features with many potentially important confounders, it is a great challenge for clinicians to consider and analyze all the engaging features and decide on the patient's condition. It also raises the likelihood of a physician error during the decisionmaking process of disease diagnosis [38]. Thus, it is desirable to use an intelligent method that has the ability to learn the problem and generalize it to other situations. In this study, six methods of MLP, SVM (linear kernel, RBF kernel), KNN, RF, and XGBoost were proposed to classify gastric cancer patients. Based on the results, the XGBoost had the best performance in predicting gastric cancer risk in comparison with the other ML techniques using the stated evaluation metrics.
Several studies have been conducted with the aim of improving the early screening accuracy of gastric cancer through the use of ML. The predictive models in previous studies were developed based on diagnostic, laboratory, pathology, and imaging data and were not related to lifestyle data and individual habits and behaviors. Therefore, the advantage that distinguishes our study was that we obtained relatively significant results by applying ML methods based on data on the lifestyle features and behavior of individuals. Moreover, our study used feature selection to select the most significant lifestyle-based variables in order to maximize the capability of the models when compared to the analysis of all variables in the dataset. Feature selection enhances the accuracy, specificity, and sensitivity of the classifiers and decreases the running time of the predictive system. In this study, we used the ML algorithms due to the fact that these methods resulted in better predictions than the conventional statistical techniques when dealing with a large number of features with complicated relationships [39,40]. Although statistical models can easily determine the relationship between dependent and independent variables, they cannot handle a large amount of variables with different types and intricate associations [41,42]. If the aim of the study is to improve the performance of predictive models and the interpretation of models is of secondary importance, researchers prefer to develop ML models to achieve satisfactory predictions [40].
The main advantage of our study was that it estimated the risk of gastric cancer based on lifestyle-related factors. Some researchers have incessantly focused on medical equipment and detection reagents to improve the screening of gastric cancer, and the results of their studies were applied to clinical gastroscopy and biopsy [43,44]. A few studies have combined genetic, proteomics, and molecular biology to detect gastric cancer [45][46][47][48]. However, owing to their limitations, such as invasiveness, intricacy, high cost, or low adaptability, the diagnostic methods have not been widely adopted in clinical practice for gastric cancer screening. Tumor markers, e.g., CEA, CA199, CA125, and CA724, are generally used for the diagnosis of gastric cancer. But, the sensitivity and accuracy of these non-invasive features are not satisfactory [49][50][51]. Contrary to the abovementioned studies, we applied ML techniques to stratify the risk of gastric cancer, which is a non-invasive approach. Patients were first examined by the optimal ML models developed, and then the high-risk cases were referred to specialized centers for further diagnostic procedures, such as endoscopy and pathology biopsy. The non-invasive gastric cancer screening approach developed in our study is highly adaptable and low-cost, which increases the coverage of gastric cancer screening in clinical practice.
Nevertheless, this study also had some limitations. First, the participants in the study included patients from a single care center, which limits the generalizability of the results to larger populations. Second, there was some subjective bias in selecting variables and examining lifestyle behavior that may affect the predictive results. Another limitation of this study was a retrospective analysis using registered data, which reduces the external validity of the results. Thus, future testing in a larger population is recommended.

Conclusions
This study utilized the routine available non-invasive features to implement six models for screening the risk of gastric cancer. The XGBoost model demonstrated better performance than the other ML models and can be Fig. 4 The classification report and ROC curve for the XGBoost classifier applied to assist clinicians in the screening of gastric cancer risk in accordance with the Iranian health referral system and hierarchical leveling of healthcare services, which will improve the early screening of gastric cancer on a large scale. The ML models may identify highrisk patients with gastric cancer early, which draws the attention of clinicians and patients, and appropriate and timely interventions will be implemented to improve the patients' survival chance and quality of life. This study can also aid clinical researchers in choosing and implementing the optimal prediction models and evaluating the main influencing features.