Automatic COVID-19 prediction using explainable machine learning techniques

The coronavirus is considered this century's most disruptive catastrophe and global concern. This disease has prompted extreme social, psychological and economic impacts affecting millions of people around the globe. COVID-19 is transmitted from one infected person's body to another through respiratory droplets. This virus proliferates when people breathe in air-contaminated space with droplets and microscopic airborne particles. This research aims to analyze automatic COVID-19 detection using machine learning techniques to build an intelligent web application. The dataset has been preprocessed by dropping null values, feature engineering, and synthetic oversampling (SMOTE) techniques. Next, we trained and evaluated different classifiers, i.e., logistic regression, random forest, decision tree, k-nearest neighbor, support vector machine (SVM), ensemble models (adaptive boosting and extreme gradient boosting) and deep learning (artificial neural network, convolutional neural network and long short-term memory) techniques. Explainable AI with the LIME framework has been applied to interpret the prediction results. The hybrid CNN-LSTM algorithm with the SMOTE approach performed better than the other models on the employed open-source dataset obtained from the Israeli Ministry of Health website, with 96.34% accuracy and a 0.98 F1 score. Finally, this model was chosen to deploy the proposed prediction system to a website, where users may acquire an instantaneous COVID-19 prognosis based on their symptoms.


Introduction
Coronaviruses are a group of diverse viruses with a wide range of variants that differ in several ways ( Bo et al., 2021 ). In 1965, scientists discovered the first strain of the coronavirus that infected humans. This strain caused common colds in the host body. After more than a decade, the researchers discovered a group of viruses found in humans and animals named after their appearance, which resembled a crown. Hence, the word 'corona' in the name of the virus derives from a Latin word that means 'crown.' So far, scientists contend that up to seven coronaviruses can infect people. One such severe acute respiratory syndrome-related virus was discovered for the first time in 2003 in southern China, and it spread rapidly in approximately 30 different countries. However, it remains unclear how this virus came to be the source of global devastation in the form of the COVID-19 pandemic. Specialists confirmed that SARS-CoV-2 originated in bats. In the wet market of Wuhan, people would visit to purchase fish and fresh meat as the animals were slaughtered in the same place. It is believed that this is where the contamination spread to humans. The congested and crowded environment is prone to facili-tating cross-contamination and the swapping of genes between various animals, which may have resulted in viruses undergoing significant mutations, potentially infecting humans and propagating the infection in a rapid and devastating manner ( Jin et al., 2020 ).
COVID-19 has continued to spread all over the world very rapidly since the beginning of 2021. According to the statistics, as of December 11, 2022, there were 653,374,909 coronavirus-infected cases worldwide, causing 6657,885 deaths. The statistics in North America and Europe are even worse; there were 101,263,635 cases of coronavirus infection in the United States, with 1109,725 deaths ( COVID LIVE UP-DATE, 2022 ). Traditional laboratory diagnostic technology implicates various challenges due to the unavailability of facilities and kits to test coronavirus infection, performing a rapid test as an alternative to COVID-19 diagnosis is currently gaining popularity ( Amin, Farman, Akgül & Alqahtani, 2022 ). Given the limits of COVID-19 testing, additional diagnostic methods are considerably in demand. The main goal of developing the web tool proposed in this article is to ensure that the users can quickly determine whether or not they are affected by COVID-19 by answering a few symptom-related clinical questions. People can use forums to ask questions about their symptoms and health concerns and receive advice on possible remedies. This website gives a broad overview of any new respiratory viruses to make people aware of future developments in COVID-19 related health problems.
Significant work has been performed on the automatic detection of coronavirus disease using machine learning techniques. Some of the notable works on this topic have been briefly discussed in the following paragraphs.
Numerous machine learning techniques have been used in the majority of the articles to detect coronaviruses automatically. For instance, in Darapaneni et al. (2020) , the authors worked with data on suspected COVID-19 patients and their admission rates in various types of hospital facilities. The authors used data from the Israelita Albert Einstein Medical Center in Sao Paulo city of Brazil to train machine learning models. The data contain the clinical information of the suspected and confirmed COVID-19 patients, such as complete blood cell (CBC) counts and information related to liver, glucose and renal tests. The authors trained models on the dataset using six machine learning algorithms containing various ensemble techniques, i.e., bagging, gradient boosting, and adaptive boosting. They evaluated the machine learning models based on their training accuracy, test accuracy, ROC curve, and confidence score. The random forest approach achieved the best training and validation accuracies of 0.97 and 0.95, respectively. Detection of the onset of the COVID-19 epidemic was examined by Rohini et al. ( Rohini et al., 2021 ) using various machine learning techniques. The authors studied the genesis of the disease and conducted forecasts and time-series monitoring, which is expected to aid in the future control of this disease. KNN with synthetic oversampling accomplished the best performance with an accuracy of 98%. Sharma and his team ( D.K. Sharma et al., 2021 ) worked on the advanced identification of coronavirus using the SVM machine learning model. A hyperparameter optimization strategy was used as a modified cuckoo search technique to improve the SVM classifier's prediction accuracy. An advanced feature selection framework, mRMR (Minimum Redundancy Maximum Relevance), was used to sort COVID-19 and healthy instances. Four distinct machine learning techniques were used by Choudary et al. ( Choudary et al., 2021 ) to identify the presence of COVID-19 in individual patients. The SVM algorithm achieved the highest level of accuracy at 98.38%, but the XG-Boost model was determined to be the best model since it had a higher recall value (99.26%) without significantly influencing the other evaluation criteria (accuracy level of 97.71%). Tiwari et al. ( Tiwari, Bhati, Al-Turjman & Nagpal, 2022 ) examined the coronavirus infection trend, treatment and mortality rates by employing classical machine learning techniques. In addition, the future growth of this virus was predicted using an open-source real-time dataset. The authors discovered that the naive Bayes framework forecasted the COVID-19 disease with the highest accuracy. Rai and coauthors ( Rai et al., 2022 ) predicted the death rate of COVID-19 patients employing majority rule-based ensemble techniques. Multivariate imputation, synthetic oversampling and feature selection approaches were used in this work. The XGBoost model attained the best accuracy and F1 coefficient of 86.9% and 71.6%, respectively.
Some authors have tried to predict coronavirus inspection by using both machine learning and deep learning techniques. The use of the occlusion technique in COVID-19 detection was discussed in Udawat, Santani and Agrawal (2021) . A content-adaptive progressive occlusion analysis (CAPAO) algorithm was used to perform the analysis. The implementation was performed in several steps, including examining the region of interest, analyzing the spatiotemporal context, and checking the reference target and motion constraints. As a result, various models have produced mixed results, with accuracies ranging between 78.33% and 98.33%. The authors concluded that the COVID-RENet architecture, which is supplied to the SVM to execute the binary classification, accomplished better performance than the proposed CAPAO algorithm. By using a combination of a wrapper feature selection technique and machine learning and deep learning classifiers, Turabieh et al.  initiated the automatic forecasting of COVID-19. CNN with the BGA approach achieved an 80% accuracy in predicting COVID-19. Cobre and researchers ( Cobre et al., 2021 ) anticipated coronavirus disease positivity and severity by applying various machine learning and neural network models. The authors employed artificial neural network, DT and KNN models to accurately categorize negative and positive instances with more than 84% accuracy. An accuracy of more than 92% was achieved when classifying patients based on their disease severity with the decision tree approach. Pourhomayoun and Shakibi ( Pourhomayoun & Shakibi, 2021 ) utilized an artificial neural network and achieved an 89.98% overall accuracy level. Shi and colleagues ( Shi et al., 2020 ) attempted to predict the infectious disease using attention-based deep learning models and chest X-ray images. A hybrid attention framework achieved better performance than other works with 94.11% accuracy. Zhang et al.  employed a custom attention-based CNN framework to identify COVID-19 from 936 chest CT images. This work reported accuracy and an F1 score of 0.9632 and 0.9633, respectively. Wang and team ( Wang, Zhu & Zhang, 2021 ) applied a custom-built CNN framework with a multipleway image augmentation technique to detect coronavirus. The authors achieved a maximum accuracy and F1 index of 96.36% and 96.35%, respectively. Kalaivani et al. ( Kalaivani & Seetharaman, 2022 ) attempted to identify coronavirus-affected patients using a combined boosted CNN technique from chest X-ray photographs with 0.994 accuracy. Pi and colleagues ( Pi & Lima, 2021 ) implemented image processing techniques and extreme learning-based neural network models to distinguish between healthy and COVID-19 chest CT images. The implemented classification model accomplished 76.98% maximum accuracy for the employed K-fold validation technique.
Few works have deployed the automatic prediction system into a website or smartphone application. Kaiser and his team ( Kaiser et al., 2021 ) created an Android application that includes training, information sharing, risk measurement, symptom display, communication tracking, and quick response during the COVID-19 outbreak. The authors collected a private dataset from a local clothing industry of Bangladesh, RBS Fashion. The author used a fuzzy neural network algorithm in the proposed WorkSafe application to combine examination data and provide an essential metric for determining the safety of health workers. The authors created an app that will track their sick employees and what kind of measures and treatments they receive. The authors built their software with the Larval PHP framework and employed a database on the backend to perform real-time contact tracing among industry workers. The authors accomplished the risk detection by analyzing the collected data and combining multiple machine learning methodologies and the fuzzy neural network method. Koshti and his coauthors ( Koshti et al., 2021 ) created a COVID-19 detection system using machine learning methodologies, a tracking system utilizing geofencing technologies, and an alert system using a mobile application. The authors used geo-fencing and methodological triangulation in the proposed application for data collection. The XGBoost model obtained better precision than the SVC model, with an accuracy of 99%. They used the 'step out' feature to access users' locations for tracking and, finally, stored user data in the Firebase database. The implemented automatic framework could successfully detect the presence of COVID-19 based on the user to identify and track symptoms and geofences around the user when they were in a specific security zone.
The literature reviews mentioned above lead us to the conclusion that extensive research has been done on automatic COVID-19 detection. In addition, numerous machine learning models have been created to forecast the potential occurrence of COVID-19. The results, however, are based solely on the implemented dataset and do not consider the temporal aspects of the progressive symptoms of coronavirus. Most of the work did not contribute to the explanation of the prediction of the machine learning techniques. The majority of these studies did not integrate the automatic detection system into a website or smartphone application for instantaneous prediction.
In this work, an automatic COVID-19 prediction system has been developed employing various machine learning techniques. An opensource dataset comprising information from 2742,596 patients has been obtained. The dataset contains individual results of various symptoms and basic patient information such as the date of the test, gender, age, and COVID-19 test results for almost 2.75 million individuals.
This paper deploys an automatic COVID-19 prediction system where people can obtain an idea of whether they are affected by COVID-19 or not. The major contributions of this work are as follows: • Preliminary analysis and data preprocessing, including the SMOTE technique, have been performed in order to prepare the dataset for the automatic prediction model training. • The dataset was used to train with seven machine learning models, including ensemble techniques and three deep learning frameworks. Hyperparameter tuning has been performed to find the salient hyperparameters. • CNN-LSTM, hybrid deep learning model has been applied to tackle the temporal attributes of the progression-based symptoms of the used COVID-19 dataset. • Explainable AI with LIME framework to interpret the machine models and SMOTE technique for data imbalance have been applied. • Finally, a website application has been developed to predict the final results. The SVM model achieved the best results, and it was eventually implemented on the website to predict COVID-19.
The novelty of this article is to implement the hybrid CNN-LSTM deep learning technique to detect COVID-19 from its various progressive symptoms.
Section II discusses the proposed automatic coronavirus identification system. The most important study findings, as well as all of the results and analysis, are presented in Section III. The article is concluded in Section IV by summarizing what we have accomplished in this paper and outlining the enhancements we intend to make in the future.

Dataset
The open-source dataset used in this work is acquired from the Department of Health, Israel ( Zoabi, Deri-Rozov & Shomron, 2021 ). The dataset contains individual results of different symptoms, basic information about the patients, and the COVID-19 test results of 2742,596 patients. There are three types of information available in the dataset. It contains basic information about the patient, i.e., the test date, gender, and if the patient's age is over 60, indicators that denote the symptoms of the patient, including information on five symptoms in the dataset, i.e., cough, fever, shortness of breath, sore throat, and headache and the COVID-19 test result and whether the patient recently came into contact with a COVID-19 patient are also included in the dataset.

Dataset preprocessing
Before utilizing the dataset for the automatic prediction models' induction process, initial exploration and data preprocessing have been performed in this work. The age and gender of the patient's personal information are frequently absent from the dataset. The entire dataset would be impacted since it would become biased if average values have been used to replace the missing values. As the used dataset is large in size, the rows with missing values for age and gender are dropped. After dropping the rows with missing values, the size of the dataset becomes 2186,227. The age information do not contain the exact age of patients and rather just indicates if the patients were above 60. Additionally, because patient age feature possesses the lowest correlation with the COVID-19 results, this data can not be used as a feature to train our dataset; therefore, we opted to remove that column. There are some  COVID-19 results that are not confirmed as positive or negative. The rows with not confirmed COVID-19 results have been dropped. The final size of the dataset turns into 2151,898, where 1943,172 and 208,726 are confirmed COVID-19 negative and positive cases, respectively. All the attributes are converted from categorical variables to numerical variables as some machine learning models cannot train on label data directly and require all the features in numerical variables. Fig. 1 shows the COVID-19 results in relation to distinct features in the dataset.
According to Fig. 1 , the dataset is highly imbalanced as there is a higher number of negative cases and a significantly smaller number of positive cases with a ratio of 9.3:1.0. The linear dependence of various features of the used dataset has been measured with the Pearson correlation index, shown in Fig. 2 .
To explore further and make a more balanced dataset for training the models, we have also used SMOTE oversampling technique ( Suvon et al., 2022 ) to resample the dataset and make it a 1:1 ratio for the training data. Fig. 3 shows the training dataset after applying SMOTE oversampling technique.

Machine learning models
The following machine learning techniques have been implemented in this work for automatic coronavirus prediction.
Logistic regression: Logistic regression is an efficient supervised machine learning technique and can be used for the classification of two where and denote the dependent and independent variables, respectively. Decision tree: Decision tree is a supervised machine learning model by nature and it can be used in different problem domains, and it is commonly used for classification tasks. The decision tree classification model resembles a tree-structured classifier with vertex representing the features, points representing the decision rules, and leaves representing the outcome ( Ahmed et al., 2021 ). Fig. 4 illustrates a visual representation of the decision tree model's structure.
Random forest: The random forest model combines and uses various classifiers to explain regression and classification problems ( Nurhachita & Negara, 2021 ). It utilizes ensemble machine learning techniques. The random forest is constructed with many decision trees and gives the best possible output based on the predictions from all the decision trees. Fig. 5 illustrates the structure of a random forest model. K-Nearest Neighbors (KNN): KNN creates clusters of similar data by calculating their distance and predicts the outcome of new data based on that calculation ( Zhang et al., 2018 ). Support Vector Machine (SVM): The SVM algorithm attempts to find the optimum line for the decision boundary so that the additional data points can be categorized using it ( Cervantes, García-Lamont, Rodríguez & Lopez-Chau, 2020 ). This decision boundary is also called the hyperplane for n-dimensional data space. The points to draw this line or hyperplane are found by the SVM algorithm during the training phase. Fig. 6 demonstrates the hyperplane categorizing two different classes.
Gradient Boosting: Gradient boosting classifiers integrate several weak learning frameworks to establish a robust prediction model   ( Bentéjac, Csörg ő & Martínez-Muñoz, 2021 ). Decision trees are frequently utilized in the implementation of gradient boosting. The word "Gradient " in gradient boosting suggests that there are two or more derivatives of the same function. In an iterative functional gradient approach, gradient boosting diminishes a loss function by repeatedly selecting a framework that leads toward the negative gradient, i.e., a weak hypothesis.
AdaBoost: The AdaBoost classifier is a meta-estimator that starts with a classifier that has been fitted to the original dataset ( Wang, 2012 ). On the same dataset, this model adjusts an extra copy of the classifier. Nevertheless, the weights of the mistakenly classified examples are modified to make subsequent classifiers more challenging to focus on.
XGBoost: The XGBoost is well known for providing better solutions than other machine learning algorithms ( Kumari, Kumar & Mittal, 2021 ). In fact, since its inception, it has become a "sophisticated " machine learning algorithm for dealing with structural data.
Deep learning models: In this work, various deep learning models, ANN, CNN and hybrid CNN-LSTM, have been used. ANN and CNN frameworks' working strategies are inspired by the neurons we find in living animals' brains. In these techniques, many nodes are connected to each other, similar to the human brain's neuron network ( S. Sharma et al., 2021 ). Each node can send a signal or communicate with another node, similar to neurons. In ANN, when a node receives a signal, it analyzes it and then passes it to the next connected node. The corresponding nonlinear functions generate the output of each node. Initially, all the nodes start with random weights, and with the model training, the nodes best fit themselves to make an accurate prediction for the input signal or data.
This study uses CNN-LSTM, a hybrid deep learning technique, for automatic COVID-19 detection. The CNN model extracts the spatial characteristics effectively; however, its performance deteriorates in sequential  tasks. The COVID-19 symptoms involve progressive occurrences, starting with fever, headache, and sore throat and progressing to breathing problems and chest pain. The subsequent LSTM model considers the corresponding temporal features of the dataset. The proposed CNN-LSTM architecture utilized in this work comprises similar layers of the model used in Halbouni et al. (2022) , as demonstrated in Fig. 7 .
After preprocessing the dataset and implementing suitable feature selection techniques, several machine learning models were trained using the dataset. Next, the performance of various models is compared concerning various evaluation metrics. Finally, the best model is used to design the proposed automatic COVID-19 prediction website framework. Fig. 8 illustrates the working sequences of the proposed COVID-19 prediction system.

Webpage
The goal of this webpage, as demonstrated in Fig. 10 , is to provide an online COVID-19 detection to determine whether or not users are likely to be infected with COVID-19. In this work, we used HTML, CSS, and JavaScript for the front end of the proposed online application. The basic foundation of the proposed web framework was built in HTML, and the styling was done in a CSS framework. The SVM model has been chosen for prediction deployment on the proposed website since it achieved the highest accuracy among all the models. The functional flowchart of the proposed automatic COVID-19 prediction website has been illustrated in Fig. 9 . First, users can input their symptoms in binary digits, and they will receive a prognosis and recommendations regarding their symptoms and health conditions. A demonstration of the proposed automatic COVID-19 prediction website has been illustrated in Fig. 10 .

Results and discussion
This section discusses the numerical results of the proposed automated coronavirus detection system for various machine learning frameworks. The dataset has been partitioned into 75% training and 25% test dataset enabling the stratified option.

Logistic regression model
In this work, various logistic regression models with = [ 0 . 01 , 100 ] and L1 and L2 penalty hyperparameters are applied. The logistic regression model achieved 80.91% training accuracy and 83% precision on the test dataset with C = 1 and L2 penalty. The model performed better in terms of predicting the positive cases. The F1 score of the model is 0.80, and recall is 0.81.
After applying SMOTE technique to the training data, the performance of the logistic regression model improved to 92.02% accuracy  and 92% precision. The F1 score of the model also improved to 0.92. Fig. 11 shows the confusion matrix for the logistic regression model after applying SMOTE technique on training data.

Decision tree model
To find the optimum depth for the decision tree model, before training models, we tested the accuracy of different decision tree models with various depths ranging from 1 to 9 using the hyperparameter optimization technique, GridSearchCV. The decision tree model achieved 92.18% accuracy and 93% precision on the test dataset with a maximum depth of three. The F1 score of the model was 0.92, and recall was 0.92. Fig. 12 represents the accuracy of the decision tree models with different values for depth after applying SMOTE.

Random forest model
The random forest model performed similarly to the decision tree model on the test dataset and achieved 80.91% accuracy and 83% pre- After applying SMOTE to the training data, the efficiency of the random forest technique improved significantly. The model achieved 92.19% accuracy with 93% precision. The F1 score of the model was 0.92, and recall was 0.92. Fig. 13 shows the confusion matrix for the random forest model after applying SMOTE approach to the training data.

KNN model
The KNN model with nearest neighbors numbered between 1 and 10, required a longer time than the other models to predict on the test dataset as it was computationally complex. With SMOTE and 8 nearest neighbors, it achieved 92.16% accuracy on the test dataset and 91% precision. The F1 score of the model was 0.92, and recall was 0.91.

SVM model
In this study, three kernel functions -linear, RBF, sigmoid, and = [0.1:100] and gamma = [0.10:10], have been used to train a variety of SVM models. Also, the hyperparameters were tuned using GridSearchCV. Among the combination of the various hyperparameters, SVM with RBF kernel and parameter values = 10 and gamma = 1 performed the best. Without SMOTE, the SVM model achieved 83% accuracy and precision. The F1 score was 0.81, and recall was 0.82. Fig. 15. ROC curve and DET graph of various machine learning models. After applying SMOTE, the performance improved to 93% accuracy and precision with an F1 score of 0.93 and a recall of 0.93.

AdaBoost classifier
In this work, the AdaBoost and XGBoost models are trained with various hyperparameters, including the maximum number of estimators between 10 and 100 and learning rate = [ 0 . 10 ∶ 0 . 90 ] . The AdaBoost classifier achieved 92.65% accuracy and 96% precision on the test dataset with SMOTE and 0.50 learning rate. The model performed better in terms of predicting the negative cases. The F1 score of the model was 0.96, and recall was 0.96.

XGBoost model
The XGBoost model achieved 92.18% accuracy and 96% precision on the test dataset with SMOTE and 0.50 learning rate. The model performed better in terms of predicting the negative cases. The F1 score and recall of the model are 0.96 and 0.95, respectively. For positive cases, the precision is 59%. The F1 score of the model was 0.62, and recall was 0.62. Fig. 14 illustrates the confusion matrix for the XGBoost model.

Performance comparison
Next, we compared different models' performances on several metrics to find the better-performing model. We examined the receiver operating characteristic (ROC) and the detection error trade-off (DET) graphs to interpret various techniques' performance. For the ROC curves, the curves closer to the top left corners indicates better performance. According to Fig. 15 , the ROC curve performs better for the decision tree and random forest models. The DET graph shows the detection error trade-off of the modes, and it also shows that the decision tree and random forest perform better than the other models.
We observed that the models' performances were similar to the other performance metrics, e.g., accuracy, precision, F1 scores, and recall. All models performed better in predicting the positive COVID-19 cases than the negative ones.
Training and validation accuracy and loss graphs with the change of epochs of the proposed CNN-LSTM model with SMOTE technique have been illustrated in Fig. 16 . The symptoms of COVID-19 are progressive in nature and involve temporal characteristics. As anticipated, the proposed CNN-LSTM model accomplished a superior performance with 96.34% validation accuracy compared to other techniques. Table I contains all the performance metrics for the machine learning models with and without SMOTE technique. As can be seen in Table I , various models performed almost similar in terms of predicting COVID-19 cases without SMOTE technique. But the performance of all the models improved significantly after applying SMOTE oversampling approach to the training data. The accuracy of all the ML models enhanced with an average of approximately 10.95% with SMOTE technique. Fig. 17 shows the performance comparison of the best-performed five prediction techniques. According to this radar chart, the proposed CNN-LSTM deep learning model outperformed all other techniques by a

LIME-based explainability model
Explainable AI consist of a collection of frameworks that helps to visualize and better understand the prediction offered by the machine learning models ( Collini et al., 2022 ). The LIME-based explainability method has been employed in this work to explain and predict machine learning classifiers' predictions. The LIME model can interpret the estimates produced by the machine learning model by performing local approximations of the estimate points ( Sahay, Omare & Shukla, 2021 ). The original LIME model brings stability issues; when the model is repeatedly assigned under the same conditions, it can give different explanations. We use an improved version of the LIME model in this research to overcome the stability problem.
Finally, Fig. 18 illustrates an example of an interpretation of the COVID-19 prediction employing the LIME explainable AI framework. The blue bars explain the treatment's background and symptoms that have significant weight to support negative prediction, while the orange bars illustrate the positive one. The explanation indicates that when making the negative COVID-19 prediction with 95% confidence, in this case, the main symptoms of the patient and the history of the treatment that contributes the most to the projection are the test indications, fever, headache, sore throat and shortness of breath.
A feedback survey of the designed automatic coronavirus detection website has been performed in this work. 27 volunteers (15 male and 12 female) aged between 19 and 46 years participated in this evaluation process. All of them assessed distinct features of the proposed website  on a Likert scale of 1 to 5. The average of the ratings of various salient features has been demonstrated in Fig. 19 .

Conclusions
The main goal of this work is to predict whether the user is infected with the coronavirus and create awareness about people's COVID-19 circumstances, which may assist in preventing the disease from spreading more in future. This article utilizes an open-source dataset containing records of more than 2 million people with their symptoms and essential information that includes the test date and result, gender, age, etc. Various data preprocessing techniques are applied, e.g., handling null values, conversion of categorical features, and imbalanced datasets with the SMOTE approach. Next, multiple machine learning approaches with hyperparameter tuning have been utilized. The CNN-LSTM model with the SMOTE approach accomplished the best prediction results regarding classification accuracy and F1 score. Next, an explainable AI technique with the LIME framework is employed to interpret the prediction results. Finally, the proposed machine learning model has been deployed on a website.
An obvious limitation of this study is to utilize an open-source dataset comprising patient data from a specific region. Future studies can use a private dataset with additional biomarkers covering more features and regions in the future. Prediction accuracy can be improved by utilizing meta learning techniques and a combination of the machine learning models with fuzzy logic frameworks. Feature selection with the wrapper technique can be applied to enhance the performance of the system.