A comprehensive review for chronic disease prediction using machine learning algorithms

The past few years have seen an emergence of interest in examining the significance of machine learning (ML) in the medical field. Diseases, health emergencies, and medical disorders may now be identified with greater accuracy because of technological advancements and advances in ML. It is essential especially to diagnose individuals with chronic diseases (CD) as early as possible. Our study has focused on analyzing ML’s applicability to predict CD, including cardiovascular disease, diabetes, cancer, liver, and neurological disorders. This study offered a high-level summary of the previous research on ML-based approaches for predicting CD and some instances of their applications. To wrap things up, we compared the results obtained by various studies and the methodologies as well as tools employed by the researchers. The factors or parameters that are responsible for improving the accuracy of the predicting model for different previous works are also identified. For identifying significant features, most of the authors employed a variety of strategies, where least absolute shrinkage and selection (LASSO), minimal-redundancy-maximum-relevance (mRMR), and RELIEF are extensively used methods. It is seen that a wide range of ML approaches, including support vector machine (SVM), random forest (RF), decision tree (DT), naïve Bayes (NB), etc., have been widely used. Also, several deep learning techniques and hybrid models are employed to create CD prediction models, resulting in efficient and reliable clinical decision-making models. For the benefit of the whole healthcare system, we have also offered our suggestions for enhancing the prediction results of CD.


Introduction
In the last 20 years, machine learning (ML) has advanced considerably from being a research curiosity to a useful technology with widespread commercial applications.It is a branch of artificial intelligence (AI) that employs statistical methods to fit models to data and discover relevant patterns from massive, unstructured, and complicated datasets [1].It is a comprehensive, multidisciplinary field with roots in statistics, mathematics, computer science, and cognitive analytics, among other disciplines [2].Algorithms trained by ML systems can utilize past data to make accurate predictions about unseen Page 2 of 28 Islam et al. Journal of Electrical Systems and Inf Technol (2024) 11 :27 data.The basis of the ML process is observations of data, such as examples, firsthand knowledge, or instructions.It searches for patterns in the data to subsequently draw conclusions from the supplied instances.The main goal of ML is to make it possible for computers to learn independently, without human aid, and to adapt after retraining.To predict future outputs, the supervised ML algorithm trains a model using historical data on both inputs and outputs, whereas unsupervised ML explores intrinsic structures and hidden patterns in input data [3][4][5].ML approaches have recently had a considerable impact on the healthcare industry (HI).The use of ML techniques in healthcare can lead to advancements such as more precise prediction models, new treatment approaches, clinical decision support systems (CDSS), medication development, and reductions in healthcare expenditures [6,7].Recent practical uses of ML in healthcare have been enabled by the collection of daily healthcare data as well as the advancement of big data processing.Different ML techniques can be applied to those datasets, which may be in structured or unstructured form, to provide a better outcome in healthcare.Various ML algorithms, such as linear regression (LR), support vector machine (SVM), random forest (RF), decision tree (DT), K-nearest neighbor (KNN), deep learning (DL), artificial neural network (ANN), and boosting algorithms are widely used to predict diseases [8,9].Using ML algorithms to forecast which treatment protocols would work best for a particular patient based on their characteristics and the state of the treatment is known as a method of ML in the HI.ML applications require a training dataset that includes an outcome variable for building various models for physicians and patients [10].
Chronic disease (CD) is a condition or illness that lasts for at least three months and can have serious long-term consequences.CD is more common in elderly people and can typically be managed but not cured [11,12].Cancer, cardiovascular disease (CVD), diabetes, brain disease, liver disease, stroke, and arthritis are common forms of CD [13].The World Health Organization (WHO) estimates that CD causes 41 million deaths annually, or 74% of all deaths worldwide.Each year, 17 million people under the age of 70 die from a CD; however, only 15% of these premature deaths occur in countries with high incomes [14,15].CVD causes the most significant number of CD deaths, followed by cancer and diabetes.Smoking, lack of exercise, excessive alcohol consumption, and poor nutrition contribute to an increased risk of dying from a CD [16].In the field of healthcare informatics, CD prediction plays a significant role.CD diagnosis systems can be very effective in correctly scheming and taking care of CD patients [17,18].The only way to reduce mortality and prepare for future diseases is to predict them early so that patients can receive proper treatment and disease severity can be prevented [19].Patients require a disease prediction model with the help of various supervised ML algorithms such as RF, DT, KNN, ANN, NN, SVM, NLP, and many more, allowing health officials and doctors to take preventative measures that can reliably, accurately, and efficiently predict diseases [20,21].
Although numerous works have been conducted on individual CDs, where most of the researchers have discussed different aspects and outcomes of that specific disease, we have tried to bring all the CDs under the same umbrella.Therefore, the aim of this systemic review is to provide a comprehensive overview of the previous studies regarding the predicting model of different CDs, in which we give more emphasis on representing comparative tabular data based on previous research so that readers can easily know about the description of the dataset, findings, outcomes and different key factors which helped to improve the accuracy of their proposed system.Furthermore, we have also provided the list of datasets available for the classification of different chronic diseases.

Contribution of this study
The main contributions of this study are as follows: • This study focuses on how ML algorithms are used to predict CDs such as liver disease, cancer disease, brain disease, heart disease, and diabetes.• This article covered the author's proposed system and findings, objectives, data sources, technologies, algorithms, and the accuracy of their study.• This study also addressed the future direction for cost-effective medical care by integrating the predictive model (PM) into the healthcare system.• This comprehensive review of different CD predictions can be helpful for future researchers.
The remaining sections of this study are briefly arranged as follows: Sect."Methodology" discusses the entire journey of paper selection and review from various journals.Sect."Predictive model (PM) using ML algorithms" provides short information on how PM works; in Sect."ML for CD prediction", previous studies have been reviewed where ML is used for predicting and diagnosing different CDs.The remaining sections cover the discussion and conclusion, respectively.

Methodology
We mainly looked for no articles using high-impact factors publisher databases such as Wiley Oxford journals, The Lancet, Springer, IEEE, Hindawi, ACM, and ScienceDirect.As shown in Fig. 1, more than 470 papers were screened for our investigation, and the search titles for the papers were "ML in healthcare, " "chronic disease prediction, " and "chronic disease classification." The entire paper collection or searching process consisted of two steps.As this study worked with prediction models of CD, in the first phase, chronic disease prediction or classification papers were searched.And in the second phase, the applicability of the paper to the study was thoroughly scrutinized.Most of these research articles selected for CD prediction were released between 2018 and 2024.In addition, this study only examined papers with a high number of citations (more than ten) or a relevant abstract and title for further investigation.During this process, articles were included in the collection only if all the writers deemed them appropriate; any differences were resolved by consensus.Thus, 125 papers out of 473 were discovered that were pertinent to our investigation.

Predictive model (PM) using ML algorithms
PM can predict future outcomes by assessing past results and existing data.PM has gradually integrated into data mining using AI technologies and ML algorithms.It has improved the quality of decision-making processes and allowed for better foresight into potential outcomes [22,23].PM consists of seven phases (Fig. 2), beginning with the data collection process.Data collection is followed by data preparation.Data preparation included removing missing values, scaling, eliminating outliers, and balancing the dataset.Selecting the ML model is the third step in this process.After the ML model has been chosen, the dataset should be split.The fifth stage, which involves training the dataset with the selected model, is the most important.Many researchers use hyperparameter tuning techniques to improve accuracy [24,25].Not only on HI but predictive analytic techniques and tools may also now reliably foretell a company's sales and profit future.This is because the PM now incorporates sales [26].Another area is marketing, explicitly anticipating customers' reactions and needs based on information gleaned from feedback [27].Social media is the industry that enables platforms to identify client behavior and predict future consequences [28].As is seen, PM has several uses, but one of the most important is risk assessment, which may assess risk and ascertain the degree of profit or loss that the future contains.Predictive analytics for quality improvement consider previous comments, adjustments, and suggestions that might improve the quality [29,30].The prediction model is widely utilized in the HI since it has become a valuable tool for making medical decisions as patients react differently to every form of treatment, particularly for chronic diseases [19].An opportunity to develop a practical preventative and treatment approach can be presented by an early diagnosis.ML models are used to predict disease, which helps doctors categorize high-risk patients, provide a unique diagnosis, reduce risk, and eliminate health hazards [31,32].This research mainly concentrated on reviewing predicting models of CD.

Liver disease (LD)
After the epidermis, the liver is the biggest internal organ.In terms of size and location, it is roughly the size of a football and rests just beneath the right ribs.As food travels through the digestive system, the liver sorts out the good from the bad.It also secretes bile, which helps in digestion and eliminates harmful substances from the body [33].The liver's ability to operate correctly declines when scar tissue gradually replaces healthy liver tissue.LD can potentially lead to patient to liver failure and cancer if not addressed on time.There are approximately 2 million deaths a year caused by LD, 1 million from cirrhosis, and another 1 million from hepatocellular carcinoma and viral hepatitis [34,35].LD has several potential causes, including infections, poor dietary choices, drug use, alcohol abuse, and toxic exposure.Genetic predispositions to LD exist as well.Hepatitis A, B, C, D, & E are all forms of viral hepatitis, while fatty LD results from poor dietary choices and an unhealthy lifestyle.Hepatitis B and C are the most prevalent of these five forms of the disease.Every 30 s, a new hepatitis patient dies, and 11% of the world's population succumbs to the disease each year.It is found that between 20 and 40% of the population in Western industrialized countries has nonalcoholic fatty LD (NAFLD).Along with rising rates of obesity, T2D, and metabolic syndrome, NAFLD has been on the rise in recent years [36,37].The construction of more precise prediction models employing a wide range of ML methods is becoming increasingly popular in response to the expanding use of ML in the healthcare industry.Early prediction of associated risk factors of LD could help a lot with diagnosis, prevention, or treatment [38].Therefore, this study aims to analyze previous research on the LD prediction model by providing information about the dataset, research objective, algorithms used, findings and different important aspects of their study (Table 1).
Liu et al. [39] used seven ML algorithms to predict Non-alcoholic fatty liver disease using a dataset of 15,315 cases and 35 characteristics from the International Health Care Center.BMI was the most significant indication based on the feature ranking.The dataset was partitioned at random in a 7:3 ratio.This study used the Tensor flow framework to create the multilayer perceptron (MLP), CNN, and long short-term memory networks (LSTM) models.At the same time, the Python scikitlearn library was accountable for developing the XGBoost, SVM, Stochastic gradient descent (SGD) classifier, and LR model.Model efficacy was evaluated using nine different matrices.In comparison to all metrics, XGBoost offers the best accuracy.
Liu et al. [40] built a prediction model for liver patients to predict the recurrence risk of hepatocellular carcinoma patients.Additionally, they constructed a webbased personal assessment system for the patient.The proposed approaches utilized six ML algorithms.The dataset sample size was 315.The study was conducted by splitting the dataset as a ratio of 7:3 for training and testing, respectively.The author applied Synthetic Minority Over-sampling Technique (SMOTE) and replaced missing values in the pre-processing stage.MLP obtained the highest accuracy in this research.
Cao et al. [41] suggested a technique to predict and evaluate NAFLD patients with several ML methods.Four distinct models were developed for making predictions, and their performance was compared to determine the most suitable model.The sample size of their study was 22,140.Out of the four ML algorithms analyzed, the XGBoost model exhibited the best results with the subsequent metrics: accuracy (83.5%), specificity (83.4%), sensitivity (83.5%),Youden index (66.9%),recall (83.5%), precision (83.1%),F-1 score (83.3%), and AUC (91.4%).
According to Harrison et al. [42], the near-term mortality of patients with liver cirrhosis was predicted using the two ML algorithms LR and LTSM.Their study aimed to integrate the suggested model into an electronic health record system with the aim of facilitating precise and prompt forecasts of decompensation and mortality.This PM used the dataset consisting of 62 features and 340,553 records from the ICU at Virginia Health System.The effectiveness and generalizability of each model were verified by testing them on an anonymous data set comprising information on the 2017 patient stay.
Speiser et al. [43] predicted the daily status of patients with acute liver failure brought on by acetaminophen use.They assessed the effectiveness of methods for outcomes in the first week of hospitalization.The acute Liver Failure Study Group (ALFSG) database served as the source of information, and its sample size of 1042 included 14 characteristics.Generalized linear mixed models, Bayesian GLMM, binary mixed model tree and forest, were among the methods.RF, SVM, KNN, ANN, and CART were also applied.Utilizing ROC, sensitivity, and specificity, the model was verified.BiMM trees gave the maximum accuracy level for this PM.

Cancer
Cancer is a condition characterized by the proliferation of abnormal cells that have the ability to invade or spread to other regions of the body [50].There are more than two hundred distinct varieties of cancer, and they can be categorized based on their location or starting point within the body.Most cancer-related diseases and fatalities result from tumor cells reemerging in nearby organs and tissues [51].Malnutrition in cancer is probably caused by more than one thing, but the location of the tumor and the symptoms that show up, such as anorexia, taste changes, dysphagia, nausea, vomiting, and diarrhea, can make nutrition and functional ability even worse [52,53].As per the findings of Global Cancer Statistics, lung cancer constituted the leading cause of cancerrelated mortality (1.8 million deaths, 18%), followed by colorectal cancer (CRC) (9.4%) and liver cancer (8.3%).Among all cancer types, breast cancer accounted for the highest number of newly identified cases (2.3 million), representing 11.7% of all, followed by lung cancer, which is 11.4% and CRC (10%) [54].This study includes skin cancer, ovarian cancer, breast cancer, gastric cancer, lung cancer and thyroid cancer.Cancer research has shifted its focus to early detection and prognosis because of the positive impact it may have on the clinical care of patients [55,56].Several studies have been conducted to create an efficient predictive model for cancer patients.Therefore, this study aims to analyze previous research on cancer disease prediction by providing information about the dataset, research objective, algorithms used, findings and different important aspects of their study (Table 2).Abbasi et al. [57] predicted skin cancer by employing the Kaplan-Meier estimator and Cox proportional hazards regression model, utilizing eight ML classifiers on a publicly available dataset from the ICGC Data Portal, specifically targeting skin cutaneous melanoma cancers.Additionally, four different ensemble methods (stacking, bagging, boosting, and voting) were created and trained to achieve optimal results.The performance was evaluated and interpreted using accuracy, precision, recall, F1 score, confusion matrix, and ROC curves, the RF classifier achieved an outstanding accuracy of 99%.
Using an image pre-processing technique to filter and eliminate the excess noise existing in the picture by various approaches, Murugan et al. [59] suggested a methodology to predict skin cancer.The median filter is used to determine the location of the skin region of the affected area, and the mean shift segmentation technique was then utilized to divide the afflicted area from the surrounding healthy skin.SVM, probabilistic neural networks (PNN), RF, and Combined SVM + RF classifiers have all been employed as the methods for this study.Compared to other classifiers, the results produced by the combined SVM + RF classifier were better.The total number of images used in the experiment is 1000, and 10 cross-validation was used, with all samples trained and tested.
Naji et al. [60] explored the use of ML algorithms to predict breast cancer and determine which algorithms were most efficient in terms of accuracy, precision, and confusion matrix.This article primarily compares the effectiveness of five classifiers: SVM, RF, LR, C4.5, and KNN.25% of the dataset was utilized for testing, while 75% was used for training.SVM consistently outperformed the other classifiers.In predicting early stomach cancer in endoscopic pictures, Sakai et al. [61] suggested a convolutional neural network-based automated detection system.The most significant contribution of this study is the effective automated diagnosis of early stomach cancer with weak morphological traits, which might be difficult to identify even for endoscopists.About a thousand white-light imaging pictures of early gastric cancer (particularly kinds 0-I, 0-IIa, and 0-IIc) have been employed in the study.The author retrieved 24-bit full-color pictures with a resolution of 1000 × 870 pixels from the video sequence.The authors collected 172,555 cancer pictures and 176,388 normal images, both measuring 224 × 224 pixels in size, by applying nine different kinds of enhancements, including rotation, shear, shift, flip, and magnification twice.For learning rates of 0.0001 and 0.00001 both before and after 34 epochs, correspondingly, the original network was trained for 50 epochs.
Salmi & Rustam [62] used the NB algorithm to forecast colon cancer, a prediction approach based on a simple probabilistic algorithm with a strong independence assumption.This study's dataset was obtained from Al-Islam Hospital Bandung and consisted of seven columns and 209 instances.Age, Carcinoembryonic Antigen, hemoglobin, leukocytes, hematocrit, and thrombocytes were the features of the dataset.The dataset was divided into 80% and 20% for training and testing.The authors found 95.24% accuracy with the NB algorithm.

Brain disease
The brain is the most significant and complicated human organ responsible for regulating almost every aspect of corporal function.Many neurological ailments, such as Alzheimer's disease (AD), Parkinson's disease (PD), stroke, Meningitis, tumors, cerebral edema, and many more, are eventually due to aging and neuronal death [68][69][70].
According to a recent study, AD, PD, stroke, epilepsy, migraine, brain traumas, and neuro infections are just some of the many neurological illnesses that affect over a sixth of the global population and claim the lives of about 6.8 million people every year [71,72].The neurodegenerative sickness that affects older people most frequently is AD [73].PD affects 2-4% of the 65 and older population, making it the second most prevalent neurodegenerative condition.Between 4.1 and 4.6 million persons were affected in 2005; experts project that figure would more than double by 2030, reaching 8.7-9.3 million [74].Regarding diseases that affect the brain, stroke is the most incapacitating long-term ailment [75].Though men have a higher risk of having an acute stroke at some point in their lives, women have a higher mortality rate from such an event.Therefore, around 16% of all women are expected to die from a stroke, compared with 8% of all males; the discrepancy is primarily owing to the older average age at which strokes occur in women and to the longer average life expectancy of women [76,77].Although there have been advancements in surgical and other therapeutic methods, brain disease or brain stroke continues to be one of the leading causes of death and disability.Improving patient quality of life requires accurate and early detection of those with brain diseases.This study aims to cover previous studies of brain disease prediction to analyze the findings, methods and important aspects of their study (Table 3).A model for anticipating the impact of AD on various variants was suggested in [78].The study's primary objective was to use ML to create a classification model for estimating the risk that a given variant poses to AD.There are 57,853 instances in all and 39 attributes to analyze.The recursive feature elimination via cross-validation score was utilized to choose the most relevant features.To anticipate the system's accuracy, the authors utilized a variety of ML methods, including RF, XGBoost, AB, and NN, to train and then test the model.Additionally, in this research, the authors developed a web server to find potentially harmful variations linked to AD. Input versions were assigned a score between 0 and 1 by the algorithm.The threshold for determining deleteriousness was 0.38; below that number, a variant is considered harmless.
The proposed framework by Lin et al. [80] was about to predict the outcome of a 90-day stroke using several ML algorithms.The 58,493 data and 206 characteristics from the Taiwan Stroke Registry were used for this analysis.To ensure accurate results, the authors implemented an evaluation validation into the data preparation pipeline to weed out any outliers with questionable ratings.The assessment validation procedure is divided into two steps: clinical-logic validation and a non-linear regression approach.The clinical-logic validation involved the development of a set of logical rules to verify the accuracy of the data.To get rid of incoherent evaluations, the locally weighted scatterplot smoothing technique was used in non-linear regression.The ML methods SVM, RF, ANN, and hybrid ANN were utilized after the 17 most important features were chosen.
Haq et al. [81] suggested a method to predict PD from speech using the SVM algorithm.This study aims to identify changes in vowel vocalization that may be used to distinguish those with PD from those who don't have PD.The MinMax Scaler and the regular scaler were used to clean up the dataset by eliminating missing values.In the feature selection step, the L1-Norm SVM method was employed to eliminate unnecessary features and increase the system's accuracy.The accuracy was highest for 10 significant characteristics.Compared to other hypermeter values, the classification performance of the SVM kernel RBF with 10 folds CV on the full features set and hyper-parameter values of C = 1 and γ = 0.025 was superior.
Using ML approaches, Kostov et al. [82] proposed a framework for predicting the risk of stroke disease.The purpose of this study was to use ML techniques to assess the factors for ischemic stroke in patients with epilepsy from a massive volume of data from general practitioners in Germany.Stroke-prone subpopulations were selected using the Sub-Population Optimization and Modeling Solutions (SOMS) application.To evaluate model performance, ROC was applied.Although age was not acknowledged as a significant indicator, male gender was found to be 1.5% more important than random chance.

Heart disease or cardiovascular diseases (CVD)
One of the essential parts of the body is the heart.The heart is responsible for circulating blood throughout the body [89].The circulatory system is critical because it carries blood, oxygen, and other substances to the body's cells and tissues.Severe health conditions, including death, will result if the heart is not functioning correctly [90].According to estimations, 17.9 million people die from CVD every year, making it the world's most prominent cause of mortality.Coronary heart disease, cerebrovascular disease, rheumatic heart disease, and other illnesses are among the categories of heart and blood vessel disorders known as CVDs.More than 80% of all CVD deaths result from strokes and heart attacks, and 30% of these deaths occur in those younger than 70 [91,92].There are many types of CVDs, such as coronary heart disease, rheumatic heart disease, cerebrovascular disease, and other conditions.A critical method of lowering this toll is early identification of CVD.Using various ML approaches and data mining techniques is one of the numerous ways to improve this ailment identification and diagnosis [93].Early identification makes it feasible to lower severe health conditions, costs, and CVD death rates.So, the purpose of this study is to conduct a comprehensive analysis of previous research concerning the prediction model for heart disease.This will be achieved by presenting details pertaining to the dataset, research objective, algorithms employed, findings, and other significant facets of the respective studies (Table 4).
In order to detect cardiac disease at an early phase, Ali et al. [94] employed six different ML algorithms on a publicly accessible UCI dataset that was gathered from Kaggle.Among 1025 instances, 51.32% of which were heart disease patients and 48.68% of which were healthy individuals.To identify outlier and extreme values during the preprocessing step, another filter known as the interquartile range (IQR) was used after substituting missing values.To eliminate outliers, the dataset was divided into three parts.After preprocessing, the accuracy of MLP, KNN, RF, DT, LR, and AdaboostM1 (ABM1) algorithms was compared.Different statistical measures were employed to assess the effectiveness of various algorithms.KNN, DT, and RF algorithms offer incredibly high accuracy.
An XGBoost-based prediction method was suggested by Shi et al. [95] to accurately detect malnutrition in children one year following congenital heart surgery.The GWC Medical Center in China provided the data, which included 536 occurrences with 15 distinct features.The continuous variables were analyzed and expressed using means and standard deviations, medians, and IQR was assessed using an independent-sample t-test or a Mann-Whitney U test.The categorical variables in this study were compared using a chi-square test and are reported as numbers and percentages.Extreme gradient boosting (XGBoost), LR, SVM, ADA, MLP, and other supervised ML methods were used.Here, the Shapley Additive exPlanations (SHAP) approach is utilized to track how each characteristic affects the outcomes of the prediction process as it is applied to each sample.The most accurate of those five algorithms was XGBoost.
Ahmed et al. [96] aimed to forecast cardiac disease based on patients' tweets using ML and big data.The primary goal of this research is to create a real-time platform that can assess and extract knowledge about heart diseases from a user's streaming tweets in order to forecast whether the person is at risk for heart disease or not.The three critical parts of the proposed system's architecture are Building an Offline Model, Stream Processing Pipeline, and Online Prediction.In the preprocessing stage, the data were scaled using the MinMax Scaler.To choose the most crucial feature subset from the data set, two feature selection techniques, Univariate feature selection, and Relief feature selection were applied.The model was trained using four classification algorithms: DT, SVM, RF, and LR, with RF providing the greatest accuracy.
Haq et al. [97] worked to develop an ML-based decision support system for the diagnosis of cardiac disease.The CHDD was employed for the forecasting model.MinMax A technique for effectively identifying cardiac ailment was proposed by Ghosh et al. [98].Five separate datasets from Cleveland, Switzerland, Hungary, Statlog, and VA Long Beach are integrated into this work to create a larger, more dependable dataset for improved prediction from the UCI ML repository.Two feature selection techniques, LASSO and Relief were employed to choose the most crucial features.Five distinct algorithms were used: DT, KNN, RF, AB, and GB.To improve the system's accuracy, the authors applied ensemble techniques, including bagging and boosting.Bagging is used to lower the variance of Decision Tree classifiers.The Gradient Boost Boosting Method (GBBM) is used in this model to get the best level of accuracy.

Diabetes disease
Diabetes, characterized by a repetitive increase in blood sugar levels, has been one of the deadliest severe metabolic conditions [105,106].Diabetes mellitus is a collection of metabolic illnesses defined by hyperglycemia caused by abnormalities in insulin production, insulin action, or both.[107].As many as 422 million people worldwide have diabetes, with the majority residing in poor and medium-income nations [108].The incidence and severity of diabetes have significantly increased over the past several decades [109].Statistics show that approximately 38.4 million people are suffering from type 2 diabetes.Among them, 29.7 million people are diagnosed, and 8.7 million are undiagnosed.On the other hand, 124.8 million people have prediabetes.Women suffer from gestational diabetes at the time of their pregnancy period.And more than 50% of them have a chance to convert this into type 2 diabetes [110].Between 2000 and 2019, WHO found a 3% increase in diabetes mortality.However, diabetes vulnerability can be decreased by following a healthy diet and lifestyle [111].A better quality of life and a longer lifespan are just two of the many benefits that might result from a diabetes diagnosis at an early stage [109,112,113].Many researchers have made significant progress in making a proper PM for the early detection of diabetes.Therefore, this research tried to contribute to the prediction of diabetes by conducting a comprehensive study about chronic diseases, where diabetes is one of the most common chronic diseases.In this study, the PM of diabetes was analyzed from previous studies, the primary purpose of which was to find the object, dataset information, features, model validation, the algorithm used, and various other important aspects of the study (Table 5).
Hasan et al. [114] suggested a model for predicting diabetes using seven distinct ML approaches.A freely available dataset, the Pima Indian Diabetes dataset (PIDD), was utilized for this study.Mean values were utilized to replace missing values throughout the preparation stages.To get the optimal MLP design, eight distinct MLP models, ranging from one to eight hidden layers, were developed and evaluated, with the number of neurons serving as the hyperparameter for determining the best numbers.The optimal architecture was found by the MLP layout, which has 3 hidden layers (H1, H2, and H3)

Discussion
This comprehensive research found that ML algorithms have shown promising outcomes in the prediction of CD.It can be seen that a significant portion of the studies used public datasets.Several researchers did outstanding work and achieved the highest accuracy.However, compared to authors who used public datasets, the authors who had access to private datasets showed greater accuracy, as well as had an improved result in other performance matrix values.The biggest disadvantage of using a publicly available dataset is the minimal quantity of data samples.Having a sufficiently big training dataset is a fundamental prerequisite when employing classification algorithms to simulate a disease.In order to validate the estimators reasonably, an equitable-sized dataset must be split into training and testing sets.An unbalanced dataset, numerous missing values, and the existence of outliers are other factors that reduce the accuracy of a publicly available dataset.Enhanced accuracy can be achieved by proper preprocessing of the dataset.Most of the authors exerted considerable effort in the preprocessing phase, which included deleting missing values, scaling the dataset, balancing the data, and removing outliers, which were able to increase accuracy.Data scalability led to improved convergence, which allowed authors to achieve an accuracy of 95% or better.SMOTE and random over-sampling were the two most prevalent strategies for balancing datasets in our scrutinized research papers.It is essential to the model-building process to narrow down the features to a manageable number.Almost every author employed a variety of strategies to identify significant features.However, to get a more exact and accurate subset of characteristics, a few authors used several feature selection methods to get a smaller subset of features that was more precise and relevant to their study.The most widely used strategies for selecting features in our reviewed papers are LASSO, RELIEF, and mRMR.Almost all of the studies mentioned here conducted validation tests to evaluate the efficacy of their learning algorithms.A significant factor in the encouraging outcomes of multiple experiments was the employment of various ML approaches with the intention of identifying the most effective one.A customized ensemble approach increased the accuracy of some authors.This study found that SVM and RF classifiers were two of the most popular ML algorithms for predicting cancer patient outcomes.Numerous studies have shown that SVM, DT, and NB are superior to other methods for predicting CVD.For the purpose of predicting liver disease, several boosting algorithms were mainly employed.Again, SVM and RF were widely used for predicting brain disease and diabetes as well.

Conclusion
The recent research on ML-based techniques for predicting CD was focused on in this article.Among them, certain authors have accomplished remarkable feats.As summarized in Tables 1, 2, 3, 4, 5, it can be seen that various important characteristics of PM using ML techniques like no. of features, dataset information, validation technique used, important features, as well as the objective and the findings which various researchers have found on different CDs such as liver disease, cancer, brain disease, heart disease, and diabetes disease have been investigated respectively.From these tabular data, a researcher can get a precise overview of the previous work of PM on CD as well as the diagnosis outcome discussed by the previous researchers.Additionally, this study also represents the list of available datasets that the researchers can work with for further research.Thus, it will definitely improve the work speed, and it can bring new ideas for the betterment of the healthcare domain.This study also finds that most suggested research in the last several years has been on creating PM for CD through the use of supervised ML techniques and classification algorithms.We have found from previous studies that SVM and RF classifiers were the most popular ML algorithms for predicting CD.Despite all of the findings of this study, there are some confounding factors as well.The main limitation of this research is that the study focuses only on the prediction model, which ultimately limits the scope of the in-depth insight into a particular chronic disease.Furthermore, our search strategy could prevent this study from covering a broad range of results regarding the previous study, so different search titles may give different results.However, it is recommended that all of the limitations be addressed, further study about the broad range of areas, and an in-depth analysis of the studies of chronic diseases be conducted.
From this research, we also recommend that the future development of the prediction models to create a proper CDSS for remotely monitoring the patients of CD would be particularly advantageous for both patients and physicians because previous studies suggest that these patients need to be observed on a frequent basis.Better results and outcomes, as well as effective patient treatment for CD, will result from the PM's seamless integration with hospitals and medical domains to provide consistent health records and data.It has the potential to not only enhance the existing healthcare system but also to make medical care more accessible to everyone by lowering the costs associated with providing treatment.

Fig. 1
Fig. 1 Flow diagram of the paper selection process

Table 2
Overview of different parameters from the previous works on cancer disease DCNN, deep convolutional neural network; DeepSurv, deep feed-forward neural network; GNB, Gaussian-naïve Bayes, N/A, missing or not available; RSF, random survival forest; BPNN, back propagation neural network

Table 3
Overview of different parameters from the previous works on brain disease ADNI, Alzheimer's Disease Neuroimaging Initiative; ANFIS, adaptive network-based fuzzy inference system; CAE, Convolutional Autoencoders; DFA, Signal fractal scaling exponent; GTE, Genotype Tissue Expression; GWAS, Genome-Wide Association Study; IVM, Import Vector Machine; MDVP Fhi, Maximum vocal fundamental frequency; MDVP Fo, The average vocal voice fundamental frequency; MDVP Shimmer, Several measures of variation in amplitude; MDVP, Flo, Minimum vocal fundamental frequency; N/A, missing or not available; NCVS, National Center for Voice and Speech; RELM, Regularized Extreme Learning Machine; RPDE, Two nonlinear dynamical complexity measures

Table 4
Overview of different parameters from the previous works on heart disease ABBM, AdaBoost Bagging method; ADA, adaptive boosting; BPNN, back propagation neural network; CA, number of major vessels (0-3) colored by fluoroscopy; CP, chest pain; CPT, Type of chest pain; DTBM, Decision Tree Bagging method; EIA, Exercise-induced angina; Exang, Exercise induced angina (1 for yes and 0 for no); FBS, fasting blood sugar; GWC, Guangzhou Women and Children's; HIS, hospital length of stay; hr_la, Heartbeat number; HRFLM, Hybrid Random Forest with Linear Model; IHD, Ischemic heart disease; KNNBM, K-nearest neighbor Bagging method; MHR, Maximum heart rate achieved; N/A, missing or not available; OPK, ST depression induced by exercise relative to rest; PES, Slope of the peak exercise ST segment; restecg, resting electrocardiographic results; RFBM, Random Forest Bagging method; Slope, The slope of the peak exercise ST segment; THA, Thallium scan; Thalach, maximum heart rate achieved; Trestbps, resting blood pressure; VCA, Number of vessels which colored by uoroscopy; WAZ, weight for age Author standard scalar were utilized in the pre-processing stage to depict ML algorithms effectively.Relief Feature Selection Algorithm, mRMR, and Least Absolute Shrinkage and Selection (LASSO) operator were the three feature selection techniques employed in this study.After choosing crucial features, seven different ML algorithms, such as SVM, LR, KNN, ANN, NB, DT, and RF, were used.RF has the highest accuracy of all of them. and

Table 5
Overview of different parameters from the previous works on diabetes disease

Table 6
provides the list of datasets publicly available for the classification of different chronic diseases such as cancer, liver disease, brain disease, heart disease and diabetes disease from this study.

Table 6
Publicly available dataset for different chronic disease classifications Islam et al.Journal of Electrical Systems and Inf Technol (2024) 11:27