Introduction

The increasing prevalence of diabetes is a considerable global health concern. According to the World Health Organization (WHO) Type 2 diabetes rates have raised worldwide across all income levels. Diabetic foot (DF) is a severe complication for diabetes patients, with a global prevalence of 6.3%, often resulting in a high amputation rate1,2,3. Individuals with diabetes have a 25% chance of developing DF during their lifetime. The mortality rate associated with DF development is approximately 5% within the first 12 months and 42% within five years4,5. The annual incidence of diabetic foot ulcers (DFUs) worldwide ranges from 1.9 to 26.1 million. The prevalence of DFUs varies significantly between countries and regions, spanning 1.5% to 16.6%6,7.

DFUs can have serious consequences, including a high rate of disability, mortality, and recurrence, as well as high treatment costs and prolonged hospitalization8,9,10,11,12. Poor prognosis imposes a significant financial burden on patients, their families, and medical and health systems. To prevent these complications, diabetic patients with a “high-risk” foot must regularly see a doctor, take costly medications, and take personal responsibility for their health13,14.

Delaying specialist evaluation for DFUs can result in more severe ulcers, lower cure rates, and more hospitalizations. Clinical guidelines recommend annual foot screening of all diabetic patients to identify those at high risk for developing foot ulcers and prevent amputations15,16. High-risk individuals can be identified through a clinical examination of the feet. Identifying individuals likely to develop ulcers allows for targeted preventive treatments. Early screening and prediction of DFUs in high-risk groups is a crucial step in managing the prognosis of diabetic patients. Accurate prediction of diabetic foot ulcer risk can significantly reduce the burden of chronic wounds and amputations17,18.

Machine learning is a subfield of artificial intelligence that enables systems to automatically learn patterns from data and enhance clinical decision-making. Utilizing machine learning techniques, data-driven medical decision-making systems can provide valuable and insightful information in clinical and diagnostic areas. A well-designed predictive model can assist medical professionals and patients in preventative care strategies19.

Effective medical diagnosis relies on knowledge discovery from medical databases. Therefore, data mining is a more suitable approach for medical studies. Data mining involves extracting information from databases and creating clear and understandable descriptions of patterns. Among unsupervised data mining methods, association rule mining is one of the most popular and effective techniques for extracting useful information and discovering relationships between elements in large amounts of data stored in databases20,21. Association rule mining has been used in various medical applications, such as identifying patterns in disease progression. Association rules specify conditions that frequently occur together in a given dataset. The extracted rules describe the presence of certain features based on other features22,23. The Apriori algorithm is a powerful tool for exploring frequent itemsets to discover association rules, which are then used as the basis for other discovery algorithms. The algorithm derives its name from the fact that it uses prior knowledge about the properties of frequent itemsets. The processing involves identifying significant rules among frequent patterns, which are extracted by setting support and confidence thresholds24. The association rules using the Apriori algorithm produces interpretable and intuitive results, providing information about general trends in the database.

Given the limited number of studies on extracting knowledge from data related to DFUs using association rules, this study aimed to develop an association classification model utilizing the Apriori algorithm. The model will predict the risk of DFUs based on the collection of demographic, clinical, and laboratory variables. Also,

Material and methods

Material

This study was a retrospective cohort analysis involving all patients consecutively referred to Shahid Beheshti Hospital (Hamadan Province, Iran) from April 2020 to August 2022. Data on 29 specific features were collected through a checklist from 666 patients with type 2 diabetes, of which 279 (42%) patients had diabetic foot ulcers (DFUs).

All patients previously diagnosed with diabetes (type 2) registered between April 2020 and August 2022 was included in the study. Inclusion criteria were as follows: age > = 25 years; meeting the ADA (American Diabetes Association) guidelines 2023 for diagnostic criteria for diabetes; haemoglobin A1c (HbA1c) ≥ 6.5% at any time before first hospitalisation; fasting plasma glucose (FPG) ≥ 126 mg/dL or 2h post-challenge plasma glucose (2h PCPG) ≥ 200 mg/dL; patient with classic symptoms of hyperglycemia or hyperglycemic crisis, a random plasma glucose $200 mg/dL (11.1 mmol/L), taking antidiabetic medication25.

International Working Group on the Diabetic Foot (IWGDF) guidelines were used to identify patients hospitalized with DFU. A foot ulcer was defined as a full-thickness lesion below the ankle, irrespective of the presence of neuropathy and/or peripheral arterial disease. DFUs were defined as wounds, infections or destruction of deep tissues in the lower limb below the ankle26,27,28. Individuals with more than one ulcer at baseline were also included as DFUs. The Wagner classification system was used to evaluate the severity of DFUs29. All patients with type 2 diabetes mellitus who were on treatment and had at least one episode of foot ulceration during their treatment, with at least Wagner stage 1 or above, were considered 'DFUs or claas1'. The diagnosis of the disease is made using the classification done at the time of the patient's initial admission to the hospital.

The present study was designed as a classification task because we developed a predictive associative CBA model to identification of diabetic patients at risk of DFU. The binary target variable or consequent in the present study is the occurrence of DFUs in patients (class 1: patients with type 2 diabetes with DFUs, and class 0: patients with type 2 diabetes without DFUs).

The logistic regression model was also used as a conventional classical competitor to compare the predictive performance of CBA model. It should be noted that the independent or input variables and the dependent (or target) variable in the logistic regression model were similar to those in the CBA model. A significance level of 0.05 was used. It should be noted that the CBA algorithm and the logistic regression model used the same training and test parts in order to obtain comparable results.

Authors confirm that all experiments were performed in accordance with relevant guidelines and regulations. The research ethics committee of Hamadan University of Medical Sciences approved the study (IR.UMSHA.REC.1401.014), and informed consent was waived due to the retrospective design and use of anonymized clinical data.

Methods

Association rule mining

Association rule mining as a data mining method utilized to identify concealed associations, frequent patterns, and correlations within data. This technique is based on the concept of if–then statements, which establish a relationship between two variables, e.g., if A occurs, then B occurs. The antecedent in this statement represents the if-part, and the consequent represents the then-part.

Two measures, support and confidence, are utilized to assess statistical significance and strength of a rule, respectively. Support (AUB) represents the proportion of records in the dataset that contain both A and B (A → B). Confidence calculated by determining the percentage of records in the dataset that includes A and B. In addition, lift, another criterion, is typically used to compare expected and actual confidence. Lift measures how often the if–then statement was anticipated to be true. If the lift value is greater than 1, it suggests that the rule body and rule head co-occur more frequently than would be expected by chance. This implies that the occurrence of the rule body exerts a positive influence on the occurrence of the rule head30.

$$ {\text{Support }}\left( {\text{I}} \right) = \left( {\text{Number of transactions containing item I}} \right)/\left( {\text{Total number of transactions}} \right) $$
$$ {\text{Confidence}} = \left( {\text{Number of transactions containing I1 and I2}} \right)/\left( {\text{Number of transactions containing I1}} \right) $$
$$ {\text{Lift}}\left( {{\text{I1}} \to {\text{I2}}} \right) = ({\text{Confidence}}\left( {{\text{ I1}} \to {\text{I2}}} \right)/\left( {{\text{Support}}\left( {{\text{I2}}} \right)} \right) $$

Apriori algorithm

The Apriori algorithm is a popular method for mining association rules from transaction data. It involves identifying frequent itemsets as a basis for creating association rules. A frequent itemset refers to a set of items that meets a minimum threshold of support and confidence. Typically, association rules are considered interesting if they satisfy both a minimum support and a minimum confidence threshold, which can be set by users or domain experts31. The Apriori algorithm can be broken down into the following steps:

  • Step 1: Set a minimum threshold for support and confidence.

  • Step 2: Identify all subsets of transactions that meet the minimum support threshold.

  • Step 3: Generate all rules for the frequent itemsets that meet the minimum confidence threshold.

  • Step 4: Sort the rules by decreasing lift.

Classification based on associations (CBA)

Associative classification is a type of association rule mining that focuses only on the class attributes on the right side of the rule (consequence). CBA (Classification Based on Associations) is a method that utilizes association rule techniques to classify data, and it has shown to be more accurate than traditional classification techniques. However, CBA is sensitive to the minimum support threshold, as setting a lower threshold can lead to a large number of rules being generated. To address this, Liu et al. proposed a CBA method that uses an Apriori approach to generate classification rules. Their method has been modified by others and has been shown to be effective in generating accurate classification rules32,33.

Association rule learning is typically applied to categorical data. Therefore, for numerical variables, data discretization was performed. This involved converting numerical attributes into categorical variables using established medical cutoffs. After the data preparation and cleaning phase, association rules were generated from the processed data. The Apriori algorithm was used to extract the supervised rules, and then rules with the highest confidence level and expected accuracy were selected. The “arules” and “arulesViz” package in the R software was used for extracting the rule mining.

The rule base for the Apriori algorithm is generated using the entire sample present in the databases. However, to create a diagnostic classification model using the CBA algorithm, it is necessary to use separate sets of training and testing data. Accordingly, 80% of the data was allocated to the training set, and the remaining 20% was used for testing and not used for optimizing the model during development. The performance of the associative classification model was evaluated using the accuracy measure. The binary target variable or consequent in the present study is the occurrence of DFUs in patients (class 1: patients with type 2 diabetes with DFUs, and class 0: patients with type 2 diabetes without DFUs). In this analysis, the minimum support threshold was set to 0.01, and the minimum confidence threshold was set to 0.7. Additionally, the minimum lift threshold was set to 1.

Results

Table 1 displays the frequency distribution of the investigated variables in diabetic patients stratified by the presence or absence of DFUs. Continuous variables are presented as mean ± standard deviation, while categorical variables reported as the number and percentage.

Table 1 Baseline characteristics of the study patients, categorized by whether or not developed a foot ulcer.

The results based on univariate analysis using Chi-square test indicate a significant statistical relationship between gender and the occurrence of DFU. Among the individuals with DFUs, 185 patients (66.3%) were male, while in diabetic patients who had not yet developed DFUs, 182 patients (47%) were male.

There is a significant relationship between age and the occurrence of DFU, with the mean and standard deviation of age for patients with DFUs being 63.34 ± 13.22 years, and for those without foot ulcers being 61.34 ± 12.27 years. Additionally, body mass index (BMI) is associated with the occurrence of DFUs, with the mean and standard deviation of BMI for patients with DFUs being 26.97 ± 4.63 kg/m2, and for those without DFUs being 26.14 ± 5.44 kg/m2.

There is a significant association between the type of diabetes treatment and the occurrence of DFU (p < 0.05). Among the patients with DFUs, 123 patients (44.1%) were using anti-diabetic pills and 156 patients (55.9%) were using insulin therapy.

The duration of diabetes is significantly associated with the risk of developing DFUs (p < 0.05). Among the patients with DFUs, 157 patients (56.3%) had a history of diabetes for more than 10 years, while among the diabetic patients who had not yet developed DFUs, 110 patients (28.7%) had a history of diabetes for more than 10 years.

Although there was no significant statistical association between comorbidities that a diabetic individual may have (kidney disease, heart disease, kidney and heart disease, and other diseases) and the occurrence of DFUs (p > 0.05), out of the 279 individuals with foot ulcers, 42 patients (15.1%) had kidney disease, 52 patients (18.6%) had heart disease, 45 patients (16.1%) had both kidney and heart disease, and finally, 140 patients (50.2%) had other diseases.

There is a significant association between smoking, drug addiction, physical activity, and the occurrence of DFUs (p < 0.05). Among the individuals with DFUs, 48 patients (17.2%) were smokers and 57 patients (20.4%) were addicted to other drugs, and 222 patients (79.6%). Also, 264 patients (94.6%) did not engage in physical activity.

Family history of diabetes, family history of DFUs, and regular visits to a doctor, proliferative retinopathy, non-proliferative retinopathy, diabetic neuropathy, LDL, HDL cholesterol, total blood cholesterol, triglycerides, have a significant association with DFUs (p < 0.05). However, there is no significant association between systolic and diastolic blood pressure, fasting blood glucose levels, 2-h postprandial glucose levels, diabetic nephropathy, cardiovascular events, and cerebrovascular events with the occurrence of DFUs (p > 0.05).

By implementing the CBA algorithm on the data, 146 rules with the minimum degree of support value and the minimum degree of confidence value equal to 1% and 70%, respectively, were identified separately for two groups of patients with and without diabetic foot ulcers. These rules are presented in the Tables 2 and 3.

Table 2 Rule generated by the CBA algorithm for patients belongs to DFUs class.
Table 3 Rule generated by the CBA algorithm for patients belongs to diabetes class.

Checking the rules identified by the CBA algorithm in Table 2; duration of diabetes more than 10 years, insulin therapy, male sex, older age, being a smoker, addiction to other drugs, family history of diabetes, higher body mass index, physical inactivity, having proliferative, non-proliferative retinopathy, nephropathy, history of heart or kidney disease, level of LDL, HDL cholesterol, triglyceride (TG), systolic and diastolic blood pressure, BS2HPP and experience of cardiovascular events are effective in the occurrence of diabetic foot ulcer.

The overall accuracy of the CBA algorithm for identifying patients in two groups was equal to 96%. The accuracy for identifying patients with and without DFUs is 97% and 95%, respectively. Confusion matrix for a test data by CBA algorithm presented in Table 4. Also, the AUC value of the area under the ROC curve of the test data was 0.962 (95%CI 0.924, 1.000).

Table 4 Confusion matrix for the same test data by CBA algorithm and logistic regression model.

The overall accuracy and AUC of logistic regression for identifying patients in two groups is 77.4%. The accuracy for identifying patients with and without DFUs was 77.78% and 76%, respectively. The results showed that the CBA algorithm has a better performance in terms of accuracy and AUC than the logistic regression model in predicting DFUs.

Based on the results of the multiple regression model; duration of diabetes more than 10 years, insulin therapy, male sex, addiction to other drugs, family history of diabetes, higher body mass index, having non-proliferative retinopathy, nephropathy, history of heart or kidney disease, higher level of LDL cholesterol, HDL cholesterol, triglyceride (TG), significantly increase the chance of occurrence of diabetic foot ulcer.

Graph-based visualization with items and rules as vertices extracted based on CBA method for DFUs was presented in Fig. 1. Also, parallel coordinate plot extracted based on CBA method for DFUs was reported in Fig. 2. In the given diagram, the red circles show the rules. The incoming arrows indicate the left-hand rules (lhs) and the outgoing arrows indicate the right-hand rules (rhs).

Figure 1
figure 1

Parallel coordinate plot extracted based on CBA algorithm for DFUs.

Figure 2
figure 2

Graph-based visualization with items and rules as vertices extracted based on CBA algorithm for DFUs (this graph created using the “arulesViz” package based on the dataset of this study).

Discussion

Foot ulceration is a widespread issue that comes with substantial healthcare expenses. As a severe complication of diabetes, DFUs have a significant impact on the well-being of patients. The study aimed to identify individuals with diabetes at risk of developing DFU by developing an associative classification-based model using the Apriori algorithm. The algorithm identifies several risk factors related to developing DFUs as long-term diabetes, insulin therapy, male gender, advanced age, smoking, drug addiction, family history of diabetes, higher BMI, physical inactivity, and diabetic complications. The CBA algorithm demonstrated high accuracy of 96% in identifying patients with and without DFUs. The CBA algorithm performs better in terms of accuracy and AUC than the logistic regression model in predicting DFUs. The variables identified by the two methods are very similar. All the variables that are significant in logistic regression are also identified as important variables in the rule generated based on the CBA model.

The study extracted interesting patterns from a real dataset using data mining, which is particularly useful in medical data due to the high volume of data and unknown relationships between factors. The patterns and models obtained can be used to generate hypotheses for subsequent studies, including clinical trials to confirm or refute them, ultimately improving evidence-based clinical studies.

Several studies have explored predictive models for DFUs, including a study by Jiang et al.34 They developed a nomogram that utilizes 12 easily obtainable risk factors, to predict the likelihood of DFUs in hospitalized patients with type 2 diabetes. The nomogram achieved an accuracy rate of 84% in predicting DFUs in validation cohorts. Identified risk factors for DFU include male gender, old age, longer duration of diabetes, history of foot disease, and various blood markers such as high white blood cell count and low hemoglobin level.

Shi et al.35 constructed potent weighted risk model using Random Forest algorithm for evaluating the occurrence DFUs. RF model based on 17 variables achieved the accuracy of o.795 for predicting risk of DFUs in external validation data sets.

Monteiro-Soare et al.36 developed a risk stratification model for DFUs using seven commonly available clinical variables such as age, gender, duration of diabetes, HbA1c levels, neuropathy, peripheral arterial disease, and previous history of foot ulcers. In this particular study, 336 patients with diabetes were enrolled and monitored for a median duration of 2.3 years to investigate the incidence of new DFU as the primary outcome. The study found that a logistic regression model achieved an area under the curve (AUC) of 0.83.

Lv et al.37 developed a nomogram that utilized a logistic regression model to predict the risk of DFUs. According to their findings, risk factors for foot ulcers included abnormal foot skin color, callus, BMI, foot arterial pulse, and a history of ulcers. The validation of the nomogram demonstrated moderate predictive ability, as shown by an AUC value of 0.787.

Research on DFUs has often focused on identifying risk factors for their development in diabetic patients. The present study supports previous research indicating a link between male gender and the occurrence of DFUs. Several other investigations, including those by Larijani et al., Bejestani et al., Ali et al., Jiang et al., Bakri et al., Frikberg et al., Richard et al., and Finke et al., have also found a higher proportion of men developing DFUs. This may be due to the higher pressure on men’s lower limbs due to their average weight, as well as differences in lifestyle and self-care34,38,39,40,41,42,43,44. The higher prevalence of atherosclerosis in men compared to women, as noted in a study by JanMohammadi et al.45 may also contribute to the higher rate of DFUs in men.

The mean age of patients with DFUs was considerably higher than that of patients without DFUs. This result is consistent with the findings reported by Shahi et al., Zhang et al., Yunir et al., and Jiang et al., who identified an age over 50 years as a significant risk factor for the development of foot ulcers in diabetic patients34,46,47,48.

The present study established a significant correlation between the longer duration of diabetes and the risk of developing DFU. Specifically, 56% of patients with DFUs had a history of diabetes for more than 10 years, while only 28% of diabetic patients without foot ulcers had a history of diabetes for more than 10 years. These results are consistent with the findings of several other investigations, such as those conducted by Bakri et al., Syauta et al., Frikberg et al., Lipsky et al., and Naemi et al. In Bejestani et al.’s study, 50% of patients had a history of diabetes for over 13 years, in Ali et al.’s study, 58% for over 10 years, and in Chowdhury et al.'s study, 56% for over 10 years39,40,41,42,49,50,51,52.

The present study also found a statistically significant relationship between physical activity and the incidence of DFUs, which was confirmed in Tola et al.’s study53.

A history of smoking or other drug addiction was associated with the development of DFUs. This finding is consistent with the results of other studies, including those by Bejestani et al., Syauta et al., Frikberg et al., Naemi et al., and Moeini et al., which identified smoking and other drugs as risk factors for DFU39,42,49,51,54.

In the studies conducted by Reardon et al.55 and Abu Obaid et al.56 regarding the factors affecting DFUs, a statistically significant relationship was found between the occurrence of foot ulcers and regular visits to the doctor. These results are in agreement with the findings of the present study.

The results of this study indicate a positive correlation between increased BMI and the occurrence of DFUs. This finding is consistent with previous research that has identified obesity and elevated BMI as potential risk factors for diabetic foot ulcers. In fact, in our study, we found that 87.1% of the patients were either overweight or obese, while only 12.9% had a normal weight. None of the patients in our study were underweight. It is generally observed that diabetic patients with elevated BMI have a higher incidence of DFUs.

To the best of the present study's knowledge, there has been no research conducted on the application of association rules mining specifically in patients with DFUs. However, several studies have used association rule mining techniques to explore the relationship between various risk factors and diabetes. Rane and Rano57, for example, employed an association rule exploration algorithm on recorded information of diabetic patients to determine the frequent risk factors associated with the occurrence of diabetes. Their findings indicated that in most of the obtained rules, low levels of HDL cholesterol (less than 35 mg/dL) were the most likely factor associated with the occurrence of diabetes.

In a study conducted in 2010 by Patil et al.58 a hidden pattern discovery algorithm was implemented on various variables of 625 female diabetic patients. The results confirmed that blood sugar levels greater than 150 mg/deciliter, age between 40 and 60 years, body mass index greater than 30 kg/m2, and pregnancy frequency greater than 5 times had the greatest association with the occurrence of diabetes.

This study developed predictive models utilizing readily available routine features. As a result, primary clinics can utilize the constructed CBA predictive model to screen patients with diabetes mellitus for their susceptibility to developing DFUs, even in cases where physicians lack experience. However, due to the retrospective nature of the study design and the data extraction procedure, some of the laboratory data were missing or were not available. Therefore, future work should analyses the potential benefit of adding other variables to those routinely recorded. This study also does not allow establishing the temporal sequence between selected risk factors and the occurrence of DFUs. Future prospective studies are needed to establish this association.

Study limitation

While the method achieved a higher overall accuracy compared to other studies, it is crucial to validate and replicate the results in other databases to ensure their generalizability to diverse populations. Future research on this topic should prioritize larger sample sizes and multi-center studies to enhance our understanding of the risk factors associated with DFUs and improve the accuracy of predictive models.