A systematic comparison of machine learning algorithms to develop and validate prediction model to predict heart failure risk in middle-aged and elderly patients with periodontitis (NHANES 2009 to 2014)

Periodontitis is increasingly associated with heart failure, and the goal of this study was to develop and validate a prediction model based on machine learning algorithms for the risk of heart failure in middle-aged and elderly participants with periodontitis. We analyzed data from a total of 2876 participants with a history of periodontitis from the National Health and Nutrition Examination Survey (NHANES) 2009 to 2014, with a training set of 1980 subjects with periodontitis from the NHANES 2009 to 2012 and an external validation set of 896 subjects from the NHANES 2013 to 2014. The independent risk factors for heart failure were identified using univariate and multivariate logistic regression analysis. Machine learning algorithms such as logistic regression, k-nearest neighbor, support vector machine, random forest, gradient boosting machine, and multilayer perceptron were used on the training set to construct the models. The performance of the machine learning models was evaluated using 10-fold cross-validation on the training set and receiver operating characteristic curve (ROC) analysis in the validation set. Based on the results of univariate logistic regression and multivariate logistic regression, it was found that age, race, myocardial infarction, and diabetes mellitus status were independent predictors of the risk of heart failure in participants with periodontitis. Six machine learning models, including logistic regression, K-nearest neighbor, support vector machine, random forest, gradient boosting machine, and multilayer perceptron, were built on the training set, respectively. The area under the ROC for the 6 models was obtained using 10-fold cross-validation with values of 0 848, 0.936, 0.859, 0.889, 0.927, and 0.666, respectively. The areas under the ROC on the external validation set were 0.854, 0.949, 0.647, 0.933, 0.855, and 0.74, respectively. K-nearest neighbor model got the best prediction performance across all models. Out of 6 machine learning models, the K-nearest neighbor algorithm model performed the best. The prediction model offers early, individualized diagnosis and treatment plans and assists in identifying the risk of heart failure occurrence in middle-aged and elderly patients with periodontitis.


This work was supported by the Fuzhou Key Specialty Project (Grant number 20191005), and the Fuzhou "14th Five-Year Plan" Clinical Specialty Training and Cultivation Construction Project, (Grant number 20220103).
Study protocols for NHANES were approved by the NCHS ethnics review board (Protocol #2011-17, https://www.cdc.gov/nchs/nhanes/irba98.htm).All the participants signed the informed consent before participating in the study.

The authors have no conflicts of interest to disclose.
The datasets generated during and/or analyzed during the current study are publicly available.

Introduction
Heart failure (HF) has a high probability of occurring in the end-stage of various cardiovascular diseases. [1]The global prevalence of heart failure has been increasing in recent years. [2][5] Patients with heart failure often have a poor quality of life. [6]Inflammation and fibrosis are believed to play a significant role in the remodeling of the heart that is typically linked with heart failure. [7]10][11][12][13] Innate and adaptive immunity are both implicated in the pathophysiology of this chronic inflammatory illness of the oral cavity, which is characterized by the creation of periodontal pockets, loss of attachments, and resorption of alveolar bone. [14]eriodontitis has grown to be a significant public health issue and a growing strain on healthcare systems as the world's population ages. [15,16]eriodontitis causes the development of cardiovascular inflammation and endothelial dysfunction, which is the principal mechanism through which periodontitis is linked to the development of cardiovascular disease. [17]In patients with periodontitis, oral ecological dysregulation results in endotoxemia, which is also linked to an elevated risk of cardiometabolic abnormalities. [18]Due to these causes, periodontitis has been linked for many years to the development of heart failure and other cardiovascular diseases. [19,20]Some studies have shown that both periodontitis and heart failure are multifactorial.Factors such as smoking, diabetes, and advanced age are common in both diseases, and eliminating these factors can help progress in the treatment of both diseases. [20,21]Therefore, we included these factors in the covariates as well.
Machine Learning is an emerging field in medicine that integrates computer science and statistics into medical problems. [22][29] Machine learning can be broadly categorized into "supervised learning and "unsupervised learning depending on whether the model fitting is "supervised or "unsupervised". [30]When compared to other statistical methods, machine learning algorithms are better able to account for the interactions between variables, find hidden patterns, identify potentially significant predictor variables, find optimized algorithms between interesting outcomes and potential predictor variables by learning from dataset patterns, and perform more accurately evaluate clinical outcomes. [31][34][35][36][37] As it stands, there are few machine learning models for predicting the risk of heart failure in patients with periodontitis, and our study can help clinicians make treatment decisions by comparing machine learning algorithms that can build better predictive models that use clinical features to predict the risk of heart failure in patients with periodontitis.

Study population and data selection
Data from the 2009 to 2014 National Health and Nutrition Examination Survey (NHANES), a research program created to evaluate the health and nutritional status of adults and children in the United States, were used in our study.The NHANES employs a number of intricate, stratified, multistage sampling designs to evaluate the health of Americans.The National Center for Health Statistics Research Ethics Review Committee authorized the survey protocol, and all participants completed informed consent forms.All procedures were conducted in accordance with relevant guidelines and regulations.Of the 30,434 subjects, we excluded those who were younger than 40 years of age and had missing values for smoking and drinking status, poverty-to-income ratio (PIR), diabetes mellitus (DM), coronary heart disease (CHD), myocardial infarction, hypertension, body mass index, education, marital status, race, sleep time on workdays, waist circumference, sedentary time, and physical activity.Two thousand eight hundred seventy-six participants aged 40 years and older who were diagnosed with periodontitis constituted our final study population.The full-mouth periodontal examination was performed by a dental hygienist who assessed the periodontal status of the participants.Participants aged 30 years and older were eligible for periodontal evaluation if they had at least 1 tooth (excluding the third molar) and did not meet any of the health exclusion criteria.Demographic characteristics can be disaggregated by gender (male, female), race (Mexican American, non-Hispanic white, non-Hispanic black, Hispanic, other race), marital status (unmarried, married or living with partner, married but currently living alone [separated, divorced, or widowed]), and education level (<9th grade, 9th to 11th grade, high school graduate, partial college or AA graduate or above) Classification.To calculate the PIR, household (or individual) income was divided by the survey year and state-specific poverty thresholds.Each participant's smoking status was assessed by self-report and categorized into 3 groups based on their current smoking status: nonsmokers, ex-smokers who no longer smoke, and smokers.Never drinkers, former drinkers who now abstain from drinking, heavy drinkers (3 or more drinks per day for women and 4 or more for men), moderate drinkers (2 or fewer drinks per day for women and 3 or more for men), and light drinkers were the categories used to describe drinking status (not including above).Diabetes conditions can be categorized as: No, Prediabetes, Diabetes.Prediabetes, which includes impaired fasting glucose and impaired glucose tolerance, is a metabolic condition between diabetes and normoglycemia. [38,39]According to the American Diabetes Association, the current definition of prediabetes in the United States includes a fasting blood glucose of 100 to 125 mg/dL, a post-load plasma glucose of 140 to 199 mg/dL, or an HbA1c of 5.7% to 6.4%. [40]hysical activity, sleep time on workdays, sedentary time, hypertension, diabetes status, myocardial infarction status, and CHD were obtained by questionnaire.The flow chart of the study population screening is shown in Figure 1.

Ethical approval
The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013).All information from the NHANES program is available and free for public, so the agreement of the medical ethics committee board was not necessary.All the participants signed the written informed consent before participating in the study.

Constructing and validating predictive models
We also divided the study population in the validation set and training set into 2 groups based on the presence or absence of heart failure and compared baseline information.Our study identified independent risk factors for the risk of heart failure in subjects with periodontitis by weighted univariate and multivariate logistic regression.Statistically significant variables selected by stepwise regression were used as input variables.
Variables with P < .05 in the univariate logistic regression analysis were included in the multivariate logistic regression analysis, and then the variables with P < .05 in the multivariate logistic regression analysis were selected as the final predictors to construct the machine learning model.In this study, a total of 1980 subjects with periodontitis from 2009 to 2012 were used as the training set and 10-fold cross-validation, and 896 subjects with periodontitis from 2013 to 2014 NHANES were used as the external svalidation set.We developed 6 ML algorithms such as logistic regression, K-nearest neighbor algorithm, support vector machine (SVM), random forest (RF), gradient boosting machine (GBM), and multilayer perceptron (MLP) to build the model on the training set.Receiver operating characteristic curve analysis is used in the external validation set to check the performance of each model.In both internal and external validation, the best performing model is defined based on the maximum areas under the receiver operating characteristic curve (AUC).The importance of the models was ranked according to the variables.

Statistical analysis
We employed NHANES sample weights in the baseline description and logistic regression analysis to achieve nationally representative values due to the complex sampling characteristics of NHANES.In the baseline information, continuous variables are expressed as means and standard errors.Categorical variables are expressed as frequencies and percentages.The t test was used to compare continuous variables between the 2 groups, and the chi-square test or Fisher exact test was used to determine the differences between groups when comparing categorical variables between the 2 groups.For participants with periodontitis, we utilized univariate and multivariate logistic regression to identify potential risk factors for heart failure.Odds ratio (OR) and 95% confidence intervals (CI) were utilized as effect estimates.P < .05 was considered statistically significant.All statistical analyses were performed using R software (version 4.3.0).

Demographic characteristics
As the training set, we used a total of 1980 patients with periodontitis from the NHANES database from 2009 to 2012, and as the external validation set, we selected 896 periodontitis patients from the NHANES database from 2013 to 2014.The weighted baseline data of the training set are shown in Table 1.The mean age of the subjects was 57.24 years, and of these participants, 61.16% were male, 38.84% were female, 42.47% were non-Hispanic white, 23.89% were African American, 14.65% were Mexican American, 9.65% were Hispanic, and 9.34% were of other races.A total 1980 periodontitis participants were split into 2 groups based on whether or not they had heart failure.The differences between the heart failure group and the non-heart failure group were statistically significant (P < .05) in terms of age, PIR, physical activity, hypertension, CHD, myocardial infarction, and DM status.The baseline characteristics of the external validation set are similar to those of the training set.The mean age of the 896 subjects was 57.86 years.Of the 896 participants, 40.18% were non-Hispanic whites, 23.77% were African American, 15.51% were Mexican Americans, 8.82% were Hispanics, and 11.72% were Americans of other races.In comparison to those without heart failure, those with heart failure were older, had a smaller PIR, and engaged in physical activity less frequently.And there was statistical significance between the heart failure and non-heart failure groups in terms of age, CHD, and myocardial infarction (P < .05).The weighted detailed results of the general baseline information are shown in Table 2.

The performance of machine learning models
We build prediction models in the training set using 6 machine learning algorithms: GBM, SVM, RF, K-nearest neighbor, logistic regression, and MLP.For internal validation, we evaluated the performance of the 6 machine learning models using 10-fold cross-validation, and the final k-nearest neighbor algorithm model outperformed the other 5 machine learning models in terms of predicting heart failure (AUC = 0.936), and the results are displayed in Figure 2. The k-nearest neighbor method model continued to outperform the other 5 machine learning algorithms in our external validation set (Fig. 3), with an AUC of 0.949.Consequently, as the final prediction model, we choose for the K-nearest neighbor algorithm model.

Relative importance of variables in machine learning algorithms
The relative importance ranking of age, race, myocardial infarction, and diabetes in the model is shown in Figure 4.Among them, myocardial infarction status has the highest importance among the 4 variables, while race has the lowest importance.

Discussion
In this research, 896 periodontitis patients from the NHANES 2013 to 2014 were chosen for external validation, while 1980 periodontitis patients from the NHANES 2009 to 2012 were considered for model construction.We used 6 machine learning methods: logistic regression, K-nearest neighbor algorithm, SVM, RF, GBM, and MLP.In our study, the performance of 6 machine learning algorithms was assessed with regard to this, and finally, the k-nearest neighbor (KNN) model outperformed the others in terms of clinically predicting the risk of heart failure in participants with periodontitis.Age, race, myocardial infarction, and DM were significant independent risk variables for the probability of heart failure in participants with periodontitis, according to our multivariable logistic regression.The variables in the final model were ranked in descending order of importance as myocardial infarction, age, diabetes, and race.[43][44][45][46] Our research demonstrates that myocardial infarction is a risk factor for heart failure in people with periodontitis.Heart failure is a manifestation of  end-stage cardiovascular disease, including myocardial infarction. [47,48]Studies have additionally demonstrated that periodontitis, a chronic condition that causes localized damage of the periodontal ligament and inflammatory bone loss brought on by oral microbes, is a risk factor for atherosclerosis and myocardial infarction. [49,50]Further research is required to substantiate the outlook that myocardial infarction may increase the risk of heart failure development in people with periodontitis.According to the findings of numerous research, age is also a risk factor for heart failure.Younger participants in the Framingham Heart Study had a reduced absolute chance of developing heart failure than older participants, both with and without risk factors. [51]Based on a Swedish study, the possibility of experiencing heart failure rises with age. [52]The specific mechanism is probably, with increasing age, deterioration of heart structure and function during the aging process leads to an increased susceptibility to heart failure. [53]In our analysis, DM was also found to be a risk factor for the development of heart failure.[56] Research has revealed that cardiometabolic damage is directly related to how diabetes affects heart failure. [57]Based on this, Halting the progression of diabetes is crucial for lowering the risk of heart failure.Even though it is regarded as the least important factor in our research, race is still a major predictor.[60] In our research, we discovered that black Americans were substantially more likely to experience heart failure than other racial groups.Consistent with our findings, a United States ARIC study of older persons without heart failure found that blacks had lower contractility than other racial groups and a higher risk of HFrEF. [61][63] Earlier epidemiological research has also suggested that young black persons have a 20-fold greater prevalence of heart failure than young white persons. [64]More study is required to identify the explanatory processes underlying this racial disparity so that prospective interventions can be targeted because the specific mechanisms underlying this racial disparity are not currently fully known.[67] However, the purpose of our study was to specifically use machine learning algorithms to predict the risk of heart failure in a population of  patients with periodontitis.To our knowledge, this is the first predictive model to use machine learning to predict the risk of heart failure in patients with periodontitis.The significance of this study is that it is based on a risk assessment study of 2876 samples by comparing 6 machine learning algorithms, with the final KNN algorithm performed the best.Clinical treatment decisions can be guided by machine learning-based models that can assist doctors better anticipate the risk of heart failure in patients with periodontitis and implement the necessary measures.
There are some limitations to our study.First off, because this is a cross-sectional study, it was unable to determine the exact order of occurrences.Thus, further prospective studies are required to look at the causal connection between heart failure and periodontitis.Second, for model construction and evaluation, we employ a tenfold cross-validation approach.We use data from earlier years of the NHANES database for the external validation set, but we still need to confirm the final findings in other databases or with alternative cohorts.Third, while the NHANES database we utilize was built on the population of the United States, it still needs to be validated in other nations to see how the predictive model operates in different cultural settings.

Conclusion
In order to personalize the prediction of heart failure risk in people with periodontitis, our study created and evaluated 6 machine learning algorithms.We concluded that the KNN algorithm had the greatest model performance.The machine learning-based prediction models can aid physicians in making clinical decisions by assisting them in determining whether periodontitis patients are at risk for heart failure.

Figure 1 .
Figure 1.The flow chart of the study population for training set and validation set.

Table 1
Weighted baseline characteristics of training set.

Table 2
Weighted baseline characteristics of validation set.