Development and Validation of Interpretable Machine Learning for Stroke Occurrence in Older, Community Chinese Dwellers

Background: Prediction of stroke based on individuals’ risk factors, especially for a rst stroke event, is of great signicance for primary prevention of high-risk populations. Our study aimed to investigate the applicability of interpretable machine learning for predicting a 2-year stroke occurrence in older adults compared with logistic regression. Methods: A total of 5960 participants consecutively surveyed from July 2011 to August 2013 in the China Health and Retirement Longitudinal Study were included for analysis. We constructed a traditional logistic regression (LR) and two machine learning methods, namely random forest (RF) and extreme gradient boosting (XGBoost), to distinguish stroke occurrence versus non-stroke occurrence using data on demographics, lifestyle, disease history, and clinical variables. Grid search and 10-fold cross validation were used to tune the hyperparameters. Model performance was assessed by discrimination, calibration, decision curve and predictiveness curve analysis. Results: Among the 5960 participants, 131 (2.20%) of them developed stroke after an average of 2-year follow-up. Our prediction models distinguished stroke occurrence versus non-stroke occurrence with excellent performance. The AUCs of machine learning methods (RF, 0.823[95% CI, 0.759-0.886]; XGBoost, 0.808[95% CI, 0.730-0.886]) were signicantly higher than LR (0.718[95% CI, 0.649, 0.787], p<0.05). No signicant difference was observed between RF and XGBoost (p>0.05). All prediction models had good calibration results, and the brier score were 0.022 (95% CI, 0.015-0.028) in LR, 0.019 (95% CI, 0.014-0.025) in RF, and 0.020 (95% CI, 0.015-0.026) in XGBoost. XGBoost had much higher net benets within a wider threshold range in terms of decision curve analysis, and more capable of recognizing high risk individuals in terms of predictiveness curve analysis. A total of eight predictors including gender, waist-to-height


Background
Stroke is the leading cause of death and disability worldwide [1], with a substantial treatment and prognostic care costs [2]. As the disease spectrum has transited from infectious and malnutrition diseases to noninfectious chronic diseases (NCDs), together with the frequent exposure to unhealthy lifestyle and environmental pollution, the global burden of stroke will continue to rise [3]. It is estimated that there will be approximately 200 million stroke patients worldwide in 2050, thereafter, 30 million new cases and 12 million deaths every year without effective prevention measures [4]. Stroke has been identi ed as one of the prioritized diseases in the World Health Organization and the United Nations on NCDs [5]. With the rapid aging of population, stroke has also become a huge challenge in China, with an increase of 2 million new cases every year [6]. Therefore, development of effective stroke prediction models for guiding early identi cation of high-risk populations is an urgent task.
Stroke-related predictions generally include four aspects: stroke prevention or risk factor identi cation, stroke diagnosis, stroke treatment and stroke prognosis [7]. So far, most studies focused on stroke diagnosis [7][8][9][10][11], which required complex medical data, such as physical, neurological, and brain imaging examination (CT or MRI) to exclude other diseases (stroke mimics), and to recognize its type, location and severity [12]. Actually, the imaging data are usually obtained when stroke has already occurred, and imaging examinations are expensive, which is undoubtedly unsuitable for early screening and reduction of costs [13]. Additionally, the prognosis of stroke is often poor since there is no effective treatments. Thus, early identi cation of high-risk individuals and personalized interventions are the most cost-effective way [14]. Fortunately, with the increase in population-based cohort studies, it's easy to obtain individuals' macro data (epidemiological data) through questionnaires as well as the micro data, such as blood biomarkers. A full utilization of the comprehensive data derived from population-based cohort would be helpful for early identi cation of high-risk populations of stroke. Many previous studies, such as the UKPDS calculator [15], PROCAM calculator [16], SCORE risk table [17], ASCVD calculator [18], QRISK calculator [19], etc., predicted the risk of cardiovascular disease using demographics, biomarkers, and clinical variables. While only few studies [20][21][22] focused on stroke prediction, especially for adults aged 60 and above.
Regression methods, such as Cox regression and logistic regression with simplicity and interpretability, were the commonly used prediction methods in previous studies. Traditional regression methods mainly deal with low-order interaction, such as rst-order interaction effects [20,21,23], making it di cult for analyzing high-order nonlinear relationships. Especially when number of predictors or the explanatory ability of predictors is limited, the complicated relationships between predictors and outcomes may not be captured [24]. Machine learning (ML), which is a set of computational methods that can discover complex nonlinear relationships between inputs and outputs, has been widely used in the eld of disease prediction and health research [25][26][27]. Among ML methods, ensemble learning was a widely used method with excellent performance [7,28], which makes predictions through integrating the results of multiple weak classi ers. Logistic regression (LR), a representation of regression methods, was often used as a reference model to compare with other ML methods.
Here, we developed two interpretable ensemble learning methods, namely random forest (RF) and extreme gradient boosting (XGboost), to predict 2-year stroke outcome (stroke occurrence versus non-stroke occurrence) in the elderly aged 60 and over compared with logistic regression based on demographics, lifestyle, disease history and blood biomarker data. Speci cally, the predictive performance was assessed by discrimination, calibration, decision curve and predictiveness curve. Besides, the interpretable machine learning techniques were used to understand the predictors of black-box ML methods towards clinical practice.

Data source
This study retrospectively collected data from the China Health and Retirement Longitudinal Study (CHARLS) from July 2011 to August 2013 (http://charls.pku.edu.cn/index/zh-cn.html). The detailed information of CHARLS were described elsewhere [29]. A series of health data in 2011 was used as potential predictors, and the self-reported physician diagnosis stroke status was collected in the 2013 follow up wave and was used as a binary outcomes (stroke occurrence versus non-stroke occurrence). Participants were included in study, if (1) aged 60 years or above in baseline, (2) without stroke in baseline.
Participants with missing value for stroke status were further excluded. Finally, 5960 participants were eligible for analysis.
Among them, 131 participants reported having a stroke after a 2-year follow-up.

Data preprocessing
Data on 16 variables were collected before constructing the prediction models, including demographics (age, gender, and waist-to-height ratio); lifestyle (smoking, drinking); disease history (hypertension, diabetes, dyslipidemia, and heart disease); clinical variables (high sensitivity C-reactive protein, white blood cell, glucose, glycated hemoglobin, low density lipoprotein cholesterol, triglycerides, and cystatin C). Waist-to-height ratio (WHtR) was calculated by dividing waist (cm) by height (cm).
Smoking and drinking were converted into binary variables (1 for yes, 0 for no). Disease history was collected from the selfreport-based physician's diagnosis, and was also treated as binary variables (1 for yes, 0 for no). For missing values, imputation was performed with two strategies. In logistic regression, continuous variables were imputed with median, and categorical variables were imputed with mode. In random forest and XGBoost, the imputation was processed by the algorithm itself, where continuous predictors were imputed by the weighted average of non-missing values and categorical variables were imputed with the class with largest average proximity. Additionally, we observed that data was quite imbalanced in study, i.e. the ratio between non-stroke and stroke population (about 44) was far from 1. Therefore, the Synthetic Minority Oversampling Technique (SMOTE), which could analyze the minority samples and synthesize new samples according to the minority samples [30,31], were used for data balancing.
Feature selection with Boruta Boruta algorithm, a commonly used feature selection method, was further used to select a few more relevant predictors for constructing prediction models. Boruta is a wrapper method built with RF classi er. It is an extension from the thought of Stoppiglia, Dreyfus, Dubois and Oussar [32], and can determine the importance of variables by comparing the correlation between real features and shadow features. Traditional feature selection algorithms are more likely to leave out some relevant features in the process of minimizing errors because of its minimum optimal criteria. While Boruta can nd all features with a full correlation strategy, that is, even predictors weakly related to outcome were preserved [33]. The main steps of Boruta algorithm were as follows: (1) based on each real feature R, randomly shu e the order and construct a new feature (shadow feature S), and connect the shadow feature behind the real feature to obtain a new feature matrix N = [R , S]. (2) take the new feature matrix N as input, train the data with RF model, and output the variable importance. (3) select the real feature with variable importance higher than shadow feature (Z score of the real feature is larger than the maximum Z score of the shadow feature) in each iteration, and removed the unimportant real features. (4) the algorithm stops when all features are compared or reaches the maximum number of iterations.
Marginal variables that are on the edge of acceptance and rejection may exist when implementing Boruta algorithm. In order to quantitatively evaluate the effects of marginal variables on predictions, we constructed a reference model with the accepted variables, then a new prediction model incorporating both the accepted variables and marginal variables were constructed, and net reclassi cation improvement (NRI) [34] and integrated discrimination improvement (IDI) [35] were used to assessed the contribution of marginal variables to predictions. A positive value of NRI or IDI indicates an improvement of performance.

Prediction models
In this study, logistic regression was used as the reference model for comparisons with ML methods. The binary LR had only two possible values for outcome (stroke occurrence versus non-stroke occurrence). RF and XGBoost were selected as the representative interpretable ensemble learning models. RF, one of the commonly used bagging methods, was proposed by Leo Breiman [36], which could generate multiple decision tree classi ers in a parallel manner, and eventually achieved classi cation with minority voting or numerical prediction through averaging. XGBoost, proposed by Chen et al. [37], is another kind of ensemble learning known as boosting strategy. The construction of next weak classi er in boosting depends on its previous classi er. XGBoost ts the predicted residuals with multiple weak classi ers, and nally synthesizes all weak learners to obtain a powerful learner.

Model derivation and internal validation
Data divided with a ratio of 7:3 were used for model derivation and internal validation, respectively. In derivation stage, 10-fold cross validation and grid search were used to tune hyperparameters. In validation stage, accuracy, sensitivity, speci city, and area under the receiver operating characteristic curve (AUC) were used to assess models' discrimination. Bootstrap method was used to calculate the 95% con dence interval (CI) of AUC, and AUC between all models were also compared. Furthermore, the calibration of each prediction model was assessed by brier score. We also performed decision curve analysis (DCA) [38] and predictiveness curve analysis [39] towards clinical usefulness, which could provide reference for selection of optimal model. DCA, proposed by Andrew J. Vickers et al. in 2006, is a simple method for evaluating prediction models, which considers both accuracy and clinical usefulness. Predictiveness curve describes the cumulative proportion towards absolute risk. The lower the proportion of the intermediate risk (between high-risk and low-risk), the better the model is able to distinguish the high-and low-risk populations. Additionally, all predictors were ranked according to its importance in each model, and we further used SHAP (Shapley Additive exPlanation) to understand the interpretability of machine learning models. SHAP is derived from game theory based on Shapley value, which can not only show the variable importance, but also determine the direction of effects [40]. The whole process of derivation and validation was shown in Figure 1.

Statistical analysis
Continuous variables were presented as mean±standard deviation (normal distribution), and as median with interquartile range (IQR, skewed distribution). Categorical variables were presented as percentages. All analyses were performed with R3.6.0. A two-sided p-value of <0.05 was considered statistically signi cant.

Baseline characteristics
The 2-year prevalence of stroke in the whole population was 2.20%, with women experiencing relatively higher than men (2.38% vs 2.02%). In terms of incidence intensity, an average of 1.10 strokes per 100 person-years was observed in 2-year follow-up, and women was also higher than men (1.19 / 100 person year versus 1.01 / 100 person year).
Comparisons of baseline characteristics are shown in Table 1. Brie y, the average age of participants was 66.75 years old, and almost half of them were females. The proportion of smoking and drinking populations was relatively high (41.49% and 30.49%, respectively). For chronic diseases, hypertension was quite common for older adults (30.74%), followed by heart disease (14.88%), dyslipidemia (9.49%), and diabetes (6.51%). Stroke patients were more likely to be older men with smoking habits and chronic diseases. However, the proportion of drinking alcohol was much lower in stroke populations. Other characteristics were balanced between two groups.

Selection of predictors using Boruta
The results of feature selection based on Boruta are shown in Figure 2. The three blue features represented the maximum Z score, average Z score, and minimum Z score of shadow feature.

Comparisons of clinical usefulness
For DCA analysis (Figure 3-A), the horizontal dashed line was based on the assumption that all participants were free of stroke, and its net bene t was 0. Conversely, the oblique dashed line showed net bene ts at different thresholds with the assumption that all participants were stroke patients. We found that net bene ts of machine learning methods were generally much higher than logistic regression, especially when threshold exceeded 0.08%, the bene ts of logistic regression transited to be negative, while machine learning methods still had higher net bene ts within a wider threshold range. Figure 3-B showed the predictiveness curves of three prediction models. The horizontal dashed line represented the prevalence of stroke (4.94%) in the elderly aged 60 and above in China. If the value of 4.94% was used as a threshold for identifying high and low risk population (binary classi cation), the proportions of high-risk groups were 6.484%, 13.583%, and 16.378% in LR, RF, and XGBoost, respectively. If a 2%-5% range was used as the intermediate risk (three categories), the proportions of the intermediate risk population were 32.700%, 20.291%, and 13.751% in LR, RF, and XGBoost, respectively.
In summary, XGBoost had relatively higher sensitivity, better calibration, and higher bene ts within a larger threshold range.
HbA1c and WBC were the common predictors in the top 5 for all three models.
From the perspective of the consistency of predictors in three models, the proportion of complete consistency in three models was 0, and similar results were observed between LR and RF. A consistency of 8.33% was observed between LR and XGBoost.
Surprisingly, the proportion of consistency reached 41.67% between the two machine learning methods. Especially when the ranking difference was allowed within 3, the consistency reached 91.67%.
We further used the SHAP to understand the interpretability of XGBoost (Figure 4). The results showed that the impact of WHtR, HbA1c, CysC, LDLC, and hsCRP on stroke was quite complicated, that is, the stroke risk showed positive relationship with the above predictors within a certain range, while the stroke risk showed negative relationship beyond that range. The impact of other predictors on stroke was mainly one-way, that is, as the exposure level increased, the risk of stroke increased.
This also proved that ML methods were able to capture the complex nonlinear relationships contained in data.

Discussion
Elderly people are vulnerable to cardiovascular disease, such as stroke. Early risk identi cation and effective prevention were quite necessary for high-risk populations. Our study predicted the 2-year stroke occurrence with comprehensive data obtained from a population-based cohort. The results showed that machine learning could effectively distinguish stroke occurrence versus non-stroke occurrence in elderly individuals.
Su cient data preprocessing, such as imputation, feature selection, data balancing, etc. was necessary before constructing predictive models [41,42]. In a recent review study, the author pointed out that there were still many studies that had not addressed the above issues well [28]. Let's take data balancing for example, the ratio of non-stroke and stroke patients was about 44 in the original data set, indicating a quite imbalanced distribution of outcome. ML methods would classify most of the participants into non-strokes for a high accuracy if trained directly in the imbalanced data set. In fact, we were cheated by them because of much lower performance in sensitivity and AUC except for accuracy and speci city. What's worse, the model would perform badly when it was applied in different populations. Therefore, we calculated the ratio of speci city to sensitivity (sep/sen) to monitor the in uence of SMOTE algorithm on performance. If sep/sen was close to 1, it meant that prediction model achieved a balance between sensitivity and speci city. Our results just proved that the sep/sen of LR, RF and XGBoost models were 1.79, 1.307, and 1.487, respectively without balancing , and were 1.383, 0.968, 1.034 after SMOTE processing. In addition, let's have a look at feature selection. After implementing Boruta algorithm, more concise and more predictive predictors were obtained, which was crucial for constructing powerful ML models [43].
The traditional statistical method (LR) and two machine learning methods (RF, XGBoost) showed good performance in this study. The AUC of LR, RF and XGBoost were 0.718, 0.823, and 0.808, respectively. Marnie E. Rice et al. mentioned that AUC could be converted into the effect size, such as Cohen's d and the point-biserial correlation coe cient (r pb ) [44]. Cohen'd values for LR, RF and XGBoost in our study were 0.806-0.820, 1.30-1.33, 1.22-1.24, respectively, and r pb were 0.374-0.379, 0.545-0.554, 0.520-0.528, respectively. According to the criterion of effect intensity towards Cohen'd, our predictive models were all equivalent to high effect levels. According to the criterion of effect intensity towards r pb , the traditional LR represented the medium effect, and the two machine learning models still represented high effect level. Furthermore, we found that ML methods performed better than traditional regression models, as demonstrated in lots of previous studies [45][46][47]. While there were also some studies showed that performance of ML and regression models was comparable [48]. Some possible explanation might be as follows: ML is excellent in processing big data, so it may not nd the complex rules when data is limited; besides, the selection of optimal predictors was also a tough task. Margaret S. Pepe et al. pointed out that in uencing factors and predictors often had great contradictions even in the same research with same data. In other words, a factor may be closely related to disease, but it may contribute less in prediction research [49], which called attention to us that there was no absolutely superior model, so was ML methods. In practice, we should select the most suitable model in speci c scenarios.
We found that a total of 8 variables ranked top 5 in three prediction models, namely gender, WHtR, dyslipidemia, HbA1c, WBC, GLU, TG, and LDLC, were signi cant for predicting 2-year stroke in older adult, which has important implications for stroke prevention. The aim of stroke prevention is to reduce the risk of developing a rst stroke event through targeted modi cation of single or multiple modi able risk factors at the population or individual level. Speci cally, there may be two broad levels for stroke prevention: (1) primordial prevention was suitable for implementation on the population level, and targeted measures, such as healthy diet, physical exercise, weight control, and healthy lifestyle, were encouraged for prevention; (2) primary prevention was conducted on the individual level with a more personalized prevention strategies, such as changing the speci c unhealthy lifestyle as well as identifying and treating chronic disease (ie, dyslipidemia) [50]. Among the 12 predictors in our study, all the predictors except age and gender were modi able factors, which suggested that much work related to the modi able risk factors needed to be done in future.
Evaluation of the clinical usefulness of predictive models could guide clinical practice. We performed decision curve and predictiveness curve analyses in this study, and the results showed that XGBoost was relatively superior in terms of clinical bene ts and ability for distinguishing stroke patients. The practical signi cance of this study mainly re ected in the primary screening of high-risk individuals. Speci cally, the ML models could be used to assess individuals' stroke risks with epidemiological data obtained from questionnaires as well as biomarkers through routine blood sampling. A healthy lifestyle and regularly health check in primary health center were recommended for individuals with low stroke risks. While for high-risk individuals, it was recommended to go to a high-level medical institution for examination to further determine whether there was a stroke.
Our study may have some potential advantages. First, we tried to construct predictive models with the comprehensive data derived from a population-based cohort, which was easy to obtain and low in cost, so it might be suitable for the primary screening of high-risk populations. Second, we constructed and evaluated prediction model for elderly people aged 60 years and older, for whom the stroke risk were much higher. Third, we performed relatively complete data preprocessing to ensure the quality of predictive models, providing a solid foundation for model construction. Fourth, we made a comprehensive assessment of prediction models, including discrimination, calibration, and the clinical usefulness analysis (decision curve and predictiveness curve), and SHAP for an in-depth discussion on the interpretability of ML models. Finally, we followed the standard reporting process of prediction model as described in TRIPOD [51].
However, there were still some limitations in our study. First of all, limited by data availability, the participants included in our study was not large enough, while ML is more powerful in processing big data, thus, the complex rules might not be discovered with limited data. In addition, we only evaluated the generalization ability of predictive models with internal validation, and an external validation in another populations is needed in future studies.

Conclusions
Based on the epidemiological and clinical data derived from population-based cohort, machine learning methods could effectively predict stroke occurrence in the adults aged 60 years and older. With a comprehensive consideration of ability to distinguish stroke occurrence and clinical bene ts, XGBoost could be used for primary screening of high-risk individuals in community.

Declarations
Ethics approval and consent to participate Data used in our study was approved by the biomedical ethics committee of Peking University, and all participants provided written informed consent.

Consent for publication
Not applicable.

Availability of data and materials
Data used in this study could be obtained from the following link http://charls.pku.edu.cn/index/zh-cn.html.

Competing interests
The authors declare that they have no competing interests.