Impact of 25-Hydroxyvitamin D on the Prognosis of Acute Ischemic Stroke: Machine Learning Approach

Background and Purpose: Vitamin D is a predictor of poor outcome for cardiovascular disease. We evaluated whether serum 25-hydroxyvitamin D level was associated with poor outcome in patients with acute ischemic stroke (AIS) using machine learning approach. Materials and Methods: We studied a total of 328 patients within 7 days of AIS onset. Serum 25-hydroxyvitamin D level was obtained within 24 h of hospital admission. Poor outcome was defined as modified Rankin Scale score of 3–6. Logistic regression and extreme gradient boosting algorithm were used to assess association of 25-hydroxyvitamin D with poor outcome. Prediction performances were compared with area under ROC curve and F1 score. Results: Mean age of patients was 67.6 ± 13.3 years. Of 328 patients, 59.1% were men. Median 25-hydroxyvitamin D level was 10.4 (interquartile range, 7.1–14.8) ng/mL and 47.2% of patients were 25-hydroxyvitamin D-deficient (<10 ng/mL). Serum 25-hydroxyvitamin D deficiency was a predictor for poor outcome in multivariable logistic regression analysis (odds ratio, 3.38; 95% confidence interval, 1.24–9.18, p = 0.017). Stroke severity, age, and 25-hydroxyvitamin D level were also significant predictors in extreme gradient boosting classification algorithm. Performance of extreme gradient boosting algorithm was comparable to those of logistic regression (AUROC, 0.805 vs. 0.746, p = 0.11). Conclusions: 25-hydroxyvitamin D deficiency was highly prevalent in Korea and low 25-hydroxyvitamin D level was associated with poor outcome in patients with AIS. The machine learning approach of extreme gradient boosting was also useful to assess stroke prognosis along with logistic regression analysis.


INTRODUCTION
Vitamin D is a prohormone synthesized by sun-exposure and dietary intake (1). Besides its role in bone integrity and calcium homeostasis (2), vitamin D status is also associated with cardiovascular morbidity and mortality (3)(4)(5)(6). Vitamin D has a protective effect on endothelial function and vascular remodeling in experimental models (7,8). One meta-analysis has shown that low vitamin D level is associated with 2.5-fold increase of the risk of ischemic stroke (9). However, the benefit of vitamin D supplementation in improving cardiovascular outcome remains controversial (10)(11)(12).
Several studies have suggested that low vitamin D is also associated with poor outcome in patients with acute ischemic stroke (AIS) (13)(14)(15)(16)(17). In these studies, logistic regression analysis alone was used to assess the relationship between low vitamin D status and prognosis of AIS patients. Logistic regression analysis should consider model complexity including interaction and model appropriateness using goodness of fit (18). Vitamin D is associated with several cardiovascular risk factors, including hypertension (19), insulin resistance (20), cerebral small vessel disease burden (21), stroke severity (16,17), infarct volume (22), mood (23), and cognition (24) in previous studies. However, studies considering these interactions in logistic regression models are scarce.
Recently, machine learning (ML) has been found to be capable of accurately predicting prognosis in several disease categories, including cancer (25), cardiovascular disease (26), and psychiatric disorders (27). ML has advantage of being able to deal with large-size data and having several optimization technique, thus reducing the overfitting of training algorithms (28). In addition, several ML models such as random forest and extreme gradient boosting are tree-based feature selection ML algorithms that enable us to find non-linear relationship and interactions between independent variables more efficiently compared to logistic regression model (29). In this study on the association between vitamin D and stroke prognosis, we hypothesize that low vitamin D is associated with poor outcome in patients with AIS and that the predicting performance of extreme gradient boosting is superior to that of binary logistic regression for considering vitamin D with several risk factors.

Study Population
The present study was a single-center retrospective study that screened all patients with AIS within 7 days of symptom onset using prospectively collected hospital registry. From August 2013 to October 2015, a total of 738 AIS patients with a positive diffusion-weighted lesion on brain MRI were selected for screening. We excluded patients with prior history of stroke (n = 243) because dietary intake or physical activity known to be associated with vitamin D intake and biosynthesis could be affected by premorbid neurological status. Additionally, patients without 3-month outcome capture (n = 85) and laboratory measures including admission serum 25-hydroxyvitamin D level (n = 82) were excluded.

Measurement of Serum 25-Hydroxyvitamin D Level
Vitamin D status was determined by serum 25-hydroxyvitamin D concentration, a major circulating form with a long circulating half-life (∼3 weeks) (30). 25-hydroxyvitamin D level was measured within 24 h of admission using a radioimmunoassay kit (DiaSorin Liaison, Stillwater, MN, USA) with <10% of interassay coefficient of variation. The assay was standardized against NIST Standard Reference Material 2972 (NIST, Gaithersburg, ME, USA) and certified by the Centers for Disease Control and Prevention Vitamin D Standardization Program. 25hydroxyvitamin D deficiency was defined when its concentration was <10 ng/mL following the criteria of Korean population study (31).

Covariates
Definitions for vascular risk factors were based on our previous reports (32). Hypertension was defined if participants were taking antihypertensive medications or if their average sitting systolic blood pressure was 140/90 mmHg or more. Diabetes was defined if they were taking medical treatments for diabetes, if they had a fasting serum glucose level of 126 mg/dL or more, or if they had a non-fasting random serum glucose level of 200 mg/dL or more with corresponding symptoms of diabetes. Subjects were considered to have hyperlipidemia if they had a fasting total cholesterol level of 240 mg/dL or more or if they were being treated with a lipid-lowering agent. A current smoker was defined as a person smoking one or more cigarettes per day within the last 6 months. Stroke subtypes were classified as large artery atherosclerosis (LAA), small vessel occlusion (SVO), cardioembolism (CE), undetermined etiology (SUE), and other determined etiology (SOE) according to the Trial of Org 10172 in Acute Stroke Treatment (TOAST) criteria (33). Laboratory parameters included serum level of fasting blood sugar, total cholesterol, triglyceride, high-density lipoprotein, low-density lipoprotein, blood urea nitrogen, creatinine, and whole blood level of white blood cell count, hemoglobin, and platelet count.

Outcomes
Primary outcome measure was 3-months poor functional outcome defined as a modified Rankin Scale (mRS) score of 3-6. Certified neurologists and nurses assessed the mRS score. Secondary outcome was F1 score of the prediction model (binary logistic regression vs. extreme gradient boosting) for AIS prognosis classification. F1 score is a measure of the model's accuracy. It represents harmonized mean of precision (positive predictive value) and recall (sensitivity) of the ML classifier.

Statistical Analysis
Baseline characteristics according to tertiles of 25hydroxyvitamin D were compared with analysis of variance or Kruskal Wallis test for continuous variables. For categorical variables, Pearson's χ 2 -test or Fisher's exact test was used. We performed univariate and multivariate logistic regression analyses for 3-months poor functional outcome (mRS 3-6) according to 25-hydroxyvitamin D status. All variables that were significant with p < 0.10 in univariable analysis, age, and sex were entered into multivariable models. Two-sided p < 0.05 was considered significant in multivariable analysis performed with R version 3.4.1 (the R Foundation for Statistical Computing).

Machine Learning Algorithm
We used the Extreme Gradient Boosting (XGBoost) R package that could perform regression and classification task well. XGBoost is an ML algorithm widely used in various Kaggle contests because it can perform non-linear prediction well, deal with missing data, and prove that computational speed is faster than other ML methods (34). As previously mentioned, we labeled patient's outcome as good (mRS: 0-2) or poor (mRS: 3-6). All participants' dataset were randomly divided into 2 group: training (60% of 328 subjects) and testing (40%). Proportion of poor outcome in training and testing dataset was identically allocated using R cart package. Input features of this model were independent variables in the multivariate logistic regression model. Precise parameter tuning methods of XGBoost for preventing overfitting of this ML model are shown in Supplemental Material. We calculated the probabilistic score of the prediction model of binary logistic regression and extreme gradient boosting algorithm, in which we set the probabilistic score of 0.5 or more to "poor outcome" (or test positive) and those <0.5 to "good outcome." We presented the performance of these two classification tasks as precision, recall, accuracy, and F1 score. We compared model performance with area under the ROC curve (AUROC). Additionally, we also examined the performance of other machine learning algorithms such as support vector machines.

25-Hydroxyvitamin D Level and Poor Outcome
Of 328 participants, proportions of AIS patients with poor 3-months outcome and 25-hydroxyvitamin D deficiency were

mRS, modified Rankin Scale; OR, odds ratio; CI, confidence interval; 25(OH)D, 25hydroxyvitamin D. Model 1 was adjusted for age and sex. Model 2 included variables in model 1 plus NIHSS and NIHSS * 25-hydroxyvitamin D deficiency interaction term. Model 3 included variables in model 2 plus stroke subtype (TOAST classification) and intravenous thrombolysis. Model 4 included variables in model 3 plus blood urea nitrogen, white blood cell count, and hemoglobin.
22.9 and 48.2%, respectively ( Table 2). Patients with 25hydroxyvitamin D deficiency tended to have higher mRS score compared to those without such deficiency (p for χ 2 trend = 0.026, Figure 1). Poor outcome at 3-months was positively associated with age, female gender, stroke severity, blood urea nitrogen, and white blood cell count. Hemoglobin level and 25-hydroxyvitamin D deficiency were inversely associated with poor outcome in patients with AIS. In univariate analysis, 25-hydroxyvitamin D deficiency was a predictor for poor 3months outcome [OR (odds ratio), 1.86; CI (confidence interval), 1.10-3.14; p = 0.021]. Although age, sex, stroke severity, stroke subtype, intravenous thrombolysis, blood urea nitrogen, white blood cell count, and hemoglobin were adjusted in the multivariate logistic regression model, 25-hydroxyvitamin D deficiency was a significant predictor for 3-months poor outcome (OR, 3.21; CI, 1.22-8.48; p = 0.019, Table 3).

Comparison of Logistic Regression and the Other Machine Learning Classifier
We used extreme gradient boosting ML algorithm to classify and predict factors associated with poor 3-months outcome of AIS. Figure 2 shows factors affecting this binary classification weighed by XGBoost algorithm. NIHSS (feature importance, 32.8%) score was the top predictive factor affecting XGBoost classification, followed by age (16.0%) and 25-hydroxyvitamin D level (12.5%) in the training dataset. Table 4 shows result of classification performance of binary logistic regression and extreme gradient boosting to predict 3-months poor outcome. Overall prediction performances including precision, recall, accuracy, and F1 score were higher in extreme gradient boosting model compared to those in the binary logistic regression model. Figure 3 shows performances of these two classifiers with AUROC curve analysis for the test dataset (total 131 subjects). The XGBoost's performance was higher than those of logistic regression in classifying prognosis of patients with AIS, but statistical significance was not achieved (AUROC, 0.805 vs. 0.746, p-value = 0.11). We also examined the performance of the support vector machine for the binary prediction of poor outcome, and overall procedures of the support vector machine was presented in the Supplemental Material.

DISCUSSION
Our study ascertained that low 25-hydroxyvitamin D status in stroke patients was highly prevalent even at the time of admission of AIS patients and that 25-hydroxyvitamin D status was a significant predictor for poor outcome after adjusting multiple variables in logistic regression model. Additionally, serum 25hydroxyvitamin D level was also associated with a poor outcome in the XGBoost ML model that considered multiple interactions. Are there possible causal relationship between vitamin D deficiency and stroke prognosis? First, it has been shown that endothelial dysfunction has an important role in atherosclerosis and stroke development (35,36). Markers for endothelial dysfunction are associated with stroke lesion volume or clinical outcome (37). Witham et al. have reported that high dose vitamin D supplementation for 16 weeks can increase flow mediated dilatation in patients with stroke (38) which is one of wellknown predictors for the cardiovascular outcome (39). These could explain a possible link between vitamin D status and stroke outcome. Second, low vitamin D status is closely related to "general health status" which could reflect individual's physical activity or dietary intake. On the contrary, vitamin D status can predict physical performance in an aged populations (40). Krarup et al. have reported that pre-stroke physical activity is associated with severity and long-term outcome from first-ever ischemic stroke (41). In this regards, low 25-hydroxyvitamin D could be a result of poor general health or increased frailty in these stroke population. New interventional studies on 25hydroxyvitamin D supplementation in AIS patients with low vitamin D status are needed to improve our understanding of role of 25-hydroxyvitamin D in cardiovascular disease.
We observed that 25-hydroxyvitamin D levels were extremely low in our stroke population compared to the normal Korean population (31). If so, how can we explain these high prevalence of 25-hydroxyvitamin D deficiency or insufficiency in stroke participants? First, the prevalence of 25-hydroxyvitamin D deficiency has been gradually increasing as more people are living in a modernized and indoor place in which sun light exposure is insufficient to produce vitamin D biosynthesis. Analysis of national health surveys in US could support this hypothesis (42). Second, relatively high prevalence of vitamin D deficiency in stroke patients has been reported in other cohorts (43,44). In these studies, mean level of 25-hydroxyvitamin D concentration ranged from 13.7 to 14.2 ng/mL, comparable to that of our study. Unquestionably, prevalence of vitamin D deficiency is a pandemic phenomenon. It has markedly increased irrespective of age, sex, ethnicity, types of concurrent comorbidity, or socioeconomic state.
Despite of these positive correlation between 25hydroxyvitamin D deficiency and various cardiovascular outcomes, there have been few evidence of randomized controlled trials whether vitamin D supplementation has a positive effect on the cardiovascular outcome (45). These discrepancies could be partly explained by various factors such as supplemental dosage of vitamin D, ethnical and environmental factors for the study participants, study design, and outcome in  each trials. It's too early to conclude that vitamin D deficiency is as salient predictor for cardiovascular disease. Therefore, future trial are necessary to consider these differences and the exact causal pathways of vitamin D metabolism. We used logistic regression and XGBoost ML algorithm to predict stroke outcome in this study. Logistic regression is the most popular multivariate statistical method in medical science. Using logistic regression, we can quantitatively estimate the relationship between two variables of interest through an index called "odds ratio." However, it is impossible to consider all interactions among variables in logistic regression because there might be a variety of combinations of interactions between variables in the real world. Results of XGBoost ML algorithm showed that, in addition to important predictors such as stroke and stroke severity, factors such as 25-hydroxyvitamin D, white blood cell count, and hemoglobin were also important factors predicting prognosis of stroke patients. We could suggest that 25-hydroxyvitamin D is associated with a variety of factors affecting stroke prognosis such as stroke severity, age, several laboratory parameters, stroke subtype, and gender and that XGBoost ML algorithm is useful for dealing with multiple interactions between independent predictors in binary classification of stroke prognosis. In addition, XGBoost algorithm can analyze large size data with a high computational speed. In this era, medical data are becoming so vast that it becomes hard to analyze. An ML approach can be an expeditious and efficient way to analyze such big data. Our study has several limitations. First, our study was a single center retrospective case control study, although measurement of 25-hydroxyvitamin D level and 3-months clinical outcome capture were prospectively collected. Therefore, our results could not presume a causal relationship of vitamin D and stroke severity. In addition, they could not be generalizable to entire ischemic stroke patients or other ethnicities. Second, although pre-stroke physical activity and dietary intake were significant confounders in assessment with vitamin D level and stoke severity, we could not assess these variables in our model. However, some strengths of our study could reinforce the relationship between vitamin D status and stroke outcome. First, we excluded patients with prior history of stroke to minimize the effect of physical activity and dietary intake. Second, analysis of vitamin D and stroke outcome was adjusted for clinically important variables with a multivariate logistic model. Lastly, interactions between all independent variables were statistically considered by means of an ensemble ML mechanism called extreme gradient boosting.
In conclusion, 25-hydroxyvitamin D level is associated with 3months poor outcome in patients with AIS. In addition, extreme gradient boosting ML algorithm could reveal the association of stroke prognosis with 25-hydroxyvitamin D level which might have certain interactions with other predictors.

DATA AVAILABILITY STATEMENT
The datasets analyzed in this article are not publicly available.
Requests to access the datasets should be directed to the corresponding author (gumdol52@hallym.or.kr).

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by Hallym University Sacred Heart Hospital Institutional Review Board/Ethics Committee (IRB No. 2015-I064). The patients/participants provided their written informed consent to participate in this study.

AUTHOR CONTRIBUTIONS
CK and B-CL drafted the manuscript. J-HL performed the statistical analysis. S-HL and MO contributed to the statistical analyses. J-SL and YK provided critical suggestions for the discussion. MJ and SJ revised the first draft. K-HY designed the analyses to the final manuscript and technical comments on the results.

FUNDING
CK received funding from the National Research Fund of Korea (NRF-2019R1G1A1097707) and a grant from the CJ healthcare Corp (2018-12-031). The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.