Machine Learning-Based Forecast of Hemorrhagic Stroke Healthcare Service Demand considering Air Pollution

This study aimed to forecast the pattern of the demand for hemorrhagic stroke healthcare services based on air quality and machine learning. Hemorrhagic stroke, air quality, and meteorological data for 2016-2017 were obtained from the Longquanyi District of China, and the study included 1932 cases. Six machine learning methods were used to forecast the demand for hemorrhagic stroke healthcare services considering seasonality and a lag effect, and the average area under the curve was as high as 0.7971. Our results indicate that (1) the performance of forecasting during the warm season is significantly better than that in the cold season, (2) considering air pollution would improve the performance of forecasting the demand for hemorrhagic stroke healthcare services using machine learning, (3) the association between the demand for hemorrhagic stroke healthcare services and air pollutants is linear to some extent, and (4) it is feasible to use short-term concentrations of air pollutants to forecast the demand for hemorrhagic stroke healthcare services. This practical forecast model could provide an advance warning regarding the potentially high numbers of hemorrhagic stroke admissions to medical institutions, thus allowing time to implement an appropriate response to the increase in patient volumes.


Introduction
Stroke, also known as cerebrovascular accident, cerebrovascular insult, or "brain attack," occurs when poor blood flow to the brain results in cell death. From a statistical perspective, stroke is the second most common fatal disease in the world [1], the fourth most common disease in America [2], and the most common in China [3]. us, it is considered to seriously affect the physical health and quality of life of patients [4]. In 2013, 6.5 million patients who suffered from a stroke died, representing over a quarter of the number of stroke survivors (25.7 million) [1]. In addition, stroke imposes a significant economic burden on patients and healthcare services [5]. e total annual cost of stroke treatments in 2008 in the United States and European Union countries was estimated at $65.5 billion and €27 billion [6], respectively. In China, the annual cost of stroke care in 2011 was approximately RMB¥40 billion [3].
Factors affecting the clinical evolution of stroke include the physical condition of patients, such as the location of stroke [7], leukocyte level [8], and complications of stroke [9,10]. In recent years, environmental health has continued to deteriorate with respect to air pollution, and smog from vehicular and industrial emissions has become a particular matter of concern for public and government policy. Simultaneously, the increasing prevalence of many diseases, including stroke, has increased the concern about air pollution as a serious threat to public health. Nitrogen dioxide (NO 2 ) and particulate matter with an aerodynamic diameter of r10 μm (PM 10 ) are significantly associated with cardiovascular mortality, with increasing concentrations of NO 2 noted to have a greater impact on cardiovascular mortality among men and the elderly [11]. Pearce et al. [12] showed that exposure to high levels of outdoor nitrogen oxide is significantly associated with an increased risk of stroke. Wing et al. [13] revealed that higher levels of particulate matter with a grain size of 2.5 μm or less (PM 2.5 ) and ozone (O 3 ) are associated with a higher incidence of stroke. Several epidemiological studies [14,15] have reported a significant positive correlation between air pollution and stroke. PM 2.5 , NO 2 , PM 10 , carbon monoxide (CO), and O 3 are the most common pollutants associated with stroke.
As a core component of health systems, healthcare service management aims to notify the related institutions of the expected demand in a timely and accurate fashion, enabling these institutions to make effective decisions on resource allocation and reinforce their healthcare systems for the anticipated demand [16]. Particularly, Liu et al. [17] also demonstrated that short-term exposure to PM 2.5 and PM 10 increased the risk of hemorrhagic stroke, which accounts for 15% of all stroke cases and 40% of deaths due to stroke. Hence, the key to optimizing healthcare resource allocation and improving the quality of health services is to forecast the possible excess demand for stroke healthcare services, especially that of hemorrhagic stroke, according to changes in external environmental factors, such as air quality.
To the best of our knowledge, few studies have focused on forecasting the demand for stroke healthcare services. However, many studies have used machine learning to forecast the effect of air quality on diseases. Soyiri et al. [16] utilized a multistage quantile regression approach to forecast the excess demand for healthcare services in the form of daily asthma admissions by using retrospective data on weather and air quality from the Hospital Episode Statistics database. Moustris et al. [18] developed three different artificial neural network models to forecast the total weekly number of childhood asthma admissions in the greater Athens area of Greece. ree different artificial neural network models were developed and trained to forecast childhood asthma admissions for subgroups of 0-4-and 5-14-year-olds as well as the entire study population. Using data regarding weather factors, air quality, and hospital asthma admissions, Soyiri et al. [16] developed two related negative binomial models to forecast admissions due to asthma in London. Zhang et al. [19] analyzed and forecasted the monthly hospital admissions and hospitalization expenses for respiratory diseases in Shanghai using the autoregressive integrated moving average model. ese studies indicate that machine learning (including traditional statistical learning) can be used to forecast such issues. However, these studies used only a single method to forecast healthcare service demand and did not conduct comparative analysis to determine the proper model in forecasting. In addition, feature selection, which may help facilitate the forecast process, was not considered.
In addition, seasonality is also an important factor. Zhang et al. [20] indicated that seasonal patterns in health impacts of air pollution have been demonstrated in a number of previous investigations, whereas findings were less consistent, with peaks occurring in cold, hot, or transitional seasons. e China Air Pollution and Health Effects Study (CAPES) identified a two-peak (winter and summer) seasonal pattern in 17 Chinese cities for PM 10related mortality effect. Also, the season-modified effects varied by geographic regions in several Chinese single-city investigations. In addition, Xiang et al. [21] demonstrated that, in contrast to the warm season, NO 2 concentrations were significantly correlated with stroke hospitalization rates during the cold season. Hence, it is more befitting to construct different forecast models for different seasonal patterns.
is study aimed to forecast the pattern of the demand for hemorrhagic stroke healthcare services based on air quality using machine learning techniques. Due to the disparity in the association between the demand for hemorrhagic stroke healthcare services and air quality in different seasons, we constructed two different forecast models. In addition, a lag effect was also considered in selecting features for forecasting and the model with optimal performance. is practical forecast model could provide advance warning to medical institutions. Healthcare resource managers can also allocate the corresponding resources according to the expected demand, thus guaranteeing the accessibility of timely healthcare resources. Based on our research, a surveillance system to enhance early detection and interventions for hemorrhagic stroke can be implemented in advance to avoid shortages in healthcare resources due to hemorrhagic stroke. e dataset included 7,230 stroke events; among them, there were 1932 cases of hemorrhagic stroke. Because nearly all the medical data for the region are recorded at this center, the data can be considered as representative of hemorrhagic stroke occurrence across the entire population of the Longquanyi District. Within these data, the personal information of deceased patients was recorded, including the date of hemorrhagic stroke onset and demographics.

Data and Experiment Setup
Data regarding air pollution for the period 2016-2017 were obtained from environmental monitoring stations in the Longquanyi District of Chengdu including data regarding the concentrations of PM 2.5 , PM 10 , CO, NO 2 , O 3 , and sulfur dioxide (http://www.cnemc.cn/). All data regarding air quality were recorded in kilograms per cubic meter but converted into milligrams per cubic meter for CO and parts per million for the other pollutants. Since temperatures may affect the incidence of stroke [22], the minimum and maximum daily temperatures recorded by the Longquanyi District Meteorological Agency were also used as predictors.
is study did not involve human subjects and adhered to all current laws of China.
To identify seasonal disparities, and considering that Chengdu is located in Southwest China with a subtropical monsoon climate, we distinguished between warm and cold seasons. e period between April 1 and September 30 was regarded as the warm season, while all other months were regarded as the cold season.

Experiment Setup.
Our study views the pattern of the demand for hemorrhagic stroke healthcare services in Longquanyi District as a complex and nonlinear system and assumed that the newly occurred hemorrhagic stroke events would have no effects on the system. Data analysis was performed in 2 stages: a descriptive statistical process and forecast process. In the former stage, we performed descriptive statistical analyses of air pollution data and historical data. Population stroke status included two: "normal" and "excess." "Normal" referred to a scenario in which the number of stroke events on a certain day was lower than the capacity limit, while "excess" referred to a scenario in which the number of stroke events was higher than the capacity limit. In our study, the capacity limit was defined as the number of events that covered 70% of the demand for hemorrhagic stroke healthcare services.
In the forecast process, data regarding daily hemorrhagic stroke admissions, minimum and maximum daily temperature, and air quality were merged by date to form a timeseries dataset. Lag effects were also considered in this study. e lag of a scheme, N, is considered when the data from the preceding day to N days prior are used. For each scheme, the lag varied from 1 to 14. In order to abstract the key feature, we used the least absolute shrinkage and selection operator (LASSO) regression to simplify the model and determine the risk factor sets considering lag effects, considering that LASSO is a good solution to avoid multicollinearity of air pollutants. Ten-fold cross-validation was used to retain the reliable and stable model. MaxLag-N refers to the risk factor sets that considered the air quality variables of the recent N days.
e select subsets of MaxLag-N were used to train and test machine learning models with 10-fold cross-validation. e following machine learning models were considered in our study: logistic regression (LR), random forest (RF), support-vector machines with linear kernel (SVMLinear), k-nearest neighbor algorithm (KNN), and extreme gradient boosting decision tree (XGBTree) and extreme gradient boosting linear (XGBLinear) models, which are extreme gradient boosting algorithms based on tree and linear models, respectively. e evaluation metrics included the area under the curve (AUC), sensitivity, and specificity. e larger the AUC value, the better the model distinguishes the prediction target ability and the better the overall model prediction effect. Sensitivity refers to the proportion of actual high-incidence prediction targets that are predicted to be high-risk prediction targets. Specificity refers to the proportion of the actual low-incidence prediction targets that are predicted to be low-incidence targets.
In this study, we first partitioned the dataset into warm and cold datasets according to the date of hemorrhagic stroke onset. en, MaxLag-N (N arranged from 1 to 14) risk factor sets of warm and cold datasets were determined by Lasso regression, respectively, and the models considering different N values and datasets using the aforementioned machine learning methods were trained and tested. Moreover, the models without considering air pollution were also trained; the performances of them were also analyzed, and comparative analysis against air pollution situation was also conducted. Finally, statistical tests were performed to assess the disparities in the performance (especially AUC) with respect to seasons, lags, and machine learning models.

Results
During the study period, the daily average number of hemorrhagic stroke events was 2.9861 (standard deviation (SD), 1.8650). During the warm season, the daily average number of hemorrhagic stroke events was 2.9780 (SD, 1.9848), and there were a total of 947 hemorrhagic stroke events. During the cold season, the daily average number of hospital admissions due to hemorrhagic stroke was 2.9939 (SD, 1.7443), and there were a total of 985 hemorrhagic stroke events. Hence, compared to the large population (approximately 643,000 residents), the newly occurred hemorrhagic stroke events (averagely 2.9861 cases per day) would have no effects on the system, which indicates that the assumption in this study is reasonable. Table 1 also shows the related statistics in detail.
Mean denotes the average number of daily hemorrhagic stroke events. SD denotes the standard deviation of the number of hemorrhagic stroke events. Min and Max denote the minimum and maximum number of hemorrhagic events, respectively, and Sum denotes the sum of different hemorrhagic stroke events. Each quartile of the daily events is shown under the respective percentage. Table 2 shows the daily level of different atmospheric pollutants, including the average daily level in the research period (2015-12-17 to 2017-12-31), the SD of the daily average concentration of each air pollutant, and the highest daily level of different atmospheric pollutants (Max). e main atmospheric pollutants were PM 2.5 and O 3 ; these were the main pollutants on up to 720 days (of a total of 989 days).
To define the population hemorrhagic stroke healthcare demand status, we assessed the total number of hemorrhagic stroke events to identify the threshold of the daily population hemorrhagic stroke status during the warm and cold seasons. Figure 1 describes the hemorrhagic stroke events of each day and presents a homogeneous degree of hemorrhagic stroke events for daily hemorrhagic stroke event counts. e x-axis denotes the daily number of hemorrhagic stroke events, and the y-axis denotes the cumulative proportion of hemorrhagic stroke events. e black solid and red dashed curves denote the daily hemorrhagic stroke event counts in the cold and warm seasons, respectively. Hence, the threshold daily numbers of hemorrhagic stroke cases in the cold and warm seasons were 4 and 5, according to the "nearest" criteria.
According to the study design, all data were partitioned into the warm and cold datasets according to the date of hemorrhagic stroke onset. en, MaxLag-N (with N arranged from 1 to 14) risk factor sets of warm and cold datasets were determined by Lasso regression, respectively. e models considered different N values, and the datasets using the aforementioned machine learning methods were trained and tested. Comparative analysis between the warm and cold seasons was performed using the t-test. Table 3 shows the results of the comparative analysis and presents the P values of the t-test and the average values of the evaluation metrics. e average AUC of the models for the warm season was 0.6801, while the average AUC of the models of the cold season was 0.5721. ere were significant differences in all evaluation metrics between the warm and cold seasons. In addition, the performances of the models for the cold season were not good enough (AUC: 0.5721); hence, we focused only on the models for the warm season in the subsequent analyses. In addition, the risk factor sets of warm datasets selected by LASSO are shown in Table 4. Table 5 shows the statistics on the performance of the models for the warm season according to the machine learning methods. LR was the most effective model and had the best performance (mean AUC, 0.7369; SD, 0.0276); the other models performed inferiorly to LR and had average AUC values >0.65. e models used, in decreasing order of average AUC, were LR, RF, SVMLinear, KNN, XGBLinear, and XGBTree. LR also had the highest sensitivity (0.4684) and specificity (0.8708). Apart from LR, the other models all had average sensitivities <0.3. e models used, in order of average sensitivity, were LR, XGBLinear, KNN, RF, XGBTree, and SVMLinear. Apart from SVMLinear (average specificity, 0.7483), the other models all had average specificities >0.80. e other models used, in decreasing order of   Suffix "_N" denotes the lag of N; for example, CO_1 refers to the concentration of CO one day ago. Low refers to lowest temperature.     average specificity, were LR, XGBTree, KNN, XGBLinear, and RF. Table 6 shows the P values of the t-test between different machine learning methods regarding AUC. e null assumption of the t-test is that there are no significant differences between different machine learning methods. e P value refers to the risk of wrongly rejecting the null assumption. If the P value is less than 0.05, we would prefer to reject the null assumption due to the low risk of making an error; otherwise, we would prefer to accept the null assumption. As shown in Table 6, there were significant differences between LR and all other models at the 0.001 significance level. In addition, the difference between XGBTree and RF was also significant, but at the 0.05 significance level.
In addition, Table 7 presents the performance of warm season models without considering air pollution among the machine learning methods. In Table 7, the mean value of AUC of LR, RF, SVMLinear, and XGBTree without considering air pollution is lower than that considering air pollution; but, for KNN and XGBLinear, the situation is quite opposite. When air pollution was not taken into consideration, RF, SVMLinear, KNN, and XGBLinear performed better in the aspect of sensitivity, and in the aspect of specificity, RF, SVMLinear, and XGBLinear performed better, respectively. However, the standard deviations of all three metrics for all models without considering air pollution are higher than that considering air pollution. Table 8 shows the P values of the t-test between models with and without considering air pollution regarding different metrics. According to Table 8, only in two scenarios the difference between with and without considering air pollution is significant: LR with AUC and SVMLinear with specificity. Table 9 shows the statistics of the performance of warm season models in terms of lag effects. e best lag period was MaxLag-14, considering not only the average AUC of MaxLag-14 but also other evaluation indexes (AUC, 0.7314). A different effect was found in the accuracy of prediction when different lag days were considered. Table 10 shows the models with AUC >0.75. e best model in our study was LR considering a 14-day lag effect, and its AUC (0.7971) was much closer to 0.8. is model in particular was the best model and had the best lag. In addition, four other models had AUC >0.75: SVMLinear with MaxLag-14, LR with MaxLag-13, RF with MaxLag-14, and LR with MaxLag-9.

Discussion
is study aimed to forecast the pattern of the demand for hemorrhagic stroke healthcare services based on air quality using machine learning that considered lag effect and season disparity. A few insights in the aspects of feasibility, model selection, and season disparity are presented below.
LR achieves the best performance in both air pollution situation and nonair pollution situation in the aspect of AUC. In addition, the difference between the two situations for LR is significant in the aspect of AUC. Hence, according to the results, air pollution has a positive effect on forecasting hemorrhagic stroke healthcare service demand.
It is feasible to use short-term concentrations of air pollutants to forecast the demand for hemorrhagic stroke healthcare services. In our study, we used only pollution information from up to 14 days to forecast the demand for hemorrhagic stroke healthcare, and it achieved a good level of performance. For MaxLag-14 models, the average AUC was 0.7314. For LR with MaxLag-14 in particular, the average AUC was as high as 0.7971.
is AUC value was approximately 0.8 and could yield great effects in practical implementation.
Among all machine learning methods, the linear models achieved the best performance. In general, the average AUC of the linear models (LR, SVMLinear, and XGBLinear) were better than that of the other models (RF, KNN, and XGBTree). LR, the most commonly used linear model, achieved the best performance in all aspects (AUC, sensitivity, and specificity). ese results may indicate that the association between the demand for hemorrhagic stroke healthcare services and air pollutants is linear to some extent. e performance of forecasting during the warm season was significantly better than that during the cold season. e average AUC, sensitivity, and specificity of the warm season were higher than those of the cold season, and the P values of the t-test were all <0.0001, which indicate that the warm season models were significantly superior to the cold season models. In a study conducted by Xiang et al. [21], NO 2 concentrations were significantly correlated with stroke hospitalization rates during the cold season rather than the warm season. According to Xiang et al. [21], an intuitive inference can be given: the cold season models were significantly superior to the warm season models, and this is in contrast to our results. is disparity may lie in the fact that Xiang et al. [21] considered only a single air pollutant,   M-AUC, M-Sens, and M-Spec denote the average area under the curve (AUC), sensitivity, and specificity, respectively; SD-AUC, SD-Sens, and SD-Spec denote the standard deviation of the AUC, sensitivity, and specificity, respectively. LR, logistic regression; RF, random forest; SVMLinear, support-vector machines with linear kernel; KNN, k-nearest neighbor algorithm; XGBTree, extreme gradient boosting decision tree; XGBLinear, extreme gradient boosting linear model. "+" indicates that the corresponding value without considering air pollution is higher than that considering air pollution. "-" indicates that the corresponding value considering air pollution is higher than that without considering air pollution.  M-AUC, M-Sens, and M-Spec denote the average area under the curve (AUC), sensitivity, and specificity, respectively; SD-AUC, SD-Sens, and SD-Spec denote the standard deviation of the AUC, sensitivity, and specificity, respectively. MaxLag-N refers to the risk factor sets that considered the air quality variables of the recent N days.
while we considered six air pollutants and temperature simultaneously. Our study has some limitations. Although most representative machine learning techniques were considered in this study, the number of machine learning techniques was still limited. In addition, although Lasso is well acknowledged as a useful feature selection method, other feature selection methods should also be considered. Finally, this research involved only hemorrhagic stroke events that occurred in a single region. Regional disparities may exist in terms of performance. Further comparative research will be conducted to support the findings of the present study and address potential disparities.

Conclusions
We developed a practical city-based forecast model using machine learning methods and the concentration of air pollutants. e results of our study indicate that (1) the performance of forecasting in the warm season is significantly better than that in the cold season, (2) considering air pollution would improve the performance of forecasting the demand for hemorrhagic stroke healthcare services using machine learning, (3) the association between the demand for hemorrhagic stroke healthcare services and air pollutants is linear to some extent, and (4) it is feasible to use short-term concentrations of air pollutants to forecast the demand for hemorrhagic stroke healthcare services. is practical forecast model could provide warnings in advance to medical institutions regarding the potentially high numbers of admissions due to hemorrhagic stroke, thus allowing time to implement an appropriate response to the increase in patient volumes.
Data Availability e data supporting the findings of this study will not be shared since it is an organizational property. Data were anonymous, and study subjects could not be identified.

Ethical Approval
is study did not involve human subjects and adhered to all current laws of China.

Conflicts of Interest
e authors declare that they have no conflicts of interest.