A predictive analytics approach for stroke prediction using machine learning and neural networks

The negative impact of stroke in society has led to concerted efforts to improve the management and diagnosis of stroke. With an increased synergy between technology and medical diagnosis, caregivers create opportunities for better patient management by systematically mining and archiving the patients' medical records. Therefore, it is vital to study the interdependency of these risk factors in patients' health records and understand their relative contribution to stroke prediction. This paper systematically analyzes the various factors in electronic health records for effective stroke prediction. Using various statistical techniques and principal component analysis, we identify the most important factors for stroke prediction. We conclude that age, heart disease, average glucose level, and hypertension are the most important factors for detecting stroke in patients. Furthermore, a perceptron neural network using these four attributes provides the highest accuracy rate and lowest miss rate compared to using all available input features and other benchmarking algorithms. As the dataset is highly imbalanced concerning the occurrence of stroke, we report our results on a balanced dataset created via sub-sampling techniques.


Introduction
We have witnessed amazing developments in the field of medicine with the aid of technology [1]. With the advent of annotated dataset of medical records, we can now use data mining techniques to identify trends in the dataset. Such analysis has helped the medical practitioners to make an accurate prognosis of any medical conditions. It has led to an improved healthcare conditions and reduced treatment costs. The use of data mining techniques in medical records have great impact on the fields of healthcare and bio-medicine [2,3]. This assists the medical practitioners to identify the onset of disease at an earlier stage. We are particularly interested in stroke, and to identify the key factors that are associated with its occurrence.
Several studies [4,5,6,7] have analysed the importance of lifestyle types, medical records of patients on the probability of the patients to develop stroke.
Further, machine learning models are also now employed to predict the occurrence of stroke [8,9]. However, there is no study that attempts to analyse all the conditions related to patient, and identify the key factors necessary for stroke prediction. In this paper, we attempt to bridge this gap by providing a systematic analysis of the various patient records for the purpose of stroke prediction. Using a publicly available dataset of 29072 patients' records, we identify the key factors that are necessary for stroke prediction. We use principal component analysis (PCA) to transform the higher dimensional feature space into a lower dimension subspace, and understand the relative importance of each input attributes. We also benchmark several popular machine-learning based classification algorithms on the dataset of patient records.
The main contributions of this paper are as follows -(a) we provide a de-tailed understanding of the various risk factors for stroke prediction. We analyse the various factors present in Electronic Health Record (EHR) records of patients, and identify the most important factors necessary for stroke prediction; (b) we also use dimensionality reduction technique to identify patterns in low-dimension subspace of the feature space; and (c) we benchmark popular machine learning models for stroke prediction in a publicly available dataset.
We follow the spirit of reproducible research, and therefore the source code of all simulations used in this paper are available online. 1 The structure of the paper is as follows. The remaining part of Section 1 provides an overview of the related work, and describes the dataset used in our study. Section 2 covers the correlation analysis and feature importance analysis.
The results from Principal Component Analysis are explained in Section 3. The data mining algorithms used for predictive modelling and their performance on the dataset is detailed in Section 4. Finally, Section 5 concludes the paper and discusses future work.

Related Work
Existing works in the literature have investigated various aspects of stroke prediction. Jeena et al. provides a study of various risk factors to understand the probability of stroke [8]. It used a regression-based approach to identify the relation between a factor and its corresponding impact on stroke. In Hanifa and Raja [9], an improved accuracy for predicting stroke risk was achieved using radial basis function and polynomial functions applied in a non-linear support vector classification model. The risk factors identified in this work were divided into four groups -demographic, lifestyle, medical/clinical and functional. Similarly, Luk et al. studied 878 Chinese subjects to understand if age has an impact on stroke rehabilitation outcomes [10]. Min et al. in [11] developed an algorithm for predicting stroke from potentially modifiable risk factors. Singh and Choudhary in [12] have used decision tree algorithm on Cardiovascular Health Study (CHS) dataset for predicting stroke in patients. A deep learning model based on a feed-forward multi-layer artificial neural network was also studied in [13] to predict stroke. Similar work was explored in [14,15,16] for building an intelligent system to predict stroke from patient records.
Hung et al. in [17] compared deep learning models and machine learning models for stroke prediction from electronic medical claims database. In addition to conventional stroke prediction, Li et al. in [18] used machine learning approaches for predicting ischaemic stroke and thromboembolism in atrial fibrillation.
The results from the various techniques are indicative of the fact that multiple factors can affect the results of any conducted study. These various factors include the way the data was collected, the selected features, the approach used in cleaning the data, imputation of missing values, randomness and standardization of the data will have an impact on the outcome of any study carried. Therefore, it is important for the researchers to identify how the different input factors in an electronic health record are related to each other, and how they impact the final stroke prediction accuracy.
Studies in related areas [19,3] demonstrate that identifying the important features impacts the final performance of machine learning framework. It is important for us to identify the perfect combination of features, instead of using all the available features in the feature space. As indicated in [3], redundant attributes and/or totally irrelevant attributes to a class should be identified and removed before the use of a classification algorithm. Therefore, it is essential for data mining practitioners in healthcare to identify how the risk factors captured in electronic health records are inter-dependent, and how they impact the accuracy of stroke prediction independently.

Electronic Health Records Dataset
An Electronic Health Record (EHR) also known as Electronic Medical Record (EMR), is a repository of information for a patient. It is an automated, computer readable storage of the medical status of a patient that is keyed in by qualified medical practitioners. The records contain vitals, diagnosis or medical exam results of a patient. The future of medical diagnosis looks promising with the optimal use of EHR. The use of EHR increased from 12.5% to 75.5% in US Hospitals between 2009 and 2014 as indicated by the statistics recorded in [20].
For our study, we use a dataset of electronic health records released by McKinsey & Company as a part of their healthcare hackathon challenge. 2 The dataset is available from Kaggle, 3 a public data repository for datasets. The dataset contains the EHR records of 29072 patients. It has a total of 11 input attributes, and 1 output feature. The output response is a binary state, indicating if the patient has suffered a stroke or not. The remaining 11 input features in EHR are: patient identifier, gender (G), age (A), binary status if the patient is suffering from hypertension (HT ) or not, binary status if the patient is suffering from heart disease (HD) or not, marital status (M ), occupation type (W ), residence (urban/rural) type (RT ), average glucose level (AG), body mass index (BM I), and patient's smoking status (SS). The dataset is highly unbalanced with respect to the occurrence of stroke events; most of the records in the EHR dataset belong to cases that have not suffered from stroke. The publisher of the dataset has ensured that the ethical requirements related to this data are ensured to the highest standards. In the subsequent discussion of this paper, we will exclude the patient identifier as one of the input feature. We will consider the remaining 10 input features, and 1 response variable, in our study and analysis.

Analysing Electronic Health Records
In this section, we provide an analysis of electronic health records dataset.
We perform correlation analysis of the features. We use the entire dataset of EHR records to perform such analysis on the input features of the EHR records.
features have very high correlation, one of them can be ignored in the prediction of occurrence of stroke as it does not contribute any additional knowledge to the prediction model. Moreover, we evaluate the behaviour of the features individually and in a group to gaze the importance of each individual feature in predicting the occurrence of a stroke. A systematic analysis of the input feature space is an integral part for stroke prediction. It is important to find the optimal and minimal set of predictive features to reduce the computational cost of modelling and efficient archival of EHR records. This paves us the path for clinicians to record only those features in the EHR records that are most efficient for stroke prediction.

Correlation between features
We use Pearson's correlation coefficient to generate Fig. 1, which shows the correlation between different patient attributes. The strength of the linear relationship between any two features of the patient's electronic health data will be determined by this correlation value. We have used a colourmap in Fig. 1, such that the blue colour represents positive correlation, while red is negative.
The deeper the colour and larger the circle size, the higher is the correlation between the two patient attributes.
As is intuitive, the correlation of an attribute with itself is unity. There is a significant correlation between a patient's marital status and their age with 0.5 correlation index. There is also a positive correlation between patient's age and the type of their work with 0.38 correlation index, whether they suffer from hypertension and heart disease or not and their average glucose level.
This correlation of patient's age with other attributes seems intuitive, as most ailments occur in an ageing population. The type of residence of patient is not correlated with any other attribute. Patient's type of work has a positive correlation with their marital status with 0.35 correlation index.
In summary, the correlation matrix shown in Fig. 1   subsections analyse the importance of an individual feature for stroke prediction.  The analysis described above shows patient's age (A) has a comparatively higher importance by itself, yet a combination of different features may improve prediction because they are not correlated with each other. Furthermore, we also compute the CHADS 2 score for the EHR records. CHADS2 score is a stroke risk score for non valvular atrial fibrillation, where C represents congestive heart failure or impairment of left ventricular function, H represents whether the patient has hypertension, A stands for age, D stands for diabetes, S stands for stroke, or transient ischaemic attack, history of thromboembolism. Fig. 3 shows the distribution of the CHADS 2 score for our dataset. We observe that most of the EHR observations have a low CHADS 2 score. Fig. 4 shows the proportion of cases with a stroke event as predicted by CHADS2 for each score level. We observe that the larger the CHADS2 score, the higher is the occurrence of stroke cases. Combined with Fig. 3 and Fig. 4, we find that most people having CHADS 2 score of 1 or 2, have a low probability of stroke. We observe that only a small number of people with CHADS 2 score greater than 2 have a higher probability of occurrence of stroke.

Selection of Optimum Features for stroke prediction
In the previous section, we used the features one at a time and saw that there are only few features which have higher importance in stroke prediction. In this section, we use all features then subsequently remove one feature at a time or add one feature at a time and further analyse the importance of an individual feature in stroke prediction. We use neural network algorithm to produce the results. We use a perceptron neural network for this experiment. We use the entire dataset of EHR records to perform the feature analysis.  Here we can observe that the accuracy rate is significantly affected when the feature A is removed from the pool. This is inline with our earlier discussion which showed that A has highest score amongst all features (ref to Fig. 2). There is no much significant changes when other features are removed from the pool.

Principal Component Analysis
In this section, we analyse the variance in the dataset using Principal Component Analysis (PCA). In this multivariate analysis, the dataset is transformed Finally, the v j features are stacked together to create the matrixX ∈ R mn×10 : We perform the PCA on the normalized matrix ofX, that is normalized using the corresponding meansv j and standard deviations σ vj of the individual features. The normalized matrixẌ is represented as: We interpret how the results from PCA are related to predictive variables or features represented by patient attributes and the individual observations represented by the medical health records. We study the relation between the first two principal components and the individual input variables. We also study the importance of the first two principal components for a given observation.
We use the guide by Abdi and Williams [21] for this study.

Variance explained by principal components
A scree plot is used to select the components which explain most variability in the dataset, generally 80% or more variance. Fig. 6(a)   We have included the scree plot for the balanced dataset as well in Fig. 6(b). This is useful for researchers to understand the impact of unbalanced nature of the dataset on the subspace representation. Table 1 shows the contribution of each variable towards first two principal components. The sum of the squares of all loadings for an individual principal component equals unity. Therefore, if the variable loading crosses a threshold value of 1/10 = 0.31, it indicates that the variable has a strong contribution towards the principal component.
In Table 1, the variables that cross the threshold are in bold. Here we observe that variables like A, AG, HT and M have strong contribution towards the first principal component. The variable HD also shows significant contribution towards it. From earlier discussions, we saw that these features actually are important from stroke prediction point of view. Therefore, it indicates that the first principal component (which has stronger loadings from these variables) might be useful in predicting the stroke. In the following sections, we will assess the role of these principal components in stroke prediction. and its length indicate its importance. We observe that the average glucose level and the heart disease are correlated to each other. The age has the biggest contribution in the first two principal component. We also observe that the orientation of the different feature vectors in the two-dimensional feature space is the same as the unbalanced dataset.  posite to each other, indicating that they provide similar information, but have a diverging characteristics. We also compute the biplot for the EHR records on the balanced dataset. We show the corresponding biplot in Fig. 7(b). We observe that the average glucose level and the heart disease are correlated to each other. The age has the biggest contribution in the first two principal component. We also observe that the orientation of the different feature vectors in the two-dimensional feature space is the same as the unbalanced dataset.

Relation of Principal Components with individual records
Finally, we also check the relation of individual patient record observations on the first two principal components. In Fig. 8, each dot represents a patient record observation.   We also compute the subspace representation of the different features for the balanced EHR dataset. We show it in Fig. 8(c) and (d). We observe that the observations with stroke and no stroke are scattered throughout the two principal axes. The observations with similar labels are not clustered together.
Therefore, the features of the EHR records are important for efficient stroke prediction.

Discussion
This section discussed how principal component analysis can assist in a clear understanding of the original feature space of patient records. We showed that the first two principal components can cumulatively capture only 31.4% of the total variance in the input feature space. More so, first eight components can explain only about 88% of the total variance. Moreover, we studied the contribution of different patient attributes to the first two principal components.
We could see that the two components do not represent the health records data perfectly. We also looked at the contribution of the first two principal components in the representation of individual health records. We could see that some health records can be represented by the two components, but some cannot. Thus, all principal components are needed to have a good representation of the variance in medical records. We cannot get a significant reduction of feature space for predictive modelling without a significant loss of variance in the data.
Hence, we use all the principal components for predictive modelling of stroke occurrence.
In the next section, we compare the state of art machine learning classification techniques for predicting the occurrence of stroke in a patient's medical record. As discussed, we use ten patient attributes as input features to the models.

Stroke Prediction
We provide a detailed analysis of various benchmarking algorithms in stroke prediction in this section. We benchmark three popular classification approaches -neural network (NN), decision tree (DT) and random forest (RF) for the purpose of stroke prediction from patient attributes. The decision tree model is one of the popular binary classification algorithm. This method involves building a tree-like decision process with several condition tests, and then applying the tree to the medical record dataset. Each node in this tree represents a test, and the branches correspond to the outcome of the test. The leaf nodes finally represent the class labels. The pruning ability of such algorithm makes it flexible and accurate, which is required in medical diagnosis. We also benchmark the dataset on random forest approach. The flexibility and ease of use of the random forest algorithm coupled with its consistency in producing good results, even with minimal tuning of the hyper-parameters makes this algorithm valuable in this application. The possibilities of over-fitting are limited by the number of trees existent in the forest. Moreover, random forest can also provide adequate indicators on the way it assigns significance to each of these input variables. We also benchmark the performance of a 2-layer shallow neural network. Artificial neural networks are quite popular these days, and they offer competitive results.
We implement the feed-forward multi-layer perceptron model using the nnet R package.

Benchmarking using all features
Our dataset contains a total of 29072 medical records. Out of this, only 548 records belong to patients with stroke condition, and the remaining 28524 records have no stroke condition. This is a highly unbalanced dataset. This creates problem in using this data directly for training any machine-learning models. Therefore, we use random downsampling technique to reduce the ad-  Table 4. We use the same train and test split for CNN training and testing procedure, the ten inputs features are reshaped into 1 * 2 * 5 for inputs.
We also calculate accuracy variance for the benchmarking methods. Table 3 shows the experiments of accuracy variance for NN, SVM, LASSO and ElasticNet.

Benchmarking using top four features
Here we present results for stroke prediction when all the features are used and when only 4 features (A, HD, AG and HT ) are used. These features are selected based on our earlier discussions. In addition to the features, we also show results for stroke prediction when principal components are used as the input.
Since we observed that almost 8 principal components are needed to explain a  Table 2: Performance evaluation of neural network, decision tree and random forest on our dataset of electronic medical records. We compare their performance for three different cases -(a) using all the original features, (b) using the PCA-transformed data of the first two principal components, and (c) using the PCA-transformed data of the first eight components.
We choose 8 components, as 8 components are necessary to cumulatively contain more than 80% of the explained variance. We report the average value of precision, recall, F-score, accuracy, miss rate and fall-out rate, based on 100 experiments.
variance of greater than 80%, we present results for both the cases when only first 2 principal components are used and when all the components are used.    Fig. 9 illustrates this. We observe that most of them have similar performance, with their mean overlapping around similar values. We also compute the variance of the accuracies of the benchmarking methods for the 100 random sub-sampling observations. The variance of decision tree, neural network and random forest are 0.00073, 0.00049, and 0.00061 respectively. This indicates that there is no sampling bias involved in the benchmarking results.

Conclusion and Future Work
In this paper, we presented a detailed analysis of patients' attributes in electronic health record for stroke prediction. We systematically analysed different features. We performed feature correlation analysis and a step wise analysis for choosing an optimum set of features. We found that the different features are not well-correlated and a combination of only 4 features (A, HD, HT and AG) might have good contribution towards stroke prediction. Additionally, we performed principal component analysis. The analysis showed that almost all principal components are needed to explain a higher variance. The variable loadings however showed that the first principal component which has the highest variance might explain the underlying phenomenon of stroke prediction.
Finally, three machine learning algorithms were implemented on a set of dif- ferent features and principal components configurations. We found that neural network works the best with a feature combination of A, HD, HT and AG.
The accuracy and miss rate for this combination are 78% and 19% respectively.
We have seen promising results from using just 4 features. The accuracy of the perceptron model cannot be improved further for primarily two reasons: lack of additional discriminatory feature set; and lack of additional dataset.
We observed that most of the existing features in the EHR dataset are highly correlated to each other, and therefore do not add any additional information to the original feature space. Furthermore, a larger dataset will enable us to train our deep neural networks more efficiently. We plan to collect institutional data in our planned future work. The systematic analysis of the different features in the electronic health records will assist the clinicians in effective archival of the records. Instead of recording and storing all the features, the data management team can archive only those features that are essential for stroke prediction. Thus, in future, we plan to integrate the electronic records dataset with background knowledge on different diseases and drugs using Semantic Web technologies [22,23]. Knowledge graph technologies [24,23] can be used in order to publish the electronic health records in an interoperable manner to the research community. The added background knowledge from other datasets can also possibly improve the accuracy of stroke prediction models as well. We intend to collect our institutional dataset for further benchmarking of these machine learning methods for stroke prediction. We also plan to perform external validation of our proposed method, as a part of our upcoming planned work.