Feature selection and transformation by machine learning reduce variable numbers and improve prediction for heart failure readmission or death

Background The prediction of readmission or death after a hospital discharge for heart failure (HF) remains a major challenge. Modern healthcare systems, electronic health records, and machine learning (ML) techniques allow us to mine data to select the most significant variables (allowing for reduction in the number of variables) without compromising the performance of models used for prediction of readmission and death. Moreover, ML methods based on transformation of variables may potentially further improve the performance. Objective To use ML techniques to determine the most relevant and also transform variables for the prediction of 30-day readmission or death in HF patients. Methods We identified all Western Australian patients aged 65 years and above admitted for HF between 2003–2008 in linked administrative data. We evaluated variables associated with HF readmission or death using standard statistical and ML based selection techniques. We also tested the new variables produced by transformation of the original variables. We developed multi-layer perceptron prediction models and compared their predictive performance using metrics such as Area Under the receiver operating characteristic Curve (AUC), sensitivity and specificity. Results Following hospital discharge, the proportion of 30-day readmissions or death was 23.7% in our cohort of 10,757 HF patients. The prediction model developed by us using a smaller set of variables (n = 8) had comparable performance (AUC 0.62) to the traditional model (n = 47, AUC 0.62). Transformation of the original 47 variables further improved (p<0.001) the performance of the predictive model (AUC 0.66). Conclusions A small set of variables selected using ML matched the performance of the model that used the full set of 47 variables for predicting 30-day readmission or death in HF patients. Model performance can be further significantly improved by transforming the original variables using ML methods.


Introduction
Heart Failure (HF) is a prevalent cardiovascular disorder affecting more than 25 million people worldwide [1]. HF is also associated with a high rate of readmissions incurring significant economic costs, and driving healthcare policies to include financial penalties for hospitals that have high rates of readmissions for HF [2]. The adverse financial implications have served as a motivation to develop models that can accurately predict readmissions at the time of an index hospital discharge. These models have traditionally relied on patients' historical data to predict the probability of a HF readmission [3].
The enormous amount of clinical and administrative data generated in the healthcare sector can be used to personalise care, improve the quality of treatment and reduce treatment costs. We and others have applied machine-learning (ML) techniques on administrative data previously to predict HF readmissions [4,5]. These techniques use patient data to learn hidden patterns that significantly contribute to the outcomes. Predictive ability of the model depends on the variables selected. Since some variables may be redundant, a prediction model developed after selecting only the significant variables associated with the outcome, is expected to reduce the machine-training time and improve the predictive performance [6]. Moreover, such a prediction model will allow policy makers and healthcare workers to focus only on the minimum number of variables required to predict the outcome. Lastly, ML methods allow feature extraction, which is a process in which the original variables are transformed to create new variables that could provide superior performance in predicting outcomes compared to the model developed on the original variables [7].
The objective of the present study was to use ML techniques to determine the most clinically significant variables in administrative datasets that are associated with 30-day readmission or death in HF patients. We also evaluated if the variables transformed by feature extraction techniques could improve the performance of the developed ML models.
One of our previous studies investigated the ability of ML models to predict HF readmission and death [5]. That study showed that a large amount of medical data could be coupled with ML techniques to predict HF readmission and death with superior predictive performance than with standard regression methods. However, such prediction models are complex, and simpler models might be produced that can predict with similar accuracy. This can be done through feature selection algorithms. The objective of the present study was to use ML techniques to determine the most clinically significant variables in administrative datasets that are associated with 30-day readmission or death in HF patients and use them to develop simpler prediction models. We also evaluated if the variables transformed by feature extraction techniques could improve the performance of the developed ML models.

Ethics
This study was approved by the Human Research Ethics Committees of the Western Australian Department of Health (2014/11); the Australian Department of Health (XJ-16); and the University of Western Australia (RA/4/1/8065). We were granted a waiver of informed consent by each ethics committee. The linked administrative data received were fully anonymized, and the authors did not have access to any identifying data.

Cohort and data sources
We used linked administrative data from the Hospital Morbidity Data Collection (HMDC) and death register which are two of the core datasets of the Western Australian Data Linkage System [8]. This is a dynamic linkage system based on probabilistic matching of records from multiple datasets with clerical review and quality control. The Western Australian Department of Health maintains these databases, with data supplied by all public and private hospitals in the state under obligatory reporting agreements. These were linked to data from the PBS for information on drug usage, and from the Medicare Benefits Scheme (MBS) for information on out-of-hospital services such as primary care visits. Details of the datasets and study cohort have been published previously [9].
We identified Western Australian patients aged 65 years and above who were hospitalised with a principal discharge diagnosis of HF (International Statistical Classification of Diseases and Related Health Problems 10 th Revision Australian Modification (ICD-10-AM) code I50) in 2003-2008 (index HF admission). We identified comorbidities and HF history by looking back 20 years from the date of the index HF admission, and outcomes of HF readmission or all-cause death by looking forward from the date of hospital discharge. We also identified the use of medications in the 6 months prior to the index HF admission from the linked PBS data, and out-of-hospital services from the linked MBS data.

Collection of features (variables) for our cohort
We used the available 47 features related to patient demographics, admission characteristics, medical history, socio-economics, medication history, out-of-hospital healthcare services, emergency inpatient admission (S1 Table). We extracted these features from the linked datasets of HMDC, PBS, and MBS data.

Feature (variable) selection
Feature selection is the process of choosing a subset of features, from a set of original features, based on a specific selection criteria [10]. The main advantages of feature selection are: 1) reduction in the computational time of the algorithm, 2) improvement in predictive performance, 3) identification of relevant features, 4) improved data quality, and 5) saving resources in subsequent phases of data collection.
Feature selection using the t-test. The outcome of interest was binary with two values: (i) 30-day HF readmission or death, and (ii) 30-day survival with no HF readmission. For a binary outcome, a significant difference in the values of a continuous input variable for each outcome value shows that the variable has the ability to distinguish between the two outcome values. We used the t-test to calculate the two-sided p value for the difference in means at the 5% level of significance. We retained those features with a two-sided p<0.05 Feature selection using Chi-Squared test. The chi-squared test determines whether there is a significant association between two categorical variables. We calculated the two-sided chi-squared p value to test the association between each categorical feature and the binary outcome at the 5% level of significance. We retained those features with a two-sided p<0.05.
Sequential forward selection. The sequential forward selection technique adds the next best feature to the feature-subset on a greedy basis. A greedy procedure always chooses the feature/variable that seems to give the best performance at that moment. A predetermined criterion such as mutual information or Pearson's correlation coefficient is used to select the best feature. We used the mutual information to select the next best feature in this work. The selection process starts with an empty set and stops when a predefined threshold for the number of variables is reached [11]. In our work, the threshold was set to seven variables.
Sequential backward selection. The sequential backward selection technique removes one least important feature from the subset, based on a fixed criterion in every iteration. The selection process starts with a full set of features and stops when a predefined threshold for the number of variables is reached or further removal of features does not improve the performance [12]. We used mutual information as the selection criteria and set the selection threshold to seven variables. The performance was measured with the AUC.
The minimal redundancy maximal relevance feature selection. The mRMR technique ranks the variables based on their relevance to the target variable, and the redundancy between the variables themselves [13]. The relevance is characterized by the Mutual Information (MI), which is given as is the joint probabilistic density, and p(v) and p(c) denote the marginal probabilistic densities. The maximum relevance or dependence on the target variable is computed The variable with the largest D is termed the most relevant variable, which reflects a high dependency on the target variable. Variables with maximum relevance to the target can also be redundant. When two variables are redundant, the discriminatory power of the data does not change if we remove one of the redundant variables. The mRMR approach selects the mutually exclusive variables by minimizing the redundancy, given as Both relevance and redundancy are combined to generate the mRMR criterion, given as max(D−R). The input variable with the highest value of this criterion is selected at each iteration to produce a subset of the most important variables from the dataset.

Feature (variable) extraction
This is the process of creating a new and a smaller set of variables, with the aim to capture the most useful information that is present in the original variables, to predict the outcome. The new variables are produced by applying a transformation to the original variables. The transformed variables represent projections of the original variables onto a new variable space, where the distinct outcome groups have a better separation compared to the original variable space.
Principal Component Analysis. Principal Component Analysis is a popular feature extraction method that creates a linear transformation of the input variables [14]. The new variables, called the principal components, are the projections of the original variables to a new variable space. The dimensions of the new variables can be reduced by sorting them based on their eigenvalues and selecting the variables with larger variance [15,16]. A common choice is to select the first m components such that the variance of m components is 95% of the total variance that is present in the data [17]. For PCA, we first normalised the data to zero mean. Then we computed the covariance matrix, which measures the joint variability across the variables in a dataset. The covariance matrix contains covariance scores for every variable with every other variable including itself. The next step was to find the eigenvectors of the covariance matrix. Finally, the original data was multiplied with the eigenvectors to produce the transformed data.

Machine learning model
We divided our cohort into three sets of data (70% for training, 15% for validation, and 15% for testing over unseen data) to build a generalized ML model for 30-day HF readmission or death. We tested the model performance using three measures: AUC, sensitivity and specificity. We used an MLP based model [18] to assess the performance of the feature selection and feature extraction techniques for HF readmission or death. We have previously shown that the MLP model performs best with these datasets. This model contained an input layer, one hidden layer of 50 nodes, and an output layer of one node. The hyper-parameters were chosen by a trial and test method. We chose a rectified linear unit as the non-linear transformation in the hidden layer and a sigmoid function for the output layer. The number of nodes (n = 50) in the hidden layer were empirically chosen to yield the best performance. We tested a wide range of hyper-parameters for the feature selection and feature extraction techniques, and present the results of the configuration that gives the best performance.

Software packages
The programming codes used in this work were written in Python programming environment v3.6. The MLP model was implemented using the Keras Library v2.1.5 with Tensor Flow back end v1.8.0. We used the Scikit-learn library v0.19.1 to perform the PCA. The statistical comparison of AUCs was performed on SAS software v9.4. Table 1 shows the patient characteristics of demographics, socio-economic indicators, medication history, deaths, medical history, and out-of-hospital services for the study cohort of 10,757. The proportion of 30-day HF readmission or death was 23.7% (n = 2546). The mean age was 82 years (Standard Deviation (SD) 7.6), and the average length of hospital stay during the index HF admission was 11.7 days (SD 26.7). Common comorbidities included diabetes (30%), hypertension (67%), atrial fibrillation (42%), chronic kidney disease (26%), ischaemic heart disease (55%), and chronic obstructive pulmonary disease (28%). About half of the patients (46%) had at least one emergency admission during the six months prior to the index HF admission. The mean Charlson comorbidity index was 4.3 (SD: 3.0).

Results
Variables selected by the statistical and ML techniques for the prediction of 30-day HF readmission or death are listed in Tables 2 and 3, respectively.
The most discriminatory continuous variables identified by the t-test in the statistical model (p<0.05) include age, length of hospital stay, time (days) since last HF discharge, and Charlson comorbidity index ( Table 2). The chi-squared test identified the most prominent binary variables associated with the outcome. These included history of HF, dementia and chronic kidney disease, an emergency admission in the 6 months prior to the index HF admission, out-of-hospital visit to a specialist, use of alimentary tract and metabolism drugs and nervous system drugs in the 6 months prior to index HF admission, and the type of index HF admission (emergency or booked).
Variables selected by the forward and backward selection approaches in the ML model (Table 3) included age, type of index HF admission, time (days) since last HF discharge, history of HF, visits to allied health professionals, use of beta blockers or RASI/ARBs in the last 6 months, use of antineoplastic and immunomodulating drugs, use of systemic hormonal drugs, use of nervous system drugs, emergency inpatient admission in last 6 months and comorbidities of cancer, stroke, dementia, and shock. The variables which have maximum relevance to the outcome and minimum redundancy between the variables are: age, type of index HF admission, visit to allied health professional, length of hospital stay, at least two supplies of antineoplastic and immunomodulating drugs in the last 6 months, history of chronic kidney disease, depression, and HF.

Machine-learning methods
The performance of the multi-layer perceptron (MLP) based prediction model using variables selected by different techniques is presented in Table 4. The initial prediction model with all of the available variables (n = 47) had an Area Under the receiver operating characteristic Curve (AUC) of 0.62 with sensitivity of 48.4% and specificity of 70.0%. In contrast, the statistical variable selection approaches produced an AUC in the range 0.58-0.60, with sensitivity ranging from 41.4% to 47.3% and specificity ranging from 62.5% to 71.4%. The prediction model based on the eight most relevant variables selected by the minimal Redundancy Maximum Relevance (mRMR)  H), at least 2 supplies of antineoplastic and immunomodulating agents in the last 6 months (ATC group L), use of beta blockers or RASI/ARB in the last 6 months.
Backward type of index HF admission, out-of-hospital visit to allied health professional, at least 2 supplies of antineoplastic and immunomodulating agents in the last 6 months (ATC group L), at least 2 supplies of nervous system drugs in the last 6 months (ATC group N), an emergency inpatient admission in the last 6 months, history of dementia, HF and shock.

Discussion
In this study, we investigated different variable selection techniques to predict 30-day HF readmission or death in HF patients in the linked administrative health data. We report a matching performance by using only eight significant variables compared to the full set of 47 variables.
In addition, we have also evaluated the use of a feature extraction ML technique called PCA, which generated new combinations of original variables/features that improved the performance of the model significantly.
The LACE score is a tool that was developed to predict 30-day unplanned readmissions from administrative data [20]. This score uses Length of hospital stay (L), Acuity of admission [A (emergency or not)], Comorbidity score (C), and the number of Emergency visits in the last 6 months (E) to predict the risk of readmission within 30 days of hospital discharge. The LACE score is not restricted to a particular disease and is used as a generic tool to predict unplanned readmissions for various health problems [21,22]. However, when applied to HF patients, the LACE score was not able to predict 30-day readmissions (p = 0.199) [23]. Another tool to determine the risk of avoidable hospital readmissions is the HOSPITAL score [24]. This tool uses laboratory values such as hemoglobin and sodium level at discharge, and other patient attributes such as length of stay in hospital, number of hospital admissions in the previous year, and admission type to determine the risk of readmission. These clinical variables are not always integrated electronically in administrative health databases in many cases, thereby Feature selection and transformation of variables using machine learning in heart failure limiting the use of the HOSPITAL score to predict readmissions from routinely collected administrative data.
Several other ML models have been proposed to predict 30-day readmission or death for HF patients. Koulaouzidis et al. used the naive Bayes classifier on telemonitored data to predict HF readmissions [25]. They manually tested different combinations of a limited set of variables such as heart rate, systolic blood pressure, diastolic blood pressure, and body weight to develop their model. However, this manual approach only works with a small number of variables because the number of possible combinations grows exponentially when we have hundreds of input variables. Mortazavi et al. used more sophisticated ML algorithms such as random forest and support vector machines to predict 30-day and 180-day (all-cause and HF-specific) readmissions [26]. As the aim of their study was to compare the effectiveness of ML techniques against the traditional logistic regression, they used all of the available variables to develop their prediction model, without evaluating the discriminatory power of their input variables. Futoma et al. used a neural network to predict HF readmissions, given diagnosis and procedures of each hospital admission [27]. A portion of this work implemented a two-step variable selection process. The first step retained only those variables that passed a likelihood ratio test. The second step involved multivariate selection in a stepwise forward manner. They randomly ordered the variables that pass the likelihood ratio test in the first step. This random ordering is a limitation since it can potentially overlook a better ordering, thus harming the predictive performance. Our previous study showed the ability of machine learning techniques to predict HF readmission and death compared with standard statistical approaches which are used clinically [5]. Our current study extended this to explore the reduction and transformation of the original variables to determine the performance of the reduced and transformed feature sets to predict HF readmission and death.
The standard models to predict HF readmissions use variables, which are carefully selected by the clinical experts. In our study, we conceived a way to determine which variables have the most significant effect on predicting 30-day HF readmission or death from various selection techniques using statistical and ML domains. We were able to reduce the number of variables from the original 47 to 8 without compromising the predictive accuracy. The prediction model based on these 8 variables consisted of age, type of index admission, visit to an allied health professional in the last 6 months, length of hospital stay, use of antineoplastic and immunomodulating agents in the last 6 months, and history of HF, chronic kidney disease and depression. Its AUC of 0.62 was the same as the AUC of the model based on all 47 variables.
Variable extraction techniques empower us to determine if a combination of the original features can generate new features, which are more discriminating of the outcome than the original features. We used the PCA technique to transform the original variables into a new space based on the maximum variance in the data. This further significantly improved the AUC (from 0.61 to 0.66) when transformed variables are used for prediction. The addition of new variables improved the predictive performance because successive variables add to the variance of the data and this helps in separating the two outcome values. This performance was measured to be statistically significant (p<0.001) compared to the model that used the set of original variables.

Conclusions
A ML model developed to predict 30-day HF readmission or death in HF patients using a reduced number of variables selected by feature selection techniques matched the performance of the model that used the full set of 47 variables. We also demonstrated that new variables generated by transforming the original variables, based on the variance in the data, further significantly improved the predictive ability of our ML model.

Limitations
We limited our cohort to age 65 years and above to ensure that we captured all medication supplies for this patient group from the Pharmaceutical Benefits Scheme (PBS) dataset. However, this did not affect the capture of HF patients because the majority of HF admissions occur in patients aged 65 years and older, and they represent the more advanced forms of HF [28]. Our administrative data did not include specific clinical data such as laboratory values and blood pressure, which may have further improved the model performance. However, we have demonstrated the improvements that are possible when routinely collected administrative data are used to predict 30-day readmission and death.