Analysis of length of hospital stay using electronic health records: A statistical and data mining approach

Background The length of stay (LOS) is an important indicator of the efficiency of hospital management. Reduction in the number of inpatient days results in decreased risk of infection and medication side effects, improvement in the quality of treatment, and increased hospital profit with more efficient bed management. The purpose of this study was to determine which factors are associated with length of hospital stay, based on electronic health records, in order to manage hospital stay more efficiently. Materials and methods Research subjects were retrieved from a database of patients admitted to a tertiary general university hospital in South Korea between January and December 2013. Patients were analyzed according to the following three categories: descriptive and exploratory analysis, process pattern analysis using process mining techniques, and statistical analysis and prediction of LOS. Results Overall, 55% (25,228) of inpatients were discharged within 4 days. The department of rehabilitation medicine (RH) had the highest average LOS at 15.9 days. Of all the conditions diagnosed over 250 times, diagnoses of I63.8 (cerebral infarction, middle cerebral artery), I63.9 (infarction of middle cerebral artery territory) and I21.9 (myocardial infarction) were associated with the longest average hospital stay and high standard deviation. Patients with these conditions were also more likely to be transferred to the RH department for rehabilitation. A range of variables, such as transfer, discharge delay time, operation frequency, frequency of diagnosis, severity, bed grade, and insurance type was significantly correlated with the LOS. Conclusions Accurate understanding of the factors associating with the LOS and progressive improvements in processing and monitoring may allow more efficient management of the LOS of inpatients.


Introduction
Length of hospital stay (LOS) is an important indicator of the use of medical services that is used to assess the efficiency of hospital management, patient quality of care, and functional evaluation. Decreased LOS has been associated with decreased risks of opportunistic infections and side effects of medication, and with improvements in treatment outcome and lower mortality rates. Furthermore, shorter hospital stays reduce the burden of medical fees and increase the bed turnover rate, which in turn increases the profit margin of hospitals, while lowering the overall social costs. [1,2] Previous studies have examined effective management of LOS. Majority of these involved subjects stratified by condition or admitting unit, for example, patients admitted to specialized departments, such as psychiatric wards [3] or the intensive care unit (ICU) [4,5]; those with hip fractures [6] or undergoing coronary artery surgery [7]; or, those admitted with a specific diagnosis, such as heart failure [1,8] or pulmonary disease [9].
LOS among patients with the same disease or undergoing the same type of surgical intervention may vary owing to complex factors related to the individual or due to different process flows within different organizations or divergences in medical practice. For these reason [10], in order to understand which factors are associated with LOS, all activities within the overall patient admission process flow should be analyzed from different perspectives.
In this study, electronic health records (EHR) data and process mining technology were used to analyze all event logs entered between admission and discharge of the patient. This allowed us to scrutinize issues regarding hospital processes that affect the actual LOS, as well as other related factors. The aim of this study was to determine a methodology that could be applied to help hospitals manage the duration of inpatient stay more efficiently.

Data and preprocessing
Log data recorded between January and December 2013 were extracted from the EHR of a tertiary general hospital to analyze factors correlating with length of hospital stay. The subjects were patients admitted (and discharged) in 2013. The extracted event log is shown in Table 1.
For the full year of 2013, we have collected 53,965 subjects except for 745 and 1,029 subjects who were in hospital at the first and last day of the year, respectively. Also, there was a lack of a discharge date for two subjects, and 8,295 patients were received the day surgery which does not have to be hospitalized. They were also removed from the set of target subjects being analyzed. In a nutshell, out of a total of 53,965 subjects, 8,419 subjects were excluded due to repeat admission for unexpected events (122), lack of a discharge date (2), and day surgery (8,295). Finally, data from 45,546 subjects were analyzed. For accurate data analysis, the following data were excluded: data presumed to have been wrongly entered, such as transfer note dates recorded before the admission date or after the discharge date; transfer completion dates recorded before the admission date; and procedures performed beyond the extracted date.
The present study was approved (IRB No. B-1409/268-107) by the Institutional Review Board of the Seoul National University Bundang Hospital, which waived patients' informed consent. All EHR data provided to the researchers for this study were de-identified.

Analysis methods
The methods of analysis used in the study are shown in Table 2 and categorized as: descriptive and exploratory analysis, process pattern analysis using process mining techniques, and statistical analysis and prediction for LOS. Descriptive and exploratory analysis seeks to understand from multiple angles, the current circumstances surrounding the hospital LOS. We included three detailed analysis items in this analysis category: performance analysis for LOS, LOS analysis according to diagnosis, and analysis for long-term hospitalization.
First, performance analysis for LOS uses the basis performance analysis technique [11] in process mining and measures basic statistics regarding the days spent in hospital. This

Event type Attribute
Patient Information included the analysis of the total distribution of hospitalization days, as well as the correlation between the number of patients in each department and the number of hospitalized days. Secondly, LOS analysis according to diagnosis using Z-scores [12] to analyze differences in LOS by diagnosis. We postulated that analyzing the absolute LOS for all diagnoses was of limited value because there are considerable differences in the required LOS according to the specific diagnoses. In order to overcome this limitation, the relative LOS was measured and compared, by deriving the standard score of each diagnosis based on the mean and standard deviation of the LOS per department. This is demonstrated as follows(1) [12]: Analysis of long-term hospitalization aimed to determine the characteristics of long-term hospitalized patients. For this purpose, we used statistical analyses to observe patient clustering and applied performance analysis to each group, aiming to identify differences between the groups.
The second analysis category was process pattern analysis using process mining techniques to analyze the association between the transfer pattern and the number of hospitalization days. We used the pattern analysis technique [13] of process mining to extract the transfer pattern, and this technique offers performance information, such as the frequency of each pattern, the average time required, and the median time required.
The third analysis category is statistical analysis and LOS prediction identifying major factors correlating with LOS and predicting LOS based on the identified factors. For this purpose, statistical analysis techniques [14] included the Student's t-test and Analysis of Variance (ANOVA), as well as machine running techniques [15], such as regression analysis.
Our approach has been analyzed with the ProM framework [16] and python open source scientific tools, Scikit-learn [17]. The ProM framework is the process mining open source software that provides plenty of analyzing functionalities. We have employed the basic performance analysis and the pattern analysis technique from the ProM framework. As far as the statistical analyses were concerned, we have utilized the Scikit-learn python module that provides a wide range of machine learning techniques.

The LOS analysis framework
In combining all the items above, using the EHR log data, we performed data pre-processing and analyzed data according to LOS. We also developed a LOS analysis framework which includes the overall flow of a model that can predict patient LOS. The LOS analysis framework is visually presented in Fig 1. In the data preparation phase, we extracted the EHR log data. The data preprocessing process improved data quality in order to extract meaningful analysis results. In the data analysis phase, four types of analysis were performed: LOS performance analysis, LOS analysis of transfer patterns, LOS analysis according to diagnosis, and analysis of long-term hospitalization. The results of the analysis were meaningful as they revealed the association between LOS and other data items, and it also helped to understand the key factors correlating with duration of hospital stay at the prediction stage. At the prediction phase, the main factors correlating with the number of days of stay were identified through data analysis and log-based statistical analysis. A model was then developed to estimate the number of days of hospitalization based on the derived factors.

Performance analysis for LOS
Examining the data from 2013, the hospitalized patients were averagely discharged around 7 days, and the range of the length of hospital stay was quite extensive (i.e., interquartile range: 2.0-8.0). The details for LOS-related values are provided in Table 3. Also, as far as the distribution of the LOS was concerned, approximately 55% (25,228) of hospitalized patients were discharged within 4 days, and out of these patients, approximately 20% (8,969) were left the hospital on the second day of hospitalization.
Furthermore, a granular analysis on the length of stay was carried out by departments. Fig 2  presents the boxplot of the length of hospital stay for each department. Unexpected records, i.e., outliers, were removed in the graph for the effective comparative analysis.
The median length of hospital stay was 14 days in rehabilitation medicine; 10 days for neuropsychiatry; 9 days for geriatric center admissions; and 8 days for internal medicine, infectious diseases. Also, the IQR of hospital stay was 11.50 (i.e., 5.0-16.50) days for neuropsychiatry; and 10 (i.e., 4.0-14.0) days for internal medicine, infectious diseases.
Based on the analysis of average and interquartile range (i.e., IQR) of the LOS within each department, patients were divided into three groups. Note that IQR signifies the statistical dispersion of the distribution. Fig 3 shows the results of the analysis according to the average and IQR of LOS per department.
Based on the analysis for the average and IQR of LOS, we identified that there was a positive correlation between two measures. As it were, the average of LOS was higher, and the overall    First, Group A was the group with the relatively lower IQR and average LOS than other departments. The group included those being treated under radiology (DR), ophthalmology (OT), or obstetrics and gynecology (OG) among others. These departments within the group A were seen to be doing well in keeping their patients with the short and low dispersed stay. Therefore, it was judged to be a group with a considerably low need for improvement. Group B included relatively higher IQR and the average of LOS than other departments. Clinical neuroscience center (CNSC), internal medicine nephrology (IMN), and internal medicine allergy (IMA) exhibited this trait and were noted as the departments to be improved their inpatient management. It was identified that these departments had the average LOS close to the average of the whole departments, i.e., 6.01 days. However, some patients had remarkably higher LOS than others; thus, it resulted in the slightly high IQR value. Therefore, we concluded that there is a need to systematically manage the medical care process of the specific patients. Finally, Group C was characterized by significantly higher average and IQR of the length of hospital stay. Rehabilitation medicine (RH), neuropsychiatry (NP), internal medicine infectious disease (IMI), Geriatric Center (GC) were included in Group C, and detailed analysis of patient characteristics was required to identify issues that may cause prolonged LOS.

LOS in accordance with diagnosis
Diagnosis was a major factor correlating with the number of days of care [18]. LOS is determined by different variables and depends on specific diagnosis. The average LOS for each ICD-10 diagnosis issued by each department was converted to a Z-score, and then analyzed. An ICD-10 code consists of the first three characters for designating diagnosis category, the next three characters (characters three through six) for representing further details including the related etiology, anatomic site, or severity, and the seventh character for expansion. Among high frequent diagnoses which were registered over 250 times, diagnoses of I63.8 (cerebral infarction, middle cerebral artery) (Mean: 13.42 and IQR: 5-17) and I63.9(infarction of middle cerebral artery territory) (Mean: 13.96 and IQR: 5-19.5) were associated with a higher mean and interquartile range for LOS.
All two diagnoses above involved infarction and detailed analysis by medical care given showed that the patients were often transferred to RH for rehabilitation therapy, and there were also cases where multiple surgical interventions were performed on a specific patient. It is believed that such cases may have played a role in affecting the average LOS and also its interquartile range. Table 4 shows the length of hospital stay for top 30 disease classification when all ICD-10 diagnosis codes were merged into the 3-digit level category. The diagnosis category of F30-F39 (Mood [affective] disorders appeared to be the highest mean and interquartile range for LOS (Mean: 13.53 and IQR: 6-17.5) with moderate frequencies.

Analysis for long-term hospitalization patients
Discharge of long-term inpatients is one of the main indicators actively managed by the hospital because shorter hospital stay is directly associated with an increase in hospital income, by increasing hospital turnover rate as well increasing the daily average cost of medical care. Usually, "long term inpatients" are defined as patients who have been hospitalized for over 30 days.
Patients were divided into three groups, according to LOS: A (under 7 days), B (7 days or more and under 30 days), and C (30 days or more). Approximately 3% (1,327people) of all patients were long-term inpatients in Group C. Table 5 shows that, compared to patients with shorter LOS, long-term inpatients included a significantly higher rate of surgical patients (54.26%), transferred patients (41.97%), and patients on antibiotic treatment (92.31%) as well as a greater number of surgical interventions (2.01 cases), antibiotics (116.55 cases), and procedures (385.99 cases) per patient, and a greater number of treatments per person per day (7.96 cases).
With increased LOS, patients are exposed to a higher risk of infections and the use of broad-spectrum antibiotics increases accordingly. The use of broad spectrum antibiotics may lead to the development of resistance to drugs and other serious side effects. For this reason, a number of antibiotics are managed as Restricted Antibiotics and subject to limited prescription. In this study, the ratio of restricted antibiotics administered to 1,000 randomly selected patients (12.79%) was higher than that in group A (0.7%) or group B (2.99%) (P-value <0.001).

LOS analysis in terms of transfer patterns
Transfer was defined as a change of department required by the patient's condition and one of the factors associating with the processing of hospitalized patients.
According to the analysis of hospital days based on transfer pattern, it was found that out of all patients, 5.25% (2,392) those who have been transferred on average spent 17 more days stay   Table 6. Table 7 shows the results of the analysis between LOS and different hospitalization variables: time required for transfer, discharge delay, surgery frequency, diagnosis frequency, severity, bed grade, and insurance type. Patients requiring 2 or more days to transfer had a greater number of hospital days (Mean: 23.99 and IQR: 11.00-27.75) than patient with a LOS of under 2 days (Mean: 14.01 and IQR: 7.00-17.00). Patient discharge time showed that patients requiring 1-2 days (Mean: 7.12 and IQR: 3.00-8.00) had a higher LOS than patients requiring 1 day or less (Mean: 6.96 and IQR: 2.00-8.00) or 2 days or more (Mean: 5.54 and IQR: 3.00-5.00). Hospital stay based on incidence of surgery showed that patients undergoing 3 times or more surgical interventions had the longest LOS (Mean: 50.30 and IQR: 23.00-64.00) compared to patients undergoing no surgery (Mean: 6.17 and IQR: 2.00-7.00), 1 intervention (Mean: 6.97 and IQR: 3.00-8.00) or 2 interventions (Mean: 21.25 and IQR: 10.00-24.00). In terms of diagnosis, patients with 3 or more diagnoses had the longest hospital stay (Mean: 38.24 and IQR: 16.00-50.00) compared to patients with no diagnosis (Mean: 3.33 and IQR: 2.00-4.00), 1 diagnosis (Mean: 6.07 and IQR: 2.00-7.00), 2 diagnoses (Mean: 14.53 and IQR: 4.00-20.00).

Deriving correlated factors on LOS
Patients receiving critical care (Mean: 7.94 and IQR: 3.00-9.00) were more likely to have longer LOS than those who were not (Mean: 6.56 and IQR: 2.00-7.00), and patients on general wards (Mean: 5.84 and IQR: 2.00-7.00) were more likely to remain in hospital longer than patients on upper grade wards (Mean: 2.90 and IQR: 1.00-3.00). Analysis of hospital stay according to insurance type indicated that admissions involving industrial accidents, medical assistance, medical research, and automobiles occurred less frequently than admissions on health insurance, although the LOS was relatively higher. All variables were statistically significant. (P<0.05)

Building a predictive model of patient's LOS
This section presents a model for the prediction of the number of days in hospital based on the significant variables analyzed above. Multiple regression analysis was performed to develop the model. The following five variables were used as independent variables: frequency of surgery, frequency of diagnosis, frequency of patient transfer, severity, and insurance type. LOS was used as a dependent variable. Also, we partitioned data into the training and test dataset to measure the accuracy of the model; 80% and 20% of data became the training and test data, respectively. Table 8 provides the result of the multiple regression analysis. All five variables were statistically significant and, therefore, correlated with the prediction of the length of hospital stay. In the regression model from the training dataset, R 2 was 0.267, and duration of hospitalization was calculated as followed: LOS (days) = 2.72 + 2.70 Ã (frequency of surgery) + 2.46 Ã (frequency of diagnosis) + 11.65 Ã (number of transfer) + 1.02 Ã (severity)-0.80 Ã (insurance type). As a result of measuring the accuracy with the test dataset, we identified that the mean absolute error of the model is 4.68. Furthermore, we performed the further analysis for building a model to classify whether a specific patient is a long-term inpatient. As we said earlier, we defined long-term inpatients as who had the 30 or more length of hospital stays. As far as this analysis was concerned, random forest was employed to build a model, and data partitioning with 80% for the training and 20% for the test dataset was performed to measure the accuracy. As a result, it was found out that the accuracy of the classification model is 0.9732. That is, the model had the sufficient capacity to classify long-term hospitalized patients. Also, the relative importance of each feature, the transfer frequency was the was the highest at 41.40%, while surgery frequency, diagnosis frequency, severity, and insurance type were 28.44%, 24.13%, 4.77%, and 1.26%, respectively.

Discussion
This study aimed to analyze the association between the EHR event data and LOS in inpatients using statistical analysis and process data mining analysis technology, in order to determine which factors are correlated with the LOS.
Among transferred patients, the mean length of hospital stay was increased by 17 days compared to those who were not transferred. The average number of hospital days (and standard deviation) was the highest among patients transferred from the CNSC and the SPC to RH. Furthermore, among ICD-10 diagnoses which have been registered more than 250 times, the most frequent three diagnoses were all related to infarction. A detailed analysis of medical care showed that majority of these patients were transferred to RH for rehabilitation. A similar distribution of patients by diagnosis was observed between patients transferred from the CNSC and from the SPC to RH.
Stroke, which accounts for the majority of cases of cerebral infarction, requires different rehabilitation treatment modalities, such as physical, speech, and occupational therapies, due to the wide variety of symptoms that occurs depending on the location of the damage in the brain. Although LOS was increased, increased rehabilitation time was associated with increased functional recovery. [19,20] When ICD-10 diagnosis codes are classified into the 3-digit level, one of the mental disorder, mood [affective] disorders showed the highest mean for the LOS. There have been studied the factors affecting the LOS in psychiatric patients. [21,22] As psychiatric patients have different hospitalization factors according to personal characteristics, it seems to be needed to find effective ways of LOS management by analyzing various personal characteristics and treatment pathways among psychiatric patients.
Analysis of patient transfer and diagnostic data revealed that LOS was high among patients who were transferred to RH. It was plausible that secondary problems caused by the primary disease were more likely to correlate with the LOS. Clinical pathways (CP) [23], evidence-based and Analysis of LOS using EHR standardized practice guidelines are being developed and applied to improve the quality of medical care and to reduce the LOS, for both diseases and surgery. It may be necessary to have a standardized and optimized practice manual developed for secondary rehabilitation treatments (for function improvements) to be used in RH. It has been reported that the use of effective rehabilitation programs affects the cost reduction of hospitalization and also reduces the LOS. [24] Long-term hospitalization exceeding 30 days was associated with a higher percentage of surgical operations and transfer rates, as well as restricted antibiotic use compared to other patients. It is very important to manage antibiotic use indicators to appropriately and effectively manage the quality of medical care. [25] While antibiotics are important in treatment and prevention of diseases, long-term use can lead to the development of drug resistance with serious effects on health. This can also increase the LOS and medical costs. [26] Therefore, antimicrobial usage should also be monitored and appropriately managed.
Furthermore, this study found that the following factors were correlated with LOS: transfer time, discharge delay time, frequency of surgery, diagnosis frequency, patient severity, bed grade, and insurance type. Most studies examining factors that are associated with the length of hospitalization involved subjects affected by specific diseases or types of surgery, limiting associated factors within the patients' individual characteristics. For example, a study on pediatric asthma [27] found that variables correlating with patient LOS were age, sex, obesity, and chronic disease. In a previous study examining subjects who underwent radical cystectomy, sex, age, and complications were reported to be associated with hospital stay [28]. It is important to note that variables such as the patient characteristics cannot be improved by changes in hospital practice. This study considered all subjects admitted to hospital, rather than restricting participants according to medical condition or type of surgery. Therefore, the variables identified were those that could improve the overall processing and motoring system of the hospital.
Based on the significant findings of this study, strategies to improve the process of transfer and discharge in the entire hospital, were recommended. Moreover, it is anticipated that effective management of LOS is possible with a sustained effort to manage patients undergoing frequent surgery, with multiple diagnoses, and with severe conditions. There were some limitations to this study which should be addressed. First, the analysis of patient process correlating with the LOS was based on data from a single hospital. As there are differences between hospitals in the admission process and treatment plans, generalizability was limited and it is important to collect and analyze data from multiple hospitals. Furthermore, data analysis was largely confined to the main hospitalization events of the EHR system; the general characteristics of the individual patients and the hospital's environmental factors were not considered in the analysis. The LOS may also be related with month of the year or day of the week of admission/discharge date, for example, admissions on Friday not being discharged until Monday due to lack of senior staff on weekends.
Despite these limitations, this study analyzed the LOS based on objective EHR data that included all medical events for each inpatient rather than some specific patients. Importantly, this study is of value as it analyzed the factors correlating with LOS and identified solutions to reduce this time.
In future studies related to hospital stay, it may be necessary to collect multi-distribution data, as well as the general characteristics of individual subjects, their environmental factors and seasonal and date/time factors, which were not considered in this study.

Conclusions
In this study, we analyzed different variables correlating with LOS by using EHR admission data. We considered how to improve the management of LOS among inpatients.
Research on the duration of hospital stay is important because it helps hospitals to more effectively manage its resources and patients. Specifically, identifying factors which are associated with the LOS in order to accurately predict and manage the number of inpatient days, could be helpful in terms of managing hospital resources and may enable the development of a Clinical Pathway useful for inpatient treatment.
Based on the variables identified in this study, it may be necessary to improve the financial structure of hospitals and develop institutional approaches to reduce patient medical fees, by promoting the effective use of hospital resources and reducing the length of hospital stay via a system subject to continuous monitoring. Eliminating unnecessary hospital stays is a strategy to reduce overall national medical expenses.