Predicting Chronic Wound Healing Time Using Machine Learning

Objective: Chronic wounds have risen to epidemic proportions in the United States and can have an emotional, physical, and financial toll on patients. By leveraging data within the electronic health record (EHR), machine learning models offer the opportunity to facilitate earlier identification of wounds at risk of not healing or healing after an abnormally long time, which may improve treatment decisions and patient outcomes. Machine learning models in this study were built to predict chronic wound healing time. Approach: Machine learning models were developed using EHR data to predict patients at risk of having wounds not heal within 4, 8, and 12 weeks from the start of treatment. The models were trained on three data sets of 1,220,576 wounds, including 187 covariates describing patient demographics, comorbidities, and wound characteristics. The area under the receiver operating characteristic curve (AUC) was used to assess the accuracy of the models. Shapley Additive Explanations (SHAP) were used to analyze variable importance in predictions and enhance clinical interpretations. Results: The 4-, 8-, and 12-week gradient-boosted decision tree models achieved AUC's of 0.854, 0.855, and 0.853, respectively. Days in treatment, wound depth and location, and wound area were the most influential predictors of wounds at risk of not healing. Innovation: Machine learning models can accurately predict chronic wound healing time using EHR data. SHAP values can give insight into how patient-specific variables influenced predictions. Conclusion: Accurate models identifying patients with chronic wounds at risk of non or slow healing are feasible and can be incorporated into routine wound care.


INTRODUCTION
Chronic wounds, defined as not healing in a predictable or ordinary amount of time, with ''ordinary'' commonly defined as 4 to 12 weeks, affect an estimated 8.2 million Americans per year and can have an emotional, physical, and financial toll on pati-ents. 1,2 Chronic wounds are becoming more prevalent in the United States for multiple reasons, such as an aging population and increasing prevalence of obesity and diabetes. 3,4 Without proper or timely treatment, patients with chronic wounds may face dire outcomes, including loss of

CLINICAL PROBLEM ADDRESSED
Focus on early identification of patients at high risk of having wounds that will not heal or heal after an abnormal amount of time may enhance clinical decision making to limit complications, improve patient outcomes, and reduce costs of care. However, there are few predictive tools that allow clinicians to identify these high-risk individuals in a timely and accurate manner for any wound throughout the course of treatment while offering insight into factors that are affecting a patient's prognosis for healing.
Recent studies aimed to predict chronic wound healing time using machine learning. 3,5,6 However, most of these studies only made predictions at specific times during treatment, focused on a limited number of wound types, had lower accuracy, or did not offer patient-specific insight into variable importance in predicting wound healing in terms of weight and whether the influence tended to be positive or negative.
Cho et al. reported models that predict chronic would healing time. 3 However, the models only provided predictions upon patients' baseline presentation and focused on the 12-week healing timeframe, yielding and area under the receiver operating characteristic curve (AUC) of 0.71. 3 Fife et al. developed a machine learning model to predict diabetic foot ulcer chronic wound healing time, achieving AUC levels of 0.67. 5 Jung et al. developed machine learning models to predict wound healing in the 15-week timeframe, achieving an AUC of 0.84. 6 While achieving higher ac-curacy, this study also focused on predictions at one time point and did not make predictions for subsequent visits.
Additionally, the existing models described above evaluated predictive importance in ways that do not capture local interpretations, that is, patientspecific characteristics that are unique driving factors for a patient's risk of not healing. For example, variable importance analyses for algorithms such as linear or logistic regression is typically done by analyzing coefficient values that are static and generalize for the population. In tree-based methods such as decision trees and random forests, no interpretation of directional is typically offered, and only global importance of the variable is presented. These types of variable importance analyses were used in the studies mentioned above.
Accordingly, this study's goal was to build highly accurate machine learning models that can be used in real-time to predict the probability of chronic wounds not healing in 4, 8, and 12 weeks from the start of treatment across multiple time points. The analysis also gives insight into variable importance in terms of magnitude and directionality at the global and local levels as patient and wound characteristics change over the course of treatment. To the best of our knowledge, this has not been previously explored in wound care research.

Population derivation
Retrospective data were sourced from Net Health Systems, Inc.'s Wound Care Analysis Data Set. The study was deemed exempt from informed consent requirements after review by IRB Solutions, a private Institutional Review Board located in Yarnell, Arizona. These data represent patients with chronic wounds from 595 facilities across PREDICTING CHRONIC WOUND HEALING TIME  [3][4][5][6][7][8][9][10][11][12] Supplementary Data S1 contains the list of variables that were used in the analysis. To protect the model from being negatively impacted by outliers, we removed all records for patients who had implausible values for wound surface area, length, width, and depth. 6 These outliers were determined by manually observing the distribution of these values across the population. This process identified 12,029 wounds whose treatment visits were dropped from the analysis.
A total of 461,293 patients were included in the analysis, accounting for 1,220,576 wounds. Of these wounds, 542,723 (45%) healed within 4 weeks from the start of treatment, 789,775 (65%) healed within 8 weeks from the start of treatment, and 909,113 (75%) healed within 12 weeks from the start of treatment. Table 1 displays clinical summary statistics for these patients. 13 Before model training, patients were randomly split into training (60% unique patients), validation (15%), and testing (25%) sets, which ensured that a patient was only represented in one of the three populations. This strategy was used to help prevent overfitting during model development. Wound treatment visits for these patients were then used during model creation, where the algorithms learned on the training and validation sets were tested on the held-out testing set. The training, validation, and testing populations for each model that were also censored to ensure visits that occurred after the healing timeframe of interest were not included in model development. For example, it is common for a patient to remain in treatment for longer than 4, 8, or 12 weeks. Treatments that occurred after the target timeframe were not included in the analysis. Additionally, final treatment visits, where the days in treatment was equal to the total days in treatment, were dropped from the analysis to ensure that the model was developed on observations before an outcome of healed or not healed. Therefore, the data sets for each model had different numbers of observations. The 4-, 8-, and 12-week models were trained and tested on 3,477,501, 4,580,575, and 5,210,023 treatment visits, respectively. Tables 2-4 display clinical summary statistics for wounds in each population at patient presentation, stratified by those that did and did not heal within 4, 8, and 12 weeks. 13 Wounds were defined as ''healed'' if they had a final wound status labeled as ''Resolved,'' ''Healed,'' ''Graft,'' ''Closed,'' or ''Treatment Complete'' in the electronic health record (EHR) before the timeframe of interest. Nonhealing wounds were defined as those with a final wound status label of ''Not Healed,'' ''Amputated,'' ''Quit Treatment,'' or healing after the timeframe of interest. These labels were chosen with clinical guidance and review of published literature. 3,6,14 Model development All models were developed in Python 3.8.5. Logistic regression, random forests, gradient-boosted decision trees (GBDT), and deep feedforward neural networks (DNN) were the tested machine learning methods to predict chronic wound healing probability in each timeframe. We applied different data preparation and feature engineering based on each algorithm's mechanism of learning.  For all models, we coded missing data in nominal categorical variables as a separate category, ''Missing.'' We used one-hot encoding for the logistic regression, random forest, and deep feedforward neural network models. We used integer-encoding for the GBDT models. 15,16 For the logistic regression and DNN models, we used the mean value by wound type to impute missing data in continuous variables. After imputation, we normalized these continuous variables and scaled them using z-scores. For the random forest models, we did not normalize or scale the continuous variables since the algorithm makes split decisions instead of deriving weighted parameters and are therefore not affected by scaling or normalization. Furthermore, for the GBDT models, we did not use imputation for missing values of continuous variables because GBDT handle missing data without imputation and makes split decisions instead of deriving weighted parameters, like random forests.
The logistic regression and random forest models were developed using scikit-learn. 17 Least absolute shrinkage and selection operator was used for model regularization and variable selection in the logistic regression models.
For random forest models, number of trees, maximum depth, minimum samples per split and per leaf, and maximum features were the hyperparameters tuned during model development. The functionalities of these hyperparameters are described in Supplemental Data S2.
LightGBM 3.0.0, a GBDT package developed by Microsoft, was used for the GBDT models. GBDT's are typically top performers for structured, tabular regression, and classification problems. 18 For the GBDT models, learning rate, evaluation metric, maximum depth, early stopping rounds, column sample by tree, number of leaves, L1 regularization, L2 regularization, and maximum bins were the hyperparameters tuned during model development. The functionalities of these hyperparameters are also described in Supplemental Data S2.
Initial hyperparameters were chosen using scikit-hyperband. Hyperband iteratively fits models with different combinations of hyperparameter values and evaluate results concerning the target metric, using a bandit-based, successive halving approach to choose final parameters. 19 For these models, hyperband chose parameters in the regions of expected highest AUC for each trial. After tuning with hyperband, hyperparameters were again tuned manually to avoid overfitting. Once final hyperparameters were chosen, the algorithms were given 100,000 epochs to learn during training while optimizing for binary logarithmic loss.
The DNN models were developed in Keras using the sequential framework. 18 A suitable model architecture was chosen for the problem: several initial dense layers with a rectified linear unit activation function followed by a dropout layer to combat overfitting. The final dense layer had a sigmoid activation function to match the binary target space.
We evaluated the algorithm results with AUC on the training and testing sets for each model. Furthermore, we analyzed each models' ability to discriminate with confusion matrices on their testing sets to evaluate accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and F1-score. The confusion matrices' discrimination thresholds were chosen using Youden's J-statistic, a Pareto-optimal value to optimize sensitivity and specificity.

Model interpretability
For clinical practice, it is important to evaluate how variables contribute to a prediction in terms of magnitude of importance and directionality. We used Shapley Additive Explanations (SHAP) to assess model interpretability. SHAP provides information on variable importance for models that have traditionally been difficult to interpret. SHAP values are derived by considering all possible permutations of independent variables for a single prediction and then assigning marginal contributions, or SHAP values, to each variable in the prediction. 20 In our models, a positive SHAP value (>0) indicates that the variable influenced a higher probability that the wound would heal in the target timeframe in the context of a single observation (e.g., treatment visit), whereas a negative SHAP value indicates the opposite. The absolute value of the SHAP value indicates which variables had more weight, or importance, in a prediction. This allows for a rich interpretation of the magnitude and direction in which a variable affected a prediction for a specific patient and wound. The ability to find unique, patient-level insights of predictions is what differentiates SHAP from other traditional variable importance analyses. SHAP values for a variable may differ from one patient to another because of their unique combination of covariates and interactions with one another. This is unlike parameterized models such as linear and logistic regression, which produce coefficients that are static and applicable on a population level, or nonparameterized models like decision trees and random forests in which historically popular measures such as Gini importance, gain, or entropy also represent global importance but provide no directionality. When SHAP values for each variable are compared across many predictions, SHAP also allows for a global interpretation of how covariates affected predictions across the population. Examples from the results of this study are provided below.

RESULTS
The GBDT models substantially outperformed the logistic regression, random forest, and DNN models across all target timepoints. Table 5 displays testing set AUC's for each model and target timeframe. Figure 1 shows the receiver operating characteristic (ROC) curves for the GBDT models on the training and testing populations. The GBDT models predicting wound healing in 4, 8, and 12 weeks from the start of treatment achieved AUC's of 0.854, 0.855, and 0.853 on their testing sets, respectively. The training set and testing set AUC's are similar and do not suggest overfitting. Table 6 shows the sensitivity, specificity, PPV, NPV, and F1-score metrics for the confusion matrices derived on the testing sets for the 4-, 8-, and 12-week GBDT models. Figures 2-4 show SHAP value summary plots on the testing sets for the 4-, 8-, and 12-week models, respectively. Each plot shows the top 20 most influential variables according to SHAP. The variables are arranged from top to bottom in    at the bottom of the figure. A higher SHAP value indicates the variable influenced the model to predict a higher probability that the wound would heal in the target timeframe. A lower SHAP value indicates the opposite. In all three models, the number of days the wound had been treated at the time of the visit (''Days Wound on Service''), wound depth (''Depth''), and current wound surface area (''Current Wound Area'') were the top three most influential variables.
For ''Days Wound on Service,'' a lower SHAP value, indicated in blue, generally influenced the models to predict that the wound would heal before the target timeframe. A higher parameter value of ''Days Wound on Service,'' indicated in red, influenced the model to predict that the wound would not heal within the timeframe.
For ''Depth,'' a lower parameter value, indicated in blue, generally influenced the models to predict that the wound would heal before the timeframe of interest. A higher value for ''Depth,'' indicated in red, had the opposite effect and influenced the models to predict that the wound would not heal before the target timeframe.
For ''Current Wound Area,'' a lower parameter value, indicated in blue, influenced the models to predict that the wound would heal before the predicted timeframe of interest. A higher value for ''Current Wound Area,'' indicated in red, influenced the models to predict that the wound would not heal before the respective timeframe.
Body part location (''Body Part'') was the fourth most influential variable in the models. While this variable is influential, it is not as intuitive to understand visually since it is an integer-encoded nominal categorical variable. To granularly understand how body part location affects predictions, it is necessary to find the aggregated mean SHAP value for ''Body Part'' to analyze how certain wound locations globally influenced predictions. Figure 5 shows the aggregated average 12-week model SHAP values for lower extremity wounds. ''Toe'' had the lowest average SHAP value across all wounds, (-0.34), followed by ''Foot'' (-0.22). ''Lower Leg'' and ''Upper Leg'' had slightly higher SHAP values of 0.13 and 0.36. This means that, on average, the model tended to predict that the wounds located on a patient's toe or foot had lower probabilities of healing within the 12-week timeframe than wounds that were on a patient's lower or upper leg.

DISCUSSION
We developed machine learning models that can accurately predict on each visit whether a chronic wound would heal within 4, 8, and 12 weeks from the start of treatment. These models use clinical features that were curated from structured EHR data and can be deployed in real-time to identify patients with chronic wounds at high risk of not healing within specified timeframes.
To the best of our knowledge, there are no other predictive tools that have demonstrated this accuracy level on every treatment visit for any wound type on a population of this size. Other models developed to predict chronic wound healing time used relatively smaller data sets, had lower accuracy, were limited to specific wound types or healing times, or only made predictions at unique points in time. In contrast, the models in this study were developed on larger data sets that provide predictions for any wound type or visits for multiple timeframes. This makes the algorithms flexible in their ability to accurately generalize across a diverse population. These parameter values are plotted with respect to their corresponding SHAP value on the x-axis, where a lower SHAP value indicates that the parameter value influenced the model to predict that the wound would not heal within 8 weeks from the start of treatment, and a higher SHAP value indicates that the parameter value influenced the model to predict that the wound would heal within 8 weeks from the start of treatment.
Our models use days in treatment as an independent variable because the design of the study was to make highly accurate predictions on every treatment visit. While days in treatment is highly predictive, the interaction of time and other vari-ables provides for an even more accurate prediction because it can capture more of the longitudinal nature of a patient's progression or degression over the course of treatment. Figure 6 represents a SHAP interaction plot between ''Days Wound on These parameter values are plotted with respect to their corresponding SHAP value on the x-axis, where a lower SHAP value indicates that the parameter value influenced the model to predict that the wound would not heal within 12 weeks from the start of treatment, and a higher SHAP value indicates that the parameter value influenced the model to predict that the wound would heal within 12 weeks from the start of treatment. This figure demonstrates that a shallow wound depth, indicated in blue, coupled with a lower ''Days Wound on Service'' value influences the model to predict that the wound would heal within the 12-week timeframe. As ''Days Wound on Service'' increases and ''Depth'' remains shallow, the model is also influenced to predict the wound will heal within the 12-week timeframe. However, as A positive SHAP value indicates that the observation was influenced to predict that the wound would heal within 12 weeks from the start of treatment. A lower SHAP value indicates that the observation was influenced to predict that the wound would not heal within 12 weeks from the start of treatment. As wound depth is shallower (indicated in blue), and ''Days Wound on Service'' increases, wounds are more likely to heal within 12 weeks. When wounds are of greater depth, indicated in red, and patients are further along in treatment, they are less likely to heal in 12 weeks from the start of treatment. ''Days Wound on Service'' increases and ''Depth'' maintains a higher parameter value, the SHAP value decreases, indicating that the model is influenced to predict that the wound will not heal in 12 weeks.
The interaction between covariates captured by SHAP is further demonstrated in Fig. 7, which shows five variables and their corresponding 12-week SHAP values for two patients with diabetic foot ulcers. Both patients were 60 years old and had been treated for 77 days. Patient A's wound healed in 12 weeks, while Patient B's did not. The model predicted that Patient A's wound had a 33% chance of healing in the 12-week timeframe, and Patient B had a 3% chance.
Although both patients had the same type of wound and had been treated for the same amount of time, it is evident that other variables in the prediction influenced the way ''Days Wound on Service'' and ''Wound Type'' affected both patients differently based on their SHAP values for these variables. For example, ''Days Wound on Service'' for Patient A had a SHAP value of -3.05, but Patient B had a SHAP value of -3.63. While this variable impacted both patients' predictions negatively, it had more of an effect on Patient B than Patient A. Additionally, both patients had different values for ''Epithelialization,'' where Patient A had a value of ''51-75%,'' but Patient B had no value recorded. Patient A received a positive SHAP value of 0.2 for ''Epithelialization,'' where Patient B had a negative SHAP value of -0.03. This example demonstrates how SHAP can unveil patient-specific information about how a variable influenced a prediction, and the magnitude of the influence.
The analysis has some limitations. First, due to the retrospective nature of the study, no predictions were derived and validated prospectively. Second, no clinical interventions were used as covariates in the models, limiting the prediction to consider patient characteristics, wound characteristics, and time-related variables on each visit. The authors of this article are currently working on incorporating treatments into this analysis. Third, due to the nature of the data, we are unable to capture the nature of recurring wounds. Last, while the analysis was performed on a population of patients from hundreds of facilities across the United States that used a specific EHR, the approach does not harmonize data from other facilities or EHR's. This may limit the models' abilities to perform well on data outside of the facilities from which they were sourced with reliable accuracy and would need further validation on data sets that are not derived from Net Health Systems, Inc.'s Wound Care Analysis Data Set.
In conclusion, this study demonstrates that machine learning algorithms can offer accurate and insightful predictions of chronic wounds at risk of not healing within specific time periods. If integrated thoughtfully within a wound care workflow, clinicians can use these predictions in practice to help earlier identification of patients that are at high risk of having their wounds not heal in 4, 8, and 12 weeks from the start of treatment. This can provide clinicians with important information to facilitate data-driven decision making and may improve patient outcomes and reduce costs associated with non or slow-healing wounds that may prolong care.

INNOVATION
This study demonstrates that machine learning models can accurately predict chronic wound healing time using historical clinical factors from real-world data curated from EHR's. These models can help clinicians identify patients at risk of nonor slow-healing wounds earlier. SHAP values for predictions can also offer clinicians insight into which variables had a positive or negative affect on a prediction, and the magnitude of the influence with respect to the rest of the covariates. When accurate predictions are used in conjunction with SHAP values, it can drive data-driven decision making that may lead to improved patient care and outcomes.

ACKNOWLEDGMENTS AND FUNDING SOURCES
This study did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. The authors performed this work as part of their regular duties.

KEY FINDINGS
The probability of chronic wounds healing within 4, 8, or 12 weeks from the start of treatment can be predicted accurately with EHR data on each treatment visit.
SHAP can be used to derive global and local importance of covariates in the predictions of chronic wound healing time.
How long the patient had been in treatment, wound surface area, wound depth, and body part location of wounds were the most influential variables in these models.
PREDICTING CHRONIC WOUND HEALING TIME