Explainable Machine Learning for Real-Time Hypoglycemia and Hyperglycemia Prediction and Personalized Control Recommendations

Background: The occurrences of acute complications arising from hypoglycemia and hyperglycemia peak as young adults with type 1 diabetes (T1D) take control of their own care. Continuous glucose monitoring (CGM) devices provide real-time glucose readings enabling users to manage their control proactively. Machine learning algorithms can use CGM data to make ahead-of-time risk predictions and provide insight into an individual’s longer term control. Methods: We introduce explainable machine learning to make predictions of hypoglycemia (<70 mg/dL) and hyperglycemia (>270 mg/dL) up to 60 minutes ahead of time. We train our models using CGM data from 153 people living with T1D in the CITY (CGM Intervention in Teens and Young Adults With Type 1 Diabetes)survey totaling more than 28 000 days of usage, which we summarize into (short-term, medium-term, and long-term) glucose control features along with demographic information. We use machine learning explanations (SHAP [SHapley Additive exPlanations]) to identify which features have been most important in predicting risk per user. Results: Machine learning models (XGBoost) show excellent performance at predicting hypoglycemia (area under the receiver operating curve [AUROC]: 0.998, average precision: 0.953) and hyperglycemia (AUROC: 0.989, average precision: 0.931) in comparison with a baseline heuristic and logistic regression model. Conclusions: Maximizing model performance for glucose risk prediction and management is crucial to reduce the burden of alarm fatigue on CGM users. Machine learning enables more precise and timely predictions in comparison with baseline models. SHAP helps identify what about a CGM user’s glucose control has led to predictions of risk which can be used to reduce their long-term risk of complications.


Introduction
People with type 1 diabetes (T1D) face a daily balance to keep their glucose levels within safe levels (ie, "in-range").Severe complications are prevalent and arise from glycemic variability, low blood sugars (hypoglycemia), and high blood sugars (hyperglycemia). 1 For hypoglycemic incidents alone, the requirement for emergency assistance may be as high as 7.1% per year 2 and could account for 6% to 10% of deaths for those with T1D. 3,4Long-term impacts of hypoglycemia include impacts on cognition and potential links with dementia. 57][8] Effective glucose management for adolescents and young adults living with T1D is challenging, 9,10 due to the multiple transitions taking place in their lives, including puberty, relationships, the move to more independent living, and diabetes self-care, and also the transfer from pediatric to adult clinical care teams.][13] Continuous glucose monitoring (CGM) enables regular automated readings of estimated glucose levels found in interstitial fluid, providing immediate insight into glucose control.5][16] In addition to mitigating short-term risk of severe hypoglycemia and hyperglycemia, compliance of wearing CGM devices has been shown to improve glycosylated hemoglobin (HbA1c) levels, which, if sustained, reduce long-term complication risks. 17,18The magnitude of reduction in HbA1c from CGM usage is dependent on the user's original HbA1c value, ie, those at highest risk of complications from poorer control are likely to benefit the most. 16pecific to young adults, Laffel et al 19 demonstrate a clear improvement in HbA1c for those utilizing CGM.
Real-time CGM devices provide alerts for users when their interstitial fluid glucose falls above or below a desired range.Type 1 diabetes management can be aided further by having ahead-of-time predictions so individuals can identify risk early and better plan self-care activities, such as insulin dosages.Simple threshold-based algorithms have been able to successfully predict hypoglycemia 30 minutes in advance (eg, Medtronic-640 "SmartGuard" 20 ).2][23][24][25][26][27][28] Dave et al 23 emphasize the importance of feature extraction when generating predictions of hypoglycemia in CGM data.Generating features that are both predictive in models and insightful for understanding a user's glucose control is a difficult balance.
In this work, we make two novel contributions: algorithms tailored to young adults and explanations.First, we introduce machine learning models to predict hypoglycemia (<70 mg/dL) and hyperglycemia (>270 mg/dL) 29 with a trustworthy prediction horizon up to 60 minutes (44 minutes on average) for young adult users of CGM.While CGM risk prediction is a well-explored topic, more must be done to understand what led to increased risk for an individual so they can be proactive.We introduce using explainable machine learning, to not only predict risk, but to automatically identify the most important factors in an individual's CGM data that led to increased risk.Explanations have no detrimental impact on model performance.We provide a framework in which machine learning can be used to: 1. Provide real-time predictions of hypoglycemia and hyperglycemia (Results-Model Evaluation) using intuitive features (Methods-Features) generated from CGM data (Methods-Data).2. Automatically identify the most important features that have led to predictions of risk for each CGM user over a given time period (Results-Model Explanation).3. Provide personalized control recommendations for each CGM user to help with their T1D management (Results-User Interface).

Data
We make use of publicly available data from "A Randomized Clinical Trial to Assess the Efficacy and Safety of Continuous Glucose Monitoring in Young Adults 14-<25 with Type 1 Diabetes" (CITY). 19By design, the study recruited adolescents and young adults with T1D (duration >12 months) exhibiting poorer glycemic control (HbA1c 7.5-<11.0%),most likely to benefit from CGM usage. 16The study was conducted at 14 endocrinology practices in the United States, where participants were randomly assigned to either CGM (Dexcom G5) or regular finger-prick glucose meter monitoring.The randomized trial aimed to determine the effect of CGM on glycemic control primarily measured through HbA1c.The CGM users were compared with the control group using HbA1c levels after six months of usage.After six months, all study participants were provided with CGM devices and HbA1c tracked for a further six months.The data were collected from January 2018 to November 2019, before being made publicly available in March 2020.We make use of CGM data from 153 people living with T1D in the CITY study, where users were provided CGM devices for 6 to 12 months, totaling more than 28 000 days of usage data.In Figure 1, we show the breakdown of age (both at enrollment and at original diagnosis) for the 153 individuals.In addition to CGM data, basic screening information and the most recently recorded HbA1c test result were used to generate predictions.All participants had scheduled in-clinic visits throughout the study with HbA1c being routinely collected every 13 weeks after the baseline taken at screening.

Features
To utilize CGM data for hypoglycemic and hyperglycemic predictions, we generate a total of 30 features which summarize a young adult's CGM data on different timescales.Glucose control is summarized on short-term (one hour), medium-term (one day), and long-term (one week) baselines prior to the current CGM reading.This is combined with six features that characterize basic information.A complete description of all generated features is given in Table 1.
Features are generated at the point of each unique CGM reading.Features are only used in modeling if the CGM device has been used for ≥80% for the prior week.

Targets
To generate targets for our model predictions, we generate two binary variables referring to hypoglycemic (<70 mg/dL) and hyperglycemic (>270 mg/dL) events.A feature set is generated for each unique CGM reading, at which point we check if the CGM user's glucose level falls within these regions (for readings ≥1) in the following 60 minutes (ie, positive prediction).Our models therefore make predictions of hypoglycemia or hyperglycemia up to a maximum of 60 minutes ahead-oftime.Our average prediction horizon is 44 minutes.Glucose readings already within the hypoglycemic or hyperglycemic regions are removed from the modeling data set to avoid artificially boosting model performance metrics.Figure 2 shows a schematic of interstitial fluid glucose levels through a given day, regions of hypoglycemia and hyperglycemia, and timestamps of model predictions prior (ie, target).

Modeling
To determine the added value of machine learning, we evaluate a baseline heuristic model, a logistic regression model, and a gradient boosted tree-based model for both hypoglycemia and hyperglycemia predictions.Our baseline heuristic model is equivalent to a glucose threshold alert (ie, predicting hypoglycemia and hyperglycemia within the next 60 minutes if interstitial fluid glucose levels fall below 110 mg/ dL or go above 240 mg/dL, respectively).
Our logistic regression model is aimed to emulate basic CGM alerts which extrapolate linear trends along with thresholds to make hypoglycemia or hyperglycemia predictions.Logistic regression, despite its name, is a linear model which aims to classify binary outcomes through probabilities estimated by the logistic function.We implement our logistic regression model using Scikit-learn, optimized for ≤300 iterations using a L2 penalty with class weights inversely proportional to class frequencies. 30inally, we make use of the XGBoost framework to implement a tree-based machine learning algorithm. 31XGBoost makes use of an ensemble of weak learners (ie, small trees) that are trained stagewise through gradient boosting.This reduces overfitting while preserving or lowering variance in the prediction error, 32 which frequently leads to gradient boosted trees outperforming other tree-based methods.In addition, XGBoost naturally deals with continuous, binary/ discrete, and missing data consistently, all of which are represented in our data set.Model hyperparameters for our XGBoost models were selected using fivefold cross-validation of the training set only using a sampler (Treestructured Parzen Estimator) implemented with the Optuna library. 33In Supplemental Table 1, we include the complete set of tuned hyperparameters along with the corresponding search ranges.
We separate our CGM data into a hold-out test set (25%) and a training set (75%).An individual's complete set of CGM data is allocated to only of the training or test set so that there is no data leakage or overfitting when evaluating model performance.Regardless, we note that randomly separating data (so that an individual's CGM data can be split between both train and test sets) only marginally improves model performance.Our supervised models (ie, logistic regression and XGBoost) learn from the training set, and all models are evaluated using the same test sample.Since we only filter data based on an individual's average usage over the prior week (≥80% to be included in the feature set), there are rare occurrences where we have insufficient data to compute trends (eg, mean, standard deviation [SD], consecutive changes) and have missing feature values for these entries.This most commonly occurs when an individual replaces their sensor, meaning that there is a significant gap in the prior hour's readings.To compare model performance fairly, we remove data with missing values from the test samples.We note, however, including entries with missing values for the XGBoost models (which can natively deal with missing values, unlike the logistic regression models) does not result in a significant performance decrease.Overall, model performance was evaluated using the area under the receiver operating curve (AUROC) and average precision, along with fixed measures of specificity and sensitivity.

Model Explanability
Historically, machine learning algorithms are considered "black boxes" with little understanding of how predictions have been made.However, recent advances in explanability have led to individual predictions of tree-based algorithms being readily explainable. 345][36] TreeExplainer efficiently calculates Shapley (SHAP) values, 37 which aim to attribute payout (ie, the prize) between coalitional players of a game.In the context of machine learning, SHAP values amount to the marginal contribution (ie, change to the model prediction) of a feature among all possible coalitions (ie, combinations of features).Practically, this means that for every individual prediction (negative or positive), the relative importance of every feature can be evaluated.
There is a rich history of global interpretation for machine learning models which summarize the average overall importance of features on predictions as a whole. 38In a medical setting, however, tailored explanations for individuals are paramount, maximizing the ability to understand their own data and ensure every person is evaluated fairly. 39Shapley values are locally accurate, meaning that they can explain which features were relatively most important for an individual prediction (ie, a hypoglycemic or hyperglycemic event).In addition, Shapley values are consistent (the values add up to the actual prediction of the model) meaning they can also be used to check the global importance of a feature.Feature importance can therefore be checked periodically by averaging over a fixed time period.Practically, this means that for a CGM user over a given time period, the most important features leading to a prediction of hypoglycemia or hyperglycemia can be automatically evaluated.This gives immediate insight about an individual's glucose control, and intuition about what may be increasing their risk.Presenting reliable predictions with intuitive explanations would enable users to be proactive in their control.Insightful control recommendations could empower users to feel closer to being on "autopilot" (ie, minimizing the cognitive load burden).
We choose to implement SHAP over other local explainer algorithms (eg, Lime) 40 since SHAP offers mathematical guarantees of trustworthiness (local accuracy, missingness, and consistency) which adhere to strict medical governance guidelines, 34 and offers consistency between local explanations meaning global importance can be computed as well.

Model Evaluation
In Figure 3, we compare the performance of our baseline heuristic model against the machine learning classifiers (ie, logistic regression and XGBoost).Performance is evaluated by the AUROC and average precision characteristics by comparing the model predictions of hypoglycemia (left) or hyperglycemia (right) up to 60 minutes ahead of time to the actual future readings.For hypoglycemia, the baseline model achieved an AUROC of 0.811, the logistic regression 0.930 (95% confidence interval [CI]: 0.929-0.931),and the XGBoost 0.998 (95% CI: 0.998-0.998)evaluated on our hold-out test set.In terms of average precision, the baseline model achieved 0.121, the logistic regression 0.244 (95% CI: 0.240-0.247),and XGBoost 0.953 (95% CI: 0.951-0.954).All CIs are estimated from bootstrapping (sampling with replacement) for 500 resamples per model.
We find a clear advantage in using XGBoost; however, the logistic regression model also performs reasonably.We note that despite its crudeness, our baseline heuristic model is still predictive, demonstrating the use of threshold-based alerts on CGM devices in forward planning.Regardless, a more powerful predictive model means a lower false-alarm rate can be achieved, while maintaining the safety of the predictions.Reducing alarm fatigue for CGM users is an important goal, and more skillful models help enable this.In Table 2, measures of model skill are given, including AUROC, average precision, sensitivity, and specificity.Sensitivity and specificity are evaluated from dichotomizing model predictions at probability P = .5.Again, we find a clear performance increase for our XGBoost model, in keeping with the high performance of decision tree-based methods 41 and commercial hybrid loop systems. 42igh performance is also seen for hyperglycemia, with the baseline model achieving an AUROC of 0.734, the logistic regression 0.862 (95% CI: 0.861-0.862),and XGBoost 0.989 (95% CI: 0.989-0.990).Average precision, sensitivity, and specificity demonstrate similar trends with XGBoost being the most skillful.For each modeling approach, we note that the model skill is lower for hyperglycemia prediction in comparison with hypoglycemia, suggesting prediction of lower glucose events is better suited to our modeling choices.

Model Explanation
In addition to increased predictive power, the added value of machine learning models can be demonstrated through explanations.Using SHAP we can evaluate the relative importance of features for a given positive prediction of hypoglycemia or hyperglycemia.SHAP is applied post model construction and therefore has no negative implications for performance.Figure 4 shows the overall relative importance of every input feature for predicting hypoglycemic (left panel) and hyperglycemic (right panel) events.The relative importance of a feature is quantified by the absolute average SHAP value.Since SHAP values are consistent across predictions, they can be averaged for individual CGM users, across any time range, to provide immediate insight.
Here, we provide the average relative importance for all CGM users in the study, but this diagram is trivially made for individual users.Unsurprisingly, the user's current glucose reading is most important for the model to make predictions of both hypoglycemia and hyperglycemia.Time of day is also important, providing insight into the sleep and eating, physical activity and stress level, and habits of the CGM user and their relationship with glycemic control.Sudden drops (or increases) in glucose are important for predicting hypoglycemia (hyperglycemia) as shown by the short-term largest  Abbreviations: AUROC, area under the receiver operating curve; CI, confidence interval.
decrease (increase) between readings.Interestingly, the longterm fraction of time low is found to be reasonably predictive of hypoglycemic events, providing immediate insight into certain user's control habits.

User Interface
Despite CGM providing a wealth of information to both users and clinicians, the sheer volume of data makes it hard to quickly draw conclusions about glycemic control.Quick summary metrics such as the fraction of time-in-range (eg, 70 mg/dL < G < 270 mg/dL) are the baseline for assessing control.By considering the most predictive model features that led to predictions of hypoglycemic or hyperglycemic events, we can draw further personalized insights into an individual's glycemic control.In Figure 5, we present a prototype dashboard which summarizes a randomly selected user's CGM data over a given month, along with potential insights derived from explainable machine learning.In addition to metrics such as time above or below range, we provide the user's average glucose through the day, along with the most likely times for our model to predict hypoglycemia (red, above green line) or hyperglycemia (blue, below green line) for the individual.We select the top features for predicting both hypoglycemia and hyperglycemia for the user and summarize this information as control recommendations in the gray box.This provides a quick glance into the specifics of the user's glycemic control, enabling the user to be better informed to avoid potential events in the future.One Artificial Intelligence (AI) insight (gray box) for this user is that they tend to go high at specific times of day.Looking at the fraction of time spent high on the dashboard through the day (red box and histogram), this peaks around 21:00 pm, hence the user should consider insulin dosages around their evening meal.

Discussion
The key contributions of our work are as follows: 1. Machine learning models with state-of-the-art performance for predicting hypoglycemia (AUROC: 0.998) and hyperglycemia (AUROC: 0.989) up to 60 minutes in advance (44 minutes average event horizon).This performance is high relative to simple algorithms [43][44][45] and comparable machine learning approaches. 23,46. With careful feature engineering, we have demonstrated how machine learning explanations (SHAP) can be utilized to understand specifics about an individual's control.SHAP also adds transparency to model predictions, aiding assurance that all individuals are evaluated fairly.3. Provided a prototype dashboard to help young adults with T1D and clinicians make use of CGM data and the insight from machine learning explanations.
Technological advances represent a significant opportunity to help reduce self-care burden on an individual with T1D, and reduce the risk of health complications arising from poor glycemic control.In particular, for young adults, automated feedback from CGM may be an important tool for reducing risk, at times of transition (from pediatric to adult care units) and where glycemic control can be at a minimum.Ahead-of-time machine learning predictions are of personal and clinical value as they give the CGM user more time to adjust self-care and reduce risk.Our tree-based model demonstrated a significant performance increase relative to threshold-based and linear models.This performance increase is vital for reducing alert burden on the user, since more certain predictions require less total alerts while maintaining safety of the device.
Despite the wealth of information provided by CGM devices, part of the problem is deriving quick insight that is useful for people with T1D, their family carers, and clinicians. 47,48Machine learning explanations can help summarize what specifics in an individual's glycemic control led to increased risk of either hypoglycemia or hyperglycemia.Used in combination with directly derived metrics (eg, timein-range), their utility can be in providing quick-glance-specific recommendations about how to reduce risk.

Limitations
Limitations of this work include the reliance on the user to comply in using the CGM device.For our results, we only generate predictions when the user has used the device for 80% of the prior week.While predictions can still be generated with a lower usage compliance, this will inevitably decrease prediction performance, and care must be taken about when machine learning enhancement can be implemented safely.Furthermore, while current CGM devices are generally accurate, they are not infallible and considerations must be made for the safety of systems reliant on their accuracy. 49nother limitation of this study is the lack of insulin and carbohydrate data.Including this information could enable specific recommendations about insulin and carbohydrate dosages through the day.Including information tracked by smart watches, such as physical activity and stress levels, would not only improve predictions, but provide far more powerful intuitive recommendations.Having contextual information (eg, high stress levels or even self-reported event markers such as drinking, sickness, or exercise) would be critical for empathetic recommendations and reducing burden for the user.
Another limitation is the simplicity of our selected features to represent glycemic variability.Including additional metrics such as average daily risk range, 50 %coefficient of variation, 51 level 2 hypoglycemia (% of readings or time <54 mg/dL), level 2 hyperglycemia (% of readings or time >250 mg/dL), 52 and low/high blood glucose index 53,54 along with our baseline of standard deviation is likely to improve model performance.For example, reevaluating our XGBoost hypoglycemia model with the inclusion of %coefficient of variation (SD expressed as a percentage of the mean) and % of readings <54 mg/dL (over one hour, one day, and one week) we find an increase in average precision to 0.959 (up from 0.953) with AUROC remaining the same.However, including multiple measures of glycemic variability would confound our ability to clearly evaluate relative importance between feature types, and hence limit our ability to provide basic explanations.To maximize the utility of further predictive models, it will be important to include more complex features while preserving the ability to feedback specifics about an individual's glycemic control.
In this work, we chose to train and test hypoglycemia and hyperglycemia models using data from all CGM users in our cohort.In practice, it may be more suitable to train individual models per CGM user, which may be better tailored to the individual.However, it would be more complex to make direct comparisons between relative feature importance for different CGM users, and hence left outside the scope of this article.

Conclusion
We introduced a framework for high-performance prediction and explanation of hypoglycemia and hyperglycemia for young adults.Careful feature selection enables both accurate short-term risk prediction and intuitive feedback about an individual's glucose control.The key benefit of adopting a machine learning framework lies in the ability to provide more accurate ahead-of-time predictions (in comparison with more simplistic-derived alerts), potentially reducing burden on the young adult potentially going through transition with their care practices.Combining these models with explanations enables both users and clinicians to gain immediate insight into an individual's glycemic control, automatically highlighting what specific trends lead to increased risk.

Figure 1 .
Figure 1.Distributions of age at enrollment (left) and age at original T1D diagnosis (right) for the 153 people with CGM data used in this study.Both distributions are stacked to show the breakdown of gender (left) and prior CGM usage (right) for the individuals.Abbreviations: CGM, continuous glucose monitoring; T1D, type 1 diabetes.

Figure 2 .
Figure 2. Schematic of interstitial fluid glucose levels (black line) for a young adult with T1D tracked by CGM.The grayshaded region shows the desired range to keep glucose levels between 70 mg/dL < G < 270 mg/dL.Our algorithm aims to predict (ahead-of-time) when a person with T1D will go below (hypoglycemia) and above (hyperglycemia) this range.Regions of low and high glucose are shaded blue and red, respectively, with the corresponding first prediction event horizon (ie, when our model first made a positive prediction of hypo/hyper) shown by the dashed line.Abbreviations: T1D, type 1 diabetes; CGM, continuous glucose monitoring.

Figure 3 .
Figure 3. ROC (left) and average precision (right) for our models of hypoglycemia (blue; top row) and hyperglycemia (red; bottom row) predictions.In each panel, a XGBoost model (solid line) and a logistic regression model (dashed line) are given, and for ROC only are compared with a baseline heuristic (dotted line).A zero skill model is represented by the solid gray line also for the ROC panels.The total area under each curve (ie, AUROC score or average precision) is given in the brackets.Abbreviations: ROC, receiver operating curve; AUROC, area under the receiver operating curve.

Figure 4 .
Figure 4. Overall importance ranking of input features for predicting hypo (left panel) and hyper (right panel) risk.Average (absolute) SHAP value for predictive features over all study participants.A higher value corresponds to a more important feature in decisionmaking.Features are grouped into categories (device information, demographics, short term [one hour], medium term [one day], long term [one week]).The fractional contribution (ie, sum over all features in that category) of a given category is given in the square brackets.Abbreviations: CGM, continuous glucose monitoring; SHAP, SHapley Additive exPlanations; HbA1c, glycosylated hemoglobin or hemoglobin A1c.

Figure 5 .
Figure 5. Prototype dashboard which summarizes the prior month of user's CGM data and provides insights from explainable machine learning.(Upper left) Hourly average of glucose levels (solid green line) through the month.Upper and lower quartiles for each hour is shown by the green-shaded region.The hourly average for the fraction of time spent with high glucose (red, above axis) and low glucose (blue, below axis) is shown.(Upper right) Summary information about user's CGM usage and glycemic control.(Bottom) Panel summarizing the AI control recommendations for the user over the past month.This was found by considering the most important features in the user's CGM data for predicting hypoglycemia or hyperglycemia.Abbreviation: CGM, continuous glucose monitoring; AI, Artificial Intelligence.

Table 1 .
Summary of Input Features Used by the Models to Make Predictions.A Subset of Features Are Computed for Various Time Ranges (ie, One Hour, One Day, One Week) and Considered as Independent Features.
(One hour, one day, one week)Largest increase between readingsLargest increase in glucose level between consecutive readings within specified time period (One hour, one day, one week) Largest decrease between readings Largest decrease in glucose level between consecutive readings within specified time period (One hour, one day, one week) Maximum number of consecutive increases Most consecutive readings where glucose levels increase over defined time period (One hour, one day, one week) Maximum number of consecutive decreases Most consecutive readings where glucose levels decrease over defined time period (One hour, one day, one week) Abbreviations: CGM, continuous glucose monitoring; T1D, type 1 diabetes; HbA1c, glycosylated hemoglobin or hemoglobin A1c.

Table 2 .
Summary of Model Performance Metrics for Both Hypoglycemia and Hyperglycemia Predictions.A Baseline Heuristic, Logistic Regression, and an XGBoost Model Are Evaluated for Each Target.Summary Statistics (AUROC and Average Precision) Are Shown With 95% CI in Square Brackets.Sensitivity and Specificity Are Evaluated From Dichotomizing Model Predictions at Probability P = .5.