FormalPara Key Summary Points

Why carry out this study?

Continuous glucose monitoring is a valuable tool in type 1 diabetes management, but handling of the massive amount of data generated is a major challenge, possibly leading to hypoglycemic events being overlooked.

We have tested the hypothesis that a machine learning (ML) model can be trained to identify hypoglycemic events and the most likely underlying root cause.

What was learned from the study?

ML algorithms can be trained to identify and automatically present the most likely underlying root cause for hypoglycemic events.

Implementation of ML algorithms in clinical practice holds great potential for improving diabetes care.

Introduction

Type 1 diabetes (T1D) is a chronic disease that affects approximately 5.5 million people globally. Individuals diagnosed with T1D are for the rest of their lives fully dependent on exogenous insulin. However, while in healthy individuals there is a fine-tuned regulation of endogenous insulin secretion, the exact insulin dosing in individuals with T1D is difficult and has to be based on multiple testing and decisions on a daily basis. Insufficient insulin doses cause hyperglycemia and in the long term an increased risk for complications, such as cardiovascular disease, blindness or renal failure. Overestimated doses of insulin, however, may cause hypoglycemia with risks of arrhythmia, brain damage and sudden death. Studies have shown that 6–10% of all deaths among patients with T1D are caused by hypoglycemia [1,2,3]. The long-term complications, on the other hand, may result in a reduction of life expectancy in individuals with T1D by almost 10 years [4, 5].

Traditionally, insulin dosing has been based on frequent capillary blood glucose testing. However, the technology for continuous- and flash glucose monitoring (CGM/FGM) has dramatically improved over the years and become a valuable and essential tool in diabetes care. In addition, CGM/FGM has improved the everyday life of many patients with T1D due to the ease of use and convenience of these technologies [6]. However, the vast amount of data generated is difficult to handle, both for the individual patient and for healthcare professionals and the entire healthcare system. Despite the widespread use of CGM/FGM technology in routine care, utilization and application of the data in terms of providing professional advice to patients on how to improve metabolic control are still low [7]. At present, CGM/FGM data to a large extent have to be interpreted manually, and even consensus statements for healthcare professionals rely on manual interpretations and basic measurements [7]. Considering the challenge of interpreting data for the individual, the unmet need for the healthcare system and the costs related to the use of CGM/FGM (approx. 1500€/patient/year), the development of methods to utilize CGM/FGM data in a more efficient way is essential.

Machine learning (ML) algorithms have been used in previous efforts to analyze glucose data to either predict or identify anomalies. Extensive efforts have also focused on prediction models based on fuzzy logic and/or ML models for application to hybrid- and closed-loop insulin pumps [8,9,10]. However, these models focus directly on forecasting glucose levels within the nearest future and do not identify the underlying root causes for glucose excursions. Detailed analysis of root causes would be of great benefit to healthcare professionals in terms of providing advice to the vast majority of patients, as well as for those patients who do not use hybrid- or closed loop pumps.

In an effort to improve the potential of CGM/FGM data utilization in diabetes care, we have manually labeled data on a large number of hypoglycemic events from adult and pediatric patients with T1D. A subset of the manually collected labels was cross-referenced using recordings of insulin doses and carbohydrate intake. Based on this dataset we tested the hypothesis that a ML model can be trained to identify the most likely root causes for hypoglycemic events. As a “field-test” the model was benchmarked against the interpretations of a board of five clinical diabetes specialists.

Methods

The study reported here is based on retrospective CGM/FGM and clinical data which were collected at the Department of Endocrinology and Diabetology and the Pediatric Clinic of Uppsala University Hospital, Sweden. The study was conducted in accordance with the Declaration of Helsinki of 1964 and its subsequent amendments and approved by the Swedish Ethical Review Authority (Dnr 2019-03726). In accordance with the decision of this Authority, the retrospective data was collected without informed consent.

Data Collection and Preparation

Retrospective CGM/FGM data were collected from 449 patients with T1D (mean number of registration days: 208 ± 4 days; n = 390 adults, n = 59 children/adolescents). Since the total dataset contained data from different sensors, the time interval was downsampled to 15 min in order to harmonize the data. A total of 42,120 hypoglycemic events were identified in the total dataset, of which 5041 events were selected randomly and then stratified based on patient number and the time of day of the event in order to generate a representative dataset for manual classification. The root cause of these events was manually interpreted based only on the glucose data and time of day. The dataset was split into a training (n = 4026 events, n = 304 patients) and an internal validation dataset (n = 1015 events, n = 145 patients). In addition, a separate dataset was generated from 22 patients (14 adults, 8 children/adolescents) for whom data were available on manually registered insulin dosages and carbohydrate intake. In this latter dataset there were 13 ‘known’ patients, i.e. overlapping with the initial dataset, and nine ‘unknown’ patients who had not been used for either the prior validation or training dataset. This latter set was used as a final validation of the ML models. CGM-data were downsampled to four recordings per hour in order to harmonize the data from different types of CGM and FGM sensors. Hypoglycemic events were defined as all episodes with a registered glucose level < 3.5 mmol/L. For each event, the duration of the event and minimum glucose level were defined, and a time series of the preceding and subsequent 6 h from the first recorded level of < 3.5 mmol/L was generated (12 h in total). Clinical data, including age, gender, disease duration and type of treatment (i.e. pump or multiple daily injection [MDI]), were collected. Descriptive data are presented in Table 1.

Table 1 Descriptive data of adult- and pediatric patients with type 1 diabetes included in this study

Manual Labeling of Hypoglycemic Events

Based on the data available from CGM/FGM sensors (i.e. time of day and glucose levels), three main causes of hypoglycemia were deemed possible to interpret and later validate based on clinical recordings of insulin doses and carbohydrate intake: (1) overestimation of insulin bolus at meal; (2) overcorrection of hyperglycemia; and (3) excessive basal insulin pressure. A number of external factors can also either directly cause hypoglycemia or contribute to it, such as physical activity; however, since these factors could not be supported by the recorded data, they were not assigned as main causes for hypoglycemia.

The manual classification of the stratified 5041 hypoglycemic events was performed by two medical doctors (P-OC, a specialist in Endocrinology and Diabetology, and DE, a specialist in Internal Medicine subspecializing in Endocrinology and Diabetology). From the separate validation dataset (n = 22 patients), 597 hypoglycemic events were stratified and manually labeled by the same two evaluators who were blinded to the insulin and carbohydrate data. Of these 597 events, there were 293 events with manually registered insulin and carbohydrate intake. In addition, there were 142 events lacking insulin and carbohydrate data; these were initially labeled as caused by excessive basal insulin pressure, and since the lack of data supports the initial labeling, these events were also included in the final ‘ground-truth’ dataset. These 435 events (n = 293 + 142) were then re-labeled by the same evaluators but now with access to the insulin and carbohydrate data, resulting in a test–retest conformity of 87%. This dataset, containing a mix of known and unknown patients, was considered as the ‘ground-truth’ and was used as the final performance test for the ML model.

Clinical Expert Board

A clinical expert board consisting of five clinical doctors (2 specialists in Pediatrics (HH and IWJ) and 3 specialists in Endocrinology and Diabetology (JCC, JH, NA) was assigned 298 hypoglycemic events from the separate validation dataset, which they labeled independently based on their clinical judgment. Events judged not to belong to one of the three root causes (e.g. suspected compression lows) were left unclassified. After providing their clinical interpretation, the same doctors were asked to re-evaluate the same events but with access to insulin and carbohydrate data. As for the initial evaluators, events without manually registered insulin doses and carbohydrate intake which were initially classified as caused by excessive basal insulin pressure were also included. Since the number of events interpreted as excessive basal insulin pressure varied between the evaluators, this resulted in 122 events when the answers from all five evaluators were combined. This dataset was used as a final ‘ground-truth’ benchmark performance test of the ML model to evaluate not only its ability to learn, but also its performance in light of how a board of clinical experts interpret hypoglycemic events.

Comparison of Different Supervised ML Models

A number of different ML model architectures were applied and evaluated on the training data. Based on previously published reports on time-series data, we especially evaluated the potential of models based on convolutional neural networks (CNN). For comparison purposes, we included a linear model (logistic regression model) and tree-based models (Random Forest and XGBoost). The ML models evaluated were:

  • Logistic regression model with root mean squared error + L2 penalty loss function.

  • Random Forest model: a tree-based ensemble algorithm which uses the concept of weak learners and voting to increase predictive power and robustness.

  • XGBoost: a gradient-boosted tree-based algorithm that has been widely adopted.

  • Convolutional autoencoder: a convolutional autoencoder based on a two-stage model trained in the following way. First, using an encoder-decoder neural network, the task for the network is to reconstruct the input. The middle of the network is a dense layer with a dimension substantially lower than the length of the time series. This forces the model to encode the input data in lower dimension, while keeping important characteristics. This step does not require labels and is trained on the full unlabeled dataset. The events chosen for the test set are removed from the training. Second, the trained encoder part from the first step is extracted and used in a new model with the original classification task. This model is then fine-tuned with the labeled dataset.

  • Recurrent neural net: a type of neural network optimized for handling sequential data.

  • RandOm Convolutional Kernel Transform (ROCKET) [11]: a CNN-based ensemble model for fast training time and one of the state-of-the-art models.

  • InceptionTime [12] (modified for shorter time series). A CNN-based neural net with multiple sizes of the convolutional filters. Consists of a general framework for time series classification which has been proven to perform well in previous applications.

  • HypoCNN: our custom built CNN-based model.

Our custom built HypoCNN model achieved the best performance (Table 2) and was further tested and optimized as further in this article.

Table 2 Comparison of different supervised machine learning models

Impact of Data Set Size

To investigate if the amount of manually labeled data was adequate, we examined how the dataset size affected the model performance. The HypoCNN model was therefore trained on varying sizes, ranging from 100 to 4026 events, of the training dataset, while keeping the validation data in its original setup in order to evaluate the outcome.

Impact of Additional Data

A number of clinical characteristics can potentially impact glucose control. By using ML models, we could incorporate a large number of parameters into our interpretation of glucose excursions. Hence, we examined how the addition of basic descriptive clinical information impacted the model performance. Included features were: hour of the day; gender (male, female); age (years); disease duration (years); treatment modality (MDI/insulin pump); and time of day for basal insulin (if MDI).

Masking the Time Series

To test the robustness of our HypoCNN model, we set up an experiment in which the window of time prior to and after the hypoglycemic events was altered. The experiment was conducted using three different time windows: (1) 6 h before the hypoglycemic event and 6 h after the hypoglycemic event (i.e. reference); (2) 6 h before the hypoglycemic event only; and (3) 3 h before the hypoglycemic event only.

The evaluation set was constant but prepared in the same way as the training data in terms of these three different time windows noted in the previous sentence.

Model Interpretation

To better understand how the HypoCNN model interprets the hypoglycemic events, we examined the model using feature attributions. Briefly, a feature attribution method provides insights on which part of the time series contributes positively (or negatively) to the model’s classification output. We estimated attribution vectors using expected gradients, as introduced by Erion et al. [13].

Statistical Analysis

Descriptive patient data was compared between the adult- and pediatric cohort using an unpaired two-tailed Student’s t-test for parametric data (age and disease duration), and the Fisher’s exact test was applied for comparing the gender distribution and proportion of patients using MDI/pump. Values are given as means ± standard deviation (SD). P values < 0.05 were considered significant.

Each model was ranked on the average validation area under the curve (AUC) based on ten runs with different initializations. Data are presented as multiclass AUC receiver operating characteristics (ROC) curves in which performance for each specific class as well as the unweighted AUC average are plotted. True positive rate is plotted on the Y-axis and the false positive rate is plotted on the X-axis. Confusion matrices were used to plot the accuracy for each class for all data and when applying an 80% confidence threshold.

Results

Patient Characterization

Descriptive data of the included patients are presented in Table 1. As expected, the pediatric patients were younger and had a shorter disease duration. There were slightly more male patients than female patients in the adult cohort. More patients used an insulin pump in the pediatric cohort compared to the adult cohort (54% vs. 28%, respectively). Of the n = 5041 hypoglycemic events, the majority were interpreted to be caused by excessive basal insulin (44%), with the two other classes being fairly equally distributed, with 27% of hypoglycemic events estimated to be caused by overestimated bolus and 27% of events due to overcorrection of hyperglycemia.

Hypoglycemia Interpretation—Clinical Expert Board

The overall conformity of the five evaluators based on only CGM/FGM data when compared to the original two evaluators ‘ground truth’ data (n = 196 events) ranged from 65% to 71%, with an overall conformity based on majority decision of 73%. Next, the clinical expert board was asked to label the same events again, but this time with access to insulin and carbohydrate data, resulting in a test–retest conformity ranging from 69% to 99%, with an average of 78% (median 74%). The conformity of the clinical expert board in this dataset (n = 122 events) when compared to the original two evaluators ranged from 65% to 71%, with an overall conformity based on majority decision of 73%, i.e. overall conformity was almost identical to the results without insulin and carbohydrate data.

Comparison of Different Supervised ML Models

The CNN models displayed the overall best performance metrics, with our custom HypoCNN model achieving the highest average AUC score (0.918). Surprisingly, the linear model performed well despite its linearity assumption and simple structure. The results from the model comparison are summarized in Table 2. Reported performance is average validation AUC over ten runs.

Impact of Data Set Size

The impact of dataset size was evaluated by gradually increasing the size of the training dataset, while keeping the validation dataset intact. Increasing the size of the training dataset initially improved performance, whereas it seemed to approach convergence asymptotically once the number of samples reached n = 4026, implying that a further increase of the dataset size would only marginally improve its performance (Fig. 1).

Fig. 1
figure 1

Impact of dataset size on performance of the HypoCNN model. The impact of dataset size was evaluated by gradually increasing the number of manually interpreted training data samples while keeping the validation dataset intact. The bar plot diagram shows median and variability for a 100-fold resampling per dataset size. The performance of our HypoCNN model seems to approach convergence asymptotically, implying that a further increase of the dataset size would only marginally improve its performance. AUC Area under the curve, ROC receiver operating characteristics

Impact of Additional Data and Masking the Time Series

The inclusion of descriptive clinical information in the models resulted in a slight increase in performance only when information for the time of day was included. Interestingly, the inclusion of other factors did not improve the model performance.

Masking the data after the onset (keeping the [− 6 h, 0 h] portion) of the hypoglycemia did improve the performance slightly, whereas restricting the input data to [− 3 h, 0 h] reduced the models' performance (Electronic Supplementary Material Table 1). Imposing class weights (penalizing the model differently for misclassifying different classes) resulted in minor improvements for small class weights whereas large class weights led to a reduced performance (results not shown).

Model Interpretation

The input features attributions of the HypoCNN model (time series only) are depicted in Fig. 2. Interestingly, the time steps after the hypoglycemic event were of little or no importance for the HypoCNN model, which reflects and supports the results from the experiment in which the time series were masked.

Fig. 2
figure 2

Visualization of the model’s interpretation of hypoglycemic events. Expected gradient attributions for hypoglycemic events based on our HypoCNN model. Columns 1–3 (left to right) represent the three different root causes of hypoglycemia, and each sample is repeated across the three columns. The sample time steps are colored based on the attribution vector for the given class. Red dots depict intervals that are weighted in positively for the class, whereas blue dots depict a negative contribution. The size of each dot reflects the magnitude of the contribution. The correct cause is highlighted with a green frame, and for those events that were not correctly predicted by the model a red frame is used to depict the model’s prediction. In column 4 (rightmost column), the probabilities the model assigns to each class are depicted, with 0 = excessive basal pressure, 1 = overcompensation and 2 = bolus at meal. The columns are green if the model predicted the event correctly, and otherwise red

Model Performance

We found that masking the time series, adding time features, as well as using class weights in combination improved the overall performance of our HypoCNN model. Including the modifications into our primary model setup, we obtained an average AUC of 0.921 (95% confidence interval [CI] 0.917–0.924) in the original train/test split of the data (Fig. 3A). Next, the best model was tested on a separate validation set consisting of 597 samples. The macro-average one-vs.-all AUC achieved in this dataset was 0.919 (95% CI 0.912–0.924) (Fig. 3B), which is close to the performance in the original train/test dataset. When testing the model in the subset of data from the original two evaluators for which there were also manually recorded insulin and carbohydrate data (n = 435 events), i.e. ‘ground truth’, our HypoCNN model achieved an AUC of 0.917 (95% CI 0.913–0.920). In addition, we tested the best model performance against the ‘ground-truth’ validation dataset from the clinical expert board (n = 122 events), and in this ‘benchmark’ test our HypoCNN model achieved an overall AUC of 0.939 (Fig. 3C). When plotting the data in confusion matrices with accuracy data for each class, we found that the “basal pressure” class was the easiest to predict, whereas the “bolus at meal” class was the most difficult. The average accuracy of our HypoCNN model for all classes was 78% (n = 597 events) (Fig. 4A), whereas if an 80% confidence level threshold was applied the model’s average accuracy was 92% (n = 301 events) (Fig. 4B).

Fig. 3
figure 3

Performance of the HypoCNN model. A Performance based on the original train/test split validation dataset (n = 1015 hypoglycemic events), which resulted in an average AUC of 0.921 (95% confidence interval [CI] 0.917–0.924) based on 10 repetitions. B Performance of the best setup HypoCNN model based on the separate validation dataset consisting of hypoglycemic events (n = 597) from patients who had never previously been seen (n = 22) resulted in an average AUC of 0.919 (95% CI 0.912–0.924). C Performance of the best setup HypoCNN model against the benchmark test of ‘ground-truth’ validation dataset from the clinical expert board (n = 122 events) resulted in an overall AUC of 0.939. Dotted lines represent micro- and macro-AUC, and colored lines represent each specific class (green = excessive basal pressure, yellow = overcompensation and blue = bolus at meal)

Fig. 4
figure 4

Confusion matrices of performance score. Confusion matrices which depict the performance score for each respective class of hypoglycemia for the best-setup of our HypoCNN model. Y-axis = the true label based on the separate ‘ground-truth’ validation dataset and the X-axis = the predicted label. A Performance for all events (n = 597), resulting in an average performance of 78%. B Performance score only for those events which could be determined with a confidence level > 80% (n = 301), resulting in an overall performance of 92%

Discussion

The introduction of CGM/FGM has been a technical revolution and an important catalyzer for the ongoing digitalization in diabetes care. At the individual patient level, the use of CGM/FGM simplifies the monitoring of glycemic control, but it also increases the number of decisions needed to be made on a daily basis. At the healthcare sector level, CGM/FGM has led to a dramatic increase in data that are underused due to the lack of automated data analysis. The digital tools needed to fully utilize the potential of CGM/FGM, i.e. to turn the data into clinical knowledge and improved treatment support and regimes, are still lacking. In this study, we have investigated the potential to use ML models as a tool to identify the underlying root cause of hypoglycemic events. The results reveal that many models can achieve an acceptable overall average accuracy score. However, there are differences between different models when it comes to the accuracy for each specific root cause.

From our comparison of models, we found that convolution-based models (ROCKET, InceptionTime and our custom HypoCNN) performed best. It is noteworthy that in addition to the good performance of the convolution-based models, the linear model also performed reasonably well, despite its assumption of linearity and simple structure. This could, however, be partially be explained by overfitting on the “excessive basal insulin pressure” class. Our custom built HypoCNN model outperformed the previously established state-of-the-art models ROCKET and InceptionTime, which supports the notion that there is a need for models adapted to specific tasks. We believe that our HypoCNN model could be even further improved by optimizing the neural network architecture. In addition, by applying active learning approaches, the model could be specifically trained on the events that were hardest to classify in our study and hence increase the overall performance. The results of the experiment in which the dataset size was gradually increased suggest that the amount of training data used in this study is approaching an asymptotic limit. Although increasing the labeled dataset size (and thus increasing the work needed for the clinicians) may lead to improved performance, we believe the marginal gain is negligible in this application. As expected, the clinical interpretations between different evaluators differed, and there was a lack of consensus regarding the interpretation of CGM/FGM-data, which makes the task more challenging from an ML perspective. In order to tackle the challenges presented with different interpretations of data, we assigned a ‘ground-truth’ dataset from the clinical evaluator board. This dataset was based on a majority decision only for those events for which there were registered insulin doses and carbohydrate intake, or if the event lacked insulin and carbohydrate data and was interpreted as a “excessive basal insulin” hypoglycemia. By applying this approach we believe that, given the retrospective nature of the data, the clinical interpretations are as close to a true answer as possible. Based on this ‘ground-truth’ data, we found that the model’s performance is similar to what we observed with the initial labeling, which suggests that the initial labeling was acceptable for model development. Interestingly, the addition of insulin and carbohydrate data did not greatly impact the interpretations of the clinical expert board (average test re-test accuracy 78%). However, the study was not designed for studying the impact of additional data on the clinician’s interpretation of CGM/FGM data and therefore this result should be interpreted with caution. The ability to train a model that is not dependent of insulin and carbohydrate data is of great importance given that in clinical practice most patients with MDI do not register data on insulin doses and/or carbohydrates and that there is also a great uncertainty regarding individual experience in estimating carbohydrates as well as the timing of registrations of both insulin and carbohydrates [14]. However, a prospective study with consequent registration of insulin and carbohydrate data and, perhaps even more important, real-time interpretations of glycemic excursions by the individual users followed by verification by clinical experts would greatly enhance the preciseness of the answers. This could, in addition, also make it possible to register other factors preceding the hypoglycemic event which could be of value in terms of avoiding similar events in the future. In addition, an extensive prospective dataset could also be an important means for establishing clinical consensus guidelines for the interpretation of CGM/FGM data.

Despite hypoglycemia being of major clinical importance, a uniformly accepted definition is still lacking. We applied a cutoff level of < 3.5 mmol/L since all measurements were based on CGM/FGM data without additional information on hypoglycemia-related symptoms. All events below this level can be clinically relevant [15, 16]; a glucose alert level of < 4.0 mmol/L is included in guidelines and consensus statements to allow time for the individual to take appropriate action and prevent severe hypoglycemia. Glucose levels below < 3.0 mmol/L are then defined as the second level of hypoglycemia [17, 18]]. Although we did not apply this approach in the current study, all hypoglycemic events could easily be sorted based on minimum glucose levels, i.e. to sort out level two hypoglycemic events (< 3.0 mmol/L). Also, by applying a 3.9 mmol/L cutoff level for the manual interpretation, the model could be re-trained to also classify events in the range from 3.5 to 3.9 mmol/L.

Interestingly, the addition of supplemental clinical information did not impact the model's performance. With the exception of time-of-day, factors such as gender, age, disease duration or treatment modality (MDI/pump) only had a modest impact on the model’s performance. However, it cannot be excluded that other types of input data, especially continuous health parameters and especially physical activity, would enhance the model. Weighting in contributing factors to hypoglycemia from other classes as well as other parameters, such as physical activity and dietary regimes, would most likely increase the preciseness of the model. The ability to handle large amounts of data and contributing parameters is one of the great advantages of ML models which therefore make them highly suitable to interpret highly complex events of a multifactorial nature.

The model interpretation based on feature attributions provides another interesting perspective on the decision mechanisms of the deep neural net model we ultimately opted for (HypoCNN). Feature attributions can provide valuable information for explaining the predictions of hypoglycemic root causes, especially if combined with additional continuous health data.

The results of the initial training of the algorithm indicate that the method developed in this study generalizes well to unseen data given the fact the initial validation was only conducted with ‘unknown’ patients. From a performance perspective, this strategy is a rather conservative approach, since the model is only evaluated based on its ability to classify new patients correctly. In real-world use, we would have a mix of known and unknown patients, as included in our ‘benchmark’ test. However, a broader test cohort is required to ensure model robustness and reliability in other patient populations in which the treatment regimens can differ.

Conclusion

Overall the findings of our study support the notion that ML models can be used to identify root causes for glycemic excursions. Within a foreseeable future, ML models could become a valuable tool for developing reliant automated analysis systems for managing CGM/FGM data, which would be of great importance for improving diabetes care.