Machine Learning Models for Blood Glucose Level Prediction in Patients With Diabetes Mellitus: Systematic Review and Network Meta-Analysis

Background Machine learning (ML) models provide more choices to patients with diabetes mellitus (DM) to more properly manage blood glucose (BG) levels. However, because of numerous types of ML algorithms, choosing an appropriate model is vitally important. Objective In a systematic review and network meta-analysis, this study aimed to comprehensively assess the performance of ML models in predicting BG levels. In addition, we assessed ML models used to detect and predict adverse BG (hypoglycemia) events by calculating pooled estimates of sensitivity and specificity. Methods PubMed, Embase, Web of Science, and Institute of Electrical and Electronics Engineers Explore databases were systematically searched for studies on predicting BG levels and predicting or detecting adverse BG events using ML models, from inception to November 2022. Studies that assessed the performance of different ML models in predicting or detecting BG levels or adverse BG events of patients with DM were included. Studies with no derivation or performance metrics of ML models were excluded. The Quality Assessment of Diagnostic Accuracy Studies tool was applied to assess the quality of included studies. Primary outcomes were the relative ranking of ML models for predicting BG levels in different prediction horizons (PHs) and pooled estimates of the sensitivity and specificity of ML models in detecting or predicting adverse BG events. Results In total, 46 eligible studies were included for meta-analysis. Regarding ML models for predicting BG levels, the means of the absolute root mean square error (RMSE) in a PH of 15, 30, 45, and 60 minutes were 18.88 (SD 19.71), 21.40 (SD 12.56), 21.27 (SD 5.17), and 30.01 (SD 7.23) mg/dL, respectively. The neural network model (NNM) showed the highest relative performance in different PHs. Furthermore, the pooled estimates of the positive likelihood ratio and the negative likelihood ratio of ML models were 8.3 (95% CI 5.7-12.0) and 0.31 (95% CI 0.22-0.44), respectively, for predicting hypoglycemia and 2.4 (95% CI 1.6-3.7) and 0.37 (95% CI 0.29-0.46), respectively, for detecting hypoglycemia. Conclusions Statistically significant high heterogeneity was detected in all subgroups, with different sources of heterogeneity. For predicting precise BG levels, the RMSE increases with a rise in the PH, and the NNM shows the highest relative performance among all the ML models. Meanwhile, current ML models have sufficient ability to predict adverse BG events, while their ability to detect adverse BG events needs to be enhanced. Trial Registration PROSPERO CRD42022375250; https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=375250


Introduction
Diabetes mellitus (DM) has become one of the most serious health problems worldwide [1], with more than 463 million (9.3%) patients in 2019; this number is predicted to reach 700 million (10.9%) in 2045 [2], which has resulted in growing concerns about the negative impacts on patients' lives and the increasing burden on the health care system [3].Furthermore, previous studies have shown that without appropriate medical care, DM can lead to multiple long-term complications in blood vessels, eyes, kidneys, feet (ulcers), and nerves [4][5][6][7].Adverse blood glucose (BG) events are one of the most common short-term complications, including hypoglycemia with BG<70 mg/dL and hyperglycemia with BG>180 mg/dL.Hyperglycemia in patients with DM may lead to lower limb occlusions and extremity nerve damage, further leading to decay, necrosis, and local or whole-foot gangrene, even requiring amputation [8,9].Hypoglycemia can cause serious symptoms, including anxiety, palpitation, and confusion in a mild scenario and seizures, coma, and even death in a severe scenario [10,11].Thus, there is an imminent need for preventing adverse BG events.
Machine learning (ML) models use statistical techniques to provide computers with the ability to complete assignments by training themselves without being explicitly programmed [12].However, ML models for managing BG requires huge amounts of BG data, which cannot be satisfied by the multiple data points generated by the traditional finger-stick glucose meter [13].With the introduction of the continuous glucose monitoring (CGM) device, which typically produces a BG reading every 5 minutes all day long, the size of the data set of BG readings is sufficient to be used in ML models [14].
Recently, there has been an immense surge in using ML technologies for predicting DM complications.Regarding BG management, previous studies have developed different types of ML models, including random forest (RF) models, support vector machines (SVMs), neural network models (NNMs), and autoregression models (ARMs), using CGM data, electronic health records (EHRs), electrocardiograph (ECG), electroencephalograph (EEG), and other information (ie, biochemical indicators, insulin intake, exercise, and meals) [10,[15][16][17][18][19][20].However, the performance of different models in these studies was not quite consistent.For instance, in terms of BG level prediction, Prendin et al [21] showed that the SVM achieved a lower root mean square error (RMSE) than the ARM, while Zhu et al [22] showed a different result.
Therefore, this meta-analysis aimed to comprehensively assess the performance of ML models in BG management in patients with DM.

Search Strategy and Study Selection
The study protocol has been registered in the international prospective register of systematic reviews (PROSPERO; registration ID: CRD42022375250).Studies on BG levels or adverse BG event prediction or detection using ML models were eligible, with no restrictions on language, investigation design, or publication status.PubMed, Embase, Web of Science, and Institute of Electrical and Electronics Engineers (IEEE) Explore databases were systematically searched from inception to November 2022.Keywords used for study repository searches were ("machine learning" OR "artificial intelligence" OR "logistic model" OR "support vector machine" OR "decision tree" OR "cluster analysis" OR "deep learning" OR "random forest") AND ("hypoglycemia" OR "hyperglycemia" OR "adverse glycemic events") AND ("prediction" OR "detection").Details regarding the search strategies are summarized in Multimedia Appendix 1. Manual searches were added to review reference lists in relevant studies.

Selection Criteria
Inclusion criteria were as follows: (1) participants in the studies were diagnosed with DM; (2) study endpoints were hypoglycemia, hyperglycemia, or BG levels; (3) the studies established at least 2 or more types of ML models for prediction of BG levels and 1 or more types of ML models for prediction or detection of adverse BG events; (4) the studies reported the performance of ML models with statistical or clinical metrics; (5) the studies contained the development and validation of ML models; and (6) study outcomes were means (SDs) of performance metrics of test data for prediction of BG levels and sensitivity and specificity of test data for prediction or detection of adverse BG events.Exclusion criteria were as follows: (1) studies did not report on the derivation of ML models, (2) studies were based only on physiological or control-oriented ML models, (3) studies could not reproduce true positives, true positives, false negatives, and false positives for prediction or detection of adverse BG events, (4) studies were reviews, systematic reviews, animal studies, or irretrievable and repetitive papers, and (5) studies had unavailable full text or outcome metrics.
Authors KL and LYL screened and selected studies independently based on the criteria mentioned before.Authors KL and YM extracted and recorded the data from the selected studies.Conflicts were resolved by reaching a consensus.The study strictly followed the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analysis) statement (Multimedia Appendix 2) [23][24][25].

Data Extraction and Management
Two reviewers independently carried out data extraction and quality assessment.If a single study included more than 1 extractable test results for the same ML model, the best result was extracted.If a single study included 2 or more models, the performance metrics of each model were extracted.For studies predicting BG levels, RMSEs based on different prediction horizons (PHs) were extracted.For studies predicting or detecting adverse BG events, the sensitivity, specificity, and precision of reproducing the 2×2 contingency table were extracted.
Specifically, the following information was extracted: • General characteristics: first author, publication year, country, data source, and study purpose (ie, predicting or detecting hypoglycemia) • Experimental information: participants (type of DM, type 1 or 2), sample size (patients, data points, and hypoglycemia), demographic information, models, study place and time, model parameters (ie, input and PHs), model performance metrics, threshold of BG levels for hypoglycemia, and reference (ie, finger-stick)

Methodological Quality Assessment of Included Reviews
The Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) tool was applied to assess the quality of included studies based on patient selection (5 items), index test (3 items), reference standard (4 items), and flow and timing (4 items).All 4 domains were used for assessing the risk of bias, and the first 3 domains were used to assess the consensus of applicability.Each domain has 1 query in relation to the risk of bias or applicability consisting of 7 questions [26].

Data Synthesis and Statistical Analysis
The performance metrics of ML models used to predict BG levels, predict adverse BG events, and detect adverse BG events were assessed independently.The performance metrics were the RMSE of ML models in predicting BG levels and the sensitivity and specificity of ML models in predicting or detecting adverse BG events.A network meta-analysis was conducted for BG level-based studies to assess the global and local inconsistency between studies and plotted the surface under the cumulative ranking (SUCRA) curve of every model to calculate relative ranks.For event-based studies, pooled sensitivity, specificity, the positive likelihood ratio (PLR), and the negative likelihood ratio (NLR) with 95% CIs were calculated.Study heterogeneity was assessed by calculating I² values based on multivariate random-effects meta-regression that considered within-and between-study correlation and classifying them into quartiles (0% to <25% for low, 25% to <50% for low-to-moderate, 50% to <75% for moderate-to-high, and >75% for high heterogeneity) [27,28].Furthermore, meta-regression was used to evaluate the source of heterogeneity for both BG level-based and adverse event-based studies.The summary receiver operating characteristic (SROC) curve of every model was also used to evaluate the overall sensitivity and specificity.Publication bias was assessed using the Deek funnel plot asymmetry test.

Quality Assessment of Included Studies
The quality assessment results using the QUADAS-2 tool showed that more than half of all included studies did not report the patient selection criteria in detail, which led to low-quality patient selection (Figure 2).Furthermore, the diagnosis of hypoglycemia using blood or the CGM device was considered high quality in the reference test in our study.
The mean RMSE was 21.40 (SD 12.56) mg/dL.Statistically significant inconsistency was detected using the inconsistency test( 2 =87.11,P<.001), as shown in the forest plot in Multimedia Appendix 1. Meta-regression indicated that I² for the RMSE was 60.75%, and the source of heterogeneity analysis showed that place and validation type were statistically significant (P<.001).The maximum SUCRA value was 99.1 for the dilated recurrent neural network (DRNN) model with a mean RMSE XSL • FO RenderX of 7.80 (SD 0.60) mg/dL [22], whereas the minimum SUCRA value was 0.4 for 1 symbolic model with a mean RMSE of 71.4 (SD 21.9) mg/dL [49].The relative ranks of the ML models are shown in Table 4, and the SUCRA curves are shown in Figure 4A.Publication bias was tested using the Egger test (P=.503),indicating no significant publication bias.
For PH=60 minutes, 4 (8.7%)studies [50,51,55] with 17 different ML models were included, and the network map is shown in Figure 3B.The mean RMSE was 30.01 (SD 7.23) mg/dL.Statistically significant inconsistency was detected using the inconsistency test ( 2 =8.82,P=.012), as shown in the forest plot in Multimedia Appendix 3. Meta-regression indicated that none of the sample size, reference, place, validation type, and model type was a source of heterogeneity.The maximum SUCRA value was 97.8 for the GluNet model with a mean RMSE of 19.90 (SD 3.17) mg/dL [51], while the minimum SUCRA value was 4.5 for the decision tree (DT) model with a mean RMSE of 32.86 (SD 8.81) mg/dL [55].The relative ranks of the ML models are shown in Table 5, and the SUCRA curves are shown in Figure 4B.No significant publication bias was detected using the Egger test (P=.626).
For PH=15 minutes, 3 (6.5%)studies [20,49,55] with 14 different ML models were included, and the network map is shown in Figure 3C.The mean RMSE was 18.88 (SD 19.71) mg/dL.Statistically significant inconsistency was detected using the inconsistency test ( 2 =28.29,P<.001), as shown in the forest plot in Multimedia Appendix 4. Meta-regression showed that I² was 41.28%, and the model type and sample size both were the source of heterogeneity, with P=.002 and .037,respectively.The maximum SUCRA value was 99.1 for the ARTiDe jump neural network (ARJNN) model with a mean RMSE of 9.50 (SD 1.90) mg/dL [49], while the minimum SUCRA value was 0.3 for the SVM with a mean RMSE of 13.13 (SD 17.30) mg/dL [55].The relative ranks of the ML models are shown in Table 6, and SUCRA curves are shown in Figure 4C.Statistically significant publication bias was detected using the Egger test (P=.003).
For PH=45 minutes, only 2 (4.3%) studies [54,55] with 11 different ML models were included, and the network map is shown in Figure 3D.The mean RMSE was 21.27 (SD 5.17) mg/dL.Statistically significant inconsistency was detected using the inconsistency test ( 2 =6.92,P=.009), as shown in the forest plot in Multimedia Appendix 5. Meta-regression indicated significant heterogeneity from the model type (P=.006).The maximum SUCRA value was 99.4 for the NNM with a mean RMSE of 10.65 (SD 3.87) mg/dL [55], while the minimum SUCRA value was 26.3 for the DT model with a mean RMSE of 23.35 (6.36) mg/dL [55].The relative ranks of the ML models are shown in Table 7, and SUCRA curves are shown in Figure 4D.Statistically significant publication bias was detected using the Egger test (P<.001).

Principal Findings
This meta-analysis systematically assessed the performance of different ML models in enhancing BG management in patients with DM based on 46 eligible studies.Comprehensive evidence obtained via exhaustive searching allowed us to assess the overall ability of the ML models in different scenarios, including predicting BG levels, predicting adverse BG events, and detecting adverse BG events.

Comparison to Prior Work
Obviously, the RMSE of ML models for predicting BG levels increased as the PH increased from 15 to 60 minutes, which indicates that the longer the PH, the larger the prediction error.Based on the results of relative ranking, among all the ML models for predicting BG levels, neural network-based models, including the DRNN, GluNet, ARJNN, and NNM, achieved the minimum RMSE and the maximum SUCRA in different PHs, indicting the highest relative performance.In contrast, the DT achieved the maximum RMSE and the minimum SUCRA in a PH of 60 and 45 minutes, indicating that lowest relative performance.Thus, for predicting BG levels, neural network-based algorithms might be an appropriate choice.We found that time domain features combined with historical BG levels as input can further improve the performance of NNM algorithms [49,55].However, the quality of training data for NNMs needs to be high; therefore, the requirements during data collection and preprocessing of raw data are high [22,51].
Regarding ML models for predicting adverse BG events, the pooled sensitivity, specificity, PLR, and NLR were 0.71 (95% CI 0.61-0.80),0.91 (95% CI 0.87-0.94),8.3 (95% CI 5.7-12.0),and 0.31 (95% CI 0.22-0.44),respectively.According to the Users' Guide to Medical Literature, with regard to diagnostic tests [69], a PLR of 5-10 should be able to moderately increase the probability of persons having or developing a disease and an NLR of 0.1-0.2should be able to moderately decrease the probability of having or developing a disease after taking the index test.Hence, current ML models have relatively sufficient ability to predict the occurrence of hypoglycemia, especially RF algorithms with a PLR of 13.9 (95% CI 10.1-18.9)and an NLR of 0.14 (95% CI 0.08-0.22).On the contrary, although the PLR of NNM algorithms was 5.9 (95% CI 3.2-10.8),their sensitivity and NLR were 0.50 (95% CI 0.16-0.84)and 0.54 (95% CI 0.24-1.21),respectively, which is far from satisfactory.Although RF algorithms seem to be able to capture the complex, nonlinear patterns affecting hypoglycemia [56], it was still not enough to determine which algorithm shows the best XSL • FO RenderX performance, as the test scenarios were quite different and there was high heterogeneity between studies.
Regarding ML models for detecting hypoglycemia, the pooled sensitivity, specificity, PLR, and NLR were 0.74 (95% CI 0.70-0.78),0.70 (0.56-0.81), 2.4 (1.6-3.7), and 0.37 (0.29-0.46), respectively, which indicates that the algorithms generate small changes in probability [69].Nevertheless, it does not mean that ML models combined with ECG or EEG monitoring, which we found in 13 of 17 studies, should not be further investigated.Considering patients with both DM and cardiovascular risk, or patients under intensive care and in a coma, combined ML models and ECG or EEG signals might be able to avoid deficits in physical and cognitive function and death caused by hypoglycemia [70].

Strengths and Limitations
The study has several limitations.First, although we developed a comprehensive search strategy, there was still a possibility of potential missing studies.To further increase the rate of literature retrieval, we included the main medical databases with a feasible search strategy, including PubMed, Embase, Web of Science, and IEEE Explore, and references from relevant studies were also screened for eligibility to avoid omissions.Second, statistically significant high heterogeneity was detected in all subgroups, with different sources of heterogeneity, including different types of DM, ML models, data sources, reference index, time and setting of data collection, and threshold of hypoglycemia, among studies.To address this issue, hierarchical analysis and meta-regression analysis were carried out in different subgroups to explore the possible sources of heterogeneity.Furthermore, for several studies that provided no required outcome measures or had inconsistent outcome measures, relevant estimation methods were used to calculate the indicators, which might have led to a certain amount of estimation error.However, the estimation error was small enough to be accepted owing to an appropriate estimation method, and the results of this study were further enriched.However, future studies are required to report all relevant outcome measures for further evaluation.

Future Directions
In future, more accurate ML models will be used for BG management, which will certainly improve the quality of life of patients with DM and reduce the burden of adverse BG events.First, as mentioned before, current ML models have relatively sufficient ability to predict BG levels and hypoglycemia, and the fact that an extended PH is more beneficial for increasing the time available for patients and clinicians to respond still needs to be emphasized [15].Hence, future studies should focus on enhancing the performance of ML models in longer PHs (ie, 60 minutes).Second, most of the raw data from CGM devices are highly imbalanced due to the low incidence of adverse BG events, which may lead to several performance distortions.Previous studies have reported several approaches to reduce the data imbalance, including oversampling [71] and cost-based learning [15].However, to the best of our knowledge, few studies have investigated the effectiveness of those approaches in BG management models, which needs to be further studied in the future.Furthermore, the high variability of BG levels in the human body due to several factors, such as meal intake, high-intensity exercise, and insulin dosage, creates challenges for ML models; thus, future works need to integrate these factors with existing models to further enhance their accuracy [22,51].It is also necessary to consider the computational complexity and convenience of use for patients and physicians.Moreover, several studies have implied that a combination of ML models and features extracted from CGM profiles can achieve better predictability compared to an ML model alone [15,56].Recently, studies have focused on more novel deep learning models, such as transformers, which have also been proved clinically useful [72].Therefore, further studies that focus on optimizing the structure of an ensemble method are needed to explore more models with a new structure.Lastly, it should be mentioned that although several studies have achieved high performance using relatively small data set [29,31,32,35,39,47,57], which can reduce the difficulty in model development, it also creates a concern about whether this will decrease the generalization ability of the models.Most of the models were developed and tested with a certain data set, and few of them have been prospectively validated in a clinical setting.Therefore, they need to be applied in clinical practice and be updated, as needed, to provide real-time feedback for the automatic collection of BG levels and generate a basis for prompt medical intervention [73].

Conclusion
In summary, in predicting precise BG levels, the RMSE increases with an increase in the PH, and the NNM shows the relatively highest performance among all the ML models.Meanwhile, according to the PLR and NLR, current ML models have sufficient ability to predict adverse BG (hypoglycemia) events, while their ability to detect adverse BG events needs to be enhanced.Future studies are required to focus on improving the performance and using ML models in clinical practice [70,73].

Figure 1 .
Figure 1.Flow diagram of identifying and including studies.IEEE: Institute of Electrical and Electronics Engineers.

Figure 2 .
Figure 2. Quality assessment of included studies.Risk of bias and applicability concerns graph (A) and risk of bias and applicability concerns summary (B).

Figure 5 .
Figure 5. Sensitivity and specificity forest plots of ML models for predicting adverse BG events.The horizontal lines indicate 95% CIs.The square markers represent the effect value of a single study, and the diamond marker represents the combined results of all studies.The vertical line shows the line of no effects.BG: blood glucose; ML: machine learning.

Figure 6 .
Figure 6.SROC curves of all ML algorithms (A), NNM algorithms (B), RF algorithms (C), SVM algorithms (D), and ensemble learning algorithms (E) for predicting adverse BG events.The hollow circles represent results of all studies, and the red diamonds represent the summary result of all studies.AUC: area under the curve; BG: blood glucose; ML: machine learning; NNM: neural network model; RF: random forest; SROC: summary receiver operating characteristic; SVM: support vector machine.

Figure 7 .
Figure 7. Sensitivity and specificity forest plots of NNM algorithms (A), RF models (B), SVM algorithms (C), and ensemble learning algorithms (D) for predicting adverse BG events.The horizontal lines indicate 95% CIs.The square markers represent the effect value of a single study, and the diamond marker represents the combined results of all studies.The vertical line shows the line of no effects.BG: blood glucose; NNM: neural network model; RF: random forest; SROC: summary receiver operating characteristic; SVM: support vector machine.

Figure 8 .
Figure 8. Sensitivity and specificity forest plots of ML models for detecting adverse BG events.The horizontal lines indicate 95% CIs.The square markers represent the effect value of a single study, and the diamond marker represents the combined results of all studies.The vertical line shows the line of no effects.BG: blood glucose; ML: machine learning.

Figure 9 .
Figure 9. SROC curves of all ML algorithms (A), NNM algorithms (B), and SVM algorithms (C) for detecting adverse BG events.The hollow circles represent results of all studies, and the red diamonds represent the summary result of all studies.AUC: area under the curve; BG: blood glucose; ML: machine learning; NNM: neural network model; SROC: summary receiver operating characteristic; SVM: support vector machine.

Figure 10 .
Figure 10.Sensitivity and specificity forest plots of NNM algorithms (A) and SVM algorithms (B) for detecting adverse BG events.The horizontal lines indicate 95% CIs.The square markers represent the effect value of a single study, and the diamond marker represents the combined results of all studies.The vertical line shows the line of no effects.BG: blood glucose; NNM: neural network model; SVM: support vector machine.

Table 1 .
Baseline characteristics of BG a level-based studies (N=10).
a BG: blood glucose.b PH: prediction horizon.c CGM: continuous glucose monitoring.d Not applicable.e T1DM: type 1 diabetes mellitus.f NNM: neural network model.g ARM: autoregression model.

Table 2 .
Baseline characteristics of studies predicting adverse BG a events (N=19).

Table 3 .
Baseline characteristics of studies detecting adverse BG a events (N=17).

Table 4 .
Relative ranks of ML a models for predicting BG b levels in PH c =30 minutes.

Table 5 .
Relative ranks of ML a models for predicting BG b levels in PH c =60 minutes.

Table 6 .
Relative ranks of ML a models for predicting BG b levels in PH c =15 minutes.

Table 7 .
Relative ranks of ML a models for predicting BG b levels in PH c =45 minutes.
b BG: blood glucose.c PH: prediction horizon.d SUCRA: surface under the cumulative ranking.e SVM: support vector machine.f DT: decision tree.g RF: random forest.h XGBoost: Extreme Gradient Boosting.i NNM: neural network model.