Comparison of Machine Learning Methods and Conventional Logistic Regressions for Predicting Gestational Diabetes Using Routine Clinical Data: A Retrospective Cohort Study

Background Gestational diabetes mellitus (GDM) contributes to adverse pregnancy and birth outcomes. In recent decades, extensive research has been devoted to the early prediction of GDM by various methods. Machine learning methods are flexible prediction algorithms with potential advantages over conventional regression. Objective The purpose of this study was to use machine learning methods to predict GDM and compare their performance with that of logistic regressions. Methods We performed a retrospective, observational study including women who attended their routine first hospital visits during early pregnancy and had Down's syndrome screening at 16-20 gestational weeks in a tertiary maternity hospital in China from 2013.1.1 to 2017.12.31. A total of 22,242 singleton pregnancies were included, and 3182 (14.31%) women developed GDM. Candidate predictors included maternal demographic characteristics and medical history (maternal factors) and laboratory values at early pregnancy. The models were derived from the first 70% of the data and then validated with the next 30%. Variables were trained in different machine learning models and traditional logistic regression models. Eight common machine learning methods (GDBT, AdaBoost, LGB, Logistic, Vote, XGB, Decision Tree, and Random Forest) and two common regressions (stepwise logistic regression and logistic regression with RCS) were implemented to predict the occurrence of GDM. Models were compared on discrimination and calibration metrics. Results In the validation dataset, the machine learning and logistic regression models performed moderately (AUC 0.59-0.74). Overall, the GBDT model performed best (AUC 0.74, 95% CI 0.71-0.76) among the machine learning methods, with negligible differences between them. Fasting blood glucose, HbA1c, triglycerides, and BMI strongly contributed to GDM. A cutoff point for the predictive value at 0.3 in the GBDT model had a negative predictive value of 74.1% (95% CI 69.5%-78.2%) and a sensitivity of 90% (95% CI 88.0%-91.7%), and the cutoff point at 0.7 had a positive predictive value of 93.2% (95% CI 88.2%-96.1%) and a specificity of 99% (95% CI 98.2%-99.4%). Conclusion In this study, we found that several machine learning methods did not outperform logistic regression in predicting GDM. We developed a model with cutoff points for risk stratification of GDM.


Background
Gestational diabetes mellitus (GDM) is a disease in which carbohydrate intolerance develops during pregnancy [1]. GDM affects approximately 14.8% of pregnant mothers in China [2], and the prevalence of GDM is increasing world-wide [3]. Women with GDM undergo metabolic disruption and placental dysfunction [4], increasing the risks for preeclampsia and cesarean delivery [5]. Hyperglycemia and placental dysfunction adversely affect fetal development, increasing the risks of birth trauma, macrosomia, preterm birth, and shoulder dystocia [6,7]. The mother with GDM and her offspring are both more likely to develop obesity, type 2 diabetes mellitus, and cardiovascular disease than those without GDM [8,9].
Early diagnosis and intervention should decrease the incidence of GDM and lower adverse pregnancy outcomes [10,11]. However, based on most guidelines, most GDM cases are diagnosed between 24 and 28 gestational weeks by an oral glucose tolerance test (OGTT) [12,13], which may miss the optimal window for intervention, as fetal and placental development have already occurred by this point [14]. Universal diagnosis by OGTT at early pregnancy has been suggested [15], but this is costly and inefficient as in most cases, GDM manifests during mid-to-late pregnancy [16]. Overall, early prediction is needed and may be valuable. Developing a simple method using the existing clinical data at early pregnancy to quantify a woman's risk of developing GDM would help to identify high-risk mothers in need of early diagnosis, monitoring, and therapy and serve to obviate universal OGTTs for low-risk women [10].
Recent prediction models for GDM have been developed using conventional regression analyses [17][18][19]. However, machine learning, a data analysis technique that develops algorithms to predict outcomes by "learning" from data, is increasingly emphasized as a competitive alternative to regression analysis. Moreover, machine learning has the potential to outperform conventional regression, possibly by its ability to capture nonlinearities and complex interactions among multiple predictive variables [20]. Despite this, only four studies [21][22][23][24] have used machine learning algorithms to predict GDM, and none of them compared their performance with that of logistic regressions.
In this study, we aimed to use machine learning methods to develop a model incorporating data on maternal characteristics and biochemical tests to predict the presence of GDM and to compare their performance with that of traditional logistic regression models. It is hypothesized that machine learning algorithms outperform traditional logistic regression models in terms of discrimination and calibration.

Study Population and Data Source
2.1.1. Study Setting. A single-center, retrospective cohort study was conducted to derive and validate a model with cutoff points for the prediction of GDM. Eligible subjects were women with singleton pregnancies who had records of serum samples collected before 24 gestational weeks for prenatal biochemical examination and Down's syndrome screening and who later delivered at the Obstetrics and Gynecology Hospital of Fudan University from 2013 to 2017. All women with multiple pregnancies or previous diabetes were excluded. Informed consent was obtained from all the participants, and the study protocol was approved by the Ethics Committee of Gynecology and Obstetrics Hospital of Fudan University.

Predictive Variables.
Predictor variables included medical history, clinical assessments, ultrasonic screening data, biochemical data of liver/renal/coagulation function at the prenatal visit, and data from Down's syndrome screening. In total, 104 variables were assessed. Briefly, medical history included a history about diabetes and previous pregnancy and the woman's family history. Clinical assessment included maternal age, educational status, smoking, body mass index (BMI), and parity. Biochemical tests at the first prenatal visit were performed after fasting for at least 8 h. Down's syndrome screening was performed between 16 and 20 gestational weeks, and ultrasound screening was performed between 24 and 28 gestational weeks.
2.1.3. Outcomes. The primary outcome was GDM, which was diagnosed according to IADPSG criteria 2010 [12]. Briefly, GDM was defined according to the 75 g OGTT (0-1-2 hours: 10.1-8.5-5.1 mmol/L) from 24 to 28 gestational weeks, and the diagnosis of GDM was established if any single glucose concentration met or exceeded a fasting value of 5.0 mmol/L, a 1 h value of 10 mmol/L, or a 2 h value of 8.5 mmol/L.

Study
Objectives and Strategies. The primary objective was to compare the performance of various machine learning models and conventional logistic regression models by discrimination and calibration. The second objective was to estimate an optimal model with one point (at or above) to predict the presence of GDM and with another point (at or below) to predict the absence of GDM.

Model Development
2.3.1. Overview. We conducted the study to derive and validate a model for the prediction of GDM by means of a twophase approach (development and validation). The dataset of the GDM group was randomly split into the development (70%) and validation (30%) cohorts. The dataset of women without GDM was randomly downsampled at a 1 : 1 ratio into the GDM group to obtain balanced data. Thus, in the development phase, we used data from 4900 participants (2181 with GDM and 2719 without GDM) to derive a model and its cutoff to predict the presence of GDM, and this cutoff was validated with the use of data from 2100 additional participants (1001 with GDM and 1099 without GDM) ( Figure 1).

Data
Processing. First, we extracted information for each specific index from the records and converted their descriptions to numerical variables. Some numerical variables were processed hierarchically and transformed into categorical variables; for example, patient biochemical tests were classified into categories, namely, normal and abnormal test results. Additionally, variables pertaining to patient characteristics were converted to numerical variables.
Second, missing data were calculated, and they were eliminated if more than 40% of the data were missing for 2 Journal of Diabetes Research one participant; otherwise, the missing data were filled in by means of the mean filling method. Third, the Pearson correlation coefficient was used to calculate the correlation coefficient between two features, aiming to judge the multicollinearity. A feature was eliminated if the absolute value of the correlation coefficient was higher than 0.75, as this indicated a strong collinearity.
Data standardization is a basic work of data mining. Different features often have different dimensions and dimension units, which will affect the results of the data analysis. To eliminate the dimensional impact among features, data standardization is needed to solve the comparability between data features. After the original data are standardized, each index is on the same order of magnitude, which is suitable for comprehensive evaluation. In the present study, the method of deviation standardization was used to normalize the continuous characteristic variable to 0-1.

Development Phase.
Eight machine learning algorithms and two conventional logistic regressions were used to develop predictive models. All model tuning using ten-fold cross-validation was performed in the development dataset. Multiple methods were used to derive the model from the same set of data. Results for recall, precision, and F-measure were obtained.

Conventional Logistic Regression Models.
Two conventional logistic regression models were compared in this study. The first model was fit with each variable entered linearly. The second model used restricted cubic splines with three knots to allow for nonlinearity. In both models, maternal age, BMI, and previous GDM were always included in the model, regardless of statistical significance, because they have been reported to be associated with GDM. Other predicted variables were retained in the model if they were statistically significant between the GDM and control groups (P < 0:05).

Machine Learning Models
(1) GBDT. GBDT is an integrated learning model based on a decision tree that adopts the additive model method. In the iterative training process, the model generates a weak classifier based on the residual of the last iteration and achieves the purpose of data classification by constantly reducing the residual generated in the training process.
(2) AdaBoost. AdaBoost is also an integrated learning algorithm. In the iterative process, the algorithm generates a new learner on the training set to predict all samples and evaluate the importance of each sample. The more difficult it is to distinguish samples, the higher the given weight will be in the iterative process. The whole iterative process is terminated when the error rate is small enough or a certain number of iterations is reached.
LGB uses a decision tree based on a learning algorithm.
LGB uses a histogram algorithm to divide the continuous floating-point features into k discrete values and constructs a histogram with a width of K. Then, the training data are traversed, and the cumulative statistics of each discrete value in the histogram are calculated. In feature selection, only the discrete value of the histogram is needed to traverse to find the optimal segmentation point.
(4) Vote. In the present study, Vote was used to synthesize the results of other algorithms in the present study; that is, when the samples were predicted and judged, other algorithms were used to vote, and the category with the most votes was the output result of the vote algorithm.  Journal of Diabetes Research (5) XGB. XGB establishes K regression trees to make the predicted value of the tree group close to the real value as much as possible and has the ability to generalize as much as possible. The objective function of XGB requires the prediction error to be as small as possible, the number of leaf nodes to be as small as possible, and the number of nodes to be as low as possible.
(6) Random Forest. Random forest is a nonlinear tree-based integrated learning model. Random forest establishes a forest in a random way. The forest is composed of many decision trees, and there is no correlation between each decision tree. After the random forest model is obtained, each decision tree in the random forest is judged when the new sample enters. For the classification problem, the voting method is used, and the maximum number of votes is the final model output.
(7) Decision Tree. The decision tree is a basic classification method. The decision tree consists of nodes and directed edges. A decision tree contains a root node, an internal node, and a leaf node, in which the internal node represents a feature and the leaf node represents a class. First, the feature is filtered according to the information gain of the feature. Then, each node is divided into subnodes according to the feature value. The root node contains the sample set. The path from the root node to each leaf node corresponds to a decision sequence.
(8) ML Logistic. Logistic regression, also known as logarithmic probability regression, is a classification model and suitable for the fitting of numerical binary output data. After the input data are linearly weighted, a sigmoid function is used to process the input data to obtain the output probability result, and then, the probability result is transformed into binary output by a symbol function. The parameters of the input model are obtained by maximum likelihood estimation, which distinguishes it from conventional logistic regressions.
2.3.6. Validation Phase. Predictive probabilities were calculated for each model in the validation dataset from each development model in two ways. First, model discrimination was assessed by the area under the receiver operating characteristic (AUC) curve, where a value of 1.0 represents perfect discrimination and 0.5 represents no discrimination. Second, model calibration was assessed by the mean square errors.
To estimate the relative contribution of the variables in the best performing machine learning methods and logistic regressions, the AUC-based permutation importance measure for the best performing machine learning method and the Wald χ 2 statistics minus the degrees-of-freedom (χ 2 − df ) for the best performing logistic regression were computed. The effects of the most accurate predictor variables across different values and predicted risk were evaluated for the most accurate model using partial dependence plots. The distribution of predictive value versus its percentage in the whole population or occurrence of GDM is shown. Distributions of predictive values in either the development or validation dataset are also shown.
The negative predictive value, positive predictive value, sensitivity, specificity, positive likelihood ratio, and negative likelihood ratio were obtained. Cutoff points were selected from the development cohort to achieve a negative predictive value higher than 85% or a positive predictive value higher than 85%, which were further analyzed in the validation cohort. All analyses were performed by R version 3.6.1 or python3.6.5. A single-tailed P value < 0.05 was regarded as statistically significant.  Figure 1). The incidence of GDM was 14.31% in the total cohort (Figure 1). The proportions of overweight, older age, history of prior GDM, and family history of diabetes among mothers who developed GDM were higher than those among mothers who did not develop GDM (Table 1). There were no statistically significant differences in the rates of nulliparity, prior macrosomia, or preterm delivery (Table 1). Figure 2 represents the discrimination and calibration performance of machine learning and logistic regression models. In terms of discrimination, the AUCs of the best performing machine learning model (GDBT) and the best performing logistic regression model (logistic with RCS) were similar (73.51%, 95% CI 71.36%-75.65% vs. 70.9%, 95% CI 68.68%-73.12%). The decision tree model was the least discriminative (59.96%, 95% CI 57.53%-61.4%). In terms of calibration performance, indexed by the mean square error, the decision tree (65%, 95% CI 0.30%-1.0%) was the best calibrated model, followed by the GBDT model (0.88%, 95% CI 0.65%-1.1%). Figure 3 shows the top ten predictor variables in the GBDT (Figure 3(a)) and logistic regression models (Figure 3(b)). In both models, fasting blood glucose, HbA1c, triglycerides, and maternal BMI were among the most important predictors. However, the top four predictor variables were not completely similar. Fasting blood glucose, HbA1c, triglycerides, and maternal BMI were the most important predictor variables in GBDT (Figure 3(a)), while HbA1c and high-density lipoprotein were the most important predictor variables in the logistic regression (Figure 3(b)). In both models, the risks for GDM increased with increasing levels of the predictors (Figures 3(c) and 3(d)).

Optimal Model Analysis.
Further analyses were performed in the GBDT model. The AUCs of GBDT in both the development and validation cohorts are shown in Figure 4 (75%, 95% CI 73.42%-76.22% vs. 74%, 95% CI 71.36%-75.65%). Model calibration was shown by comparing the observed and model-predicted risk of GDM in the general cohort (Figure 4(a)). At or below the cutoff point of 0.3, the observed occurrence was slightly lower than that predicted by the model, while at or above the cutoff point of 0.7, the observed occurrence was slightly higher than the 4 Journal of Diabetes Research predicted value; however, overall, the predicted risk fits the observed occurrence well (Figure 4(b)). The distribution of the predictive values in the combined cohorts is presented in Figure 5, which shows skewness, and the top distribution was at a cutoff of 0.3. The median of the predictive value was 0.42. The occurrence of GDM increased linearly with increasing predictive value ( Figure 5(a)). A total of 569 of the participants had predictive values at or above 0.7, and 89.3% of them developed GDM, accounting for 16% of the cases of GDM (Figure 5(a)).
In the validation cohort, the median predictive values of the participants who developed GDM and those who did not develop GDM were 0.52 and 0.37, respectively ( Figure 5(b)). At the cutoff point of 0.3, the negative predictive value was 74.1% (95% CI 69.5%-78.2%) in the validation

Pregnant Adverse
Outcomes. In terms of discrimination, the AUCs of GDBT for adverse pregnancy outcomes was 0.68 (95% CI 0.67-0.70) in the development cohort and 0.63 (95% CI 0.61-0.65) in the validation cohort (supplementary file, Figure S1A). At or below the cutoff point of 0.4, the prediction for adverse pregnancy outcomes was slightly lower than the observed predictive values, while at or above the cutoff point of 0.6, the predictive value was slightly higher than the observed value (supplementary file, Figure S1B). The distribution of predictive values in the total cohort was near normal, and the occurrence of adverse pregnancy outcomes increased with increasing predictive value (supplementary file, Figure S2). The median predictive value for adverse pregnancy outcomes seemed higher than in participants without adverse pregnancy outcomes in both the development and validation cohorts, although there was no statistical significance (supplementary file, Figure S2). In the validation cohort, at the cutoff point of 0.3, the sensitivity was 99% (95% CI 98.3%-99.5%), while the corresponding negative predictive value was 52.4% (95% 32.4%-71.7%) (Table 3). Additionally, at the cutoff point of 0.7, the specificity for GDM was 99% (95% CI 98.1%-99.4%), while the positive predictive value was 79.2% (95% CI 66.5%-88.0%) and the positive likelihood ratio was 4 (95% CI 3.33-4.67) ( Table 3).

Main Findings and Significance.
We found no evidence to support the hypothesis that GDM prediction models based 7 Journal of Diabetes Research on machine learning lead to better performance than models based on logistic regression. In the GBDT model, the best performing machine learning model, we identified cutoff points of 0.3 and 0.7 for the predictive value as potential predictors of the absence and presence of GDM, respectively. Our study provides value for risk assessment of GDM.

Comparisons and Interpretations.
Previous studies showed inconsistent results regarding the performance of machine learning algorithms compared with regression models [26][27][28]. In the current analysis, machine learning was not superior to logistic regression. There may be several reasons for this inconsistency. First, logistic regressions are suitable for simple data with linear relationships between variables and outcomes. Failure in controlling variables entering models according to prior knowledge and in addressing the collinearity between variables will result in poor performance. In the present study, work was done in these respects. The Pearson correlation coefficient was used to calculate the correlation between variables, and the method for entering variables into the model was also controlled clinically and statistically. Second, numerous types of machine learning models and logistic regressions may fit and perform differently in different datasets. Eight common machine learning algorithms in the present study were analyzed and compared, and GBDT was identified as the best model with higher discrimination and calibration than the others. Variables in the GBDT model were shown to be nonlinear, underscoring the advantage of identifying nonlinear variables.
In the present study, even if the overall performance of the machine learning model was similar to logistic regressions,   8 Journal of Diabetes Research the former was employed for further analysis because of the flexibility in managing nonlinear relationships between variables. Most models showed moderate discrimination, with AUCs mostly above 0.70, which is consistent with previous findings [21,22,24,29]. Previous studies have investigated models derived from electronic health records for the predictions of GDM in machine learning models or conventional logistic regressions, but these studies included fewer participants than ours, did not compare the two methods, involved limited variables, or did not comprehensively estimate the performance of discriminations and calibrations [18,21,22,24,29]. The current studies extend these previous studies by validating predictive ratio cutoff points of 0.3 and 0.7 originating from general maternal characteristics and biochemical data by comparing the performance of machine learning and traditional logistic regression models for the prediction of GDM. Furthermore, our findings underscore the roles of glucose and lipid metabolism in the development of GDM [4,30]. In both machine learning and logistic regressions, we identified blood glucose, lipids, BMI, and maternal age as important contributors to GDM.

Significance.
In predicting GDM, at or above the cutoff point of 0.7, the observed positive predictive value was 93.2%, and the corresponding sensitivity was 99%, representing values in predicting the presence of GDM and identifying high-risk GDM women. Following the identification of highrisk women, proper management, including early diagnosis by OGTT, lifestyle changes, and exercise, may be beneficial in controlling maternal and neonatal complications [31], although additional data are needed. A predictive value of 0.7 or higher also had value in predicting the presence of fetal adverse outcomes, including macrosomia, preterm delivery, and low Apgar scores, as well as cesarean delivery and preeclampsia, with a positive predictive value of 79.2% and a specificity of 99%. We did not further evaluate the predictive performance separately for maternal and fetal outcomes, as data on corresponding outcomes were few.
The sensitivity values of GDM and adverse pregnancy outcomes at the cutoff point of 0.3 were 90% and 99%, respectively, indicating that most of the GDM and adverse pregnancy outcomes occurred at or above the predictive value of 0.3. Thus, women with a predictive value below 0.3 may potentially avoid further OGTT, although this needs to be further demonstrated in prospective cohorts.
Overall, the prediction model in the present study provides values in screening strategy based on risk assessment for detecting GDM. Pregnant women may benefit from the strategy of allowing for either the elimination of the need for further screening in low-risk women and the initiation of early prevention and treatment measures in high-risk women.

Strengths and
Limitations. The strengths of our study are the large pregnancy cohort with data on obstetrical history, clinical assessments, and biochemical variables in early pregnancy. This is the first prediction model with cutoff points for GDM with easy-use clinical data derived from machine learning algorithms and conventional regressions.
There were some limitations in our study. First, the data come from only one institution and lack external validation. Potential bias may occur in the present study. Further prospective studies and studies on additional populations are needed to establish whether the use of this ratio in clinical practice, accompanied by the current OGTT at 24-28 gestational weeks, could reduce GDM, costs, or adverse pregnancy outcomes. Second, data on socioeconomic status, diet, and physical activities, which have been reported to be risk factors for GDM, were not included in this retrospective study [32].

Conclusions
In conclusion, this study shows that the machine learning algorithm does not outperform conventional logistic regressions. A prediction model with cutoff points was developed and provides values in risk assessment for detecting GDM.

Data Availability
The dataset generated and/or analyzed in this study is not publicly available, owing to currently ongoing research studies, but the data are available from the corresponding author on reasonable request.