Identification and validation of gestational diabetes subgroups by data-driven cluster analysis

Aims/hypothesis Gestational diabetes mellitus (GDM) is a heterogeneous condition. Given such variability among patients, the ability to recognise distinct GDM subgroups using routine clinical variables may guide more personalised treatments. Our main aim was to identify distinct GDM subtypes through cluster analysis using routine clinical variables, and analyse treatment needs and pregnancy outcomes across these subgroups. Methods In this cohort study, we analysed datasets from a total of 2682 women with GDM treated at two central European hospitals (1865 participants from Charité University Hospital in Berlin and 817 participants from the Medical University of Vienna), collected between 2015 and 2022. We evaluated various clustering models, including k-means, k-medoids and agglomerative hierarchical clustering. Internal validation techniques were used to guide best model selection, while external validation on independent test sets was used to assess model generalisability. Clinical outcomes such as specific treatment needs and maternal and fetal complications were analysed across the identified clusters. Results Our optimal model identified three clusters from routinely available variables, i.e. maternal age, pre-pregnancy BMI (BMIPG) and glucose levels at fasting and 60 and 120 min after the diagnostic OGTT (OGTT0, OGTT60 and OGTT120, respectively). Cluster 1 was characterised by the highest OGTT values and obesity prevalence. Cluster 2 displayed intermediate BMIPG and elevated OGTT0, while cluster 3 consisted mainly of participants with normal BMIPG and high values for OGTT60 and OGTT120. Treatment modalities and clinical outcomes varied among clusters. In particular, cluster 1 participants showed a much higher need for glucose-lowering medications (39.6% of participants, compared with 12.9% and 10.0% in clusters 2 and 3, respectively, p<0.0001). Cluster 1 participants were also at higher risk of delivering large-for-gestational-age infants. Differences in the type of insulin-based treatment between cluster 2 and cluster 3 were observed in the external validation cohort. Conclusions/interpretation Our findings confirm the heterogeneity of GDM. The identification of subgroups (clusters) has the potential to help clinicians define more tailored treatment approaches for improved maternal and neonatal outcomes. Graphical Abstract Supplementary Information The online version of this article (10.1007/s00125-024-06184-7) contains peer-reviewed but unedited supplementary material.


Participants and experimental procedures
The ESM Table 1 reports the variables that are present in the Berlin dataset and/or the Vienna dataset.
According to the German and Austrian practice guidelines the diagnostic OGTT is performed at mid gestation (late second or early third trimester) [1].In some cases, with high risk of GDM or in those with elevated fasting glucose concentrations at early pregnancy, presence of GDM was verified by early OGTT testing before 24 weeks according to local guidelines.Venous blood samples are taken at fasting and at 60 as well as 120 min after ingestion of oral glucose solution (75 g in 300 ml water).Pre-gestational BMI (BMIPG) was calculated by self-reported weight.This approach is regularly used in epidemiological studies and has highly correlated with technician-measured weight [2,3].Calculation of sex-and age-adjusted percentiles of infants birth weight was performed by use of international anthropometric standards.Large (LGA) and small-for-gestational-age (SGA) were defined as birth weight above the 90 th or below the 10 th Intergrowth percentile, respectively [4,5].
Intermediate-acting Neutral Protamine Hagedorn (NPH) insulin was started in the evening if ≥2 measurements of fasting glucose were equal to or above 5.3 mmol/l (95 mg/dl) in a period of one week, and rapid-acting insulin analogues (Aspart or Lispro) if ≥2 measurements of 1h postprandial glucose (either after breakfast, lunch or dinner) equal or above 7.8 mmol/l (140 mg/dl) in a period of one week.NPH was started with 6 to 10 IU and increased by 4 to 6 IU (or in case of higher doses), and rapid-acting insulin was started with 2 to 4 IU and increased by 2 to 4 IU if thresholds were not achieved within three days.Long-acting insulin analogues such as glargine (U100/U300) were used as an alternative to NPH if necessary.Patients were trained on insulin management and titration according to their glucose levels.Metformin was used in some cases (especially in insulin resistant obese women).
In both study sites (Berlin and Vienna) different variables were extracted for further analyses: maternal characteristics, such as age, height, weight before pregnancy, singleton or multiple pregnancy, parity and number of deliveries, glucose concentrations at fasting, as well at 60 and 120 min during the diagnostic OGTT, use of glucose-lowering medications, pregnancy disorders, fetal and maternal complications.In the Vienna cohort, we collected specific data on treatment strategies, including use and dosage of insulin and/or metformin.Moreover, we collected details about insulin formulations: intermediate-acting insulin, rapid-acting insulin analogues, or long-acting insulin analogues.Specifically, we considered the maximum doses per day.Moreover, neonatal biometry and outcomes (birth weight, length, and head circumference, gestational age at delivery (GAD), transfer to neonatal intensive care unit, cord blood pH, base excess, APGAR scores), and mode of delivery were extracted by referring to the medical history.Lastly, for Vienna cohort we collected information on previous history of GDM, use and number of anti-hypertensive medicines, and family history of diabetes.

Definition of training and test datasets, input and outcome variables
The raw datasets underwent a pre-processing phase, whose steps are summarised in ESM Fig. 1.First, we checked that subjects had at least one glucose value of the diagnostic OGTT equal to or above the threshold values for gestational diabetes (GDM) diagnosis [6].Thus, in the Berlin dataset, 41 subjects were excluded.The subsequent pre-processing phase involved identification and removal of typographical errors and extreme outliers.Typographical errors were identified via the interquartile range method [7].Following a similar approach as in the study by Ahlqvist et al. [8] extreme outliers were defined as values that deviated by more than five standard deviations from the mean and they were eliminated only on the Berlin dataset (external validation cohort).If an error or outlier was detected in an input variable, the corresponding subject was excluded.Conversely, if an error or outlier was found in an outcome variable, the variable value variable was deleted.This step resulted in the removal of 8 typos and 6 outliers subjects from the Berlin dataset and 3 typos subjects from the Vienna dataset.Additionally, Local Outlier Factor (LOF) method [9] was applied to the input variables of the Berlin dataset.This method, specifically designed to identify and eliminate densitybased outliers, led to the exclusion of 11 more subjects.Of note, subjects with multiple pregnancies (such as twin pregnancies) and subjects missing the required input variables were excluded from the analyses.This led to the exclusion of 72 and 78 subjects respectively from the Berlin dataset, and 1 and 44 subjects respectively from the Vienna test set.After this, the Berlin dataset was partitioned into a training set (70%) and a test set (30%) leading to a training set of 1154 subjects and a first test set of 495; concurrently, the Vienna dataset provided a separate test set composed of 769 subjects.The final pre-processing step involved the normalisation of the input variables to minimize the impact of different measurement scales on the clustering results.Importantly, we performed the normalisation of the training and test sets using only the mean and variance computed from the training set to avoid data leakage [10].

Internal validation
In the validation of a proposed clustering solution, considering the selected input variables, clustering algorithm and number of identified clusters, several validation techniques were sequentially applied.The stability of the solution was initially assessed by confirming that the Jaccard index was greater than 0.75 for each cluster [11,12].Subsequently, it was verified that the average silhouette of each cluster was non-negative [12].The significance of each input variable was evaluated [12], as reported in the statistical analysis section (in the main text).
A twofold cross-validation was then implemented on two random subsets from the training set [12].These "partial models" were then compared to the "original model" derived from the entire training set across three key aspects: input similarity, outcome significance consistency, and result similarity indices.The input similarity was measured on a variable basis between matching clusters.Using variable mean and standard deviation for each cluster in the original model, the z-score was computed for each subject in the partial model; subsequently, a t-test (or its non-parametric counterpart) evaluated whether each z-score transformed variable in each cluster was not significantly different from zero.To assess the consistency in outcome significance, the analyses conducted on the original model were repeated on the two identified partial models.Lastly, on those subjects included in both the partial and original models, the similarity of the results was evaluated using the Adjusted Rand Index (ARI).In fact, ARI is a measure that spans from 0 to 1 (1 indicating identical match between two clustering outcomes), taking into account the potential randomness in subject partitions [13].In addition, we computed some classification accuracy metrics such as overall accuracy and, for each cluster, sensitivity, specificity, and F1-score.
Through this comprehensive process, it was possible to select valid clustering solutions, and identify the optimal one in terms of input variables, clustering algorithm and number of clusters.This entire validation process was performed exclusively on the training set.

External Validation
Following identification of the optimal model, its clustering results underwent evaluation on the two independent test sets to assess generalisability.As introduced in the main text, for a particular clustering algorithm and input variables we defined the Training Cluster Ctraining (clustering results obtained from the training set) and the Estimated Test Cluster  test (clustering results obtained for the test set using the model determined from the training set).In  test, each subject was assigned to the cluster whose centroid was nearest.We also defined the Reference Test Cluster Ctest, which denotes the clustering results on the test set obtained by applying the same clustering algorithm directly to the test set itself.Ctest served to assess the robustness of  test, i.e., the clustering obtained from the training data (see ESM Fig. 2).
The first step of the external validation was the comparison between Ctraining and Ĉtest to evaluate possible differences in terms of each cluster's percentage of subjects compared to the total subjects' number, and in input variables distribution between corresponding clusters.For every cluster identified, comparisons were performed between the distributions of each input variable in the training and test set, using t-test or the Wilcoxon rank-sum test, depending on whether the distributions were gaussian.A two-sided p-value less than 0.05 was considered statistically significant.
Furthermore, the comparison between Ĉtest and Ctest represents the second step of the external validation process, to evaluate the stability of cluster membership.This step assesses the consistency of cluster assignments by comparing the groupings derived from applying the clustering technique to the test set (Ctest) with those obtained by directly transferring the original clustering to the test set data (Ĉtest), integrating methodological and outcome-based validation strategies [14].To ensure a meaningful comparison of Ĉtest and Ctest, we assessed the stability of the Reference Clusters Ctest by computing the Jaccard index [11].
A confusion matrix (ESM Table 2) was obtained considering the assignments to the clusters of Ctest as a reference, and key metrics were calculated for each i-th cluster.Sensitivity (sensi) was determined as , where TP are True Positives and TN are True Negatives; specificity (speci) was , and the F1-score (F1i) was 2 .
. The total accuracy was computed as the sum of TP and TN for all clusters, divided by the total number of samples.In addition, ARI was calculated to evaluate the similarity between Ĉtest and Ctest, as this index provides further insight into effectiveness of the trained clustering model.
ESM Fig. 2 shows a diagram of the procedure implemented for the clustering analysis procedure.

Selection of the optimal clustering model configuration
With regard to DBSCAN, we found unsatisfactory results.By using a k-distance graph, we plotted the distances of each sample to the k-nearest neighbours, where k ranged from 2 to 10, thereafter we also tested k of 20, 30, and 40.However, DBSCAN mostly returned just one large cluster, or many small clusters.When we tested HDBSCAN we observed similar results.Due to these outcomes, we focused on other clustering algorithms.

Internal validation results
With the selected k-means clustering algorithm, each cluster, whose centroids are reported in ESM Table 3, showed a Jaccard index above 0.86 (precisely, 0.88, 0.86, and 0.86 for Clusters 1, 2, and 3, respectively).All clusters had nonnegative silhouette values, as requested by the implemented validation criterion.A comparison of the input variables across the clusters confirmed the significance of all variables, as mirrored by significant differences in almost all pairs of clusters according to posthoc tests following ANOVA or Kruskall-Wallis tests, the only exception being age between Clusters 1 and 3.
When the twofold cross-validation was applied for the clustering process, we found remarkable similarities across the cluster characteristics.Upon applying z-score transformation to the subjects in the two partial models based on mean and standard deviation derived from the original (complete) model, the t-tests (or Mann-Whitney) tests generally revealed non-significant differences across the majority of the cluster variables.The only exceptions were in partial model 1, with OGTT120 in Cluster 1, OGTT0 in Clusters 2 and 3, and OGTT60 in Cluster 3. In addition to the indicated comparison between each partial model and the original model, we also assessed significance of input variables in each of the partial models, assessing possible differences among variables between cluster pairs in a given partial model, finding a strong agreement with the original model (all variables except in partial model 1 for age between Clusters 2 and 3, and OGTT120 between Cluster 3 and the other two clusters).A summary of the goodness-of-fit statistics for each cluster in both partial models 1 and 2 is provided in ESM Table 4. Finally, ARI values were equal to 0.82 and 0.92, for partial models 1 and 2, respectively, with overall accuracy reaching 93.8% and 97.2%.

Comparison between training set and test sets
For the Berlin test set, the proportion of subjects in each cluster closely parallels that of Ctraining (Fig. 1, main text): 102 subjects (20.6%) were assigned to Cluster 1, 182 subjects (36.8%) to Cluster 2, and 211 subjects (42.6%) to Cluster 3. The chi-squared test on the contingency table of cluster assignment proportions between Ctraining and Ĉtest yielded a non-significant p-value of 0.84.When comparing the variable values in each cluster between Ctraining and Ĉtest, all tests returned nonsignificant p-values, except for BMIPG and OGTT120 in Cluster 1 only, suggesting that the input variables in Ctraining and Ĉtest were virtually the same.
With regard to the Vienna test set, we found marginally different proportions of subjects in the cluster, as compared to Ctraining: 113 subjects (14.7%) were assigned to Cluster 1, 244 subjects (31.7%) to Cluster 2, and 412 subjects (53.6%) to Cluster 3 (Fig. 1, main text).In fact, the chi-squared test on the contingency table yielded a significant p-value (p<0.0001) in this case.When comparing the variable values in each cluster between Ctraining and this test set, few variables revealed non-significant p-values (age in Clusters 2 and 3, OGTT0 across all clusters, and OGTT120 in Cluster 2).However, clusters overall properties were very similar to those of Ctraining.

Comparison between estimated and reference clusters in test sets (Ĉtest and Ctest)
A summary of goodness-of-fit metrics obtained on the two test sets is reported in ESM Table 5.

Comparison of clinical outcomes across clusters
The significant differences identified in the training set were further explored in the two partial models obtained through a twofold cross-validation (ESM Tables 6 and 7).The need for pharmacological treatment was found different in Cluster 1 vs. Clusters 2 and 3 in both partial models (p< 0.0001).Large-for-gestational-age (LGA), Birth Weight > 4000 g, and both the weight and length intergrowth percentiles were different between Cluster 1 vs. Clusters 2 and 3 in one of the two partial models.On the other hand, the low base excess and in range base excess outcomes, when comparing Cluster 1 vs. Cluster 3, did not exhibit significant differences in either of the two partial models.
To evaluate binary outcomes, we computed Odds Ratios (OR) along with their respective 95% confidence intervals.The OR represents the odds that an outcome will occur given a specific cluster membership, compared to the odds of the outcome occurring without membership to the same cluster.To achieve this, we conducted separate logistic regression analyses for each cluster, where each cluster was considered independently as the focal group in relation to non-membership: in the first logistic regression model, we calculated the odds of the outcome within Cluster 1 relative to the odds outside of Cluster 1.This process was repeated for Cluster 2 and Cluster 3 in their respective models.Thus, each logistic regression model treated its corresponding cluster as an independent reference group for comparison.This approach allowed us to comprehensively assess the impact of cluster membership on the binary outcome.The results of these analyses are reported in ESM Tables 8-12 for the training set, test sets, and each of the twofold cross-validation partial models.f Anomalous APGAR at 0, 5 or 10 min Significant differences across sets are indicated as follows: * p<0.05 between cluster 1 and cluster 2; † p<0.05 between cluster 1 and cluster 3; ‡ p<0.05 between cluster 2 and cluster 3 APGAR, appearance, pulse, grimace, activity, respiration; LGA Large for gestational age; NICU, neonatal intensive care unit; SGA, small for gestational age ESM Table 7

1 :
Diagram of the data pre-processing steps for the Berlin and Vienna datasets The chart displays the sequence of operations carried out on the initial raw data, indicating the number of patients excluded at each stage.The remaining Berlin dataset was randomly split into training and test sets according to a 70/30 ratio, while Vienna dataset was utilised as an additional test set.GDM: gestational diabetes mellitus; LOF: Local Outlier Factor.

Table 1 :
ESM Variables from the Berlin and Vienna datasets

Table 2 :
Confusion matrix obtained by considering Ctest as the reference, and the Ĉtest as the prediction

Table 3 :
Coordinates of centroids and related normalised coordinates.Mean and standard deviation values used for normalisation of the input variables are also reported BMIPG: pre-pregnancy body mass index; OGTT0, OGTT60, OGTT120: glucose from the diagnostic oral glucose tolerance test at fasting: 60, 120 min.

Table 4 :
Cluster goodness-of-fit metrics for the two partial models obtained through twofold cross-validation, using the clusters estimated from the original model as the reference

Table 5 :
Sensitivity, specificity, and F1-score for each cluster on the test sets, with clustering obtained from the model determined on the training set

Table 6 :
Summary statistics for the first partial model of the twofold cross-validation Ntot is the total number of subjects (and their distribution across clusters) for which the corresponding information was available.Categorical data are expressed as n (%).Continuous data are presented as median and IQR.

Table 8 :
: Summary statistics for the second partial model of the twofold cross-validation Ntot is the total number of subjects (and their distribution across clusters) for which the corresponding information was available.Categorical data are expressed as n (%).Continuous data are presented as median and IQR.p<0.05 between cluster 1 and cluster 2; † p<0.05 between cluster 1 and cluster 3; ‡ p<0.05 between cluster 2 and cluster 3 APGAR, appearance, pulse, grimace, activity, respiration; LGA Large for gestational age; NICU, neonatal intensive care unit; SGA, small for gestational age ESM Odds Ratios (OR) and 95% confidence intervals for binary outcomes among the three clusters in the training set *

Table 9 :
Odds Ratios (OR) and 95% confidence intervals for binary outcomes among the three clusters in the Berlin test set Anomalous APGAR at 0 or 5 or 10 min.APGAR, appearance, pulse, grimace, activity, respiration; LGA Large for gestational age; NICU, neonatal intensive care unit; SGA, small for gestational age Anomalous if APGAR < 7, f Anomalous APGAR at 0 or 5 or 10 min.APGAR, appearance, pulse, grimace, activity, respiration; LGA Large for gestational age; NICU, neonatal intensive care unit; NPH, neutral protamine Hagedorn; SGA, small for gestational age ESM Table11: Odds Ratios (OR) and 95% confidence intervals for binary outcomes among the three clusters in the first partial model of the twofold cross-validation Anomalous APGAR at 0 or 5 or 10 min.APGAR, appearance, pulse, grimace, activity, respiration; LGA Large for gestational age; NICU, neonatal intensive care unit; SGA, small for gestational age a Base Excess > 2, b Base Excess < -2, c for gestational age at birth > 37 weeks, d Arterial blood pH < 7.2, e Anomalous if APGAR < 7, f e e Anomalous if APGAR < 7, f