Ensemble learning for fetal ultrasound and maternal–fetal data to predict mode of delivery after labor induction

Providing adequate counseling on mode of delivery after induction of labor (IOL) is of utmost importance. Various AI algorithms have been developed for this purpose, but rely on maternal–fetal data, not including ultrasound (US) imaging. We used retrospectively collected clinical data from 808 subjects submitted to IOL, totaling 2024 US images, to train AI models to predict vaginal delivery (VD) and cesarean section (CS) outcomes after IOL. The best overall model used only clinical data (F1-score: 0.736; positive predictive value (PPV): 0.734). The imaging models employed fetal head, abdomen and femur US images, showing limited discriminative results. The best model used femur images (F1-score: 0.594; PPV: 0.580). Consequently, we constructed ensemble models to test whether US imaging could enhance the clinical data model. The best ensemble model included clinical data and US femur images (F1-score: 0.689; PPV: 0.693), presenting a false positive and false negative interesting trade-off. The model accurately predicted CS on 4 additional cases, despite misclassifying 20 additional VD, resulting in a 6.0% decrease in average accuracy compared to the clinical data model. Hence, integrating US imaging into the latter model can be a new development in assisting mode of delivery counseling.


Tabular data
From January 2018 and December 2021, 808 patients with singleton vertex pregnancies were included in our longitudinal retrospective study, with 563 (69.7%) patients culminating in VD and 245 (30.3%) categorized as unplanned CS.The participants' average age was 32.2 years [18-47 ± 5.7].Demographic features and maternal and neonatal outcomes are shown in Table 1.Comparison between the two delivery modes showed significant differences in terms of age, height, body mass index (BMI) and parity, with more parous women in the VD group (37.8% vs 32.2%; p < 0.001).Women in the CS group were older, shorter, had a higher BMI and heavier babies at birth (p < 0.001).Other characteristics, such as gestational diabetes, 5-min Apgar scores ≤ 7, and neonatal intensive care unit admission rates were similar between groups.Mean gestational age (GA) on third trimester US was 30.9 ± 0.94 weeks [27-32 weeks].The mean fetal biometry measures were, respectively: HC 290.7 ± 12.7 mm [245.8-326.6 mm], BPD 81.1 ± 4.0 mm [57.0-91.1 mm], AC 278.6 ± 15.1 mm [215.0-325.6 mm], FL 60.0 ± 2.9 mm [47.1-68.0mm] and estimated fetal weight (EFW) 1848.8 ± 247.9 mm [903.0-2610.0mm].The recommended Hadlock formula was used to calculate EFW 9 .All these measurements were significantly different between both groups, except for FL, which showed similar values (see Table 1).
Mean GA at IOL was similar between groups (39.9 vs 40.1 weeks).Dinoprostone was more frequently used in the CS group (66.1%), and the contrary was true for misoprostol (49.6%).Also, 91.8% of pregnant women in the CS group presented significantly lower Bishop scores (≤ 3) before IOL.IOL indications did not differ between groups.Time to delivery was significantly longer (20.1 versus 28.9 h) in the CS group.A third of CS were due to non-reassuring fetal heart rate (30.2%), while the majority (67%) corresponded to "failed induction/labor dystocia" (see Table 1).
Figures 1 and 2 and Table 2 show the tabular data models' performance in predicting CS likelihood.These models take into consideration maternal clinical data as well as fetal information provided by the third trimester US.The best performing model was selected for further interpretation due to its superior positive predictive value (PPV) and F1-score weighted, meaning that it has the best ratio between true positives (TP) and false positives (FP).The rationale for this choice is because a mode of delivery prediction model should detect as many CS as possible-true positives-while avoiding misclassifying a VD as a CS-false positives.All models www.nature.com/scientificreports/showed good predictive performance, with F1-scores ranging from 0.59 to 0.74.The AdaBoost model presented a high predictive power (F1-score = 0.736 ± 0.024 and PPV 0.734 ± 0.024) and accurately predicted 86.7% VD and  www.nature.com/scientificreports/

Imaging data
Of the total 808 pregnant women included in the tabular data, each contributed with 3 US images of the third trimester (comprising the fetal head, abdomen and femur), totaling 2424 images.These were analyzed using a threefold cross validation, comprehending 1126 VD and 490 CS images for training and validation, and 563 VD and 245 CS images for testing the imaging-based models.Figures 1 and 2 and Table 2 present the imaging models' performance in classifying VD vs CS.True delivery outcome served as the ground truth for training and testing.Overall, the best DL model for fetal US images was Inception, based on the same rationale as previously explained for the AdaBoost model.F1-score weighted and PPV for our test dataset were 0.594 ± 0.022, 0.580 ± 0.027 for femur (the best image model), and 0.590 ± 0.015; 0.571 ± 0.025 for abdomen, respectively.The head view's F1-score weighted (0.587 ± 0.043) and PPV (0.565 ± 0.068) were the least helpful for mode of delivery prediction.

Ensemble models
Additionally, to test whether DL can improve mode of delivery prediction using multimodal imaging associated with tabular features, we implemented an ensemble of neural networks to provide classification on mode of delivery and compared their performance measures (see Fig. 3 for further explanation).We explored this approach by applying both average voting and majority voting strategies, the latter providing the best results, as shown in Table 3.The first ensemble model gathered the best US models of fetal head, abdomen and femur (image-based ensemble model), returning weak results in distinguishing VD vs CS, with a F1-score weighted of 0.584 ± 0.032 and a PPV of 0.585 ± 0.031.Marginally better results were shown by an ensemble model considering the previous three models and the AdaBoost model, providing a F1-score weighted of 0.628 ± 0.018 and a PPV of 0.675 ± 0.021 (see Table 3 and Fig. 4).The final classification ensemble model was the best ensemble model, aggregating the best tabular model (AdaBoost) and the best US image model (Inception femur).It achieved a F1-score weighted of 0.689 ± 0.042 and a PPV of 0.693 ± 0.038 (Table 3 and Fig. 4).It accurately predicted 75.9% VD and 51.9% of CS, corresponding to a 24.1% FP rate and 48.1% FN rate, with an overall accuracy of 68.7% (184/268; see Fig. 5c).The confusion matrix and respective AUROC of the final classification ensemble model are displayed in Figs. 4 and 5.
The best tabular data model (AdaBoost) provided an average accuracy improvement of 6.0% over the final classification ensemble model for CS prediction.However, concerning CS prediction, the final classification ensemble model correctly predicted 51.9% (vs 46.9%) of CS, with a FP rate of 24.1% (vs 13.3%), compared to the AdaBoost model.Therefore, the tabular data model missed 4 correct CS predictions (TP) over the final classification ensemble model, while avoiding 20 unnecessary CS (FP) (see Figs. 2a and 5c).

Discussion
This study is the first to verify the feasibility of DL algorithms for the binary classification of mode of delivery after IOL using maternal and fetal electronic medical data and third trimester fetal US images.We developed ML models using tabular data and DL models for imaging data using transfer-learning methods.Our best-performing models were AdaBoost on tabular data, with a PPV 0.734; and the DL model Inception evaluating femur US images, with a PPV 0.580.Then, using ensemble-learning methodology, we developed various composite models, the best being based on AdaBoost and Inception US femur images, yielding a PPV of 0.693, matching the metrics of our best tabular model.www.nature.com/scientificreports/ Recent studies use electronic medical information on maternal and fetal characteristics to construct prediction models regarding mode of delivery after IOL [26][27][28] .However, very few use ML with the same goal 2,29 .Also, several research studies explored third trimester US biometry planes for image segmentation 30,31 , image or plane classification 8,15,16 and fetal biometry estimation 32,33 .US image classification has been mainly used for automatic fetal malformation detection 18,34 .However, to our knowledge, no study has yet reported the relation between third trimester US fetal plane imaging and mode of delivery outcomes after IOL.
In our study, maternal characteristics related to CS outcomes were compatible with literature findings (see Table 1).In our dataset, women submitted to unplanned CS were older, shorter, with higher BMI and lower Bishop scores compared with the VD group 6,26,35 .Fetal US characteristics such as EFW and fetal biometry measures were also significantly larger for fetuses who underwent CS, which is also compatible with literature [36][37][38] .However, FL showed no difference between groups.This is an exquisite finding because our best US image model uses femur images (see Fig. 1b and Fig. 2b).The explanation may lie in prenatal predictors of increased fetal adipose deposition, namely on the fetal thigh, which were found to be strong predictors of unplanned CS, compared to traditional fetal biometry and EFW [39][40][41][42] .
In fact, when analyzing individually, each DL image model underperformed, revealing the model's difficulty in ascertaining which image features could aid in mode of delivery prediction (see Fig. 2 and Table 2).This was expected, for two main reasons: the first relates to the fact that DL models for object-detection and segmentation tasks are more accurate in identifying fetal standard biometry planes than classification models, because they can localize anatomical landmarks before classifying the plane, similar to human reasoning 10 ; the other reason lies on understanding AI's effectiveness in complementing clinical processes, since there is no study evaluating the accuracy of human evaluation of fetal third trimester US planes and their association with CS, probably due to an empirically unlikely association.Consequently, there is no practical way of evaluating if our metrics are reduced or if they can eventually supersede human intervention.
Regarding metrics for evaluation, our prediction model aims to counsel pregnant women undergoing IOL.Therefore, the main objective is to correctly advise those at high risk of CS and try to reduce the psychological and monetary burden of IOL on these women, as well as to confidently initiate and continue an IOL in women with a high probability of VD.Hence, the aim is to correctly identify true CS (TP) and avoid performing a CS on women who would have a VD (FP).As such, the most useful metrics in our study would be PPV, or the ratio of TP predictions to the total number of predicted positive observations; accuracy, or the proportion of correct predictions made by the model out of the total number of predictions; and sensitivity, defined as the ratio of TP predictions to all observations in the class 13 .That is why F1-score works better in our study than AUC, and because the presence of imbalanced data can influence the latter 43 .Hence, the DL models that provided the overall best PPV and F1-scores were the Inception group (see Table 2), which were consequently chosen for the ensemble models construction.
The first attempt on the ensemble model aggregated all Inception models (US images of the fetal head, femur and abdomen).Its performance showed a worse F1-score than the best Inception image model (femur) (0.584 vs 0.594) with a slightly superior PPV value (0.585 vs 0.580), probably because it aggregated the lower scores of the head and abdomen Inception models.Therefore, the next attempted ensemble model grouped all three Inception models and the Adaboost model.The latter probably influenced this ensemble positively, with F1-score and PPV of 0.628 (vs 0.584) and 0.675 (vs 0.585), compared with the image-based model, respectively.Since AI models can only account for information 'seen' during training, this model improved its performance by integrating imaging and electronic health record data 18 .Consequently, the last ensemble model, named final classification model, gathered the best tabular ML model and the best image model.Its performance was similar to the AdaBoost model, retrieving a F1-score of 0.689 (vs 0.736) and a PPV of 0.693 (vs 0.734).However, on a closer look at the confusion matrix, results show that the final classification model correctly predicted 51.9% of CS, more than the 46.9%rate of the AdaBoost model.On the other hand, the FP rates were more favorable for the AdaBoost model, showing a 13.2%rate (vs 24.1% on the final classification model).This trade-off between TP and FP can be explained by the difference in specificity (0.867 vs 0.758) and sensitivity (0.746 vs 0.689) of the AdaBoost model over the final classification one.Hence, we could infer that using DL femur US image models could help increase TP diagnosis at the expense of a marginal increase of FP cases 15 .As such, the model could be a useful clinical screening tool to distinguish women who are clear candidates for VD from those who have an extremely high risk of CS, or those who would benefit from a personalized mode of delivery planning.However, as emphasized in recent literature, AI tools should be used as an adjunct to the decision-making process, and the choices of the obstetrician and the pregnant woman should prevail when counseling on mode of delivery 19 .
This study has several strengths.To our knowledge, we present a novel database, comprising 2024 images from 808 fetuses, annotated for mode of delivery classification tasks using ground-truth information.This contrasts with most databases using similar images, which focus on image segmentation and plane classification and do not provide information regarding mode of delivery 8 .
The dataset accurately represents a real clinical setting, by being unbalanced and by using images collected retrospectively by various operators using various US machines.We opted not to use oversampling methods, i.e., to artificially increase the representation of minority classes and balance the dataset 13 .This would enhance our models' performance but refrained from a real clinical scenario.Also, since our study used routine examination images suffering from speckle noise, low contrast, and variations of machines and settings, our models worked on their heterogeneity and complexities 8,16 .We argue that learning from diverse images enhances models' adaptability and applicability in real-world scenarios by identifying consistent patterns and features 8,13,44 .
Data augmentation and use of clinical data along imaging data enhanced robustness and flexibility of the final models 16,17 .Finally, our model was thought to be plug and play and user-friendly without many restrictions to deal with real world clinical scenarios, allowing centers to upload deidentified images directly from workstations or hospitals to a cloud platform, with or without requiring additional patient data 17  www.nature.com/scientificreports/ The study is not without limitations.It is retrospective and uses data from a single center.This, especially for class imbalance databases such as ours, may have affected model training and testing and subsequently influenced model metrics, with emphasis on ROC curves 10,18 .Future developments may address this limitation by ensuring more CS images are available for successful binary classifier training.Also due to retrospective data collection, our model could not account for clinical or imagiological intrapartum variables such as fetal occiput position and engagement.The authors recognize the significance of these assessments, as supported by current research 24 .
The inclusion of numerous predictors in our sample size leads to concerns about overfitting.There might also exist a lack of a robust predictive accuracy assessment when using other data 14 .Therefore, we emphasize the importance of external validation as our next step, to assess the constraints of generalization and the possibility of multisite deployment of our model 17 .Finally, the results suggest it could be worth exploring data fusion approaches that combine into one model both streams of information, clinical data and image data.
In summary, this study proposed an ensemble AI model using US images of the fetal femur and maternal-fetal tabular data, yielding a relatively good performance.This is the first attempt to use this type of imaging data for mode of delivery prediction after IOL.The proposed model may become part of a promising tool in assisting mode of delivery counseling in clinical practice.

Datasets
The dataset was retrospectively collected at the Obstetrics Department of University Hospital of Coimbra, a center with two sites (Obstetrics Department A and B), which are specialized maternal-fetal departments that manage thousands of births annually.Sample size was based on feasibility.
Tabular data included 2672 consecutive singleton vertex term pregnant women referred for IOL between January 2018 and December 2021.Other inclusion criteria were pregnant women ≥ 18 years of age and baseline Bishop score of ≤ 6. Planned CS, antepartum fetal demise, major fetal anomalies, and preterm births were excluded from analysis.EMR were analyzed, and, to ensure reliability of data, cases with no information on cervical examination at the time of admission were also excluded (n = 3).The final tabular dataset included 2434 deliveries.
The image dataset was collected based on the previous case selection, taking into consideration pregnant women attending our department for routine third trimester US evaluation.Images acquired during standard clinical practice were collected.Gestational age was computed from crown-rump length measurements on first trimester US 45 .Images were taken as a part of the Portuguese screening program, which recommends that the third trimester US should be performed between 30 and 32 weeks and 6 days of gestation 46 .Therefore, we decided to include a range of gestational ages from early third trimester (27 weeks) to 32 weeks and 6 days.Only third trimester US were considered for our visual computational model because first and second trimester US have specific goals that do not provide relevant information regarding mode of delivery planning.Of the 2434 subjects selected for tabular data, we excluded those who did not perform the third trimester US in our institution.Of the ones who did, we excluded those with missing US images, including only examinations which provided at least three US images per fetus (fetal head, abdomen and femur).This resulted in a final dataset of 808 deliveries (cases) and a total of 2024 US images.
Approval was obtained from the ethics committee of our center (protocol number CE-047/2022).Given the retrospective nature of the analysis, written informed permission was not required.Methods and results are reported in accordance with the TRIPOD guidelines 47 .

Data collection
Regarding tabular data collection, maternal age, gravidity, parity, BMI, height, GA, Bishop score, IOL indications, mode of delivery, CS indications, intrapartum complications, neonatal birth weight and neonatal outcomes were among the features examined 2 .Data were collected on admission and at the onset of the first stage of labor, after pelvic examination and assessment of both mother and fetus.
Ten different US machines provided by three different manufacturers (Voluson E8, Voluson P8, GE Healthcare, Zipf, Austria; Xario 200G, Aplio a550, Aplio i700, Aplio a, Aplio 400, Aplio 500 Xario 200, Toshiba, Canon Medical, Netherlands, Europe; H540 Samsung) were used for examinations.The percentage and absolute number of images from GE, Toshiba and Samsung ultrasound machines were 3.0% (n = 24), 96.9% (n = 783) and 0.1% (n = 1), respectively.Images were taken using a curved transducer with a frequency range from 3 to 7.5 MHz.Twelve examiners with significant experience (5-35 years) in obstetric US conducted the examinations according to ISUOG guidelines 11 .All images were stored in the original Digital Imaging and Communication in Medicine (DICOM) format in our Astraia database and were retrospectively collected.This process was made to comply with minimal quality requirements, i.e., omitting those with an improper anatomical plane (badly taken or cropped).
Regarding IOL, the choice of cervical ripening methods varied according to WHO recommendations and Bishop scores.These included oxytocin infusion, prostaglandin analogues and extra-amniotic balloon catheters 1,48,49 .Premature rupture of membranes was defined as membrane rupture at term before labor onset.Prolonged pregnancy was determined at ≥ 41 weeks 48 .The definition of labor was the presence of regular uterine contractions with cervical changes 50 .Our institution performs IOL according to ACOG and NICE recommendations 51,52 .

Figure 1 .
Figure 1.The ROCs for prediction of mode of delivery for tabular data (a) and image-based data (b).ROC, receiver operating characteristic.

Figure 2 .
Figure 2. Confusion matrices on (a) the best tabular data model AdaBoost and (b) the best image-based model of the femur, (c) head and (d) abdomen are shown.Confusion matrix depicting in reading order from left to right, top to bottom: true-negative, false-negative, false-positive and true-positive rates.

Figure 3 .
Figure 3. Process involved in the establishment of the ensemble models.The three image-based models (Inception head, abdomen and femur) were associated with the best tabular data model, AdaBoost in three different ways.Green box: Image-based model, using the CNN Inception model of the femur, abdomen and head; Orange box: AdaBoost tabular data model with Inception models of the femur, abdomen and head; and Blue box: the Final classification model, which consists of the AdaBoost tabular data model and the Inception model of the femur, which is the ensemble model which provides the best metrics.

Figure 4 .
Figure 4.The ROC curves for prediction of mode of delivery for the ensemble models and their comparison with the ROC curves of the Adaboost and Inception femur models.ROC, receiver operating characteristic.

Figure 5 .
Figure 5. Confusion matrices on the following ensemble models: (a) image-based model, (b) AdaBoost and Inception models of the femur, head and abdomen (majority voting) (c) the final classification model (majority voting).Confusion matrix depicting in reading order from left to right, top to bottom: true-negative, falsenegative, false-positive and true-positive rates.

Table 1 .
Demographic, induction of labour and delivery outcome data for women who delivered vaginally vs. unplanned caesarean.a There were 19 missing values on the VD group and 15 missing values in the CS group concerning body mass index.b There were 6 missing values on the VD group and 1 missing value in the CS group concerning IOL method.Significant values are in bold.

Table 2 .
Performance metrics of the DL models on the datasets.Results are presented as mean ± standard deviation.These values were obtained by threefold cross-validation.Columns include: AUROC: area under the receiver operating curve; PPV: positive predictive value; NPV: negative predictive value.

Table 3 .
Performance metrics of the best tabular, image model and ensemble models using majority voting and average strategies.The results are the averages and standard deviations of the threefold validation.Results are presented as mean ± standard deviation.These values were obtained by threefold cross-validation.Columns include: AUROC: area under the receiver operating curve; PPV: positive predictive value; NPV: negative predictive value.