Ensemble Learning to Improve the Prediction of Fetal Macrosomia and Large-for-Gestational Age

Background: The objective of this study was to investigate the use of ensemble methods to improve the prediction of fetal macrosomia and large for gestational age from prenatal ultrasound imaging measurements. Methods: We evaluated and compared the prediction accuracies of nonlinear and quadratic mixed-effects models coupled with 26 different empirical formulas for estimating fetal weights in predicting large fetuses at birth. The data for the investigation were taken from the Successive Small-for-Gestational-Age-Births study. Ensemble methods, a class of machine learning techniques, were used to improve the prediction accuracies by combining the individual models and empirical formulas. Results: The prediction accuracy of individual statistical models and empirical formulas varied considerably in predicting macrosomia but varied less in predicting large for gestational age. Two ensemble methods, voting and stacking, with model selection, can combine the strengths of individual models and formulas and can improve the prediction accuracy. Conclusions: Ensemble learning can improve the prediction of fetal macrosomia and large for gestational age and have the potential to assist obstetricians in clinical decisions.


Introduction
Excessive fetal growth poses risks to maternal and infant well-being [1]. The term "macrosomia" is used to describe the condition of a fetus with a birth weight of more than 4000 g, regardless of gestational age [1,2]. Macrosomia is sometimes confused with "large for gestational age" (LGA), which describes an infant with a 90th percentile or higher birthweight for gestational age [1]. It has been shown that the infants with either macrosomia or LGA pose a large risk of perinatal morbidity and mortality to their mothers [2,3]. Therefore, accurately predicting and diagnosing these conditions has been a major goal of obstetric investigators, for purposes of conducting early intervention or targeted perinatal medical care to reduce the risks.
Measurements taken from prenatal ultrasound imaging are the primary quantitative resources for predicting birth weights and diagnosing macrosomia or LGA. Zhang et al. [4] took the empirical formula given by Hadlock et al. [5] for estimating fetal weights and implemented a joint mixed-effects model to predict macrosomia and LGA. This procedure of predicting macrosomia or LGA from prenatal ultrasound measurements was a two-step supervised learning process. In the first step, an empirical formula was chosen to derive the estimated fetal weights (EFWs) from sonographic ultrasound measurements at each of the gestational time points when the ultrasound measurements were taken and recorded. In the second step, a joint mixed-effects model with latent subject-specific random effects was fitted. One component of the joint model was a quadratic mixed-effects model to derive the predicted birth weights (PBWs). Another component was a probit mixed-effects model, from which the classification of macrosomia or LGA was determined from the PBWs by comparing the PBWs with a pre-specified threshold [6]. However, previous literature has noted that there were 26 candidate empirical formulas for estimating fetal weights [7][8][9]. In addition, some literature also argued that nonlinear mixed-effects models should be more appropriate in modeling growth curves than linear or quadratic mixed-effects models [10]. Selecting a particular EFW empirical formula and a quadratic mixed-effects model, as in Zhang et al. [4], increases the uncertainties in predicting macrosomia or LGA, because these empirical formulas and the statistical models function diversely.
In this article, we investigate the use of ensemble methods to aggregate prediction results given by different EFW empirical formulas and the statistical models. The goal of this practice is to improve the prediction in macrosomia or LGA. With ensemble methods, it is not required to select any specific statistical model or empirical formulas. Instead, the prediction capability of each combination of the empirical formulas and statistical models is combined and aggregated to generate a learning procedure that gives the best prediction performance.

Data
The Successive Small-for-Gestational-Age Births study (SGA study) was funded by the National Institute of Child Health and Human Development (NICHD), in the National Institutes of Health of the USA in 1983. The study was conducted concurrently by the University of Bergen in Norway, the University of Uppsala in Sweden, and the University of Alabama, USA, from 1984 to 1985 [11]. The initial goal of the SGA study was to characterize the different types of intra-uterine growth restriction and to assess the associated risk factors.
To demonstrate the strengths of ensemble methods in predicting fetal macrosomia or LGA, we took the Scandinavian data in the SGA study that were collected in Norway and Sweden. The Scandinavia SGA data were collected from January 1st , 1986, through March 31st, 1988 from nulliparous (parity 1) and primiparous (parity 2) Caucasian pregnant women prior to the 20th gestational week who had a singleton pregnancy and spoke one of the Scandinavian languages. A total of 6354 women were recruited to the study, and 632 of them were excluded from the study, according to the exclusion criteria (n = 432) or due to absence of first prenatal visit (n = 200). The remaining 5722 patients were split into three subgroups. A random sample of 561 patients was first selected. Then, a "high-risk" group of 1384 patients was identified out of the random sample of 561 patients on the basis of five small-for-gestational-age (SGA) risk factors: giving birth to an infant with birthweight below 2750 g in the past, maternal cigarette smoking at conception, pre-pregnancy weight lower than 50 kg, a previous perinatal death, and the presence of chronic maternal disease (chronic renal disease, hypertension, or heart disease). The remaining 3777 patients were considered as the "low-risk" group. The patients from the random sample and "high-risk" group were eligible to participate in a detailed follow-up study, during which the women were examined at approximately 17th, 25th, 33rd, and 37th weeks of gestation. For each visit, ultrasound examination was performed, and demographic and medical information were collected. At the end of the Scandinavia SGA study, only 1945 were able to complete the follow-up study.
In our study, we restricted the study data to the 1115 women in the Scandinavia SGA study who had all four ultrasound examinations and complete covariate information (maternal age, pre-pregnancy body weight and height, previous diseases history, and smoking history) in the Scandinavia SGA study. In the ultrasound examination records in the Scandinavia SGA study, there were three fetal measurements: biparietal diameter (BPD), middle abdominal diameter (MAD), and femur length (FL).

An Ensemble Learning Procedure of Predicting Macrosomia and LGA
Here, we developed a four-step ensemble learning procedure to predict macrosomia or LGA from prenatal ultrasound measurements. The output of the learning procedure is the binary classification of either macrosomia or LGA. The prediction of macrosomia and LGA is conducted based upon a set of input features. The primary input features are the sonographic measurements BPDs, MADs, and FLs collected from 17th, 25th, 33rd, and 37th weeks of gestation, as well as gestational age at delivery, in the Scandinavia SGA study. Other features also include maternal age, pre-pregnancy body mass index, parity, smoking status, existing diabetes, and gestational diabetes. Figure 1 shows a diagram that delineates the four steps in the ensemble learning procedure, to predict macrosomia or LGA.
Step 1 is to take the sonographic measurements for each of the gestational weeks and each of the empirical formulas in Table A2, to obtain EFWs at each gestational time point.
Step 2 is to fit either a nonlinear mixed-effects model or a quadratic mixed-effects model to predict the corresponding PBWs at birth from the EFWs obtained in Step 1.
Step 3 is to use the PBWs to derive the classification of macrosomia or LGA by using a specified threshold.
Step 4 is the ensemble learning step, in which prediction output from various empirical formulas and the nonlinear and quadratic mixed-effects models are combined to generate the ensemble learning prediction results.

Estimated Fetal Weights with 26 Empirical Formulas
In the ultrasound examination records in the Scandinavia SGA study, there were three fetal measurements BPD, MAD, and FL. Melamed et al. [7] summarized 26 different empirical sonographic formulas (see Appendix A Table A1) that can be taken to estimate fetal weights from sonographic ultrasound measures. In our ensemble learning procedure, we considered all the 26 models in our analysis. Abdominal circumference (AC) in the empirical sonographic formulas can be calculated by 3.1416 × MAD, and head circumference (HC) can be derived by the formula introduced in [12].

Mixed-Effects Models for Predicting PBWs and Deriving the Classification of Macrosomia or LGA
We built a nonlinear three-parameter mixed-effects logistic model with latent random effects to predict PBWs and derive the classification of macrosomia or LGA [10,13]. The three-parameter mixed-effects logistic model for the ith fetus (i = 1, · · · , n indexing the study subjects) at gestational time t ij ( j = 1, 2, 3, 4 indexing the time when the gestational ultrasound measurements were recorded, j = 5 indexing the time of birth) is as follows: where y ij is the EFW obtained from one of the 26 formulas in Table A1 at gestational time t ij , j = 1, 2, 3, and 4, and y ij is the birth weight when j = 5, φ i = (φ 1i , φ 2i , φ 3i ) are model parameters in which φ 1i indicates amplitude, φ 2i indicates the smoothness, φ 3i indicates stretch, and ij are within-subject random errors. For the nonlinear mixed-effects model, each parameter φ ki (k = 1, 2, 3 indexing the parameter), can be further modeled by a linear representation: where X ij is a vector of fixed-effects the covariates of maternal age, pre-pregnancy body mass index, previous disease history, including diabetes, cardiac disease, high blood pressure, renal disorders, and other diseases, and smoking during pregnancy; and β k denotes the corresponding regression parameters.
The first element in X ij equals 1 for the interception. In our learning procedure, the covariates were included in the linear term φ 1i = X ij β 1 + b 1i , and φ 2i and φ 3i were specified as φ 2i = β 2 + b 2i and φ 3i = β 3 + b 3i . We also considered the nonlinear three-parameter mixed-effects logistic model without any covariates in φ 1i such that φ 1i = β 1 + b 1i . The random effects b i = (b 1i , b 2i , b 3i ) represent the latent individual variations that are not explained by the covariates. We assumed the random effects, b i , independently follow a multivariate normal distribution, with a vector of mean 0 and a variance-covariance matrix Σ, where Σ was assumed to be positive, definite, and unstructured. Further, we assumed the following heteroscedastic model [10] for the within-subject random errors ij :

An Ensemble Learning Procedure of Predicting Macrosomia and LGA
Here, we developed a four-step ensemble learning procedure to predict macrosomia or LGA from prenatal ultrasound measurements. The output of the learning procedure is the binary classification of either macrosomia or LGA. The prediction of macrosomia and LGA is conducted based upon a set of input features. The primary input features are the sonographic measurements BPDs, MADs, and FLs collected from 17th, 25th, 33rd, and 37th weeks of gestation, as well as gestational age at delivery, in the Scandinavia SGA study. Other features also include maternal age, pre-pregnancy body mass index, parity, smoking status, existing diabetes, and gestational diabetes. Figure 1 shows a diagram that delineates the four steps in the ensemble learning procedure, to predict macrosomia or LGA.
Step 1 is to take the sonographic measurements for each of the gestational weeks and each of the empirical formulas in Table A2, to obtain EFWs at each gestational time point. Step 2 Input: Sonographic measurements BPDs, MADs, and FLs from 17th, 25th, 33rd, and 37th weeks of gestation.
Step 1: Using a specified empirical formula to derive the estimated fetal weights (EFWs) from sonographic measurements BPDs, MADs, and FLs.
Step 2: Fitting a mixed-effects model to derive the corresponding predicted birth weights (PBWs).

Output: Classification of macrosomia or LGA
Step 3: Using the PBWs to derive the classification of macrosomia or LGA with a specified threshold.
Step 4: Using ensemble methods to aggregate prediction results from various empirical formulas and mixed-effects models. We compared the above nonlinear mixed-effects models with the following quadratic mixed-effects model implemented in Zhang et al. [4].
The configuration of X ij , β, b i , and ij is identical to those in the nonlinear mixed-effects model, and θ = (θ 1 , θ 2 ) are the parameters of time fixed-effects t ij and t 2 ij . For the quadratic mixed-effects model, we also considered the one without any covariates for predicting the birth weights and deriving the classification of macrosomia or LGA.

Ensemble Learning Methods
Here, we propose to apply ensemble methods [14], to combine the prediction results generated from 26 individual EFW empirical formulas and from nonlinear and linear mixed-effects models. One appealing property for ensemble methods is that they combine the classification strengths of individual models but do not overfit the data. Two types of ensemble algorithms, majority voting and stacking, are considered. Majority voting is one of the most fundamental ensemble methods for classification [14]. For a binary classification problem, the final classification of majority voting is the class that receives more than half of the votes from the individual learning models. Because the individual learning models can be correlated, it is necessary to select among individual learning models that are combined in majority voting [15][16][17]. Here, we implemented least absolute shrinkage and selection operator (LASSO) [18], smoothed clipped absolute deviation (SCAD) [19], and minimax penalized likelihood (MCP) [20] to select individual learning models.
The stacking method [21,22] is one of the most known meta-learning methods. It combines the prediction results from several different individual learning models, called "first-level learners", by another learning model, named "second-level learner" or "meta-learner". Van der Laan et al. [23] proposed to train the meta-learner by a unified cross-validation algorithm or super learner and proved its oracle properties. The unified cross-validation algorithm is summarized as follows. For a K-fold cross-validation procedure (K = 10 is set here), the training dataset was split into K equal-sized groups, stratified by the response variable. Then, let the k-th group be the validation data, take the remaining data to train the first-level learners, and collect the prediction values from the validation data as the covariates of the meta-leaner. By repeating the above procedure on every fold of data, along with the original response variable, a complete dataset, called the "leave-one data", is generated for training the meta-learner. Several reports [21,22,24] proposed using the logistic regression with positive constraint on the regression coefficients as the meta-learner for classification, whereas Van der Laan et al. [23] used linear regression as the meta-learner for regression models. Debray et al. [24] suggested performing model selection on the meta-learner, so we applied LASSO, SCAD, and MCP to conduct model selection on the meta-learner.

Evaluation of Prediction Performance
In our investigation, the original dataset was randomly divided into a training dataset (70%, n = 781) and a testing dataset (30%, n = 334), stratified by the presence of macrosomia or LGA. For an individual mixed-effects model with a particular EFW empirical formula, the entire training dataset, along with the ultrasound measures and demographic information in the testing dataset, was used to fit the model. The birth time, birthweight, and the true macrosomia or LGA status in the testing dataset were used to evaluate the prediction performance. For ensemble methods, the ensemble learner was trained by the training dataset and tested by the testing dataset.
The prediction accuracy of each individual mixed-effects model with a particular EFW empirical formula, as well as the ensemble learner, was assessed by the areas under the receiver operating characteristic curve (AUC), sensitivity, specificity, positive predictive value (+PV), negative predictive value (−PV), positive likelihood ratio (+LR), negative likelihood ratio (−LR), and Youden's index (sensitivity + specificity − 1) for predicting both macrosomia and LGA. Table 1 shows the baseline characteristics of our study subjects (n = 1115). The mean maternal age of the study subjects was 28 years (standard deviation or SD: 4 years), and the average height and weight before pregnancy were 166 centimeters (SD: 6 centimeters) and 59 kg (SD: 10 kg), respectively. Twenty-one percent of women had a history of SGA births, and about half of them smoked at enrollment. Only a few had high blood pressure, cardiac disease, diabetes, or renal disorder, but 15% had other types of diseases. The mean gestational age at birth was 280 days (SD: 8 days), and the mean birthweight was 3562 g (SD: 478 g). Seventeen percent of infants had macrosomia, 11% had LGA at birth, and, among the LGA infants, 9 of them (7%) were not macrosomia infants. The current study sample was used as the reference of LGA [11].

Prediction Performance of Individual Models and Empirical Formulas in Macrosomia
We predicted macrosomia or LGA with the nonlinear and quadratic mixed-effects models described above, and also considered those models with or without the covariates. Thus, four mixed-effects models combined with 26 empirical formulas for EFWs, totally 104 learning models, were fitted. The prediction performance of individual learning models in predicting macrosomia is reported in Appendix A Tables A2-A5. For the three-parameter mixed-effects logistic models, their prediction performance varied with different EFW empirical formulas, and adding covariates into the models did not improve their prediction performance (see Appendix A Tables A2 and A3). The three-parameter mixed-effects logistic models, either with or without covariates, predicted all birth weights to be under 4000 g when combined with empirical formula 3 in Table A1. The two nonlinear mixed-effects models combined with empirical formula 11 in Table A1 gave the highest value of Youden's index of 0.670, and the lowest value of Youden's index of 0.050 was obtained by empirical formula 18 in Table A1 without covariates. The best sensitivities of 0.857 and 0.839 were obtained from the two models with or without covariates, respectively, when coupled with empirical formula 7 in Table A1. The AUCs for these models ranged from 0.871 to 0.910. The prediction performance of the quadratic mixed-effects models also varied among different empirical formulas, and adding covariates did not improve their prediction performance either (see Appendix A Tables A4 and A5). The quadratic mixed-effects models combined with empirical formula 12 in Table A1 gave the highest value of Youden's index of 0.663, whereas the lowest value of Youden's index of 0.310 was obtained by empirical formula 3, in Table A1, without covariates. The best sensitivity of 0.982 was obtained from the two models with or without covariates when coupled with empirical formulas 7 and 11 in Table A1. The AUCs for these models ranged from 0.857 to 0.905.

Prediction Performance of Individual Models and Empirical Formulas in LGA
LGA is defined as a newborn with a birthweight greater than the 90th percentile for gestational age. Here, given the input of sonographic ultrasound measures, both the nonlinear and quadratic mixed-effects models can generate PBWs for each of fetuses at any time point, regardless of its actual birth time. A newborn was classified as LGA if his or her PBW was above the 90th percentile of PBWs of all other fetuses, when the fetuses were assumed to be born at the identical birth time as that newborn. The prediction performance of the nonlinear and quadratic mixed-effects models is reported in Appendix A Tables A6-A9. The prediction performance showed small variation among different EFWs empirical formulas and among these models. The Youden's indexes ranged from 0.354 to 0.526, the sensitivities ranged from 0.424 to 0.576, the specificities ranged from 0.930 to 0.950, and the AUCs ranged from 0.863 to 0.894.

Prediction Performance of Ensemble Methods
Two ensemble methods, voting and stacking methods, were applied to combine the 104 learning models for the prediction of macrosomia. The voting method was implemented by using two approaches: voting from all learning models without selection and voting from the selected learning models. To select among the learning models, a penalized logistic regression was run with either a LASSO, SCAD, or MCP penalty. When the stacking method was implemented, we fit a linear regression model as our meta-learner, either without variable selection or with three variable selection methods, LASSO, SCAD, and MCP. The prediction performance of the ensemble methods for the prediction of macrosomia is summarized in Table 2. The voting and stacking methods with the SCAD selection yielded the best Youden's indexes of 0.681 and 0.688, respectively, which were higher than the Youden's indexes generated by any other individual learning models in Tables A2-A5. The voting method with the SCAD and MCP selection and the stacking method with the LASSO, SCAD, and MCP selection outperformed most of the individual learning models listed in Tables A2-A5. The voting method with the SCAD or MCP model selection generated an AUC of 0.932 and 0.924, respectively, higher than the AUCs generated by any other individual learning models in Tables A2-A5. We also applied the voting and stacking methods for the prediction of LGA, to combine the 104 learning models. The voting method was implemented as described in the prediction of macrosomia. When the stacking method was implemented, we fit a logistic regression with positive constraints [22], without further selection on first-level learners, and a logistic regression with the LASSO, SCAD, and MCP selection on first-level learners as our meta-learner. The results are summarized in Table 3. These results showed that the best prediction results were supplied by the voting method with the MCP selection with a Youden's index of 0.537 and a sensitivity of 0.636, both higher than any of the individual learning models in Tables A6-A9.  Areas under the receiver operating characteristic curve (AUC), confident interval (CI), positive likelihood ratio (+LR), negative likely ratios (−LR), positive predictive values (+PV), negative predictive values (−PV).

Discussion and Conclusions
We proposed using ensemble methods to combine the strengths from nonlinear and quadratic mixed-effects models and 26 empirical formulas of EFWs to predict macrosomia and LGA of newborns from sonographic ultrasound measurements. The prediction performance of the ensemble methods was studied with the data from the Scandinavia SGA study. We showed that the prediction performance varied among the empirical formulas and mixed-effects models. The three-parameter mixed-effects logistic model combined with empirical formula 11 in Table A1 gave the best prediction results in predicting macrosomia. In predicting LGA, the prediction performance also varied. The best prediction results were obtained from the quadratic mixed-effects model combined with empirical formula 6 in Table A1. These results showed that it was difficult to select any individual statistical learning model combined with only one empirical formula of EFWs, to predict macrosomia or LGA from sonographic ultrasound measurements.
We subsequently proposed applying ensemble methods to aggregate the prediction results from mixed-effects models and empirical formulas of EFWs, to achieve better prediction performance. Our investigation showed that, with the aid of either the SCAD or MCP for model selection, both stacking and voting methods improved the prediction accuracies in predicting macrosomia, as opposed to those from individual statistical models and empirical formulas. The voting method with the MCP for model selection predicted LGA more accurately than the individual statistical models and empirical formulas. In this study, the ensemble prediction algorithms were created from all the prenatal ultrasound measures and birth weights. However, the algorithms can be used to make prediction on macrosomia or LGA with the ultrasound measures only from the first or second trimester, although the ultrasound measures collected in the third trimester or before birth can substantially improve the prediction accuracy. Our current study is a feasibility study that demonstrates the ensemble methods can be integrated with ultrasound examinations to assist obstetricians in clinical diagnosis on whether a pregnant woman will give birth to a large infant, and to further guide clinical interventions for the condition. However, measurable clinical benefits are unclear until they are demonstrated in prospective clinical studies.
Our current study has several limitations. The proportion of macrosomia or LGA infants in the Scandinavia SGA dataset is small (15% for macrosomia and 11% LGA of n = 1115 infants). This may largely influence the prediction accuracy given by the individual models and empirical formulas. However, using machine learning methods specifically ensemble methods, can accommodate such imbalanced data, and improve prediction accuracy [23]. In addition, the initial objective of the SGA study was to characterize the intra-uterine growth restriction and assess the associated risk factors of SGA, but the study was not designed for studying macrosomia or LGA. As a consequence, the risk factors associated with macrosomia or LGA were not thoroughly collected. This may be the reason why, in our study, we did not receive benefits from adding covariates into the mixed-effects models. Lastly, The SGA study was conducted in 1980s, with out-of-date sonographic ultrasound examination technologies, so the prediction models developed in this study may not be directly applied to predict macrosomia or LGA with the ultrasound measures from most recent state-of-the-art ultrasound technologies. However, the ensemble methods can still be applied to aggregate any available ultrasound prediction models. Also, the subjects in this study were the Caucasians from Europe that were mostly not obese (11.6% overweight rate and 2.1% obese rate in the study). New studies and data on a diverse population should be able to substantially improve the prediction of macrosomia and LGA among both the whites and minorities, as well as the overweight and obese population.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.
Appendix A Table A1. Empirical formulas used for estimating fetal weight from sonographic ultrasound measurements (Melamed et al. [5]).

Formula
Reference Equation