Prediction Models for Bronchopulmonary Dysplasia in Preterm Infants: A Systematic Review and Meta-Analysis

Objective To review systematically and assess the accuracy of prediction models for bronchopulmonary dysplasia (BPD) at 36 weeks of postmenstrual age. Study design Searches were conducted in MEDLINE and EMBASE. Studies published between 1990 and 2022 were included if they developed or validated a prediction model for BPD or the combined outcome death/BPD at 36 weeks in the ﬁrst 14 days of life in infants born preterm. Data were extracted independently by 2 authors following the Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modelling Studies (ie, CHARMS) and PRISMA guidelines. Risk of bias was assessed using the Prediction model Risk Of Bias ASsessment Tool (ie, PROBAST). Results Sixty-ﬁve studies were reviewed, including 158 development and 108 externally validated models. Median c-statistic of 0.84 (range 0.43-1.00) was reported at model development, and 0.77 (range 0.41-0.97) at external validation. All models were rated at high risk of bias, due to limitations in the analysis part. Meta-analysis of the validated models revealed increased c-statistics after the ﬁrst week of life for both the BPD and death/BPD outcome. Conclusions Although BPD prediction models perform satisfactorily, they were all at high risk of bias. Methodo-logic improvement and complete reporting are needed before they can be considered for use in clinical practice. Future research should aim to validate and update existing models. (J Pediatr 2023; - :113370) .


Data Extraction and Analysis
Screening and identification of the studies were independently performed by 2 authors.Titles and abstracts were screened to determine eligibility.A full-text review was performed for all potentially eligible studies to determine inclusion or exclusion.Data from the included studies were independently extracted in duplicate by 2 authors in accordance with the CHARMS checklist 9 (eMethods in Appendix 1).Disagreement or differences were resolved by consensus or by a third reviewer.Authors of the included studies were not contacted to obtain additional data.
Characteristics of the included studies, development models, and externally validated models were described.For models that were both developed and validated in the same study, information on development and validation was extracted separately.Participants and predictors data were summarized with medians and ranges.The number of events per variable was calculated for each development model by dividing the number of outcome events-or the number of infants without the outcome if such was lower-by the number of candidate predictors.The median c-statistic and range were calculated separately for development models (apparent c-statistic when it was the only avail-able information and c-statistic corrected for optimism when reported) and for externally validated models.
Externally validated prediction models were analyzed quantitatively by using random effects meta-analyses for the outcomes BPD and death/BPD separately in order to quantify an overall performance, including a prediction interval and heterogeneity of these models over time.Only discrimination performances, ie, c-statistics, were metaanalyzed, as calibration performances were poorly reported.If researchers performed external validation of several models very similar to each other (eg, one variable coded as continuous in the first model and dichotomized in the second model) at the same time point, only the best c-statistic was included.Meta-analysis was performed using the actual reported c-statistic including SE and/or 95% CI reported in the included study.These values were logit transformed, which resulted in a pooled c-statistic with corresponding 95% CI. 11 In addition, 95% prediction intervals (PIs) were calculated to provide boundaries on the likely performance in future model validation studies that are comparable with the studies included in the meta-analysis. 12We then stratified the meta-analysis by timing assessment (birth to 24 hours of life, day 1 to day 7, and day 8 to day 14).The prespecified minimum number of models to perform meta-analysis was 3, this number was met for each outcome and each time point.

RoB and Applicability Assessment
The RoB and applicability assessment was evaluated independently by 2 authors using the PROBAST guidelines. 8The RoB and applicability for each domain was scored as low, high, or unclear.If any of the signaling questions in a domain was answered with "no," the risk of bias of that domain was scored as high, and eventually, the overall judgment was scored as high risk of bias.

Results
The search resulted in 11 260 references (Figure 1; available at www.jpeds.com).Duplicates (n = 3298), reviews, editorial papers, and irrelevant studies based on their title (n = 5239) were excluded.Of the remaining 2723 abstracts, 683 full texts were retained for review.Of these, 618 references were excluded for various reasons (Figure 1).Finally, 65 studies met the inclusion criteria.  One udy that performed external validation and reported calibration plots for several BPD prediction models was excluded because data was used from infants born before 1990. 6orty-nine studies described the development of at least 1 model, accounting for 158 developed models in total.Thirtytwo studies externally validated at least 1 model accounting for 108 models.From those 32 studies, 9 externally validated a model that had been developed in the same paper. 13,14,16,21,22,28,30,64,67Finally, 12 studies developed at least 1 model and externally validated existing models. 16,18,21,50,52,57,58,60,62,63,68,71he study characteristics of the 65 included studies are described in Table I (available at www.jpeds.com).The included studies were published between 1996 and 2022, and the number of publications increased over time.Almost one-half of the included studies were performed in North America, followed by Europe.Most of the studies included data from single centers.Participants included in development and validation samples were similar for median gestational age (27.2 and 28.0 weeks, respectively), and median birth weight (911 and 938 g, respectively).

Development of BPD Prediction Models
Model Development.The total number of eligible infants was reported in most models (99.4%) (Table II).However, the total number of analyzed infants was less frequently reported (89.2%) and ranged from 43.8% to 100% of eligible infants.Some of the studies with high numbers of eligible infants did not report how many infants were finally analyzed in the models, 28,30,43,46 which explains the discrepancy between the median number of the analyzed infants and of outcome events.The proportion of infants with BPD from the total number of analyzed infants ranged from 14.8% to 68.2% across the studies and with death/BPD from 31.3% to 77.0%.The median number of predictors considered for inclusion in the models was 15 (range 1-50).The number of events per variable was less than 10 in 33.6%, and more than 20 in 40.9% of the models.The final prediction models included a median of 6 predictors (range 2-21).The predicted outcome was either BPD (43.7%) or death/BPD (56.3%) 36 weeks of PMA (Table I).Infants that died before 36 weeks of PMA were either excluded from analyses (39.2%), included in the no-BPD group (0.6%), 13 or included in the death/BPD group (52.5%).The most frequent timing of model assessment was day 7 (31.0%).
As shown in Table II, most models used all candidate predictors before multivariable modeling (56.3%).Logistic regression with all predictors forced in the model was the most frequently used modeling method.Five models (3.2%) used Classification and Regression Tree analysis.In 55 models (34.8%), eligible participants with missing data were excluded from the analysis, and 52 models (32.9%) performed single or multiple imputation for imputation of missing data.For all other models, insufficient or no information was reported to determine how much missing data there was, and how it was handled.In some models, continuous predictors were converted into categories (7.6%).

RoB and Applicability Assessment
All developed models were rated at high RoB in the analysis domain (Figure 5).This was due to issues with small sample size (events per variable <10), inadequate handling of missing data by omitting participant with missing data  -2023

ORIGINAL ARTICLES
Prediction Models for Bronchopulmonary Dysplasia in Preterm Infants: A Systematic Review and Meta-Analysis from the analysis, and subsequently using complete case analysis instead of using multiple imputation, selection of predictors based on univariable analysis before multivariable modeling, competing risk by death not accounted for, and absence of internal validation as part of model development.In addition, most studies only reported a c-statistic as performance measure of model performances; however, calibration was mostly not or inadequately reported (eg, reporting Hosmer-Lemeshow instead of calibration plot).These items resulted in a high RoB rate for the analysis domain for all developed models.Most of the models scored a low RoB in the participants and predictor domains.In one study, the eligible population could not be determined at the time of model assessment at day 7, because one of the inclusion criteria was ventilation during the first four weeks of life. 48nclusion or exclusion criteria were unclear in 3 studies. 30,34,70Four studies considered predictors 19,23,26,59 that were not available at the time of BPD risk assessment.Others did not assess predictors at the same time for all infants 34,54 or gave unclear definitions for predictors. 13,57ome models also scored high (11.3%)or unclear (5.7%) RoB on the outcome domain, mainly due to BPD assessment occurring earlier than 36 weeks in infants transferred or discharged before 36 weeks of PMA (Table I).Because all models scored high RoB in one of the domains, the overall RoB was rated as high.
Most of the models had low concern on overall applicability.Concerns on applicability were often due to highly selected populations, when only infants with mechanical ventilation or with severe respiratory disease were included.Some studies considered radiographic results to define BPD, 13,16,23 which is not recommended anymore.A complete overview of the RoB and applicability assessment per model is shown in Table III (available at www.jpeds.com).

Model Validation.
All validated models but one were rated at high RoB in the analysis domain (Figure 5), mostly due to a small sample size (<100 outcome events).In 2 studies, external validation was performed on 9 and 28 events, respectively. 14,21Other issues were inadequate handling of missing data, competing risk by death not accounted for, and lack of assessment of calibration.Most of the validated models scored low RoB for participants, predictor, and outcome domains; however, because all models but one scored high RoB in the analysis domain, the overall RoB was rated as high.Most models had also low concerns on applicability (Figure 5, Table III).

Discussion
This systematic review identified 65 studies that developed or validated a multivariable prediction model for BPD or death/ BPD at 36 weeks of PMA during the 14 first days of life.In total, 158 development and 108 validated models were reviewed.These models included mainly routinely collected data, such as clinical information and routine laboratory findings.
Although discrimination was satisfactory with a median cstatistic of 0.84 in the development models, and 0.77 in the validated models, these results must be interpreted with caution as all models scored high for RoB or had applicability concerns.The main bias issues were found in the analysis domain of the PROBAST guidelines, with small sample sizes, inappropriate handling of missing data, absence of internal or external validation, and inappropriate report or absence of calibration measures.Applicability issues were mainly due to highly selected populations that were particularly at risk of developing BPD.
In 2013, a systematic review included clinical BPD prediction models published up to 2011. 6In contrast with that review, the current review only included studies that defined BPD at 36 weeks of PMA, because this time point has an improved association with long term respiratory and neurological outcomes than the 28 days definition. 76,77Although most studies used the Eunice Kennedy Shriver National Institute of Child Health and Human Development 2001 definition (80%), variability in the BPD definitions used throughout the studies might have contributed to some variability in the c-statistics.A second difference with the 2013 review and a strength of the current review is that all models were assessed with a formal RoB assessment tool, PROBAST. 8e noted that all studies had a high RoB, and most studies had a high RoB in the analysis domain.One of the main general flaws in this domain of all existing prediction models is that calibration was rarely assessed or reported.Only 4 studies reported the results of the Hosmer-Lemeshow goodness-of-fit test, 18,22,26,29 which has been recognized as inadequate, as it is influenced by sample size and gives no indication of direction or magnitude of miscalibration. 78A calibration plot is the preferred method to report calibration performance of a prediction model; this finding was also highlighted in the recent systematic review about mortality prediction models in infants born preterm. 79Assessing the accuracy of absolute risk estimates is a key element that provides useful information for clinical decision-making and should be evaluated before considering using a prediction model in clinical practice. 80Another main flaw in this domain that caused high RoB was the small sample size, with events per variable <10, which increases the risk of overfitting for which correction with internal validation is necessary.Most models did not perform internal validation or performed an inadequate method using random split of data.Only a few models used cross-validation, which is, next to bootstrapping, recommended to account for overfitting and optimism in a developed prediction model.Finally, most models scored high RoB in the analysis domain due to inappropriate or unclear handling of missing data.Some studies excluded those participants with missing data and performed complete case analysis, which leads to biased model performances.Instead, multiple imputation is recommended to reduce the risk of bias.
Research in BPD prediction has shifted toward the identification of early biomarkers or imaging tools.Many reports of univariate associations between biomarkers and BPD have shown promising data. 81,82Very few multivariable models using blood, tracheal biomarkers, or lung ultrasound have been developed, and none have undergone external validation. 52,55Aside from the potential difficulties in assessing these variables in daily clinical practice, the highlighted pitfalls prevent conclusions about the potential usefulness of prediction models using biomarkers or lung ultrasound data in daily care.In addition, most studies used a single center, which limits the generalizability of the developed and/or validated prediction model.The prediction models with the best 3 discriminating performances showed a cstatistic between 0.97 and 1.00, although there is a risk that these models are overfit.These models included predictors regarding lung mechanics, lung imaging and echocardiographic data, and used logistic regression analysis for model development with timing of model assessment in the first week of life.All models had a low sample size with low events per variable, which probably influence the high c-statistic, and 2 models reported the final model including intercept and predictor weights. 63,74However, none of these models were internally or externally validated, and although one model presented the calibration performance with pseudo-R 2 , none reported a calibration plot. 45Given these flaws, these models have an increased risk of bias, despite their promising discrimination performances.The best three discriminating externally validated models reported c-statistics between 0.96 and 0.97, which also indicates good discrimination. 13,21,63However, all models had insufficient sample sizes to perform adequate external validation, and no calibration performances were reported.Although these 3 studies showed promising discriminating performances, none of these models can be recommended for implementation in clinical care due to methodologic flaws.
A strength of this review is our meta-analysis provided a quantification of the overall performances, including a PI and heterogeneity of the models.Meta-analysis of the c-statistic of all validated models for the outcome BPD and death/BPD revealed good performance (0.77 and 0.82, respectively).Subsequently, meta-analysis combining models per time point showed an increase in the pooled c-statistic from the validated models for the outcome BPD and death/ BPD after the first week of life.This finding is in line with the pathophysiological understanding that BPD is a developing disease during the neonatal period, and future research should investigate whether dynamic models using different predictors over time improve BPD prediction.The calculated PIs indicate that the validated models in the second week of life can perform very differently in a new set of infants, indicating that updating is necessary. 83his review has some limitations.We were unable to perform external validation of the identified prediction models due to frequent inadequate reporting of final prediction model.In addition, we found no dataset that fulfilled all the requirements to perform validation.The ideal dataset should have a sufficient sample size, with a populationbased recruitment of infants born very preterm to limit the risk of selection bias, and contain all the predictors needed to apply the prediction models, including detailed information on respiratory management, biomarkers that are not routinely collected and imaging data.
In conclusion, many different BPD prediction models have satisfactory performance.However, their actual value in clinical practice remains uncertain as the result of methodologic issues, lack of external validation, and no calibration assessment.Adherence to existing reporting and methodologic guidelines are needed to improve the quality of research on prediction modeling.Future research should aim to validate externally existing models in different countries, assess both discrimination and calibration performances, and conduct impact studies.n Bronchopulmonary Dysplasia in Preterm Infants: A Systematic Review and Meta-Analysis

Figure 2 .
Figure 2. Frequency of variables used in the development models.Labels indicate the number of occurrences of each variable out of the 158 models.The categories indicate how often one of the predictors mentioned below were included in one of the development models.

Figure 5 .
Figure 5. RoB and applicability assessment for the 158 development and the 108 externally validated model.

Table II .
Characteristics of the included models 4Romijn et al

Table II .
Continued BW, backward; CART, Classification and Regression Tree; FW, forward; LASSO, least absolute shrinkage and selection operator; NA, not available.*The158developedmodels include 141 developed only models and 17 developed models with external validation.The 108 validated models include 87 only validated models and 21 developed models with external validation.†Routinelycollecteddataincluding sociodemographic characteristics, pregnancy and neonatal events, and standard laboratory finding.-2023ORIGINALARTICLESPredictionModels for Bronchopulmonary Dysplasia in Preterm Infants: A Systematic Review and Meta-Analysis c−statistic A Figure 3. Meta-analysis of externally validated models predicting A, BPD or B, death/BPD at 36 weeks of PMA.*Prediction model externally validated on outcome moderate or severe BPD.**Prediction model externally validated on outcome grade II or III BPD.(Continues) THE JOURNAL OF PEDIATRICS www.jpeds.comVolume -6 Romijn et al Baud 2021, NICHD model at birth Hentschel 1998, Berlin score at NICU admission Hentschel 1998, Sinkin score at hour 12 Hentschel 1998, CRIB score at hour 12 Yoder 1999, Model at hour 12 Yoder 1999, Pre−sinkin score at hour 12 Subhedar 1998, Model at day 1 Laughon 2011, NICHD model at day 1 Yoder 1999, Model at day 3 Yoder 1999, Ryan score at day 3 Laughon 2011, NICHD model at day 3 Ambalavanan 2011, Model at day 7 Laughon 2011, NICHD model at day 7 Truog 2014, NICHD model dat day 7 B Figure 3. Continued.

Table I .
Description of included studies

Table I .
Continued

Table I .
Continued

Table I .
Continued

Table I .
Continued

Table I .
ContinuedTHE JOURNAL OF PEDIATRICS www.jpeds.com

Table I .
ContinuedPrediction Models for Bronchopulmonary Dysplasia in Preterm Infants: A Systematic Review and Meta-Analysis