Neurodevelopmental Impairments Prediction in Premature Infants Based on Clinical Data and Machine Learning Techniques

: Preterm infants are prone to NeuroDevelopmental Impairment (NDI). Some previous works have identified clinical variables that can be potential predictors of NDI. However, machine learning (ML)-based models still present low predictive capabilities when addressing this problem. This work attempts to evaluate the application of ML techniques to predict NDI using clinical data from a cohort of very preterm infants recruited at birth and assessed at 2 years of age. Six different classification models were assessed, using all features, clinician-selected features, and mutual information feature selection. The best results were obtained by ML models trained using mutual information-selected features and employing oversampling, for cognitive and motor impairment prediction, while for language impairment prediction the best setting was clinician-selected features. Although the performance indicators in this local cohort are consistent with similar previous works and still rather poor. This is a clear indication that, in order to obtain better performance rates, further analysis and methods should be considered, and other types of data should be taken into account together with the clinical variables.


Introduction
Very preterm infants are at high risk of developing neurodevelopmental impairments (NDI) that include a range of language, cognitive, sensory, and motor impairments [1].Those born at or before 32 weeks of gestational age are at a higher risk of NDI.Research studies have identified clinical conditions that have been associated with adverse outcomes, such as prenatal, perinatal, and comorbidities factors.Prenatal risk factors include infections [2], hypertension [3], and malnutrition [4].Perinatal risk factors include gestational age [5], sex [6], and Apgar score [7].Moreover, comorbidities also may impact the long-term neurodevelopmental outcome, such as necrotizing enterocolitis [8], sepsis [9], bronchopulmonary dysplasia [10] among others.These risk factors can be prospective predictors of NDI, and provide a better understanding of the potential pathways to adverse outcomes in preterm infants.
Previous models using traditional statistical methods have been developed to predict NDI in preterm infants using antenatal and neonatal clinical data [11][12][13].These studies have demonstrated the significance of predictive models that can assist neonatologists in early diagnosis and decision-making.Despite these efforts, drawbacks have been considered such as the lack of diversity in the variables, treating the variables as independent, and measuring a risk combination of NDI and/or death, among others.Nowadays, owing to the exponential increase in data, advantages can be obtained by applying machine learning (ML) models that can help to support these statistical models.Previous studies have used clinical features as predictors for NDI in preterm infants and extremely low birth weight infants using ML models.Research performed by Ambalavanan et al. [14] used antenatal and perinatal variables to predict neurodevelopmental outcomes at 18 months, using Neural Networks.Some of the limitations of this model were the small sample size and the selection of variables, as they only used variables selected with reported risk.
Furthermore, Ambalavanan et al. [15], used antenatal and postnatal data to predict NDI using classification tree models, among the drawbacks they consider is that the model predicted a combined outcome of NDI or death.Moreover, some of the classification tree nodes were based on very small subsets (e.g., specific treatment), from the original dataset and were less accurate than the ones based on larger subsets.Another study performed by Juul et al. [16], aimed to predict NDI outcomes at 2 years by using Bayesian Additive Regression Trees using three subsets of selected variables.This research provided meaningful conclusions such as using a dichotomous version of the NDI outcome performed better than its original version and that total transfusion volume was the most important predictor for their cohort.
Despite the efforts, to the best of our knowledge there is still no ML-based solution that performs well in this task in small cohorts and highly imbalanced datasets by using clinical data from very preterm infants.This study aims to evaluate the application of ML techniques to predict NDI in very preterm infants using clinical data from a local cohort acquired at the Hospital Puerta del Mar, in Cádiz, Spain, which includes prenatal, perinatal, and comorbidities records.The authors' main goal is to determine which is the best set of ML techniques, together with the set of clinical variables, that better predicts the NDI outcomes in very preterm infants.The main contributions of this work are next summarized: • Develop supervised ML models to predict NDI in very preterm infants.

•
Analyze the performance of these ML models when using all clinical features available, a subset of them guided by experts in this field, and mutual information-selected features.

•
Apply a commonly used data augmentation technique to deal with data scarcity and class imbalance issues.
The rest of the paper is organized as follows.Section 2 describes the study design and data collection.Section 3 explains feature selection methods and ML classifiers implemented in this research.Section 4 describes strategies for the generation of synthetic data, and evaluation metrics that have been considered.Section 5 details and discusses results that were obtained with the experimental design and the chosen methods.Finally, Section 6 gives an overview of the conclusions of this study and the approaches that need to be considered for future work.

Study Design and Participants
Data was prospectively collected in a cohort study including very preterm infants from May 2018 to January 2021 at Puerta del Mar University Hospital, Cádiz, Spain.Research and Ethics Committee approval and informed consent of participants were obtained.Inclusion criteria were very preterm infants born at ≤32 weeks and/or very-low-birth-weight infants (≤1500 g).
A total of 52 clinical features were used in this study which include prenatal, perinatal, and comorbidities features.Prenatal variables refer to the mother's health records at the time of pregnancy such as age, hypertension, hypothyroidism, chorioamnionitis, gestational diabetes, preeclampsia, cesarean delivery, IV fertilization, etc. Perinatal variables refer to clinical records of the preterm infant during delivery, such as gestational age, sex, Apgar score at 1 and 5 min of life, Clinical Risk Index for Babies (CRIB Index), intubation at the delivery room, head circumference at birth, small for gestational age (birth weight bellow 10th centile), etc. Comorbidities include patent ductus arteriosus, bronchopulmonary dysplasia, days of oxygen therapy, and mechanical ventilation, among others.

Neurodevelopmental Assessments at 2 Years Corrected Age
In this study, we analyzed a dataset of 180 very preterm infants, each assessed for neurodevelopmental outcomes at 2 years of corrected age.The corrected age is the chronological age reduced by the number of weeks born before 40 weeks of gestation.Assessments were conducted using the Bayley Scales of Infant Development, 3rd Edition (Bayley III) [17].This test evaluates three different areas of neurodevelopment: motor, cognitive, and language.The scores are independent for each area and evaluation is performed by a qualified clinical psychologist.The Bayley score is a quantitative variable, that is categorized in the following threshold: ≥85 normal neurodevelopment, <85 mild impairment, and <70 severe impairment.
To address the imbalance in our dataset, we simplified the problem to binary classification: values ≥ 85 were considered normal neurodevelopment and values < 85 as mild to severe impairment.For cognitive impairment, 155 patients had a normal neurodevelopmental outcome, while 25 had mild to severe impairment.Meanwhile, for motor impairment, 156 patients obtained a normal neurodevelopmental outcome, while 24 patients had a mild to severe impairment.For language impairment, 139 patients had a normal neurodevelopmental outcome, while 41 obtained mild to severe impairment.

Data Curation and Pre-Processing
From the 52 features considered in this study, 32 features had less than 6% of missing values, while 6 features had around 30% to 40% of missing values.Features with missing values were imputed using two simple methods according to the type of feature.Imputation approaches used in this study, are commonly simple, and fast methods to impute; however, considering the complexity of clinical features, suggestions from the clinicians were considered to perform these imputation methods.

•
Numerical features: imputation was performed using the mean value, where missing values were replaced with the mean of all known values in each feature.Moreover, normalization was applied, where each feature was scaled to a range of 0-1.Scaling was adjusted on the training dataset, whereas the test set was normalized based on the training data.• Categorical features: the most frequent value imputation was performed, where missing values were substituted with the most frequent value in each feature.Subsequently, these features were transformed using dummy encoding, where each feature was converted into k classes, yielding a total of k − 1 [18].

Feature Selection
Clinician-based feature selection was performed by clinicians, based on their expertise in following preterm infants across the years, this feature selection ended up reducing the number of 22 features.
Additionally, the mutual information method was employed as an alternative feature selection approach.This method calculates the mutual information between each feature and the target variable [19].To determine the optimal number of features, an additional variable with random values was generated and analyzed together with the rest of the variables through this feature selection method.Then, only features with a mutual information coefficient greater than the one corresponding to this random variable are retained.

Classifiers
We evaluated different classification models, which are listed and defined below.

•
Logistic regression: a type of regression model used in binary classification problems.It models the dependent variable as a linear combination of the independent variables and employs the logistic function to transform these linear combinations into a probability value between 0 and 1 [20].• AdaBoost: a meta-estimator that starts by fitting a classifier on the initial dataset and then fits multiple copies of the classifier on the same dataset, but modifies the weights of instances that are mistakenly categorized so that succeeding classifiers focus more on difficult cases [21].

•
Decision Trees: a model that creates predictions of the target variable by learning simple choice rules based on the characteristics of the data components.The root node serves as the base of the model and is continued by the leaf and intermediate nodes [20].

•
Random Forest: a tree-based ensemble model that combines tree predictors, where each tree in a random forest depends on the values of a random vector that was sampled randomly and along the same distribution for all trees in the forest [22].

•
Gaussian Naive Bayes: a model based on the probabilistic method and Gaussian distribution.The classification process considers the likelihood that each feature is Gaussian and that each feature can independently predict the target feature [23].• K-nearest Neighbor: a model that classifies based on the similarity of the data, where k is the number of closest neighbors.In this case, the technique to find the K-nearest Neighbor is performed using Euclidean distance [20].

Experimental Design 4.1. Evaluation Strategy
The models were evaluated using leave-one-out cross-validation to ensure robustness and minimize bias.Additionally, hyperparameter tuning was conducted using grid search combined with 5-fold cross-validation, enabling the identification of optimal parameters in each iteration.The entire process, including data pre-processing and model evaluation, was implemented in Python 3.11.6,and models from the scikit-learn 1.4 package were imported.The subsequent performance metrics were computed as follows, where: TN: True negatives; TP: True positives; FN: False negatives; FP: False positives.

•
Accuracy: this metric is calculated by dividing the total number of correct predictions by the total number of observations.accuracy = TP + TN TP + FN + FP + TN • Recall: this metric is calculated by the fraction of all the positive cases that were correctly classified.
• ROC AUC: this metric indicates how well the model can distinguish between the positive and negative samples.When AUC = 1, the model perfectly distinguished the positive and negative classes.

Dealing with Data Imbalance Issues
The oversampling method SMOTE-NC (Synthetic Minority Over-sampling Method for Nominal and Continuous) [24] was used because of the class imbalance of the dataset.This method is a variation of SMOTE and it creates additional synthetic data points for the minority class by interpolating between existing samples.SMOTE-NC randomly chooses an example from the minority class (mild/severe class), identifies its k-nearest neighbors, chooses one of them, and linearly interpolates between the chosen example and the neighbor to produce a new example.

Results and Discussion
In this study, we worked with a small and class-imbalanced dataset (approximately 80% vs. 20%) and evaluated the performance of different models.These results indicate that performing an oversampling method such as SMOTE-NC yields better results in cognitive and motor impairments for all features Tables A1 and A2, clinician-selected features Tables A3 and A4, and mutual information-selected features Tables A5 and A6.Given our aim to predict the positive class (mild/severe impairment), our primary focus was on models that achieved better results for the recall, as shown in Figure 1.We are particularly interested in a model that can accurately identify the true positives, meaning that when the model indicates a patient has an impairment, this prediction should be correct most of the time.Moreover, according to the best model performance (Gaussian Naives) in motor impairment prediction, we present the ROC curve in Figure 2, which achieved an AUC value of 0.69.
Clinician-selected features performed better based on the recall metric compared with all features for the three types of impairment prediction.However, performing mutual information-selected features and SMOTE-NC outperformed the other settings in cognitive and motor impairment prediction.In contrast, for language impairment prediction, the best setting was clinician-selected features.In this sense, the best model performance according to recall was Gaussian Naive Bayes for all types of neurodevelopmental impairment predictions.Moreover, implementing SMOTE-NC increased its prediction of cognitive and motor impairments across all settings compared to no oversampling settings.Meanwhile, in language impairment prediction, applying SMOTE-NC obtained notable results in the setting of mutual information-selected features.Any feature selection employed in this work was inconsistent for all three cases, indicating reevaluation, because recent studies have proposed more sophisticated and smarter approaches for addressing this task [16].Furthermore, it is important to mention that even though prediction for the mild/severe class was the goal of this analysis, joining both classes due to the lack of patients might imply a loss of clinical information regarding the state and future of each patient; therefore, future analysis should be considered to work with these classes independently.Moreover, it has been previously stated by Juul et al. [16] that even by applying advanced methods the field is still not able to predict complex long-term outcomes such as the Bayley score.However, in this study, we will be approaching extra features that previously have been considered that are predictors for NDI such as socioeconomic features [25] and image-based features such as brain volumes [26,27].Moreover, multimodal data integration does seem a greater alternative for predicting NDI outcomes by the use of machine learning methods, research on this scope has been focused on integrating clinical data and image-based data to predict NDI in preterm infants [28,29].

Conclusions and Future Work
In this work we assessed a small and class-imbalanced dataset to predict the neurodevelopmental impairments at two years in very preterm infants, using six different classification models, employing all features and feature selection by an expert and by mutual information and performing SMOTE-NC oversampling.Besides, this evaluation was performed using clinical data from a local cohort acquired in a Spanish public hospital.Results are indicative that using mutual information-selected features and SMOTE-NC techniques was the best setting for cognitive and motor impairment prediction, while for language impairment prediction the best setting was clinician-selected features.The feature selection performed by experts in this field should be reconsidered for further studies, implementing a more complex approach as recent studies have suggested.
To this end, this work could be extended by applying further methods, adding more heterogeneous features such as image-based features, and centering the prediction into regression methods so that the Bayley score can be predicted in its original version.Moreover, it is important to note that this is a small dataset, and it is expected to include more patients and validate the model on external cohorts.

Figure 1 .
Figure 1.Model performance based on recall metric.Mutual information-selected features and SMOTE-NC obtained the highest recall for motor and cognitive impairment prediction.Followed by clinical-selected features in language impairment prediction.

Figure 2 .
Figure 2. Receiver Operating Characteristic (ROC) curve demonstrating the performance of the Gaussian Naive Bayes model for predicting motor impairment.

Table A2 .
Performances of models using all features and SMOTE-NC.

Table A3 .
Performances of models using clinician-selected features and no oversampling.

Table A4 .
Performances of classification models using clinician-selected features and SMOTE-NC.

Table A5 .
Performances of classification models using mutual information-selected features.

Table A6 .
Performances of classification models using mutual information-selected features and SMOTE-NC.