Central apnea detection in premature infants using machine learning

Background and objective: Apnea of prematurity is one of the most common diagnosis in neonatal intensive care units. Apneas can be classiﬁed as central, obstructive or mixed. According to the current international standards, minimal ﬂuctuations or absence of ﬂuctuations in the chest impedance (CI) suggest a central apnea (CA). However, automatic detection of reduced CI ﬂuctuations leads to a high number of central apnea-suspected events (CASEs), the majority being false alarms. We aim to improve automatic detection of CAs by using machine learning to optimize detection of CAs among CASEs. Methods: Using an optimized algorithm for automated detection, all CASEs were detected in a population of 10 premature infants developing late-onset sepsis and 10 age-matched control patients. CASEs were inspected by two clinical experts and annotated as CAs or rejections in two rounds of annotations. A total of 47 features were


Introduction
Apnea of prematurity is a condition that is commonly found in premature infants admitted to neonatal intensive care units (NICUs) and is related to the immaturity of their respiratory control system [1] .Patients who develop severe illnesses such as sepsis or necrotizing enterocolitis are particularly prone to recurrent apneas [2][3][4] .Furthermore, apnea of prematurity is associated with later neurodevelopmental impairment and brain damage [5 , 6] .Ac-curate and timely apnea detection is therefore clinically important to improve outcomes in premature infants.
Apnea of prematurity is most often defined as a cessation of breathing lasting more than 20 s or more than 10 s when followed by a bradycardia, desaturation, or both [7 , 8] .However, some clinical experts disagree on the second part of the definition as they consider it necessary to have at least 15 s of cessation of breathing to define an apnea, even when followed by bradycardia or desaturation [9][10][11] .
Apneas can be categorized as central apnea (CA), characterized by a cessation of respiratory drive with a lack of diaphragmatic activity, obstructive apnea, characterized by an airway obstruction which causes an absence of airflow, and mixed apnea, which includes characteristics of both central and obstructive apnea [1 , 12-14] .The heart rate (HR) and oxygen saturation (SpO 2 ) responses vary depending both on the different apnea types as well as their duration [11] .For instance, it was noticed that a fall in HR to less than 100 beats/min is typically found in the case of CAs whereas an initial acceleration in the HR is more often present during obstructive apneas [11] .At the same time, further decreases in SpO 2 were noticed in longer apneas, irrespectively of their type [11 , 15] .
Premature infants are constantly monitored and an alarm is generated by the patient monitor in the event of an apnea.In current clinical practice, patient monitors generate an apnea alarm that is based on temporary reduction in amplitude of the chest impedance (CI), since this reflects a reduction in chest motion and thus indirectly in breathing motion [1 , 13] .This approach leads to the identification of the so-called central apnea-suspected events (CASEs), which may reflect apneic events of central origin.In fact, while among CASEs there are true CAs, several CASEs are also known to be false alarms.Studies have shown the presence of a high number of false apnea associated alarms in the daily NICU practice [9 , 16-19] , indicating that the current clinical apnea detection is deficient.
In the past years, different strategies have been developed with the aim to improve apnea detection, encompassing a variety of signal processing techniques as well as machine learning and deep learning techniques [17 , 20-25] .Among the different proposed solutions, the algorithm proposed by Lee et al. focuses specifically on CA detection [17][18][19] .In a previous study, we optimized this algorithm considering annotations from two clinical experts [26] .We observed a very high recall (sensitivity or true positive rate (TPR)) for CAs, but a persistingly low precision (or positive predictive value (PPV)).This indicated that many false CAs were still detected.In order to obtain both a high precision and a high recall, the aim of the current study was to develop an optimized detection model based on machine learning to automatically detect true CAs from all CASEs extracted using the optimized algorithm for CA detection [26] .

Dataset
We used routinely obtained data from premature infants that participated in a previous matched case-control study on sepsis [27] .These were all patients admitted to the NICU of Máxima Medical Center in Veldhoven, the Netherlands, from July 2016 to December 2018.From that existing database we selected premature infants most likely to experience apneas (i.e., all very preterm infants with a gestational age ≤ 30 weeks) that developed late-onset sepsis (LOS), and their case-matched controls.
LOS was diagnosed when patients presented the clinical signs of a generalized infection according to the Vermont Oxford Criteria [28] , required the use of antibiotics after at least 72 h of life, and pathogens were isolated from their blood culture.For these patients we identified the CRASH-moment (blood Culture collection, Resuscitation, and Antibiotics Started Here), following the definition originally provided by Griffin et and Moorman [29] .We then collected all the data for all 48 h prior to the onset of sepsis, in which apneas are very likely to occur [3 , 26 , 30] .For the case-matched controls we calculated an equivalent-in-time CRASHmoment to make sure that the postmenstrual age was matched and we extracted the data from the 48 h prior to the equivalent CRASH-moment.A total of 960 h of data from 20 premature infants (10 LOS patients and 10 matched controls) were included in this study.All patients' characteristics, specified for the LOS and control groups, are presented in Table 1 .
Each patient was monitored using Philips IntelliVue MX800 patients' monitors (Philips Medical Systems, Böblingen, Germany) according to the present clinical standard and all data was stored in a data warehouse (PIIC-iX, Data Warehouse Connect; Philips Medical Systems, Andover, MA).The ECG and CI were measured using three ECG leads with a sampling frequency of 250 Hz for the ECG and 62.5 Hz for the CI.The SpO 2 was measured with a photoplethysmogram with a sampling frequency of 1 Hz.All these three signals were then extracted from the data warehouse for further processing.
As this study had a retrospective and non-invasive nature, a waiver was provided by the medical ethical committee in accordance with the Dutch law on medical research with humans (WMO).

Annotation process
As a first step in this study, we applied the optimized algorithm for CA detection described in our previous paper [26] on our dataset of 20 premature infants.This algorithm was derived from an algorithm originally described by Lee et al. [17 , 19 , 18] , which allows to filter out the cardiac artifacts found in the ECG from the CI, to compute a CA probability function from the resulting filtered respiration signal and finally to detect CASEs, which include both CAs and false CAs.We previously optimized this algorithm to improve its precision without affecting its high recall [26] .
In the current study, the optimized algorithm for CA detection was applied to the patients' signals to detect CASEs longer than 10 s, in a similar way to the previous use of this algorithm [26] .This first step is indicated in Fig. 1 , which also includes an overview of all subsequent steps performed in this study.Matlab (The MathWorks, Natick, United States) was used to detect CASEs from our dataset, as well as to perform the annotation process and the signal processing and feature extraction techniques.
A total of 7818 CASEs were extracted from the 20 premature infants.Two clinical experts were asked to provide, independently from each other, annotations for all the extracted CASEs using a user interface developed for this task.Three different options were given for the annotations: 'central apnea' (used for CASEs that clearly presented the characteristics of a CA as pre-defined by the clinical experts, i.e., a trace of flat CI following a previously regular fluctuating CI), 'rejection' (used when a CASE did not present the typical characteristics of a CA, i.e., in case of a flat CI trace mixed with large CI fluctuations), and 'error' (used in case of missing or corrupt signal).
A second (final) round of annotations was performed with the two clinical experts present at the same time to find consensus for all CASEs that presented different annotations after the first round.

Time windows for CASE detection
We observed a high number of clusters of CASEs, in which CASEs were present at a brief distance from other CASEs.In particular, 2172 CASEs were found to be preceded by another CASE Fig. 1.Overview of the steps performed in this study starting from the extraction of central apnea-suspected events (CASEs) and their annotation by clinical experts (1).Next, CASEs cluster analysis was performed and for each CASE not preceded by another CASE in the previous 30 s, four 30 s windows with 5 s overlap ([-30 s, 0 s), [-25 s, 5 s), [-20 s, 10 s) and [-15 s, 15 s)) were identified (2).Signals extracted from the patients were then processed to create five new signals (3).From all these new signals 47 features were extracted (4).Two different cross-validation methods, leave-one-patient-out cross-calidation (LOPO CV (I)) and a 10-fold cross-validation (10-fold CV (II)) and three different machine learning algorithms, logistic regression with elastic net penalty (LR), random forest (RF) and support vector machines (SVM), were defined to create detection models at step (5).Finally, five different machine learning experiments were defined and performed (6).

Table 1
Characteristics of the patients included in the current study, separated considering the Late-Onset Sepsis (LOS) and control groups and all patients together.

LOS Control All
Count of Patients 10 10 20 Gestational Age (wk) (mean ± std) 27.9 ± 0.9 27.9 ± 1.0 27.9 ± 0. in the previous 30 s (27.8% of the total count of CASEs).We therefore performed a cluster analysis to investigate whether patterns in the transitions for the annotations from the clinical experts were present.We investigated the likelihood of having the same annotation in subsequent CASEs, information that could possibly be relevant in future detection or prediction studies.Only CASEs which were not preceded by another CASE in the previous 30 s were included to perform machine learning experiments.This clinical choice was motivated by the fact that the decrease in HR or oxygen saturation due to a previous CASE might affect the vital signs in the patient monitor and the derived feature values computed for a subsequent CASE.This could have therefore possibly caused a wrong detection and influenced the learning process of the detection models based on machine learning.Previous studies indicated that the definition of 7.5 min pre-apnea (as well as post-apnea) windows might be an optimal solution to perform a prediction study for apnea of prematurity [20 , 24 , 31] but also suggested that reduction of this window length would better suit the needs of the clinical practice.Our choice yielded a total of 5646 CASES to be included in (72.2% of the initial CASEs).
After this process, all the remaining CASEs having an 'error' annotation were discarded from the study.This resulted in a total of 5460 CASEs to be considered for the upcoming steps (69.8% of the initial CASEs).
For all the included CASES, we included information starting from 30 s before the start of a CASE (-30 s) to 15 s after the start of the CASE and we used a moving step size of 5 s.This time window was chosen so that the included time was shorter than the minimum time requested by different apnea definitions to provide an annotation (15 s of cessation of breathing) [9][10][11] .This criterion led to feature extraction from four 30 s windows: [-30 s, 0 s), [-25 s, 5 s), [-20 s, 10 s) and [-15 s, 15 s), where 0 s was the moment of the start of a CASE.

Signal processing
RR-intervals were derived from the ECG by applying an algorithm for R-peaks detection and computing the time between consecutive R-peaks, using the same method described in our previous studies [27 , 32] .From the ECG, the signal instability index (SII) was then calculated with the aim of quantifying patient motion [27 , 32] .This measure was obtained by applying a band-pass filter (0.001-0.40 Hz) to 10 s-long ECG windows and subsequently by computing a kernel density estimate to return a measure of the patient motion for every second (1 Hz).Low values for this signal are associated with an absence of movement whereas higher values suggest that a movement is performed by the patient.
The CI was processed following the steps described by Redmond et al. to obtain the respiration Ribcage Respiratory Effort (RRE) [33] .This signal was computed by subtracting its mean and by removing the HF noise present above respiratory frequencies by applying a low-pass filter with a tenth-order Butterworth fil-ter with a cutoff equal to 0.8 Hz.The resulting signal obtained for each patient received an amplitude normalization to remove the contribution of the tidal volume, a parameter that is different for each patient.In order to perform the normalization, turning points were first identified and the median peak-to-trough amplitude was found considering all differences between consecutive peaks and troughs.This value was then used to normalize the whole patient signal.The cardiorespiratory coupling signal (CRC) was finally computed by resampling the respiration RRE in correspondence with R-peaks in the ECG.No additional processing was performed on the SpO 2 measured with the photoplethysmogram.

Feature extraction
The RR-intervals, SII, respiration RRE, CRC and SpO 2 were all used to extract a total of 47 features for 5460 CASEs.Detailed information regarding all features can be found in Appendix A and only features that are less commonly used are more extensively explained in this section, with our motivation to include them in our detection models.Each feature, unless stated differently, was extracted considering 30 s windows following the windows defined in paragraph 2.3.
Features extracted from the RR-intervals using time-domain, frequency-domain and non-linear measures are well-established in previous studies.These include the Mean RR, the standard deviation of the normal RR intervals (SDNN) and the root mean square of successive differences between adjacent normal RR intervals (RMSSD), which can provide an estimate of overall heart-rate variability as well as an estimate of its short-term components [34 , 35] .We also determined the Sample Asymmetry of the RR-intervals, a measure of symmetry in the histogram of RR-intervals, which in previous studies showed clear differences before different illnesses (e.g., sepsis and systemic inflammatory response syndrome) compared to other instances [36] .Sample entropy, a measure able to quantify similarity for different sequences within a time-series [37] , was also applied to the RR-intervals [38] .The Percentage of Decelerations and Standard Deviation of Decelerations were calculated since both proved to be effective in predicting the onset of sepsis [27 , 39] .
The phase-rectified signal averaging (PRSA) [40 , 41] is capable of quantifying quasi-periodic oscillations in a physiological signal by filtering out artefacts and noise.We identified anchor points considering both increases in the signal (increase events) as well as decreases (decrease events).Windows of length equal to 40 time samples were set around the anchor points (i.e., L = 20), a choice motivated by the short time windows (30 s) used in this study.From these PRSA waveforms seven different features were extracted [42] .These included the Immediate Deceleration and Acceleration Response (IDR and IAR) as well as the Average Deceleration and Acceleration Responses (ADR and AAR), with adjusted parameters considering the length of the time windows used in this study.In addition to these features, the Slope of the Decel-eration and Acceleration Response (SDR and SAR), which are able to quantify the rate of HR response, were also included.Finally, a new feature extracted from the PRSA was developed and included in this study.This was defined by dividing the difference between the count of anchor points from the PRSA with decrease events and with increase events by the count of RR-intervals and then multiplying this value by 100 (Anchor Points Decrease versus Increase).
The Fast Fourier Transform (FFT) was applied on the RRintervals after resampling this signal at 250 Hz to extract the power spectral densities, allowing to measure the development of the cardio-respiratory coupling and the presence of respiratory sinus arrhythmia (RSA), as suggested in previous studies [43 , 44] .To achieve this aim, we used the frequency ranges indicated by Indic et al. [43] and defined the Low Frequency (LF: 0.01-0.15Hz), High Frequency (HF: 0.15-0.45Hz), Super High Frequency (sHF: 0.45-0.7 Hz) and Ultra High Frequency (uHF: 0.7-1.5 Hz) ranges.The main reason for this choice compared to those provided in other studies [33 , 44] was the possibility of using the same frequency ranges also for the FFT of the respiration RRE.In addition to these frequency-domain features, we also extracted the LF/HF Ratio and the Mean Respiratory Frequency derived from the RR-intervals, defined as the frequency of maximum power in the frequency range 0.15-0.45Hz [33] .
Four features were extracted from the SII to quantify the movements of premature infants.These were the Mean SII, the Standard Deviation of the SII, Skewness of the SII and Interdecile Range of the SII.These were selected for their ability to identify motion patterns [27 , 32 , 45] .
From the respiration RRE we extracted the Mean RRE, Standard Deviation of the RRE, Skewness of the RRE and Interdecile Range of the RRE.Sample Entropy of the RRE was computed with the same parameter choices described for the RR-intervals.Other timedomain features extracted from the respiration RRE include the Envelope Power, the Breath-by-Breath Correlation, the Breath Length Variation and the Time Domain Respiratory Frequency, features that proved to be particularly useful in classifying sleep stages.A detailed explanation for these features can be found in the study from Redmond et al. [33] .The Standardized Median of the Peaks and the Standardized Median of the Troughs were then computed as additional features to quantify the amplitude of the respiratory peaks and troughs [46] .With regards to features from the frequency domain, the power spectral densities were computed from the FFT of the respiration RRE by using the same LF, HF, sHF and uHF ranges previously mentioned for the case of the RR-intervals [43] .Similarly to the case of the RR-intervals, the frequency of maximum power in the frequency range 0.15-0.45Hz was also computed considering the respiration RRE (Mean Respiratory Frequency derived from the respiration RRE) [33] .
Cardiorespiratory features, which have proven to be particularly useful to identify sleep stages as well as cardiorespiratory phase-coupling during obstructive sleep apnea in adults, were introduced to verify whether they can be helpful in the context of apnea detection in premature infants [47 , 48] .For their computation we applied the nonlinear visibility graph (VG), a method which allows for the description of a time-series based on geometric criteria [49] , to the CRC, as suggested by Long et al. [47] .
This method requires the calculation of the degree δ of all data points in the time-series, also known as nodes, based on their visibility on other data points.The Mean Degree of the nodes and the Degree Variation of the constructed VG network were then introduced as features to quantify the degree of complexity of the timeseries.The Assortativity Coefficient was instead introduced to estimate whether nodes were more likely to connect with high-degree or low-degree nodes [47] .Similarly to the study by Long et al., we used 3.5 min windows for the computation of features, in order to capture enough changes in the RR-intervals.However, we set the upper time limit for these extended windows to match exactly the end of each 30 s windows previously defined (i.e., the last 30 s of the 3.5 min windows corresponded to the 30 s windows used to extract other features).
Features extracted from the SpO 2 included its mean and standard deviation.In addition, we also estimated the slope of linear fitting of all data points within a time window.These features were selected to provide an indication about whether a desaturation was present in the current time window or expected in the upcoming ones.
Finally, Z score normalization was applied to the feature matrix which included features extracted for each CASE considering the different time windows.After the feature extraction process, CASEs presenting features with two or more consecutive 'NaN' values were discarded resulting in a total of 5333 CASEs to be considered for the detection models (68.2% of the initial CASEs).As it is reported in Table 1 , for each patient we found a median and [IQR] value of 284.5 [187.5 -367] CASEs and 26.5 [9.5 -62.25]CAs annotated by clinical experts in the 48 h of data.In 5 patients very few CAs were annotated (count of CAs ≤ 5) while a high number of CAs was found in 6 other patients (count of CAs ≥ 50).

Machine learning: detection models
Detection models based on three traditional machine learning algorithms were employed in this study to perform CA detection: logistic regression (LR), random forest (RF) and support vector machines (SVM).LR was selected for its simplicity, which guarantees an easy transfer of knowledge and interpretability of the results.RF and SVM were selected following previous studies which proved that they can be suitable for apnea prediction in premature infants [22 , 23] .
LR was implemented with the elastic net penalty, which linearly combines the L1 and L2 penalty functions of the lasso (least absolute shrinkage and selection operator) and ridge methods.Elastic net was adopted since different studies highlighted that it usually holds a better predictive power compared to the lasso regression while still performing feature selection, a characteristic that was considered relevant for the relatively high number of included features compared to the number of CASEs (records in machine learning terminology) [50] .We performed hyperparameter optimization by means of cross-validation, further discussed in the next paragraph, for the parameter C, which indicates how much LR is penalized and L1 ratio, which influences the algorithm penalty choosing among a mere L1 penalty, a mere L2 penalty or a combination of L1 and L2 penalties.
Three hyperparameters were optimized for the RF: the number of trees, the number of candidate features to consider for the best split at each node, and the maximum depth of the tree.
Finally, three hyperparameters were optimized for the case of the SVM.These included the regularization parameter, the type of kernel to be used and kernel coefficient, suitable for all the different types of kernels used in this study.A list of all the values for each hyperparameter that were investigated is presented in Table 2 .All the detection models here described were implemented using the Scikit-learn in Python (Python Software Foundation, Fredericksburg, United States).

Machine learning: validation and evaluation
Two cross-validation methods were used for each detection model to optimize the hyperparameters, as shown in Fig. 1 .The first cross-validation method was a leave-one-patient-out crossvalidation (LOPO CV), using for each iteration the CASEs from one Steps used in this study to perform hyperparameter optimization for all detection models and for both cross-correlation methods, leave-one-patient-out crossvalidation (LOPO CV (I)) and a 10-fold cross-validation (10-fold CV (II)).The optimal parameters found using these criteria were then applied to the patient/fold left as test set.
patient as a test set.LOPO CV was selected as it relates well to clinical practice, where each new patient admitted in a NICU is treated as a test set.CASEs from the remaining 19 patients were also separated using a second LOPO CV, which allowed us to obtain for each iteration 18 patients in the training set and 1 patient in the validation set in order to perform hyperparameter optimization for each detection model.All steps regarding hyperparameter optimization are shown in Fig. 2 .A grid search approach was used to determine the training and validation scores for each iteration of the second LOPO CV, computed by using the area under the receiver operat-ing characteristic curve (AUROC) as a scoring function.The mean training scores, mean validation scores and the difference between the mean training and mean validation scores were computed for each iteration of the first LOPO CV.The optimal set of parameters for the detection model to be used on a test set was then selected based on two requirements: a set of parameters leading to (1) a mean training score greater than a predefined threshold ( Th Class ), solution adopted to prevent underfitting, and to (2) the minimum difference between the mean training and mean validation scores, solution implemented to avoid overfitting.Th Class was empirically set equal to 0.9.In case less than 30% among the mean training scores computed for all sets of parameters were found to be greater than 0.9, its value was decreased each time by 0.05 and the full experiment repeated until this condition was fully satisfied for all the iterations of the first LOPO CV.As the CAs were unevenly distributed over the patients, with 5 patients having only a few CAs, we also trained the detection models using a a 10-fold cross-validation (10-fold CV), folded over CASEs rather than patients, as a second cross-validation method.This experiment allowed to distribute CAs more evenly across all the folds, resulting in each fold having around one CAs per 6 CASEs.On the other hand, an important limitation of 10-fold CV is that the inclusion of CASEs from different patients in the training and test sets can introduce overfitting, a problem that is not found with LOPO CV.Nonetheless, this cross-validation method was included in our study also to quantify the extent of the overfitting compared to LOPO CV.Furthermore, as shown in Fig. 2 , a second cross-validation consisting of 9 folds, a number selected to keep consistency with the choice applied for the second LOPO CV (where the number of patients was reduced by 1 compared to the first LOPO CV), was used to create the training and validation sets and to perform hyperparameter optimization, following the same steps as described above for LOPO CV.

Machine learning: experiments
Five different machine learning experiments were performed.These are also all indicated in Fig. 1 .All the experiments allowed to evaluate the detection of CAs among CASEs in the test set by using the mean AUROC as metric, unless stated otherwise.
As a first experiment, we evaluated the performance of the different detection models based on machine learning and two crossvalidation methods on the detection of CAs using information from all time windows.Features extracted from all time windows [-30 s, 0 s), [-25 s, 5 s), [-20 s, 10 s) and [-15 s, 15 s) were fed to all detection models, for both cross-correlation methods.This allowed us to verify which detection model worked better for CA detection and whether AUROC values were consistent for both cross-correlation methods.
As a second experiment, we investigated the relevance of including features from an increasing number of time windows, starting from window [-30 s, 0 s) only and up to windows [-30s, 0 s), [-25 s, 5 s), [-20 s, 10 s) and [-15 s, 15 s), on the detection.This experiment was also performed for all three detection models and both cross-validation methods.
Before proceeding with the following experiments, we identified the best performing detection model and investigated the feature relevance and the hyperparameters selected per each fold.However, only detection models created by means of LOPO CV were considered since they prevent overfitting (i.e., LOPO CV does not allow to use CASEs extracted from one patient both in the training and test set).In addition to this, the use of LOPO CV might be more reasonable in a clinical context, where an algorithm should work proficiently immediately after the admission of a new patient, where no training can be done with previous events extracted from this same patient.This detection model was then used to perform the remaining experiments.
The third experiment consisted of verifying whether the detection was performed differently for different groups of patients.At first, we compared AUROC values for control and LOS patients and applied the Wilcoxon signed-rank test to compare AUROC values from paired patients belonging to the two groups.We then compared AUROC values for each patient experiencing a low number of CAs (count of annotated CAs ≤ 5) and high number of CAs (count of CAs ≥ 50).Due to the low number of patients in these two groups, for this comparison we only evaluated the mean, minimum and maximum AUROC for patients in the two groups.
As a fourth experiment we extracted the confusion matrices found after the definition of different thresholds for the false positive rate (FPR) in the mean ROC curve.This allowed to evaluate how different metrics, recall (TPR), precision (PPV), miss rate (or false negative rate (FNR)) and the specificity (or true negative rate (TNR)), varied per each selected FPR.This experiment allowed us to understand the feasibility of using the selected detection model and cross-validation method for performing CA detection in clinical practice.
As a fifth and final experiment, after the identification of an optimal FPR, we investigated how the selected detection model performed in detecting different events (recall).At first, we investigated the percentages of correct detections (recall) for all CAs, CAs followed by a bradycardia (HR ≤ 80 bpm), CAs followed by a decrease in HR (80 bpm < HR ≤ 100 bpm), CAs followed by a desaturation (SpO 2 ≤ 80%), CAs followed by a decrease in SpO 2 (80% < SpO 2 ≤ 88%), and CAs followed by both a bradycardia and a desaturation (also often referred as apnea bradycardia desaturation events in literature [17 , 19] ).The definitions for bradycardia and desaturation were selected following the thresholds for alarms used in our NICU [51] .We then evaluated the percentages of correct detection of CAs with different length.For this purpose we separated CAs considering the following lengths L (i.e., 10 s ≤ L < 20 s, 20 s ≤ L < 30 s, 30 s ≤ L < 40 s, 40 s ≤ L < 50 s, 50 s ≤ L < 60 s, L ≥ 60 s) in a similar way to what was proposed by Mohr et al. for an analysis of different apnea lengths [19] .Finally, we investigated the percentages of correct detection for CASEs agreed and disagreed upon by clinical experts during the first round of annotations.

Results
Fig. 3 shows results from the cluster analysis of CASEs, in which transitions for the annotations from the clinical experts were investigated.This analysis showed consistency for annotations from clinical experts within clusters, irrespectively of the transition that was considered.Considering the first transition as an example, clusters starting from a CA had a far higher likelihood of being followed by a CA rather than a rejection for the following CASEs (16.7% vs. 3.8%).The same consideration was found for clusters starting from rejection, which were far more ofthen followed by a rejection (24.4%) rather than a CA (1.1%).All the other percentages of transitions indicated instead the end of a cluster.
Our first machine learning experiment consisted of investigating the performance of the different detection models considering features from all time windows.Fig. 4 shows the mean ROC curves for the different detection models.Fig. 4A , computed by using LOPO CV, shows that the highest AUROC was found when using the LR with elastic net penalty (0.88 ± 0.08).This result was followed closely by the AUROC found using SVM (0.87 ± 0.09) and RF (0.85 ± 0.09).Slightly higher results compared to those found with LOPO CV were found by means of 10-fold CV, as shown in Fig. 4B , where the highest AUROC was found with LR with elastic net penalty (0.90 ± 0.02), result followed closely by the AUROC found with SVM (0.89 ± 0.02) and RF (0.87 ± 0.03).
Our second experiment investigated the relevance of including features from an increasing number of time windows on CA detection.AUROC values (mean ± standard deviation) found by considering features from an increasing number of time windows and including therefore progressively more seconds of CASEs are presented in Table 3 .In this table a stepwise increase in the mean AUROC values for all detection models is observed, for both crossvalidation methods.Fig. 3. Representation of the first four stages of transitions between all central apnea-suspected events (CASEs) included in this study.Annotations returned by the clinical experts during the second round of annotation are used to create this representation.END is also added to indicate the ending of a cluster.Considering the first transition as an example, clusters from a CA were followed by a CA 16.7% of the time, were followed by a rejection 3.8% of the times and were not followed by another CASE in the following 30 s 79.4% of the times.Only the most likely transitions are represented (e.g., transition from CA to error is not represented due to extremely low likelihood of this transition).

Table 3
AUROC values (mean ± standard deviation) found by considering features an increasing number of consecutive time windows for each central apnea-suspected event (CASE).Results found for detection models based on logistic regression (LR) with elastic net penalty, random forest (RF) and support vector machines (SVM), as well as both cross-validation methods, leave-one-patient-out cross-validation (LOPO CV) and 10-fold cross-validation (10-fold CV), are shown.The following time windows including progressively more seconds of CASEs were considered to obtain these results: [-30 s, 0 s), [-25 s, 5 s), [-20  As shown in Fig. 4A , the best performing detection model by means of LOPO CV was identified in LR with elastic net penalty trained using features from all 4 time windows, since it showed the highest mean AUROC.Feature relevance was investigated by extracting the coefficients from the detection model.These are log odds ratios and while positive values indicate increased risk towards CA, negative values indicate increased risk towards rejection.Median log odds ratios computed considering all folds for the twenty most relevant features, ranked considering their absolute values are shown in Fig. 5 .In general, relevance for features extracted from the respiration RRE was found to be generally the highest among all features included in this study.Different features extracted from the RR-intervals also proved to be relevant for the detection whereas features extracted from the SII, CRC and SpO 2 showed lower median log odds ratios.The most often selected set of hyperparameters consisted of C = 0.005 and L1 ratio = 1, and was selected for 16 out of 20 folds.
In the third experiment we evaluated whether the detection was performed differently for different groups of patients while using the detection model made using LR with elastic net penalty and LOPO CV.AUROC values for all the patients in the test set are shown in Fig. 6 , represented in blue for control patients and in red for LOS patients.No statistical significance while comparing matching LOS and control patients was found (p-value > 0.05).AUROC values for the 5 patients experiencing a low number of CAs (count of annotated CAs ≤ 5) showed high values (mean AU-ROC = 0.948, min AUROC = 0.895, max AUROC = 1) while a wide variety of AUROC values was found in the six patients with high number of CAs (count of CAs ≥ 50) (mean AUROC = 0.850, min AUROC = 0.659, max AUROC = 0.951).
In our fourth experiment, three different FPR (0.1, 0.2 and 0.3) were defined to extract the confusion matrices for the same detection model based on LR with elastic net penalty and LOPO CV.Confusion matrices as well as other metrics are shown in Table 4 .A FPR = 0.3 in particular allowed to detect CAs with a recall (TPR) = 0.78, a precision = 0.42 and a miss rate (FNR) = 0.22.Since we considered as highest priority to avoid missing too many CAs even at the cost of a slight increase of false CA detection, we chose to use this FPR value for the subsequent and final experiment.
In our fifth experiment, percentages of correct detection for different events found with FPR = 0.3 were computed.Results are shown in Fig. 7 together with the total number of events considered for each group.Percentages of correct detections were found to be equal to 78.2% for all CAs and very similar values were also found for CAs followed by a desaturation (79.0%) or decrease in SpO 2 (75.7%).Significantly higher values percentages of correct detections were found for CAs followed by a bradycardia (93.4%), by a decrease in HR (93.9%) and by both a bradycardia and a desaturation (95.2%).In addition, percentages of correct detection were found to be higher for CAs with length 20 s ≤ L < 30 s (88.3%) or for CAs with length L ≥ 60 s (91.7%).Finally, percentages of correct detection were found to be higher for CASEs originally agreed upon by clinical experts (82.9%) compared to CASEs for which there was no initial agreement (60.1%).

Discussion
Correct identification of apnea of prematurity is important as a correlation has been described between apnea of prematurity and neurodevelopmental impairment as well as other pathologies, such as sepsis, that can represent a major threat to the life of premature infants [2 , 4-6] .Therefore we developed a detection model based on machine learning which shows that it is feasible to automati-cally detect true CAs and to reduce the occurrence of high load of false apnea alarms observed in the current clinical practice.
As a result of an investigation of CASEs extracted from our population of premature infants, we found a high likelihood for CASEs to appear in clusters, as 27.8% of the total count of CASEs was found to be preceded by another CASE in the previous 30 s.This result was found to be in agreement with previous literature studies which also indicated that periodic apneas can be found in premature infants [6 , 14 , 20] .In addition to this we observed that consistency in annotations can be found for CASEs appearing in a cluster.In particular, we found a far higher likelihood for CAs to be clustered together rather than with other apneic events anno-   Percentages of correct detections (recall) performed by logistic regression (LR) with elastic net penalty trained using leave-one-patient-out cross-validation (LOPO CV) on CASEs present in the test set when a FPR = 0.3 was selected.In particular, in (A) all central apneas (CAs), CAs followed by a decrease in heart rate (HR) or oxygen saturation (SpO 2 ), in (B) CAs with different lengths and in (C) central apnea-suspected events (CASEs) agreed and disagreed upon by clinical experts during the first round of annotations are represented.

Table 4
Confusion matrices found considering a False Positive Rate (FPR) equal to (a) 0.1, (b) 0.2 and (c) 0.3 in the mean ROC curve for logistic regression (LR) with elastic net penalty, estimated labels by this detection model using the corresponding threshold and the annotations from the clinical experts as true labels.
(a) FPR = 0. tated as rejections.Consistency of annotations within clusters of CASEs might also be considered for the improvement of detection models.Future studies could also provide more precise annotations for rejections (e.g., separating obstructive and mixed apneas) and investigate whether consistency characterizes other clusters as well.Furthermore, our first machine learning experiment showed very similar AUROC values for the different detections models that were implemented based on LR with elastic net penalty, RF and SVM, considering features extracted from all 4 time windows and both cross-validation methods.AUROC values found with RF and SVM were found to be higher than in the promising previous literature studies which made use of the same machine learning algorithms [22 , 23] .Our AUROC values found by means of LR with elastic net penalty are instead more difficult to compare to results achieved in the past.This is because, to the best of our knowledge, LR was only used in combination with a Gaussian mixture model by Zuzarte al. with the aim of performing apnea prediction [6] .Furthermore, this previous model allowed to perform a live apnea prediction, while in our study we considered performing apnea detection based on single CASEs, similarly to what was done in previous studies who used RF and SVM [22 , 23] .In this study mean AUROC values were also found to be consistent for both cross-validation methods.However, it is important to highlight the higher uncertainty, indicated by the AUROC standard deviation, found with LOPO CV, a result that is related to different count of CASEs identified per patient.On the other hand, methods based on testing the detection model on new patients, like LOPO CV, are more similar to the regular clinical practice, where a detection model based on machine learning should be trained with previously admitted patients and used (i.e tested) on a newly admitted patient.In addition to this, AUROC values found using 10-fold CV, which are due to overfitting resulting from the use of CASEs from the same patient in both the training and test set, are only slightly higher compared to those obtained with LOPO CV, giving further validity for the use of a model based on LOPO CV in the other experiments included in this study.
In the second machine learning experiment we found out that the highest AUROC values were found with features extracted from up to the first 15 s of each CASE, irrespectively of the detection model and the cross-validation method that was used.It must be noted, though, that AUROC values were quite similar by including only features up to the first 10 s of each CASE.This result is particularly significant for the clinical practice, as all definitions present in literature require a minimum of 10 s of a cessation of breathing to define apnea of prematurity [7][8][9][10][11] .Our detection model therefore not only detects apneas based on vital signs but predicts the occurrence of an upcoming true CA alarm.
The investigation of feature relevance performed for LR with elastic net penalty, features from all 4 time windows and LOPO CV showed higher median log odds ratios for features extracted from the respiration RRE compared to all the other features.This result is expected as a CA is likely diagnosed by observations of the CI, with consequences in HR and SpO 2 appearing just at a later stage.Nonetheless, features extracted from the RR-intervals also provided a significant contribution to the detection model whereas features extracted from the CRC and SpO 2 were found to be in general less relevant.In contrast to recent literature [6] , in our study the motion signal (calculated in our study as SII) shows only a modest contribution, which may be explained by the fact that we focused our attention on apneas with a central origin, whereas patient movement is associated with apneas of an obstructive origin [52] , and by the fact that the SII is extracted from the ECG signal, a signal on which the patient monitor applies different methods to prevent motion artifacts [32] .Future studies could therefore investigate whether wavelet-based algorithms, used for instance by Zuzarte et al. [6] , are more suitable than SII to contribute to the detection of apnea of prematurity.
In our third machine learning experiment no difference was found by comparing AUROC values for LOS versus control patients, indicating the possibility of applying LR with elastic net penalty for CA detection irrespectively of the patient group.We also noticed high AUROC values for 5 patients experiencing a low number of CAs.With an in-depth analysis focused on the interpretation of this result, we found out that CA records for patients with few CAs used as a test set generally presented high probability values, irrespectively of the machine learning model that was used.As soon as high threshold values (low FPR values) were trespassed, an increase in TPR and consequently in the AUROC values became immediately noticeable.We speculate that this result might have a clinical reason, since we noticed a very regular breathing pattern in this group of patient.Because of this reason, a disruption in this breathing pattern could be easily recognized.Patients with several CAs often showed instead a less regular breathing pattern, which could have led to more rejections being wrongly identified as CAs and at the same time made CAs harder to be detected.
In our fourth machine learning experiment, a FPR = 0.3 was selected as an optimal threshold for our best performing detection model.This choice was performed since in the medical field it is preferred to correctly detect as many true events as possible (i.e., high recall), even at the cost of having a lower precision and triggering therefore more false alarms [9] .Recent studies have indicated a precision equal to 0.35 for the current monitoring systems available in clinical practice (i.e., up to 65% of the apnea alarms currently found in NICUs are false) [17 , 19 , 18] .Using our detection model with a FPR = 0.3 we improved the precision to 0.42, reducing the number of false alarms by 66% compared to the application of the optimized algorithm for central apnea detection only, an algorithm derived from the one originally described by Lee et al. which showed a lower number of false positives compared to what is obtained using the current monitoring techniques [17] .
Percentages of correct detections (recall) were further examined in the fifth machine learning experiment and were found to be particularly high for CAs followed by a decrease in HR (80 bpm < HR ≤ 100 bpm), a bradycardia and both a bradycardia and a desaturation, which are considered the most concerning CAs for the well-being of the infants [2 , 14] .Our results are in line with the findings of Williamson et al., who showed promising results with an AUROC equal to 0.79 while applying a different machine learning algorithm (i.e., quadratic classifier) [20] .Lower percentages were instead found when CAs were followed by a decrease in SpO 2 , a result that can be related to the low relevance found for features extracted from this signal.An explanation for this result was given by Di Fiore et al., who indicated that false desaturation alarms can be reduced by using long averaging time, equal to at least 16 s [2] , which is nowadays typically used in most clinical practices.Exceeding the first 15 s of CASE for feature extraction might therefore be one solution to improve detection of CAs followed by a decrease in saturation.On the other hand, since 15 s of cessation of breathing are sufficient to completely define apnea of prematurity [9][10][11] , the inclusion of this requirement would have prevented us from the possibility of performing an actual prediction of CAs and was therefore not considered.
Another solution that can be explored in future studies is the definition of new features capable to estimate possible future decrease in SpO 2 , including for instance features computed in the frequency domain which proved to be useful for diagnosing obstructive sleep apnea syndrome in adults [53] .Finally, it is important to note that the percentages of correct detections were lower in case the clinical experts disagreed during the first round of annotations (recall equal to 60.1% instead of 82.9%, found in case of agreement).
This study has some noteworthy limitations.First of all, a limited sample size of patients and CASEs, which was selected considering the time availability for clinicians to provide annotations in different rounds, as this annotation process -even though guided by a GUI -was still a very time-consuming task for the clinical experts.This was also the reason why annotations were made by no more than two clinical experts.Second, in this study detection models were only used retrospectively.Also, the two step approach, with first detection of cessations of breathing and then CAs is slow for implementation.Using a full dataset including all 30 slong time windows, independent of the presence of a CASE, would be of interest.However, this would result in a highly unbalanced dataset.Third, the inclusion of more refined features from different signals, including the SpO 2 , and the inclusion of other features to quantify body motion as well as the gestational age, postnatal days and birth weight could also improve the detection.Fourth, validation on data coming from a different dataset or another hospital was not investigated in this study.Since there is currently no public dataset available that includes annotations for CAs in premature infants as well as all the signals included in this study, future studies could include data from a different dataset to provide a further indication on the validity of the current results.However, we consider our study to be valid as a first feasibility study to verify our methodology before a larger-scale study including more data from more patients and different datasets.Finally, only machine learning models characterized by an easier transfer of knowledge regarding the results and that could cover regression, clustering and tree-based algorithms were included in this study.Future studies, more focused on the technical aspects, could however be considered to expand on the selected models for instance by including deep neural networks.

Conclusion
In this study different detection models based on machine learning were applied in order to detect CAs among CASEs.We show that CAs can be automatically detected correctly while reducing the number of false alarms, using a detection model based on well-known machine learning algorithms.A threshold for the FPR in the mean receiver-operating-characteristic curve equal to 0.3 led to a percentage of 78.2% correct detections for all CAs and 93.4% and 95.2% for the most critical and clinically relevant CAs, namely those followed by bradycardia or bradycardia and desaturation.

Declaration of Competing Interest
I declare that I participated in the design, execution and analysis of the paper with the title 'Central Apnea Detection in Premature Infants using Machine Learning' together with the all the colleagues listed here, that I have seen and approved the final version and that it has neither been published elsewhere.I also declare that I have no conflict of interest.
considering a network including a total of M edges, the i -th edge connects two nodes with degree of α i and β i at their ends [47 , 49] Features extracted from the SpO

Fig. 5 .
Fig. 5. Feature relevance for the logistic regression (LR) with elastic net penalty fed with features from time windows [-30 s, 0 s), [-25 s, 5 s), [-20 s, 10 s) and [-15 s, 15 s) and trained using leave-one-patient-out cross-validation (LOPO CV).Feature relevance was investigated by extracting the coefficients, indicating the log odds ratios.Median log odds ratios were then computed and were ranked considering their absolute values to represent the twenty most relevant features.

Fig. 6 .
Fig. 6.AUROC values computed for each patient in the test set with the detection model based on logistic regression (LR) with elastic net penalty using features from windows [-30 s, 0 s), [-25 s, 5 s), [-20 s, 10 s) and [-15 s, 15 s) with the leave-one-patient-out cross-validation (LOPO CV).AUROC values are represented in blue for control patients and in red for LOS patients.

Fig. 7 .
Fig. 7.Percentages of correct detections (recall) performed by logistic regression (LR) with elastic net penalty trained using leave-one-patient-out cross-validation (LOPO CV) on CASEs present in the test set when a FPR = 0.3 was selected.In particular, in (A) all central apneas (CAs), CAs followed by a decrease in heart rate (HR) or oxygen saturation (SpO 2 ), in (B) CAs with different lengths and in (C) central apnea-suspected events (CASEs) agreed and disagreed upon by clinical experts during the first round of annotations are represented.
[33]uted by defining as breath cycle the time from the trough of one breath to the trough of the next.After computing the maximum cross-correlation value for adjacent breaths, this value is divided by the maximum of the energy of either breath alone to normalize the maximum cross-correlation value.The maximum cross-correlation values, for all pairs of adjacent breaths, are then averaged[33] Standard deviation of the SpO 2 47Slope of the SpO 2 Slope of the linear fitting of all the SpO 2 points