Detecting bursts in the EEG of very and extremely premature infants using a multi-feature approach

Highlights • Machine learning approach enables accurate detection of bursts in preterm EEG.• Features of amplitude and spectral shape capture discriminating information.• Improves reliability of estimates of inter-burst intervals.


Introduction
Preterm birth is the single largest risk factor for perinatal mortality and morbidity, accounting for over 1 million deaths every year [1] . The immature brain of the preterm infant is especially vulnerable and often the source of long-term health problems. The electroencephalogram (EEG) can help identify at-risk infants by providing continuous cot-side monitoring of brain activity in the neonatal intensive care unit (NICU). The EEG, however, requires interpretation by specialist staff which often makes it impractical to provide continuous reporting for all infants. Automated EEG analysis could overcome this limitation and provide the clinician with relevant information, in real time, to guide treatment during critical care.
Early preterm EEG exhibits an intermittent or discontinuous pattern ( tracé discontinu ) consisting of low-voltage activity, known as inter-bursts, followed by short-duration higher-voltage activity, known as bursts or spontaneous activity transients [2] . This pattern differs to the burst-suppression pattern found in the EEG of adults and full-term infants, a pattern associated with severe brain injury or coma [3] . In contrast, the discontinuous pattern is indicative of normal, healthy neurological development for the preterm infant. An important first stage for any automated analysis of preterm EEG is to distinguish between bursts and interbursts. Simple features of this bursting pattern, such as maximum inter-burst duration, relate to neurological development and are associated with neurological delay [4][5][6][7] . Segmentation of the EEG into bursts and inter-bursts is an essential first-stage for more advanced automated analysis; for example to predict neurodevelopmental outcome [8] , detect changes in sleep states [9] , or assess changes in maturation [7] .
Here, we propose to assess multiple amplitude and frequency features separately and then combine these features in a classifier. This approach has been applied to detecting burst-suppression patterns in full-term EEG [20,21] . Based on training from a large database of preterm EEG, machine learning algorithms can infer the best combination rules. We apply a feature selection procedure, that maximises relevancy and minimises redundancy, thus retaining only necessary features. Unlike existing methods, which either operate on 1 specific channel [17] or all channels simultaneously [6,18] , channels are processed independently as bursts can be focal or multi-focal and not always generalised across all channels. For example, in asynchronous activity bursts will not occur simultaneously across hemispheres [22] . For performance testing, feature sets and all parameters are estimated using strata of cross-validations to avoid overlap between training and testing data.

Acquiring and annotating the EEG
EEG data were collected from the NICU of the Cork University Maternity Hospital, Ireland, during the period 2009-2011. Data collection was approved by the Cork Research Ethics Committee of Cork Teaching Hospitals, Ireland. Informed and written parental consent was obtained before EEG recording.
EEG was recorded with the NicoletOne EEG system (CareFusion Co., San Diego, USA) using 11 electrodes according to the international 10-20 system of electrode configuration over the frontal, central, temporal, and occipital regions, a reference electrode at Fz, and a ground electrode behind the left ear. EEGs were recorded within 72 h of birth with a sampling frequency of 256 Hz. Infants with reported severe brain injuries, determined by cranial ultrasound scans within the first week of life, were not included.
Ten-minute segments with minimal artefact were selected from 36 EEG records (one segment per infant). These 10 min segments were, on average, 14 h post-birth (range: 3-41 h). Gestational age ranged from 23.4 weeks to 29.7 weeks with a mean of 27.4 weeks.
Two clinical physiologists (RO Lloyd and RM Goulding) annotated all EEG segments for bursts and inter-bursts. Bursts were defined as any preterm EEG activity not explicitly categorised as inter-bursts. Therefore the annotations included long-duration bursts ( > 20 s) which some classification systems would label as continuous activity [4] . We chose not to distinguish between bursts and continuous activity because the difference between continuous and discontinuous activity is not clearly defined for infants with gestational age less than 32 weeks [4] . Example annotations are in Fig. 1 .
EEG was analysed using the bipolar montage F4-C4, C4-O2, F3-C3, C3-O1, T4-C4, C4-Cz, Cz-C3, and C3-T3. EEG channels were annotated separately to develop a channel independent detector. As bursts do not always occur synchronously across all channels, a single channel was extracted for review to avoid annotation bias caused by the simultaneous display of multiple channels. One channel per infant was annotated and channel selection was alternated over all EEG records to avoid a channel bias. For example, F4-C4 was used for the first EEG, C4-O2 was used for the second, and so on. For all 36 EEGs, each channel was selected a median of 4.5 (range: 3-6) times.
Annotations differed between the two reviewers, as the example in Fig. 1 highlights. A consensus annotation, including only the burst or inter-burst periods where both reviewers agreed, was used for training and testing the classifier.

Amplitude features.
Discrete EEG signal x ( n ) was bandpass filtered using a 5th-order Butterworth filter into the i th frequency band ( i = 1 , 2 , 3 , 4 ) to produce x i ( n ). These filters implement the forward-backwards procedure to produce a zero-phase filter. We calculated signal envelope a i ( n ) of x i ( n ) as where z i ( n ) is the analytic associate of x i ( n ); H represents the Hilbert transform and j represents the imaginary unit of the complex-valued z i ( n ).

Spectral features.
Multiple features are used to quantify spectral characteristics. Relative spectral power for the i th band is estimated as where X ( k ) is the discrete Fourier transform (DFT) of length-N x ( n ), P total is the total spectral power over the 0.5-30 Hz range, and notation k ∈ i represents summation over the i th frequency band. To quantify spectral shape, we fit the line to the log-log spectrum Y ( k ) and then use slope c 2 and measureof-fit r 2 , defined as as features. This process has some similarity to a multifractal approach [24] but differs in the EEG frequency-band selection and summary measures. Mean frequency is calculated using the periodic-mean frequency estimate, with mod 2 π representing the modulus function in 2 π , f s the sampling frequency, and X i ( k ) is the DFT of x i ( n ). Instantaneous frequency is calculated using the central-finite difference estimate, with phase function φ i (n ) = arg [ z i (n )] , where z i ( n ) is the analytic signal described in (1) . We also include fractal dimension because of its association with spectral shape [25] . The Higuchi method first estimates curve length for scale value k as over m = 1 , 2 , . . . , k, using the entire frequency range 0.5-30 Hz for x ( n ). Curve length L ( k ), at scale k , is then computed as the mean value of L m ( k ) over all m values. This process is iterated for different scale values k . If the process is self-similar and stationary then where D is the fractal dimension. The slope of a line fit to the points (log k , log L ( k )) provides an estimate of −D [25] .

Frequency-weighted energy features.
Recent detection methods apply features of frequency-weighted energy measures [6,[17][18][19] . These measures produce an instantaneous estimate of signal energy that is dependent on both amplitude and frequency [19] . Palmu et al. used the absolute value of the nonlinear energy operator (NLEO) with a moving average window, defined as [17,23] and Koolen et al. used the line-length measure [18] Although line-length was presented as a measure of fractal dimension [18] , it better fits the definition of a frequency-weighted energy measure. Relating line length in (9) to curve length in (7) , Because the intercept is independent of the slope, line length has no apparent relation to fractal dimension.
Both NLEO and line-length measures are not included in the feature set. Instead, we use the envelope-derivative operator which has similar properties to the NLEO but is non-negative [19] . The operator for discrete signal x ( n ) is defined as [19] , where the discrete Hilbert transform h ( n ) is defined as IDFT represents the inverse DFT and sgn represents the sign function. NLEO and line length methods are compared with the proposed detector and are implemented according to published specifications [17,18] : EEG is bandpass filtered (0.5-10 Hz for NLEO and 1-20 Hz for line length) and a moving average filter is applied to the output of the operator (1.5 s for NLEO and 1 s for line length). The bandpass filtering uses a 1st-order Butterworth filter for the highpass component and a 6th-order elliptic filter for the low-pass component [17] . The envelope--derivative operator is implemented with the same NLEO specifications (0.5-10 Hz pre-processing filter and 1.5 moving-average post-processing filter).
Although nonlinear functions, we present the frequency response of a single sinusoidal input in Fig. 2 (d) for these frequencyweighting energy methods. For this diagram, the methods are implemented without the post-processing moving-average filter. The NLEO and envelope-derivative operator are implemented according to O'Toole et al. [19] ; for the line-length, only the forwarddifference com ponent of the method is implemented, as the frequency response for x (n + 1) − x (n ) is known but unclear for Each frequency response includes the previously described pre-processing filters and are normalised within the 0-30 Hz region.

Short-time analysis of features.
For all features, except the frequency-weighted energy measures, EEG is down-sampled to 64 Hz. For these exceptions (NLEO, line-length, and envelopederivative operator) the higher sampling rate (256 Hz) is used instead because of the known sensitivity to sampling frequency [6,19] . Once calculated, the feature itself is then down-sampled to 64 Hz to ensure uniformity of sampling across all features.
Features are estimated within a short-time window, shifted in time with a 75% overlap, as detailed in Table 1 . Spectral features use a 2 s window to include low-frequency activity at 0.5 Hz; amplitude and fractal dimension features use a 1 s window to allow for faster non-stationary activity.
Features with asymmetric or heavy-tailed distributions are transformed using the natural log. Log-transformed features include line-length, NLEO, envelope-derivative operator, envelope, and spectral-power features. All features are then normalised to zscores.

Feature selection and classification
Feature selection was implementing using the maximumrelevance-minimum-redundancy (mRMR) approach [26] . This method includes both a filter and wrapper stage. The filter stage, which is independent of the classifier, uses mutual information to find a feature subset that maximises relevance and minimises redundancy. The wrapper stage uses backwards elimination to rank feature subsets based on classifier performance. The reduced feature set from the filter stage allows implementation of the more sophisticated backwards-elimination procedure with a realisable computational load.
Next, features were combined using a support vector machine (SVM). We selected an SVM because of its successful application in other newborn EEG methods [21,27] . SVMs can use different kernels to generate different decision boundaries [28] . In initial testing we found no significant improvement for the radial basis function over the linear kernel and thus implemented the linear kernel. The linear-kernel SVM can be expressed as the linear regression equation where  . The detector generates features from the EEG and combines them in the SVM (boxes 1-2). A threshold (box 3) then converts the continuous SVM output to a binary output that indicates bursts or inter-bursts. As part of the testing stage, features of the temporal distribution of the bursts, for example maximum IBIs, are estimated from this binary output (box 4) and compared with features derived from the human experts' annotations. The detector is developed using the consensus annotation and tested using multiple metrics: AUC, sensitivity and specificity, Cohen's kappa, and absolute difference between features of the burst annotations. Key: SVM, support vector machine; AUC, area under the receiver operator characteristic; IB, inter-burst; IBI, inter-burst interval.
To produce a binary output indicating either bursts or interbursts, a threshold is applied to the discriminating function D [ x ( n )] in (11) . We implement both the static threshold T thres = 0 and the infant-dependent (adaptive) threshold T thres = mean { D [ x (n )] } [6,18] .
We set lower limits on the duration of bursts and inter-bursts to remove short-duration segments. These limits are estimated from the reviewers' annotations by selecting the 2.5th percentile of burst (and inter-burst) duration.

Analysis of detection performance
The detector is developed using the consensus annotation; for testing, both consensus and individual annotations were used. Fig. 3 gives an overview of this process.
Both the individual features and the detector are assessed using the area under the receiver operator characteristic (AUC), generated from a time-based assessment on a sample-by-sample basis. AUC measures detection performance with values ranging from 0 to 1 where 0.5 representing random chance. Bursts were labelled as the signal-of-interest: true positives implies correct detection of bursts. For the individual features, an AUC was generated for each infant ( n = 36 ) and features were deemed statistically significant ( p < 0.05) when the 95% CI of the AUC excluded 0.5.
Performance of the detector was assessed within a crossvalidation procedure, thus avoiding over-fitting and reducing bias in testing error. Feature selection was implemented in a nested (inner) cross-validation for each outer training fold, as described in Ref. [29] . Both inner and outer cross-validation folds used a leaveone-out scheme, with testing on each left-out EEG record (one record per infant).
All parameters, including feature z -score parameters and SVM weights, were estimated within the training set and then applied to the testing set. Lower duration limits for bursts and inter-bursts were estimated from the reviewers' annotations in the outer crossvalidation. Features were generated from the EEG first. For feature selection and SVM training, only 1/500th of the training data (every 500th sample of the generated features) were used. This reduction in training data gave a good compromise between providing a representative distribution of values for both bursts and interbursts and computational efficiency during the training stage. For the testing stage, all available data was used.
AUC values for the NLEO and line-length methods [17,18] were compared to the AUC (cross-validation testing results) for the proposed detector. In addition to the time-based assessment, we also include an event-based assessment for sensitivity-specificity. The event-based assessment quantifies detection performance independent of burst and inter-burst duration, defining a true positive when detecting more than 75% of the burst duration.
Inter-rater agreement between the two human experts is quantified using Cohen's kappa statistic ( κ) with the two annotations ( Fig. 3 ). Bias and prevalence terms are reported with the κ statistic to better estimate agreement: prevalence quantifies the difference in the proportion of bursts to inter-bursts and bias quantifies the difference in the proportion of agreed bursts and inter-bursts. To assess the detector's performance relative to inter-rater agreement, the detector is compared to each annotation separately using κ.
Three measures are calculated on the detector's binary output: median inter-burst interval; maximum inter-burst interval; and burst-to-inter-burst ratio, the percentage of time the EEG is annotated as a burst per EEG record. These features represent important summary measures of preterm EEG as markers of normal maturation [4][5][6][7]9,23] . These three features were also calculated using the reviewer's annotation; absolute differences were calculated between the two annotations and the detection method, as indicated in Fig. 3 .
Pair-wise comparisons use the Wilcoxon signed-rank test and include the median difference with a 95% CI. CIs are generated using the bootstrap method with 10 0 0 iterations. P -values are reported with sample size n ; in most instances n = 36 , the number of EEG records and infants in the study. When comparing the proposed detector to existing methods, we require p < 0.05 and at least a 1% improvement in performance to link statistical significance to engineering significance.
Finally, the detector was trained on all EEG records to generate a prototype burst detector suitable for validation on independent data. Matlab and Octave code for this detector (version 0.1.1) is provided in the Supplementary Material and updates are available at https://github.com/otoolej/burst _ detector .
Sensitivity-specificity using the time-based assessment for both static and adaptive thresholds is plotted in Fig. 6 (a). Sensitivity is higher for the static threshold, with median (95% CI) difference between the static and adaptive thresholds of 17.9% (11.5-24.6%). But specificity is lower for the static threshold, with a difference between thresholds of −5.0% ( −7.6% to −3.3%). Both differences are significant: p < 0.001, n = 36 . A similar picture emerges for the event-based assessment in Fig. 6 (b): median (95% CI) difference between the static-adaptive thresholds is 1.2% (0.0-7.4%) for sensitivity and −8.7% ( −11.5 to −3.4%) for specificity, with p < 0.001 ( n = 36 ) for both comparisons. Table 4 shows inter-rater agreement together with the agreement between the detector and two reviewers, using AUC and κ as measures of agreement. Whereas the consensus annotation is used to train and test the detector, with results in Table 3, Table 4 presents testing results using the full individual annotations (see Fig. 3 ) and compares with inter-rater (reviewer) agreement. Fig. 1 shows examples of EEG segments comparing the two annotations to the detection method. Fig. 7 shows differences in estimates of median and maximum inter-burst intervals and burst-to-inter-burst ratio, based on the annotations of the human experts and the proposed detector. Differences between detector and the two human experts is significantly lower than differences between human experts in 3 out of the 6 comparisons.
And lastly, we assessed processing speed for the proposed and existing methods. All methods were implement in Matlab (Release 2013a, The Mathworks Inc., Massachusetts, United States) on a desktop computer with a 2.8 GHz Intel Xeon processor and 8 GB of RAM. We used 2 h of EEG with 8-channels sampled at 256 Hz and processed each channel separately. The computational time was 64 s for the proposed method, 47 s for the NLEO method, and 1 s for the line-length method. Although slower than the singlefeature methods, the proposed method is capable of processing EEG in real time.

Discussion
The proposed method outperforms existing methods for detecting bursts in preterm EEG, with improvements of approximately 4-5% in AUC ( p < 0.001; n = 36 ) over the frequency-weighted energy methods [17,18] . Unlike these existing methods, the proposed method combines different f eatures of amplitude and spectral content with a frequency-weighted energy measure. The crossvalidation testing results-median AUC of 0.989 and sensitivityspecificity of 95.8-94.4%-show that the detector is capable of operating with a high-level of accuracy. The proposed method is also capable of analysing EEG in real time, with un-optimised code processing 2-h of 8-channel EEG in just over 1 min.
The 0.5-3 Hz envelope feature outperforms all frequencyweighted energy measures, with a difference in AUC of 2-4% ( Fig. 5 ). An increase in low frequency amplitude is known to be associated with burst activity [2] . In contrast, the frequency-weighted energy measures suppress content within this band [ Fig. 2 (d)]. Although the frequency responses in Fig. 2 (d) will differ for multicomponent signals, their similarity for mono-components suggests that the pre-processing bandpass filters may be the most influential discriminating factor.
Most of the significant spectral features (5/7) are specific to the 15-30 Hz band ( Fig. 5 ). And almost all spectral features (11/12) in frequency bands < 15 Hz performed poorly ( p > 0.05). This suggests that frequency-weighted energy measures, which all operate < 20 Hz, rely heavily on amplitude and not on spectral characteristics. Yet the feature set always ( Table 2 ) included the envelopederivative operator, implying that there is value in including a feature which multiplies frequency by amplitude. In addition, both amplitude and frequency features were frequently ( > 90%) included by the feature selection process ( Table 2 ). Future work could develop features to further exploit spectral differences. For example, the burst-to-inter-burst spectral ratio in Fig. 2 (c) could be applied in a spectral density correlator [30] . This type of matched filter correlates a received signal (EEG PSD) with a template (PSD estimate of bursts).
The two threshold methods, static and adaptive, produced similar results: better sensitivity with the static threshold and better specificity with the adaptive threshold. The static threshold may be a more robust approach however, as the adaptive threshold will fail in continuous or inactive EEG and will hinder a real-time implementation because of the required time-lag involved in threshold estimation.
Agreement between the detector and reviewer annotations was moderate ( κ = 0 . 65 and 0.72) with broad CIs, similar to agreement between the reviewers ( κ = 0 . 60 ). The seemingly high sens. spec. sens. spec.   Plots show absolute differences for the 3 features between reviewers' annotations (A1 and A2) and proposed detection method (using the static threshold). Pair-wise comparisons: either not significant (ns), * for p < 0.05, or * * * for p < 0.001 using the Wilcoxon signed-rank test.
performance for the detector with the consensus annotation (AUC of 0.99, Table 3 ) drops to 0.84 and 0.85 when tested on the two annotations separately, reflecting the level of inter-rater agreement between human experts ( Table 4 ). Our findings are consistent with known agreement rates: Palmu et al. reported rates of between 81% and 86% [23] and Murphy et al. reported rates of 71% with kappa values between 0.53 and 0.66 [6] , although both studies included three, not two, annotations. This moderate inter-rater agreement highlights the inconsistencies in annotating bursts in preterm EEG and will limit the efficacy of any machine learning approach.
There is a clear advantage to an automated approach for the estimation of summary statistics of the burst annotation compared to visual interpretations ( Fig. 7 ). Visual annotations will vary because of only moderate inter-rater agreement. The objectivity of the algorithm will decrease variability within these measures and therefore increase the reliability of preterm EEG analysis.
This study has several limitations. The proposed method was developed on EEG from infants with gestational ages ranging from 23 to 30 weeks, thus we are uncertain of how the method will perform for infants older than 30 weeks. The EEG data was largely artefact free, representative of a realistic sample of EEG used for visual analysis by clinical physiologists. For recordings with major artefacts, it may be necessary to include a pre-processing artefact detection system to assess the quality of the EEG [31,32] . Although we have compared the method to existing detection methods, a fair comparison is difficult as methods were developed on different channel montages, with single channel or multi-channel implementations, and with different underlying definitions of bursts and inter-bursts [6,17,23] . Nonetheless, our results indicate that the multi-feature approach, with data-driven combination rules, better captures the complexity of the burst waveform compared to the single-feature approach. Although our method was developed on a larger EEG data set of preterm infants ( n = 36 ) compared to existing methods ( n = 18 and n = 16 [17,18] ) the proposed method requires validation on a large, independent data set.

Conclusions
An important stage for the automated analysis of preterm EEG is to distinguish between bursts and inter bursts. We show that using a combination of features improves detection performance over existing methods. We also show that automated methods of detection improve the reliability of estimates of the median interburst interval and the burst-to-inter-burst ratio. Improving burst detection will improve downstream analysis of preterm EEG such as tracking maturation and predicting neurodevelopmental delay [7,9] .