Single-channel qEEG characteristics distinguish delirium from no delirium, but not postoperative from non-postoperative delirium

(cid:1) Our random forest quantitative EEG (qEEG) classiﬁer could classify delirium versus no delirium with an AUC of 0.76


Introduction
Delirium is a neuropsychiatric syndrome characterized by disturbances of attention, awareness, and other cognitive functions (American Psychiatric Association, 2013).Increasing our knowledge of delirium is of great importance since the condition affects one in every four to five hospital patients (Gibb et al., 2020) and is associated with prolonged hospitalization (Salluh et al., 2015), long-term cognitive decline (Austin et al., 2019;Goldberg et al., 2020;Tsui et al., 2022), institutionalization and mortality, as well as increased costs (Inouye et al., 2014;Leslie et al., 2008).
Delirium can be categorized in different ways (Wilson et al., 2020), but one distinction that is often made is that between postoperative and non-postoperative delirium (Palanca et al., 2017).In delirium research, elective surgery patients are often studied for their suitability for prospective study designs (Boord et al., 2020;Wiegand et al., 2022).However, various factors in postoperative delirium are not uniform across the entire delirium population.For instance, the use of opioids and sedatives, two types of drugs commonly prescribed in surgical patients, increases delirium incidence (Sanson et al., 2018).Furthermore, differences in electroencephalography (EEG) changes seen in postoperative and sepsisrelated delirium have been reported.Triphasic waves have been reported in sepsis-related delirium but not postoperative delirium, while mixed alpha and fast oscillations have been reported in postoperative delirium but not sepsis-related (Palanca et al., 2017).These differences raise the question whether findings from delirium research in the postoperative setting can be generalized to other types of delirium or should be interpreted separately.
EEG provides a unique combination of accessibility and high temporal resolution and it has been used for decades in delirium research (Engel and Romano, 1959).We previously showed that single-channel EEG (Fp2-Pz) could adequately detect delirium using only one minute of recorded data in a postoperative cohort, based on a limited number of qEEG changes (Ditzel et al., 2022;Numan et al., 2019).Here, we wanted to focus on expanding our knowledge of qEEG changes in delirium.Additionally, we aimed to explore if qEEG features differ between postoperative and non-postoperative delirium.
To facilitate the comparison of a greater number of qEEG features than in prior delirium EEG studies (Van Der Kooi et al., 2015), machine learning (ML) techniques can be used.ML is typically better suited than traditional statistical methods for analyzing large amounts of potentially non-predictive variables, as demonstrated in genetic research (Libbrecht and Noble, 2015).Random Forest (RF) is a type of supervised ML that is seen as a leading algorithm for classification (Deo, 2015), for it has been proven to be robust and relatively insensitive to overfitting (Breiman, 2001;Hosseini et al., 2021).In this exploratory study design, we used RF to allow a great variety of qEEG characteristics and their interactions to be included in the comparison.We used binary classification for two-sample testing to study possible group differences (Friedman, 2003).We aimed to improve the understanding of delirium first by exploring a variety of qEEG features related to delirium and secondly, comparing qEEG changes of postoperative and non-postoperative delirium patients.

Setting and study populations
This project was part of the DeltaStudy, a cross-sectional, multicenter study in Intensive Care Units (ICUs) and non-ICU wards between May 2019 and December 2020 in ten hospitals in the Netherlands (University Medical Center (UMC) Utrecht, Diakonessenhuis Utrecht, Radboud UMC, Isala Zwolle, Isala Meppel, Tergooi Medical Center, Franciscus Gasthuis & Vlietland, Onze Lieve Vrouwe Gasthuis, Amphia, Medisch Spectrum Twente).The study design was conducted in accordance with the ethical principles that originated from the 2013 version of the Declaration of Helsinki (World Medical Association, 2013) approved by the local ethical committee of UMC Utrecht (17857) and registered at ClinicalTrials.gov (NCT03966274).This manuscript adheres to the Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) guidelines (Collins et al., 2015).
Patients were included when admitted on the days that the researchers visited the hospital.Inclusion criteria for the ICU were an expected stay of at least 24 h and a minimum age of 18. Patients admitted to a general ward were included if they had an expected stay of at least 48 h and a minimum age of 60.Exclusion criteria in both groups were: primary brain surgery or any type of brain injury in the six preceding weeks, admission for primary neurolog-ical or neurosurgical disease, known dementia, use of lithium, presence of an intracranial metal plate or device, a language barrier or deafness, and a Richmond Agitation-Sedation Scale (RASS) score below À2.Patients who showed severe agitation disturbing the EEG measurement were also excluded.Additionally, patients with partly or completely missing delirium assessment or EEG data were excluded.

Clinical assessment of delirium
A trained researcher administered the Delirium Interview: a test battery targeting the DSM-5 delirium criteria, which takes about ten minutes, and is designed to serve as reference standard for studies on delirium assessment tools, with a sensitivity of 89% (95% Confidence Interval (CI) 72%-98%) and a specificity of 82% (95% CI 71%-90%) (Ditzel et al., 2023).The results of the Delirium Interview were combined with information extracted from the electronic health record 24 h before and twelve h after the measurement.The diagnosis of delirium was based on a majority vote of three independent delirium experts asked to assess delirium at the moment of the measurement.These experts were blinded to RF classification results.The panel of experts was comprised of seventeen clinicians, primarily geriatricians and psychiatrists, who had an average of 17 years of clinical experience (with a standard deviation, SD of 6.3 years).Each of these clinicians encountered approximately 10 delirious patients per week.

EEG recordings, pre-processing, and analysis
Directly after the delirium assessment, the same researcher performed an EEG measurement.The researcher was blinded for the definitive outcome of the delirium assessment while measuring the EEG.Four-minutes, resting-state, eyes-closed EEG was recorded with a single Fp2-Pz lead, previously selected (Van Der Kooi et al., 2015) and validated (Ditzel et al., 2022;Numan et al., 2019) for delirium monitoring.Specially developed self-adhesive patches fixed with Ten20 electrode paste were used.EEGs were recorded in the patient's hospital room, without displaying the recorded signal to the researcher.EEGs predominantly containing noise were excluded based on visual inspection.Analysis was performed with R version 4.1.0.The raw EEG data were first band-pass filtered from 1-25 Hz to largely eliminate electromyography artifacts (Whitham et al., 2007) and drifting.Next, the data were filtered to delta (1-4 Hz), widened delta (1-6 Hz), theta (4-8 Hz), alpha (8-13 Hz), and beta (13-25 Hz) bands.Then, the data were cut into eight-second-long epochs, after which automated artifact rejection took place.We applied automated artifact rejection because we preferred objectivity and reproducibility over sophistication, especially since single-channel EEG manual epoch selection can be prone to subjectivity.This automated rejection consisted of excluding epochs in which the signal exceeded five times the signal amplitude SD.To ensure a balanced representation for each patient in the analysis, we analyzed an equal number of epochs for each participant.This approach was chosen to avoid a potential decrease in feature reliability for patients with fewer artifact-free epochs.The optimal number of epochs to be included per patient was determined by maximizing data availability and minimizing patient exclusion.The first artifact free epochs were selected to reduce selection bias (van Diessen et al., 2015).
A custom R script was used for qEEG feature extraction.The choice of features was based on previous delirium and epilepsy EEG literature (Chao and Shen, 2003;Cohen, 2008;Komsta and Novomestky, 2022;Luque et al., 2009;Salimpour and Anderson, 2019;Snaedal et al., 2012;Zhu et al., 2014).For explorative reasons, we used additional statistical measures suitable for quantifying characteristics of time series data, which were not specific for qEEG analysis.Since there was minimal risk of overfitting in the RF model, we were able to include a large number of features, even when they were not specifically predictive for the diagnosis of any clinical syndrome according to the literature (Boord et al., 2020).For the different frequency bands named previously, we calculated the absolute power, relative power, four slow fast power ratios, peak frequency, signal autocorrelation, entropy, horizontal visibility graph analysis, skewness, kurtosis, variance and crossfrequency coupling.See Appendix A for an extensive description of these features and the code used to calculate them.Before analysis, feature values were averaged over the eight epochs per subject and added to a feature matrix with features Â subjects.

Random forest classification
RF classification was used to train models for two comparisons: (1) delirium versus no delirium, and (2) postoperative delirium versus non-postoperative delirium.Within the delirium group, patients who had undergone any type of surgery during their current hospital stay were considered postoperative.The remaining patients were considered non-postoperative.Models were based on the qEEG features mentioned previously.In R, RF was implemented with the randomForest function in the randomForest package.Decision trees were trained using under-sampling to prevent skewed model performance in favor of the majority group.We used 1000 decision trees instead of the standard 500 (Breiman, 2001) to correct for the smaller sample per decision tree.The number of features used per tree split equaled the square root of the total number of features, equaling approximately ten features per tree split.
The second classifier, postoperative versus non-postoperative delirium, was trained solely on patients positive for delirium (N = 129).To account for the smaller number of subjects to train the second classifier we performed a sensitivity analysis.Classification of delirium versus no delirium was also performed on a randomly selected patient subset with the same sample size as the delirium subgroup (N = 129).For this analysis, we kept the incidence of delirium in the subset equal to the fraction of nonpostoperative delirium in the original cohort (i.e., 32/129 or 25%), which was similar to the delirium incidence in the original sample (i.e., 28%).This was done to compare the performance of the two classifiers irrespective of differences in training dataset size and incidence of predicted classes (Figueroa et al., 2012).

Evaluation of random forest classification
Validation of the RF models, both for the main models and for the sensitivity analysis, was performed on unseen data, using 10fold cross-validation.With this method, subjects are first randomly divided into ten subsets or folds without replacement.Model training was conducted using 90% of the dataset, containing nine subsets, with the remaining 10% reserved for outcome prediction.This procedure was iteratively carried out ten times, each time employing a distinct validation subset (Burman, 1989).Next, by combining predicted outcomes for all ten folds, we were able to calculate model performance measures for the entire sample, while still only taking into account performance on unseen data.Use of cross-validation in this way allowed for validation on unseen data, minimizing overfitting, while not requiring a separate testing dataset.This method is illustrated in Fig. 1.
We evaluated RF performance with the area under the receiver operating characteristic curve (AUC) and the accuracy, defined as the number of correct classifications/total number of subjects.The presented sensitivity, specificity, positive-and negative predictive values (PPV, NPV) were calculated at the optimal operating point on the Receiver Operating Characteristic (ROC) curve, as determined by the random forest algorithm.Lastly, the relative feature mean decrease in accuracy (MDC) was used to display feature importance as it describes the loss of model accuracy when removing a particular feature during training of the model.Absolute MDC was transformed to relative MDC by dividing by the largest absolute value, scaling the measure to a range of 0 to 1.

Statistical analysis
For both classifiers, we applied statistical testing to the 10% of qEEG features most predictive (based on the MDC) for that comparison to test the univariate differences within our groups (Kent et al., 2020).Based on Shapiro-Wilk testing, normally distributed features were compared between two groups with independent t-tests, while non-normally distributed features were compared using Mann-Whitney U tests.We used a strict Bonferroni correction for multiple testing, based on the total number of features.Since we multiplied the p-values with 98, double dipping, by preselecting features before statistical testing, played no role.A corrected p-value of < 0.05 was considered to indicate statistical significance.

Participants
In total, 615 patients were enrolled, of whom 286 were in the ICU and 329 were in non-ICU departments (see the participant flowchart in Fig. 2).We excluded 73 patients because of Electromagnetic Compatibility (EMC) disturbances in the EEG, which were caused by artifacts induced by the electromagnetic environment in some hospital rooms.These issues were resolved through hardware within the first year of patient enrollment.During demographic data extraction, we excluded 48 patients because they did not meet all inclusion criteria on closer inspection.Another 22 patients were excluded because not all the required data were available.Lastly, fifteen patients were excluded because of poor EEG quality based on initial visual inspection.Artifact rejection of the remaining cohort excluded 10.63% of available epochs, with an average of 2.95 epochs per participant.Data availability allowed us to include the first eight epochs per participant, while only excluding one patient because of insufficient epochs after automated artifact rejection.Therefore, the final dataset consisted of 456 patients, of whom 129 with delirium.Within the delirium population, 32 patients were non-postoperative, and 97 were postoperative.Patient characteristics are shown in Table 1.

Classification of delirium
It is important to note that the performances in this section and section 3.3 were calculated by combining the performance of the ten different models trained during 10-fold cross-validation on unseen data, as explained in the methods section.Using the whole sample (N = 456) as model training data, delirium classification with RF resulted in an AUC of 0.76 (95% CI 0.71-0.80).Sensitivity was 0.77 (95% CI 0.72-0.82)and specificity was 0.63 (95% CI 0.55-0.72),both calculated at the optimal operating point on the ROC curve.Complete classification performance measures are shown in Table 2.The corresponding confusion matrix can be found in Appendix B, Table S2a.

Classification of postoperative delirium
Using the sample of patients with delirium (N = 129), the classification of postoperative versus non-postoperative delirium resulted in model performance similar to chance level with an AUC of 0.50 (95% CI 0.38-0.61).Sensitivity was 0.48 (95% CI 0.30-0.66).Specificity was 0.43 (95% CI 0.33-0.54).Complete classifica-tion performance measures are shown in Table 2.The corresponding confusion matrix can be found in Appendix B, Table S2b.

Classification of delirium sensitivity analysis
RF training of the delirium classifier in a randomly selected cohort of 129 patients resulted in an AUC of 0.77 (95% CI 0.68-0.85).

Most predictive features
The ten most important qEEG features based on MDC forboth the classifier for delirium and the classifier for postoperative delirium are displayed in radar charts (Fig. 3).Based on MDC, the first model (delirium versus no delirium) depends for as much as 25% on the two most important qEEG features (theta peak frequency and relative alpha power), as the model lost 25% of its accuracy when these features were removed.The other eight most important features are displayed in Fig. 3.The most predictive features showed that theta peak frequency and relative alpha power were lower in delirium, while relative widened delta power and theta/alpha ratio were higher in delirium.Additionally, two variants of beta autocorrelation were higher in delirium, while two variants of alpha and theta autocorrelation were lower in delirium.Lastly, alpha-beta CFC and general skewness were lower in delirium.
After strict Bonferroni correction, there was a statistically significant difference between patients with delirium and without delirium for all of these features except for alpha autocorrelation with a time shift of 0.25 s and alpha-beta amplitude envelope cross-frequency coupling.The MDC values and p-values for these features are presented in Table 3.
None of the ten best classifying qEEG features for postoperative delirium showed a statistically significant difference compared to non-postoperative delirium.There was some overlap in the most important features of the two classifiers: beta autocorrelation with a shorter time shift of 0.25 s, alpha autocorrelation with a longer time shift of 0.5 s, theta autocorrelation with the same time shift of 0.125 s, skewness and relative widened delta power.See Appendix C, Fig. S1 for boxplots of the most predictive quantitative EEG features for both classifiers.

Discussion
In this multicenter study we used RF to classify delirium based on single-channel, resting state EEG, resulting in an AUC of 0.76 (95% CI 0.71-0.80).The RF model was based on increased slow wave activity in delirium (lower theta peak frequency, lower relative alpha power, higher relative widened delta power and higher theta/alpha ratio) and on a different autocorrelation in delirium (lower in the alpha and theta band and higher in the beta band with relatively longer time shifts).When classifying postoperative versus non-postoperative delirium, model performance was similar to chance level 0.50 (95% CI: 0.38-0.61).In other words, the same method that could accurately predict the presence of delirium using EEG, even after conducting a sensitivity analysis that corrected for differences in sample size, could not classify whether delirium patients had surgery before developing delirium.
We found an AUC of 0.76 when classifying delirium with RF, which matches or slightly exceeds performance of delirium classifiers in recent literature.For example, the outcome exceeds the AUCs of 0.70 (Mulkey et al., 2023;van Sleuwen et al., 2021), as well as an AUC of 0.71 (Yamanashi et al., 2022).Furthermore, it is consistent with an AUC of 0.74 (Urdanibia-Centelles et al., 2021), and an AUC of 0.78 (Ditzel et al., 2022).The current study includes the largest sample size, the most heterogenous population and used a relatively excessive delirium reference.Furthermore, the current study used an electrode selection that was previously selected (Van Der Kooi et al., 2015) and independently validated (Numan et al., 2019).
Our finding that the EEG in delirium had a lower autocorrelation in the alpha and the theta band, can be considered to align with (van der Kooi et al., 2014) who found that in delirium, the EEG shows decreased complexity as calculated with repetitiveness of the signal when analyzing the EEG window by window.The current study investigated autocorrelation in more detail than van der Kooi et al. investigated their measure of signal complexity, by using autocorrelation with different time shifts separately for each frequency band.Skewness and cross-frequency coupling have not been linked to delirium before, indicating possible new research targets in the expanding body of research comprising delirium EEG monitoring.Although previous studies suggested possible differences in EEG characteristics associated with postoperative and sepsis-related delirium (Palanca et al., 2017), to the best of our knowledge, this is the first study to directly compare qEEG features of postoperative delirium and non-postoperative delirium.
In previous studies, the method for detecting slow-wave activity in single-channel EEG (Fp2-Pz) was primarily developed in a postoperative delirium population.Therefore, the validity of this monitoring method has remained relatively uncertain when applied in a non-postoperative delirium population (Ditzel et al., 2022;Hut et al., 2021;Numan et al., 2019;Van Der Kooi et al., 2015).In the current study, the EEG of postoperative and nonpostoperative delirium patients could not be distinguished.An important implication of this finding is that both postoperative and non-postoperative delirium can probably be monitored using the same EEG methods.Single-channel EEG (Fp2-Pz) can, therefore, likely be utilized in both patient populations.On a larger scale, these findings may suggest that delirium is one entity, whether it develops postoperatively or not.
An important strength of this study is that we could show that our EEG pipeline and RF method were robust to classify delirium,  independent of our training dataset size.As the delirium classifier was validated in a smaller sample during sensitivity analysis, our negative finding for the second objective became more robust.Also, incorporating a large number of qEEG features maximized the amount of information gained from the single-channel measurements used.Additionally, it is important to note that there were differences in patient characteristics, notably ICU admission being skewed towards delirium patients.Since we used supervised ML, this skewness may inadvertently play a role in the classification.It should, however, be mentioned that the current population also presents a clinically representative sample of patients.A possible other limitation of this study is that only single-channel EEG was recorded.We could therefore not include any connectivity measures in the model as was done in previous studies when classifying delirium with EEG (Numan et al., 2017;Van Dellen et al., 2014).Finally, it is known that EEG studies face limitations due to movement artifacts.Especially agitated, hyperactive delirium patients can have extensive artifacts hindering analysis, thus requiring exclusion (Boord et al., 2020).This selection could potentially result in a lower-than-average representation of patients with hyperactive delirium in our sample.Further research could explore additional delirium etiologies as we solely focused on comparing postoperative and nonpostoperative delirium etiological groups.Additionally, we recommend external validation in an independent dataset with a large sample size to confirm or refute the findings of this exploratory study (Kent et al., 2020).

Conclusions
Using RF classification, delirium could be distinguished from no delirium, based on single-channel qEEG features with relatively high performance compared to previous literature.The analysis used yielded possible new qEEG markers for delirium such as skewness and cross-frequency coupling.When comparing postoperative to non-postoperative delirium, RF performance was similar to chance level, meaning that we have not found evidence that single-channel qEEG features differ between these groups.Our results indicate that single-channel EEG can be used to classify delirium, but we found no evidence for a distinct EEG profile for postoperative delirium compared to delirium with other precipitating factors.These findings may suggest that delirium is one entity, whether it develops postoperatively or not.

Sponsor's Role
This work was supported by European Union Horizon 2020 [grant number 820555].The sponsor had no role in the study design, data collection, data analysis, data interpretation, writing of the report, or the decision to submit for publication.

Declaration of competing interests
Arjen JC Slooter is a non-salaried advisor for Prolira, a start-up company that has developed an EEG-based delirium monitor.Any (future) profits from EEG-based delirium monitoring will be used for future scientific research only.None of the other authors reports any conflicts of interest.Prolira had no role in the study design, data analysis, data interpretation, or the decision to submit for publication.

Fig. 1 .
Fig.1.Ten-fold cross-validation method.To evaluate performance of the random forest models, ten-fold cross-validation was used.The dataset (training set) was divided into ten subsets or folds of the same size.Then, the random forest algorithms were trained ten times on 90% of the data (training fold), in an alternating fashion, without replacement of patients between training sets.Each time, the remaining testing fold provided independent performance measures.The resulting performance measures (P 1 -P 10 ) were then combined, resulting in performance measures for the entire training set ( P P).

Fig. 2 .
Fig. 2. Participant flowchart.In total, 615 patients were enrolled, of whom 286 were in the Intensive Care Unit (ICU) and 329 were in non-ICU departments.The problem of Electromagnetic compatibility disturbance in the electroencephalogram was resolved within the first year of enrolment.N: number of participants.

Fig. 3 .
Fig. 3. Quantitative electroencephalography profiles of the ten most important features of the random forest classifiers, as based on mean decrease in accuracy.A: Radar chart of the most important delirium versus no delirium quantitative electroencephalography (qEEG) features.B: Radar chart of the most important postoperative delirium versus non-postoperative delirium qEEG features.Median values are presented, with axis limits representing the total interquartile range (25th-75th).* p < 0.05, ** p < 0.001, AE: amplitude envelope, CFC: cross-frequency coupling, HVG: horizontal visibility graph, MD: mean degree, MDC: mean decrease in accuracy, POD: postoperative delirium, rel.: relative, s: seconds, TS: time series.

Table 1
Patient characteristics (N = 456).Data are presented as number in percentage (%) or mean with standard deviation.Registered antipsychotics are haloperidol, olanzapine, quetiapine and clozapine.Registered alpha-2 antagonists are clonidine and dexmedetomidine.Registered benzodiazepines are: midazolam, lorazepam, temazepam, oxazepam and diazepam.Registered opioids are: morphine, fentanyl, remifentanyl, sufentanyl, tramadol, piritramide and oxycodone.Medication data was presented if it was administered to at least 10% of the cohort within the 24-hour period preceding the measurement.ICU: intensive care unit, N: number of participants, postop.: postoperative, SD: standard deviation, TIA: transient ischemic attack.Classification outcomes of the random forest models trained to predict delirium presence and postoperative versus non-postoperative delirium.

Table 3
Mean decrease in accuracy values and p-values of the most predictive quantitative electroencephalography features of the delirium classifier.
For both the delirium and postoperative delirium (POD) classifiers, the ten most important quantitative electroencephalography (qEEG) features are presented, as based on relative mean decrease in accuracy (MDC).Adj.: adjusted, AE: amplitude envelope, CFC: cross-frequency coupling, HVG: horizontal visibility graph, MD: mean degree, rel.: relative, s: seconds, TS: time series.