Efficient use of clinical EEG data for deep learning in epilepsy

OBJECTIVE
Automating detection of Interictal Epileptiform Discharges (IEDs) in electroencephalogram (EEG) recordings can reduce the time spent on visual analysis for the diagnosis of epilepsy. Deep learning has shown potential for this purpose, but the scarceness of expert annotated data creates a bottleneck in the process.


METHODS
We used EEGs from 50 patients with focal epilepsy, 49 patients with generalized epilepsy (IEDs were visually labeled by experts) and 67 controls. The data was filtered, downsampled and cut into two second epochs. We increased the number of input samples containing IEDs through temporal shifting and using different montages. A VGG C convolutional neural network was trained to detect IEDs.


RESULTS
Using the dataset with more samples, we reduced the false positive rate from 2.11 to 0.73 detections per minute at the intersection of sensitivity and specificity. Sensitivity increased from 63% to 96% at 99% specificity. The model became less sensitive to the position of the IED in the epoch and montage.


CONCLUSIONS
Temporal shifting and use of different EEG montages improves performance of deep neural networks in IED detection.


SIGNIFICANCE
Dataset augmentation can reduce the need for expert annotation, facilitating the training of neural networks, potentially leading to a fundamental shift in EEG analysis.


Introduction
Interictal epileptiform discharges (IEDs) are transient patterns that reflect an increased likelihood of epileptic seizures (Pillai and Sperling, 2006;Smith, 2005). IEDs are present in about half of the routine EEG recordings of epilepsy patients, rising to 80% in sleep recordings (Halford, 2009). Visual analysis of EEG signals by experts is currently the gold standard in IED detection (Lodder et al., 2014), but this approach entails several disadvantages. The learning curve is long, review times are significant and specialized personnel is required. Furthermore, intra and inter-rater variability leads to error rates up to 25% (Benbadis and Lin, 2008).
Automating IED detection can reduce the resources spent on visual analysis, time to diagnosis and the misdiagnosis rate. Several approaches have been developed for this purpose. Most are based on 'pre-chosen' features, which might limit the algorithm's ability to learn how to detect these transients and seem to justify the moderate performance of these algorithms. More recently, endto-end deep learning approaches have been used (Tjepkema-Cloostermans et al., 2018;Lourenço et al., 2020;Johansen et al., 2016;Jing et al., 2019), which can learn their own representation of the feature space from raw data (LeCun et al., 2015).
Convolutional neural networks (CNNs) have been able to accurately detect IEDs and several approaches have been explored. A 3-layer CNN trained on raw data achieved an area under the Receiver Operating Characteristic curve (AUC) of 0.95 (Johansen et al., 2016). A two-dimensional CNN led to an AUC of 0.94, with 47.4% sensitivity at 98.0% specificity (Tjepkema-Cloostermans et al., 2018). A known CNN architecture from the literature (VGG) was also used for IED detection, achieving an AUC of 0.96 (Lourenço et al., 2020). Another architecture (SpikeNet) was able to outperform commercially available software (Persyst 13 (Scheuer et al., 2017)) in IED detection (Jing et al., 2019). The authors were also able to train the network to classify whole EEGs with an AUC of 0.85.
Given the high number of trainable parameters, large volumes of data are needed to train deep neural networks appropriately. For supervised learning, this data must be labeled so that the network has a way of assessing its errors and iteratively improve its performance, getting closer to the gold standard (i.e. label provided by experts) (Taylor and Nitschke, 2017;Perez and Wang, 2017). This means that every IED in an EEG recording should be labeled by experts (preferably several to establish consensus, given the high variability), which limits the availability of such data. In turn, this creates a bottleneck in the development of generalizable algorithms of this type. Increasing the number of input samples using augmentation techniques or temporal shifting, can potentially improve performance.
Data augmentation techniques use existing samples to create new ones, aiming to improve the accuracy or robustness of a classifier. When working with images, cropping, padding, flipping and other transformations are often used (Taylor and Nitschke, 2017;Perez and Wang, 2017). However, given that EEGs are time-series, many of these approaches are not applicable, as they would interfere with the temporal component of the signal. The spatial component of the EEG (i.e. the channels) is also relevant in visual analysis and possibly in the training of the classifier. Since several montages can be used by experts to identify IEDs, we hypothesize that a transformation in what concerns the order of the channels could contribute to a more robust classifier.
Temporal shifting, i.e. shifting the window of acquisition of the samples in time, creates novel context for the network. As such, it also increases the number of samples available for training and can be applied to EEG signals without loss of temporal continuity, adding novel temporal context to the IEDs.
We build on previous work (Lourenço et al., 2020) and explore the impact of applying temporal shifting and using data in different montages to increase the number of IED samples used in the training process of a convolutional neural network.

EEG data and pre-processing
We used EEG data from 166 patients, randomly selected from the digital database of the Medisch Spectrum Twente, in the Netherlands (Lourenço et al., 2020). All EEGs were obtained as part of routine care, and anonymized before analysis. As EEG is part of routine care, the Medical Ethical Committee Twente waived the need for informed consent for continuous EEG monitoring. Interictal EEGs (with IEDs) from patients with focal (50 patients) and generalized (49 patients) epilepsy were included, along with normal EEGs from 67 healthy controls. The IEDs were visually labeled by experts (MvP and MTC).
EEG data was filtered in the 0.5-30 Hz range and downsampled to 125 Hz, aiming to reduce artefacts and data dimensionality. The signals were split into 2 s non-overlapping epochs. These steps were implemented in Matlab R2019a (The MathWorks, Inc., Natick, MA).

Dataset creation
Four different datasets were created using the original and augmented data (see Table 1). Set A was comprised of the aforementioned data, using the longitudinal bipolar montage (DB). Set B included the same data in three different montages: longitudinal bipolar, source derivation (SD) and common average (G19). Since the last two montages include 19 channels instead of the 18 present in the DB montage, the last channel of the SD and G19 samples was discarded. This led to a threefold multiplication of the number of samples (since the same transformation was applied to IED and non-IED samples). Fig. 1 shows an example of an epoch in the three montages. New samples containing IEDs were generated by shifting the time window of the epoch by 0.5 s, 1 s and 1.5 s, as can be seen in Fig. 2. This resulted in a fourfold multiplication of the number of samples with IEDs, which were used in Set C. To create Set D, data was time-shifted in each montage (DB, SD and G19).
Data was randomized and a 80/20 split into a training/validation and test set was applied. All epochs from a particular patient were used either for training or testing. Fivefold cross validation was applied on the training/validation set.

Deep learning models
A VGG C convolutional neural network (see Supplementary Material - Supplementary Fig. S1), was implemented in Python 3.4 using Keras 2.1.2, Tensorflow 1.4.0 and a CUDA-enabled NVI-DIA GPU (GTX-1080), running on CentOS 7. Stochastic optimization was performed using an Adam optimizer (Kingma and Ba, 2014) with a learning rate of 2 Ã 10 À5 ; b 1 ¼ 0:91; b 2 ¼ 0:999, and ¼ 10 À8 . A sparse categorical cross entropy function was employed to estimate the loss. A batch size of 64 and weights of 100:1 were used (100 corresponding to the positive class, i.e. samples with IEDs) for sets A through C. Technical details can be found in the Supplementary Material -Technical aspects regarding the implementation of the algorithm.

Performance evaluation
Model performance was evaluated based on the average Receiver Operating Characteristic (ROC) curves, built for all crossvalidation sets, based on 1001 discretizations. The corresponding area under the curve (AUC) and Confidence Intervals (CIs) at 95% were calculated. Sensitivity, specificity and false positive detection rates per minute were assessed at the intersection of sensitivity and specificity and at 99% specificity. These routines were implemented in Matlab R2019a (The MathWorks, Inc., Natick, MA).

Occlusion technique
Occlusion is a network visualization technique used to assess the relative importance of each part of the input samples in the network's classification (Zeiler and Fergus, 2014). Typically, the sample is divided into small sections and each of these is iteratively occluded. The difference between the prediction of the full sample and the prediction of the sample with the occluded section is calculated. When plotting this difference, warmer colours were used for larger difference values, indicating higher importance in classification. More details can be found in (Lourenço et al., 2020).

Results
The VGG C architecture was trained with the four different datasets. With the original dataset (Set A), the intersection between sensitivity and specificity occurred at 93% with a false positive rate of 2.11 (0.85-3.38) false detections per minute. With data from different montages (Set B), using 1001 thresholds, it was not possible to completely match sensitivity and specificity. The closest values for these parameters were 90% and 93%, respectively. Using time-shifting (Set C), the intersection was at 96% and combining the two techniques (Set D), it was possible to achieve an intersection at 97%. The corresponding false positive rate for Set D was 0.73 (0.00-1.54) false detections per minute.
At 99% specificity, the sensitivity improved from 63% for the original dataset (Set A) to 96% for the augmented dataset (Set D, using both time shifting and different montages). In Table 2, we summarize performance for the various datasets. Fig. 3 shows the ROC curve for the models trained with Set A and Set D, with the corresponding areas under the curve (AUC) of 0.95 and 0.98.
Focusing on the classification of EEGs from individual patients of the test set (Supplementary Tables S2 and S3), it was possible   Figs. 4 and 5 illustrate the results of occlusion on a sample containing a focal IED before and after temporal shifting by one second, classified by the models trained with Set A and D. Fig. 4 illustrates that, before shifting, there is an isolated red patch, associated with a correct classification by the model trained with Set A, whereas on the second panel there is a more diffuse red area and a consequent misclassification of the sample.
In Fig. 5, the same sample shown in Fig. 4 was both time-shifted and re-referenced three times, followed by classification by the model trained on Set D. It is possible to see isolated red patches on all panels, regardless of channel order or the position of the IED, and all versions of the sample were correctly classified by the network.

Discussion
We studied the effect of temporal shifting and the use of different montages to increase the number of samples used to train a VGG C network for IED detection.
Datasets B through D, which included data obtained by shifting the acquisition window or changing the montage (or both), led to a decrease in the false positive rate at the intersection of sensitivity and specificity when compared to the model trained with Set A. This shows that more training samples lead to an improvement of the model's performance, in accordance with previous findings (Taylor and Nitschke, 2017;Perez and Wang, 2017). Furthermore, the models trained with Sets C and D also led to a higher sensitivity at 99% specificity when compared to Set A.
The model trained with set C (time-shifted data) leads to a lower false positive rate (1.91 versus 1.15 false detections per minute) and higher sensitivity at 99% specificity (74.20% versus 81.11%) when compared to the one trained with different montages (set B). This is due to the difference in difficulty of the task itself. Since Set B contains samples with different channel orders, the model needs to learn to 'see' what IEDs look like in each of them. With Set C, this is not necessary as all the epochs are in the DB montage and only the position of the IEDs changes.
The model trained with set D was able to outperform the other three models in false positive rate, intersection of sensitivity and specificity and also sensitivity at 99% specificity (cf Table 2). While the AUC value for this model was also higher (0.98) (cf Fig. 3), this improvement was not as resounding as in the previously mentioned variables since the AUC of the model trained with Set A was already quite high (0.95).
The improvements of the model trained with Set D were also shown in the classification of EEGs from individual patients. This model achieved a higher average specificity when classifying EEG samples without IEDs and it detected all the IEDs present in 17 out of the 20 epilepsy patients of the test set, while the model trained with Set A was only able to reach 100% sensitivity in 5 patients. Considering patients with focal and generalized epilepsy separately, the average sensitivity of IED detection increased for both types of epilepsy. Focal IEDs were detected with 35% more sensitivity, while the increase for generalized discharges was Table 2 Average sensitivity (Sens), specificity (Spec) and false positives per minute (FP/min) for the VGG trained with the different datasets. The 95% Confidence Intervals are also shown. The results for the intersection of sensitivity and specificity are shown, along with the corresponding false positive (FP) rate. These results are followed by the sensitivity at 99% specificity for each set. Set A: original dataset, Set B: augmented dataset using three montages, Set C: augmented dataset using temporal shifting, Set D: augmented dataset with both using different montages and temporal shifting.   C. da Silva Lourenço, M.C. Tjepkema-Cloostermans and Michel J.A.M. van Putten Clinical Neurophysiology 132 (2021) 1234-1240 15%. This is most likely due to the higher subtlety and complexity of focal discharges, which are typically more difficult to recognize. While the model trained with Set A was already quite satisfactory at detecting generalized discharges (sensitivity of 83.74%), it struggled with focal IEDs (sensitivity of 54.74%). This meant that there was more room for improvement, as noted by the results obtained when training the model with Set D.
Our results are also up to par with other CNN-based approaches described in the literature. Reported AUCs of 0.95 (Johansen et al., 2016) and 0.94, obtained using a subset of Set A for training (Tjepkema-Cloostermans et al., 2018), do not surpass the 0.98 obtained with the VGG trained on Set D. Persyst 13 is a commercially available for IED detection (Scheuer et al., 2017). This algorithm achieved 43.9% sensitivity and 1.65 false detections per minute. Despite the use of a vastly different dataset, our networks (in particular the one trained with Set D), far surpass this sensitivity and also lead to lower false positive rates.
The occlusion technique was used to elucidate which parts of the input are apparently relevant in the network's classification process. As shown in Fig. 4, it is possible to see that the model trained with Set A had a bias towards the position of the IED within the sample, as it was able to correctly identify the transient in the beginning of the sample but not after it was shifted in time, positioning the IED in the last second of the epoch. This is due to the way the experts labeled the IEDs: they identified the transient and tended to start its label roughly 0.5 s before the IED itself. Therefore, many of the training samples had this configuration, leading the network to learn this bias from the experts.
By shifting the data in time, we create samples that do not follow this trend, increasing the variety of the training data. Retraining the VGG with these samples rendered the network insensitive to the location of the IED, making the model more invariant and increasing its discriminative power. Adding re-referenced samples further improved the flexibility of the network by forcing it to 'look' at how IEDs are represented in different channel orders.
Combining these two methods and creating Set D led to a larger increase in the number of samples (there was a twelve-fold increase in samples containing IEDs when compared to Set A) but also in the variety of the samples, as two different approaches were used. This model was able to eliminate the position bias seen in the model trained with Set A. Furthermore, it was possible to confirm that the network learned to identify IED patterns in the three montages used in training, regardless of their position (cf Fig. 5).
This improved version of the model is more suitable for clinical application, given its generalization ability and satisfactory classification performance. After training the neural network with more varied samples, obtained through temporal shifting and remontaging, it is able to identify IEDs in any of the three montages used for training, regardless of their position within the sample. Such a model can be used as an assistive tool for IED detection by clinicians. Using a graphical interface such as an EEG viewer, signals can be fed to the algorithm and the output can be sorted according to the probability of an epoch containing an IED. Showing the samples with higher probability first, the expert can go through as many as necessary to assess whether the patterns are relevant and enough to diagnose the patient.
This would lead to a significant decrease in the time needed to analyse EEGs from prospective epilepsy patients, as experts would be able to 'jump' to potentially relevant segments instead of scanning the whole signal. Logically, the relevance of this type of tool increases when longer EEG recordings are considered, since the analysis of these signals is proportionally more time consuming. Furthermore, situations where the marking of spikes is necessary over an entire recording, such as EEG analysis in presurgical evaluation for epilepsy surgery, can also benefit from this algorithm. In these cases, after applying the neural network, the clinician would only need to review the classification, which would also lead to workload reduction.
Since the aim is to use the neural network as an assistive tool, it would still be up to the clinician to judge whether the detected spikes are relevant and correspond to IEDs. This is important since it is possible that the EEG recording includes 'epileptiform variants' (i.e. patterns that look like IEDs but are not significant for the diagnosis, such as wicket waves or small sharp spikes) that can be disregarded by the human reviewer.

Conclusion
We show that increasing the number of samples in the training set of a neural network through time-shifting and using different montages increases the discriminative power for IED detection and makes the model more generalizable. Combining these two strategies lowered the false positive rate and increased sensitivity when compared to the separate use of time-shifting and rereferencing. Furthermore, it was possible to eliminate a bias in the algorithm towards the location of the IED, shown by occlusion. Using these techniques to multiply available samples by several folds can reduce the bottleneck created by the scarceness of expert annotated data.

Declaration of Competing Interest
M.J.A.M. van Putten is co-founder of Clinical Science Systems, a supplier of EEG systems for Medisch Spectrum Twente. Clinical Science Systems offered no funding and was not involved in the design, execution, analysis, interpretation or publication of the study. The remaining authors have disclosed that they do not have any conflicts of interest.