Deep learning for automated epileptiform discharge detection from scalp EEG: A systematic review

Duong Nhu; Mubeen Janmohamed; Ana Antonic-Baker; Piero Perucca; Terence J O’Brien; Amanda K Gilligan; Patrick Kwan; Chang Wei Tan; Levin Kuhlmann

doi:10.1088/1741-2552/ac9644

1. Introduction

1.1. Automated epilepsy monitoring and diagnosis

1.1.1. Machine learning for epilepsy monitoring and diagnosis

Electroencephalography (EEG) is used for different clinical indications. These include assessing the presence of interictal epileptiform discharges (IEDs), the characteristic biomarkers of epilepsy. IEDs manifest in the form of spikes or sharp waves, often associated with slow frequency waveforms disrupting the normal background [1]. The durations are 20 ms up to 10 s [2]. IED detection is a time-consuming and challenging task as it requires the visual review of EEG signals. Automated IED detection dates back to the 1970s when Gotman and Gloor [3] published the first work to analyze different properties or features of interictal epileptic activities such as duration, sharpness, and amplitude, and a threshold was applied to classify epileptic and background activities. Since then, more complex features [4, 5] and classification approaches have been developed, especially with the development of machine learning [6, 7]. Machine learning (ML) methods have been widely applied to automate epilepsy diagnosis and seizure detection using EEG with great performance to some extent [8–11]. As a result, such automated methods have been employed in clinical practice [5, 12, 13]. Automated IED detection (figure 1) involves classifying sequential temporal windows or segments of the EEG signal into IEDs or normal activity (without IEDs). Predefined features, characterizing distinct aspects of the EEG, derived from these windows of data are passed to a classifier such as a support vector machine [14], K-nearest neighbor classifier [15], decision tree [16], or artificial neural network [17] to discriminate IEDs from artifacts and background signals. These features can be grouped into three domains: time, frequency, and wavelet [5]. A limitation of many of these approaches is that they were only tested on small datasets (⩽50 patients) and the features applied were often hand-picked. This is likely to result in poor generalizability when it comes to seeing if the methods would work well when applied in other clinical centers on new data. Improving the generalizability of automated IED methods is an important area of research to increase the clinical acceptance of these methods. An example of software with approval from the United States Food and Drug Administration (FDA) for automated IED detection is Persyst [18], developed by Persyst Corporation. Persyst has been shown to have comparable performance to skilled senior EEG technologists [19] and also shown to be non-inferior to epilepsy-trained clinicians in a study on EEGs of electrical status epilepticus patients recorded during sleep [20]. This method has been found to be approaching the level of human IED detection performance [21]. Further work is needed to achieve or outperform human-level performance while at the same time maintaining generalizability. Along these lines, promising work suggests that human-level performance can be outperformed using deep learning (DL) [22, 23]. As such, this review specifically focuses on the work that has been done applying DL to automated IED detection.

**Figure 1.** Automated IED detection workflow.
Download figure:
Standard image High-resolution image

1.1.2. Learning from raw data with deep learning

Training on larger datasets will give better generalizability but require a more extensive set of features [24, 25]. Feature selection tasks thereby will be cost-intensive. DL methods solve this by automatically learning latent features from raw high-dimensional data by optimizing a stack of multiple layers of mathematical functions with thousands to millions of parameters. This complexity comes with a tradeoff of high computational expense. DL has been around for many years with the breakthrough application of convolutional neural networks (CNNs) to hand-written number recognition in 1989 [26]. While a CNN's structure mimics the function of the receptive fields in the human visual system, another method, the long short-term memory network, inspired by the human brain's short-term and long-term memory mechanisms, was introduced in 1997 [27]. However, due to the lack of hardware-efficient implementations and computational power limitations at the time, DL was overlooked by researchers [25]. In the ImageNet Large Scale Visual Recognition Challenge 2012 (ILSVRC2012), a submission successfully applied a GPU efficient implementation of CNNs (AlexNet) [28] to classify millions of images. Since then, many DL frameworks have been developed (i.e. Tensorflow in 2015 [29], Pytorch in 2016 [30], etc) and widely adopted thanks to their ease of use and efficient execution on both CPU and GPU. As computational resources are becoming cheaper and more accessible, DL methods have emerged as powerful computational methods, approaching human performance in various tasks (i.e. speech processing, image classification, and text analysis) [31–33].

A recent systematic review of the broad applications of DL to EEG analysis [34] reported a rapid increase in the number of papers with promising results across different clinical domains from 2012, especially seizure detection and prediction [35, 36]. These studies analyzed raw EEG signals directly with minimal denoising steps (figure 1). In light of this success, researchers have also experimented with DL on automated IED detection [23, 37, 38]. A recent narrative review of machine learning for IED detection showed that DL methods on raw EEG signals outperformed traditional ML approaches [4]. Given the typical duration of IEDs, short data analysis windows used in DL are either 0.5 s, 1 s, or 2 s [23, 37, 39]. The most common choice of DL architecture among existing works is CNN.

1.2. Objective and organization of this review

Given the rise and success of DL and machine learning applications in general, this systematic review summarizes the steps required to design DL for automated IED detection and provides an insightful overview of the field's current state via an in-depth analysis of methods and performance in recent studies. Existing commercial tools for automated IED detection have shown comparable performance to neurologists in different practice settings [19, 21]. Given these promising results, our review aims to better the understanding of emerging automated software and algorithms in clinical settings. We first describe data collection and analysis methods. Next, we present the systematic characterization of the reviewed papers and critical findings. Based on the results, we formulate recommendations for future research.

2. Methods

The systematic review method used in this study followed a predetermined protocol, registered on PROSPERO (ID: CRD42021257117) and the Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) guidelines [40]. This study was conducted with approval from the Alfred Health Ethics Committee (Project No.: 745/19).

2.1. Search strategy and selection criteria

We searched for peer-reviewed original journal and conference proceedings articles published in the English language between 2012 and 2022 from four databases, PubMed, Scopus, IEEE Xplore, and Embase. The same search strategy was used across all databases, which included the following terms and conditions: (EEG OR electroencephalogram) AND (IED OR 'epileptiform discharges' OR 'spike detection') AND ('deep neural network' OR 'convolutional neural network' OR CNN OR 'long short-term memory' OR LSTM OR 'recurrent neural network' OR RNN OR autoencoder OR gan OR 'generative adversarial network' OR VAE OR 'variational autoencoder' OR 'deep learning'). Appendix S1 contains complete queries for all databases. These were queried the last time on 3 August 2022.

Our review only included studies on the methodological approach of DL to IED detection from scalp EEG recordings from patients with one or both focal and generalized epilepsy. We excluded all reviews and work focusing solely on software. The title was first assessed to evaluate the eligibility of a study. The abstract was checked if it did not meet exclusion or inclusion criteria. The full text of selected studies was read for data extraction. Any misalignment found during this process led to studies being rejected. Two reviewers (D N and M J) screened each study independently. Disagreements on study inclusion and exclusion were resolved by discussion.

2.2. Data extraction

Data extracted from selected studies were grouped into five categories: (a) data properties, (b) preprocessing methods, (c) design of DL architectures, (d) evaluation metrics and results, and (e) reproducibility. Table 1 summarizes descriptions of items in each category. DL requires a large amount of data, and its performance is driven by data quality. The review of an EEG recording is time-consuming, and so is annotating IEDs. An assessment of data properties, including data size, types of epilepsy, and recording settings, would help better judge the performance results. To understand how noise and long-term EEG recordings are treated, we extracted information on preprocessing methods, including artifact removal, segmentation, and normalization. We also summarized trends from the choice of the neural network architecture and weight regularizations to optimizations. Different machine learning and clinical perspectives of evaluation metrics were extracted and compared. In addition, reproducibility was also assessed as it indicates whether provided details were sufficient to replicate the work for benchmarking in future studies.

Table 1. Descriptions of items in each category extracted from included studies.

Category	Item	Description	Note
Data properties	EEG type	Type of collected EEG recordings (i.e. routine, LTM-VEM, etc).	If a study did not mention this, we assumed it included routine and prolonged EEG. We also assumed epilepsy type to be both focal and generalized if it was not mentioned
	No. of epileptic and normal EEG	The number of epileptic and normal EEG recordings used in the experiments.
	Number of annotated IEDs	The number of annotated IEDs from collected EEG recordings. As DL requires a large amount of data, this plays a crucial part in assessing the quality of the work.
	Number of clinical centers	The number of clinical centers the EEG sets were acquired from. Evaluation of multiple centers would demonstrate the generalizability of the models.
Preprocessing methods	Artifact removal	Whether any artifact removal methods were applied.
	Window methods	Whether a windowing method is applied and how long the duration is.
	Normalization	Whether any data normalization methods were used (i.e. z-score, min-max).
	Montage and channel selection	The usage of any specific montage and which channels are selected for the computation.
	Data augmentation	Methods of data augmentation to introduce noise and new 'artificial' data examples without having to collect more data.
Design of DL architectures	Architecture	The choice of DL network.
	No. of layers	The number of neural layers in the architecture design.
	Usage of pooling layers	Which types of pooling layers were used.
	Normalization layers	Set of normalization layers used in the models.
	Optimizer	Which optimization methods were used to optimize training parameters.
Evaluation and results	Performance metrics	Set of performance metrics.	If a metric could be inferred from other mentioned metrics, we would include it in the report.
	Cross-validation methods	Set of validation methods (i.e. k-folds).
Reproducibility	Availability of data	Whether the dataset is public or private.
	Availability of source code	Whether the source code is shared publicly.
	DL architecture details	Whether details of the DL architecture provided are enough to replicate the work.

2.3. Quality appraisal

To assess the quality of studies, we used the Critical Appraisal Skills Programme (CASP) tool containing 12 questions to make sense of a Diagnostic Test Study [41]. There are three broad questions in this tool: Are the results of the study valid?, What are the results?, Will the results help locally?. Two reviewers (DN and MJ) assessed each study independently. Any conflicts among answers to these questions were resolved by discussion.

3. Results

3.1. Search and screening results

The electronic search of four databases yielded 66 studies (figure 2). We excluded 32 studies during the title and abstract screen because they were either intracranial EEG (iEEG) focus, reviews, seizure detection, prediction, or using machine learning or traditional statistical methods. We assessed the full text of the remaining 34 studies and excluded further 11 studies as they focused on either iEEG or seizure detection despite mentioning IEDs. Finally, there were 23 studies included in our systematic review. Among these, the earliest study was published in 2016 [42]. The trend of application of DL to IED detection has only peaked since 2018 (figure 3). Tables S4 and S5 summarize the extracted data from these studies.

**Figure 3.** Timeline of included studies from 2012 until we last searched.
Download figure:
Standard image High-resolution image

3.2. Results of quality appraisal with quality assessment

The CASP checklist and quality assessment results are shown in tables S2 and S3, respectively. In section A, 'Yes' was given to all questions across studies except for question 5. This meant all results were valid. In question 5, 'Is the disease status of the tested population clearly described?', we only answered 'Yes' to 13 studies as they provided the type of epilepsy. In section B, 'Confident' was given to question 8, 'How sure are we about the results? Consequences and cost of alternatives performed?' across all studies. In section C, We answered 'cannot tell' to question 9, 'Can the results be applied to your patients/the population of interest?', and question 10, 'Can the test be applied to your patient or population of interest?', for four studies focusing on benign rolandic epilepsy with centrotemporal spikes (BECTS) [43–46] as these are only found in children. In IEDs detection, precision and sensitivity determine the proportion of detected IED events and how many events are suggested by the models that the neurologists would have to review. In terms of question 11, 'Were all outcomes important to the individual or population considered?', we answered 'Yes' to seven studies [37, 39, 44, 47–50] that included one of F1, AUCPR, or false positives rate per minute, and 'No' to the rest. F1 conveys the balance of precision and sensitivity. AUCPR measures the area under the precision-recall curve at different probability thresholds. False positives rate per minute indicates how many falsely classified samples per minute by the algorithm that the clinicians would have to review. Question 12 asks 'What would be the impact of using this test on your patients/population?'. As only three studies [51–53] included a public dataset of IEDs, we answered 'Speed up review time' for these and 'Uncertain as no public dataset was evaluated' for other studies. Testing on a public dataset is essential for benchmarking and reproducibility in DL research. It ensures the same results are replicable and allows direct comparisons among studies. The Discussion section provides further information on answers to questions 11 and 12.

3.3. Data properties

DL typically requires large data to achieve high pattern classification accuracy. To summarize the data size in each study, we extracted three categories: type of epilepsy and EEG recordings, number of EEG recordings and annotated IEDs, and number of clinical centers.

3.3.1. Type of epilepsy and EEG recordings

If a study did not explicitly mention the type of epilepsy, we assumed it included both focal and generalized epilepsies (9 studies [23, 42, 48, 50–52, 54–56]) (figure 4). There were two studies focusing only on idiopathic generalized epilepsy [38, 39]. Detecting IEDs from patients with focal epilepsy was the main goal of three studies [43, 44, 54], with four of them focusing on EEGs of patients with BECTS [43–46].

**Figure 4.** Number of studies by type of EEG recording settings.
Download figure:
Standard image High-resolution image

All included studies utilized six different EEG recording settings: Routine; Prolonged; Long term video-EEG (LTM-EEG); Ambulatory; ICU; EEG-fMRI (figure 4). Routine EEG recording typically takes 20–30 min. Other EEG types have longer durations (up to 3 d or longer) and a higher chance of capturing IEDs than routine EEG. LTM-EEG combines continuous in-patient EEG and video monitoring over several days to identify and characterize interictal and ictal abnormalities for diagnostic, prognostic, and surgical localization. There was one study [37] with EEG recordings inside an fMRI scanner; however, we still included this as the training of the DL model only involved scalp EEG signals. While the most popular data type is routine EEG (N = 9), eight studies [23, 45, 46, 49, 51, 53, 54, 57] combined at least two different EEG recording types.

3.3.2. Number of EEG recordings and annotated IEDs

The median of total EEGs among studies is 166 (IQR: 110–518). The median number of EEGs with IEDs is 156 (IQR: 26–344). There were 12 [23, 38, 47, 48, 50, 52, 54–59] studies including normal EEG recordings (without IEDs) (median: 106; IQR: 67–496). Of all studies, there were only two studies with more than 1000 recordings in total [23, 52]. Moreover, we looked at the ratio of EEGs with and without IEDs to assess the balance of datasets. While most studies have balanced sets of EEGs, one study had a highly imbalanced ratio of EEGs with IEDs to EEGs without IEDs, 1:10 [23].

Deep learning models for IED detection are trained on window segments from whole EEG recordings. These windows are classified as with or without IEDs. Therefore, the number of annotated IEDs is crucial to the training process. We only included numbers in the studies and ignored datasets whose statistics were not mentioned. The median of annotated IEDs is 11 631 (IQR: 2663–16 402). There were 11 studies containing more than 10 000 annotated IEDs [23, 44–47, 49, 51–53, 55, 60]. The highest number of IEDs is 19 057 [52].

3.3.3. Number of clinical centers

Training DL models on data from a single center may limit the generalizability of models when applied in alternative settings/hospitals, given differences in equipment and protocols. Only six studies [38, 47–49, 52, 58] used multi-center data (median: 3) with the highest number of centers of 6 [52].

3.3.4. Overview of datasets

Deep learning requires large and well-annotated datasets, often referred to as big data, for high performance [25, 61]. However, there is no indication of how large the dataset size should be. To obtain an overall picture of big data in IED detection, we summarized the largest public and private datasets with at least 10 000 annotated IEDs in table 2. We recommend that readers refer to table S4 for details of datasets in all included studies. The Temple University Hospital (TUH) EEG corpus [51], released in 2016, is the largest known public EEG corpus consisting of approximately 23 000 EEG recordings and 31 000 annotated IEDs. Temple University Events (TUEV), a subset of this corpus, contains labels for the different kinds of IEDs and artifacts (19 057 annotated IEDs) and was used in only one study [51]. Apart from this, Temple University Epilepsy (TUEP) is another subset used for the whole EEG recording classification and patient classification [52]. This dataset has 100 subjects with epilepsy and 100 subjects without epilepsy. TUEV and TUEP were divided into training, and evaluation sets such that there was no data from the same patients in both sets.

Table 2. Summary of the largest public and private datasets.

Study/dataset	Total number of EEG recordings	Number of IEDs	Number of clinical centers	Is it public?	Description
Temple University Events [51]	518	19 057	1	Yes	Public dataset with the largest number of IEDs
Temple University Epilepsy [62]	561	Not available	1	Yes	Public dataset with the largest number of EEG recordings
Jing et al [23]	9571	13 262	1	No	Private data set with the highest number of EEG recordings
Thomas et al [52]	2729	18 164 (Only number of MGH dataset was provided)	6	No	Dataset with the most number of clinical centers
Thomas et al [55]	156	18 164	1	No
Wei et al [49]	15	17 586	1	No	Private BECTS dataset with the largest number of IEDs
Thomas et al [48]	820	14 541	3	No
Prasanth et al [47]	828	14 170	4	No

Among private datasets in studies, the largest EEG dataset reported was from the Massachusetts General Hospital (MGH) [23], consisting of 9571 EEG recordings. The private dataset with the highest number of annotated IEDs was acquired from the same hospital, with 18 164 annotations [52, 55].

3.4. Preprocessing methods

3.4.1. Artifact removal

EEG is usually contaminated with artifacts. These could be ocular, muscular, electrode, and powerline noises and occasionally resemble an IED on visual review. Removing artifacts before training any predictive model might help achieve better accuracy. All studies applied artifact removal methods in their preprocessing step. The most common methods are band-pass filtering and applying montages. The choices of high-pass frequencies are 0.5 Hz (N = 10) and 1 Hz (N = 5). Muscle artifacts often occur in high frequency and could be removed with low-pass frequency filtering. Low-pass filtering at 70 Hz was used in one paper [39]. Another solution was to reduce the frequency ranges to the sub-bands alpha, delta, beta, and theta. For this, choices of 30 Hz (N = 2), 32 Hz (N = 1), and 35 Hz (N = 3) were used. Power line noise is at 50 Hz in Europe and 60 Hz in the United States of America. Notch filtering can remove this [48] at these frequencies (N = 3). Low pass filterings of 50 Hz (N = 2) and 49 Hz (N = 1) were also employed to remove the power line noise. Apart from bandpass filtering, the PureEEG algorithm [63] was used in one study [58] to remove artifacts based on a stochastic, spatio-temporal model using a linear minimum mean square error estimator.

There were five choices of montage in studies: temporal central parasagittal (TCP), common average (CA), longitudinal bipolar (LB), source derivation (SD), and laplacian. There were 12 studies [38, 48, 50–59]employing at least one montage in the experiment. CA and LB were combined in one study [23]. On the other hand, LB and laplacian were studied separately and showed comparable results [54]. TCP [64] was the main montage in the TUH corpus [62] and is a combination of longitudinal and transverse montages focusing on focal regions of the scalp. Nhu et al [59] studied longitudinal and transverse montages and showed that the combination of these two had the best performance.

3.4.2. Window methods

Processing the whole EEG recording at once is resource-intensive. A solution to this is to segment the recording into smaller windows. These can overlap or not overlap. Segmenting the data into windows also assists the time-resolved detection of events in the data. Based on the typical duration of IEDs, the choices for window size were 0.1 s (N = 1), 0.5 s (N = 4), 1 s (N = 6), and 2 s (N = 9).

3.4.3. Channel selection

Channel selection focuses the analysis on a subset of the recorded channels and reduces the computational resources required for analysis. EEG experts select the subset based on the recording settings, historical data, and experience [65]. The 10–20 system was the most popular recording setting (N = 22). The 25 electrode array was used in one study [58]. We did not observe any significant channel selection methods apart from excluding two ear electrodes in 10 studies [23, 38, 42, 47, 48, 52, 54–56, 59].

3.4.4. Input normalization

Normalization of input data might reduce the training time and improve the performance of the classification system [66]. There were two methods: z-score normalization (N = 6) [38, 39, 45, 46, 53, 59] and maxima normalization (N = 1) [49]. z-score normalization reduces the mean of the inputs to 0 by subtracting the mean from the inputs and dividing it by the standard deviation. Maxima normalization rescales the input to between 0 and 1 by subtracting the minimum value and dividing it by the maximum value. These were applied to the windows before the training of the DL model.

3.4.5. Data augmentation

Data augmentation is a popular technique in DL, especially in image classification, to tackle overfitting [67] by introducing noises and transforming the data to increase data size. There were six studies [23, 37, 38, 49, 50, 58] using data augmentation on multi-channel data. Among these, two studies [37, 38] reordered channels by the Pearson correlation with a randomly chosen reference channel so that the classification system would learn to be invariant to the order of input channels. Another method is to create jittered signals by randomly translating windows with IEDs by 0.1 s at the beginning and end of a window [23]. Unsupervised Data Augmentation (UDA) [68] was also used in one study to randomly perturb amplitude levels and the order of channels [58]. Synthetic Minority Oversampling Technique (SMOTE) [69] was also employed to synthesize the IED samples by taking the average of the closest samples and was shown to improve the performance in two studies [44, 57].

3.5. Design of DL architectures

3.5.1. Architecture

There were six DL architectures in the included studies, convolutional neural network (CNN), long short-term memory (LSTM), combinations of CNN and LSTM or bidirectional LSTM, combinations of CNN and Gated recurrent unit (GRU), graph convolutional network (GCN), and hybrid. All of them were supervised learning as they involved training on labels annotated by experts. The frequencies of these variants across studies are depicted in figure 5, which includes overlap as several studies explored multiple architectures.

**Figure 5.** Number of studies using different DL architectures. N is the number of studies including the overlap if a study explored more than one architecture.
Download figure:
Standard image High-resolution image

CNNs have shown great performance in computer vision and were the most popular method, appearing in 14 studies. A CNN is designed to process grid-like data by optimizing filters or kernels with the convolution operation. In other words, the network applies matrix multiplications on partial data iteratively, as illustrated in figure 6. In IED detection, the EEG signal is considered as a multivariate or univariate time series (EEG channels as features) and passed to a one dimensional (1D) CNN (N = 10) or two dimensional (2D) CNN (N = 9). In the traditional 2D CNN in image classification, the inputs are 3D matrices (width × height × features), and the kernels slide along the width and height dimensions. Temporal or 1D CNN uses causal convolutions and dilations [70], sliding along the time dimension and extracting temporal features from time series. These two types of CNN achieved comparable performance [47]. In terms of 2D CNN, seven studies [23, 37, 39, 47, 49, 54, 55] used the convolution strides with the first dimension of 1, which means they only learned features channel by channel. The maximum probability output across all channels is used as the output of the epoch or window when the inputs are single-channel. This approach was applied to outputs from 1D CNN in two papers [48, 52]. The proposed 2D CNN used a kernel size of 3 × 3 [50, 56, 58]. Jing et al employed 2D CNN which started with a temporal or 1D convolutional layer and then 2D convolutional layers with increasing filter sizes size, from 8 to 128, as the network became deeper [23].

**Figure 6.** 1D CNN and 2D CNN for IED detection.
Download figure:
Standard image High-resolution image

Existing 2D CNN architectures in computer vision were also employed, Fast Region-based Convolutional Network (Fast R-CNN) [58] and VGG [50, 56]. The 1D residual network from time series classification (ResNet-TSC) was tested in one study [38]. ResNet-TSC [71] stacks multiple convolutional layers with residual connections to ensure the small residuals do not vanish. Fast R-CNN [72] learns local and global features from different regions of image-like inputs. VGG [73] was invented by the Visual Geometry Group at the University of Oxford, consisting of 16–19 convolutional layers.

LSTM was explored in five studies [43, 44, 46, 53, 54]. LSTM mimics the behavior of the long-short term memory in the human brain by choosing which information of previous time steps to carry to the next steps with the help of a forget gating mechanism. The GRU cell is similar to the LSTM cell but with fewer parameters as the output unit is obsolete. LSTM and GRU were applied to the whole windows in which the voltage values of channels at each time step were used as features (figure 7). Bi LSTM was used in one study [44] to extract features of a sequence in both forward and backward directions. Information to pass through CNN with LSTM or GRU was also combined by stacking the convolutional layer and LSTM cell [54]GRU cell [49, 57]. However, GRU and LSTM suffer from high computational costs and exploding gradient problems [74]. This drawback explains the dominance of CNN in automated IED detection, especially with rectified linear unit (ReLU) activation. ReLU allows for faster convergence by setting inputs to the layers to 0 if they are negative, introducing sparsities, and avoiding the gradient issues of the activation functions in LSTM.

**Figure 7.** LSTM and GRU for IED detection.
Download figure:
Standard image High-resolution image

In the GCN work [59], a montage was viewed as an undirected graph in which each electrode was a node and the linkage between a pair of electrodes was an edge. A stack of Chebyshev convolutional layers [75] was then applied to estimate the Fourier transform of the graph signals. This particular GCN architecture also applied an embedding layer consisting of a 1D convolutional layer to extract temporal features from each electrode. The global features of the graph were computed as the sum of features from all electrodes.

A hybrid method, combining stochastic processes with DL, was tested with the Temple University EEG Events dataset [51]. This method first extracted a sequence of features: discrete Fourier transform, energy, discrete cosine transform, mel-frequency cepstral coefficients [76], and derivatives. These features were then further processed via three steps: a Hidden Markov model for sequential decoding of EEG events, a stacked denoising autoencoder for temporal and spatial analysis, and finally, a statistical language modeling for final classification.

3.5.2. Number of layers

Deep learning networks consist of stacks of layers. The number of layers and hidden units is often referred to as depth and width, respectively. Carefully balancing these would lead to an increase in performance [77]. We counted all layers, including pooling, dropout, batch normalization, and final classification layers. The number of layers in included studies ranged from 4 to 31 layers (median: 9; IQR: 5–21). Only one study mentioned the number of hidden units of LSTM layers (200 units) [44]. In terms of convolutional layers, the number of filters, starting from 16, 32, or 64, increased linearly with the depth.

3.5.3. Pooling, normalization, and dropout layers

Pooling, normalization, and dropout layers are popular in CNNs. Pooling layers reduce the size of inputs. There were two types of pooling layers seen in six studies [23, 37, 38, 55, 56, 58], max-pooling and strided convolution. While max-pooling keeps maximum values of the feature dimensions within non-overlapping moving kernels, strided convolution learns which features to keep by weighting them. When these were applied, the dimensions were reduced by half, and the number of convolution filters was subsequently doubled. Dropout layers [78] help increase the generalizability of the models by randomly setting hidden units (neurons within the layers that are not input or output layers of the network) to zero during the training process and were applied before the last fully connected layer in four studies [37, 38, 48, 52]. There were two studies [52, 56] applying dropout after each LSTM layer. Spatial dropout [79] was used in one study [59], which randomly zeroed features of some timesteps. Training DL models on the entire dataset is computationally expensive. As such, iterating through mini-batches of samples is a strategy for faster training speed. To overcome the statistical variance in batches, four studies [23, 38, 57, 60] used batch normalization to learn how to re-center and re-scale the distribution of inputs to internal layers within every batch [80]. Layer normalization was used along with a graph convolutional network in one study [59] to normalize the features of each electrode.

3.5.4. Tackling dataset imbalance

It is common in included studies for windows without IEDs to significantly outnumber windows with IEDs. The ratio could be up to 1:1000 [47]. The most popular strategy to address this is oversampling minor classes (i.e. the IED windows) or balancing the mini-batch by sampling the same number of samples from different classes in every batch. There was one study implementing focal loss [38] that applied a modulating term to the original cross-entropy loss, preventing the vast number of negative (i.e. interictal) samples from forcing the classifier always to predict the negative class. Learning the projection of the inputs on another coordinate plane has been widely examined in the DL community. Following this, one study [37] utilized the triplet loss [32] to maximize the distance between the projections of IED and normal samples.

3.5.5. Optimization

In IED detection, the cross-entropy loss measures the difference between two probability distributions of predicted and true labels (in this case, a window is labeled as either IED or normal). The value of the loss function increases when the two distributions diverge from each other. Minimizing the loss is equivalent to reducing the error between the predicted and true labels. It is achieved by optimizing the neural network weights with gradient methods via backpropagation on the training data. The choice of optimizer would affect the training time and performance. Only nine studies mentioned the selection of the optimizer. The traditional Stochastic Gradient Descent (SDG) [81] was implemented in two studies [39, 42]. The most popular choice (N = 7) was the adaptive momentum optimizer (Adam) [82]. Adam uses adaptive learning rates to deal with sparse gradients and non-stationary objectives, allowing faster convergence.

3.6. Evaluation metrics and results

This section extracted evaluation metrics for three categories: IED detection, EEG classification, and patient classification (defined below). Only three studies [51–53] included a public dataset on IED, making direct comparisons among studies challenging.

3.6.1. IED detection performance

IED detection is the task of classifying a window of EEG signal into either having IED or without IED. There were a variety of reported metrics among studies (figure 8). The most common metric was AUC, which is the area under the curve of the true positive rate (TPR) and the false-positive rate (FPR) at different probability thresholds of the probability outputs, where TPR = TP/(TP + FN), FPR = FP/(FP + TN). True positives (TP) are the correctly classified windows containing IEDs, and the false positives (FP) are the windows without IEDs incorrectly classified as containing IEDs. True negatives (TN) and false negatives (FN) are the correctly and incorrectly classified windows without IEDs, respectively. The median AUC score was 0.94, and the highest AUC was 0.99 [47] (IQR: 0.94–0.96). Figure 9 compared mean AUC scores by DL architecture. Sensitivity was another popular metric measuring how many annotated IEDs were detected (median: 0.91; IQR: 0.89–0.93). By contrast, specificity measures the performance in classifying windows without IEDs or the true positives (median 0.93; IQR: 0.89–0.99). Balanced accuracy (BAC) is the average of the two metrics (median 0.84; IQR: 0.77–0.93). We also plotted the boxplots of the values of the most common metrics in figure 10. From the boxplots, we identified four outliers, Nhu et al—GCN [59], Golhommadi—Hybrid [51], Tjepkema-Cloostermans et al—1D CNN [54], and Tjepkema-Cloostermans et al—1D CNN-LSTM [54]. Nhu et al—GCN had the lowest F1 and sensitivity scores. Golhommadi—Hybrid had the lowest specificity and BAC scores. The AUC scores of Tjepkema-Cloostermans et al—1D CNN and Tjepkema-Cloostermans et al—1D CNN-LSTM were the lowest. In addition, the average AUC score across all studies with multiple clinical centers was 0.93 [52].

**Figure 8.** Number of studies by metrics. N is the number of studies.
Download figure:
Standard image High-resolution image

**Figure 9.** Mean AUC by architecture. N is the number of studies reporting AUC for a given architecture.
Download figure:
Standard image High-resolution image

**Figure 10.** Boxplots of common metrics among existing studies.
Download figure:
Standard image High-resolution image

False positives per minute (FP/minute) were measured in eight studies [37, 39, 44, 47–49, 57, 59] with the lowest value of 0.23 [39]. The measure that characterizes the percent of true detections in the events an algorithm thinks are IEDs is termed precision. F1 score conveys the balance of sensitivity and precision and was used in three studies [44, 47, 49]. Another metric measuring the relationship between these is AUPRC which computes the area under the precision-recall curve [47, 48].

Cross-validation (CV) is an evaluation method dividing the dataset into different subsets for training and testing purposes. CV across multiple subsets would give an overview of the generalizability of models. K-fold cross-validation was employed in six studies [23, 38, 43, 47, 48, 52] which split the data into different k groups, and each of them was used for testing iteratively. The number of folds was 3, 5, and 10. Leave-One-Patient-Out testing is another approach to divide data into different groups without each patient and is useful when the number of patients is relatively small. When the number of centers is large enough, Leave-One-Institution-Out, subgroups without data from a center, might be used to test the generalizability of the models across different clinical settings [52]. Golmohammadi [51] included the most clinical centers (6) and reported a mean Leave-One-Institution-Out AUC of 0.83 and a mean Leave-One-Subject-Out AUC of 0.82. This used the same design of CNN as in Prasanth et al [47]. These AUC scores are lower than the median; however, other studies did not perform any of the two cross-validation strategies.

3.6.2. EEG and patient classification performance

Whole EEG classification involves categorizing an EEG recording into normal or epileptic and usually leverages the automated detection of IEDs. Similarly, patient classification distinguishes patients with epilepsy from those without epilepsy. There were two approaches to extracting features for an EEG recording. A common method is to consider an EEG recording epileptic if it comprises detected spikes. The other approach is to compute detected spikes per minute as a feature [52] and choose a threshold to maximize the performance. Features across all EEG recordings of a patient are then used for patient classification [52]. Only five studies [23, 38, 47, 52, 59] evaluated the performance of the whole EEG classification, and two studies mentioned patient classification with the best AUC scores of 0.85 [23] and 0.90 [52], respectively.

3.7. Reproducibility

Details of the CNN-based studies were sufficient to replicate. By contrast, studies either partly provided or completely excluded the design of LSTM models (i.e. number of hidden units, number of layers, etc). In addition, none of the studies shared the source code. Another perspective of reproducibility is the availability of the datasets. The majority of the datasets were private, with only 3 studies [51–53] experimenting with a public dataset.

4. Discussion

4.1. Data properties

The majority of studies included both types of epilepsy, with routine EEG as the most popular recording setting. Details of labeling protocol were missing in the majority of studies. The labels of the TUEV dataset can be found in the original study [62] where 1 or a set of channels was annotated as 1 of 3 labels, spike and sharp wave, generalized periodic epileptiform discharges, and periodic lateralized epileptiform discharges. Nhu et al [59] estimated the start and end of an annotation at onset or just preceding the earliest spike/slow-wave or rhythmic change and after the end of the estimated spike/wave rhythmic activity. This lack of information makes it challenging to compare the data quality and explain the difference in performance on different datasets. Moreover, there was no indication of what constitutes sufficient data. We observed no correlation between the number of annotated IEDs and AUC scores of CNN models (Pearson correlation = −0.41, p = 0.22).

4.2. Design of DL architectures

CNN was the most popular choice of DL methods with various CNN architectures. A crucial factor in the design of DL is the depth of the model. The greater the depth is, the more complex the model is. Figure 11 depicts the relationship between the AUC score and the number of layers and parameters from CNN models in studies. The number of parameters for each study was estimated from the provided number of filters, kernel size, numbers of neurons in the fully connected layers, and numbers of layers. The results of the calculation are provided in table S5. There was no correlation between the network depth of CNN and AUC scores. The highest AUC score of 0.99 was achieved with only seven layers [47], while the deepest network of 31 layers only achieved an AUC of 0.91 [38]. The networks with smaller numbers of parameters can achieve comparable performance to the larger ones. The VGG architecture has 138 million parameters and had an AUC score of 0.95 [50], which was similar to that of the smaller network with 7000 parameters [48]. These comparisons might be affected by the lack of evaluation of data from multiple clinical centers and preprocessing or optimization methods. Nevertheless, as these models would be deployed at clinical centers, the computational costs should be considered. We hope this analysis will make computer scientists rethink the network design and the efficiency of the DL models.

**Figure 11.** AUC of IED detection versus the number of DL layers and number of parameters from CNN models in studies.
Download figure:
Standard image High-resolution image

4.3. Preprocessing methods

The window durations of 1 s and 2 s were the most common choices for window duration. Bandpass filtering was the most popular choice for artifact removal; however, only one study analyzed the misclassified windows and showed that these were primarily ocular artifacts [38]. Further investigation of misclassified windows might help identify extra required EEG preprocessing steps. Data augmentations have been extensively examined in the DL field and improved the generalizability of models [67, 68]. Lourenco et al [56] employed a 2D CNN architecture VGG and achieved a sensitivity score of 0.79. Wei et al [49] trained the same architecture on the same dataset but with data augmentation and achieved a significantly higher sensitivity of 0.99. The fast R-CNN method [46] had lower sensitivity of 0.89, which used Unsupervised Data Augmentation perturbing changes of noise and amplitudes, and hemispheric flipping. However, in general, augmentations of EEG signals did not show performance improvement for IED detection as the mean AUC scores of the studies with and without these were the same (mean AUC = 0.94). Further investigations are still needed to confirm this and find an effective set of augmentation methods for EEG signals.

4.4. Evaluation and metrics

Deep learning for IED detection has yielded promising results with high AUC scores in both focal and generalized epilepsy. However, public datasets as reference datasets were not the main focus of studies. Performance was reported inconsistently with variations of means of measurement, such as the average metrics of k-fold or the metrics on an independent test set. These make direct comparisons among studies challenging. In addition, the AUC score of window-based classification in the study with the most number of clinical centers (six centers) was 0.83 [52] and was significantly lower than the median reported AUC of 0.94. This decrease in performance indicates that testing on multiple datasets acquired from different clinical centers is needed to evaluate the model's generalizability.

An important measure of clinical usefulness is the number of false positives clinicians need to review while maintaining high sensitivity. Excessive false positives would make the EEG review tedious and more time-consuming for clinicians, given the need to navigate and review all suggestions by the model. As the window-based dataset is highly imbalanced (the ratio of windows with IEDs to the normal window could be up to 1:1000), the denominator of the FPR in the calculation of AUC will be dominated by the true negatives (TN). A model might be good at classifying normal windows and have a high sensitivity, resulting in a high AUC score, but still have a high ratio of FP compared to the true predicted number of IED events (true positives, TP). A clinician needs to inspect the events that an algorithm thinks are TP in practice. For example, if less than 50% of these detections are true detections, a clinician will waste time trying to find events useful for diagnosis. AUC, in this case, alone does not reflect this and might provide an inaccurate perspective of performance. Instead, precision could be used in addition to AUC to measure the percent of true IED events detected by the classification system. Useful metrics that embody the balance of precision and sensitivity are F1, AUPRC, and FP/minute when sensitivity is high. Limited reporting of these clinically relevant metrics was noted in the majority of studies.

4.5. Reproducibility

With the growth of different DL frameworks, source code sharing plays an important role in benchmarking and extending the existing work to new datasets in the DL research community. Despite the missing information on optimizer choice, details of the neural networks in included studies are sufficient to replicate in any DL framework; however, none of the included studies made their source code available for external testing. Moreover, the majority of studies used private datasets. As shown above, there is no guarantee that the same results would be achieved in another dataset.

4.6. Benchmark or gold standard

We have shown a wide variety of preprocessing methods and designs of DL architecture until now. Identifying a benchmark or gold standard, if there is one, will help the field advance in the right direction. A benchmark is always performed on public datasets in the machine learning and DL community. This ensures the actual results are replicable without the effect of statistical variances among datasets and allows for direct comparisons. For example, in time series classification, the UCR archive [83], consisting of 128 public time-series datasets, has been widely used as the benchmark for state-of-the-art models [84, 85]. Similarly, the performance on the ImageNet dataset [28, 86] has been an active benchmark for image classification. The studies experimenting on the public TUEV dataset might be considered benchmarks [51, 53]. We also observed that the performance dropped when more datasets from multiple clinical datasets were involved. Therefore, finding one benchmark or gold standard among existing work is ambiguous in this case. This raises a question for the research community on the public access to existing private datasets of IEDs. Opening up more data across clinical centers will help to ensure generalizability of automated IED methods.

5. Recommendations

Our systematic review showed that DL could automatically extract useful features from EEG for IED detection with promising performance. We also identified the limitations of existing studies. Based on our data analysis, we propose the following recommendations for future directions in the field of automated IED detections:

To improve the reproducibility of future research, we encourage sharing source code publicly to help researchers with different levels of technical skills study and reproduce the results.
IED annotating protocols used in a study should be clearly described, including the location of annotation markers relative to the start and end of discharges and the duration of annotations. These details would help future researchers and clinicians understand the properties of the data and design a consistent annotation process.
Future studies should include data from multiple clinical centers with the Leave-One-Institution-Out cross-validation strategy. This will increase the diversity of data and provide stronger evidence for the generalizability of the models.
As the TUEV dataset from Temple University is the only public dataset observed among included studies, we suggest that this serves as the standard dataset for benchmarking models until other public datasets become available. This would align with other fields of DL where researchers have been actively benchmarking proposed models on public datasets [28, 83, 87].
F1, false positives per minute, and AUPRC should be reported along with AUC to assess performance in a more clinically relevant and complete manner.
Future work should study misclassified windows to identify the limitations of proposed methods.

6. Limitations

Our systematic review does not provide an overview of the standard format of IED annotation as the included studies did not provide this information. Due to the lack of studies on specific conditions or waveforms, we did not compare the performance in different epilepsy and IED types. Meta-analysis on the reported performance metrics was omitted as the measurements varied, and the most common metric was AUC which had limitations as mentioned above.

7. Conclusions

In this systematic review, we highlighted the current trends and successes and identified limitations of DL for automated IED detection. Based on our analysis, we provided recommendations for future directions to advance the field. This paper also aims to act as guidelines for clinical, computer science, and engineering researchers interested in DL and epilepsy.

Acknowledgments

D N is supported by the Graduate Research Industry Scholarship (GRIP) at Monash University, Australia. P P is supported by the National Health and Medical Research Council (APP1163708), the Epilepsy Foundation, The University of Melbourne, Monash University, Brain Australia, Norman Beischer Medical Research Foundation, and the Weary Dunlop Medical Research Foundation. P K is supported by a Medical Research Future Fund Practitioner Fellowship (MRF1136427). M J is supported by the Monash RTP Stipend Scholarship. L K is supported by the National Health and Medical Research Council (GNT1183119 and GNT1160815) and the Epilepsy Foundation of America.

Data availability statement

The data that support the findings of this study are available upon reasonable request from the authors.

Conflict of interest

Outside the submitted work, PP has received speaker honoraria or consultancy fees to his institution from Chiesi, Eisai, LivaNova, Novartis, Sun Pharma, Supernus, and UCB Pharma. He is an Associate Editor for Epilepsia Open.

Ethical publication statement

We confirm that we have read the Journal's position on issues involved in ethical publication and affirm that this report is consistent with those guidelines.