Machine Learning for detection of Interictal Epileptiform Discharges

The electroencephalogram (EEG) is a fundamental tool in the diagnosis and classification of epilepsy. In particular, Interictal Epileptiform Discharges (IEDs) reflect an increased likelihood of seizures and are routinely assessed by visual analysis of the EEG. Visual assessment is, however, time consuming and prone to subjectivity, leading to a high misdiagnosis rate and motivating the development of automated approaches. Research towards automating IED detection started 45 years ago. Approaches range from mimetic methods to deep learning techniques. We review different approaches to IED detection, discussing their performance and limitations. Traditional machine learning and deep learning methods have yielded the best results so far and their application in the field is still growing. Standardization of datasets and outcome measures is necessary to compare models more objectively and decide which should be implemented in a clinical setting.


Introduction
Electroencephalograms (EEGs) are widely used in the diagnosis and classification of epilepsy, a brain disease that entails a predisposition for epileptic seizures (Fisher et al., 2014;Lopes da Silva et al., 2003;Tatum et al., 2018).However, most seizures occur sporadically and at unknown times, so the availability of ictal EEGs is scarce.Alternatively, interictal EEGs are used in diagnosis.Half of the routine EEG recordings from epilepsy patients include interictal epileptiform discharges (IEDs), with this number rising to 80% in sleep recordings (de Curtis et al., 2012;Tatum et al., 2018).
IEDs are epileptiform transients that indicate an increased likelihood of seizures (de Curtis et al., 2012), appearing mainly as spikes and sharp waves.Several examples of IEDs are shown in Figure 1.

Figure 1
The gold standard for IED detection is visual analysis by experts (Lodder et al., 2014).This approach has several drawbacks, including a long learning curve and extensive analysis time, especially for long recordings.Further, human error, subjectivity, intra and inter-observer variability result in misdiagnosis rates up to 30%, leading to lack of treatment or prescription of medication with potentially harmful side effects (Lodder et al., 2014;Nowack, 1997).
This motivates the development of computer assisted IED detection with algorithms that match or outperform experts, reducing the time and resources spent on visual analysis, as well as the misdiagnosis rate.Automating IED detection is not trivial due to the oversimplified textbook definitions of these transients, similarities between IEDs and other transients and variations in discharge morphology between patients (Wilson and Emerson, 2002).
Still, extensive research has been carried out aiming to automate the detection of IEDs.We review the evolution in the type of algorithms in the literature, their performance and limitations.We used the search terms 'automated interictal epileptiform discharge detection' and 'automated interictal spike detection' in Scopus.Papers written in English, based on human scalp EEG and describing an algorithm aimed to detect IEDs, dating back to the beginning of this field (1976) and up to March 2020, were selected.

EEG Data
Several EEG datasets including samples with IEDs have been used for the development and evaluation of automated IED detection.Some datasets are unequivocally too small to encompass the diversity of IED shapes necessary for a robust, generalizable algorithm.
Examples of this are approaches based on data from one patient (Latka et al., 2003;Tzallas et al., 2006b).
Although most authors used private datasets, impeding direct comparison of results across publications, some works were based on the same publicly available EEGs (or a subset thereof) (Abibullaev et al., 2010;Güler andUbeyli, 2005, 2007;Güler et al., 2005;Guo et al., 2011;Iscan et al., 2011;Martinez-Vargas et al., 2011;Orhan et al., 2011;Ubeyli, 2008;Übeyli, 2009a, 2009b, 2010;Wang et al., 2011).The five-class dataset, made available by Andrzejak et al. (Andrzejak et al., 2001), consists of 100 single channel EEG segments without discernible artefacts.Classes A and B (extracranial) were recorded from 5 controls with eyes open and closed; classes C and D (intracranial) were recorded from 5 epilepsy patients in seizure free intervals in the epileptogenetic zone (D) and hippocampal formation of the opposite side of the brain (C); class E contained seizure activity from those patients.While this dataset is useful for comparison, it still entails limitations such as the small number of patients and the absence of extracranial recordings from epilepsy patients.
Another relevant characteristic concerning the datasets used for training and testing the algorithms is the expert opinion used for labeling the IEDs.Since identifying these discharges is a subjective task prone to variability, it is crucial to have enough experts agreeing on the presence of an IED for it to be a part of the dataset.Inter-rater agreement (i.e.how often experts agree with one another) can be estimated by calculating the percentage of agreement or with more sophisticated metrics such as Cohen's kappa (Cohen, 1960).
The number of experts annotating the EEGs in the reviewed studies ranges from one (Chaibi et al., 2015;Dingle et al., 1993;El-Gohary et al., 2008;Gabor and Seyal, 1992;Horak et al., 2015;Ko and Chung, 2000;Sankar and Natour, 1992;Webber et al., 1994) to eleven (Halford et al., 2013).While the opinion of one expert is highly relevant, a single subjective classification might not be enough for the correct identification of all the IEDs in a recording.Thus, most authors opt for including the opinion of two or three experts.
The way in which these opinions are taken into account is also pertinent.One way of doing this is to 'accept' the IED only if all experts agree on its presence (i.e. if consensus is reached) (Gotman andWang, 1991, 1992;Inan and Kuntalp, 2007;Kurth et al., 2000;Kutlu et al., 2009;Pietilä et al., 1994).This eliminates doubtful patterns but can disregard some IEDs, classifying them as normal samples in the dataset.Instead of consensus, an agreement threshold can also be used -for instance, consider all IEDs identified by at least 7 of 8 experts (Guedes de Oliveira et al., 1983).Assigning weights according to the expertise of each clinician (Hostetler et al., 1992) or calculating the mean of the probability assigned to each sample by each expert (Wilson et al., 1999) are other ways or integrating multiple opinions.

Performance Metrics
To assess model performance, the number of true and false positives and negatives (i.e.correctly and incorrectly identified IEDs and non-IEDs) are essential characteristics.The number of true and false positives is particularly relevant since they show how many IEDs are detected, as well as the number of false detections, which should be low in comparison to the correctly detected IEDs.
Several metrics can then be derived, including sensitivity (defined as the number of true positives over true positives and false negatives), specificity (true negatives over true negatives and false positives) and accuracy (true positives and true negatives over all samples).
An IED detector should have both a high sensitivity (to detect as many IEDs as possible) and high specificity (correctly identifying non-IEDs).Either will not suffice, as low sensitivity and high specificity yield a very low number of detections while a high sensitivity with low specificity would lead to a high number of false positives.An alternative metric to specificity is the false positive rate, either defined as 1-specificity or as the number of false positives per hour.
Accuracy is often reported, too, defined as sensitivity * P(a) + specificity * (1-P(a)) with P(a) the prior likelihood.Note that this metric alone does not provide enough information about the classifier performance if sensitivity and specificity are not reported as well, as the accuracy generally depends on the prior likelihood of an IED, P(a).The only exception occurs in the unlikely scenario that the sensitivity equals the specificity, as in that case accuracy = sensitivity = specificity.Further, in the case of unbalanced datasets (which is typically the case in IED detection, as non-IEDs are more abundant), high accuracy often corresponds to a high number of true negatives, providing little to no information on IED detection itself.Thus, while accuracy is reported in many studies (Benlamri et al., 1997;Güler and Ubeyli, 2005;Güler et al., 2005;Nigam and Graupe, 2004;Srinivasan et al., 2005;Tzallas et al., 2006a;Ubeyli, 2008;Übeyli, 2008, 2009a, 2009b;Wang et al., 2011) (often as the only metric) it is not suitable for performance assessment nor comparison of different methods as typically the prior likelihoods of IEDs are different for the EEG data sets used between various studies.

Developed Methods
Over the last 40 years, several approaches to automate IED detection have been developed.
Table 1 provides a comprehensive overview of the automated IED detection methods developed between 1976 and 2020.Supplementary Table S1 provides detailed information on each study.
All approaches can be broadly divided into four categories.In the first group (I), users select and extract characteristics of the EEG which are used to determine the probability of a sample including an IED.Examples include maximum spectral frequency, amplitude of a transient, similarity to a template, where classification is based on passing a particular user-defined threshold.Methods in the second category (II) use explicitly defined features as in the first group but classifiers 'learn', using statistical modeling, from the input, without the use of a predefined classification rule.
Expert systems (III) in IED detection are applied to the output of an algorithm from the first two categories.These methods cannot be used as a standalone approach, as they do not use raw data or features as input, but they can refine the classification of the previous algorithms.

Mimetic Methods
Mimetic methods were the first approach to automated IED detection.These techniques aim to emulate the visual analysis process and thus resort mostly to morphological characteristics of the EEG, since these are the ones used by experts when searching for IEDs (Gotman, 1982;Gotman and Gloor, 1976;Gotman andWang, 1991, 1992;Guedes de Oliveira et al., 1983;Hostetler et al., 1992).
The transient is divided into segments (i.e. the section between two extrema of amplitude), as shown in Figure 2. A minimum difference between segments is often imposed to reduce the impact of noise (third panel, Figure 2).Waves are defined as the set of two adjacent segments in opposite directions, with half-waves being the segments on either side of the maximum amplitude.From the half-waves, features such as relative amplitude, duration, slope and sharpness are extracted (Benlamri et al., 1997;Gotman, 1982;Gotman and Gloor, 1976;Guedes de Oliveira et al., 1983;Nigam and Graupe, 2004;Srinivasan et al., 2005;Tzallas et al., 2006a).
After extracting these features, physiologically acceptable values are used as threshold to decide whether a sample contains an IED or not.These are typically the same thresholds used when performing visual analysis.

Figure 2
The first published method was based on amplitude, duration and 'sharpness' of the half-waves (Gotman and Gloor, 1976).This algorithm was later updated to detect IEDs in different states (active wakefulness, quiet wakefulness, desynchronized EEG, phasic EEG and slow EEG), increasing generalization (Gotman andWang, 1991, 1992).Another mimetic algorithm was partially implemented in hardware to support on-line detection, which was relevant progress at the date of publication (1980s) (Guedes de Oliveira et al., 1983).
In most of these studies, separation of training and test sets was not properly reported.
Furthermore, sensitivity and specificity were not the chosen metrics for performance assessment.
Mimetic methods appear to have limited ability to distinguish IEDs from transients and artefacts with similar morphology, leading to a high number of false positives.While some form of artefact rejection was used in several studies (Gotman and Gloor, 1976;Gotman and Wang, 1991;Guedes de Oliveira et al., 1983), non-artefactual phenomena (i.e.physiological transients) still led to misclassifications.One study reported an average of 41 true and 39 false detections in 6 h of recordings, showcasing the high proportion of false positives (Gotman, 1982).Therefore, this type of approach lost popularity and gave way to other types of methods.
Similar to mimetic methods, a threshold value was applied to these features to decide whether a sample contained an IED or not.
Still, there was a lack of studies adequately assessing the algorithm's performance.While one publication reported 91.7% sensitivity at 89.3% specificity, there was no separation between training and test sets, making these results unreliable (Indiradevi et al., 2008).Furthermore, choosing a suitable wavelet for EEG decomposition and computing an appropriate threshold was not trivial, and other types of algorithms started being developed to circumvent these shortcomings associated with parameter and threshold choices.

Morphological Filtering
Morphological filtering an alternative strategy used for IED detection.In these approaches, a structuring element is combined with the signal using a set operator (intersection, union, among others) (Juozapavicius et al., 2011;Xu et al., 2007;Xu et al., 2006).This changes the shape of the signal, separating the spikes from the background.An amplitude threshold can then be applied to the difference between the original signal and the background, aiming to detect IEDs.
Xu et al. developed a method based on the combination of different morphological operators (Xu et al., 2007).While it performed better than standard filter types typically used in morphological operations, the presence of artefacts with morphological characteristics similar to IEDs led to a large number of false detections in all of the authors' approaches.
Juozapavicius et al. reached a similar conclusion.While their algorithm detected IEDs, it also identified other spikes (e.g.physiological spikes during sleep) as well as false positives in noisy recordings (Juozapavicius et al., 2011).This suggested that, while more complex, morphological filtering suffered from the same drawbacks as mimetic methods, due to the similarity of IEDs to other transients and artefacts.
While these methods are more complex than feature thresholding, they do include a thresholding step since one needs to specify how small the distance must be or how large the correlation must be for a candidate pattern to be considered an IED.The threshold value has a high impact on IED recognition: a lower similarity threshold will lead to the detection of more epochs as IEDs, leading to more false positives (and vice versa).
The template pools used in these approaches must be large and representative enough to capture the inter-patient variability of IED patterns.To account for this, authors tended to use several multi-channel templates (El-Gohary et al., 2008;Ji et al., 2011a).Other ways of increasing robustness included using additional templates for the background patterns of the EEG (Thomas et al., 2017) and increasing the template pool by finding epochs similar to the existing templates (Lodder et al., 2013).
Self-adapting systems were also proposed to improve template matching (Lodder and van Putten, 2014).In this work, the system started by detecting possible IEDs based on the available templates and showing the prospective detections to the clinicians.If the detection was correct, the IED could be used as a template, increasing the pool of available.Using 241 IEDs, it was possible to detect a third of the transients after one iteration with the clinicians (in which the clinicians confirm correct classifications).This number rose 95% after 15 iterations, showing the impact of the expert input.However, as the authors point out, the system was optimized in terms of sensitivity and there was a trade-off with specificity, leading to a higher number of false positives, ranging from 0.24 to 6.6 per minute.
While this self-adapting system can contribute to the creation of larger template pools, the new templates will always be morphologically related to the previous ones (since the system detected the epochs through a previous template).Thus, if there is any type of morphology lacking in the original template pool, it will be very complicated for the system to bridge that gap on its own, constituting a limitation of this method.
Several template matching approaches were hindered by the use of very small datasets, which could not capture the morphological diversity of IEDs (El-Gohary et al., 2008;Ji et al., 2011a;Nonclercq et al., 2012).For instance, one algorithm trained on data from 6 patients yielded 97% sensitivity at 78% specificity when there were clear IEDs, but the performance dropped to 31% sensitivity at 33% specificity in recordings where the discharges were less clear (i.e. closer to the clinical reality) (Pietilä et al., 1994).

Pre-defined Features and Machine Learning (II)
Machine learning (ML) methods take in pre-determined features and use them for classification.They include traditional machine learning classifiers (section 4.2.1) and small artificial neural networks (section 4.2.2).Unlike the previous approaches, these algorithms learn from the input features instead of being completely defined a priori.This strategy aimed to reduce user-defined parameters and typically use a large number of characteristics of the EEG for classification.

Traditional Machine Learning Classifiers
Traditional machine learning classifiers are a set of algorithms based on concepts such as point neighborhood, clustering or class division that adapt their parameters according to the input features, learning from them to classify new data.
K-nearest neighbors (KNN) is an example of an ML classifier proposed for IED detection (Guo et al., 2011;Iscan et al., 2011;Martinez-Vargas et al., 2011;Zhou et al., 2013).After feature extraction, for each sample, the algorithm finds a certain number (k) of training samples that are closest to it in the feature space -its k-nearest neighbors.Then, a voting scheme leads to classification: the sample is classified as 'having an IED' if most of its neighbors have an IED.
Clustering methods are another ML approach, of which k-means is the most common (Inan and Kuntalp, 2007;Nonclercq et al., 2012;Orhan et al., 2011).This algorithm partitions the training data into a number (k) of clusters.Then, each new sample is assigned the cluster with the closest center.More complex clustering techniques have been applied to IED detection.
When comparing five different types of clustering (k-means, k-medoids, fuzzy C-means, agglomerative clustering and affinity propagation), a study showed that affinity propagation yielded the best results (Thomas et al., 2017).This method does not require previous specification of the number of clusters.Instead, this is determined as the algorithm runs, since it finds 'exemplars', which are training samples representative of clusters.
Another widely used ML classifier is the support vector machine (SVM) (Acir and Güzeliş, 2004;Acir et al., 2005;Bagheri et al., 2018;Chavakula et al., 2013;Güler and Ubeyli, 2007;Iscan et al., 2011;Kelleher et al., 2010;Lima et al., 2010;Thomas et al., 2018).Given the training data, SVMs try to build a barrier (hyperplane) between each class, maximizing the margin between classes.A new sample is classified based on which side of the 'fence' it is, in the parameter space.Authors have compared the performance of SVM to a multilayer perceptron (MLP, described in section 4.2.2) on the same dataset, leading to the conclusion that the SVM shows superior accuracy in IED detection (Derya Ubeyli, 2008;Güler and Ubeyli, 2007).
One study compared the results of a traditional SVM and one of its variants, least-squares SVM (LS-SVM), to other ML classifiers (Iscan et al., 2011).The LS-SVM reconfigures the problem so that it is solved using a set of linear equations instead of quadratic ones.Ultimately, the LS-SVM surpassed all other ML classifiers tested in this dataset.Using a cascade of SVMs rather than a single one also led to increased precision and sensitivity (Bagheri et al., 2018).This was congruent with the previously mentioned study where affinity propagation beat k-means, leading to the conclusion that updated versions of traditional ML algorithms can improve IED detection (Thomas et al., 2017).
Comparing these algorithms not trivial, as several authors only provided qualitative results (Horak et al., 2015;Lima et al., 2010) or did not separate training and test sets (e.g.(Kelleher et al., 2010;Orhan et al., 2011)).One study reported an area under the receiver operating characteristic curve (AUC) of 0.93 on a patient-specific classifier, indicating that this performance would drop when applied to another patient, showcasing the need for external validation (Kelleher et al., 2010).Furthermore, several studies reported accuracy as the only performance metric (e.g.(Güler and Ubeyli, 2007;Martinez-Vargas et al., 2011;Ubeyli, 2008)).Accuracy ranged from 75% to 99% but, as explained in section 3, this is not indicative of a good IED detector.

Small Artificial Neural Networks
Artificial neural networks (ANNs) are architectures of artificial neurons organized in layers, which have been used for IED detection since 1992 (Gabor and Seyal, 1992).They are similar to traditional ML classifiers in their ability to learn from data, but ANNs present a specific type of architecture and mathematical framework that distinguishes them from other ML approaches.This section will focus on ANNs with a shallow/small structure and that use features as input (i.e. are not used as end-to-end methods, as opposed to the ANNs discussed in section 4.4.).
Perceptrons are the building blocks of neural networks: the output of a perceptron is linked to the input of the next layer's perceptron, creating a network.Even within MLPs, there is architecture variability -the number of layers, nodes and activation function, among others, are differentiating parameters.

Figure 4
Aside from MLPs, other feed-forward ANN architectures have been used to automate IED detection (James et al., 1999;Kurth et al., 2000;Nigam and Graupe, 2004;Song and Zhang, 2013;Ubeyli, 2008;Übeyli, 2008, 2010;Wilson et al., 1999).Several authors compared the performance of MLPs and their own custom architectures (Güler et al., 2005;Ubeyli, 2008;Übeyli, 2008).Custom architectures outperformed the MLP in all the studies, showing that it is likely that more complex algorithms are necessary to fully capture and process the information given as input.
Recurrent neural networks (RNNs) differ from feed-forward networks by allowing information to travel forward and backward in the network through feedback loops.These have also been an option for IED detection, reaching similar performances (Güler et al., 2005;Sezer et al., 2012;Srinivasan et al., 2005;Übeyli, 2009a).
Some authors combined ANNs with traditional ML classifiers (Acir et al., 2005;Inan and Kuntalp, 2007;Thomas et al., 2018).One study used a neural network as a feature extractor and the output of this step was fed to an SVM, which performed EEG-level classification (Thomas et al., 2018).Another study used two perceptrons to pre-classify EEG samples into definite IEDs, definite non-IEDs and possible IEDs (Acir et al., 2005).An SVM was then used to separate epochs in the third group.This study used data from 19 patients for training and 10 for testing, reaching 89.1% sensitivity at 85.9% specificity.
Furthermore, not separating the training and test sets leads to unreliable results, since overtraining is one of the main possible drawbacks of ANNs, given the high number of trainable parameters.These parameters can be overly fitted to a training set, thus not being generalizable.In turn, running an overtrained model on an independent set would lead to a much lower performance.For instance, it is not likely that 100% sensitivity and specificity could be reached on an independent set (Sezer et al., 2012).Examples of studies that performed a proper train/test split and reported relevant metrics showed that it is possible to overcome the issue of overtraining and, with different morphological features and ANNs, sensitivity could range from 55.3% to 89.9%, at approximately 99% specificity (James et al., 1999;Wilson et al., 1999).While these results represent an improvement from previously described approaches, they mostly fell short of human performance (Wilson et al., 1999).

Expert Systems (III)
Expert systems (ESs) are fundamentally different from the previously described approaches, as they can only be used in conjunction with one of the classifiers described in sections 4.1.and 4.2.ESs for IED detection take the results of previous methods as input and refine the decision (accepting or discarding the detections) by incorporating spatial and/or temporal context or previous knowledge given to the system.This is highly dependent on the human expert, since clinicians decide what information is relevant in this final classification step.
Mimetic methods have been used in conjunction with ESs (Benlamri et al., 1997;Black et al., 2000;Dingle et al., 1993).One approach applied an ES after thresholding morphological features to incorporate spatial context to reject artefacts (e.g.muscle spikes, eye blinks, electrode movement).Then, a combination of spatial and temporal context was used to approve the candidate patterns (Dingle et al., 1993).
ESs have also been coupled with ANNs (Argoud et al., 2006;Park et al., 1998;Tzallas et al., 2006a).One study used an ANN to pre-classify EEG segments into categories (e.g.spikes, muscle activity, eye blinks or sharp alpha activity) which were then classified by the ES using contextual information regarding synchronicity and channel adjacency (Tzallas et al., 2006b).
Studies using ESs suffer the same limitations as the previous ones, regarding the lack of separation of train and test sets (Dingle et al., 1993;Park et al., 1998), as well as improper performance metrics (Benlamri et al., 1997;Tzallas et al., 2006a).Furthermore, since ESs must be coupled with other classifiers, judging their performance comparatively is impossible.While they may aid several other algorithms as a final step, they cannot be implemented in the clinics on their own.Therefore, their contribution to classification should be studied on a case basis, comparing the algorithm with and without the ES.

Deep Learning (IV)
Deep learning methods encompass neural networks with a large number of layers (i.e.depth), distinguishing them from the smaller ANNs discussed in section 4.2.2.Typically, these networks are used as end-to-end methods: they take in raw data, learn their own representation of the feature space and classify new samples.This constitutes an important difference regarding the methods in the previous sections, which need pre-defined features.Making the models feature-independent increases the likelihood that the algorithm will be able to capture more information from the EEG signal as it is not being forced to focus on a limited number of user-defined characteristics.
Before the rise of deep learning in computer science, authors compared the use of raw and parameterized data for IED detection (Ko and Chung, 2000;Webber et al., 1994).Studies showed below random performance with raw data, leading authors to conclude that it was impossible to use ANNs as end-to-end methods.However, with the development of novel architectures and availability of computing power, the paradigm changed.
Convolutional neural networks (CNNs) are a type of ANN that is often used in deep learning approaches.The core building block of CNNs are convolutional layers, which extract information using filters that are iteratively convolved with the input.The use of CNNs in IED detection is growing (Jing et al., 2020;Johansen et al., 2016;Lourenço et al., 2019;Tjepkema-Cloostermans et al., 2018), as these have shown potential in fields such as image analysis, but also in health-related applications.Figure 5 shows an example of a CNN architecture that was initially developed for image analysis and has been adapted for IED detection (Lourenço et al., 2019).

Figure 5
Applying CNNs to raw EEGs, it was possible to achieve areas under the receiver operating characteristic curve (AUCs) higher than 0.9, indicative of algorithms with very good performance (Jing et al., 2020;Johansen et al., 2016;Lourenço et al., 2019;Tjepkema-Cloostermans et al., 2018).Jing et al. (Jing et al., 2020) tested their CNN against commercially available software (Persyst 13, which uses extracted features and feed-forward NN rules (Scheuer et al., 2017)) and achieved an AUC 0.1 higher than this program on the same test set.
While these performances are impressive, the increased complexity of the methods (in terms of the number of trainable parameters) reduces the explainability of the decisions.CNNs are often called 'black-boxes', as it is not easy to understand what goes on inside the model -the information that the network retrieves from the input and its relative importance towards the final classification is not explicit.However, steps have been taken to increase the transparency of these techniques.
Tjepkema et al. (Tjepkema-Cloostermans et al., 2018) showed that two-dimensional (2D) CNNs worked better than one-dimensional (1D) CNNs for IED detection, when the input was given as a matrix of several channels over time.The difference is that 2D-CNNs have twodimensional filters while 1D-CNNs have vector filters.This means that the 2D-CNN is able to capture information from several channels in each application of a filter kernel, while the 1D-CNN restrains its information retrieval to one channel.Since the 2D-CNNs yielded better performances, this implies that using more spatial context (in this case, using more than one channel in each application of the filter kernel) is of added benefit for classification.
Through the application of network visualization techniques, da Silva Lourenço et al. were able to show that the CNN was detecting IEDs in the samples (see Figure 6) (Lourenço et al., 2019).
This proved that the network was classifying the samples based on the same visual stimulus an expert would use, rather than complementary features.

Figure 6
The number of studies applying deep networks to IED detection is still scarce.However, the field is developing fast and there is an increase in healthcare applications of deep networks.
Furthermore, the high sensitivity and specificity values obtained by several authors contributes to the indication that we should expect more of these approaches in the near future.

General Considerations
Despite the efforts towards automating IED detection, clinical use is still scarce.Previously, this was mostly due to the sub-par performance of the methods when compared to the gold standard.However, the current main hurdles are related to the small and incomparable datasets and performance measures.
Most authors used different datasets to train their algorithms, hindering result comparison.
There is a lack of public standardized datasets with a large number of EEGs from patients in different age ranges, reviewed by a significant number of experts.This is mostly because expert opinion is expensive, and the organizations that are able to obtain such a dataset are typically not willing to share it.To change this situation, state-supported studies should be funded to allow the creation of this type of dataset and subsequent homogenization of model evaluation.
Additionally, while some studies reported sensitivity and specificity, as well as false positive rates, several studies did not present these values (e.g.(Boos et al., 2010;Guedes de Oliveira et al., 1983;Hostetler et al., 1992;Ko and Chung, 2000)).Approaches reporting qualitative results also lack information needed to objectively assess the method and compare it to others (Calvagno et al., 2000;Ji et al., 2011a;Juozapavicius et al., 2011;Suresh and Balasubramanyam, 2013;Vijayalakshmi and Abhishek, 2010).
In fact, out of all the approaches reviewed in this paper, only six fulfilled two basic requirements: separating the training and test sets and reporting sensitivity and specificity or false positive rate (Acir and Güzeliş, 2004;James et al., 1999;Lodder et al., 2013;Lourenço et al., 2019;Tjepkema-Cloostermans et al., 2018;Wilson et al., 1999).Still, even these papers were not directly comparable due to the difference in the datasets.
On a more global perspective, however, over the past forty years, there have been significant developments in the automation of IED detection.Authors went from trying to mimic the behavior of experts to creating more complex and diverse approaches.Figure 7 provides a chronological overview of the first use of the different types of methods.

Figure 7
False detections associated with morphological features were circumvented by using other characteristics and the development of fields such as machine learning led to a wider breadth of algorithms that could be applied to IED detection.Furthermore, more data became available and the computer power needed to process such data was accessible, allowing the training of more generalizable algorithms.As time progressed, there was a shift from older techniques, with moderate results, to newer approaches such as machine learning, with better results and deep learning, with impressive outcomes.

Conclusion
The methods used for automated IED detection have evolved over the past 45 years, with a resulting increase in performance.While the diversity of datasets and performance measures has hindered direct comparison of the studies, the pervasiveness of machine and deep learning techniques has increased over the past years.The results of approaches, which seem to surpass those of other methods, further contribute to increasing interest in this line of research.
It is expected that the upward trend of machine and deep learning in IED detection will continue.It is also likely that these algorithms will become an integral part of the assessment of clinical EEG, reducing variability, subjectivity and time to diagnosis.Adapted from (Noebels et al., 2012).

Legends
Figure 2 -A) Electroencephalographic (EEG) signal, B) Segments of the EEG signal, defined as the section between two extrema of amplitude, C) Sequences of the EEG signal, obtained by imposing a minimum difference between segments to reduce the impact of noise.This allows the algorithm to 'see' the wave between A, F and I (Gotman and Gloor, 1976).
Figure 3 -Example of a template (in red).To find matches, characteristics such as correlation coefficients, amplitude and variance are compared between the candidate elecroencephalogram (EEG) and the template.By doing this across the EEG signal, it is possible to detect interictal epileptiform discharges, since these patters will be the most similar to the templates.For example, the EEG segment highlighted in blue is much more similar to the template than the one highlighted in purple (Lodder et al., 2013).
Figure 4 -Basic structure of a perceptron.The inputs, x 0 to x n are combined with varying weights (w 0 to w n ).This results in a weighed sum that is then passed through a step function (or a more complex activation function), predicting 1 if the result was above a certain threshold and 0 otherwise.
Figure 5 -Example of a convolutional neural network architecture that has been adapted for interictal epileptiform discharge detection (Lourenço et al., 2019).This network includes several blocks of convolutional and pooling layers, ending with fully connected layers.
Figure 6 -Probability heatmap obtained with the occlusion visualization technique on an electroencephalogram sample with a focal interictal epileptiform discharge (IED) (Lourenço et al., 2019).Warmer colors indicate higher importance for classification, showing that the convolutional network detected the IED shape, revealing some information about the network's decision process.
Figure 7 -Chronological overview of the first report of each type of automated approach for interictal epileptiform discharge detection.The first paper using mimetic methods was published in 1976, followed by artificial neural networks (ANNs), template matching, expert systems (ESs), thresholding of frequency features, machine learning (ML), morphological filtering and deep learning (DL).

Table 1 -
Approaches to automated IED detection.More details are shown in Supplementary