Less Parameterization Inception-Based End to End CNN Model for EEG Seizure Detection

Many deep-learning-based seizure detection algorithms have achieved good classification, which usually outperformed traditional machine-learning-based algorithms. However, the hand-engineered features increase the computational complexity and potentially have an ineffectiveness problem for the category. Therefore, this paper proposes a novel end-to-end deep-learning model comprising an inception module and a residual module to analyze the multi-scales of original EEG signals and realize seizure detection without feature extraction. Experiments were conducted and evaluated on the Bonn dataset and the CHB-MIT dataset. In the subject-dependent experiments, our model achieved an average F1-score of 69.34% on the CHB-MIT dataset. In subject-independent experiments, our method achieved an average accuracy of 99.04% on the Bonn dataset and an average F1-score of 37.31% on the CHB-MIT dataset. A series of analyses confirmed that our proposed model has better classification performance and lower computational complexity than existing end-to-end seizure detection models.


I. INTRODUCTION
Epilepsy is a chronic neurological disease affecting about 50 million people worldwide [1]. Abnormal discharge of brain cells during epileptic seizures causes symptoms such as convulsions, fainting, loss of consciousness, behavior changes, etc. These symptoms may not only cause physical damage to patients but may also cause psychological problems in the long run [2]. Although most patients' epilepsy symptoms can be effectively improved with antiepileptic drugs. However, about 20-40% of patients with severe epilepsy may need surgical treatment, and an accurate diagnosis of epilepsy is a necessary preoperative evaluation [3].
Electroencephalography (EEG) a method for recording brain electrical activity is often used in diagnosing brain diseases [4]. The traditional epilepsy diagnosis method is manual detection by visual observation of EEG signals by neurologists. This method is time-consuming and the accuracy of diagnosis highly depends on the experience and ability The associate editor coordinating the review of this manuscript and approving it for publication was Larbi Boubchir . of the diagnostician. Long-term EEG diagnosis is more prone to human negligence resulting in misdiagnosis or delayed diagnosis aggravating the condition [5].
In recent years, many automated seizure detection technologies have been proposed for epilepsy diagnosis assistance systems [6], [7] and seizure alarm systems [8], [9]. The epilepsy diagnostic assistance system can increase the accuracy of epilepsy diagnosis and shorten the diagnosis time. The seizure alarm systems can immediately notify relevant units when a seizure occurs. Patients can quickly obtain treatment and the sudden unexpected death in epilepsy (SUDEP) can be reduced [10].
Seizure detection is regarded as a classification problem. Those related algorithms can be roughly divided into two categories: traditional machine learning and deep learning. The algorithm based on traditional machine learning generally includes pre-processing, feature extraction, feature selection, classifier, and post-processing. For example, Chen et al. [6] used discrete wavelet transformation (DWT) and calculates a variety of different types of entropy (Shannon entropy, sample entropy, fuzzy entropy, etc.) as features. Then, analysis of variance (ANOVA) is used to sort and select features. Finally, the least squares support vector machine (LS-SVM) was used as a classifier to identify whether the input signal was a seizure. Olokodana et al. [10] used DWT soft threshold technology to filter out signal noise, and then calculate fractal dimension, Hjorth parameter, and singular value decomposition entropy as signal features. Finally, the Kriging model is applied for classification. Tiwari et al. [11] used the difference of Gaussian (DoG) to identify the key positions in the EEG signal, then calculates the local binary pattern (LBP) as a feature. Finally, the support vector machine (SVM) is used as a classifier.
Deep learning techniques have been widely applied in many non-stationary signal processing tasks. For example, Lopac et al. [12] used Cohen's class time-frequency representations and convolutional neural network (CNN) models to detect gravitational waves. Arias-Vergara et al. [13] used a CNN to detect speech deficits and classify phone attributes based on convolutional gated recurrent units (CGRU). Sakib et al. [14] used a CNN model to analyze raw Electrocardiogram (ECG) signals for arrhythmia detection on the embedded system. In recent years, many seizure detection algorithms are also based on deep learning models, such as long short-term memory (LSTM) [15], gated recurrent units (GRU) [16], CNN [17], [18], [19], [20], [21], [22], [23] etc. The CNN-based algorithm is the most commonly used model architecture. The EEG is converted into two-dimensional feature images in the pre-processing and then input into CNN for classification. For instance, Rashed-Al-Mahfuz et al. [17] used short-time Fourier transform (STFT) and continuous wavelet transform (CWT) to convert EEG into a two-dimensional time-frequency map and then used pre-trained VGG16 [24] and ResNet50 [25] for classification. Lai et al. [18] used a bandpass filter to obtain 80-250Hz and 250-500Hz signals in EEG, and then the candidate regions are identified by calculating the short-time energy. Finally, CWT was used to convert the EEG of candidate regions into time-frequency images and input shallow CNN with a depth of 5 layers for classification.
However, hand-engineered feature representations may ignore temporal patterns of the original EEG signals and have ineffectiveness issues for classification [19]. Design an endto-end model that directly uses the original EEG signal as the model input. For instance, Acharya et al. [20] analyzed raw EEG signals using a 13-layer 1D-CNN. Wang et al. [21] extracted high-level representations from EEG through two convolution blocks with different sizes of the kernels, and output with a fully connected layer. Thuwajit et al. [22] used a multiscale architecture to analyze the different ranges of time domain information of EEG. Duan et al. [23] used the Siamese network architecture to train the embedding module. Through the embedding module, EEG can be converted to the embedding space, and the classification can be performed in the embedding space.The main contributions of this paper are summarized as follows: • We propose a novel end-to-end deep learning model to detect seizures based on the ResNet and Inception Net [26]. Unlike the previous end-to-end models that use different scale information independently, we use the inception module to integrate multi-scale features effectively.
• Besides using traditional metrics accuracy, sensitivity and specificity for seizure detection evaluation, we use the metric F1-score to measure and compare model performance. Experimental results show the F1-score is a more helpful metric than others, especially on the class imbalanced datasets.
• We conduct experiments and analyses on two datasets to show that our proposed model has promising classification performance and lower computational complexity than existing end-to-end models.
The rest of this paper is organized as follows. Section II describes the details of our proposed method. Section III specifies the experimental setups, including the dataset description, classification tasks, baseline methods, and evaluation metrics. Section IV presents experimental results for model performance evaluation and comparisons. Section V introduces the findings for further analysis and discussion. Conclusions with future directions are finally drawn in Section VI.

A. PROPOSED MODEL ARCHITECTURE
This paper uses the inception module, residual module, and global average pooling as feature extractors to generate feature vectors from EEG. Using a two-layer fully connected layer as a classifier to output the EEG is inter-ictal or seizure. The overall model architecture is shown in Fig. 1.
The EEG signals will first pass through three inception and residual modules for feature extraction. We then perform the global average pooling to obtain the output using the fullyconnected layer. The inception module allows the model to have multi-scale analysis and integration. The residual module effectively maintains the gradient process during the model training phase. Finally, the fully-connected layer classifies the features extracted by the inception and residual modules.

1) INCEPTION MODULE
Many seizure detection studies have achieved better classification performance through multi-scale feature extraction [19], [21], [22]. Multi-scale feature extraction enables the model to analyze EEG signals in different ranges, obtaining comprehensive features with local and global information.
We use the inception module to extract multi-scale features through convolutional layers with different convolution kernel sizes. Compared with the previous multi-scale feature extraction, the inception module integrates features in different scales and reduces the number of model parameters [27]. Our used inception module comprises three parallel convolutional blocks and a convolutional layer with a convolution kernel size of one. Each convolutional block is deduced by a convolutional layer (stride is 2, padding mode is the same, and the number of kernels is 16), batch normalization, and leaky ReLu activation function. The convolution kernel sizes of the three blocks are 3, 7, and 11, respectively. The output of the three parallel convolutional layers is concatenated and then passed through the convolutional layer (kernel size is one, and the number of kernels is 16).

2) RESIDUAL MODULE
The shortcut connections architecture of the residual module effectively maintains the gradient process during training to avoid the problem of exploding or vanishing gradients [25]. In addition, the skip connection mitigates the information loss after the signal passes through the convolutional layers so that the extracted features retain the original signal information [28].
Our used residual module is composed of three stacked convolution blocks. We use the output of the third convolution Block in the inception module as the input of our residual module. Each convolution block comprises a convolution layer with a convolution kernel size of three, batch normalization, and leaky ReLu.

3) CLASSIFIER
We use the feature vectors generated by the inception and residual modules to train the classifier comprising a two-layer fully-connected layer for seizure detection. The size of the first layer is four units, and the activation function is ReLu. The size of the second layer is two units, and the activation function is softmax.
During the training phase, if an EEG record is annotated with seizure, the class is assigned as 1, and 0 otherwise. To prevent abnormal stopping caused by unstable loss, we apply the early stopping mechanism when the epoch exceeds the threshold (i.e., I min ). The training loss function is categorical cross-entropy. The optimizer is Adam (epsilon=0.1, clipnorm=1), and the learning rate uses cosine decay restart [29]. We summarize the model parameters in Table 1. To detect an instance during the testing phase, we use the class with the more significant probability than the other as model output.

III. EXPERIMENT A. DATABASE
We use the CHB-MIT and Bonn datasets to verify the model performance. The CHB-MIT dataset contains long-term EEG record files with time characteristics. In addition, its imbalanced class distribution may naturally reflect real-life seizure detection challenges. The Bonn dataset contains the segmented EEG data for fair performance comparisons, which gets rid of the bias caused by different pre-processing. Details are introduced as follows

1) CHB-MIT DATASET
The CHB-MIT dataset, a scalp EEG (sEEG) dataset from Children's Hospital Boston, was sampled using 16-bit resolution and 256Hz [30]. The CHB-MIT dataset is a long-term EEG dataset that is commonly used to evaluate seizure detection algorithms. The entire dataset contained EEG recordings of 24 epileptic patients aged 3-22 years with a total of 198 epileptic events and 644. The edf files were each about 1 hour long.
In this paper, according to the baseline studies [22], the following preprocessing is carried out: overlaps between each segment. 3. Use a low-pass filter to remove noise above 64 Hz. Files chb12_27.edf, chb12_28.edf, and chb12_29.edf were excluded from the experiment because they did not contain any common channels.

2) BONN DATASET
The Bonn dataset, an epilepsy dataset provided by the University of Bonn, was sampled using 12-bit resolution at 173.61Hz [31]. The data set consists of 500 records and each record is a single channel with 4096 sampling points. The duration is about 23.6 seconds and can be divided into five categories: A to E. Class A and Class B signals were collected from sEEG signals of five healthy awake subjects. Class C, Class D, and Class E were collected from intracranial EEG (iEEG) signals of five epileptic patients.
Class A is the signal when the eyes are opened. Class B is the signal when the eyes are closed. Class C and D are the signals when the inter-ictal. Class D measures the epileptogenic zone. Class C measured the hippocampal formation of the opposite hemisphere of the epileptic seizure region. Class E is the signal during an epileptic seizure.

B. CLASSIFICATION TASK 1) EXPERIMENT I-SUBJECT-DEPENDENT
Subject dependent experiment of this paper is validated in the CHB-MIT dataset and leave-one-record-out cross-validation was used. First, one record file is reserved as the testing set and the rest files are used as the training set. Then, 20% of the training set data are randomly sampled as a validation set.

2) EXPERIMENT II-SUBJECT-INDEPENDENT
Subject-independent experiment of this paper is validated in the CHB-MIT dataset and the Bonn dataset. The CHB-MIT dataset uses leave-one-subject-out cross-validation. First, one subject is reserved as the testing set, and the rest subjects are used as the training set. Then 20% of the training set data are randomly sampled as a validation set.
For Bonn Dataset, we use ten-fold cross-validation and eight tasks in Table 2 were performed to evaluate the model. Case1, Case2, and Case5 compare the EEG of healthy subjects with the EEG of seizure. Case3, Case4, and Case6 compared EEG in epileptic patients with inter-ictal and during seizures. Case7 and Case8 contain EEG signals from healthy subjects and people with epilepsy. The epileptic EEG, nonepileptic EEG, and healthy EEG were compared.

C. BASELINE METHOD
We compared with one traditional machine learning algorithm and three end-to-end deep learning models to show the classification performance of the proposed method. The following is a brief introduction to each method.
The filter bank common spatial pattern (FBCSP) divides the signals into different frequency bands through multiple filter banks, and then multiplies the projection matrix with the VOLUME 11, 2023 EEG signals by CSP algorithm to maximize the difference between each category of EEG signals. Finally extract the most discriminant part as the feature. The bandpass filter bands of the filter banks used in this paper are respectively 4-8Hz, 8-16Hz, 16-32Hz, and 32-55Hz. The kernel function of SVM is ''rbf''.

2) EEGNET-8,2
EEGNet [38] is a convolutional neural network with a compact architecture, and its architecture is formed by stacking three convolutional layers. EEGNet makes each layer of the network learn different features by designing different convolution modes and sizes of the convolution kernel. The first layer uses 2D convolution to learn frequency filters, and the second layer uses Depthwise convolution to learn frequency-specific spatial filters. The third layer uses Separable convolution to learn the integration and optimization of time-frequency and spatial features. Finally, it is output as a model through a fully connected layer. This paper uses EEGNet-8,2 which has 8 frequency filters and 2 spatial filters as the baseline model.

3) STACKED 1D-CNN
Stacked 1D-CNN [21] is an end-to-end model, which uses two parallel convolutional blocks to extract features from the EEG and then outputs them through a fully connected layer. Two convolution blocks are formed by repeated stacking of convolutional layers, Batch Normalization, and Max Pooling three times, and the number of convolutional kernels is 32, 64, and 128, respectively. The kernel size of the convolutional layer in convolution block 1 is 3. In convolution block 2, the kernel size of the first two convolutional layers is 5, and the kernel size of the last convolutional layer is 3. Features extracted from two convolution blocks are used as the input of the classifier after a Global Average Pooling. The classifier consists of two fully connected layers, whose unit number is 128 and 2, respectively.

4) EEGWAVENET
EEGWaveNet [22] is an end-to-end model consisting of a multiscale convolution module, a spatial-temporal feature extractor, and a classifier. The multiscale convolution module uses 6-layer depthwise convolution (kernel size=2, strides=2), and the length of the signal will become 1/2 of the original one after each layer. The feature maps generated by depthwise convolution from layer 2 to layer 6 are used as the output of the multi-scale convolution module. The space-time domain feature extractor repeatedly stacks the convolutional layer (kernel size=4, strides=2) and batch normalization three times, and then performs global average pooling. The space-time domain feature extractor outputs a 32-dimensional feature vector at each scale and concatenates the 5-scale feature vectors to 160 -dimensional feature vectors as classifier input. The classifier is a three-layer fully connected layer, and the number of units is 64, 32, and 2, respectively.

D. EVALUATION METRICS
In this paper, Accuracy, Sensitivity, Specificity, and F1-score were used to evaluate the classification performance of the model. Accuracy evaluation model reflects the resolution of the model in the overall data. The calculation method is shown as follows.
F1-Score is the harmonic average of Precision and Sensitivity. For imbalanced data sets, F1-Score can more accurately reflect the classification performance of the model. Its calculation method is shown in Eq. (4).
Many seizure detection studies used accuracy, sensitivity, and specificity as evaluation metrics for performance comparisons [16], [23], [36], [39]. However, accuracy and specificity may have a bias due to the class imbalance. For example, about 98.4% of the CHB-MIT dataset are negative samples. If a classifier predicts all testing instances as negative class will achieve high accuracy of 98.4% and perfect specificity of 100%. Therefore, the F1-score measure, which is a harmonic mean of recall (i.e., sensitivity) and precision, is usually more helpful than accuracy, especially for data with an uneven class distribution. Table 3 shows the average classification performance of different algorithms in the subject-dependence experiment. Compared with the previous end-to-end model, the proposed method has the best performance in all performance metrics, and the number of model parameters is only more than EEGNet-8,2 [38]. FBCSP+SVM [34] has the highest accuracy and specificity, but its F1-score is the lowest comparing with other methods. The sensitivity achieved by FBCSP+SVM even clearly underperformed (about 10%) other methods, revealing that seizure cases cannot be effectively detected. This confirms that accuracy and specificity are not proper metrics on class imbalance data like the CHB-MIT dataset. The model structure of EEGNet-8,2 is the most concise and has the least number of parameters, the overly simple model structure also causes the classification performance to be slightly lower than other models.    2 shows the model prediction results of the record file chb05_22.edf, which are the seizure probability output by the model, the result after moving average, the result after thresholding, and the real label in sequence. Fig. 2(a) shows the results of the entire file. It can be seen that even if the model has false detections, it can be effectively suppressed by moving average and thresholding. Fig. 2(b) zooms in on the seizure event region, and the gray background is the seizure occurs. The detection errors for this seizure event were only one false positive at the onset of the seizure and one false negative during the seizure. The F1-score of chb04, chb06, chb08, chb13, chb14, chb16, chb23, and chb24 is low than 60%. The characteristics of each patient's data and the reasons for difficult training will be discussed in the following section.

B. RESULTS OF SUBJECT-INDEPENDENT EXPERIMENTS
We also conducted the subject-independent experiment for both CHB-MIT and Bonn datasets. Table 4 shows the average classification performance of different algorithms in subject-independent experiments. Compared with the previous end-to-end model, the method of this paper has the best Accuracy, Specificity, F1-score. Sensitivity is second only to Stacked 1D-CNN [21]. Stacked 1D-CNN has the best Sensitivity, its F1-score is lower than the proposed method, which means that Stacked 1D-CNN has more false positives. FBCSP+SVM has the same problem as the subject-dependent experiment, and the Sensitivity is 10-20% less than other methods. Table 5 shows the experimental results of different algorithms on the Bonn dataset. According to Table 5, it can be seen that the proposed method can still have excellent classification performance in datasets with a small amount of data, with an average accuracy of 99.04%. Compared with [21], [22], [38], [40], [41], and [42], the proposed method has more excellent performance. Compared with [17], [43], [44], and [45], the classification accuracy of the proposed method is a little bit worse, but [17], [43], and [45] did not report the accuracy of Case 8, and the difference between Class AB and Class CD cannot be clearly distinguished in its feature representation. Furthermore, those methods use hand-engineered feature representations that have potentially an ineffectiveness problem for classification [19].
The experiments and results on the CHB-MIT dataset experiment are closely similar to the related study conducted by Thuwajit et al. [22]. The lower F1-score was mainly caused by extremely imbalanced class distribution. The F1scores of the previous studies [10], [19] were higher than 80% on the class-balanced data such as the Bonn dataset. We also analyzed the F1-scores in the experiments of the Bonn dataset. Our proposed method achieved a Marco-averaging

V. DISCUSSION
In this section, we discuss the causes of false positives and missed detections. It can be mainly divided into noise, extreme class imbalance, and heterogeneity.

A. NOISE
We found that the subjects chb04, chb08, chb13, chb23, and chb24 in the CHB-MIT dataset had long-duration, huge amplitude, and multi-band noises. The noise appears in multiple channels at the same time, and it is difficult to filter it out through pre-processing because the noise frequency of each subject is different. Such noise can cause the model to generate a large number of false positives. The results of the short-time Fourier transform (STFT) and model output of the normal record file (chb04_28.edf) shown in Fig. 3 (a), which are the result of STFT, seizure probability output by the model, the result after moving average, and the real label in sequence. The abnormal record file (chb04_05.edf) is shown in Fig. 3 (b). In Fig. 3 (b), high-energy noise appears at about 3.5Hz, 8Hz, 11Hz, 15Hz. . . etc. A large number of false positives appear in the model output due to high energy noise at the same position.
49178 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.

B. EXTREME CLASS IMBALANCE
In the experiment of this paper, CHB-MIT is an imbalanced dataset, and the class ratio is about 1:63 (Seizure: Normal). In a few subjects such as chb04, chb06, chb14, etc., the number of normal samples was more than 100 times the number of seizure samples, which caused the model to abandon seizure samples during training. The chb16 only has 24 seizure samples, which made the model unable to learn more general features.

C. HETEROGENEITY
EEG classification of epilepsy has a high degree of heterogeneity within the same category. The EEG of the same subject will vary depending on the subject's current physical state when they are not having a seizure, and depending on the symptoms of epilepsy when the seizure occurs. The same class of EEG from different subjects varied depending on their physiological characteristics. Fig. 4 (a) and (b) show the proposed model dimension reduction distribution of the chb11 training data feature vector in the subject-dependence experiment by the t-SNE algorithm [46]. Fig. 4 (c) and (d) show the proposed model, dimension reduction distribution of the feature vectors chb10, chb18, and chb24 in the training data subject in the subject-independent experiment (chb01 as the test data) using the t-SNE algorithm. Fig. 4 (a) and (c) are labeled according to data categories, and Fig. 4 (b) and (d) are labeled according to data sources. We can find that although the feature vectors extracted by the model can clearly distinguish the states of seizures and non-seizures, they can also easily identify the source of the data. Fig. 4 (b) shows the samples from chb11_82.edf file is mainly distributed on the left side of the figure, and the samples from chb11_92.edf are mainly distributed on the right side of the figure. Fig. 4 (d) shows that the samples from chb18 are VOLUME 11, 2023 mainly distributed on the upper side of the figure, the samples from chb10 are mainly distributed on the left side of the figure, and the samples from chb24 are mainly distributed on the right side of the figure. According to the results in Fig. 4, the model cannot eliminate the differences caused by different seizure events or different subjects in the process of generating feature vectors, to obtain more normalized EEG features.

VI. CONCLUSION
We propose a novel end-to-end deep learning model based on the inception and residual modules for seizure detection. We do not need domain expert knowledge to design the feature engineering with an end-to-end architecture. Compared to the previous models, our method integrates different scale information through the inception module for feature extraction. In addition, the residual module maintains the gradient process effectively during the model training phase. We conducted subject-dependent and subject-independent experiments on CHB-MIT and Bonn datasets. The previous methods, including FBCSP+SVM, EEGNet-8.2, Stacked 1D-CNN, and EEGWaveNet, were used to compare model performance. The experimental results show that our proposed model has the best classification performance in the CHB-MIT dataset. The macro-averaging f1-scores of the subject-dependent and subject-independent experiments are 69.34% and 37.31%. Our method has a macro-averaging accuracy of 99.04% in the Bonn dataset. Compared with the previous seizure detection methods needing a feature engineering process, our approach uses the original EEG signal as the model input, resulting in fewer parameters and lower computational complexity. The advantages of our proposed method are summarized as follows: • Use EEG signals directly to avoid the highly complex feature extraction and the ineffectiveness risk of handengineered features.
• Our proposed method requires fewer parameters and lower computing resource requirements.
• Integrate different scale information in feature extraction through the inception module to achieve better classification performance. As error cases we have analyzed and discussed, our model is limited to three aspects: noises, class imbalance, and heterogeneity that cause performance degradation. Therefore, future work is investigated in several directions. First, the denoising autoencoder [47] may help reduce noise. Secondly, for the class imbalance problem, deep metric learning [23] would be an attempt to transform the classification problem into a similarity problem during model training. Finally, domain-adversarial learning [48] that uses domain classifiers against a feature extractor during training may benefit from the heterogeneity limitation.