Hybrid Convolutional Neural Network for Localization of Epileptic Focus Based on iEEG

Epileptic focus localization by analysing intracranial electroencephalogram (iEEG) plays a critical role in successful surgical therapy of resection of the epileptogenic lesion. However, manual analysis and classification of the iEEG signal by clinicians are arduous and time-consuming and excessively depend on the experience. Due to individual differences of patients, the iEEG signal from different patients usually shows very diverse features even if the features belong to the same class. Accordingly, automatic detection of epileptic focus is required to improve the accuracy and to shorten the time for treatment. In this paper, we propose a novel feature fusion-based iEEG classification method, a deep learning model termed Time-Frequency Hybrid Network (TF-HybridNet), in which short-time Fourier transform (STFT) and 1d convolution layers are performed on the input iEEG in parallel to extract features of the time-frequency domain and feature maps. And then, the time-frequency features and feature maps are fused and fed to a 2d convolutional neural network (CNN). We used the Bern-Barcelona iEEG dataset for evaluating the performance of TF-HybridNet, and the experimental results show that our approach is able to differentiate the focal from nonfocal iEEG signal with an average classification accuracy of 94.3% and demonstrates an improved accuracy rate compared to the model using only STFT or one-dimensional convolutional layers as feature extraction.


Introduction
Epilepsy is a chronic disease of the brain, and it is characterized by recurrent and unpredictable seizures, which are brief episodes of involuntary movements and even accompanied by transient loss of consciousness [1]. Currently, with approximately 50 million epilepsy sufferers worldwide according to the World Health Organization (WHO), epilepsy is one of the most common neurological diseases globally, making it a significant challenge for healthcare and social services [1]. The cause of seizure episodes is excessive electrical discharging in a group of brain neurons. Electroencephalography is therefore a commonly used method to measure brain activity through the recording of electrical activity and has been widely used for the diagnosis and treatment of varied neurological conditions such as brain death, epilepsy, Alzheimer's, and coma [2]. Considering it is uncertain that symptoms will pres-ent in the iEEG signal at all times, and iEEG should be monitored and recorded in the long term. During this process, huge amounts of data are generated and experienced neurological experts subsequently analyse abnormalities in brain activities via visual inspection. This task is time-consuming that could lead to a serious delay of days or even weeks of treatment. In recent years, various automatic diagnostic methods have been proposed to assist neurologists by accelerating the interpreting process, thereby reducing workload [3,4]. Current methods mainly focused on tasks such as seizure detection, seizure prediction, and seizure type classification. Statistics show that up to 70% of patients could be successfully treated with the proper use of antiepileptic drugs (AEDs) [1]. Nevertheless, for patients who respond poorly to drug treatments, resection surgery for epileptogenic tissues might be one of the most promising treatments in controlling epileptic seizures. Hence, it is crucial to determine the seizure area in surgical therapy, and there is a very strong demand for the automatic detection of epileptic focus localization. iEEG is recorded directly from the cerebral cortex, and iEEG signals recorded from the epileptogenic area are more stationary and less random than iEEG signals recorded from the normal area [5]. This nature makes it enable to be used for identification of location effectively. The task of seizure focus localization is not largely developed owing to factors such as the complex nature of the task and rare clinical datasets [6]. The mainly used dataset is the publicly available Bern-Barcelona iEEG dataset, which was collected by Andrzejak et al. at the Department of Neurology of the University of Bern [7].
In recent years, various automatic focus detection methods through the classifying iEEG signal into focal and nonfocal have been proposed [8]. Most of those are usually divided into three main steps, preprocessing, feature extraction, and classification. In the preprocessing step, various filtration or normalization is applied to the raw signal. In the feature extraction step, to extract the most discriminative features, commonly used methods include empirical mode decomposition (EMD) [9], entropy, and time-frequency analysis methods like Hilbert-Huang transform, Fourier transform (FT) [10], STFT [11], and wavelet transforms (WT) [12,13]. Particularly, STFT has been established that it is suitable for iEEG signal processing by extracting time-frequency domain features [14]. In the classification step, support vector machines (SVM) [9], logistic regression (LR) [15], and K-Nearest Neighbor (KNN) method [16] are usually be used. With the rapid development of deep learning models, automatic feature-based approaches have been successfully applied to classification problems [17]; in particular, CNN is regarded as one of the most successful and widely used deep learning models. In our previous research work [18,19], two individual feature extraction methods, the Time-Frequency Convolutional Neural Network (TFCNN) and the Mixed-CNN, were proposed and have proven to be effective. The main contribution of this paper is that we propose TF-HybridNet, a deep learning model to diversify features by combining time-frequency analysis and learnable automatic feature extraction methods. We compare the TF-HybridNet accuracy with the TFCNN and Mixed-CNN, and experiments show that the multifeature extraction method produces higher iEEG classification accuracy compared to the individual feature extraction method. The proposed framework does not only have strong feature learning capabilities but also have adaptive iEEG features for higher classification performance without much human intervention.
The rest of the article is organized as follows: Section 2 describes the dataset used in the experiment and the method of CNN and STFT. Section 3 describes a comparison of the architecture of three deep learning models proposed. The experimental results are presented in Section 4, and the last is the conclusion of this paper.

Materials and Methods
In this section, we firstly introduce the Bern-Barcelona iEEG dataset. Then, we discuss the STFT, a powerful generalpurpose tool for time-frequency domain feature extraction. The working principles used in this paper are described at last, including a 1d convolutional layer, 2d convolutional layer, and various neural network components. This collection of working principles and components provides an essential basis for the three deep learning models proposed in Section 3.

Dataset.
The Bern-Barcelona iEEG dataset was recorded from five patients suffering from long-standing drugresistant temporal lobe epilepsy which were candidates for surgery. The signal recorded from the focus region (lesion) was labeled as the focal signal; otherwise, the signal was labeled as the nonfocal signal. The dataset contains 3750 focal iEEG signal pairs and 3750 nonfocal iEEG signal pairs. Each pair of iEEG signal from adjacent channels was sampled for 20 seconds at a frequency of 512 Hz and was band-pass filtered between 0.5 and 150 Hz with a fourth-order Butterworth filter. The iEEG signal recorded during the seizure and three hours after the last seizure was excluded to guarantee to discard the seizure iEEG signal. An example of the focal and nonfocal iEEG signal is shown in Figure 1, respectively.

Short-Time Fourier Transform.
On account of the instability of the iEEG signal, it is extremely difficult to extract the key features by some commonly used time-frequency analysis methods such as Fourier transform [20]. The STFT, as a Fourier-related transform, is used to equally divide the raw signal into shorter segments of length by a window function which is nonzero for only a short period of time, so that the segments of the signal are approximately stationary. The Fourier transform of the shorter segments is computed as the window function is sliding along the time axis, obtaining the spectrum, a 2-dimensional representation of the signal. Hence, it is demonstrated that the time-frequency domain features extracted by STFT are suitable for classifying iEEG signal of epilepsy [20]. For a determined signal xðtÞ, the time-frequency domain at each time point can be obtained by where wðtÞ is the Hann window function centered around zero. Examples of the spectrogram of the iEEG signal (focal and nonfocal) are shown in Figure 2.

Convolutional Neural
Network. CNN is a subset of deep learning which has recently been successfully used in numerous tasks in different research fields of images and time series classification (TSC), such as biomedical imaging, iEEG/Electrocardiography (ECG) signal, and motion sensor data and speech. The CNN model consists of an input and an output layer, as well as multiple hidden layers, and the early layers following input layers are general convolutional layers. Convolution is a mathematical operation that is used to extract the feature map by sliding the convolution kernel over the input data, which helps extract particular features. For an input vector f with length n and a kernel g with length m, f * g is the convolution of f and g and is defined as 2.3.2. Two-Dimensional Convolutional Layer. For 2d convolution, just as 1d convolution, we slide the 2d kernel over each pixel of the input image and then multiply the corresponding entries of the input image and kernel; the sum of the results will be the value of the feature map. The activation map is obtained by computing the dot product of the input file and  3 Neural Plasticity the filter. Then, after additive bias and a nonlinear map by activation functions, feature maps of the convolutional layer are output to feed to the next layer in the CNN model.

Pooling Layer.
After the convolution layer, feature maps are usually passed to the pooling layer and different from the convolution operation; pooling has no parameters. In the pooling layer, the feature maps are separated into many rectangle regions, and then, each region feature is obtained. It enables downsampling each feature map independently to reduce the dimensionality, lower the calculation complexity, and prevent overfitting. Various pooling operations, for instance, max pooling operation, select only the maximum value in the pooling window, while mean pooling obtains the mean value of the pooling window.

Batch Normalization
Layer. The batch normalization layer is applied to normalize the output feature map obtained from the previous layer by subtracting the batch mean and dividing by batch standard deviation, to fight the internal covariate shift problem and increase the stability of neural networks. For the input x obtained from the previous layer, the batch normalization layer first calculates the mean μ B and variance σ 2 B of a minibatch B of size m by equations (3) and (4). Then, normalized values x i are calculated as equation (5) where ε is a constant added to the minibatch variance for numerical stability. Finally, the x i are shifted and scaled as equation (6) that the parameters γ and β are to be learned.
2.3.5. Fully Connected Layer. In the fully connected layer, all the 2d feature maps from the upper layer are represented by a one-dimensional feature vector as the input of this layer. In this paper, the output is obtained by doing dot products between the feature vector and learnable weight vector, adding learnable bias and then responding to the activation function.

Neural Network Architecture
Conventional CNNs are hierarchical architectures based on an alternation of convolutional layers with pooling layers and batch normalization layers and followed by a fully connected layer.

Time-Frequency Convolutional Neural
Network. In our previous research, we proposed an architecture that combines time-frequency analysis and a two-dimensional convolutional neural network. A TFCNN network consists of an STFT layer, five subsequent stages, five FC stages, a dropout layer, and a final output layer, which is illustrated in Figure 3(a).
In that architecture, the iEEG signal is firstly transformed by the STFT layer to extract local features individually based on the local correlation among the time-frequency domain. Then, discriminative features that are built by connecting the local features are learned, and classification is performed by the TFCNN. The specific training process is as follows: the time-frequency spectrogram with size 257 × 101 is firstly convoluted by using a 3 × 3 filter by sliding with stride 1 and set 10 channels to feature map, and each feature map has the same size as the input spectrogram. Then, batch normalization (BN) and max pooling operation are successively implemented in the batch normalization layer and max pooling layer. And these two steps are repeated 5 times, except that the size of input and output is decided by the former layer, and channels of the feature map increase exponentially.

Mixed Convolutional Neural Network.
In the previous TFCNN architecture, before feeding into the neural network, the signal needs to perform extraction and selection of features manually. The most used time-frequency analysis method like STFT has the capability to extract local information at a onetime scale determined by a single filter, limiting the flexibility of the model. To address this problem, considering a convolution can be seen as applying and sliding a filter over the time series; instead of the STFT, we use 1d convolution layers in the earlier layers. It is easier to optimize the parameter configuration when each layer is treated independently, and it also enables using different input feature maps or receptive field sizes. A Mixed-CNN consists of eight convolution stages, five FC stages, a dropout layer, and a final output layer, which is illustrated in Figure 3(b). From stages 1 to 3, each stage begins with an 8 × 1 1d convolution layer with a stride of 2, which is then followed by the BN layer and 3 × 1 max pooling layer also with a stride of 2. The size of the output feature map of stage 3 is 159 × 256. The feature maps from 1d convolution layers are reshaped and then successively fed to subsequent five 2d convolution stages and fully connected layer to perform further feature extraction and classification.

Time-Frequency Hybrid Network.
Inspired by the performances of the previous two models, we propose a hybrid model combining time-frequency analysis and Mixed-CNN. As shown in the architecture of TF-HybridNet in Figure 3(c), before feeding into 2d convolution layers to perform further feature extraction, the spectrogram obtained from STFT and the feature map obtained from 1d convolutional layers are both adjusted into a size of 160 × 257 and are stacked together in sequence depthwise. The remaining parts of the model are almost the same as Mixed-CNN.

Results and Discussion
The proposed models were implemented on a workstation with 12 Intel Core i7 3.50 GHz (5930 K), a GeForce RTX 2080 Ti graphics processing unit (GPU), and 128 GB random-access memory (RAM) using the Python programming language on the TensorFlow framework. The 5-fold cross-validation and he 10-fold cross-validation are used in this paper. In 5-fold cross-validation, 60% of the dataset is used as the training set

Neural Plasticity
and 20% is used as a validation set while the remaining 20% is used as the test set. And in 10-fold cross-validation, the distribution proportion is set to 80%, 10%, and 10%. It requires a lot of computational overhead to use one iteration of the full training set to perform each epoch; therefore, in each epoch of the training, the training set is randomly divided into 100, 120, and 200 batches separately in TFCNN, Mixed-CNN, and TF-HybridNet, which are fed into the network in turn.
The training performance of the model was monitored during the training stage until getting the best accuracy on the training set with minimum train loss. And we validate the networks by using a validation set after each epoch of training. The accuracy of the validation set across classification by the different models is shown in Figure 4. It can be seen that there is no overfitting problem in the three models; during the training period, the validation accuracy is steady at the end of the training. Finally, for performance evaluation  The abbreviations represent true positive (TP), false positive (FP), true negative (TN), and false negative (FN).
Confusion matrices and performance measures through 5fold/10-fold cross-validation obtained for TFCNN, Mixed-CNN, and TF-HybridNet are shown in Table 1. Our results show that the developed TF-HybridNet model performed better than the other two models both during the training and testing periods. And considering that the result of 10-fold crossvalidation is better than that of 5-fold cross-validation, the performance could be improved with more numbers of iEEG data. Compared with other published state-of-the-art methods shown in Table 1, the proposed TF-HybridNet managed to obtain 94.3% accuracy. And the advantage of this method is that it is less signal preprocessing for feature extraction and selection.

Conclusions
Since the manual visual inspection of iEEG is a timeconsuming process, automation of the detection of epileptic focus by an effective classifier will have the potential to solve the delay issue in treatment. In the case of signal feature extraction, STFT may be able to extract some but not all specific features of the time-frequency domain and the convolutional layer is similarly able to extract partial latent features of iEEG. In addition, iEEG signal from different patients usually shows very diverse features due to individual differences of patients, even if the features belong to the same class. It leads to that each feature extraction methods usually obtain different results on different datasets in the signal processing field.
To solve these problems, we propose to adopt a feature fusion-based iEEG classification method which can make up for the shortage of traditional feature extraction and deep learning techniques. In this paper, we present and compare the performance of three different models (TFCNN, Mixed-CNN, and TF-HybridNet) for iEEG signal classification as the focal and nonfocal iEEG signal. Among the three models, the TF-HybridNet model performs the best result both 5-fold and 10-fold. Even though this proposed model could not 7 Neural Plasticity yield the best classification performance as compared to the published works shown in Table 2, the proposed TF-HybridNet model still managed to obtain 94.3% accuracy. This shows that the TF-HybridNet is effective with much efficiency and timesaving to assist neurological clinicians to detect the focal epileptic seizure area.

Conflicts of Interest
The authors declare that they have no conflicts of interest.