Automatic Seizure Detection and Prediction Based on Brain Connectivity Features and a CNNs Meet Transformers Classifier

(1) Background: Epilepsy is a neurological disorder that causes repeated seizures. Since electroencephalogram (EEG) patterns differ in different states (inter-ictal, pre-ictal, and ictal), a seizure can be detected and predicted by extracting various features. However, the brain connectivity network, a two-dimensional feature, is rarely studied. We aim to investigate its effectiveness for seizure detection and prediction. (2) Methods: Two time-window lengths, five frequency bands, and five connectivity measures were used to extract image-like features, which were fed into a support vector machine for the subject-specific model (SSM) and a convolutional neural networks meet transformers (CMT) classifier for the subject-independent model (SIM) and cross-subject model (CSM). Finally, feature selection and efficiency analyses were conducted. (3) Results: The classification results on the CHB-MIT dataset showed that a long window indicated better performance. The best detection accuracies of SSM, SIM, and CSM were 100.00, 99.98, and 99.27%, respectively. The highest prediction accuracies were 99.72, 99.38, and 86.17%, respectively. In addition, Pearson Correlation Coefficient and Phase Lock Value connectivity in the β and γ bands showed good performance and high efficiency. (4) Conclusions: The proposed brain connectivity features showed good reliability and practical value for automatic seizure detection and prediction, which expects to develop portable real-time monitoring equipment.


Introduction
Epilepsy is a common neurological disorder worldwide, causing a huge burden to patients and their families. It is a transient brain dysfunction caused by sudden abnormal and super-synchronous discharges of neurons [1]. The neuronal discharge pattern of epilepsy generally goes through three stages: the inter-ictal, pre-ictal, and ictal phases [2]. Seizure detection refers to identifying the ictal phase, which is time-consuming for clinicians as it involves visually examining electroencephalogram (EEG) changes. A seizure detection model can improve detection efficiency and accuracy. Approximately 30% of the patients have intractable epilepsy [3]. The unpredictability of seizure recurrence leads to a serious psychosomatic impact. Therefore, a seizure prediction model is required to identify the preictal phase to detect impending seizures [4]. The seizure prediction horizon (SPH) represents the time range within which seizures can be predicted in advance. The strategies of seizure detection, prediction model training, and validation fall into three categories: subject-specific model (SSM), subject-independent model (SIM), and cross-subject model (CSM). efficiency in practical applications, we adopt SVM as the classifier for SSM and CMT network for SIM and CSM.

Proposed Framework
The workflow is illustrated in Figure 1. First, multichannel EEG raw data were preprocessed, including bandpass filtering, epileptic period division (inter-ictal, pre-ictal, and ictal), and segmentation. Second, five physiological bands were extracted, and five connectivity measures were adopted for each band, including three FC and two EC measures. Third, all connection matrices were combined into a large one as an image-like feature. We chose SVM as the classifier for SSM and CMT for SIM and CSM. Finally, the performance of the model was evaluated. Optimal features, including connectivity measures and frequency bands, were selected, and the efficiency was analyzed in practical applications.

EEG Datasets
We adopted the free online dataset CHB-MIT [22], which includes 23 scalp EEG sets collected from 23 people with epilepsy. Details are presented in Table 1. ID 21 and ID 01 are from the same subject with an interval of 1.5 years, but the two recordings were treated as two different cases here due to the long interval [23]. There were different montages in CHB-MIT, including Montage A (23-channel), as shown in Figure 2a, and Montage B (28-channel: Montage A + 5 "virtual" signal channels). Most of the recordings adopted Montage A or its extensions. In Figure 2a, the EEG channels used bipolar reference to estimate the potential differences between two adjacent electrodes. This bipolar montage can offer better artifact rejection and sharper spatial localization than referential montages [24,25]. Moreover, as a non-potential but the derivative of the potential, bipolar EEG is far away from volume conduction problem [24,26]. The sampling rate is 256 Hz. Experts manually marked the start and end times of the seizures. Most EEG signals were recorded using the international 10-20 electrode system, and two electrodes (FT9 and FT10) were based on the 10-10 electrode system. The EEG channels adopting bipolar montage are on the right, where each electrode's voltage is linked and compared to an adjacent one to form a chain of electrodes. The bipolar montage can offer better artifact rejection than referential montages, and it is free of volume conduction problems [24][25][26]. (b) The definition of different EEG states. The ictal phase was extracted according to experts' manual marks. 15 to 30 min before the onset of each seizure was defined as the pre-ictal period, so the SPH here was 15-30 min [27]. The inter-ictal state was within an interval between half an hour after the end of a seizure and before the onset of the next pre-ictal state [28].

Preprocessing
All extra channels were removed for all extended montages of A. For example, five "virtual" signal channels out of Montage B were deleted. Channel order was rearranged to keep it the same as Montage A. Channel 23 was deleted because it is identical to channel 15. In this manner, we obtained 22-channel EEG data from 24 cases. In a pre-experiment, we designed two band-pass filters at 0.5-40 Hz and at 0.5-100 Hz to process the EEG recordings and conducted the classification tasks. Results showed that the filter at 0.5-100 Hz contributed more to seizure detection while the filter at 0.5-40 Hz worked better on seizure prediction. More details and discussion are shown in Supplementary Material. Therefore, for seizure detection, a Butterworth filter at 0.5-100 Hz was applied to capture high-frequency features of the ictal phase, and a notch filter at 60 Hz to remove powerline noise. For seizure prediction, a Butterworth filter at 0.5-40 Hz was used to remove high-frequency noise and keep useful information on the pre-ictal phase [29].
Next, different EEG states [27,28] were extracted, as shown in Figure 2b. Subsequently, two non-overlapping rectangular windows with different lengths (1 s and 8 s) were applied to segment the EEG signals. It has been reported that long windows have higher classification accuracy, whereas short windows have faster speed of calculating features [6]. Finally, two datasets were prepared, one containing ictal and inter-ictal EEG segments for seizure detection and the other including pre-ictal and inter-ictal phases for seizure prediction. In seizure detection, because the duration of a seizure event was much shorter than that of inter-ictal periods, we made the number of inter-ictal segments consistent with that of the ictal state for sample balance between classes. The number of ictal Most EEG signals were recorded using the international 10-20 electrode system, and two electrodes (FT9 and FT10) were based on the 10-10 electrode system. The EEG channels adopting bipolar montage are on the right, where each electrode's voltage is linked and compared to an adjacent one to form a chain of electrodes. The bipolar montage can offer better artifact rejection than referential montages, and it is free of volume conduction problems [24][25][26]. (b) The definition of different EEG states. The ictal phase was extracted according to experts' manual marks. 15 to 30 min before the onset of each seizure was defined as the pre-ictal period, so the SPH here was 15-30 min [27]. The inter-ictal state was within an interval between half an hour after the end of a seizure and before the onset of the next pre-ictal state [28].  1  Female  11  23  7  442  53  2  Male  11  23  3  172  21  3  Female  14  23  7  402  47  4  Male  22  23  4  378  45  5  Female  7  23  5  558  68  6  Female  1.5  23  10  153  14  7  Female  14.5  23  3  325  39  8  Male  3.5  23  5  919  113  9  Female  10  23  4  276  32  10  Male  3  23  7  447  53  11  Female  12  28  3  806  100  12  Female  2  28  27  989  112  13  Female  3  28  10  440  51  14  Female  9  28  8  169  20  15  Male  16  38  20  2012  252  16  Female  7  28  8  69  8  17  Female  12  28  3  293  36  18  Female  18  28  6 317 36  19  Female  19  28  3  236  28  20  Female  6  28  8  294  32  21  Female  13  28  4  199  24  22  Female  9  28  3  204  25  23  Female  6  28  7  424  49  24  Unknown  Unknown  23  15  527  63  sum  180 11,051 1321 1 The evident gender bias (male/female-5/18) exists in CHB-MIT dataset. 2 The number of seizures was counted only for Montage A and its extension.

Preprocessing
All extra channels were removed for all extended montages of A. For example, five "virtual" signal channels out of Montage B were deleted. Channel order was rearranged to keep it the same as Montage A. Channel 23 was deleted because it is identical to channel 15. In this manner, we obtained 22-channel EEG data from 24 cases. In a preexperiment, we designed two band-pass filters at 0.5-40 Hz and at 0.5-100 Hz to process the EEG recordings and conducted the classification tasks. Results showed that the filter at 0.5-100 Hz contributed more to seizure detection while the filter at 0.5-40 Hz worked better on seizure prediction. More details and discussion are shown in Supplementary Material. Therefore, for seizure detection, a Butterworth filter at 0.5-100 Hz was applied to capture high-frequency features of the ictal phase, and a notch filter at 60 Hz to remove power-line noise. For seizure prediction, a Butterworth filter at 0.5-40 Hz was used to remove high-frequency noise and keep useful information on the pre-ictal phase [29].
Next, different EEG states [27,28] were extracted, as shown in Figure 2b. Subsequently, two non-overlapping rectangular windows with different lengths (1 s and 8 s) were applied to segment the EEG signals. It has been reported that long windows have higher classification accuracy, whereas short windows have faster speed of calculating features [6]. Finally, two datasets were prepared, one containing ictal and inter-ictal EEG segments for seizure detection and the other including pre-ictal and inter-ictal phases for seizure prediction. In seizure detection, because the duration of a seizure event was much shorter than that of inter-ictal periods, we made the number of inter-ictal segments consistent with that of the ictal state for sample balance between classes. The number of ictal segments is listed in Table 1. The total 1-s segments of each category was 11,051, and the total 8-s segments was 1321. In seizure prediction, only the pre-ictal and inter-ictal phases were required, which indicated that the number of segments were not greatly limited. Therefore, we prepared for each subject 800 1-s segments and 100 8-s segments of each class, resulting in the total 19,200 1-s segments and 2400 8-s segments of each category.

Feature Extraction
To extract rich features, a bandpass filter was employed to transfer each EEG segment into five physiological frequency bands [30]: δ (0.5-4 Hz), θ (4-8 Hz), α (8-13 Hz), β (13-30 Hz), and γ (30-40 Hz). There are three types of brain connections [12,13]: (1) structural connectivity, an anatomical connection between brain neurons; (2) functional connectivity (FC), a statistical interdependence between different neuronal activities, which belongs to undirected connection; (3) effective connectivity (EC), the causal effect of one neural region on another, which belongs to directed connection. Only FC and EC were calculated in each band because structural connectivity cannot be obtained using EEG data. We selected measures based on different mathematical assumptions to obtain various adjacency matrices of a brain network.

Pearson Correlation Coefficient (PCC)
First, we assume x p (t) {p = 1, 2, . . . , 22} is an EEG signal in the p-th channel. As a fast and simple measure, PCC estimates the linear correlation between two signals x p (t) and x q (t) in a time domain, ranging from −1 to 1. The formula is shown in Equation (1): where E is the mathematic expectation. µ p and µ q are respectively the mean value of x p (t) and x q (t). σ p and σ q are the standard deviation of x p (t) and x q (t), separately.

Phase Locking Value (PLV)
Based on phase synchronization, PLV [31] measures FC by calculating the synchronization strength instantaneous of phase between two signals. It assumes that two bioelectrical signals with asynchronous amplitudes may be synchronized in phase showing simultaneous (or fix-delayed) rise and fall of two phases. The instantaneous phase of x p (t) is given by Equation (2): where x p (t) represents the Hilbert transform of x p (t), which is defined in Equation (3): where PV is the Cauchy principal value. Moreover, the PLV value between two signals is calculated according to Equation (4): where ∆t is the sampling period, and N denotes the number of sampling points of signal.
The PLV values range from 0 to 1.

Mutual Information (MI)
Based on information theory, MI evaluates the information dependency between two random variables. It represents the information amount of one signal contained in another. When we use . . , N as the random variable format of x p (t), the calculation of its entropy H X p is given in Equation (5): H(X q X p ) and H X p , X q indicate the conditional entropy and joint entropy between X p and X q , which are defined in Equations (6) and (7): where the MI value between X p and X q is given in Equation (8):

Granger Causality (GC)
GC [32] has an intuitive assumption: if the historical information of x p (t) can contribute to predicting the future changes of x q (t), it is considered that x p (t) is the Granger cause of x q (t). Firstly, the linear autoregressive (AR) models corresponding to x p (t) and x q (t) are constructed according to Equations (9) and (10): where d is the order of the AR models. A 11τ and A 22τ represent the AR coefficient. The mean values of noise η p (t) and η q (t) are both zero and their variances ∑ 1 = var η p (t) and ∑ 2 = var η q (t) are unrelated. The bivariate AR model is defined in Equations (11) and (12): where A 12τ and A 21τ denote the cross-correlation coefficients. The mean values of noise e p (t) and e q (t) are both zero, and their variances ∑ pp = var η p (t) , ∑ qq = var η q (t) and covariance ∑ pq = var e p (t), e q (t) are unrelated. The joint covariance matrix is given in Equation (13): The overall interdependency between x p (t) and x q (t) is described in Equation (14): where |·| denotes the determinant of the enclosed matrix. When the two time series are independent, F p,q reaches 0. F p,q is decomposed into three items: (1) the GC value from x q (t) to x p (t); (2) the GC value from x p (t) to x q (t); and (3) the instantaneous causality between the two signals. We usually use the first two items to measure the GC values in different directions.

Transfer Entropy (TE)
TE has a similar assumption as GC, but it derives from information entropy; if the historical information of the random process X p can help reduce the uncertainty (entropy) of the random process X q , it is considered that X p is the cause of X q . The calculation of TE value from X p to X q is shown in Equation (15): where d is the time lag deciding the time length of historical information. X k−1:k−d p and X k−1:k−d q represent the historical information of X p and X q . When the two random processes are independent, the TE value equals 0.

Connectivity Features Arrangement
For a 22-channel EEG segment in a specific frequency band b, connectivity measure m was calculated for each pair of channels, resulting in a 22 × 22 brain connectivity adjacency matrix C ij m b (channel i, j ∈ {1, 2, 3, . . . , 22}; measure m ∈ {PCC, PLV, MI, GC, TE}; band b ∈ {δ, θ, α, β, γ}). For FC measure, the element c ij represented an undirected connectivity between the EEG signals in the i-th and the j-th channels. For EC method, c ij denoted a directed connectivity from the i-th and the j-th channels. Finally, 25 (5 bands × 5 measures) adjacency matrices were extracted. Referring to [33], we arranged these matrices to obtain a brain connectivity feature image (110 × 110), as shown in Figure 3a. The differences in most feature elements were not obvious, which could be mitigated by normalizing all the connectivity values corresponding to each m, as shown in Figure 3b. After this normalization, the feature diversity was more obvious for classifier learning. The details of the feature dataset were shown in Table 2.
where d is the time lag deciding the time length of historical information. −1: − and −1: − represent the historical information of and . When the two random processes are independent, the TE value equals 0.

Connectivity Features Arrangement
For a 22-channel EEG segment in a specific frequency band b, connectivity measure m was calculated for each pair of channels, resulting in a 22 × 22 brain connectivity adjacency matrix { } (channel i, j ∈ {1, 2, 3, …, 22}; measure m ∈ {PCC, PLV, MI, GC, TE}; band b ∈ {δ, θ, α, β, γ}). For FC measure, the element represented an undirected connectivity between the EEG signals in the i-th and the j-th channels. For EC method, denoted a directed connectivity from the i-th and the j-th channels. Finally, 25 (5 bands × 5 measures) adjacency matrices were extracted. Referring to [33], we arranged these matrices to obtain a brain connectivity feature image (110 × 110), as shown in Figure 3a. The differences in most feature elements were not obvious, which could be mitigated by normalizing all the connectivity values corresponding to each m, as shown in Figure 3b. After this normalization, the feature diversity was more obvious for classifier learning. The details of the feature dataset were shown in Table 2.    SVM [34] was selected as a powerful classifier that effectively performs the nonlinear classification of high-dimensional features. The SVM is designed to find the hyperplane farthest from different sample boundaries. The kernel function plays a very important role in SVM, which can solve nonlinear problems and replace the inner product operation in high-dimensional feature space to avoid the complexity of high-dimensional operations. Common kernel functions include linear, polynomial, and radial basis function (RBF) kernels. Due to its strong nonlinear mapping ability, an RBF was selected here. In addition, SVM has two additional parameters: penalty factor C and kernel parameter g. The former represents the tolerance of the error, and the latter implicitly determines the distribution of the data mapped to the new feature space. A grid search determined the optimal parameters of the SVM. The range of the grid search for parameter C was set as [0.001, 0.01, 0.1, 1, 10, 100, 1000], while the search range for parameter g was [0.0001, 0.001, 0.01, 0.1, 1, 10, 100]. The decision formula of SVM was given in Equation (16): where f (x) is a predicted label, n is the number of training samples, a i represents a Lagrange multiplier, y i is the label of the i-th sample, x denotes a feature vector input, x i indicates the i-th sample, b means a bias term, and K(x i , x) is an RBF kernel, as shown in Equation (17): where g is a kernel parameter controlling RBF's radial action range.

CNNs Meet Transformers (CMT)
CMT [21] is a hybrid network relying on CNN and transformer to extract local and global information, respectively. The architecture of CMT network is shown in Table S3 in Supplementary Material. It first employs a convolutional stem, which uses multiple 3 × 3 convolution stacks for downsampling and detailed features. Its main body consists of a four-stages transformer. Each stage is formed by stacks of CMT blocks, each of which includes a local perception unit (LPU), a lightweight multi-head self-attention (LMHSA) module, and an inverted residual feed-forward network (IRFFN). The LPU is defined in Equation (18), which introduces a shortcut for stable training: where X ∈ R H×W×d (H × W is the resolution of the input of the current stage and d indicates the dimension of features), and DWConv(·) denotes the depth-wise convolution. The lightweight attention in the second module is shown in Equation (19): where Q ∈ R n×d k , K ∈ R n×d k , and V∈ R n×d v represent query, key, and value in the original self-attention module. The notation n = H × W is the number of patches. d k and d v denote the query (key) and value dimensions, respectively. K = DWConv(K) ∈ R n/k 2 ×d k and V = DWConv(V) ∈ R n/k 2 ×d v are obtained for lightweight features. is randomly initialized and learnable. Finally, h LighweightAttention functions are the h "heads" of LMHSA, resulting in a final n × d sequence. Compared with an FFN in traditional trans-formers, the IRFFN given in Equation (20) applies a depth-wise convolution to extract local information with negligible computational cost.
where Conv (·) denotes a traditional convolution and F(X) = DWConv(X) + X. Therefore, the information is passed through a CMT block, as shown in Equations (21)- (23): where Y i , Z i , and X i denote the output of LPU, LMHSA, and IRFFN in the i-th block, respectively. LN (·) represents layer normalization. Finally, CMT ends with a global average pooling layer, a projection layer, and a classification layer with softmax. According to the experimental experience, sigmoid rather than softmax was used here. We selected crossentropy loss and L2-norm as the loss function and the regularization term, respectively.

Performance Evaluation Metrics
This study adopted three widely used evaluation criteria for seizure detection and prediction: accuracy (ACC), sensitivity (Sen), and specificity (Spe). ACC represents the percentage of correct period detection, Spe represents the percentage of correct inter-ictal EEG recognition, and Sen represents the percentage of correct interest period identification. For seizure detection, the ictal phase is the interest period, whereas, for seizure prediction, the pre-ictal period is the interest period. The calculation formulas for performance evaluation criteria are shown in Equations (24)-(26): where TP denotes the number of samples correctly identified as the interest period, FN represents the number of samples incorrectly identified as the inter-ictal segment, TN refers to the number of samples correctly identified as the inter-ictal period, and FP is the number of samples incorrectly identified as the interest period.

1.
Subject-Specific Model (SSM) Figure 4a showed the SSM training and verification processes. Since the training and validation data for the SSM were both from the same individual, the obtained samples were limited. With the advantage of small sample classification, SVM was adopted to build the SSM. However, as the SVM cannot directly process image input, the brain connectivity matrix (110 × 110) was flattened into a 12,100-dimensional feature vector whose dimensions were too high to suit the SVM. Therefore, principal component analysis (PCA) [35] was used to extract low-dimensional main components as the feature vector input of the SVM. Based on practical experience, we selected the first 100 principal components of the 1 s-features and the first 30 principal components of the 8 s-features as the input features. A five-fold cross-validation was used to provide an unbiased evaluation of its performance and avoid overfitting. Then the results were averaged across folds to verify a single model. Finally, these evaluations were averaged across subjects to check the average performance of the SSM.
[35] was used to extract low-dimensional main components as the feature vector input of the SVM. Based on practical experience, we selected the first 100 principal components of the 1 s-features and the first 30 principal components of the 8 s-features as the input features. A five-fold cross-validation was used to provide an unbiased evaluation of its performance and avoid overfitting. Then the results were averaged across folds to verify a single model. Finally, these evaluations were averaged across subjects to check the average performance of the SSM. First, the EEG segments of all the subjects were collected to form a dataset. The ratio of the training data and validation data was set to 8:2. In the training process, we set the initial learning rate to 0.0001 (decayed every five epochs), the regularization coefficient to 0.001, and batch size to 8. Finally, the trained CMT used the validation set to verify the performance of the SIM. Figure 4c showed the CSM's leave-one-subject-out (LOSO) validation. One subject was left as the validation data, and the remaining subjects formed training data. CMT was used to establish CSM. The evaluations were averaged across subjects to check the average performance of the CSM. We set the learning rate to 0.0001 (decayed every five epochs), the regularization coefficient to 0.01, and the batch size to 8.

Feature Selection and Efficiency Analysis
To determine the optimal connectivity measures and frequency bands, we evaluated the feature separability between different epileptic phases, as shown in Figure 5a. First, the features corresponding to a specific method or band were extracted. Since the distance of feature vectors in high-dimensional space is difficult to measure accurately, t-distributed stochastic neighbor embedding (t-SNE) [36] was adopted to map a high-dimensional feature to a point in a two-dimensional plane. It uses a heavy-tailed distribution (like the Student t-distribution) to convert distances into probability scores in low dimensions, which results in points with greater similarity in high-dimensional space being mapped to points with smaller distances in low-dimensional space and vice versa. Next, according to Equation (27), a silhouette coefficient (SC) was calculated for clusters formed by two classes of feature points on the plane, positively related to the degree of feature separation.
where N is the number of samples, and sc(i) is the silhouette coefficient of the i-th sample, given in Equation (28).
where b(i) and a(i) represent the average distance between the i-th sample and other samples with a different and the same label, respectively. The SC corresponding to each method was calculated for each subject. A Shapiro-Wilk test showed the failure of the SC

2.
Subject-Independent Model (SIM) Figure 4b outlined how SIM was trained and verified. First, the EEG segments of all the subjects were collected to form a dataset. The ratio of the training data and validation data was set to 8:2. In the training process, we set the initial learning rate to 0.0001 (decayed every five epochs), the regularization coefficient to 0.001, and batch size to 8. Finally, the trained CMT used the validation set to verify the performance of the SIM.

3.
Cross-Subject Model (CSM) Figure 4c showed the CSM's leave-one-subject-out (LOSO) validation. One subject was left as the validation data, and the remaining subjects formed training data. CMT was used to establish CSM. The evaluations were averaged across subjects to check the average performance of the CSM. We set the learning rate to 0.0001 (decayed every five epochs), the regularization coefficient to 0.01, and the batch size to 8.

Feature Selection and Efficiency Analysis
To determine the optimal connectivity measures and frequency bands, we evaluated the feature separability between different epileptic phases, as shown in Figure 5a. First, the features corresponding to a specific method or band were extracted. Since the distance of feature vectors in high-dimensional space is difficult to measure accurately, t-distributed stochastic neighbor embedding (t-SNE) [36] was adopted to map a high-dimensional feature to a point in a two-dimensional plane. It uses a heavy-tailed distribution (like the Student t-distribution) to convert distances into probability scores in low dimensions, which results in points with greater similarity in high-dimensional space being mapped to points with smaller distances in low-dimensional space and vice versa. Next, according to Equation (27), a silhouette coefficient (SC) was calculated for clusters formed by two classes of feature points on the plane, positively related to the degree of feature separation.
where N is the number of samples, and sc(i) is the silhouette coefficient of the i-th sample, given in Equation (28).
where b(i) and a(i) represent the average distance between the i-th sample and other samples with a different and the same label, respectively. The SC corresponding to each method was calculated for each subject. A Shapiro-Wilk test showed the failure of the SC values to conform to a (nearly) normal distribution, and a Spearman Correlation test detected the significant correlations (p < 0.01) among the five methods. Therefore, a non-parametric multiple-paired test, namely, the Friedman test was conducted among the SC values of the five methods, and the p-values were corrected by Finner correction.
values to conform to a (nearly) normal distribution, and a Spearman Correlation test detected the significant correlations (p < 0.01) among the five methods. Therefore, a nonparametric multiple-paired test, namely, the Friedman test was conducted among the SC values of the five methods, and the p-values were corrected by Finner correction.
(a) (b) After selecting optimal connectivity measures, the SC was computed for each frequency band and the optimal measures. Due to the non-normal distribution and the significant correlations among the five bands, the same statistical method was applied for optimal band selection. After the above feature selection, we analyzed the efficiency of optimal features. In general, a practical method in real application scenarios must have high efficiency in terms of time and storage to achieve online real-time detection and prediction. Therefore, the time and storage costs were calculated as efficiency metrics. All computations were performed on a standard desktop computer with an Intel Core i7-10700 CPU @ 2.90 GHz processor and 16 GB RAM. Given that the training time of the model largely depends on the training sample size, which was not considered in this study, we focused on the time spent on preprocessing, feature extraction, and model output. The storage cost was local-specific. Table 3 presented the classification results for the proposed method. All evaluation criteria for SSM and SIM were greater than 95%. In the 8 s window, ACC, Sen, and Spe of SSM and SIM reached more than 99%. For seizure detection, the CSM obtained satisfactory results, with each metric over 96%. However, the performance of CSM decreased significantly in seizure prediction. In the 1 s-and 8 s-windows, the seizure prediction ACC reached 74.78% and 86.17%, respectively. After selecting optimal connectivity measures, the SC was computed for each frequency band and the optimal measures. Due to the non-normal distribution and the significant correlations among the five bands, the same statistical method was applied for optimal band selection. After the above feature selection, we analyzed the efficiency of optimal features. In general, a practical method in real application scenarios must have high efficiency in terms of time and storage to achieve online real-time detection and prediction. Therefore, the time and storage costs were calculated as efficiency metrics. All computations were performed on a standard desktop computer with an Intel Core i7-10700 CPU @ 2.90 GHz processor and 16 GB RAM. Given that the training time of the model largely depends on the training sample size, which was not considered in this study, we focused on the time spent on preprocessing, feature extraction, and model output. The storage cost was local-specific. Table 3 presented the classification results for the proposed method. All evaluation criteria for SSM and SIM were greater than 95%. In the 8 s window, ACC, Sen, and Spe of SSM and SIM reached more than 99%. For seizure detection, the CSM obtained satisfactory results, with each metric over 96%. However, the performance of CSM decreased significantly in seizure prediction. In the 1 s-and 8 s-windows, the seizure prediction ACC reached 74.78% and 86.17%, respectively.

Feature Selection
The statistical results were displayed in Figure 5b. For seizure detection, PCC and PLV had significantly higher SC than GC (both windows: p < 0.001), MI (1 s window: p < 0.01; 8 s window: p < 0.05), and TE (1 s window: p < 0.001; 8 s window: p < 0.01). MI showed better feature separability than GC (both windows: p < 0.01). For seizure prediction, PCC and PLV had larger SC than GC (both windows: p < 0.001), MI (8 s window: p < 0.05), and TE (1 s window: p < 0.001). Lower SC was obtained for GC than MI (1 s windows: p < 0.001) and TE (8 s window: p < 0.001). Based on the results above, the features calculated by PCC and PLV showed optimal separability. Therefore, we extracted the PCC and PLV connection matrix (44 × 22) in each frequency band for SC comparison. For seizure detection in the 1 s window, γ band had higher SC than δ (p = 0.001), θ (p = 0.001), and α (p = 0.030), and β showed better feature separability than δ (p = 0.003) and θ (p = 0.004). For seizure detection in the 8 s window, β band produced higher SC values than δ (p = 0.039), θ (p = 0.028), and α (p = 0.028). γ band showed better feature separability than θ (p = 0.028) and α (p = 0.039). For seizure prediction in the 1 s window, β and γ had significantly higher SC than δ (p < 0.001), θ (p < 0.001), and α (p < 0.05) bands. There was no difference among the frequency bands for seizure prediction in the 8 s window. Based on the results above, we recommend a feature selection scheme: PCC and PLV connectivity features in the βand γ-bands. The classification ACC, time cost, and storage cost comparisons between the original and selected features were shown in Table 4. After removing the other three measures and three bands, this selection scheme's time and storage costs substantially decreased by more than 70%, but the ACC only decreased by less than 3.50%.  4 Time cost included the time spent on preprocessing raw EEG, feature extraction, and model output. 5 The storage cost was calculated for the local side. The original and selected features occupied 62.5 KB and 10 KB. With only 17 KB, SVM-based SSM could be embedded in the local side, which was required to store both feature matrix and model. However, it was unsuitable for SIM or CSM based on CMT with 101 MB to be loaded locally. Given that they could apply to different subjects, SIM and CSM could be uploaded to the cloud side for online computing. The local side was responsible to extract features and then upload a feature matrix. Therefore, the storage cost of SIM or CSM for locality included only a feature matrix.

Training and Validation Strategy Comparison
In the 1 s window, there were obvious differences in the performance of the different training and validation strategies. From the perspective of data distribution, the EEG data pattern distribution on the same individual should be relatively similar while the distributions among different individuals have significant diversity, which was named cross-subject heterogeneity [37]. In this case, SSM should have the best classification performance than SIM and CSM. However, in this study, SSM and SIM reached similar accuracy. For one, the training and validation sets of SIM are likely to come from the same subjects, which mitigated the cross-subject diversity to some extent. For another, CMT was used for SIM construction, with a stronger fitting ability than SVM-based SSM. Nonetheless, CMT's powerful ability failed to overcome the pronounced heterogeneity problem in CSM for seizure prediction. Although SIM could work on different subjects and have higher accuracy than CSM, it has limitations in real application scenarios. Since SIM was trained and validated on the same subject group, it could not work well on a "never-seen" subject.
If it is to work on the new subject, it is necessary to collect her/his EEG data and add it to the dataset for model training. The ideal situation is that a model can still work on brand-new subjects without training. Therefore, the CSM has the highest practical value among the three models. Recently, Giuseppe et al. [38] have attempted LOSO validation in the classification between EEG signals of psychogenic nonepileptic seizures (PNES) and of the healthy subjects, obtaining a good accuracy. However, most studies in seizure detection and prediction did not adopt LOSO validation; therefore, we suggest that CSM's training and verification strategy be used in this field to provide a model evaluation in real application scenarios.

Window Length Comparison
The long window performed better than the short window, which is consistent with the literature [6]. A possible reason is that brain connectivity values may become noisier in a short segment due to reduced sample data points [39]. Another reasonable explanation might be that some connectivity measures require additional samples to detect a phase shift in a particular band. Therefore, features extracted from longer segments are more likely to be stable to carry more useful information. However, there are some limitations to its practical application. Without knowing the period labels, a randomly intercepted 8 s segment has a high chance of including two different periods simultaneously (e.g., inter-ictal and pre-ictal states), which greatly affects the classification accuracy. In addition, a long window mitigates the feature predominance of some seizures with a short duration, leading to missed detection. Therefore, we suggest selecting an appropriate window length according to specific requirements.

Comparison with Previous Studies
A comparison of the seizure detection results was displayed in Table 5. Our classification performances of SSM and SIM in the 1 s window were better than the previous results based on the short window (1~2 s). When the window length was 8 s, our SSM, SIM, and CSM results reached satisfactory values of over 99%, higher than other studies. These comparisons showed that the brain connectivity features were effective in seizure detection. Similarly, brain connectivity has been recently proven as a promising feature for the classification between rest-EEG data of PNES from healthy control subjects [40].
A comparison of the results for seizure prediction was presented in Table 6. The difficulty of seizure prediction is sensitive to SPH, which makes SPH a key factor for comparison. In the 1 s and 8 s window, our results reached the state-of-the-art as every evaluation criterion of SSM and SIM was over 96%, which was higher than other results. In addition, CMT-based SIM reached an accuracy of over 96% in both window lengths, which indicated that CMT succeeded in useful features on brain network learning. However, CMT-based CSM did not reach satisfactory performance. Careful examination reported that the ACC varied significantly across individuals, ranging from 60% to 100%, which resulted from the large heterogeneity of individual EEG data.  The results in Section 3.2. indicated that the two EC methods performed worse than the three FC measures. Since related studies have shown that there were obvious differences between effective connection networks in different EEG states [14,15], we believe that the main reason EC was suboptimal here was that the selected GC and TE methods could not accurately measure the true effective connectivity of EEG data from the CHB-MIT dataset. Although GC is a widely used measure of causality, it has an established shortcoming of high sensitivity to noise, which becomes particularly acute with noisy electrophysiological recordings such as scalp EEG signals [48]. The increase in noise was likely to cause a change in the GC direction and the appearance of false connections. Furthermore, if there is a third unmeasured random process that affects the two signals simultaneously, the measurement accuracy of TE will be severely affected [49]. Therefore, we do not recommend traditional GC and TE methods to extract effective connectivity for seizure detection and prediction. Although MI showed relatively good separability, it required a long computational time. In contrast, PCC and PLV are recommended candidates with good effect, easy operation, and a fast computing speed. They provide connectivity information in a brain network from the time and frequency domains.

Frequency Band Comparison
The results in Section 3.2. indicated that features calculated in the β and γ bands were more effective, which is consistent with the following evidence. From a pathological perspective, observing human and animal epilepsy models showed a relationship between the epileptogenicity of neuronal tissue and its tendency to produce rapid oscillations during seizures. The frequency of seizures gradually increases during the transition from the pre-ictal to the ictal state [50]. From clinical observations, removing brain areas with rapid discharge could positively impact the prognosis of surgery [51]. Blanco et al. [52] analyzed the Fourier spectral entropy of EEG signals within the pre-ictal period based on the Freiburg dataset. They found that the high-frequency spectral entropy in the pre-ictal period significantly increased compared with that in the inter-ictal phase, which indicated that the abnormal information transmission generated during the discharge of neuronal clusters in the focus area might result in signal energy transfer from the low-frequency to the high-frequency band. Other researchers have also used experiments to prove that β and γ oscillations could be biomarkers of seizures [53][54][55]. Therefore, we suggest β and γ bands as effective frequency bands for seizure detection and prediction.

Limitations
There were three limitations: (1) Although bipolar montage is far away from volume conduction, this problem is too pronounced to be solved comprehensively. Moreover, when EEG recordings adopt referential (or unipolar) montage, some brain connectivity measures such as PLV should be cautiously treated because they are sensitive to volume conduction. (2) The cross-subject model in seizure prediction failed to perform satisfactorily, indicating that the heterogeneity was not well mitigated. In the next work, other network architectures, such as graph neural networks, will be conducted to solve this problem. (3) The small sample size of CHB-MIT might affect the performance of deep learning networks. Therefore, more clinical data will be collected in our cooperative hospital for classification in the future.

Conclusions
The main contributions of this work included the following: (1) the proposed functional and effective connectivity features for seizure detection and prediction: PCC, PLV, GC, MI, and TE; (2) comprehensive evaluation of the effectiveness of brain connectivity features under different classification tasks (seizure detection and prediction), window lengths (1 s and 8 s), and model training and verification strategies (i.e., SSM, SIM, and CSM); (3) the classifiers applicable to different training and validation strategies: SVM for SSM, and CMT for SIM and CSM; (4) the optimal selection of frequency bands and connectivity measures: PCC and PLV features in the β and γ bands; (5) the classification accuracy and the time-storage efficiency analysis showed the practicality in clinical applications.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/brainsci13050820/s1, Table S1. Classification results with filtering at 0.5-40 Hz; Table S2. Classification results with filtering at 0.5-100 Hz; Table S3. The architecture of CMT network. The feature image was resized to the input resolution of 160 × 160. The output size corresponds to the input resolution. Convolutions and CMT blocks are shown in brackets with the number of stacked blocks. Hi and ki are the number of heads and reduction rates in LMHSA of stage i, respectively. Ri denotes the expansion ratio in IRFFN of stage i. References [21,52] are cited in the supplementary materials.

Conflicts of Interest:
The authors declare no conflict of interest.