Source-Free Domain Adaptation (SFDA) for Privacy-Preserving Seizure Subtype Classification

Electroencephalogram (EEG) based seizure subtype classification is very important in clinical diagnostics. Source-free domain adaptation (SFDA) uses a pre-trained source model, instead of the source data, for privacy-preserving transfer learning. SFDA is useful in seizure subtype classification, which can protect the privacy of the source patients, while reducing the amount of labeled calibration data for a new patient. This paper introduces semi-supervised transfer boosting (SS-TrBoosting), a boosting-based SFDA approach for seizure subtype classification. We further extend it to unsupervised transfer boosting (U-TrBoosting) for unsupervised SFDA, i.e., the new patient does not need any labeled EEG data. Experiments on three public seizure datasets demonstrated that SS-TrBoosting and U-TrBoosting outperformed multiple classical and state-of-the-art machine learning approaches in cross-dataset/cross-patient seizure subtype classification.


I. INTRODUCTION
E PILEPSY, which affects millions of people worldwide, is one of the most common neurological diseases [1]. Electroencephalogram (EEG) is the gold standard in clinical epilepsy diagnostics. Unfortunately, distinguishing epileptic fragments from long EEG sequences is labor-intensive and time-consuming. Thus, automatic epilepsy classification is important in improving the treatment efficiency [2], [3].
Both traditional machine learning and deep learning [4], [5] have been used in automatic epilepsy classification. The former usually includes the following steps [6]: 1) Data pre-processing, which removes EEG artifacts, e.g., muscle movements and electrical noise, by band-pass filtering and detrending. 2) Feature extraction, which manually extracts EEG features according to expert knowledge. 3) Classifier training, which trains a machine learning algorithm, e.g., support vector machine (SVM), logistic regression, or multi-layer perceptron (MLP), to classify the EEG signals.
To reduce the amount of training data, frequently labeled EEG data from some existing subjects (source domains, or source subjects) are used to help train a classifier for a new subject (target domain, or target subject). However, due to significant individual differences, blindly borrowing data from other subjects may result in negative transfer [7], i.e., hurting the classification performance for the new subject.
Domain adaptation (DA) [8], which aims to reduce the distribution discrepancy between the source and target domains, is useful when there are insufficient labeled target data. DA may help cope with individual differences between different patients. Source-free domain adaptation (SFDA) [9], which only uses a pre-trained source model (instead of the source data) and some target data for DA, is very useful in protecting the source data privacy and avoiding heavy data transmission costs. This paper considers two SFDA scenarios in seizure subtype classification: 2) We extended semi-supervised transfer boosting (SS-TrBoosting) [10] to unsupervised TrBoosting (U-TrBoosting), which performs unsupervised SFDA, and verified their effectiveness on two public seizure subtype classification datasets. 3) To the best of our knowledge, we are the first to consider SFDA in cross-patient/cross-dataset seizure subtype classification. The rest of this paper is organized as follows. Section II briefly reviews some related works. Section III describes the details of SS-TrBoosting and U-TrBoosting. Section IV presents the 41 manually extracted seizure features and the experimental results. Section V draws conclusions.

II. RELATED WORKS
This section introduces related works on EEG-based seizure classification.

A. EEG-Based Seizure Detection and Subtype Classification
Seizure detection aims at distinguishing seizure onset fragments from normal EEG signals [6]. Seizure subtype classification further classifies these onset fragments into different seizure subtypes [11], e.g., generalized seizures (absence and tonic-clonic seizures), focal seizures (simple partial seizures and complex partial seizures), and generalized and focal seizures, for targeted treatments [2].
Many manual features have been proposed for traditional classifiers [6]. For example, Tian et al. [4] extracted temporal, frequency and time-frequency domain features and used them in Naive Bayes Classifier, Decision Tree, SVM, k-Nearest Neighbor Classifier, and Fuzzy System, for seizure classification.
Manual feature extraction can be avoided in deep learning. Li et al. [5] proposed channel-embedding spectral-temporal squeeze-and-excitation network (CE-stSENet), a slim network inspired by the Squeeze-and-Excitation Block [12], for both seizure detection and subtype classification. Peng et al. [3] proposed a temporal information enhanced EEGNet (TIE-EEGNet) for cross-patient seizure subtype classification.

B. SFDA
SFDA uses some target data to adapt a pre-trained source model to the target domain.
Model adaptation [13] uses generative adversarial networks [14] to generate target-style source data and trains a domain discriminator to predict the domain label. It also uses clustering-based constraints to capture the local structural information.
Source hypothesis transfer (SHOT) [9] combines information maximization, clustering and self-training. At each epoch, SHOT first uses clustering to update the pseudo-labels of the unlabeled data, and then uses self-training to minimize the information maximization loss. SHOT++ [15] further combines SHOT with Mix-Match [16]. It first assigns and fixes the pseudo-labels of unlabeled target data with low entropy (i.e., high confidence), and then converts the SFDA problem into a semi-supervised learning problem, which can be solved by using MixMatch.
Neighborhood reciprocity clustering (NRC) [17] captures the intrinsic neighborhood structure of the target data by using a clustering hypothesis based loss, and uses a self-regularization loss to mitigate the negative impact of noisy neighbors.

C. Ensemble Learning
Ensemble learning [18] constructs multiple base learners to predict the output. Two common strategies are Bagging and Boosting.
Boosting updates the sample weights according to the negative gradient [19] or classification error [20] in each iteration. Given N training samples {(x n , y n )} N n=1 , where y n ∈ R J and J is the number of classes, Boosting generates K base learners to construct an ensemble classifier F K to predict the output, i.e., (1) LogitBoost [21] is a popular implementation of Boosting. It uses Newton's Method to decompose the classification problem into multiple regression problems, and solves them by using linear regression. Suppose LogitBoost has generated an ensemble classifier F k−1 with k − 1 base learners. Then, LogitBoost trains the k-th base learner f k in three steps: 1) Use F k−1 to generate each sample's prediction probability p n ∈ R J . 2) Calculate each sample's weight w n and pseudo-labelỹ n : 3) Construct J temporary weighted regression datasets . Bagging and boosting fine-tuning (BBF) [22] is a general fine-tuning framework, integrating LogitBoost, Bagging [23], and broad learning system [24]. Adaptive semi-supervised ensemble (ASSEMBLE) [25] is a semi-supervised learning approach that assigns pseudo-labels and calculates the weights of unlabeled data at each iteration. TrAdaBoost [26] extends AdaBoost [20] to DA, which assumes the source data misclassified by previous learners are unrelated to the target domain, and hence reduces their weights.

III. METHODOLOGY
This section introduces our proposed SS-TrBoosting and U-TrBoosting, as shown in Fig. 1. The Python code is available at https://github.com/zhaochangming/TrBoosting.

A. Problem Definition
Unsupervised SFDA (U-SFDA) and semi-supervised SFDA (SS-SFDA) are considered in this paper.
Unsupervised SFDA includes a source model f s , e.g., a logistic regression or neural network classifier, and N unlabeled target data Semi-supervised SFDA additionally includes a small amount of labeled target data D , where y T n ∈ R J and N T is the number of labeled target data. When the source model is a deep neural network, which integrates a feature extractor and a classifier, the last layer of the deep neural network is viewed as the classifier f s .
To increase the stability of the training process, especially when the target data are insufficient, SS-TrBoosting first uses a virtual source domain generation approach to generate some virtual source data: 1) Randomly generate virtual label vectors for each class: in whichŷ c,n is the generated label of Class c, r is a randomly selected class except c, β is a hyperparameter, α n is the mix weight of Class c, N L is the number of generated labels for each class, and J is the number of classes. Then, TrBoosting combines all virtual label vectors to generateŶ S = [ŷ 1,1 , . . . ,ŷ 1,N L ,ŷ 2,1 , . . . ,ŷ 2,N L , . . . ,ŷ J,1 , . . . ,ŷ J,N L ] ⊺ ∈ R (J ×N L )×J . 2) Calculate the virtual source data by usingŶ S and f s : where d is the feature dimensionality, θ ∈ R d×J is the linear weight of the classifier of a deep neural network or a logistic regression model, and θ † is the pseudo-inverse of θ . 3) Align the mean and standard deviation ofX S and X U , and generate the one-hot coding ofŶ S : where µ S (µ U ) and σ S (σ U ) are mean and standard deviation ofX ated virtual source data. Then, SS-TrBoosting normalizes f s to get the normalized initial model f norm , which ensures the outputs of the initial model and the base learners to have the same order of magnitude, and then generates K fine-tuning blocks, each of which consists of two base learners, to enhance f norm : where F 2K is an ensemble classifier with 2K base learners (except f norm ), and f k is the k-th base learner. Insipred by LogitBoost, SS-TrBoosting generates F 2K via an iterative process. Assume SS-TrBoosting has generated k − 1 fine-tuning blocks, i.e., the current ensemble classifier is F 2k−2 , and seeks to train the k-th fine-tuning block, which consists of the (2k − 1)-th base learner f 2k−1 and the 2k-th base learner f 2k . SS-TrBoosting decomposes the semi-supervised SFDA problem into a supervised SFDA problem and a semi-supervised learning problem: 1) For the supervised SFDA problem, SS-TrBoosting first merges D T and D S to generate the training set, and uses F 2k−2 to calculate the prediction probability p n of each sample. Then, SS-TrBoosting generates J temporary datasets to train f 2k−1 according to LogitBoost, and adds it to F 2k−2 to generate F 2k−1 . 2) For the semi-supervised learning problem, SS-TrBoosting first merges D T and D U to generate the training set, and uses F 2k−1 to calculate the prediction probability p n of each sample. Inspired by ASSEMBLE [25], SS-TrBoosting uses p n to assign the pseudo-labelŷ U n of each unlabeled target sample.  Then, SS-TrBoosting calculates sample weight w n and pseudo-labelỹ n of each sample by using (2) and (3), respectively. Finally, SS-TrBoosting also generates J temporary datasets to train f 2k , and adds it to F 2k−1 to generate F 2k .
Let p n ∈ R J be the source model's prediction probability of the n-th sample, andŷ n = argmax p n be its prediction class. U-TrBoosting first calculates the entropy e of each unlabeled target sample: Next, U-TrBoosting uses e n to sort {(x U n ,ŷ n , e n )| n ∈ [1, N ]} in ascending order, and generates an index set for each prediction Class j, i.e., Then, U-TrBoosting collects the index set of the top-γ (γ ∈ [0, 1]) portion for each prediction Class j: and constructs labeled target data D T and new unlabeled target data D U : where I l = I 1 l ∪ I 2 l ∪ · · · ∪ I J l . Finally, U-TrBoosting combines D U and D T to transform the unsupervised SFDA problem into a semi-supervised SFDA problem, which can be solved by using SS-TrBoosting.
Inspired by broad learning system [24], which randomly generates and fixes the input weights of the hidden layer, both U-TrBoosting and SS-TrBoosting generate a random feature mapping (RFM) h k in each iteration to introduce feature randomness into the training process and speed-up the training, i.e., where W k ∈ R d×ns is a random matrix, ns is the node size, ZScore is z-score normalization, and µ k and σ k are the mean and standard deviation vectors of X U W k , respectively. TrBoosting uses h k to transform x before feeding it into each regression model f j k ( j ∈ [1, J ]) of f k , and trains f j k by using ridge regression [27]. Then, the asymptotic complexity of f j k can be decreased from O N · d 2 + d 3 to O N · d · ns + N · ns 2 + ns 3 , where N · d 2 + d 3 and N · ns 2 + ns 3 are the cost of ridge regression, and N · d · ns is the cost of building h k . Fig. 2 shows the diagram of training the k-th fine-tuning block in U-TrBoosting.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.

Algorithm 1 U-TrBoosting
Input: D U = {x U n } N n=1 , N unlabeled target samples; K , the number of fine-tuning blocks; J , the number of classes (J > 2); f s , the source model; ns, the node size; lr , the learning rate; bs, the batch size; ξ , the magnitude of noise; γ , the portion of unlabeled target data to assign pseudo-labels. Output: F 2K , an ensemble classifier.
* / Generate the pseudo-labels of the unlabeled target data {ŷ U n = OneHot argmax(F 2k−1 (x U n ) | n ∈ [n, N U ]}; Construct the nonlinear feature mapping h 2k ; Generate the semi-supervised training set (x n , y n ) (20); for j = 1 : J do for n = 1 : N T + N U do Compute p j n = softmax j (F 2k−1 (x n )), n ≤ N T softmax j (F 2k−2 (x n ) + lr × f 2k−1 (x n + ε n−N T )), n > N T , and clip it by using (18) Similar to LogitBoost [21], SS-TrBoosting and U-TrBoosting also clip the pseudo-labelỹ j n to increase its robustness to noise: Similar to consistency regularization [28], which improves the model's robustness to noise by adding perturbations to the unlabeled samples, SS-TrBoosting and U-TrBoosting generate and add Gaussian noise to each unlabeled target sample, where ξ controls the magnitude of the noise. Then, the augmented unlabeled target data are used to calculate the prediction probability. Algorithm 1 shows the pseudo-code of U-TrBoosting, where the pseudo-code of balanced sampling (BS) [22] and Norm [22] is given in Algorithms 2 and 3, respectively.
BS generates a class-balanced batch of samples to train the base learner in each iteration, which helps remedy class imbalance. To balance the number of labeled target and source data, or labeled and unlabeled target data, BS first uses weighted sampling with replacement to obtain two class-balanced batches of samples from them, and then merges them.

Algorithm 2 Balanced Sampling (BS) for J-Class
Classification [22] Input: {(w n , y n )} N n=1 , N sample weights and labels; J , the number of classes (J > 2); bs, the batch size; j, the positive class, default 1. Output: I batch , the index set of the data in one batch.
be the sample weights and labels of the source data, the labeled target data and the unlabeled target data, respectively. In each fine-tuning block, SS-TrBoosting and U-TrBoosting generate different batch index sets to train f 2k−1 and f 2k , respectively: 1) For the supervised SFDA problem, the batch index set for Class j is I j batch = BS(W T , J, bs, j) ∪ BS(W S , J, bs, j), (21) where bs is the batch size. 2) For the semi-supervised learning problem, the batch index set for Class j is I j batch = BS(W T , J, bs, j) ∪ BS(W U , J, bs, j). (22) Norm scales the output magnitudes of the base learners to have unit L2-norm, which helps improve the SFDA performance.

IV. EXPERIMENTS
This section introduces our experimental settings and results.

A. Datasets and Preprocessing
Three public seizure datasets were used in our experiments. The Bonn dataset [29] contains five subsets of 100 singlechannel fragments collected from five healthy people and five epilepsy patients, as shown in Table I. We performed threeclass classification to distinguish Normal (Z and O), Inter-Ictal (N and F), and Ictal (S) states. This dataset had been denoised, so we used it directly, without any further preprocessing. All reported results on the Bonn dataset were the average of 10 repeat experiments. For each experiment, we first randomly divided the data into a training set and a test set of the same size, and then randomly selected 25% of the training set to identify the best hyper-parameters for the baselines. Since the patient IDs of Bonn were unavailable, the test set was used to simulate the target data from new patients.
The CHSZ [3] and TUSZ (V1.5.2) [30] datasets were used in seizure subtype classification. CHSZ is a small dataset with seizure EEGs recorded from 27 children and infants. TUSZ contains 68 patients of all ages. For DA, we selected four common seizure types (focal seizures, absence seizures, tonic seizures, and tonic-clonic seizures) of the two datasets, as shown in Table II. As in [3], we first applied 50 Hz notch filtering, 0-64 Hz low-pass filtering, and detrending to remove EEG artifacts, and then performed channel re-reference to generate 20 channels. Finally, we used a 4-second non-overlap sliding window on the processed EEG records of each seizure Average BCA w.r.t. the number of boosting iterations on "CHSZ→TUSZ" and "TUSZ→CHSZ" in semi-supervised SFDA. event to generate samples. Ten repeats of 3-fold cross-patient validations were conducted, and the average results are reported. For each experiment, we randomly selected 25% of the training set to identify the best hyper-parameters for the baselines.

B. Manual Feature Extraction
We extracted 41 manual features for seizure classification, as shown in Table III. The Python code is available at https://github.com/rmpeng/Epilepsy-Seizure-Detection.
SHOT [9], SHOT++ [15], and NRC [17] were used as SFDA baselines. To use them in SS-SFDA, we added a supervised loss on the labeled target data.  Fig. 3 shows the structure of the MLP used in our experiments. Table IV shows the hyper-parameters of eight baseline algorithms, SS-TrBoosting and U-TrBoosting in our experiments.
The best parameter combinations of GBDT, SVM and RLR were selected by grid-search. The parameter optimization procedures of EEGNet, CE-stSENet, TIE-EEGNet and MLP followed [3], which used early-stopping on the validation set. The optimizers, learning_rate, and max_epoch of SHOT, SHOT++ and NRC followed [3]. Other parameters of them followed their original papers, respectively. The base learner of ASSEMBLE used the same parameters as the source model.

E. Performance Measures
For CHSZ and TUSZ, we report event-level classification results, as in [3]: for each event, all classification probabilities of the fragments were aggregated to generate its prediction. For Bonn, we report the sample-level classification results, since the event information of Bonn was unavailable. We then calculate the balanced classification accuracy (BCA) 1 and the macro F 1 score 2 as performance measures. Table V compares the performance of different manual features on the Bonn, CHSZ, and TUSZ datasets. Combining temporal, spectral, time-frequency, and nonlinear features achieved the best average BCA, suggesting these features were complementary to each other. Time-frequency features achieved higher average BCA than temporal, spectral and nonlinear features, suggesting that decomposing EEG into multiple frequency bands and extracting temporal features from each of them may be more suitable for seizure classification.

G. SS-TrBoosting for Cross-Dataset Seizure Subtype Classification
Table VI compares SS-TrBoosting with seven baselines on "CHSZ→TUSZ" and "TUSZ→CHSZ" in semi-supervised SFDA, where only one event per class in the target domain was labeled. Observe that: 1) SS-TrBoosting achieved the best BCA and F 1 in both tasks. 2) SVM and RLR achieved higher BCAs than MLP in both tasks. GBDT achieved higher BCA than MLP on "TUSZ→CHSZ". These results suggested that traditional machine learning models may be more suitable for seizure subtype classification than MLP when the target data were unavailable. 3) SHOT, SHOT++ and NRC achieved higher BCAs than MLP on "CHSZ→TUSZ", but lower BCAs on "TUSZ→CHSZ", suggesting that SFDA approaches may result in negative transfer [7] when the target data were insufficient. 4) ASSEMBLE achieved lower BCAs than its source model RLR, suggesting that directly using semisupervised boosting-based approach may also result in negative transfer.
SHOT, SHOT++, NRC and ASSEMBLE used the model predictions of the current iteration to generate the pseudo-labels to guide the training of the next iteration. Therefore, the accuracy of the predictions on the unlabeled target data impacts the accuracy of the pseudo-labels in the next iteration. To further analyze why SS-TrBoosting outperformed them, we compared their average BCAs as the number of boosting iterations increased from zero to 50, as shown in Fig. 4. Observe that: 1) SS-TrBoosting's BCA increased steadily and then converged, indicating its robustness.
2) SHOT's BCA decreased on "TUSZ→CHSZ", as the pseudo-labels generated by clustering may be incorrect. 3) SHOT++'s BCA decreased and oscillated in both tasks, suggesting the use of MixMatch in SHOT++ may cause negative transfer when the unlabeled target data were class-imbalanced, since MixMatch does not explicitly consider class-imbalance. Moreover, SHOT++ increased the weight of the loss calculated on the unlabeled data as the iteration went on, which may lead to oscillations when the target data were insufficient. Since SHOT++ was trained on the basis of SHOT, its initial BCA equaled the final BCA of SHOT. 4) NRC's BCA first increased and then oscillated, suggesting that capturing only the intrinsic neighborhood structure may not be enough when the target data have class-imbalance. 5) ASSEMBLE's BCA first decreased and then increased, suggesting that it cannot effectively utilize the source model, since it only used its first base learner to fit the source model's predictions.
H. U-TrBoosting for Cross-Patient Seizure Classification Table VII compares U-TrBoosting with 10 baselines on Bonn, CHSZ and TUSZ in unsupervised SFDA for cross-patient seizure classification. Observe that: 1) U-TrBoosting achieved the best BCA and F 1 on all datasets. 2) EEGNet, CE-stSENet and TIE-EEGNet performed worse than GBDT, SVM, RLR and MLP in most cases (except CE-stSENet in TUSZ), demonstrating the effectiveness of the manually extracted features. 3) SHOT, SHOT++ and NRC had lower BCAs than MLP, suggesting that they may cause negative transfer if the target data were insufficient and/or class-imbalanced. These results demonstrated the effectiveness of SS-TrBoosting and U-TrBoosting in semi-supervised and unsupervised SFDA for seizure subtype classification, respectively.

I. Additional Analyses
To further investigate the robustness of SS-TrBoosting and U-TrBoosting, we performed experiments to study how their performance change with different hyper-parameters.  1) Virtual Source Data: Figs. 5(a) and 5(b) show the average BCAs of SS-TrBoosting and U-TrBoosting with and without virtual source data, respectively, as the number of boosting iterations increased from zero to 50. Generally, as the number of boosting iterations increased, the performance of SS-TrBoosting and U-TrBoosting first increased and then converged to higher BCAs with less oscillations, when virtual source data were used.
SS-TrBoosting and U-TrBoosting assign pseudo-labels for unlabeled target data in each iteration; however, incorrect pseudo-labels may result in training instability. Since the virtual source data's labels are fixed, they help retain the source domain knowledge and enhance the stability of the training process, especially when the target data are insufficient.
To further analyze the effect of β, which controls the distribution of the randomly generated virtual label, Figs. 6(a) and 6(b) show the average BCAs of SS-TrBoosting and U-TrBoosting, respectively, as β increased from 0.5 to 0.95. Generally, SS-TrBoosting and U-TrBoosting had stable performance w.r.t. different β values.
RFM almost always improved the performance of SS-TrBoosting and U-TrBoosting. Recall that Section III-C also showed that RFM speeded up the training process. Thus, RFM is very beneficial to SS-TrBoosting and U-TrBoosting.
3) γ , the Portion of Unlabeled Target Data to Assign Pseudo-Labels: Fig. 8(a) shows the average BCAs of U-TrBoosting, as γ increased from 0.1 to 0.9. Fig. 8(b) shows the average BCAs of MLP, whose predictions were used to assign the pseudo-labels.
MLP had relatively high BCA when γ was small (e.g., γ ∈ [0, 0.3]), and hence the performance of U-TrBoosting gradually increased with γ in this range. As γ further increased, the BCA of MLP rapidly decreased, and hence the BCA of U-TrBoosting also slightly decreased.

V. CONCLUSION
This paper proposed SS-TrBoosting for privacy-preserving semi-supervised seizure subtype classification, i.e., the source EEG data are not accessible in domain adaptation to protect the privacy of the source subjects. We also extended SS-TrBoosting to U-TrBoosting, which assigns pseudo-labels to the high confidence unlabeled target samples for privacy-preserving unsupervised SFDA. Experiments on three public seizure datasets (Bonn, CHSZ, and TUSZ) demonstrated the effectiveness of SS-TrBoosting and U-TrBoosting, and also our 41 manually extracted features in seizure classification.
Though we mainly considered a privacy-preserving scenario that the source model instead of the source data is available, it should be noted that when the source data are available, SS-TrBoosting and U-TrBoosting can still be used, by replacing the virtual source data with the real ones.
Albeit its outstanding performance, U-TrBoosting still has limitations, e.g., its model size increases with the number of boosting iterations, and it suffers from incorrect pseudolabels. Our future research will integrate U-TrBoosting with Backfitting [41] to cyclically optimize a fixed number of base learners, and with the clustering [9] or neighborhood [17] hypothesis to improve the robustness to pseudo-labels.