Federated Motor Imagery Classification for Privacy-Preserving Brain-Computer Interfaces

Training an accurate classifier for EEG-based brain-computer interface (BCI) requires EEG data from a large number of users, whereas protecting their data privacy is a critical consideration. Federated learning (FL) is a promising solution to this challenge. This paper proposes Federated classification with local Batch-specific batch normalization and Sharpness-aware minimization (FedBS) for privacy protection in EEG-based motor imagery (MI) classification. FedBS utilizes local batch-specific batch normalization to reduce data discrepancies among different clients, and sharpness-aware minimization optimizer in local training to improve model generalization. Experiments on three public MI datasets using three popular deep learning models demonstrated that FedBS outperformed six state-of-the-art FL approaches. Remarkably, it also outperformed centralized training, which does not consider privacy protection at all. In summary, FedBS protects user EEG data privacy, enabling multiple BCI users to participate in large-scale machine learning model training, which in turn improves the BCI decoding accuracy.


Federated Motor Imagery Classification for Privacy-Preserving Brain-Computer Interfaces
EEG data from many subjects are usually needed to train an MI classifier with good generalization [3].However, recent studies have found that EEG-based BCIs are subject to privacy threats [4], e.g., EEG data could leak users' private information including personal preference, health status, mental state, and so on.Due to legal regulations and user concerns, privacypreserving machine learning for BCIs becomes a necessity.
Federated learning (FL) [5], [6] is a promising solution.For privacy-preserving BCIs, FL works as follows [illustrated using our proposed Federated classification with local Batch-specific batch normalization and Sharpness-aware minimization (FedBS) in Fig. 1]: a central server, which has no access to the local clients' private EEG data, maintains a global model and sends it to the local clients for updating; each client updates the global model parameters based on its own EEG data and sends them to the server for aggregation.In this way, a global model can be trained without sharing EEG data between the server and the clients, or among the clients.FL protects user privacy by preventing other devices from accessing raw data stored on the local client, thus avoiding the privacy risks of centralized datasets.
FedAvg [7] is one of the most popular FL approaches.To reduce the communication overhead, FedAvg performs multiple stochastic gradient descent updates on some chosen clients in each communication round and then aggregates the models on the server until convergence.FedBN [8] uses local batch normalization (BN) [9] and excludes the BN layer parameters of all client models in communication, to alleviate client shifts; however, FedBN does not yield a complete classifier at the server, and is unable to adapt to previously unseen client data distributions.
FedBS improves the BN layer in FedAvg, ensuring each client's BN layers remain private during training, while the server has a complete model during the testing phase.It calculates batch-specific statistics for BN to mitigate feature shift across different clients.Furthermore, FedBS introduces the sharpness-aware minimization (SAM) optimizer [10] into the local training of the clients, encouraging their models to converge to flatter minima and hence enhancing the generalization.Experiments using three popular deep learning models on three MI datasets demonstrated FedBS's superior performance over six state-of-the-art FL approaches, even surpassing centralized training which does not consider privacy protection.
The remainder of this paper is organized as follows.Section II introduces related works on privacy-preserving learning and FL.Section III proposes FedBS.Section IV shows the experimental results on three MI datasets.Section V presents some discussions and points out future research directions.Finally, Section VI draws conclusions.

II. BACKGROUND INFORMATION ON PRIVACY-PRESERVING MACHINE LEARNING
This section introduces background knowledge and related works on privacy-preserving machine learning and FL, which will be used in the next section.

A. Privacy Protection
Privacy protection is especially important in BCI applications.Developing a commercial BCI system may require collaboration among multiple organizations, e.g., hospitals, universities, and/or companies.If raw EEG data are transferred directly during this process, then private information, e.g., physical/emotional states, may be leaked [4].With increasing privacy protection requirements from both the governments (e.g., the European General Data Protection Regulation, the American Data Privacy and Protection Act, and the Personal Information Protection Law of China) and the end-users, privacy protection becomes necessary.
There are four popular privacy protection strategies [11], [12]: cryptography, perturbations, source-free domain adaptation, and FL.Cryptography utilizes encryption techniques like homomorphic encryption [13] and secure multi-party computation [14] to protect data privacy.Perturbation techniques such as differential privacy [15] add noise to or alter the original data while maintaining their utility for downstream tasks.The other two strategies are introduced in more detail next.

B. Source-Free Domain Adaptation
Source-free domain adaptation [16], a subcategory of transfer learning [17], considers adapting to a target domain (test subject) that has data distribution shift from the source domain (training subject).In contrast to traditional transfer learning, source-free transfer learning performs adaptation without accessing the source domain data, usually in the form of transferring a trained source model, to ensure source data privacy protection.
For EEG-based BCIs, Xia et al. [18] proposed augmentation-based source-free domain adaptation for crosssubject MI classification.Zhang et al. proposed lightweight source-free transfer [19], and further studied multi-source decentralized transfer [20], for privacy-preserving BCIs.

C. Federated Learning
FL aims to build a global model from private data located at multiple sites, without access to the raw data.
FedAvg, a simple yet widely used FL approach, has difficulty in handling local data heterogeneity.To tackle this problem, FedProx [21] introduces a proximal term into the local objective function of the clients to reduce the discrepancy between local and global models.SCAF-FOLD [22] employs control variables to correct the "client drift" during local updates.Hsu et al. [23] introduced the concept of momentum into server-side aggregation to mitigate distribution discrepancies.Some works have integrated optimization techniques with federated learning.Reddi et al. [24] introduced federated versions of adaptive optimizers (Adagrad, Adam, and Yogi), which improve federated learning convergence speed and performance.Jin et al. [25] proposed FedDA, a momentum-decoupling adaptive optimization approach, ensuring convergence in federated learning.
Limited studies have been carried out on FL for BCIs, and their primary goal was not privacy protection.Ju et al. [26] introduced a federated transfer learning framework, which uses single-trial covariance matrices and domain adaptation techniques to extract shared discriminative information from EEG data of multiple subjects; however, their approach may compromise user data privacy and does not work with Euclidean space features and models.Liu et al. [27] split the classifier into a local module and a global module to merge knowledge from different EEG datasets; however, they mainly focused on personalized FL, with limited attention to user data privacy protection.
In summary, there lacks a generic FL approach for BCIs, which is applicable to Euclidean space features, achieves user data privacy protection, and boosts the decoding performance simultaneously.Compatibility with Euclidean space features means that the approach can work with diverse network architectures commonly used in EEG data analysis, and be seamlessly integrated with various data preprocessing, data augmentation, and machine learning algorithms.

III. FEDBS
This section introduces the details of our proposed FedBS approach.

A. Definitions and Notations
Table I summarizes the main notations used in this paper.
Assume there are K subjects as clients, and the k-th client has i=1 and a local classifier with parameters w k t , where X i ∈ R C×T (C is the number of EEG channels, and T the number of sampling points), y i ∈ {1, . . ., N c } (N c is the number of classes) and t ∈ {1, . . ., N t } (N t is the number of maximum communication rounds) respectively represent the i-th trial, its corresponding label and the communication rounds.Assume also there is a server with n s EEG trials D s = {(X i , y i )} n s i=1 from an unknown subject for test.The goal is to train a classifier without exposing the raw client data and subsequently evaluate its performance on the test subjects at the server.

B. Overview of FedBS
FedBS consists of a server and several clients, as illustrated in Fig. 2. Algorithm 1 shows the pseudo-code.The Python code is available at https://github.com/TianwangJia/FedBS.
Each client holds data from an individual subject and trains a local classifier on it.The server handles model distribution and aggregation, with BN layers added only during aggregation.We utilize local batch-specific BN to reduce feature shift across different clients, as described in Section III-C.Furthermore, we introduce an SAM optimizer for client model training, as outlined in Section III-D.

C. Local Batch-Specific BN
Inspired by FedBN [8], we localize the BN layer parameters and improve the BN's FL training to accommodate generalization to test scenarios: • Clients upload all parameters, including those of the BN layer; • Server aggregates all model parameters, but distributes model parameters without those of the BN layer.This approach ensures localized BN parameters on each client for better adaptation to its specific data distribution, while also providing the server with a complete model structure and model parameters for on-demand tests.
To better accommodate the EEG data distribution discrepancies from different subjects, FedBS further calculates batch-specific statistics of BN.
As pointed out in [9], the BN layer has four sets of parameters: µ, σ , γ , and β. µ and σ are respectively the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Note again that our proposed FedBS yields a complete classifier at the server, and is able to adapt to previously unseen client data distributions, whereas FedBN [8] is primarily for personalized FL scenarios.

D. SAM Optimizer
To boost model generalization, inspired by [28] and [29], we employ the SAM optimizer [10] in client model training.
As pointed out in [28], client shift leads each local model to focus on its own biased data, deviating from the global optimum.SAM helps each local model converge to a flatter optimum in the loss landscape.When aggregated, the global model could be closer to the global optimum and have better generalization.
Specifically, SAM minimizes the following loss function: where L S AM (w) ≜ max ∥ϵ∥ p ≤ρ L(w + ϵ) is the SAM loss, in which L(w) is the cross-entropy loss of the original parameters, L(w+ϵ) is the cross-entropy loss when parameters are adjusted by ϵ, ρ ≥ 0 is a hyperparameter that governs the range of ϵ, and p ∈ [1, ∞] ( p = 2 is often used [10]).λ 2 ∥w∥ 2 2 is a regularization.L S AM (w) above can be re-expressed as: The first term is the cross-entropy loss, whereas the latter is the sharpness of the cross-entropy loss.Thus, SAM simultaneously minimizes both the cross-entropy loss (classification loss) and its sharpness, leading to better generalization.
To solve (4), we first find ϵ * (w) that maximizes L(w + ϵ): ϵ * (w) ≜ arg max When the L 2 norm is used, (5) yields: Substituting ϵ * (w) into (4) and neglecting the higher-order terms, we have: Let w t be the classifier parameters after t steps of model update.Then, SAM involves two steps: So, SAM performs first gradient ascent to find model parameters that maximize the loss function within a specified range, and then gradient descent on the original model parameters.

IV. EXPERIMENTS AND RESULTS
We performed MI classification experiments on three EEG datasets to validate the effectiveness of FedBS.

A. Datasets and Preprocessing
Three EEG-based MI datasets, summarized in Table II, were used in our experiments.Theoretical chance levels for the three datasets are 25%, 50%, and 50%, respectively [30].
The three datasets used similar data collection procedure.A subject sat in front of a computer screen.Each trial began with a fixation cross and a brief warning tone.Following that a visual cue (e.g., an arrow) appeared for several seconds, during which the subject performed specific a MI task.Then, there was a brief rest period.EEG signals were recorded throughout the entire experiment.
All datasets were downloaded and 8-30Hz bandpass filtered using MOABB [34].MI2 was also downsampled to 250Hz, to be consistent with the other two datasets.We extracted EEG trials between [0,4]s for MI1, and [0,5]s for MI2 and MI3, after each task stimulus.

B. Baseline Algorithms
We compared FedBS with a centralized training approach and six existing FL approaches: 1) Centralized training (CT), which pools data from all subjects together for training, without any privacy protection.2) FedAvg [7], the most widely used FL algorithm.Each client performs multiple stochastic gradient descent update steps in a single communication round to balance the communication cost and the accuracy.3) FedProx [21], which introduces a proximal term into the client objective function to reduce the discrepancy between local and global models.4) SCAFFOLD [22], which employs control variables to correct the 'client drift' during local updates.5) MOON [35], which leverages the similarity between model representations in local training.
6) FedFA [36], which performs federated feature augmentation.7) GA [37], which introduces a fairness objective measured by the variance of the generalization gaps among different source domains, and optimizes it through dynamic aggregation weight adjustments.Note that GA requires all clients to participate in every communication round.

C. Experiment Settings and Hyperparameters
We used leave-one-subject-out cross-validation in performance evaluation.Six repeats with different random seeds were performed, and the average results are reported.
Specifically, in CT, one subject was designated as the test subject, and all others' data were combined for model training.In all FL approaches, the number of clients equaled the total number of subjects minus one.Each client represented one training subject, and the server used the remaining subject's data for test.During each communication round, half of the clients (rounded down) were randomly chosen, and each performed two local training epochs.
Stochastic gradient descent optimizer with weight decay 0.0001 and momentum 0.9 were used in all approaches.CT was trained for 200 epochs, whereas all others were trained for 200 global communication rounds.On MI1 and MI3, CT used a batch size of 64 and all others used a batch size of 32; on MI2, since each subject had fewer samples, the batch sizes were 32 and 16, respectively.EEGNet and DeepConvNet used a learning rate of 0.005, whereas ShallowConvNet used 0.0001.The test batch size was fixed at 8.
FedBS introduces only one additional hyperparameter, which was set to ρ = 0.1.Other hyperparameters for different FL approaches were fine-tuned within a small range based on their respective literature.µ = 1.0 for FedProx, µ = 1.0 and τ = 0.5 for MOON, p = 0.5 for FedFa, and d = 0.05 for GA.

D. Cross-Subject Classification Performance
The cross-subject classification accuracies without EA are shown in Table III.The detailed results on the three MI datasets with EA are shown in Tables IV-VI, respectively.The best performance in each column is marked in bold, and the second-best underlined.We can observe that: 1) FedBS almost always achieved the best performance.
Remarkably, it almost always outperformed CT, which did not consider privacy protection at all.In other words, our proposed FedBS achieved simultaneously data privacy protection and classification accuracy improvement.2) EA boosted the performance of all approaches, including FedBS.On average, when EA was used, FedBS Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.outperformed CT by 1.97%, and the second-best FL approaches by 3.08%.To evaluate whether FedBS outperformed other approaches significantly, we first calculated the p-values of paired t-tests, and then adjusted them by Benjamini-Hochberg False Discovery Rate correction [42].The results are shown in Table VII, where p-values smaller than 0.05 are marked in bold.It is evident that the performance improvements of FedBS over others were almost always statistically significant.Further details of the paired t-tests are provided in Supplementary Materials.

E. Ablation Studies
We performed ablation studies to confirm the effectiveness of the two components (local batch-specific BN, and SAM) in FedBS.We also performed paired t-tests on the ablation studies results, and adjusted the p-values using Benjamini-Hochberg False Discovery Rate correction.The results are shown in Table VIII, where ⋆ indicates that the adjusted p-value of the paired t-test between FedBS and a variant is less than 0.05.Further details of the paired t-tests are provided in Supplementary Materials.
Table VIII shows that every individual strategy was effective, and their combination achieved the best performance in all scenarios.Specifically, using only the local batch-specific BN, the three models improved 4.35%, 7.01% and 1.23% respectively over FedAvg.Using only the SAM, the three models improved 3.12%, 1.90% and 1.33% respectively over FedAvg.When the two strategies were combined, the three models improved 5.09%, 7.99% and 2.09% respectively over FedAvg.
To emphasize again, local batch-specific BN enhances generalization by making the feature distributions more uniform and adaptable to new subjects, and SAM improves generalization by introducing perturbations during gradient computation.

F. Effect of FL Parameters
Figs. 3 and 4 show the performance of different FL approaches with varying client selection weight P and number of local computation epochs E, respectively, when EEGNet was used as the backbone.Note that for easier understanding, the horizontal axis of Fig. 3 is P • K , the number of selected clients.
FedBS always outperformed other FL approaches, regardless of P and E. Particularly, Fig. 3 shows that FedBS's performance gradually increased with Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.P, whereas other approaches did not have a consistent pattern.
In practice, one must carefully select the appropriate P and E to trade-off the communication cost and the classification accuracy.However, no matter which P and E are used, our proposed FedBS always had the best performance.This metric, ranging from [−1, 0], quantifies the separability of neural network features, with lower values indicating better separability.Manhattan distance was used in the calculation as it is more suitable for high-dimensional data [45].Clearly, FedBS's features were more separable.This is because FedBS utilizes local batch-specific BN to align samples from different subjects, reducing the distribution disparities and improving the classification performance on new subjects.

H. Effect of Test Batch Size
As FedBS calculates the BN statistics for each batch, the batch size also impacts the test results.Fig. 6 shows the performance of FedBS under different test batch sizes, when EEGNet was used.The performance increased with the test batch size, but converged at 4.

V. DISCUSSION AND FUTURE RESEARCH
The experimental results in the previous section demonstrated that both the local batch-specific BN and the SAM optimizer introduced by FedBS effectively enhance the cross-subject decoding accuracy.In FL, local batch-specific BN samples data from individual subjects, allowing for subject-specific normalization.Models trained individually on subjects in FL are biased and sensitive to perturbations due to limited training data, making them suitable for the SAM optimizer.However, these conditions do not hold in CT; so, how to extend local batch-specific BN and the SAM optimizer to CT requires further research.
Our current work also has some limitations, e.g., FedBS was only validated in the classic MI-based BCIs, whereas there are many other popular BCI paradigms; and, FedBS only considers the simple homogeneous EEG classification scenario, where the number and locations of EEG channels from all subjects are the same.To expand the applicability of FedBS, our future research will: 1) Extend FedBS to other BCI paradigms, e.g., steady-state visual evoked potentials, P300 event related potentials, and affective BCIs [46].2) Extend FedBS to more challenging and more flexible heterogeneous BCI applications, where the number and/or locations of EEG channels are different for different subjects.

Fig. 2 .
Fig. 2. Overview of FedBS.Each client represents an individual subject.'Batch-specific BN' means the BN layer statistics are computed independently for each batch.

Fig. 3 .
Fig. 3. Average classification accuracies of different FL approaches w.r.t.P • K, the number of selected clients.GA was not included because it requires the participation of all clients in each round of training.(a) MI1; (b) MI2; and, (c) MI3.

Fig. 5
Fig.5shows t-SNE[43] visualizations of features extracted by CT, FedAvg and FedBS from test Subject 1 on the MI2 dataset.Table IX presents the average Generalized Discrimination Values[44] calculated on features extracted by CT, FedAvg and FedBS from each test subject in the three datasets.

TABLE III AVERAGE
CROSS-SUBJECT CLASSIFICATION ACCURACIES (%) WITHOUT EA.THE BEST ACCURACY IN EACH COLUMN IS MARKED IN BOLD, AND THE SECOND BEST BY AN UNDERLINE

TABLE IV CROSS
-SUBJECT CLASSIFICATION ACCURACIES (%) WITH EA ON MI1.THE BEST ACCURACY IN EACH COLUMN IS MARKED IN BOLD, AND THE SECOND BEST BY AN UNDERLINE

TABLE V CROSS
-SUBJECT CLASSIFICATION ACCURACIES (%) WITH EA ON MI2.THE BEST ACCURACY IN EACH COLUMN IS MARKED IN BOLD, AND THE SECOND BEST BY AN UNDERLINE

TABLE VI CROSS
-SUBJECT CLASSIFICATION ACCURACIES (%) WITH EA ON MI3.THE BEST ACCURACY IN EACH COLUMN IS MARKED IN BOLD, AND THE SECOND BEST BY AN UNDERLINE

TABLE VII ADJUST
p-VALUES BETWEEN FEDBS AND OTHER APPROACHES.THE p-VALUES LESS THAN 0.05 ARE MARKED IN BOLD

TABLE VIII AVERAGE
CLASSIFICATION ACCURACIES (%) IN ABLATION STUDIES.THE BEST ACCURACY IN EACH COLUMN IS MARKED IN BOLD.ADJUSTED p-VALUES OF THE PAIRED t -TESTS BETWEEN FEDBS (WITH LOCAL BATCH-SPECIFIC BN AND SAM) AND OTHERS LESS THAN 0.05 ARE INDICATED WITH ⋆.NOTE THAT THE 'AVG.' COLUMN DID NOT PARTICIPATE IN THE PAIRED t -TESTS G. Visualization

TABLE IX THE
AVERAGE GENERALIZED DISCRIMINATION VALUES CALCULATED ON FEATURES EXTRACTED BY CT, FEDAVG, AND FEDBS FROM EACH TEST SUBJECT IN THREE DATASETS.THE BEST VALUE IN EACH COLUMN IS MARKED IN BOLD Fig. 6.Classification accuracies of FedBS under different test batch sizes.