Reduce the User Burden of Multiuser Myoelectric Interface via Few-Shot Domain Adaptation

Due to physiological and anatomical variations across users, myoelectric interfaces trained by multiple users cannot be adapted to the unique hand movement patterns of the new user. Most current work requires the new user to provide one or more trials per gesture (dozens to hundreds of samples), applying domain adaptation methods to calibrate the model and achieve promising movement recognition performance. However, the user burden associated with time-consuming electromyography signal acquisition and annotation is a key factor hindering the practical application of myoelectric control. As shown in this work, once the number of calibration samples is reduced, the performance of previous cross-user myoelectric interfaces will degrade due to the lack of enough statistics to characterize the distributions. In this paper, a few-shot supervised domain adaptation (FSSDA) framework is proposed to address this issue. It aligns the distributions of different domains by calculating the distribution distances of point-wise surrogates. Specifically, we introduce a positive-negative pair distance loss to find a shared embedding subspace where each scarce sample from the new user will be closer to the positive samples and away from the negative samples of multiple users. Thus, FSSDA allows every target domain sample to be paired with all source domain samples and optimizes the feature distance between each target domain sample and the source domain samples within the same batch, instead of direct estimation of the data distribution of the target domain. The proposed method is validated on two high-density EMG datasets, which achieves the averaged recognition accuracies of 97.59% and 82.78% with only 5 samples per gesture. In addition, FSSDA is also effective even when only one sample per gesture is provided. The experimental results show that FSSDA greatly reduces the user burden and further facilitates the development of myoelectric pattern recognition techniques.


I. INTRODUCTION
H AND movement recognition based on the surface electromyogram (sEMG) signals is a practical technique since it is easy to use and non-invasive [1]. The myoelectric interface enables interaction with the user and external devices and helps both healthy and disabled people with their everyday activities. The external devices consist of the power prostheses [2], rehabilitation robots [3] and Virtual Reality (VR) gaming interfaces [4]. As a key technology in myoelectric control, myoelectric pattern recognition allows dexterous movement of multiple degrees of freedom [5]. Typically, after collecting electromyogram (EMG) data from the user using surface electrodes, the hand motion recognition model can be configured using myoelectric pattern recognition techniques. Following the guided training protocol process, high gesture recognition accuracy could be achieved in a controlled laboratory environment.
However, there still exists a gap between the research studies and clinical implementations due to non-ideal factors, such as heavy user burden and the variability in the characteristics of sEMG signals [6]. The new user is required to perform the calibration process to train the classifier before using the prosthetic device, which is time-consuming and may take several days in practice. Additionally, the EMG signal also has a user-dependent nature, a factor that leads to differences in the signal measured when different users perform the same movement, despite the electrodes being worn in the same position [7]. This property is attributed to physiological and anatomical variations across users, e.g. fiber composition, skin resistance, muscle geometry, and fat content [8].
Many efforts are devoted to designing a cross-user model that is grounded in the background data of other users, with minimal effort for accommodating changes in the EMG signal characteristics from the new user [9], [10], [11]. Khushaba tried to find the correlation relationship between the pooled users and the new user [7]. All users' features were projected to a unified-style space associated with the expert set via canonical correlation analysis (CCA). The mapping from the This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ new user to the expert set was learned with one trial of each gesture. This framework can overcome individual differences, achieving accuracies of >83% across multiple users. On this basis, Xue et al. [12] proposed a framework called CCA-OT (canonical correlation analysis-optimal transport) to further reduce the discrepancy in data distribution between the new user and other users via optimal transport. They achieved 8.49% improvement on thirteen Chinese sign language hand movements performed by ten intact-limb subjects compared to the CCA-only framework.
In recent years, convolutional neural networks (CNN) have been widely used in gesture recognition due to the development of deep learning techniques [13], [14], [15]. Atzori et al. have verified CNN with a simple architecture can achieve superior classification performance than conventional methods [16]. However, it is not very effective to use CNNs directly to solve cross-user gesture recognition problems. Therefore, some domain adaptation methods such as multistream Adaptive Batch Normalization (MS-AdaBN) [17], supportive model selection [18] and adversarial training [11] were combined with CNN to improve cross-user performance. Domain adaptation (DA) is a technique that uses labeled data from an associated source domain to solve the new task in the target domain [19]. DA is divided into supervised DA (SDA) and unsupervised DA (UDA) according to whether the target domain data is labeled or not. For the supervised DA, Kim et al. [18] firstly selected supportive CNN classifiers pretrained by several subjects, and then finetuned the classifiers with the first trial per gesture of the new subject. The final decision was based on the output that was most commonly classified by all supportive CNN classifiers. Campbell et al. [11] combined the adaptive batch normalization and domain adversarial training to CNN, which outperformed CCA in the single-repetition calibration crossuser setting. While the above studies have achieved great success for cross-user gesture recognition problems, they still require the new subject to provide one or more trials per gesture and hold one trial for 3-5 s, to achieve adaptation to the model. With an increasing number of gestures to be performed, the signal collection time has to be extended. What's worse, as a result of the non-stationary and sensitive nature of sEMG, the classifiers, as well as the prostheses need to be recalibrated every day to achieve optimal performance, which may become a heavy usage burden to the end-user [20]. Recently, Zhang et al. [21] attempted to cope with the crossuser gesture recognition problem using UDA method and achieved promising performance. They proposed a method to continuously update the parameters of the classifier by incorporating a self-guided adaptive sampling strategy for online learning. After several iterations, the performance of the classifier will gradually reach stability and realize a plug-andplay gestural interface. However, in the first few iterations, the classifier also needs a few hundred samples from the new user to align the margin distribution of the two domains. Therefore, the end-user has to put up with the poor performance of the model at the beginning of its use. Typically each trial is divided into dozens to hundreds of samples after the sliding window, and it would further reduce the usage burden on the new user if only several samples per gesture were used to complete the calibration. Previous methods relied on calculating distances and similarities between distributions when learning differences among data distributions for the new user and multiple users, and those were difficult to represent with as few as one sample per gesture. The need to design a crossuser gesture recognition model that relies on less training data while attaining high performance has not been met.
In the few-shot scenario, i.e. when the size of the new user's samples is small, alignment of the distribution of the source domain and target domain becomes a challenge. To tackle this problem, inspired by [22], we introduce a few-shot supervised domain adaptation (FSSDA) framework based on positivenegative pair distance loss (P-N pair distance loss). This loss aims to learn a shared feature embedding so that each new user's sample will be closer to the positive samples and away from the negative samples of source domain. Our proposed model can adequately train each scarce target domain sample by pairing it with the source domain samples within the same minibatch to facilitate the classification of new user's samples. The strength of the point-wise surrogates approach is that it even allows a labeled target domain sample to be paired with all source domain samples, efficiently aligning the data distributions across domains. In this study, we explore whether the proposed multiuser myoelectric interface can rapidly adapt to the new user using background data from other users and significantly reduces the usage burden. Therefore, this work is of interest to promote the widespread commercial and clinical use of myoelectric control systems.
The rest of the paper is organized as follows. Section II presents the datasets and details the proposed framework FSSDA. In Section III, the design of experiments and results are included. An in-depth discussion is presented in Section IV. Finally, Section V concludes our work.

A. Data Preparation
High-density surface electromyogram (HD-sEMG) is measured by a large two-dimensional array with close-spaced electrodes, which increases the possibility of extracting spatial information due to the increased density and coverage of the electrodes [23]. In the field of gesture recognition, combining it with CNN will achieve better performance [14], [15], [24]. Therefore, we conducted our experiments on two high-density EMG datasets. The first was collected by our group [21] and the second was the CapgMyo database collected by another research group [17].
1) Dataset-I: The first dataset contains nine intact-limb subjects (referred to as M1-to-M9), including five males and four females, all right-handed and aged between 24 and 35 years. All subjects were asked to execute six gestures, which included little finger extension, middle finger extension, index finger extension, extension of both index and middle fingers, extension of the last three fingers, and wrist extension. Each gesture was performed 10 times and one repetition of a gesture is called a trial. One trial lasted for about 5s at a comfortable medium-force level followed by 3s of rest to avoid muscle fatigue. HD-sEMG signal was collected using 100 mono-polar channels which were arranged in a 10 × 10 grid placed on the surface of the extensor carpi ulnaris and the extensor digitorum with a sampling rate of 1,000 Hz. The signal was then processed by a two-stage amplifier with a total gain of 60 dB and filtered by a 20-500 Hz bandpass filter. More details on signal acquisition and electrode configuration were provided in [21].
2) Dataset-II: In addition to testing on our dataset, we further utilised the CapgMyo DB-c dataset collected and first used by Du et al. [17]. EMG signals were recorded from ten healthy subjects (referred to as N1-to-N10) ranging in age from 23 to 26 years with an 8 × 16 electrode array. The array consists of 8 acquisition modules, each containing a matrix (8 × 2) differential electrode array. The first module was placed on the extensor digitorum communis muscle and the other modules were arranged equidistantly around the surface of the right forearm. The 16-bit analogue-to-digital converter was used to sample the signal at 1000 Hz. The signal was then transferred to the PC via WIFI and filtered by a band-pass filter between 20 and 380 Hz. 12 classes of finger movements relevant to the activities of daily living were carried on during the experiments. Each gesture was required to be held for 3s to 10s and repeated 10 times. The subjects rested for 7s after completing one trial at a time to avoid muscle fatigue. To perfectly match the labels, only the static part of the movement (about 1s) was used. In the experiment, the processed data from N1-N9 were used since the data of N10 could not be downloaded from the website provided in the paper [17].

B. Data Preprocessing
For DATASET-I, in accordance with the typical settings described in the literature [5], [25], the sEMG signal stream was split into a series of overlapping windows (window length: 256 ms, step size: 128 ms). For DATASET-II, Following their earlier work [14], [17], the window length and the step size were set to 150 ms and 50 ms respectively. Then the windows of the resting segments were discarded using the amplitude threshold [26]. The threshold was empirically set to the mean plus three times the standard deviation of the baseline signal averaged over all channels. Thus, for both datasets, the EMG signal of one trial was split into approximately 40 and 20 samples respectively. For each analysis window of the active segment, three features, namely the waveform length from four time-domain feature set [27], f1 and f6 from the time-dependent power spectrum descriptors feature set [28] for each channel were empirically extracted. As a result, the analysis windows for the two datasets are converted into a 10 × 10 × 3 and an 8 × 16 × 3 feature matrix respectively. These processed feature matrices are considered as feature images equivalently, facilitating the training of CNN models [15], [16].

C. SDA With Scarce Target Domain Samples
Suppose that we have two domains: one is the source domain (the training set made of pairs representing the samples of existing users), the other is the target domain (D t = {x t j } N j=1 representing samples of the new user). The features x s i ∈ X s and x t j ∈ X t are realizations of the random variables X s and X t respectively, while the labels y s i ∈ Y s and y t j ∈ Y t are realizations of the random variables Y s and Y t respectively. In our issue, the feature space and the label space of two domains are the same, It is assumed that there is a covariate shift [29] between X s and X t and thus a difference between the probability distributions p(X s ) and p(X t ). SDA is utilized to train a robust classifier f : X → Y capable of classifying D tu by using the labeled data from D s and D tl . In this work, our particular concern is that only several labeled samples per class of target domain are available, or even only one sample per class.

D. P-N Pair Distance Loss
Typically, f can be decomposed into two functions, i.e. f = k • h, where k : X → Z is a mapping from the input space X to an embedding feature space Z, and h : Z → Y is a function for making predictions. For each sample x t i in the target domain D tl , we can find a positive sample x s j of the same class as it in the source domain and a negative sample x s k of the different class from it in the source domain. As shown in Fig. 1, to make the distance between the target domain samples and the positive sample (P-pair distance) as small as possible and the distance between them and the negative sample (N-pair distance) as far as possible, positive and negative pair distance loss (P-N pair distance loss) is defined as follows: where [·] + = max (·, 0), ∥·∥ F denotes the Frobenius norm, n t and n s represent the size of target domain samples and source domain samples in a batch, and α is the margin distance of separability between the P-N pairs in the embedding space. It is noted that without considering the sample selection from two domains, P-N pair distance loss becomes the well-known triplet loss [30].

E. Description of FSSDA Framework
Fig. 2 shows our proposed FSSDA method for cross-user gesture recognition based on CNN. For DATASET-I, the CNN architecture was designed based on Lenet-5 [31], which has been widely used in gesture recognition and shown good performance [15], [21], [32], [33]. Each scarce target domain sample is paired with all source domain samples, resulting in M * N l pairs. This ensures that the total number of samples in the source and target domain is equal, to facilitate the design of the batch sampler. The sampler then allows for the same number of samples from existing users and the new user in each batch. The training architecture mainly has two streams, one for source domain and the other for target domain. For DATASET-I, a network structure similar to [21] was used. Each stream contains two CNN blocks and two fully connected (FC) layers. Within each CNN block, there is a convolutional layer to extract features, a batch normalization layer to accelerate model convergence and a maximum pooling layer to reduce feature dimensionality while retaining the more important information. The number of filters in the convolutional layers of two convolutional blocks is 64 and 128 respectively, and kernel size is 3*3 with padding = 1. The next two FC layers contain 128 and 64 units respectively. Except for the last FC layer in the stream of target domain, all the previous layers share the same weight, which can be seen as the embedding function k. A rectified linear unit is equipped after the convolutional layers and the FC layers to avoid vanishing gradient and enhance non-linear fitting capability of the model [34]. A separate FC layer containing 6 units is designed as a classification layer h for the target domain. Additionally, a dropout layer with a probability of 0.5 is adopted before the final FC layer to reduce overfitting over training. The output of h is the predicted probabilities of the gesture categories. The overall loss of FSSDA is as follows: where L P−N is defined in (1), and the loss L C is the classification loss calculated from the target domain. In this experiment, this classification loss function is chosen as the CE (cross-entropy) function.
where n t is the size of the target samples in a batch. C represents the number of categories of gestures. y t ic represents a binary indicator function: it equals 1 if the sample x t i truly belongs to class c, otherwise it equals 0. f x t i c is the predicted probability that the sample x t i belongs to class c. At each training iteration, the samples of both domains enter the network at the same time and pass through the same defined layer to achieve the shared weight of these network layers. The P-N pair distance loss is calculated after the second FC layer of both domains. The classification layer for the target domain stream is defined separately and allows the calculation of the CE loss. Finally, the total loss is calculated and the parameters of the model are updated by gradient back propagation. For DATASET-II, the same network structure as [17] is adopted for both streams of the source domain and the target domain, except that the stream of the source domain does not contain a final classification layer. The other parameter settings are the same for both networks except that the batch size is empirically set to 32 and 64 for two networks. Within each batch, the number of samples in two domains is equal (half of batch size). We use Adam optimizer (β 1 = 0.9, β 2 = 0.99, weight_decay = 0.0001) [35], an epoch number of 20 in all experiments. The learning rate is initially set to 0.0001 and is divided by 10 after the 8th and 16th epochs. The models are implemented in PyTorch 1.12.0. and all experiments are conducted on four GeForce RTX 2080 Ti GPUs.

F. Performance Evaluation and Statistics Analysis
As with other works in multi-user gesture recognition [7], [11], [21], the leave-one-subject-out cross-validation scheme was used to test our proposed framework, which means that one subject was randomly selected as the new user from all subjects and the rest of subjects were utilized as the training set until each user was tested once as a new user. For all subjects, the first trial of each gesture was utilized as the calibration set and the rest of trials were utilized as the testing set or training set depending on whether it was from a new user or background users. The performance of the classifier is measured with classification accuracy, which is defined as the ratio between the number of samples correctly classified and the total number of samples for the new user in each cross-validation.
Many cross-user gesture recognition methods were employed for comparison purposes. (1) LDA (linear discriminant analysis) [36]: a classifier that is easy to implement and has efficient performance. (2) CNN: the same network structure as the stream of target domain in the proposed method, without any domain adaptation operations. (3) CCA [7]: the most classic CCA-based cross-user gesture recognition framework. (4) CCA-OT [12]: this framework further reduces the discrepancy in data distribution between the source and target domain by optimal transport based on CCA. (5) Finetune [15]: a technique that uses a few samples of the new user to fine-tune the parameters of the FC layer of a model pretrained by other users. (6) CCSA [22]: an approach to solving the problem of domain adaptation with few samples is used for comparison as it fits better with the problem we are trying to solve. This method uses classification and contrastive semantic alignment loss to achieve the alignment of the distribution between two domains. (7) MS-AdaBN [17]: the first approach to calibrate the cross-user gesture recognition model using UDA. (8) SGDA (self-guided domain adaptation) [21]: an UDA approach with a self-guided adaptive sampling strategy. In this paper, to make a fair comparison, we used the networks adopted in the papers proposing the datasets respectively [17], [21]. The settings of training conditions for the different methods are shown in Table I. For the deep learning methods, other parameters (e.g. the learning rate, the optimizer) are set to be the same as FSSDA. For the traditional methods, the parameters are set according to their previous work.
Additionally, statistical significance is assessed using a oneway repeated-measure analysis of variance (ANOVA) test with a significance value of p = 0.05. All statistical analyses were conducted using SPSS (v.24.0, SPSS Inc. Chicago, IL, USA).

A. Classification Performance of FSSDA on Two EMG Datasets
In the few-shot learning scenario, each new user provides 5 samples of each gesture (1/8 of one trial for DATASET-I, 1/4 of one trial for DATASET-II). It should be noted that the data used for training LDA and CNN is only the calibration samples of the new user, as we have found that this works better than using other users' data or a combination of both. For CCA and CCA-OT, due to the small number of calibration samples, a replication operation was performed on these samples to facilitate the transformation of features into a unified-style space. The other experimental conditions for the seven methods were set equal to make a fair comparison. Fig. 3 and Fig. 4 show the classification accuracies achieved by six methods on two datasets. The accuracies of nine subjects in DATASET-I using our proposed method (FSSDA) and LDA, CNN, Finetune, CCA, CCA-OT, and CCSA were 97.59 ± 3.10%, 90.39 ± 8.24%, 90.83 ± 5.36%, 84.13 ± 13.16%, 84.50 ± 6.90%, 89.73 ± 5.77%, and 94.75 ± 6.02%, respectively. Also, the proposed method (FSSDA), LDA, CNN, Finetune, CCA, CCA-OT, and CCSA, yielded accuracies of 82.78 ± 5.17%, 73.29 ± 7.87%, 72.80 ± 6.60%, 49.41 ± 3.38%, 52.15 ± 6.97%, 56.45 ± 7.18%, and 76.60 ± 7.70% on DATASET-II, respectively. The results of the comparison with the other two UDA approaches are presented in Table I. The ANOVA revealed the proposed method performed significantly better for cross-user gesture recognition than all other comparison methods on both datasets ( p < 0.05 for any comparison).  of the same and different categories in the source domain. Note that the accuracy gradually grows as α increases from 0.01 to 0.5, and reaches the best at α = 0.3. However, when α is larger (e.g. α = 0.5), the accuracy starts to drop. It is due to the fact that larger margins increase the difficulty of feature learning. For DATASET-I, we set α to 0.3. The experimental results on DATASET-II have a similar trend, and we selected the optimal α = 0.05 for it.

C. Experiments on Different Numbers of Calibration Samples
The impact of the number of samples utilized in the calibration process on classification accuracy when utilizing different methods was also analyzed. A certain number of samples were randomly selected from the first trial of each gesture performed by the new user. The process was repeated 3 times and the averaged recognition accuracy was calculated using different approaches. Fig. 6 and Fig. 7 show the averaged recognition accuracy of the different methods on both datasets as the number of samples per gesture grows. Additionally, note that as the number of training samples is equal to the number of classes, the LDA classifier cannot be trained at this point, so this curve starts when the number of samples is 2. It can be observed that the performance of LDA and CNN is poor when the number of calibration samples is small, and with increasing numbers of calibration samples, the final results of the three methods are almost equal. Additionally, FSSDA still works well compared to the other two methods even when only one sample is provided for each gesture, achieving 95.72% and 59.46% accuracy on the two datasets respectively. Fig. 8 shows the visualization results of embedding features using t-distributed stochastic neighbor embedding (t-SNE)  technology [37]. Fig. 8(a) and Fig. 8(b) demonstrate the visualization of the embedding of high-dimensional features extracted from two domains into a two-dimensional space before using FSSDA. It can be observed that the distribution of samples in the same category varies considerably across users, and some of the samples in different categories even overlap significantly with each other. After applying FSSDA, the distribution of samples in the source (Fig. 8(c)) and target domain ( Fig. 8(d)) is close, with samples from the same category close to each other and samples from different categories of different users far away from each other, showing good separability and thus facilitating the classification of the gesture tasks for the new user.

E. Computational Complexity Analysis
The number of parameters (Params), floating point operations (FLOPs) and the running time of FSSDA on two datasets are presented in Table II. The model used in DATASET-II has more convolutional blocks, resulting in a larger number of params than the model used in DATASET-I. We also test the running time by inputting the testing samples of the subjects from two datasets into the models one by one. Therefore, the average processing time of one sample is considered as the running time of FSSDA. Although the running time of the model used in DATASET-II is slightly longer than the model used in DATASET-I, they are both far less than the time requirement (300 ms) of real-time gesture recognition in myoelectric control systems [36].

IV. DISCUSSIONS
Even with the rapid development of EMG pattern recognition technology in recent years, severe user burden is a significant barrier to the development of cross-user gesture recognition technology. Due to the non-stationary and variability in the characteristics of the EMG signal, direct application of classifiers trained with other user's data to the new user is poor, as has been demonstrated in the literature [11], [21]. The difficulty of this problem lies in adapting the cross-user model to learn the user-specific nature of the new user with minimal training data, while taking advantage of the similarity of patterns between different users. In this paper, a crossuser gesture recognition framework FSSDA based on the P-N pair distance loss is proposed to alleviate the usage burden of the end-user by requiring only several samples per gesture. First, we pair the scarce samples of the new user with the sufficient samples of the background users and feed them into the network together, so that there is the same number of both domain samples in each batch. Then, P-N pair distance loss is employed to find a shared embedding space that finds the most appropriate location for each new user's sample, which is close to the positive samples and far from the negative samples in the source domain. Additionally, CE loss is calculated on the samples of the new user to build a gesture classification model more suitable for it. By leveraging the distribution distances of point-wise surrogates for the new user and other users, FSSDA enables adaptive stability, reducing sEMG signal recording time by more than four times, without sacrificing accuracy in gesture recognition.
As demonstrated in Fig. 3 and Fig. 4, in the few-shot scenario, some previous methods did not provide a satisfactory solution for cross-user gesture recognition. The recognition accuracy of CCA, CCA-OT and Finetune is even lower than the recognition accuracy of LDA trained with only a few samples from the new user, dropping by 6.26%, 5.89% and 0.66% on the first dataset and by 23.88%, 21.14% and 16.84% on the second dataset, respectively. CNN achieved similar recognition rates as LDA in this case. In contrast, Finetune achieved the worst results, probably because the number of samples of the new user was too small to allow the network to learn the differences in movement patterns between the new user and other users. Although CCA-OT further reduces the discrepancy between the distribution of two domains on top of CCA, it does not exceed the strong baseline of LDA. Inaccurate statistical information about the data distribution of the new user is calculated, leading to negative migration, i.e. knowledge learned from background users in the source domain can have a negative effect on the prediction of the new user's samples in the target domain. In this case, training an LDA or CNN with only calibration samples would produce better results than a cross-user model with negative migration, although prone to overfitting. Thus, the above approaches lack sufficient data from the target domain for learning the differences between distributions, and do not make full use of the scarce samples of the new user, resulting in a cross-user model that is not well suited to the new user. Nevertheless, CCSA achieves better results than the previous approach, with improvements of 4.36% and 3.31% over LDA on the two datasets respectively, which employs contrastive loss to bound the relationship between the scarce target samples and large source domain samples for the separation and alignment of the distribution. However, the recognition rate achieved by this method is still lower than that of FSSDA because it requires the P-pair distance = 0, which can lead to model collapse, i.e. all samples of the same category will output the same embedding features, thus affecting the feature representation ability. Instead, P-N pair distance loss is to require that for each sample in the target domain, the distance between it and samples of a different category from source domain is larger than the distance between it and samples of the same category from the source domain, which ensures the distinguishability of the embedding features and avoids the difficulty of model learning caused by overly strong constraints.
A noteworthy point is that our group has recently used unsupervised domain adaptation to address the cross-user problem [21], implementing a plug-and-play myoelectric interface. This approach designs an adaptive domain adaptation model that uses an adaptive sampling strategy to gradually align the margin distribution across users, continuously updating the model until the new user are satisfied. Although this approach does not require the new user to provide labels of the gestures, it still requires several hundred samples to update the model parameters (to ensure gesture diversity) to achieve optimal performance, whereas we only need five samples of each gesture. As shown in Table I, the final recognition accuracies of this method on the two datasets are 90.41 ± 14.44% and 55.27 ± 5.54%, which are still poorer than our result 97.59 ± 3.10% and 82.78 ± 5.17%. Therefore, for the second dataset with poorly discriminated gesture samples, the learning of margin distributions is difficult. With a few labeled samples, SDA achieves better cross-user gesture recognition performance than UDA. Fig. 6 and Fig. 7 also investigate the effect of the number of calibration samples on two classifiers and FSSDA for two datasets. The trend of the curves with FSSDA on the two datasets is similar, except that the curve on the DATASET-II changes more significantly. This is because in DATASET-I, the number of gesture categories is small and the samples are well differentiated, so promising recognition accuracies are achieved when only one calibration sample per gesture is provided. In DATASET-II, the number of gesture categories is large and the samples are poorly differentiated. Therefore, the recognition accuracy is lower when there is one calibration sample per gesture. With each additional calibration data provided, the network learns more accurate classification boundaries and the recognition accuracy improves. Nevertheless, when the number of calibration samples is small, the results of all methods on DATASET-II are not very promising. Although the proposed method achieves the optimal results, no obvious improvement in recognition rate is achieved when only 2 and 3 calibration samples are provided. This can be attributed to the difficulty of the proposed method to adequately align the distributions of two domains when the input information is limited. However, it is worth noting that although the curves for the two datasets start out with different levels of growth, they can reach stability once the calibration samples reach five. Furthermore, the results demonstrate that increasing the size of calibration samples could gradually enhance the robust performance of the crossuser gesture recognition method. However, it certainly places an increasing burden on the new user to calibrate. The timeconsuming and frequent recalibration is one of the reasons for the high rejection rate of myoelectric pattern recognition-based prostheses. Our method achieves stable and good performance with as few as 5 calibration samples on both datasets, which confirms that FSSDA reduces the effect of the number of the calibration samples on the cross-user model performance. FSSDA achieves the best classification results with a short signal collection time, which may have significant implications for the design of prosthetic devices with a low user burden.
The visualization of t-SNE in Fig. 8 further validates the effectiveness of FSSDA. The data distribution of the new user always differs from that of existing users. Using the positive and negative pair relationship between samples in two domains, the samples in the target domain are brought closer together with samples of the same category in the source domain and further apart with samples of a different category in the source domain. We can intuitively observe that the decision boundaries between different gesture patterns in two domains become similar after using FSSDA, indicating that the model learns the inherent correlation patterns between the new user and other users, consistent with the hypothesis that the EMG patterns of specific muscle activation are similar for different users. Overall our proposed method leverages minimal calibration data from the new user and combines additional sources of variability provided by other users to build a cross-user gesture recognition model, which is superior to the previous cross-subject techniques and may be a viable option for improving EMG control based on pattern recognition.
Finally, the current study still has several limitations. This work did not take into account in the experiments the non-ideal factors encountered in practice, such as electrode shift [5] and limb position [38], which may lead to performance degradation in myoelectric control. Another issue is that while FSSDA reduces the size of calibration samples for the new user, it still requires samples for each gesture. If it was possible to calibrate all gestures using a small number of gestures or to use zero retraining as in the cross-day gesture recognition problems [39], [40], this would further reduce the usage burden on the end-user, which would be our future work.

V. CONCLUSION
In this paper, we propose a novel multiuser myoelectric interface FSSDA. This framework could generalize well after observing very few samples per class of the new user. P-N pair distance loss was proposed to perform domain adaptation on the sufficient samples from the source domain and the scarce samples from the target domain, despite the mismatch in distribution between them. We found that using the point-wise surrogates of distribution distances to solve the problem of the distribution alignment across users was very effective when the number of calibration samples was small, even if there was only one sample per gesture. FSSDA will substantially reduce the heavy usage burden required, reducing the sEMG signal recording time by more than four times. The results on both HD-sEMG datasets verify the impressive performance of FSSDA in comparison to the prior cross-user models in the few-shot scenario. This work has important implications for reducing the usage burden, improving acceptance, and increasing the base of potential users using EMG gesture recognition. This will facilitate the widespread use of EMG control systems in consumer and industrial applications.