Robust Long-Term Hand Grasp Recognition With Raw Electromyographic Signals Using Multidimensional Uncertainty-Aware Models

Hand grasp recognition with surface electromyography (sEMG) has been used as a possible natural strategy to control hand prosthetics. However, effectively performing activities of daily living for users relies significantly on the long-term robustness of such recognition, which is still a challenging task due to confused classes and several other variabilities. We hypothesise that this challenge can be addressed by introducing uncertainty-aware models because the rejection of uncertain movements has previously been demonstrated to improve the reliability of sEMG-based hand gesture recognition. With a particular focus on a very challenging benchmark dataset (NinaPro Database 6), we propose a novel end-to-end uncertainty-aware model, an evidential convolutional neural network (ECNN), which can generate multidimensional uncertainties, including vacuity and dissonance, for robust long-term hand grasp recognition. To avoid heuristically determining the optimal rejection threshold, we examine the performance of misclassification detection in the validation set. Extensive comparisons of accuracy under the non-rejection and rejection scheme are conducted when classifying 8 hand grasps (including rest) over 8 subjects across proposed models. The proposed ECNN is shown to improve recognition performance, achieving an accuracy of 51.44% without the rejection option and 83.51% under the rejection scheme with multidimensional uncertainties, significantly improving the current state-of-the-art (SoA) by 3.71% and 13.88%, respectively. Furthermore, its overall rejection-capable recognition accuracy remains stable with only a small accuracy degradation after the last data acquisition over 3 days. These results show the potential design of a reliable classifier that yields accurate and robust recognition performance.


I. INTRODUCTION
H AND gesture recognition (HGR) with surface electromyography (sEMG), which reveals neuromuscular activities by collecting electrical signals from muscles through non-invasive electrodes, has been widely acknowledged as a natural way to express intuitive intention, thus providing a control command for human-machine interaction (HMI) [1], [2]. As such, designing an accurate, robust, and reliable hand gesture classifier can help build a solid bridge between computers and humans. This is extremely valuable for transradial amputee users in controlling prosthetic limbs, allowing them to perform activities of daily living (ADL) through hand movements, including various hand grasps. Furthermore, as an independent module in HMI, sEMG-based HGR has often been studied offline with publicly available datasets [3], [4], [5].
Our previous work [35] has investigated the reliability (i.e., the quality of uncertainty measures) of the uncertainty-aware models that were proposed for the recognition of finger movement with raw sEMG. Furthermore, we have shown that a classifier can be considered more reliable if it knows what it does not know. The promising results encouraged us to pursue  further research on the long-term robustness of hand grasp recognition with the same design of uncertainty-aware models. In this study, we focus on the very challenging NinaPro Database 6, as it 1) was first released for repeatability analysis of hand grasp recognition with sEMG collected in data acquisitions across days; 2) was found to require more research compared to other benchmark data sets; 3) was ideal for validating the long-term robustness of the proposed uncertainty-aware models on sEMG-based hand grasp recognition. We first design an end-to-end 3D CNN as a baseline model for the specific task of classifying hand grasps. An uncertaintyaware model, that is, evidential CNN (ECNN), is then proposed by integrating it with evidential deep learning (EDL). Despite the fact that the potential of ECNN can be investigated by reliability analysis without optimal rejection threshold determination, one may be eager to know how the performance of the HGR can be improved under rejection schemes in practical use. Therefore, the primary objective of this study is to present a practical way to determine a rejection threshold using validation sets only and to provide a comprehensive analysis of the long-term performance of rejection-capable hand grasp recognition with raw sEMG. Furthermore, ECNN could generate multidimensional uncertainties such as vacuity and dissonance as a result of the nature of EDL [35]. A secondary objective of this study is to investigate its potential to improve the rejection-capable performance with multidimensional uncertainties by comparing it with a single uncertainty.

II. RELATED WORK
The Non-Invasive Adaptive Hand Prosthetics Database 6 (NinaPro DB6) [28] is an undervalued benchmark data set to test the repeatability of hand grasp recognition, where repeatability was defined as the variation in repeated measurements made on consecutive days by the same subject under identical conditions. For each individual of 10 intact healthy subjects, 10 sessions were recorded in the morning (AM) and afternoon (PM) on 5 days, where each session involved 12 repetitions of 7 hand grasps. Furthermore, each subject was required to sit in front of a table with the forearm leaning on it to grasp an object for approximately 4 seconds on one repetition and rest for approximately 4 seconds afterward. Muscle activity was measured with 14 Delsys Trigno Wireless electrodes at a sampling frequency of 2 kHz, attached to the upper half of the forearm in two rows with equal space.
This section outlines two key findings from the summary of the research work on NinaPro DB6, which is presented in Table I. One is that NinaPro DB6 has been underexplored in the literature due to the difficulty of improving its recognition accuracy, as evidenced by the observation of generally poor recognition accuracy. As evaluated by Chang et al. [36], the signal quality of this data set was acceptable. Despite the fact that some signals suffer from low signal-to-noise ratio (SNR) values and incorrect labelling, we can argue that poor accuracy was due to highly confused hand grasps, together with large and various variabilities. In addition to the variability between steady and transient states and the temporal variability, at least two more variabilities must be taken into account in NinaPro DB6, namely the variability between data acquisitions and the variability between objects for each hand grasp. Note that electrodes were not required to be attached at the exact same position on each data acquisition, and two objects were used in turn when each individual performed the same hand grasp. Most studies reported intra-session accuracy in NinaPro DB6 to validate their proposed algorithms. We argue that the focus on NinaPro DB6 should be on improving the inter-session recognition accuracy only. Recall that this database was first released for the analysis of repeatability in hand grasp recognition with sEMG collected, with the aim of improving the long-term robustness of hand prostheses control systems. More importantly, Table I shows that the previous SoA inter-session accuracy was achieved as 49.6% using a proposed Temporal CNN (TCN) with multi-session training [27]. Recently, Bao et al. [17] proposed a new confidence estimate used for a CNN to improve the long-term robustness of recognition and reported an SoA inter-session between-day accuracy of 73.33% under the rejection scheme.

III. MULTIDIMENSIONAL UNCERTAINTY-AWARE MODELS
The proposed multidimensional uncertainty-aware models are shown in Fig. 1. We first proposed a simple but efficient end-to-end 3D convolutional neural network for sEMG-based hand grasp recognition as a baseline model in this project. To improve its long-term robustness, we then constructed three multidimensional uncertainty-aware models, each with slightly different learning preferences. In this work, two dimensions of uncertainty were mainly considered: vacuity and dissonance. The uncertainty of vacuity (or known as belief vacuity) is due to insufficient or unreliable information received from sources, while the uncertainty of dissonance (or known as belief dissonance) reflects the situation where a model holds simultaneous contradicting beliefs about a given prediction, which is usually caused by valid but conflicting evidence derived from the model output [37].

A. Baseline Model
We proposed an end-to-end 3D CNN for this specific hand grasp recognition task as a baseline. Note that the classifiers in this task are required to output predictive probabilities of 8 classes (including the rest posture). To learn characteristics that take into account spatial and temporal information, the 14 channels of sEMG signals were presented in a 2×7 matrix. As such, the size of each frame of the raw sEMG signals is 400 × 2 × 7. Taking into account the trade-off between model efficiency and training computation cost, there are three convolutional layers and three fully connected layers, which were appended for high-level reasoning. In this article, we refer to this baseline model simply as CNN.

B. Uncertainty-Aware Models
We constructed uncertainty-aware models by integrating the baseline model with evidential deep learning (EDL) [38]. For the sake of consistency, they will be referred to as Evidential Convolutional Networks (ECNN), the same as in our previous study [35]. Initially, EDL based on the framework of Subjective Logic (SL) was proposed to help explicitly train a model that can make a prediction along with the quantified uncertainty of it, i.e., vacuity in the context of SL, indicating whether there is sufficient evidence to support model predictions [38]. Due to the SL framework, ECNN is capable of inferring dissonance, which is another type of explainable uncertainty derived from conflicting evidence [37]. They can be regarded as evidential uncertainty, and more details are presented later in Sec. III-C.
to a belief mass distribution over Y; u Y is the uncertainty mass that expresses the vacuity of evidence; a Y represents a base rate distribution over Y, which is known as prior probability in classic Bayesian theory. Note that each element of a Y equals 1/K if no additional information is provided. More importantly, its additivity requirement can be seen as follows.
It is clear that u Y refers to vacuity because it is inversely proportional to the total belief masses. For example, when u Y reaches its upper limit, that is, 1, it means that no belief mass can be found in any of the classes. By replacing the softmax layer with an activation layer such as ReLU, EDL terms the nonnegative output as the evidence vector e, expressing the amount of evidence collected to support the classification of samples into a specific class. The belief mass distribution b Y can then be calculated by normalising e, as Furthermore, the Dirichlet distribution of order K with parameter vector α could be calculated from the evidence observed by e + 1. Therefore, the predicted probability for each class is the expectation of the corresponding Dirichlet distribution, i.e., In summary, the main difference between CNN and ECNN is that ECNN finds a way to turn the value of output units into observed evidence, thus helping the model form a multinomial opinion for the recognition task to further enrich uncertainty representation.
Previously, we investigated the reliability of three variants of ECNN, which employed different loss functions for training. To further explore their long-term robustness, we followed the same procedure to build three multidimensional uncertaintyaware models, i.e., ECNN-A, ECNN-B, and ECNN-C. Given a sample i and let y i be a one-hot encoding of the ground truth class of it with y i j = 1 and y im = 0 for all j ̸ = m where j and m are class labels. The loss function used for ECNN-A can be seen in (4), as where p i j is the predicted probability that the sample i is classified as the ground truth class j. This is used to obtain more evidence when the samples are predicted correctly and to remove excessive misleading evidence to avoid misclassification. To encourage the ECNN to generate relatively high vacuity when it is likely to make wrong predictions due to outliers, a Kullback-Leibler (KL) divergence term, is incorporated into (4) as a regularisation term to help reduce the total evidence of a misclassified sample to zero. With respect to ECNN-B, the impact of the KL divergence term is expected to increase gradually. This is controlled by introducing a hyperparameter, annealing step s, i.e., λ = min(1.0, t/s) where t stands for the current training epoch number. Unlike ECNN-B, λ in (5) is a constant and will be considered a hyperparameter directly when training ECNN-C.

C. Multidimensional Uncertainty
As introduced previously, ECNN can generate evidential uncertainty, including vacuity and dissonance. Vacuity denotes uncertainty due to insufficient evidence or knowledge and can be calculated by where K is the number of classes and e Y is the vector of observed evidence for each class. Dissonance represents the uncertainty due to conflicting evidence, derived from a sufficient number of conflicting evidence by comparing each two singleton belief masses [39]: where K is the number of classes and Bal(b j , b m ) represents the relative mass balance between a pair of belief masses b j and b m for the sample i, equals 0 when b j + b m = 0 and 1 − |b j −b m | b j +b m otherwise. Furthermore, two uncertainty measures, including the entropy of the probability vector and the maximum posterior probability, were often used as traditional confidence scores [17], [35], [40] for comparison. They are denoted by u n Entr opy and u nnmp and calculated in (8), representing normalised entropy and normalised negative maximum probability, respectively. u n Entr opy = − p ln p ln(K ) , where p is the predicted probability vector for the K classes.
IV. EXPERIMENTS The versions of PyTorch and Python used in this study are 1.10.2 and 3.9.7, respectively. The experimental sequences were constructed by data loading, data segmentation, hyperparameter determination, model training, and model testing. Due to the stochastic nature of deep learning algorithms, all models were trained 10 times with the determined hyperparameters. The source code for this study is available on GitHub (https://github.com/YuzhouLin/3dECNN-NinaProDB6).

A. Data Preprocessing
To satisfy the constraints of real-time control of prosthetic limbs, a sliding window of 250 ms (< 300ms [41]) was selected with an increment of 25ms. The overlap was set to as high as 90% to increase the decision density and reduce the prediction delay. For fair consideration, the rest movement was treated identically to that of the other hand grasps. Specifically, each trial of 'rest' was taken after the first 0.5s to avoid incorrect labelling and only about 1/7 duration of its previous hand grasp to guarantee a balance in data for all classes. Note that the signals in NinaPro DB6 were preprocessed data that had been filtered by a bandpass filter (20−450Hz) and a notch filter (50 Hz). Therefore, no additional signal preprocessing was required, and each segmented sEMG would be taken as a model input directly. Furthermore, the sEMG signals used in training and testing included both steady and transient states.
Since the main objective of this study was to investigate the long-term performance of sEMG-based hand grasp recognition, the data collected on day 1 and day 2 for each individual were used as training and validation sets, while the rest were used as test sets. To capture more generalised features, the last two cycles of each acquisition were used as validation sets. Specifically, the 11 th and 12 th cycles recorded at AM and PM of the first two days were used as validation sets. The ratio of training data to validation data was 5 : 1. For consistency, the 2 th and 9 th subjects would not be studied in this project, and the reasons are listed below.
• Only two labels of sEMG signals were found instead of 8 on the morning of day 2 for the subject 2. This may be caused by label noise or a mistake in data acquisition.
• Only 13 valid channels were found instead of 14 on the morning of day 1 for the subject 9. Specifically, you can only observe 0 on channel 7 (that is, the 8 th channel).

B. Model Training
To reduce the computational burden, we used the Tree-structured Parzen Estimator (TPE) [42], [43], which is one of the SoA hyperparameter optimisation (HPO) algorithms, to search for optimised hyperparameters. Taking advantage of sequential model-based global optimisation algorithms [42], [44], the available values of each hyperparameter could be determined based on the previous search results because the hyperparameters have been organised into a tree-like space. The hyperparameters for each model were determined using Optuna [45], which is a powerful HPO framework. Each HPO search stopped after running 5 study trials, where each trial refers to each evaluation of an objective function to minimise validation loss. In addition, early stop was used to avoid overfitting in each training. When no improvement was found in the validation set after waiting for 10 epochs or when the training epoch reached 200, the model stopped training. Furthermore, we used ADAM as an optimiser for training and the ReduceLRonPlateau schedule with patience for 5 epochs and the decay factor of 0.8 to decrease the learning rate during training. The search result for HPO is shown in Table II.

C. Rejection-Capable Performance Evaluation
Assume there is an uncertainty threshold δ, a prediction can be considered uncertain when its quantified uncertainty is greater than or equal to δ and certain otherwise. By determining δ, a model is capable of making only confident predictions by rejecting the uncertain ones. When δ = 1, it simply refers to standard recognition since no rejections will be made. Here, the rejection threshold of a model for each individual in each run was determined by validating its performance in misclassification detection in the validation set. In more detail, a rejection threshold was selected when it achieved the best F β -Score, where β was chosen as 2 in this study, which means that recall was considered 2 times as important as precision. As shown in Fig. 2, the purpose of this setting was to maximise TAR as much as possible in testing data sets with a determined rejection threshold. When the number of TP and TN remains the same, a higher recall can only be achieved with fewer FNs, thus indirectly producing a higher TAR. Furthermore, to avoid setting extremely low thresholds so that no active predictions would be made in the testing phase, the determined threshold had to be checked first on the validation sets. If the number of active predictions was less than the 10% of the total number of validation samples in each acquisition, the threshold was selected again on the value which has the second best F β -Score and et cetera.

V. RESULTS
In all experiments, the results would be reported as an average of 10 runs, and the Wilcoxon signed-rank test was used to compare the performance of CNN (baseline model) and ECNN variants. There was a significant difference in the results between the two models when the p-value < 0.05, that is, the null hypothesis, which assumed that two related paired samples come from the same distribution, was rejected. The calculations of all the evaluation metrics used in this section can be found in Fig. 2.

A. Recognition Performance -No Rejections
To investigate the long-term robustness of sEMG-based hand grasp recognition with the proposed uncertainty-aware models, their recognition accuracies were first presented when there was no rejection option. Recall that the proposed CNN was considered a baseline model so that its performance could be compared with the ECNN variants. Table III shows that the recognition accuracy of all models increases slightly (about 1.7%) on day 4 but decreases greatly (about 6.9%) on day 5. The best average accuracy was observed in the ECNN-A as 51.4%. Although the difference between it and CNN on each day was not statistically significant, it outperformed CNN on average with a small but statistically significant difference of 0.6%.

B. Recognition Performance -With Rejections
To better investigate the rejection-capable performance of sEMG-based hand grasp recognition with respect to all proposed models, we employed three evaluation metrics, including the true acceptance rate (TAR), the true rejection rate (TRR), and the rejection rate (RR). The rejection-capable performances of models with different uncertainty scores are summarised in Table IV. Recall that TAR can be considered the recognition accuracy under the rejection scheme. First, the same commonly seen uncertainty estimate was used for a fair comparison between the CNN and ECNN variants. It can be seen that the ECNN variants significantly outperformed CNN with u n Entr opy or u nnmp by presenting a higher TAR. ECNN-B had achieved the highest TAR of 76.77%, which was 7.07% higher than CNN when using u nnmp as an uncertainty estimate. Taking into account the use of evidential uncertainty, it also achieved the best performance with u vac compared to other variants of ECNN. Note that although the difference between ECNN-B and ECNN-A in TAR was not statistically significant, ECNN-B made fewer incorrect rejections, as evidenced by the observation of an 8.56% higher TRR and a 4.35% lower RR. When using u diss as the uncertainty score, it is important to note that ECNN-B achieved the lowest RR, which was only about 50%. With more predictions accepted, it produced the lowest but acceptable TAR of 65.83%. Special attention was paid to the use of multidimensional uncertainties (that is, predictions were accepted when their u vac and u diss were found to be smaller than their respective predetermined thresholds simultaneously). In summary, ECNN-A achieved the highest TAR (83.51%) while ECNN-B obtained the best comprehensive performance by presenting a lower TAR (80.32%) but with 4.79% more active predictions.
To investigate the repeatability of sEMG-based hand grasp recognition with the proposed ECNN models with regard to multidimensional uncertainty, the grouped violin plot was drawn to visualise their TAR differences in several data acquisitions over days. A violin plot [46] is a hybrid of a box plot and a kernel density plot, which can be used to compare the distributions of several groups. Fig. 3 shows that ECNN-C consistently achieved lower performance on each acquisition across 3 days compared with the other two ECNN variants. In other words, both ECNN-A and ECNN-B obtained robust long-term sEMG-based hand grasp recognition performance, and ECNN-A was a little more robust than ECNN-B from observation of its fewer position shifts of wider sections as time goes on. Note that a wider section of the violin plot represents a higher probability that a model will show on the given TAR. There are more details in the confusion matrix results of the ECNN-A and ECNN-B, as presented in Fig. 4 and will be discussed in Sec. VI.

C. Comparison With SoA
The comparison of the inter-session cross-day recognition accuracy in terms of non-rejection and rejection with the previous study was presented in Table V. It is observed that our proposed uncertainty-aware models outperformed SoA models under all conditions. The best performance was achieved by ECNN-A, which significantly improves the recognition accuracy by 3.71% and 13.88% under the non-rejection and rejection schemes, respectively.

VI. DISCUSSION
NinaPro DB6 as a valuable benchmark dataset deserves more investigation because it was built to be challenging by including variability from data acquisitions over days and aimed at improving the long-term robustness of hand  prostheses control systems. It is worth noting that electrodes were not required to be placed in expected locations exactly during each data acquisition, which approaches real-life scenarios and brings the challenge of remaining the long-term robustness of sEMG-based hand grasp recognition to the table. In this study, we proposed a 3D uncertainty-aware model (ECNN) to address this challenge by rejecting doubtful predictions with multidimensional uncertainties. The potential of ECNN has been explored with three training strategies implemented by slightly different loss functions. The main implication of our results is that a CNN can be easily modified as an uncertainty-aware model by integrating with evidential deep learning. Leaving aside the training strategies used by ECNN, all ECNN variants statistically significantly outperformed CNN under the rejection scheme, even if the overall mean recognition accuracy of CNN was found to be 3.80% and 0.7% higher than those of ECNN-B and ECNN-C when there is no rejection option. It is worth noting that ECNN-B achieved the best TAR and TRR with a common uncertainty u n Entr opy or u nnmp , although the lowest standard accuracy was achieved by it under the non-rejection scheme. This suggests that model selection based on standard recognition accuracy is not sufficient, especially when rejection is introduced. Following our previously proposed reliability analysis [35], the mean reliability of ECNN-B in terms of the normalised area under precision-recall was calculated as 58.81% and 59.88% with u n Entr opy and u nnmp , which was 4.16% and 5.50% higher than CNN. This matches the comparison between CNN and ECNN-B with respect to rejection-capable recognition performance, indicating that ECNN-B is more reliable than CNN. This finding is consistent with previous research showing that reliability analysis can be considered a useful supplementary measure for studying sEMG-based hand gesture recognition.
The outstanding rejection-capable hand grasp recognition performance achieved by ECNN variants with multidimensional uncertainties confirms that evidential uncertainties are not only understandable but also useful in practical terms. By rejecting predictions with either insufficient or conflicting evidence, the ECNN variants are able to produce robust long-term hand grasp recognition. This implies that model reliability is associated with model robustness. However, it may be quite difficult to determine which model yields the best rejection-capable performance, as this is actually a tradeoff problem. Ideally, if a model has a standard accuracy of 52% and knows exactly what it does not know, it will obtain 100% TAR and 100% TRR with 48% RR by an optimal rejection threshold under the rejection scheme. A comprehensive analysis is required to consider the efficiency of an uncertainty-aware model by making a comparison between the ideal and practical recognition performances. For example, ECNN-B can be considered more efficient than ECNN-A because it obtained an improvement of 54.46% in recognition accuracy by making 59.67% more rejections, while ECNN-A obtained an improvement of 60.60% in recognition accuracy by making 69.65% more rejections compared to the ideal situation.
One concern shown in Fig. 3 was that ECNN-C would achieve 0% TAR on some runs because no accepted predictions were found. This indicates that the approach used in this paper to determine the rejection threshold may sometimes yield a very strict threshold. It is confident that ECNN-C can get better rejection-capable recognition with appropriate rejection thresholds. Conversely, it can be said that rejection-capable performance suffers from the limitation that it is highly dependent on optimal rejection thresholds. Fig. 4 shows the confusion matrices of the predictions made by ECNN-A and ECNN-B under the non-rejection and rejection scheme. The first finding from this figure is that it is difficult to improve the recognition accuracy of some classes, which were confused with others, by rejections. For example, it can be observed that G2 and G4 were difficult to correctly classify. When the highest number of rejections (about 93% RR) was made in the samples labelled G4, only a limited improvement in TAR was achieved by ECNN-A and ECNN-B. The recognition accuracy of G4 even decreased by approximately 50% after making approximately 93% rejections in ECNN-B. Associated with the relatively poor rejection-capable performance achieved by the ECNN variants with u diss shown in Table IV, it reveals a limitation in the training of the ECNN variants. The current loss functions used in this paper did not directly consider penalising training samples with high dissonance because it may affect the normal convergence of training CNNs. A boosting algorithm may be a possible solution to address this problem [47]. The second finding is that there is a positive correlation between baseline performance and rejection-capable performance. For example, ECNN-B obtained the highest classification accuracy on 'rest' as 86.07% and 95.21% with respect to the non-rejection options and with rejection. This example also shows that, not surprisingly, 'Rest' is considered the easiest class. Leaving the easiest and hardest classes ('Rest', G2 and G4) aside, it was found that about 20% improvements in hand grasps could be obtained by allowing uncertain rejections with multidimensional uncertainties regarding both ECNN-A and ECNN-B. This implies that one can focus on improving the straightforward baseline performance first without suffering from the limitations of investigating its rejection-capable performance when designing an uncertainty-aware model integrated by evidential deep learning.

VII. CONCLUSION
This study uses a very challenging benchmark dataset, NinaPro DB6, to demonstrate the potential of designing an uncertainty-aware model to improve the long-term robustness of hand grasp recognition with raw sEMG signals. The proposed ECNN allows us to reject predictions with high evidential uncertainty from either less supported or conflicting evidence. When there is no rejection option, it outperformed the existing SoA with a significant improvement of 3.71% in recognition accuracy. More importantly, it achieved as high recognition accuracy as 83.51%, which was found to improve the current SoA by 13.88% under the rejection scheme. Furthermore, its long-term robustness has been verified by presenting a high (> 85%) rejection-capable recognition accuracy on each of 3 days with only a small degradation observed in the afternoon of day 5. This encourages us to extend the investigation to amputee subjects in improving the real-time sEMG-based control system.