Improving Human Activity Recognition With Wearable Sensors Through BEE: Leveraging Early Exit and Gradient Boosting

Early-exiting has recently provided an ideal solution for accelerating activity inference by attaching internal classifiers to deep neural networks. It allows easy activity samples to be predicted at shallower layers, without executing deeper layers, hence leading to notable adaptiveness in terms of accuracy-speed trade-off under varying resource demands. However, prior most works typically optimize all the classifiers equally on all types of activity data. As a result, deeper classifiers will only see hard samples during test phase, which renders the model suboptimal due to the training-test data distribution mismatch. Such issue has been rarely explored in the context of activity recognition. In this paper, to close the gap, we propose to organize all these classifiers as a dynamic-depth network and jointly optimize them in a similar gradient-boosting manner. Specifically, a gradient-rescaling is employed to bound the gradients of parameters at different depths, that makes such training procedure more stable. Particularly, we perform a prediction reweighting to emphasize current deep classifier while weakening the ensemble of its previous classifiers, so as to relieve the shortage of training data at deeper classifiers. Comprehensive experiments on multiple HAR benchmarks including UCI-HAR, PAMAP2, UniMiB-SHAR, and USC-HAD verify that it is state-of-the-art in accuracy and speed. A real implementation is measured on an ARM-based mobile device.


I. INTRODUCTION A. Background
S ENSOR technology has recently undergone rapid advancement, marked by several key performance indicators, including high accuracy, miniaturized size, and low manufacturing prices, which have paved the way towards the integration of heterogeneous inertial sensors such as accelerometers and gyroscopes into a wide range of portable Internet-of-Things (IoT) and wearable devices including smartphones, smartwatches, etc. Human Activity Recognition (HAR) refers to the procedure of employing machine learning algorithms and sensor data from various wearable devices to detect and classify human activities such as running, walking, cooking and fall, which has become an active research area in the ubiquitous computing community.As previously summarized in several survey literatures [1], [2], [3], [4], according to the deployment type of sensors, HAR could be roughly classified into two main aspects: ambient sensor-based HAR (e.g., images or videos obtained from camera, WIFI signals) and wearable sensor-based HAR.In the former case, ambient sensors are utilized to capture the interaction between human and the environment, which are often integrated in users' smart environment.However, those ambient sensors might not work while a person is beyond their coverage range.In addition, video-based HAR would lead to severe privacy concern.In the latter case, those on-body sensors are usually worn by the users (e.g., an accelerometer and gyroscope embedded in smart phones), which have been a popular solution for HAR due to their ubiquity, unobtrusiveness, ease of use, and low cost.For example, smartphones have become an indispensable part of our daily lives, which can be carried everywhere.As a consequence, sensor-based HAR has been an emerging topic in the ubiquitous computing community, which could be essentially seen as a multichannel time series classification question.A few recent studies have shown that physiological signals are often spatio-temporal dependent [5], [6], [7], [8].Through analyzing raw sensor data recorded from multiple sensors attached to different body positions, it could offer useful and smart assistive services to improve people's quality of life across diverse application domains such as sports tracking, elderly care, smart homes, healthcare, patient rehabilitation, human behavior analysis, and Human-Computer Interaction (HCI) [3], [4].For instance, activity recognition algorithms are highly related to neural systems or rehabilitation engineering, which allow wearable robots to dynamically adjust their control mode by leveraging machine learning models to make predictions based on signals from body-worn sensors [9], [10].
Traditional machine learning methods usually require prior domain knowledge to design features manually, which is a lengthy and trial-and-error process.In contrast, deep learning strategies have the advantage of being able to extract intricate features automatically from raw sensor data, without no need of laborious or time-consuming human intervention in manual feature extraction.During the past decade, deep neural networks have played a dominant role in the ubiquitous HAR computing area, due to their automatic feature representation power [3], [4].For example, Zeng et al. [11] is the first to apply CNNs for multichannel sensor time series classification through automatic extraction for intricate activity features, showing promising results in activity recognition and significantly outperforming traditional Machine Learning (ML) algorithms that always rely on manual feature design.Ignatov [12] have proposed a novel CNN architecture that effectively combines automatic CNN features with shallow hand-crafted features, thereby helping to preserve the global structure of sensor signals for real-time activity recognition.Ordó nez and D. Roggen [13] have introduced a network called DeepConvLSTM by incorporating the advantage of recurrent structures, which combines CNN and Long Short-Term Memory (LSTM) units to simultaneously extract local and global activity features.Ma et al. [14] have integrated attention modules into Gated Recurrent Units (GRU) to propose an attention-based DNN called AttnSense, which aims at addressing multi-modal HAR tasks and excels in improving the understandability of deep models' behavior.Despite tremendous success of deep neural networks in various activity recognition tasks [15], [16], the high computational cost always prevents them from being deployed on resource-limited mobile devices such as smartphones.
Therefore, both potential computational budget limitation and practical inference time have to be considered in realworld HAR tasks [17].How to accelerate activity inference of deep learning models has been a research hotspot.Now there already exist several popular solutions such as tensor decomposition, network pruning, weight quantization, lightweight model design, and knowledge distillation [18], [19].For example, Zhou et al. [20] have proposed a hybrid attention-based multi-sensor pruning and feature selection network, named HAP-DNN, which aims at reducing redundancy in multisensory and multi-channel information through pruning.Zhu et al. [21] introduce an efficient CNN called Mobile RadarNet by using one-dimensional depth-wise convolution and pointwise convolution, which is specifically designed for human activity classification based on micro-Doppler features.In particular, dynamic neural network provides another promising solution for accelerating activity inference, which has gained a lot of research attention due to its favorable adaptiveness [22], [23].

B. Current Challenge
In practice, the computational budget always dynamically changes in real-world.The computational resources would be predominantly occupied by other application programs, and the battery level of an edge device might also drastically reduce during intensive use.Therefore, given varying resource demands, dynamic neural networks are especially practical and useful, which can find more proper accuracy-speed trade-off points for on-device activity inference.A typical dynamic inference scheme is early-exiting (EE) [24], [25], [26], [27], which could be realized via attaching multiple internal classifiers to a given deep backbone network, hence offering multi-exit points.For example, Daghero et al. [28] have presented an adaptive and quantized DNN for HAR, deployed on microcontroller devices.Lattanzi et al. [29] have systematically investigated how HAR can benefit from such early-exiting behavior, reaffirming its usefulness and efficacy for activity recognition.Once the prediction confidence from an intermediate classifier can satisfy one certain criterion (i.e., larger than a given threshold), the forward inference will be early terminated, and the computational cost of executing deeper layers may be saved.Such an early-exiting behavior can substantially accelerate activity inference while preserving accuracy, where shallow classifiers are capable of handling easy activity samples and ignoring the redundant computation of deeper layers, while deeper classifiers can make reliable predictions for hard inputs through later exits.
However, most existing earl-exiting works typically train all the classifiers equally over all activity samples, no matter whether they are easy or hard [22], [23].This potentially causes a data distribution gap between training and test phases: because not all the classifiers can see all types of test data.
As illustrated in Fig. 1, given that the computational budget is too tight, these easy activity samples are difficult to reach deep classifiers, the deeper classifiers will only see hard activity samples (e.g.,"Falls").In such a way, more easy activity samples are forced to optimize shallower classifiers, while these deeper classifiers will only pay attention to less hard activity examples and really fails to handle easy inputs (i.e., never see easy activity samples during training phase).As a result, there exists the training-testing data distribution mismatch problem [22], [30], [31], which potentially leads to suboptimal performance.Such training-testing mismatch issue is very challenging but rarely explored in the context of HAR.

C. Research Motivation
To mitigate such training-test mismatch problem, one should seek to reweight all the classifiers while organized as a dynamic depth network.Intuitively, while the whole model with multiple exits may be naturally perceived as a sum of multiple weak classifiers, an intrinsic property of such early-exiting procedure is that a deeper classifier could offer a beneficial supplement to an ensemble of its previous weak shallower classifiers via correcting their classification mistakes, which shares a similar spirit to well-known gradientboosting strategy [31], [32], [33].Therefore, inspired by an idea of ensemble learning, a simple solution is to iteratively minimize the total loss via training every weak classifier in a similar gradient-boosting manner, through incorporating the important notion of sample hardness [30], [31].
However, naïvely training all the classifiers in a gradientboosting manner do not work well, due to two major contradictions: On one hand, the well-known standard gradient-boosting training pipeline is to iteratively train multiple classifiers with early exits, where one has to first train shallow classifier until convergence, and then subsequentially train deeper classifier while fixing the prediction heads of all previous shallow classifiers.Because all these classifiers share the same backbone network, deep classifiers have to partially share numerous layers with shallower ones, which are not independent of each other [24], [32].Such sequential procedure might potentially hurt the performance of all previous shallower classifiers, if one fixes the shared parameters with shallower layers, while only tuning the non-shared parameters of deeper layers.In addition, such iterative training strategy is very timeconsuming as well; On the other hand, deeper classifiers often have a more powerful representation capacity because of their larger model size, thereby requiring more challenging hard samples.This is a contradiction with the fact that the valid amount of "hard" training samples will increasingly decrease as the network goes deeper [22], [30].An ideal optimization strategy should encourage deep classifiers so as to emphasize hard activity samples through refining or highlighting their loss.Motivated by the fact, this paper mainly targets at resolving the two major contradictions to improve accuracylatency trade-off for dynamic early-exiting in ubiquitous HAR environment.

D. Main Contributions
In this paper, to mitigate the training-test data distribution mismatch problem, we present a generic boosting early-exiting strategy named BEE to improve the early-exiting human activity recognition performance by incorporating the notion of example hardness, which exploits an idea of prediction reweighting to train all these classifiers unequally in a similar boosting manner.To resolve two major contradictions, we modify the naïve training pipeline of gradient-boosting from the following two aspects.First, instead of sequential optimization, we employ a joint optimization strategy by organizing all the classifiers as a dynamic-depth network and optimize them on each training batch.Given that the batch size is N, after forwarding the whole model to get N prediction results, we can aggregate the training losses from all N predictions and implement one back-propagation step.In particular, to effectively train such dynamic neural networks with multi-exits, we perform a gradient-rescaling operation by multiplying a scalar with the gradient of parameters at different depths, which help to rescale the magnitude of gradients along its backward-propagation patch and make them have a constant scale across different branches, thereby reducing gradient variance and stabilizing the training process.Second, as trained by above procedure, we empirically show that the proportion of effective training data tends to increasingly reduce as the network goes deeper.Intuitively, the current deep classifier could be not well learned if shallow classifiers always contribute too much during training phase, leading to a suboptimal performance.To resolve the conflict, a predictionreweighting scheme is employed to rescale the ensemble of all previous shallower classifiers by multiplying their output by a temperature coefficient before they start to serve as an ensemble member of the current deep classifier, in order to relieve the shortage of training data at deeper classifiers and weaken the prediction of all ensembled shallower classifiers.Fig. 2 presents an overview of the proposed BEE.In summary, our main contributions are three-fold: 1) To our knowledge, this paper is the first work that targets at closing the training-test data distribution gap from a perspective of boosting early-existing HAR, where all the classifiers are organized as a dynamic-depth network with multi-exits and then trained in a similar boosting manner.2) We modify the conventional sequential boosting pipeline via implementing a joint optimization on multiple classifiers.To render such training procedure more stable, a gradient-rescaling is introduced.Particularly, we perform a prediction-reweighting to emphasize deep classifiers, so as to mitigate the training-test data mismatch issue.3) We perform extensive experiments on four popular HAR benchmarks including UCI-HAR, PAMAP2, UniMiB-SHAR, and USC-HAD, which show that the proposed BEE can significantly outperform existing state-of-the-art early-exiting baselines in terms of accuracy-cost tradeoffs.A practical activity inference is implemented on an ARM-based mobile device.To the best of our knowledge, existing most works primarily focus on developing more sophisticated multi-exit structures, but often ignore the training-test mismatch problem [34].In contrast, the proposed BEE is the first work that focuses on closing the gap between training-test data distributions, through improving the training pipeline of dynamic earlyexiting networks for activity recognition in a gradient-boosting manner.

II. MODEL A. Overview
Our whole model consists of two main components: the backbone network and multiple classifiers.The backbone network, composed of N blocks, is designed to maintain generality while allowing flexibility.Each block may comprise one or more convolutional layers, followed by a classifier responsible for generating prediction results.It is important to note that the backbone network is not confined to a specific network structure and can be adapted to other popular network architectures.In other words, the proposed Boost Early Exit (BEE) can serve as a modification to existing general network structures.The standard dynamic early-exiting procedure operates as follows [29], [32], [35]: An input sample is forwarded to the first classifier for an initial prediction.If this prediction exceeds a given threshold, the forward inference will be early stoped to save computational resources.Otherwise, it proceeds to the next classifier, which refines the prediction using previously extracted features, aiming for higher accuracy.This procedure will be repeated until a classifier achieves an enough confident prediction or the final classifier is reached.In fact, our boosting early-exiting does not take into account the aforementioned sequential procedure.Instead, it employs a joint optimization over all the classifiers.For a given model with N classifiers, the overall loss can be defined as follows: where x and y represent an input sample and its true class label, respectively.F n (x) corresponds to the prediction of the n-th classifier, while ω n denotes its contribution to the overall loss.In this loss formulation, all classifiers treat each training sample in the same way.However, during inference, the model often exits early for easy inputs, enabling deeper classifiers to see only challenging hard activity samples.This training-test mismatch issue results in potential data distribution discrepancy between training and test phases, rendering the whole model suboptimal [22], [30], [35].Intuitively, a deep classifier should be a complement to all previous shallow classifiers by iteratively correcting their mistakes.Inspired by gradient boosting, we train the multiple classifiers in a similar manner.Specifically, the final output of each classifier is a linear combination of the output of the current classifier with the outputs of all the previous classifiers, as detailed in Fig. 2.However, unlike gradient boosting, we directly optimize the parameters θ n of the neural network using mini-batch gradient descent.We refer to this approach as "Boosting Early Exit".The specific formula is as follows: where f n (x) denotes the weak prediction before being integrated into the final boosting prediction F n (x) of classifier n.It is worth noting that, due to using the same backbone, F n−1 partially shares a subset of parameters with F n .Following prior literature [22], [32], it is theoretically verified that shallower classifiers should not be influenced by deeper classifiers.Therefore, the gradient updates should be restricted for F n−1 .

B. Training Details
1) Joint Optimization: Conventional early-exiting involves training N classifiers sequentially, e.g.training classifier f n to convergence, fixing the classifier, and then proceeding to train the next classifier, f n+1 .However, when all classifiers share the same backbone network, training f n+1 in this manner would hurt mode performance of all previous classifiers shallower than f n+1 [24].Therefore, we adopt a joint optimization for each batch of training data.Specifically, given a batch of training data, we can obtain N prediction outputs after forwarding them to pass throughout the whole model.Subsequently, we construct the training loss by summing up the N outputs jointly on back-propagation step to update the Sensor data x and label y.
Ensemble weights e n and loss weights l n Prediction temperatures t n .Output: Trained Model M.
F 0 = 0 for t in all training batches T do end for return M parameters.The overall loss is defined as: where ω n is the loss weight, and stop_grad is used to restrict the flow of gradients through F n−1 .
2) Gradient Rescaling: In fact, different classifiers might exhibit varying predictive power and require different training intensities [27], [35].Similar to previous work [36], we employ gradient rescaling to balance the gradients of different classifiers at varying depths.As illustrated in Fig. 3, we rescale the gradients of different classifiers using weighted ensemble e n and weighted loss l n .The ensemble weights e n are in charge of controlling the weights of the output results of the current classifier n when participating in the output results of classifier n+1.The loss weights l n , as previously described, are responsible for controlling the proportion of the total loss attributed to each classifier's loss.Through this operation, the overall loss L with respect to block b n has gradients as follows: The effectiveness of gradient rescaling will be ablated later.
3) Prediction Reweighting: In this training procedure, we empirically find that deeper classifier does not demonstrate significant performance gain to shallower ones, which can be attributed to insufficient training data.Actually.as the where f n means the intermediate weak prediction of classifier N .This technique is referred to as "prediction reweighting".With prediction reweighting, the prediction of the shallow classifiers F n−1 can be weakened before being incorporated into the ensemble output F n .We set all the temperatures t to 0.5, whose detailed ablation study is in Section III-D.The detailed training process is shown in Algorithm 1.

III. EXPERIMENTS A. Experimental Setup
In this subsection, we first introduce the used benchmark datasets and their implementational details for performance evaluation.
2) Implementation Details: To evaluate the effectiveness of the proposed method, we employ popular CNN and ResNet architecture as our backbones.We offer a static 5-layer network without added any other modules or training tricks.For fair comparisons, we still follow the same trainingvalidation-test splitting protocols as [12], [39], [44], and [42].
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE II MAIN RESULTS UNDER ANYTIME MODE IN TERMS OF ACCURACY(%)
Specific details are described as follows.UCI-HAR [12]: 70% data of all these volunteers is chosen to generate the training set, while the remaining 30% data is held out for validation and test; PAMAP2 [44]: To maintain consistency, we assign the first and second runs from participant 6 for the test set, and the first and second runs from participant 5 for the validation set.The remaining data is leaved as the training set; UniMiB-SHAR [39]: Following the original dataset's configuration, we still employ the leave-one-subjectout cross-validation for our evaluation; USC-HAD [42]: In line with recent HAR literature, subjects 1-10 are chosen to generate the training set, while subjects 11 and 12 constitute the validation set, and participants 13 and 14 constitute the test set.To ensure sufficiently high accuracy at the initial classifier, no early-exiting point is placed immediately after the first convolutional layer.To update model parameters, we used Stochastic Gradient Descent (SGD) optimizer with a momentum parameter of 0.5 and weight decay of 0.01.All models are trained for 300 epochs with an initial learning rate of 0.005, which decayed exponentially by a factor of 0.5 every 50 epochs.The batch size is consistently set to 64.Regarding hardware, the entire experiment is conducted using the PyTorch deep learning library on a server equipped with an NVIDIA GeForce RTX 3090 GPU.To ensure the robustness, we repeat the training procedure 10 times, and report the averaged results with standard deviation for our evaluation.
3) Compared Methods: Ma et al. [14] introduce a framework called AttnSense that integrates an attention mechanism into a CNN and a GRU network.This framework is designed to capture dependencies of sensing signals in both spatial and temporal domains.Singh et al. [48] propose a deep neural network architecture that can not only learn spatiotemporal dependencies from multiple sensor modalities, but also identify and capture prominent time intervals through the utilization of a self-attention mechanism.Based on the Federated Averaging method and a specially designed perceptive extraction network (PEN), Xiao et al. [41] introduce a federated learning system called HARFLS, which allows each user to tackle its own activity classification task collec-tively.Khaertdinov et al. [43] present a deep triplet network, which combines LSTM networks with various triplet loss functions and batch sampling techniques for activity classification.Li et al. [47] propose a federated representation learning framework named Meta-HAR, where a signal embedding network is meta-learned federatively.The learned signal representations are then fed into personalized classification networks for activity recognition at each end user.Ignatov [12] develop a user-independent deep learning approach by integrating CNN with traditional handcrafted statistical features for online human activity classification.Zeng et al. [44] embed two attention mechanisms including temporal attention and sensor attention into LSTM backbones for HAR.Saeed et al. [45] present a novel self-supervised feature learning approach for sensory data with no need of semantic labels.Jiang and Yin [42] transform accelerometer and gyroscope signal sequences into two-dimensional activity images, enabling CNNs to automatically learn optimal features for HAR, without relying on manually designed features.

B. Main Results
In this subsection, following previous works [26], [27], we report the main results under anytime prediction and batch budgeted mode.Then two exemplar illustrations including confusion analyses and sample distributions are provided to showcase the effectiveness and usefulness of our proposed approach.
1) Anytime Prediction: Built on both backbones (CNN and ResNet), we compare our proposed boosting early-exiting (BEE) with conventional early-exiting (EE) baseline under anytime prediction mode [23], where all activity samples are forced to exit at the same classifier, regardless of prediction hardness.Additionally, we compare it with numerous state-ofthe-art methods in the previous literatures [12], [14], [39], [41], [42], [43], [44], [45], [46], [47], [48].TABLE II summarizes the main results on UCI-HAR, PAMAP2, UniMiB-SHAR, and USC-HAD.Through multi-exits under the same computational overhead (an indirect cost metric, i.e., FLOPs), it can be seen that our method can consistently surpass conventional EE baseline by a large margin.Interestingly, we find that Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.the performance gain becomes even more significant as the networks go deeper.At the last classifier, the accuracy boosts are even larger than 1% on UCI-HAR and UniMiB-SHAR.This can be attributed to insufficient training data for deeper classifiers in conventional EE training pipeline.Instead, our proposed BEE can effectively mitigate this issue by performing gradient-scaling and prediction-reweighting.
2) Budgeted Batch Classification: Next, we compare our proposed BEE with conventional EE training baseline and two static backbones (i.e., CNN and ResNet) under budgeted-batch classification mode [27].In contrast to anytime prediction mode, it is a more common scenario where some hard activity samples will need more processing time (i.e., exit later) than other easy ones.In other words, for a given computational budget, the dynamic early-exiting network will receive a batch of activity samples for dynamic inference, all activity samples may exit at different classifiers according to example hardness.Following previous protocol described by Yang et al. [27], one can empirically determine the confidence thresholds of all the classifiers on the hold-out validation set (also see TABLE I), so as to maximize classification accuracy without exceeding the upper limit of allocated computational resources.We highlight that our algorithm does not solely focus on improving accuracy alone.Instead, it seeks to effectively accelerate activity inference while preserving higher accuracy, which attempts to strike a better accuracy-latency trade-off for real-time activity recognition.To verify the effectiveness and efficiency of our proposed approach, we plot the accuracy-cost curves in Fig. 4. It can be seen that our proposed BEE approach can consistently improve the accuracy-computation trade-off of the vanilla EE baseline.On one hand, under the same computational budget, our proposed approach can consistently surpass the compared baseline by roughly 1.0%, 0.8%, 1.5%, and 1.9% on UCI-HAR, PAMAP2, UniMiB-SHAR, and USC-HAD datasets, respectively.On the other hand, at the same accuracy, our proposed BEE performs much faster than the comparing EE baseline while allowing better adaptiveness.For instance, as shown in the figure, our approach only takes much less computation cost to achieve almost the same accuracy of original static backbone network, leading to round 2.8×, 2.0×, 2.2×, and 2.0× speedups on the four HAR benchmark datasets.That is to say, in addition to better flexibility and adaptiveness, our algorithm can provide more prompt or timely feedback to optimize the user experience of online activity classification systems, without scarifying overall performance.As can be seen in embedded pie charts, our proposed BEE allow more easy activity samples to be predicted at early-exits (e.g., exit-1 and exit-2).Overall, our proposed BEE approach can consistently improve the accuracy-computation trade-off of the vanilla EE baseline.
3) Confusion Analyse and Sample Distribution: For a more detailed analysis, we compute the confusion matrix for CNN-EE and CNN-BEE, considering the maximum budget scenario (using only the final classifier exit) on the UCI-HAR dataset.As shown in Fig. 5, the most significant confusion exists between "standing" and "sitting", as their activity waveforms exhibit similarities, which can be considered "hard" samples.The results demonstrate that the proposed BEE significantly improves classification accuracy on "hard" samples compared to convenntional EE.In contrast, lying down is correctly classified in both EE and BEE, indicating that this type of sample is easily recognized by the model and can be considered "easy" samples.Furthermore, we visualize the distribution of sample exits.We use three sets of exit thresholds: τ 1 = {0.6,0.6, 0.6}, τ 2 = {0.85,0.85, 0.85} and τ 3 = {0.95,0.95, 0.95}, to simulate scenarios with low, medium and high budgets, and the specific results are presented in Fig. 6.It is evident that "easy" samples tend to exit at the first classifier in all scenarios, while "hard" samples tend to exit at deeper classifiers as the resource budget increases.The results validate that the proposed method dynamically allocates computational resources based on sample difficulty, aligning with our intuitive understanding.

C. Discussion
In this subsection, we explain the necessity of the proposed BEE by analyzing the source of performance improvement on UCI-HAR dataset.Furthermore, to highlight the significance of prediction-reweighting, we conduct an in-depth analysis by comparing the number of activity samples correctly classified at each exit point.In comparison, at Exit 1, BEE is relatively weak compared to EE while classifying "hard" activity classes such as "sitting" and "standing".Instead, at Exit 4, BEE outperforms conventional EE across all action categories, especially for the "hard" activity class, e.g., "standing".This supports our claim that in our design early classifier tends to prioritize the classification of "easy" activity classes, while later classifier concentrates more on "hard" ones, highlighting the role of prediction reweighting.could be well trained due to insufficient training data, which is in well line with Fig. 7a.One potential reason is that deeper classifiers have a larger model size while maintaining stronger feature representation power, but this also indicates that they need more challenging training samples, instead of only learning "easy" residuals left by shallower classifiers [32].To mitigate this phenomenon, we introduce BEE, which employs two strategies denoted as 'gradient-rescaling' and 'prediction-reweighting' to emphasize deep classifiers through weakening the ensemble of its previous shallow classifiers.Furthermore, an in-depth analysis is conducted on the classification accuracy of all activity classes at each classifier exit.All experiments use the UCI-HAR dataset, utilizing CNN-BEE and CNN-EE models under an anytime setting, where all samples are forced to exit through the same exit.As illustrated in Fig. 8, the difference in the number of correctly classified samples between CNN-BEE and CNN-EE is presented.It is evident that BEE does not uniformly surpass the conventional EE in classification accuracy across different classes at various classifier exits, although it demonstrates superior overall classification accuracy for all activity classes.At Exit 1, BEE identify 40 more correct samples in the "easy" activity class "laying" compared to EE, but it shows noticeably weaker performance in "hard" activity classes such as "sitting" and "standing".As described earlier, "hard" samples exhibit lower confidence at the first classifier exit, while "easy" samples higher allowing hard samples to undergo additional computation and not exit at the first classifier.At Exit 4, BEE outperforms conventional EE in classifying all categories, particularly in the "hard" activity class "standing", which aligns with the requirements of adaptive inference.The analysis reveals a trend in the proposed approach: early classifier exits prioritize the classification of "easy" activity classes, while later exits focus on "hard" ones.Typically, in conventional EE networks, predictions at each classifier exit are independent, equivalent to a series of independent models with varying depths.However, training an EE network with multiple classifier exits in a boosting paradigm allows for a complementary and cooperative assist among classifier exits.This cooperative mechanism across multiple classifier exits is the source of the performance improvement observed in BEE over conventional EE.

D. Ablation Studies
In this subsection, we perform detailed ablation studies to analyze the effect of our two major hyper-parameters including the prediction-reweighting temperature and batch size.Moreover, we explore the impact of gradient-scaling module by comparing the original model with gradient rescaling and its variant without gradient-rescaling.
1) Prediction-Reweighting: As aforementioned, the prediction reweighting has played a dominant role in providing more valid training data for deeper classifiers.Therefore, we change the value of temperature between 0 and 1 to explore how prediction-reweighting affects the boosting performance on UCI-HAR dataset.The results are shown in Fig. 9a.When t is equal to 1, it is worth noting that deep classifier will fail to be emphasized by weakening an ensemble of its previous classifiers.In this case, it can be seen that our proposed BEE performs the worst among all the cases, because there are not sufficient training data to regulate deep classifiers.On the other hand, when is set to 0, our BEE basically degenerates into the standard EE without incorporating an ensemble of all previous classifiers, which is significantly inferior to the case of t = 0.5.Without loss of generality, we consistently set the temperature values of all classifiers to 0.5 through this paper.
2) Batch Size: Intuitively, the batch size should play a crucial role in such boosting early-exiting training pipeline.
During training, if the batch size is too small, it might cause these shallower classifiers to fluctuate too much, which will affect subsequent deeper classifiers.In this case, the deep classifier could not be well learned.Therefore, an enough big batch size will be more beneficial to stable the training process.We study how the batch size can affect final performance on PAMAP2 dataset.As shown in Fig. 9b, we can observe a nonmonotonic trend as the batch size increases, where both too small and big batch sizes lead to worse performance.In fact, too big batch size also hurt the generalization ability of trained model.Therefore, as a trade-off between generalization and stability, we employ a medium batch size of 64 to train models.
3) Gradient-Rescaling and Gradient-Stopping: To evlaute the effectiveness of gradient-rescaling, we further train the boosting dynamic network with gradient-rescaling and without gradient-rescaling on UCI-HAR dataset.We plot the accuracy curves under both settings.As shown in Fig. 9c, it can be seen that the trained models with gradient-rescaling (red curve) consistently perform better than the counterpart without gradient-rescaling (blue curves), which is consistent our previous observation that gradient-rescaling may help to stabilize such training procedure while boosting activity classification accuracy.Moreover, as illustrated in Algorithm 1, we always prevent the gradient flows of back-propagation along F n−1 branch, whose primary aim is to disentangle deeper losses from shallower classifiers and avoid hurting model performance.To compare, we also train the boosting dynamic network by allowing the gradient flows along F n−1 .As shown in Fig. 9c, we observe a drastic accuracy drop (green curve) due to sharing the same backbone by multiple classifiers, indicating the necessity of gradient-stopping mechanism.
In such a way, we explain well why the optimal hyperparameters including prediction-reweighting temperature and batch size are consistently set to 0.5 and 64, while validating the necessity of gradient-scaling module through the whole paper.

E. Generalization Analysis 1) Integration With Other Deep Learning Architectures:
In this part, we evaluate the generalization ability of our proposed BEE on other representative backbone networks.We apply the proposed BEE to the widely-employed DeepConvLSTM Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE III INTEGRATION WITH OTHER BACKBONE ARCHITECTURES
backbone network to validate its generalizability on the UniMiB-SHAR dataset.Referencing previous works [13], the DeepConvLSTM architecture is mainly comprised of four convolutional layers followed by two LSTM units, where these intermediate classifiers are inserted after each convolutional layer.To preserve the accuracy of the initial classifier, no earlyexiting point is applied after the first convolutional layer.The comparison results are shown in TABLE III.It can be seen that our proposed BEE can consistently outperform EE across all early-exiting points with identical computational cost.The experimental results suggest that the performance improvement induced by our BEE could still be obtained in combination with a wide range of architectures, which holds for convolutional, residual, and other hybrid foundations, exhibiting its excellent generalization in the context of HAR.
2) Generalizing Our Approach to Predict Freezing of Gait in Patients With Parkinson's Disease: In previous sections, we evaluate the performance of our model on the commonly employed four benchmark datasets, which mainly focus on young and healthy persons.Since the number of elderly people aged 60 and over has been growing fast today, some health strategies are required to help elderly people to maintain their daily living, prevent diseases or support in rehabilitation [49], which is very costly to individuals and the healthcare system (WHO 2002).Therefore, we further evaluate our methodology through capturing data using wearable sensors from elderly and diseased people under realistic hospital conditions.The Daphnet Freezing of Gait (FoG) Dataset is utilized for our evaluation [50], which is purposely devised to benchmark various machine learning methods to recognize freezing of gait in patients with Parkinson's disease (PD).It is collaboratively collected by the Laboratory for Gait and Neurodynamics, Tel Aviv Sourasky Medical Center, Israel and the Wearable Computing Laboratory, ETH Zurich, Switzerland, which includes data from three female and seven male PD patients with wearable accelerometers attached to their legs and hip.To observe possible FoG occurrence, each patient was asked to perform three kinds of activities: straight line walking, walking with numerous turns, and finally a more realistic activity of daily living (ADL), where patients went into different rooms to open doors, fetch coffee, etc. TABLE IV summarizes some basic clinical characteristics including gender, age, disease duration, and H&Y (Hoehn& Yahr) scale of these patients.In the OFF state, all the patients would walk without assistance.Relatively large gait variability could be observed among them.According to the patients' self-descriptions on when the FoG occured frequently, two patients were in the ON state while the other eight were in the OFF state during the FoG provoking As be seen in the table, the 4th and 10th patients do not show any FoG epochs during data collection process.After noise filtering and data normalization, the gait time series signal was segmented through a one-second sliding window with an overlap rate of 50% between adjacent windows.To ensure fair comparisons, we only compare our approach with those works that have already been evaluated on the Daphnet dataset under the similar experimental settings.Table V summarizes the comparison results.The two comparing methods proposed by Orphanidou et al. [51] and Kleanthous et al. [52] mainly rely on time-frequency domain features from an acceleration signal window, which are then manually selected and fed into a traditional machine learning classifier for recognizing interesting gait classes.The other comparing approach presented by Elziaat et al. [53] is a deep learning-based model, where Short-Time Fourier Transform (STFT) is first exploited to generate spectrogram image based on an acceleration signal window, and then a CNN-based network is applied to automatically learn activity features from the generated spectrogram images.Finally, the learned features are fed into a traditional SVM classifier with radial basis function (RBF) kernel for activity classification.Based on the comparison results, it can be found that our proposed approach significantly outperforms all three compared methods in terms of average accuracy.At the same time, our approach has an additional advantage, which is dynamic and allows a faster activity inference by early-exit behavior.Overall, our approach is universal, which can generalize well on other rehabilitationrelated tasks to support elderly or diseased people by detecting their activity patterns and intervening while the type of disease might affect their movements.

F. Actual Implementation
FLOPs is an indirect metric, which could not translate into direct metric such as inference speed or latency.Therefore, in this subsection, we apply a direct metric (i.e., inference latency) rather than FLOPs to evaluate how our proposed approach improve accuracy-latency trade-off on a specific hardware platform.In fact, a crucial aspect in recognizing human actions is practical on-device latency.Deep neural networks have been prevailing in a wide range of HAR applications.Due to their high computational costs, the ability to accelerate deep neural networks within a given latency constraint is of paramount importance to classification accuracy for HAR applications on resource-limited wearable devices such as mobile phones.For instance, it must require a short response time while detecting falls in the elderly people.High latency would cause activity recognition system's feedback to lag behind current user actions, thereby affecting the overall quality of real-world user experience.The importance of latency can hardly be overemphasized, since a user can easily notice the main difference while the mobile app is slow to respond.
One main benefit of our proposed BEE is that it enables the trained model to dynamically vary its computing budget during inference.We provide a real use case when the model is deployed to an ARM-based device: Raspberry Pi 4B, which can be well compatible with popular PyTorch deep learning toolkit.We benchmark practical activity inference speed of our boosting early-exiting models under both anytime and budgeted batch modes.We randomly choose 400 activity samples from UCI-HAR validation set for performance evaluation, where the test batch size is set as one.Following previous standard working pipeline [27], we first train the backbone network with multi-exits through boosting, and then load the trained PyTorch models to perform activity inference.To this end, an application program is written in Python language.Specifically, we utilize the "time" library to estimate inference time: a timer is launched (i.e., 'starting time') while the deployed model begins to read sensor data, and then is stopped once a confident prediction is returned (i.e., 'ending time').Then a simple subtraction equation is employed to calcuate the time elapsed during inference.Fig. 10 presents the main user interface under anytime mode, where red color highlights the prediction results, which exhibits varying inference speeds at different exiting points.Under budgeted batch mode, we plot the inference time curves and confusion matrices in Fig. 11.While changing the budget level from high to low (e.g., a common case that other background APP programs are occupying more available computing resources), we measure the practical inference delays of total 400 test activity samples under three pre-defined budgets: low, medium, and high with different confidence thresholds [0.65,0.65,0.65],[0.85,0.85,0.85],and [0.95,0.95,0.95].Here blue solid lines indicate practical inference time, while red dash lines indicate the average inference time and green dash lines indicate the inference time induced by static CNN.In top panel (i.e., high budget), we can clearly see that practical inference delays are always below 6ms; In middle panel (i.e., medium budget), the practical inference delays are nearly approaching 8ms; In bottom panel (i.e., low budget), the practical inference delays are mostly over 10ms.Similarly, as the budget level decreases, it can be seen that the average inference delays are significantly increased.The experimental results confirm that our proposed BEE can adaptively accelerate activity inference by tuning its confidence thresholds online, without leading to a pronounced confusion, as indicated by confusion matrices.Overall, given the same budget limitation, our BEE could provide a better accuracy-latency trade-off by our boosting mechanism compared to the vanilla EE baseline and static model, with no need of retraining and offloading/downloading new models.

IV. CONCLUSION
In this paper, we mainly focus on resolving the training-test data distributions mismatch problem from a new perspective of boosting early-exiting HAR, which is premised on the fact that the deeper classifiers often can only see less hard activity samples during inference phase.To this end, we introduce a new training strategy called BEE by organizing multiple classifiers with early exists as a dynamic-depth neural network, and training them in a similar gradient-boosting manner.Different from conventional boosting pipeline, we employ a joint optimization to update all the classifiers on every training batch.For effectively training the model, a gradientrescaling operation is performed to rescale the gradients of all the classifiers at different depths, which make the training procedure more stable.To close the training-test data distribution gap, we introduce a prediction-reweighting strategy to increase the effective amount of training samples on a given deeper classifiers while weakening the prediction by an ensemble of all its previous shallower classifiers.Our proposed BEE obtains superior performance on four popular HAR benchmarks in both anytime and budgeted-batch prediction settings, which can be naturally integrated with a variety of existing HAR works to tackle protentional challenge of longtailed distribution or class imbalance problem in future work.

A. Broad Impact
Since body-worn sensors are very suitable to collect data on action patterns over a long time period, the need of such automatic activity classification systems has been widely found in a broad range of application domains, from health-related study to pervasive computing scenario.To date, there is a high demand for exploring the links between levels of physical activity and common diseases, while activity recognition systems provide a more reliable approach to quantify levels of physical activity (also see [54]).For example, activity recognition systems may be utilized to provide feedback about how an individual adheres to her/his daily/weekly personal fitness plans.They are also shown to be valid in the assessment of physical activity levels for Parkinson's patients.In stroke, accelerometer-based systems have been applied to identify real-world upper extremity movement, that could then help to derive treatment outcomes.With an ageing population, more elderly persons have been living alone, inevitably increasing the incidence of falls.Clearly, accurately detecting a fall occurrence and automatically calling for help would be of high benefit.Besides health-related applications, activity recognition always plays a crucial role in ubiquitous computing scenarios.In this paper, our approach mainly contributes towards improving the resource consumption or computational efficiency while applying extremely expensive deep learning models for activity recognition.Different from prior most works, it takes a new direction by avoiding trainingtest data mismatch to decrease the overall FLOPs required for making activity predictions.Performance improvements induced our BEE in accelerating inference allow deploying real-time activity recognition algorithms in edge devices and mobile hardware, and reduce the total resource usage of a computation-intensive deep learning model on resourceconstrained mobile platform.

B. Potential Limitations and Future Study
In this paper, our approach mainly focuses on avoiding train-test mismatch, which leads to a better overall accuracycost trade-off for activity inference.However, our dynamic early-exiting strategy is still based on the confidence scores at every individual exit.Though such a model can adaptively produce different outputs at varying levels of computational resources, the confidence measures have been always ignored.As a consequence, most existing state-of-the-art dynamic neural networks are typically miscalibrated or overconfident (also see [55]), which fail to offer confidence information.Recently, uncertainty quantification has been the interest of probabilistic (or Bayesian) methods in the machine learning community, which attracted increasing attention in various computer vision tasks.Despite remarkable progress, it has been seldom explored for dynamic early-exit HAR with a focus on confidence metrics.How to fix dynamic neural works for reliable HAR while considering the overconfidence or calibration of confidence estimates still remains a challenging problem.We will leave this as our future work.

Fig. 2 .
Fig.2.Detailed scheme of proposed Boost Early Exit.A joint optimization is employed to train all the classifiers in a similar boosting manner, where each classifier is viewed as an ensemble of current deep classifier with all previous shallow classifiers.

Fig. 3 .
Fig. 3.An overview of gradient-rescaling operation.Algorithm 1 Training Pipeline for Boost Early Exit Input: Model M and parameters θ n for N classifiers.Sensor data x and label y.Ensemble weights e n and loss weights l n Prediction temperatures t n .Output: Trained Model M. F 0 = 0 for t in all training batches T do L = 0 for n in all classifiers N do F ′ =stop_ grad(e n F n−1 (x t )) L n =loss_ function t n F ′ + f n (x t ) , y t L = L + l n L n end for [ξ 1 , . . ., ξ n ] =compute_ gradients(l, [θ 1 , . . ., θ n ]) apply_gradients([θ 1 , . . ., θ n ] , [ξ 1 , . . ., ξ n ]) end for return M network go deeper, deep classifiers have a large mode size, hence causing overfitting.Since the final prediction of each classifier can be seen as an ensemble of the current classifier's predictions and the predictions of all previous classifiers, an overemphasis on previous predictions during training would weaken the prediction of the current classifier.To reconcile the conflict, we multiply the previous boosting prediction F n−1 of the (n − 1)-th classifier by a temperature parameter t n before incorporating it into later ensemble F n :

Fig. 6 .
Fig. 6.The output distributions from classifiers at different layers.τ 1 , τ 2 and τ 3 refer to the three threshold settings mentioned before.

Fig. 7 .
Fig. 7. Percent of valid training data and loss magnitude of every classifier.Left: conventional EE.Right: our BEE.

Fig. 8 .
Fig.8.The difference in the number of correctly classified samples between BEE and EE across different action categories.A positive bar (above zero) indicates that BEE is correctly classifying more samples, whereas a negative bar (below zero) indicates that EE is correctly classifying more samples.In comparison, at Exit 1, BEE is relatively weak compared to EE while classifying "hard" activity classes such as "sitting" and "standing".Instead, at Exit 4, BEE outperforms conventional EE across all action categories, especially for the "hard" activity class, e.g., "standing".This supports our claim that in our design early classifier tends to prioritize the classification of "easy" activity classes, while later classifier concentrates more on "hard" ones, highlighting the role of prediction reweighting.
Fig. 7d and 7b depict the training loss and the amount of valid training samples for each classifier of BEE.It can be seen that the proposed method can effectively alleviate the problem of relatively lower loss magnitude and the lack of valid training data for deeper classifiers.