On-Device Personalization for Human Activity Recognition on STM32

Human activity recognition (HAR) is one of the most interesting application for machine learning models running on low-cost and low-power devices, such as microcontrollers (MCUs). As a matter of fact, MCUs are often dedicated to performing inference on their own acquired data, and any form of model training and update is delegated to external resources. We consider this mainstream paradigm a severe limitation, especially when privacy concerns prevent data sharing, thus model personalization, which is universally recognized as beneficial in HAR. In this letter, we present our HAR solution where MCUs can directly fine-tune a deep learning model using locally acquired data. In particular, we enable training functionalities for 1-D convolutional neural networks (CNNs) on STM32 microcontrollers and provide a software tool to estimate the memory and computational resources required to accomplish model personalization.


I. INTRODUCTION
R ECENTLY, we have witnessed a broad diffusion of Internet of Things (IoT) devices equipped with tiny microcontroller units (MCUs) and an increasing interest in tiny machine learning (TinyML [1]) research to leverage machine learning models on low-power devices.Human activity recognition (HAR) is among the most frequently addressed problems in the TinyML domain [2].The mainstream paradigm in HAR consists in classifying [e.g., by a 1D-convolutional neural network (CNN)] segments of inertial measurement units (IMUs) recordings, which are easy to gather in wearable devices mounting accelerometers and gyroscopes.Model training is exclusively performed on a server using many annotated data and large computational resources.Once trained, the model is possibly optimized, e.g., by distillation, quantization, or pruning [2], and then deployed to the MCU, which is only in charge of inference.Not surprisingly, the vast majority of TinyML solutions support only inference on MCUs.Examples include Tensorflow Lite Micro [3] from Google and X-CUBE AI [4] from STMicroelectronics.
On-device learning (ODL) solutions for MCUs are scarce in the literature, and a few frameworks expose only training functionalities for dense layers [5].This is somehow in contrast with modern CNNs that favor convolutional ones.Moreover, TinyTL [6] and Train++ [7], are purely algorithmic studies and do not provide any implementation to be used in HAR.The only framework that specifically addresses CNNs training on MCUs is [8], which introduces an algorithm-system co-design that combines quantizations and sparse updating techniques to enable the training of an image classifier on STM32 MCUs with minimal memory requirements (256 KB).However, none of these frameworks, including [8], were used to investigate model personalization in HAR by fine-tuning all the layers of a CNN directly on the MCU.We believe this is a relevant problem for HAR, where model personalization is often key to compensate for subjects' heterogeneity resulting in very different signals for the same activity.Moreover, performing model personalization directly on the device is key to prevent sharing of confidential data.
In this letter we perform for the first time model personalization for HAR directly on a STM32 microcontroller.To this purpose, we implemented both a software framework that enables to fine-tune on the MCU all the layers of a 1-D CNN, and also a tool to estimate the memory footprint and the computational resources required for the personalization.Our framework can also be used for addressing other problems in HAR, namely: 1) enabling continuous learning to counteract concept drift and 2) enabling federated learning (FL) mechanisms [9] in a fully distributed manner.
Our experiments, performed on real-world HAR datasets, confirm that model personalization in HAR is very beneficial and that it is possible to retrain 1-D CNNs satisfying the strict computational and memory constraints of the STM32L496ZG MCU.In addition, we demonstrate a tradeoff between model accuracy and computational requirements, since performing a full retraining of the model is beneficial in terms of F1score, but at the same time it has a greater impact on the memory and the energy consumption.This analysis provides insights on when to schedule model personalization on the device.

II. PROBLEM DEFINITION
We address HAR as a multiclass classification problem, where the input s ∈ R n×z is a segment of z samples from n time series acquired by inertial sensors, and the output y is a label corresponding to a human activity.We assume a general training set TR of data from different users is provided to train a classifier C, which associates each s to a label ŷ = C(s).State-of-the-art classifiers for HAR [2]   The general classifier C can poorly recognize activities from users not present in TR.Therefore, model personalization for a user i consists in fine-tuning the parameters θ 0 of the general classifier C using a local training set TR i containing data from user i.In particular, our goal is to train a local classifier C i with parameters θ i directly on an MCU, using θ 0 as initialization and TR i as training set.
III. FRAMEWORK FOR HAR PERSONALIZATION Fig. 1 illustrates our framework to personalize HAR models on STM32 MCUs.The framework is developed in C programming language and composed of three modules.
1) Network: Instantiates the local classifier C i from the architecture specifications and the initial parameters θ 0 , which can be imported from a pretrained HAR classifier C, or randomly set.The output of this module is the classifier itself, which can be fed into the evaluation or training modules.
2) Training: HAR personalization is performed by a few iterations of backpropagation.The goal of the backpropagation algorithm is to find parameters θ * that minimize a loss function L on the local training set TR i (more details on that will be given in Section III-A).This module implements backpropagation via gradient descent through three submodules: the Orchestrator governing the iterative training procedure by invoking alternatively the Forward and the Backward submodules.The Orchestrator allows specifying training hyperparameters, such as the number of epochs, the learning rate, and the batch size.The Forward module performs the forward pass of the backpropagation by implementing the forward expressions reported in Table I for the most popular layers of 1D-CNN.During the forward pass, the values of the activations in the neurons of C i need to be stored.The Backward module performs the backward pass of the backpropagation, implementing the backward expressions reported in Table I.
3) Evaluation: This module performs inference and during training computes L(θ ), namely, the value of the loss function corresponding to parameters θ .We also use this module to assess the effectiveness of personalization, by comparing the accuracy of C and C i .

A. Gradients Computation
We illustrate the implementation of backpropagation in the training module.As in Tensorflow API, we decompose the backpropagation step (a forward pass followed by a backward pass) into a sequence of operations performed layer by layer and combined using the chain rule of derivatives.Each layer is associated with a layer function f and parameters θ = [w, b].Starting from the layer's input x and the parameters θ , the function f computes the value of the output activation a, which will become the input x of the subsequent layer.
During the backward pass, starting from the last layer, we compute the gradient of the loss L with respect to the network parameters w, b, and the so-called downstream error (∂L/∂x) as where f i and a i are, respectively, the ith component of the layer function f and of the unrolled a tensor, and M is the number of units of the layer.The downstream error is then passed back to the previous layer as (∂L/∂a).Table I reports the explicit expressions for the forward and backward pass of most popular layers of a CNN, namely, Conv1D, Dense, AvgPool1D, GlobalAvgPool1D, and Flatten.By combining those expressions, it is possible to derive the overall forward pass and to compute the loss L and its derivative.
As an example we illustrate the computation of a Conv1D layer with C channels, N input units, F filters, and kernel size K.The (j, m)th element of the output a is defined as where j ∈ {1, . . ., F}, m ∈ {1, . . ., M}, and Note that in (2) we treat convolution as cross-correlation as Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
in Tensorflow, a customary practice when filters are being learned.According to the chain rule, we have In particular, since (∂L/∂a) is passed from the next layer during backpropagation, we just have to compute the local gradients of the layer function with respect to the input In ( 4), the activation index n − k + 1 ranges from 2 − K to N, for a total of N + K − 1 terms for each channel j, whereas (∂L/∂a) has size F × (N − K + 1).If K > 1, we apply a 0 padding to (∂L/∂a) by adding F • (K − 1) zeros to both sides of (∂L/∂a) along its second dimension.The index k in (4) has opposite signs in the two terms of the convolution (−k in (∂L/∂a) and +k in w), thus we obtain a flipped kernel.The final result is expressed as where conv still represents cross-correlation.Finally, to compute the gradient of a batch of input segments we resort gradient accumulation.This maintains fixed the memory footprint during the training regardless the batch size.

B. Estimating Resources
Our framework is equipped with a tool that estimates: 1) the memory footprint and 2) the CPU load to personalize a HAR classifier on an STM32 MCU.This tool is very valuable during prototyping to properly size embedded ML applications to the MCU capabilities.
Our tool computes the memory footprint as the amount of memory required by model personalization.Model training requires storing training samples, network parameters, activations, gradients, and errors computed at each layer during backpropagation.In particular, all the activations and the downstream errors (∂L/∂a) of the layers that we want to train must be stored in memory to be used during the backward pass to compute gradients.Our tool can estimate a priori the memory usage of training by multiplying the total amount of saved variables by their bit precision.The tool also estimates the CPU load by counting the total number of operations performed by the backpropagation algorithm in the forward and in backward passes, as well as in the parameters' update phase.The number and the type of operations performed depend on the type of layers and are derived from an analysis of the expressions in Table I.

IV. EXPERIMENTS AND RESULTS
Our experiments are meant to assess the benefits of model personalization in HAR.In particular, in the considered settings we have that: 1) training on user-specific data improves the accuracy of a pretrained model and 2) personalization of a pretrained model outperforms a classifier trained only on the target user.Finally, we show that enabling the full retraining of the classifier on MCU is beneficial with respect to transfer learning (TL) in terms of accuracy, although it requires more computational resources.
Our target MCU is the STM32L496ZG, an ultralow-power MCU produced by STMicroelectronics with an ARM Cortex-M4 core and 320 KB of sRAM.We flash the firmware using the development board NUCLEO-L496ZG that is equipped with such MCU.

A. Datasets and Architecture
The general dataset TR we consider is the wireless data mining (WISDM) dataset [10], which is used to pretrain a general classifier C. The local training set TR i are from ST dataset, which is collected in STM facilities using a SensorTile.box[11].Their characteristics are: 1) WISDM Dataset: 36 users, six activities (walking, jogging, ascending stairs, descending stairs, sitting, and standing), and sampling frequency 20 Hz; 2) ST Dataset: Three users, three activities (walking, ascending stairs, and descending stairs), and sampling frequency 27 Hz.We notice that the activities in the ST dataset are a subset of those in WISDM.Both datasets have been collected using tri-axial accelerometers, but with different sensors, thus the ST dataset has been resampled to 20 Hz.We underline that, after the resampling, the two datasets have approximately the same number of samples for each user (around 30k).The classifier takes as input from 1 to 5 s of recording, which correspond to the following input sizes: (20, 3), (40, 3), (60, 3), (80, 3), and (100, 3), where 3 is the number of axis of the accelerometers.
We adopt a 1D-CNN made of four blocks: 1) Conv1D with F = 32 filters and kernel size K = 3 with ReLU and AvgPool1D; 2) Conv1D of F = 64 filters and kernel size K = 3 with ReLU, AvgPool1D, and GlobalAvgPool1D; 3) dense of M = 50 units with ReLU; and 4) dense of M = 6 units with Softmax.The total number of parameters θ to train is around 10000.We select the SGD optimizer with a learning rate of 0.01 and a batch size of 32.Facing a multiclass classification task, we adopt the Categorical Cross-Entropy as loss function.

B. Model Personalization
First, we show that in the considered settings, model personalization improves the performance of a pretrained model.The first experiment is entirely conducted on the WISDM dataset by a leave-one-subject-out (LOSO) approach.For each user i = 1, . . ., 36 we define a training (TR i ) and test (TS i ) sets and pretrain a general classifier C from the other 35 users.We personalize each local classifier C i by retraining all the layers (Full personalization) of C using TR i .As a comparison, we consider the TL which can be pursued by standard TinyML frameworks [5] and that retrains only the last two dense layers of C. For each user i we assess C i on TS i , and we show in Table II the F1-score averaged over all the users.We also consider No Pers.as the performance of the classifier C. Both personalization approaches improve the performance of C, and Full personalization always reaches the highest F1-score, for all the input sizes.This confirms that enabling the retraining of all the network layers is highly beneficial in the HAR scenario.
We also assess the benefits of model personalization over each user of the ST dataset for both TL and Full personalization.As a customary procedure for improving convergence in fine tuning, we first retrain for two epochs the last dense layer freezing all the others.We then retrain all the network's layers Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

C. Computational Resources' Evaluation
Here, we assess the computational resources required to perform model personalization on the MCU.First, we use our tool to estimate the memory footprint of both TL and Full personalization.As reported in Table III, the input size has a great impact on the memory footprint: the smallest input size (20, 3) requires about half of the memory with respect to the largest size (100, 3).However, all the tested cases are within the memory limitations of our selected device, since they use less than 320 kB.
Table III reports the time required to process a batch of 32 segments for each input size for both Full and TL.Also in this case we observe that increasing the input size results in larger execution time.In particular, (100, 3) requires more than 5 times the time required by input size of (20, 3).However, as shown in Table II, an input size of (20, 3) is enough for reaching a very high accuracy.
Finally, we use the X-NUCLEO-LPM01A power shield to measure the average power required to process a batch of 32 samples.The power shield is attached to the development board to measure the current absorbed during the execution of the training.Since the voltage provided is equal to 3.3 V, we can easily derive the power consumed.We note that the power is the same for both TL and Full personalization procedures for any input size.However, the energy (power × time) required to process a single batch is higher for the Full personalization, since it scales linearly with the time.This analysis is very valuable as it gives an estimate of the computational resources required during the training, allowing to schedule model personalization depending on the power availability.For example, Full personalization could be performed only when the battery device is recharging, while TL could be run when the device relies on its own battery.Finally, our framework can be used to adjust the input size and select the number of training epochs for the chosen MCU, to reach the best tradeoff between the accuracy of the model and the usage of resources.

V. CONCLUSION
We present a HAR solution to fine-tune and personalize a deep learning model directly on a STM32 MCU using locally acquired data.In particular, we develop a framework to retrain 1-D CNNs satisfying the strict computational and memory constraints of the STM32L496ZG MCU.Our experiments shows that the full personalization of the CNN achieves a better accuracy than TL, which is what existing frameworks allow, although it requires more energy.
Future work concerns extending our framework to support more layers and different optimization strategies.We will also adapt the framework to be used on MicroProcessors like those of the MP1 series from STMicroelectronics, which can be equipped with a small GPU.
are 1D-CNNs.c 2024 The Authors.This work is licensed under a Creative Commons Attribution 4.0 License.
For more information, see https://creativecommons.org/licenses/by/4.0/Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE I FORWARD
AND BACKWARD EXPRESSIONS FOR LAYERS IMPLEMENTED IN THE TRAINING MODULE.N AND M ARE THE WIDTH OF THE PREVIOUS AND THE CURRENT LAYER, RESPECTIVELY, F IS THE NUMBER OF FILTERS OF THE CURRENT LAYER, K IS THE KERNEL SIZE, C IS THE NUMBER OF INPUT CHANNELS, AND p IS THE STRIDE OF THE AVGPOOL1D LAYER.THE BACKWARD PASSES OF WEIGHTS AND BIASES ARE COMPUTED ONLY FOR Dense AND Conv1D SINCE THE OTHER CONSIDERED LAYERS DO NOT TRAIN THESE PARAMETERS Fig. 1.Proposed framework for HAR personalization.
or training the last two dense layers only (TL).Moreover, we train user specific classifiers from data of each user (denoted as No Pretrain) starting from a random initialization.TableIIreports the F1-scores averaged on the three users of the ST dataset.We note that the while TL achieves lower or comparable performance w.r.t.classifier No Pretrain, Full personalization always achieves the highest F1-score, independently from the chosen input size.This confirms that enabling the retraining of all the network's layers is highly beneficial even when personalization is performed on data from a different dataset, which is common in HAR scenarios.