Wearable on-device deep learning system for hand gesture recognition based on FPGA accelerator

.


Introduction
Gesture recognition has attracted much attention across different fields due to its various applications in healthcare, rehabilitation, caring for the deaf-mutes, etc. [1]. Through our field trip to a Chinese provincial medical institute, we got to know that it is normally hard for doctors to measure the effectiveness of rehabilitation follow-up treatment after patients leave the hospital. Therefore, we are commissioned by this institute to design an offline wearable device for real-time gesture recognition to help with remote diagnosis. In the field of hand gesture recognition, we laid the groundwork for the design of an edge device that is able to undergo on-device AI computing. In other words, with our efficient hardware, robust software, and improved AI algorithms, we are able to process data and complete the recognition process locally, rather than in the cloud or on PC.
In the field of gesture recognition, image processing has been a popular established protocol [2,3], and the use of artificial intelligence (AI) has dramatically improved the accuracy and robustness of image-based gesture recognition [4,5]. However, it usually requires a powerful processor for completing various functions, which severely limits the portability of the device. Moreover, image-processing technology may be interfered with by multiple factors, including ray and skin color, which significantly raises training costs and may potentially bring ethical issues. Furthermore, image-processing technology cannot efficiently ascertain the speed during gesture recognition, making it hard for its application in similar medical rehabilitation scenarios.
In terms of sensors for gesture recognition, surface electromyography (SEMG) sensor [6], highly flexible fabric strain sensor [7], and many other sensors are limited by their higher costs and power consumption. That is why we decide to use the inertial measurement unit (IMU) sensors. With its characteristics of low cost, low power consumption, and high performance, it is an excellent solution to record the acceleration and angular velocity of gestures.
To date, in existing gesture recognition systems to our knowledge, gesture data are collected and then sent to a remote processing system with higher computing power, which significantly increases the costs.
In this paper, a prototype of a low-cost wearable computing system for gesture recognition is proposed, which uses IMU sensors to collect data and locally implements gesture recognition functions, without relaying anything to other devices with higher computing power. As far as we know, it is probably the most effective system that simultaneously achieves low-power, low-latency, and high-accuracy in the field of gesture recognition.
The contributions of this study are as follows: (1) A highly-integrated wearable edge device that locally recognizes hand gestures, rather than uploading data to the cloud as in previous studies; (2) A low-power field-programmable gate array (FPGA) Accelerator deployment via a serial-parallel hybrid method to reduce the resource consumption and improve computational efficiency; (3) A software-hardware co-design based on the Cortex-M0 IP core that improves the system's generalization ability, which is also referable for other scenarios; (4) A new activation function for the NN+CNN structure, which promotes the recognition accuracy; (5) Open-source data for experiments that will be continuously updated.
The rest of the paper is organized as follows. Section 2 introduces the existing gesture recognition methods and AI implementations in wearable designs. The architecture for this software-hardware co-design system, as well as the proposed algorithms for the accelerator, are explained in detail in section 3. Section 4 displays the main part of our work, including the de-noising process, serial-parallel hybrid method, and resource-efficient scheduling of calculations in the FPGA accelerator. Section 5 demonstrates the datasets, experimental results, and the comparisons of resource consumption and performance between different systems on-chip, including MCUs, SBCs, and desktop processors. Finally, we conclude our work in section 6.

Related work
Some common gesture recognition methods and AI implementations in wearable devices are reviewed in the subsequent two paragraphs.

Gesture recognition methods
The main methods for gesture recognition are image (video) processing and sensor data processing. Visual interest point [8][9][10], a classic and effective means of gesture recognition, is based on the color, shape, or motion. Neural network algorithms, including three-dimensional (3-D) CNNs, hybrid deep learning, and long short-term memory (LSTM), are widely used in image processing for gesture recognition [11][12][13]. These approaches effectively reduce the effects of various illuminations, views, and self-structural characteristics. Depth maps have been used to enhance the effect of a large number of small articulations in the hand [14,15]. The IMU or similar sensors are widely used, for instance, in studies where gloves have been used to measure acceleration, finger knuckle angle, and spatial position of hands to recognize gestures [6,16]. In such applications, sensors, including IMU, resistance strain gauge, image sensors, and acceleration sensors are distributed on the back of the hand, forefinger, and the middle finger. Typically, IMU or acceleration sensors are integrated into the peripherals of many wearable devices [17][18][19]. Previous studies have reported that the values collected by the acceleration sensor are combined with the gestures dictionary, and then feature extraction is used to convert the problem to the classification problem, thus realizing the gesture recognition [20,21]. However, all the proposed methods above used personal computers (PCs) or more powerful devices to process data, which seems redundant and inconvenient for mobile uses such as follow-up treatment for rehabilitation. In contrast, our device can achieve data collection, data processing, and gesture recognition totally offline. Meanwhile, it's worth noting that although RNN and LSTM generally have better performance on time series data, and have been implemented on FPGA in several studies [22][23][24], they do not contribute to our research objective. They are usually achieved by High Level Synthesis (HLS) instead of hardware description language, which brings uncontrollability to resource and power consumption, reduces the system's portability, and adversely affects future Application-Specific Integrated Circuit (ASIC) deployment. Compared with its drawbacks, the mere improvements in accuracy are not worth mentioning in our case.

AI implementations in wearable devices
Researchers in [25] designed a CNN accelerator on heterogeneous Cortex-M3/FPGA architecture. The accelerator consists of a convolution circuit and a pooling circuit. This CNN accelerator uses 4901 LUTs without hardware multipliers. A throughput of 6.54 GPOS was achieved. A different study used a very low-power NN accelerator on Micro-semi IGLOO FPGA [26], with a multilayer perceptron NN architecture for classification. The shortest accelerator latency was 1.99 μs and the lowest energy was 34.91 MW under different coupling. Similar findings were drawn in [27], which proposed a recurrent neural network (RNN) model for IoT (internet of things) devices with an end-to-end authentication system based on breathing acoustics. The advancements of these systems in achieving lower power and lower resource consumption is fundamental to our work. And by combining CNN + multilayer perceptron NN, as well as improving the activation function, we lowered the resource consumption and improved recognition accuracy.

System design
The structure of the IMU sensor gesture recognition system designed in this paper is shown in Figure 1. This system consists of an FPGA for the Cortex-M0 IP core implementation, and a glove with IMU sensors driven by Cortex-M0. The data is first sent to the acceleration module by different coupling ways, and then the data processing is accelerated by Cortex-M0.  We made two sets of devices, one for the left hand and one for the right hand. The actual right-handed glove prototype with six IMU sensors is shown in Figure 2. There are five sensors distributed on the second joint of each finger, including thumb, fore, middle, ring, and little, as well as one sensor placed on the back of the palm. The sensors are driven, and simultaneously the data is and then extracted in the return system for gesture classification. The gestures we sought to recognize are gestures for numbers from one to ten, as demonstrated in Figure 3.

Accelerator architecture
A schematic diagram of the accelerator is shown in Figure 4, including the preprocessing module (PM), feature extraction module (FEM), and classification module (CM).
After the Cortex-M0 drives the IMU, each of the six sensors collects the 2-D angle data with 12 dimensions. The data are then sent to the PM by the Kalman filtering algorithm in the IMU module. The interference caused by external factors such as a slight movement in gesture changes is reduced by smoothing de-noising based on the wavelet transform (WT). The sliding-window and bottom-up (SWAB) algorithm is subsequently used to segment and extract effective data. The extracted effective data are input into the FEM to extract the eigenvalue with CNN. Then the extracted eigenvalues undergo the feature rescaling operation by the Cortex-M0, and are then input into the CM for classification by the multilayer perceptron neural network. Finally, the recognized gesture results are sent to M0 as the output. Compared to the traditional gesture recognition methods or neural network classifications in the MCU platform, the PM introduced in our approach effectively reduces noise in the downstream processing and the amount of data processed. Additionally, the system does not need expert supervision once trained to properly classify the input data, as CNN is introduced to extract hidden features from the data collected by the IMU sensors [20].

Proposed aalgorithm
As previously mentioned, in order to achieve effective gesture classification, it is necessary to extract various gesture features from the original data. And instead of extracting features by hand, which is highly dependent on expert supervision in the training process [28], we adopted CNN in the system to automatically complete the process. The original signal is preprocessed for feature extraction enhancement. Moreover, the effect of classification is directly affected by the extracted features.
The PM may fall into two classes i.e., the wavelet de-noising and SWAB (the hardware implementation is explained later in section 4.1). During data collection, fatigue or unconscious trembles of the user may introduce noise in the collected gesture data. While noise has large impacts at the stage of feature extraction and lowers the learning rate, de-noising can effectively improve the accuracy of the final model [29][30][31]. Hence, to overcome this challenge, WT is used for de-noising and filtering [32]. The square-integrable signal x(t) is expressed as (1) where b is the time shift, a is the scale factor, and ψ(t) is the basic wavelet. The wavelet basic function is obtained via the shift and stretching of the parent wave as shown in Eq 2. (2) After passing through the WT, the signals distributed in each layer are uncorrelated to achieve de-noising. As the thresholds of signals at various scales are determined, the thresholds can be quantitatively processed. The corresponding threshold formula is as shown in Eq 3.
where γ is the noise power spectrum parameter, median( ) is the median corresponding to the input, is the coefficient of the wavelet in the jth layer, j is the decomposition scale, and is the length of the input data. The threshold in the wavelet is the standard for deciding whether to keep or zero the wavelet coefficients. Under high resolution, if the signal amplitude is lower than the threshold value, the wavelet coefficient is set to zero; otherwise, the original wavelet coefficient is retained.
To facilitate quantification, the calculation of FPGA is executed using the fixed threshold formulas shown in Eqs 4 and 5. After de-noising, signals are obtained by reconstruction.
The process of data collection by the IMU glove includes the preparation stage, movement stage, and completion stage. Regarding signals, values are stable in the preparation stage, but they change in the movement stage before stabilizing again in the completion stage. Some of the data collected in the preparation and completion stage are not essential for sample identification. Such data may severely influence the performance of the subsequent module. Therefore, the filtered data is extracted twice to minimize unnecessary data. Here, this is achieved using the improved SWAB algorithm, which combines the sliding-window (SW) and bottom-up (B-up) algorithms [33]. First, the sum of squares of the 2-D signal values of the collected by the six IMU sensors is taken as the changed measurement as expressed in Eq 6. Figure 5a demonstrates the segmentation of valid data in the traditional SWAB algorithm. This algorithm makes the length of feature data varied in different signal channels. However, as it is required to have the same data length in the next stage, other irrelevant and non-contextual data will have to be filled into those short data, which greatly influences the recognition effects. Figure 5b presents the segmentation in the improved SWAB algorithm. After up-sampling different signal channels, a unified length of valid feature signal is determined and the starting and ending points are passed to the next module. There are three major advantages (a) the feature data lengths are the same after segmentation, (b) the computation required by the filling operation is reduced, (c) the original context of the feature data is preserved.
Signals that can be regarded as the basis for gesture recognition are segmented by setting appropriate thresholds and then passed through the PW. The corresponding waveform is shown in Figure 6.
a. Raw signal b. After wavelet de-noising c. After SWAB Figure 6. Signal waveform.
Then, in FEM, a CNN model was constructed to extract gesture features (the hardware implementation is explained in section 4.2). After passing signals through PW, these data were used as the bottom input on CNN. Training features were adopted to extract the model. After the training, the model parameters were saved. Finally, the collected gesture data were put into the trained model to extract gesture features for output. A schematic diagram of the CNN structure is shown in Figure 7. This design is composed of three convolutional layers with the number of convolutional kernels similar to those reported in previous studies (4,6,6). Moreover, the size of the convolutional kernel was fixed at 3 × 3. The step size was 1 and the max-pooling was used. The window size was fixed at 2 × 2. The output feature dimension of the full connection layer was 4 and the ReLU (rectified linear activation unit) was selected as the activation function. CNN parameters in each layer are shown in Table 1. The 2-D gesture angle data collected by six sensors, with a size of 50 × 12, were input into the network. To prevent the loss of the boundary data in the process of convolution, the boundary is filled with a numerical value of 180.  1200 sets of original data and 1200 sets of enhanced data from 10 participants comprised of different genders and ages are used for training. The data is described in a more detailed way in section 5 and the original data is publicly available on GitHub and will be updated continuously (the link is provided in SUPPLEMENT). Based on the signal waveforms, the differences in signal results of the different participants corresponded to the differences in the frequency or intensity. Numerically, the differences are in the data continuity and the size of the data. Hence, some methods, such as the frequently used scale, crop, translation, and Gaussian noise [34], increase with image dataset expansion and are effective for the data in this paper.
For FEM and post-stage CM, the input data were normalized to improve the convergence speed of training [35]. The mean normalization formula [36] was used for the FEM and CM inputs with [−1,1] interval as shown in Eq 7.
The multilayer perceptron NN framework is suited for classification in CM (the hardware implementation is explained in section 4.3). The input and output of NN were set according to the CNN output requirements as shown in Figure 8. Four inputs connected with the FEM output and the final output results corresponded to the recognized gesture number. Next, some parameters of the NN were optimized and different frameworks were designed. They were tested 20 times using different datasets. This analysis revealed that an increase in the number of individual layers or changes in the number of neurons in each NN framework layer had minimal influence on the accuracy [26]. Hence, two hidden layers and eight neurons in each hidden layer were selected.

Input
C1 P1 C3 P3 Output A new activation function was introduced based on the different layers with different activation functions. Although the traditional ReLU activation function has sparse activation characteristics, the effect of nonlinear fitting data on sparse activation is reduced in the case of a few network layers. Moreover, the response boundary of the ReLU activation function is linear. The ReLU activation function also has a lower fitting ability compared to the Tanh activation function. Thus, ReTanh, a novel activation function, was designed based on the characteristics of the Tanh and ReLU activation functions as shown in Eq 8. The function image is shown in Figure 9. The results of the test accuracy corresponding to the different activation functions used in different layers are shown in Table 2. Note that the last output layer must select ReLU as the activation function. The result shows that the effect of the first layer with the ReLU activation function is poor. This is because, first, the unilateral activation is a linear function, making it hard to fit the complex features under fewer layers. Second, the output range of data cannot be specified twice, resulting in uncontrollable data size and non-convergence in training. Hence, we chose Tanh, ReTanh, and ReLU as the activation functions.

Accelerators implementation
The analysis of the various NN operation modes in the FEM and CM found that the data between network layers are interrelated and that the calculation in the layer only depends on the input data of the previous layer. Thus, each operation unit in the layer is independent. Currently, most neural network operations are based on the central processing unit (CPU). However, the serial computing mode of the CPU cannot give full play to the advantages of a parallel computing network. The development of deep learning has gradually expanded the network structure. For the CPU-based neural network, accelerating the operation by increasing the clock frequency alone would inevitably increase power consumption, but the acceleration effect would not be ideal. Thus, the recently proposed accelerators based on the FPGA, GPU (graphics processing unit), and an ASIC [37][38][39] should be used to improve the NN performance. The FPGA-based accelerator is more attractive due to its good performance, high-energy efficiency, fast development period, and reconfigurability. Accelerator design can be based on the FPGA hardware and the Cortex-M0 software algorithm introduced in section 3 to minimize resource consumption.
In the following subsections, details about how we simultaneously achieve low-power, low-latency, and high-accuracy on this edge device are explained: (a) how we de-noised data to improve recognition accuracy, (b) how we effectively combine the serial and parallel methods to save resource consumption and improve calculation efficiency, (c) how we schedule the order of calculations to make the best use of the limited resource.

PM implementation
The introduction of PM in our approach effectively reduces noise in the downstream processing and the amount of data processed, and improves the recognition efficiency of subsequent models. The core of the PM lies in the implementation of the wavelet de-noising, while the SWAB algorithm can be easily implemented as previously described in section 3. The wavelet de-noising is based on the quadrature mirror filter banks. The formulas for calculating the filter coefficient are shown in Eqs 9-12 [40,41].
To facilitate the FPGA calculation, s is 0.483, a[0] is 1.732, and a [1] is −0.268 in the equations. Thus, the above structure can be directly realized through the FPGA as shown in Figure 10. The signals are divided into two parts by the filter banks in Figure 10.a. One part with a high frequency-coefficient and the other with a low-frequency coefficient. The noise is reduced by changing the weight of the high-frequency coefficient.

FEM implementation
The main entities in the FEM include convolutions and max-pooling that constitutes the CNN. This part introduces the FPGA hardware implementation of CNN. In the convolution operation, the whole idea is serial input and serial-parallel hybrid calculations. For max-pooling, serial input and parallel calculations are adopted. In this way, resource consumption is saved and calculation efficiency is greatly improved.
Since line-buffer is a common way to image processing [42,43], we proposed register-based line-buffers to implement the convolution operation of 3 × 3 window as shown in Figure 11.a. The weight value corresponding to the current convolution is obtained through the corresponding preset address in the read-only memory (ROM), which allows up to six convolutions to be executed simultaneously. The realization of pooled sampling is similar to convolution implementation. The sample values are obtained systematically through three comparators using the serial data for the local caching as shown in Figure 11.b. These two processes ensure the matching of data transmission, matching of window movement, and performance of each module. This design saves both the data cache capacity and minimizes the consumption of on-chip resources. The implementation of the fully connected layer is similar to the implementation of the NN in the subsequent CM. This is further discussed in the next section.
a. Block diagram of the convolution operation b. Block diagram of the max-pooling Figure 11. Block diagram of the FEM.

CM implementation
The CM is a typical multilayer perceptron with multiple interconnected neurons. While the resource is quite limited, we make the best use of it by dividing and combining the order of calculations of the eight multipliers, in short, we adopted a pipeline-multiplexing structure. This structure is consistent with the fully connected module in the FEM. Regardless of the Tanh or ReTanh function used in the first stage, the corresponding function of the core Tanh is not directly implementable by the FPGA. Here, we used the smoothing interpolation combined with the lookup-table method to fit the activation function [44]. Because the input for the negative Tanh function interval is negative, its absolute value is substituted to the positive interval for calculation. Hence, the negative interval results are the opposite value of the positive interval results. This operation saves the FPGA resources. The function relationship can be fitted by the (n-1) polynomial.

Let
. The corresponding linear form of a polynomial of n is shown in Eq 13.
The I = 1, 2, 3…, m experimental points satisfy ，They are substituted into the Eq 13 and then (14) Thus, the polynomial is fitted by the least square method as follows.
The equation is solved thus obtaining . Here, the positive interval is divided into [0,1], (1,2], (2,3], (3,4] and (4,+∞). The corresponding final fitting functions are listed in Table 3. The corresponding absolute error was calculated using MATLAB. The relationship between the fitting function and the original function is drawn by sampling as shown in Figure 12. The data best fit in the interval [0,3]. Moreover, the error is larger when there are more data points. Nevertheless, this does not affect the results since the FPGA hardware calculation is implemented by shifting with 2 decimal places.  (1,2] 0.0039 (2,3] 0.0055 (3,4] 0.0101 (4, ) 1 / Figure 12. The images of the fitting effect of these formulas.
In this paper, eight multipliers are used in the hidden layer and the processing schedule is demonstrated in Figure 13. The weight and bias in the neurons were input into the corresponding neurons by the external storage for calculation. The data from each module of the previous level were serially input, while the entire neural network was still in a pipeline structure. For the eight multipliers in the hidden layer, operation with the four inputs in each neuron can be conducted simultaneously, except for the three multipliers needed by the activation functions in the first two layers. To complete the calculations in the whole neuron, multiple operation periods are needed in the hidden and output layer. To solve this, a cache should be used to store data between layers. It is also worth noting that, the whole operation was completed in the first hidden layer after eight cycles. For the second layer, it takes eight cycles to complete the operation in each neuron. Thereafter, all the multipliers are used to calculate the activation functions. The operation was completed in three cycles. The whole process takes 21 cycles. Figure 13. Scheduling of the NN processing.

Experimental results and performance analysis
In this section, experiments and discussions are conducted as evidence to support the superiority of the suggested system. In the following sub-sections, we first describe the datasets for experimentation, and then compare our algorithm with other past algorithms that can also be implemented on MCU-level platforms [45]. Finally, we focus on evaluating resource consumption, time performance, and power consumption on different hardware development platforms.

System datasets
The algorithm proposed for the accelerator was first implemented and tested using python language to validate the concept with our database [46]. This dataset contains: (a) 1200 sets of original anonymous data of 10 different gestures, which are proudly offered by 10 volunteers, and (b) 1200 sets of enhanced data gained by translation, increasing noise, scaling the relative position and value of the original data.
For the original data, each data group consisted of 12 dimensions collected by 6 sensors (2 dimensions for each sensor). We labeled the data by strictly following the file-naming rule of a-b-c-d.txt. a represents the position of the sensor node, b represents the name of the gesture, c represents the dimension of the sensor data, d represents the collection times. The age and gender distribution of the volunteers and their finger-moving conditions (including moving stability and whether one is left-or right-handed) are shown in Tables 4 and 5.

Comparison of classification algorithms
In this paper, KNN and SVM are selected as comparable classification algorithms with our CNN+NN model, as they conform to our objectives of deploying offline identification algorithm on an MCU-level development platform and then ASIC. These two supervised multiple classifiers are usually designed with hardware description language and then implemented on ASIC in past studies [47][48][49].
We divided the raw data into two groups -training data and testing data -at the ratio of 2:1 for the three algorithms. We then pinched, cropped, and added stochastic noise to the raw data to obtain 1200 sets of enhanced data. They are split to training and testing data still at the ratio of 2:1. Both raw data and enhanced data are tested for recognition accuracy.
The final results are shown in Table 6. Ideally, the accuracy of the three algorithms tested on raw data is close. However, when data are enhanced in various ways, CNN+NN has a much distinct accuracy at 95.12%, which proves its strong robustness, generalization ability, and suitability for real application scenarios.

Evaluation of hardware implementation
The performance of an HW accelerator for the MCU can be divided into 3 categories: overall hardware utilization, time performance, and power consumption.
Here, we chose Intel's DE10-Lite development board as the hardware development platform with 10M50DAF484C7G as the main chip. For low power consumption, this design uses a 55 nm CMOS technological process, 49760 LUTs, 1638 Kbit M9K memory, 144 18 × 18 multipliers, and 4 PLLs. Table 7 shows the re-source consumption of this design. It should be noted that the 144 18 × 18 multipliers in the MAX10 are recognized as 288 9 × 9 multipliers in the Quartus software. We chose a desktop processor, two mobile application processors, and the STM-MCU as the evaluation objects. We used a similar algorithm that was run on the Intel Core i7-9750H, Rockchip RK3399 Pro, Raspberry Pi 3B+, and STM32F407. As multiple groups of detection data were input into the FM to measure the operation time, the average values were recorded. The data for each platform are listed in Table 8. According to data in Table 8, in this specific scenario and in the range of MCUs, the proposed system has a significant performance, as M4 and M3 cannot even complete the calculations. Moreover, as an MCU, the proposed system outperforms some of the SBC. For example, its performance is more than twice as much of Cortex-A53 (which is usually used in Raspberry Pi), and is close to the high-performance of Cortex-A72.
With this astounding performance, referring to the results in Table 7, the resource consumption of the proposed system is only in line with M3. This accomplishment is of great importance for the design of ASIC and offline high-performance computing.
The coupling modes based on different accelerators are shown in Figure 14. Direct coupling to the AHB BUS was compared to that of the peripheral interface. Because the Cortex-M0 was transplanted to the FPGA through the IP core, only the SPI (Serial Peripheral Interface) and UART (Universal Asynchronous Receiver Transmitter) peripheral interfaces are discussed. Test comparison revealed that delay in the direct coupling and Cortex-M0 is the lowest, which is also determined by the coupling mode. However, for the SPI and UART, the main delay occurs in the data transmission process in the communication protocol of the two. Furthermore, the accelerator would become more universal in the way with a peripheral interface. Low power consumption and less resource consumption are critical factors in designing the accelerator for the MCU. Since the neural network system architecture is recommended for the application-specific scenario, it is difficult to find the same architecture for comparison. Hence, the power consumption of the CNN and NN, the core part of the accelerator, are compared with those of the other parts (see Table 9). To achieve low power consumption, we adopted methods that minimize the power usage in the design, including the common gated clock and the Gray code encoding for the weight address bit. Based on these data, the individual operation frequency of the acceleration module can reach 96 MHz after time sequence analysis and constraint operation. This suggests that the shorter lay is obtained as the accelerator combines with a certain buffer through a peripheral coupling. The FPGA in an Ultra-low-power level of Microsemi IGL00 with only 0.21 MW static power consumption has been previously used [26]. However, the static power consumption of MAX10, which is used here, exceeds 90 MW. Due to the restricted condition, the Cyclone 10LP was used to analyze the power consumption of the Cyclone 10LP chip series using the Power Analyzer Tool of the Quartus. Cyclone 10LP is Intel's low-power FPGA, with a static power consumption of 30 MW only. Our proposed design is simulated by the Quartus and achieves a lower power consumption index. Therefore, this series of chips should be considered for further testing and applications.

Conclusion
In this paper, we propose a low-cost wearable edge device with a neural network accelerator based on the IMU sensor for gesture recognition. Firstly, the prototype glove is designed to locally collect data, process data, and complete the large volume data calculations off-line. Secondly, with the pre-stage processing module and serial-parallel hybrid method, the device is of low-power and low-latency. As an MCU that consumes at an MCU level, it performs eight times higher than the existing high-performance MCU and outperforms some SBCs. Thirdly, the whole system is a software-hardware co-design that is potentially transferrable to other scenarios. Moreover, a new activation function was designed for the multilayer perception neural network to improve recognition accuracy, and the feature extract process of CNN to complete classification is rather automatic that doesn't need expert supervision in the process. Finally, all the data are open-sourced and will be continuously updated for other researchers for further use.
Our work promotes embedded systems and accessible edge computing models in the field of hand rehabilitation. However, we recognize that our framework presents four core limitations. The first is related to the amount and the type of hand gestures we tested. Since our intention was to make a technical prototype for locally recognizing medical and healing exercises for hands, we used only ten hand gestures for numbers instead of actual movements that hospitals are currently using. We will continuously work with the medical institute for future development. The second is that the whole device falls short in its undesirable size. It's not comfortable enough for long-time wearing or in-the-wild uses. Moreover, the power consumption has not reached its lowest, because we didn't use FPGA devices with the lowest power consumption. Last but not the least, further power analysis and design verification by ASIC tools are needed.
Future studies may focus on the three directions below: (a) Improvement in resource utilization and calculation performance, (b) ASIC implementation of the accelerator with Cortex-M0 processor core via EDA (Electronic Design Automation) tools optimization, (c) More specific and targeted recognition and evaluation schemes for different medical scenarios, such as sign language interpretation, finger movement rehabilitation after stroke and wrist movement evaluation after fracture, etc. Ultimately, we hope our work can be a good reference to improve the device accessibility, usability, versatility for different groups of users.