Deep Residual Bidir-LSTM for Human Activity Recognition Using Wearable Sensors

Human activity recognition (HAR) has become a popular topic in research because of its wide application. With the development of deep learning, new ideas have appeared to address HAR problems. Here, a deep network architecture using residual bidirectional long short-term memory (LSTM) cells is proposed. The advantages of the new network include that a bidirectional connection can concatenate the positive time direction (forward state) and the negative time direction (backward state). Second, residual connections between stacked cells act as highways for gradients, which can pass underlying information directly to the upper layer, effectively avoiding the gradient vanishing problem. Generally, the proposed network shows improvements on both the temporal (using bidirectional cells) and the spatial (residual connections stacked deeply) dimensions, aiming to enhance the recognition rate. When tested with the Opportunity data set and the public domain UCI data set, the accuracy was increased by 4.78% and 3.68%, respectively, compared with previously reported results. Finally, the confusion matrix of the public domain UCI data set was analyzed.


Introduction
In real life, many problems can be described as time series problems.
Indeed, human activity recognition (HAR) is of value in both theoretical research and actual practice. It can be used widely, including in health monitoring [1] [2], smart homes [3] [4], and human-computer interactions [5] [6]; for example, LSTM cells are a good choice for solving HAR problems. Unlike traditional algorithms, LSTM can catch relationships in data on the temporal dimension without having to mix the time steps together as a 1D convolutional neural network (CNN) would do. As more of what is commonly called "big data" emerges, LSTM architecture can offer great performance and many potential applications.
More specifically, HAR is the process of obtaining action data with sensors; it symbolizes the action information and then allows understanding and extraction of the motion characteristics, which is what activity recognition refers to. Because of the spatial complexity and temporal divergence of behavior, there is no unified recognition method. A public domain benchmark of HAR has been introduced, and different methods of recognition have been analyzed [7].
The results showed that the K-Nearest Neighbor (KNN) algorithm outperforms other algorithms in most recognition tasks. Support Vector Machine (SVM) is another outstanding algorithm. A Multi-Class Hardware-Friendly Support Vector Machine (MC-HF-SVM), which uses fixed-point arithmetic for HAR instead of the typical floating-point arithmetic, has been proposed for sensor data [8]. Unlike the manual filtering features in previous algorithms, a systematic feature learning method that combines feature extraction with CNN training has also been proposed [9]. Subsequently, DeepConvLSTM networks outperformed previous algorithms in the Opportunity Challenge by an average of 4% of the F1 score [10]; the effects of parameters on the final result were also analyzed.
Although researchers have made great strides in HAR, room for In recent years, deep learning has shown applicability to many fields, such as image processing [11] [12], speech recognition [13] [14] [15], and natural language processing [16] [17]. In ILSVRC 2012, AlexNet [18], proposed by Alex Krizhevsky, took first place, and, since then, deep learning has been considered to be applicable to solving real problems and has done so with impressive accuracy. Indeed, deep learning has become a popular area for scientists and engineers.
Another event in 2016 that drew considerable attention was the century man-machine war at the end of the game in which AlphaGo achieved victory.
This event also demonstrated that deep learning, based on big data, is a feasible way to solve the non-deterministic polynomial problem.
LSTM cells, which were first proposed by Juergen Schmidhuber in 1997 [19], are variants of recurrent neural networks (RNNs). They have special inner gates that allow for consistently better performance than RNN on a time series.
Compared with those of other networks, such as CNN, restricted Boltzmann machines (RBM), and auto-encoder (AE), the structure of the LSTM renders it especially good at solving problems involving time series, such as those related to natural language processing, speech recognition, and weather prediction, because its design enables gradients to flow through time readily. Section 2 presents the baseline LSTM, Bidir-LSTM, and residual networks.
In Section 3, we provide an explicit introduction to the preprocessing in HAR and describe Deep-Res-Bidir-LSTM. Several experiments were performed with HAR benchmarks: the public domain UCI data set and the Opportunity data set.
We compare the accuracy of recognition of our algorithm with those of other algorithm. Finally, we summarize the research and discuss our future work.

Baseline LSTM
LSTM [18] is an extension of recurrent neural networks. Due to its special architecture, which combats the vanishing and exploding gradient problems, it is good at handling time series problems up to a certain depth. In Figure 1, We define the input set as 0 In theory, RNN can estimate the output of current time based on past information. However, Bengio [20] found that RNN could remember the information for only a short time, because of the vanishing and exploding gradient problems. When back propagation with a deep network is used, gradients will vanish rapidly if preventative measures that permit gradients to flow deeply are not taken. Compared with the simple input concatenation and activation used in RNNs, LSTM has a particular structure for remembering information for a longer time as an input gate and a forget gate control how to overwrite the information by comparing the inner memory with the new information arriving; this enables gradients to flow through time easily. As shown in Figure 2, the input gate, the forget gate, and the output gate of LSTM are designed to control what information should be forgotten, remembered, and updated. Gating is a method to selectively pass the information that is needed. It consists of a sigmoid function and an element-wise multiplication function. The output value is between [0, 1] to allow the multiplication to then happen to let information flow or not; thus, it is considered good practice to initialize these gates to a value of 1, or close to 1, so as to not impair training at the beginning.
In the LSTM cell, each parameter at moment t can be defined as follows: First, there is a need to forget old information, which involves the forget gate. The next step is to determine what new information needs to be kept in memory; this is done with an input gate. From that, it is possible to update the old cell state, Finally, it should be decided which information should be output to the layer above with an output gate.

Bidirectional LSTM
In real life, human trajectories are continuous. Baseline LSTM cells can predict the current status based only on former information. It is clear that some important information may not be captured properly by the cell if it runs in only one direction.
The improvement in bidirectional LSTM is that the current output is not only related to previous information but also to subsequent information. For example, it is appropriate to predict a missing word based on context.
Bidirectional LSTM [32] is made up of two LSTM cells, and the output is determined by the two together.  , the hidden layer and the input layer can be defined as follows: Our bidirectional LSTM cell differs slightly from this. We concatenated the results of the two t h to then reduce the number of features in half with a ReLU fully connected hidden layer, as follows: where () concat means concatenating sequences.

Residual Network
The MSRA team built a 152-layer network, which was about eight times that of the VGG network [21]. Due to its excellent performance, they took first place in the 2015 ILSVRC competition owing to an absolute advantage in image classification, image location, and image detection.
As the network deepens, the research emphasis shifts to how to overcome the obstruction of information and gradient transmission. The MSRA uses residual networks. The main idea is that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping.
An important advantage of residual networks is that they are much easier to train because the gradients can be passed through the layers more directly with the addition operator that enables them to bypass some layers that would have otherwise been restrictive. This enables both better training and a deeper network, because residual connections do not impede gradients and still contribute to refining the output of a highway layer composed of such residual connections. On top of a collection of residual connections is a bottleneck where the next layers stop being residual and where a batch normalization is generally applied to normalize and restrict the usage of the feature space represented by the layer [22].  can be defined as follows: In the code implementation, indexing in the configuration file starts at 1 rather than 0 because we included the count of the first layer that acts as a basis before the residual cells. The same counting rule applies for indicating how many blocks of residual highway layers are stacked one on top of the other.

Process Pipeline for HAR
The process pipeline of HAR can be divided into three parts: preprocess, training, and testing. In our case, testing was modified in parallel with training.
First, testers performed activities of daily living with wearable sensors and gathered information to form the raw data set. When data were missing, we added them and then normalized to a mean of zero and standard deviation of 0.5; we then reshaped the data to fit the designed network, with windows of 128 time steps. The data were split into training and testing data sets.
Second, a training data tensor was added to the designed network so it could output a prediction. The difference between the predicted value and the real value was then compared with a sigmoid cross entropy loss with L2 weight decay to then back-propagate errors backward into the network layer by layer with the Adam Optimizer [31]. Thus, we could adjust the hyper-parameters in networks, such as the learning rate and L2 weight decay, to reduce the difference.
Finally, during testing, we added the testing tensor to the neural network architecture without affecting the learned parameters, so as to not corrupt the test. Testing did not affect the training and did not change the results.
Predictions obtained from the neural network were compared with the real values. The metrics of accuracy and of the F1 score of HAR were then calculated throughout learning and, at the end, by running the tests frequently.
Both the best in-training metrics and the final metrics obtained were kept for comparison.

Architecture of Deep Res-Bidir-LSTM Network
Considering the networks in Section 2, we proposed the Deep-Res-Bidir-LSTM to deal with HAR. Although residual connections for CNN have been used [21], this method is also available for LSTM.
Similar to building blocks, we can select modules and combine them to build a network based on our mission. The input of HAR should be a time series, and the basic structure of the LSTM guarantees that it can preserve the characteristics on the temporal dimension.
Additionally, a large network can be optimized correctly for a problem with sufficient regularization, such as L2 weight decay and dropout; however, if no regularization is used, this results in overfitting and bad operations on the test set. Complexity is good but only if countered with regularization. Too many layers and cells per layer will increase the computational complexity and waste computational resources. When the layer number and cell number reach a certain scale, the recognition accuracy will remain at a certain scale instead of increasing; by adding more depth, regularization is then needed to avoid overfitting while still improving accuracy.
Our deep LSTM neural network is limited in terms of how many data points it can access: it has access to only 128 time steps when making its predictions. Especially when deepened, the next forward/backward duo will see output from the other pass "in advance," because, logically, a backward pass for our bidirectional LSTM reverses the input and the output before the concatenation. Thus, the Bidir-LSTM has the same input and output shape as the baseline LSTM but, through depth, at a given time step, it has access to more information in advance because of the backward passes.
In general, gradient vanishing is a widespread problem for deep networks.  In Figure 5, the information flows in the horizontal direction (temporal dimension) and in the vertical direction (depth dimension). With the exception of the input and output layers, there are 2 hidden layers who has residual connection inside (hence, called "residual layer"). Moreover, each residual layer contains 2 bidirectional layers. The network in Figure 5 has a 22  architecture, which can also be thought of as 8 LSTM cells in sum. In our network, the activity function is unified with ReLU, because it always outperforms tanh with deep networks to counter gradient vanishing. Although the output is a tensor for a given time window, T , the time axis has been crunched by the neural network. That is, we need only the last element of the output and can discard the others. Thus, only the gradient from the prediction at the last time step is applied. This also causes a LSTM cell to be unnecessary: the uppermost backward LSTM in the bidirectional pass. Hopefully, this is not of great concern because TensorFlow, the library we use, should evaluate what to compute and what not to compute. Additionally, the training data set should be shuffled during the training process. The state of the neural network is reset at each new window for each new prediction.

Tricks for Optimization
Our Deep-Res-Bidir-LSTM for HAR makes it possible to see that the accuracy during testing is much worse than that during training. Overfitting is likely to occur, and balancing the regularization hyper-parameters becomes difficult because they are so numerous. The L2 norm of the weights for weight decay is added in the loss function.
Also, dropout is applied between each layer on the depth axis or, where g is the gradient and || || g is the normed gradient.
Batch normalization [23] can also be useful in training residual connections. The fundamental idea of batch normalization is that layers are simply normalized by mean and variance such that they have a mean of zero and a standard deviation of 1 over the whole batch, so one big rescaling factor multiplies the whole batch, and one big bias is also added. The result is then normalized and offset in a linear fashion. The scaling multiplier  and the offset parameter  are learned to rescale inputs in a custom way, and  can be initialized to 1, as is commonly done. The formula can be defined as: where () We added many tricks to the network to provide better results. Generally, L2 norm for weight decay and dropout are used to prevent overfitting, and gradient clipping and batch normalization are used to prevent gradient vanishing or explosion as well as overshooting the learning updates.

Experiments
We tested the Deep-Res-Bidir-LSTM network with the public domain UCI data set [24] and the Opportunity data set [7]. Then, we compared it with the outcomes of other methods and analyzed the results. The computer for testing had an i7 CPU with 8 GB RAM as well as an NVIDIA GTX 960m GPU, which has 640 CUDA cores and 4 GB RAM. The GPU and CPU were used alternatively depending on the size of the neural network, which sometimes exceeded the available amount of memory on the graphics card during training.

Data Sets
The research objects of recognition were activities in daily life.  Opportunity data set. The Opportunity data set for HAR from wearable, object, and ambient sensors is a data set devised to benchmark HAR algorithms.
The data set includes activities from four subjects; each one has six recorded runs. For each subject, the first five records consist of runs of activities of daily living, characterized by the natural execution of daily activities. The sixth run was a "drill" run, where users executed a scripted sequence of activities. The activities of the user in the scenario were annotated on different levels. Notably, among others, 17 mid-level gesture classes were identified and used for our predictions; this group included the "NULL" class, which is common, for a total of 18 classes. The NULL class rendered the data set highly unbalanced; thus, following previous research [10], we used a weighted F1 score [10]. In total, 242 features from body-worn sensors, object sensors, and ambient sensors were provided for each sample; time stamps in milliseconds, starting from zero and having a sampling rate of 30 Hz, were also provided. Many of those 242 features are not useful for HAR; thus, we used only 113 features, such as DeepConvLSTM [10]. Due to the use of wireless sensors to transfer data, there may be missing data. We used linear interpolation to fill in the missing data.
Also, the data were provided with a custom scale and different value ranges and resolutions for each feature; there were sometimes magnitudes of difference according to the cell used. Our architecture used mean and variance (standard deviation) normalization on the z-score scale with a standard deviation of 0.5.
Such a small standard deviation is often useful in deep learning [30]. The transition function was defined as follows: where  is mean value and  is the standard deviation. As with DeepConvLSTM, we used a subset from subjects 1 to 3 as a training data set and used the remainder of this subset for the test data, using runs 3 and 4 of subjects 2 and 3 as testing data, for a total of 4 test runs. To obtain comparable results, we did not use the data from subject 4. To summarize, we used only a subset of the full data set to simulate the conditions of the competition, using 113 sensor channels and classifying the 17 categories of output (and the NULL class). Our LSTM's inner representation was always reset to 0 between series.
We used mean and variance normalization rather than min-to-max rescaling.
Because of class imbalance in the Opportunity data set, we used the F1 score as a measurement of recognition. The F1 score can be regarded as a weighted average of accuracy and recall; it ranges between 0 and 1. For a dichotomous problem, the F1 score can be defined as follows: where prec and recall indicate precision and recall, respectively.
However, we needed a multi-class classification in this paper. So, the F1 score was defined as follows: where c N is the sample count of class c , and total N is the total sample count of the data set.

Hyper-parameters Setting
The hyper-parameters in the Deep-Res-Bidir-LSTM network affect the final result. Generally used methods of tuning parameters include experimental methods, grid searches [26], genetic algorithm (GA) [27], and particle swarm optimization (PSO) [ Step Size Label=Class #1 Label=Class #2 Figure 6. Sliding window. The ground truth represents labels for the classification.
Our LSTM's inner representation was always reset to 0 between series. As shown in Figure 6, for n channels' data, a new sample consisted of a window length series of T. Then, T was moved with a step size to form the next sample, using 50% overlap, which added some redundancy during training and testing.
Repeating the operation above yielded a data set suitable for the training.  MultiClass Hardware Friendly SVM, or MC-HF-SVM, was proposed by Davide Anguita [8]. It allows better preservation of the life of a smart phone battery than the conventional floating-point-based formulation while maintaining comparable system accuracy levels. The performances of Bidir-LSTM and Res-LSTM were almost the same; both were better than the baseline LSTM, because they are good at information transmission.

Results
Bidir-LSTM can get information in both forward and backward passes, and Res-LSTM uses a highway to transmit information directly. Among the algorithms in Table 1, Deep-Res-Bidir-LSTM achieved the best F1 score, 93.54%, because of both residual connections and bidirectional cells.
Comparing accuracy and F1 scores, the two columns are almost the same for each model. We randomly selected a batch while training, and a complete calculation was almost able traverse the entire data set.

Conclusions
In this paper, the significance of HAR research is analyzed, and an overview of emerging methods in the field is provided. LSTM neural networks have been used in many innovations in natural language processing, speech recognition, and weather prediction. This technology was adapted to the HAR task. We proposed the novel framework of the Deep-Res-Bidir-LSTM network.
This deep network can enhance learning ability for faster learning in early training. The proposed network also guarantees the validity of information transmission through residual connections (on the depth dimension) and bidirectional cells (on the temporal dimension). In our experiments, the proposed network was able to improve the accuracy, by 4.78%, for the public domain UCI data set and increase the F1 score, by 3.68%, for the Opportunity data set in comparison with previous work.
We also found that window size was a key parameter. Too small a window did not guarantee continuity of information, and too large a window caused classification errors. Usually, 500 ms to 5000 ms will be appropriate for the window size. During model training, the architecture of the network, such as the layers and the cells in each layer, should be determined first, followed by the optimization of hyper-parameters, such as learning rate and the L2 weight decay multiplier. The values of hyper-parameters should be determined according to the specific architecture. For example, 28 cells were sufficient for the public domain UCI data set, but 128 cells were better for the Opportunity data set because it has more features and labels and, thus, increased overall complexity.
Future work should explore a more efficient way to tune parameters.
Although the grid search was workable, the searching range must be changed manually, and the values are always fixed. It will be important to find an adaptive way to automatically adjust the searching process and also make the Indeed, focusing on time series prediction problems has value. Problems such as stock prediction and trajectory prediction may be explored.