Adapted Long Short-Term Memory (LSTM) for Concurrent Human Activity Recognition

: In this era, deep learning methods offer a broad spectrum of efficient and original algorithms to recognize or predict an output when given a sequence of inputs. In current trends, deep learning methods using recent long short-term memory (LSTM) algorithms try to provide superior performance, but they still have limited effectiveness when detecting sequences of complex human activity. In this work, we adapted the LSTM algorithm into a synchronous algorithm (sync-LSTM), enabling the model to take multiple parallel input sequences to produce multiple parallel synchronized output sequences. The proposed method is implemented for simultaneous human activity recognition (HAR) using heterogeneous sensor data in a smart home. HAR assists artificial intelligence in providing services to users according to their preferences. The sync-LSTM algorithm improves learning performance and sees its potential for real-world applications in complex HAR, such as concurrent activity, with higher accuracy and satisfactory computational complexity. The adapted algorithm for HAR is also applicable in the fields of ambient assistive living, healthcare, robotics, pervasive computing, and astronomy. Extensive experimental evaluation with publicly available datasets demonstrates the competitive recognition capabilities of our approach. The sync-LSTM algorithm improves learning performance and has the potential for real-life applications in complex HAR. For concurrent activity recognition, our proposed method shows an accuracy of more than 97%.

complete recognition process includes sensor data, pre-processing, feature extraction, and classifiers (model), where feature extraction and classifier algorithms play a crucial role in achieving high recognition accuracy [3]. Manual feature extraction or selection is often laborious and arbitrary and lacks generalizability [4]. Instead of using standard produced features, we applied automatic feature extraction or selection [5,6]. Recent activity recognition studies have often focused on single individuals or single activities rather than complex, concurrent activities. Our method addresses the challenge of concurrent activity detection.
The need to structure information in order to process a large amount of data is becoming a pervasive problem and gaining high research interest. Automatic sequence classification based on convolutional neural network (CNN) [7] and recurrent neural network (RNN) [8] methods are used to structure and process these sequences with a set of high-level representations. Deep learning methods such as long short-term memory (LSTM) [9] and bi-directional LSTM (BLSTM) [10] are used in various domains and tasks, including image processing [11], language recognition [12], and sentence processing [13].
Despite its popularity, LSTM is unable to take multiple inputs for synchronous data and cannot produce multiple outputs directly when used as a classifier. In [14], a multiple stream neural network called parallel LSTM is presented, which can process only the synchronous data stream to produce a single output. In this paper, we present an improved version of the architecture of a synchronous neural network called synchronous LSTM (sync-LSTM), which processes multiple data streams simultaneously to detect concurrent human activity. Two or more activities running simultaneously in parallel are called concurrent activities. Most importantly, we detect complex human activity using the improved deep learning platform. Multi-binary classifiers can detect parallel activity [15,16], but this fails for many activities. The hidden Markov model (HMM) [17], condition random field [18], and various other types of machine learning approaches [19] and probability inference algorithms [20] are widely used for parallel activity detection. However, they cannot handle a large number of spatio-temporal data sequences. The method proposed here is the first so far to address these issues.
Two requirements for human activity recognition drive this work: improving recognition accuracy and developing reliable algorithms to solve complex activity recognition problems. Therefore, our method promises to address the needs of activity recognition with heterogeneous sensors, primarily by improving performance over existing approaches. The highlights of our method are as follows.
• We present the improved LSTM known as sync-LSTM to recognize concurrent human activity. • The presented concept is an adapted version of LSTM that supports parallel input and parallel output. • The proposed approach can structure and learn spatio-temporal features directly and automatically from the raw sensor data, without the need for manual feature extraction or selection for concurrent activity recognition. • This framework can likely be applied to different recognition platforms with different sensor modalities in different domains. • The results obtained by our proposed method show that it outperforms existing methods on recognition of concurrent activity.
Activity recognition relies on a combination of different sensors: wearable, external, or both. We chose external sensors for the user's comfort. We evaluate and compare the performance of our proposed approach using fully annotated real-world datasets generated by Kasteren and the Center for Advanced Studies in Adaptive Systems (CASAS). The rest of this paper is organized as follows. Related work on activity recognition and LSTM is described in Section 2. Section 3 illustrates our proposed method. The experiment setup, analysis, and evaluation are provided in Section 4. Finally, Section 5 concludes the paper.

Related Work
Early research trained shallow depth classifiers on sequences collected by a single sensor. Sensor-specific constraints overly controlled this single sensor-based system, as the data from this type of sensor was inherently insufficient for complex activity recognition. Therefore, a multi-sensor heterogeneous system was designed to address these challenges. Significantly, our system can work with multiple and heterogeneous sensors. A naive Bayes classifier is used for activity recognition with the longest sequences of sensor data sequences [21]. An incremental learning approach called the dynamic Bayesian network is tested to detect differences in activities by rebuilding the previously learned models [22]. Early generative deep learning methods used restricted Boltzmann machines to derive task-independent feature representations [23,24]. More complex models such as CNN have been successfully used for challenging HAR tasks [25]. Similarly, some decent algorithms, such as multilayer perceptron [26], support vector machine [27], decision tree [28], and an updated HMM [29], are used in classifying some types of activities.
The discriminative models have used the independence assumption, where we learn the model parameters by optimizing the conditional likelihood rather than the joint likelihood [30]. Audio [31] and video [32,33] activity recognition methods are also widely used and best suited for healthcare and remote monitoring. However, audio-visual methods have privacy issues and are complex and pervasive. In recent years, many deep learning methods have been used to work with CNN [34,35], RNN [36], and LSTM. The enhanced RNN and LSTM are widely applicable in language modeling, hand gesture recognition [37], machine translation [38], sound recognition [39], video analysis [40], and image captioning [41]. As mentioned earlier, they lag behind in parallel processing. Additionally, these algorithms are least common in activity recognition. However, a significant amount of research is concerned with single or regular activity recognition, and few researchers are interested in complex activity recognition such as concurrent activity recognition. Activity recognition with a CNN-LSTM [42] structure sheds light on concurrent activity detection with multimodal sensor data, but it suffers from data redundancy.

Materials and Method
The HAR process consisted of four phases: data acquisition, preprocessing, feature extraction, and training/testing, as shown in Fig. 1. In this work, concurrent activity is recognized by using Sync-LSTM. Therefore, the experimental process is divided into preprocessing and training/testing, as LSTM automatically acquires the feature vectors.
The sensor data is pre-processed by applying filters and overlapping sliding windows of 128 time steps each. Our model operates with two fully connected and two LSTM layers of 64 units each, as shown in Fig. 2. We train the model and predict the real value by setting hyperparameters such as entropy, learning rate, weight decay, and optimizer. A k-fold cross-validation technique is performed to improve and validate the output before testing. During testing, a test sensor is added without affecting the learned parameters. The detection obtained during testing is then compared with the real values, and the accuracy is calculated using the F1 score determined from HAR. We present the sync-LSTM based on a standard LSTM and describe its implementation for concurrent activity recognition. The main feature of our approach is that it can accept multiple inputs and produce multiple outputs by improving the LSTM algorithm to handle parallel structure and detect concurrent activity. The input sequences are independent and mapped into homogeneous subsets. A single-mode LSTM with concatenated input sequences is theoretically not suitable for mapping the heterogeneous input representation of parallel sequences.

Sync-LSTM Algorithm Recognized Activity
Sync-LSTM Algorithm Recognized Activity Feature Extraction

Standard LSTM
RNNs use feedback to classify current data to neurons. This unique ability of RNN helps to find patterns with long-term dependencies. However, the vanishing gradient problem still occurs. To solve this problem, LSTM was introduced. LSTM outperforms RNN in finding long-term dependencies. Fig. 3a shows the internal architecture of standard LSTM. The cell in this LSTM unit of the network consists of an input gate i(t), a forgetting gate f(t), an output gate o(t), and a memory cell c(t) that stores the information and potentially updates the output over some periods.
The LSTM also has a peephole in the inner cells to the gates in the same cell to learn the timing of the outputs. The multiplicative equation commands each cell and gate to propagate forward.
x(t) is the input sequence. w xi , w xf , and w xo are the input weights of the sequence associated with an input gate, a forgetting gate, and an output gate, respectively. c(t) represents the memory cell at a corresponding time, and b represents the bias voltage of the corresponding gate and cell. The hidden layer h(t) computes input x(t) and provides output y(t) at time t, as shown in Fig. 3a. σ is known as the logistic sigmoid activation function, which bounds the value between [0,1] and is mathematically expressed as σ (x) = 1 1+e −x . tanh is known as the hyperbolic tangent activation function.

Adapted LSTM: Sync-LSTM
Let a normal input sequence be x ∈ R S×E×I×V ×L , where S and E are the start and end times, I is the sensor ID, V is the sensor value, and L is the location. Sync-LSTM takes the input of samples x 1 this paradigm can accelerate the training model and provide a viable solution to facilitate the development and operation of AI applications. Sync-LSTM forwards multiple activity sequences to detect and encode hidden internal structures between parallel hidden activity sequences. It is time-consuming to process a signal with long time steps by a standard LSTM neuron. Therefore, we used multiple LSTM units in parallel to process different parts of the information. Fig. 4 shows the unfolded architecture of the sync-LSTM network. It consists of the input layer, the parallel LSTM, the fully connected layers, and the outputs. The outputs in the final time step of each LSTM unit are summarized as n × h, where h is the number of hidden neurons of each LSTM unit. The LSTM layers adjust their internal state after each time step. The size of the weight, bias, cell, and hidden layers is assigned to be 128. The final number of parameters depends on the number of activity classes in the classification.
initialize the inputs do 4.
loop over a cell, end for 6.
do for t = 0 to n do Calculate the gate values: input gates: concurrent activity recognize 8.
do Update the weight end

Experimental Results and Analysis
This section presents the detailed results in both the training and recognition phases of the model. Several design parameters, such as the input data, the number of sync-LSTM layers, and the number of activities, are assigned and processed. The training dataset is used to train the classification, estimation, and evaluation of an individual activity for the best model parameters and for tuning the hyperparameters. Then the proposed model is trained, and the results are compared with the existing model outputs. For the experimental analysis of the proposed approach, the Kasteren and Kyoto 3 datasets are considered. The selected datasets have some limitations, such as the activity instances of different groups, some residents in each house, performance of the same activity in different ways, and availability of less learning data. Algorithm 1 presents the pseudo-code for concurrent activity recognition.

Experimental Configuration and Training
The proposed neural network is implemented in the TensorFlow and scikit-learn libraries. In this paper, linear interpolation is used to fill the missing data and normalized to a zero mean with a standard deviation of 0.5. The sensor data is pre-processed and sampled in overlapping sliding windows with a fixed width of 200 ms and a window size ranging from 0.25 s to 7 s. The data is sampled in a single window. The proposed method is trained and tested using the TensorFlow_GPU1.13.1 library. The computer deployment is best suited to run our algorithm using i7 CPU with 16 GB RAM and GTX Titan GPU with CUDA 9.0 and using the cuDDN 7.0 library. The CPU and GPU are used to avoid exceeding the memory limit during training. The dataset is split into a training set and a test set. The model uses 70% of the data for training, 10% for validation, and the remaining 20% for testing. We use a k-fold CV (cross-validation) to validate the data. In our experiment, we validate with k = 10, known as 10-fold cross-validation. The accuracy result is averaged across all 10 folds, and the error is calculated as the cross-validation error rate.
During training, the dropout rate is set to 0.5 to remove and avoid unused specific neurons from each hidden layer to solve the overfitting problem. The training loss function is minimized by random initialization and optimization of training parameters. The two-loss functions, named cross-entropy and L2 normalization, are inherited to avoid overfitting throughout the epoch to make the model stable.
where m is the number of samples per batch, and Γ is the weighting parameter. y l t is the predicted output, and y l t is the label from the dataset. L2-normalization limits the trainable weighting parameter to a smaller value, which avoids overfitting.
We try to tune the best hyperparameters as shown in Tab. 1 in networks so that the learning rate and L2 weight decrease to reduce the difference and thus achieve the possible optimal performance. We train the model with a learning rate of 0.005 and 0.006 for the Kyoto 3 and Kasteren House-B datasets by taking the batch value of 100 for each epoch. Learning starts at 0.001 for all data. The training is performed for more than 12,000 epochs and stops at stable outputs. The Adam optimizer is an adaptive moment estimator that obtains independent adaptive learning rates for different parameters, making them more stable. The dimension based on the input is set to 128. The gradient clipping is set to 4 and 5 to reduce the threshold for crossing the gradient to match the normalization. The batch size is set to 100 samples, and this is often referred to as a mini-batch gradient descent since our batch size is smaller than the training sample size.

Datasets
Our proposed method is evaluated using the Kasteren datasets [43] and CASAS [44]. An overview of the datasets is shown in Tab. 2. Both datasets were gathered in an apartment containing either four or two rooms. Seventy-six sensors are deployed in Kyoto 3, whereas 23 sensors are deployed in House B. Of these sensors, 51 were used for motion detection, 12 were used as cabinet sensors, 5 were used for cooking element detection, 3 were used as temperature sensors, and the remainder was used as a medicine container sensor, pot sensor, phone book sensor, water sensor, burner sensor, or phone sensor in Kyoto 3. The four residents in the Kyoto 3 dataset performed eight different activities for 15 days. There are 178 instances of activity recorded from Kyoto 3. The single resident in the Kasteren house performed 13 different types of activities for 14 days.
The sensors show the change of state according to the action of the occupant. Radiofrequency identification, a wireless sensor network, a pressure sensor, a reed switch, mercury contacts, a passive infrared-PIR, float sensors, and temperature sensors are used to record the data for the Kasteren house. The schematic diagram of the sensor deployment is shown in Fig. 5.

Evaluation Metrics
The metrics of the confusion matrix, accuracy, F1-score, and training time are used to evaluate the performance of the model. A confusion matrix shows the performance of the approach, where the row represents the predicted class, and the column represents the actual class and vice versa. It also gives information about the errors made. Generally, human activity recognition methods are evaluated according to their computational recognition accuracy. The accuracy is calculated using the confusion matrix, which is the result of the Precision and Recall parameters.
Precision is the proportion of correctly recognized instances out of the absolute perceived activity occurrences. Recall is the proportion of correctly recognized instances out of the total occurrences of the activity. An f-score is the weighted average of Precision and Recall whose value is between [0,1], with the best performance indicated if it is closer to 1.
where tp, tn, fp, and fn are true-positive, true-negative, false-positive, and false-negative, respectively. The true-positive score is defined as the number of true activities detected in positive instances, while a false-positive indicates the false activities detected in negative instances. The false-negative score indicates the exact number of false activities detected in positive instances, whereas the true-negative score reflects the correct non-detection of activities in the negative instances.

Recognition Analysis
In this section, a possible implementation of the platform for human activity detection in smart homes is explained. All activities are localized based on the dataset. The activities that occur most frequently at the same time are considered the predominant activities in a smart home environment. Fig. 6 presents the graphical confusion matrix for Kyoto 3. According to the confusion matrix, the filling medication activity was correctly detected with 98% accuracy but still has an activity error of 2% because the call may be made while other activities are being performed simultaneously. Watching DVD also has a 98% confusion accuracy, although 1.3%, 1.2%, and 0.4% of the fill medication dispenser, answer the phone, and choose outfit activities cause confusion since they can all be performed at the same location. The activities water plants, answer the phone, prepare birthday card, prepare soup, clean, and choose outfit have recognition accuracies of 98%, 96%, 97%, 99%, 98%, and 98%, respectively. The activity answer the phone has the lowest recognition accuracy compared to the rest of the activities, and this activity also leads to confusion with all other activities because a phone call can be performed with all other activities that share most sensor values. However, the co-occurrence of answer the phone is higher. The data from Kasteren and CASAS have the lowest number of records and instances; therefore, the actual distribution is easy to find and train. The recognition accuracy can increase if we train our proposed method with a large amount of data. The confusion matrix for House-B is shown in Fig. 7. The number of activity instances is relatively small, so the occurrence of errors is relatively lower, and recognizing the concurrent activity with the highest true positive value results in 96.90% accuracy. The activity drinking in House-B is the most appropriate activity that occurs concurrently with the activities preparing dinner, preparing breakfast, dinner, relaxing, using dishwasher, and leaving house, with an accuracy of more than 96.90%. House-B achieved precision, recall, and an f1-score of 97.94%, 97.00%, and 0.97, respectively. Brushing teeth is also detected simultaneously with the activity toileting. Activities such as brushing teeth, showering, and toileting create confusion with some error because all the activities share the same location. However, the errors are comparatively very low and can be neglected. The using dishwasher activity shows concurrency with activities such as preparing dinner, preparing breakfast, dinner, and breakfast. The detection of using dishwasher concurrently with other activities is more than 97.935%. All of these concurrently recognized activities have high detection accuracies. Sleeping is a stand-alone activity: It cannot appear simultaneously with other listed activities. Sometimes it may create some confusion with the dressing activity, as it is performed in the bedroom. The accuracy of House-B is insufficient to fully establish the experimental concept. Although the accuracy is high, many datasets could be needed to find the proposed actual recognition. We confirm that our proposed approach for concurrent human activity recognition is feasible. Figs. 8a and 8b present the accuracy and loss of training and testing procedures for Kyoto 3 and House-B, respectively. In the graphs, the gap between the training and testing accuracy is comparatively small, indicating the effectiveness of the model. The gap between training and test loss is also very narrow, which explains that the dropout techniques are beneficial and resistant to overfitting.

Performance Comparison
The average accuracy was found to be 97.374%, and the average error was 0.1637. The performance of the proposed approach was compared with the existing framework, which uses CNN, LSTM, and Bi-LSTM methods (algorithms) for recognition by measuring the average precision, recall, f-score, and accuracy, which are shown in Tab. 4. The accuracy of our method is more than 97%, and the f1-score is more than 0.97. The Bi-LSTM also has competitive accuracy with our method, but it can only process two inputs simultaneously. The cross-validation process is performed before the test to validate the input sequences; therefore, the accuracy of the test increases. This is a natural phenomenon in AI. The mean and standard deviation of the accuracy and error using 10-fold cross-validation are shown in Tab. 3. The given analysis shows that the proposed method can detect concurrent activities with higher accuracy than the existing approaches.

Computational Complexity
The computational complexity depends on the number of weights and is given as O(W), where W is the weight. The weight depends on the number of output units, the cell storage unit, the size of the memory, and the number of hidden units. It is also affected by the number of units associated with forwarding neurons, memory cells, gate units, and hidden units. The computational complexity does not depend on the length of the input sequence. Although using an LSTM framework increases time complexity, our approach has an acceptable computation time. Unlike concurrent activity detection, Bi-LSTM has lower complexity, but it can only handle two parallel activity detection processes and apply delay or other chaining functions to recognize more than two activities.
This feature causes the system to wait for a complete process, which increases the computational complexity. Fig. 9 shows the computational time for testing our method compared to existing frameworks, such as CNN, LSTM, and Bi-LSTM, for Kasteren house and Kyoto 3. In the CNN comparison, it processes too many hidden layers and pooling as it segments the activity into other sub-activities, which causes higher computational complexity. The computation time is slightly higher but satisfactory and executable.

Conclusion
The framework presented in this paper shows that sync-LSTM can lead to a feasible solution for detecting concurrent human activities in the smart home scenario. This claim is supported by a comprehensive comparison with recently utilized activity recognition techniques, such as CNN, LSTM, and bi-LSTM. LSTM can work with single data sequences and bi-LSTM with a maximum of two data sequences, but our sync-LSTM can accept multiple inputs and generate multiple outputs, synchronized and in parallel. That is, for any environment or domain for concurrent processing and recognition, this sync-LSTM would be an effective solution. Many approaches focus on single and regular activity detection. Few of them have tried to detect complex activity. Starting from the standard LSTM formulation, we have improvised a more efficient LSTM-based approach to recognize complex human activity.
However, accuracy, processing complexity, and complex activity recognition are still significant challenges in human activity recognition. The proposed method has an f1 score of more than 0.97, along with an accuracy of more than 97%. This proves its effectiveness for concurrent human activity detection with successful training and testing. Nevertheless, the accuracy is limited due to some error factors, such as same location errors, sensor distance, noise interference, and limited data. The unique best-performing model also suffers from several real-time challenges across different datasets. The parameters, like number of activities, type of sensors, sensor distribution, number of occupants, and duration of test periods, also affect the performance. The window size plays a significant role in performance, as a small size may not contain all of the information, and large size may lead to resident detection errors. The proposed method processes parallel data, which is beneficial and consistent with the setting of highly imbalanced datasets. Therefore, data augmentation techniques are not required.
Besides, sync-LSTM can automatically extract spatio-temporal information by reducing the time-consuming effort for pre-processing data and manual feature extraction. External sensors were used instead of wearables and camera or video sensors to avoid the unnecessary burden and protect the privacy of the resident. In the future, more complex activities, such as interleaved activity, will be recognized by improving and updating the proposed method. Furthermore, we can take advantage of cloud computing by using Google Colab and Amazon Web Services. These technologies provide the opportunity to use their servers and also experiment with tensor processing units. Using these techniques and technologies will also reduce the time complexity for faster and better performance. We will also explore a transfer learning approach for this model in other domains, environments, and sectors on big data and cloud infrastructures. In summary, our proposed method (i.e., a sync-LSTM-based model that provides fewer parallel and synchronized recognition and prediction paradigms) is preferable to its competitors.

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.