Industry 4.0-Oriented Deep Learning Models for Human Activity Recognition

According to the Industry 4.0 vision, humans in a smart factory, should be equipped with formidable and seamless communication capabilities and integrated into a cyber-physical system (CPS) that can be utilized to monitor and recognize human activity via artificial intelligence (e.g., deep learning). Recent advances in the accuracy of deep learning have contributed significantly to solving the human activity recognition issues, but it remains necessary to develop high performance deep learning models that provide greater accuracy. In this paper, three models: long short-term memory (LSTM), convolutional neural network (CNN), and combined CNN-LSTM are proposed for classification of human activities. These models are applied to a dataset collected from 36 persons engaged in 6 classes of activities – downstairs, jogging, sitting, standing, upstairs, and walking. The proposed models are trained using TensorFlow framework with a hyper-parameter tuning method to achieve high accuracy. Experimentally, confusion matrices and receiver operating characteristic (ROC) curves are used to assess the performance of the proposed models. The results illustrate that the hybrid model CNN-LSTM provides a better performance than either LSTM or CNN in the classification of human activities. The CNN-LSTM model provides the best performance, with a testing accuracy of 97.76%, followed by the LSTM with a testing accuracy of 96.61%, while the CNN shows the least testing accuracy of 94.51%. The testing loss rates for the LSTM, CNN, and CNN-LSTM are 0.236, 0.232, and 0.167, respectively, while the precision, recall, $F1$ -Measure, and the area under the ROC curves (AUCS) for the CNN-LSTM are 97.75%, 97.77%, 97.76%, and 100%, respectively.


I. INTRODUCTION
In the domain of Industry 4.0, the interaction between humans/workers in terms of their activities and the physical environment has changed substantially, but remains crucial for the synergetic integration of the intelligent manufacturing assets [1], [2]. Particularly in smart factories, recognizing and classifying human activities helps to evaluate human performance and thus their overall efficiency in production systems. From this perspective, in the era of Industry 4.0, artificial intelligence (AI) plays an important role in recognizing and evaluating human activities [3], [4]. In the last decade, human activity recognition has become a popular topic for The associate editor coordinating the review of this manuscript and approving it for publication was Yudong Zhang . research due to its importance in many fields, such as healthcare, sports, and fitness [5]- [10], interactive gaming, humancomputer interaction, remote monitoring systems, and smart manufacturing [11].
In other applications, wearable accelerometers are used to measure human activity for remotely communicating between patients and hospitals [12]. However, the low accuracy of these accelerometers is a challenging problem yet to be fully overcome [13], [14]. Many traditional machine learning approaches have been proposed for accurately recognition of a human activity [15]- [18], but do not always achieve a satisfactory level of accuracy [19], [20].
Deep learning models such as recurrent neural networks (RNN S ), long short-term memory networks (LSTM S ), and convolutional neural networks (CNN S ) present effective solutions to overcome the problem of low accuracy. These models are already useful in such fields as speech recognition [21], image processing [22], and language modelling [23], and are applicable for recognizing human activities.
In the literature, several deep learning models have been introduced for classification of human activities. In [24], Pienaar and Malekian proposed a design of a LSTM-RNN model for daily life activities. The authors utilized a regularization method to improve the computations to process a huge WISDM dataset [25] and achieved an overall accuracy of 94%. However, the performance of the developed model was evaluated using only two evaluation metrics, namely confusion matrix and learning curve. In [26], Hammarela et al. introduced a bi-directional LSTM model using inertial sensors to classify a large number of human activities. This model was applied on the Opportunity dataset [27] and had a F1-Measure of 92.7%. Cruciani et al. [28] presented a CNN model for human activity recognition that was tested on the UCI-HAR dataset available in [29] with an achieved accuracy of 91.98%. This work was evaluated using a variety of evaluation metrics. In [30], Xia et al. proposed a LSTM-CNN model for human activity recognition. This model was also applied to the huge WISDM dataset [25] and achieved a maximum accuracy of 95.85%. However, the computational time consumed for the training phase was noticeable. Ordonez and Roggen [31] presented a model with slightly simple architecture to recognize human activities. This model utilized a combination of a ConvLSTM model based on seven inertial measurement units (IMU S ) and twelve accelerometers. It classified five activities using the Skoda dataset [32] and achieved a F1-Measure of 95.8%.
Alani et al. [33] proposed LSTM, CNN, and CNN-LSTM models to classify imbalanced data for human activity recognition. These models were applied on the SPHERE dataset [34] and achieved accuracies of 92.98%, 93.55%, and 93.67%, respectively. This work dealt with twenty different human activities, but the performance evaluation of the models was limited to a single metric. In [35], Alzantot et al. applied a LSTM model to recognize human activities, which used to distinguish between synthesized and real data, but it had a high number of training parameters led to an architecture complexity, the accuracy achieved was quite low.
Researchers [36]- [39] have presented LSTM and CNN deep learning models to recognize human activities in daily living. In [36], Alsheikh et al. achieved an accuracy of 86.6%, but did not describe which dataset was used for the test. Shakya et al. [37] implemented RNN and CNN models, used the Actitracker, and Shoaib SA datasets, which were divided randomly, and achieved accuracies of 81.74% and 92.22%, respectively. An LSTM architecture was presented in [38], which achieved an accuracy of 92.1% on an unspecified test dataset split. Mekruksavanich et al. [39] also applied a LSTM model, achieving an accuracy of 96.2%, and F1-Measure of 96.3%.
Agarwal et al. [40] introduced a RNN-LSTM model to recognize human activity, using the WISDM dataset. The authors utilized only two response metrics for the performance evaluation of the model, and achieved an accuracy of 95.78%. Cipolla et al. [41] proposed a LSTM model to classify human activities using the SPHERE dataset. The developed model showed robust ability to deal with unbalanced classes. The model was applied to five activities and achieved a classification accuracy of 83.2%. Zhao et al. [42] implemented a bi-directional LSTM model to identify human activities, for which a number of sensors were used to collect the datasets. The main drawback of this model was the long time consumed for training the model, and thus it wasn't so convenient for real-time applications.
CNN gained a lot of attention over the years and is often used in applications of image classification [43], text analysis [44], and natural language processing [45]. Xu et al. [46] trained a CNN model on a randomly chosen 70% of a dataset, and then used the model to evaluate the remaining 30%, achieving an accuracy of 91.97%. In [47], Ignatov applied a CNN model for human activity recognition. The accuracy reached 93.32% and 90.42% for the training and testing datasets, respectively. Huang et al. [48] introduced an architecture of two cascaded CNN S and used a cross validation method to achieve an F1-Measure of 84.6%.
Looking at the reviewed literature one can argue that although a number of research studies attempted to develop accurate models for human activities recognition, there is still a wide margin of improvement to obtain. In this context, in this paper, three models of deep learning such as LSTM, CNN, and CNN-LSTM are proposed to predict human activities with the overarching aim of improving/ increasing the classification accuracy of the proposed models by introducing a hyper-parameter tuning method. In this regard, the main objective of applying the proposed deep learning models is achieving high accuracy to recognize human activities. Therefore, a k-fold cross-validation technique is implemented to reach high performance of testing accuracy. The models are trained and tested on the dataset available from the Wireless Sensor Data Mining (WISDM) Lab [49].
The main contributions of the paper are the following: • Implementing the LSTM, CNN, and CNN-LSTM models to classify human activities; • Achieving maximum testing accuracy of the models with a hyper-parameter tuning method on a central processing unit (CPU); • Evaluating the performance of the three proposed models using different evaluation metrics with the WISDM dataset. Besides, the performance of the LSTM was evaluated using the DaLiAc dataset, which entails a large number of human activities; • Enhancing and validating the performance of the proposed deep learning models using the k-fold crossvalidation technique. The remainder of the paper is organized as follows: Section II introduces theoretical background concerning the VOLUME 9, 2021 LSTM and CNN. Section III presents the proposed deep learning models. Section IV describes the evaluation metrics utilized with the proposed models. The experimental results are illustrated in section V. Section VI discusses the results. Finally, conclusions from the work are drawn in section VII.

II. THEORITICAL BACKGROUND
A. LONG SHORT-TERM MEMORY (LSTM) As a special type of recurrent neural networks (RNN), LSTM is a popular deep learning approach [50]. RNN has the vanishing gradient problem [51], [52], which LSTM was devised to solve. LSTM is suitable for processing time sequences. LSTM layers include memory blocks recurrently connected in a memory cell. Figure 1 presents the architecture of an LSTM unit, which consists of a memory cell and three gates; a forget gate, an input gate, and an output gate [53]. The memory cell remembers values over arbitrary time intervals. The three gates accept and reject information passing through the cell. In Figure 1, the forget gate decides which information will be remembered from the previous cell state C t−1 . This decision is taken via a sigmoid activation function (σ ). The output of this sigmoid is f (t). If this output has a value of 1, the data will be passed into the model; if the output value is 0, the data will not be passed through the model. The input of this sigmoid is the current input x t and previous hidden state h t−1 . The input gate decides what new information will be stored in the current cell state C t .
The input gate has a sigmoid activation function to update the cell state. This function has a range from 0 to 1. The output of this sigmoid is i (t), which is multiplied with a tanh activation function that outputs a new cell stateĉ(t). The resultant of the multiplication is added to the current cell state C t . Finally, the output gate presents the information to the next cell. It also has a sigmoid activation function to determine which parts of the current cell state require outputs; the cell state is processed by a tanh to obtain values between −1 and 1. Then, the values are multiplied by a sigmoid function with an output o(t) to obtain the output values, where h t is the current hidden state output of the current cell. The forget gate, input gate, and output gate are computed using Equations (1), (2), and (3), respectively. Theĉ(t), C t , and h t parameters are estimated from Equations (4), (5), and (6), respectively. Whereby W f , W i , and W o are the weights for the forget gate, input gate, and output gate, respectively. These weights will be learned in a training process. b f , b i , and b o are the bias parameters for the forget gate, input gate, and output gate, respectively.

B. CONVOLUTIONAL NEURAL NETWORK (CNN)
CNN is a feed forward neural network (FNN) that includes an input layer, an output layer, and hidden layers. These hidden layers are represented by convolutional layers combined with max-pooling layers. The convolution layers are the most significant of the layers in a CNN. In these layers, a configuration of filters is used to smooth input signals and generate feature maps of a dataset. These maps are activated as a result of a convolution operation with a kernel over the dataset. Figure 2 shows an example of a one-dimensional convolutional neural network architecture. An input layer with feature signals x 1 . . . x t , x t+1 are represented by X t . These signals are connected to convolutional layers with a kernel size K . Multiple convolutional layers help feature extraction from an input data to greater levels of abstraction. Maxpooling layers coming after the convolutional layers enhance the extracted feature signals by reducing its dimension using a pooling function. A feature maps extraction is calculated using Equation (7), where C z is the feature map of a convolutional layer, X t is input feature map, σ is the sigmoid activation function, * is the convolutional operator, and W t,z represents the weight vector, which connects the t th input signal to the z th feature signal.

III. PROPOSED MODELS
The LSTM, CNN, and combination CNN-LSTM models are employed here for classifying human activities. The proposed models are selected because of their approved robust performance. Especially, the LSTM is found to be suitable for processing time series data [54], the CNN is suitable for processing spatial data [55], and the CNN-LSTM is suitable for processing both temporal and spatial data. Specifically, LSTM architectures have been successfully used with time series [56]. These architectures are memory extensions for the RNN with the advantage that use of the LSTM avoids the vanishing gradient problem [57]. CNN architectures have been applied to time series to extract significant patterns by reducing noise [58] with the advantage that use of CNN structures allows learning of complex input features [59]. Finally, LSTM architectures were combined with CNN architectures [60] to exploit the advantages of both architectures and extract both temporal and spatial data.

A. LONG SHORT-TERM MEMORY (LSTM) MODEL
The proposed LSTM model architecture is shown in Figure 3. This architecture is implemented using TensorFlow framework. It consists of an input layer, two LSTM layers, and an output layer. The input layer has a shape: number of samples, 90 time steps, and 3 features. These features are a x , a y , and a z , where a x is the acceleration in the x-axis, a y is the acceleration in the y-axis, and a z is the acceleration in the z-axis. The two LSTM layers are utilized to extract the time features in the data sequence. The two LSTM layers are stacked to add depth to the model and increase its stability and accuracy. Each LSTM layer has 32 hidden units [61], and uses a rectified linear unit (ReLU) activation function to increase the robustness of the model. The output layer consists of 6 neurons with a softmax function used as activation to obtain the classes. In this model, the batch size is optimized to 64 with training epochs of 50 and the learning rate is set to 0.0025. This rate determines the speed of the model. In addition, the optimizer is the Adam optimization algorithm, which finds the optimum weights for this model, minimizes the errors, and maximizes the training accuracy [62]. Furthermore, a regularization technique, based on a cross-entropy loss function, is implemented to prevent the model from over-fitting [63]. Figure 4 illustrates the proposed CNN model architecture.

B. CONVOLUTIONAL NEURAL NETWORK (CNN) MODEL
The model consists of an input layer, a one dimension (1-D) convolutional layer, a 1-D max-pooling layer, another 1-D convolutional layer, a flatten layer, a dense layer, and an output layer. Firstly, the input layer receives three channels of acceleration data, a x , a y , and a z . The input shape of the channels has a width of 90, a height of 1, and 3 features. Secondly, the 1-D convolutional layer is applied to each input channel separately with a ReLU activation function. It is utilized to extract/capture spatial features. This layer has 64 filters,  a kernel size of 5, and a stride of 1. Thirdly, a max-pooling layer is utilized to reduce the complexity of the convolutional output, performing a down sampling operation. The maxpooling layer has a pool size of 5 and a stride of 2. Fourthly, the second 1-D convolutional layer has a configuration of 32 filters, a kernel size of 5, a stride of 1, and uses a ReLU function. This layer is added to enable the model to detect higher level features that are missed in the first convolutional layer. Fifthly, a flatten layer is used to flatten the output of the convolutional layer. Sixthly, the dense fully connected layer is configured with 6 neurons and its activation function is a ReLU. Finally, the output layer is connected to a softmax activation function, which is used to reduce the output to six activity classes. In this model, a cross-entropy loss function is used to measure the error between the prediction and the true values and the optimizer is Adam. This model is trained using a batch size of 64 for 50 training epochs, with a learning rate of 0.0025.

C. CNN-LSTM MODEL
The proposed CNN-LSTM architecture is presented in Figure 5. It consists of an input layer, two staked 1-D convolutional layers, two subsequent LSTM layers, a dense layer, and an output layer. The input layer has three channels of acceleration data a x , a y , a z . Its input shape has 90 time steps, height of 1, and 3 features. The two convolutional layers are VOLUME 9, 2021 configured with 32 filters, a kernel size of 5, and a stride of 1. They are followed by the two stacked LSTM layers; 32 hidden units are used for each LSTM. In addition, a dense layer is applied with 6 neurons to pass the classes. All layers contain a ReLU nonlinear activation function. Finally, the output layer has 6 neurons to classify six classes of human activities and uses a softmax activation function to distinguish these classes. In this model, hyper-parameters are configured to 50 training epochs, 64 batch size, and 0.0025 learning rate. For optimizing this model, the Adam optimizer is utilized. A loss function is used depending on the cross-entropy.

D. DATASET DESCRIPTION
The dataset used for the classification is from the Wireless Sensor Data Mining (WISDM) Lab [49], with 1,098,207 samples of various activities including downstairs, jogging, sitting, standing, upstairs, and walking, and the sample percentages for each activity are 9.1%, 31.2%, 5.5%, 4.4%, 11.2%, and 38.6%, respectively. The dataset was collected from 36 persons using an Android mobile phone, which contains a built-in accelerometer placed in a front trouser pocket. The dataset readings are taken with a sampling rate of 20 Hz. The dataset has 6 attributes with information related to the activities of the users' activities: time, and x-, y-, and zaccelerations. The dataset is divided into a training set (80%) and a test set (20%), with the test set used to evaluate the proposed models.
The WISDM dataset is chosen as it contains a considerable variety of daily life activities as well as it has sufficient data with a sample size of 1,098,207 samples for training the proposed models. The dataset is split into 80% for training and 20% for testing using Scikit-learn framework. A probability sampling method is utilized for a random distribution of the dataset, which offers a robust training process for the models and thus enhances their performance [64]. Figure 6 demonstrates a sample of the acceleration data collected from a single person over a period of time. In this figure, three acceleration signals ''raws'' ''a x , a y , and a z '' as functions in terms of the gravity acceleration (g) are illustrated. The blue raw represents the acceleration in the x-direction a x , while the green and orange raws represent the a y , and a z which are the accelerations in the y-and z-direction, respectively. These signals are observed for  FP means a model wrongly predicts the positive class, and FN presents a model wrongly predicts the negative class. The confusion matrix is also used to determine the ability of a model to classify multi-classes accurately with the models using the testing dataset to compare an output predicted class with a true class.
The first metric, Accuracy, represents the ratio of correct predictions to the total predictions of the testing dataset. It is calculated using Equation (8). The second metric, Precision, is the ratio of correct predictions of a specific class activity to the total predictions of the same-class in the testing dataset. Precision is mathematically represented in Equation (9). The third metric is the Recall, the ratio of accurately classified positives of a specific class to the total number of true class activities in the test dataset. Recall is also known as Sensitivity and can be calculated using Equation (10). The fourth metric is F1-Measure, which is the average of the precision and recall with a weight of 2, see Equation (11). It is also known as the balanced F1-Score. It takes both false positives (FP) and false negatives (FN) into consideration. Thus, it is a more useful metric for an evaluation than the accuracy metric. The worst possible value of all four metrics is 0%, and the best is 100%. Recall = TP TP + FN (10) Area under the ROC curve (AUC) is also used as an evaluation metric for the models. It is defined as the integral of the true positive rate (TPR) multiplied by the false positive rate (FPR). TPR, FPR, and AUC are described in Equations (12), (13), and (14). The AUC value is always between 0 and 1. A high value of AUC implies a model is capable of distinguishing between classes of human activity. So, the model with the largest area under the ROC curve, is the best model for a classification. The TPR is the ratio of the number of true positives (TP) to the sum of true positives and false negatives (TP + FN). The FPR is the ratio of the number of false positives (FP) to the sum of false positives and true negatives (FP + TN). FPR is also known as 1-Specificity. The utilized evaluation metrics ''accuracy, precision, recall, F1-Measure, and area under the curve'' are chosen due to their effectiveness in analyzing and evaluating the performance of the proposed models [66].

V. EXPERIMENTAL RESULTS
Experimentally, the proposed models are implemented using Python programming language in a Jupyter Notebook environment. The models are executed on a personal computer with Microsoft Windows 10 operating system, Intel Core i3, 4 CPU S , 2.2 GHz processor, and 4 GB RAM memory. Figure 7 shows the training and testing accuracy of the models.    Table 1 illustrates the training and testing loss rates of the models. For the LSTM, the training and testing loss rates are 0.219 and 0.236, respectively, whereas for the CNN 0.181 and 0.232, respectively, and for the CNN-LSTM 0.154 and 0.167, respectively, showing that the CNN-LSTM achieves the lowest loss rates for training and testing. Figure 8 shows the two curves for training and test accuracy of the LSTM model. The training curves are also called learning curves. The orange curve is training accuracy, which continuously changes and reaches a maximum value of 96.98% after 50 epochs. The blue curve represents the test accuracy, which starts from 86.06% and reaches 96.61% after 50 epochs. Fig. 9 shows the loss rate curves of the LSTM model based on the training and test sets. The loss rate continuously decreases with increasing number of epochs. In the case of the training set, the loss rate reaches 0.219 after 50 epochs, while the rate loss for the test set starts at 0.612 and declines to 0.236. Figure 10 shows the training and testing accuracy for the CNN over 50 epochs. The training, orange, curve starts at 79.93% and reaches 94.51%. Similarly, the test, blue, curve starts at 80.56% and increases to 95.79%. Figure 11 represents the training and testing loss for the CNN. The training loss starts from 0.585 and decreases to 0.181 after 50 epochs, while the test loss curve starts at 0.577 and decreases to 0.232. Figure 12 illustrates the training and testing accuracy for the CNN-LSTM. The training, orange, curve starts at 86.37%, and reaches 98.23% after 50 epochs. The test, blue, curve starts from 86.70% and rises to 97.76%.      Figure 14 depicts the confusion matrix obtained from the LSTM model. It predicts a class label, the ''predicted label'', which is compared to the true label in the test dataset. The diagonal values of the matrix indicate the accuracy of the classification, while the values above and below the diagonal   illustrate errors that occurred. Here, the confusion matrix detects data of 884, 3465, 587, 480, 1013, and 4181 as true positives for the six activities: downstairs, jogging, sitting, standing, upstairs, and walking, respectively. Figure 15 shows the test dataset based on the confusion matrix of the CNN model, showing 10379 instances in the test dataset which are correctly classified. Predicted activity matched the true class for downstairs, jogging, sitting, standing, upstairs, and walking in 817, 3422, 585, 456, 1034, 4065 cases, respectively.  The confusion matrix for the CNN-LSTM model is presented in Figure 16. It is clear from this figure that the CNN-LSTM model performs very well, with few error values occurring below and above the diagonal of the matrix. Table 2 summarizes the evaluation metrics in terms of the precision, recall, and F1-Measure of the three models, to evaluate their relative performance on the WISDM dataset. The precision, recall, and F1-Measure for the LSTM are 96.57%, 96.61%, and 96.57%, respectively. For the CNN model, 94.83%, 94.51%, and 94.61%, respectively. While the CNN-LSTM achieved 97.75%, 97.77%, and 97.76% for precision, recall, and F1-Measure, respectively. Figure 17 shows a comparison of these metrics of the proposed models.  Figures 18, 19, and 20. These curves are probability curves used to evaluate the models and show the true positive rate (TPR) on the ordinate axis and    Figure 18 illustrates the ROC curves for the LSTM model for each activity. Downstairs and upstairs activities have AUC values of 0.99, and jogging, sitting, standing, and walking activities achieve 1.0 AUC. Figure 19 demonstrates the ROC curves per class for the CNN model. In this figure, the highest value is the standing class, which achieves 1.0 AUC, and the lowest AUC value is the upstairs class, which achieved 0.86 AUC. The CNN model achieved 0.91 AUC for downstairs, and 0.99 for each of jogging, sitting, and walking. Figure 20 illustrates the ROC curves for the CNN-LSTM, and we see the AUC values for all activities are 1.00. This result indicates that the CNN-LSTM model achieves better performance results than the LSTM and CNN models. Figure 21 illustrates the precision-recall curves for the LSTM model for each activity. Downstairs ''class 0'', jogging ''class 1'', sitting ''class 2'', standing ''class 3'', upstairs ''class 4'', and walking ''class 5'' achieve area values of 0.938, 0.999, 0.997, 0.995, 0.956, and 0.998, respectively. Therefore, the area under the micro-average precision-recall curve is 0.994. This value is computed via the area summation of each class ''from class 0 to class 5'' divided by the number of the classes (six classes in this case).   Figure 22 demonstrates the precision-recall curves per class for the CNN model. In this figure, the highest value is the jogging activity ''class 1'', which achieved 0.991 area, and the lowest area value is the upstairs activity ''class 4'', which achieved 0.387 area. The CNN model achieved 0.612 area for downstairs activity ''class 0'' and 0.975 for sitting activity ''class 2''. Also, this model has 0.981 and 0.986 areas for standing activity ''class 3'' and walking activity ''class 5'', respectively, and the average area for the precision-recall curve is 0.886. Figure 23 shows the precision-recall curves for the CNN-LSTM, in which the area values for class 0, class 1, class 2, class 3, class 4, and class 5 are 0.972, 0.999, 0.998, 0.987, 972, and 0.999, respectively. Therefore, the area under the micro-average precision-recall curve is 0.996. This value means that the CNN-LSTM model achieves better performance results than both the LSTM and CNN models.

ROC curves are shown in
Due to the fact that ROC and precision-recall curves are comprehensive metrics, which count all statistical values ''TP, TN, FP, and FN'', they are used in the evaluation of the proposed models. In addition, both curves enable easy and instant visual diagnosis of the models' behavior. Also, these curves depend on the area parameter, where a greater area   means a more feasible test, and the areas under ROC curves are used to compare the usefulness of tests. Table 3 shows the performance of the proposed models using the k-fold cross-validation technique. In this technique, the dataset is divided into 5 equal partitions (4 partitions are utilized for training and one partition for validation), where k denotes the partition number and equals 5 in this case. The proposed models are repeatedly trained 5 times using different partitioning of the dataset. ±0.50%, ±0.41%, and ±0.32%, respectively. Therefore, the proposed CNN-LSTM model has the best performance, with the minimum standard deviation. Moreover, the k-fold crossvalidation technique improved the models' performance and avoided the biasing of the performance results via a convenient division of the training and testing dataset.

VI. DISCUSSION
In Figure 7, the proposed models are compared in terms of their accuracy. The results demonstrate that the proposed CNN-LSTM has the greatest accuracy and the least loss rate for training and testing. Results from the ROC curves and confusion matrices demonstrate that the accurate classification of human activities can be achieved with the proposed model. Figures 8, 10, and 12 present the performance of the proposed models in terms of the accuracy of training and testing using the WISDM dataset. Figures 14, 15, and 16 show distributions of error rates across the classes. Figure 17 presents a comparison of the three models in terms of precision, recall, and F1-Measure. The microaverage precision-recall curve has an area of 99.6% for the CNN-LSTM model. Also, the AUC is 100% for the CNN-LSTM model for all six activities, a better performance than either the LSTM or CNN models. Further, from Table 3, the k-fold cross validation technique is utilized to verify the performance results for the proposed models in terms of the mean accuracy, precision, recall, and F1-Measure at k = 5.
Finally, Table 4 presents a comparison of the accuracy between this work and previous published works. We see that the accuracy achieved by the CNN-LSTM model, 97.76%, is better than the models referred in [24], [26], [28], [30], [31]. The better performance of the proposed models can be attributed to the fine-tuning of the hyper-parameters of the models, which includes the number of training epochs, loss and activation functions, learning rate, batch size, dropout rate, optimizer type, and number of neurons for the utilized layers in the proposed models.
In particular, for the CNN, when the batch size, learning rate, and training epochs number were set to 32, 0.0025, and 10, respectively, and using a softmax activation function, the CNN model achieved a testing accuracy of 93.87%, but in case the batch size was changed to 12, the learning rate was adjusted to 0.00, the training epochs were reconfigured to 5, and the same activation function was used, the CNN model' testing accuracy reached 92.3%.
Also, for the LSTM and the CNN-LSTM, when the batch size, learning rate, and training epochs number were set to be 128, 0.0050, and 10, respectively, and using a sigmoid activation function, the LSTM and CNN-LSTM models achieved a testing accuracy of 87.47% and 86.3%, respectively, but when the batch size is changed to be 256, the learning rate is tuned to be 0.001, the training epochs are reconfigured to be 5, and the same activation function, the two models testing accuracy reached 88.44% and 89%, respectively. So, the optimization and proper settings of these parameters significantly enhances the results. Moreover, the best possible performance of the three models was achieved when the governing parameters were set to a batch size of 64 with training epochs of 50 and the learning rate of 0.0025, the Adam optimization algorithm is utilized, and a regularization technique, based on a cross-entropy loss function, is implemented.
The hyper-parameters of the proposed models are tuned using GridSearchCV method, which automatically computes the optimum values of the hyper-parameters to achieve better performance of the proposed models.   Additionally, the proposed LSTM model is applied to the Daily Life Activities (DaLiAc) dataset [67] to widely evaluate the performance of the model with other previously reported models in the literature. Figure 24 shows the normalized confusion matrix for the LSTM model using the DaLiAc dataset. Table 5 presents a classification report of the LSTM in terms of precision, recall, and F1-Measure. For the DaLiAc dataset, the achieved testing accuracy of the LSTM is 99.17%, while the accuracy was 98.9% in the case where the DaLiAc dataset was trained using the CNN model [68]. So, one can argue that the LSTM model proposed in this paper has the highest accuracy compared to the state-of-the-art methods reported in [68]- [70].

VII. CONCLUSION
This paper has presented an implementation of three deep learning models for the recognition of daily human activities. Experiments conducted to test the accuracy of the proposed models achieved the testing accuracy for LSTM, CNN The hyper-parameters of the three models i.e., the loss and activation functions, the optimizer type, learning rate, number of training epochs, dropout rate, batch size, and number of neurons for the utilized layers in the proposed models, are found to significantly affect the accuracy of the proposed models, and high performance achieved is highly correlated to the optimal settings applied.
Additionally, the performance of the three models is evaluated in terms of accuracy, precision, recall, and F1-Measure using the k-fold cross-validation technique.
Therefore, referring to Industry 4.0, these models can accurately recognize human activities in a smart factory environment. In the future, the proposed models can be applied with more datasets and the training performance can be compared with a graphics processing unit (GPU) environment. One could also build new models of deep learning, such as gated recurrent unit (GRU).
The used dataset and developed code are available online at: https://www.kaggle.com/drsaeedmohsen/wisdm dataset2021 AHMED ELKASEER is currently a Senior Research Associate at the Institute for Automation and Applied Informatics (IAI), Karlsruhe Institute of Technology (KIT), Germany. He received the Ph.D. degree from the Cardiff School of Engineering, Cardiff University, U.K., in 2011. He has more than 15 years' experience of research in advanced manufacturing technologies. He has been working on different EC and EPSRC funded research projects. His work entails performing experimental and laboratory work, modeling, simulation and optimizationbased studies of mechanical, and EDM and laser processing of advanced materials on conventional and micro scales, with a recent emphasis on additive and smart manufacturing and Industry 4.0 applications. His studies have led to several publications in the area of conventional and advanced micro and nano-manufacturing technologies. He is being invited to serves as an editorial board member and a reviewer for a number of journals, and to act as a scientific committee chair and a program committee member for a number of international conferences. STEFFEN G. SCHOLZ is currently the Head of the Research Team 'Process optimization, Information management and Applications,' part of the Institute for Automation and Applied Informatics (IAI), Karlsruhe Institute of Technology. He is also an Honorary Professor at Swansea University, U.K., and an Adjunct Professor at the Vellore Institute of Technology, India. He is also the Principal Investigator in the Helmholtz funded longterm programs 'Digital System Integration' and 'Printed Materials and Systems.' He has more than 20 years of experience in the field of system integration and automation, sustainable flexible production, polymer micro-and nano-replication, process optimization and control, with a recent emphasis on additive manufacturing and Industry 4.0 applications. In addition to pursuing and leading research, he has been very active with knowledge transfer to industry. He has been involved in over 30 national and international projects, and he won in excess of 20M EUR research grants, in which he has acted as a co-ordinator and/or a principal investigator. His academic output includes more than 150 technical articles and five books. He is a chair of different international conferences within the scope of advanced and sustainable manufacturing technologies. VOLUME 9, 2021