Two-Stage Human Activity Recognition Using 2D-ConvNet

is built depending on these features. Here some of the most frequently used feature detectors for human activity recognition have been discussed, which include Histogram of optical flow (HOF), Spatial-Temporal Interest Points (STIP), Histogram of Oriented Gradients (HOG), and dense trajectories [43] etc. However, the extraction of these features is a really difficult and time consuming process as well as it is challenging to know which kind of feature is relevant to the problem because feature selection varies from problem to problem in real time. Therefore a deep learning based model has been proposed and discussed in below section to attend the demand for handcrafted features and reduce the complexity of this process. At a recent time, deep learning has arisen as a group of deep architecture based learning models that render high-level abstraction of data. A deep learning model is a systematic presentation of multiple


I. Introduction
R ECOGNITION of human activities have been a hot research field in computer vision for more than two decades and researchers are still working in this domain due to unavailability of perfect human activity recognition system.Still images give less knowledge for action recognition as compared to the videos.Videos give temporal information as an additional ingredient, which is an important indicator for action recognition.A large number of different activities may be correctly identified depending on the motion component found in videos.
Action recognition is an active ingredient of many applications such as automatic video surveillance [2]- [6], object detection and tracking [7], video retrieval [8], etc.Other applications are strongly connected to the activities and actions recognitions, like human motion analysis [9]- [15], analysis of dynamic scene activities [16], classification of human actions [17], or understanding human behavior [18].Human activity recognition comprises various steps, which define the features that represent low level activities.The activities of interest and their details may vary depending on the applications.For example, from the last few years Automatic Teller Machine (ATM) has become one of the prime facilities for cash dispense, cash withdrawal, balance enquires, etc.For this reason ATM has become an unsafe site and if the security issues of ATM are concern then, it requires an intelligent video surveillance system that not only captures the scene information at the time of abnormality, but also recognizes single and multi-limb human abnormal activities so that the intelligent system could warn to the security-in-charge in real time and the corrective action could be taken at the time when the abnormal activity happens either by single or multi human limb.
A basic model of human activity recognition in video frame sequences consists of mainly two levels.In the first level, handcrafted features have been extracted from raw input data and in the second level a classifier model is built depending on these features.Here some of the most frequently used feature detectors for human activity recognition have been discussed, which include Histogram of optical flow (HOF), Spatial-Temporal Interest Points (STIP), Histogram of Oriented Gradients (HOG), and dense trajectories [43] etc.However, the extraction of these features is a really difficult and time consuming process as well as it is challenging to know which kind of feature is relevant to the problem because feature selection varies from problem to problem in real time.Therefore a deep learning based model has been proposed and discussed in below section to attend the demand for handcrafted features and reduce the complexity of this process.
At a recent time, deep learning has arisen as a group of deep architecture based learning models that render high-level abstraction of data.A deep learning model is a systematic presentation of multiple Two-Stage Human Activity Recognition Using 2D-ConvNet layers that are organized for automatically learning of features.Moreover, every layer in the deep architecture model receives output from the previous layer and implements nonlinear transformation such that the input data are transformed into systematic order of low level features to more advanced level features.The most common types of deep learning models are convolutional neural networks, recurrent neural networks, auto encoders, deep belief networks etc.For the labeled input data the deep learning model is trained based on supervised learning and, in case of unlabeled data, the deep architecture model is trained via unsupervised learning.Due to its outstanding performance in various areas like bio signal recognition, gesture recognition, computer vision, bio-informatics, etc. it could be fully deployed in human activity recognition.

II. Related Work
Action recognition in video frames, images sequences or in still images have become a hot research area over the past several years.Since it is not possible to discuss complete literature of action recognition, hence major focus has been given on action recognition in (a) RGB video frames (b) depth frames using deep neural network.Due to the increasing demand of computer vision, latest research work has shifted towards applying convolution neural networks (CNNs) for activity recognition because it is able to learn spatio-temporal information [20], [21], [22] from the videos.Li et al. in [23] used CNN for recognition of human activity captured using a smart phones data set.However activity recognition is not the only field where convolution neural network has achieved wonderful results but it outperforms in various areas such as human facial recognition [24], image recognition [25]- [27] and human pose estimation [28].
The first work on MSR3DAction dataset was given by L. Xia et al. in [29], a histogram of 3D joints (HOJ3D) has been computed and redirected by linear discriminant analysis and divided in k different visual words and temporal evaluations of these visual words are formed using HMM.L. Xia et al. validated their approach on MSR UT kinect-Action 3D joints dataset and achieved 90.92 % accuracy.The recently proposed work in [30] used CNNs, Long Short-Term Memory (LSTM) units and a temporal-wise attention model for action recognition.To learn visual features using CNNs [25] have given the benefit over hand-crafted features for recognition in still images [31]- [33] moreover it overcomes the limitation of the manual feature extraction process.Modified CNNs for recognition of activities in video frames was suggested in various approaches [20], [22], [26], [34]- [39].A couple of methods used single video frames with spatial features [20], [26], [34].Multi-channel inputs to 2-dimensional convolution neural networks have also been used in [20], [26], [37], [39].In [26] author divided the temporal and spatial video parts by using the optical flow method of RGB frames.Each of the separated parts was placed into different deep convolution neural network for learning spatial and temporal features of appearance and movement of object in a frame.Moreover, activity recognition can be performed not only by finding spatial temporal information but a state-of-the-art video representation in [40]- [42] also uses dense point trajectories.The first work was given in [43], which uses dense points of each frame and follows these points based on displacement information after finding a dense optical flow field.The method proposed in [43] was validated on four bench mark datasets such as KTH, YouTube, Hollywood2, and UCF-sports.The better implementation of the approach based on trajectory was given in the Motion based Histogram [44] which was calculated on vertical and horizontal parts of optical flow [50].
However the recent CNN approaches use 2D-convolution architecture of the video which allows learning the shift-invariant representations of the image scene.Meanwhile 2D-convolution is unable to incorporate the depth volume of video which vary with time and is also an important ingredient for activity recognition from the beginning to the end of the activity.3D-Convolution addresses this issue and incorporates spatiotemporal information of videos and provides a real extension of 2D convolution.A different way to deal with spatio-temporal features is suggested in [45], where the author factorized original 3D convolution into 2D spatial in the lower layer followed by one-dimensional temporal convolution in the upper layer.The proposed framework was tested on two bench marked datasets (UCF-101 and HMDB-51) and outperformed over existing CNN based methods.
Liu et al. [46] proposed an approach for human activity recognition using coupled hidden conditional random fields (CRF) by combining RGB and depth data together rather than CRF because it had the limitation that it cannot capture the intermediate hidden state using variables.Liu et al. implemented their method on three datasets named DHA, UT kinect-Action 3D and TJU, which produced 95.9%, 92% and 92.5 % accuracies respectively.Zhao et al. [47] proposed a multimodel architecture using 3D Spatio-temporal CNN and SVM using raw depth sequences, depth motion map and human 3D skeleton body joints for recognition of human actions, the proposed approach was validated on UTKinect-Action 3D dataset and MSR-Action 3D dataset and gave 94.15% and 97.29% accuracies respectively.Arrate et al. in [48] suggested 3D geometry of different human parts by taking translation and rotation in the 3D space and tested the approach on UTKinect-Action 3D dataset, achieving 97.02% of accuracy.Siirtola et al. in [51] discussed the personal human activity recognition model using incremental learning and smartphone sensors.Authors in [51] have also discussed that how human activity recognition system has changed since 2012, and how this human activity recognition method can be used in healthcare application.Jalal et al. in [52] suggested a novel framework for human behavior modeling using 3D human body posture captured from Kinect sensor based on clue parameters.In [52] authors extracted human silhouettes from noisy data and tracked body joints values by considering spatio-temporal, motion information and frame differentiation, then angular direction, invariant feature and spatio-temporal velocity features have been extracted in order to find the clue parameters, at last clue parameters are mapped into code-words and recognize the human behavior using advanced Hidden Markov Model (HMM), the proposed approach is tested on three benchmark depth datasets: IM-DailyDepthActivity, SRDailyActivity3D and MSRAction3D, and got 68.4%, 91.2% and 92.9% accuracies respectively.
Simonyan et al. [26] used two convolution neural networks for two streams, one for spatial features and another for temporal features and validated the proposed system on different bench-mark datasets, UCF-101 and HMDB-51.The limitation of this method [26] is that they applied a complete dataset for spatial stream and temporal stream, both for 101 classes' data in UCF-101 dataset and 51 class's data for HMDB-51 and achieved 88% and 59.4 % accuracies, respectively.There may be a scope of research to improve the results in terms of recognition accuracies by applying our two stage methodologies.Similarly we can apply our proposed approach to any other challenging large scale action recognition dataset.

III. Motivation
In our proposed work, we have used a novel multi-stage classification framework based on the color information retrieved from a RGB camera in an intelligent video surveillance system.In our work we have used two-stage classification.The advantage of twostage classification is that we can handle a large complex problem in a better way in real time since it is difficult to train the network when the number of classes are high.
Therefore in this approach we have divided the classes into two categories at subsequent level.In two-stage classification the first stage is also known as coarse level classification and the second stage is called fine level classification.The two stage classification is given in Fig. 1, as Wang et al. [30] used CNN-LSTM based attention model for human action recognition on UCF-101 dataset and achieved 84.10% accuracy.Similarly Karpathy et al. [20] used CNN on sports dataset for sport action recognition and got 63.3% accuracy.Both, Wang et al [30] and Karpathy et al. [20] used complete dataset only at once and performed classification, which was the limitation of the work given in [20] and [30].If our proposed two-stage approach had been used on the datasets mentioned in [30] and [20], we could get better results.However, our approach divides the number of classes into multilevels and reduces the loss first at initial level, and then at subsequent levels, overcoming the limitation of work given in [20] and [30] in which whole dataset was taken at once, hence by splitting the dataset into a multi-level scenario we applied the proposed technique and achieved good results in terms of accuracy compared to the state of art literature.
Our proposed method may be applied to large scale human action datasets such as UCF-101 [22], [34], [36]- [37], HMDB-51 [26], [38], [39], [41]- [42] which divides the whole dataset into a subsequent level of hierarchy in downward direction, which leads to the reduction in losses at each level.Our proposed method is applied and validated on UTKinect-Action 3D dataset, giving promising result and advances as compared to the previous literature.In this work, a 2D-Convolution Neural Network has been used due to its excellent performance on object detection, activity classification and recognition.It has the power of automatically learning the features from the input video data.The work recently published in [19] used a stack of 3D-CNN on spatio-temporal activation re-projection (STAR-Net) using RGB information.Most of the previous approaches have utilized hand crafted features, which is a really time consuming process.Therefore this work proposed a two-stage-DNN based model to recognize the human activities.In the first stage we divided all activity types into two categories based on human limbs which are human single limb and human multi limb activities.In the second stage two 2D-Convolutions Neural Network have been used for activity recognition.
From the experimental point of view, UTKinect-Action3D Dataset has been used which contains the following ten human activity classes types: wave hands, pull, walk, sit down, push, stand up, throw, pick up, carry, and clap hands.Based on our proposed approach we have divided sit down, standup, pull, push and throw into human single limb activities category, and walk, pickup, carry, wave hands, clap hands, into human multi-limb activities category.Table I shows some of the state-of-art techniques, datasets used, accuracies, number of classes in datasets used alongside with their applications.
In summary, the main work of this paper is given as: • This paper proposes a two-stage activity recognition framework for RGB information.In the first stage human activities are distinguished based on human single and multi-limb categories.
• In the second stage two Deep-CNN models are used to recognize the separated single and multi-limb human activities.The paper is arranged as follows: section II contains the related work on activity recognition in RGB videos, depth data and skeleton information: section III contains motivation of the proposed work: section IV describes our suggested method for activity recognition: experimental result and discussion is given in the section V and section VI, respectively, and at last conclusion and future work has been described in section VII.

IV. Proposed Approach
To represent human single-limb and multi-limb activities sequence recognition, this paper proposes a novel framework based on the color information retrieved from a RGB camera in an intelligent video surveillance system.We have done our proposed work in two stages.In the first stage input data have been preprocessed by resizing into the new scale, then we have extracted the histogram of oriented gradients (HOG) features from all input frames and we have classified the activities into two categories named single-limb and multi-limb activities, using a random forest (RF) classifier.Then in the second stage two 2D convolution neural networks have been applied to each category for recognition of actual activity type under single and multilimb category.The flow diagram of the proposed two stage human activity recognition is depicted in Fig. 2.

A. Preprocessing and Feature Extraction
Resizing: Collected RGB frames from MSR UTKinect Action 3D dataset have size of 480x640.The proposed approach read all single and multi-limb activity sequences for resizing the initial raw color information to a new size (80x120) using the OpenCV library.
Histogram of Oriented Gradients: HOG features are widely used in various motion based applications and activity recognition systems.These features gradients have been calculated over preprocessed image.We have divided the image into 8x8 cells.All the gradient values of a cell are divided into 9 equal bins histogram.To estimate the feature value, a block size of [16,16] has been taken.A [16,16] block has 4 histogram which will form a one dimensional vector of size 36.In addition to this, a detection window size of [16,16] is used having a stride of [8,8] leading to the14 horizontal and 9 vertical positions where the detection window moves for constructing a total of 126 positions.To calculate the final feature vector of an entire image, all 126 features having a size of 36, are combined together to form a final large vector, which is a 4536(126 * 36) dimensional vector.The final feature vector (F) can be defined using (1) where d 1 , d 2 , d 3 ……… d k are the dimensions.

B. Initial Classification using Random Forest
Random Forest is a type of ensemble based classifier proposed by Breiman [1] in 2001.The RF method works on different models increasing the accuracy (bagging) and improving the performance of previous trees by the subsequent trees (boosting).The basic principle of Random Forest is that it takes the decision based on de-correlated decision trees.It can be used with multi-class classification purposes.An RF model is non-parametric in nature.For an ensemble of classifiers C1(θ), C2(θ),..........Cn(θ), and having a training set chosen arbitrarily from the distribution of the random vector P, Q, then the margin function given in equation ( 2) can be defined as, Where i ≠ Q Where I (*) is the indication function.The margin determines the limit to which the average number of votes at P, Q for the right class exceeds the average vote for any other class.In our work, we have used the RF classifier to distinguish two classes, i.e. either single-limb or multi-limb activity.Since Random Forest is an ensemble classifier, a HOG feature has been computed for each video frame in an activity.This final feature vector (F) of dimension 4536 has been used to train the classifier.

C. 2D-CNN Classifier Based Activity Recognition
A 2D-CNN or ConvNet is a type of deep neural network, often used for video analysis and recognition and image classification.ConvNet is a biological inspired deep-network whose connectivity designs between the neurons are similar to the animal visual cortex.
CNN is a sequence modeling classifier and has a sequence of layers where every layer transforms input volume of activation into another form.In ConvNet, three main layers are used to make a ConvNet model, a convolution layer, pooling or sub sampling and fully connected layer.We keep the layers one after another to build the ConvNet architecture.CNN uses minimal preprocessing compared to other classification algorithms.Automatic feature detection is the considerable advantage.In this work, we have used 2D-ConvNet classifiers for recognition of activities that belong to the single limb and multi-limb classes.Therefore, two 2DconvNet models have been trained using the categorical cross entropy (CCE) based objective function.The architecture of 2D-ConvNets for single and multi-limb activity is given in Fig. 3.

CCE based 2DconvNet:
The Network has C output values, corresponding to one value of each class for the activity sequences.The Categorical Cross entropy (CEE) for C classes is defined in (4) (4) Where N represents the number of samples.Each sample belongs to a category k (the total of categories is C).l yi∈Ck is the indicator function of j samples for belonging to k th category and logProb model [y j ∈C k ] is the predicted-probability of the jth sample belonging to k th category.

V. Experimental Work
First, the description of the dataset used in this study is given.Then results have been evaluated by splitting the dataset into a 80% training set and a 20% test set using the train test split method.For training and validation of our framework, we used the public dataset: UTKinect-Action Dataset.The model was trained on Intel Core i5 7th Gen, 2.4GHz processor, 8GB of RAM and a 2GB 940MX NVIDIA GPU support on Ubuntu 16.04 LTS(Linux) operating system.

A. UTKinect-Action3D Dataset
To validate our proposed framework we composed a dataset having 10 types of indoor human activity sequences.The dataset was taken using a stationary Kinect sensor with Kinect for Windows SDK Beta version having a frame rate of 30 fps.The Kinect sensor has a capacity to capture about 4 to 11 feet.The Dataset contains 10 activities performed by 10 different persons two times: 9 male and 1 female, out of them one person is left-handed and the rest are right-handed, with a total of 199 activity sequences.The label of the carry activity performed by the persons the 2nd time is not given, hence frames for this activity cannot be identified.The 10 activity sequences are wave hands, pull, walk, sit down, push, stand up, throw, pick up, carry, and clap hands.All these activity sequences are given in the three different formats RGB, Depth frames and skeleton information.The dataset contains all the actions in indoor scenario.The number of frames for each activity ranges from 5 to 120.The resolution of RGB and Depth images is 480x640.The total numbers of frames are 5869 for 199 activity sequences.The proposed frame work uses activity recognition only in RGB frames sequences and the rest of the sequences are just for demonstration of the dataset.Few RGB frames and their corresponding depth images are given in the Fig. 4.

B. Activity-Type Recognition using Random Forest
To Recognize single limb and multi limb types of activities we trained the random forest classifier from Scikit-Learn Python Library.The classification has been performed by varying the number of decision trees (n_estimators) from 1 to 100.An accuracy of 99.92% has been recorded in classification of single limb and multi-limb activity types for n=46.This is shown in the confusion matrix given in Fig. 5, that all samples are classified except one sample from single limb class.

C. Activity Recognition using 2D-ConvNet
Two 2D-ConvNet classifiers have been trained for single limb and multi-limb activities using the output of initial Random Forest Classification with the help of a feature vector F. For Single limb activity, the network has been trained with categorical cross-entropy objective function with a learning rate of 10 -3 and decay of 5x10 -6 .The architecture of 2D-ConvNet is given in Fig. 3 and the learning curve of the network for single-limb activities is shown in the Fig. 6(a) and, for multi-limb activities the learning curve is given in Fig. 6(b).
It can be seen from the learning curve in Fig. 6(a) of single limb activity that, after 146 epochs, there is no change in the validation network.Thus, it has been pointed as the Best-Network.An Accuracy of 97.9% has been recorded in the recognition of single-limb activities as shown in Fig. 7(a).The Confusion matrix corresponding to classification of single limb activities is given in Fig. 8(a) Recognition performance has also been marked against each class of activities as shown in Fig. 9(a) where accuracies vary from 89% to 100% for different activities and 100 percent accuracies have been recorded for pull, stand up and throw activities.It also has been noticed that one more activity sit down is having approximately 100 percent recognition.
The learning curve corresponding to the multi-limb activities is shown in the Fig. 6(b).After 42 epochs, the best network is found because further there is no change in the validation results.An accuracy of 98% has been recorded in recognition of multi-limb activities given in Fig 7(b).The confusion matrix of multi-limb activities is given in Fig. 8(b).Recognition performance has also been recorded for every individual class of activities as depicted in Fig. 9(b).It is also noticed that 100 percent recognition is achieved for clap hands activity.

VI. Results and Discussions
This section presents the obtained experimental results and discussions about them.In this paper we adopted two-stage classification in which we obtained the results in two phase called coarse and fine level classification.First stage classification divides the activity types in two categories, single-limb and multi-limb respectively.Second stage classification has been performed to recognize actual type of activities in both.Finally a randomly train test split method has been used to validate our proposed approach.

A. First Stage Classification Results
In the first level classification, human activity types are categorized into two type's, single-limb and multi-limb activities.This classification makes easy the process of recognition at the second level.In this phase classification has been performed by Random Forest classifier changing the number-of-trees in the forest.It is illustrated from the confusion matrix given in Fig. 5 that all test samples have been classified correctly into single-limb and multi-limb classes except for one sample and 99.92% is the classification accuracy.

B. Second Stage Classification Results
In the second stage, actual recognition of individual activities takes place.Recognition has been performed using human activities for both single-limb and multi-limb categories.The overall recognition accuracy of single-limb activities is 97.9%.Individual activity classes named throw, standup and pull have been recognized 100 percent correctly, and for the remaining two activities push and sitdown the obtained accuracy is 89% and 99 %, respectively.On the other hand the overall recognition accuracy of multi-limb activities is 98% in which claphands activity is recognized 100% and the remaining individual classes named carry, pickup, walk and wavehands have 96%, 93%, 98% and 98% accuracies, respectively.It has also been illustrated from Fig. 10 that the overall accuracy of the system is 97.88 %, which is better than some of the previous state-of-art results.
The approach proposed in the paper is better than the existing methods in the sense that in a two-stage strategy the larger class problem may be divided into the sub class problems where individual sub classes are recognized at next sub level in downward stream, since real time recognition is difficult if larger number of classes are taken into account.Hence, the problem of large classes may be improved by dividing the classes into two or more levels.The recognition performance of multi-level classification depends on the training losses incurred at each level.Therefore we try to minimize the training losses at each level of classification.For example the author in [49] performed classification task on EEG signals dataset by considering the whole dataset at a time and got 39.34% accuracy, but when they used two-stage strategy (coarse-fine level classification) they got 85.20 % accuracy at coarse level and 65.03 % at fine level and combined accuracy was 57.11% (85.20 % * 65.03 %), a major increment in accuracy by a factor of 17.77% using two-stage classification.

C. Error Analysis
The proposed human activity recognition system is working in two stages.In the first stage, the proposed system is classifying single limb and multi limb activities with an accuracy of 99.92% which is approximately 100% except only one sample whose actual class was single-limb, which was being predicted as multi-limb.
In the next stage, after classifying single and multi-limb activities when each individual category is being recognized, 11% of the push activity in single limb was erroneously recognized as pull, owing to the similarity in both activities because the captured video frames show similarity during these two activates.Simultaneously, 1% error was detected (during sit down activity) as standup activity due to the high degree of similarity between the two activities.
While in case of multi-limb activity, an error of 4% was found in carry activity, which is misclassified 1% as pick up and 2% as walk, because carry activity is somewhat a combination of walk and pick up activities.Similarly, 7% of misclassification error was found in the pickup activity, being 5% of instances predicated as carry and 2% as walk, because in pickup activity video frames the subject is carrying some goods and some pickup activity video frames are similar to walk activity.At the same time, an error of 2% was found in walk activity and model predicted it as a carry activity because both activities contain the motion information.Similarly 2% misclassification was in wave hands activity which is predicted 1% as wave hands activity and 1% as walk activity due to the similarity between them.

D. Comparison with State-of-art
We compared classification-accuracy of our proposed system with other approaches given in previous methodologies.The result comparison has been shown in Table II.Our method achieved the highest accuracy among the methods given in Table II.Starting from [29] where the authors have taken human posture as histogram of 3D joints (HOJ3D) as a novel descriptor and got 90.92% classification accuracy while our two-stage strategy produces 97.98 % on the same dataset.Liu et al. [46] used both RGB and depth information of human activities and fussed this information together with coupled hidden conditional random field model and generated 92% accuracy.Zhao.et al. [47] used raw depth sequences, depth motion map and RGB information and fussed together all this information and applied 3DSTCNN with SVM for human action recognition.The proposed approach in [48] produces 97.29% accuracy on UTKinect-Action 3D dataset, 94.15% accuracy on MSR-Action 3D dataset.Vemulapalli.et al. [48] used 3D geometry of different body parts using translation and rotation in 3D space and generated 97.08% accuracy on the UTKinect-Action 3D dataset.It is clear that our proposed approach generates good results on the UTKinect-Action 3D dataset as compared to the methods given in Table II.Thus our methodology advances some of the methodology as discussed above.[29] 90.92 % Liu et al., (2015) [46] 92.00 % Zhao.et al., (2019) [47] 97.29 % Vemulapalli.et al., (2014) [48] 97.1 % Proposed Approach 97.88% Based on the comparison with previous state-of-art results discussed in Table II our proposed approach has some advantages which are mentioned below: • The multi-stage method facilitates the classification task by reducing large training losses given in complex problems into the low training losses given at the different levels.
• Complexity of the system may be reduced by making multi stages.
• Better recognition using initial and subsequent levels.
• Suitable for human computer interaction application.
Although our proposed multi-stage strategy has good results as compared to the state-of-art results given in Table II, it also has some limitation as the system will produce good results if and only if the classification accuracy at the initial stages is high.

VII. Conclusion
This paper presents a novel framework to recognize human singlelimb and multi-limb activities using video frames.This framework facilitates to analyze human limb activities in real-time.The recognition process has been done in two stages.
Firstly a Random Forest classifier has been used to distinguish input activities into two classes of activities, such as human single-limb and multi-limb.In the second phase, two 2D Convolution neural network classifiers have been trained for recognition of separated activities using a sequence classification based approach.The UTKinect-Action Dataset of 199 activities sequences has been used by the proposed framework.An accuracy of 99.92% was achieved using the Random Forest classifier.An overall accuracy of 97.88% has been recorded by our system for both types of activity classes.The major components of this proposed approach are real time, computation of HOG feature and classification.Obtained experimental results show the major advantage of deep convolution neural network implementation in activities recognition.This work also proposes the advantages of applying RGB information to recognize human activity types.In future work, Depth frames and Skeleton joints data may be combined with RGB information to form a large amount of data and generate a robust approach for better human activity recognition.

Fig. 10 .
Fig.10.Comparative performance analysis between recognition rates of singlelimb and multi-limb human activities along with complete system performance..

TABLE I .
Some Of Art-Of-The-State Approaches With Their Datasets And Accuracy

TABLE II .
Performance Comparison with Other Methods on UTKinect-Action3D Dataset