Human Activity Recognition Using Gaussian Mixture Hidden Conditional Random Fields

In healthcare, the analysis of patients' activities is one of the important factors that offer adequate information to provide better services for managing their illnesses well. Most of the human activity recognition (HAR) systems are completely reliant on recognition module/stage. The inspiration behind the recognition stage is the lack of enhancement in the learning method. In this study, we have proposed the usage of the hidden conditional random fields (HCRFs) for the human activity recognition problem. Moreover, we contend that the existing HCRF model is inadequate by independence assumptions, which may reduce classification accuracy. Therefore, we utilized a new algorithm to relax the assumption, allowing our model to use full-covariance distribution. Also, in this work, we proved that computation wise our method has very much lower complexity against the existing methods. For the experiments, we used four publicly available standard datasets to show the performance. We utilized a 10-fold cross-validation scheme to train, assess, and compare the proposed model with the conditional learning method, hidden Markov model (HMM), and existing HCRF model which can only use diagonal-covariance Gaussian distributions. From the experiments, it is obvious that the proposed model showed a substantial improvement with p value ≤0.2 regarding the classification accuracy.


Introduction
In real-life environments, there are some fascinating applications in which the analysis of human activities plays a significant role. Some applications include human/object detection and recognition based on vision object analysis and processing areas such as tracking and detection [1,2], computer engineering [3], physical sciences [4], health-related issues, natural sciences, and industrial academic areas [5]. Most of the authors [6][7][8][9][10][11] recognized the human activities in indoor environments based on different methodologies. However, in their respective systems, they used stable environment like fixed camera setting and prelighting setting, and most of the activities were performed by the instructions provided by the instructor. Similarly, the authors of [10,[12][13][14] proposed different methods to recognize the human daily activities in outdoor environments. However, in most of the used datasets, they used static background and this is one of the common drawbacks in their systems. Similarly, different sensors were utilized by the authors of [15][16][17] in order to classify indoor and outdoor human activities.
Moreover, in telemedicine and healthcare, human activity recognition (HAR) can be explained by helping physically disabled persons' scenario. A paralyzed patient with half of the body critically disturbed by stroke is completely unable to walk and the one way to recover him is through daily exercises. Normally, the daily exercises (activities) are recommended by the doctors to the stroke patients for getting better improvements in their health. A human activity recognition (HAR) system can correctly train and identify the activities performed by the stroke patients, through which the doctors easily can monitor the improvement scale in the patients' health.
ere are four modules in a typical HAR system: preprocessing (segmentation), feature extraction, feature selection, and recognition as shown in Figure 1. Most of the existing works [18][19][20][21][22][23] focused on feature extraction and selection; however, very limited works have been done for the recognition module. Some studies exploited conventional techniques [24][25][26][27][28]. Among them, HMM is one of the best candidates for the activity recognition; however, HMM is generative in nature and less precise than its matching part like HCRF model [29]. e inspiration behind the recognition stage is the lack of enhancement in the learning method. erefore, we have made the following contribution: (i) e existing HCRF model is inadequate by independence assumptions, which may reduce classification accuracy. erefore, the first objective of this study is to propose a recognition model that presents a new algorithm to relax the assumption, allowing our model in order to use full -covariance distribution. (ii) Another objective of the work is to prove that computation wise our method has very much lower complexity against the existing methods. In this method, our goal was to find some parameters to maximize the conditional probability of the training data at the training phase. erefore, in our work, we utilize limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) method to search for the optimal point. However, instead of repeating the forward and backward algorithms to compute the gradients as others did [30], we run the forward and backward algorithms only when calculating the conditional probability, and then we reuse the result to compute the gradients. As a result, the computation time is significantly reduced. (iii) A comprehensive set of experiments which yielded a weighted average classification rate 97% that is better improvement in the performance against the state-of-the-art methods. e rest of the paper is organized as follows: Section 2 presents related works with their limitations. Section 3 provides the proposed recognition model with its advantages. Section 4 describes the experimental setup for the proposed model against four datasets. Based on the setup, a series of experiments are presented in Section 5. Finally, Section 6 describes conclusion with some future directions.

Related Works
In a typical HAR system, different types of latest segmentation methods were used in preprocessing module in order to extract the human body from the activity frame.
is process helps to improve the performance of the activity recognition system. erefore, in the literature, the authors of [31][32][33][34][35][36] utilized the latest methods to segment the human body from the video frames. Similarly, for the feature extraction, different latest methodologies have been employed which help the classifiers to accurately classify the human activities (as the workflow shown in Figure 1) [37][38][39][40][41][42]. ey showed better performance on different datasets, and most of them achieved average accuracy between 70 and 90%.
Regarding recognition, the researchers have proposed diverse systems which exploit various classifiers such as Gaussian mixture model (GMM) [43,44], artificial neural network (ANN) [45,46], and support vector machine (SVMs) [47][48][49][50]. ese classifiers were principally employed for frame-based classification. Contrarily, in many HAR systems [37,51,52], the eminent hidden Markov model (HMM) has extensively been utilized for sequence-based classification. In the case of frame-level features, HMMs are benefited over vector-based classifiers like SVM, GMM, and ANN in terms of effectively handling the sequential data. However, the Markovian property implied in the traditional HMM assumes that the current state is a function of the past state only. is causes the labels of two adjacent states in the observation sequence to hypothetically appear in succession. But in practical implementation, this assumption often does not meet satisfaction. Besides, the generative characteristic of HMM and independence presumptions between observations and states also limit its performance [29]. To get rid of these limitations, the maximum entropy Markov model (MEMM) had been proposed which comparatively performs better than HMM [53]. However, MEMM is associated with the well-known disadvantage termed as "label bias problem".   Computational Intelligence and Neuroscience Two generalized models of MEMM known as conditional random fields (CRFs) [29] and HCRF [54] were developed to fix the shortcoming of "label bias problem" [29]. For learning the hidden structure of the sequential data, HCRF facilitates the effectiveness of CRF with hidden states. However, in both models, the per-state normalization is replaced with global normalization, permitting the weighted scores which in turn result in larger parameter spaces as compared to HMM and MEMM.
For example, the CRFs achieved in the HAR system having the observed frames from a video are represented by feature vector U, resultant label V, and unknown state label K.
Suppose, the problem image labeling is assumed by original labels K with image features U and parameter of the model is Λ, then the later probability (post(K | U; Λ)) maximized by CRF is given as where the normalization factor is Some issues in HCRF implementation are reviewed and analyzed in the following description. e later probability of CRF in (1) has been updated by the post(K | U; Λ) in a HCRF model that is the addition of exponentials of latent functions with all expected labels L as given below e above equations are used to warranty the sum to one rule of the conditional probability. V ′ is the possible tag for the series of frames, and L � l 1 , l 2 , . . . , l T is a series of hidden states l i , i � 1, 2, . . . , T, and equations (1) and (2) have constant values from 1 to Q (the number of states), Λ is the vector factor, and f(V, L, U) is a feature vector that will yield a decision which parameter will be educated by the model. en, the feature vector concludes the addition of the existing HCRF model. For example, the underneath selections will create a Markov restraint HCRF with a Gaussian distribution at every state: where each v ∈ V is the expected tag and every u ∈ U is a predicted vector. e per-component square of the observation vector v at state t (i.e., v t ) is given as It can be seen that along with certain set of parameters (Λ), the HCRF addition is similar to the hidden Markov model, for instance along with the abovementioned feature vector, if we choose where b in (6) is an earlier dissemination of Gaussian HMM and C in (7) is an evolution matrix; then conditional possibility numerator might be explained as In the above equation, N represents Gaussian distribution. Equation (11) is the conditional probability of U, given V is calculated along with a Gaussian HMM through equation (11) which has an earlier distribution b with a conversion matrix C.
Moreover, the authors of [30] proposed a comprehensive form of the HCRF model to tackle composite scatterings utilizing a linear combination of Gaussian distribution functions, which is explained as In equation (12), M indicates the number of components in Gaussian mixture.
Lots of works have been developed which showed better performance based on the usage of the abovementioned HCRF [55,56]; however, most of them did not consider the limitations of the model. It is obvious from the aforementioned equations that the existing model employed diagonal (sloping)-covariance Gaussian distribution, which means that the variables (columns of u i , i � 1, 2, . . . , N ) were presumed to be couples independent. On the other hand, equations (8)- (10) suggest that with a specific set of value, each state observation density will congregate to Gaussian procedure. Unluckily, there is no training method Computational Intelligence and Neuroscience designed yet to guarantee this convergence, and those suppositions might decrease the accuracy results. erefore, we proposed the improved version of the HCRF technique that has the ability to openly employ fullcovariance Gaussian mixture in the feature function. e proposed model will get the benefits of hidden conditional random field model that completely considered the drawbacks of the previous method.

Proposed Methodology
3.1. Feature Extraction. In our previous work, we utilized symlet wavelet [37] for extracting various features from the activity frames. ere are number of reasons for using the symlet wavelet which produces relatively better classification results. ese include its capability to extract the conspicuous information from the activity frames in terms of frequency and its support to the characteristics of the grayscale images like orthogonality, biorthogonality, and reverse biorthogonality. For a certain provision size, the symlet is characterized with the highest number of vanishing moments and has the least asymmetry.

Proposed Hidden Conditional Random Fields (HCRFs)
Model. As described earlier, the current Gaussian mixture HCRF model does not have the capability of utilizing fullcovariance distributions and also does not guarantee the conjunction of its factors to certain values upon which the conditional probability is demonstrated as a combination of the normal density functions.
To address these limitations, we explicitly involve a mixture of Gaussian distributions in the feature functions as illustrated in the following forms: then, where N represents the number of density functions, Gamma "Γ" considers the appropriate information of the entire observations, D indicates the dimension of the observation, and Γ Obs l,m presents the partying weightiness for the m th constituent along with mean μ l,m and covariance matrix Σ l,m .
As indicated in equation (14), when we change some of the parameters such as Γ, μ, and Σ, then we may build a combination of the standard densities. e resultant conditional probability might be written as therefore, e forward and backward algorithms are used to calculate the conditional probability based on equations (19) and (20) that can be written as 4 Computational Intelligence and Neuroscience In the training data, to maximize the conditional probability, we initially focused on calculating the parameters (Λ, Γ, μ, and Σ). In the proposed approach, limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BGFS) method has been implemented in order to search the optimum point. Unlikely the other models [30], both the forward and backward algorithms are used to compute the conditional probability and the results were reused for finding the gradients.
is makes the algorithm more significant in reducing the computation time.
At the observation level, we particularly incorporated the full-covariance matrix in the feature function as shown in (16). Equation (17) may be used for getting the normal distribution which is further elaborated in the following equations: e d Score function is a gradient function for a variable of the prior probability vector: Computational Intelligence and Neuroscience e d Score function is a gradient function for a Gaussian mixture weight variable. Here, a function V(t) can be determined as e d Score function is a gradient function for the Gaussian distribution mean: e d Score function is a gradient function for the covariance of the Gaussian distributions.
Equations (24)- (27) presented above describe an analysis method algorithm for calculating values of gradients for a feature function, the mean of Gaussian distributions, and the covariance of the prior probability vector, the transition probability vector, and the observation probability vector obtained from the existing HCRF.
In our model, the recognition of a variety of real-time activities can be divided into two steps: a training step and an inference step. In the first step, data with known labels are inputted for recognizing the target as well as training the hidden conditional field model. In the inference step, the inputs to be actually estimated are ordered dependent on parameters determined in the training phase.
If the activity frame is acting as an input in the training step, then, in the preprocessing step, the applied distinctive lighting effects are decreased for detecting and extracting faces from the activity frames. At that point, the movable features are extricated from the various facial parts for creating the feature vector. After that, the feature vector obtained serves as an input to a full-covariance Gaussian-mixed hidden conditional random field model of the suggested recognition model.
As mentioned in the earlier discussion, a feature gradient is generally determined by LBFG approach in the training phase of the HCRF model. Nonetheless, in the current gradient calculation technique, a forward and backward iterative execution algorithm is iteratively called upon, which needs an exceptionally high computational time and thus leads to reduction in the computational speed. Another analysis approach has been formulated that reduces the invoking of the forward and backward iterative execution algorithm using five gradient functions determined by equations (24)

Datasets Used.
In this work, we employed four opensource standard action datasets like Weizmann action datasets [57], KTH action dataset [58], UCF sports dataset [59], and IXMAS action dataset [60] for corroborating the proposed HCRF model performance. All the datasets are explained below.

Weizmann Action Dataset.
is dataset consisted of 10 actions such as bending, running, walking, skipping, place jumping, side movement, jumping forward, two hand waving, and one hand waving that were performed by total 9 subjects. is dataset comprised of 90 video clips with average of 15 frames per clip where the frame size is 144 × 180.

KTH Action Dataset.
KTH dataset employed for activity recognition comprised of 25 subjects who performed 6 activities like running, walking, boxing, jogging, handclapping, and hand waving in four distinctive scenarios. Using a static camera, in the homogenous background, a total of 2391 sequences were taken with a frame size of 160 × 120.

UCF Sports Dataset.
In this dataset, there were 182 videos which were evaluated by n-fold cross-validation rule.
is dataset has been taken from different sports activities in broadcast television channels. Some of the videos had high intraclass similarities. is dataset was also collected using a static camera. is dataset covers 9 activities like running, diving, lifting, golf swinging, skating, kicking, walking,     Computational Intelligence and Neuroscience

IXMAS Action
Dataset. IXMAS (INRIA Xmas motion acquisition sequences) dataset comprised of 13 activity classes which were performed by 11 actors, each 3 times. Every actor opted a free orientation as well as position. e dataset has provided annotated silhouettes for each person. For our experiments, we have selected only 8 action classes like walk, cross arms, punch, turn around, sit down, wave, get up, and kick. IXMAS dataset is a multiview dataset for a view-invariant human activity recognition where each frame has a size of 390 × 291. is dataset has a major occlusion and that may cause misclassification; therefore, we utilized global histogram equalization [61] in order to resolve the occlusion issue.

Setup.
For a comprehensive validation, we carried out the following set of experiments executed using Matlab.
(i) e first experiment was conducted on each dataset separately in order to show the performance of the proposed model. In this experiment, we employed 10-fold cross-validation rule, which means that data from 9 subjects were utilized for training data, while the data from one subject was picked as a testing data. e procedure was reiterated for 10 times provided each subject data is utilized for both training and testing. (ii) e second experiment was conducted in the absence of the proposed recognition model on all the four datasets that will show the importance of the developed model. For this purpose, we used the existing eminent classifiers like SVM, ANN, HMM, and existing HCRF [30] as a recognition model rather than utilizing the proposed HCRF model. (iii) e third experiment was conducted to show the performance of the proposed approach against the state-of-the-art methods. (iv) In the last experiment, the computational complexity of the proposed HCRF model was compared with forward/backward algorithms.

Results and Discussion
5.1. First Experiment. As described before, this experiment validates the performance of the proposed recognition model Computational Intelligence and Neuroscience 9 on an individual dataset. e overall results are shown in Tables 1 (using Weizmann dataset), 2 (using KTH dataset), 3 (using UCF sports dataset), and 4 (using IXMAS), respectively. As observed from Tables 1-4, the proposed recognition model constantly obtained higher recognition rates on individual dataset.
is result shows the robustness of the proposed model which means that the model not only showed better performance on one dataset but also showed better performance across multiple spontaneous datasets.

Second Experiment.
As described before, the second experiment was conducted in the absence of the proposed recognition model, to show the importance of the proposed model using all the four datasets. For this purpose, we used the existing eminent classifiers like SVM, ANN, HMM, and existing HCRF [30] as a recognition model rather than utilizing the proposed HCRF model.  show that when the proposed HCRF model was substituted with ANN, SVM, HMM, and existing HCRF [30], the system failed to accomplish higher recognition rates. e better performance of the proposed HCRF model is visualized in Tables 1-4, which show that the proposed HCRF model effectively fix the drawbacks of HMM and existing HCRF that has been extensively utilized for sequential HAR. Table 7: Classification results of the proposed system on UCF sports dataset (A) using ANN, (B) using SVM, (C) using HMM, and (D) using existing HCRF [30], while removing the proposed HCRF model (unit: %).

ird Experiment.
In this experiment, a comparative analysis was made between the state-of-the-art methods and the proposed model. All of these approaches were implemented by the instructions provided in their particular articles. A 10-fold cross-validation rule was employed on each dataset as explained in Section 4. e average classification results of the existing methods along with the proposed method across different datasets are summarized in Table 9.   It is obvious from Table 9 that the proposed method showed a significant performance against the existing stateof-the-art methods. erefore, the proposed method accurately and robustly recognizes the human activities using different video data.

Fourth Experiment.
In this experiment, we have presented the computational complexity that is also one of the contributions in this paper. e implementations of the previous HCRF are available in literature, which calculate the gradients by reiterating the forward and backward techniques, while the proposed HCRF model executes them once only and cashes the outcomes for the later use. From (21) and (22), it is clear that the forward or backward technique has a complexity of O(TQ 2 M), where Trepresents the input sequence length, Q represents the number of states, and M indicates the number of mixtures. e proposed HCRF model, however, requires a full complexity of O(TM) to calculate gradients as can be seen from (22)-(29). Figure 3 shows a comparison of the execution time when the gradients are computed by the forward (or backward) algorithm and by our proposed method. e computational time is calculated by running Matlab R2013a with the specification of Intel ® Pentium ® Core ™ i7-6700 (3.4 GHz) with a RAM capacity of 16 GB.

Conclusion
In healthcare and telemedicine, the human activity recognition (HAR) can be best explained by helping physically disable persons' scenario. A paralyzed patient with half of the body critically attacked by paralysis is completely unable to perform their daily exercises. e doctors recommend specific activities to get better improvement in their health. So, for this purpose, the doctors need a human activity recognition (HAR) system through which they can monitor the patients' daily routines (activities) on a regular basis. e accuracy of most of the HAR systems depends upon the recognition modules. For feature extraction and selection modules, we used some of the existing well-known methods, while for the recognition module, we proposed the usage of HCRF model which is capable of approximating a complex distribution using a mixture of Gaussian density functions. e proposed model was assessed against four publicly available standard action datasets. From our experiments, it is obvious that the proposed full-covariance Gaussian density function showed a significant improvement in accuracy than the existing state-of-the-art methods. Furthermore, we also proved that such improvement is significant from statistical point of view by showing value ≤0.2 of the comparison. Similarly, the complexity analysis points out that the proposed computational method strongly decreases the execution time for the hidden conditional random field model. e ultimate goal of this study is to deploy the proposed model on smartphones. Currently, the proposed model is using full-covariance matrix; however, this might be time consuming, especially when using on smartphones. Using a lightweight classifier such as K-nearest neighbor (K-NN) could be one possible solution. But K-NN is very much sensitive to environmental factor (like noise). erefore, in future, we will try to investigate further research to reduce the time and sustain the same recognition rate when employing on smartphones in real environment.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare there are no conflicts of interest.