Action recognition using vague division DMMs

: This study presents a novel human action recognition method based on the sequences of depth maps, which provide additional body shape and motion information for action recognition. First, the authors divide each depth sequence into a number of sub-sequences. All these sub-sequences are of uniform length. By controlling vague boundary (VB), they construct a VB-sequence which consists of an original sub-sequence and its adjacent sequences. Then, each depth frame in a VB-sequence is projected onto three orthogonal Cartesian planes, and the absolute value of the difference between two consecutive projected maps is accumulated to form a depth motion map (DMM) to describe the dynamic feature of a VB-sequence. Finally, they concatenate the DMMs of all the VB-sequences in one video sequence to describe an action. Collectively, they call them VB division of depth model. For classi ﬁ cation, they apply robust probabilistic collaborative representation clas-si ﬁ cation. The recognition results applied to the MSR Action Pairs, MSR Gesture 3D, MSR Action3D, and UTD-MHAD datasets indicate superior performance of their method over most existing methods.


Introduction
In computer vision and pattern recognition, human action recognition is an active branch. There are many potential applications in human-machine interactivity including video analysis, surveillance systems, robotics, and so on. In the past, research has mainly focused on learning and recognising actions from image sequences taken by visible light cameras [1][2][3]. There are inherent limitations of this type of data source, e.g. it is sensitive to colour and illumination changes, occlusions, and background clutters. With the recent advent of the cost-effective Kinect, depth cameras have received a great deal of attention from researchers. At present, there are major two categories to extract features from depth map sequence: the skeleton joint points and the original depth maps. Recently, many recognition methods based on the estimated three-dimensional (3D) skeleton joints are presented [4][5][6]. However, these methods need reliable skeleton data while the estimation cannot guarantee the skeleton joint points are always right. Better than the 3D skeleton joint points, the original depth maps not only provide additional body shape but also offer motion information to distinguish actions. So, the original depth maps maybe the better choice to deal with the recognition problem.
In this paper, we focus on recognising human actions by using the original depth map sequences. A new descriptor based on vague boundary division of depth model (VBDDM) is presented and proved to be an effective way to solve human recognition problem. Depth motion maps (DMMs) of an entire video [7] may lose the time information in an action [e.g. actions of 'pick up' and 'put down' in Action Pairs dataset share similar motions and shapes, yet, the co-occurrence of the object shape and the hand motion are presented in different spatiotemporal order (see Fig. 1)]. To further model the time variants, in our method, we divide each depth sequence into a number of sub-sequences. All these sub-sequences are of uniform length. By controlling VB, we construct a VB-sequence which consists of an original subsequence and its adjacent frames. Then, DMMs in three orientations of each VB-sequence is calculated, which are resized into a fixed size in the next step. Different from [7,8], we do not calculate histograms of the oriented gradients (HOGs) [9] of DMMs to encode a descriptor, pixel values are used directly as the feature to reduce computational complexity of the feature extraction process. Finally, we concatenate descriptors of all the VB-sequences in one video sequence to obtain the VBDDM descriptor. For the classification, we use the robust probabilistic collaborative representation classification (R-ProCRC) proposed by Yang et al. [10], which shows superior performance to many popular classifiers. The experiments on MSR Action3D [11], MSR Action Pairs [12], MSR Gesture 3D [13], and UTD-MHAD [14] datasets demonstrate that the proposed method is more robust and achieves better recognition accuracy than most existing methods.
The rest of this paper is organised as follows. In Section 2, related work is presented. In Section 3, the details of generating VBDDM feature descriptors are offered. The principle of the classification R-ProCRC is presented in Section 4. A variety of experimental results and discussion are presented in Section 5. Conclusions are finally given in Section 6.

Related work
Owing to the wide range of applications for action recognition, researchers have been actively studying this topic and have achieved promising results. Some excellent surveys have been published based on conventional RGB cameras. These approaches mainly focus on detection and representation of space-time volumes such as spatiotemporal features and trajectories. For example, in [15], spatiotemporal interest points coupled with an support vector machine (SVM) classifier was used to achieve human action recognition. Scale-invariant feature transform (SIFT)-feature trajectories modelled in a hierarchy of three abstraction levels were used to recognise actions in video sequences in [16]. Motion energy images and motion-history images were introduced in [17] as motion templates to model spatial and temporal characteristics of human actions in videos. However, a major shortcoming associated with using these colour-based or intensity-based methods is the sensitivity of recognition to illumination variations, limiting the recognition robustness.
With the development of computing ability and the improvement of sensor techniques, a large number of methods have been proposed to address action recognition using various data modalities from depth sensors. Depth maps and skeletal joints are two most commonly used data modalities in this area. The work by Wang et al. [11] utilised both skeleton and point cloud information. They proposed to extract local occupancy patterns (LOPs) features at each joint and learn an actionlet ensemble model to represent each action and to capture the intra-class invariance. Luo et al. [18] proposed a new discriminative dictionary learning algorithm (DL-GSGC) that incorporated both group sparsity and geometry constraints to better represent skeleton features. Xia et al. [5] proposed a feature called histogram of 3D joint locations using modified spherical coordinates. Skeleton-based features are easier to extract and require less computational cost and memory than depth map-based features, while it can hardly work when the human body is partly in view, and the estimation is not reliable or can fail when the person touches the background or when the person is not in an upright position.
In the area of using original depth data, Oreifej and Liu [12] proposed to describe the depth sequence using a histogram capturing the distribution of the surface normal orientation in the 4D space of time (HON4D), depth, and spatial coordinates. Yang et al. [7] proposed to project depth frames onto three orthogonal Cartesian planes for the purpose of characterising the motion of an action and then the HOGs computed from DMMs were used as feature descriptors. Chen et al. [19] also employed DMMs to capture motion cues and use local binary patterns (LBPs) to gain a compact feature representation. Two types of fusion consisting of feature-level fusion and decision-level fusion are considered. Yang and Tian [20] proposed to subdivide a depth video into a set of space-time grids, and then adopted a novel scheme of aggregating the low-level polynomials into the super normal vector. Cai and Zhang computed DMMs and depth static maps (DSMs) to obtain the temporal pyramid of depth model (TPDM) [8], then spatial pyramid HOG (SPHOG) is computed from the TPDM for the representation of an action.
Chen and Liu [21] divided depth sequence into temporally overlapping depth segments which are used to generate three DMMs.
Multiple frame lengths of depth segments are utilised to cope with speed variations in actions. Then, the LBPs descriptor is exploited to characterise local rotation invariant texture information in those patches.

Proposed method
A depth map can be used to capture the 3D structure and shape information. However, human action consists of a sequence of human postures and varies with time. Some actions even share similar postures yet differ in the order that the postures appear. To solve this problem, in this work we propose a new model named VBDDM. Framework about VBDDM for human recognition on depth maps is demonstrated in Fig. 2.

VB division
To model the time variation of the human action, we first divide the depth map sequence into a number of sub-sequences. Given a depth sequence Clearly, in our method, we choose to segment the depth sequence of equal length. In this situation, the last segment maybe shorter or longer to make sure all the segments cover the depth sequence. By division strategy, each sub-sequence contains different characteristics for different action phases. Encoding features based on divided subsequences can better capture the change rule of the depth information. However, not all the actions performed by different subjects follow the same regularly changing rhythm. Human action recognition are one of the most complex recognition problem, the major reason is that human actions are closely influenced by different culture, personal character, and emotion shift. The speed that a same action performed by different subjects, or even performed by same subject at different time, can be different. For a video capture system which adopts fixed sampling rate, simply dividing the depth map sequence into a number of sub-sequences may neglect the potential speed variants.
To address this problem, we introduce a VB division method. As shown in Fig. 2. By controlling VB, we construct a VB-sequence which consists of an original sub-sequence and its adjacent frames. The jth VB-sequence of the kth action video ] . α is the VB factor which controls the number of frames shared between adjacent sub-sequences. By considering the forward and backward depth information simultaneously, each VB-sequence is more robust to action speed. Now, we convert the action depth sample X into a number of VB-sequences {P j } DIV j=1 .

VBDDM descriptor
Inspired by the work in [22], we calculate the DMM for each VB-sequence. A depth map is a 2D image taken by some kind of depth camera. Each pixel in the depth map indicates the distance from the camera to the surface of the object. Different from the 3D model, the 3D information of the unseen surface are missing in the depth map. Yet, depth map contains rich information to help us understand the object and it is quite easy to convert the depth map into point cloud. Thus, by using point cloud, all the depth maps in the action depth sample X are converted from 2D images into 3D space. Then, we project each point cloud derived from a depth map into three views: front view X i,f , side view X i,s , and top view X i,t . We get the motion energy for each frame by calculating the absolute difference between two consecutive maps without thresholding, which is same as [22]. Then, the DMM is obtained by stacking the motion energy across a VB-sequence DMM j,v denotes the DMM of the jth VB-sequence derived from the v views. Note that for the first/last VB-sequence, we only consider the backward/forward neighbour frames. DMM utilises the difference between adjacent frames. As shown in Fig. 3, the accumulation of the difference between adjacent frames better explains the trajectory of the human body. Since DMMs calculated from different action video sequences may have different sizes, bicubic interpolation is applied to resize all DMM j,v derived from the same projection view to a fixed size in order to reduce the intra-class variability. The size of the front view DMM j,f is set as m f × n f , the size of side view DMM j,s is m s × n s , and the size of top view DMM j,t is m t × n t . Examples of DMMs are shown in Fig. 3. All the DMMs are normalised between 0 and 1 to avoid large pixel values dominating the feature set. The normalised DMM j,v is denoted by DMM j,v . Now, we can obtain a new descriptor of the jth VB-sequence: Different from [7,22], we calculate DMM just from the VB-sequences not the whole sequence. Cai and Zhang calculated DMM on sub-action which is similar with ours [8]; however, Xu et al. [8] also computed DMM and DSMs on the whole video sequence to form a spatial-temporal pyramid model, which is highcomputational. Furthermore, the method in [8] divided the action video into sub-actions which have definite boundaries. While most actions are consecutive, information in the time domain will be lost wherever we divide the action. So, in our method every VB-sequence shares information from its adjacent sequences. In [7,8] ] . To keep the classification task more efficient, we perform principal component analysis (PCA) to reduce the dimensionality. We choose to retain 99% of the variance. Thus, we get a compact VBDDM de- Chen et al. [22] Accuracy, %

Fig. 3 Example of DMMs
Chen et al. [22] 50.6 (skeleton + LOP + pyramid) [11] 82.2 HON4D + D disc [12] 96.7 ours 97.2 , which employs a ProCR framework to jointly maximise the probability that a test sample belongs to each class . ProCRC effectively makes use of the training samples from all classes to deduce the class label of a test sample. It possesses a clear probabilistic interpretation, and is very efficient to solve. Therefore, here we use this newly proposed classifier to solve our action recognition problem.
According to the preceding part of this paper, we have a collection of training samples from C classesH = H 1 , ...,H i , ...,H C , whereH i is the data matrix of the ith class and each column ofH i is a VBDDM feature calculated from a training sample. We viewH i as the data matrix of an expanded class, and denoted by lH the label set of all candidate classes inH i . Denoted by S the linear subspace collaboratively spanned by all samples inH i . Then for each test sample y in the collaborative subspace S, it can be represented as a linear combination of samples inH i :y =H i b, where b is the representation vector. See [10], the optimal representation vectorb can be calculated aŝ In visual classification, partial corruption or occlusion often degrades the performance. It is well-known that the robustness of classification tasks can be enhanced by using l 1 -norm to characterise the loss function [10]. Therefore, (4) can be easily extended to its robust version R-ProCRĈ Although the R-ProCRC model is convex, there is no closed-form solution to it, Yang et al. [10] adopted an iteratively reweighted least squares (IRLS) algorithm to compute b. The coefficient vector b can be updated bŷ where WH refers to diagonal weighting matrix andH(i, :) is the ith row ofH, the diagonal weighting of WH can be calculated as With the optimal solution vectorb estimated using the training samples for classification, the probability of a test sample y classified as class c can be computed by Hb −H cbc Note that y −Hb 1 +l b 2 2 is the same for all classes, thus we can omit this part in computing Ply = k . The classification rule   (3) hand catch (4) forward kick (14) forward punch (5) draw x (7) side kick (15) high throw (6) draw tick (8) jogging (16) hand clap (10) draw circle (9) tennis swing (17) bend (13) two hand wave (11) tennis serve (18) tennis serve (18) forward kick (14) golf swing (19) pickup throw (20) side boxing (12) pickup throw (20) Chen et al. [22] 90.7 HON4D + D disc [12] 92.5 Yang and Tian [20] 94.7 Chen and Liu [21] 98.

Experiment and discussion
We extensively experimented on the proposed ideas using four standard 3D activity datasets including MSR Action Pair [12], MSR Gesture 3D [13], MSR Action3D [11], and UTD-MHAD [14]. For each dataset, we perform experiments to analyse the influence of parameter α. In all experiments, each action sequence is divided into DIV VB-sequences. The value of DIV is different according to various human action datasets. Then, three DMMs on the orthogonal Cartesian planes are calculated for every VB-sequence. So, we will have DIV × 3 DMM v for one complete action sequence. Meanwhile, the comparison on the classifier SVM [23], l 1 -norm CRC [24], R-ProCRC, and l 2 -norm CRC [22] is made to show the superiority of R-ProCRC.

MSR Action Pairs 3D dataset
The MSR Action Pairs 3D [12] is a paired-activity dataset of depth sequences captured by a depth camera. It contains 12 activities of 10 subjects with each subject performing each activity three times. This dataset is collected to investigate how the temporal order affects activity recognitions. The evaluation setup same as [12] is used in our experiment. Before extracting feature, we remove the first and last four frames of each sequence, in which subjects are mostly standstill. Then, we set α = 0.05 for this dataset and divide them into six subsets. The first five actors are used for testing and the rest for training. The detailed comparison to other approaches is demonstrated in Table 1. We compare our method with three methods on this dataset. First, we compute motion maps for the whole action sequence, which are vectored as feature and use the l 2 -norm CRC [22] to train them. Second, we compare to skeleton-based pair-wise features and LOP features as described in [11], which are enhanced by applying a temporal pyramid. The accuracy has reached to 82.2%. Finally, we compare our method to HON4D [12], though we only have just 0.5% improvement, we can consider our method has reached a relatively better result.
Confusion matrices are shown in Fig. 4. Fig. 4a presents the confusion matrices of the method [22] which extracts DMMs features through whole video sequence. Fig. 4b is the confusion matrices of the proposed method. There are many methods utilising DMMs to extract features, many of them use DMMs calculated from whole video (e.g. [7,22]), which may lose temporal information. See Fig. 1,i f'pick up' and 'put down' are projected onto one DMM v in specific direction, they will be highly similar. It can be clearly explained in the confusion matrix in Fig. 4a. The couple actions are the major reason causing confusion. While the proposed VBDDM descriptor captures the temporal information effectively and has a much better performance. Most couple actions are 100% set apart from each other.

MSR hand gesture dataset
The Gesture 3D dataset [13] is a hand gesture dataset of depth sequences captured by a depth camera. It contains a set of dynamic gestures defined by American Sign Language. There are 12 gestures in the dataset: 'bathroom', 'blue', 'finish', 'green', 'hungry', 'milk', 'past', 'pig', and 'store', 'where', 'j' and 'z'. Sequences of depth maps sampled from 'milk' are shown in Fig. 5. In total, the dataset contains 333 depth sequences, and is considered challenging mainly because of self-occlusion issues.
We follow the experiment setup in [12] and obtain the accuracies described in Table 2. Note that, before we do our experiment on this dataset, we removed the all-zero frames in these sequences. Since when we divide the video sequence into three parts, there are so many all-zero frames and we cannot calculate the DMM v correctly. Meanwhile, we removed the first three frames which are almost static. In this experiment, DIV = 3 and α = 0.2.
As we can see in Table 2, our method has obtained a relatively high accuracy on this dataset. Among these methods, Chen and Liu [21] achieve the best performance. In [21], a different strategy of temporally overlapping depth segments was proposed which adopts three temporal levels and then LBPs descriptor was exploited to characterise local rotation invariant texture information in those patches. In our method, all the depth map sequences are divided into DIV sub-sequences. Our method just uses one temporal level and does not need to calculate LBP descriptor which leads to relative lower computational complexity.

MSR Action3D dataset
The MSR Action3D [11] is a public dataset with sequences of depth maps captured by an RGBD camera. It includes 20 action categories performed by 10 subjects facing to the camera during performance. Each action was performed two or three times by each subject. The depth maps are with the resolution of 320 × 240. To facilitate a fair comparison, we follow the same experimental settings as [8,22]to split 20 categories into three subsets as listed in Table 3. As for each subset, there are three different tests, i.e. Test One (One), Test Two (Two), and Cross Subject Test (Cr-Sub). In Test One, 1/3rd of the  In all the three subsets, we set the DIV = 3 and α = 0.8. Meanwhile, the fix sizes of DMMs are totally the same with [22]. In this experiment, we choose α = 0.8. Then, we compare our method with other methods on three subsets, and the overall accuracies are also provided for each test, as shown in Table 4. Chen et al. [22] directly use the pixels of the DMMs (DMM v ) calculated from a whole video sequence. Xu et al. [8] proposed a spatiotemporal pyramid model, in which DMM v and DSMs are calculated and then HOG [9] is used to represent these maps. Chen et al. [19] employed DMMs to capture motion cues and use LBPs to gain a compact feature representation. In Table 4, we compare the fusionlevel fusion approach (DMM-LBP-FF) with our method which has a higher average accuracy than decision-level fusion approach (DMM-LBP-DF). TPDM-SPHOG [8]i sa ne f ficient feature and has reached a state-of-the-art performance in Action3D dataset; however, the disadvantage of this method is with great amount of computational complexity. In our method, we only need to calculate DMMs (DMM v ) of those sub-sequences and pixels of DMM v (not the HOG features) are used directly to obtain the feature. As we can see in Table 4, our method has obtained comparable performance with these methods and is sometimes superior to them.
The confusion matrices of action set 1 (AS1), AS2, and AS3 on Cross Subject Test are shown in Fig. 6. We can see the proposed method on AS2 has a relatively low accuracy than the other two subsets. The mainly confused actions are hand catch (4), draw x (7), draw tick (8), and draw circle (9). For these actions, only the moving of arms are critical for action classification. Yet DMMs capture the variants of the whole depth map. The moving of the body and the unrelated trunks introduce large amount of noises. The DMMs of these actions are hard to distinguish. In Fig. 6c, the only two confused actions are tennis swing (17) and tennis serve (18); they have a high similarity and are hard to distinguish as well.

UTD-MHAD dataset
This dataset [14] consists of four temporally synchronised data modalities. These modalities include RGB videos, depth videos, skeleton positions, and inertial signals from a Kinect camera and a wearable inertial sensor for a comprehensive set of 27 human actions performed by 8 subjects (4 females and 4 males). We only use depth videos here. Each subject repeated each action four times. After removing three corrupted sequences, the dataset includes 861 data sequences. The 27 actions are listed in Table 5. This dataset is larger than the other three datasets mentioned above and has a relatively bigger size in terms of action number and sample number.
In the experiments conducted, the data performed by the subjects 1, 3, 5, 7 were used for training, and the data performed by the subjects 2, 4, 6, 8 were used for testing, which are the same as [14]. We set the DIV = 3 and α = 0.05. As a newly published dataset, there are fewer works on UTD-MHAD dataset comparing with the other three datasets. We compare our method with [22] on depth videos, details are shown in Table 6. Although our method is much better than [22], there is enormous exploration potential in this dataset.

Discussion
The influence of the VB factor α is presented in Fig. 7. As we can see in Fig. 7a, the variation of parameter α has a little effect (<5%) on the classification performance. Our method achieves the Fig. 8 Comparison of recognition rates (%) using different classifiers on MSR Action3D in Cross Test Fig. 9 Comparison of recognition rates (%) using VBDDM + CRC and conventional DMM + CRC a Cross test on MSR Action3D dataset b Action Pair, Gesture 3D and UTD-MHAD datasets best performance when α = 0.8. In Fig. 7b, the influence of the parameter α on Gesture 3D dataset is relatively stable and reach the best performance when α = 0.2. On Action Pair and UTD-MHAD dataset the accuracy has a downtrend with the increase of α and obtain the best result when α = 0.05. This experiment shows that the VB factor α can help to improve the recognition performance comparing with the results when α = 0. Obviously, optimal α changes on different actions and different datasets, so how to choose the optimal α is a valuable and challenging question in the future work.
We also compare the recognition rates using different classifiers. The results are shown in Fig. 8. The l 2 -regularised CR classifier (L2-CRC) is introduced in [22] and has been proved to perform better than l 1 -CRC [24] and SVM [23]. As exhibited in this figure, in our experiment, the R-ProCRC classifier presents stronger discrimination than the other three classifiers on all the three subsets on Action3D dataset.
To demonstrate how much performance improvement is attributed to the proposed new descriptor, VBDDM using CRC is compared with DMM using CRC [22] for all datasets in Fig. 9. The proposed descriptor VBDDM has about 45% improvement on Action Pair dataset and 3-15% on the other three datasets.

Conclusion
In this paper, a computationally efficient DMM-based human action recognition method using R-ProCRC was proposed. We divide a whole video sequence into several sub-sequences and DMMs on three orthogonal Cartesian planes, which are calculated from VBs is used to generate VBDDM. In addition, the utilisation of R-ProCRC was shown to be better than l 2 -CRC which has been proved to perform better than l 1 -CRC and SVM. The recognition results on MSR Action3D, MSR Action Pairs, MSR Gestures 3D and UTD-MHAD datasets indicate superior performance of our method over most existing methods.