Human motion correction and representation method from motion camera

: Motion estimation is a basic issue for many computer vision tasks, such as human – computer interaction, motion objection detection and intelligent robot. In many practical scenes, the object movement goes with camera motion. Generally, motion descriptors directly based on optical ﬂ ow are inaccurate and have low discrimination power. To this end, a novel motion correction method is proposed and a novel motion feature descriptor called the motion difference histogram (MDH) for recognising human action is proposed in this study. Motion estimation results are corrected by background motion estimation and MDH encodes the motion difference between the background and the objects. Experimental results on video shot with camera motion show that the proposed motion correction method is effective and the recognition accuracy of MDH is better than that of the state-of-the-art motion descriptor.


Introduction
Motion estimation and recognition is the foundation of many computer vision works, especially for object motion analysis in visible light camera. It is widely used in many applications, such as human-machine interaction, video surveillance, event retrieval and intelligent vehicles. In many practical scenes, the object movement goes with camera motion. So recognising human motion from motion camera is a hot research topic in computer-human interaction [1,2] and computer vision [3,4]. The approaches of human action recognition involve motion estimation/representation, object detection and trajectories. In most of these video analysis tasks, the motion feature is popularly used as a low-level vision feature and plays an important role. However, in real scenes, owing to the movement of the camera and objects, error exists in motion estimation, reducing the discrimination power of the motion descriptor.
For motion recognition in complex scenes, especially in a camera motion environment, how to model camera motion is still an open issue. Wang and Schmid [5] estimated camera motion by matching feature points between frames and using the motion boundary histogram (MBH) to represent motion. Unfortunately, there is no clean solution to this problem. Towards this end, we propose a novel correction method for motion estimation results and a novel motion descriptor called the motion difference histogram (MDH) is calculated, which regards the background motion as camera motion.
To estimate motion and compute MDH, the dense optical flow of the video is extracted via the Lucas-Kanade (LK) algorithm [6]. The maximising component is regarded as camera motion, and the real motion is the relative motion between the optical flow and camera motion. Finally, the histogram of the orientation of real motion is computed as MDH.
To verify the accuracy of motion correction and the discrimination power of MDH, we use the conventional bag-of-words (BOW) model to represent the motion. The video is regarded as a set of spatiotemporal interest points (STIPs) detected by a 3D Harris algorithm [7]. MDH is used for the motion representation of STIP, and the visual word vocabulary of the action is constructed. Finally, the motion is regarded as the visual word feature, and the support vector machine (SVM) classifier is trained for motion recognition. Fig. 1 shows the motion recognition strategy of BOW model. In this work, we focus on motion estimation and representation, the BOW model is a simple pattern recognition model to evaluate the motion descriptor.
The contributions of our work are threefold: (i) We propose a specific approach to estimate background/camera motion and the human motion is corrected by the difference of camera motion and optical flow.
(ii) We propose a novel motion descriptor for discriminative action representation. The experimental results show that the discrimination of MDH is better than that of the state-of-the-art motion descriptor.
(iii) The proposed motion descriptor is sufficiently general for other off-the-shelf vision tasks. We are open to more robustness model to replace BOW motion recognition model. Currently, we place greater emphasis on the accuracy of motion estimation and motion representation.
The remaining of this paper is organised as follows. Section 2 reviews some related works. Section 3 describes the proposed method. Section 4 presents and discusses our experimental results. Finally, Section 5 concludes the paper.

Related works
The mainly approaches of motion estimation from camera involve in optical flow, frame/background difference method and object tracking. The optical flow method [8] tries to calculate the motion between two frames based on the optical flow constraint equation which assume the motion remain the same in very short time. Frame difference method needs a good and robustness background model. Moreover, object tracking is based on accuracy object detector and tracker. However, due to camera motion, the motion estimation of these methods is inaccuracy. In this work, we propose a new method correction method to calculate real motion from the motion estimation of optical flow.
To verify the accuracy of the proposed motion correction method, motion descriptor based on motion correction results is calculated for STIP to recognition human motion. Many studies in the literature indicate that STIP is widely used in human action recognition tasks owing to its robustness and good performance. In this study, we also focus on STIP and discuss the motion descriptor of STIP. Generally, two descriptor types are used to represent motion: absolute motion descriptor and relative motion descriptor. The absolute motion descriptor is computed directly based on optical flow, such as the histogram of the orientation of optical flow (HOF) [9]. This approach is simple but inaccurate owing to background motion, especially camera motion. The relative motion descriptor receives more attention because of its good performance in human action recognition. Frequently used relative motion descriptors include MBH [5] and Internal Motion Histograms (IMHcd).
In this study, we also discuss the relative motion descriptor and propose a novel descriptor named MDH. In contrast to these descriptors, MDH estimates the camera motion by maximising the statistical distribution of the optical flow. The real motion of each pixel is expressed by subtracting from the camera motion.
To verify the discrimination power and effectiveness of MDH, we use the BOW model to construct the action representation based on the motion descriptor. An SVM classifier is constructed to recognise action. In this study, the emphasis is on the effective of MDH, which is indicated by comparing MBH with IMHcd. BOW is wildly used in many vision tasks. Wang et al. [10] used the K-means algorithm to create visual words, and action is expressed as the word sequence. Niebles et al. [11] used an unsupervised learning algorithm to create a visual word codebook, and actions were recognised via the probabilistic latent semantic analysis (pLSA) or latent Dirichlet allocation (LDA) algorithm. In this study, the emphasis is on the effective of motion correction and MDH, which is indicated by comparing with MBH and IMHcd.

Proposed motion correction method and motion descriptor
We describe the proposed motion correction method and motion descriptor for STIP as follows.

Motion correction method
To calculate precision motion from motion camera, it is a necessity to eliminate the influence of camera motion. Towards this end, we assume the background motion is raised by camera motion and the background motion is argued as camera motion. Thus, relative motion is a good solution. Firstly, the optical flow I is computed based on pyramidal frames structure. With camera motion, the optical flow I is the sum of object motion I r and camera motion I c where these motion vectors can be decomposed into the horizontal and vertical directions (x and y directions) as follows: where I rx indicates the object motion in the x-direction, I ry indicates the object motion in the y-direction, I cx indicates the camera motion in the x-direction and I cy indicates the camera motion in the y-direction.
In the same image, the camera motion is fixed for all points. The object motion vector is estimated by solving I rx and I ry . The key is how to estimate the camera motion. However, estimation of camera motion directly from video data is still a challenging problem in computer vision. In this work, the background motion is estimated by analysing the optical flow of dense interest points. The background motion is regarded as camera motion to compute object motion.
To compute the background motion, the local interest points of the image are extracted first. In this work, we use the Harris corner points as the detector and extract the optical flow of these interest points via the LK algorithm. Some examples are shown in Figs. 2a and b.
The optical flow of these points is decomposed into the x and y directions. The value is divided into ten intervals. The distribution of points is then accumulated. Each value in the interval indicates the number of points. Examples are shown in Fig. 2c. The maximisation of the histogram is regarded as background motion (camera motion) because the overwhelming majority of the movement points are caused by camera motion, and the movement patterns of these points are consistent. The background motion pattern is shown in Fig. 2d. The relative motion can be estimated by using (2).

Motion descriptor and recognition method
After precise relative motion estimation, to evaluate the effective of motion correction and motion representation, we use the relative motion feature to recognise human motion. A new descriptor named motion difference histogram (MDH) is computed in the spatiotemporal domain of STIP. The domain is divided into 3 * 3 * 2 cells, and the histogram of the orientation of relative motion is computed in each cell. The angles of 0 • − 360 • are divided into nine intervals. Finally, by combining the histograms of these cells, the dimension of MDH is 3 * 3 * 2 * 9 = 162. The computational process of MDH is shown in Fig. 3.
To recognise human action, the video is represented as a histogram feature of a visual word dictionary. To create a visual word dictionary, we use the K-means algorithm for each category based on STIP and the motion descriptor. The length of the dictionary in each category is k. Finally, the dictionary is    After computing the video feature, the SVM classifier is trained for action recognition. In this work, the RBF kernel is used to train and predict the SVM classifier where H i and H j are the features of the video (visual word histogram), and s 2 is estimated by cross-validation.

Dataset and parameter setting
In this study, we discuss the motion correction and motion descriptor method in camera motion scene. The accuracy and effective of the proposed motion correction and descriptor method are verified in human motion recognition challenge. The method is designed based on YouTube dataset [12], which contains 11 actions (C = 11): 'basketball shooting', 'biking/cycling', 'diving', 'golf swinging', 'horseback riding', 'soccer juggling', 'swinging', 'tennis swinging', 'trampoline jumping', 'volleyball spiking' and 'walking with a dog'. All of the video in this dataset are collected from the YouTube website. This is a challenge owing to the large variations in camera motion, object appearance, object pose, object scale, viewpoint, background clutter and illumination conditions. Each action has 25 subjects (S = 25) containing more than 4 different environments (E ≥ 4) for a total of 1599 videos. Fig. 4 presents some examples from YouTube dataset.

Performance evaluation of human motion recognition
To verify the accuracy of motion correction and the discriminative of the proposed descriptor, we compared MDH with MBH, HOF [9] and IMHcd in the YouTube dataset. In our experiment, we used 25-fold leave-one-out cross-validation to measure the performance of the proposed method. In each round, one subject is selected as testing data N test = C * E, and the remaining are the training data, for a total of C * E * (S − 1). To create the dictionary, the cluster number is set at k = 20. The accuracy is the average of 25 rounds. The comparison result is shown in Table 1. Bold values indicate results of the proposed and best results From the comparison, we can find that compared with HOG and IMHcd, the improvement of MDH is more than 2%, and it is also better than MBH. Moreover, according the theory of feature descriptor in human motion recognition, the appearance feature combined with motion feature has better performance. In the experiment, we also compared the motion combine with appearance feature in Table 1.
In Table 1, HOG (histogram of orientation of gradient) is the appearance feature, HNF means the HOG feature combined with HOF. Moreover, HOGNMDF means the HOG feature combined with MDH. From the result, the performance of HOGNMDF is better than HNF, the improvement of HOGNMDF is more than 2%.
As mentioned in Section 3.2, the cluster number is sensitive to recognise performance. In the experiment, we discuss the cluster number k for action recognition. The value of k is set 20-150. Moreover, we have comprised the performance of HNF and HOGNMD feature. The experimental result is shown in Table 2 and Fig. 5.
From Table 2 and Fig. 5, we can find that the accuracy of HNF feature at k = 100 and k = 150 are 58.23 and 58.23%, respectively. The accuracy of HOGNMDH feature is 61.42 and 65.81%. The improvement of MDH at k = 100 and k = 150 is 3.19 and 7.89%, respectively. It verifies the effective of motion correction further. At the same time, there are almost have no improvement of HNF feature while the cluster number k increase from 100 to 150. Finally, the confusion matrix of the HNF feature and HOGNMDH is shown in Figs. 6a and b.

Conclusions
In this study, we propose a novel motion correction method and motion descriptor called MDH. In MDH, the camera motion is estimated, and relative motion is computed by the motion difference between the optical flow and the camera motion. To verify the effective of the proposed motion correction method, the MDH is built to recognise human motion. Experimental results by comparison with other relative motion descriptors show that the proposed descriptor is effective in motion description with camera movement. The motion correction method is useful to estimate real motion in camera movement scene. MDH is generally for other action recognition approaches and other vision tasks. In the future, we will use more a robust and discriminative action recognition approach to achieve better performance.

Acknowledgments
The work was supported by the Nature Science Foundation of China (no. 61502182), the Natural Science Foundation of Fujian Province of China (nos. 2014J01249, 2015J01253).