Elsevier

Image and Vision Computing

Volume 32, Issue 9, September 2014, Pages 616-628
Image and Vision Computing

Motion boundary based sampling and 3D co-occurrence descriptors for action recognition

https://doi.org/10.1016/j.imavis.2014.06.011Get rights and content

Highlights

  • A motion boundary based sampling strategy is proposed for dense trajectory.

  • A set of 3D co-occurrence descriptors is developed to describe cuboids.

  • Two decomposition strategies are presented to further improve performance.

  • We achieve state-of-the-art results on several human action datasets.

Abstract

Recent studies witness the success of Bag-of-Features (BoF) frameworks for video based human action recognition. The detection and description of local interest regions are two fundamental problems in BoF framework. In this paper, we propose a motion boundary based sampling strategy and spatial-temporal (3D) co-occurrence descriptors for action video representation and recognition. Our sampling strategy is partly inspired by the recent success of dense trajectory (DT) based features [Wang et al., 2013] for action recognition. Compared with DT, we densely sample spatial-temporal cuboids along a motion boundary which can greatly reduce the number of valid trajectories and preserve the discriminative power. Moreover, we develop a set of 3D co-occurrence descriptors which take account of the spatial-temporal context within local cuboids and deliver rich information for recognition. Furthermore, we decompose each 3D co-occurrence descriptor at pixel level and bin level and integrate the decomposed components with a multi-channel framework, which can improve the performance significantly. To evaluate the proposed methods, we conduct extensive experiments on three benchmarks including KTH, YouTube and HMDB51. The results show that our sampling strategy significantly reduces the computational cost of point tracking without degrading performance. Meanwhile, we achieve superior performance than the state-of-the-art methods. We report 95.6% on KTH, 87.6% on YouTube and 51.8% on HMDB51.

Introduction

Automatic recognition of human action in videos has been an active research area in recent years due to its wide range of potential applications, such as smart video surveillance, video indexing, and human–computer interface. Though various approaches have been proposed and significant progresses have been made, action recognition still remains a challenging task due to the high dimension and complexity of video data, the large intra-class variations, clutter, occlusion and other fundamental difficulties [2].

A fundamental problem in action recognition is how to represent an action video. The approaches for action video representation can be roughly divided into five categories: (1) dynamic model based approaches which apply statistical sequential models such as HMM and Bayesian network to describe the temporal states of actions [3], [4]; (2) human pose based approaches which utilize pose structure information [5], [6]; (3) global action template based approaches which construct global templates to capture appearance and motion information of the whole motion body [7], [8], [9]; (4) local feature based approaches which mainly extract spatial-temporal cuboids [10], [11], [12], [13], [14], [15], [16], [17] or motion parts [18], [19]; and (5) supervised feature learning based methods which learn the representation by hierarchical networks or other models [20], [21], [22], [23].

Among the state-of-the-art methods, the representation of local spatial-temporal feature with Bag-of-Features (BoF) framework [24] is perhaps the most popular and successful one for action recognition. Local features are usually obtained by cuboid detectors and descriptors. Laptev [25] developed space-time interest points (STIP) detector by extending the Harris detector to 3D domain. Dollar et al. [10] detected space-time salient points by applying 2D spatial Gaussian and 1D temporal Gabor filters. Willems et al. [26] utilized Hessian matrix to extract scale-invariant spatial-temporal interest points in videos. Wang et al. [14] densely sampled cuboids at regular positions and scales. For descriptors, well-known approaches include HOG/HOF [11], Cuboids [10], HOG3D [13], 3D-SIFT [27], and so on.

Recently, Wang et al. [15] proposed dense trajectory for sampling spatial-temporal interest points and introduced a novel descriptor named motion boundary histogram (MBH) for action recognition. The motion boundary is defined by the gradient magnitude of optical flow which is initially introduced in the context of human detection [28]. Extensive experiments on nine popular human action datasets have demonstrated the excellent performance of this approach [1]. Despite its great power, the DT based representation is expensive in memory storage and computation due to the large number of densely sampled trajectories.

In this paper, we first develop a motion boundary based sampling strategy named DT-MB to reduce the computation and storage consumption of the previous DT based method. We start from densely sampled patches with grids in a frame. Meanwhile, motion boundary (Fig. 1) is derived from optical flow and a binary mask is estimated from motion boundary. Then we remove those sampled regions that have very few overlaps with foreground in the mask. Central points of the rest patches are refined by averaging the location of occupied foregrounds within the patches. Our DT-MB is partly motivated by the fact that those trajectories on motion boundary are the most meaningful ones. This is also implied by the superior performance of the MBH descriptor [1]. Using our sampling method, the number of DTs can be sharply reduced without hurting the performance.

In addition, to further enhance the discriminative power of DT based representation, we propose a set of spatial-temporal (3D) co-occurrence descriptors to describe the local appearance and motion information along trajectories. This is partly inspired by the success of co-occurrence feature in image domain [29], [30], [31]. In [30], a descriptor based on co-occurrence HOG (CoHOG) is presented for human detection. In [29], gray-level co-occurrence matrix (GLCM) is introduced to extract textural features for image classification. Our motivation is that the spatial-temporal co-occurrence features, which depict the local tiny context of motion and appearance in videos, can provide important cues for action recognition. The novel descriptors are composed of 3D-CoHOG, 3D-CoHOF and 3D-CoMBH. We find that (1) 3D-CoHOG depicts more complex structure of spatial patch and the appearance changes along with time; (2) 3D-CoHOF conveys complex motion structure and motion direction changes; and (3) 3D-CoMBH captures the complex gradient structure of optical flow and the changes of gradient orientations of flow. Furthermore, we thoroughly exploit two types of multi-channel pipelines for these descriptors, namely the pixel level pipeline and the bin level pipeline. Considering 3D-CoHOG in a given cuboid aligned by trajectory, we set several offsets at horizontal, vertical and temporal axes for each point, and the co-occurrence matrices in all the offsets are vectorized and concatenated to form 3D co-occurrence descriptors. For pixel level multi-channels of 3D-CoHOG, we vectorize the co-occurrence matrices for each offset and model them by the BoF pipeline individually, and then combine all the BoF pipelines by a multi-channel kernel SVM. For the bin level, we split the co-occurrence matrices into several channels by their co-occurrence bins for each offset. The idea of using multi-channels for 3D co-occurrence is partly inspired by the fact that MBHx and MBHy perform differently [1] and the complementarities can be better investigated in a multi-channel way as shown in [32].

To evaluate our sampling strategy and the proposed descriptors, we perform action classification with a standard BoF framework and a kernel SVM classifier [11] on three widely-used datasets, namely KTH [33], YouTube [34] and HMDB51 [35]. Our framework is illustrated in Fig. 1. We investigate our DT-MB sampling strategy over that of the original dense trajectory [1] in the view of computation and memory cost. We evaluate the improvement of our new descriptors over original HOG, HOF and MBH [1]. Furthermore, we provide a theoretical analysis on the advantages of using co-occurrence feature.

The main contributions of this paper are summarized as follows:

  • 1)

    we develop a motion boundary based sampling strategy to reduce the number of dense trajectories which can save memory and computation without degrading performance;

  • 2)

    we propose a set of 3D co-occurrence descriptors, namely 3D-CoHOG, 3D-CoHOF and 3D-CoMBH, which can depict the spatial-temporal contextual information within local cuboids;

  • 3)

    we present two decomposition strategies for 3D co-occurrence descriptors (pixel level and bin level) and integrate the decomposed components with a multi-channel framework, which can further improve the performance;

  • 4)

    we achieve state-of-the-art results on several widely-used human action datasets.

It's worth noting that our new descriptors are independent of the spatial-temporal cuboid detectors (e.g., DT [1], STIP [25], dense cuboids [14]). Though we mainly discuss our novel descriptors with dense trajectory, one can easily extend them with other detectors as well. The analysis and results presented here extend our preliminary work in BMVC 2013 [36]. Here, we develop more general spatial-temporal co-occurrence descriptors and further improve the performance by exploiting their multi-channel versions. We also provide an information theory analysis to validate the advantages of using co-occurrence descriptors.

The rest of this paper is organized as follows. In Section 2, we give a brief review of dense trajectory based method and present our DT-MB method in detail. In Section 3, we present our 3D co-occurrence descriptors. The two decomposition strategies for 3D co-occurrence descriptors are presented in Section 4. Section 5 shows the experimental results and gives a comprehensive comparison for each individual descriptor. We conclude our work in Section 6.

Section snippets

Dense trajectories on motion boundary

In this section, we first give a brief review of dense trajectory method [1] and explain the advantage of DT from a view of human visual fixation system. Then, we present our new sampling strategy based on motion boundary in details.

3D co-occurrence descriptors

Generally, there always exist strong correlations among spatial-temporal neighborhoods of pixels. Traditional HOG, HOF and MBH descriptors are statistical histograms counted pixel-wise which ignore the correlation of pairwise pixels. To jointly encode the spatial-temporal correlations of pixels, we present 3D co-occurrence descriptors which consist of 3D-CoHOG, 3D-CoHOF and 3D-CoMBH.

Multi-channels of 3D co-occurrence descriptors

In this section, we first present the multi-channel scheme at pixel level for 3D co-occurrence descriptors. Then, we revisit our previous spatial-temporal context descriptors. Finally, we give the details of bin level multi-channel scheme.

Experiments

We evaluate the performance of the proposed methods on three popular human action datasets, namely KTH [33], YouTube [34] and HMDB51 [35]. In this section, we first give a brief introduction for these datasets, and then compare the performance and complexity between DT and DT-MB. Finally, we give a comprehensive comparison between our descriptors and other descriptors.

Conclusion

This paper first introduced a new dense sampling strategy (i.e., DT-MB) for dense trajectories. This scheme constrains sampled points on the motion boundary which can significantly save memory and time cost without degrading performance. Another important contribution is that we propose a set of 3D co-occurrence descriptors, namely 3D-CoHOG, 3D-CoHOF and 3D-CoMBH, which can depict the spatial-temporal contextual information within local cuboids. We also exploit these 3D-Co descriptors by using

Acknowledgments

This work is partly supported by the construct program of the key discipline in Hunan province, Natural Science Foundation of China (91320101, 60972111), Shenzhen Basic Research Program (JC201005270350A, JCYJ20120903092050890, JCYJ20120617114614438), 100 Talents Program of CAS, and Guangdong Innovative Research Team Program (201001D0104648280).

References (46)

  • R. Poppe

    A survey on vision-based human action recognition

    Image Vis. Comput.

    (2010)
  • H. Wang et al.

    Dense trajectories and motion boundary descriptors for action recognition

    IJCV

    (2013)
  • J. Yamato et al.

    Recognizing human action in time-sequential images using hidden Markov model

  • T. Starner et al.

    Real-time American sign language recognition from video using hidden Markov models

  • F.Lv et al.

    Single view human action recognition using key pose matching and Viterbi path searching

  • G.F. Angela Yao et al.

    Does human action recognition benefit from pose estimation?

  • A.F. Bobick et al.

    The recognition of human movement using temporal templates

    TPAMI

    (2001)
  • M. Blank et al.

    Actions as space-time shapes

  • S. Sadanand et al.

    Action bank: a high-level representation of activity in video

  • P. Dollár et al.

    Behavior recognition via sparse spatio-temporal features

  • I. Laptev et al.

    Learning realistic human actions from movies

  • J.C. Niebles et al.

    Unsupervised learning of human action categories using spatial-temporal words

  • A. Klaser et al.

    A spatio-temporal descriptor based on 3D-gradients

  • H. Wang et al.

    Evaluation of local spatio-temporal features for action recognition

  • H. Wang et al.

    Action recognition by dense trajectories

  • X. Wang et al.

    A comparative study of encoding, pooling and normalization methods for action recognition

  • S. Feng et al.

    Sampling strategies for real-time action recognition

  • W. LiMin et al.

    Motionlets: mid-level 3D parts for human motion recognition

  • J. Arpit et al.

    Representing videos using mid-level discriminative patches

  • G.W. Taylor et al.

    Convolutional learning of spatio-temporal features

  • M. Baccouche et al.

    Sequential deep learning for human action recognition

  • Q.V. Le

    Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis

  • S. Ji et al.

    3D convolutional neural networks for human action recognition

    TPAMI

    (2013)
  • Cited by (26)

    • Combined trajectories for action recognition based on saliency detection and motion boundary

      2017, Signal Processing: Image Communication
      Citation Excerpt :

      It is first introduced by Dalal et al. [25] to compute a motion-based descriptor for human detection, which is later extended to a trajectory descriptor for action recognition by Wang et al. [1,2]. In this paper, similar to [12], motion boundary is detected to extract trajectories for action representation. For a given video, we extract three sets of trajectories, namely, the dense trajectories (DTs), the trajectories of action-related areas (ARA-DTs) and the trajectories of filtered motion boundaries (FMB-DTs).

    • Action recognition by saliency-based dense sampling

      2017, Neurocomputing
      Citation Excerpt :

      Furthermore, the salient region may be not in human body area but other objects like the oars are more attractive in rowing action, see the 4th column of Fig. 3(a). Hence, in order to find out the attractive salient regions, we follow [27] to create the mask named Motion Boundary Image (MBI). But improved dense trajectories on motion boundary images (iDT-MB) are not stable, since the motion boundaries are significantly influenced by the threshold value on gradient variation of optical flow.

    • Saliency-based dense trajectories for action recognition using low-rank matrix decomposition

      2016, Journal of Visual Communication and Image Representation
    • 3D-based Deep Convolutional Neural Network for action recognition with depth sequences

      2016, Image and Vision Computing
      Citation Excerpt :

      Ye et al. [18] and Han et al. [27] summarized a detailed survey on HAR from depth camera. Among depth-based HAR methods, HOG [6,7], STIP [3,22,23], bag-of-3D points [4,12], and skeleton joints [11,20,28] are mostly used features. Ni et al. [29] presented a novel idea for activity recognition by applying depth based filters to remove false detection and then combining data from conventional camera and a depth sensor.

    View all citing articles on Scopus

    This paper has been recommended for acceptance by Ahmed Elgammal.

    View full text