Action recognition using restricted dense trajectories

This paper presents an action recognition algorithm using restricted dense trajectories (RDT). In feature extraction step, restricted dense trajectories are obtained by tracking the refined points in optical flow field, which remove most of meaningless trajectories while preserve the discriminative power. Then we extract a new set of descriptors to capture the appearance and motion information of trajectories. For encoding step, we improve VLAD by assigning each descriptor to their K nearest words and employing these words as basis vectors to linearly approximate the descriptor under the minimum squared error criterion. Experimental results show the proposed algorithm obtains state-of-the-art results.


Introduction
Action recognition is an very active area in the fields of computer vision [1], owing to its broad range of applications, such as video surveillance, human-computer interaction, motion analysis and contentbased video retrieval. Even though a large number of approaches have been proposed, it is still very challenging.
Building an efficient and discriminative representation for action is the key step for action recognition system. Wang et al. [2] tracked the densely sampled points using optical flow fields in each scale separately and computed local descriptors within the trajectory-aligned volume. Furthermore, they proposed improved dense trajectories (IDT) [3], which provided the state-of-the-art results on various datasets. However, a large number of feature points fall into the background area that is irrelevant to the action resulting in a substantial amount of redundant trajectories. Peng et al. [4] used a motion boundary based dense sampling strategy. Mukher et al. [5] introduced a 3D representation of IDT. In our work, we utilize a expanded human rectangular box to refine the dense points. Then, relative trajectory and spatio-temporal co-occurrence descriptors are extracted based on the refined dense trajectories, called restricted dense trajectories (RDT).
In recent years, VLAD [6] is a widely used encoding method. But it is difficult for VLAD to describe the distribution information of descriptors. Picard et al. [7] accumulated the second order tensor product and proposed VLAT. Peng et al. [8] computed two kinds of high-order statistics as complementary information. Unlike their work, we assign each descriptor to their K nearest visual words and employ these words as basis vectors to linearly approximate them under the minimum squared error criterion. Obtaining the linearly coefficients, we use them as weights to compute VLAD on the nearest multiple words.

Restricted dense trajectories
We first use deformable part based model to detect human in frames. Then the expanded human rectangular box is cropped by expanding the human rectangular box with 2 times of the width and 1.5 times of the height, as shown in Figure 1. Given the expanded human rectangular box, feature points within it are retained and others are removed. Then we track the remaining feature points to obtain restricted dense trajectories.

Relative trajectory descriptor
We extract the geometric center c t P of the expanded human rectangular box and track it to get its trajectory. Given the trajectories of c t P and feature points, the relative trajectory descriptor is defined as Equation (1).

Spatio-temporal co-occurrence descriptors
We extract our spatio-temporal co-occurrence descriptors in the cubiods align with each RDT. The size of cubiods is N×N×L pixels. For the image block with the size W×H, the co-occurrence matrix is defined as Equation (2) (2) where s and t are quantized orientation labels of gradient or optical flow, M(i, j) and M(i, j) are respectively the magnitude and orientation of gradient or optical flow at position (i, j).
Spatial co-occurrence HOG (SCoHOG) counts gradient orientations of pixel pairs at a fixed distance from the single frame, and uses the resulting co-occurrence matrix as image representation. SCoHOF counts the optical flow orientations of pixel pairs, and SCoMBH counts the optical flow gradient orientations of pixel pairs. We use two offsets (i.e., (2, 0) and (0, 2)) and quantize the orientation into 8 bins with an an additional bin for HOF.
Temporal co-occurrence HOG (TCoHOG) counts gradient orientations of pixel pairs over time align with the trajectory. The calculation methods of TCoHOF and TCoMBH are similar to the corresponding spatial feature, except that their co-occurrence units are extracted in temporal dimension. TCoHOF can convey the changes of motion orientation while TCoMBH reflect the alteration in the flow field. The offset of temporal co-occurrence units is set to ∆t=2.

Improved VLAD encoding method
For descriptor x i , its K nearest visual words in the dictionary D are found to construct a sub-dictionary defined as 12 , , . . . , These K words are employed as basis vectors to linearly approximate x i under the minimum squared error criterion. Obtaining the linearly coefficients, we use them as weights to compute VLAD on the nearest K words. The steps of our improved VLAD are as follows:

Comparison to IDT descriptor
In this section, we compare RDT descriptors to original trajectories descriptors using the standard BoVW framework, as shown in Table 1. Dictionaries are constructed for each type of descriptor separately, and the number of visual words in every dictionary is fixed to 4000. Relative trajectory has improved the performance on the three datasets compared with the trajectory shape descriptor. The improvement on YouTube and HMDB51 are 3.5% and 4.3%, which indicates the validity of the trajectory spatial position. Since introduced spatio-temporary co-occurrence information, our two combinations of descriptors outperform the original combination of descriptors on the three datasets, which demonstrates they can encode the deep level information of appearance and motion.  Table 2 presents the comparison of our IVLAD with FV and VLAD. We employ two feature extraction schemes for each encoding method respectively, which are the original IDT and our RDT. The size of dictionary for FV and IVLAD are set to 256, which is employed in many documents. For VLAD, We evaluate the size of dictionary with 256 and 512 visual words in order to make the comparison more fair. Experimental results verify the importance of distribution information for enhancing action recognition. On the other hand, we also can see that our restricted dense trajectory descriptors yield better results than dense trajectory descriptor on three datasets. The best results, integrating RDT with IVLAD, are 98.1%, 89.9% and 70.5%, achieving the state-of-the-art results.

Conclusion
Our restricted dense trajectories remove most of the meaningless trajectories and capture the spatialtemporal contextual information, which provide an efficient and discriminative representation for action. Regarding the encoding method, we introduce an improved VLAD to encode the distribution information of descriptors. The experimental results demonstrate the effectiveness of our approach.