1 Introduction

Human action recognition has been an area of keen interest to computer vision researchers for the past few decades. Day by day, length and breadth of the problem statement is expanding and researchers have come up with solutions to tackle scale, appearance, illumination, orientation variations, and occlusions. Action Recognition spans a wide spectrum of applications such as the autonomous video surveillance, detection of abnormal events, analysis of human behavior, video retrieval, human computer interaction, etc. The sole aim of our work is to perform real-time action recognition for a large scale surveillance system. The algorithm primarily targets real-time fixed-camera surveillance and closed circuit TV applications.

A lot of research has been reported till date in recognizing human actions in pixel domain. However, most of those algorithms claimed recognition speed of upto 25 fps. Applications such as surveillance, that demand low memory utilization and faster speed cannot employ those techniques owing to higher computational load. On the other hand, analyzing the videos in compressed domain offers real-time performance. The latest video compression standards such as HEVC [23] and H.264/AVC [27] provides better compressed videos with reduced bandwidth utilization and low storage demand making them ideal for real-time applications. Compressed domain parameters such as motion vectors (MV) are the indication of how motion was compensated while exploiting the temporal frame redundancy during encoding process. Additionally, Quantization Parameter (QP) has a major role in regulating the bit rate. QP values can be adjusted to maintain a constant bit rate within the allowed channel bandwidth. Hence, real-time encoders heavily depend on varying the QP values to control the trade-off between video quality and compression.

In this work, we propose a novel algorithm for action recognition utilizing cues from Quantization Parameters and Motion Vectors in the H.264/AVC compressed video sequence. The spatial gradient information of QP over the space is utilized to form QP Gradient Image (QGI) which provides vital clue regarding motion occurrence and the spread of the action. This information, along with the magnitude and orientation information of the motion vector, is utilized for feature extraction. The paper is structured in the following way. Related works on action recognition in pixel and compressed domains are presented in section 2. The proposed approach is discussed in detail in section 3. Section 4 presents the experimental results and discussions. Concluding remarks and future works are stated in section 5.

2 Related work

Human Action recognition has been one of the major research areas in computer vision for the past many years. In this section we will briefly describe few approaches from pixel domain and compressed domain.

Pixel domain approaches

Pixel domain approaches can be broadly classified as three categories based on the way the action is represented: Image/template, Spatio-temporal and Interest point based representation. This sections presents a brief review of various pixel domain approaches from the above categories, more details on various action recognition techniques can be found from the survey papers by Poppe [19] and Weinland et al. [26].

Image/template based representation

Bobick et al. [6] used the concept of motion energy image (MEI), that indicates the location of motion along with motion history image (MHI), that indicates the silhouette motion over time, for representing the actions. The Hu moments extracted from MEI and MHI are used for classifying the actions.

Spatio-temporal based representation

Yu et al. [30] proposed an approach which utilizes local space-time volumes as a powerful discriminative codebook and captures the structural information of actions using pyramidal spatio-temporal relationship match (PSRM) to perform classification. Li et al. [15] presents a luminance field manifold trajectory analysis based solution for human action recognition and achieved recognition performance comparable to that of state-of-the-art techniques. Amiri et al. [1] extracted features from spatio-temporal video patches and uses Support Vector Machine (SVM) to distinguish different actions. They had evaluated their performance on KTH and IXMAS datasets and showed that their performance outperforms state-of-the-art techniques. Chin et al. [16] explored the self-similarities among frames of the video using instance-specific features and class-consistent features to capture within-class similarities. Sadek et al. [20] presented a method for action recognition using a finite set of features derived directly from difference frames of action snippet. Xu et al. [31] extracted features by forcing the similar local features in different positions on the image grid to be assigned to different visual words and then latent topic model to classify the videos. Gorelick et al. [10], introduced space-time action shapes for representing actions. Various space-time features were extracted from the properties of the solution to the Poisson equation. They recognized actions based on these space-time features. Efros et al. [9] modeled the relative movement of various object parts by computing optical flow between person-centered image windows. The object motion is described by the positive and negative components of vertical and horizontal flow vectors. A set of pre-classified action database is used as dictionary for recognizing query actions in k-nearest neighbor framework.

Interest point based representation

Leptev [14] used space-time interest points for modeling the actions in bag-of-words framework. Here, the interest points are described by histogram of oriented (spatial) gradient (HOG) and histograms of optical flow (HOF). Heng [25] considered various approaches based on interest points and evaluated their performance on 25 action classes over three diverse datasets. Wu et al. [28] presented an alternative local feature named Heat Kernel Structural Descriptor (HKSD) based on the heat diffusion process of cuboid around the interest points and employed SVM for action recognition. This approach provides better robustness to view point changes and background noise.

Compressed domain approaches

Compressed domain approaches on the other hand are more feasible for real-time applications owing to their faster performance. But, very less amount of research has been done in recognizing human actions in compressed videos. Compressed Domain Approaches are again classified as Motion Vector Based, DCT Based and combined approaches.

Motion vector based approaches

Ozer et al. [18] proposed a hierarchical approach for human activity detection and recognition in MPEG compressed sequences. Body parts were segmented out and Principal Component Analysis was performed on the segmented motion vectors prior to classification. However, the performance of the algorithm solely depends on the temporal duration of the activities. Another notable work was put forward by Babu et al. [2] in MPEG compressed domain. The features extracted from motion vectors were fed to Hidden Markov Model (HMM) for classification. Totally seven actions were trained with distinct HMM model and the recognition results were found to be more than 90 %. Babu et al. [3] later proposed a method using motion flow history and motion history image in MPEG compressed domain. Various features were extracted from the static MHI and MFH images. Histogram of the horizontal and vertical components of the MVs were utilized to form the Projected-1D feature. Also, a 2D Polar feature was developed using histogram of magnitude and orientation of MVs. The extracted features were used to train different types of classifiers including KNN, Neural network, SVM and the Bayes for recognizing the set of seven human actions.

DCT based approaches

Later, Liu et al. [15] recognized actions by creating eigen-space representation of human silhouettes obtained from AC-DCT coefficients. However, the method used compressed and uncompressed domain parameters. The low-resolution compressed domain data was connected with high level semantics in spatial domain to achieve real-time performance. Frames with specific postures were stored and global activity of the human body was estimated. This information was then used as an input in the pixel domain for gesture/action recognition. The first step retrieved possible frames in the compressed domain where people are present. Then the system is analyzed and the required region is extracted for posture recognition.

DCT and MV based approaches

Yeo et al. [8] proposed a high-speed algorithm for action recognition and localization in MPEG streams based on computing motion correlation measure by utilizing differences in motion orientation and magnitudes. The approach is based on computing a coarse estimate and a confidence map of the optical flow using MVs and DCT coefficients. However, the algorithm cannot handle scale variations. After the formation of optical flow, the approach is equivalent to any pixel-based algorithm. Hence the computational complexity is equivalent to pixel domain approaches.

In this paper, we propose a novel algorithm for activity recognition using information from QPs, MB partition types, and MVs in the H.264/AVC compressed video. The gradient information of QP over the space is utilized to form QP Gradient Image (QGI) which provides vital clue regarding motion occurrence and the spread of the action. This information, along with the magnitude and orientation of the motion vector, is utilized. To the best of our knowledge, this is the first work on action recognition in H.264/AVC compressed domain.

3 The proposed approach

Overview of the proposed approach is shown in Fig. 1. First, the QP delta (difference in QP values of adjacent macroblocks) and MV values are extracted by partial decoding of the input H.264 bit-stream. QGI is then formed using the QP delta values which is then utilized to generate the QGI projection profile and spread features. Also, motion accumulated projection profile and histogram of magnitude-weighted orientation of motion vectors are formed as features using the extracted MVs. All the individual features, after weighting, are combined to form the final feature vector. Classification is then done using Support Vector Machines (SVM).

Fig. 1
figure 1

Overview of the Proposed Approach

3.1 Action representation

The action is represented in the following ways: (i) QP Gradient Image and (ii) Motion Accumulation. Features extracted from these representations are used for classifying the actions.

QP gradient image (QGI)

In H.264/AVC standard, each macroblock has a typical QP value. The QP values are changed minimally across macroblocks (in raster-scan order) in case of low information content. On the other hand, significant shift in QP values can be observed when the information level is high. Higher the change in QP value, higher is the gradient, which is a direct measure of action occurrence in the respective macroblock pair. The MBs present at the action area will have non-zero QP-delta and will be classified as action-MBs. Spurious MBs with no action information can be removed by connected component analysis. This step is continued for all P frames and provides initial action segmentation with MB level precision.

A weighted accumulation of the filtered action-MBs [24] over time is then performed to incorporate the temporal information. The frames are accumulated in forward and backward temporal directions. Even for real-time applications, buffering of frames is possible. This validates the credibility of forward accumulation. The forward (F a ) and backward accumulation factors (B a ) have been selected empirically as 2 frames each.

The action-edge accumulated output for current frame P acc (T) is given by:

$$ {P}_{a cc}(T)={\displaystyle \sum_{k=0}^{B_a}}\; w\left( T- k\right) P\left( T- k\right)+{\displaystyle \sum_{k=1}^{F_a}}\; w\left( T+ k\right) P\left( T+ k\right) $$
(1)

where P(T − k) and P(T + k) denote the output of initial QP gradient stage for a frame which is k-frames farther from the current frame P(T) in reverse and forward temporal directions respectively. Similarly, w(T − k) and w(T + k) denotes the respective weights assigned which are linearly decreased on either sides of the current frame (T) as shown in Table 1. To generate the final QGI for the group of frames, the action-edge MBs of each frame are again accumulated temporally. Specifically, for a group of k-frames starting from frame t, the QGI (Fig. 2) is defined as:

Table 1 Temporal distribution of weights
Table 2 Per-video confusion matrix (with scale variations) on KTH
Table 3 KTH Database Speed Results
Fig. 2
figure 2

QP Gradient Image of the actions on Weizmann dataset [5]

$$ QGI\left( t, k\right)={\displaystyle \sum_{j=0}^{k-1}}\;{P}_{acc}\left( t+ j\right) $$
(2)

QGI located at [l, m] can be represented as QGI(t, k, l, m).

Motion accumulation

The magnitude and orientation of motion vector of a sub-macroblock (4 × 4 block in H.264/AVC) and location [l,m] in frame t and with horizontal and vertical components u and v respectively are given by:

$$ {M}_t\left( l, m\right)=\sqrt[2]{u^2+{v}^2} $$
(3)
$$ {A}_t\left( l, m\right)= atan\left( v/\left( abs(u)+\varepsilon \right)\right) $$
(4)

where, atan(x) returns the inverse tangent (arctangent) of x in radians. ε is only used for numerical stability. For real elements of x, atan(x) is in the range [−π/2, ⋅ π/2]. We employed the absolute value of the horizontal component of motion vector to attain left-right symmetry of actions. The magnitude of motion vector of each sub-macroblock is accumulated temporally over k frames as given by:

$$ M{A}_t\left( l, m\right)={\displaystyle \sum_{j=0}^{k-1}}\;{M}_{t+ j}\left( l, m\right) $$
(5)

The accumulated motion is used to form the horizontal and vertical projection profiles (1) to represent each action.

3.2 Feature extraction

3.2.1 QGI features

QGI spread feature

It can be observed from the QGIs in Fig. 2 that the horizontal (HS) and vertical spread (VS) of action have clue regarding the same. The VS is more for actions like jack and pjump. In case of hand waving actions the spread is localized to the upper torso and hence the VS will be less. Actions wave1 and wave2 can be distinguished by HS, even though the VS are comparable. Actions like bend, walk and run have comparable VS. But, bend can be distinguished from run or walk by adding HS too as a feature. The spread feature can distinguish between actions like bend, jack, pjump, wave1 and wave2. But the confusion between skip, side, run, walk and jump remains as the spreads in both dimension are comparable. Spread is normalized with maximum of frame height or width (in MBs) before adding to the final feature vector.

QGI projection profile (QGIP) feature

The horizontal and vertical QGI projection profiles (Fig. 3) of each action-group of k frames starting from frame t, for a video with frame width w and frame height h (in pixels) is given by:

Fig. 3
figure 3

Projection Profiles for the action ‘Both hands waving (wave2)’

$$ QGI{P}_{horiz}\left( t, k\right)={\displaystyle \sum_{j=1}^{w/16}}\; QGI\left( t, k, l+ j, m\right) $$
(6)
$$ QGI{P}_{vert}\left( t, k\right)={\displaystyle \sum_{i=1}^{h/16}}\; QGI\left( t, k, l, m+ i\right) $$
(7)

3.2.2 Motion accumulated projection profile (MAP) feature

The motion-accumulated projection profiles are given by:

$$ MA{P}_{horiz}\left( t, k\right)={\displaystyle \sum_{j=1}^{w/4}}\; M{A}_t\left( l+ j, m\right) $$
(8)
$$ MA{P}_{vert}\left( t, k\right)={\displaystyle \sum_{i=1}^{h/4}}\; M{A}_t\left( l, m+ i\right) $$
(9)

Horizontal and vertical profiles, after sub-sampling and normalization, are concatenated to form the projection feature. In our experiments we took only one in four samples of each MAP profile to minimize the feature vector length. The projection feature (Fig. 3) has information on ‘where the motion occurred and to what extent’.

3.2.3 Histogram of oriented motion vector

Each action has a typical set of motion vectors with more frequency of occurrence, which may differ in magnitudes but are similar in orientations. For example, hand-waving actions will have a typical set of repeating MV orientations in the waving-area. To utilize this property, histogram can be used. If raw histograms are employed, actions like running and walking will get confused as the orientation of those actions will be more or less the same. In those cases, the speed of action need to be utilized . The magnitude of the MVs will be more for actions performed at higher speeds. Hence, the histogram of oriented motion vector weighted by magnitudes is used as a feature [4].

$$ HOM(u)={\displaystyle \sum_{b\left({A}_t\left( l, m\right)\right)= u}}\;{M}_t\left( l, m\right) $$
(10)

where, HOM is the histogram of the oriented motion vector. A t (l, m) and M t (l, m) are the orientation and magnitude of MV at location [l, m, t]; b(A) maps the angle A to the corresponding histogram bin.

The outline of the algorithm is given below:

Algorithm : Action Recognition

figure d

4 Experiments, results and discussion

The compressed domain parameters were extracted using JM15.1 [12] software. The experiments were conducted on Linux OS with single core Intel® Xeon(R) CPU X 5675@3.07GHz. Videos were encoded using x264 [29] software with the following settings : Baseline profile, GOP length 30, single slice per frame, one reference frame and frame rate = 25 fps. Surveillance camera encoders predominantly uses baseline profile where B-frames are not used. Since the B frames are not used, this profile is ideal for network cameras and video encoders due to its low latency [11]. Even though not much of the existing H.264 encoders are freely available, we had evaluated our proposed approach on another encoder JM15.1 [12] and found the performance of JM15.1 to be consistent with that of x264.

The proposed approach was evaluated on two benchmark action datasets: Weizman [5] and KTH [21]. Even though huge datasets like UCF101 [22], HMDB51 [13] etc. are available, with more than 50 classes of actions each, the classification accuracy reported in those databases is very less, even in pixel domains. It will be even more difficult to classify the actions using the compressed domain cues. Also, Weizman and KTH conform to surveillance-type scenario since camera motion is very less or even negligible. Hence, we limited our analysis to KTH and Weizman datasets.

4.1 Leave-one-out full-fold cross validation strategy using Support Vector Machines (SVM)

Leave-one-out full-fold cross validation (LOOFCV) methodology was employed to evaluate the performance of the proposed algorithm. In LOOFCV, on a database with n subjects, n-1 subjects are used to build the action model and the remaining subject for testing. Classification was performed using libsvm 3.14 [7]. We used RBF (radial basis function) kernel with 132 support vectors for the SVM with default parameters.

4.1.1 KTH dataset results

KTH dataset [21] contains six types of human actions viz, boxing, hand waving, hand clapping, walking, jogging and running. The actions are performed by 25 subjects in four different scenarios: outdoors s1, outdoors with scale variation s2, outdoors with different clothes s3 and indoors s4 as in Fig. 4. There are a total of 599 video sequences taken over homogeneous backgrounds with frame rate of 25 fps and spatial resolution of 160 × 120 pixels.

Fig. 4
figure 4

KTH Actions (image courtesy : Laptev et al.[21])

Classification accuracy

The proposed algorithm achieves 85 % classification on KTH dataset. Since [8] cannot handle scale variations, the outdoor videos with scale changes (s2) were neglected from their analysis. Hence, for a comparison, the performance of the proposed approach was also evaluated by leaving out the videos with scale changes. Without scale change, the proposed approach achieves 90.4 % classification. Unlike [8], our algorithm is robust to scale variation as well. The comparison of results in Table 3 shows that the proposed algorithm outperforms [8]. The confusion matrix is shown on Table 2. The jog action is confused with run and walk, since they have similar motion vector patterns. Even many pixel domain approaches show similar confusion between actions. Results on KTH shows that our approach is efficient and can deal with scale and appearance variations. Also, the algorithm is robust for indoor and outdoor testing conditions.

Classification speed

KTH has a total of 289,716 frames. We obtained a mean classification speed of 3,489 fps with standard deviation of 5 frames. The whole KTH dataset actions were classified by LOOFCV in less than 90 seconds. The time taken for feature extraction, loading of MVs, QPs and SVM classification were included for the evaluation. The speed of classification was not reported by [8]. Hence, a comparison with the fastest pixel-domain algorithm, reported on KTH, is also shown in Table 3. None of the pixel domain approaches have claimed recognition speed above 25 fps. The proposed algorithm is hence more than 100 times faster than existing fastest action recognition algorithms.

4.1.2 Weizman dataset results

Weizman dataset [5] consists of 10 different actions viz., walk, run, jump, gallop sideways, bend, one-hand waving, two-hands waving, jump in place, jumping jack and skip (Fig. 5). The actions are performed by 9 subjects. There are a total of 93 video sequences.

Classification Accuracy

We achieved a total classification of 85.56 %. However, the action skip was confused with run, side and jump. When classification was performed without skip, the accuracy improved to 93.83 % and the confusion matrix for the same is shown in Table 4. Accuracy of classification is more when performed at video level. However, we have shown the effect of dividing the video into smaller groups and classifying each group on Tables 7 and 8. Grouping of videos is done with 50 % overlap.

Fig. 5
figure 5

Weizman Actions (image courtesy : Liu et al.[17])

Table 4 Confusion matrix on Weizman (excluding action’skip’)
Table 5 Weizman Dataset Classification Speed Comparison
Table 6 Performance of the proposed approach with encoders × 264, JM15.1 and addition of gaussian noise
Table 7 KTH (group-based result)
Table 8 Weizman (group-based result)

Classification speed

Weizman dataset has a total of 5,532 frames. The classification speed achieved is 2,186 fps with a standard deviation of 7 frames. The whole dataset was classified in 2.5 seconds. The proposed approach is compared with the fastest approach (pixel-domain) reported for Weizman in Table 5.

We have evaluated the proposed approach for two open source encoders : x264 and JM15.1. In order to emulate any generic encoder, the proposed approach is evaluated by adding Gaussian noise with variance proportional to the variance of the MVs in the dynamic region of the compressed video sequences as shown in Table 6. We also evaluated the performance of the proposed approach by varying the percentage of motion vectors corrupted in the dynamic region of compressed video sequences in KTH dataset, which is shown in Fig. 6.

Fig. 6
figure 6

Effect of gaussian noise on the performance of the system for KTH dataset

4.2 Feature-wise classification

The classification performance of each individual feature was also analyzed. On Weizman dataset, QGIP, QGI spread, MAP, and, HOM features have individual classification accuracy (SVM) of 47.7 %, 18.8 %, 63.3 %, and 66.7 % respectively. On the other hand, the features offered 52.8 %, 29.3 %, 56.67 %, and 67.33 % accuracy on KTH dataset. Detailed results are shown in Fig. 7. We fixed Ba and F_a as 2 each and the number of histogram bins as 15 for the same experiment. From this observation, we have fixed the weights (w1, w2, w3, w4 as in Fig. 1) as 0.48, 0.19, 0.63 and 0.67 respectively in the final feature vector (for Weizman). Classification accuracy is 1 to 2 % more when weighted feature vector is employed, as opposed to direct concatenation of the features. We have also evaluated the proposed approach for QP values by discarding the forward frames and considering only the backward frames. Since the overall contribution of QP values are less, the performance accuracy falls only by 3–4 %.

Fig. 7
figure 7

Feature-wise Classification Performance on (1) KTH and (2) Weizman Datasets

5 Conclusion and future work

In this paper, we have presented a novel high-speed action recognition algorithm in H.264/AVC compressed domain. The proposed algorithm is more than 100 times faster than existing state-of-the-art pixel-domain approaches. It can be concluded that our approach can classify QCIF resolution videos with speed more than 2,000 fps. The proposed approach can be utilized as an initial fast stage in a hierarchical action recognition system where highly accurate pixel domain approaches can be later employed for fine recognition, so that the overall system will attain accurate and real-time recognition. Also, the proposed approach can be utilized to deal with stored video libraries, where speeds much faster than real time are required. Currently, our work cannot deal with occluded actions and videos with more than one action. We look forward for adapting our approach robust to those scenarios.