Keywords

1 Introduction

Convolutional neural networks (CNNs) [17] are originally designed to represent static appearances of visual scenes well. However, it has a limitation if the underlying structure is characterized by sequential and temporal relations. In particular, since recognizing human behavior in a video requires both spatial appearance and temporal motion as important cues, many previous researches have utilized various modalities that can capture motion information such as optical flow [33] and RGBdiff (temporal difference in consecutive RGB frames) [33]. Methods based on two-stream [7, 21, 33] and 3D convolutions [2, 28] utilizing these input modalities achieve state-of-the-art performances in the field of action recognition. However, even though optical flow is a widely utilized modality that provides short-term temporal information, it takes a lot of time to generate. Likewise, 3D-kernel-based methods such as 3D ConvNets also require heavy computational burden with high memory requirements.

Fig. 1.
figure 1

Some examples of action classes in the three action recognition datasets, Jester (top), Something-Something (middle), and UCF101 (bottom). – top: left ‘Sliding Two Fingers Down’, right ‘Sliding Two Fingers Up’, middle: left ‘Dropping something in front of something’, right ‘Removing something, revealing something behind’, bottom: left ‘TableTennisShot’, right ‘Billiards’. Due to ambiguity of symmetrical pair classes/actions, static images only are not enough to recognize correct labels without sequential information in the former two datasets. However, in case of the bottom UCF101 image frames, the action class can be recognized with only spatial context (e.g. background and objects) from a single image.

In our view, most previous labeled action recognition datasets such as UCF101 [24], HMDB51 [16], Sports-1M [13] and THUMOS [12] provide highly abstract concepts of human behavior. Therefore they can be mostly recognized without the help of temporal relations of sequential frames. For example, the ‘Billiard’ and ‘TableTennisShot’ in UCF101 can be easily recognizable by just seeing one frame as shown in the third row of Fig. 1. Unlike these datasets, Jester [1] and Something-Something [8] include more detailed physical aspects of actions and scenes. The appearance information has a very limited usefulness in classifying actions for these datasets. Also, visual objects in the scenes that mainly provide shape information are less important for the purpose of recognizing actions on these datasets. In particular, the Something-Something dataset has little correlation between the object and the action class, as its name implies. The first two rows of Fig. 1 show some examples of these datasets. As shown in Fig. 1, it is difficult to classify the action class with only one image. Also, even if there are multiple images, the action class can be changed according to the temporal order. Thus, it can be easily confused when using the conventional static feature extractors. Therefore, the ability to extract the temporal relationship between consecutive frames is important to classify human behavior in these datasets.

To solve these issues, we introduce a unified model which is named as the Motion Feature Network (MFNet). MFNet contains specially designed motion blocks which represent spatio-temporal relationships from only RGB frames. Because it extracts temporal information using only RGB, pre-computation time that is typically needed to compute optical flow is not needed compared with the existing optical flow-based approaches. Also, because MFNet is based on a 2D CNN architecture, it has fewer parameters compared to its 3D counterparts.

We perform experiments to verify our model’s ability to extract spatio-temporal features on a couple of publicly available action recognition datasets. In these datasets, each video label is closely related to the sequential relationships among frames. MFNet trained using only RGB frames significantly outperforms previous methods. Thus, MFNet can be used as a good solution for an action classification task in videos consisting of sequential relationships of detailed physical entities. We also conduct ablation studies to understand properties of MFNets in more detail.

The rest of this paper is organized as follows. Some related works for action recognition tasks are discussed in Sect. 2. Then in Sect. 3, we introduce our proposed MFNet architecture in detail. After that, experimental results with ablation studies are presented and analyzed in Sect. 4. Finally, the paper is concluded in Sect. 5.

2 Related Works

With the great success of CNNs on various computer vision tasks, a growing number of studies have tried to utilize deeply learned features for action recognition in video datasets. Especially, as the consecutive frames of input data imply sequential contexts, temporal information as well as spatial information is an important cue for classification tasks. There have been several approaches to extract these spatio-temporal features on action recognition problems.

One popular way to learn spatio-temporal features is using 3D convolution and 3D pooling hierarchically [6, 9, 28, 29, 36]. In this approach, they usually stack continuous frames of a video clip and feed them into the network. The 3D convolutions have enough capacity to encode spatio-temporal information on densely sampled frames but are inefficient in terms of computational cost. Furthermore, the number of parameters to be optimized are relatively large compared to other approaches. Thus, it is difficult to train on small datasets, such as UCF101 [24] and HMDB51 [15]. In order to overcome these issues, Carreira et al. [2] introduced a new large dataset named Kinetics [14], which facilitates training 3D models. They also suggest inflating 3D convolution filters from 2D convolution filters to bootstrap parameters from the pre-trained ImageNet [4] models. It achieves state-of-the-art performances in action recognition tasks.

Another famous approach is the two-stream-based method proposed by Simonyan et al. [22]. It encodes two kinds of modalities which are raw pixels of an image and the optical flow extracted from two consecutive raw image frames. It predicts action classes by averaging the predictions from both a single RGB frame and a stack of externally computed multiple optical flow frames. A large amount of follow up studies [18, 32, 35] to improve the performance of action recognition has been proposed based on the two-stream framework [7, 21, 33]. As an extension to the previous two-stream method, Wang et al. [33] proposed the temporal segment network. It samples image frames and optical flow frames on different time segments over the entire video sequences instead of short snippets, then it trains RGB frames and optical flow frames independently. At inference time, it accumulates the results to predict an activity class. While it brings a significant improvement over traditional methods [3, 30, 31], it still relies on pre-computed optical flows which are computationally expensive.

In order to replace the role of hand-crafted optical flow, there have been some works feeding frames similar to optical flow as inputs to the convolutional networks [33, 36]. Another line of works use optical flow only in training phase as ground-truth [20, 38]. They trained a network that reconstructs optical flow images from raw images and provide the estimated optical flow information to the action recognition network. Recently, Sun et al. [26] proposed a method of optical-flow-guided features. It extracts motion representation using two sets of features from adjacent frames by separately applying temporal subtraction (temporal features) and Sobel filters (spatial features). Our proposed method is highly related to this work. The differences are that we feedforward spatial and temporal features in a unified network instead of separating two features apart. Thus, it is possible to train the proposed MFNet in an end-to-end manner.

3 Model

In this section, we first introduce the overall architecture of the proposed MFNet and then give a detailed description of ‘motion filter’ and ‘motion block’ which constitute MFNet. We provide several instantiations of motion filter and motion block to explain the intuition behind it.

Fig. 2.
figure 2

The overall architecture of MFNet. The proposed network is composed of appearance blocks and motion blocks which encode spatial and temporal information. A motion block takes two consecutive feature maps from respective appearance blocks and extracts spatio-temporal information with the proposed fixed motion filters. The accumulated feature maps from the appearance blocks and motion blocks are used as an input to the next layer. This figure shows the case of \(K=7\).

3.1 Motion Feature Network

The proposed architecture of MFNet is illustrated in Fig. 2. We construct our architecture based on temporal segment network (TSN) [33] which works on a sequence of K snippets sampled from the entire video. Our network is composed of two major components. One is appearance block which encodes the spatial information. This can be any of the architectures used in image classification tasks. In our experiments, we use ResNet [10] as our backbone network for appearance blocks. Another component is motion block which encodes temporal information. To model the motion representation, it takes two consecutive feature maps of the corresponding consecutive frames from the same hierarchyFootnote 1 as inputs and then extracts the temporal information using a set of fixed motion filters which will be described in the next subsection. The extracted spatial and temporal features in each hierarchy should be properly propagated to the next hierarchy. To fully utilize two types of information, we provide several schemes to accumulate them for the next hierarchy.

3.2 Motion Representation

To capture the motion representation, one of the commonly used approaches in action recognition is using optical flow as inputs to a CNN. Despite its important role in the action recognition tasks, optical flow is computationally expensive in practice. In order to replace the role of optical flow and to extract temporal features, we propose motion filters which have a close relationship with the optical flow.

Approximation of Optical Flow. To approximate the feature-level optical flow hierarchically, we propose a modular structure named motion filter. Typically, the brightness consistency constraint of optical flow is defined as follows:

$$\begin{aligned} {I}(x + \varDelta x, y + \varDelta y, t + \varDelta t) = {I}(x,y,t), \end{aligned}$$
(1)

where I(xyt) denotes the pixel value at the location (xy) of a frame at time t. Here, \(\varDelta {x}\) and \(\varDelta {y}\) denote the spatial displacement in horizontal and vertical axis respectively. The optical flow \((\varDelta x, \varDelta y)\) that meets (1) is calculated between two consecutive image frames at time t and \(t+\varDelta {t}\) at every location of an image.

Originally, solving an optical flow problem is to find the optimal solution \((\varDelta {x}^{*},\varDelta {y}^{*})\) through an optimization technique. However, it is hard to solve (1) directly without additional constraints such as spatial or temporal smoothness assumptions. Also, it takes much time to obtain a dense (pixelwise) optical flow.

In this paper, the primary goal is to find the temporal features derived from optical flow to help classifying action recognition rather than finding the optimal solution to optical flow. Thus, we extend (1) to feature space by replacing an image I(xyt) with the corresponding feature maps F(xyt) and define a residual features R as follows:

$$\begin{aligned} R_l(x,y,\varDelta {t}) =F_l(x+\varDelta {x},y+\varDelta {y},t+\varDelta {t}) - F_l(x,y,t), \end{aligned}$$
(2)

where \(\textit{l}\) denotes the index of the layer or hierarchy, \({F}_l\) is the l-th feature maps from the basic network. R is the residual features produced by two features from the same layer l. Given \(\varDelta {x}\) and \(\varDelta {y}\), the residual features R can be easily calculated by subtracting two adjacent features at time t and \(t+\varDelta {t}\). To fully utilize optical flow constraints in feature level, R tends to have lower absolute intensity. As searching for the lowest absolute value in each location of feature map is trivial but time-consuming, we design a set of predefined fixed directions \(\mathbb {D} = \{(\varDelta {x}, \varDelta {y})\}\) to restrict the search space. For convenience, in our implementation, we restrict \(\varDelta {x}, \varDelta {y}\in {\{0,\pm 1\}}\) and \(\left| \varDelta {x}\right| +\left| \varDelta {y}\right| \le 1\). Shifting one pixel along each spatial dimension in the image space is responsible for capturing a small amount of optical flow (i.e. small movement), while one pixel in the feature space at a higher hierarchy of a CNN can capture larger optical flow (i.e. large movement) as it looks at a larger receptive field.

Fig. 3.
figure 3

Motion filter. Motion filter generates spatio-temporal features from two consecutive feature maps. Feature map at time \(t+\varDelta {t}\) is shifted by a predefined set of fixed directions and each of them is subtracted from the feature map at time t. With concatenation of features from all directions, motion filter can represent spatio-temporal information.

Motion Filter. The motion filter is a modular structure calculated by two feature maps extracted from shared networks feed-forwarded by two consecutive frames as inputs. As shown in Fig. 3, the motion filter takes features \(F_l(t)\) and \(F_l(t\,+\,\varDelta {t})\) at time t and \(t\,+\,\varDelta {t}\) as inputs. The predefined set of directions \(\mathbb {D}\) is only applied to the features at time \({t+\varDelta {t}}\) as illustrated Fig. 3. We follow the shift operation proposed in [34]. It moves each channel of its input tensor in a different spatial direction \(\delta \triangleq (\varDelta {{x}}, \varDelta {{y}}) \in \mathbb {D} \). This can be alternatively done with widely used depth-wise convolution, whose kernel size is determined by the maximum value of \(\varDelta {x}\) and \(\varDelta {y}\) in \(\mathbb {D}\). For example, on our condition, \(\varDelta x, \varDelta y \in \{0, \pm 1\}\), we can implement with \(3 \times 3\) kernels as shown in Fig. 3. Formally, the shift operation can be formulated as:

$$\begin{aligned} {G}_{k,l,m}^{\delta }&= \sum _{i,j} {K}_{i,j}^{\delta } F_{k+\hat{i},l+\hat{j},m}, \end{aligned}$$
(3)
$$\begin{aligned} {K}_{i,j}^{\delta }&= {\left\{ \begin{array}{ll} 1 &{} \quad \text {if } i=\varDelta {x} \text { and } j=\varDelta {y},\\ 0 &{} \quad \text {otherwise.} \end{array}\right. } \end{aligned}$$
(4)

Here, the subscript indicates the index of a matrix or a tensor, \(\delta \triangleq (\varDelta x, \varDelta y) \in \mathbb {D}\) is a displacement vector, \(F \in {\mathbb {R}^{W\times H \times C}}\) is the input tensor and \(\hat{i}=i-\lfloor W/2 \rfloor \), \(\hat{j}=j-\lfloor H/2 \rfloor \) are the re-centered spatial indices (\(\lfloor \cdot \rfloor \) is the floor operation). The indices kl and ij are those along spatial dimensions and m is a channel-wise index. We get a set \(\mathbb {G}\) = {\(G^{\delta }_{t+\varDelta t} | {\delta } \in \mathbb {D} \)}, where \(G^{\delta }_{t+\varDelta t}\) represents the shifted feature map by an amount of \(\delta \) at time \(t+\varDelta {t}\). Then, each of them is subtracted by \(F_t\)Footnote 2. Because the concatenated feature map is constructed by temporal subtraction on top of the spatially shifted features, the feature map contains spatio-temporal information suitable for action recognition. As mentioned in Sect. 2, this is quite different from optical-flow-guided features in [26] which use two types of feature maps obtained by temporal subtraction and spatial Sobel filters. Also, it is distinct from ‘subtractive correlation layer’ in [5] with respect to the implementation and the goal. ‘Subtractive correlation layer’ is utilized to find correspondences for better reconstruction, while, the proposed motion filter is aimed to encode directional information between two feature maps via learnable parameters.

3.3 Motion Block

Fig. 4.
figure 4

Two ways to aggregate spatial and temporal information from appearance block and motion filter.

As mentioned above, the motion filter is a modular structure which can be adopted to any intermediate layers of two appearance blocks consecutive in time. In order to propagate spatio-temporal information properly, we provide several building blocks. Inspired by the recent success of residual block used in residual networks (ResNet) in many challenging image recognition tasks, we develop a new building block named motion block to propagate spatio-temporal information between two adjacent appearance blocks into deeper layers.

Element-Wise Sum. A simple and direct way to aggregate two different characteristics of information is the element-wise sum operation. As illustrated in Fig. 4(a), a set of motion features \(R_{t}^\delta \triangleq F_t - G_{t+\varDelta t}^{\delta } \in \mathbb {R}^{W\times {H}\times {C}}\), \(\delta \in \mathbb {D}\), generated by motion filter are concatenated along channel dimension to produce a tensor \({M}_{t} = [R^{\delta _1}_t | R^{\delta _2}_t | \cdots | R^{\delta _S}_t] \in {\mathbb {R}^{W\times {H}\times {N}}}\), where \([\cdot |\cdot ]\) denotes a concatenation operation, \(N = S \times C\) and S is the number of the predefined directions in \(\mathbb {D}\). It is further compressed by \(1 \times 1\) convolution filters to produce output \(\hat{M}_t\) with the same dimension as \(F_{t}\). Finally, the features from the appearance block \(F_{t}\) and those from the motion filters \(\hat{M}_t\) are summed up to produce inputs to the next hierarchy.

Concatenation. Another popular way to combine the appearance and the motion features is calculated by the concatenation operation. In this paper, the motion features \(M_{t}\) mentioned above are directly concatenated with each of the appearance features \(F_{t}\) as depicted in Fig. 4(b). A set of \(1 \times 1\) convolution filters is also exploited to encode spatial and temporal information after the concatenation. The \(1 \times 1\) convolution reduces the channel dimension as we desire. It also implicitly encodes spatio-temporal features to find the relationship between two different types of features: appearance and motion features.

4 Experiments

In this section, the proposed MFNet is applied to action recognition problems and the experimental results of MFNet are compared with those of other action recognition methods. As datasets, Jester [1] and Something-Something [8] are used because these cannot be easily recognized by just seeing a frame as already mentioned in Sect. 1. They also are suitable for observing the effectiveness of the proposed motion blocks. We also perform comprehensive ablation studies to prove the effectiveness of the MFNets.

4.1 Experiment Setup

To conduct comprehensive ablation studies on video classification tasks with motion blocks, first we describe our base network framework.

Base Network Framework. We select the TSN framework [33] as our base network architecture to train MFNet. TSN is an effective and efficient video processing framework for action recognition tasks. TSN samples a sequence of frames from an entire video and aggregates individual predictions into a video-level score. Thus, TSN framework is well suited for our motion blocks because each block directly extracts the temporal relationships between adjacent snippets in a batch manner.

In this paper, we mainly choose ResNet [10] as our base network to extract spatial feature maps. For the sake of clarity, we divide it into 6 stages. Each stage has a number of stacked residual blocks and each block is composed of several convolutional and batch normalization [11] layers with Rectified Linear Unit (ReLU) [19] for non-linearity. The final stage consists of a global pooling layer and a classifier. Our base network differs from the original ResNet in that it contains the max pooling layer in the first stage. Except this, our base network is the same as the conventional ResNet. The backbone network can be replaced by any other network architecture and our motion blocks can be inserted into the network all in the same way regardless of the type of the network used.

Motion Blocks. To form MFNet, we insert our motion blocks into the base network. In case of using ResNet, each motion block is located right after the last residual block of every stage except for the last stage (global pooling and classification layers). Then, MFNet automatically learns to represent spatio-temporal information from consecutive frames, leading the conventional base CNN to extract richer information that combines both appearance and motion features. We also add an \(1 \times 1\) convolution before each motion block to reduce the number of channels. Throughout the paper, we reduce the number of input channels to motion block by a factor of 16 with the \(1\times 1\) convolutional layer. We add a batch normalization layer after the \(1\times 1\) convolution to adjust the scale to fit to the features in the backbone network.

Training. In the datasets of Jester and Something-Something, RGB images extracted from videos at 12 frames per second with a height of 100 pixels are provided. To augment training samples, we exploit random cropping method with scale-jittering. The width and height of a cropped image are determined by multiplying the shorter side of the image by a scale which is randomly selected in the set of \(\{ 1.0, 0.875, 0.75, 0.625 \}\). Then the cropped image is resized to \(112\times 112\), because the width of the original images is relatively small compared to that of other datasets. Note that we do not adopt random horizontal flipping to the cropped images of Jester dataset, because some classes are a symmetrical pair, such as ‘Swiping Left’ and ‘Swiping Right’, and ‘Sliding Two Fingers Left’ and ‘Sliding Two Fingers Right’.

Since motion block extracts temporal motion features from adjacent feature maps, a frame interval between frames is a very important hyper-parameter. We have trained our model with the fixed-time sampling strategy. However, in our experiments, it leads to worse results than the random sampling strategy in [33]. With a random interval, the method forces the network to learn through frames composed of various intervals. Interestingly, we get better performance on Jester and Something-Something datasets with the temporal sampling interval diversity.

Table 1. Top-1 and Top-5 classification accuracies for different networks with different numbers of training segments (3, 5, 7). The compared networks are TSN baseline, MFNet concatenation version (MFNet-C), and MFNet element-wise sum version (MFNet-S) on Jester and Something-Something validation sets. All models use ResNet-50 as a backbone network and are trained from scratch.

We use the stochastic gradient descent algorithm to learn network parameters. The batch size is set to 128, the momentum is set to 0.9 and weight decay is set to 0.0005. All MFNets are trained from scratch and we train our models with batch normalization layers [11]. The learning rate is initialized as 0.01 and decreases by a factor of 0.1 for every 50 epochs. The training procedure stops after 120 epochs. To mitigate over-fitting effect, we adopt dropout [25] after the global pooling layer with a dropout ratio of 0.5. To speed up training, we employ a multi-GPU data-parallel strategy with 4 NVIDIA TITAN-X GPUs.

Inference. We select equi-distance 10 frames without the random shift. We test our models on sampled frames whose image size is rescaled to \(112\times 112\). After that, we aggregate separate predictions from each frame and average them before softmax normalization to get the final prediction.

4.2 Experimental Results

The Jester [1] is a crowd-acted video dataset for generic human hand gestures recognition. It consists 118, 562 videos for training, 14, 787 videos for validation, and 14, 743 videos for testing. The Something-Something [8] is also a crowd-acted densely labeled video dataset of basic human interactions with daily objects. It contains 86, 017 videos for training, 11, 522 videos for validation, and 10, 960 videos for testing. Each of both datasets is for the action classification task involving 27 and 174 human action categories respectively. We report validation results of our models on the validation sets, and test results from the official leaderboardsFootnote 3\(^{, }\)Footnote 4.

Evaluation on the Number of Segments. Due to the nature of our MFNet, the number of segments, K, in the training is one of the important parameters. Table 1 shows the comparison results of different models while changing the number of segments from 3 to 7 with the same evaluation strategies. We observe that as the number of segments increases, the performance of overall models increases. The performance of the MFNet-C50 (which means that MFNet concatenate version with ResNet-50 as a backbone network) with 7 segments is by far the better than the same network with 3 segments: \(96.1\%\) vs. \(90.4\%\) and \(37.3\%\) vs. \(17.4\%\) on Jester and Something-Something datasets respectively. The trend is the same for MFNet-S50, the network with element-wise sum. Also, unlike baseline TSN, MFNets show significant performance improvement as the number of segments increases from 3 to 5.

Table 2. Top-1 and Top-5 classification accuracies for different depths of MFNet’s base network. ResNet [10] is used as the base network. The values are on JESTER and Something-Something validation sets. All models are trained from scratch, with 10 segments.

These improvements imply that increasing K reduces the interval between sampled frames which allows our model to extract richer information. Interestingly, MFNet-S achieves slightly higher top-1 accuracy (\(0.2\%\) to \(0.6\%\)) than MFNet-C on Jester dataset, and MFNet-C shows better performance (\(0.2\%\) to \(2.8\%\)) than MFNet-S on Something-Something dataset. On the other hand, because the TSN baseline is learned from scratch, performance was worse than expected. It can be seen that TSN spatial model without pre-training barely generates any action-related visual features in Something-Something dataset.

Table 3. Comparison of the top-1 and top-5 validation results of various methods on Jester and Something-something datasets. K denotes the number of training segments. The results of other models are from their respective papers.
Table 4. Selected test results on the Jester and Something-Something datasets from the official leaderboards. Since the test results are continuously updated, some results that are not reported or whose description is missing are excluded. The complete list of test results is available on official public leaderboards. Our results are based on ResNet-101 with \(K=10\) and trained from scratch. For submissions, we use the same evaluation strategies as the validation mode.

Comparisons of Network Depths.

Table 2 compares the performances as the depths of MFNet’s backbone network changes. In the table, we can see that MFNet-C with ResNet-18 achieves comparable performance as the 101-layered ResNet using almost \(76\%\) fewer parameters (11.68M vs. 50.23M). It is generally known that as CNNs become deeper, more features can be expressed [10, 23, 27]. However, one can see that because most of the videos in Jester dataset are composed of almost similar kinds of human appearances, the static visual entities are very little related to action classes. Therefore, the network depth does not appear to have a significant effect on performance. In Something-Something case, accuracy gets also saturated. It could be explained that generalization of a model seems to be difficult without pre-trained weights on other large-scale datasets, such as Imagenet [4] and Kinetics [14].

Comparisons with the State-of-the-Art.

Table 3 shows the top-1 and top-5 results on the validation set. Our models outperform Pre-3D CNN + Avg [8] and the MultiScale TRN [37]. Because Jester and Something-Something are recently released datasets in the action recognition research field, we also report the test results on the official leaderboards for each dataset for comparison with previous studies. Table 4 shows that MFNet achieves comparable performance to the state-of-the-art methods with \(96.22\%\) and \(37.48\%\) top-1 accuracies on Jester and Something-Something test datasets respectively on official leaderboards. Note that we do not introduce any other modalities, ensemble methods or pre-trained initialization weights on large-scale datasets such as ImageNet [4] and Kinetics [14]. We only utilize officially provided RGB images as the input of our final results. Also, without 3D ConvNets and additional complex testing strategies, our method provides competitive performances on the Jester and Something-Something datasets.

Fig. 5.
figure 5

Confusion matrices of TSN baseline and our proposed MFNet on Jester dataset. The figure is best viewed in an electronic form.

Fig. 6.
figure 6

Validation accuracies trained with the different number of segments K, while varying the number of validation segments from 2 to 25. The x-axis represents the number of segments at inference time and the y-axis is the validation accuracy of the MFNet-C50 trained with different K.

4.3 Analysis on the Behavior of MFNet

Confusion Matrix. We analyze the effectiveness of MFNet comparing with the baseline. Figure 5 shows the confusion matrices of TSN baseline (left) and MFNet (right) on Jester dataset. Class numbers and the corresponding class names are listed below. Figure 5 suggests that the baseline model confuses one action class with its counterpart class. That is, it has trouble classifying temporally symmetric action pairs. For example, (‘Swiping Left’, ‘Swiping Right’) and (‘Two Finger Down’, ‘Two Finger Up’) are temporally symmetric pairs.

In case of baseline, it predicts an action class by simply averaging the results of sampled frames. Consequently, if there is no optical flow information, it might fail to distinguish some temporal symmetric action pairs. Specifically, we get 62.38% accuracy on ‘Rolling Hand Forward’ class among 35.7% of which is misclassified as ‘Rolling Hand Backward’. In contrast, our MFNet showed significant improvement over baseline model as shown in Fig. 5 (right). In our experiments, we get the accuracy of 94.62% on ‘Rolling Hand Forward’ class among 4.2% of which is identified as ‘Rolling Hand Backward’. It proves the ability of MFNet in capturing the motion representation.

Varying Number of Segments in the Validation Phase. We evaluated the models which have different numbers of frames in the inference phase. Figure 6 shows the experimental results of MFNet-C50 on Jester (left) and Something-Something (right) datasets. As discussed in Sect. 4.2, K, the number of segments in the training phase is a crucial parameter on performance. As we can see, overall performance for all the number of validation segments is superior on large K (7). Meanwhile, the optimal number of validation segments for each K is different. Interestingly, it does not coincide with K but is slightly larger than K. Using more segments reduces the frame interval which allows extracting more precise spatio-temporal features. It brings the effect of improving performance. However, it does not last if the numbers in the training and the validation phases differ much.

5 Conclusions

In this paper, we present MFNet, a unified network containing appearance blocks and motion blocks which can represent both spatial and temporal information for action recognition problems. Especially, we propose the motion filter that outputs the motion features by performing the shift operation with the fixed set of predefined directional filters and subtracting the resultant feature maps from the feature maps of the preceding frame. This module can be attached to any existing CNN-based network with a small additional cost. We evaluate our model on two datasets, Jester and Something-Something, and obtain outperforming results compared to the existing results by training the network from scratch in an end-to-end manner. Also, we perform comprehensive ablation studies and analysis on the behavior of MFNet to show the effectiveness of our method. In the future, we will validate our network on large-scale action recognition dataset and additionally investigate the usefulness of the proposed motion block.