Learn to cycle: Time-consistent feature discovery for action recognition

Generalizing over temporal variations is a prerequisite for effective action recognition in videos. Despite significant advances in deep neural networks, it remains a challenge to focus on short-term discriminative motions in relation to the overall performance of an action. We address this challenge by allowing some flexibility in discovering relevant spatio-temporal features. We introduce Squeeze and Recursion Temporal Gates (SRTG), an approach that favors inputs with similar activations with potential temporal variations. We implement this idea with a novel CNN block that uses an LSTM to encapsulate feature dynamics, in conjunction with a temporal gate that is responsible for evaluating the consistency of the discovered dynamics and the modeled features. We show consistent improvement when using SRTG blocks, with only a minimal increase in the number of GFLOPs. On Kinetics-700, we perform on par with current state-of-the-art models, and outperform these on HACS, Moments in Time, UCF-101 and HMDB-51.


Introduction
Action recognition in videos is an active field of research. A major challenge that is addressed comes from dealing with the vast variation in the temporal display of the action [9,24]. In deep neural networks, temporal motion has primarily been modeled either through the inclusion of optical flow as a separate input stream [21] or using 3D convolutions [13]. The latter have shown consistent improvements in state-of-the-art models [2,3,6,5].
3D convolution kernels in convolutional neural networks (3D-CNNs) take into account fixed-sized temporal regions. Kernels in early layers have small receptive fields that primarily focus on simple patterns such as texture and linear movement. Later layers have significantly greater receptive fields that are capable of modeling complex spatio-temporal patterns. Through this hierarchical dependency, the relations between discriminative short-term motions within the larger motion patterns are only established in the very last network layers. Consequently, when training a 3D-CNN, the learned features might include incidental correlations instead of consistent temporal patterns. Thus, there appears to be room for improvement in the discovery of discriminative spatio-temporal features.
To improve this discovery process, we propose a method named Squeeze and Recursion Temporal Gates (SRTG) which aims at extracting features that are consistent in the temporal domain. Instead of relying on a fixed-size window, our approach relates specific short-term activations to the overall motion in the video, as shown in Figure 1. We introduce a novel block that uses an LSTM [10]) to encapsulate feature dynamics, and a temporal gate to decide whether these discovered dynamics are consistent with the modeled features. The novel block  We discuss the advancements in the modeling of time in action recognition in Section 2. A detailed description of the main methodology is provided in Section 3. Experimental setup and results are presented in Section 4 and we conclude in Section 5.

Related Work
We discuss how temporal information is represented in CNNs, in particular using 3D convolutions.
Time representation in CNNs. Apart from the hand-coded calculation of optical flow [21], the predominant method for representing spatio-temporal information in CNNs is the use of 3D convolutions. These convolutions process motion information jointly with spatial information [13]. Because the spatial and temporal dimensions of videos are strongly connected, this has led to great improvements especially for deeper 3D-CNN models [2,8]. Recent work additionally targets the efficient incorporation of temporal information at different time scales through the use of separate pathways [3,6].
3D convolution variants. A large body of work has focused on reducing the computational requirements of 3D convolutions. Most of these attempts are targeted towards the decoupling of temporal information, for example as pseudo and (2+1)D 3D convolutions [19,27]. Others have proposed a decoupling of horizontal and vertical motions [25].
Information fusion of spatio-temporal activations. Squeeze and Excitation [12], Gather and Excite [11] and Point-wise Spatial Attention [33] consider self-attention in convolutional blocks for image-based input. In the video domain, self-attention has been implemented by Long et al. [16] using clustering, to integrate local patterns with different attention units. Others have studied the use of non-local operations that capture long-range temporal dependencies through different distances [29]. Wang et al. [28] proposed to filter feature responses with activations decoupled to branches for appearance and spatial relations. Qiu et al. [20] have extended the idea of creating separate pathways for general features that can be updated through network block activations.
While these methods have shown increased generalization performance, they do not address the discovery of local spatiotemporal features across large time sequences. As activations are constrained by the spatio-temporal locality of their receptive fields, they are not allowed to effectively consider extended temporal variations of actions based on their general motion and time of execution. Instead of attempting to map the locality of features to each of the frame-wise activations, our work combines the locally-learned spatio-temporal features with their temporal variations across the duration of the video sequence.

Squeeze and Recursion Temporal Gates
In this section, we introduce Squeeze and Recursion Temporal Gates (SRTG) blocks, and the possible configurations for their use in CNNs. We will denote layer input a as a stack of T frames a (C × T × H × W) with C the number of channels, T the number of frames, and H and W the spatial dimensions of the video. The backbone blocks that SRTG are applied to also include residual connections where the final accumulated activations are the sum of the previous block activations (a [l−1] ) and the current computed features (z [l] ) denoted as a [l] = z [l] + a [l−1] , with block index l.

Squeeze and Recursion
Squeeze and Recursion blocks can be built on top of any spatio-temporal activation map a [l] = g(z [l] ) for any activation function g() applied to a volume of features z [l] , shown in Figure 2(a). This process is similar to Squeeze and Excitation [12]. For each block, the activation maps are sub-sampled in both spatial dimensions to create a vectorized representation of the volume's features across time. Each element in the vector contains the intensity values of a frame squeezed, so to say, in a single average value. This process encapsulates the average temporal attention through the discovered features.
Recurrent cells. The importance of each feature in the temporal attention feature vector is decided by an LSTM subnetwork. Through the sequential chain structure of recurrent cells, the overall features that are generally informative for entire video sequences can be discovered. We briefly describe the inner workings of the LSTM sub-network [10] and how the importance of each feature for the entire video is learned, as depicted in Figure 3.
To focus on salient patterns, low intensity activations are discarded in the first operation of the recurrent cell at the forget gate layer. A decision f (t) is made given the input pool(a [l] ) (t) and informative features from the previous frame h (t−1) . The features that are to be stored are decided by the product of the sigmodial (σ) input gate layer i (t) , and the vector of candidate values C (t) as computed as: tanh tanh tanh tanh tanh tanh Figure 3: Overview of the LSTM-chained cells used for the discovery of globally informative local features. Each input corresponds to a temporal activation map and produces a feature vector of the same size as the input.
The previous cell state C (t−1) is then updated based on the forget and input gates in order to ignore features that are not consistent across time and to determine the update weight. The new cell state C (t) is calculated as: The output of the recurrent cell h (t) is given by the current cell state C (t) , the previous hidden state h (t−1) and current input pool(a [l] ) (t) as: The hidden states are again squeezed together to re-create a coherent sequence of filtered spatio-temporal feature intensities a [l] . This new attention vector considers previous cell states, thus creating a generalized vector based on the feature intensity across time.

Temporal Gates for cyclic consistency
Cyclic consistency. To evaluate the similarity between two temporal volumes, cyclic consistency has been widely used [4,30]. The technique is based on the one-to-one mapping of frames from two time sequences, schematically summarized in Figure 4. Each of the two feature spaces can be considered an embedding space. Two embedding spaces are cycle-consistent if and only if, each point at time t in the embedding space A, has a minimum distance point in embedding space B that is also at time t. Equivalently, each point at time t in embedding space B should also have a minimum distance point in embedding space A at time t. As shown in Figure 4, when points do not cycle back to the same temporal location, they do not exhibit cyclic consistency. In this case, a temporal cyclic error occurs.
By having points that can cycle back to themselves, a similarity baseline between embedding spaces can be established. Although individual features of the two spaces may be different, they should demonstrate an overall similarity as long as their alignment in terms of cyclic consistency is the same. Therefore, comparing volumes by their cyclic consistency is a suitable measure to account for (temporal) variations. Soft nearest neighbor distance. The main challenge in creating a coherent similarity measure between two embeddings is to deal with the vast embedding spaces, as well as to discover the "nearest" point in an adjacent embedding. The idea of soft matches for projected points in embeddings [7] is based on finding the closest point in an embedding space through the weighted sum of all possible matches and then selecting the closest actual observation.
To find the soft nearest neighbor of an activation a A (t) in embedding space B, the euclidean distances between observation a B (t) and all points in B are calculated (see Figure 5). Each frame is considered a separate instance for which we want to find the minimum point in the adjacent embedding space. We weight the similarity of each frame in embedding space B to activation a A (t) using a softmax activation and by exploiting the exponential difference between activation pairs: The softmax activation produces a normal distribution of similarities N(µ, σ 2 ) , centered on the frame with the minimum distance from activation a A (t) . Based on the discovery of the nearest neighbor a (B→A) (t) , the distance to nearest frames in B can then be computed. This allows the discovery of frames that are closely related to the initially considered frame a A (t) , achieved by minimizing the L2 distance from the found soft match: We define a point as consistent if and only if the initial temporal location t matches precisely the temporal location of the computed point in embedding space B, a (B→A) = a A (t) ∀t ∈ {1, ..., T }. Temporal gates. The temporal activation vector encapsulates average feature attention over time. However, it does not enforce a precise similarity to the local spatio-temporal activations. Thus, we compute cyclic consistency between the pooled activations pool(a [l] ) and the outputted recurrent cells a [l] . In this context, cyclic consistency is used as a gating mechanism to only fuse the recurrent cell hidden states with unpooled versions of the activations when the two volumes are temporally cycleconsistent. This condition ensures that only time-consistent information is added back to the network, as shown for the active states in Figure 2(a).

SRTG block variants
Cyclic consistency can be considered in different parts of a convolution block, and we investigate six different approaches in terms of constructing a SRTG block. In each case, the principle of global and local information fusion remains. The block configurations only differ in the relative locations of the SRTG and the LSTM input. All configurations are shown in Figure 2(b). Similar to networks with residual connections, we consider Simple blocks with two conv operations and Bottleneck blocks with three conv operations. Not all SRTG configurations apply to the Simple blocks.
Start. SRTG is the very first process in the block to ensure that all operations will be based on both global and local information. The configuration can be used in both Simple and Bottleneck residual blocks.
Top. Activations of the first convolution are used by the LSTM, with fused features being used by the final convolution. This is specific to Bottleneck blocks.
Mid. SRTG is added at the middle of Simple blocks and after the second convolution at Bottleneck blocks.
End. Local and global features are fused at the end of the final convolution, before the concatenation of the residual connection. This is only used in Bottleneck blocks.
Res. The SRTG block can also be applied to the residual connection. This transforms the residual connection to further include global spatio-temporal features and to combine those with the convolutional activations for either Simple or Bottleneck blocks.
Final. SRTG is added at the end of the residual block, which allows for the activations to be calculated jointly with their representations across time on the entire video. This can be used in both Simple and Bottleneck blocks.

Experiments and Results
We evaluate our approach on five action recognition benchmark datasets (Section 4.1). We perform experiments with various ResNet backbones with various depths. Each network uses either 3D convolutions (r3d) or (2+1)D convolutions (r(2+1)d).

Datasets
We use five action recognition datasets for our experiments: Human Action Clips and Segments (HACS, [32]) includes approximately 500K clips of 200 classes. Clips are 60-frame segments extracted from 50k unique videos.
Moments in Time (MiT, [18]) is one of the largest video datasets of human actions and activities. It includes 339 classes with approximately 800K, 3-second clips.
UCF-101 [22] includes 101 classes and 13k clips that vary between 2 and 14 seconds in duration.

Experimental settings
Training was performed with a random sub-sampling of 16 frames, resized to 224 × 224. We adopted a multigrid training scheme [31] with an initial learning rate of 0.1, halved at each cycle. We used a SGD optimizer with 1e −6 weight decay and a step-wise learning rate reduction. All tested SRTG blocks incorporate stacked dual LSTMs (2 layers). For HACS, K-700 and MiT, we use the train/test splits suggested by the authors, and report on split1 for UCF-101 and HMDB-51.

Comparison of SRTG block configurations
We compare the different SRTG block configurations with a 34-layer r3d and r(2+1)d. ResNets-34 contain Simple blocks with two conv layers instead of the Bottleneck blocks with three conv layers. We therefore only evaluate the Start, Mid, Res and Final configurations. Results, summarized in Table 1, are obtained on HACS by training from scratch. All SRTG blocks perform better than their vanilla counterparts. This demonstrates the merits of our more flexible treatment of the temporal dimension. This effect appears to be stronger when the filtering is applied later. Indeed, the best performing SRTG configuration Final achieves a top-1 accuracy improvement of 3.781% for 3D and 4.686% for (2+1)D convolution blocks.

Comparison of network architectures
To better understand the merits of our method, we compare a number of network architectures with and without SRTG (Final configuration). We summarize the performance on all five benchmark datasets in Table 2. The top part of the table contains the results for state-of-the-art networks including I3D [2] which is based on an Inception-v1 network. The remaining evaluated architectures use Resnet backbones. Temporal Shift Module (TSM, [15]) and Multi-Fiber networks (MF, [3]) use a r3d-50 backbone and Channel-Separated Convolutions (ir-CSN, [26]) and SlowFast networks (SF, [6]) are based on r3d-101 backbones. We further include an additional 50layer SlowFast network for an additional comparison of lowercapacity models. We have used the trained networks from the respective authors' repositories. These trained models are typically pre-trained on other datasets. Missing values are due to the lack of a trained model. Any deviations from previously reported performances are due to the use of multigrid [31] with a base cycle batch size of 32.
The second and third parts of Table 2 summarize the performances of ResNets with various depths and 3D or (2+1)D convolutions, with and without SRTG, respectively. Models for HACS are trained from scratch. The weights of models for K-700 and MiT are initialized based on those from the pre-trained HACS model. For UCF-101 and HMDB-51, we fine-tune the HACS pre-trained models. Missing values are due to time constraints. We will add these in the final version of the paper.
For the state-of-the-art architectures, the use of larger and deeper models provides accuracy improvements. This is in line with the general trend for action recognition using CNNs with architectures that are either deeper or include higher complexity. Models implemented with (2+1)D convolution blocks perform somewhat better than their counterparts with 3D convolu-tions. These differences are modest and not consistent across datasets, however.
As shown in Table 2, adding SRTG blocks to any architecture consistently improves performance. Table 3 shows pairwise comparisons of the performance on the three largest benchmark datasets for networks with and without SRTG. When using SRTG blocks, the improvements are in the range of 1.2-4.7% for HACS, 2.8-4.4% for K-700 and 2.1-3.7% for MiT. For smaller networks, we observe larger gains. The use of timeconsistent features obtained through our method appears to improve the generalization ability of 3D-CNNs.
The r3d and r(2+1)d networks with SRTG perform on-par with the current state-of-the-art architectures. The r3d-101 outperforms current state-of-the-art in HACS, MiT, UCF-101 and HMDB-51. For MiT, our top-1 accuracy of 33.564%, which largely surpasses other tested architectures. The (2+1)D variant further outperforms current architectures on HACS with 84.326% top-1 accuracy. We also note a performance on Kinectics-700 that is comparable to the best performing Slow-Fast r3d-101 model. While the SlowFast network achieves better top-1 accuracy, a r(2+1)d-101 network with SRTG blocks has higher top-5 accuracy. This similar performance is remarkable given the relatively low complexity of the SRTG r3d-101 and r(2+1)d-101 models. SlowFast is built on a dual-network configuration with two sub-parts responsible for long-term and short-term movements. The SlowFast network therefore includes a significantly larger number of operations than a r3d-101 or r(2+1)d network with plug-in SRTG blocks. We analyze the computation cost of the SRTG block in Section 4.5.
Finally, we observe that the performance gain with SRTG is substantial for the two smaller datasets, UCF-101 and HMDB-51. Especially for UCF-101, the action recognition accuracy is very saturated. Still, the already competitive performance of the ResNet-101 models on UCF-101 increases with 1.569% and 1.778% for the 3D and (2+1)D convolution variants, respectively. This further demonstrates that SRTG can improve the selection of features that contain less noise and generalize better, even when there is fewer training data available.

Analysis of computational overhead
The SRTG block can be added to a large range of 3D-CNN architectures. It leverages the small computational costs of LSTMs compared to 3D convolutions. That enables us to increase the number of parameters without a significant increase in the number of GFLOPs. This also corresponds to the small additional memory usage compared to baseline models on both forward and backward passes. We present the number of multiaccumulative operations (MACs) 2 used for the r3/(2+1)d architectures with and without SRTG in Figure 6, with respect to the corresponding accuracies. The additional computation overhead, for models that include the proposed block, is approximately 0.15% of the total number of operations in the vanilla networks. This constitutes a negligible increase, compared to the performance gains, making SRTG a lightweight block that  can be easily used on top of networks.

Evaluating feature transferability
A common practice to train CNNs is to use transfer learning on a pre-trained network. To evaluate the performance of the SRTG block after transfer learning, we pre-train on several datasets and fine-tune on smaller datasets UCF-101 and HMDB-51. Through this, we can further eliminate biases relating to the pre-training datsets and compare the accuracies achieved with respect to the SRTG blocks.
As shown in Table 4, the accuracy rates remain fairly consistent for the pre-training datasets. This consistency is due to 2 Multi-accumulative operations [17] are based on the product of two numbers increased by an accumulator. They relate to the accumulated sum of convolutions between the dot product of the weights and input region. the large sizes of these datasets, as well as the overall robustness of the proposed method. The average offset between each of the pre-trained models is 0.71% for UCF-101 and 0.47% for HMDB-51. These are only minor changes in accuracy, which further demonstrates that the improvements observed are due to the inclusion of SRTG blocks in the network.

Conclusions
We have introduced a novel Squeeze and Recursion Temporal Gates (SRTG) block that can be added to a large range of CNN architectures to create time-consistent features. The SRTG block uses an LSTM to capture multi-frame feature dynamics, and a temporal gate to evaluate the cyclic consistency between the discovered dynamics and the modeled features. SRTG blocks add a negligible computational overhead (0.03-0.4 GFLOPs), which makes both forward and backward passes efficient. Adding our proposed SRTG blocks in ResNet backbones with 3D or (2+1)D convolutions consistently leads to performance gains. We obtain results that are on par with, and in most cases outperform, the current state-of-the-art on action recognition datasets including Kinetics-700 and Moments in Time. For HACS, we obtain a state-of-the-art-performance of 84.3%. Our combined experiments demonstrate the generalization ability of the discovered time-consistent features.

Acknowledgments
This publication is supported by the Netherlands Organization for Scientific Research (NWO) with a TOP-C2 grant for Automatic recognition of bodily interactions (ARBITER).