Abstract

With the explosive growth of Internet video data, demands for accurate large-scale video classification and management are increasing. In the real-world deployment, the balance between effectiveness and timeliness should be fully considered. Generally, the video classification algorithm equipped with time segment network is used in industrial deployment, and the frame extraction feature is used to classify video actions However, the issue of semantic deviation will be raised due to coarse feature description. In this paper, we propose a novel method, called image dense feature and internal significant detail description, to enhance the generalization and discrimination of feature description. Specifically, the location information layer of space-time geometric relationship is added to effectively engrave the local features of convolution layer. Moreover, the multimodal feature graph network is introduced to effectively improve the generalization ability of feature fusion. Extensive experiments show that the proposed method can effectively improve the results on two commonly used benchmarks (kinetics 400 and kinetics 600).

1. Introduction

Video classification is the task of automatically identifying the category of input video. The key challenges are not only to serialize and understand the spatiotemporal relationship in the image but also to abstract the most important information and predict the categories with the extracted representation. In the last decade, video classification network has achieved great success, and a series of excellent methods [15] have been proposed. At the same time, many benchmark datasets have been built [7, 15].

In the process of image recognition, the two spatial dimensions of width and height are processed through convolution operation. Many experiments show that [8, 9] image has the characteristics of isotropy such as translation invariance under the first-order approximation. Corresponding to the video, the object motion contained in it has the same characteristics of consistent direction [10]. Therefore, it is effective to adopt the method of frame extraction for video classification [14], which is one commonly used solution in industry.

The methods of dual stream video classification [1, 3, 11, 12] are favored by the industry in specific practical applications. When implemented, this kind of methods model RGB and optical flow data, respectively, and connect features for video classification. However, they are based on the complete features of video frames [13, 11, 12] and ignore the importance of features in different spatiotemporal positions.

As shown in Figure 1, during penguin diving, the main body penguin will have limb changes in action with the progress of time, and the spatial correlation between the motion scene and action will also be reflected. The correlation is shown by the red-dashed arrow in Figure 2; there will be the correlation of actions and the consistency of background, whether between the adjacent grids of the current frame or between different frames adjacent to time.

On the whole, the dual stream video classification methods have several important problems: (1)Frame level features describe global information, lack of key region feature mining, and insufficient modeling for the proportion of action subject relative image(2)Optical flow and image features are often learned separately, and the lack of interaction between features leads to the inability to fuse the two effectively. Therefore, there is still a lot of room for improvement in the method

In order to solve these problems, this paper proposes a video classification method based on image dense features and internal salient detail description. The proposed method has the following advantages: (1)We propose to find the key motion features. By combining spatiotemporal location coding feature, the key motion areas can be found more easily(2)With the help of multimodal feature fusion, which through cross-attention mechanism, RGB features and optical flow features can fully interact with each other

At present, video classification includes methods based on spatiotemporal representation and video stream recognition. In the methods based on spatiotemporal representation, actions can be expressed as the changes of spatiotemporal objects in time and space. And the key feature can be filtered and captured by spatiotemporal directional modeling [13, 14]. In order to take the spatiotemporal relationship in feature extraction into account, 3D convolution [6, 15, 16] expands the 2D spatial image model [1720] to spatial-temporal domain. Among them, I3D network [15] uses dual stream CNN with expanded 3D convolution on RGB and optical flow sequences, which has achieved good results in video classification. Some other methods focus on serialization timing modeling. The common method is CNN+LSTM [1, 12, 2123], and they use CNN to extract frame features and LSTM to integrate temporal features. Considering that the separate processing of spatiotemporal information can effectively improve the computational performance, there are some schemes to decompose the convolution into separate two-dimensional space and one-dimensional time filters [2427]. Spatiotemporal joint feature filtering and fusion, including separable spatiotemporal feature modeling, can quickly grasp the main features and effectively model the key information of video actions. It is also a popular implementation and deployment method in the industry. However, its disadvantages are also obvious; frame level features describe the global information, lack of key area feature mining, and insufficient modeling for the relatively small proportion of the action subject’s relative image.

In the early stage, there were more classic manual feature design methods based on optical flow, such as flow histogram [28], motion boundary histogram [29], and trajectory graph [30]. Before the popularization of deep learning, these methods have shown good results in motion recognition. In the context of deep neural network, the dual stream method [11] effectively uses optical flow to obtain the key information of action by considering optical flow as another video input mode. This idea has also been effectively verified in many other methods [1, 12, 24], and some progress has been made. In the TSN [1] method, the video is divided into multiple segments, and the sample of RGB frame image and optical flow image is randomly selected in different time periods to extract the information for video action recognition. However, since the two features of optical flow and image are often learned separately and the lack of interaction between the features, the two sources cannot be fused effectively, so there is still a lot of room for improvement in the method.

With the development of visual attention methods, they are widely used in video understanding tasks. For video summarization, in order to optimize the recognition effect, a dynamic and static visual attention method is prosed [31], and a global&local multihead attention method is adopted in [32]. GSE-GCN [33] proses a Granularity-Scalable Ego-Graph Convolutional Network for obtaining a more satisfying summary. And [34] uses static and motion features with the parallel attention to improve video summarization results. For video classification, ViS4mer [35] uses a multiscale temporal S4 decoder for subsequent long-range temporal inference. MViT [36] proses window attention and pooling attention operations for calculating local information and aggregating them. In order to evaluate the movements of infants in the video, [37] uses the spatiotemporal attention selection mechanism. For solving the differences between features, BA-Transformer [38] applies different attention operations to different feature channel groups.

Recently, the video classification methods based on transformer has made great progress. Video Transformer Net [39] adds the time attention on the basis of the pretrained VIT model [40], which has produced good performance. TimeSformer [41] studied five structures in spatiotemporal modeling and proposed a spatiotemporal attention mechanism based on factorization. The experimental results show that the algorithm achieves a good tradeoff between speed and performance. Based on the picture classification structure, Video Swin Transformer [42] adds the time dimension, and good results are achieved. ViViT [43] discussed four different ways to realize spatiotemporal attention on the basis of VIT [40]. In order to reduce the amount of computation of the model, tokshift transformer [44] proposes a pure convolution free video classification algorithm. Multiscale ViT [45] is a multiscale visual transformer for video classification; for reducing the amount of computation, spatiotemporal modeling is carried out through the attention mechanism, and this method has achieved good results.

Because the algorithm in this paper is to improve the methods based TSN [1], which are commonly used in industry, the video classification methods based on transformer are not compared in this paper.

3. Video Classification Based on Salient Detail Description of Video Image

This paper proposes a video classification method based on salient detail description of video images (VCM-SDD). The flow of the proposed method is shown in Figure 3, where our VCM-SDD includes two serial feature fusion architectures. The first one is spatiotemporal consistency modeling, which operates RGB image and optical flow image separately. We can find the significant details of the time domain and the spatial domain under the single mode. The second one is multimodal feature fusion, which achieves feature interaction by using RGB image features and optical flow image features as query and key/value items to each other. The fusion of these two architectures is used for final classification.

3.1. Spatiotemporal Consistency Modeling

Many experiments show that [8, 9] images have isotropic characteristics such as translation invariance under the first-order approximation. At the same time, the image feature based on nonoverlapping grid can describe the characteristics of moving subject in more detail. No matter between adjacent grids of the current frame or between frames with similar time, there will be action correlation and consistency of action occurrence background. Therefore, it is feasible to mine the salient details of actions under the condition of spatiotemporal consistency modeling. We represent a video as ; that is, a video is first decomposed into segments from the whole video. The video length of each subsegment is , represents the number of image channels, and and represent the width and height of the image, respectively. Image appearance features (RGB) and motion optical flow features are extracted from each subsegment of the video.

In the RGB image grid feature extraction, the last convolution layer is selected to obtain a two-dimensional image feature description:

where represents the serial number of the convolution layer, represents the convolution feature map of the th layer in the network, is the input image, and represents the th layer convolution feature of the extracted RGB image. The final output dimension is , where represents the resolution of the feature map and represents the number of feature map channels.

In the optical flow image grid feature extraction, the same network as the image feature extraction is selected, and is specifically expressed as

where represents the serial number of the convolution layer, represents the convolution feature mapping of the th layer in the network, is the input image, and represents the th layer convolution feature of the extracted optical flow image, and the final output dimension is , where represents the resolution of the feature map and represents the number of feature map channels.

In order to better integrate the relative position information of visual features, the relative position information according to the grid geometry is added. The bounding box of the region can be expressed as , where , , , and represent the central coordinates of the grid and its width and height; therefore, for any two grids and , their geometric relationship as a 4-dimensional vector is expressed: Considering and in the grid feature, the last two terms of the geometric relationship vector in the above formula can be removed, so it can be simplified as

It can be seen that the farther the distance, the smaller the corresponding geometric relationship value. The relationship between different grids is shown in Figure 4. Considering that the interframe features are adjacent in time sequence, their expression will be closer. We propose a geometric measure of time relationship:

In the above formula, and , respectively, represent the time of the current feature, and is the number of frames extracted from the current video clip. It can be seen from the above formula that the closer the distance, the greater the value; the geometric relationship measurement after spatial-temporal combination can be finally expressed as

In order to prevent abnormal logarithm operation of geometric relationship in the same spatial position or time position, we add an offset term after each feature. Next, we build the feature similarity matrix according to the spatial-temporal geometric relationship:

In the above formula, is a learnable feature mapping matrix, and its dimension is in the experiment. is a learnable similarity mapping matrix, and the dimension is set to during the experiment. The specific method is to expand the geometric relationship vector to the high-dimensional space through the mapping matrix and carry out the similarity mapping after RELU activation to obtain the similarity eigenvalue.

The similarity eigenvalue proposed in this paper can be regarded as a spatiotemporal constrained cross attention mechanism. The consistency and difference between regions can be measured to realize the feature fusion between regions. To avoid introducing semantic noise, we create a geometric alignment graph . The grid features extracted with time are represented as independent node . In the dimensions of time, width, and height, when the distance between nodes is less than or equal to 2, the grid nodes will be connected to form edge set . According to the above rules, we construct an undirected graph. According to the geometric position relationship information of equation (7), the weight matrix W can be obtained and normalized:

represent the normalized similarity between node and node . represents the characteristic node of the grid, and represents all neighbor nodes adjacent to node . After the above similarity operation, the feature output after RGB spatial-temporal fusion can be expressed as

RGB image grid features and optical flow image grid features can obtain their respective spatial-temporal fusion features through the above operations. This operation can extract the significant visual information in the image. It is helpful to improve the effect of video classification. The following experiments also prove the effectiveness of the proposed algorithm. The specific features after the fusion of RGB image and optical flow image are expressed as follows:

3.2. Multimodal Feature Fusion

This paper is an improvement of TSN [1] and TSM [2], which belongs to the segmented video classification method commonly used in industry. The two features of optical flow and image are studied separately, and there is a lack of interaction between the features, which makes it impossible to fuse them effectively. In order to solve this problem, this paper carries out feature fusion modeling by means of cross attention mechanism and used as retrieval items and value/key items to each other. At this stage, RGB image grid features and optical flow image grid features alternately act as retrieval items and numerical items. For RGB image features, to fuse with optical flow features, the modeling method of cross attention can be expressed as

For optical flow image features, to fuse with RGB features, cross attention can be expressed as

Considering the above two equations, and feature dimension increases the amount of calculation. This will lead to long reasoning and training time. In this paper, the method similar to that in transformer [46] is used to disassemble and group the features. The features with larger dimensions are divided into multiple parts for calculation according to the method of multihead calculation.

3.3. Classification Result Fusion

In TSN [1], in the video result prediction stage, all segment recognition networks share model parameters. The learned model performs frame level evaluation like a normal image network. The details are as follows:

where and represent the result of RGB image and optical flow frame extraction in segment of the video. Here, one frame is extracted from each video, represents the result of single-stage model reasoning, and represents the final classification result. The method of multisegment averaging is adopted.

Different from the above fusion results, here the multimodal fusion feature is used:

where means that the extracted multiframe image features are averaged. The reason why the average value is used here is that there is action continuity in multiple frames. Using the average value to comprehensively measure the characteristic value can effectively reflect the significance detail description, and represent the fusion features obtained by taking RGB image and optical flow image as retrieval items, respectively.

4. Experiments

4.1. Model and Parameter Setting

The model in this paper is shown in Figure 5. Firstly, image features and optical flow features are extracted according to ResNet101. Then, the spatiotemporal information is added through spatiotemporal consistency modeling. And obtain modal interaction information through multimodal feature fusion. Video classification is carried out after feature enhancement. The number in Figure 5 shows the shape information of the feature.

In order to explore the impact of different video segments on the classification results, 8 and 16 clips of a video are collected evenly along the time axis during the test. For each segment, 6 frames of images are evenly extracted for feature extraction, and finally, the results are fused in the way of average of the results described in Section 3.3.

Table 1 shows the parameter settings and hardware information in the experiment. When extracting RGB image and optical flow features, similar to TSN and TSM methods, firstly scale the images to 256 256, get images by center cropping. In order to facilitate comparison, this paper selects Top1 and Top5 evaluation indicators commonly used in video classification.

4.2. Experiment Dataset

In the experiment, in order to evaluate the effectiveness of the algorithm proposed in this paper, we evaluated it on kinetics dataset [15]. Kinetics is a large-scale and high-quality YouTube video URL dataset, which contains many human action markers. The dataset was released by DeepMind to help the research of machine learning on video understanding. It contains two different versions according to different categories. The kinetics 400 contains about 260 K video clips, including 240 k training data and 20 K verification data, covering 400 types of human actions, with at least 400 video clips for each type of action. Each clip is about 10 seconds long and is marked with an action category. All clips are manually annotated in multiple rounds, and each clip comes from a separate YouTube video. These actions include a wide range of human object interaction actions, such as playing musical instruments, and human-human interaction actions, such as shaking hands and hugging. Kinetics 600 contains 420 K YouTube videos, including 392 k training data and 30 K verification data, with a total of 600 categories. Each category has at least 600 videos, and each video lasts about 10 seconds. At the same time, kinetics is also the basic dataset of the international human action classification competition organized by ActivityNet.

4.3. Comparison and Discussion

In order to prove the effectiveness of the algorithm, experiments are carried out on whether to add spatiotemporal consistency and multimodal feature fusion. Table 2 shows the corresponding experimental results. The experiment is based on the pretraining model of ResNet101 on ImageNet as the feature extraction model of RGB image and optical flow image. The video is divided into 16 segments, and 6 frames are extracted from each segment.

VCM-SDD No_STC No_MFF means that do not include the spatial-temporal consistency and multimodal feature fusion in model training and inference, from which we can see that the effect is the worst.

VCM-SDD STC No_MFF means that spatial-temporal consistency is added in training and reasoning, but there is no multimodal fusion. Compared with the experimental items that are not added, Top1 and Top5 are increased from 71.7 and 90.9 to 73.9 and 91.4, respectively, with absolute values of 2.2 and 0.5. Experiments show the effectiveness of the spatiotemporal consistency algorithm.

VCM-SDD No_STC MFF indicates that there is no spatial-temporal consistency in training and reasoning, but multimodal fusion is added. Compared with the experimental items with spatial-temporal consistency, Top1 and Top5 are increased from 73.9 and 91.4 to 78.5 and 93.6, respectively, with absolute values of 4.6 and 2.2. It can be seen that multimodal fusion is more critical to the improvement of classification performance.

VCM-SDD STC MFF indicates that spatial-temporal consistency and multimodal feature fusion are added in the experiment. The accuracy rates of Top1 and Top5 are 80.1 and 94.4, respectively, which is the highest in the whole ablation experiment, which proves the effectiveness of the algorithm proposed in this paper.

Compared with the benchmark algorithms TSN and TSM, VCM-SDD STC MFF has made significant progress in kinetics 400. This is mainly due to spatial-temporal consistency and multimodal feature fusion. If without these two operations, the result of VCM-SDD No_STC No_MFF is similar to TSN but worse than TSM; this is mainly because the temporal and spatial correlations of features are not considered. When the spatial-temporal consistency operation is added, the temporal and spatial relationship between features is strengthened, and the VCM-SDD STC No_MFF result is better than TSN and slightly worse than TSM. When multimodal fusion is added, the cross modeling is carried out between different features. The generalization performance is strengthened, and the VCM-SDD No_STC MFF result is better than TSN and TSM. It shows that modal interaction plays a positive role in video classification algorithm. Compared with the separate operation, the video classification effect has been further improved after combining the two operations. It is proved that the operation of spatial-temporal consistency and multimodal fusion is effective. As shown in Table 3, the experiment in kinematics 600 also proves the effectiveness of the proposed algorithm.

4.4. Comparison of Experimental Results of Different Methods

Table 3 shows the comparison results of different methods on the kinetics 400; R101_NP is the result of not loading the pretraining model. It can be seen that the effect of dividing into 16 segments is better. Without ImageNet pretraining, the accuracy of Top1 divided into 16 segments is 79.3, and the accuracy of Top1 divided into 8 segments is 77.4, an increase of 1.9 percentage points. Top5 also has an improvement of 0.8 points. In the case of ImageNet pretraining, the accuracy of Top1 divided into 16 segments is 80.1, the accuracy of Top1 divided into 8 segments is 78.5, an increase of 1.6 percentage points, and the accuracy of top5 is also increased by 0.9 points.

This algorithm divides the video into . With ImageNet pretraining, Top1 and Top5 are 80.1 and 94.4, respectively, which are 8.8 and 3.9 percentage points higher than TSN algorithm, 5.0 and 2.6 percentage points higher than TSM algorithm, and 0.3 and 0.5 percentage points higher than SlowFast algorithm with better performance. It can be seen from the experimental results that the combination of spatial-temporal consistency and multimodal fusion has a certain improvement in the image-based two-way recognition method.

In order to compare the amount of calculation between different methods, this paper compares the GFLOPs of each algorithm. It can be seen that compared with the baseline TSN and TSM, the amount of computation of this algorithm is between the two algorithms, and the computational efficiency of this algorithm meets the deployment requirements. Compared with SlowFast, the amount of calculation of this algorithm is significantly reduced.

Table 4 shows the comparison results of different methods on the kinetics 600 dataset. ResNet101 is also used as the backbone here. In order to explore the influence of the number of different video segments on the classification results, the video is also divided into 8 segments and 16 segments. It can be seen that the effect of being divided into 16 paragraphs is also better. Without ImageNet pretraining, the accuracy of Top1 divided into 16 segments was 81.3, and the accuracy of Top1 divided into 8 segments was 79.6, an increase of 1.7 percentage points; top5 also has an increase of 0.7 points. With ImageNet pretraining, the accuracy of Top1 divided into 16 segments was 81.9, and the accuracy of Top1 divided into 8 segments was 80.4, an increase of 1.5 percentage points; top5 also has an increase of 0.4 points. This algorithm divides the video into . With ImageNet pretraining, Top1 and Top5 are 81.9 and 95.1, respectively, which are 10.2 and 4.5 percentage points higher than TSN algorithm, 5.3 and 3.0 percentage points higher than TSM algorithm, and 0.1 and 0.2 percentage points higher than SlowFast algorithm with better performance. From the experimental results in kinetics 600, it can be seen that the combination of spatial-temporal consistency and multimodal fusion has been improved in the image-based two-stream recognition method.

5. Conclusion

This paper presents a method to describe image dense features and internal salient details. It is used to enhance the generalization and distinguishability of feature description and improve the effect of video classification. In this paper, the location information layer of spatial-temporal geometric relationship is added to effectively carve the local features of convolution layer and enhance the ability of visual representation and detail description of local features of grid. The multimodal feature graph network interaction modeling mechanism is introduced to effectively improve the generalization ability of feature fusion. The results on the two datasets verify the effectiveness of the proposed method. At the same time, the proposed algorithm in this paper still has room for improvement. Firstly, this paper only models the grid features, while we find that the bounding box features of the moving subject have better expression performance. Secondly, we only fuse different modal features in the later stage of modeling. In future study, we will consider integrating modal fusion into the whole modeling process.

Data Availability

The data included in this paper are available without any restriction.

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the Natural Science Foundation of China (No. 61379106) and National Key R&D Plan of China (2019YF0301800).