3-Dimensional Bag of Visual Words Framework on Action
Recognition

Human motion recognition plays a crucial role in the video analysis framework. However, a given video may contain a variety of noises, such as an unstable background and redundant actions, that are completely different from the key actions. These noises pose a great challenge to human motion recognition. To solve this problem, we propose a new method based on the 3-Dimensional (3D) Bag of Visual Words (BoVW) framework. Our method includes two parts: The first part is the video action feature extractor, which can identify key actions by analyzing action features. In the video action encoder, by analyzing the action characteristics of a given video, we use the deep 3D CNN pre-trained model to obtain expressive coding information. A classifier with subnetwork nodes is used for the final classification. The extensive experiments demonstrate that our method leads to an impressive effect on complex video analysis. Our approach achieves state-of-the-art performance on the datasets of UCF101 (85.3%) and HMDB51 (54.5%).


Introduction
Video action recognition is the basic building block in various applications such as video retrieval, natural human-machine interaction, video surveillance, and digital entertainment [Liu, Su, Nie et al. (2017) ;Herath, Harandi and Porikli (2017) ;Song, Yu, Zhao et al. (2019)]. In action recognition, there are two important and complementary aspects: appearance and dynamics. Video often has some complex factors, such as camera motion, scale change and viewpoint change. Therefore, whether the action recognition system can extract and utilize the relevant feature information is the key to its performance. However, it is not easy to extract features effectively. Therefore, the question of how to design a network structure to deal with these problems and retain classified information becomes crucial. The recent rise of recurrent neural networks (RNNs) have been successfully applied to action recognition [Donahue, Anne, Guadarrama et al. (2015); Li, Qiu, Yao et al. (2016)]. Existing 3D motion recognition methods based on RNN are mainly used for time-domain modeling of long-term context information, representing the dynamic based on motion.
However, in the spatial domain, there are also strong dependencies between nodes. For 3D action recognition tasks, the spatial configuration of nodes in video frames may be very recognizable. Liu et al. [Liu, Shahroudy and Xu (2016)] proposed a spatial-temporal long short-term memory (ST-LSTM) network and achieved a good performance. The BoVW framework has recently been used in motion recognition with good results. This framework includes two parts: feature extractor and classifier. Most of the BoVW models adopt Fisher vectors of improved dense trajectories [Wang and Schmid (2013); Wang, Qiao and Tang (2016);Fernando, Gavves, Oramas et al. (2017)] or CNN features [Wang, Xiong, Wang et al. (2016); Wang, Qiao and Tang (2015)] with a classifier such as support vector machine (SVM), and achieve reliable results on pre-segmented video datasets, such as UCF-101 [Soomro, Zamir and Shah (2012)] and HMDB51 [Kuehne, Jhuang, Carrole et al. (2011)]. In action recognition field, 3D CNNs have recently been more effective than the CNNs with two-dimensional (2D) kernels [Carreira and Zisserman (2017)]. Recently, 3D CNNs have been used in accurate action recognition. However, the 2D model still has strong associations with video data. Even well-organized 2D models [Tran, Bourdev, Fergus et al. (2015); Varol, Lapttev and Schmid (2016)] cannot overcome the advantages of 2D CNNs combining stacked flow with RGB images [Simonyan and Zisserman (2014)], mainly because video datasets usually have small data-scales, preventing optimization of a large number of parameters in 3D CNNs cannot be optimized. In addition, 3D CNNs can only be trained on a video data set from scratch, while 2D CNNs can be pre-trained on ImageNet. Recently, Carreira et al. [Carreira and Zisserman (2017)] trained 3D CNNs on Kinetics dataset and boomed the performance, which also made it possible for us to use a 3D pre-trained model. Thus, we can now use a Kinetics datasets pre-trained model to perform our action recognition. In this paper we propose a BoVW framework. Our network contains two parts, a feature extractor and a classifier (Fig. 1). We use a 3D residual networks (ResNet) He et al. [He, Zhang, Ren et al. (2016)] pre-trained model as our feature extractor and for the classifier we proposed a single layer feedforward network with subnetwork nodes (SLFN). We tested the ResNet model of different structures from a shallower to a deeper network model using the UCF-101 and HMDB-51 datasets in order to ascertain which structure has the best performance. Additionally, we also tested the feasibility of the pre-trained model. We optimized our SLFN parameters to achieve a better performance. In addition, we evaluated the approach we proposed in terms of accuracy and time consumption. Furthermore, other classifiers, such as SVM, can be used in the method as well. Our proposed method could use any type of videos. The proposed framework is summarized in the following section.

Network structure
The BoVW structure plays a powerful role in image recognition. The general idea of BOVW is to reduce the dimensions of an image or video and encode it into a set of features. Features comprise key points and descriptors. The key points are the "salient points" in the image, so the key points are the same whether the image is rotated, shrunk, or expanded. A descriptor is a description of the key points. So we can represent each image in terms of the frequency of its features, and by virtue of its feature frequency, we can predict the category of another image. In order to explore whether this structure is used in the field of action recognition, we built our network, see Fig. 1 below, and tested the performance.

3D ResNet
For the extractor, we focused on 3D CNNs that have begun to perform better than 2D CNNs on large-scale video datasets. Recently, Hara et al. [Hara, Kataoka and Satoh (2018)] conducted a series of experiments using different depth 3D ResNet. They also compared the performance between a base model and pertrained model. Their experiments showed that the ResNet-101 pre-trained model demonstrated the best performance. In our study, based on those experiments results provided by Hara et al. [Hara, Kataoka and Satoh (2018)], we choose the 3D ResNet pre-trained model as our feature extractor. To ensure the accuracy of the results, we reevaluated all experiments. We also tested the performance of 3D ResNet with different depths. The results of our experiment are shown in the following section.

Feature extraction from 3D ResNet models
As it was mentioned in the previous subsection, the 3D ResNet models have been considered. This model was previously trained on a Kinetics dataset with obtained impressive results. ResNet is known to be a deep CNN model that consists entirely of several convolution layers but only one fully connected layer. Average-pooling layers are employed after convolution layers. As shown in Fig. 2 we extracted deep features from the average pool layer.

Single-layer classifier with sub-network nodes
In video data processing, time cost and computation cost are often relatively high, so our proposed new classifier greatly reduces time and computational cost compared to traditional classifiers such as SVM and ELM, and iterative training tends to achieve better results. Fig. 3 shows the structure of our classifier.

Figure 3: Our proposed classifier
As we have encoded features data, we will use the proposed SLFN on the encoded dimension data to classify objects with L numbers of subnetworks. Thereafter, we will split the data along with target label data with n size and send it to the network to train it. First, we will use ( 0 , 0 ) chunk of data for the initial training of the network. Thereafter, will represent the residual network error and ��, � � would define the input weight and output weight which will be updated in every iteration.
We will use 0 to represent 0 T 0 + ( × / ). Now we can write Eq. (1) in following way: (0) = 0 0 ℎ −1 ( 0 ) (2) After the initial training, we will update the input weight in a sequential manner with the next batch of training samples ( , ), we combine 0 and 1 together as well as their corresponding residual error 0 and 1 ; theoretically, we can get the following: According to Eqs. (3)-(5), we derive We can generalize Eq. (6) to Instead of calculating � for each epoch, we can use the previous knowledge and update it by passing new chunk of encoded. Suppose the chunk of data we send for initial training is _ and remaining data is considered as _ . If total numbers of training epochs are TOTAL_EPOCHS and batch size for sequential data is BATCH_SIZE. We will train our network in the following manner: The performance of the proposed algorithm depends on both the number of hidden neurons m (encoding dimension) and the pre-defined constant c. A proper way of selecting the optimal value of c is chosen by the trial-and-error method [Huang, Zhou, Ding et al. (2012)]. Now, we will use the Online Sequential-Subnetwork on encoded dimension data to classify objects with L numbers of subnetworks.

Datasets
The HMDB-51 Kuehne et al. [Kuehne, Jhuang and Garrote (2011)] and UCF-101 [Soomro, Zamir and Shah (2012)] datasets are currently the most successful in the field of action recognition. UCF101 has a total of 13,320 videos, including 101 action categories. Moreover, video in this dataset has diverse actions, with very different camera movements and often messy background. It is one of the most challenging data sets available. The videos in 101 action categories are divided into 25 groups, where each group can consist of 4-7 videos of an action. The videos from the same group may share some common features, such as similar background, similar viewpoint and so on. The HMDB51 dataset contains 6849 clips divided into 51 action categories, each containing a minimum of 101 clips. Fig. 4 shows samples from the video frames of datasets utilized to evaluate our method performance.

Figure 4: The samples of video frames from UCF-101 and HMDB-51 datasets
In the training and testing process, it is very important to separate the video belonging to the same group. Since videos within a group are all from a single long video, sharing videos from the same group in the training set and testing set can achieve higher performance. So, each of these datasets provide train and test splits files. We evaluated our network performance based on these splits files and calculated average performance.

Environment setting
To test our proposed method, we compare our work with different depth 3D ResNet and classifiers; we also compare our proposed method with other state-of-the-art methods. To find the best performance of our extractor, we compared it with pre-trained 3D ResNet with different depth and learning from scratch. We compared the classifier with support vector machine (SVM), extreme learning machine (ELM), K-nearest neighbors (KNN) and random forest (RF) [Breiman (2011)]. For SVM and ELM, parameters were set up from [2 −10 , 2 −9 , … , 2 10 ] in each experiment. For KNN we set K to 3 and used auto algorithm. We selected 100 as our RF estimators. For our proposed classifier, the regularization parameter C was selected from C ∈ {2 −4 , … , 2 8 }. To test the efficiency of our algorithm, we run our experiments on two datasets, which are conducted on a machine with an NVidia GTX-1080Ti GPU.

Extractor evaluation
According to a previous study Hara et al. [Hara, Kataoka and Satoh (2018)], 3D ResNet trained on UCF-101 and HMDB-51 does not achieve high accuracy whereas a Kinetics pre-trained model works well. In this section, aiming to find the optimal feature extractor, we tried to reproduce the performance in the experiment. In this process, we trained 3D ResNets with different depths by UCF-101 and HMDB-51 dataset from scratch and then we trained kinetics pre-trained 3D ResNet models as well. To make the result fairly, we use train and test split file 1; choose batch size of 32, and train 50 epochs. The performances are shown in Tab. 1. As Tab. 1 shows kinetics pre-trained models perform significantly better then learning from scratch. 3D ResNet-101 has the best performance both for learning from scratch and transfer learning. It indicates that the 3D ResNet-101 pre-trained model can learn optimal features more accurately in less time compared to other methods. Therefore, we choose it as our extractor and use it in later experiments.

Classifier evaluation
After the experiment above, we choose a 3D ResNet-101 pre-trained model as our extractor. In this section we evaluated our proposed classifier's performance. A performance comparison has been evaluated among SVM, ELM, KNN, RF and our proposed algorithm. For the accuracy and fairness of the experiment, we first trained an extractor and extracted deep features; later, the classifier recognized the actions. Further, we carried out this experiment in Tab. 2 to compare our single-layer network with those learning algorithms. Here, it can be seen that the accuracy of our method is clearly higher compared that of other classifiers. We also evaluated the time consumption, and the comparison results of time consumption are shown in Tab. 3, which indicates that our classifier has seen a significant improvement in training speed compared to SVM. Further, the ELM training speed is similar. However, our classifier supports iterative training, and the corresponding batch size can be set according to the situation, which is especially important in video processing and in especially large video datasets.

Framework evaluation
The above experiments show that a kinetics dataset can be used to train our network.
Comparisons with other state-of-the-art architectures are shown in Tab. 4. As can be seen from Tab. 4, our method achieved higher accuracies compared with 3D Resnet-101, 3D ResNext-101 [Hara, Kataoka and Satoh (2018)], C3D [Tran, Bourdev, Fergus et al. (2015)], P3D [Qiu, Yao and Mei (2017)], and two-stream I3D [Carreira and Zisserman (2017)]. We can also observe that two-stream I3D, which is pre-trained by the Kinetics dataset, achieves the best accuracy. In addition, we believe that combining the two-stream architecture with our framework can further improve the accuracy of two-stream I3D.

Conclusion
In this paper, we tested various CNNs architectures of spatio-temporal three-dimensional convolution kernels on the current video dataset. According to these experimental results, the following conclusions can be drawn: (1) The BoVW structure is efficient on video processing.
(2) It is effective for 3D CNNs to pre-train on the Kinetics dataset, which has sufficient data to optimize the 3D CNN network.
(3) Instead of using randomized input weights, we can approach a classifier where weights would be configured by calculation and reach to the steepest descent in iterative manner without configuring the learning rate.

Funding Statement:
The author(s) received no specific funding for this study.