Hidden Two-Stream Collaborative Learning Network for Action Recognition

The two-stream convolutional neural network exhibits excellent performance in the video action recognition. The crux of the matter is to use the frames already clipped by the videos and the optical flow images pre-extracted by the frames, to train a model each, and to finally integrate the outputs of the two models. Nevertheless, the reliance on the pre-extraction of the optical flow impedes the efficiency of action recognition, and the temporal and the spatial streams are just simply fused at the ends, with one stream failing and the other stream succeeding. We propose a novel hidden twostream collaborative (HTSC) learning network that masks the steps of extracting the optical flow in the network and greatly speeds up the action recognition. Based on the two-stream method, the two-stream collaborative learning model captures the interaction of the temporal and spatial features to greatly enhance the accuracy of recognition. Our proposed method is highly capable of achieving the balance of efficiency and precision on large-scale video action recognition datasets.


Introduction
Understanding the content in the video is an important part of computer vision, such as action recognition. Comparing to the images with only static information, the videos have more temporal information. Karpathy et al. [Karpathy, Toderici, Shetty et al. (2014)] adopted superimposing video multi-frame input to the network for action recognition learning, but this method is worse than manually extracting features. Simonyan et al. ] proposed two-stream convolutional networks, which means that deep learning has taken a major step in action recognition. The two-stream convolutional network is divided into two parts, one processes RGB images, the other processes optical flow images. Then the work jointly trains the model with the extracted features and finally classifies the actions with the trained model. Although the two-stream method achieves good performance, it relies on extracting the optical stream from the video in advance, and then learning the optical flow characteristics for action recognition which results in a reduction in the efficiency of the entire network. In order to solve this problem, a variety of methods have been proposed to capture motion information in the videos other than optical flow methods, such as recurrent neural network (RNN) [Du, Wang and Wang (2015)] and 3D CNN [Qiu, Yao and Mei (2017); Tran, Wang, Torresani et al. (2018)]. Some new feature representations of motion information are also proposed, such as motion vectors [Zhang, Wang, Wang et al. (2016); Wang, Long, Wang et al. (2017)], RGB image differences and warped optical flow fields [Wang, Xiong, Wang et al. (2016)]. However, the traditional optical flow methods for human action recognition are more effective than these feature representations. Recently, Zhu et al. [Zhu, Lan, Newsam et al. (2018)] use the CNN method to learn optical flow, implicitly generating motion information for action recognition, effectively avoiding expensive calculations and massive storage and increasing the speed of the entire task. This method solved the problem more directly. However, since only weighted fusion is performed, the interaction between spatial features and temporal features cannot be captured, which results in the situation that there is one flow success and the other flow failure and affects the overall action recognition efficiency. Therefore, the spatial and temporal integration strategy is particularly important, and many existing methods [Karpathy, Toderici, Shetty et al. (2014) ;Ji, Xu, Yang et al. (2012); Tran, Bourdev, Fergus et al. (2015)] build appropriate integration strategies by exploiting the advantages of convolution. Compared with the original methods using Fisher Vector [Perronnin, Sánchez and Mensink (2010)], HOF [Laptev, Marszałek, Schmid et al. (2008)], and dense trajectory features [Wang and Schmid (2013); Wang, Qiao and Tang (2015)], these methods which directly use CNN for video action recognition have no crushing advantage. Although CNN has achieved outstanding performance in image analysis tasks, CNNs cannot make full use of the spatial-temporal features in video understanding tasks. in order to make full use of spatial-temporal features, in addition to using standard CNN streams to capture appearance information, some recent methods ; Wang, Xiong, Wang et al. (2016)] have attempted to input video optical flow images into another CNN to extract features that contain video motion information. However, we have found that in these models, usually with one stream failing, while the other stream is still correctly misclassified. Therefore, the weighted average fusion strategy of the original two-stream method cannot fully utilize the apparent information and motion information of the video. On the contrary, we believe that the apparent information and motion information in the video should be mutually reinforcing. Recently, two-stream Collaborative Learning with spatial-temporal attention for video classification (TCLSTA) [Peng, Zhao and Zhang (2018)] paid spatial-temporal attention to video static and motion features so as to distinguish the different contributions of different regions in static frames to the final recognition result and discriminative frames in the frame sequence. Then TCLSTA uses the discriminative static and motion features extracted from the spatial-temporal attention model to mutually enhance representation learning and optimizes the combined weight of frames and the optical flow of video classification. But TCLSTA relies on extracting the optical flow from the video in advance, and then learning the optical flow features for action recognition, which greatly reduces the efficiency of the entire system. In order to solve these two limitations and pursue the balance between speed and precision, our work proposes a hidden two-stream collaborative learning method for action recognition without storing pre-computed optical flow, which not only improves the efficiency of the whole network, but also captures spatial features and the interaction of temporal features and improves the accuracy of action recognition. Overall, the paper has two contributions as follows: 1. We propose a novel framework, HTSC for action recognition (Section 3) without precomputing optical flow, effectively avoiding expensive computational and massive storage, which improves the efficiency of the entire network. 2. Our work can directly extract the motion information features from the frame sequence, and guide the spatial features and temporal features to each other, improving the accuracy of recognition.

Related work
Understanding what's in the video is an important part of computer vision. For example, video action recognition. In recent years, video human action recognition has made great achievements. At first, traditional hand-crafted extraction of frames, such as improved dense trajectory (IDT) [Wang and Schmid (2013)], is the method with the best effect, the best stability and the highest reliability before deep learning applied to this field, but the speed of this algorithm is slow. Convolutional neural network (CNN) [Karpathy, Toderici, Shetty et al. (2014);Zhu, Lan, Newsam et al. (2018)] is usually several orders of magnitude faster than IDT. Deep CNN is gaining its popularity in recent years [Tang, Yang, Zhou et al. (2015); Wang, Gao, Yin et al. (2018) Marszałek, Schmid et al. (2008); Wang and Schmid (2013)]. Many works have designed deeper CNNs in order to apply CNN more effectively to action recognition tasks [Zhang, Wang, Wang et al. (2016); Ng, Hausknecht, Vijayanarasimhan et al. (2015); Qiu, Yao and Mei (2017); Wang, Qiao and Tang (2015); Peng, Zhao and Zhang (2018); Carreira and Zisserman (2017); Gui and Zeng (2019); Zhang, Jin, Sun et al. (2018)]. For example, several feature fusion strategies have been explored in the Sport1M dataset [Tran, Bourdev, Fergus et al. (2015)]. At the same time, the two-stream method proposes two CNNs for video action recognition, one of which is a static image stream and the other is an optical flow stream. Finally, the two streams are merged to capture static appearance information and motion information [Ng, Hausknecht, Vijayanarasimhan et al. (2015)] in video.
Method [Tran, Bourdev, Fergus et al. (2015)] uses a 3D convolution kernel to extract features from a series of dense RGB frames. The temporal segment network (TSN) [Wang, Xiong, Wang et al. (2016)] first decomposes the video into static frames and optical flow images, then samples them and uses two CNNs to extract features, thereby extracting features containing video static information and video motion information. Method [Ng, Hausknecht, Vijayanarasimhan et al. (2015)] first use CNN to extract the features of static frames in order to better capture video motion information, and then use the long short-term memory (LSTM) model to explore the relationship between videos. Recently, I3D network [Carreira and Zisserman (2017)] used two streams CNN with expanded 3D convolution to achieve the most advanced performance on Kinetics data set on a dense RGB and optical flow sequence [Qiu, Yao and Mei (2017)]. A big disadvantage of the two-stream method is that it cannot model on a long-time video, and only extract temporal context for continuous video frames. In order to solve this problem, TSN network proposes to divide video into K segments, randomly select a snippet from each segment, and then apply two-stream method to these snippets, and finally integrate the features extracted from these snippets. However, the original two-stream method has two major disadvantages: first, the optical flow must be extracted from video in advance, and then the optical flow features are learned for action recognition, which greatly reduces the efficiency of the entire system. Secondly, the spatial CNN and temporal CNN in the two-stream method are independent of each other. Only a simple weighted average fusion strategy is performed to obtain the final prediction. It cannot learn the subtle spatial-temporal relations. In order to explore better fusion strategies, Laptev et al. [Laptev, Marszałek, Schmid et al. (2008)] compared multiple CNN connection methods, but none of these methods can make good use of the static information and motion information of the video. Recently, TCLSTA designed a static-motion collaborative learning model, which enhanced the spatial and temporal features of each other, and optimized the combined weights of frames and optical flow. However, it relies on extracting optical flow from video in advance, and then learning optical flow features for action recognition, which greatly reduces the efficiency of the entire system. The hidden two-stream collaborative learning method proposed in this paper does not need to extract the optical flow in advance, which greatly improves the network efficiency, and at the same time can better capture the spatial-temporal interaction, achieving the balance of speed and precision.

Method
This section describes our proposed hidden two-stream collaborative learning method in detail, which includes two models: hidden two-stream model and collaborative learning model. Our idea is shown in Fig. 2. In Section 3.1, hidden two-stream model was introduced. We first decompose the video into a sequence of frames, and then send them to the spatial stream CNN and hidden temporal CNN, respectively. The hidden temporal CNN obtains the motion features and spatial stream CNN obtains the spatial features. In Sections 3.2 and 3.3, we introduced a collaborative learning model that performs a collaborative learning network to optimize the spatial and temporal features. Then adaptive weighted learning model learns the fusion weight of each video category adaptively and finally obtains the prediction result.

Hidden two-stream model
Given the frame sequence of the video, we hope to learn not only the static appearance features, but also the motion information from the frame sequence, which serves as the basis for judging the video action category. We can effectively realize the action recognition of the static image by two-stream network ], so our spatial stream network adopts the same setting as the two-stream network, and is used to capture the static appearance information of the images. FlowNet [Dosovitskiy, Fischer, Ilg et al. (2015)] proves that optical flow can be estimated by CNN. We hope to use the CNN to learn the optical flow information of the frame sequence, which contributes to the human action recognition task.

Figure 2:
The proposed HTSC method consists of two parts: (1) hidden two-stream model learns from the frame sequence to spatial (static) features and temporal (motion) features.
(2) collaborative learning model uses the complementarity of motion information and static information to optimize the motion features and static features, which improves the accuracy of action recognition

Spatial stream
Static appearance features (colors, lighting, textures, contour, etc.,) are a useful clue because some actions are closely related to specific objects and scenes. The input of our spatial stream Convnet is a static frame of video, which can effectively realize the action recognition of static images. In fact, the action classification of static frames (spatial streams) is inherently quite competitive. Due to the outstanding performance of CNN in image recognition tasks, we pre-train our spatial stream network on the basis of recent advance large-scale image recognition method [Perronnin, Sánchez and Mensink (2010)].

Temporal stream
Although some actions can be recognized using a single frame image, some actions are dependent on motion information. Therefore, the temporal stream of the original two-stream network takes the optical stream image as an input. The original two-stream network needs to obtain the optical flow images from the video in advance using methods such as TVL1. The information contained in the optical flow images is useful for the action recognition task ]. The original method needs to extract the optical flow information in advance, but the extraction speed is slow. The storage of the optical flow images requires additional storage space. We consider optical flow prediction as an image reconstruction problem [Jason, Harley and Derpanis (2016); Zeng, Dai, Li et al. (2018)]. We use the hidden temporal stream to learn the optical flow information of the frame sequence that contributes to our task and generate an effective optical stream of adjacent frames.
Taking adjacent frames f 1 and f 2 as inputs, if the predicted optical flow and f 2 can reconstruct f 1 , the network learns the motion information. Our temporal flow is divided into two parts: optical flow estimation and feature extraction. Our network details can be seen in Section 4.2. We calculate losses on multiple scales in the network of optical flow. Three loss functions [Zhu, Lan, Newsam et al. (2018)] are adopted to generate optical flow of higher quality, which can be written as follows: Standard pixel reconstruction error function: where L P denotes the standard pixel reconstruction error function, where ,ℎ and ,ℎ are the estimated optical flow in the horizontal and vertical directions of the pixel (g, h), and m and n represent the height and width of frame 1 and 2 . In order to reduce the effect of outliers, we adopt the equation F(x) = (x 2 + ε 2 ) α [Lai, Huang, Ahuja et al. (2017)] (a variant of L1 loss, first used as a loss function in LapSRN) where L sm is a smoothness loss function, which solves the aperture problem that leads to blurring when estimating motion in a non-textured region. and are gradients of the predicted optical flow field in each direction. Analogously, and are the gradients of the optical flow field , F (x) is the same as in Eq. (1) SSMI�f B 1 , f B 2 � = (2 1 2 + 1 )(2 1 2 + 2 ) ( 1 2 + 2 2 + 1 )( 1 2 + 2 2 + 2 ) Structural similarity (SSIM) loss function [Wang, Bovik, Sheikh et al. (2004)], which helps us learn the structure of frames, where f B 1 and f B 2 are local blocks of frames f 1 and f 2 , respectively, and we set the size to 8 × 8. 1 and 2 are average values of the image blocks f B 1 and f B 2 , 1 and 2 are the variances of the two image blocks, 1 2 is the covariance, and 1 and 2 are two constants used to stabilize the division. In the experiments, we set it to 0.0001 and 0.001, respectively.
In order to compare the similarity between two frames 1 and f 1 ′ , we design a loss function , where I am the number of local blocks we can extract from the image, and i is the index of the local block.
(6) where is the parameter to regulate the losses of each scale [Zhang, Yin, Yang et al. (2017)]. The loss of each scale is the weighted sum of the previous three loss functions. The feature extraction part is also similar to the CNN structure of spatial stream. Before sending the estimated optical flow to the CNN that extracts features, we normalize it to a range between 0 and 255. This normalization is important for good temporal stream performance ]. Finally, the temporal stream extracts features containing optical flow information.

Collaborative learning model
For the two-stream method and its derivative methods   (2016)], we carefully observe their recognition process and find that the spatial stream and the temporal stream are trained independently and tested, but only the final fusion of the scores of the two streams is finally performed. The disadvantage of this method is that one stream identification will fail, the other stream identification will succeed, and the overall network identification will fail. On the contrary, we hope that temporal stream (motion information) and spatial stream (static information) not only merge at the end, but promote each other in the process. In order to capture the interaction of spatial (static) information and temporal (motion) information, we hope that static and motion features interact, so our collaborative learning model uses motion and static information with symmetrical structural motion and static information to make static and motion features guide and optimize each other.

Algorithm formula
At time t, frame features are utilized to optimize optical flow features. H = tan ℎ ( + ( )1 ) (7) where and are weight parameters. 1 is a vector (all values are 1). = [ 1 , 2 , … , ] represents the temporal feature (optical flow feature), is the video feature which is aggregated from the video frame feature at time t-1.
We calculate the optical flow optimization coefficient by Eq. (8), and we combine the optical flow features output by the previous model as the video feature . ℎ is also a weight parameter. Next, we use the optical flow feature to optimize the frame features. The frame features are expressed as = [ 1 , 2 , … , ]. In general, the input of our module is: frames and optical flow features extracted from the previous model. The outputs are: optimized frame features and optical flow features .

Algorithm steps
Step 1. Define the optimization coefficient of the spatial (frame) feature as .
Step 3. Using the spatial (frame) feature to calculate the video feature ∑ by Eq. (9).
Step 4. Using to calculate the optimization coefficient of the temporal (optical flow) feature by Eqs. (7) and (8) to obtain the optimized temporal (optical flow) feature.
Step 5. Using the temporal (optical flow) feature to calculate the video feature ∑ by Eq. (9).
Step 6. Using to optimize the spatial (frame) features and obtain the optimization coefficient on the spatial (frame) features Step 7. Iteration, the convergence of the loss function stops.
Step 8. Stop and return the optimized frame feature = and the optimized optical flow feature = .

Adaptive weighted learning
Since we have obtained the predicted scores (static and motion) for each stream, we can simply fuse the scores of the two streams as categories of video action. However, spatial (static) and temporal (motion) information contributes differently to different action categories. There are no obvious movements in some classes, such as "blow dry hair" and "pommel horse". Therefore, these classes should be primarily recognized from static frames. Certain classes include obvious motion, however, motion information is significant for classifying categories, such as "diving" and "sky diving". Finally, fusion weights of spatial and temporal streams of distinct classes are adaptively learned.
We express the network prediction score as = [ ,1 , ,2 ] ∈ ℝ 2× , where represents the category in the dataset, and represents the category in the dataset. M is the number of action categories corresponding to the dataset. ,1 and ,2 represent the scores of the first stream and the second stream. We represent the weight of the first stream of class in the corresponding dataset as 1 and 2 as the second stream, = [ 1 , 2 ] is the two-stream fusion weight. The twostream fusion weights for each category are learned separately by each category, we obtain the fusion weight for each category by limiting ∑ Eq. (10) represents our objective function, and c is set to 5 × 10 −3 , where is defined as follows, where denotes the number of all the data of the category in the corresponding dataset, A = [0, … ,0, 1, 0, … , 0] ∈ ℝ ×1 , only the -th element is 1 and the other elements are 0 in this vector. The way to maximize R is to maximize the product of the column a vector of and . It also means to minimize the product of and -th column vector of ( is not equal to ). R and N consider the relationship between the positive and negative samples of , respectively, and are parameters that balance the weights of the positive and negative samples. Then, we can transform our objective function into: Finally, our fusion weights are calculated by linear programming [Li, Liu, Wang et al. (2019); Reddy and Shah (2013)]. During the test, the SoftMax layer output of the two streams is expressed as Eq. (14).
arg max (15) The final classification result is determined by the highest fusion score.

Datasets
We select 3 widely used action recognition datasets UCF101 [Soomro, Zamir and Shah (2012)], HMDB51 [Kuehne, Jhuang, Garrote et al. (2011)], and THUMOS14 [Idrees, Zamir, Jiang et al. (2017)] to validate our HTSC method. UCF101 is collected from real world, which are clipped from YouTube. UCF101 is a widely used action recognition dataset, including 101 different kinds of human action video. UCF101 consists of 13320 videos and 101 action categories. HMDB51 contains 6849 samples of 51 categories extracted from various resources (online videos and films). THUMOS14 is a large video dataset applied to action recognition and detection that includes long unclipped videos. THUMOS14 has 101 action classes. We adopt 13,320 videos for training and 1010 videos for verification, respectively. Then, we test the performance of our network on 1,574 videos.  Fig. 2, and we utilize the spatial stream of the two-stream network to extract spatial feature. The convolution kernel in each convolution layer is represented as (W×H). The number K denotes the lineage of "Blocks" in Tab. 1. For spatial flow, the input size is 224×224×3 and its output feature map size is 1×1×4096. During the training process, we adopt the pre-trained Vgg16 model on ImageNet. Besides, we change the number of fully connected layer (classification) as the class number of the corresponding datasets. For hidden temporal stream, the detailed structure of network is shown in Tab. 2, the input is frame sequence, and optical flow is estimated by the CNN. The optical flow is directly fed to the feature extraction network after normalization. Because the optical flow images are not stored, it is much faster than the two-stage methods. The two-stage method requires writing and reading the optical flow images and takes almost three times longer than all the other steps.

Collaborative learning model
The detailed network structure of our collaborative learning module is shown in Fig. 3. The module consists of two parts: the first part is the collaborative learning layer, and the hidden two-stream model output features are used as its input. Static and motion features are optimized by collaborative learning layer. Like the hidden two-stream model, we design N hidden units in the two softmax layers, where N is the number of categories in the relative datasets. The second part is adaptive weighted learning model. The details are in Section 3. We select the cross-validation model and set the parameter c of Eq. (13) to 5 × 10 −3 . And in order to optimize our collaborative learning network, we adopt the cross-entropy loss as our loss function. Finally, we predict the action category of the video by Eq. (15).

Comparison with the-state-of-art methods
Our HTSC method is tested on 2 trimmed video datasets and 1 untrimmed video dataset.
Our experimental results are compared with latest competitive methods, and the results are shown in Tab. 3. For HMDB51 dataset, the early works [Diba, Pazandeh and Van Gool (2016); Kantorov and Laptev (2014); Gui and Zeng (2019)] selected the handcrafted features as video feature representations, whose performance is limited and far worse than our proposed method. Some methods [Tran, Bourdev, Fergus et al. (2015); Diba, Pazandeh and Van Gool (2016)] utilize the features of 3D convolution as video representation, whose speed is fast. However, they need high computational cost and obtain lower accuracy than two-stream methods and their derivation methods. Other methods, for example, Karpathy et al. [Karpathy, Toderici, Shetty et al. (2014)] adopted two kinds of CNN to simulate static and motion information to obtain higher accuracy than conventional action recognition methods [Cai, Wang, Peng et al. (2014); Kantorov and Laptev (2014); Gui and Zeng (2019)]. But the improvements are limited because of the simple fusion strategy. Therefore, some researchers [Wang, Xiong, Wang et al. (2016);Feichtenhofer, Pinz and Zisserman (2016)] employ more complicated feature fusion strategies to combine static and motion information and obtain higher accuracy than [Karpathy, Toderici, Shetty et al. (2014)] method. But all these methods ; Wang, Xiong, Wang et al. (2016);Feichtenhofer, Pinz and Zisserman (2016)] need to extract optical flow in advance, which affects the efficiency of the whole network. Moreover, the spatial and temporal features of video extraction are not independent of each other. On the contrary, they have high complementarity. Our method achieved good results among the most advanced methods, with an increase of 1.3% over the highest results of the comparative method. This result occurs because our method allows the two types of information (static and motion information) of the video to learn and optimize each other. The accuracy of the method TCLSTA is slightly higher than our method because it not only utilizes the complementarity of spatial and temporal features, but also pays attention to spatial and temporal features before collaborative learning. However, the spatial-temporal attention model requires three-stage training and a deeper residual network in TCLSTA, which increases the cost of training. Compared with TCLSTA, our hidden two-stream method train the model end-to-end and implicitly extracts optical flow information, which saves storage space required by pre-extracted optical flow images. Therefore, our efficiency is higher than TCLSTA (Efficiency Evaluation section). The comparison on UCF101 dataset is also shown in Tab. 3. The result trends for HMDB51 dataset and THUMOS14 dataset are similar to UCF101 dataset.

Ablation experiments
To prove the effectiveness of each component of the proposed method HTSC, we design the following ablation experiments, Tab. 4 shows the results of our ablation experiments. Our method includes two streams: spatial stream and hidden temporal stream. Firstly, we use spatial stream and hidden temporal stream to predict the categories of videos, respectively. Then, we fuse spatial stream and hidden temporal stream. We find that the accuracy of twostream fusion method is 6.2% higher than single-stream fusion method in UCF101 dataset, indicating that spatial stream and hidden temporal stream are complementary. In addition, we add collaborative learning network (CLN) on the basis of hidden two-stream network, we find that "hidden two-stream+CLN" achieves better classification accuracy than the results of hidden two-stream network without collaborative learning model. It is shown that the CLN can promote mutual learning of static and motion features and make use of their correlation to further improve the accuracy of action recognition. Finally, we add the adaptive weighted learning (AWL) model on the basis of "hidden two-stream+CLN". Compared with the network without adaptive weighted learning model, the accuracy is further improved, which proves the effectiveness of adaptive weighted learning. The accuracy of late fusion is lower than adaptive weighted learning model, the reason is that the late fusion cannot distinguish the importance of different categories of static and motion information. While the adaptive weight learning can distinguish the different significance of static and motion information of different semantic classes, which improves the accuracy of our entire network.

Efficiency assessment
To evaluate the performance of the HTSC method, we calculate the test speed on HMDB51 dataset. The test process of the hand-crafted feature methods [Cai, Wang, Peng et al. (2014); Kantorov and Laptev (2014); Gui and Zeng (2019)] include local feature extraction, feature coding and classification. But the process of local feature extraction takes up lots of the computational cost, resulting in low efficiency. So, we do not compare our method with them. The results are compared with some existing deep learning methods in Tab. 5. The efficiency and accuracy of our method is better than Simonyan et al. ; Feichtenhofer, Pinz and Zisserman (2016)]. In addition, compared with the two-stream methods and their derivation methods, we do not need to extract optical flow images in advance. Our method is faster than TCLSTA, because our hidden two-stream model train the model end-to-end and implicitly extracts optical flow information, saving storage space required for pre-extracted optical flow images In general, although the efficiency is slightly lower than Zhu et al. [Zhu, Lan, Newsam et al. (2018)], our proposed method obviously achieve higher accuracy than other methods, and we show the results in Tab. 5.   99.7 TCLSTA [Peng, Zhao and Zhang (2018)] 89.5 HTS [Zhu, Lan, Newsam et al. (2018)] 120.4 Ours 115.1

Conclusions
This paper proposes a hidden two-stream collaborative learning network for human action recognition, which consists of a hidden two-stream model and a collaborative model. Conventional action recognition methods need to extract the optical flow of the video in advance to capture the motion information. Differently, the hidden two-stream model adopts CNN to capture the relationships between video frames, which improves the efficiency of the whole network and saves the storage space. The collaborative model adopts the hidden two-stream model to extract the spatial static frame features and the temporal flow motion features, which enhance the mutual representation to improve the accuracy of action recognition. Experiments of three widely used video classified datasets show the effectiveness of our method.