Video event classification based on two-stage neural network

Zhang, Lei; Xiang, Xuezhi

doi:10.1007/s11042-019-08457-5

Video event classification based on two-stage neural network

Published: 06 May 2020

Volume 79, pages 21471–21486, (2020)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

476 Accesses
19 Citations
Explore all metrics

Abstract

Video stream is a sequence of static frames which can be described as 3D signals consisting of spatial and temporal clues. Simultaneous tackling of these two clues has always been a key problem for video analysis task. This work proposes a two stage neural network for video event classification task. Instead of straightly connecting RNN to CNNs, a two-stage neural network strategy is employed where the first stage can transfer pre-learned object knowledge to video contents by selected anchors in supervised learning way. Through the proposed strategy, the frame sequence is changed into anchor points by mean-max pooling and then classified by transferred CNNs. The second stage can combine temporal information by means of RNN’s ‘deep in time’ ability. Transferred CNNs joined with RNN can handle spatial and temporal information at the same time, which is end-to-end network learning excepted keeping transferred CNNs’ parameters unchanged. Especially, LSTM and GRU in RNN with one layer or two layers are adopted to overcome the gradient disappearance and gradient explosion problems. Experiments on three in-the-wild datasets show that the proposed two-stage network delivers comparable performances with other state-of-the-art approaches, demonstrating its effectiveness for video event classification.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

LSN: Long-Term Spatio-Temporal Network for Video Recognition

A Multi-tier Fusion Strategy for Event Classification in Unconstrained Videos

Extracting Deep Video Feature for Mobile Video Classification with ELU-3DCNN

References

Boureau YL, Ponce J, Lecun Y (2010) A theoretical analysis of feature pooling in visual recognition. In: International conference on machine learning, pp 111–118
Cai Z, Wang L, Peng X, Qiao Y (2014) Multi-view super vector for action recognition. In: IEEE Conference on computer vision and pattern recognition, pp 596–603
Donahue J, Hendricks LA, Rohrbach M, Venugopalan S, Guadarrama S, Saenko K, Darrell T (2017) Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans Pattern Anal Mach Intell 39(4):677–691
Article Google Scholar
Du T, Bourdev L, Fergus R, Torresani L, Paluri M (2016) Learning spatiotemporal features with 3d convolutional networks. In: IEEE International conference on computer vision, pp 4489–4497
Feichtenhofer C, Pinz A, Wildes R (2016) Dynamic scene recognition with complementary spatiotemporal features. IEEE Trans Pattern Anal Mach Intell 38 (12):2389–2401
Article Google Scholar
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: IEEE conference on computer vision and pattern recognition, pp 1933–1941
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: IEEE Conference on computer vision and pattern recognition
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Ji S, Yang M, Yu K (2012) 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231
Article Google Scholar
Jiang YG, Ye G, Chang SF, Ellis D, Loui AC (2011) Consumer video understanding:a benchmark database and an evaluation of human and machine performance. In: ACM International conference on multimedia retrieval, pp 1–8
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: IEEE conference on computer vision and pattern recognition, pp 1725–1732
Klaser A (2008) A spatiotemporal descriptor based on 3d-gradients. In: British machine vision conference, pp 1–8
Koller D, Tang K, Li FF (2012) Learning latent temporal structure for complex event detection. In: IEEE Conference on computer vision and pattern recognition, pp 1250–1257
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) Hmdb: a large video database for human motion recognition. In: IEEE International conference on computer vision, pp 2556–2563
Lan Z, Lin M, Li X, Hauptmann AG, Raj B (2015) Beyond gaussian pyramid: multi-skip feature stacking for action recognition. In: IEEE Conference on computer vision and pattern recognition, pp 204–212
Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: IEEE Conference on computer vision and pattern recognition, pp 1–8
Liu Y, Nie L, Han L, Zhang L, Rosenblum DS (2015) Action2activity: recognizing complex activities from sensor data. In: Proceedings of the 24th international conference on artificial intelligence, pp 1617–1623
Liu L, Cheng L, Liu Y, Jia Y, Rosenblum DS (2016) Recognizing complex activities by a probabilistic interval-based model. In: Thirtieth AAAI conference on artificial intelligence, pp 1266–1272
Liu Y, Liu Y, Liu Y, Liu Y (2016) Video-based emotion recognition using cnn-rnn and c3d hybrid networks. In: ACM International conference on multimodal interaction, pp 445–450
Liu Y, Nie L, Liu L, Rosenblum DS (2016) From action to activity: sensor-based activity recognition. Neurocomputing 181:108–115
Article Google Scholar
Liu Z, Zhang C, Tian Y (2016) 3d-based deep convolutional neural network for action recognition with depth sequences. Image Vis Comput 55:93–100
Article Google Scholar
Ng YH, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: deep networks for video classification. In: IEEE Conference on computer vision and pattern recognition, pp 4694–4702
Pan SJ, Yang Q (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22(10):1345–1359
Article Google Scholar
Pan Y, Yao T, Li H, Mei T (2017) Video captioning with transferred semantic attributes. In: IEEE conference on computer vision and pattern recognition, pp 984–992
Peng X, Schmid C (2016) Multi-region two-stream r-cnn for action detection. In: European conference on computer vision, pp 744–759
Peng X, Wang L, Wang X, Qiao Y (2016) Bag of visual words and fusion methods for action recognition. Comput Vis Image Underst 150(C):109–125
Article Google Scholar
Pinto N, Cox DD, Dicarlo JJ (2008) Why is real-world visual object recognition hard? Plos Comput Biol 4(1):0151–0156
Article MathSciNet Google Scholar
Rakotomamonjy A, Bach F, Canu S, Grandvalet Y (2008) SimpleMKL. J Mach Learn Res 9(3):2491–2521
MathSciNet MATH Google Scholar
Shi Y, Tian Y, Wang Y, Huang T (2017) Sequential deep trajectory descriptor for action recognition with three-stream cnn. IEEE Trans Multimed 19(7):1510–1520
Article Google Scholar
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576
Soomro K, Zamir AR, Shah M (2012) Ucf101: a dataset of 101 human actions classes from videos in the wild CRCV-TR-12-01
Srivastava N, Mansimov E, Salakhutdinov R (2015) Unsupervised learning of video representations using lstms. In: International conference on international conference on machine learning, pp 843–852
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: IEEE international conference on computer vision, pp 4489–4497
Tu Z, Xie W, Qin Q, Poppe R, Veltkamp RC, Li B, Yuan J (2018) Multi-stream cnn: learning representations based on human-related regions for action recognition. Pattern Recogn 79:32–43
Article Google Scholar
Vail DL, Veloso MM, Lafferty JD (2007) Conditional random fields for activity recognition. In: International joint conference on autonomous agents and multiagent systems, p 235
Wang H, Schmid C (2014) Action recognition with improved trajectories. In: IEEE International conference on computer vision, pp 3551–3558
Wang Q, Yuan C (2016) Video object segmentation by multi-scale pyramidal multi-dimensional lstm with generated depth context. In: IEEE International conference on image processing, pp 281–285
Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: IEEE Conference on computer vision and pattern recognition, pp 4305–4314
Wang LM, Qiao Y, Tang X (2013) Motionlets: mid-level 3d parts for human motion recognition. In: Computer vision and pattern recognition, pp 2674–2681
Wang Y, Song J, Wang L, Gool LV, Hilliges O (2014) Two-stream sr-cnns for action recognition in videos. In: Advances in neural information processing systems, pp 1–9
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Gool LV (2016) Temporal segment networks: towards good practices for deep action recognition. Acm Trans Inf Syst 22(1):20–36
Google Scholar
Wu Z, Wang X, Jiang YG, Ye H, Xue X (2015) Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In: ACM International conference on multimedia, pp 461–470
Wu J, Ishwar P, Konrad J (2016) Two-stream cnns for gesture-based verification and identification: learning user style. In: Computer vision and pattern recognition workshops, pp 110–118
Xu H, Das A, Saenko K (2017) R-c3d: region convolutional 3d network for temporal activity detection. In: IEEE International conference on computer vision, pp 5794–5803
Xu Z, Yang Y, Tsang I, Sebe N, Hauptmann AG (2013) Feature weighting via optimal thresholding for video analysis. In: IEEE International conference on computer vision, pp 3440–3447
Yilmaz TAY, Kitsuregawa M (2012) Non-linear weighted averaging for multimodal information fusion by employing analytical network process. In: International conference on pattern recognition, pp 234–237
Zeng Z, Ji Q (2010) Knowledge based activity recognition with dynamic Bayesian network. In: European conference on computer vision, pp 532–546
Zhang M, Gao C, Li Q, Wang L, Zhang J (2018) Action detection based on tracklets with the two-stream cnn. Multimed Tools Appl 77(3):3303–3316
Article Google Scholar

Download references

Acknowledgements

Thanks to National Science Foundation of China (61976060, 61571147, 61401113). It is also supported by Project of Educational Commission of Guangdong province of China(2018KCXTD019)

Author information

Authors and Affiliations

Guangdong University of Petrochemical Technology, Maoming, China
Lei Zhang
Harbin Engineering University, Harbin, China
Xuezhi Xiang

Authors

Lei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xuezhi Xiang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xuezhi Xiang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, L., Xiang, X. Video event classification based on two-stage neural network. Multimed Tools Appl 79, 21471–21486 (2020). https://doi.org/10.1007/s11042-019-08457-5

Download citation

Received: 05 May 2018
Revised: 30 August 2019
Accepted: 07 November 2019
Published: 06 May 2020
Issue Date: August 2020
DOI: https://doi.org/10.1007/s11042-019-08457-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Video event classification based on two-stage neural network

Abstract

Access this article

Similar content being viewed by others

LSN: Long-Term Spatio-Temporal Network for Video Recognition

A Multi-tier Fusion Strategy for Event Classification in Unconstrained Videos

Extracting Deep Video Feature for Mobile Video Classification with ELU-3DCNN

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Video event classification based on two-stage neural network

Abstract

Access this article

Similar content being viewed by others

LSN: Long-Term Spatio-Temporal Network for Video Recognition

A Multi-tier Fusion Strategy for Event Classification in Unconstrained Videos

Extracting Deep Video Feature for Mobile Video Classification with ELU-3DCNN

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation