LSN: Long-Term Spatio-Temporal Network for Video Recognition

Wang, Zhenwei; Dong, Wei; Zhang, Bingbing; Zhang, Jianxin

doi:10.1007/978-981-19-5194-7_24

LSN: Long-Term Spatio-Temporal Network for Video Recognition

Zhenwei Wang^11,12,13,
Wei Dong^11,12,13,
Bingbing Zhang¹⁴ &
…
Jianxin Zhang^11,12,13

Conference paper
First Online: 10 August 2022

616 Accesses

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1628))

Abstract

Although recurrent neural networks (RNNs) are widely leveraged to process temporal or sequential data, they have attracted too little attention in current video action recognition applications. Therefore, this work attempts to model the long-term spatio-temporal information of the video based on a variant of RNN, i.e., higher-order RNN. Moreover, we propose a novel long-term spatio-temporal network (LSN) for solving this video task, the core of which integrates the newly constructed high-order ConvLSTM (HO-ConvLSTM) modules with traditional 2D convolutional blocks. Specifically, each HO-ConvLSTM module consists of an accumulated temporary state (ATS) module as well as a standard ConvLSTM module, and several previous hidden states in the ATS module are accumulated to one temporary state that will enter the standard ConvLSTM to determine the output together with the current input. The HO-ConvLSTM module can be inserted into different stages of the 2D convolutional neural network (CNN) in a plug-and-play manner, thus well characterizing the long-term temporal evolution at various spatial resolutions. Experiment results on three commonly used video benchmarks demonstrate that the proposed LSN model can achieve competitive performance with the representative models.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. arXiv preprint arXiv:1406.2199 (2014)
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2
Chapter Google Scholar
Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941 (2016)
Google Scholar
Feichtenhofer, C.: X3D: expanding architectures for efficient video recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 203–213 (2020)
Google Scholar
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)
Google Scholar
Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35, 221–231 (2012)
Article Google Scholar
Zhou, B., Andonian, A., Oliva, A., Torralba, A.: Temporal relational reasoning in videos. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 831–846. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_49
Chapter Google Scholar
Zhu, X., Xu, C., Hui, L., Lu, C., Tao, D.: Approximated bilinear modules for temporal modeling. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3494–3503 (2019)
Google Scholar
Ballas, N., Yao, L., Pal, C., Courville, A.: Delving deeper into convolutional networks for learning video representations. arXiv preprint arXiv:1511.06432 (2015)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997)
Article Google Scholar
Soltani, R., Jiang, H.: Higher order recurrent neural networks. arXiv preprint arXiv:1605.00064 (2016)
Yu, R., Zheng, S., Anandkumar, A., Yue, Y.: Long-term forecasting using tensortrain RNNs. arXiv preprint arXiv:1711.00073 (2017)
Sato, M.: Learning chaotic dynamics by recurrent neural networks. In: Proceedings of International Conference on Fuzzy Logic and Neural Networks (1990)
Google Scholar
Su, J., Byeon, W., Kossaifi, J., Huang, F., Kautz, J., Anandkumar, A.: Convolutional tensor-train LSTM for spatio-temporal learning. arXiv preprint arXiv:2002.09131 (2020)
Goyal, R., et al.: The “something something” video database for learning and evaluating visual common sense. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5842–5850 (2017)
Google Scholar
Li, Y., Li, Y., Vasconcelos, N.: RESOUND: towards action recognition without representation bias. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 520–535. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_32
Chapter Google Scholar
Xingjian, S., Chen, Z., Wang, H., Yeung, D.-Y., Wong, W.-K., Woo, W.-C.: Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: Advances in Neural Information Processing Systems, pp. 802–810 (2015)
Google Scholar
Pascanu, R., Mikolov, T., Bengio, Y.: On the difficulty of training recurrent neural networks. In: International Conference on Machine Learning, pp. 1310–1318 (2013)
Google Scholar
Gao, S., Cheng, M.-M., Zhao, K., Zhang, X.-Y., Yang, M.-H., Torr, P.H.: Res2net: a new multi-scale backbone architecture. IEEE Trans. Pattern Anal. Mach. Intell. 43, 652–662 (2019)
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Lin, J., Gan, C., Han, S.: TSM: temporal shift module for efficient video understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7083–7093 (2019)
Google Scholar
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018)
Google Scholar
Fan, Q., Chen, C.-F., Kuehne, H., Pistoia, M., Cox, D.: More is less: learning efficient video representations by big-little network and depthwise temporal aggregation. arXiv preprint arXiv:1912.00869 (2019)
Liu, X., Lee, J.-Y., Jin, H.: Learning video representations from correspondence proposals. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4273–4281 (2019)
Google Scholar
Li, X., Wang, Y., Zhou, Z., Qiao, Y.: Smallbignet: integrating core and contextual views for video classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1092–1101 (2020)
Google Scholar
Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., Wang, L.: Tea: temporal excitation and aggregation for action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 909–918 (2020)
Google Scholar
Wang, H., Tran, D., Torresani, L., Feiszli, M.: Video modeling with correlation networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 352–361 (2020)
Google Scholar
Kondratyuk, D., et al.: MoViNets: mobile video networks for efficient video recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16020–16030 (2021)
Google Scholar
Zhang, Y., et al.: VidTr: video transformer without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13577–13587 (2021)
Google Scholar
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? arXiv preprint arXiv:2102.05095 (2021)
Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019)
Google Scholar
Kim, M., Kwon, H., Wang, C., Kwak, S., Cho, M.: Relational selfattention: what’s missing in attention for video understanding. arXiv preprint arXiv:2111.01673 (2021)

Download references

Acknowledgment

This work was partially supported by the National Natural Science Foundation of China (61972062, 61902220), the Young and Middle-aged Talents Program of the National Civil Affairs Commission, and the University-Industry Collaborative Education Program (201902029013).

Author information

Authors and Affiliations

School of Computer Science and Engineering, Dalian Minzu University, Dalian, China
Zhenwei Wang, Wei Dong & Jianxin Zhang
SEAC Key Laboratory of Big Data Applied Technology, Dalian Minzu University, Dalian, China
Zhenwei Wang, Wei Dong & Jianxin Zhang
Institute of Machine Intelligence and Bio-Computing, Dalian Minzu University, Dalian, China
Zhenwei Wang, Wei Dong & Jianxin Zhang
School of Information and Communication Engineering, Dalian University of Technology, Dalian, China
Bingbing Zhang

Authors

Zhenwei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Wei Dong
View author publications
You can also search for this author in PubMed Google Scholar
Bingbing Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jianxin Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Bingbing Zhang or Jianxin Zhang .

Editor information

Editors and Affiliations

Southwest Petroleum University, Chengdu, China
Yang Wang
University of Electronic Science and Technology of China, Chengdu, China
Guobin Zhu
Harbin Engineering University, Harbin, China
Qilong Han
Harbin Institute of Technology, Harbin, China
Hongzhi Wang
Harbin University of Science and Technology, Harbin, China
Xianhua Song
National Academy of Guo Ding Institute of Data Sciences, Beijing, China
Zeguang Lu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, Z., Dong, W., Zhang, B., Zhang, J. (2022). LSN: Long-Term Spatio-Temporal Network for Video Recognition. In: Wang, Y., Zhu, G., Han, Q., Wang, H., Song, X., Lu, Z. (eds) Data Science. ICPCSEE 2022. Communications in Computer and Information Science, vol 1628. Springer, Singapore. https://doi.org/10.1007/978-981-19-5194-7_24

Download citation

DOI: https://doi.org/10.1007/978-981-19-5194-7_24
Published: 10 August 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-5193-0
Online ISBN: 978-981-19-5194-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics