Skip to main content

LSN: Long-Term Spatio-Temporal Network for Video Recognition

  • Conference paper
  • First Online:
  • 616 Accesses

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1628))

Abstract

Although recurrent neural networks (RNNs) are widely leveraged to process temporal or sequential data, they have attracted too little attention in current video action recognition applications. Therefore, this work attempts to model the long-term spatio-temporal information of the video based on a variant of RNN, i.e., higher-order RNN. Moreover, we propose a novel long-term spatio-temporal network (LSN) for solving this video task, the core of which integrates the newly constructed high-order ConvLSTM (HO-ConvLSTM) modules with traditional 2D convolutional blocks. Specifically, each HO-ConvLSTM module consists of an accumulated temporary state (ATS) module as well as a standard ConvLSTM module, and several previous hidden states in the ATS module are accumulated to one temporary state that will enter the standard ConvLSTM to determine the output together with the current input. The HO-ConvLSTM module can be inserted into different stages of the 2D convolutional neural network (CNN) in a plug-and-play manner, thus well characterizing the long-term temporal evolution at various spatial resolutions. Experiment results on three commonly used video benchmarks demonstrate that the proposed LSN model can achieve competitive performance with the representative models.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. arXiv preprint arXiv:1406.2199 (2014)

  2. Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2

    Chapter  Google Scholar 

  3. Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941 (2016)

    Google Scholar 

  4. Feichtenhofer, C.: X3D: expanding architectures for efficient video recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 203–213 (2020)

    Google Scholar 

  5. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)

    Google Scholar 

  6. Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35, 221–231 (2012)

    Article  Google Scholar 

  7. Zhou, B., Andonian, A., Oliva, A., Torralba, A.: Temporal relational reasoning in videos. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 831–846. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_49

    Chapter  Google Scholar 

  8. Zhu, X., Xu, C., Hui, L., Lu, C., Tao, D.: Approximated bilinear modules for temporal modeling. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3494–3503 (2019)

    Google Scholar 

  9. Ballas, N., Yao, L., Pal, C., Courville, A.: Delving deeper into convolutional networks for learning video representations. arXiv preprint arXiv:1511.06432 (2015)

  10. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997)

    Article  Google Scholar 

  11. Soltani, R., Jiang, H.: Higher order recurrent neural networks. arXiv preprint arXiv:1605.00064 (2016)

  12. Yu, R., Zheng, S., Anandkumar, A., Yue, Y.: Long-term forecasting using tensortrain RNNs. arXiv preprint arXiv:1711.00073 (2017)

  13. Sato, M.: Learning chaotic dynamics by recurrent neural networks. In: Proceedings of International Conference on Fuzzy Logic and Neural Networks (1990)

    Google Scholar 

  14. Su, J., Byeon, W., Kossaifi, J., Huang, F., Kautz, J., Anandkumar, A.: Convolutional tensor-train LSTM for spatio-temporal learning. arXiv preprint arXiv:2002.09131 (2020)

  15. Goyal, R., et al.: The “something something” video database for learning and evaluating visual common sense. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5842–5850 (2017)

    Google Scholar 

  16. Li, Y., Li, Y., Vasconcelos, N.: RESOUND: towards action recognition without representation bias. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 520–535. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_32

    Chapter  Google Scholar 

  17. Xingjian, S., Chen, Z., Wang, H., Yeung, D.-Y., Wong, W.-K., Woo, W.-C.: Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: Advances in Neural Information Processing Systems, pp. 802–810 (2015)

    Google Scholar 

  18. Pascanu, R., Mikolov, T., Bengio, Y.: On the difficulty of training recurrent neural networks. In: International Conference on Machine Learning, pp. 1310–1318 (2013)

    Google Scholar 

  19. Gao, S., Cheng, M.-M., Zhao, K., Zhang, X.-Y., Yang, M.-H., Torr, P.H.: Res2net: a new multi-scale backbone architecture. IEEE Trans. Pattern Anal. Mach. Intell. 43, 652–662 (2019)

    Article  Google Scholar 

  20. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  21. Lin, J., Gan, C., Han, S.: TSM: temporal shift module for efficient video understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7083–7093 (2019)

    Google Scholar 

  22. Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018)

    Google Scholar 

  23. Fan, Q., Chen, C.-F., Kuehne, H., Pistoia, M., Cox, D.: More is less: learning efficient video representations by big-little network and depthwise temporal aggregation. arXiv preprint arXiv:1912.00869 (2019)

  24. Liu, X., Lee, J.-Y., Jin, H.: Learning video representations from correspondence proposals. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4273–4281 (2019)

    Google Scholar 

  25. Li, X., Wang, Y., Zhou, Z., Qiao, Y.: Smallbignet: integrating core and contextual views for video classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1092–1101 (2020)

    Google Scholar 

  26. Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., Wang, L.: Tea: temporal excitation and aggregation for action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 909–918 (2020)

    Google Scholar 

  27. Wang, H., Tran, D., Torresani, L., Feiszli, M.: Video modeling with correlation networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 352–361 (2020)

    Google Scholar 

  28. Kondratyuk, D., et al.: MoViNets: mobile video networks for efficient video recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16020–16030 (2021)

    Google Scholar 

  29. Zhang, Y., et al.: VidTr: video transformer without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13577–13587 (2021)

    Google Scholar 

  30. Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? arXiv preprint arXiv:2102.05095 (2021)

  31. Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019)

    Google Scholar 

  32. Kim, M., Kwon, H., Wang, C., Kwak, S., Cho, M.: Relational selfattention: what’s missing in attention for video understanding. arXiv preprint arXiv:2111.01673 (2021)

Download references

Acknowledgment

This work was partially supported by the National Natural Science Foundation of China (61972062, 61902220), the Young and Middle-aged Talents Program of the National Civil Affairs Commission, and the University-Industry Collaborative Education Program (201902029013).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Bingbing Zhang or Jianxin Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, Z., Dong, W., Zhang, B., Zhang, J. (2022). LSN: Long-Term Spatio-Temporal Network for Video Recognition. In: Wang, Y., Zhu, G., Han, Q., Wang, H., Song, X., Lu, Z. (eds) Data Science. ICPCSEE 2022. Communications in Computer and Information Science, vol 1628. Springer, Singapore. https://doi.org/10.1007/978-981-19-5194-7_24

Download citation

  • DOI: https://doi.org/10.1007/978-981-19-5194-7_24

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-19-5193-0

  • Online ISBN: 978-981-19-5194-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics