Skip to main content

TCVM: Temporal Contrasting Video Montage Framework for Self-supervised Video Representation Learning

  • Conference paper
  • First Online:
Computer Vision – ACCV 2022 (ACCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13842))

Included in the following conference series:

  • 250 Accesses

Abstract

Extracting appropriate temporal differences and ignoring irrelevant backgrounds are two important perspectives on preserving sufficient motion information in video representation, such as driver behavior monitoring and driver fatigue detection. In this paper, we propose a unified contrastive learning framework called Temporal Contrasting Video Montage (TCVM) to learn action-specific motion patterns, which can be implemented in a plug-and-play way. On the one hand, Temporal Contrasting (TC) module is designed to guarantee appropriate temporal difference between frames. It utilizes high-level feature space to capture raveled temporal information. On the other hand, Video Montage (VM) module is devised for alleviating the effect from video background. It demonstrates similar temporal motion variances in different positive samples by implicitly mixing up the backgrounds of different videos. Experimental results show that our TCVM reaches promising performances on both large action recognition dataset (i.e. Something-Somethingv2) and small datasets (i.e. UCF101 and HMDB51).

F. Tian and J. Fan—Equal Contribution.

This work was done when Fengrui Tian and Jiawei Fan were interns at Megvii Research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Ahsan, U., Madhok, R., Essa, I.: Video Jigsaw: unsupervised learning of spatiotemporal context for video action recognition. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 179–189 (2019). https://doi.org/10.1109/WACV.2019.00025

  2. Alwassel, H., Mahajan, D., Korbar, B., Torresani, L., Ghanem, B., Tran, D.: Self-supervised learning by cross-modal audio-video clustering. In: NeurIPS (2020)

    Google Scholar 

  3. Benaim, S., et al.: SpeedNet: learning the speediness in videos. In: CVPR, pp. 9922–9931 (2020)

    Google Scholar 

  4. Biondi, F.N., Alvarez, I.J., Jeong, K.A.: Human-vehicle cooperation in automated driving: a multidisciplinary review and appraisal. Int. J. Hum.-Comput. Interact. 35, 932–946 (2019)

    Article  Google Scholar 

  5. Carreira, J., Zisserman, A.: Quo Vadis, action recognition? A new model and the kinetics dataset. In: CVPR, pp. 6299–6308 (2017)

    Google Scholar 

  6. Chen, P., et al.: RSPNet: relative speed perception for unsupervised video representation learning. In: AAAI, vol. 1 (2021)

    Google Scholar 

  7. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: ICML, pp. 1597–1607. PMLR (2020)

    Google Scholar 

  8. Chen, X., Fan, H., Girshick, R., He, K.: Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297 (2020)

  9. Choi, J., Gao, C., Messou, J.C., Huang, J.B.: Why can’t i dance in the mall? Learning to mitigate scene bias in action recognition. arXiv preprint arXiv:1912.05534 (2019)

  10. Dave, I., Gupta, R., Rizve, M.N., Shah, M.: TCLR: temporal contrastive learning for video representation. arXiv preprint arXiv:2101.07974 (2021)

  11. Ding, S., et al.: Motion-aware self-supervised video representation learning via foreground-background merging. arXiv preprint arXiv:2109.15130 (2021)

  12. Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: ICCV, pp. 6202–6211 (2019)

    Google Scholar 

  13. Feichtenhofer, C., Fan, H., Xiong, B., Girshick, R., He, K.: A large-scale study on unsupervised spatiotemporal representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3299–3309 (2021)

    Google Scholar 

  14. Fernando, B., Bilen, H., Gavves, E., Gould, S.: Self-supervised video representation learning with odd-one-out networks. In: CVPR, pp. 3636–3645 (2017)

    Google Scholar 

  15. Goyal, R., et al.: The “something something” video database for learning and evaluating visual common sense. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5842–5850 (2017)

    Google Scholar 

  16. Grill, J.B., et al.: Bootstrap your own latent: a new approach to self-supervised learning. arXiv preprint arXiv:2006.07733 (2020)

  17. Han, T., Xie, W., Zisserman, A.: Video representation learning by dense predictive coding. In: ICCV Workshops (2019)

    Google Scholar 

  18. Han, T., Xie, W., Zisserman, A.: Memory-augmented dense predictive coding for video representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12348, pp. 312–329. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58580-8_19

    Chapter  Google Scholar 

  19. Han, T., Xie, W., Zisserman, A.: Self-supervised co-training for video representation learning. In: NeurIPS (2020)

    Google Scholar 

  20. He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: CVPR, pp. 9729–9738 (2020)

    Google Scholar 

  21. Huang, D.A., et al.: What makes a video a video: analyzing temporal information in video understanding models and datasets. In: CVPR, pp. 7366–7375 (2018)

    Google Scholar 

  22. Huang, L., Liu, Y., Wang, B., Pan, P., Xu, Y., Jin, R.: Self-supervised video representation learning by context and motion decoupling. In: CVPR, pp. 13886–13895 (2021)

    Google Scholar 

  23. Huang, Z., Zhang, S., Jiang, J., Tang, M., Jin, R., Ang, M.H.: Self-supervised motion learning from static images. In: CVPR, pp. 1276–1285 (2021)

    Google Scholar 

  24. Huo, Y., et al.: Self-supervised video representation learning with constrained spatiotemporal jigsaw. In: IJCAI, pp. 751–757 (2021). https://doi.org/10.24963/ijcai.2021/104

  25. Jing, L., Yang, X., Liu, J., Tian, Y.: Self-supervised spatiotemporal feature learning via video rotation prediction. arXiv preprint arXiv:1811.11387 (2018)

  26. Khosla, P., et al.: Supervised contrastive learning. In: NeurIPS, vol. 33, pp. 18661–18673 (2020)

    Google Scholar 

  27. Kim, D., Cho, D., Kweon, I.S.: Self-supervised video representation learning with space-time cubic puzzles. In: AAAI, vol. 33, pp. 8545–8552 (2019). https://doi.org/10.1609/aaai.v33i01.33018545

  28. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: ICCV, pp. 2556–2563. IEEE (2011)

    Google Scholar 

  29. Lee, H.Y., Huang, J.B., Singh, M., Yang, M.H.: Unsupervised representation learning by sorting sequences. In: ICCV, pp. 667–676 (2017)

    Google Scholar 

  30. Li, Y., et al.: MPC-based switched driving model for human vehicle co-piloting considering human factors. Transp. Res. Part C Emerg. Technol. 115, 102612 (2020). https://doi.org/10.1016/j.trc.2020.102612. https://www.sciencedirect.com/science/article/pii/S0968090X18308179

  31. Lin, J., Gan, C., Han, S.: TSM: temporal shift module for efficient video understanding. In: ICCV, pp. 7083–7093 (2019)

    Google Scholar 

  32. Lin, T., Zhao, X., Su, H., Wang, C., Yang, M.: BSN: boundary sensitive network for temporal action proposal generation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 3–21. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_1

    Chapter  Google Scholar 

  33. Misra, I., van der Maaten, L.: Self-supervised learning of pretext-invariant representations. In: CVPR, pp. 6707–6717 (2020)

    Google Scholar 

  34. Pan, T., Song, Y., Yang, T., Jiang, W., Liu, W.: VideoMoCo: contrastive video representation learning with temporally adversarial examples. In: CVPR, pp. 11205–11214 (2021)

    Google Scholar 

  35. Patrick, M., et al.: Space-time crop & attend: improving cross-modal video representation learning. In: ICCV (2021)

    Google Scholar 

  36. Qian, R., et al.: Spatiotemporal contrastive video representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6964–6974 (2021)

    Google Scholar 

  37. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-CAM: visual explanations from deep networks via gradient-based localization. In: ICCV, pp. 618–626 (2017)

    Google Scholar 

  38. Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)

  39. Sun, C., Myers, A., Vondrick, C., Murphy, K., Schmid, C.: VideoBERT: a joint model for video and language representation learning. In: ICCV, pp. 7464–7473 (2019)

    Google Scholar 

  40. Suzuki, T., Itazuri, T., Hara, K., Kataoka, H.: Learning spatiotemporal 3D convolution with video order self-supervision. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11130, pp. 590–598. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11012-3_45

    Chapter  Google Scholar 

  41. Vondrick, C., Shrivastava, A., Fathi, A., Guadarrama, S., Murphy, K.: Tracking emerges by colorizing videos. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 402–419. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8_24

    Chapter  Google Scholar 

  42. Wang, J., Jiao, J., Liu, Y.-H.: Self-supervised video representation learning by pace prediction. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12362, pp. 504–521. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58520-4_30

    Chapter  Google Scholar 

  43. Wang, J., et al.: Removing the background by adding the background: towards background robust self-supervised video representation learning. In: CVPR, pp. 11804–11813 (2021)

    Google Scholar 

  44. Wang, L., et al.: Temporal segment networks for action recognition in videos. IEEE Trans. Pattern Anal. Mach. Intell. 41(11), 2740–2755 (2018)

    Article  Google Scholar 

  45. Xiao, F., Tighe, J., Modolo, D.: MoDist: motion distillation for self-supervised video representation learning. arXiv preprint arXiv:2106.09703 (2021)

  46. Xu, D., Xiao, J., Zhao, Z., Shao, J., Xie, D., Zhuang, Y.: Self-supervised spatiotemporal learning via video clip order prediction. In: CVPR, pp. 10334–10343 (2019)

    Google Scholar 

  47. Yao, Y., Liu, C., Luo, D., Zhou, Y., Ye, Q.: Video playback rate perception for self-supervised spatio-temporal representation learning. In: CVPR, pp. 6548–6557 (2020)

    Google Scholar 

Download references

Acknowledgements

This work was supported by the National Key Research and Development Program of China under Grant No. 2020AAA0108100, the National Natural Science Foundation of China under Grant No. 61971343, 62088102 and 62073257, and the Key Research and Development Program of Shaanxi Province of China under Grant No. 2022GY-076.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shaoyi Du .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 2494 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tian, F., Fan, J., Yu, X., Du, S., Song, M., Zhao, Y. (2023). TCVM: Temporal Contrasting Video Montage Framework for Self-supervised Video Representation Learning. In: Wang, L., Gall, J., Chin, TJ., Sato, I., Chellappa, R. (eds) Computer Vision – ACCV 2022. ACCV 2022. Lecture Notes in Computer Science, vol 13842. Springer, Cham. https://doi.org/10.1007/978-3-031-26284-5_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-26284-5_32

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-26283-8

  • Online ISBN: 978-3-031-26284-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics