TCVM: Temporal Contrasting Video Montage Framework for Self-supervised Video Representation Learning

Tian, Fengrui; Fan, Jiawei; Yu, Xie; Du, Shaoyi; Song, Meina; Zhao, Yu

doi:10.1007/978-3-031-26284-5_32

Fengrui Tian¹²,
Jiawei Fan¹³,
Xie Yu¹³,
Shaoyi Du¹²,
Meina Song¹³ &
…
Yu Zhao¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13842))

Included in the following conference series:

Asian Conference on Computer Vision

250 Accesses

Abstract

Extracting appropriate temporal differences and ignoring irrelevant backgrounds are two important perspectives on preserving sufficient motion information in video representation, such as driver behavior monitoring and driver fatigue detection. In this paper, we propose a unified contrastive learning framework called Temporal Contrasting Video Montage (TCVM) to learn action-specific motion patterns, which can be implemented in a plug-and-play way. On the one hand, Temporal Contrasting (TC) module is designed to guarantee appropriate temporal difference between frames. It utilizes high-level feature space to capture raveled temporal information. On the other hand, Video Montage (VM) module is devised for alleviating the effect from video background. It demonstrates similar temporal motion variances in different positive samples by implicitly mixing up the backgrounds of different videos. Experimental results show that our TCVM reaches promising performances on both large action recognition dataset (i.e. Something-Somethingv2) and small datasets (i.e. UCF101 and HMDB51).

F. Tian and J. Fan—Equal Contribution.

This work was done when Fengrui Tian and Jiawei Fan were interns at Megvii Research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Ahsan, U., Madhok, R., Essa, I.: Video Jigsaw: unsupervised learning of spatiotemporal context for video action recognition. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 179–189 (2019). https://doi.org/10.1109/WACV.2019.00025
Alwassel, H., Mahajan, D., Korbar, B., Torresani, L., Ghanem, B., Tran, D.: Self-supervised learning by cross-modal audio-video clustering. In: NeurIPS (2020)
Google Scholar
Benaim, S., et al.: SpeedNet: learning the speediness in videos. In: CVPR, pp. 9922–9931 (2020)
Google Scholar
Biondi, F.N., Alvarez, I.J., Jeong, K.A.: Human-vehicle cooperation in automated driving: a multidisciplinary review and appraisal. Int. J. Hum.-Comput. Interact. 35, 932–946 (2019)
Article Google Scholar
Carreira, J., Zisserman, A.: Quo Vadis, action recognition? A new model and the kinetics dataset. In: CVPR, pp. 6299–6308 (2017)
Google Scholar
Chen, P., et al.: RSPNet: relative speed perception for unsupervised video representation learning. In: AAAI, vol. 1 (2021)
Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: ICML, pp. 1597–1607. PMLR (2020)
Google Scholar
Chen, X., Fan, H., Girshick, R., He, K.: Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297 (2020)
Choi, J., Gao, C., Messou, J.C., Huang, J.B.: Why can’t i dance in the mall? Learning to mitigate scene bias in action recognition. arXiv preprint arXiv:1912.05534 (2019)
Dave, I., Gupta, R., Rizve, M.N., Shah, M.: TCLR: temporal contrastive learning for video representation. arXiv preprint arXiv:2101.07974 (2021)
Ding, S., et al.: Motion-aware self-supervised video representation learning via foreground-background merging. arXiv preprint arXiv:2109.15130 (2021)
Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: ICCV, pp. 6202–6211 (2019)
Google Scholar
Feichtenhofer, C., Fan, H., Xiong, B., Girshick, R., He, K.: A large-scale study on unsupervised spatiotemporal representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3299–3309 (2021)
Google Scholar
Fernando, B., Bilen, H., Gavves, E., Gould, S.: Self-supervised video representation learning with odd-one-out networks. In: CVPR, pp. 3636–3645 (2017)
Google Scholar
Goyal, R., et al.: The “something something” video database for learning and evaluating visual common sense. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5842–5850 (2017)
Google Scholar
Grill, J.B., et al.: Bootstrap your own latent: a new approach to self-supervised learning. arXiv preprint arXiv:2006.07733 (2020)
Han, T., Xie, W., Zisserman, A.: Video representation learning by dense predictive coding. In: ICCV Workshops (2019)
Google Scholar
Han, T., Xie, W., Zisserman, A.: Memory-augmented dense predictive coding for video representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12348, pp. 312–329. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58580-8_19
Chapter Google Scholar
Han, T., Xie, W., Zisserman, A.: Self-supervised co-training for video representation learning. In: NeurIPS (2020)
Google Scholar
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: CVPR, pp. 9729–9738 (2020)
Google Scholar
Huang, D.A., et al.: What makes a video a video: analyzing temporal information in video understanding models and datasets. In: CVPR, pp. 7366–7375 (2018)
Google Scholar
Huang, L., Liu, Y., Wang, B., Pan, P., Xu, Y., Jin, R.: Self-supervised video representation learning by context and motion decoupling. In: CVPR, pp. 13886–13895 (2021)
Google Scholar
Huang, Z., Zhang, S., Jiang, J., Tang, M., Jin, R., Ang, M.H.: Self-supervised motion learning from static images. In: CVPR, pp. 1276–1285 (2021)
Google Scholar
Huo, Y., et al.: Self-supervised video representation learning with constrained spatiotemporal jigsaw. In: IJCAI, pp. 751–757 (2021). https://doi.org/10.24963/ijcai.2021/104
Jing, L., Yang, X., Liu, J., Tian, Y.: Self-supervised spatiotemporal feature learning via video rotation prediction. arXiv preprint arXiv:1811.11387 (2018)
Khosla, P., et al.: Supervised contrastive learning. In: NeurIPS, vol. 33, pp. 18661–18673 (2020)
Google Scholar
Kim, D., Cho, D., Kweon, I.S.: Self-supervised video representation learning with space-time cubic puzzles. In: AAAI, vol. 33, pp. 8545–8552 (2019). https://doi.org/10.1609/aaai.v33i01.33018545
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: ICCV, pp. 2556–2563. IEEE (2011)
Google Scholar
Lee, H.Y., Huang, J.B., Singh, M., Yang, M.H.: Unsupervised representation learning by sorting sequences. In: ICCV, pp. 667–676 (2017)
Google Scholar
Li, Y., et al.: MPC-based switched driving model for human vehicle co-piloting considering human factors. Transp. Res. Part C Emerg. Technol. 115, 102612 (2020). https://doi.org/10.1016/j.trc.2020.102612. https://www.sciencedirect.com/science/article/pii/S0968090X18308179
Lin, J., Gan, C., Han, S.: TSM: temporal shift module for efficient video understanding. In: ICCV, pp. 7083–7093 (2019)
Google Scholar
Lin, T., Zhao, X., Su, H., Wang, C., Yang, M.: BSN: boundary sensitive network for temporal action proposal generation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 3–21. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_1
Chapter Google Scholar
Misra, I., van der Maaten, L.: Self-supervised learning of pretext-invariant representations. In: CVPR, pp. 6707–6717 (2020)
Google Scholar
Pan, T., Song, Y., Yang, T., Jiang, W., Liu, W.: VideoMoCo: contrastive video representation learning with temporally adversarial examples. In: CVPR, pp. 11205–11214 (2021)
Google Scholar
Patrick, M., et al.: Space-time crop & attend: improving cross-modal video representation learning. In: ICCV (2021)
Google Scholar
Qian, R., et al.: Spatiotemporal contrastive video representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6964–6974 (2021)
Google Scholar
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-CAM: visual explanations from deep networks via gradient-based localization. In: ICCV, pp. 618–626 (2017)
Google Scholar
Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
Sun, C., Myers, A., Vondrick, C., Murphy, K., Schmid, C.: VideoBERT: a joint model for video and language representation learning. In: ICCV, pp. 7464–7473 (2019)
Google Scholar
Suzuki, T., Itazuri, T., Hara, K., Kataoka, H.: Learning spatiotemporal 3D convolution with video order self-supervision. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11130, pp. 590–598. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11012-3_45
Chapter Google Scholar
Vondrick, C., Shrivastava, A., Fathi, A., Guadarrama, S., Murphy, K.: Tracking emerges by colorizing videos. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 402–419. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8_24
Chapter Google Scholar
Wang, J., Jiao, J., Liu, Y.-H.: Self-supervised video representation learning by pace prediction. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12362, pp. 504–521. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58520-4_30
Chapter Google Scholar
Wang, J., et al.: Removing the background by adding the background: towards background robust self-supervised video representation learning. In: CVPR, pp. 11804–11813 (2021)
Google Scholar
Wang, L., et al.: Temporal segment networks for action recognition in videos. IEEE Trans. Pattern Anal. Mach. Intell. 41(11), 2740–2755 (2018)
Article Google Scholar
Xiao, F., Tighe, J., Modolo, D.: MoDist: motion distillation for self-supervised video representation learning. arXiv preprint arXiv:2106.09703 (2021)
Xu, D., Xiao, J., Zhao, Z., Shao, J., Xie, D., Zhuang, Y.: Self-supervised spatiotemporal learning via video clip order prediction. In: CVPR, pp. 10334–10343 (2019)
Google Scholar
Yao, Y., Liu, C., Luo, D., Zhou, Y., Ye, Q.: Video playback rate perception for self-supervised spatio-temporal representation learning. In: CVPR, pp. 6548–6557 (2020)
Google Scholar

Download references

Acknowledgements

This work was supported by the National Key Research and Development Program of China under Grant No. 2020AAA0108100, the National Natural Science Foundation of China under Grant No. 61971343, 62088102 and 62073257, and the Key Research and Development Program of Shaanxi Province of China under Grant No. 2022GY-076.

Author information

Authors and Affiliations

Xi’an Jiaotong University, Xi’an, China
Fengrui Tian & Shaoyi Du
Beijing University of Posts and Telecommunications, Beijing, China
Jiawei Fan, Xie Yu & Meina Song
Harbin Institute of Technology, Harbin, China
Yu Zhao

Authors

Fengrui Tian
View author publications
You can also search for this author in PubMed Google Scholar
Jiawei Fan
View author publications
You can also search for this author in PubMed Google Scholar
Xie Yu
View author publications
You can also search for this author in PubMed Google Scholar
Shaoyi Du
View author publications
You can also search for this author in PubMed Google Scholar
Meina Song
View author publications
You can also search for this author in PubMed Google Scholar
Yu Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shaoyi Du .

Editor information

Editors and Affiliations

University of Wollongong, Wollongong, NSW, Australia
Lei Wang
University of Bonn, Bonn, Germany
Juergen Gall
University of Adelaide, Adelaide, SA, Australia
Tat-Jun Chin
National Institute of Informatics, Tokyo, Japan
Imari Sato
Johns Hopkins University, Baltimore, MD, USA
Rama Chellappa

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 2494 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tian, F., Fan, J., Yu, X., Du, S., Song, M., Zhao, Y. (2023). TCVM: Temporal Contrasting Video Montage Framework for Self-supervised Video Representation Learning. In: Wang, L., Gall, J., Chin, TJ., Sato, I., Chellappa, R. (eds) Computer Vision – ACCV 2022. ACCV 2022. Lecture Notes in Computer Science, vol 13842. Springer, Cham. https://doi.org/10.1007/978-3-031-26284-5_32

Download citation

DOI: https://doi.org/10.1007/978-3-031-26284-5_32
Published: 23 February 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-26283-8
Online ISBN: 978-3-031-26284-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

TCVM: Temporal Contrasting Video Montage Framework for Self-supervised Video Representation Learning