Skip to main content
Log in

Bi-calibration Networks for Weakly-Supervised Video Representation Learning

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

The leverage of large volumes of web videos paired with the query (short phrase for searching the video) or surrounding text (long textual description, e.g., video title) offers an economic and extensible alternative to supervised video representation learning. Nevertheless, modeling such weakly visual-textual connection is not trivial due to query polysemy (i.e., many possible meanings for a query) and text isomorphism (i.e., same syntactic structure of different text). In this paper, we introduce a new design of mutual calibration between query and text to achieve more reliable visual-textual supervision for video representation learning. Specifically, we present Bi-Calibration Networks (BCN) that novelly couples two calibrations to learn the correction from text to query and vice versa. Technically, BCN executes clustering on all the titles of the videos searched by an identical query and takes the centroid of each cluster as a text prototype. All the queries constitute the query set. The representation learning of BCN is then formulated as video classification over text prototypes and queries, with text-to-query and query-to-text calibrations. A selection scheme is also devised to balance the two calibrations. Two large-scale web video datasets paired with query and title, named YOVO-3M and YOVO-10M, are newly collected for weakly-supervised video feature learning. The video features of BCN with ResNet backbone learnt on YOVO-3M (3M YouTube videos) obtain superior results under linear protocol on action recognition. More remarkably, BCN trained on the larger set of YOVO-10M (10M YouTube videos) with further fine-tuning leads to 1.3% gain in top-1 accuracy on Kinetics-400 dataset over the state-of-the-art TAda2D method with ImageNet pre-training. Source code and datasets are available at https://github.com/FuchenUSTC/BCN.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Data Availability

The newly-created web video datasets YOVO-3M and YOVO-10M are available at https://github.com/FuchenUSTC/BCN/tree/master/datasets. One possible issue of using the proposed two datasets is that small quantities of videos might be missing or taken down from YouTube, potentially affecting data availability in the future. The video data that support the downstream task evaluation of this research study are available in the Kinetics-400 (Carreira and Zisserman, 2017) (https://www.deepmind.com/open-source/kinetics), UCF101 (Soomro et al, 2012) (https://www.crcv.ucf.edu/data/UCF101.php), HMDB51 (Kuehne et al, 2011) (https://serre-lab.clps.brown.edu/resource), Something-Something V1 and V2 (Goyal et al, 2017) (https://developer.qualcomm.com/software/ai-datasets) datasets.

References

  • Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., & Vijayanarasimhan, S. (2016). YouTube-8M: A Large-Scale Video Classification Benchmark. arXiv preprint arXiv:1609.08675.

  • Agrawal, P., Carreira, J., & Malik, J. (2015). Learning to see by moving. In ICCV.

  • Alwassel, H., Mahajan, D., Korbar, B., Torresani, L., Ghanem, B., & Tran, D. (2020). Self-supervised learning by cross-modal audio-video clustering. In NeurIPS.

  • Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lucic, M., & Schmid, C. (2021). ViViT: A video vision transformer. In ICCV.

  • Avila, S., Thome, N., Cord, M., & Valle, E. (2013). de A Araujo A. The visual codeword point of view. Compute vision and image understanding: Pooling in image representation.

  • Benaim, S., Ephrat, A., Lang, O., Mosseri, I., Freeman, W. T., Rubinstein, M., Irani, M., & Dekel, T. (2020). SpeedNet: Learning the speediness in videos. In CVPR.

  • Berg, T. L., & Forsyth, D. A. (2006). Animals on web. In CVPR.

  • Bertasius, G., Wang, H., & Torresani, L. (2021). Is space-time attention all you need for video understanding? In ICML.

  • Cai, Q., Wang, Y., Pan, Y., Yao, T., & Mei, T. (2020). Joint contrastive learning with infinite possibilities. In NeurIPS.

  • Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. In CVPR.

  • Carreira, J., Noland, E., Hillier, C., & Zisserman, A. (2019). A short note on the Kinetics-700 human action dataset. arXiv preprint arXiv:1907.06987.

  • Damen, D., Doughty, H., Farinella, G. M., Fidler, S., Furnari, A., Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., Price, W., & Wray, M. (2018). Scaling egocentric vision: The EPIC-KITCHENS Dataset. In ECCV.

  • Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In CVPR.

  • Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. In ACL.

  • Diba, A., Sharma, V., & Gool, L. V. (2017). Deep temporal linear encoding networks. In CVPR.

  • Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR.

  • Duan, H., Zhao, Y., Xiong, Y., Liu, W., & Lin, D. (2020). Omni-sourced Webly-supervised learning for video recognition. In ECCV.

  • Dwibedi, D., Aytar, Y., Tompson, J., Sermanet, P., & Zisserman, A. (2019). Temporal cycle-consistency learning. In CVPR.

  • Fan, H., Xiong, B., Mangalam, K., Li, Y., Yan, Z., Malik, J., & Feichtenhofer, C. (2021). Multiscale vision transformers. In ICCV.

  • Feichtenhofer, C., Pinz, A., & Zisserman, A. (2016). Convolutional two-stream network fusion for video action recognition. In CVPR.

  • Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). SlowFast networks for video recognition. In ICCV.

  • Feichtenhofer, C., Fan, H., Xiong, B., Girshick, R., & He, K. (2021). A large-scale study on unsupervised spatiotemporal representation learning. In CVPR.

  • Fernando, B., Bilen, H., Gavves, E., & Gould, S. (2017). Self-supervised video representation learning with odd-one-out networks. In CVPR.

  • Finn, C., Goodfellow, I., & Levine, S. (2016). Unsupervised learning for physical interaction through video prediction. In NIPS.

  • Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, JW., Wallach, H., III HD, & Crawford, K. (2021). Datasheets for datasets. Communications of the ACM.

  • Ghadiyaram, D., Feiszli, M., Tran, D., Yan, X., Wang, H., & Mahajan, D. (2019). Large-scale weakly-supervised pre-training for video action recognition. In CVPR.

  • Ghanem, B., Niebles, J. C., Snoek, C., Heilbron, F. C., Alwassel, H., Escorcia, V., Krishna, R., Buch, S., & Dao, C. D. (2018). The ActivityNet large-scale activity recognition challenge 2018 summary. arXiv preprint arXiv:1808.03766.

  • Gionis, A., Indyk, P., & Motwani, R. (1999). Similarity search in high dimensions via hashing. In VLDB.

  • Girshick, R. (2015). Fast R-CNN. In ICCV.

  • Goyal, R., Kahou, S. E., Michalski, V., Materzynska, J., Westphal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., Mueller-Freitag, M., Hoppe, F., Thurau, C., Bax, I., & Memisevic, R. (2017). The ”something something” video database for learning and evaluating visual common sense. In ICCV.

  • Han, T., Xie, W., & Zisserman, A. (2020). Self-supervised co-training for video representation learning. In NeurIPS.

  • Hara, K., Kataoka, H., & Satoh, Y. (2018). Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? In CVPR.

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.

  • He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning. In CVPR.

  • Huang, Z., Zhang, S., Pan, L., Qing, Z., Tang, M., Liu, Z., & Jr M. H. A. (2022). TAda! Temporally-adaptive convolutions for video understanding. In ICLR.

  • Ji, S., Xu, W., Yang, M., & Yu, K. (2012). 3D convolutional neural networks for human action recognition. IEEE Transactions on PAMI, 35(1), 221–231.

    Article  Google Scholar 

  • Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R. B., Guadarrama, S., & Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. In ACM MM.

  • Jiang, B., Wang, M., Gan, W., Wu, W., & Yan, J. (2019) STM: SpatioTemporal and motion encoding for action recognition. In ICCV.

  • Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In CVPR.

  • Kong, Y., & Fu, Y. (2022). Human action recognition and prediction: A survey. International Journal of Computer Vision, 130, 1366–1401.

    Article  Google Scholar 

  • Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). HMDB: A large video database for human motion recognition. In ICCV.

  • Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR.

  • Li, D., Qiu, Z., Pan, Y., Yao, T., Li, H., & Mei, T. (2021a). Representing videos as discriminative sub-graphs for action recognition. In CVPR.

  • Li, J., Zhou, P., Xiong, C., & Hoi, S. (2021b). Prototypical contrastive learning of unsupervised representations. In ICLR.

  • Li, R., Zhang, Y., Qiu, Z., Yao, T., Liu, D., & Mei, T. (2021c). Motion-focused contrastive learning of video representations. In ICCV.

  • Li, T., & Wang, L. (2020) Learning spatiotemporal features via video and text pair discrimination. arXiv preprint arXiv:2001.05691.

  • Li, X., Wang, Y., Zhou, Z., & Qiao, Y. (2020a). SmallBigNet: Integrating core and contextual views for video classification. In CVPR.

  • Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., & Wang, L. (2020b). TEA: Temporal excitation and aggregation for action recognition. In CVPR.

  • Li, Y., Yao, T., Pan, Y., & Mei, T. (2022). Contextual transformer networks for visual recognition. IEEE Transactions on PAMI.

  • Lin, J., Gan, C., & Han, S. (2019). TSM: Temporal shift module for efficient video understanding. In ICCV.

  • Lin, Y., Guo, X., & Lu, Y. (2021) Self-supervised video representation learning with meta-contrastive network. In ICCV.

  • Liu, X., Lee, J. Y., & Jin, H. (2019). Learning video representations from correspondence proposals. In CVPR.

  • Liu, Z., Yeh, R. A., Tang, X., Liu, Y., & Agarwala, A. (2017). Video frame synthesis using deep voxel flow. In ICCV.

  • Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV.

  • Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., Hu, H. (2022). Video swin transformer. In ECCV.

  • Long, F., Yao, T., Qiu, Z., Tian, X., Luo, J., & Mei, T. (2019). Gaussian temporal awareness networks for action localization. In CVPR.

  • Long, F., Yao, T., Qiu, Z., Tian, X., Mei, T., & Luo, J. (2020). Coarse-to-fine localization of temporal action proposals. IEEE Transactions on Multimedia, 22(6), 1577–1590.

    Article  Google Scholar 

  • Long, F., Qiu, Z., Pan, Y., Yao, T., Luo, J., & Mei, T. (2022a). Stand-alone inter-frame attention in video models. In CVPR.

  • Long, F., Qiu, Z., Pan, Y., Yao, T., Ngo, C. W., & Mei, T. (2022b). Dynamic temporal filtering in video models. In ECCV.

  • Luo, Z., Peng, B., Huang, D. A., Alahi, A., & Fei-Fei, L. (2017) Unsupervised learning of long-term motion dynamics for videos. In CVPR.

  • Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. In JMLR.

  • Mettes, P., Thong, W., & Snoek, G. G. M. (2021). Object priors for classifying and localizing unseen actions. International Journal of Computer Vision, 129, 1954–1971.

    Article  Google Scholar 

  • Miech, A., Zhukov, D., Alayrac, J. B., Tapaswi, M., Laptev, I., & Sivic, J. (2019). HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips. In ICCV.

  • Miech, A., Alayrac, J. B., Smaira, L., Laptev, I., & Sivic, J., Zisserman A. (2020). End-to-end learning of visual representations from uncurated instructional videos. In CVPR.

  • Misra, I., Zitnick, C. L., & Hebert, M. (2016). Shuffle and learn: Unsupervised learning using temporal order verification. In ECCV.

  • Mobahi, H., Collobert, R., & Weston, J. (2009). Deep learning from temporal coherence in video. In ICML.

  • Monfort, M., Andonian, A., Zhou, B., Ramakrishnan, K., Bargal, S. A., Yan, T., Brown, L., Fan, Q., Gutfreund, D., Vondrick, C., & Oliva, A. (2019). Moments in time dataset: One million videos for event understanding. IEEE Transactions on PAMI, 42(2), 502–508.

    Article  Google Scholar 

  • Neimark, D., Bar, O., Zohar, M., & Asselmann, D. (2021). Video transformer network. In ICCV workshop.

  • Ng, J. Y. H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., & Toderici, G. (2015). Beyond short snippets: Deep networks for video classification. In CVPR.

  • Oord, A., Li, Y., & Vinyals, O. (2018). Representation learning with contrastive predictive coding. In NIPS.

  • Pan, Y., Li, Y., Luo, J., Xu, J., Yao, T., & Mei, T. (2022). Auto-captions on GIF: A large-scale video-sentence dataset for vision-language pre-training. In ACM multimedia.

  • Pathak, D., Girshick, R., Dollar, P., Darrell, T., & Hariharan, B. (2017). Learning features by watching objects move. In CVPR.

  • Qian, R., Meng, T., Gong, B., Yang, M. H., Wang, H., Belongie, S., & Cui, Y. (2021). Spatiotemporal contrastive video representation learning. In CVPR.

  • Qiu, Z., Yao, T., & Mei, T. (2017). Learning spatio-temporal representation with pseudo-3D residual networks. In ICCV.

  • Qiu, Z., Yao, T., Ngo, C. W., Tian, X., & Mei, T. (2019). Learning spatio-temporal representation with local and global diffusion. In CVPR.

  • Qiu, Z., Yao, T., Ngo, C. W., Zhang, X. P., Wu, D., & Mei, T. (2021). Boosting video representation learning with multi-faceted integration. In CVPR.

  • Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. In ICML.

  • Saenko, K., & Darrell, T. (2008). Unsupervised learning of visual sense models for polysemous words. In NIPS.

  • Schroff, F., Criminisi, A., & Zisserman, A. (2007). Harvesting image databases from the web. In ICCV.

  • Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In NIPS.

  • Soomro, K., Zamir, A. R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. In CRCV-TR-12-01.

  • Srivastava, N., Mansimov, E., & Salakhutdinov, R. (2015). Unsupervised learning of video representations using LSTMs. In ICML.

  • Stroud, J. C., Ross, D. A., Sun, C., Deng, J., Sukthankar, R., & Schmid, C. (2020). Learning video representations from textual web supervision. arXiv preprint arXiv:2007.14937.

  • Sultani, W., Chen, C., & Shah, M. (2018). Real-world anomaly detection in surveillance videos. In CVPR.

  • Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of cluster in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B, 411–423.

  • Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3D convolutional networks. In ICCV.

  • Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., & Paluri, M. (2018). A closer look at spatiotemporal convolutions for action recognition. In CVPR.

  • Wang, H., Klaser, A., Schmid, C., & Liu, C. L. (2013). Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision, 103(1), 60–79.

    Article  MathSciNet  Google Scholar 

  • Wang, H., Tran, D., Torresani, L., & Feiszli, M. (2020). Video modeling with correlation networks. In CVPR.

  • Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Gool L. V. (2016). Temporal segment networks: Towards good practices for deep action recognition. In ECCV.

  • Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Gool, L. (2018). Temporal segment networks for action recognition in videos. IEEE Transactions on PAMI, 41(11), 2740–2755.

    Article  Google Scholar 

  • Wang, L., Tong, Z., Ji, B., & Wu, G. (2021a). TDN: Temporal difference networks for efficient action recognition. In CVPR.

  • Wang, M., Xing, J., & Liu, Y. (2021b). ActionCLIP: A new paradigm for video action recognition. arXiv preprint arXiv:2109.08472.

  • Wang, X., & Gupta, A. (2015). Unsupervised learning of visual representations using videos. In ICCV.

  • Wang, X., & Gupta, A. (2018). Videos as space-time region graphs. In ECCV.

  • Wang, X., Girshick, R., Gupta, A., & He, K. (2018b). Non-local neural networks. In CVPR.

  • Wang, X., Jabri, A., & Efros, A. A. (2019). Learning correspondence from the cycle-consistency of time. In CVPR.

  • Wang, Z., She, Q., & Smolic, A. (2021c). ACTION-net: Multipath excitation for action recognition. In CVPR.

  • Wei, C., Fan, H., Xie, S., Wu, C. Y., Yuille, A., & Feichtenhofer, C. (2021). Masked feature prediction for self-supervised visual pre-training. arXiv preprint arXiv:2112.09133.

  • Wei, D., Lim, J., Zisserman, A., & Freeman, W. T. (2018). Learning and using the arrow of time. In CVPR.

  • Wu, X., Wang, R., Hou, J., Lin, H., & Luo, J. (2021). Spatial–temporal relation reasoning for action prediction in videos. International Journal of Computer Vision, 129, 1484–1505.

    Article  Google Scholar 

  • Xie, S., Sun, C., Huang, J., Tu, Z., & Murphy, K. (2018). Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In ECCV.

  • Xu, D., Xiao, J., Zhao, Z., Shao, J., Xie, D., & Zhuang, Y. (2019). Self-supervised spatiotemporal learning via video clip order prediction. In CVPR.

  • Yan, S., Xiong, X., Arnab, A., Lu, Z., Zhang, M., Sun, C., & Schmid, C. (2022). Multiview transformers for video recognition. In CVPR.

  • Yang, C., Xu, Y., Dai, B., & Zhou, B. (2020a). Video representation learning with visual tempo consistency. arXiv preprint arXiv:2006.15489.

  • Yang, C., Xu, Y., Shi, J., Dai, B., & Zhou, B. (2020b). Temporal pyramid network for action recognition. In CVPR.

  • Yang, J., Feng, L., Chen, W., Yan, X., Zheng, H., Luo, P., & Zhang, W. (2020c). Webly supervised image classification with self-contained confidence. In ECCV.

  • Yao, T., Zhang, Y., Qiu, Z., Pan, Y., & Mei, T. (2021). Seco: Exploring sequence supervision for unsupervised representation learning. In AAAI.

  • Yao, T., Li, Y., Pan, Y., Wang, Y., Zhang, X. P., & Mei, T. (2022a). Dual vision transformer. arXiv preprint arXiv:2207.04976.

  • Yao, T., Pan, Y., Li, Y., Ngo, C. W., & Mei, T. (2022b). Wave-ViT: Unifying wavelet and transformers for visual representation learning. In ECCV.

  • Zabih, R., & Woodfill, J. (1994). Non-parametric local transforms for computing visual correspondence. In ECCV.

  • Zhang, Y., Li, X., Liu, C., Shuai, B., Zhu, Y., Brattoli, B., Chen, H., Marsic, I., & Tighe, J. (2021). VidTr: Video transformer without convolutions. In ICCV.

  • Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., & Lin, D. (2021). Temporal action detection with structured segment networks. International Journal of Computer Vision, 128, 74–95.

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ting Yao.

Additional information

Communicated by Dima Damen.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was supported by the National Key R &D Program of China under Grant No. 2020AAA0108600

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Long, F., Yao, T., Qiu, Z. et al. Bi-calibration Networks for Weakly-Supervised Video Representation Learning. Int J Comput Vis 131, 1704–1721 (2023). https://doi.org/10.1007/s11263-023-01779-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-023-01779-w

Keywords

Navigation