Abstract
Nowadays, numerous social videos have pervaded on the Web. Social web videos are characterized with the accompanying rich contextual information which describe the content of videos and thus greatly facilitate video search and browsing. Generally those context data such as tags are generated for the whole video, without temporal indication on when they actually appear in the video. However, many tags only describe parts of the video content. Therefore, tag localization, the process of assigning tags to the underlying relevant video segments or frames is gaining increasing research interests and a benchmark dataset for the fair evaluation of tag localization algorithms is highly desirable. In this paper, we describe and release a dataset called DUT-WEBV, which contains 1550 videos collected from YouTube portal by issuing 31 concepts as queries. These concepts cover a wide range of semantic aspects including scenes like “mountain”, events like “flood”, objects like “cows”, sites like “gas station”, and activities like “handshaking”, offering great challenges to the tag (i.e., concept) localization task. For each video of a tag, we carefully annotate the time durations when the tag appears in the video. Besides the video itself, the contextual information, such as thumbnail images, titles, and categories, is also provided. Together with this benchmark dataset, we present a baseline for tag localization using multiple instance learning approach. Finally, we discuss some open research issues for tag localization in web videos.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Wang, M., Ni, B., Hua, X.-S., Chua, T.-S.: Assistive Tagging: A Survey of Multimedia Tagging with Human-Computer Joint Exploration. ACM Computing Surveys 44(4) (2012)
Ulges, A., Schulze, C., Breuel, T.: Identifying Relevant Frames in Weakly Labeled Videos for Training Concept Detectors. In: ACM CIVR (2008)
Ikizler-Cinbis, N., Cinbis, R.G., Sclaroff, S.: Learning Actions From the Web. In: International Conference on Computer Vision (2009)
Li, G., Wang, M., Zheng, Y.-T., Li, H., Zha, Z.-J., Chua, T.-S.: ShotTagger: tag location for internet videos. In: ICMR (2011)
Wang, M., Hong, R., Li, G., Yan, S., Chua, T.-S.: Event Driven Web Video Summarization by Tag Localization and Key-Shot Identification. IEEE Trans. on Multimedia 14(4), 975–985 (2012)
Hong, R., Tang, J., Tan, H.-K., Ngo, C.-W., Yan, S., Chua, T.-S.: Beyond search: Event-driven summarization for web videos. TOMCCAP 7(4), 35 (2011)
Ballan, L., Bertini, M., Del Bimbo, A., et al.: Tag suggestion and localization in user-generated videos based on social knowledge. In: Proc. of the 2nd ACM SIGMM International Workshop on Social Media (2010)
Ballan, L., Bertini, M., Del Bimbo, A., Serra, G.: Enriching and localizing semantic tags in internet videos. ACM Multimedia (2011)
Chu, W.-T., Li, C.-J., Chou, Y.-K.: Tag suggestion and localization for web videos by bipartite graph matching. In: Proc. of the 3rd ACM SIGMM International Workshop on Social Media, WSM 2011 (2011)
Ulges, A., Schulze, C., Breuel, T.: Multiple Instance Learning from Weak-ly Labeled Videos. In: SAMT Workshop on Cross-Media Information Analysis and Retrieval (2008)
Naphade, M., Smith, J.R., Tesic, J., Chang, S.-F., Hsu, W., Kennedy, L., Hauptmann, A., Curtis, J.: Large-Scale Concept Ontology for Multimedia. IEEE MultiMedia 13, 86–91 (2006)
Jiang, Y.-G., Ye, G., Chang, S.-F., Ellis, D.P.W., Loui, A.C.: Consumer video understanding: a benchmark database and an evaluation of human and machine performance. In: ICMR (2011)
Cao, J., Zhang, Y.D., Song, Y.C., Chen, Z.N., Zhang, X., Li, J.T.: MCG-WEBV: A Benchmark Dataset for Web Video Analysis. Technical Report, ICT-MCG-09-001, Institute of Computing Technology (May 2009)
Ulges, A., Schulze, C., Keysers, D., Breuel, T.M.: A System That Learns to Tag Videos by Watching Youtube. In: Gasteratos, A., Vincze, M., Tsotsos, J.K. (eds.) ICVS 2008. LNCS, vol. 5008, pp. 415–424. Springer, Heidelberg (2008)
Tang, J., Li, H., Qi, G.-J., Chua, T.-S.: Image Annotation by Graph-Based Inference With Integrated Multiple/Single Instance Representations. IEEE Transactions on Multimedia 12(2), 131–141 (2010)
Zhang, M.-L., Zhou, Z.-H.: Improve Multi-Instance Neural Networks through Feature Selection. Neural Process Letters 19(1), 1–10 (2004)
Tang, S., Zheng, Y.-T., Wang, Y., Chua, T.-S.: Sparse Ensemble Learning for Concept Detection. IEEE Transactions on Multimedia 14(1), 43–54 (2012)
Nister, D., Stewenius, H.: Scalable recognition with a vocabulary tree. In: CVPR (2006)
Shen, J., Tao, D., Li, X.: Modality Mixture Projections for Semantic Video Event Detection. IEEE Trans. Circuits Syst. Video Techn. 18(11), 1587–1596 (2008)
Wang, M., Yang, K., Hua, X.-S., Zhang, H.-J.: Towards a Relevant and Diverse Search of Social Images. IEEE Transactions on Multimedia 12(8), 829–842 (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Li, H., Yi, L., Guan, Y., Zhang, H. (2013). DUT-WEBV: A Benchmark Dataset for Performance Evaluation of Tag Localization for Web Video. In: Li, S., et al. Advances in Multimedia Modeling. Lecture Notes in Computer Science, vol 7733. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35728-2_29
Download citation
DOI: https://doi.org/10.1007/978-3-642-35728-2_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35727-5
Online ISBN: 978-3-642-35728-2
eBook Packages: Computer ScienceComputer Science (R0)