research-article

Correlative multilabel video annotation with temporal kernels

Authors:
Guo-Jun Qi

University of Science and Technology of China, Anhui, China

University of Science and Technology of China, Anhui, China
View Profile

,
Xian-Sheng Hua

Microsoft Corporation, Beijing, China

Microsoft Corporation, Beijing, China
View Profile

,
Yong Rui

Microsoft Corporation, Beijing, China

Microsoft Corporation, Beijing, China
View Profile

,
Jinhui Tang

University of Science and Technology of China, Anhui, China

University of Science and Technology of China, Anhui, China
View Profile

,
Tao Mei

Microsoft Corporation, Beijing, China

Microsoft Corporation, Beijing, China
View Profile

,
Meng Wang

University of Science and Technology of China, Anhui, China

University of Science and Technology of China, Anhui, China
View Profile

,
Hong-Jiang Zhang

Microsoft Corporation, Beijing, China

Microsoft Corporation, Beijing, China
View Profile

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 5 Issue 1Article No.: 3pp 1–27https://doi.org/10.1145/1404880.1404883

Published:30 October 2008Publication History

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

Automatic video annotation is an important ingredient for semantic-level video browsing, search and navigation. Much attention has been paid to this topic in recent years. These researches have evolved through two paradigms. In the first paradigm, each concept is individually annotated by a pre-trained binary classifier. However, this method ignores the rich information between the video concepts and only achieves limited success. Evolved from the first paradigm, the methods in the second paradigm add an extra step on the top of the first individual classifiers to fuse the multiple detections of the concepts. However, the performance of these methods can be degraded by the error propagation incurred in the first step to the second fusion one. In this article, another paradigm of the video annotation method is proposed to address these problems. It simultaneously annotates the concepts as well as model correlations between them in one step by the proposed Correlative Multilabel (CML) method, which benefits from the compensation of complementary information between different labels. Furthermore, since the video clips are composed by temporally ordered frame sequences, we extend the proposed method to exploit the rich temporal information in the videos. Specifically, a temporal-kernel is incorporated into the CML method based on the discriminative information between Hidden Markov Models (HMMs) that are learned from the videos. We compare the performance between the proposed approach and the state-of-the-art approaches in the first and second paradigms on the widely used TRECVID data set. As to be shown, superior performance of the proposed method is gained.

References

Berg, B. A. 2004. Markov Chain Monte Carlo Simulations and Their Statistical Analysis. World Scientific.Google Scholar
Boyd, S., Vandenberghe, L. 2004. Convex Optimization. Cambridge University Press. Google ScholarDigital Library
Campbell, M., et al. 2006. Ibm research trecvid-2006 video retrieval system. TREC Video Retrieval Evaluation (TRECVID) Proceedings.Google Scholar
Chang, S.-F., et al. 2006. Columbia university trecvid-2006 video search and high-level feature extraction. In TREC Video Retrieval Evaluation (TRECVID) Proceedings.Google Scholar
Cover, T. and Thomas, J. 1991. Elements of Information Theory. John Wiley and Sons, New York, NY. Google ScholarDigital Library
Cristianini, N. and Shawe-Taylor, J. 2000. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press. Google ScholarDigital Library
Do, M. 2003. Fast approximation of kullback-leibler distance for dependence trees and hidden markov models. IEEE Signal Process. Lett. 10, 4, 115--118.Google ScholarCross Ref
Ebadollahi, S., Xie, L., Chang, S.-F., and Smith, J. R. 2006. Visual event detection using multidimensional concept dynamics. In Proceedings of the IEEE International Conference on Multimedia and Expo.Google Scholar
Gauvain, J.-L. and Lee, C.-H. 1994. Maximum a posteriori estimation for multivariate gaussian mixture observations of markov chains. IEEE Trans. Speech Audio Process. 2, 2, 291--298.Google ScholarCross Ref
Godbole, S. and Sarawagi, S. 2004. Discriminative methods for multi-labeled classification. In Proceedings of the Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining.Google Scholar
Goldberger, J. and Aronowitz, H. 2005. A distance measure between gmms based on the unscented transform and its application to speaker recognition. In Proceedings of the International Conference on Spoken Language Processes.Google Scholar
Hauptmann, A. G., Chen, M.-Y., and Christel, M. 2004. Confounded expectations: Informedia at TRECVID 2004. In TREC Video Retrieval Evaluation (TRECVID) Proceedings.Google Scholar
Hauptmann, A. G., et al. 2006. Multi-lingual broadcast news retrieval. In TREC Video Retrieval Evaluation (TRECVID) Procedings.Google Scholar
Hauptmann, A. G., Yan, R., Lin, W.-H., Christel, M., and Wactlar, H. 2007. Can high-level concepts fill the semantic gap in video retrieval&quest; A case study with broadcast news. IEEE Trans. Multimed. 9, 5, 958--966. Google ScholarDigital Library
Hua, X.-S., Mei, T., Lai, W., Wang, M., Tang, J., Qi, G.-J., Li, L., and Gu, Z. 2006. Microsoft reseach asia trecvid 2006 high-level feature extraction and rushes exploitation. In TREC Video Retrieval Evaluation (TRECVID) Proceedings.Google Scholar
Jiang, W., Chang, S.-F., and Loui, A. 2006. Active concept-based concept fusion with partial user labels. In Proceedings of the IEEE International Conference on Image Processing.Google Scholar
Jiang, Y.-G., Ngo, C.-W., and Yang, J. 2007. Towards optimal bag-of-features for object categorization and semantic video retrieval. In Proceedings of the ACM International Conference on Image and Video Retrieval. Google ScholarDigital Library
Koskela, M., Smeaton, A., and Laaksonen, J. 2007. Measuring concept similarities in multimedia ontologies: analysis and evaluations. IEEE Trans. Multimed. 9, 5, 912--922. Google ScholarDigital Library
Kumar, S. and Hebert, M. 2003. Discriminative random fields: A discriminative framework for contextual interaction in classification. In Proceedings of the IEEE International Conference on Machine Learning. Google ScholarDigital Library
Lafferty, J., McCallum, A., and Pereira, F. 2001. Conditional random fields: Probabilistic models for segmentation and labeling sequence data. In Proceedings of the International Conference on Machine Learning. Google ScholarDigital Library
Liu, P., Soong, F. K., and Zhou, J.-L. 2007. Divergence-based similarity measure for spoken document retrieval. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing.Google Scholar
Marr, D. 1982. Vision. W. H. Freeman and Company.Google Scholar
Naphade, M. R., Kozintsev, I., and Huang, T. 2002. Factor graph framework for semantic video indexing. IEEE Trans. CSVT 12, 1 (Jan.). Google ScholarDigital Library
Naphade, M. R., Smith, J., Tesic, J., Chang, S.-F., Hsu, W., Kennedy, L., Hauptmann, A. G., and Curtis, J. 2006. Large-scale concept ontology for multimedia. IEEE Trans. Multimed. 13, 3, 86--91. Google ScholarDigital Library
Naphade, M. R. 2002. Statistical techniques in video data management. In Proceedings of the IEEE Workshop on Multimedia Signal Processing.Google ScholarCross Ref
Naphade, M. R., Kennedy, L., Kender, J. R., Chang, S.-F., Smith, J. R., Over, P., and Hauptmann, A. G. 2005. A light scale concept ontology for multimedia understanding for TRECVID 2005. IBM Research Report RC23612 (W0505-104).Google Scholar
Nigam, K., Lafferty, J., and McCallum, A. 1999. Using maximum entropy for text classification. In Proceedings of the IJCAI-99 Workshop on Machine Learning for Information Filtering. 61--67.Google Scholar
Petersohn, C. 2004. Fraunhofer hhi at trecvid 2004: shot boundary detection system. In TREC Video Retrieval Evaluation (TRECVID) Proceedings.Google Scholar
Rabiner, L. R. 1989. A tutorial on hidden markov models and selected applications in speech recognition. Proc. IEEE 77, 2, 257--286.Google ScholarCross Ref
Smeaton, A., Over, P., and Kraaij, W. 2006. Evaluation campaigns and trecvid. In Proceedings of the ACM Multimedia Information Retrieval Conference. 321--330. Google ScholarDigital Library
Smith, J. R. and Naphade, M. R. 2003. Multimedia semantic indexing using model vectors. In Proceedings of the IEEE Internaional Conference on Multimedia and Expo. Google ScholarDigital Library
Snoek, C., Worring, M., Geusebroek, J., Koelma, D., Seinstra, F., and Smeulders, A. 2006. The semantic pathfinder: Using an authoring metaphor for generic multimedia indexing. IEEE Trans. Patt. Anal. Mach. Intell. 28, 10, 1678--1689. Google ScholarDigital Library
Snoek, C. G. M., Worring, M., Gemert, J. C., Geusebroek, J.-M., and Smeulders, A. W. M. 2006. The challenge problem for automated detection of 101 semantic concepts in multimedia. In Proceedings of the ACM Internaional Conference on Multimedia. 421--430. Google ScholarDigital Library
Tang, J., Hua, X.-S., Qi, G.-J., Wang, M., Mei, T., and Wu, X. 2007. Structure-sensitive manifold ranking for video concept detection. In Proceedings of the ACM Internaional Conference on Multimedia. Google ScholarDigital Library
Tsochantaridis, I., Hofmann, T., Joachims, T., and Altun, Y. 2004. Support vector machine learning for intedependent and structured output spaces. In Proceedings of the Internaional Conference on Machine Learning. Google ScholarDigital Library
Wang, D., Liu, X., Luo, L., Li, J., and Zhang, B. 2007. Video diver: Generic video indexing with diverse features. In Proceedings of the ACM Conference on Multimedia Information Retrieval. Google ScholarDigital Library
Wang, T., Li, J., Diao, Q., Hu, W., Zhang, Y., and Dulong, C. 2006. Semantic event detection using conditional random fields. In Proceedings of the IEEE Computer Vision and Pattern Recognition Workshop. Google ScholarDigital Library
Winkler, G. 1995. Image Analysis, Random Fields and Dynamic Monte Carlo Methods: A Mathematical Introduction. Springer-Verlag, Berlin, Heidelberg. Google ScholarDigital Library
Wu, Y., Tseng, B. L., and Smith, J. R. 2004. Ontology-based multi-classification learning for video concept detection. In Proceedings of the IEEE Internaional Conference on Multimedia and Expo.Google Scholar
Xie, L. and Chang, S.-F. 2002. Structural analysis of soccer video with hidden markov models. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing.Google Scholar
Yan, R., Chen, M.-Y., and Hauptmann, A. G. 2006. Discriminative random fields: A discriminative framework for contextual interaction in classification. In Proceedings of the IEEE Internaional Conference on Multimedia and Expo.Google Scholar
Yanagawa, A., Chang, S.-F., Kennedy, L., and Hsu, W. 2007. Columbia university's baseline detectors for 374 lscom semantic visual concepts. Tech. Rep. 222-2006-8, Columbia University ADVENT Technical Report. March. 20.Google Scholar
Yao, Y. Y. 2003. Entropy Measures, Maximum Entropy Principle, and Emerging Applications. Springer, Chapter Information-theoretic measures for knowledge discovery and data mining, 115--136. Google ScholarDigital Library
Zha, Z.-J., Mei, T., Hua, X.-S., Qi, G.-J., and Wang, Z. 2007. Refining video annotation by exploiting pairwise concurrent relation. In Proceedings of the ACM International Conference on Multimedia. Google ScholarDigital Library
Zhang, H., Berg, A. C., Maire, M., and Malik, J. 2006. Svm-knn: discriminative nearest neighbor classification for visual category recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Google ScholarDigital Library

Index Terms

Correlative multilabel video annotation with temporal kernels
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Video summarization
2. Information systems
  1. Information retrieval
    1. Document representation
    2. Search engine architectures and scalability
      1. Search engine indexing

Recommendations

Correlative multi-label video annotation
MM '07: Proceedings of the 15th ACM international conference on Multimedia

Automatically annotating concepts for video is a key to semantic-level video browsing, search and navigation. The research on this topic evolved through two paradigms. The first paradigm used binary classification to detect each individual concept in a ...
Read More
Semi-supervised multi-instance multi-label learning for video annotation task
MM '12: Proceedings of the 20th ACM international conference on Multimedia

Traditional approaches for automatic video annotation usually represent one video clip with a flat feature vector, neglecting the fact that video data contain natural structures. It is also noteworthy that a video clip is often relevant to multiple ...
Read More
Automatic video annotation by semi-supervised learning with kernel density estimation
MM '06: Proceedings of the 14th ACM international conference on Multimedia

Insufficiency of labeled training data is a major obstacle for automatically annotating large-scale video databases with semantic concepts. Existing semi-supervised learning algorithms based on parametric models try to tackle this issue by incorporating ...
Read More

Reviews

Reviewer: Sebastien Lefevre

Annotation of multimedia data is a very topical yet very challenging problem. Indeed, Web sites such as YouTube store terabytes or even petabytes of video data. To successfully enable user navigation or retrieval in these huge databases, some automatic processes are mandatory. Among them, automatic video annotation seeks to label each video with some predefined concepts, which will then be sought by the end user. In this paper, the authors propose a new paradigm for video annotation in order to deal more effectively with multi-label annotation (since a video usually concerns several topics). Instead of using a set of binary classifiers related to each individual concept, or merging these binary classifiers in a post-processing process called context-based conceptual fusion, they introduce an integrated multi-label approach that explicitly models both concepts themselves and concepts' interactions, to avoid relying on premature decisions made by binary classifiers. Their experiments, performed on the Text Retrieval Conference Video Retrieval Evaluation (TRECVID) dataset, show the relevance of this approach, but also point to the need for efficient algorithms, since the proposed solution is still very far from real time. To deliver automatic solutions to the market, research efforts should now focus not only on the reliability of automatic annotation solutions, but also (and perhaps more) on the efficiency of these systems. This issue should be tackled more often by research teams in the field of multimedia indexing and retrieval. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 5, Issue 1
October 2008
201 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/1404880
Issue’s Table of Contents

Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 30 October 2008
- Accepted: 1 July 2008
- Revised: 1 May 2008
- Received: 1 January 2008
Published in tomm Volume 5, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Video annotation
concept correlation
multilabeling
temporal kernel
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 38
  Total Citations
  View Citations
- 7,238
  Total Downloads
- Downloads (Last 12 months)17
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Correlative multilabel video annotation with temporal kernels

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

References

Cited By

Index Terms

Recommendations

Correlative multi-label video annotation

Semi-supervised multi-instance multi-label learning for video annotation task

Automatic video annotation by semi-supervised learning with kernel density estimation

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Correlative multilabel video annotation with temporal kernels

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

References

Cited By

Index Terms

Recommendations

Correlative multi-label video annotation

Semi-supervised multi-instance multi-label learning for video annotation task

Automatic video annotation by semi-supervised learning with kernel density estimation

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media