skip to main content
research-article

Correlative multilabel video annotation with temporal kernels

Authors Info & Claims
Published:30 October 2008Publication History
Skip Abstract Section

Abstract

Automatic video annotation is an important ingredient for semantic-level video browsing, search and navigation. Much attention has been paid to this topic in recent years. These researches have evolved through two paradigms. In the first paradigm, each concept is individually annotated by a pre-trained binary classifier. However, this method ignores the rich information between the video concepts and only achieves limited success. Evolved from the first paradigm, the methods in the second paradigm add an extra step on the top of the first individual classifiers to fuse the multiple detections of the concepts. However, the performance of these methods can be degraded by the error propagation incurred in the first step to the second fusion one. In this article, another paradigm of the video annotation method is proposed to address these problems. It simultaneously annotates the concepts as well as model correlations between them in one step by the proposed Correlative Multilabel (CML) method, which benefits from the compensation of complementary information between different labels. Furthermore, since the video clips are composed by temporally ordered frame sequences, we extend the proposed method to exploit the rich temporal information in the videos. Specifically, a temporal-kernel is incorporated into the CML method based on the discriminative information between Hidden Markov Models (HMMs) that are learned from the videos. We compare the performance between the proposed approach and the state-of-the-art approaches in the first and second paradigms on the widely used TRECVID data set. As to be shown, superior performance of the proposed method is gained.

References

  1. Berg, B. A. 2004. Markov Chain Monte Carlo Simulations and Their Statistical Analysis. World Scientific.Google ScholarGoogle Scholar
  2. Boyd, S., Vandenberghe, L. 2004. Convex Optimization. Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Campbell, M., et al. 2006. Ibm research trecvid-2006 video retrieval system. TREC Video Retrieval Evaluation (TRECVID) Proceedings.Google ScholarGoogle Scholar
  4. Chang, S.-F., et al. 2006. Columbia university trecvid-2006 video search and high-level feature extraction. In TREC Video Retrieval Evaluation (TRECVID) Proceedings.Google ScholarGoogle Scholar
  5. Cover, T. and Thomas, J. 1991. Elements of Information Theory. John Wiley and Sons, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Cristianini, N. and Shawe-Taylor, J. 2000. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Do, M. 2003. Fast approximation of kullback-leibler distance for dependence trees and hidden markov models. IEEE Signal Process. Lett. 10, 4, 115--118.Google ScholarGoogle ScholarCross RefCross Ref
  8. Ebadollahi, S., Xie, L., Chang, S.-F., and Smith, J. R. 2006. Visual event detection using multidimensional concept dynamics. In Proceedings of the IEEE International Conference on Multimedia and Expo.Google ScholarGoogle Scholar
  9. Gauvain, J.-L. and Lee, C.-H. 1994. Maximum a posteriori estimation for multivariate gaussian mixture observations of markov chains. IEEE Trans. Speech Audio Process. 2, 2, 291--298.Google ScholarGoogle ScholarCross RefCross Ref
  10. Godbole, S. and Sarawagi, S. 2004. Discriminative methods for multi-labeled classification. In Proceedings of the Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining.Google ScholarGoogle Scholar
  11. Goldberger, J. and Aronowitz, H. 2005. A distance measure between gmms based on the unscented transform and its application to speaker recognition. In Proceedings of the International Conference on Spoken Language Processes.Google ScholarGoogle Scholar
  12. Hauptmann, A. G., Chen, M.-Y., and Christel, M. 2004. Confounded expectations: Informedia at TRECVID 2004. In TREC Video Retrieval Evaluation (TRECVID) Proceedings.Google ScholarGoogle Scholar
  13. Hauptmann, A. G., et al. 2006. Multi-lingual broadcast news retrieval. In TREC Video Retrieval Evaluation (TRECVID) Procedings.Google ScholarGoogle Scholar
  14. Hauptmann, A. G., Yan, R., Lin, W.-H., Christel, M., and Wactlar, H. 2007. Can high-level concepts fill the semantic gap in video retrieval? A case study with broadcast news. IEEE Trans. Multimed. 9, 5, 958--966. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Hua, X.-S., Mei, T., Lai, W., Wang, M., Tang, J., Qi, G.-J., Li, L., and Gu, Z. 2006. Microsoft reseach asia trecvid 2006 high-level feature extraction and rushes exploitation. In TREC Video Retrieval Evaluation (TRECVID) Proceedings.Google ScholarGoogle Scholar
  16. Jiang, W., Chang, S.-F., and Loui, A. 2006. Active concept-based concept fusion with partial user labels. In Proceedings of the IEEE International Conference on Image Processing.Google ScholarGoogle Scholar
  17. Jiang, Y.-G., Ngo, C.-W., and Yang, J. 2007. Towards optimal bag-of-features for object categorization and semantic video retrieval. In Proceedings of the ACM International Conference on Image and Video Retrieval. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Koskela, M., Smeaton, A., and Laaksonen, J. 2007. Measuring concept similarities in multimedia ontologies: analysis and evaluations. IEEE Trans. Multimed. 9, 5, 912--922. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Kumar, S. and Hebert, M. 2003. Discriminative random fields: A discriminative framework for contextual interaction in classification. In Proceedings of the IEEE International Conference on Machine Learning. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Lafferty, J., McCallum, A., and Pereira, F. 2001. Conditional random fields: Probabilistic models for segmentation and labeling sequence data. In Proceedings of the International Conference on Machine Learning. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Liu, P., Soong, F. K., and Zhou, J.-L. 2007. Divergence-based similarity measure for spoken document retrieval. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing.Google ScholarGoogle Scholar
  22. Marr, D. 1982. Vision. W. H. Freeman and Company.Google ScholarGoogle Scholar
  23. Naphade, M. R., Kozintsev, I., and Huang, T. 2002. Factor graph framework for semantic video indexing. IEEE Trans. CSVT 12, 1 (Jan.). Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Naphade, M. R., Smith, J., Tesic, J., Chang, S.-F., Hsu, W., Kennedy, L., Hauptmann, A. G., and Curtis, J. 2006. Large-scale concept ontology for multimedia. IEEE Trans. Multimed. 13, 3, 86--91. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Naphade, M. R. 2002. Statistical techniques in video data management. In Proceedings of the IEEE Workshop on Multimedia Signal Processing.Google ScholarGoogle ScholarCross RefCross Ref
  26. Naphade, M. R., Kennedy, L., Kender, J. R., Chang, S.-F., Smith, J. R., Over, P., and Hauptmann, A. G. 2005. A light scale concept ontology for multimedia understanding for TRECVID 2005. IBM Research Report RC23612 (W0505-104).Google ScholarGoogle Scholar
  27. Nigam, K., Lafferty, J., and McCallum, A. 1999. Using maximum entropy for text classification. In Proceedings of the IJCAI-99 Workshop on Machine Learning for Information Filtering. 61--67.Google ScholarGoogle Scholar
  28. Petersohn, C. 2004. Fraunhofer hhi at trecvid 2004: shot boundary detection system. In TREC Video Retrieval Evaluation (TRECVID) Proceedings.Google ScholarGoogle Scholar
  29. Rabiner, L. R. 1989. A tutorial on hidden markov models and selected applications in speech recognition. Proc. IEEE 77, 2, 257--286.Google ScholarGoogle ScholarCross RefCross Ref
  30. Smeaton, A., Over, P., and Kraaij, W. 2006. Evaluation campaigns and trecvid. In Proceedings of the ACM Multimedia Information Retrieval Conference. 321--330. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Smith, J. R. and Naphade, M. R. 2003. Multimedia semantic indexing using model vectors. In Proceedings of the IEEE Internaional Conference on Multimedia and Expo. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Snoek, C., Worring, M., Geusebroek, J., Koelma, D., Seinstra, F., and Smeulders, A. 2006. The semantic pathfinder: Using an authoring metaphor for generic multimedia indexing. IEEE Trans. Patt. Anal. Mach. Intell. 28, 10, 1678--1689. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Snoek, C. G. M., Worring, M., Gemert, J. C., Geusebroek, J.-M., and Smeulders, A. W. M. 2006. The challenge problem for automated detection of 101 semantic concepts in multimedia. In Proceedings of the ACM Internaional Conference on Multimedia. 421--430. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Tang, J., Hua, X.-S., Qi, G.-J., Wang, M., Mei, T., and Wu, X. 2007. Structure-sensitive manifold ranking for video concept detection. In Proceedings of the ACM Internaional Conference on Multimedia. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Tsochantaridis, I., Hofmann, T., Joachims, T., and Altun, Y. 2004. Support vector machine learning for intedependent and structured output spaces. In Proceedings of the Internaional Conference on Machine Learning. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Wang, D., Liu, X., Luo, L., Li, J., and Zhang, B. 2007. Video diver: Generic video indexing with diverse features. In Proceedings of the ACM Conference on Multimedia Information Retrieval. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Wang, T., Li, J., Diao, Q., Hu, W., Zhang, Y., and Dulong, C. 2006. Semantic event detection using conditional random fields. In Proceedings of the IEEE Computer Vision and Pattern Recognition Workshop. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Winkler, G. 1995. Image Analysis, Random Fields and Dynamic Monte Carlo Methods: A Mathematical Introduction. Springer-Verlag, Berlin, Heidelberg. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Wu, Y., Tseng, B. L., and Smith, J. R. 2004. Ontology-based multi-classification learning for video concept detection. In Proceedings of the IEEE Internaional Conference on Multimedia and Expo.Google ScholarGoogle Scholar
  40. Xie, L. and Chang, S.-F. 2002. Structural analysis of soccer video with hidden markov models. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing.Google ScholarGoogle Scholar
  41. Yan, R., Chen, M.-Y., and Hauptmann, A. G. 2006. Discriminative random fields: A discriminative framework for contextual interaction in classification. In Proceedings of the IEEE Internaional Conference on Multimedia and Expo.Google ScholarGoogle Scholar
  42. Yanagawa, A., Chang, S.-F., Kennedy, L., and Hsu, W. 2007. Columbia university's baseline detectors for 374 lscom semantic visual concepts. Tech. Rep. 222-2006-8, Columbia University ADVENT Technical Report. March. 20.Google ScholarGoogle Scholar
  43. Yao, Y. Y. 2003. Entropy Measures, Maximum Entropy Principle, and Emerging Applications. Springer, Chapter Information-theoretic measures for knowledge discovery and data mining, 115--136. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Zha, Z.-J., Mei, T., Hua, X.-S., Qi, G.-J., and Wang, Z. 2007. Refining video annotation by exploiting pairwise concurrent relation. In Proceedings of the ACM International Conference on Multimedia. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Zhang, H., Berg, A. C., Maire, M., and Malik, J. 2006. Svm-knn: discriminative nearest neighbor classification for visual category recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Correlative multilabel video annotation with temporal kernels

        Recommendations

        Reviews

        Sebastien Lefevre

        Annotation of multimedia data is a very topical yet very challenging problem. Indeed, Web sites such as YouTube store terabytes or even petabytes of video data. To successfully enable user navigation or retrieval in these huge databases, some automatic processes are mandatory. Among them, automatic video annotation seeks to label each video with some predefined concepts, which will then be sought by the end user. In this paper, the authors propose a new paradigm for video annotation in order to deal more effectively with multi-label annotation (since a video usually concerns several topics). Instead of using a set of binary classifiers related to each individual concept, or merging these binary classifiers in a post-processing process called context-based conceptual fusion, they introduce an integrated multi-label approach that explicitly models both concepts themselves and concepts' interactions, to avoid relying on premature decisions made by binary classifiers. Their experiments, performed on the Text Retrieval Conference Video Retrieval Evaluation (TRECVID) dataset, show the relevance of this approach, but also point to the need for efficient algorithms, since the proposed solution is still very far from real time. To deliver automatic solutions to the market, research efforts should now focus not only on the reliability of automatic annotation solutions, but also (and perhaps more) on the efficiency of these systems. This issue should be tackled more often by research teams in the field of multimedia indexing and retrieval. Online Computing Reviews Service

        Access critical reviews of Computing literature here

        Become a reviewer for Computing Reviews.

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Multimedia Computing, Communications, and Applications
          ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 5, Issue 1
          October 2008
          201 pages
          ISSN:1551-6857
          EISSN:1551-6865
          DOI:10.1145/1404880
          Issue’s Table of Contents

          Copyright © 2008 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 30 October 2008
          • Accepted: 1 July 2008
          • Revised: 1 May 2008
          • Received: 1 January 2008
          Published in tomm Volume 5, Issue 1

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader