Abstract
Automatic video annotation is an important ingredient for semantic-level video browsing, search and navigation. Much attention has been paid to this topic in recent years. These researches have evolved through two paradigms. In the first paradigm, each concept is individually annotated by a pre-trained binary classifier. However, this method ignores the rich information between the video concepts and only achieves limited success. Evolved from the first paradigm, the methods in the second paradigm add an extra step on the top of the first individual classifiers to fuse the multiple detections of the concepts. However, the performance of these methods can be degraded by the error propagation incurred in the first step to the second fusion one. In this article, another paradigm of the video annotation method is proposed to address these problems. It simultaneously annotates the concepts as well as model correlations between them in one step by the proposed Correlative Multilabel (CML) method, which benefits from the compensation of complementary information between different labels. Furthermore, since the video clips are composed by temporally ordered frame sequences, we extend the proposed method to exploit the rich temporal information in the videos. Specifically, a temporal-kernel is incorporated into the CML method based on the discriminative information between Hidden Markov Models (HMMs) that are learned from the videos. We compare the performance between the proposed approach and the state-of-the-art approaches in the first and second paradigms on the widely used TRECVID data set. As to be shown, superior performance of the proposed method is gained.
- Berg, B. A. 2004. Markov Chain Monte Carlo Simulations and Their Statistical Analysis. World Scientific.Google Scholar
- Boyd, S., Vandenberghe, L. 2004. Convex Optimization. Cambridge University Press. Google ScholarDigital Library
- Campbell, M., et al. 2006. Ibm research trecvid-2006 video retrieval system. TREC Video Retrieval Evaluation (TRECVID) Proceedings.Google Scholar
- Chang, S.-F., et al. 2006. Columbia university trecvid-2006 video search and high-level feature extraction. In TREC Video Retrieval Evaluation (TRECVID) Proceedings.Google Scholar
- Cover, T. and Thomas, J. 1991. Elements of Information Theory. John Wiley and Sons, New York, NY. Google ScholarDigital Library
- Cristianini, N. and Shawe-Taylor, J. 2000. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press. Google ScholarDigital Library
- Do, M. 2003. Fast approximation of kullback-leibler distance for dependence trees and hidden markov models. IEEE Signal Process. Lett. 10, 4, 115--118.Google ScholarCross Ref
- Ebadollahi, S., Xie, L., Chang, S.-F., and Smith, J. R. 2006. Visual event detection using multidimensional concept dynamics. In Proceedings of the IEEE International Conference on Multimedia and Expo.Google Scholar
- Gauvain, J.-L. and Lee, C.-H. 1994. Maximum a posteriori estimation for multivariate gaussian mixture observations of markov chains. IEEE Trans. Speech Audio Process. 2, 2, 291--298.Google ScholarCross Ref
- Godbole, S. and Sarawagi, S. 2004. Discriminative methods for multi-labeled classification. In Proceedings of the Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining.Google Scholar
- Goldberger, J. and Aronowitz, H. 2005. A distance measure between gmms based on the unscented transform and its application to speaker recognition. In Proceedings of the International Conference on Spoken Language Processes.Google Scholar
- Hauptmann, A. G., Chen, M.-Y., and Christel, M. 2004. Confounded expectations: Informedia at TRECVID 2004. In TREC Video Retrieval Evaluation (TRECVID) Proceedings.Google Scholar
- Hauptmann, A. G., et al. 2006. Multi-lingual broadcast news retrieval. In TREC Video Retrieval Evaluation (TRECVID) Procedings.Google Scholar
- Hauptmann, A. G., Yan, R., Lin, W.-H., Christel, M., and Wactlar, H. 2007. Can high-level concepts fill the semantic gap in video retrieval? A case study with broadcast news. IEEE Trans. Multimed. 9, 5, 958--966. Google ScholarDigital Library
- Hua, X.-S., Mei, T., Lai, W., Wang, M., Tang, J., Qi, G.-J., Li, L., and Gu, Z. 2006. Microsoft reseach asia trecvid 2006 high-level feature extraction and rushes exploitation. In TREC Video Retrieval Evaluation (TRECVID) Proceedings.Google Scholar
- Jiang, W., Chang, S.-F., and Loui, A. 2006. Active concept-based concept fusion with partial user labels. In Proceedings of the IEEE International Conference on Image Processing.Google Scholar
- Jiang, Y.-G., Ngo, C.-W., and Yang, J. 2007. Towards optimal bag-of-features for object categorization and semantic video retrieval. In Proceedings of the ACM International Conference on Image and Video Retrieval. Google ScholarDigital Library
- Koskela, M., Smeaton, A., and Laaksonen, J. 2007. Measuring concept similarities in multimedia ontologies: analysis and evaluations. IEEE Trans. Multimed. 9, 5, 912--922. Google ScholarDigital Library
- Kumar, S. and Hebert, M. 2003. Discriminative random fields: A discriminative framework for contextual interaction in classification. In Proceedings of the IEEE International Conference on Machine Learning. Google ScholarDigital Library
- Lafferty, J., McCallum, A., and Pereira, F. 2001. Conditional random fields: Probabilistic models for segmentation and labeling sequence data. In Proceedings of the International Conference on Machine Learning. Google ScholarDigital Library
- Liu, P., Soong, F. K., and Zhou, J.-L. 2007. Divergence-based similarity measure for spoken document retrieval. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing.Google Scholar
- Marr, D. 1982. Vision. W. H. Freeman and Company.Google Scholar
- Naphade, M. R., Kozintsev, I., and Huang, T. 2002. Factor graph framework for semantic video indexing. IEEE Trans. CSVT 12, 1 (Jan.). Google ScholarDigital Library
- Naphade, M. R., Smith, J., Tesic, J., Chang, S.-F., Hsu, W., Kennedy, L., Hauptmann, A. G., and Curtis, J. 2006. Large-scale concept ontology for multimedia. IEEE Trans. Multimed. 13, 3, 86--91. Google ScholarDigital Library
- Naphade, M. R. 2002. Statistical techniques in video data management. In Proceedings of the IEEE Workshop on Multimedia Signal Processing.Google ScholarCross Ref
- Naphade, M. R., Kennedy, L., Kender, J. R., Chang, S.-F., Smith, J. R., Over, P., and Hauptmann, A. G. 2005. A light scale concept ontology for multimedia understanding for TRECVID 2005. IBM Research Report RC23612 (W0505-104).Google Scholar
- Nigam, K., Lafferty, J., and McCallum, A. 1999. Using maximum entropy for text classification. In Proceedings of the IJCAI-99 Workshop on Machine Learning for Information Filtering. 61--67.Google Scholar
- Petersohn, C. 2004. Fraunhofer hhi at trecvid 2004: shot boundary detection system. In TREC Video Retrieval Evaluation (TRECVID) Proceedings.Google Scholar
- Rabiner, L. R. 1989. A tutorial on hidden markov models and selected applications in speech recognition. Proc. IEEE 77, 2, 257--286.Google ScholarCross Ref
- Smeaton, A., Over, P., and Kraaij, W. 2006. Evaluation campaigns and trecvid. In Proceedings of the ACM Multimedia Information Retrieval Conference. 321--330. Google ScholarDigital Library
- Smith, J. R. and Naphade, M. R. 2003. Multimedia semantic indexing using model vectors. In Proceedings of the IEEE Internaional Conference on Multimedia and Expo. Google ScholarDigital Library
- Snoek, C., Worring, M., Geusebroek, J., Koelma, D., Seinstra, F., and Smeulders, A. 2006. The semantic pathfinder: Using an authoring metaphor for generic multimedia indexing. IEEE Trans. Patt. Anal. Mach. Intell. 28, 10, 1678--1689. Google ScholarDigital Library
- Snoek, C. G. M., Worring, M., Gemert, J. C., Geusebroek, J.-M., and Smeulders, A. W. M. 2006. The challenge problem for automated detection of 101 semantic concepts in multimedia. In Proceedings of the ACM Internaional Conference on Multimedia. 421--430. Google ScholarDigital Library
- Tang, J., Hua, X.-S., Qi, G.-J., Wang, M., Mei, T., and Wu, X. 2007. Structure-sensitive manifold ranking for video concept detection. In Proceedings of the ACM Internaional Conference on Multimedia. Google ScholarDigital Library
- Tsochantaridis, I., Hofmann, T., Joachims, T., and Altun, Y. 2004. Support vector machine learning for intedependent and structured output spaces. In Proceedings of the Internaional Conference on Machine Learning. Google ScholarDigital Library
- Wang, D., Liu, X., Luo, L., Li, J., and Zhang, B. 2007. Video diver: Generic video indexing with diverse features. In Proceedings of the ACM Conference on Multimedia Information Retrieval. Google ScholarDigital Library
- Wang, T., Li, J., Diao, Q., Hu, W., Zhang, Y., and Dulong, C. 2006. Semantic event detection using conditional random fields. In Proceedings of the IEEE Computer Vision and Pattern Recognition Workshop. Google ScholarDigital Library
- Winkler, G. 1995. Image Analysis, Random Fields and Dynamic Monte Carlo Methods: A Mathematical Introduction. Springer-Verlag, Berlin, Heidelberg. Google ScholarDigital Library
- Wu, Y., Tseng, B. L., and Smith, J. R. 2004. Ontology-based multi-classification learning for video concept detection. In Proceedings of the IEEE Internaional Conference on Multimedia and Expo.Google Scholar
- Xie, L. and Chang, S.-F. 2002. Structural analysis of soccer video with hidden markov models. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing.Google Scholar
- Yan, R., Chen, M.-Y., and Hauptmann, A. G. 2006. Discriminative random fields: A discriminative framework for contextual interaction in classification. In Proceedings of the IEEE Internaional Conference on Multimedia and Expo.Google Scholar
- Yanagawa, A., Chang, S.-F., Kennedy, L., and Hsu, W. 2007. Columbia university's baseline detectors for 374 lscom semantic visual concepts. Tech. Rep. 222-2006-8, Columbia University ADVENT Technical Report. March. 20.Google Scholar
- Yao, Y. Y. 2003. Entropy Measures, Maximum Entropy Principle, and Emerging Applications. Springer, Chapter Information-theoretic measures for knowledge discovery and data mining, 115--136. Google ScholarDigital Library
- Zha, Z.-J., Mei, T., Hua, X.-S., Qi, G.-J., and Wang, Z. 2007. Refining video annotation by exploiting pairwise concurrent relation. In Proceedings of the ACM International Conference on Multimedia. Google ScholarDigital Library
- Zhang, H., Berg, A. C., Maire, M., and Malik, J. 2006. Svm-knn: discriminative nearest neighbor classification for visual category recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Google ScholarDigital Library
Index Terms
- Correlative multilabel video annotation with temporal kernels
Recommendations
Correlative multi-label video annotation
MM '07: Proceedings of the 15th ACM international conference on MultimediaAutomatically annotating concepts for video is a key to semantic-level video browsing, search and navigation. The research on this topic evolved through two paradigms. The first paradigm used binary classification to detect each individual concept in a ...
Semi-supervised multi-instance multi-label learning for video annotation task
MM '12: Proceedings of the 20th ACM international conference on MultimediaTraditional approaches for automatic video annotation usually represent one video clip with a flat feature vector, neglecting the fact that video data contain natural structures. It is also noteworthy that a video clip is often relevant to multiple ...
Automatic video annotation by semi-supervised learning with kernel density estimation
MM '06: Proceedings of the 14th ACM international conference on MultimediaInsufficiency of labeled training data is a major obstacle for automatically annotating large-scale video databases with semantic concepts. Existing semi-supervised learning algorithms based on parametric models try to tackle this issue by incorporating ...
Comments