Abstract
This paper addresses automatic image annotation problem and its application to multi-modal image retrieval. The contribution of our work is three-fold. (1) We propose a probabilistic semantic model in which the visual features and the textual words are connected via a hidden layer which constitutes the semantic concepts to be discovered to explicitly exploit the synergy among the modalities. (2) The association of visual features and textual words is determined in a Bayesian framework such that the confidence of the association can be provided. (3) Extensive evaluation on a large-scale, visually and semantically diverse image collection crawled from Web is reported to evaluate the prototype system based on the model. In the proposed probabilistic model, a hidden concept layer which connects the visual feature and the word layer is discovered by fitting a generative model to the training image and annotation words through an Expectation-Maximization (EM) based iterative learning procedure. The evaluation of the prototype system on 17,000 images and 7736 automatically extracted annotation words from crawled Web pages for multi-modal image retrieval has indicated that the proposed semantic model and the developed Bayesian framework are superior to a state-of-the-art peer system in the literature.
Similar content being viewed by others
References
Barnard, K., Duygulu, P., de Freitas, N., Forsyth, D., Blei, D.,Jordan, M.I.: Matching words and pictures. J. Machine Learn.Res. 3, 1107–1135 (2003)
Blei, D., Jordan, M.: Modeling annotated data. In: the 26th International Conference on Research and Development in Information Retrieval (SIGIR) (2003)
Blei, D., Ng, A., Jordan, M.: Dirichlet allocation models. In:The International Conference on Neural Information Processing Systems (2001)
Cai, D., Yu, S., Wen J.-R., Ma W.-Y.: Vips: a vision-based page segmentation algorithm. Microsoft Technical Report (MSR-TR-2003-79) (2003)
Chang, E., Goh, K., Sychay, G., Wu, G.: Cbsa: content-based soft annotation for multimodal image retrieval using bayes point machines. IEEE Trans. Circuits Syst. Video Technol. 13(1) (2003)
Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the em algorithm. J. Royal Stat. Soc. 39(1), 1–38 (1977)
Dillon, W.R., Goldstein, M.: Multivariate Analysis, Mehtods and Applications. Wiley, New York (1984)
Duygulu, P., Barnard, K., de Freitas, N., Forsyth, D.A.: Object recognition as machine translation: learning a lexicon for a fixed image vocabulary. In: The 7th European Conference on Computer Vision, vol. IV, pp. 97–112, Copenhagan, Denmark (2002)
Feng, S.L., Manmatha, R., Lavrenko, V.: Multiple bernoulli relevance models for image and video annotation. In: The International Conference on Computer Vision and Pattern Recognition, Washington, DC (2004)
Fishman, G.: Monte Carlo Concepts, Algorithms and Applications. Springer-Verlag, Berlin (1996)
Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Machine Learn.42, 177–196 (2001)
Hofmann, T., Puzicha, J.: Statistical models for co-occurrence data. AI Memo 1625 (1998)
Hofmann, T., Puzicha, J., Jordan, M.I.: Unsupervised learning from dyadic data. In: The International Conference on Neural Information Processing Systems (1996)
Lavrenko, V., Manmatha, R., Jeon, J.: A model for learning the semantics of pictures. In: the International Conference on Neural Information Processing Systems (NIPS'03) (2003)
Li, J., Wang, J.Z.: Automatic linguistic indexing of pictures by a statistical modeling approach. IEEE Trans. PAMI 25(9) (2003)
Mclachlan, G., Basford, K.E.: Mixture Models. Marcel-Dekker, Basel, NY (1988)
Mori, Y., Takahashi, H., Oka, R.: Image-to-word transformation based on difiding and vector quantizing images with words. In: The First International Workshop on Multimedia Intelligent Storage and Retrieval Management (1999)
Rissanen, J.: Stochastic Complexity in Statistical Inquiry. World Scientific, Singapore, (1989)
Smeulders, A.W.M., Worring, M., Santini, S., Gupta, A., Jain, R.: Content-based image retrieval at the end of the early years. IEEE Trans. Pattern Anal. Machine Intell. 22, 1349–1380 (2000)
Wang, X.-J., Ma, W.-Y., Xue, G.-R., Li, X.: Multi-model similarity propagation and its application for web image retrieval. In: The 12th annual ACM international conference on Multimedia, pp. 944–951, New York City, NY, (2004)
Westerveld, T., de Vries, A.P.: Experimental evaluation of a generative probabilistic image retrieval model on ‘easy’ data. In: The SIGIR Multimedia Information Retrieval Workshop 2003,vAugust (2003)
Zhang, Z.M., Zhang, R., Ohya, J.: Exploiting the cognitive synergy between different media modalities in multimodal information retrieval. In: The IEEE International Conference on Multimedia and Expo (ICME'04), Taipei, Taiwan, (2004)
Zhao, R., Grosky, W.I.: Narrowing the semantic gap – improved text-based web document retrieval using visual features. IEEE Trans. Multimedia 4(2) (2002)
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Zhang, R., Zhang, Z., Li, M. et al. A probabilistic semantic model for image annotation and multi-modal image retrieval. Multimedia Systems 12, 27–33 (2006). https://doi.org/10.1007/s00530-006-0025-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00530-006-0025-1