Abstract
We present a framework for information retrieval that combines document models and query models using a probabilistic ranking function based on Bayesian decision theory. The framework suggests an operational retrieval model that extends recent developments in the language modeling approach to information retrieval. A language model for each document is estimated, as well as a language model for each query, and the retrieval problem is cast in terms of risk minimization. The query language model can be exploited to model user preferences, the context of a query, synonomy and word senses. While recent work has incorporated word translation models for this purpose, we introduce a new method using Markov chains defined on a set of documents to estimate the query models. The Markov chain method has connections to algorithms from link analysis and social networks. The new approach is evaluated on TREC collections and compared to the basic language modeling approach and vector space models together with query expansion using Rocchio. Significant improvements are obtained over standard query expansion methods for strong baseline TF-IDF systems, with the greatest improvements attained for short queries on Web data.
- A. Berger and J. Lafferty. Information retrieval as statistical translation. In Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval, pages 222--229, 1999. Google ScholarDigital Library
- A. Bookstein and D. Swanson. A decision theoretic foundation for indexing. Journal for the American Society for Information Science, pages 45--50, 1975. Google ScholarCross Ref
- A. Bookstein and D. Swanson. Probabilistic models for automatic indexing. Journal for the American Society for Information Science, 25(5):312--318, 1976. Google ScholarCross Ref
- S. Brin and L. Page. Anatomy of a large-scale hypertextual web search engine. In Proceedings of the 7th International World Wide Web Conference, 1998. Google ScholarDigital Library
- H. Hubbell C. An input-output approach to clique identification. Sociometry, 28:377--399, 1965. Google ScholarCross Ref
- J. G. Carbonell, Y. Geng, and J. Goldstein. Automated query-relevant summarization and diversity-based reranking. In IJCAI-97 Workshop on AI and Digital Libraries, 1997.Google Scholar
- W. S. Cooper and M. E. Maron. Foundations of probabilistic and utility-theoretic indexing. Journal of the Association for Computing Machinery, 25(1):67--80, 1978. Google ScholarDigital Library
- W. B. Croft and D.J. Harper. Using probabilistic models of document retrieval without relevance information. Journal of Documentation, 35:285--295, 1979. Google ScholarCross Ref
- S. Deerwester, S. Dumais, T. Landauer, G. Furnas, and R. Harshman. Indexing by latent semantic analysis. Journal of American Society for Information Science, 41:391--407, 1990. Google ScholarCross Ref
- N. Fuhr. Probabilistic models in information retrieval. The Computer Journal, 35(3):243--255, 1992. Google ScholarDigital Library
- D. Hiemstra andW. Kraaij. Twenty-one at TREC-7: Ad-hoc and cross-language track. In Proc. of Seventh Text REtrieval Conference (TREC-7), 1998.Google Scholar
- L. Katz. A new status index derived from sociometric analysis. Psychometrika, 18:39--43, 1953. Google ScholarCross Ref
- J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the Association for Computing Machinery, 46, 1999. Google ScholarDigital Library
- J. Lafferty and C. Zhai. Probabilistic IR models based on query and document generation. In Proceedings of the Workshop on Language Modeling and Information Retrieval, Carnegie Mellon University, May 31-June 1, 2001.Google Scholar
- D. H. Miller, T. Leek, and R. Schwartz. A hidden Markov model information retrieval system. In Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval, pages 214--221, 1999. Google ScholarDigital Library
- F. Mosteller and D. Wallace. Inference and disputed authorship: The Federalist. Addison Wesley, 1964.Google Scholar
- J. Ponte and W. B. Croft. A language modeling approach to information retrieval. In Proceedings of the ACM SIGIR, pages 275--281, 1998. Google ScholarDigital Library
- S. Robertson and K. Sparck Jones. Relevance weighting of search terms. Journal of the American Society for Information Science, 27:129--146, 1976. Google ScholarCross Ref
- S. E. Robertson, S. Walker, S. Jones, M. M.Hancock- Beaulieu, and M. Gatford. Okapi at TREC-3. In D. K. Harman, editor, The Third Text REtrieval Conference (TREC-3), 1995.Google Scholar
- I. Witten, A. Moffat, and T. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, 1999.Google ScholarDigital Library
- S. K. M. Wong and Y. Y. Yao. A probability distribution model for information retrieval. Information Processing and Management, 25(1):39--53, 1989. Google ScholarDigital Library
Index Terms
- Document Language Models, Query Models, and Risk Minimization for Information Retrieval
Recommendations
Document language models, query models, and risk minimization for information retrieval
SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrievalWe present a framework for information retrieval that combines document models and query models using a probabilistic ranking function based on Bayesian decision theory. The framework suggests an operational retrieval model that extends recent ...
Statistical query translation models for cross-language information retrieval
Query translation is an important task in cross-language information retrieval (CLIR), which aims to determine the best translation words and weights for a query. This article presents three statistical query translation models that focus on the ...
Query expansion using term relationships in language models for information retrieval
CIKM '05: Proceedings of the 14th ACM international conference on Information and knowledge managementLanguage Modeling (LM) has been successfully applied to Information Retrieval (IR). However, most of the existing LM approaches only rely on term occurrences in documents, queries and document collections. In traditional unigram based models, terms (or ...
Comments