skip to main content
column

Document Language Models, Query Models, and Risk Minimization for Information Retrieval

Published:02 August 2017Publication History
Skip Abstract Section

Abstract

We present a framework for information retrieval that combines document models and query models using a probabilistic ranking function based on Bayesian decision theory. The framework suggests an operational retrieval model that extends recent developments in the language modeling approach to information retrieval. A language model for each document is estimated, as well as a language model for each query, and the retrieval problem is cast in terms of risk minimization. The query language model can be exploited to model user preferences, the context of a query, synonomy and word senses. While recent work has incorporated word translation models for this purpose, we introduce a new method using Markov chains defined on a set of documents to estimate the query models. The Markov chain method has connections to algorithms from link analysis and social networks. The new approach is evaluated on TREC collections and compared to the basic language modeling approach and vector space models together with query expansion using Rocchio. Significant improvements are obtained over standard query expansion methods for strong baseline TF-IDF systems, with the greatest improvements attained for short queries on Web data.

References

  1. A. Berger and J. Lafferty. Information retrieval as statistical translation. In Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval, pages 222--229, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Bookstein and D. Swanson. A decision theoretic foundation for indexing. Journal for the American Society for Information Science, pages 45--50, 1975. Google ScholarGoogle ScholarCross RefCross Ref
  3. A. Bookstein and D. Swanson. Probabilistic models for automatic indexing. Journal for the American Society for Information Science, 25(5):312--318, 1976. Google ScholarGoogle ScholarCross RefCross Ref
  4. S. Brin and L. Page. Anatomy of a large-scale hypertextual web search engine. In Proceedings of the 7th International World Wide Web Conference, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. H. Hubbell C. An input-output approach to clique identification. Sociometry, 28:377--399, 1965. Google ScholarGoogle ScholarCross RefCross Ref
  6. J. G. Carbonell, Y. Geng, and J. Goldstein. Automated query-relevant summarization and diversity-based reranking. In IJCAI-97 Workshop on AI and Digital Libraries, 1997.Google ScholarGoogle Scholar
  7. W. S. Cooper and M. E. Maron. Foundations of probabilistic and utility-theoretic indexing. Journal of the Association for Computing Machinery, 25(1):67--80, 1978. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. W. B. Croft and D.J. Harper. Using probabilistic models of document retrieval without relevance information. Journal of Documentation, 35:285--295, 1979. Google ScholarGoogle ScholarCross RefCross Ref
  9. S. Deerwester, S. Dumais, T. Landauer, G. Furnas, and R. Harshman. Indexing by latent semantic analysis. Journal of American Society for Information Science, 41:391--407, 1990. Google ScholarGoogle ScholarCross RefCross Ref
  10. N. Fuhr. Probabilistic models in information retrieval. The Computer Journal, 35(3):243--255, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. Hiemstra andW. Kraaij. Twenty-one at TREC-7: Ad-hoc and cross-language track. In Proc. of Seventh Text REtrieval Conference (TREC-7), 1998.Google ScholarGoogle Scholar
  12. L. Katz. A new status index derived from sociometric analysis. Psychometrika, 18:39--43, 1953. Google ScholarGoogle ScholarCross RefCross Ref
  13. J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the Association for Computing Machinery, 46, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. Lafferty and C. Zhai. Probabilistic IR models based on query and document generation. In Proceedings of the Workshop on Language Modeling and Information Retrieval, Carnegie Mellon University, May 31-June 1, 2001.Google ScholarGoogle Scholar
  15. D. H. Miller, T. Leek, and R. Schwartz. A hidden Markov model information retrieval system. In Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval, pages 214--221, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. F. Mosteller and D. Wallace. Inference and disputed authorship: The Federalist. Addison Wesley, 1964.Google ScholarGoogle Scholar
  17. J. Ponte and W. B. Croft. A language modeling approach to information retrieval. In Proceedings of the ACM SIGIR, pages 275--281, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Robertson and K. Sparck Jones. Relevance weighting of search terms. Journal of the American Society for Information Science, 27:129--146, 1976. Google ScholarGoogle ScholarCross RefCross Ref
  19. S. E. Robertson, S. Walker, S. Jones, M. M.Hancock- Beaulieu, and M. Gatford. Okapi at TREC-3. In D. K. Harman, editor, The Third Text REtrieval Conference (TREC-3), 1995.Google ScholarGoogle Scholar
  20. I. Witten, A. Moffat, and T. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, 1999.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. K. M. Wong and Y. Y. Yao. A probability distribution model for information retrieval. Information Processing and Management, 25(1):39--53, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Document Language Models, Query Models, and Risk Minimization for Information Retrieval
            Index terms have been assigned to the content through auto-classification.

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM SIGIR Forum
              ACM SIGIR Forum  Volume 51, Issue 2
              SIGIR Test-of-Time Awardees 1978-2001
              July 2017
              276 pages
              ISSN:0163-5840
              DOI:10.1145/3130348
              • Editors:
              • Donna Harman,
              • Diane Kelly
              Issue’s Table of Contents

              Copyright © 2017 Copyright is held by the owner/author(s)

              Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 2 August 2017

              Check for updates

              Qualifiers

              • column

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader