skip to main content
10.1145/2124295.2124343acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

IR system evaluation using nugget-based test collections

Published:08 February 2012Publication History

ABSTRACT

The development of information retrieval systems such as search engines relies on good test collections, including assessments of retrieved content. The widely employed Cranfield paradigm dictates that the information relevant to a topic be encoded at the level of documents, therefore requiring effectively complete document relevance assessments. As this is no longer practical for modern corpora, numerous problems arise, including scalability, reusability, and applicability. We propose a new method for relevance assessment based on relevant information, not relevant documents. Once the relevant 'nuggets' are collected, our matching method can assess any document for relevance with high accuracy, and so any retrieved list of documents can be assessed for performance. In this paper we analyze the performance of the matching function by looking at specific cases and by comparing with other methods. We then show how these inferred relevance assessments can be used to perform IR system evaluation, and we discuss in particular reusability and scalability. Our main contribution is a methodology for producing test collections that are highly accurate, more complete, scalable, reusable, and can be generated with similar amounts of effort as existing methods, with great potential for future applications.

References

  1. The Eighth Text REtrieval Conference (TREC-8). U.S. Government Printing Office, 2000.Google ScholarGoogle Scholar
  2. 33rd ACM SIGIR Workshop on Crowdsourcing for Search Evaluation, Geneva, Switzerland, 2010.Google ScholarGoogle Scholar
  3. P. Achananuparp, X. Hu, and X. Shen. The evaluation of sentence similarity measures. DaWaK '08, Berlin, Heidelberg, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. E. Amitay, D. Carmel, R. Lempel, and A. Soffer. Scaling IR-system evaluation using term relevance sets. SIGIR '04, New York, NY, USA, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. P. Arvola, S. Geva, J. Kamps, R. Schenkel, A. Trotman, and J. Vainio. Overview of the INEX 2010 ad hoc track. In Preproceedings of the INEX 2010 Workshop, Vught, The Netherlands, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Ashkan and C. L. Clarke. On the informativeness of cascade and intent-aware effectiveness measures. WWW '11, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. A. Aslam and E. Yilmaz. Inferring document relevance via average precision. SIGIR '06, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. P. Bailey, N. Craswell, I. Soboroff, P. Thomas, A. P. de Vries, and E. Yilmaz. Relevance assessment: Are judges exchangeable and does it matter? SIGIR '08, New York, NY, USA, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Z. Broder. Identifying and filtering near-duplicate documents. COM '00, London, UK, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. WWW'97. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. B. Carterette, J. Allan, and R. K. Sitaraman. Minimal test collections for retrieval evaluation. SIGIR'06. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. B. Carterette, V. Pavlu, E. Kanoulas, J. A. Aslam, and J. Allan. Evaluation over thousands of queries. SIGIR'08, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. C. L. A. Clarke, N. Craswell, and I. Soboroff. Overview of the trec 2009 web track, 2009.Google ScholarGoogle Scholar
  14. C. L. A. Clarke, M. Kolla, G. V. Cormack, O. Vechtomova, A. Ashkan, S. Büttcher, and I. MacKinnon. Novelty and diversity in information retrieval evaluation. SIGIR'08, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. H. T. Dang, J. J. Lin, and D. Kelly. Overview of the TREC 2006 question answering track 99. In TREC, 2006.Google ScholarGoogle Scholar
  16. C. L. A. C. Gordon V. Cormack, Mark D. Smucker. Efficient and effective spam filtering and re-ranking for large web datasets. University of Waterloo, 2010.Google ScholarGoogle Scholar
  17. D. Harman. Overview of the third text REtreival conference (TREC-3). In Overview of the Third Text REtrieval Conference (TREC-3). U.S. Government Printing Office, Apr. 1995.Google ScholarGoogle ScholarCross RefCross Ref
  18. S. Krenzel. Finding blurbs. Website. http://www.stevekrenzel.com/articles/blurbs.Google ScholarGoogle Scholar
  19. J. Lin and D. Demner-Fushman. Automatically evaluating answers to definition questions. HLT '05, Morristown, NJ, USA, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Lin and D. Demner-Fushman. Will pyramids built of nuggets topple over? HLT'06, Morristown, NJ, USA, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. G. Marton and A. Radul. Nuggeteer: Automatic nugget-based evaluation using descriptions and judgements. In Proceedings of NAACL/HLT, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. F. Radlinski, M. Kurup, and T. Joachims. How does clickthrough data reflect retrieval quality? CIKM '08, New York, NY, USA, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S. Rajput, V. Pavlu, P. B. Golbus, and J. A. Aslam. A nugget-based test collection construction paradigm. CIKM '11, New York, NY, USA, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. T. Sakai and C.-Y. Lin. Ranking Retrieval Systems without Relevance Assessments -- Revisited. In the 3rd International Workshop on Evaluating Information Access (EVIA) -- A Satellite Workshop of NTCIR-8, Tokyo, Japan, 2010.Google ScholarGoogle Scholar
  25. I. Soboroff, C. Nicholas, and P. Cahan. Ranking retrieval systems without relevance judgments. SIGIR '01, New York, NY, USA, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. A. Spoerri. Using the structure of overlap between search results to rank retrieval systems without relevance judgments. Inf. Process. Manage., 43, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. E. M. Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. Inf. Process. Manage., 36, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. E. M. Voorhees. Question answering in TREC. CIKM '01, New York, NY, USA, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Y. Yang and A. Lad. Modeling expected utility of multi-session information distillation. ITCIR'09, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Y. Yang, A. Lad, N. Lao, A. Harpale, B. Kisiel, and M. Rogati. Utility-based information distillation over termporally sequenced documents. SIGIR'07, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. E. Yilmaz and J. A. Aslam. Estimating average precision when judgments are incomplete. Knowledge and Information Systems, 16(2), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. J. Zobel. How reliable are the results of large-scale retrieval experiments? SIGIR'98, Aug. 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. IR system evaluation using nugget-based test collections

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      WSDM '12: Proceedings of the fifth ACM international conference on Web search and data mining
      February 2012
      792 pages
      ISBN:9781450307475
      DOI:10.1145/2124295

      Copyright © 2012 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 8 February 2012

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate498of2,863submissions,17%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader