ABSTRACT
The development of information retrieval systems such as search engines relies on good test collections, including assessments of retrieved content. The widely employed Cranfield paradigm dictates that the information relevant to a topic be encoded at the level of documents, therefore requiring effectively complete document relevance assessments. As this is no longer practical for modern corpora, numerous problems arise, including scalability, reusability, and applicability. We propose a new method for relevance assessment based on relevant information, not relevant documents. Once the relevant 'nuggets' are collected, our matching method can assess any document for relevance with high accuracy, and so any retrieved list of documents can be assessed for performance. In this paper we analyze the performance of the matching function by looking at specific cases and by comparing with other methods. We then show how these inferred relevance assessments can be used to perform IR system evaluation, and we discuss in particular reusability and scalability. Our main contribution is a methodology for producing test collections that are highly accurate, more complete, scalable, reusable, and can be generated with similar amounts of effort as existing methods, with great potential for future applications.
- The Eighth Text REtrieval Conference (TREC-8). U.S. Government Printing Office, 2000.Google Scholar
- 33rd ACM SIGIR Workshop on Crowdsourcing for Search Evaluation, Geneva, Switzerland, 2010.Google Scholar
- P. Achananuparp, X. Hu, and X. Shen. The evaluation of sentence similarity measures. DaWaK '08, Berlin, Heidelberg, 2008. Google ScholarDigital Library
- E. Amitay, D. Carmel, R. Lempel, and A. Soffer. Scaling IR-system evaluation using term relevance sets. SIGIR '04, New York, NY, USA, 2004. Google ScholarDigital Library
- P. Arvola, S. Geva, J. Kamps, R. Schenkel, A. Trotman, and J. Vainio. Overview of the INEX 2010 ad hoc track. In Preproceedings of the INEX 2010 Workshop, Vught, The Netherlands, 2010. Google ScholarDigital Library
- A. Ashkan and C. L. Clarke. On the informativeness of cascade and intent-aware effectiveness measures. WWW '11, 2011. Google ScholarDigital Library
- J. A. Aslam and E. Yilmaz. Inferring document relevance via average precision. SIGIR '06, 2006. Google ScholarDigital Library
- P. Bailey, N. Craswell, I. Soboroff, P. Thomas, A. P. de Vries, and E. Yilmaz. Relevance assessment: Are judges exchangeable and does it matter? SIGIR '08, New York, NY, USA, 2008. Google ScholarDigital Library
- A. Z. Broder. Identifying and filtering near-duplicate documents. COM '00, London, UK, 2000. Google ScholarDigital Library
- A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. WWW'97. Google ScholarDigital Library
- B. Carterette, J. Allan, and R. K. Sitaraman. Minimal test collections for retrieval evaluation. SIGIR'06. Google ScholarDigital Library
- B. Carterette, V. Pavlu, E. Kanoulas, J. A. Aslam, and J. Allan. Evaluation over thousands of queries. SIGIR'08, 2008. Google ScholarDigital Library
- C. L. A. Clarke, N. Craswell, and I. Soboroff. Overview of the trec 2009 web track, 2009.Google Scholar
- C. L. A. Clarke, M. Kolla, G. V. Cormack, O. Vechtomova, A. Ashkan, S. Büttcher, and I. MacKinnon. Novelty and diversity in information retrieval evaluation. SIGIR'08, 2008. Google ScholarDigital Library
- H. T. Dang, J. J. Lin, and D. Kelly. Overview of the TREC 2006 question answering track 99. In TREC, 2006.Google Scholar
- C. L. A. C. Gordon V. Cormack, Mark D. Smucker. Efficient and effective spam filtering and re-ranking for large web datasets. University of Waterloo, 2010.Google Scholar
- D. Harman. Overview of the third text REtreival conference (TREC-3). In Overview of the Third Text REtrieval Conference (TREC-3). U.S. Government Printing Office, Apr. 1995.Google ScholarCross Ref
- S. Krenzel. Finding blurbs. Website. http://www.stevekrenzel.com/articles/blurbs.Google Scholar
- J. Lin and D. Demner-Fushman. Automatically evaluating answers to definition questions. HLT '05, Morristown, NJ, USA, 2005. Google ScholarDigital Library
- J. Lin and D. Demner-Fushman. Will pyramids built of nuggets topple over? HLT'06, Morristown, NJ, USA, 2006. Google ScholarDigital Library
- G. Marton and A. Radul. Nuggeteer: Automatic nugget-based evaluation using descriptions and judgements. In Proceedings of NAACL/HLT, 2006. Google ScholarDigital Library
- F. Radlinski, M. Kurup, and T. Joachims. How does clickthrough data reflect retrieval quality? CIKM '08, New York, NY, USA, 2008. Google ScholarDigital Library
- S. Rajput, V. Pavlu, P. B. Golbus, and J. A. Aslam. A nugget-based test collection construction paradigm. CIKM '11, New York, NY, USA, 2011. Google ScholarDigital Library
- T. Sakai and C.-Y. Lin. Ranking Retrieval Systems without Relevance Assessments -- Revisited. In the 3rd International Workshop on Evaluating Information Access (EVIA) -- A Satellite Workshop of NTCIR-8, Tokyo, Japan, 2010.Google Scholar
- I. Soboroff, C. Nicholas, and P. Cahan. Ranking retrieval systems without relevance judgments. SIGIR '01, New York, NY, USA, 2001. Google ScholarDigital Library
- A. Spoerri. Using the structure of overlap between search results to rank retrieval systems without relevance judgments. Inf. Process. Manage., 43, 2007. Google ScholarDigital Library
- E. M. Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. Inf. Process. Manage., 36, 2000. Google ScholarDigital Library
- E. M. Voorhees. Question answering in TREC. CIKM '01, New York, NY, USA, 2001. Google ScholarDigital Library
- Y. Yang and A. Lad. Modeling expected utility of multi-session information distillation. ITCIR'09, 2009. Google ScholarDigital Library
- Y. Yang, A. Lad, N. Lao, A. Harpale, B. Kisiel, and M. Rogati. Utility-based information distillation over termporally sequenced documents. SIGIR'07, 2007. Google ScholarDigital Library
- E. Yilmaz and J. A. Aslam. Estimating average precision when judgments are incomplete. Knowledge and Information Systems, 16(2), 2008. Google ScholarDigital Library
- J. Zobel. How reliable are the results of large-scale retrieval experiments? SIGIR'98, Aug. 1998. Google ScholarDigital Library
Index Terms
- IR system evaluation using nugget-based test collections
Recommendations
A nugget-based test collection construction paradigm
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge managementThe problem of building test collections is central to the development of information retrieval systems such as search engines. Starting with a few relevant "nuggets" of information manually extracted from existing TREC corpora, we implement and test a ...
Constructing test collections by inferring document relevance via extracted relevant information
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge managementThe goal of a typical information retrieval system is to satisfy a user's information need---e.g., by providing an answer or information "nugget"---while the actual search space of a typical information retrieval system consists of documents---i.e., ...
Live nuggets extractor: a semi-automated system for text extraction and test collection creation
SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrievalThe Live Nugget Extractor system provides users with a method of efficiently and accurately collecting relevant information for any web query rather than providing a simple ranked lists of documents. The system utilizes an online learning procedure to ...
Comments