ABSTRACT
There is great interest in producing effectiveness measures that model user behavior in order to better model the utility of a system to its users. These measures are often formulated as a sum over the product of a discount function of ranks and a gain function mapping relevance assessments to numeric utility values. We develop a conceptual framework for analyzing such effectiveness measures based on classifying members of this broad family of measures into four distinct families, each of which reflects a different notion of system utility. Within this framework we can hypothesize about the properties that such a measure should have and test those hypotheses against user and system data. Along the way we present a collection of novel results about specific measures and relationships between them.
- Rakesh Agrawal, Sreenivas Gollapudi, Halan Halverson, and Samuel Ieong. Diversifying search results. In Proceedings of WSDM, pages 5--14, 2009. Google ScholarDigital Library
- Leif Azzopardi, Kalervo Jarvelin, Jaap Kamps, and Mark D. Smucker, editors. Proceedings of the SIGIR 2010 Workshop on the Simulation of Interaction: Automated Evaluation of Interactive IR, 2010.Google Scholar
- Stefan Buettcher, Charles L.A. Clarke, and Ian Soboroff. The TREC 2006 Terabyte Track. In Proceedings of TREC, 2006.Google Scholar
- Ben Carterette and Paul N. Bennett. Evaluation measures for preference judgments. In Proceedings of SIGIR, 2008. To appear. Google ScholarDigital Library
- Ben Carterette, Paul N. Bennett, D. Maxwell Chickering, and Susan T. Dumais. Here or there: Preference judgments for relevance. In Proceedings of ECIR, pages 16--27, 2008. Google ScholarDigital Library
- Olivier Chapelle, Donald Metzler, Ya Zhang, and Pierre Grinspan. Expceted reciprocal rank for graded relevance. In Proceedings of CIKM, 2009. Google ScholarDigital Library
- Charles L. A. Clarke, Nick Craswell, and Ian Soboroff. Preliminary report on the trec 2009 web track. In Proceedings of Text Retrieval Conference (TREC-2009), 2009.Google Scholar
- Charles L. A. Clarke, Maheedhar Kolla, Gordon V. Cormack, Olga Vechtomova, Azin Ashkan, Stefan Büttcher, and Ian MacKinnon. Novelty and diversity in information retrieval evaluation. In Proceedings of SIGIR, pages 659--666, 2008. Google ScholarDigital Library
- Gordon V. Cormack, Christopher R. Palmer, and Charles L.A. Clarke. Efficient construction of large test collections. In Proceedings of SIGIR, pages 282--289, 1998. Google ScholarDigital Library
- Kalervo Jarvelin and Jaana Kekalainen. Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Syst., 20(4):422--446, 2002. Google ScholarDigital Library
- Evangelos Kanoulas, Ben Carterette, Paul D. Clough, and Mark Sanderson. Evaluation over multi-query sessions. In Proceedings of SIGIR, 2011. To appear. Google ScholarDigital Library
- Alistair Moffat and Justin Zobel. Rank-biased precision for measurement of retrieval effectiveness. ACM Trans. Info. Sys., 27(1):1--27, 2008. Google ScholarDigital Library
- Stephen E. Robertson. A new interpretation of average precision. In Proceedings of SIGIR, pages 689--690, 2008. Google ScholarDigital Library
- Stephen E. Robertson, Evangelos Kanoulas, and Emine Yilmaz. Extending average precision to graded relevance judgments. In Proceedings of SIGIR, pages 603--610, 2010. Google ScholarDigital Library
- Mark E. Rorvig. The simple scalability of documents. JASIS, 41(8):590--598, 1990.Google ScholarCross Ref
- Ellen Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. In Proceedings of SIGIR, pages 315--323, 1998. Google ScholarDigital Library
- Ellen M. Voorhees and Donna Harman. Overview of the Sixth Text REtrieval Conference (TREC-6). In Proceedings of the Sixth Text REtrieval Conference (TREC-6), pages 1--24, 1997. NIST Special Publication 500--240.Google Scholar
- Emine Yilmaz, Javed A. Aslam, and Stephen Robertson. A new rank correlation coefficient for information retrieval. In Proceedings of SIGIR, pages 587--594, 2008. Google ScholarDigital Library
- Emine Yilmaz, Milad Shokouhi, Nick Craswell, and Stephen Robertson. Expected browsing utility for web search evaluation. In Proceedings of CIKM, 2010. To appear. Google ScholarDigital Library
- Yuye Zhang, Laurence A. Park, and Alistair Moffat. Click-based evidence for decaying weight distributions in search effectiveness metrics. Inf. Retr., 13:46--69, February 2010. Google ScholarDigital Library
Index Terms
- System effectiveness, user models, and user utility: a conceptual framework for investigation
Recommendations
Second Workshop on Evaluation of Personalisation in Information Retrieval (WEPIR 2019)
CHIIR '19: Proceedings of the 2019 Conference on Human Information Interaction and RetrievalThe second WEPIR 2019 workshop brings together researchers with different backgrounds interested in continuing to explore and advance the evaluation of personalisation in information retrieval. The workshop builds on the first WEPIR workshop held at ...
Precision recall with user modeling (PRUM): Application to structured information retrieval
Standard Information Retrieval (IR) metrics are not well suited for new paradigms like XML or Web IR in which retrievable information units are document elements and/or sets of related documents. Part of the problem stems from the classical hypotheses ...
Advances on the development of evaluation measures
SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrievalThe goal of the tutorial is to provide attendees with a comprehensive overview of the latest advances in the development of information retrieval evaluation measures and discuss the current challenges in the area. A number of topics are covered, ...
Comments