ABSTRACT
The nDCG measure has proven to be a popular measure of retrieval effectiveness utilizing graded relevance judgments. However, a number of different instantiations of nDCG exist, depending on the arbitrary definition of the gain and discount functions used (1) to dictate the relative value of documents of different relevance grades and (2) to weight the importance of gain values at different ranks, respectively. In this work we discuss how to empirically derive a gain and discount function that optimizes the efficiency or stability of nDCG. First, we describe a variance decomposition analysis framework and an optimization procedure utilized to find the efficiency- or stability-optimal gain and discount functions. Then we use TREC data sets to compare the optimal gain and discount functions to the ones that have appeared in the IR literature with respect to (a) the efficiency of the evaluation, (b) the induced ranking of systems, and (c) the discriminative power of the resulting nDCG measure.
- A. Al-Maskari, M. Sanderson, andP. Clough. The relationship between ir effectiveness measures and user satisfaction. In SIGIR'07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 773--774, NewYork, NY, USA, 2007. ACM. Google ScholarDigital Library
- J. A. Aslam, E. Yilmaz, and V. Pavlu. The maximum entropy method for analyzing retrieval measures. In SIGIR'05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 27--34, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
- D. Banks, P. Over, and N.-F. Zhang. Blind men and elephants: Six approaches to trec data. Inf. Retr., 1(1-2):7--34, 1999. Google ScholarDigital Library
- D. Bodoff and P. Li. Test theory for assessing ir test collection. In SIGIR'07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 367--374, 2007. Google ScholarDigital Library
- R. L. Brennan. Generalizability Theory. Springer-Verlag, NewYork, 2001.Google Scholar
- C. Buckley and E. M. Voorhees. Evaluating evaluation measurestability. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 33--40, 2000. Google ScholarDigital Library
- C. J. C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to rank using gradient descent. In ICML'05: Proceedings of the 22nd international conference on Machine learning, pages 89--96, NewYork, NY, USA, 2005. ACM. Google ScholarDigital Library
- B. Carterette, V. Pavlu, E. Kanoulas, J. A. Aslam, and J. Allan. Evaluation over thousands of queries. In SIGIR'08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 651--658, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- K. Jarvelin and J. Kekalainen. Ir evaluation methods for retrieving highly relevant documents. In SIGIR '00: Proceedings of the23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 41--48, New York, NY, USA, 2000. ACM. Google ScholarDigital Library
- K. Jarvelin and J. Kekalainen. Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Syst., 20(4):422--446, 2002. Google ScholarDigital Library
- J. Kekalainen. Binary and graded relevance in ir evaluations: comparison of the effects on ranking of ir systems. Inf. Process. Manage., 41(5):1019--1033, 2005. Google ScholarDigital Library
- T. Sakai. Evaluating evaluation metrics based on the bootstrap. In SIGIR'06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 525--532, NewYork, NY, USA, 2006. ACM. Google ScholarDigital Library
- T. Sakai. On penalising late arrival of relevant documents in information retrieval evaluation with graded relevance. In First International Workshop on Evaluating Information Access (EVIA2007), pages 32--43, 2007.Google Scholar
- E. M. Voorhees. Evaluation by highly relevant documents. In SIGIR'01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 74--82, New York, NY, USA, 2001. ACM. Google ScholarDigital Library
Index Terms
- Empirical justification of the gain and discount function for nDCG
Recommendations
Empirical Analysis of Impact of Query-Specific Customization of nDCG: A Case-Study with Learning-to-Rank Methods
CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge ManagementIn most existing works, nDCG is computed for a fixed cutoff k, i.e., nDCG@k and some fixed discounting coefficient. Such a conventional query-independent way to compute nDCG does not accurately reflect the utility of search results perceived by an ...
PSkip: estimating relevance ranking quality from web search clickthrough data
KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data miningIn this article, we report our efforts in mining the information encoded as clickthrough data in the server logs to evaluate and monitor the relevance ranking quality of a commercial web search engine. We describe a metric called pSkip that aims to ...
Hits on the web: how does it compare?
SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrievalThis paper describes a large-scale evaluation of the effectiveness of HITS in comparison with other link-based ranking algorithms, when used in combination with a state-of-the-art text retrieval algorithm exploiting anchor text. We quantified their ...
Comments