Assessors Agreement: A Case Study Across Assessor Type, Payment Levels, Query Variations and Relevance Dimensions

Palotti, Joao; Zuccon, Guido; Bernhardt, Johannes; Hanbury, Allan; Goeuriot, Lorraine

doi:10.1007/978-3-319-44564-9_4

Joao Palotti²¹,
Guido Zuccon²²,
Johannes Bernhardt²³,
Allan Hanbury²¹ &
…
Lorraine Goeuriot²⁴

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9822))

Included in the following conference series:

International Conference of the Cross-Language Evaluation Forum for European Languages

1002 Accesses
7 Citations

Abstract

Relevance assessments are the cornerstone of Information Retrieval evaluation. Yet, there is only limited understanding of how assessment disagreement influences the reliability of the evaluation in terms of systems rankings. In this paper we examine the role of assessor type (expert vs. layperson), payment levels (paid vs. unpaid), query variations and relevance dimensions (topicality and understandability) and their influence on system evaluation in the presence of disagreements across assessments obtained in the different settings. The analysis is carried out in the context of the CLEF 2015 eHealth Task 2 collection and shows that disagreements between assessors belonging to the same group have little impact on evaluation. It also shows, however, that assessment disagreement found across settings has major impact on evaluation when topical relevance is considered, while it has no impact when understandability assessments are considered.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
uRBP is a variation of RBP [8] where gains depend both on the topical relevance label and the understandability label of a document. For more details, see [13]. In the empirical analysis of this paper, we set the persistence parameter \(\rho \) of all RBP based measures to 0.8 following [9, 13].

References

Azzopardi, L.: Query side evaluation: an empirical analysis of effectiveness and effort. In: Proceedings of SIGIR, pp. 556–563 (2009)
Google Scholar
Bailey, P., Craswell, N., Soboroff, I., Thomas, P., de Vries, A.P., Yilmaz, E.: Relevance assessment: are judges exchangeable and does it matter. In: Proceedings of SIGIR, pp. 667–674 (2008)
Google Scholar
Bailey, P., Moffat, A., Scholer, F., Thomas, P.: User variability and IR system evaluation. In: Proceedings of SIGIR, pp. 625–634 (2015)
Google Scholar
Carterette, B., Soboroff, I.: The effect of assessor error on IR system evaluation. In: Proceedings of SIGIR, pp. 539–546 (2010)
Google Scholar
Koopman, B., Zuccon, G.: Relevation!: an open source system for information retrieval relevance assessment. In: Proceedings of SIGIR, pp. 1243–1244. ACM (2014)
Google Scholar
Koopman, B., Zuccon, G.: Why assessing relevance in medical IR is demanding. In: Medical Information Retrieval Workshop at SIGIR 2014 (2014)
Google Scholar
Lesk, M.E., Salton, G.: Relevance assessments and retrieval system evaluation. Inform. Storage Retrieval 4(4), 343–359 (1968)
Article MATH Google Scholar
Moffat, A., Zobel, J.: Rank-biased precision for measurement of retrieval effectiveness. ACM Trans. Inform. Syst. (TOIS) 27(1), 2 (2008)
Google Scholar
Palotti, J., Zuccon, G., Goeuriot, L., Kelly, L., Hanbury, A., Jones, G.J., Lupu, M., Pecina, P.: CLEF eHealth evaluation lab: retrieving information about medical symptoms. In: CLEF (2015)
Google Scholar
Stanton, I., Ieong, S., Mishra, N.: Circumlocution in diagnostic medical queries. In: Proceedings of SIGIR, pp. 133–142. ACM (2014)
Google Scholar
Voorhees, E.M.: Variations in relevance judgments and the measurement of retrieval effectiveness. Inform. Process. Manage. 36(5), 697–716 (2000)
Article Google Scholar
Voorhees, E.M., Harman, D.K.: TREC: Experiment and Evaluation in Information Retrieval, vol. 1. MIT Press, Cambridge (2005)
Google Scholar
Zuccon, G.: Understandability biased evaluation for information retrieval. In: Ferro, N., Crestani, F., Moens, M.F., Mothe, J., Silvestri, F., Di Nunzio, G.M., Hauff, C., Silvello, G. (eds.) ECIR 2016. LNCS, vol. 9626, pp. 280–292. Springer, Heielberg (2016)
Chapter Google Scholar
Zuccon, G., Koopman, B.: Integrating understandability in the evaluation of consumer health search engines. In: Medical Information Retrieval Workshop at SIGIR 2014, p. 32 (2014)
Google Scholar
Zuccon, G., Koopman, B., Palotti, J.: Diagnose this if you can: on the effectiveness of search engines in finding medical self-diagnosis information. In: Hanbury, A., Kazai, G., Rauber, A., Fuhr, N. (eds.) ECIR 2015. LNCS, vol. 9022, pp. 562–567. Springer, Heidelberg (2015)
Google Scholar

Download references

Acknowledgements

This work has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 644753 (KConnect), and from the Austrian Science Fund (FWF) projects P25905-N23 (ADmIRE) and I1094-N23 (MUCKE).

Author information

Authors and Affiliations

Vienna University of Technology, Vienna, Austria
Joao Palotti & Allan Hanbury
Queensland University of Technology, Brisbane, Australia
Guido Zuccon
Medical University of Graz, Graz, Austria
Johannes Bernhardt
Université Grenoble Alpes, Saint-Martin-d’Hères, France
Lorraine Goeuriot

Authors

Joao Palotti
View author publications
You can also search for this author in PubMed Google Scholar
Guido Zuccon
View author publications
You can also search for this author in PubMed Google Scholar
Johannes Bernhardt
View author publications
You can also search for this author in PubMed Google Scholar
Allan Hanbury
View author publications
You can also search for this author in PubMed Google Scholar
Lorraine Goeuriot
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Joao Palotti .

Editor information

Editors and Affiliations

Universität Duisburg-Essen , Duisburg, Germany
Norbert Fuhr
Universidade de Évora , Évora, Portugal
Paulo Quaresma
University of Évora , Évora, Portugal
Teresa Gonçalves
Aalborg University Copenhagen , Copenhagen, Denmark
Birger Larsen
University of Stavanger , Stavanger, Norway
Krisztian Balog
University of Glasgow , Glasgow, United Kingdom
Craig Macdonald
University of Padua , Padua, Italy
Linda Cappellato
University of Padua , Padua, Italy
Nicola Ferro

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Palotti, J., Zuccon, G., Bernhardt, J., Hanbury, A., Goeuriot, L. (2016). Assessors Agreement: A Case Study Across Assessor Type, Payment Levels, Query Variations and Relevance Dimensions. In: Fuhr, N., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2016. Lecture Notes in Computer Science(), vol 9822. Springer, Cham. https://doi.org/10.1007/978-3-319-44564-9_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-44564-9_4
Published: 23 August 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-44563-2
Online ISBN: 978-3-319-44564-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics