Better than Their Reputation? On the Reliability of Relevance Assessments with Students

Schaer, Philipp

doi:10.1007/978-3-642-33247-0_14

Philipp Schaer²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7488))

Included in the following conference series:

International Conference of the Cross-Language Evaluation Forum for European Languages

760 Accesses
10 Citations

Abstract

During the last three years we conducted several information retrieval evaluation series with more than 180 LIS students who made relevance assessments on the outcomes of three specific retrieval services. In this study we do not focus on the retrieval performance of our system but on the relevance assessments and the inter-assessor reliability. To quantify the agreement we apply Fleiss’ Kappa and Krippendorff’s Alpha. When we compare these two statistical measures on average Kappa values were 0.37 and Alpha values 0.15. We use the two agreement measures to drop too unreliable assessments from our data set. When computing the differences between the unfiltered and the filtered data set we see a root mean square error between 0.02 and 0.12. We see this as a clear indicator that disagreement affects the reliability of retrieval evaluations. We suggest not to work with unfiltered results or to clearly document the disagreement rates.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 72.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Alonso, O., Schenkel, R., Theobald, M.: Crowdsourcing Assessments for XML Ranked Retrieval. In: Gurrin, C., He, Y., Kazai, G., Kruschwitz, U., Little, S., Roelleke, T., Rüger, S., van Rijsbergen, K. (eds.) ECIR 2010. LNCS, vol. 5993, pp. 602–606. Springer, Heidelberg (2010)
Chapter Google Scholar
Arguello, J., Diaz, F., Callan, J., Carterette, B.: A Methodology for Evaluating Aggregated Search Results. In: Clough, P., Foley, C., Gurrin, C., Jones, G.J.F., Kraaij, W., Lee, H., Mudoch, V. (eds.) ECIR 2011. LNCS, vol. 6611, pp. 141–152. Springer, Heidelberg (2011)
Chapter Google Scholar
Artstein, R., Poesio, M.: Inter-coder agreement for computational linguistics. Comput. Linguist. 34(4), 555–596 (2008)
Article Google Scholar
Bailey, P., Craswell, N., Soboroff, I., Thomas, P., de Vries, A.P., Yilmaz, E.: Relevance assessment: are judges exchangeable and does it matter. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 667–674. ACM, New York (2008)
Chapter Google Scholar
Bermingham, A., Smeaton, A.F.: A study of inter-annotator agreement for opinion retrieval. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 784–785. ACM, New York (2009)
Google Scholar
Borlund, P.: The concept of relevance in IR. Journal of the American Society for Information Science and Technology 54(10), 913–925 (2003)
Article Google Scholar
Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychological Bulletin 76(5), 378–382 (1971)
Article Google Scholar
Gamer, M., Lemon, J., Puspendra Singh, I.F.: irr: Various Coefficients of Interrater Reliability and Agreement (2010)
Google Scholar
Greve, W., Wentura, D.: Wissenschaftliche Beobachtung: eine Einführung. Beltz, PsychologieVerlagsUnion, Weinheim (1997)
Google Scholar
Krippendorff, K.: Computing Krippendorff’s Alpha-Reliability (2011), http://repository.upenn.edu/asc_papers/43
Krippendorff, K.: Reliability in Content Analysis: Some Common Misconceptions and Recommendations. Human Communication Research 30(3), 411–433 (2004)
Google Scholar
Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33(1), 159–174 (1977)
Article MathSciNet MATH Google Scholar
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Book MATH Google Scholar
Mayr, P., Mutschke, P., Petras, V., Schaer, P., Sure, Y.: Applying Science Models for Search. In: 12. Internationales Symposium für Informationswissenschaft (ISI) (2011)
Google Scholar
Mutschke, P., Mayr, P., Schaer, P., Sure, Y.: Science models as value-added services for scholarly information systems. Scientometrics 89(1), 349–364 (2011)
Article Google Scholar
Piwowarski, B., Trotman, A., Lalmas, M.: Sound and complete relevance assessment for XML retrieval. ACM Trans. Inf. Syst. 27(1), 1:1–1:37 (2008)
Google Scholar
R Development Core Team: R: A Language and Environment for Statistical Computing, Vienna, Austria (2011).
Google Scholar
Schaer, P., Mayr, P., Mutschke, P.: Implications of Inter-Rater Agreement on a Student Information Retrieval Evaluation. In: Atzmüller, M., Benz, D., Hotho, A., Stumme, G. (eds.) Proceedings of LWA 2010 Workshop-Woche: Lernen, Wissen & Adaptivitaet, Kassel, Germany (2010)
Google Scholar
Song, R., Guo, Q., Zhang, R., Xin, G., Wen, J.-R., Yu, Y., Hon, H.-W.: Select-the-Best-Ones: A new way to judge relative relevance. Inf. Process. Manage. 47(1), 37–52 (2011)
Article Google Scholar
Voorhees, E.M.: Topic set size redux. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 806–807. ACM, New York (2009)
Google Scholar
Voorhees, E.M.: Variations in relevance judgments and the measurement of retrieval effectiveness. Inf. Process. Manage. 36(5), 697–716 (2000)
Article Google Scholar

Download references

Author information

Authors and Affiliations

GESIS, Leibniz Institute for the Social Sciences, Unter Sachsenhausen 6-8, 50667, Cologne, Germany
Philipp Schaer

Authors

Philipp Schaer
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer, Control and Management, Engenerring Antonio Ruberti, Sapienza University of Rome, Via Ariosto 25, 00185, Rome, Italy
Tiziana Catarci
Center for the Evaluation of Language and Communication Technologies (CELCT), Via alla Casata 56/c, 38123, Povo, TN, Italy
Pamela Forner
Department of Computer Science, Database Group, University of Twente, PO Box 217, 7500 AE, Enschede, The Netherlands
Djoerd Hiemstra
UNED Natural Language Processing and Information Retrieval Research Group, E.T.S.I. Informática de la UNED, c/ Juan del Rosal 16, 28040, Madrid, Spain
Anselmo Peñas
Department of Computer, Control and Management, Engeneering Antonio Ruberti, Sapienza University of Rome, Via Ariosto 25, 00185, Rome, Italy
Giuseppe Santucci

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Schaer, P. (2012). Better than Their Reputation? On the Reliability of Relevance Assessments with Students. In: Catarci, T., Forner, P., Hiemstra, D., Peñas, A., Santucci, G. (eds) Information Access Evaluation. Multilinguality, Multimodality, and Visual Analytics. CLEF 2012. Lecture Notes in Computer Science, vol 7488. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33247-0_14

Download citation

DOI: https://doi.org/10.1007/978-3-642-33247-0_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33246-3
Online ISBN: 978-3-642-33247-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics