Abstract
During the last three years we conducted several information retrieval evaluation series with more than 180 LIS students who made relevance assessments on the outcomes of three specific retrieval services. In this study we do not focus on the retrieval performance of our system but on the relevance assessments and the inter-assessor reliability. To quantify the agreement we apply Fleiss’ Kappa and Krippendorff’s Alpha. When we compare these two statistical measures on average Kappa values were 0.37 and Alpha values 0.15. We use the two agreement measures to drop too unreliable assessments from our data set. When computing the differences between the unfiltered and the filtered data set we see a root mean square error between 0.02 and 0.12. We see this as a clear indicator that disagreement affects the reliability of retrieval evaluations. We suggest not to work with unfiltered results or to clearly document the disagreement rates.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Alonso, O., Schenkel, R., Theobald, M.: Crowdsourcing Assessments for XML Ranked Retrieval. In: Gurrin, C., He, Y., Kazai, G., Kruschwitz, U., Little, S., Roelleke, T., Rüger, S., van Rijsbergen, K. (eds.) ECIR 2010. LNCS, vol. 5993, pp. 602–606. Springer, Heidelberg (2010)
Arguello, J., Diaz, F., Callan, J., Carterette, B.: A Methodology for Evaluating Aggregated Search Results. In: Clough, P., Foley, C., Gurrin, C., Jones, G.J.F., Kraaij, W., Lee, H., Mudoch, V. (eds.) ECIR 2011. LNCS, vol. 6611, pp. 141–152. Springer, Heidelberg (2011)
Artstein, R., Poesio, M.: Inter-coder agreement for computational linguistics. Comput. Linguist. 34(4), 555–596 (2008)
Bailey, P., Craswell, N., Soboroff, I., Thomas, P., de Vries, A.P., Yilmaz, E.: Relevance assessment: are judges exchangeable and does it matter. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 667–674. ACM, New York (2008)
Bermingham, A., Smeaton, A.F.: A study of inter-annotator agreement for opinion retrieval. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 784–785. ACM, New York (2009)
Borlund, P.: The concept of relevance in IR. Journal of the American Society for Information Science and Technology 54(10), 913–925 (2003)
Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychological Bulletin 76(5), 378–382 (1971)
Gamer, M., Lemon, J., Puspendra Singh, I.F.: irr: Various Coefficients of Interrater Reliability and Agreement (2010)
Greve, W., Wentura, D.: Wissenschaftliche Beobachtung: eine Einführung. Beltz, PsychologieVerlagsUnion, Weinheim (1997)
Krippendorff, K.: Computing Krippendorff’s Alpha-Reliability (2011), http://repository.upenn.edu/asc_papers/43
Krippendorff, K.: Reliability in Content Analysis: Some Common Misconceptions and Recommendations. Human Communication Research 30(3), 411–433 (2004)
Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33(1), 159–174 (1977)
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Mayr, P., Mutschke, P., Petras, V., Schaer, P., Sure, Y.: Applying Science Models for Search. In: 12. Internationales Symposium für Informationswissenschaft (ISI) (2011)
Mutschke, P., Mayr, P., Schaer, P., Sure, Y.: Science models as value-added services for scholarly information systems. Scientometrics 89(1), 349–364 (2011)
Piwowarski, B., Trotman, A., Lalmas, M.: Sound and complete relevance assessment for XML retrieval. ACM Trans. Inf. Syst. 27(1), 1:1–1:37 (2008)
R Development Core Team: R: A Language and Environment for Statistical Computing, Vienna, Austria (2011).
Schaer, P., Mayr, P., Mutschke, P.: Implications of Inter-Rater Agreement on a Student Information Retrieval Evaluation. In: Atzmüller, M., Benz, D., Hotho, A., Stumme, G. (eds.) Proceedings of LWA 2010 Workshop-Woche: Lernen, Wissen & Adaptivitaet, Kassel, Germany (2010)
Song, R., Guo, Q., Zhang, R., Xin, G., Wen, J.-R., Yu, Y., Hon, H.-W.: Select-the-Best-Ones: A new way to judge relative relevance. Inf. Process. Manage. 47(1), 37–52 (2011)
Voorhees, E.M.: Topic set size redux. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 806–807. ACM, New York (2009)
Voorhees, E.M.: Variations in relevance judgments and the measurement of retrieval effectiveness. Inf. Process. Manage. 36(5), 697–716 (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Schaer, P. (2012). Better than Their Reputation? On the Reliability of Relevance Assessments with Students. In: Catarci, T., Forner, P., Hiemstra, D., Peñas, A., Santucci, G. (eds) Information Access Evaluation. Multilinguality, Multimodality, and Visual Analytics. CLEF 2012. Lecture Notes in Computer Science, vol 7488. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33247-0_14
Download citation
DOI: https://doi.org/10.1007/978-3-642-33247-0_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33246-3
Online ISBN: 978-3-642-33247-0
eBook Packages: Computer ScienceComputer Science (R0)