Skip to main content

Better than Their Reputation? On the Reliability of Relevance Assessments with Students

  • Conference paper
Information Access Evaluation. Multilinguality, Multimodality, and Visual Analytics (CLEF 2012)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7488))

Abstract

During the last three years we conducted several information retrieval evaluation series with more than 180 LIS students who made relevance assessments on the outcomes of three specific retrieval services. In this study we do not focus on the retrieval performance of our system but on the relevance assessments and the inter-assessor reliability. To quantify the agreement we apply Fleiss’ Kappa and Krippendorff’s Alpha. When we compare these two statistical measures on average Kappa values were 0.37 and Alpha values 0.15. We use the two agreement measures to drop too unreliable assessments from our data set. When computing the differences between the unfiltered and the filtered data set we see a root mean square error between 0.02 and 0.12. We see this as a clear indicator that disagreement affects the reliability of retrieval evaluations. We suggest not to work with unfiltered results or to clearly document the disagreement rates.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 72.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Alonso, O., Schenkel, R., Theobald, M.: Crowdsourcing Assessments for XML Ranked Retrieval. In: Gurrin, C., He, Y., Kazai, G., Kruschwitz, U., Little, S., Roelleke, T., Rüger, S., van Rijsbergen, K. (eds.) ECIR 2010. LNCS, vol. 5993, pp. 602–606. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  2. Arguello, J., Diaz, F., Callan, J., Carterette, B.: A Methodology for Evaluating Aggregated Search Results. In: Clough, P., Foley, C., Gurrin, C., Jones, G.J.F., Kraaij, W., Lee, H., Mudoch, V. (eds.) ECIR 2011. LNCS, vol. 6611, pp. 141–152. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  3. Artstein, R., Poesio, M.: Inter-coder agreement for computational linguistics. Comput. Linguist. 34(4), 555–596 (2008)

    Article  Google Scholar 

  4. Bailey, P., Craswell, N., Soboroff, I., Thomas, P., de Vries, A.P., Yilmaz, E.: Relevance assessment: are judges exchangeable and does it matter. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 667–674. ACM, New York (2008)

    Chapter  Google Scholar 

  5. Bermingham, A., Smeaton, A.F.: A study of inter-annotator agreement for opinion retrieval. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 784–785. ACM, New York (2009)

    Google Scholar 

  6. Borlund, P.: The concept of relevance in IR. Journal of the American Society for Information Science and Technology 54(10), 913–925 (2003)

    Article  Google Scholar 

  7. Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychological Bulletin 76(5), 378–382 (1971)

    Article  Google Scholar 

  8. Gamer, M., Lemon, J., Puspendra Singh, I.F.: irr: Various Coefficients of Interrater Reliability and Agreement (2010)

    Google Scholar 

  9. Greve, W., Wentura, D.: Wissenschaftliche Beobachtung: eine Einführung. Beltz, PsychologieVerlagsUnion, Weinheim (1997)

    Google Scholar 

  10. Krippendorff, K.: Computing Krippendorff’s Alpha-Reliability (2011), http://repository.upenn.edu/asc_papers/43

  11. Krippendorff, K.: Reliability in Content Analysis: Some Common Misconceptions and Recommendations. Human Communication Research 30(3), 411–433 (2004)

    Google Scholar 

  12. Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33(1), 159–174 (1977)

    Article  MathSciNet  MATH  Google Scholar 

  13. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)

    Book  MATH  Google Scholar 

  14. Mayr, P., Mutschke, P., Petras, V., Schaer, P., Sure, Y.: Applying Science Models for Search. In: 12. Internationales Symposium für Informationswissenschaft (ISI) (2011)

    Google Scholar 

  15. Mutschke, P., Mayr, P., Schaer, P., Sure, Y.: Science models as value-added services for scholarly information systems. Scientometrics 89(1), 349–364 (2011)

    Article  Google Scholar 

  16. Piwowarski, B., Trotman, A., Lalmas, M.: Sound and complete relevance assessment for XML retrieval. ACM Trans. Inf. Syst. 27(1), 1:1–1:37 (2008)

    Google Scholar 

  17. R Development Core Team: R: A Language and Environment for Statistical Computing, Vienna, Austria (2011).

    Google Scholar 

  18. Schaer, P., Mayr, P., Mutschke, P.: Implications of Inter-Rater Agreement on a Student Information Retrieval Evaluation. In: Atzmüller, M., Benz, D., Hotho, A., Stumme, G. (eds.) Proceedings of LWA 2010 Workshop-Woche: Lernen, Wissen & Adaptivitaet, Kassel, Germany (2010)

    Google Scholar 

  19. Song, R., Guo, Q., Zhang, R., Xin, G., Wen, J.-R., Yu, Y., Hon, H.-W.: Select-the-Best-Ones: A new way to judge relative relevance. Inf. Process. Manage. 47(1), 37–52 (2011)

    Article  Google Scholar 

  20. Voorhees, E.M.: Topic set size redux. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 806–807. ACM, New York (2009)

    Google Scholar 

  21. Voorhees, E.M.: Variations in relevance judgments and the measurement of retrieval effectiveness. Inf. Process. Manage. 36(5), 697–716 (2000)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Schaer, P. (2012). Better than Their Reputation? On the Reliability of Relevance Assessments with Students. In: Catarci, T., Forner, P., Hiemstra, D., Peñas, A., Santucci, G. (eds) Information Access Evaluation. Multilinguality, Multimodality, and Visual Analytics. CLEF 2012. Lecture Notes in Computer Science, vol 7488. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33247-0_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-33247-0_14

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-33246-3

  • Online ISBN: 978-3-642-33247-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics