Multiple Testing for IR and Recommendation System Experiments

Ihemelandu, Ngozi; Ekstrand, Michael D.

doi:10.1007/978-3-031-56063-7_37

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14610))

Included in the following conference series:

European Conference on Information Retrieval

280 Accesses

Abstract

While there has been significant research on statistical techniques for comparing two information retrieval (IR) systems, many IR experiments test more than two systems. This can lead to inflated false discoveries due to the multiple-comparison problem (MCP). A few IR studies have investigated multiple comparison procedures; these studies mostly use TREC data and control the familywise error rate. In this study, we extend their investigation to include recommendation system evaluation data as well as multiple comparison procedures that controls for False Discovery Rate (FDR).

Partly supported by the National Science Foundation on Grant 17-51278.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bellogín, A., Castells, P., Cantador, I.: Statistical biases in information retrieval metrics for recommender systems. Inf. Retriev. J. 20, 606–634 (2017)
Article Google Scholar
Benjamini, Y., Hochberg, Y.: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Stat. Soc.: Ser. B (Methodol.) 57(1), 289–300 (1995)
MathSciNet Google Scholar
Benjamini, Y., Yekutieli, D.: The control of the false discovery rate in multiple testing under dependency. Annals Stat. 29(4), 1165–1188 (2001)
Google Scholar
Bland, J.M., Altman, D.G.: Multiple significance tests: the bonferroni method. BMJ 310(6973), 170 (1995)
Article Google Scholar
Boytsov, L., Belova, A., Westfall, P.: Deciding on an adjustment for multiplicity in IR experiments. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 403–412 (2013)
Google Scholar
Carterette, B.A.: Multiple testing in statistical analysis of systems-based information retrieval experiments. ACM Trans. Inf. Syst. 30(1), 1–34 (2012). https://doi.org/10.1145/2094072.2094076
Article Google Scholar
Hagen, M., et al.: Webis at trec 2013-session and web track. In: TREC (2013)
Google Scholar
Harper, F.M., Konstan, J.A.: The movielens datasets: history and context. ACM Trans. Interact. Intell. Syst. 5(4), 1–19 (2015)
Article Google Scholar
Holm, S.: A simple sequentially rejective multiple test procedure. Scand. J. Statist. 65–70 (1979)
Google Scholar
Hull, D.: Using statistical testing in the evaluation of retrieval experiments. In: Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 329–338 (1993)
Google Scholar
Ihemelandu, N., Ekstrand, M.D.: Statistical inference: the missing piece of recsys experiment reliability discourse. arXiv preprint arXiv:2109.06424 (2021)
Ihemelandu, N., Ekstrand, M.D.: Inference at scale: significance testing for large search and recommendation experiments. In: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2023) (2023)
Google Scholar
Jones, K.S., Willett, P.: Readings in Information Retrieval. Morgan Kaufmann (1997)
Google Scholar
Parapar, J., Losada, D.E., Presedo-Quindimil, M.A., Barreiro, A.: Using score distributions to compare statistical significance tests for information retrieval evaluation. J. Am. Soc. Inf. Sci. 71(1), 98–113 (2020)
Google Scholar
Rijsbergen, C.V.: Van. Information Retrieval, vol. 2. Butterworths (1979)
Google Scholar
Savoy, J.: Statistical inference in retrieval effectiveness evaluation. Inf. Process. Manag. 33(4), 495–512 (1997)
Google Scholar
Scheffé, H.: A method for judging all contrasts in the analysis of variance. Biometrika 40(1–2), 87–110 (1953)
MathSciNet Google Scholar
Smucker, M.D., Allan, J., Carterette, B.: A comparison of statistical significance tests for information retrieval evaluation. In: Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management, pp. 623–632 (2007)
Google Scholar
Tague-Sutcliffe, J.: The pragmatics of information retrieval experimentation, revisited. Inf. Process. Manag. 28(4), 467–490 (1992)
Google Scholar
Tague-Sutcliffe, J., Blustein, J.: A Statistical Analysis of the Trec-3 Data, pp. 385–385. NIST Special Publication SP (1995)
Google Scholar
Urbano, J., Lima, H., Hanjalic, A.: Statistical significance testing in information retrieval: an empirical analysis of type I, type II and type III errors. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 505–514 (2019)
Google Scholar
Urbano, J., Nagler, T.: Stochastic simulation of test collections: evaluation scores. In: Proceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 695–704 (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

Boise State University, Boise, ID, 83725, USA
Ngozi Ihemelandu
Department of Information Science, Drexel University, Philadelphia, PA, 19104, USA
Michael D. Ekstrand

Authors

Ngozi Ihemelandu
View author publications
You can also search for this author in PubMed Google Scholar
Michael D. Ekstrand
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ngozi Ihemelandu .

Editor information

Editors and Affiliations

Georgetown University, Washington, WA, USA
Nazli Goharian
University of Pisa, PISA, Pisa, Italy
Nicola Tonellotto
King's College London, London, UK
Yulan He
University College London, London, UK
Aldo Lipani
University of Glasgow, Glasgow, UK
Graham McDonald
University of Glasgow, Glasgow, UK
Craig Macdonald
University of Glasgow, Glasgow, UK
Iadh Ounis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ihemelandu, N., Ekstrand, M.D. (2024). Multiple Testing for IR and Recommendation System Experiments. In: Goharian, N., et al. Advances in Information Retrieval. ECIR 2024. Lecture Notes in Computer Science, vol 14610. Springer, Cham. https://doi.org/10.1007/978-3-031-56063-7_37

Download citation

DOI: https://doi.org/10.1007/978-3-031-56063-7_37
Published: 23 March 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-56062-0
Online ISBN: 978-3-031-56063-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Multiple Testing for IR and Recommendation System Experiments