skip to main content
research-article

Validating Synthetic Usage Data in Living Lab Environments

Published:06 March 2024Publication History
Skip Abstract Section

Abstract

Evaluating retrieval performance without editorial relevance judgments is challenging, but instead, user interactions can be used as relevance signals. Living labs offer a way for small-scale platforms to validate information retrieval systems with real users. If enough user interaction data are available, then click models can be parameterized from historical sessions to evaluate systems before exposing users to experimental rankings. However, interaction data are sparse in living labs, and little is studied about how click models can be validated for reliable user simulations when click data are available in moderate amounts.

This work introduces an evaluation approach for validating synthetic usage data generated by click models in data-sparse human-in-the-loop environments like living labs. We ground our methodology on the click model’s estimates about a system ranking compared to a reference ranking for which the relative performance is known. Our experiments compare different click models and their reliability and robustness as more session log data become available. In our setup, simple click models can reliably determine the relative system performance with already 20 logged sessions for 50 queries. In contrast, more complex click models require more session data for reliable estimates, but they are a better choice in simulated interleaving experiments when enough session data are available. While it is easier for click models to distinguish between more diverse systems, it is harder to reproduce the system ranking based on the same retrieval algorithm with different interpolation weights. Our setup is entirely open, and we share the code to reproduce the experiments.

REFERENCES

  1. [1] Agichtein Eugene, Brill Eric, and Dumais Susan T.. 2006. Improving web search ranking by incorporating user behavior information. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’06), Efthimiadis Efthimis N., Dumais Susan T., Hawking David, and Järvelin Kalervo (Eds.). ACM, 1926. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. [2] Althammer Sophia, Hofstätter Sebastian, Verberne Suzan, and Hanbury Allan. 2022. TripJudge: A relevance judgement test collection for TripClick health retrieval. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, Hasan Mohammad Al and Xiong Li (Eds.). ACM, 38013805. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Amati Giambattista. 2006. Frequentist and Bayesian approach to information retrieval. In Advances in Information Retrieval: Proceedings of the 28th European Conference on IR Research (ECIR 2006),Lecture Notes in Computer Science, Vol. 3936, Lalmas Mounia, MacFarlane Andy, Rüger Stefan M., Tombros Anastasios, Tsikrika Theodora, and Yavlinsky Alexei (Eds.). Springer, 1324. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Azzopardi Leif and Balog Krisztian. 2011. Towards a living lab for information retrieval research and development - A proposal for a living lab for product search tasks. In Multilingual and Multimodal Information Access Evaluation: Proceedings of the 2nd International Conference of the Cross-Language Evaluation Forum (CLEF’11). 2637. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Bakshy Eytan, Eckles Dean, and Bernstein Michael S.. 2014. Designing and deploying online field experiments. In Proceedings of the 23rd International World Wide Web Conference (WWW’14), Chung Chin-Wan, Broder Andrei Z., Shim Kyuseok, and Suel Torsten (Eds.). ACM, 283292. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Balog Krisztian, Elsweiler David, Kanoulas Evangelos, Kelly Liadh, and Smucker Mark D.. 2013. CIKM 2013 workshop on living labs for information retrieval evaluation. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management (CIKM’13). 25572558. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Balog Krisztian, Kelly Liadh, and Schuth Anne. 2014. Head first: Living labs for Ad-Hoc search evaluation. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management (CIKM’14), Li Jianzhong, Wang Xiaoyang Sean, Garofalakis Minos N., Soboroff Ian, Suel Torsten, and Wang Min (Eds.). ACM, 18151818. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Balog Krisztian, Maxwell David, Thomas Paul, and Zhang Shuo. 2021. Report on the 1st simulation for information retrieval workshop (Sim4IR 2021) at SIGIR 2021. SIGIR Forum 55, 2 (2021), 10:1–10:16. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Baskaya Feza, Keskustalo Heikki, and Järvelin Kalervo. 2013. Modeling behavioral factors in interactive information retrieval. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management (CIKM’13), He Qi, Iyengar Arun, Nejdl Wolfgang, Pei Jian, and Rastogi Rajeev (Eds.). ACM, 22972302. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Beel Jöran, Collins Andrew, Kopp Oliver, Dietz Linus W., and Knoth Petr. 2019. Online evaluations for everyone: Mr. DLib’s living lab for scholarly recommendations. In Advances in Information Retrieval: Proceedings of the 41st European Conference on IR Research (ECIR’19), Part II. 213219. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Beltagy Iz, Lo Kyle, and Cohan Arman. 2019. SciBERT: A pretrained language model for scientific text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19), Inui Kentaro, Jiang Jing, Ng Vincent, and Wan Xiaojun (Eds.). Association for Computational Linguistics, 36133618. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Borisov Alexey, Markov Ilya, Rijke Maarten de, and Serdyukov Pavel. 2016. A neural click model for web search. In Proceedings of the 25th International Conference on World Wide Web (WWW’16), Bourdeau Jacqueline, Hendler Jim, Nkambou Roger, Horrocks Ian, and Zhao Ben Y. (Eds.). ACM, 531541. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Breuer Timo, Ferro Nicola, Fuhr Norbert, Maistro Maria, Sakai Tetsuya, Schaer Philipp, and Soboroff Ian. 2020. How to measure the reproducibility of system-oriented IR experiments. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’20), Huang Jimmy X., Chang Yi, Cheng Xueqi, Kamps Jaap, Murdock Vanessa, Wen Ji-Rong, and Liu Yiqun (Eds.). ACM, 349358. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Breuer Timo and Schaer Philipp. 2021. A living lab architecture for reproducible shared task experimentation. In Information between Data and Knowledge: Information Science and Its Neighbors from Data Science to Digital Humanities: Proceedings of the 16th International Symposium of Information Science (ISI’21), Wolff Christian and Schmidt Thomas (Eds.). Werner Hülsbusch, 348362. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Brodt Torben and Hopfgartner Frank. 2014. Shedding light on a living lab: The CLEF NEWSREEL open recommendation platform. In Proceedings of the 5th Information Interaction in Context Symposium (IIiX’14). 223226. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Buckley Chris and Voorhees Ellen M.. 2004. Retrieval evaluation with incomplete information. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’04), Sanderson Mark, Järvelin Kalervo, Allan James, and Bruza Peter (Eds.). ACM, 2532. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Carterette Ben. 2011. System effectiveness, user models, and user utility: A conceptual framework for investigation. In Proceeding of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval SIGIR’11). 903912. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Carterette Ben, Bah Ashraf, and Zengin Mustafa. 2015. Dynamic test collections for retrieval evaluation. In Proceedings of the International Conference on the Theory of Information Retrieval (ICTIR’15), Allan James, Croft W. Bruce, Vries Arjen P. de, and Zhai Chengxiang (Eds.). ACM, 91100. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Carterette Ben and Chandar Praveen. 2018. Offline comparative evaluation with incremental, minimally-invasive online feedback. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR’18), Collins-Thompson Kevyn, Mei Qiaozhu, Davison Brian D., Liu Yiqun, and Yilmaz Emine (Eds.). ACM, 705714. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Carterette Ben and Jones Rosie. 2007. Evaluating search engines by modeling the relationship between relevance and clicks. In Advances in Neural Information Processing Systems 20: Proceedings of the 21st Annual Conference on Neural Information Processing Systems, Platt John C., Koller Daphne, Singer Yoram, and Roweis Sam T. (Eds.). Curran Associates, Inc., 217224.Google ScholarGoogle Scholar
  21. [21] Chapelle Olivier and Zhang Ya. 2009. A dynamic Bayesian network click model for web search ranking. In Proceedings of the 18th International Conference on World Wide Web (WWW’09), Quemada Juan, León Gonzalo, Maarek Yoëlle S., and Nejdl Wolfgang (Eds.). ACM, 110. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Chen Ye, Zhou Ke, Liu Yiqun, Zhang Min, and Ma Shaoping. 2017. Meta-evaluation of online and offline web search evaluation metrics. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Kando Noriko, Sakai Tetsuya, Joho Hideo, Li Hang, Vries Arjen P. de, and White Ryen W. (Eds.). ACM, 1524. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Chuklin Aleksandr, Markov Ilya, and Rijke Maarten de. 2015. Click Models for Web Search. Morgan & Claypool. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Chuklin Aleksandr, Schuth Anne, Hofmann Katja, Serdyukov Pavel, and Rijke Maarten de. 2013. Evaluating aggregated search using interleaving. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management (CIKM’13), He Qi, Iyengar Arun, Nejdl Wolfgang, Pei Jian, and Rastogi Rajeev (Eds.). ACM, 669678. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Chuklin Aleksandr, Schuth Anne, Zhou Ke, and Rijke Maarten de. 2015. A comparative analysis of interleaving methods for aggregated search. ACM Trans. Inf. Syst. 33, 2 (2015), 5:1–5:38. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Chuklin Aleksandr, Serdyukov Pavel, and Rijke Maarten de. 2013. Click model-based information retrieval metrics. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’13), Jones Gareth J. F., Sheridan Paraic, Kelly Diane, Rijke Maarten de, and Sakai Tetsuya (Eds.). ACM, 493502. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Cleverdon Cyril W.. 1991. The significance of the cranfield tests on index languages. In Proceedings of the 14th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Bookstein Abraham, Chiaramella Yves, Salton Gerard, and Raghavan Vijay V. (Eds.). ACM, 312. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Cohan Arman, Feldman Sergey, Beltagy Iz, Downey Doug, and Weld Daniel S.. 2020. SPECTER: Document-level representation learning using citation-informed transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20), Jurafsky Dan, Chai Joyce, Schluter Natalie, and Tetreault Joel R. (Eds.). Association for Computational Linguistics, 22702282. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Craswell Nick, Campos Daniel, Mitra Bhaskar, Yilmaz Emine, and Billerbeck Bodo. 2020. ORCAS: 20 million clicked query-document pairs for analyzing search. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management, Virtual Event (CIKM’20), d’Aquin Mathieu, Dietze Stefan, Hauff Claudia, Curry Edward, and Cudré-Mauroux Philippe (Eds.). ACM, 29832989. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Craswell Nick, Zoeter Onno, Taylor Michael J., and Ramsey Bill. 2008. An experimental comparison of click position-bias models. In Proceedings of the International Conference on Web Search and Web Data Mining (WSDM’08), Najork Marc, Broder Andrei Z., and Chakrabarti Soumen (Eds.). ACM, 8794. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’19), Volume 1 (Long and Short Papers), Burstein Jill, Doran Christy, and Solorio Thamar (Eds.). Association for Computational Linguistics, 41714186. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Drutsa Alexey, Gusev Gleb, Kharitonov Eugene, Kulemyakin Denis, Serdyukov Pavel, and Yashkov Igor. 2019. Effective online evaluation for web search. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’19), Piwowarski Benjamin, Chevalier Max, Gaussier Éric, Maarek Yoelle, Nie Jian-Yun, and Scholer Falk (Eds.). ACM, 13991400. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Gingstad Kristian, Jekteberg Øyvind, and Balog Krisztian. 2020. ArXivDigest: A living lab for personalized scientific literature recommendation. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM’20), d’Aquin Mathieu, Dietze Stefan, Hauff Claudia, Curry Edward, and Cudré-Mauroux Philippe (Eds.). ACM, 33933396. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Grotov Artem, Chuklin Aleksandr, Markov Ilya, Stout Luka, Xumara Finde, and Rijke Maarten de. 2015. A comparative study of click models for web search. In Experimental IR Meets Multilinguality, Multimodality, and Interaction: Proceedings of the 6th International Conference of the CLEF Association,(CLEF 2015),Lecture Notes in Computer Science, Vol. 9283, Mothe Josiane, Savoy Jacques, Kamps Jaap, Pinel-Sauvagnat Karen, Jones Gareth J. F., SanJuan Eric, Cappellato Linda, and Ferro Nicola (Eds.). Springer, 7890. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Guo Fan, Liu Chao, and Wang Yi Min. 2009. Efficient multiple-click models in web search. In Proceedings of the 2nd International Conference on Web Search and Web Data Mining (WSDM’09), Baeza-Yates Ricardo, Boldi Paolo, Ribeiro-Neto Berthier A., and Cambazoglu Berkant Barla (Eds.). ACM, 124131. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] He Jing, Zhai Chengxiang, and Li Xiaoming. 2009. Evaluation of methods for relative comparison of retrieval systems based on clickthroughs. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM 2009), Cheung David Wai-Lok, Song Il-Yeol, Chu Wesley W., Hu Xiaohua, and Lin Jimmy (Eds.). ACM, 20292032. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Hersh William R., Turpin Andrew, Price Susan, Chan Benjamin, Kraemer Dale, Sacherek Lynetta, and Olson Daniel. 2000. Do batch and user evaluation give the same results? In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’00), Yannakoudakis Emmanuel J., Belkin Nicholas J., Ingwersen Peter, and Leong Mun-Kew (Eds.). ACM, 1724. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Hofmann Katja, Li Lihong, and Radlinski Filip. 2016. Online evaluation for information retrieval. Found. Trends Inf. Retr. 10, 1 (2016), 1117. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Hofmann Katja, Whiteson Shimon, and Rijke Maarten de. 2013. Fidelity, soundness, and efficiency of interleaved comparison methods. ACM Trans. Inf. Syst. 31, 4 (2013), 17:1–17:43. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Hofstätter Sebastian, Althammer Sophia, Sertkan Mete, and Hanbury Allan. 2022. Establishing strong baselines for TripClick health retrieval. In Advances in Information Retrieval: Proceedings of the 44th European Conference on IR Research (ECIR’22), Part II,Lecture Notes in Computer Science, Vol. 13186, Hagen Matthias, Verberne Suzan, Macdonald Craig, Seifert Christin, Balog Krisztian, Nørvåg Kjetil, and Setty Vinay (Eds.). Springer, 144152. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Hopfgartner Frank, Kille Benjamin, Lommatzsch Andreas, Plumbaum Till, Brodt Torben, and Heintz Tobias. 2014. Benchmarking news recommendations in a living lab. In Information Access Evaluation. Multilinguality, Multimodality, and Interaction: Proceedings of the 5th International Conference of the CLEF Initiative (CLEF’14),Lecture Notes in Computer Science, Vol. 8685, Kanoulas Evangelos, Lupu Mihai, Clough Paul D., Sanderson Mark, Hall Mark M., Hanbury Allan, and Toms Elaine G. (Eds.). Springer, 250267. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Jagerman Rolf, Balog Krisztian, and Rijke Maarten de. 2018. OpenSearch: Lessons learned from an online evaluation campaign. ACM J. Data Inf. Qual. 10, 3 (2018), 13:1–13:15. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Jiang Jiepu and Allan James. 2016. Correlation between system and user metrics in a session. In Proceedings of the ACM Conference on Human Information Interaction and Retrieval (CHIIR’16), Kelly Diane, Capra Robert, Belkin Nicholas J., Teevan Jaime, and Vakkari Pertti (Eds.). ACM, 285288. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] Joachims Thorsten, Granka Laura A., Pan Bing, Hembrooke Helene, and Gay Geri. 2005. Accurately interpreting clickthrough data as implicit feedback. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’05), Baeza-Yates Ricardo A., Ziviani Nivio, Marchionini Gary, Moffat Alistair, and Tait John (Eds.). ACM, 154161. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. [45] Järvelin Kalervo and Kekäläinen Jaana. 2002. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. 20, 4 (2002), 422446. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. [46] Kamps Jaap, Koolen Marijn, and Trotman Andrew. 2009. Comparative analysis of clicks and judgments for IR evaluation. In Proceedings of the Workshop on Web Search Click Data (WSCD@WSDM’09), Craswell Nick, Jones Rosie, Dupret Georges, and Viegas Evelyne (Eds.). ACM, 8087. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. [47] Kelly Diane, Dumais Susan T., and Pedersen Jan O.. 2009. Evaluation challenges and directions for information-seeking support systems. Computer 42, 3 (2009), 6066. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. [48] Khandel Pooya, Markov Ilya, Yates Andrew, and Varbanescu Ana Lucia. 2022. ParClick: A scalable algorithm for EM-based click models. In Proceedings of the Web Conference (WWW’22), Laforest Frédérique, Troncy Raphaël, Simperl Elena, Agarwal Deepak, Gionis Aristides, Herman Ivan, and Médini Lionel (Eds.). ACM, 392400. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. [49] Kharitonov Eugene, Macdonald Craig, Serdyukov Pavel, and Ounis Iadh. 2015. Optimised scheduling of online experiments. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, Baeza-Yates Ricardo, Lalmas Mounia, Moffat Alistair, and Ribeiro-Neto Berthier A. (Eds.). ACM, 453462. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. [50] Kohavi Ron, Deng Alex, Frasca Brian, Walker Toby, Xu Ya, and Pohlmann Nils. 2013. Online controlled experiments at large scale. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’13), Dhillon Inderjit S., Koren Yehuda, Ghani Rayid, Senator Ted E., Bradley Paul, Parekh Rajesh, He Jingrui, Grossman Robert L., and Uthurusamy Ramasamy (Eds.). ACM, 11681176. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. [51] Labhishetty Sahiti and Zhai Chengxiang. 2021. An exploration of tester-based evaluation of user simulators for comparing interactive retrieval systems. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’21), Diaz Fernando, Shah Chirag, Suel Torsten, Castells Pablo, Jones Rosie, and Sakai Tetsuya (Eds.). ACM, 15981602. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. [52] Labhishetty Sahiti and Zhai ChengXiang. 2022. RATE: A reliability-aware tester-based evaluation framework of user simulators. In Advances in Information Retrieval: Proceedings of the 44th European Conference on IR Research (ECIR’22), Part I,Lecture Notes in Computer Science, Vol. 13185, Hagen Matthias, Verberne Suzan, Macdonald Craig, Seifert Christin, Balog Krisztian, Nørvåg Kjetil, and Setty Vinay (Eds.). Springer, 336350. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. [53] Li Lihong, Kim Jinyoung, and Zitouni Imed. 2015. Toward predicting the outcome of an A/B experiment for search relevance. In Proceedings of the 8th ACM International Conference on Web Search and Data Mining (WSDM’15), Cheng Xueqi, Li Hang, Gabrilovich Evgeniy, and Tang Jie (Eds.). ACM, 3746. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. [54] Li Shuai, Abbasi-Yadkori Yasin, Kveton Branislav, Muthukrishnan S., Vinay Vishwa, and Wen Zheng. 2018. Offline evaluation of ranking policies with click models. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD’18), Guo Yike and Farooq Faisal (Eds.). ACM, 16851694. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. [55] Lin Jimmy, Nogueira Rodrigo Frassetto, and Yates Andrew. 2021. Pretrained Transformers for Text Ranking: BERT and Beyond. Morgan & Claypool. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  56. [56] Liu Mengyang, Mao Jiaxin, Liu Yiqun, Zhang Min, and Ma Shaoping. 2019. Investigating cognitive effects in session-level search user satisfaction. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD’19), Teredesai Ankur, Kumar Vipin, Li Ying, Rosales Rómer, Terzi Evimaria, and Karypis George (Eds.). ACM, 923931. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. [57] Liu Yiqun, Xie Xiaohui, Wang Chao, Nie Jian-Yun, Zhang Min, and Ma Shaoping. 2017. Time-aware click model. ACM Trans. Inf. Syst. 35, 3 (2017), 16:1–16:24. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. [58] MacAvaney Sean, Yates Andrew, Feldman Sergey, Downey Doug, Cohan Arman, and Goharian Nazli. 2021. Simplified data wrangling with Ir_datasets. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’21), Diaz Fernando, Shah Chirag, Suel Torsten, Castells Pablo, Jones Rosie, and Sakai Tetsuya (Eds.). ACM, 24292436. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. [59] Macdonald Craig and Tonellotto Nicola. 2020. Declarative experimentation in information retrieval using PyTerrier. In Proceedings of the ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR’20), Balog Krisztian, Setty Vinay, Lioma Christina, Liu Yiqun, Zhang Min, and Berberich Klaus (Eds.). ACM, 161168. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. [60] Maistro Maria, Breuer Timo, Schaer Philipp, and Ferro Nicola. 2023. An in-depth investigation on the behavior of measures to quantify reproducibility. Inf. Process. Manag. 60, 3 (2023), 103332. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. [61] Malkevich Stepan, Markov Ilya, Michailova Elena, and Rijke Maarten de. 2017. Evaluating and analyzing click simulation in web search. In Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval (ICTIR’17), Kamps Jaap, Kanoulas Evangelos, Rijke Maarten de, Fang Hui, and Yilmaz Emine (Eds.). ACM, 281284. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. [62] Mao Jiaxin, Chu Zhumin, Liu Yiqun, Zhang Min, and Ma Shaoping. 2019. Investigating the reliability of click models. In Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval (ICTIR’19), Fang Yi, Zhang Yi, Allan James, Balog Krisztian, Carterette Ben, and Guo Jiafeng (Eds.). ACM, 125128. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. [63] Markov Ilya, Borisov Alexey, and Rijke Maarten de. 2017. Online expectation-maximization for click models. In Proceedings of the ACM on Conference on Information and Knowledge Management (CIKM’17), Lim Ee-Peng, Winslett Marianne, Sanderson Mark, Fu Ada Wai-Chee, Sun Jimeng, Culpepper J. Shane, Lo Eric, Ho Joyce C., Donato Debora, Agrawal Rakesh, Zheng Yu, Castillo Carlos, Sun Aixin, Tseng Vincent S., and Li Chenliang (Eds.). ACM, 21952198. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. [64] Marrero Mónica. 2018. APONE: Academic platform for ONline experiments. In Proceedings of the 1st Biennial Conference on Design of Experimental Search & Information Retrieval SystemsCEUR Workshop Proceedings, Vol. 2167, Alonso Omar and Silvello Gianmaria (Eds.). CEUR-WS.org, 4753.Google ScholarGoogle Scholar
  65. [65] Marrero Mónica and Hauff Claudia. 2018. A/B testing with APONE. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR’18), Collins-Thompson Kevyn, Mei Qiaozhu, Davison Brian D., Liu Yiqun, and Yilmaz Emine (Eds.). ACM, 12691272. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. [66] Maxwell David. 2019. Modelling Search and Stopping in Interactive Information Retrieval. Ph. D. Dissertation. University of Glasgow, UK.Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. [67] Maxwell David and Azzopardi Leif. 2016. Agents, simulated users and humans: An analysis of performance and behaviour. In Proceedings of the 25th ACM International Conference on Information and Knowledge Management (CIKM’16), Mukhopadhyay Snehasis, Zhai ChengXiang, Bertino Elisa, Crestani Fabio, Mostafa Javed, Tang Jie, Si Luo, Zhou Xiaofang, Chang Yi, Li Yunyao, and Sondhi Parikshit (Eds.). ACM, 731740. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. [68] Maxwell David and Azzopardi Leif. 2016. Simulating interactive information retrieval: SimIIR: A framework for the simulation of interaction. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’16), Perego Raffaele, Sebastiani Fabrizio, Aslam Javed A., Ruthven Ian, and Zobel Justin (Eds.). ACM, 11411144. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. [69] Moffat Alistair, Bailey Peter, Scholer Falk, and Thomas Paul. 2017. Incorporating user expectations and behavior into the measurement of search effectiveness. ACM Trans. Inf. Syst. 35, 3 (2017), 24:1–24:38. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. [70] Moffat Alistair and Zobel Justin. 2008. Rank-biased precision for measurement of retrieval effectiveness. ACM Trans. Inf. Syst. 27, 1 (2008), 2:1–2:27. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. [71] Ounis Iadh, Amati Gianni, Plachouras Vassilis, He Ben, Macdonald Craig, and Johnson Douglas. 2005. Terrier information retrieval platform. In Advances in Information Retrieval: Proceedings of the 27th European Conference on IR Research (ECIR’05),Lecture Notes in Computer Science, Vol. 3408, Losada David E. and Fernández-Luna Juan M. (Eds.). Springer, 517519. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. [72] Ozertem Umut, Jones Rosie, and Dumoulin Benoît. 2011. Evaluating new search engine configurations with pre-existing judgments and clicks. In Proceedings of the 20th International Conference on World Wide Web (WWW’11), Srinivasan Sadagopan, Ramamritham Krithi, Kumar Arun, Ravindra M. P., Bertino Elisa, and Kumar Ravi (Eds.). ACM, 397406. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. [73] Piwowarski Benjamin and Zaragoza Hugo. 2007. Predictive user click models based on click-through history. In Proceedings of the 16th ACM Conference on Information and Knowledge Management (CIKM’07), Silva Mário J., Laender Alberto H. F., Baeza-Yates Ricardo A., McGuinness Deborah L., Olstad Bjørn, Olsen Øystein Haug, and Falcão André O. (Eds.). ACM, 175182. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. [74] Pääkkönen Teemu, Kekäläinen Jaana, Keskustalo Heikki, Azzopardi Leif, Maxwell David, and Järvelin Kalervo. 2017. Validating simulated interaction for retrieval evaluation. Inf. Retr. J. 20, 4 (2017), 338362. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. [75] Qian Xin, Lin Jimmy, and Roegiest Adam. 2016. Interleaved evaluation for retrospective summarization and prospective notification on document streams. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’16), Perego Raffaele, Sebastiani Fabrizio, Aslam Javed A., Ruthven Ian, and Zobel Justin (Eds.). ACM, 175184. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. [76] Radlinski Filip, Kurup Madhu, and Joachims Thorsten. 2008. How does clickthrough data reflect retrieval quality? In Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM’08), Shanahan James G., Amer-Yahia Sihem, Manolescu Ioana, Zhang Yi, Evans David A., Kolcz Aleksander, Choi Key-Sun, and Chowdhury Abdur (Eds.). ACM, 4352. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. [77] Rekabsaz Navid, Lesota Oleg, Schedl Markus, Brassey Jon, and Eickhoff Carsten. 2021. TripClick: The log files of a large health web search engine. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’21), Diaz Fernando, Shah Chirag, Suel Torsten, Castells Pablo, Jones Rosie, and Sakai Tetsuya (Eds.). ACM, 25072513. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. [78] Robertson Stephen E. and Zaragoza Hugo. 2009. The probabilistic relevance framework: BM25 and beyond. Found. Trends Inf. Retr. 3, 4 (2009), 333389. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. [79] Sanderson Mark. 2010. Test collection based evaluation of information retrieval systems. Found. Trends Inf. Retr. 4, 4 (2010), 247375. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  80. [80] Sanderson Mark, Paramita Monica Lestari, Clough Paul D., and Kanoulas Evangelos. 2010. Do user preferences and evaluation measures line up? In Proceeding of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’10), Crestani Fabio, Marchand-Maillet Stéphane, Chen Hsin-Hsi, Efthimiadis Efthimis N., and Savoy Jacques (Eds.). ACM, 555562. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  81. [81] Schaer Philipp, Breuer Timo, Castro Leyla Jael, Wolff Benjamin, Schaible Johann, and Tavakolpoursaleh Narges. 2021. Overview of LiLAS 2021—Living labs for academic search. In Experimental IR Meets Multilinguality, Multimodality, and Interaction: Proceedings of the 12th International Conference of the CLEF Association (CLEF’21),Lecture Notes in Computer Science, Vol. 12880, Candan K. Selçuk, Ionescu Bogdan, Goeuriot Lorraine, Larsen Birger, Müller Henning, Joly Alexis, Maistro Maria, Piroi Florina, Faggioli Guglielmo, and Ferro Nicola (Eds.). Springer, 394418. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. [82] Scholer Falk, Shokouhi Milad, Billerbeck Bodo, and Turpin Andrew. 2008. Using clicks as implicit judgments: Expectations versus observations. In Advances in Information Retrieval: Proceedings of the 30th European Conference on IR Research (ECIR’08),Lecture Notes in Computer Science, Vol. 4956, Macdonald Craig, Ounis Iadh, Plachouras Vassilis, Ruthven Ian, and White Ryen W. (Eds.). Springer, 2839. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  83. [83] Schuth Anne, Balog Krisztian, and Kelly Liadh. 2015. Overview of the living labs for information retrieval evaluation (LL4IR) CLEF Lab 2015. In Experimental IR Meets Multilinguality, Multimodality, and Interaction: Proceedings of the 6th International Conference of the CLEF Association (CLEF’15),Lecture Notes in Computer Science, Vol. 9283, Mothe Josiane, Savoy Jacques, Kamps Jaap, Pinel-Sauvagnat Karen, Jones Gareth J. F., SanJuan Eric, Cappellato Linda, and Ferro Nicola (Eds.). Springer, 484496. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  84. [84] Schuth Anne, Oosterhuis Harrie, Whiteson Shimon, and Rijke Maarten de. 2016. Multileave gradient descent for fast online learning to rank. In Proceedings of the 9th ACM International Conference on Web Search and Data Mining, Bennett Paul N., Josifovski Vanja, Neville Jennifer, and Radlinski Filip (Eds.). ACM, 457466. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  85. [85] Singh Amanpreet, D’Arcy Mike, Cohan Arman, Downey Doug, and Feldman Sergey. 2022. SciRepEval: A multi-format benchmark for scientific document representations. arXiv:2211.13308. Retrieved from https://arxiv.org/abs/2211.13308Google ScholarGoogle Scholar
  86. [86] Soboroff Ian, Nicholas Charles K., and Cahan Patrick. 2001. Ranking retrieval systems without relevance judgments. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’01), Croft W. Bruce, Harper David J., Kraft Donald H., and Zobel Justin (Eds.). ACM, 6673. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  87. [87] Tague Jean and Nelson Michael J.. 1981. Simulation of user judgments in bibliographic retrieval systems. In Theoretical Issues in Information Retrieval: Proceedings of the 4th International Conference on Information Storage and Retrieval, Crouch Carolyn J. (Ed.). ACM, 6671. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  88. [88] Tague Jean, Nelson Michael J., and Wu Harry. 1980. Problems in the simulation of bibliographic retrieval systems. In Information Retrieval Research: Proceedings of the Joint ACM/BCS Symposium in Information Storage and Retrieval, Oddy Robert N., Robertson Stephen E., Rijsbergen C. J. van, and Williams P. W. (Eds.). Butterworths, 236255.Google ScholarGoogle Scholar
  89. [89] Thomas Paul, Moffat Alistair, Bailey Peter, and Scholer Falk. 2014. Modeling decision points in user search behavior. In Proceedings of the 5th Information Interaction in Context Symposium (IIiX’14), Elsweiler David, Ludwig Bernd, Azzopardi Leif, and Wilson Max L. (Eds.). ACM, 239242. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  90. [90] Turpin Andrew and Hersh William R.. 2001. Why batch and user evaluations do not give the same results. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’01), Croft W. Bruce, Harper David J., Kraft Donald H., and Zobel Justin (Eds.). ACM, 225231. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  91. [91] Turpin Andrew and Hersh William R.. 2002. User interface effects in past batch versus user experiments. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’02), Järvelin Kalervo, Beaulieu Micheline, Baeza-Yates Ricardo A., and Myaeng Sung-Hyon (Eds.). ACM, 431432. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  92. [92] Turpin Andrew and Hersh William R.. 2004. Do clarity scores for queries correlate with user performance? In Proceedings of the 15th Australasian Database Conference (ADC’04),CRPIT, Vol. 27, Schewe Klaus-Dieter and Williams Hugh E. (Eds.). Australian Computer Society, 8591.Google ScholarGoogle Scholar
  93. [93] Turpin Andrew and Scholer Falk. 2006. User performance versus precision measures for simple search tasks. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’06), Efthimiadis Efthimis N., Dumais Susan T., Hawking David, and Järvelin Kalervo (Eds.). ACM, 1118. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  94. [94] Voorhees Ellen M.. 1998. Variations in relevance judgments and the measurement of retrieval effectiveness. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’98), Croft W. Bruce, Moffat Alistair, Rijsbergen C. J. van, Wilkinson Ross, and Zobel Justin (Eds.). ACM, 315323. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  95. [95] Webber William, Moffat Alistair, and Zobel Justin. 2010. A similarity measure for indefinite rankings. ACM Trans. Inf. Syst. 28, 4 (2010), 20:1–20:38. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  96. [96] White Ryen. 2013. Beliefs and biases in web search. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’13), Jones Gareth J. F., Sheridan Paraic, Kelly Diane, Rijke Maarten de, and Sakai Tetsuya (Eds.). ACM, 312. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  97. [97] Zhang Junqi, Liu Yiqun, Ma Shaoping, and Tian Qi. 2018. Relevance estimation with multiple information sources on search engine result pages. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM’18), Cuzzocrea Alfredo, Allan James, Paton Norman W., Srivastava Divesh, Agrawal Rakesh, Broder Andrei Z., Zaki Mohammed J., Candan K. Selçuk, Labrinidis Alexandros, Schuster Assaf, and Wang Haixun (Eds.). ACM, 627636. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  98. [98] Zhang Junqi, Liu Yiqun, Mao Jiaxin, Ma Weizhi, Xu Jiazheng, Ma Shaoping, and Tian Qi. 2023. User behavior simulation for search result re-ranking. ACM Trans. Inf. Syst. 41, 1 (Jan. 2023), 135. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  99. [99] Zhang Junqi, Liu Yiqun, Mao Jiaxin, Xie Xiaohui, Zhang Min, Ma Shaoping, and Tian Qi. 2022. Global or local: Constructing personalized click models for web search. In Proceedings of the ACM Web Conference (WWW’22), Laforest Frédérique, Troncy Raphaël, Simperl Elena, Agarwal Deepak, Gionis Aristides, Herman Ivan, and Médini Lionel (Eds.). ACM, 213223. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  100. [100] Zhang Yinan, Liu Xueqing, and Zhai ChengXiang. 2017. Information retrieval evaluation as search simulation: A general formal framework for IR evaluation. In Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval (ICTIR’17), Kamps Jaap, Kanoulas Evangelos, Rijke Maarten de, Fang Hui, and Yilmaz Emine (Eds.). ACM, 193200. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  101. [101] Zheng Yukun, Fan Zhen, Liu Yiqun, Luo Cheng, Zhang Min, and Ma Shaoping. 2018. Sogou-QCL: A new dataset with click relevance label. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR’18), Collins-Thompson Kevyn, Mei Qiaozhu, Davison Brian D., Liu Yiqun, and Yilmaz Emine (Eds.). ACM, 11171120. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library
  102. [102] Zobel Justin. 1998. How reliable are the results of large-scale information retrieval experiments?. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’98), Croft W. Bruce, Moffat Alistair, Rijsbergen C. J. van, Wilkinson Ross, and Zobel Justin (Eds.). ACM, 307314. DOI:Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Validating Synthetic Usage Data in Living Lab Environments

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image Journal of Data and Information Quality
        Journal of Data and Information Quality  Volume 16, Issue 1
        March 2024
        187 pages
        ISSN:1936-1955
        EISSN:1936-1963
        DOI:10.1145/3613486
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 6 March 2024
        • Online AM: 24 September 2023
        • Accepted: 30 August 2023
        • Revised: 21 July 2023
        • Received: 14 March 2023
        Published in jdiq Volume 16, Issue 1

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      View Full Text