research-article

Validating Synthetic Usage Data in Living Lab Environments

Authors:
Timo Breuer

TH Köln - University of Applied Sciences, Köln, Germany

TH Köln - University of Applied Sciences, Köln, Germany

0000-0002-1765-2449
View Profile

,
Norbert Fuhr

University of Duisburg-Essen, Duisburg, Germany

University of Duisburg-Essen, Duisburg, Germany

0000-0002-0441-6949
View Profile

,
Philipp Schaer

TH Köln - University of Applied Sciences, Köln, Germany

TH Köln - University of Applied Sciences, Köln, Germany

0000-0002-8817-4632
View Profile

Authors Info & Claims

Journal of Data and Information Quality Volume 16 Issue 1Article No.: 5pp 1–33https://doi.org/10.1145/3623640

Published:06 March 2024Publication History

Journal of Data and Information Quality

Abstract

Evaluating retrieval performance without editorial relevance judgments is challenging, but instead, user interactions can be used as relevance signals. Living labs offer a way for small-scale platforms to validate information retrieval systems with real users. If enough user interaction data are available, then click models can be parameterized from historical sessions to evaluate systems before exposing users to experimental rankings. However, interaction data are sparse in living labs, and little is studied about how click models can be validated for reliable user simulations when click data are available in moderate amounts.

This work introduces an evaluation approach for validating synthetic usage data generated by click models in data-sparse human-in-the-loop environments like living labs. We ground our methodology on the click model’s estimates about a system ranking compared to a reference ranking for which the relative performance is known. Our experiments compare different click models and their reliability and robustness as more session log data become available. In our setup, simple click models can reliably determine the relative system performance with already 20 logged sessions for 50 queries. In contrast, more complex click models require more session data for reliable estimates, but they are a better choice in simulated interleaving experiments when enough session data are available. While it is easier for click models to distinguish between more diverse systems, it is harder to reproduce the system ranking based on the same retrieval algorithm with different interpolation weights. Our setup is entirely open, and we share the code to reproduce the experiments.

REFERENCES

[1] Agichtein Eugene, Brill Eric, and Dumais Susan T.. 2006. Improving web search ranking by incorporating user behavior information. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’06), Efthimiadis Efthimis N., Dumais Susan T., Hawking David, and Järvelin Kalervo (Eds.). ACM, 19–26. DOI:Google ScholarDigital Library
[2] Althammer Sophia, Hofstätter Sebastian, Verberne Suzan, and Hanbury Allan. 2022. TripJudge: A relevance judgement test collection for TripClick health retrieval. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, Hasan Mohammad Al and Xiong Li (Eds.). ACM, 3801–3805. DOI:Google ScholarDigital Library
[3] Amati Giambattista. 2006. Frequentist and Bayesian approach to information retrieval. In Advances in Information Retrieval: Proceedings of the 28th European Conference on IR Research (ECIR 2006),Lecture Notes in Computer Science, Vol. 3936, Lalmas Mounia, MacFarlane Andy, Rüger Stefan M., Tombros Anastasios, Tsikrika Theodora, and Yavlinsky Alexei (Eds.). Springer, 13–24. DOI:Google ScholarDigital Library
[4] Azzopardi Leif and Balog Krisztian. 2011. Towards a living lab for information retrieval research and development - A proposal for a living lab for product search tasks. In Multilingual and Multimodal Information Access Evaluation: Proceedings of the 2nd International Conference of the Cross-Language Evaluation Forum (CLEF’11). 26–37. DOI:Google ScholarCross Ref
[5] Bakshy Eytan, Eckles Dean, and Bernstein Michael S.. 2014. Designing and deploying online field experiments. In Proceedings of the 23rd International World Wide Web Conference (WWW’14), Chung Chin-Wan, Broder Andrei Z., Shim Kyuseok, and Suel Torsten (Eds.). ACM, 283–292. DOI:Google ScholarDigital Library
[6] Balog Krisztian, Elsweiler David, Kanoulas Evangelos, Kelly Liadh, and Smucker Mark D.. 2013. CIKM 2013 workshop on living labs for information retrieval evaluation. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management (CIKM’13). 2557–2558. DOI:Google ScholarDigital Library
[7] Balog Krisztian, Kelly Liadh, and Schuth Anne. 2014. Head first: Living labs for Ad-Hoc search evaluation. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management (CIKM’14), Li Jianzhong, Wang Xiaoyang Sean, Garofalakis Minos N., Soboroff Ian, Suel Torsten, and Wang Min (Eds.). ACM, 1815–1818. DOI:Google ScholarDigital Library
[8] Balog Krisztian, Maxwell David, Thomas Paul, and Zhang Shuo. 2021. Report on the 1st simulation for information retrieval workshop (Sim4IR 2021) at SIGIR 2021. SIGIR Forum 55, 2 (2021), 10:1–10:16. DOI:Google ScholarDigital Library
[9] Baskaya Feza, Keskustalo Heikki, and Järvelin Kalervo. 2013. Modeling behavioral factors in interactive information retrieval. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management (CIKM’13), He Qi, Iyengar Arun, Nejdl Wolfgang, Pei Jian, and Rastogi Rajeev (Eds.). ACM, 2297–2302. DOI:Google ScholarDigital Library
[10] Beel Jöran, Collins Andrew, Kopp Oliver, Dietz Linus W., and Knoth Petr. 2019. Online evaluations for everyone: Mr. DLib’s living lab for scholarly recommendations. In Advances in Information Retrieval: Proceedings of the 41st European Conference on IR Research (ECIR’19), Part II. 213–219. DOI:Google ScholarDigital Library
[11] Beltagy Iz, Lo Kyle, and Cohan Arman. 2019. SciBERT: A pretrained language model for scientific text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19), Inui Kentaro, Jiang Jing, Ng Vincent, and Wan Xiaojun (Eds.). Association for Computational Linguistics, 3613–3618. DOI:Google ScholarCross Ref
[12] Borisov Alexey, Markov Ilya, Rijke Maarten de, and Serdyukov Pavel. 2016. A neural click model for web search. In Proceedings of the 25th International Conference on World Wide Web (WWW’16), Bourdeau Jacqueline, Hendler Jim, Nkambou Roger, Horrocks Ian, and Zhao Ben Y. (Eds.). ACM, 531–541. DOI:Google ScholarDigital Library
[13] Breuer Timo, Ferro Nicola, Fuhr Norbert, Maistro Maria, Sakai Tetsuya, Schaer Philipp, and Soboroff Ian. 2020. How to measure the reproducibility of system-oriented IR experiments. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’20), Huang Jimmy X., Chang Yi, Cheng Xueqi, Kamps Jaap, Murdock Vanessa, Wen Ji-Rong, and Liu Yiqun (Eds.). ACM, 349–358. DOI:Google ScholarDigital Library
[14] Breuer Timo and Schaer Philipp. 2021. A living lab architecture for reproducible shared task experimentation. In Information between Data and Knowledge: Information Science and Its Neighbors from Data Science to Digital Humanities: Proceedings of the 16th International Symposium of Information Science (ISI’21), Wolff Christian and Schmidt Thomas (Eds.). Werner Hülsbusch, 348–362. DOI:Google ScholarCross Ref
[15] Brodt Torben and Hopfgartner Frank. 2014. Shedding light on a living lab: The CLEF NEWSREEL open recommendation platform. In Proceedings of the 5th Information Interaction in Context Symposium (IIiX’14). 223–226. DOI:Google ScholarDigital Library
[16] Buckley Chris and Voorhees Ellen M.. 2004. Retrieval evaluation with incomplete information. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’04), Sanderson Mark, Järvelin Kalervo, Allan James, and Bruza Peter (Eds.). ACM, 25–32. DOI:Google ScholarDigital Library
[17] Carterette Ben. 2011. System effectiveness, user models, and user utility: A conceptual framework for investigation. In Proceeding of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval SIGIR’11). 903–912. DOI:Google ScholarDigital Library
[18] Carterette Ben, Bah Ashraf, and Zengin Mustafa. 2015. Dynamic test collections for retrieval evaluation. In Proceedings of the International Conference on the Theory of Information Retrieval (ICTIR’15), Allan James, Croft W. Bruce, Vries Arjen P. de, and Zhai Chengxiang (Eds.). ACM, 91–100. DOI:Google ScholarDigital Library
[19] Carterette Ben and Chandar Praveen. 2018. Offline comparative evaluation with incremental, minimally-invasive online feedback. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR’18), Collins-Thompson Kevyn, Mei Qiaozhu, Davison Brian D., Liu Yiqun, and Yilmaz Emine (Eds.). ACM, 705–714. DOI:Google ScholarDigital Library
[20] Carterette Ben and Jones Rosie. 2007. Evaluating search engines by modeling the relationship between relevance and clicks. In Advances in Neural Information Processing Systems 20: Proceedings of the 21st Annual Conference on Neural Information Processing Systems, Platt John C., Koller Daphne, Singer Yoram, and Roweis Sam T. (Eds.). Curran Associates, Inc., 217–224.Google Scholar
[21] Chapelle Olivier and Zhang Ya. 2009. A dynamic Bayesian network click model for web search ranking. In Proceedings of the 18th International Conference on World Wide Web (WWW’09), Quemada Juan, León Gonzalo, Maarek Yoëlle S., and Nejdl Wolfgang (Eds.). ACM, 1–10. DOI:Google ScholarDigital Library
[22] Chen Ye, Zhou Ke, Liu Yiqun, Zhang Min, and Ma Shaoping. 2017. Meta-evaluation of online and offline web search evaluation metrics. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Kando Noriko, Sakai Tetsuya, Joho Hideo, Li Hang, Vries Arjen P. de, and White Ryen W. (Eds.). ACM, 15–24. DOI:Google ScholarDigital Library
[23] Chuklin Aleksandr, Markov Ilya, and Rijke Maarten de. 2015. Click Models for Web Search. Morgan & Claypool. DOI:Google ScholarCross Ref
[24] Chuklin Aleksandr, Schuth Anne, Hofmann Katja, Serdyukov Pavel, and Rijke Maarten de. 2013. Evaluating aggregated search using interleaving. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management (CIKM’13), He Qi, Iyengar Arun, Nejdl Wolfgang, Pei Jian, and Rastogi Rajeev (Eds.). ACM, 669–678. DOI:Google ScholarDigital Library
[25] Chuklin Aleksandr, Schuth Anne, Zhou Ke, and Rijke Maarten de. 2015. A comparative analysis of interleaving methods for aggregated search. ACM Trans. Inf. Syst. 33, 2 (2015), 5:1–5:38. DOI:Google ScholarDigital Library
[26] Chuklin Aleksandr, Serdyukov Pavel, and Rijke Maarten de. 2013. Click model-based information retrieval metrics. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’13), Jones Gareth J. F., Sheridan Paraic, Kelly Diane, Rijke Maarten de, and Sakai Tetsuya (Eds.). ACM, 493–502. DOI:Google ScholarDigital Library
[27] Cleverdon Cyril W.. 1991. The significance of the cranfield tests on index languages. In Proceedings of the 14th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Bookstein Abraham, Chiaramella Yves, Salton Gerard, and Raghavan Vijay V. (Eds.). ACM, 3–12. DOI:Google ScholarDigital Library
[28] Cohan Arman, Feldman Sergey, Beltagy Iz, Downey Doug, and Weld Daniel S.. 2020. SPECTER: Document-level representation learning using citation-informed transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20), Jurafsky Dan, Chai Joyce, Schluter Natalie, and Tetreault Joel R. (Eds.). Association for Computational Linguistics, 2270–2282. DOI:Google ScholarCross Ref
[29] Craswell Nick, Campos Daniel, Mitra Bhaskar, Yilmaz Emine, and Billerbeck Bodo. 2020. ORCAS: 20 million clicked query-document pairs for analyzing search. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management, Virtual Event (CIKM’20), d’Aquin Mathieu, Dietze Stefan, Hauff Claudia, Curry Edward, and Cudré-Mauroux Philippe (Eds.). ACM, 2983–2989. DOI:Google ScholarDigital Library
[30] Craswell Nick, Zoeter Onno, Taylor Michael J., and Ramsey Bill. 2008. An experimental comparison of click position-bias models. In Proceedings of the International Conference on Web Search and Web Data Mining (WSDM’08), Najork Marc, Broder Andrei Z., and Chakrabarti Soumen (Eds.). ACM, 87–94. DOI:Google ScholarDigital Library
[31] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’19), Volume 1 (Long and Short Papers), Burstein Jill, Doran Christy, and Solorio Thamar (Eds.). Association for Computational Linguistics, 4171–4186. DOI:Google ScholarCross Ref
[32] Drutsa Alexey, Gusev Gleb, Kharitonov Eugene, Kulemyakin Denis, Serdyukov Pavel, and Yashkov Igor. 2019. Effective online evaluation for web search. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’19), Piwowarski Benjamin, Chevalier Max, Gaussier Éric, Maarek Yoelle, Nie Jian-Yun, and Scholer Falk (Eds.). ACM, 1399–1400. DOI:Google ScholarDigital Library
[33] Gingstad Kristian, Jekteberg Øyvind, and Balog Krisztian. 2020. ArXivDigest: A living lab for personalized scientific literature recommendation. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM’20), d’Aquin Mathieu, Dietze Stefan, Hauff Claudia, Curry Edward, and Cudré-Mauroux Philippe (Eds.). ACM, 3393–3396. DOI:Google ScholarDigital Library
[34] Grotov Artem, Chuklin Aleksandr, Markov Ilya, Stout Luka, Xumara Finde, and Rijke Maarten de. 2015. A comparative study of click models for web search. In Experimental IR Meets Multilinguality, Multimodality, and Interaction: Proceedings of the 6th International Conference of the CLEF Association,(CLEF 2015),Lecture Notes in Computer Science, Vol. 9283, Mothe Josiane, Savoy Jacques, Kamps Jaap, Pinel-Sauvagnat Karen, Jones Gareth J. F., SanJuan Eric, Cappellato Linda, and Ferro Nicola (Eds.). Springer, 78–90. DOI:Google ScholarDigital Library
[35] Guo Fan, Liu Chao, and Wang Yi Min. 2009. Efficient multiple-click models in web search. In Proceedings of the 2nd International Conference on Web Search and Web Data Mining (WSDM’09), Baeza-Yates Ricardo, Boldi Paolo, Ribeiro-Neto Berthier A., and Cambazoglu Berkant Barla (Eds.). ACM, 124–131. DOI:Google ScholarDigital Library
[36] He Jing, Zhai Chengxiang, and Li Xiaoming. 2009. Evaluation of methods for relative comparison of retrieval systems based on clickthroughs. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM 2009), Cheung David Wai-Lok, Song Il-Yeol, Chu Wesley W., Hu Xiaohua, and Lin Jimmy (Eds.). ACM, 2029–2032. DOI:Google ScholarDigital Library
[37] Hersh William R., Turpin Andrew, Price Susan, Chan Benjamin, Kraemer Dale, Sacherek Lynetta, and Olson Daniel. 2000. Do batch and user evaluation give the same results? In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’00), Yannakoudakis Emmanuel J., Belkin Nicholas J., Ingwersen Peter, and Leong Mun-Kew (Eds.). ACM, 17–24. DOI:Google ScholarDigital Library
[38] Hofmann Katja, Li Lihong, and Radlinski Filip. 2016. Online evaluation for information retrieval. Found. Trends Inf. Retr. 10, 1 (2016), 1–117. DOI:Google ScholarDigital Library
[39] Hofmann Katja, Whiteson Shimon, and Rijke Maarten de. 2013. Fidelity, soundness, and efficiency of interleaved comparison methods. ACM Trans. Inf. Syst. 31, 4 (2013), 17:1–17:43. DOI:Google ScholarDigital Library
[40] Hofstätter Sebastian, Althammer Sophia, Sertkan Mete, and Hanbury Allan. 2022. Establishing strong baselines for TripClick health retrieval. In Advances in Information Retrieval: Proceedings of the 44th European Conference on IR Research (ECIR’22), Part II,Lecture Notes in Computer Science, Vol. 13186, Hagen Matthias, Verberne Suzan, Macdonald Craig, Seifert Christin, Balog Krisztian, Nørvåg Kjetil, and Setty Vinay (Eds.). Springer, 144–152. DOI:Google ScholarDigital Library
[41] Hopfgartner Frank, Kille Benjamin, Lommatzsch Andreas, Plumbaum Till, Brodt Torben, and Heintz Tobias. 2014. Benchmarking news recommendations in a living lab. In Information Access Evaluation. Multilinguality, Multimodality, and Interaction: Proceedings of the 5th International Conference of the CLEF Initiative (CLEF’14),Lecture Notes in Computer Science, Vol. 8685, Kanoulas Evangelos, Lupu Mihai, Clough Paul D., Sanderson Mark, Hall Mark M., Hanbury Allan, and Toms Elaine G. (Eds.). Springer, 250–267. DOI:Google ScholarCross Ref
[42] Jagerman Rolf, Balog Krisztian, and Rijke Maarten de. 2018. OpenSearch: Lessons learned from an online evaluation campaign. ACM J. Data Inf. Qual. 10, 3 (2018), 13:1–13:15. DOI:Google ScholarDigital Library
[43] Jiang Jiepu and Allan James. 2016. Correlation between system and user metrics in a session. In Proceedings of the ACM Conference on Human Information Interaction and Retrieval (CHIIR’16), Kelly Diane, Capra Robert, Belkin Nicholas J., Teevan Jaime, and Vakkari Pertti (Eds.). ACM, 285–288. DOI:Google ScholarDigital Library
[44] Joachims Thorsten, Granka Laura A., Pan Bing, Hembrooke Helene, and Gay Geri. 2005. Accurately interpreting clickthrough data as implicit feedback. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’05), Baeza-Yates Ricardo A., Ziviani Nivio, Marchionini Gary, Moffat Alistair, and Tait John (Eds.). ACM, 154–161. DOI:Google ScholarDigital Library
[45] Järvelin Kalervo and Kekäläinen Jaana. 2002. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. 20, 4 (2002), 422–446. DOI:Google ScholarDigital Library
[46] Kamps Jaap, Koolen Marijn, and Trotman Andrew. 2009. Comparative analysis of clicks and judgments for IR evaluation. In Proceedings of the Workshop on Web Search Click Data (WSCD@WSDM’09), Craswell Nick, Jones Rosie, Dupret Georges, and Viegas Evelyne (Eds.). ACM, 80–87. DOI:Google ScholarDigital Library
[47] Kelly Diane, Dumais Susan T., and Pedersen Jan O.. 2009. Evaluation challenges and directions for information-seeking support systems. Computer 42, 3 (2009), 60–66. DOI:Google ScholarDigital Library
[48] Khandel Pooya, Markov Ilya, Yates Andrew, and Varbanescu Ana Lucia. 2022. ParClick: A scalable algorithm for EM-based click models. In Proceedings of the Web Conference (WWW’22), Laforest Frédérique, Troncy Raphaël, Simperl Elena, Agarwal Deepak, Gionis Aristides, Herman Ivan, and Médini Lionel (Eds.). ACM, 392–400. DOI:Google ScholarDigital Library
[49] Kharitonov Eugene, Macdonald Craig, Serdyukov Pavel, and Ounis Iadh. 2015. Optimised scheduling of online experiments. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, Baeza-Yates Ricardo, Lalmas Mounia, Moffat Alistair, and Ribeiro-Neto Berthier A. (Eds.). ACM, 453–462. DOI:Google ScholarDigital Library
[50] Kohavi Ron, Deng Alex, Frasca Brian, Walker Toby, Xu Ya, and Pohlmann Nils. 2013. Online controlled experiments at large scale. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’13), Dhillon Inderjit S., Koren Yehuda, Ghani Rayid, Senator Ted E., Bradley Paul, Parekh Rajesh, He Jingrui, Grossman Robert L., and Uthurusamy Ramasamy (Eds.). ACM, 1168–1176. DOI:Google ScholarDigital Library
[51] Labhishetty Sahiti and Zhai Chengxiang. 2021. An exploration of tester-based evaluation of user simulators for comparing interactive retrieval systems. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’21), Diaz Fernando, Shah Chirag, Suel Torsten, Castells Pablo, Jones Rosie, and Sakai Tetsuya (Eds.). ACM, 1598–1602. DOI:Google ScholarDigital Library
[52] Labhishetty Sahiti and Zhai ChengXiang. 2022. RATE: A reliability-aware tester-based evaluation framework of user simulators. In Advances in Information Retrieval: Proceedings of the 44th European Conference on IR Research (ECIR’22), Part I,Lecture Notes in Computer Science, Vol. 13185, Hagen Matthias, Verberne Suzan, Macdonald Craig, Seifert Christin, Balog Krisztian, Nørvåg Kjetil, and Setty Vinay (Eds.). Springer, 336–350. DOI:Google ScholarDigital Library
[53] Li Lihong, Kim Jinyoung, and Zitouni Imed. 2015. Toward predicting the outcome of an A/B experiment for search relevance. In Proceedings of the 8th ACM International Conference on Web Search and Data Mining (WSDM’15), Cheng Xueqi, Li Hang, Gabrilovich Evgeniy, and Tang Jie (Eds.). ACM, 37–46. DOI:Google ScholarDigital Library
[54] Li Shuai, Abbasi-Yadkori Yasin, Kveton Branislav, Muthukrishnan S., Vinay Vishwa, and Wen Zheng. 2018. Offline evaluation of ranking policies with click models. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD’18), Guo Yike and Farooq Faisal (Eds.). ACM, 1685–1694. DOI:Google ScholarDigital Library
[55] Lin Jimmy, Nogueira Rodrigo Frassetto, and Yates Andrew. 2021. Pretrained Transformers for Text Ranking: BERT and Beyond. Morgan & Claypool. DOI:Google ScholarCross Ref
[56] Liu Mengyang, Mao Jiaxin, Liu Yiqun, Zhang Min, and Ma Shaoping. 2019. Investigating cognitive effects in session-level search user satisfaction. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD’19), Teredesai Ankur, Kumar Vipin, Li Ying, Rosales Rómer, Terzi Evimaria, and Karypis George (Eds.). ACM, 923–931. DOI:Google ScholarDigital Library
[57] Liu Yiqun, Xie Xiaohui, Wang Chao, Nie Jian-Yun, Zhang Min, and Ma Shaoping. 2017. Time-aware click model. ACM Trans. Inf. Syst. 35, 3 (2017), 16:1–16:24. DOI:Google ScholarDigital Library
[58] MacAvaney Sean, Yates Andrew, Feldman Sergey, Downey Doug, Cohan Arman, and Goharian Nazli. 2021. Simplified data wrangling with Ir_datasets. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’21), Diaz Fernando, Shah Chirag, Suel Torsten, Castells Pablo, Jones Rosie, and Sakai Tetsuya (Eds.). ACM, 2429–2436. DOI:Google ScholarDigital Library
[59] Macdonald Craig and Tonellotto Nicola. 2020. Declarative experimentation in information retrieval using PyTerrier. In Proceedings of the ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR’20), Balog Krisztian, Setty Vinay, Lioma Christina, Liu Yiqun, Zhang Min, and Berberich Klaus (Eds.). ACM, 161–168. DOI:Google ScholarDigital Library
[60] Maistro Maria, Breuer Timo, Schaer Philipp, and Ferro Nicola. 2023. An in-depth investigation on the behavior of measures to quantify reproducibility. Inf. Process. Manag. 60, 3 (2023), 103332. DOI:Google ScholarDigital Library
[61] Malkevich Stepan, Markov Ilya, Michailova Elena, and Rijke Maarten de. 2017. Evaluating and analyzing click simulation in web search. In Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval (ICTIR’17), Kamps Jaap, Kanoulas Evangelos, Rijke Maarten de, Fang Hui, and Yilmaz Emine (Eds.). ACM, 281–284. DOI:Google ScholarDigital Library
[62] Mao Jiaxin, Chu Zhumin, Liu Yiqun, Zhang Min, and Ma Shaoping. 2019. Investigating the reliability of click models. In Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval (ICTIR’19), Fang Yi, Zhang Yi, Allan James, Balog Krisztian, Carterette Ben, and Guo Jiafeng (Eds.). ACM, 125–128. DOI:Google ScholarDigital Library
[63] Markov Ilya, Borisov Alexey, and Rijke Maarten de. 2017. Online expectation-maximization for click models. In Proceedings of the ACM on Conference on Information and Knowledge Management (CIKM’17), Lim Ee-Peng, Winslett Marianne, Sanderson Mark, Fu Ada Wai-Chee, Sun Jimeng, Culpepper J. Shane, Lo Eric, Ho Joyce C., Donato Debora, Agrawal Rakesh, Zheng Yu, Castillo Carlos, Sun Aixin, Tseng Vincent S., and Li Chenliang (Eds.). ACM, 2195–2198. DOI:Google ScholarDigital Library
[64] Marrero Mónica. 2018. APONE: Academic platform for ONline experiments. In Proceedings of the 1st Biennial Conference on Design of Experimental Search & Information Retrieval SystemsCEUR Workshop Proceedings, Vol. 2167, Alonso Omar and Silvello Gianmaria (Eds.). CEUR-WS.org, 47–53.Google Scholar
[65] Marrero Mónica and Hauff Claudia. 2018. A/B testing with APONE. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR’18), Collins-Thompson Kevyn, Mei Qiaozhu, Davison Brian D., Liu Yiqun, and Yilmaz Emine (Eds.). ACM, 1269–1272. DOI:Google ScholarDigital Library
[66] Maxwell David. 2019. Modelling Search and Stopping in Interactive Information Retrieval. Ph. D. Dissertation. University of Glasgow, UK.Google ScholarDigital Library
[67] Maxwell David and Azzopardi Leif. 2016. Agents, simulated users and humans: An analysis of performance and behaviour. In Proceedings of the 25th ACM International Conference on Information and Knowledge Management (CIKM’16), Mukhopadhyay Snehasis, Zhai ChengXiang, Bertino Elisa, Crestani Fabio, Mostafa Javed, Tang Jie, Si Luo, Zhou Xiaofang, Chang Yi, Li Yunyao, and Sondhi Parikshit (Eds.). ACM, 731–740. DOI:Google ScholarDigital Library
[68] Maxwell David and Azzopardi Leif. 2016. Simulating interactive information retrieval: SimIIR: A framework for the simulation of interaction. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’16), Perego Raffaele, Sebastiani Fabrizio, Aslam Javed A., Ruthven Ian, and Zobel Justin (Eds.). ACM, 1141–1144. DOI:Google ScholarDigital Library
[69] Moffat Alistair, Bailey Peter, Scholer Falk, and Thomas Paul. 2017. Incorporating user expectations and behavior into the measurement of search effectiveness. ACM Trans. Inf. Syst. 35, 3 (2017), 24:1–24:38. DOI:Google ScholarDigital Library
[70] Moffat Alistair and Zobel Justin. 2008. Rank-biased precision for measurement of retrieval effectiveness. ACM Trans. Inf. Syst. 27, 1 (2008), 2:1–2:27. DOI:Google ScholarDigital Library
[71] Ounis Iadh, Amati Gianni, Plachouras Vassilis, He Ben, Macdonald Craig, and Johnson Douglas. 2005. Terrier information retrieval platform. In Advances in Information Retrieval: Proceedings of the 27th European Conference on IR Research (ECIR’05),Lecture Notes in Computer Science, Vol. 3408, Losada David E. and Fernández-Luna Juan M. (Eds.). Springer, 517–519. DOI:Google ScholarDigital Library
[72] Ozertem Umut, Jones Rosie, and Dumoulin Benoît. 2011. Evaluating new search engine configurations with pre-existing judgments and clicks. In Proceedings of the 20th International Conference on World Wide Web (WWW’11), Srinivasan Sadagopan, Ramamritham Krithi, Kumar Arun, Ravindra M. P., Bertino Elisa, and Kumar Ravi (Eds.). ACM, 397–406. DOI:Google ScholarDigital Library
[73] Piwowarski Benjamin and Zaragoza Hugo. 2007. Predictive user click models based on click-through history. In Proceedings of the 16th ACM Conference on Information and Knowledge Management (CIKM’07), Silva Mário J., Laender Alberto H. F., Baeza-Yates Ricardo A., McGuinness Deborah L., Olstad Bjørn, Olsen Øystein Haug, and Falcão André O. (Eds.). ACM, 175–182. DOI:Google ScholarDigital Library
[74] Pääkkönen Teemu, Kekäläinen Jaana, Keskustalo Heikki, Azzopardi Leif, Maxwell David, and Järvelin Kalervo. 2017. Validating simulated interaction for retrieval evaluation. Inf. Retr. J. 20, 4 (2017), 338–362. DOI:Google ScholarDigital Library
[75] Qian Xin, Lin Jimmy, and Roegiest Adam. 2016. Interleaved evaluation for retrospective summarization and prospective notification on document streams. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’16), Perego Raffaele, Sebastiani Fabrizio, Aslam Javed A., Ruthven Ian, and Zobel Justin (Eds.). ACM, 175–184. DOI:Google ScholarDigital Library
[76] Radlinski Filip, Kurup Madhu, and Joachims Thorsten. 2008. How does clickthrough data reflect retrieval quality? In Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM’08), Shanahan James G., Amer-Yahia Sihem, Manolescu Ioana, Zhang Yi, Evans David A., Kolcz Aleksander, Choi Key-Sun, and Chowdhury Abdur (Eds.). ACM, 43–52. DOI:Google ScholarDigital Library
[77] Rekabsaz Navid, Lesota Oleg, Schedl Markus, Brassey Jon, and Eickhoff Carsten. 2021. TripClick: The log files of a large health web search engine. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’21), Diaz Fernando, Shah Chirag, Suel Torsten, Castells Pablo, Jones Rosie, and Sakai Tetsuya (Eds.). ACM, 2507–2513. DOI:Google ScholarDigital Library
[78] Robertson Stephen E. and Zaragoza Hugo. 2009. The probabilistic relevance framework: BM25 and beyond. Found. Trends Inf. Retr. 3, 4 (2009), 333–389. DOI:Google ScholarDigital Library
[79] Sanderson Mark. 2010. Test collection based evaluation of information retrieval systems. Found. Trends Inf. Retr. 4, 4 (2010), 247–375. DOI:Google ScholarCross Ref
[80] Sanderson Mark, Paramita Monica Lestari, Clough Paul D., and Kanoulas Evangelos. 2010. Do user preferences and evaluation measures line up? In Proceeding of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’10), Crestani Fabio, Marchand-Maillet Stéphane, Chen Hsin-Hsi, Efthimiadis Efthimis N., and Savoy Jacques (Eds.). ACM, 555–562. DOI:Google ScholarDigital Library
[81] Schaer Philipp, Breuer Timo, Castro Leyla Jael, Wolff Benjamin, Schaible Johann, and Tavakolpoursaleh Narges. 2021. Overview of LiLAS 2021—Living labs for academic search. In Experimental IR Meets Multilinguality, Multimodality, and Interaction: Proceedings of the 12th International Conference of the CLEF Association (CLEF’21),Lecture Notes in Computer Science, Vol. 12880, Candan K. Selçuk, Ionescu Bogdan, Goeuriot Lorraine, Larsen Birger, Müller Henning, Joly Alexis, Maistro Maria, Piroi Florina, Faggioli Guglielmo, and Ferro Nicola (Eds.). Springer, 394–418. DOI:Google ScholarDigital Library
[82] Scholer Falk, Shokouhi Milad, Billerbeck Bodo, and Turpin Andrew. 2008. Using clicks as implicit judgments: Expectations versus observations. In Advances in Information Retrieval: Proceedings of the 30th European Conference on IR Research (ECIR’08),Lecture Notes in Computer Science, Vol. 4956, Macdonald Craig, Ounis Iadh, Plachouras Vassilis, Ruthven Ian, and White Ryen W. (Eds.). Springer, 28–39. DOI:Google ScholarCross Ref
[83] Schuth Anne, Balog Krisztian, and Kelly Liadh. 2015. Overview of the living labs for information retrieval evaluation (LL4IR) CLEF Lab 2015. In Experimental IR Meets Multilinguality, Multimodality, and Interaction: Proceedings of the 6th International Conference of the CLEF Association (CLEF’15),Lecture Notes in Computer Science, Vol. 9283, Mothe Josiane, Savoy Jacques, Kamps Jaap, Pinel-Sauvagnat Karen, Jones Gareth J. F., SanJuan Eric, Cappellato Linda, and Ferro Nicola (Eds.). Springer, 484–496. DOI:Google ScholarDigital Library
[84] Schuth Anne, Oosterhuis Harrie, Whiteson Shimon, and Rijke Maarten de. 2016. Multileave gradient descent for fast online learning to rank. In Proceedings of the 9th ACM International Conference on Web Search and Data Mining, Bennett Paul N., Josifovski Vanja, Neville Jennifer, and Radlinski Filip (Eds.). ACM, 457–466. DOI:Google ScholarDigital Library
[85] Singh Amanpreet, D’Arcy Mike, Cohan Arman, Downey Doug, and Feldman Sergey. 2022. SciRepEval: A multi-format benchmark for scientific document representations. arXiv:2211.13308. Retrieved from https://arxiv.org/abs/2211.13308Google Scholar
[86] Soboroff Ian, Nicholas Charles K., and Cahan Patrick. 2001. Ranking retrieval systems without relevance judgments. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’01), Croft W. Bruce, Harper David J., Kraft Donald H., and Zobel Justin (Eds.). ACM, 66–73. DOI:Google ScholarDigital Library
[87] Tague Jean and Nelson Michael J.. 1981. Simulation of user judgments in bibliographic retrieval systems. In Theoretical Issues in Information Retrieval: Proceedings of the 4th International Conference on Information Storage and Retrieval, Crouch Carolyn J. (Ed.). ACM, 66–71. DOI:Google ScholarDigital Library
[88] Tague Jean, Nelson Michael J., and Wu Harry. 1980. Problems in the simulation of bibliographic retrieval systems. In Information Retrieval Research: Proceedings of the Joint ACM/BCS Symposium in Information Storage and Retrieval, Oddy Robert N., Robertson Stephen E., Rijsbergen C. J. van, and Williams P. W. (Eds.). Butterworths, 236–255.Google Scholar
[89] Thomas Paul, Moffat Alistair, Bailey Peter, and Scholer Falk. 2014. Modeling decision points in user search behavior. In Proceedings of the 5th Information Interaction in Context Symposium (IIiX’14), Elsweiler David, Ludwig Bernd, Azzopardi Leif, and Wilson Max L. (Eds.). ACM, 239–242. DOI:Google ScholarDigital Library
[90] Turpin Andrew and Hersh William R.. 2001. Why batch and user evaluations do not give the same results. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’01), Croft W. Bruce, Harper David J., Kraft Donald H., and Zobel Justin (Eds.). ACM, 225–231. DOI:Google ScholarDigital Library
[91] Turpin Andrew and Hersh William R.. 2002. User interface effects in past batch versus user experiments. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’02), Järvelin Kalervo, Beaulieu Micheline, Baeza-Yates Ricardo A., and Myaeng Sung-Hyon (Eds.). ACM, 431–432. DOI:Google ScholarDigital Library
[92] Turpin Andrew and Hersh William R.. 2004. Do clarity scores for queries correlate with user performance? In Proceedings of the 15th Australasian Database Conference (ADC’04),CRPIT, Vol. 27, Schewe Klaus-Dieter and Williams Hugh E. (Eds.). Australian Computer Society, 85–91.Google Scholar
[93] Turpin Andrew and Scholer Falk. 2006. User performance versus precision measures for simple search tasks. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’06), Efthimiadis Efthimis N., Dumais Susan T., Hawking David, and Järvelin Kalervo (Eds.). ACM, 11–18. DOI:Google ScholarDigital Library
[94] Voorhees Ellen M.. 1998. Variations in relevance judgments and the measurement of retrieval effectiveness. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’98), Croft W. Bruce, Moffat Alistair, Rijsbergen C. J. van, Wilkinson Ross, and Zobel Justin (Eds.). ACM, 315–323. DOI:Google ScholarDigital Library
[95] Webber William, Moffat Alistair, and Zobel Justin. 2010. A similarity measure for indefinite rankings. ACM Trans. Inf. Syst. 28, 4 (2010), 20:1–20:38. DOI:Google ScholarDigital Library
[96] White Ryen. 2013. Beliefs and biases in web search. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’13), Jones Gareth J. F., Sheridan Paraic, Kelly Diane, Rijke Maarten de, and Sakai Tetsuya (Eds.). ACM, 3–12. DOI:Google ScholarDigital Library
[97] Zhang Junqi, Liu Yiqun, Ma Shaoping, and Tian Qi. 2018. Relevance estimation with multiple information sources on search engine result pages. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM’18), Cuzzocrea Alfredo, Allan James, Paton Norman W., Srivastava Divesh, Agrawal Rakesh, Broder Andrei Z., Zaki Mohammed J., Candan K. Selçuk, Labrinidis Alexandros, Schuster Assaf, and Wang Haixun (Eds.). ACM, 627–636. DOI:Google ScholarDigital Library
[98] Zhang Junqi, Liu Yiqun, Mao Jiaxin, Ma Weizhi, Xu Jiazheng, Ma Shaoping, and Tian Qi. 2023. User behavior simulation for search result re-ranking. ACM Trans. Inf. Syst. 41, 1 (Jan. 2023), 1–35. DOI:Google ScholarDigital Library
[99] Zhang Junqi, Liu Yiqun, Mao Jiaxin, Xie Xiaohui, Zhang Min, Ma Shaoping, and Tian Qi. 2022. Global or local: Constructing personalized click models for web search. In Proceedings of the ACM Web Conference (WWW’22), Laforest Frédérique, Troncy Raphaël, Simperl Elena, Agarwal Deepak, Gionis Aristides, Herman Ivan, and Médini Lionel (Eds.). ACM, 213–223. DOI:Google ScholarDigital Library
[100] Zhang Yinan, Liu Xueqing, and Zhai ChengXiang. 2017. Information retrieval evaluation as search simulation: A general formal framework for IR evaluation. In Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval (ICTIR’17), Kamps Jaap, Kanoulas Evangelos, Rijke Maarten de, Fang Hui, and Yilmaz Emine (Eds.). ACM, 193–200. DOI:Google ScholarDigital Library
[101] Zheng Yukun, Fan Zhen, Liu Yiqun, Luo Cheng, Zhang Min, and Ma Shaoping. 2018. Sogou-QCL: A new dataset with click relevance label. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR’18), Collins-Thompson Kevyn, Mei Qiaozhu, Davison Brian D., Liu Yiqun, and Yilmaz Emine (Eds.). ACM, 1117–1120. DOI:Google ScholarDigital Library
[102] Zobel Justin. 1998. How reliable are the results of large-scale information retrieval experiments?. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’98), Croft W. Bruce, Moffat Alistair, Rijsbergen C. J. van, Wilkinson Ross, and Zobel Justin (Eds.). ACM, 307–314. DOI:Google ScholarDigital Library

Index Terms

Validating Synthetic Usage Data in Living Lab Environments
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
      1. Relevance assessment
    2. Users and interactive retrieval

Recommendations

Overview of the Living Labs for Information Retrieval Evaluation LL4IR CLEF Lab 2015
CLEF'15: Proceedings of the 6th International Conference on Experimental IR Meets Multilinguality, Multimodality, and Interaction - Volume 9283

In this paper we report on the first Living Labs for Information Retrieval Evaluation LL4IR CLEF Lab. Our main goal with the lab is to provide a benchmarking platform for researchers to evaluate their ranking systems in a live setting with real users in ...
Read More
Evaluating Research Dataset Recommendations in a Living Lab
Experimental IR Meets Multilinguality, Multimodality, and Interaction
Abstract
The search for research datasets is as important as laborious. Due to the importance of the choice of research data in further research, this decision must be made carefully. Additionally, because of the growing amounts of data in almost all areas,...
Read More
R/quest: A Question Answering System
FQAS 2013: Proceedings of the 10th International Conference on Flexible Query Answering Systems - Volume 8132

In this paper, we discuss our novel, open-domain question answering Q/A system, R/quest. We use web page snippets from Google^TM to extract short paragraphs that become candidate answers. We performed an evaluation that showed, on average, 1.4 times ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Journal of Data and Information Quality Volume 16, Issue 1
March 2024
187 pages
ISSN:1936-1955
EISSN:1936-1963
DOI:10.1145/3613486
Editor:
Felix Naumann
Hasso Plattner Institute, Germany
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 6 March 2024
- Online AM: 24 September 2023
- Accepted: 30 August 2023
- Revised: 21 July 2023
- Received: 14 March 2023
Published in jdiq Volume 16, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Synthetic usage data
click signals
system evaluation
living labs
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 154
  Total Downloads
- Downloads (Last 12 months)154
- Downloads (Last 6 weeks)35
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

Validating Synthetic Usage Data in Living Lab Environments

Journal of Data and Information Quality

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Overview of the Living Labs for Information Retrieval Evaluation LL4IR CLEF Lab 2015

Evaluating Research Dataset Recommendations in a Living Lab

R/quest: A Question Answering System