Abstract
Evaluating retrieval performance without editorial relevance judgments is challenging, but instead, user interactions can be used as relevance signals. Living labs offer a way for small-scale platforms to validate information retrieval systems with real users. If enough user interaction data are available, then click models can be parameterized from historical sessions to evaluate systems before exposing users to experimental rankings. However, interaction data are sparse in living labs, and little is studied about how click models can be validated for reliable user simulations when click data are available in moderate amounts.
This work introduces an evaluation approach for validating synthetic usage data generated by click models in data-sparse human-in-the-loop environments like living labs. We ground our methodology on the click model’s estimates about a system ranking compared to a reference ranking for which the relative performance is known. Our experiments compare different click models and their reliability and robustness as more session log data become available. In our setup, simple click models can reliably determine the relative system performance with already 20 logged sessions for 50 queries. In contrast, more complex click models require more session data for reliable estimates, but they are a better choice in simulated interleaving experiments when enough session data are available. While it is easier for click models to distinguish between more diverse systems, it is harder to reproduce the system ranking based on the same retrieval algorithm with different interpolation weights. Our setup is entirely open, and we share the code to reproduce the experiments.
- [1] . 2006. Improving web search ranking by incorporating user behavior information. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’06), , , , and (Eds.). ACM, 19–26.
DOI: Google ScholarDigital Library - [2] . 2022. TripJudge: A relevance judgement test collection for TripClick health retrieval. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, and (Eds.). ACM, 3801–3805.
DOI: Google ScholarDigital Library - [3] . 2006. Frequentist and Bayesian approach to information retrieval. In Advances in Information Retrieval: Proceedings of the 28th European Conference on IR Research (ECIR 2006),
Lecture Notes in Computer Science , Vol. 3936, , , , , , and (Eds.). Springer, 13–24.DOI: Google ScholarDigital Library - [4] . 2011. Towards a living lab for information retrieval research and development - A proposal for a living lab for product search tasks. In Multilingual and Multimodal Information Access Evaluation: Proceedings of the 2nd International Conference of the Cross-Language Evaluation Forum (CLEF’11). 26–37.
DOI: Google ScholarCross Ref - [5] . 2014. Designing and deploying online field experiments. In Proceedings of the 23rd International World Wide Web Conference (WWW’14), , , , and (Eds.). ACM, 283–292.
DOI: Google ScholarDigital Library - [6] . 2013. CIKM 2013 workshop on living labs for information retrieval evaluation. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management (CIKM’13). 2557–2558.
DOI: Google ScholarDigital Library - [7] . 2014. Head first: Living labs for Ad-Hoc search evaluation. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management (CIKM’14), , , , , , and (Eds.). ACM, 1815–1818.
DOI: Google ScholarDigital Library - [8] . 2021. Report on the 1st simulation for information retrieval workshop (Sim4IR 2021) at SIGIR 2021. SIGIR Forum 55, 2 (2021), 10:1–10:16.
DOI: Google ScholarDigital Library - [9] . 2013. Modeling behavioral factors in interactive information retrieval. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management (CIKM’13), , , , , and (Eds.). ACM, 2297–2302.
DOI: Google ScholarDigital Library - [10] . 2019. Online evaluations for everyone: Mr. DLib’s living lab for scholarly recommendations. In Advances in Information Retrieval: Proceedings of the 41st European Conference on IR Research (ECIR’19), Part II. 213–219.
DOI: Google ScholarDigital Library - [11] . 2019. SciBERT: A pretrained language model for scientific text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19), , , , and (Eds.). Association for Computational Linguistics, 3613–3618.
DOI: Google ScholarCross Ref - [12] . 2016. A neural click model for web search. In Proceedings of the 25th International Conference on World Wide Web (WWW’16), , , , , and (Eds.). ACM, 531–541.
DOI: Google ScholarDigital Library - [13] . 2020. How to measure the reproducibility of system-oriented IR experiments. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’20), , , , , , , and (Eds.). ACM, 349–358.
DOI: Google ScholarDigital Library - [14] . 2021. A living lab architecture for reproducible shared task experimentation. In Information between Data and Knowledge: Information Science and Its Neighbors from Data Science to Digital Humanities: Proceedings of the 16th International Symposium of Information Science (ISI’21), and (Eds.). Werner Hülsbusch, 348–362.
DOI: Google ScholarCross Ref - [15] . 2014. Shedding light on a living lab: The CLEF NEWSREEL open recommendation platform. In Proceedings of the 5th Information Interaction in Context Symposium (IIiX’14). 223–226.
DOI: Google ScholarDigital Library - [16] . 2004. Retrieval evaluation with incomplete information. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’04), , , , and (Eds.). ACM, 25–32.
DOI: Google ScholarDigital Library - [17] . 2011. System effectiveness, user models, and user utility: A conceptual framework for investigation. In Proceeding of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval SIGIR’11). 903–912.
DOI: Google ScholarDigital Library - [18] . 2015. Dynamic test collections for retrieval evaluation. In Proceedings of the International Conference on the Theory of Information Retrieval (ICTIR’15), , , , and (Eds.). ACM, 91–100.
DOI: Google ScholarDigital Library - [19] . 2018. Offline comparative evaluation with incremental, minimally-invasive online feedback. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR’18), , , , , and (Eds.). ACM, 705–714.
DOI: Google ScholarDigital Library - [20] . 2007. Evaluating search engines by modeling the relationship between relevance and clicks. In Advances in Neural Information Processing Systems 20: Proceedings of the 21st Annual Conference on Neural Information Processing Systems, , , , and (Eds.). Curran Associates, Inc., 217–224.Google Scholar
- [21] . 2009. A dynamic Bayesian network click model for web search ranking. In Proceedings of the 18th International Conference on World Wide Web (WWW’09), , , , and (Eds.). ACM, 1–10.
DOI: Google ScholarDigital Library - [22] . 2017. Meta-evaluation of online and offline web search evaluation metrics. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, , , , , , and (Eds.). ACM, 15–24.
DOI: Google ScholarDigital Library - [23] . 2015. Click Models for Web Search. Morgan & Claypool.
DOI: Google ScholarCross Ref - [24] . 2013. Evaluating aggregated search using interleaving. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management (CIKM’13), , , , , and (Eds.). ACM, 669–678.
DOI: Google ScholarDigital Library - [25] . 2015. A comparative analysis of interleaving methods for aggregated search. ACM Trans. Inf. Syst. 33, 2 (2015), 5:1–5:38.
DOI: Google ScholarDigital Library - [26] . 2013. Click model-based information retrieval metrics. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’13), , , , , and (Eds.). ACM, 493–502.
DOI: Google ScholarDigital Library - [27] . 1991. The significance of the cranfield tests on index languages. In Proceedings of the 14th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, , , , and (Eds.). ACM, 3–12.
DOI: Google ScholarDigital Library - [28] . 2020. SPECTER: Document-level representation learning using citation-informed transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20), , , , and (Eds.). Association for Computational Linguistics, 2270–2282.
DOI: Google ScholarCross Ref - [29] . 2020. ORCAS: 20 million clicked query-document pairs for analyzing search. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management, Virtual Event (CIKM’20), , , , , and (Eds.). ACM, 2983–2989.
DOI: Google ScholarDigital Library - [30] . 2008. An experimental comparison of click position-bias models. In Proceedings of the International Conference on Web Search and Web Data Mining (WSDM’08), , , and (Eds.). ACM, 87–94.
DOI: Google ScholarDigital Library - [31] . 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’19), Volume 1 (Long and Short Papers), , , and (Eds.). Association for Computational Linguistics, 4171–4186.
DOI: Google ScholarCross Ref - [32] . 2019. Effective online evaluation for web search. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’19), , , , , , and (Eds.). ACM, 1399–1400.
DOI: Google ScholarDigital Library - [33] . 2020. ArXivDigest: A living lab for personalized scientific literature recommendation. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM’20), , , , , and (Eds.). ACM, 3393–3396.
DOI: Google ScholarDigital Library - [34] . 2015. A comparative study of click models for web search. In Experimental IR Meets Multilinguality, Multimodality, and Interaction: Proceedings of the 6th International Conference of the CLEF Association,(CLEF 2015),
Lecture Notes in Computer Science , Vol. 9283, , , , , , , , and (Eds.). Springer, 78–90.DOI: Google ScholarDigital Library - [35] . 2009. Efficient multiple-click models in web search. In Proceedings of the 2nd International Conference on Web Search and Web Data Mining (WSDM’09), , , , and (Eds.). ACM, 124–131.
DOI: Google ScholarDigital Library - [36] . 2009. Evaluation of methods for relative comparison of retrieval systems based on clickthroughs. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM 2009), , , , , and (Eds.). ACM, 2029–2032.
DOI: Google ScholarDigital Library - [37] . 2000. Do batch and user evaluation give the same results? In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’00), , , , and (Eds.). ACM, 17–24.
DOI: Google ScholarDigital Library - [38] . 2016. Online evaluation for information retrieval. Found. Trends Inf. Retr. 10, 1 (2016), 1–117.
DOI: Google ScholarDigital Library - [39] . 2013. Fidelity, soundness, and efficiency of interleaved comparison methods. ACM Trans. Inf. Syst. 31, 4 (2013), 17:1–17:43.
DOI: Google ScholarDigital Library - [40] . 2022. Establishing strong baselines for TripClick health retrieval. In Advances in Information Retrieval: Proceedings of the 44th European Conference on IR Research (ECIR’22), Part II,
Lecture Notes in Computer Science , Vol. 13186, , , , , , , and (Eds.). Springer, 144–152.DOI: Google ScholarDigital Library - [41] . 2014. Benchmarking news recommendations in a living lab. In Information Access Evaluation. Multilinguality, Multimodality, and Interaction: Proceedings of the 5th International Conference of the CLEF Initiative (CLEF’14),
Lecture Notes in Computer Science , Vol. 8685, , , , , , , and (Eds.). Springer, 250–267.DOI: Google ScholarCross Ref - [42] . 2018. OpenSearch: Lessons learned from an online evaluation campaign. ACM J. Data Inf. Qual. 10, 3 (2018), 13:1–13:15.
DOI: Google ScholarDigital Library - [43] . 2016. Correlation between system and user metrics in a session. In Proceedings of the ACM Conference on Human Information Interaction and Retrieval (CHIIR’16), , , , , and (Eds.). ACM, 285–288.
DOI: Google ScholarDigital Library - [44] . 2005. Accurately interpreting clickthrough data as implicit feedback. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’05), , , , , and (Eds.). ACM, 154–161.
DOI: Google ScholarDigital Library - [45] . 2002. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. 20, 4 (2002), 422–446.
DOI: Google ScholarDigital Library - [46] . 2009. Comparative analysis of clicks and judgments for IR evaluation. In Proceedings of the Workshop on Web Search Click Data (WSCD@WSDM’09), , , , and (Eds.). ACM, 80–87.
DOI: Google ScholarDigital Library - [47] . 2009. Evaluation challenges and directions for information-seeking support systems. Computer 42, 3 (2009), 60–66.
DOI: Google ScholarDigital Library - [48] . 2022. ParClick: A scalable algorithm for EM-based click models. In Proceedings of the Web Conference (WWW’22), , , , , , , and (Eds.). ACM, 392–400.
DOI: Google ScholarDigital Library - [49] . 2015. Optimised scheduling of online experiments. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, , , , and (Eds.). ACM, 453–462.
DOI: Google ScholarDigital Library - [50] . 2013. Online controlled experiments at large scale. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’13), , , , , , , , , and (Eds.). ACM, 1168–1176.
DOI: Google ScholarDigital Library - [51] . 2021. An exploration of tester-based evaluation of user simulators for comparing interactive retrieval systems. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’21), , , , , , and (Eds.). ACM, 1598–1602.
DOI: Google ScholarDigital Library - [52] . 2022. RATE: A reliability-aware tester-based evaluation framework of user simulators. In Advances in Information Retrieval: Proceedings of the 44th European Conference on IR Research (ECIR’22), Part I,
Lecture Notes in Computer Science , Vol. 13185, , , , , , , and (Eds.). Springer, 336–350.DOI: Google ScholarDigital Library - [53] . 2015. Toward predicting the outcome of an A/B experiment for search relevance. In Proceedings of the 8th ACM International Conference on Web Search and Data Mining (WSDM’15), , , , and (Eds.). ACM, 37–46.
DOI: Google ScholarDigital Library - [54] . 2018. Offline evaluation of ranking policies with click models. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD’18), and (Eds.). ACM, 1685–1694.
DOI: Google ScholarDigital Library - [55] . 2021. Pretrained Transformers for Text Ranking: BERT and Beyond. Morgan & Claypool.
DOI: Google ScholarCross Ref - [56] . 2019. Investigating cognitive effects in session-level search user satisfaction. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD’19), , , , , , and (Eds.). ACM, 923–931.
DOI: Google ScholarDigital Library - [57] . 2017. Time-aware click model. ACM Trans. Inf. Syst. 35, 3 (2017), 16:1–16:24.
DOI: Google ScholarDigital Library - [58] . 2021. Simplified data wrangling with Ir_datasets. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’21), , , , , , and (Eds.). ACM, 2429–2436.
DOI: Google ScholarDigital Library - [59] . 2020. Declarative experimentation in information retrieval using PyTerrier. In Proceedings of the ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR’20), , , , , , and (Eds.). ACM, 161–168.
DOI: Google ScholarDigital Library - [60] . 2023. An in-depth investigation on the behavior of measures to quantify reproducibility. Inf. Process. Manag. 60, 3 (2023), 103332.
DOI: Google ScholarDigital Library - [61] . 2017. Evaluating and analyzing click simulation in web search. In Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval (ICTIR’17), , , , , and (Eds.). ACM, 281–284.
DOI: Google ScholarDigital Library - [62] . 2019. Investigating the reliability of click models. In Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval (ICTIR’19), , , , , , and (Eds.). ACM, 125–128.
DOI: Google ScholarDigital Library - [63] . 2017. Online expectation-maximization for click models. In Proceedings of the ACM on Conference on Information and Knowledge Management (CIKM’17), , , , , , , , , , , , , , , and (Eds.). ACM, 2195–2198.
DOI: Google ScholarDigital Library - [64] . 2018. APONE: Academic platform for ONline experiments. In Proceedings of the 1st Biennial Conference on Design of Experimental Search & Information Retrieval Systems
CEUR Workshop Proceedings , Vol. 2167, and (Eds.). CEUR-WS.org, 47–53.Google Scholar - [65] . 2018. A/B testing with APONE. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR’18), , , , , and (Eds.). ACM, 1269–1272.
DOI: Google ScholarDigital Library - [66] . 2019. Modelling Search and Stopping in Interactive Information Retrieval. Ph. D. Dissertation. University of Glasgow, UK.Google ScholarDigital Library
- [67] . 2016. Agents, simulated users and humans: An analysis of performance and behaviour. In Proceedings of the 25th ACM International Conference on Information and Knowledge Management (CIKM’16), , , , , , , , , , , and (Eds.). ACM, 731–740.
DOI: Google ScholarDigital Library - [68] . 2016. Simulating interactive information retrieval: SimIIR: A framework for the simulation of interaction. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’16), , , , , and (Eds.). ACM, 1141–1144.
DOI: Google ScholarDigital Library - [69] . 2017. Incorporating user expectations and behavior into the measurement of search effectiveness. ACM Trans. Inf. Syst. 35, 3 (2017), 24:1–24:38.
DOI: Google ScholarDigital Library - [70] . 2008. Rank-biased precision for measurement of retrieval effectiveness. ACM Trans. Inf. Syst. 27, 1 (2008), 2:1–2:27.
DOI: Google ScholarDigital Library - [71] . 2005. Terrier information retrieval platform. In Advances in Information Retrieval: Proceedings of the 27th European Conference on IR Research (ECIR’05),
Lecture Notes in Computer Science , Vol. 3408, and (Eds.). Springer, 517–519.DOI: Google ScholarDigital Library - [72] . 2011. Evaluating new search engine configurations with pre-existing judgments and clicks. In Proceedings of the 20th International Conference on World Wide Web (WWW’11), , , , , , and (Eds.). ACM, 397–406.
DOI: Google ScholarDigital Library - [73] . 2007. Predictive user click models based on click-through history. In Proceedings of the 16th ACM Conference on Information and Knowledge Management (CIKM’07), , , , , , , and (Eds.). ACM, 175–182.
DOI: Google ScholarDigital Library - [74] . 2017. Validating simulated interaction for retrieval evaluation. Inf. Retr. J. 20, 4 (2017), 338–362.
DOI: Google ScholarDigital Library - [75] . 2016. Interleaved evaluation for retrospective summarization and prospective notification on document streams. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’16), , , , , and (Eds.). ACM, 175–184.
DOI: Google ScholarDigital Library - [76] . 2008. How does clickthrough data reflect retrieval quality? In Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM’08), , , , , , , , and (Eds.). ACM, 43–52.
DOI: Google ScholarDigital Library - [77] . 2021. TripClick: The log files of a large health web search engine. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’21), , , , , , and (Eds.). ACM, 2507–2513.
DOI: Google ScholarDigital Library - [78] . 2009. The probabilistic relevance framework: BM25 and beyond. Found. Trends Inf. Retr. 3, 4 (2009), 333–389.
DOI: Google ScholarDigital Library - [79] . 2010. Test collection based evaluation of information retrieval systems. Found. Trends Inf. Retr. 4, 4 (2010), 247–375.
DOI: Google ScholarCross Ref - [80] . 2010. Do user preferences and evaluation measures line up? In Proceeding of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’10), , , , , and (Eds.). ACM, 555–562.
DOI: Google ScholarDigital Library - [81] . 2021. Overview of LiLAS 2021—Living labs for academic search. In Experimental IR Meets Multilinguality, Multimodality, and Interaction: Proceedings of the 12th International Conference of the CLEF Association (CLEF’21),
Lecture Notes in Computer Science , Vol. 12880, , , , , , , , , , and (Eds.). Springer, 394–418.DOI: Google ScholarDigital Library - [82] . 2008. Using clicks as implicit judgments: Expectations versus observations. In Advances in Information Retrieval: Proceedings of the 30th European Conference on IR Research (ECIR’08),
Lecture Notes in Computer Science , Vol. 4956, , , , , and (Eds.). Springer, 28–39.DOI: Google ScholarCross Ref - [83] . 2015. Overview of the living labs for information retrieval evaluation (LL4IR) CLEF Lab 2015. In Experimental IR Meets Multilinguality, Multimodality, and Interaction: Proceedings of the 6th International Conference of the CLEF Association (CLEF’15),
Lecture Notes in Computer Science , Vol. 9283, , , , , , , , and (Eds.). Springer, 484–496.DOI: Google ScholarDigital Library - [84] . 2016. Multileave gradient descent for fast online learning to rank. In Proceedings of the 9th ACM International Conference on Web Search and Data Mining, , , , and (Eds.). ACM, 457–466.
DOI: Google ScholarDigital Library - [85] . 2022. SciRepEval: A multi-format benchmark for scientific document representations.
arXiv:2211.13308 . Retrieved from https://arxiv.org/abs/2211.13308Google Scholar - [86] . 2001. Ranking retrieval systems without relevance judgments. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’01), , , , and (Eds.). ACM, 66–73.
DOI: Google ScholarDigital Library - [87] . 1981. Simulation of user judgments in bibliographic retrieval systems. In Theoretical Issues in Information Retrieval: Proceedings of the 4th International Conference on Information Storage and Retrieval, (Ed.). ACM, 66–71.
DOI: Google ScholarDigital Library - [88] . 1980. Problems in the simulation of bibliographic retrieval systems. In Information Retrieval Research: Proceedings of the Joint ACM/BCS Symposium in Information Storage and Retrieval, , , , and (Eds.). Butterworths, 236–255.Google Scholar
- [89] . 2014. Modeling decision points in user search behavior. In Proceedings of the 5th Information Interaction in Context Symposium (IIiX’14), , , , and (Eds.). ACM, 239–242.
DOI: Google ScholarDigital Library - [90] . 2001. Why batch and user evaluations do not give the same results. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’01), , , , and (Eds.). ACM, 225–231.
DOI: Google ScholarDigital Library - [91] . 2002. User interface effects in past batch versus user experiments. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’02), , , , and (Eds.). ACM, 431–432.
DOI: Google ScholarDigital Library - [92] . 2004. Do clarity scores for queries correlate with user performance? In Proceedings of the 15th Australasian Database Conference (ADC’04),
CRPIT , Vol. 27, and (Eds.). Australian Computer Society, 85–91.Google Scholar - [93] . 2006. User performance versus precision measures for simple search tasks. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’06), , , , and (Eds.). ACM, 11–18.
DOI: Google ScholarDigital Library - [94] . 1998. Variations in relevance judgments and the measurement of retrieval effectiveness. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’98), , , , , and (Eds.). ACM, 315–323.
DOI: Google ScholarDigital Library - [95] . 2010. A similarity measure for indefinite rankings. ACM Trans. Inf. Syst. 28, 4 (2010), 20:1–20:38.
DOI: Google ScholarDigital Library - [96] . 2013. Beliefs and biases in web search. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’13), , , , , and (Eds.). ACM, 3–12.
DOI: Google ScholarDigital Library - [97] . 2018. Relevance estimation with multiple information sources on search engine result pages. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM’18), , , , , , , , , , , and (Eds.). ACM, 627–636.
DOI: Google ScholarDigital Library - [98] . 2023. User behavior simulation for search result re-ranking. ACM Trans. Inf. Syst. 41, 1 (
Jan. 2023), 1–35.DOI: Google ScholarDigital Library - [99] . 2022. Global or local: Constructing personalized click models for web search. In Proceedings of the ACM Web Conference (WWW’22), , , , , , , and (Eds.). ACM, 213–223.
DOI: Google ScholarDigital Library - [100] . 2017. Information retrieval evaluation as search simulation: A general formal framework for IR evaluation. In Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval (ICTIR’17), , , , , and (Eds.). ACM, 193–200.
DOI: Google ScholarDigital Library - [101] . 2018. Sogou-QCL: A new dataset with click relevance label. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR’18), , , , , and (Eds.). ACM, 1117–1120.
DOI: Google ScholarDigital Library - [102] . 1998. How reliable are the results of large-scale information retrieval experiments?. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’98), , , , , and (Eds.). ACM, 307–314.
DOI: Google ScholarDigital Library
Index Terms
- Validating Synthetic Usage Data in Living Lab Environments
Recommendations
Overview of the Living Labs for Information Retrieval Evaluation LL4IR CLEF Lab 2015
CLEF'15: Proceedings of the 6th International Conference on Experimental IR Meets Multilinguality, Multimodality, and Interaction - Volume 9283In this paper we report on the first Living Labs for Information Retrieval Evaluation LL4IR CLEF Lab. Our main goal with the lab is to provide a benchmarking platform for researchers to evaluate their ranking systems in a live setting with real users in ...
Evaluating Research Dataset Recommendations in a Living Lab
Experimental IR Meets Multilinguality, Multimodality, and InteractionAbstractThe search for research datasets is as important as laborious. Due to the importance of the choice of research data in further research, this decision must be made carefully. Additionally, because of the growing amounts of data in almost all areas,...
R/quest: A Question Answering System
FQAS 2013: Proceedings of the 10th International Conference on Flexible Query Answering Systems - Volume 8132In this paper, we discuss our novel, open-domain question answering Q/A system, R/quest. We use web page snippets from GoogleTM to extract short paragraphs that become candidate answers. We performed an evaluation that showed, on average, 1.4 times ...
Comments