Abstract
Knowledge and information systems are usually measured by labeling the relevance of results corresponding to a sample of user queries. In practical systems like search engines, such measurement needs to be performed continuously, such as daily or weekly. This creates a trade-off between (a) representativeness of query sample to current query traffic of the product; (b) labeling cost—if we keep the same query sample, results would be similar allowing us to reuse their labels; and (c) overfitting caused by continuous usage of same query sample. In this paper, we explicitly formulate this tradeoff, propose two new variants—stable and semi-stable—to simple and weighted random sampling and show that they outperform existing approaches for the continuous usage settings, including monitoring/debugging search engine or comparing ranker candidates.
Similar content being viewed by others
References
Mallick A, Hsieh K, Arzani B, Joshi G (2022) Matchmaker: data drift mitigation in machine learning for large-scale systems. Proc. Mach. Learn. Syst. 4:77–94
Dwork C, Feldman V, Hardt M, Pitassi T, Reingold O, Roth AL (2015) Preserving statistical validity in adaptive data analysis. In: Proceedings of the forty-seventh annual ACM symposium on theory of computing, pp 117–126
Efraimidis P, Spirakis P (2008) Weighted random sampling: 2005; Efraimidis, Spirakis. Encyclopedia of Algorithms, pp 1024–1027
Cochran WG (2007) Sampling techniques. Wiley, New York
Tille Y (2006) Sampling algorithms—Springer. Springer Ser Stat. https://doi.org/10.1007/0-387-34240-0.ISBN978-0-387-30814-2
Lohr SL (2009) Sampling: design and analysis. Nelson Education, London
Fuller WA (2011) Sampling statistics, vol 560. Wiley, New York
Martino L, Luengo D, Míguez J (2018) Independent random sampling methods. Springer, Martino, pp 65–113
Meng X (2013) Scalable simple random sampling and stratified sampling. In: International conference on machine learning, pp 531–539
Sanders P, Lamm S, Hübschle-Schneider L, Schrade E, Dachsbacher C (2018) Efficient parallel random sampling—vectorized, cache-efficient, and online. ACM Trans. Math. Softw. (TOMS) 44(3):1–14
Yu Y (2012) On the inclusion probabilities in some unequal probability sampling plans without replacement. Bernoulli 18(1):279–289
Chao M-T (1982) A general purpose unequal probability sampling plan. Biometrika 69(3):653–656
Hübschle-Schneider L, Sanders P (2019) Parallel weighted random sampling. arXiv:1903.00227
Allan J, Carterette B, Aslam JA, Pavlu V, Dachev B, Kanoulas E (2007) Overview of the TREC 2007 million query track. In: Proceedings of TREC
Carterette B, Pavlu V, Kanoulas E, Aslam JA, Allan J (2009) If I had a million queries. In: European conference on information retrieval. Springer, Berlin, Heidelberg, pp 288–300
Yilmaz E, Kanoulas E, Aslam JA (2008) A simple and efficient sampling method for estimating AP and NDCG. In: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pp 603–610
Elvira V, Martino L, Luengo D, Bugallo MF (2015) Efficient multiple importance sampling estimators. IEEE Signal Process Lett 22(10):1757–1761
Elvira V, Martino L, Luengo D, Bugallo MF (2015) Generalized multiple importance sampling. arXiv:1511.03095
Agarwal A, Basu S, Schnabel T, Joachims T (2017) Effective evaluation using logged bandit feedback from multiple loggers. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp 687–696
Author information
Authors and Affiliations
Contributions
N.A. wrote the main manuscript text. N.A, D.C and N.K. worked on experiments (which led to tables and figures). All authors contributed to methods ideas and reviewed the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors have no conflicts of interest to disclose. The datasets generated and/or analyzed during the current study are not publicly available due to the commercial nature of Web search engine (Microsoft Bing) logs.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Nikita Astrakhantsev: Currently a machine learning engineer at Dropbox.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Astrakhantsev, N., Chittajallu, D.R., Kaushal, N. et al. Stable and semi-stable sampling approaches for continuously used samples. Knowl Inf Syst 65, 3251–3271 (2023). https://doi.org/10.1007/s10115-022-01806-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-022-01806-1