Skip to main content
Log in

Stable and semi-stable sampling approaches for continuously used samples

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Knowledge and information systems are usually measured by labeling the relevance of results corresponding to a sample of user queries. In practical systems like search engines, such measurement needs to be performed continuously, such as daily or weekly. This creates a trade-off between (a) representativeness of query sample to current query traffic of the product; (b) labeling cost—if we keep the same query sample, results would be similar allowing us to reuse their labels; and (c) overfitting caused by continuous usage of same query sample. In this paper, we explicitly formulate this tradeoff, propose two new variants—stable and semi-stable—to simple and weighted random sampling and show that they outperform existing approaches for the continuous usage settings, including monitoring/debugging search engine or comparing ranker candidates.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Mallick A, Hsieh K, Arzani B, Joshi G (2022) Matchmaker: data drift mitigation in machine learning for large-scale systems. Proc. Mach. Learn. Syst. 4:77–94

    Google Scholar 

  2. Dwork C, Feldman V, Hardt M, Pitassi T, Reingold O, Roth AL (2015) Preserving statistical validity in adaptive data analysis. In: Proceedings of the forty-seventh annual ACM symposium on theory of computing, pp 117–126

  3. Efraimidis P, Spirakis P (2008) Weighted random sampling: 2005; Efraimidis, Spirakis. Encyclopedia of Algorithms, pp 1024–1027

  4. Cochran WG (2007) Sampling techniques. Wiley, New York

    MATH  Google Scholar 

  5. Tille Y (2006) Sampling algorithms—Springer. Springer Ser Stat. https://doi.org/10.1007/0-387-34240-0.ISBN978-0-387-30814-2

    Article  MATH  Google Scholar 

  6. Lohr SL (2009) Sampling: design and analysis. Nelson Education, London

    MATH  Google Scholar 

  7. Fuller WA (2011) Sampling statistics, vol 560. Wiley, New York

    MATH  Google Scholar 

  8. Martino L, Luengo D, Míguez J (2018) Independent random sampling methods. Springer, Martino, pp 65–113

    Book  MATH  Google Scholar 

  9. Meng X (2013) Scalable simple random sampling and stratified sampling. In: International conference on machine learning, pp 531–539

  10. Sanders P, Lamm S, Hübschle-Schneider L, Schrade E, Dachsbacher C (2018) Efficient parallel random sampling—vectorized, cache-efficient, and online. ACM Trans. Math. Softw. (TOMS) 44(3):1–14

    Article  MathSciNet  MATH  Google Scholar 

  11. Yu Y (2012) On the inclusion probabilities in some unequal probability sampling plans without replacement. Bernoulli 18(1):279–289

    Article  MathSciNet  MATH  Google Scholar 

  12. Chao M-T (1982) A general purpose unequal probability sampling plan. Biometrika 69(3):653–656

    Article  MathSciNet  MATH  Google Scholar 

  13. Hübschle-Schneider L, Sanders P (2019) Parallel weighted random sampling. arXiv:1903.00227

  14. Allan J, Carterette B, Aslam JA, Pavlu V, Dachev B, Kanoulas E (2007) Overview of the TREC 2007 million query track. In: Proceedings of TREC

  15. Carterette B, Pavlu V, Kanoulas E, Aslam JA, Allan J (2009) If I had a million queries. In: European conference on information retrieval. Springer, Berlin, Heidelberg, pp 288–300

  16. Yilmaz E, Kanoulas E, Aslam JA (2008) A simple and efficient sampling method for estimating AP and NDCG. In: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pp 603–610

  17. Elvira V, Martino L, Luengo D, Bugallo MF (2015) Efficient multiple importance sampling estimators. IEEE Signal Process Lett 22(10):1757–1761

    Article  MATH  Google Scholar 

  18. Elvira V, Martino L, Luengo D, Bugallo MF (2015) Generalized multiple importance sampling. arXiv:1511.03095

  19. Agarwal A, Basu S, Schnabel T, Joachims T (2017) Effective evaluation using logged bandit feedback from multiple loggers. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp 687–696

Download references

Author information

Authors and Affiliations

Authors

Contributions

N.A. wrote the main manuscript text. N.A, D.C and N.K. worked on experiments (which led to tables and figures). All authors contributed to methods ideas and reviewed the manuscript.

Corresponding author

Correspondence to Nikita Astrakhantsev.

Ethics declarations

Conflict of interest

The authors have no conflicts of interest to disclose. The datasets generated and/or analyzed during the current study are not publicly available due to the commercial nature of Web search engine (Microsoft Bing) logs.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Nikita Astrakhantsev: Currently a machine learning engineer at Dropbox.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Astrakhantsev, N., Chittajallu, D.R., Kaushal, N. et al. Stable and semi-stable sampling approaches for continuously used samples. Knowl Inf Syst 65, 3251–3271 (2023). https://doi.org/10.1007/s10115-022-01806-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-022-01806-1

Keywords

Navigation