Stable and semi-stable sampling approaches for continuously used samples

Astrakhantsev, Nikita; Chittajallu, Deepak Roy; Kaushal, Nabeel; Mokeev, Vladislav

doi:10.1007/s10115-022-01806-1

Stable and semi-stable sampling approaches for continuously used samples

Regular Paper
Published: 03 April 2023

Volume 65, pages 3251–3271, (2023)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Nikita Astrakhantsev¹,
Deepak Roy Chittajallu¹,
Nabeel Kaushal¹ &
…
Vladislav Mokeev¹

70 Accesses
1 Altmetric
Explore all metrics

Abstract

Knowledge and information systems are usually measured by labeling the relevance of results corresponding to a sample of user queries. In practical systems like search engines, such measurement needs to be performed continuously, such as daily or weekly. This creates a trade-off between (a) representativeness of query sample to current query traffic of the product; (b) labeling cost—if we keep the same query sample, results would be similar allowing us to reuse their labels; and (c) overfitting caused by continuous usage of same query sample. In this paper, we explicitly formulate this tradeoff, propose two new variants—stable and semi-stable—to simple and weighted random sampling and show that they outperform existing approaches for the continuous usage settings, including monitoring/debugging search engine or comparing ranker candidates.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Effective Single-Pass Approach for Estimating the Φ-quantile in Data Streams

Online Aggregation

A new systematic ranked set-sampling scheme for symmetric distributions

Article 10 June 2019

References

Mallick A, Hsieh K, Arzani B, Joshi G (2022) Matchmaker: data drift mitigation in machine learning for large-scale systems. Proc. Mach. Learn. Syst. 4:77–94
Google Scholar
Dwork C, Feldman V, Hardt M, Pitassi T, Reingold O, Roth AL (2015) Preserving statistical validity in adaptive data analysis. In: Proceedings of the forty-seventh annual ACM symposium on theory of computing, pp 117–126
Efraimidis P, Spirakis P (2008) Weighted random sampling: 2005; Efraimidis, Spirakis. Encyclopedia of Algorithms, pp 1024–1027
Cochran WG (2007) Sampling techniques. Wiley, New York
MATH Google Scholar
Tille Y (2006) Sampling algorithms—Springer. Springer Ser Stat. https://doi.org/10.1007/0-387-34240-0.ISBN978-0-387-30814-2
Article MATH Google Scholar
Lohr SL (2009) Sampling: design and analysis. Nelson Education, London
MATH Google Scholar
Fuller WA (2011) Sampling statistics, vol 560. Wiley, New York
MATH Google Scholar
Martino L, Luengo D, Míguez J (2018) Independent random sampling methods. Springer, Martino, pp 65–113
Book MATH Google Scholar
Meng X (2013) Scalable simple random sampling and stratified sampling. In: International conference on machine learning, pp 531–539
Sanders P, Lamm S, Hübschle-Schneider L, Schrade E, Dachsbacher C (2018) Efficient parallel random sampling—vectorized, cache-efficient, and online. ACM Trans. Math. Softw. (TOMS) 44(3):1–14
Article MathSciNet MATH Google Scholar
Yu Y (2012) On the inclusion probabilities in some unequal probability sampling plans without replacement. Bernoulli 18(1):279–289
Article MathSciNet MATH Google Scholar
Chao M-T (1982) A general purpose unequal probability sampling plan. Biometrika 69(3):653–656
Article MathSciNet MATH Google Scholar
Hübschle-Schneider L, Sanders P (2019) Parallel weighted random sampling. arXiv:1903.00227
Allan J, Carterette B, Aslam JA, Pavlu V, Dachev B, Kanoulas E (2007) Overview of the TREC 2007 million query track. In: Proceedings of TREC
Carterette B, Pavlu V, Kanoulas E, Aslam JA, Allan J (2009) If I had a million queries. In: European conference on information retrieval. Springer, Berlin, Heidelberg, pp 288–300
Yilmaz E, Kanoulas E, Aslam JA (2008) A simple and efficient sampling method for estimating AP and NDCG. In: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pp 603–610
Elvira V, Martino L, Luengo D, Bugallo MF (2015) Efficient multiple importance sampling estimators. IEEE Signal Process Lett 22(10):1757–1761
Article MATH Google Scholar
Elvira V, Martino L, Luengo D, Bugallo MF (2015) Generalized multiple importance sampling. arXiv:1511.03095
Agarwal A, Basu S, Schnabel T, Joachims T (2017) Effective evaluation using logged bandit feedback from multiple loggers. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp 687–696

Download references

Author information

Authors and Affiliations

Microsoft, Bellevue, USA
Nikita Astrakhantsev, Deepak Roy Chittajallu, Nabeel Kaushal & Vladislav Mokeev

Authors

Nikita Astrakhantsev
View author publications
You can also search for this author in PubMed Google Scholar
Deepak Roy Chittajallu
View author publications
You can also search for this author in PubMed Google Scholar
Nabeel Kaushal
View author publications
You can also search for this author in PubMed Google Scholar
Vladislav Mokeev
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

N.A. wrote the main manuscript text. N.A, D.C and N.K. worked on experiments (which led to tables and figures). All authors contributed to methods ideas and reviewed the manuscript.

Corresponding author

Correspondence to Nikita Astrakhantsev.

Ethics declarations

Conflict of interest

The authors have no conflicts of interest to disclose. The datasets generated and/or analyzed during the current study are not publicly available due to the commercial nature of Web search engine (Microsoft Bing) logs.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Nikita Astrakhantsev: Currently a machine learning engineer at Dropbox.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Astrakhantsev, N., Chittajallu, D.R., Kaushal, N. et al. Stable and semi-stable sampling approaches for continuously used samples. Knowl Inf Syst 65, 3251–3271 (2023). https://doi.org/10.1007/s10115-022-01806-1

Download citation

Received: 11 September 2022
Revised: 02 December 2022
Accepted: 12 December 2022
Published: 03 April 2023
Issue Date: August 2023
DOI: https://doi.org/10.1007/s10115-022-01806-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Stable and semi-stable sampling approaches for continuously used samples

Abstract

Access this article

Similar content being viewed by others

An Effective Single-Pass Approach for Estimating the Φ-quantile in Data Streams

Online Aggregation

A new systematic ranked set-sampling scheme for symmetric distributions

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Stable and semi-stable sampling approaches for continuously used samples

Abstract

Access this article

Similar content being viewed by others

An Effective Single-Pass Approach for Estimating the Φ-quantile in Data Streams

Online Aggregation

A new systematic ranked set-sampling scheme for symmetric distributions

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation