Skip to main content

Fitting Data on a Grain of Rice

  • Conference paper
  • First Online:
Algorithmic Aspects of Cloud Computing (ALGOCLOUD 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14053))

Included in the following conference series:

  • 99 Accesses

Abstract

Coresets are among the most successful compression paradigms. For clustering, a coreset B of a point set A preserves the clustering cost for any candidate solution C. In general, we are interested in finding a B that is as small as possible. In this overview, we will survey techniques for constructing coresets for clustering problems, their applications, and potential future directions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 49.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Faster algorithms are possible, but not discussed here.

  2. 2.

    \(\tilde{O}(x)\) hides terms \(\text {polylog}(x)\).

References

  1. Agarwal, P.K., Har-Peled, S., Varadarajan, K.R.: Geometric approximation via coresets. Comb. Computat. Geom. 52, 1–30 (2005)

    MathSciNet  MATH  Google Scholar 

  2. Baker, D., Braverman, V., Huang, L., Jiang, S.H.C., Krauthgamer, R., Wu, X.: Coresets for clustering in graphs of bounded treewidth (2020)

    Google Scholar 

  3. Bandyapadhyay, S., Fomin, F.V., Simonov, K.: On coresets for fair clustering in metric and Euclidean spaces and their applications. In: Bansal, N., Merelli, E., Worrell, J. (eds.) 48th International Colloquium on Automata, Languages, and Programming, ICALP 2021, 12–16 July 2021, Glasgow, Scotland (Virtual Conference). LIPIcs, vol. 198, pp. 23:1–23:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2021). https://doi.org/10.4230/LIPIcs.ICALP.2021.23

  4. Batson, J.D., Spielman, D.A., Srivastava, N., Teng, S.: Spectral sparsification of graphs: theory and algorithms. Commun. ACM 56(8), 87–94 (2013). https://doi.org/10.1145/2492007.2492029

  5. Becchetti, L., Bury, M., Cohen-Addad, V., Grandoni, F., Schwiegelshohn, C.: Oblivious dimension reduction for k-means: beyond subspaces and the Johnson-Lindenstrauss lemma. In: Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, Phoenix, AZ, USA, 23–26 June 2019, pp. 1039–1050 (2019). https://doi.org/10.1145/3313276.3316318

  6. Boutsidis, C., Drineas, P., Magdon-Ismail, M.: Near-optimal coresets for least-squares regression. IEEE Trans. Inf. Theor. 59(10), 6880–6892 (2013). https://doi.org/10.1109/TIT.2013.2272457

  7. Braverman, V., et al.: The power of uniform sampling for coresets. In: 63rd IEEE Annual Symposium on Foundations of Computer Science, FOCS 2022, Denver, CO, USA, 31 October–3 November 2022, pp. 462–473. IEEE (2022). https://doi.org/10.1109/FOCS54457.2022.00051

  8. Braverman, V., Jiang, S.H., Krauthgamer, R., Wu, X.: Coresets for clustering in excluded-minor graphs and beyond. In: Marx, D. (ed.) Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms, SODA 2021, Virtual Conference, 10–13 January 2021, pp. 2679–2696. SIAM (2021). https://doi.org/10.1137/1.9781611976465.159

  9. Chen, K.: On coresets for k-median and k-means clustering in metric and Euclidean spaces and their applications. SIAM J. Comput. 39(3), 923–947 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  10. Cohen, M.B., Elder, S., Musco, C., Musco, C., Persu, M.: Dimensionality reduction for k-means clustering and low rank approximation. In: Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 2015, Portland, OR, USA, 14–17 June 2015, pp. 163–172 (2015)

    Google Scholar 

  11. Cohen-Addad, V., Larsen, K.G., Saulpic, D., Schwiegelshohn, C.: Towards optimal lower bounds for k-median and k-means coresets. In: Leonardi, S., Gupta, A. (eds.) 54th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2022, Rome, Italy, 20–24 June 2022, pp. 1038–1051. ACM (2022). https://doi.org/10.1145/3519935.3519946

  12. Cohen-Addad, V., Larsen, K.G., Saulpic, D., Schwiegelshohn, C., Sheikh-Omar, O.A.: Improved coresets for Euclidean k-means. In: NeurIPS (2022). http://papers.nips.cc/paper_files/paper/2022/hash/120c9ab5c58ba0fa9dd3a22ace1de245-Abstract-Conference.html

  13. Cohen-Addad, V., Li, J.: On the fixed-parameter tractability of capacitated clustering. In: 46th International Colloquium on Automata, Languages, and Programming, ICALP 2019, 9–12 July 2019, Patras, Greece, pp. 41:1–41:14 (2019). https://doi.org/10.4230/LIPIcs.ICALP.2019.41

  14. Cohen-Addad, V., Saulpic, D., Schwiegelshohn, C.: Improved coresets and sublinear algorithms for power means in Euclidean spaces. In: NeurIPS (2021)

    Google Scholar 

  15. Cohen-Addad, V., Saulpic, D., Schwiegelshohn, C.: A new coreset framework for clustering. In: Khuller, S., Williams, V.V. (eds.) 53rd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2021, Virtual Event, Italy, 21–25 June 2021, pp. 169–182. ACM (2021). https://doi.org/10.1145/3406325.3451022

  16. Cohen-Addad, V., Saulpic, D., Schwiegelshohn, C.: A new coreset framework for clustering. In: Khuller, S., Williams, V.V. (eds.) 53rd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2021, Virtual Event, Italy, 21–25 June 2021. ACM (2021). https://doi.org/10.1145/3406325.3451022

  17. Feldman, D.: Core-sets: an updated survey. WIREs Data Mining Knowl. Discov. 10(1) (2020). https://doi.org/10.1002/widm.1335

  18. Feldman, D., Langberg, M.: A unified framework for approximating and clustering data. In: Proceedings of the 43rd ACM Symposium on Theory of Computing, STOC 2011, San Jose, CA, USA, 6–8 June 2011, pp. 569–578 (2011)

    Google Scholar 

  19. Feldman, D., Schmidt, M., Sohler, C.: Turning big data into tiny data: constant-size coresets for k-means, PCA, and projective clustering. SIAM J. Comput. 49(3), 601–657 (2020). https://doi.org/10.1137/18M1209854

  20. Har-Peled, S., Kushal, A.: Smaller coresets for k-median and k-means clustering. Discrete Computat. Geom. 37(1), 3–19 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  21. Har-Peled, S., Mazumdar, S.: On coresets for k-means and k-median clustering. In: Proceedings of the 36th Annual ACM Symposium on Theory of Computing, Chicago, IL, USA, 13–16 June 2004, pp. 291–300 (2004)

    Google Scholar 

  22. Huang, L., Huang, R., Huang, Z., Wu, X.: On coresets for clustering in small dimensional Euclidean spaces. In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J. (eds.) Proceedings of the International Conference on Machine Learning Research, ICML 2023, 23–29 July 2023, Honolulu, Hawaii, USA, vol. 202, pp. 13891–13915. PMLR (2023). https://proceedings.mlr.press/v202/huang23h.html

  23. Huang, L., Jiang, S.H., Li, J., Wu, X.: Epsilon-coresets for clustering (with outliers) in doubling metrics. In: 59th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2018, Paris, France, 7–9 October 2018, pp. 814–825 (2018). https://doi.org/10.1109/FOCS.2018.00082

  24. Huang, L., Jiang, S.H., Lou, J.: The power of uniform sampling for k-median. CoRR abs/2302.11339. arXiv arXiv:2302.11339 (2023)

  25. Huang, L., Jiang, S.H., Vishnoi, N.K.: Coresets for clustering with fairness constraints. In: NeurIPS, pp. 7587–7598 (2019)

    Google Scholar 

  26. Huang, L., Li, J., Wu, X.: Towards optimal coreset construction for \((k, z)\)-clustering: breaking the quadratic dependency on \(k\). CoRR abs/2211.11923. arXiv arXiv:2211.11923 (2022). https://doi.org/10.48550/arXiv.2211.11923

  27. Huang, L., Vishnoi, N.K.: Coresets for clustering in Euclidean spaces: importance sampling is nearly optimal. In: Makarychev, K., Makarychev, Y., Tulsiani, M., Kamath, G., Chuzhoy, J. (eds.) Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020, Chicago, IL, USA, 22–26 June 2020, pp. 1416–1429. ACM (2020). https://doi.org/10.1145/3357713.3384296

  28. Huggins, J., Campbell, T., Broderick, T.: Coresets for scalable Bayesian logistic regression. In: Advances in Neural Information Processing Systems, pp. 4080–4088 (2016)

    Google Scholar 

  29. Indyk, P., Mahabadi, S., Gharan, S.O., Rezaei, A.: Composable core-sets for determinant maximization problems via spectral spanners. In: Chawla, S. (ed.) Proceedings of the 2020 ACM-SIAM Symposium on Discrete Algorithms, SODA 2020, Salt Lake City, UT, USA, 5–8 January 2020, pp. 1675–1694. SIAM (2020). https://doi.org/10.1137/1.9781611975994.103

  30. Indyk, P., Mahabadi, S., Mahdian, M., Mirrokni, V.S.: Composable core-sets for diversity and coverage maximization. In: Hull, R., Grohe, M. (eds.) Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2014, Snowbird, UT, USA, 22–27 June 2014, pp. 100–108. ACM (2014). https://doi.org/10.1145/2594538.2594560

  31. Jiang, S.H., Krauthgamer, R., Lou, J., Zhang, Y.: Coresets for kernel clustering. CoRR abs/2110.02898 (2021). https://arxiv.org/abs/2110.02898

  32. Jubran, I., Shayda, E.E.S., Newman, I., Feldman, D.: Coresets for decision trees of signals. CoRR abs/2110.03195 (2021)

    Google Scholar 

  33. Karnin, Z.S., Liberty, E.: Discrepancy, coresets, and sketches in machine learning. In: Beygelzimer, A., Hsu, D. (eds.) Conference on Learning Theory, COLT 2019, 25–28 June 2019, Phoenix, AZ, USA, vol. 99, pp. 1975–1993. Proceedings of Machine Learning Research (PMLR) (2019). http://proceedings.mlr.press/v99/karnin19a.html

  34. Langberg, M., Schulman, L.J.: Universal \(\varepsilon \)-approximators for integrals. In: Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2010, Austin, Texas, USA, 17–19 January 2010, pp. 598–607 (2010)

    Google Scholar 

  35. Mahabadi, S., Makarychev, K., Makarychev, Y., Razenshteyn, I.P.: Nonlinear dimension reduction via outer Bi-Lipschitz extensions. In: Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, Los Angeles, CA, USA, 25–29 June 2018, pp. 1088–1101 (2018). https://doi.org/10.1145/3188745.3188828. http://doi.acm.org/10.1145/3188745.3188828

  36. Makarychev, K., Makarychev, Y., Razenshteyn, I.P.: Performance of Johnson-Lindenstrauss transform for k-means and k-medians clustering. In: Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, Phoenix, AZ, USA, 23–26 June 2019, pp. 1027–1038 (2019). https://doi.org/10.1145/3313276.3316350. https://doi.org/10.1145/3313276.3316350

  37. Munteanu, A., Schwiegelshohn, C.: Coresets-methods and history: a theoreticians design pattern for approximation and streaming algorithms. Künstliche Intell. 32(1), 37–53 (2018). https://doi.org/10.1007/s13218-017-0519-3. https://doi.org/10.1007/s13218-017-0519-3

  38. Munteanu, A., Schwiegelshohn, C., Sohler, C., Woodruff, D.P.: On coresets for logistic regression. In: Bengio, S., Wallach, H.M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3–8 December 2018, Montréal, Canada, pp. 6562–6571 (2018)

    Google Scholar 

  39. Narayanan, S., Nelson, J.: Optimal terminal dimensionality reduction in Euclidean space. In: Charikar, M., Cohen, E. (eds.) Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, Phoenix, AZ, USA, 23–26 June 2019, pp. 1064–1069. ACM (2019). https://doi.org/10.1145/3313276.3316307

  40. Phillips, J.M., Tai, W.M.: Near-optimal coresets of kernel density estimates. Discret. Comput. Geom. 63(4), 867–887 (2020). https://doi.org/10.1007/s00454-019-00134-6

  41. Schmidt, M., Schwiegelshohn, C., Sohler, C.: Fair coresets and streaming algorithms for fair k-means. In: 17th International Workshop on Approximation and Online Algorithms, WAOA 2019, Revised Selected Papers, Munich, Germany, 12–13 September 2019, pp. 232–251 (2019). https://doi.org/10.1007/978-3-030-39479-0_16

  42. Sohler, C., Woodruff, D.P.: Strong coresets for k-median and subspace approximation: goodbye dimension. CoRR abs/1809.02961 (2018). http://arxiv.org/abs/1809.02961

  43. Tukan, M., Maalouf, A., Feldman, D.: Coresets for near-convex functions. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, Virtual, 6–12 December 2020 (2020)

    Google Scholar 

  44. Woodruff, D.P.: Sketching as a tool for numerical linear algebra. Found. Trends Theoret. Comput. Sci. 10(1–2), 1–157 (2014). https://doi.org/10.1561/0400000060

Download references

Acknowledgements

The author acknowledges the support of the Independent Research Fund Denmark (DFF) under a Sapere Aude Research Leader grant No. 1051-00106B.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chris Schwiegelshohn .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Schwiegelshohn, C. (2024). Fitting Data on a Grain of Rice. In: Chatzigiannakis, I., Karydis, I. (eds) Algorithmic Aspects of Cloud Computing. ALGOCLOUD 2023. Lecture Notes in Computer Science, vol 14053. Springer, Cham. https://doi.org/10.1007/978-3-031-49361-4_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-49361-4_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-49360-7

  • Online ISBN: 978-3-031-49361-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics