Abstract
Coresets are among the most successful compression paradigms. For clustering, a coreset B of a point set A preserves the clustering cost for any candidate solution C. In general, we are interested in finding a B that is as small as possible. In this overview, we will survey techniques for constructing coresets for clustering problems, their applications, and potential future directions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Faster algorithms are possible, but not discussed here.
- 2.
\(\tilde{O}(x)\) hides terms \(\text {polylog}(x)\).
References
Agarwal, P.K., Har-Peled, S., Varadarajan, K.R.: Geometric approximation via coresets. Comb. Computat. Geom. 52, 1–30 (2005)
Baker, D., Braverman, V., Huang, L., Jiang, S.H.C., Krauthgamer, R., Wu, X.: Coresets for clustering in graphs of bounded treewidth (2020)
Bandyapadhyay, S., Fomin, F.V., Simonov, K.: On coresets for fair clustering in metric and Euclidean spaces and their applications. In: Bansal, N., Merelli, E., Worrell, J. (eds.) 48th International Colloquium on Automata, Languages, and Programming, ICALP 2021, 12–16 July 2021, Glasgow, Scotland (Virtual Conference). LIPIcs, vol. 198, pp. 23:1–23:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2021). https://doi.org/10.4230/LIPIcs.ICALP.2021.23
Batson, J.D., Spielman, D.A., Srivastava, N., Teng, S.: Spectral sparsification of graphs: theory and algorithms. Commun. ACM 56(8), 87–94 (2013). https://doi.org/10.1145/2492007.2492029
Becchetti, L., Bury, M., Cohen-Addad, V., Grandoni, F., Schwiegelshohn, C.: Oblivious dimension reduction for k-means: beyond subspaces and the Johnson-Lindenstrauss lemma. In: Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, Phoenix, AZ, USA, 23–26 June 2019, pp. 1039–1050 (2019). https://doi.org/10.1145/3313276.3316318
Boutsidis, C., Drineas, P., Magdon-Ismail, M.: Near-optimal coresets for least-squares regression. IEEE Trans. Inf. Theor. 59(10), 6880–6892 (2013). https://doi.org/10.1109/TIT.2013.2272457
Braverman, V., et al.: The power of uniform sampling for coresets. In: 63rd IEEE Annual Symposium on Foundations of Computer Science, FOCS 2022, Denver, CO, USA, 31 October–3 November 2022, pp. 462–473. IEEE (2022). https://doi.org/10.1109/FOCS54457.2022.00051
Braverman, V., Jiang, S.H., Krauthgamer, R., Wu, X.: Coresets for clustering in excluded-minor graphs and beyond. In: Marx, D. (ed.) Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms, SODA 2021, Virtual Conference, 10–13 January 2021, pp. 2679–2696. SIAM (2021). https://doi.org/10.1137/1.9781611976465.159
Chen, K.: On coresets for k-median and k-means clustering in metric and Euclidean spaces and their applications. SIAM J. Comput. 39(3), 923–947 (2009)
Cohen, M.B., Elder, S., Musco, C., Musco, C., Persu, M.: Dimensionality reduction for k-means clustering and low rank approximation. In: Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 2015, Portland, OR, USA, 14–17 June 2015, pp. 163–172 (2015)
Cohen-Addad, V., Larsen, K.G., Saulpic, D., Schwiegelshohn, C.: Towards optimal lower bounds for k-median and k-means coresets. In: Leonardi, S., Gupta, A. (eds.) 54th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2022, Rome, Italy, 20–24 June 2022, pp. 1038–1051. ACM (2022). https://doi.org/10.1145/3519935.3519946
Cohen-Addad, V., Larsen, K.G., Saulpic, D., Schwiegelshohn, C., Sheikh-Omar, O.A.: Improved coresets for Euclidean k-means. In: NeurIPS (2022). http://papers.nips.cc/paper_files/paper/2022/hash/120c9ab5c58ba0fa9dd3a22ace1de245-Abstract-Conference.html
Cohen-Addad, V., Li, J.: On the fixed-parameter tractability of capacitated clustering. In: 46th International Colloquium on Automata, Languages, and Programming, ICALP 2019, 9–12 July 2019, Patras, Greece, pp. 41:1–41:14 (2019). https://doi.org/10.4230/LIPIcs.ICALP.2019.41
Cohen-Addad, V., Saulpic, D., Schwiegelshohn, C.: Improved coresets and sublinear algorithms for power means in Euclidean spaces. In: NeurIPS (2021)
Cohen-Addad, V., Saulpic, D., Schwiegelshohn, C.: A new coreset framework for clustering. In: Khuller, S., Williams, V.V. (eds.) 53rd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2021, Virtual Event, Italy, 21–25 June 2021, pp. 169–182. ACM (2021). https://doi.org/10.1145/3406325.3451022
Cohen-Addad, V., Saulpic, D., Schwiegelshohn, C.: A new coreset framework for clustering. In: Khuller, S., Williams, V.V. (eds.) 53rd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2021, Virtual Event, Italy, 21–25 June 2021. ACM (2021). https://doi.org/10.1145/3406325.3451022
Feldman, D.: Core-sets: an updated survey. WIREs Data Mining Knowl. Discov. 10(1) (2020). https://doi.org/10.1002/widm.1335
Feldman, D., Langberg, M.: A unified framework for approximating and clustering data. In: Proceedings of the 43rd ACM Symposium on Theory of Computing, STOC 2011, San Jose, CA, USA, 6–8 June 2011, pp. 569–578 (2011)
Feldman, D., Schmidt, M., Sohler, C.: Turning big data into tiny data: constant-size coresets for k-means, PCA, and projective clustering. SIAM J. Comput. 49(3), 601–657 (2020). https://doi.org/10.1137/18M1209854
Har-Peled, S., Kushal, A.: Smaller coresets for k-median and k-means clustering. Discrete Computat. Geom. 37(1), 3–19 (2007)
Har-Peled, S., Mazumdar, S.: On coresets for k-means and k-median clustering. In: Proceedings of the 36th Annual ACM Symposium on Theory of Computing, Chicago, IL, USA, 13–16 June 2004, pp. 291–300 (2004)
Huang, L., Huang, R., Huang, Z., Wu, X.: On coresets for clustering in small dimensional Euclidean spaces. In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J. (eds.) Proceedings of the International Conference on Machine Learning Research, ICML 2023, 23–29 July 2023, Honolulu, Hawaii, USA, vol. 202, pp. 13891–13915. PMLR (2023). https://proceedings.mlr.press/v202/huang23h.html
Huang, L., Jiang, S.H., Li, J., Wu, X.: Epsilon-coresets for clustering (with outliers) in doubling metrics. In: 59th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2018, Paris, France, 7–9 October 2018, pp. 814–825 (2018). https://doi.org/10.1109/FOCS.2018.00082
Huang, L., Jiang, S.H., Lou, J.: The power of uniform sampling for k-median. CoRR abs/2302.11339. arXiv arXiv:2302.11339 (2023)
Huang, L., Jiang, S.H., Vishnoi, N.K.: Coresets for clustering with fairness constraints. In: NeurIPS, pp. 7587–7598 (2019)
Huang, L., Li, J., Wu, X.: Towards optimal coreset construction for \((k, z)\)-clustering: breaking the quadratic dependency on \(k\). CoRR abs/2211.11923. arXiv arXiv:2211.11923 (2022). https://doi.org/10.48550/arXiv.2211.11923
Huang, L., Vishnoi, N.K.: Coresets for clustering in Euclidean spaces: importance sampling is nearly optimal. In: Makarychev, K., Makarychev, Y., Tulsiani, M., Kamath, G., Chuzhoy, J. (eds.) Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020, Chicago, IL, USA, 22–26 June 2020, pp. 1416–1429. ACM (2020). https://doi.org/10.1145/3357713.3384296
Huggins, J., Campbell, T., Broderick, T.: Coresets for scalable Bayesian logistic regression. In: Advances in Neural Information Processing Systems, pp. 4080–4088 (2016)
Indyk, P., Mahabadi, S., Gharan, S.O., Rezaei, A.: Composable core-sets for determinant maximization problems via spectral spanners. In: Chawla, S. (ed.) Proceedings of the 2020 ACM-SIAM Symposium on Discrete Algorithms, SODA 2020, Salt Lake City, UT, USA, 5–8 January 2020, pp. 1675–1694. SIAM (2020). https://doi.org/10.1137/1.9781611975994.103
Indyk, P., Mahabadi, S., Mahdian, M., Mirrokni, V.S.: Composable core-sets for diversity and coverage maximization. In: Hull, R., Grohe, M. (eds.) Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2014, Snowbird, UT, USA, 22–27 June 2014, pp. 100–108. ACM (2014). https://doi.org/10.1145/2594538.2594560
Jiang, S.H., Krauthgamer, R., Lou, J., Zhang, Y.: Coresets for kernel clustering. CoRR abs/2110.02898 (2021). https://arxiv.org/abs/2110.02898
Jubran, I., Shayda, E.E.S., Newman, I., Feldman, D.: Coresets for decision trees of signals. CoRR abs/2110.03195 (2021)
Karnin, Z.S., Liberty, E.: Discrepancy, coresets, and sketches in machine learning. In: Beygelzimer, A., Hsu, D. (eds.) Conference on Learning Theory, COLT 2019, 25–28 June 2019, Phoenix, AZ, USA, vol. 99, pp. 1975–1993. Proceedings of Machine Learning Research (PMLR) (2019). http://proceedings.mlr.press/v99/karnin19a.html
Langberg, M., Schulman, L.J.: Universal \(\varepsilon \)-approximators for integrals. In: Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2010, Austin, Texas, USA, 17–19 January 2010, pp. 598–607 (2010)
Mahabadi, S., Makarychev, K., Makarychev, Y., Razenshteyn, I.P.: Nonlinear dimension reduction via outer Bi-Lipschitz extensions. In: Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, Los Angeles, CA, USA, 25–29 June 2018, pp. 1088–1101 (2018). https://doi.org/10.1145/3188745.3188828. http://doi.acm.org/10.1145/3188745.3188828
Makarychev, K., Makarychev, Y., Razenshteyn, I.P.: Performance of Johnson-Lindenstrauss transform for k-means and k-medians clustering. In: Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, Phoenix, AZ, USA, 23–26 June 2019, pp. 1027–1038 (2019). https://doi.org/10.1145/3313276.3316350. https://doi.org/10.1145/3313276.3316350
Munteanu, A., Schwiegelshohn, C.: Coresets-methods and history: a theoreticians design pattern for approximation and streaming algorithms. Künstliche Intell. 32(1), 37–53 (2018). https://doi.org/10.1007/s13218-017-0519-3. https://doi.org/10.1007/s13218-017-0519-3
Munteanu, A., Schwiegelshohn, C., Sohler, C., Woodruff, D.P.: On coresets for logistic regression. In: Bengio, S., Wallach, H.M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3–8 December 2018, Montréal, Canada, pp. 6562–6571 (2018)
Narayanan, S., Nelson, J.: Optimal terminal dimensionality reduction in Euclidean space. In: Charikar, M., Cohen, E. (eds.) Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, Phoenix, AZ, USA, 23–26 June 2019, pp. 1064–1069. ACM (2019). https://doi.org/10.1145/3313276.3316307
Phillips, J.M., Tai, W.M.: Near-optimal coresets of kernel density estimates. Discret. Comput. Geom. 63(4), 867–887 (2020). https://doi.org/10.1007/s00454-019-00134-6
Schmidt, M., Schwiegelshohn, C., Sohler, C.: Fair coresets and streaming algorithms for fair k-means. In: 17th International Workshop on Approximation and Online Algorithms, WAOA 2019, Revised Selected Papers, Munich, Germany, 12–13 September 2019, pp. 232–251 (2019). https://doi.org/10.1007/978-3-030-39479-0_16
Sohler, C., Woodruff, D.P.: Strong coresets for k-median and subspace approximation: goodbye dimension. CoRR abs/1809.02961 (2018). http://arxiv.org/abs/1809.02961
Tukan, M., Maalouf, A., Feldman, D.: Coresets for near-convex functions. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, Virtual, 6–12 December 2020 (2020)
Woodruff, D.P.: Sketching as a tool for numerical linear algebra. Found. Trends Theoret. Comput. Sci. 10(1–2), 1–157 (2014). https://doi.org/10.1561/0400000060
Acknowledgements
The author acknowledges the support of the Independent Research Fund Denmark (DFF) under a Sapere Aude Research Leader grant No. 1051-00106B.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Schwiegelshohn, C. (2024). Fitting Data on a Grain of Rice. In: Chatzigiannakis, I., Karydis, I. (eds) Algorithmic Aspects of Cloud Computing. ALGOCLOUD 2023. Lecture Notes in Computer Science, vol 14053. Springer, Cham. https://doi.org/10.1007/978-3-031-49361-4_13
Download citation
DOI: https://doi.org/10.1007/978-3-031-49361-4_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-49360-7
Online ISBN: 978-3-031-49361-4
eBook Packages: Computer ScienceComputer Science (R0)