skip to main content
research-article

CARMI: a cache-aware learned index with a cost-based construction algorithm

Published:01 July 2022Publication History
Skip Abstract Section

Abstract

Learned indexes, which use machine learning models to replace traditional index structures, have shown promising results in recent studies. However, existing learned indexes exhibit a performance gap between synthetic and real-world datasets, making them far from practical indexes.

In this paper, we identify that ignoring the importance of data partitioning during model training is the main reason for this problem. Thus, we explicitly apply data partitioning to index construction and propose a new efficient and updatable cache-aware RMI framework, called CARMI. Specifically, we introduce entropy as a metric to quantify and characterize the effectiveness of data partitioning of tree nodes in learned indexes and propose a novel cost model, laying a new theoretical foundation for future research. Then, based on our novel cost model, CARMI can automatically determine tree structures and model types under various datasets and workloads by a hybrid construction algorithm without any manual tuning. Furthermore, since memory accesses limit the performance of RMIs, a new cache-aware design is also applied in CARMI, which makes full use of the characteristics of the CPU cache to effectively reduce the number of memory accesses. Our experimental study shows that CARMI performs better than baselines, achieving an average of 2.2X/1.9X speedup compared to B+ Tree/ALEX, while using only about 0.77X memory space of B+ Tree. On the SOSD platform, CARMI outperforms all baselines, with an average speedup of 1.2X over the nearest competitor RMI, which has been carefully tuned for each dataset in advance.

References

  1. [n.d.]. https://github.com/JiaoYiZhang/learned_index.Google ScholarGoogle Scholar
  2. [n.d.]. https://registry.opendata.aws/osm/.Google ScholarGoogle Scholar
  3. [n.d.]. https://panthema.net/2007/stx-btree/.Google ScholarGoogle Scholar
  4. [n.d.]. https://www.kaggle.com/ucffool/amazon-sales-rank-data-for-print-and-kindle-books.Google ScholarGoogle Scholar
  5. [n.d.]. http://dumps.wikimedia.org.Google ScholarGoogle Scholar
  6. Rudolf Bayer. 1972. Symmetric binary B-trees: Data structure and maintenance algorithms. Acta informatica 1, 4 (1972), 290--306.Google ScholarGoogle Scholar
  7. R Bayer and E McCreight. 1970. ORGANIZATION AND MAINTENANCE OF LARGE. (1970).Google ScholarGoogle Scholar
  8. Rudolf Bayer and Edward McCreight. 2002. Organization and maintenance of large ordered indexes. In Software pioneers. Springer, 245--262.Google ScholarGoogle Scholar
  9. Brian Beavis and Ian Dobbs. 1990. Optimisation and stability theory for economic analysis. Cambridge university press.Google ScholarGoogle Scholar
  10. Joan Boyar and Kim S Larsen. 1994. Efficient rebalancing of chromatic search trees. J. Comput. System Sci. 49, 3 (1994), 667--682.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Shimin Chen, Phillip B Gibbons, and Todd C Mowry. 2001. Improving index performance through prefetching. ACM SIGMOD Record 30, 2 (2001), 235--246.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Brian F Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM symposium on Cloud computing. 143--154.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Jialin Ding, Umar Farooq Minhas, Jia Yu, Chi Wang, Jaeyoung Do, Yinan Li, Hantian Zhang, Badrish Chandramouli, Johannes Gehrke, Donald Kossmann, et al. 2020. ALEX: an updatable adaptive learned index. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 969--984.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Jialin Ding, Vikram Nathan, Mohammad Alizadeh, and Tim Kraska. 2020. Tsunami: A learned multi-dimensional index for correlated data and skewed workloads. arXiv preprint arXiv:2006.13282 (2020).Google ScholarGoogle Scholar
  15. Paolo Ferragina and Giorgio Vinciguerra. 2020. The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds. Proceedings of the VLDB Endowment 13, 8 (2020), 1162--1175.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Alex Galakatos, Michael Markovitch, Carsten Binnig, Rodrigo Fonseca, and Tim Kraska. 2019. Fiting-tree: A data-aware index structure. In Proceedings of the 2019 International Conference on Management of Data. 1189--1206.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Abdullah Gani, Aisha Siddiqa, Shahaboddin Shamshirband, and Fariza Hanum. 2016. A survey on indexing techniques for big data: taxonomy and performance evaluation. Knowledge and information systems 46, 2 (2016), 241--284.Google ScholarGoogle Scholar
  18. Robert M Gray. 2011. Entropy and information theory. Springer Science & Business Media.Google ScholarGoogle Scholar
  19. Changkyu Kim, JatinC hhugani, Nadathur Satish, Eric Sedlar, Anthony D Nguyen, Tim Kaldewey, Victor W Lee, Scott A Brandt, and Pradeep Dubey. 2010. FAST: fast architecture sensitive tree search on modern CPUs and GPUs. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. 339--350.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter Boncz, and Alfons Kemper. 2018. Learned cardinalities: Estimating correlated joins with deep learning. arXiv preprint arXiv:1809.00677 (2018).Google ScholarGoogle Scholar
  21. Andreas Kipf, Ryan Marcus, Alexander van Renen, Mihail Stoian, Alfons Kemper, Tim Kraska, and Thomas Neumann. 2019. SOSD: A Benchmark for Learned Indexes. NeurIPS Workshop on Machine Learning for Systems (2019).Google ScholarGoogle Scholar
  22. Andreas Kipf, Ryan Marcus, Alexander van Renen, Mihail Stoian, Alfons Kemper, Tim Kraska, and Thomas Neumann. 2020. RadixSpline: a single-pass learned index. In Proceedings of the Third International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, aiDM@SIGMOD 2020, Portland, Oregon, USA, June 19, 2020. 5:1--5:5. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Tim Kraska, Mohammad Alizadeh, Alex Beutel, Ed H Chi, Jialin Ding, Ani Kristo, Guillaume Leclerc, Samuel Madden, Hongzi Mao, and Vikram Nathan. 2019. Sagedb: A learned database system. (2019).Google ScholarGoogle Scholar
  24. Tim Kraska, Alex Beutel, Ed H Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018. The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data. 489--504.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Sanjay Krishnan, Zongheng Yang, Ken Goldberg, Joseph Hellerstein, and Ion Stoica. 2018. Learning to optimize join queries with deep reinforcement learning. arXiv preprint arXiv:1808.03196 (2018).Google ScholarGoogle Scholar
  26. Martijn HR Lankhorst, Bas WSMM Ketelaars, and Robertus AM Wolters. 2005. Low-cost and nanoscale non-volatile memory concept for future silicon chips. Nature materials 4, 4 (2005), 347--352.Google ScholarGoogle Scholar
  27. Tobin J Lehman and Michael J Carey. 1985. A study of index structures for main memory database management systems. Technical Report. University of Wisconsin-Madison Department of Computer Sciences.Google ScholarGoogle Scholar
  28. Viktor Leis, Alfons Kemper, and Thomas Neumann. 2013. The adaptive radix tree: ARTful indexing for main-memory databases. In 2013 IEEE 29th International Conference on Data Engineering (ICDE). IEEE, 38--49.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Pengfei Li, Hua Lu, Qian Zheng, Long Yang, and Gang Pan. 2020. LISA: A learned index structure for spatial data. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2119--2133.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Yandong Mao, Eddie Kohler, and Robert Tappan Morris. 2012. Cache craftiness for fast multicore key-value storage. In Proceedings of the 7th ACM european conference on Computer Systems. 183--196.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Ryan Marcus, Andreas Kipf, Alexander van Renen, Mihail Stoian, Sanchit Misra, Alfons Kemper, Thomas Neumann, and Tim Kraska. 2020. Benchmarking Learned Indexes. Proc. VLDB Endow. 14, 1 (2020), 1--13.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Ryan Marcus and Olga Papaemmanouil. 2018. Towards a hands-free query optimizer through deep learning. arXiv preprint arXiv:1809.10212 (2018).Google ScholarGoogle Scholar
  33. Ryan Marcus, Emily Zhang, and Tim Kraska. 2020. Cdfshop: Exploring and optimizing learned index structures. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2789--2792.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Michael Mitzenmacher. 2018. A model for learned bloom filters and related structures. arXiv preprint arXiv:1802.00884 (2018).Google ScholarGoogle Scholar
  35. Vikram Nathan, Jialin Ding, Mohammad Alizadeh, and Tim Kraska. 2020. Learning multi-dimensional indexes. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 985--1000.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Jennifer Ortiz, Magdalena Balazinska, Johannes Gehrke, and S Sathiya Keerthi. 2018. Learning state representations for query optimization with deep reinforcement learning. In Proceedings of the Second Workshop on Data Management for End-To-End Machine Learning. 1--4.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Varun Pandey, Alexander van Renen, Andreas Kipf, Ibrahim Sabek, Jialin Ding, and Alfons Kemper. 2020. The case for learned spatial indexes. arXiv preprint arXiv:2008.10349 (2020).Google ScholarGoogle Scholar
  38. Jianzhong Qi, Guanli Liu, Christian S Jensen, and Lars Kulik. 2020. Effectively learning spatial indices. Proceedings of the VLDB Endowment 13, 12 (2020), 2341--2354.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Jun Rao and Kenneth A. Ross. 1999. Cache Conscious Indexing for Decision-Support in Main Memory. In Proceedings of the 25th International Conference on Very Large Data Bases (VLDB '99). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 78--89.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Jun Rao and Kenneth A Ross. 2000. Making B+-trees cache conscious in main memory. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data. 475--486.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Mihail Stoian, Andreas Kipf, Ryan Marcus, and Tim Kraska. 2021. PLEX: Towards Practical Learned Indexing. arXiv preprint arXiv:2108.05117 (2021).Google ScholarGoogle Scholar
  42. Peter Van Sandt, Yannis Chronis, and Jignesh M Patel. 2019. Efficiently Searching In-Memory Sorted Arrays: Revenge of the Interpolation Search?. In Proceedings of the 2019 International Conference on Management of Data. 36--53.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Haixin Wang, Xiaoyi Fu, Jianliang Xu, and Hua Lu. 2019. Learned index for spatial queries. In 2019 20th IEEE International Conference on Mobile Data Management (MDM). IEEE, 569--574.Google ScholarGoogle ScholarCross RefCross Ref
  44. Wei Wang, Meihui Zhang, Gang Chen, HV Jagadish, Beng Chin Ooi, and Kian-Lee Tan. 2016. Database meets deep learning: Challenges and opportunities. ACM SIGMOD Record 45, 2 (2016), 17--22.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Jiacheng Wu, Yong Zhang, and Shimin Chen. 2021. Updatable Learned Index with Precise Positions. Proceedings of the VLDB Endowment 14, 8 (2021), 1276--1288.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Huanchen Zhang, David G Andersen, Andrew Pavlo, Michael Kaminsky, Lin Ma, and Rui Shen. 2016. Reducing the storage overhead of main-memory OLTP databases with hybrid indexes. In Proceedings of the 2016 International Conference on Management of Data. 1567--1581.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Jiaoyi Zhang and Yihan Gao. 2021. CARMI: A Cache-Aware Learned Index with a Cost-based Construction Algorithm. arXiv:2103.00858 [cs.DB]Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 15, Issue 11
    July 2022
    980 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    • Published: 1 July 2022
    Published in pvldb Volume 15, Issue 11

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader