research-article

CARMI: a cache-aware learned index with a cost-based construction algorithm

Authors:
Jiaoyi Zhang

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

,
Yihan Gao

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

Proceedings of the VLDB Endowment Volume 15 Issue 11pp 2679–2691https://doi.org/10.14778/3551793.3551823

Published:01 July 2022Publication History

Proceedings of the VLDB Endowment

Abstract

Learned indexes, which use machine learning models to replace traditional index structures, have shown promising results in recent studies. However, existing learned indexes exhibit a performance gap between synthetic and real-world datasets, making them far from practical indexes.

In this paper, we identify that ignoring the importance of data partitioning during model training is the main reason for this problem. Thus, we explicitly apply data partitioning to index construction and propose a new efficient and updatable cache-aware RMI framework, called CARMI. Specifically, we introduce entropy as a metric to quantify and characterize the effectiveness of data partitioning of tree nodes in learned indexes and propose a novel cost model, laying a new theoretical foundation for future research. Then, based on our novel cost model, CARMI can automatically determine tree structures and model types under various datasets and workloads by a hybrid construction algorithm without any manual tuning. Furthermore, since memory accesses limit the performance of RMIs, a new cache-aware design is also applied in CARMI, which makes full use of the characteristics of the CPU cache to effectively reduce the number of memory accesses. Our experimental study shows that CARMI performs better than baselines, achieving an average of 2.2X/1.9X speedup compared to B+ Tree/ALEX, while using only about 0.77X memory space of B+ Tree. On the SOSD platform, CARMI outperforms all baselines, with an average speedup of 1.2X over the nearest competitor RMI, which has been carefully tuned for each dataset in advance.

References

[n.d.]. https://github.com/JiaoYiZhang/learned_index.Google Scholar
[n.d.]. https://registry.opendata.aws/osm/.Google Scholar
[n.d.]. https://panthema.net/2007/stx-btree/.Google Scholar
[n.d.]. https://www.kaggle.com/ucffool/amazon-sales-rank-data-for-print-and-kindle-books.Google Scholar
[n.d.]. http://dumps.wikimedia.org.Google Scholar
Rudolf Bayer. 1972. Symmetric binary B-trees: Data structure and maintenance algorithms. Acta informatica 1, 4 (1972), 290--306.Google Scholar
R Bayer and E McCreight. 1970. ORGANIZATION AND MAINTENANCE OF LARGE. (1970).Google Scholar
Rudolf Bayer and Edward McCreight. 2002. Organization and maintenance of large ordered indexes. In Software pioneers. Springer, 245--262.Google Scholar
Brian Beavis and Ian Dobbs. 1990. Optimisation and stability theory for economic analysis. Cambridge university press.Google Scholar
Joan Boyar and Kim S Larsen. 1994. Efficient rebalancing of chromatic search trees. J. Comput. System Sci. 49, 3 (1994), 667--682.Google ScholarDigital Library
Shimin Chen, Phillip B Gibbons, and Todd C Mowry. 2001. Improving index performance through prefetching. ACM SIGMOD Record 30, 2 (2001), 235--246.Google ScholarDigital Library
Brian F Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM symposium on Cloud computing. 143--154.Google ScholarDigital Library
Jialin Ding, Umar Farooq Minhas, Jia Yu, Chi Wang, Jaeyoung Do, Yinan Li, Hantian Zhang, Badrish Chandramouli, Johannes Gehrke, Donald Kossmann, et al. 2020. ALEX: an updatable adaptive learned index. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 969--984.Google ScholarDigital Library
Jialin Ding, Vikram Nathan, Mohammad Alizadeh, and Tim Kraska. 2020. Tsunami: A learned multi-dimensional index for correlated data and skewed workloads. arXiv preprint arXiv:2006.13282 (2020).Google Scholar
Paolo Ferragina and Giorgio Vinciguerra. 2020. The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds. Proceedings of the VLDB Endowment 13, 8 (2020), 1162--1175.Google ScholarDigital Library
Alex Galakatos, Michael Markovitch, Carsten Binnig, Rodrigo Fonseca, and Tim Kraska. 2019. Fiting-tree: A data-aware index structure. In Proceedings of the 2019 International Conference on Management of Data. 1189--1206.Google ScholarDigital Library
Abdullah Gani, Aisha Siddiqa, Shahaboddin Shamshirband, and Fariza Hanum. 2016. A survey on indexing techniques for big data: taxonomy and performance evaluation. Knowledge and information systems 46, 2 (2016), 241--284.Google Scholar
Robert M Gray. 2011. Entropy and information theory. Springer Science & Business Media.Google Scholar
Changkyu Kim, JatinC hhugani, Nadathur Satish, Eric Sedlar, Anthony D Nguyen, Tim Kaldewey, Victor W Lee, Scott A Brandt, and Pradeep Dubey. 2010. FAST: fast architecture sensitive tree search on modern CPUs and GPUs. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. 339--350.Google ScholarDigital Library
Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter Boncz, and Alfons Kemper. 2018. Learned cardinalities: Estimating correlated joins with deep learning. arXiv preprint arXiv:1809.00677 (2018).Google Scholar
Andreas Kipf, Ryan Marcus, Alexander van Renen, Mihail Stoian, Alfons Kemper, Tim Kraska, and Thomas Neumann. 2019. SOSD: A Benchmark for Learned Indexes. NeurIPS Workshop on Machine Learning for Systems (2019).Google Scholar
Andreas Kipf, Ryan Marcus, Alexander van Renen, Mihail Stoian, Alfons Kemper, Tim Kraska, and Thomas Neumann. 2020. RadixSpline: a single-pass learned index. In Proceedings of the Third International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, aiDM@SIGMOD 2020, Portland, Oregon, USA, June 19, 2020. 5:1--5:5. Google ScholarDigital Library
Tim Kraska, Mohammad Alizadeh, Alex Beutel, Ed H Chi, Jialin Ding, Ani Kristo, Guillaume Leclerc, Samuel Madden, Hongzi Mao, and Vikram Nathan. 2019. Sagedb: A learned database system. (2019).Google Scholar
Tim Kraska, Alex Beutel, Ed H Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018. The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data. 489--504.Google ScholarDigital Library
Sanjay Krishnan, Zongheng Yang, Ken Goldberg, Joseph Hellerstein, and Ion Stoica. 2018. Learning to optimize join queries with deep reinforcement learning. arXiv preprint arXiv:1808.03196 (2018).Google Scholar
Martijn HR Lankhorst, Bas WSMM Ketelaars, and Robertus AM Wolters. 2005. Low-cost and nanoscale non-volatile memory concept for future silicon chips. Nature materials 4, 4 (2005), 347--352.Google Scholar
Tobin J Lehman and Michael J Carey. 1985. A study of index structures for main memory database management systems. Technical Report. University of Wisconsin-Madison Department of Computer Sciences.Google Scholar
Viktor Leis, Alfons Kemper, and Thomas Neumann. 2013. The adaptive radix tree: ARTful indexing for main-memory databases. In 2013 IEEE 29th International Conference on Data Engineering (ICDE). IEEE, 38--49.Google ScholarDigital Library
Pengfei Li, Hua Lu, Qian Zheng, Long Yang, and Gang Pan. 2020. LISA: A learned index structure for spatial data. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2119--2133.Google ScholarDigital Library
Yandong Mao, Eddie Kohler, and Robert Tappan Morris. 2012. Cache craftiness for fast multicore key-value storage. In Proceedings of the 7th ACM european conference on Computer Systems. 183--196.Google ScholarDigital Library
Ryan Marcus, Andreas Kipf, Alexander van Renen, Mihail Stoian, Sanchit Misra, Alfons Kemper, Thomas Neumann, and Tim Kraska. 2020. Benchmarking Learned Indexes. Proc. VLDB Endow. 14, 1 (2020), 1--13.Google ScholarDigital Library
Ryan Marcus and Olga Papaemmanouil. 2018. Towards a hands-free query optimizer through deep learning. arXiv preprint arXiv:1809.10212 (2018).Google Scholar
Ryan Marcus, Emily Zhang, and Tim Kraska. 2020. Cdfshop: Exploring and optimizing learned index structures. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2789--2792.Google ScholarDigital Library
Michael Mitzenmacher. 2018. A model for learned bloom filters and related structures. arXiv preprint arXiv:1802.00884 (2018).Google Scholar
Vikram Nathan, Jialin Ding, Mohammad Alizadeh, and Tim Kraska. 2020. Learning multi-dimensional indexes. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 985--1000.Google ScholarDigital Library
Jennifer Ortiz, Magdalena Balazinska, Johannes Gehrke, and S Sathiya Keerthi. 2018. Learning state representations for query optimization with deep reinforcement learning. In Proceedings of the Second Workshop on Data Management for End-To-End Machine Learning. 1--4.Google ScholarDigital Library
Varun Pandey, Alexander van Renen, Andreas Kipf, Ibrahim Sabek, Jialin Ding, and Alfons Kemper. 2020. The case for learned spatial indexes. arXiv preprint arXiv:2008.10349 (2020).Google Scholar
Jianzhong Qi, Guanli Liu, Christian S Jensen, and Lars Kulik. 2020. Effectively learning spatial indices. Proceedings of the VLDB Endowment 13, 12 (2020), 2341--2354.Google ScholarDigital Library
Jun Rao and Kenneth A. Ross. 1999. Cache Conscious Indexing for Decision-Support in Main Memory. In Proceedings of the 25th International Conference on Very Large Data Bases (VLDB '99). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 78--89.Google ScholarDigital Library
Jun Rao and Kenneth A Ross. 2000. Making B+-trees cache conscious in main memory. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data. 475--486.Google ScholarDigital Library
Mihail Stoian, Andreas Kipf, Ryan Marcus, and Tim Kraska. 2021. PLEX: Towards Practical Learned Indexing. arXiv preprint arXiv:2108.05117 (2021).Google Scholar
Peter Van Sandt, Yannis Chronis, and Jignesh M Patel. 2019. Efficiently Searching In-Memory Sorted Arrays: Revenge of the Interpolation Search?. In Proceedings of the 2019 International Conference on Management of Data. 36--53.Google ScholarDigital Library
Haixin Wang, Xiaoyi Fu, Jianliang Xu, and Hua Lu. 2019. Learned index for spatial queries. In 2019 20th IEEE International Conference on Mobile Data Management (MDM). IEEE, 569--574.Google ScholarCross Ref
Wei Wang, Meihui Zhang, Gang Chen, HV Jagadish, Beng Chin Ooi, and Kian-Lee Tan. 2016. Database meets deep learning: Challenges and opportunities. ACM SIGMOD Record 45, 2 (2016), 17--22.Google ScholarDigital Library
Jiacheng Wu, Yong Zhang, and Shimin Chen. 2021. Updatable Learned Index with Precise Positions. Proceedings of the VLDB Endowment 14, 8 (2021), 1276--1288.Google ScholarDigital Library
Huanchen Zhang, David G Andersen, Andrew Pavlo, Michael Kaminsky, Lin Ma, and Rui Shen. 2016. Reducing the storage overhead of main-memory OLTP databases with hybrid indexes. In Proceedings of the 2016 International Conference on Management of Data. 1567--1581.Google ScholarDigital Library
Jiaoyi Zhang and Yihan Gao. 2021. CARMI: A Cache-Aware Learned Index with a Cost-based Construction Algorithm. arXiv:2103.00858 [cs.DB]Google Scholar

Recommendations

Design and Optimization of Large Size and Low Overhead Off-Chip Caches

Large off-chip L3 caches can significantly improve the performance of memory-intensive applications. However, conventional L3 SRAM caches are facing two issues as those applications require increasingly large caches. First, an SRAM cache has a limited ...
Read More
TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs

Translation Lookaside Buffers (TLBs) are critical to overall system performance. Much past research has addressed uniprocessor TLBs, lowering access times and miss rates. However, as Chip MultiProcessors (CMPs) become ubiquitous, TLB design and ...
Read More
The hyperdyadic index and generalized indexing and query with PIQUE
SSDBM '15: Proceedings of the 27th International Conference on Scientific and Statistical Database Management

Many scientists rely on indexing and query to identify trends and anomalies within extreme-scale scientific data. Compressed bitmap indexing (e.g., FastBit) is the go-to indexing method for many scientific datasets and query workloads. Recently, the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 15, Issue 11
July 2022
980 pages
ISSN:2150-8097
Editors:
Fatma Özcan
Google
,
Juliana Freire
New York University
,
Xuemin Lin
University of New South Wales
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 July 2022
Published in pvldb Volume 15, Issue 11
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 9
  Total Citations
  View Citations
- 62
  Total Downloads
- Downloads (Last 12 months)30
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

CARMI: a cache-aware learned index with a cost-based construction algorithm

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Design and Optimization of Large Size and Low Overhead Off-Chip Caches

TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs

The hyperdyadic index and generalized indexing and query with PIQUE

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

CARMI: a cache-aware learned index with a cost-based construction algorithm

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Design and Optimization of Large Size and Low Overhead Off-Chip Caches

TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs

The hyperdyadic index and generalized indexing and query with PIQUE

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media