Abstract
Learned indexes, which use machine learning models to replace traditional index structures, have shown promising results in recent studies. However, existing learned indexes exhibit a performance gap between synthetic and real-world datasets, making them far from practical indexes.
In this paper, we identify that ignoring the importance of data partitioning during model training is the main reason for this problem. Thus, we explicitly apply data partitioning to index construction and propose a new efficient and updatable cache-aware RMI framework, called CARMI. Specifically, we introduce entropy as a metric to quantify and characterize the effectiveness of data partitioning of tree nodes in learned indexes and propose a novel cost model, laying a new theoretical foundation for future research. Then, based on our novel cost model, CARMI can automatically determine tree structures and model types under various datasets and workloads by a hybrid construction algorithm without any manual tuning. Furthermore, since memory accesses limit the performance of RMIs, a new cache-aware design is also applied in CARMI, which makes full use of the characteristics of the CPU cache to effectively reduce the number of memory accesses. Our experimental study shows that CARMI performs better than baselines, achieving an average of 2.2X/1.9X speedup compared to B+ Tree/ALEX, while using only about 0.77X memory space of B+ Tree. On the SOSD platform, CARMI outperforms all baselines, with an average speedup of 1.2X over the nearest competitor RMI, which has been carefully tuned for each dataset in advance.
- [n.d.]. https://github.com/JiaoYiZhang/learned_index.Google Scholar
- [n.d.]. https://registry.opendata.aws/osm/.Google Scholar
- [n.d.]. https://panthema.net/2007/stx-btree/.Google Scholar
- [n.d.]. https://www.kaggle.com/ucffool/amazon-sales-rank-data-for-print-and-kindle-books.Google Scholar
- [n.d.]. http://dumps.wikimedia.org.Google Scholar
- Rudolf Bayer. 1972. Symmetric binary B-trees: Data structure and maintenance algorithms. Acta informatica 1, 4 (1972), 290--306.Google Scholar
- R Bayer and E McCreight. 1970. ORGANIZATION AND MAINTENANCE OF LARGE. (1970).Google Scholar
- Rudolf Bayer and Edward McCreight. 2002. Organization and maintenance of large ordered indexes. In Software pioneers. Springer, 245--262.Google Scholar
- Brian Beavis and Ian Dobbs. 1990. Optimisation and stability theory for economic analysis. Cambridge university press.Google Scholar
- Joan Boyar and Kim S Larsen. 1994. Efficient rebalancing of chromatic search trees. J. Comput. System Sci. 49, 3 (1994), 667--682.Google ScholarDigital Library
- Shimin Chen, Phillip B Gibbons, and Todd C Mowry. 2001. Improving index performance through prefetching. ACM SIGMOD Record 30, 2 (2001), 235--246.Google ScholarDigital Library
- Brian F Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM symposium on Cloud computing. 143--154.Google ScholarDigital Library
- Jialin Ding, Umar Farooq Minhas, Jia Yu, Chi Wang, Jaeyoung Do, Yinan Li, Hantian Zhang, Badrish Chandramouli, Johannes Gehrke, Donald Kossmann, et al. 2020. ALEX: an updatable adaptive learned index. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 969--984.Google ScholarDigital Library
- Jialin Ding, Vikram Nathan, Mohammad Alizadeh, and Tim Kraska. 2020. Tsunami: A learned multi-dimensional index for correlated data and skewed workloads. arXiv preprint arXiv:2006.13282 (2020).Google Scholar
- Paolo Ferragina and Giorgio Vinciguerra. 2020. The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds. Proceedings of the VLDB Endowment 13, 8 (2020), 1162--1175.Google ScholarDigital Library
- Alex Galakatos, Michael Markovitch, Carsten Binnig, Rodrigo Fonseca, and Tim Kraska. 2019. Fiting-tree: A data-aware index structure. In Proceedings of the 2019 International Conference on Management of Data. 1189--1206.Google ScholarDigital Library
- Abdullah Gani, Aisha Siddiqa, Shahaboddin Shamshirband, and Fariza Hanum. 2016. A survey on indexing techniques for big data: taxonomy and performance evaluation. Knowledge and information systems 46, 2 (2016), 241--284.Google Scholar
- Robert M Gray. 2011. Entropy and information theory. Springer Science & Business Media.Google Scholar
- Changkyu Kim, JatinC hhugani, Nadathur Satish, Eric Sedlar, Anthony D Nguyen, Tim Kaldewey, Victor W Lee, Scott A Brandt, and Pradeep Dubey. 2010. FAST: fast architecture sensitive tree search on modern CPUs and GPUs. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. 339--350.Google ScholarDigital Library
- Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter Boncz, and Alfons Kemper. 2018. Learned cardinalities: Estimating correlated joins with deep learning. arXiv preprint arXiv:1809.00677 (2018).Google Scholar
- Andreas Kipf, Ryan Marcus, Alexander van Renen, Mihail Stoian, Alfons Kemper, Tim Kraska, and Thomas Neumann. 2019. SOSD: A Benchmark for Learned Indexes. NeurIPS Workshop on Machine Learning for Systems (2019).Google Scholar
- Andreas Kipf, Ryan Marcus, Alexander van Renen, Mihail Stoian, Alfons Kemper, Tim Kraska, and Thomas Neumann. 2020. RadixSpline: a single-pass learned index. In Proceedings of the Third International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, aiDM@SIGMOD 2020, Portland, Oregon, USA, June 19, 2020. 5:1--5:5. Google ScholarDigital Library
- Tim Kraska, Mohammad Alizadeh, Alex Beutel, Ed H Chi, Jialin Ding, Ani Kristo, Guillaume Leclerc, Samuel Madden, Hongzi Mao, and Vikram Nathan. 2019. Sagedb: A learned database system. (2019).Google Scholar
- Tim Kraska, Alex Beutel, Ed H Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018. The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data. 489--504.Google ScholarDigital Library
- Sanjay Krishnan, Zongheng Yang, Ken Goldberg, Joseph Hellerstein, and Ion Stoica. 2018. Learning to optimize join queries with deep reinforcement learning. arXiv preprint arXiv:1808.03196 (2018).Google Scholar
- Martijn HR Lankhorst, Bas WSMM Ketelaars, and Robertus AM Wolters. 2005. Low-cost and nanoscale non-volatile memory concept for future silicon chips. Nature materials 4, 4 (2005), 347--352.Google Scholar
- Tobin J Lehman and Michael J Carey. 1985. A study of index structures for main memory database management systems. Technical Report. University of Wisconsin-Madison Department of Computer Sciences.Google Scholar
- Viktor Leis, Alfons Kemper, and Thomas Neumann. 2013. The adaptive radix tree: ARTful indexing for main-memory databases. In 2013 IEEE 29th International Conference on Data Engineering (ICDE). IEEE, 38--49.Google ScholarDigital Library
- Pengfei Li, Hua Lu, Qian Zheng, Long Yang, and Gang Pan. 2020. LISA: A learned index structure for spatial data. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2119--2133.Google ScholarDigital Library
- Yandong Mao, Eddie Kohler, and Robert Tappan Morris. 2012. Cache craftiness for fast multicore key-value storage. In Proceedings of the 7th ACM european conference on Computer Systems. 183--196.Google ScholarDigital Library
- Ryan Marcus, Andreas Kipf, Alexander van Renen, Mihail Stoian, Sanchit Misra, Alfons Kemper, Thomas Neumann, and Tim Kraska. 2020. Benchmarking Learned Indexes. Proc. VLDB Endow. 14, 1 (2020), 1--13.Google ScholarDigital Library
- Ryan Marcus and Olga Papaemmanouil. 2018. Towards a hands-free query optimizer through deep learning. arXiv preprint arXiv:1809.10212 (2018).Google Scholar
- Ryan Marcus, Emily Zhang, and Tim Kraska. 2020. Cdfshop: Exploring and optimizing learned index structures. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2789--2792.Google ScholarDigital Library
- Michael Mitzenmacher. 2018. A model for learned bloom filters and related structures. arXiv preprint arXiv:1802.00884 (2018).Google Scholar
- Vikram Nathan, Jialin Ding, Mohammad Alizadeh, and Tim Kraska. 2020. Learning multi-dimensional indexes. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 985--1000.Google ScholarDigital Library
- Jennifer Ortiz, Magdalena Balazinska, Johannes Gehrke, and S Sathiya Keerthi. 2018. Learning state representations for query optimization with deep reinforcement learning. In Proceedings of the Second Workshop on Data Management for End-To-End Machine Learning. 1--4.Google ScholarDigital Library
- Varun Pandey, Alexander van Renen, Andreas Kipf, Ibrahim Sabek, Jialin Ding, and Alfons Kemper. 2020. The case for learned spatial indexes. arXiv preprint arXiv:2008.10349 (2020).Google Scholar
- Jianzhong Qi, Guanli Liu, Christian S Jensen, and Lars Kulik. 2020. Effectively learning spatial indices. Proceedings of the VLDB Endowment 13, 12 (2020), 2341--2354.Google ScholarDigital Library
- Jun Rao and Kenneth A. Ross. 1999. Cache Conscious Indexing for Decision-Support in Main Memory. In Proceedings of the 25th International Conference on Very Large Data Bases (VLDB '99). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 78--89.Google ScholarDigital Library
- Jun Rao and Kenneth A Ross. 2000. Making B+-trees cache conscious in main memory. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data. 475--486.Google ScholarDigital Library
- Mihail Stoian, Andreas Kipf, Ryan Marcus, and Tim Kraska. 2021. PLEX: Towards Practical Learned Indexing. arXiv preprint arXiv:2108.05117 (2021).Google Scholar
- Peter Van Sandt, Yannis Chronis, and Jignesh M Patel. 2019. Efficiently Searching In-Memory Sorted Arrays: Revenge of the Interpolation Search?. In Proceedings of the 2019 International Conference on Management of Data. 36--53.Google ScholarDigital Library
- Haixin Wang, Xiaoyi Fu, Jianliang Xu, and Hua Lu. 2019. Learned index for spatial queries. In 2019 20th IEEE International Conference on Mobile Data Management (MDM). IEEE, 569--574.Google ScholarCross Ref
- Wei Wang, Meihui Zhang, Gang Chen, HV Jagadish, Beng Chin Ooi, and Kian-Lee Tan. 2016. Database meets deep learning: Challenges and opportunities. ACM SIGMOD Record 45, 2 (2016), 17--22.Google ScholarDigital Library
- Jiacheng Wu, Yong Zhang, and Shimin Chen. 2021. Updatable Learned Index with Precise Positions. Proceedings of the VLDB Endowment 14, 8 (2021), 1276--1288.Google ScholarDigital Library
- Huanchen Zhang, David G Andersen, Andrew Pavlo, Michael Kaminsky, Lin Ma, and Rui Shen. 2016. Reducing the storage overhead of main-memory OLTP databases with hybrid indexes. In Proceedings of the 2016 International Conference on Management of Data. 1567--1581.Google ScholarDigital Library
- Jiaoyi Zhang and Yihan Gao. 2021. CARMI: A Cache-Aware Learned Index with a Cost-based Construction Algorithm. arXiv:2103.00858 [cs.DB]Google Scholar
Recommendations
Design and Optimization of Large Size and Low Overhead Off-Chip Caches
Large off-chip L3 caches can significantly improve the performance of memory-intensive applications. However, conventional L3 SRAM caches are facing two issues as those applications require increasingly large caches. First, an SRAM cache has a limited ...
TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs
Translation Lookaside Buffers (TLBs) are critical to overall system performance. Much past research has addressed uniprocessor TLBs, lowering access times and miss rates. However, as Chip MultiProcessors (CMPs) become ubiquitous, TLB design and ...
The hyperdyadic index and generalized indexing and query with PIQUE
SSDBM '15: Proceedings of the 27th International Conference on Scientific and Statistical Database ManagementMany scientists rely on indexing and query to identify trends and anomalies within extreme-scale scientific data. Compressed bitmap indexing (e.g., FastBit) is the go-to indexing method for many scientific datasets and query workloads. Recently, the ...
Comments