Abstract
Targeting in-memory one-dimensional search keys, we propose a novel DIstribution-driven Learned Index tree (DILI), where a concise and computation-efficient linear regression model is used for each node. An internal node's key range is equally divided by its child nodes such that a key search enjoys perfect model prediction accuracy to find the relevant leaf node. A leaf node uses machine learning models to generate searchable data layout and thus accurately predicts the data record position for a key. To construct DILI, we first build a bottom-up tree with linear regression models according to global and local key distributions. Using the bottom-up tree, we build DILI in a top-down manner, individualizing the fanouts for internal nodes according to local distributions. DILI strikes a good balance between the number of leaf nodes and the height of the tree, two critical factors of key search time. Moreover, we design flexible algorithms for DILI to efficiently insert and delete keys and automatically adjust the tree structure when necessary. Extensive experimental results show that DILI outperforms the state-of-the-art alternatives on different kinds of workloads.
- https://github.com/pfl-cs/DILI.Google Scholar
- https://www.dropbox.com/s/j1d4ufn4fyb4po2/osm_cellids_800M_uint64.zst?dl=1.Google Scholar
- https://www.dropbox.com/s/y2u3nbanbnbmg7n/books_800M_uint64.zst?dl=1.Google Scholar
- https://panthema.net/2007/stx-btree.Google Scholar
- https://github.com/microsoft/ALEX.Google Scholar
- https://github.com/gvinciguerra/PGM-index.Google Scholar
- https://github.com/Jiacheng-WU/lipp.Google Scholar
- https://github.com/learnedsystems/SOSD.Google Scholar
- Jayadev Acharya, Ilias Diakonikolas, Jerry Li, and Ludwig Schmidt. 2016. Fast Algorithms for Segmented Regression. In ICML, Maria-Florina Balcan and Kilian Q. Weinberger (Eds.), Vol. 48. 2878--2886.Google Scholar
- Nikolas Askitis and Justin Zobel. 2009. B-tries for disk-based string management. VLDB J. 18, 1 (2009), 157--179.Google ScholarDigital Library
- Manos Athanassoulis and Anastasia Ailamaki. 2014. BF-Tree: Approximate Tree Indexing. Proc. VLDB Endow. 7, 14 (2014), 1881--1892.Google ScholarDigital Library
- Supawit Chockchowwat. 2022. Tuning Hierarchical Learned Indexes on Disk and Beyond. In SIGMOD. 2515--2517.Google Scholar
- Douglas Comer. 1979. The Ubiquitous B-Tree. ACM Comput. Surv. 11, 2 (1979), 121--137.Google ScholarDigital Library
- Intel Corporporation. 2018. Intel 64 and ia-32 architectures software developer manuals. https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html.Google Scholar
- Angjela Davitkova, Evica Milchevski, and Sebastian Michel. 2020. The ML-Index: A Multidimensional, Learned Index for Point, Range, and Nearest-Neighbor Queries. In EDBT. OpenProceedings.org, 407--410.Google Scholar
- Jialin Ding, Umar Farooq Minhas, Jia Yu, Chi Wang, Jaeyoung Do, Yinan Li, Hantian Zhang, Badrish Chandramouli, Johannes Gehrke, Donald Kossmann, David B. Lomet, and Tim Kraska. 2020. ALEX: An Updatable Adaptive Learned Index. In SIGMOD. 969--984.Google Scholar
- Jialin Ding, Vikram Nathan, Mohammad Alizadeh, and Tim Kraska. 2020. Tsunami: A Learned Multi-dimensional Index for Correlated Data and Skewed Workloads. CoRR abs/2006.13282 (2020). arXiv:2006.13282 https://arxiv.org/abs/2006.13282Google Scholar
- Paolo Ferragina and Giorgio Vinciguerra. 2020. The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds. Proc. VLDB Endow. 13, 8 (2020), 1162--1175.Google ScholarDigital Library
- Agner Fog. 2018. Lists of Instruction Latencies, Throughputs and Micro-operation Breakdowns for Intel, AMD and VIA CPUs, Technical University of Denmar, Last updated 2021-01-31. http://www.agner.org/optimize/instruction_tables.pdf,DoA. (2018).Google Scholar
- Alex Galakatos, Michael Markovitch, Carsten Binnig, Rodrigo Fonseca, and Tim Kraska. 2019. FITing-Tree: A Data-aware Index Structure. In SIGMOD. 1189--1206.Google Scholar
- Goetz Graefe. 2010. A survey of B-tree locking techniques. ACM Trans. Database Syst. 35, 3 (2010), 16:1--16:26.Google ScholarDigital Library
- Intel. 2021. Intel Optane Persistent Memory (PMem), Last updated 2021-11-13. https://www.intel.ca/content/www/ca/en/architecture-and-technology/optane-dcpersistent-memory.html.Google Scholar
- H. V. Jagadish, Beng Chin Ooi, Kian-Lee Tan, Cui Yu, and Rui Zhang. 2005. iDistance: An adaptive B+-tree based indexing method for nearest neighbor search. ACM Trans. Database Syst. 30, 2 (2005), 364--397.Google ScholarDigital Library
- Andreas Kipf, Ryan Marcus, Alexander van Renen, Mihail Stoian, Alfons Kemper, Tim Kraska, and Thomas Neumann. 2020. RadixSpline: a single-pass learned index. In aiDM@SIGMOD. 5:1--5:5.Google Scholar
- Donald Ervin Knuth. 1997. The art of computer programming. Vol. 3. Pearson Education.Google ScholarDigital Library
- Tim Kraska, Mohammad Alizadeh, Alex Beutel, Ed H. Chi, Ani Kristo, Guillaume Leclerc, Samuel Madden, Hongzi Mao, and Vikram Nathan. 2019. SageDB: A Learned Database System. In CIDR.Google Scholar
- Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018. The Case for Learned Index Structures. In SIGMOD, Gautam Das, Christopher M. Jermaine, and Philip A. Bernstein (Eds.). 489--504.Google ScholarDigital Library
- Pengfei Li, Yu Hua, Jingnan Jia, and Pengfei Zuo. 2021. FINEdex: A Fine-grained Learned Index Scheme for Scalable and Concurrent Memory Systems. Proc. VLDB Endow. 15, 2 (2021), 321--334.Google ScholarDigital Library
- Pengfei Li, Hua Lu, Qian Zheng, Long Yang, and Gang Pan. 2020. LISA: A Learned Index Structure for Spatial Data. In SIGMOD. 2119--2133.Google Scholar
- Pengfei Li, Hua Lu, Rong Zhu, Bolin Ding, Long Yang, and Gang Pan. 2023. DILI: A Distribution-Driven Learned Index (Extended version). CoRR abs/2304.08817 (2023). arXiv:2304.08817 Google ScholarCross Ref
- David B. Lomet. 1981. Digital B-Trees. In VLDB. 333--344.Google Scholar
- Baotong Lu, Jialin Ding, Eric Lo, Umar Farooq Minhas, and Tianzheng Wang. 2021. APEX: A High-Performance Learned Index on Persistent Memory. Proc. VLDB Endow. 15, 3 (2021), 597--610.Google ScholarDigital Library
- Yandong Mao, Eddie Kohler, and Robert Tappan Morris. 2012. Cache craftiness for fast multicore key-value storage. In EuroSys. 183--196.Google Scholar
- Ryan Marcus, Andreas Kipf, Alexander van Renen, Mihail Stoian, Sanchit Misra, Alfons Kemper, Thomas Neumann, and Tim Kraska. 2020. Benchmarking Learned Indexes. Proc. VLDB Endow. 14, 1 (2020), 1--13.Google ScholarDigital Library
- Guy M Morton. 1966. A computer oriented geodetic data base and a new technique in file sequencing. (1966).Google Scholar
- Vikram Nathan, Jialin Ding, Mohammad Alizadeh, and Tim Kraska. 2020. Learning Multi-Dimensional Indexes. In SIGMOD. 985--1000.Google Scholar
- Jianzhong Qi, Guanli Liu, Christian S. Jensen, and Lars Kulik. 2020. Effectively Learning Spatial Indices. Proc. VLDB Endow. 13, 11 (2020), 2341--2354.Google ScholarDigital Library
- Esteban G Tabak and Cristina V Turner. 2013. A family of nonparametric density estimation algorithms. Communications on Pure and Applied Mathematics 66, 2 (2013), 145--164.Google ScholarCross Ref
- Vladimir Tsymbal. 2019. Tuning Guides and Performance Analysis Papers, Last updated 2020-12-15. https://software.intel.com/content/www/us/en/develop/articles/processor-specific-performance-analysis-papers.html.Google Scholar
- Haixin Wang, Xiaoyi Fu, Jianliang Xu, and Hua Lu. 2019. Learned Index for Spatial Queries. In MDM. 569--574.Google Scholar
- Chaichon Wongkham, Baotong Lu, Chris Liu, Zhicong Zhong, Eric Lo, and Tianzheng Wang. 2022. Are Updatable Learned Indexes Ready? Proc. VLDB Endow. 15, 11 (2022), 3004--3017.Google ScholarDigital Library
- Jiacheng Wu, Yong Zhang, Shimin Chen, Yu Chen, Jin Wang, and Chunxiao Xing. 2021. Updatable Learned Index with Precise Positions. Proc. VLDB Endow. 14, 8 (2021), 1276--1288.Google ScholarDigital Library
- Shangyu Wu, Yufei Cui, Jinghuan Yu, Xuan Sun, Tei-Wei Kuo, and Chun Jason Xue. 2022. NFL: Robust Learned Index via Distribution Transformation. Proc. VLDB Endow. 15, 10 (2022), 2188--2200.Google ScholarDigital Library
- Yingjun Wu, Jia Yu, Yuanyuan Tian, Richard Sidle, and Ronald Barber. 2019. Designing Succinct Secondary Indexing Mechanism by Exploiting Column Correlations. In SIGMOD. 1223--1240.Google Scholar
- Jiaoyi Zhang and Yihan Gao. 2022. CARMI: A Cache-Aware Learned Index with a Cost-based Construction Algorithm. Proc. VLDB Endow. 15, 11 (2022), 2679--2691.Google ScholarDigital Library
Recommendations
Updatable learned index with precise positions
Index plays an essential role in modern database engines to accelerate the query processing. The new paradigm of "learned index" has significantly changed the way of designing index structures in DBMS. The key insight is that indexes could be regarded as ...
Timing-Constrained Flexibility-Driven Routing Tree Construction
As the complexity of VLSI circuits increases, the routability problem becomes more and more important in modern VLSI design. In general, the flexibility improvement of the edges in a routing tree has been exploited to release the routing congestion and ...
Improving index performance through prefetching
SIGMOD '01: Proceedings of the 2001 ACM SIGMOD international conference on Management of dataThis paper proposes and evaluate Prefetching B+-Trees (pB+-Trees), which use prefetching to accelerate two important operations on B+-Tree indices: searches and range scans. To accelerate searches, pB+-Trees use prefetching to effectively create wider ...
Comments