research-article

DILI: A Distribution-Driven Learned Index

Authors:
Pengfei Li

Alibaba Group, China

Alibaba Group, China
View Profile

,
Hua Lu

Roskilde University, Denmark

Roskilde University, Denmark
View Profile

,
Rong Zhu

Alibaba Group, China

Alibaba Group, China
View Profile

,
Bolin Ding

Alibaba Group, China

Alibaba Group, China
View Profile

,
Long Yang

Peking University, China

Peking University, China
View Profile

,
Gang Pan

Zhejiang University, China

Zhejiang University, China
View Profile

Authors Info & Claims

Proceedings of the VLDB Endowment Volume 16 Issue 9pp 2212–2224https://doi.org/10.14778/3598581.3598593

Published:01 May 2023Publication History

Proceedings of the VLDB Endowment

Abstract

Targeting in-memory one-dimensional search keys, we propose a novel DIstribution-driven Learned Index tree (DILI), where a concise and computation-efficient linear regression model is used for each node. An internal node's key range is equally divided by its child nodes such that a key search enjoys perfect model prediction accuracy to find the relevant leaf node. A leaf node uses machine learning models to generate searchable data layout and thus accurately predicts the data record position for a key. To construct DILI, we first build a bottom-up tree with linear regression models according to global and local key distributions. Using the bottom-up tree, we build DILI in a top-down manner, individualizing the fanouts for internal nodes according to local distributions. DILI strikes a good balance between the number of leaf nodes and the height of the tree, two critical factors of key search time. Moreover, we design flexible algorithms for DILI to efficiently insert and delete keys and automatically adjust the tree structure when necessary. Extensive experimental results show that DILI outperforms the state-of-the-art alternatives on different kinds of workloads.

References

https://github.com/pfl-cs/DILI.Google Scholar
https://www.dropbox.com/s/j1d4ufn4fyb4po2/osm_cellids_800M_uint64.zst?dl=1.Google Scholar
https://www.dropbox.com/s/y2u3nbanbnbmg7n/books_800M_uint64.zst?dl=1.Google Scholar
https://panthema.net/2007/stx-btree.Google Scholar
https://github.com/microsoft/ALEX.Google Scholar
https://github.com/gvinciguerra/PGM-index.Google Scholar
https://github.com/Jiacheng-WU/lipp.Google Scholar
https://github.com/learnedsystems/SOSD.Google Scholar
Jayadev Acharya, Ilias Diakonikolas, Jerry Li, and Ludwig Schmidt. 2016. Fast Algorithms for Segmented Regression. In ICML, Maria-Florina Balcan and Kilian Q. Weinberger (Eds.), Vol. 48. 2878--2886.Google Scholar
Nikolas Askitis and Justin Zobel. 2009. B-tries for disk-based string management. VLDB J. 18, 1 (2009), 157--179.Google ScholarDigital Library
Manos Athanassoulis and Anastasia Ailamaki. 2014. BF-Tree: Approximate Tree Indexing. Proc. VLDB Endow. 7, 14 (2014), 1881--1892.Google ScholarDigital Library
Supawit Chockchowwat. 2022. Tuning Hierarchical Learned Indexes on Disk and Beyond. In SIGMOD. 2515--2517.Google Scholar
Douglas Comer. 1979. The Ubiquitous B-Tree. ACM Comput. Surv. 11, 2 (1979), 121--137.Google ScholarDigital Library
Intel Corporporation. 2018. Intel 64 and ia-32 architectures software developer manuals. https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html.Google Scholar
Angjela Davitkova, Evica Milchevski, and Sebastian Michel. 2020. The ML-Index: A Multidimensional, Learned Index for Point, Range, and Nearest-Neighbor Queries. In EDBT. OpenProceedings.org, 407--410.Google Scholar
Jialin Ding, Umar Farooq Minhas, Jia Yu, Chi Wang, Jaeyoung Do, Yinan Li, Hantian Zhang, Badrish Chandramouli, Johannes Gehrke, Donald Kossmann, David B. Lomet, and Tim Kraska. 2020. ALEX: An Updatable Adaptive Learned Index. In SIGMOD. 969--984.Google Scholar
Jialin Ding, Vikram Nathan, Mohammad Alizadeh, and Tim Kraska. 2020. Tsunami: A Learned Multi-dimensional Index for Correlated Data and Skewed Workloads. CoRR abs/2006.13282 (2020). arXiv:2006.13282 https://arxiv.org/abs/2006.13282Google Scholar
Paolo Ferragina and Giorgio Vinciguerra. 2020. The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds. Proc. VLDB Endow. 13, 8 (2020), 1162--1175.Google ScholarDigital Library
Agner Fog. 2018. Lists of Instruction Latencies, Throughputs and Micro-operation Breakdowns for Intel, AMD and VIA CPUs, Technical University of Denmar, Last updated 2021-01-31. http://www.agner.org/optimize/instruction_tables.pdf,DoA. (2018).Google Scholar
Alex Galakatos, Michael Markovitch, Carsten Binnig, Rodrigo Fonseca, and Tim Kraska. 2019. FITing-Tree: A Data-aware Index Structure. In SIGMOD. 1189--1206.Google Scholar
Goetz Graefe. 2010. A survey of B-tree locking techniques. ACM Trans. Database Syst. 35, 3 (2010), 16:1--16:26.Google ScholarDigital Library
Intel. 2021. Intel Optane Persistent Memory (PMem), Last updated 2021-11-13. https://www.intel.ca/content/www/ca/en/architecture-and-technology/optane-dcpersistent-memory.html.Google Scholar
H. V. Jagadish, Beng Chin Ooi, Kian-Lee Tan, Cui Yu, and Rui Zhang. 2005. iDistance: An adaptive B⁺-tree based indexing method for nearest neighbor search. ACM Trans. Database Syst. 30, 2 (2005), 364--397.Google ScholarDigital Library
Andreas Kipf, Ryan Marcus, Alexander van Renen, Mihail Stoian, Alfons Kemper, Tim Kraska, and Thomas Neumann. 2020. RadixSpline: a single-pass learned index. In aiDM@SIGMOD. 5:1--5:5.Google Scholar
Donald Ervin Knuth. 1997. The art of computer programming. Vol. 3. Pearson Education.Google ScholarDigital Library
Tim Kraska, Mohammad Alizadeh, Alex Beutel, Ed H. Chi, Ani Kristo, Guillaume Leclerc, Samuel Madden, Hongzi Mao, and Vikram Nathan. 2019. SageDB: A Learned Database System. In CIDR.Google Scholar
Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018. The Case for Learned Index Structures. In SIGMOD, Gautam Das, Christopher M. Jermaine, and Philip A. Bernstein (Eds.). 489--504.Google ScholarDigital Library
Pengfei Li, Yu Hua, Jingnan Jia, and Pengfei Zuo. 2021. FINEdex: A Fine-grained Learned Index Scheme for Scalable and Concurrent Memory Systems. Proc. VLDB Endow. 15, 2 (2021), 321--334.Google ScholarDigital Library
Pengfei Li, Hua Lu, Qian Zheng, Long Yang, and Gang Pan. 2020. LISA: A Learned Index Structure for Spatial Data. In SIGMOD. 2119--2133.Google Scholar
Pengfei Li, Hua Lu, Rong Zhu, Bolin Ding, Long Yang, and Gang Pan. 2023. DILI: A Distribution-Driven Learned Index (Extended version). CoRR abs/2304.08817 (2023). arXiv:2304.08817 Google ScholarCross Ref
David B. Lomet. 1981. Digital B-Trees. In VLDB. 333--344.Google Scholar
Baotong Lu, Jialin Ding, Eric Lo, Umar Farooq Minhas, and Tianzheng Wang. 2021. APEX: A High-Performance Learned Index on Persistent Memory. Proc. VLDB Endow. 15, 3 (2021), 597--610.Google ScholarDigital Library
Yandong Mao, Eddie Kohler, and Robert Tappan Morris. 2012. Cache craftiness for fast multicore key-value storage. In EuroSys. 183--196.Google Scholar
Ryan Marcus, Andreas Kipf, Alexander van Renen, Mihail Stoian, Sanchit Misra, Alfons Kemper, Thomas Neumann, and Tim Kraska. 2020. Benchmarking Learned Indexes. Proc. VLDB Endow. 14, 1 (2020), 1--13.Google ScholarDigital Library
Guy M Morton. 1966. A computer oriented geodetic data base and a new technique in file sequencing. (1966).Google Scholar
Vikram Nathan, Jialin Ding, Mohammad Alizadeh, and Tim Kraska. 2020. Learning Multi-Dimensional Indexes. In SIGMOD. 985--1000.Google Scholar
Jianzhong Qi, Guanli Liu, Christian S. Jensen, and Lars Kulik. 2020. Effectively Learning Spatial Indices. Proc. VLDB Endow. 13, 11 (2020), 2341--2354.Google ScholarDigital Library
Esteban G Tabak and Cristina V Turner. 2013. A family of nonparametric density estimation algorithms. Communications on Pure and Applied Mathematics 66, 2 (2013), 145--164.Google ScholarCross Ref
Vladimir Tsymbal. 2019. Tuning Guides and Performance Analysis Papers, Last updated 2020-12-15. https://software.intel.com/content/www/us/en/develop/articles/processor-specific-performance-analysis-papers.html.Google Scholar
Haixin Wang, Xiaoyi Fu, Jianliang Xu, and Hua Lu. 2019. Learned Index for Spatial Queries. In MDM. 569--574.Google Scholar
Chaichon Wongkham, Baotong Lu, Chris Liu, Zhicong Zhong, Eric Lo, and Tianzheng Wang. 2022. Are Updatable Learned Indexes Ready? Proc. VLDB Endow. 15, 11 (2022), 3004--3017.Google ScholarDigital Library
Jiacheng Wu, Yong Zhang, Shimin Chen, Yu Chen, Jin Wang, and Chunxiao Xing. 2021. Updatable Learned Index with Precise Positions. Proc. VLDB Endow. 14, 8 (2021), 1276--1288.Google ScholarDigital Library
Shangyu Wu, Yufei Cui, Jinghuan Yu, Xuan Sun, Tei-Wei Kuo, and Chun Jason Xue. 2022. NFL: Robust Learned Index via Distribution Transformation. Proc. VLDB Endow. 15, 10 (2022), 2188--2200.Google ScholarDigital Library
Yingjun Wu, Jia Yu, Yuanyuan Tian, Richard Sidle, and Ronald Barber. 2019. Designing Succinct Secondary Indexing Mechanism by Exploiting Column Correlations. In SIGMOD. 1223--1240.Google Scholar
Jiaoyi Zhang and Yihan Gao. 2022. CARMI: A Cache-Aware Learned Index with a Cost-based Construction Algorithm. Proc. VLDB Endow. 15, 11 (2022), 2679--2691.Google ScholarDigital Library

Recommendations

Updatable learned index with precise positions

Index plays an essential role in modern database engines to accelerate the query processing. The new paradigm of "learned index" has significantly changed the way of designing index structures in DBMS. The key insight is that indexes could be regarded as ...
Read More
Timing-Constrained Flexibility-Driven Routing Tree Construction

As the complexity of VLSI circuits increases, the routability problem becomes more and more important in modern VLSI design. In general, the flexibility improvement of the edges in a routing tree has been exploited to release the routing congestion and ...
Read More
Improving index performance through prefetching
SIGMOD '01: Proceedings of the 2001 ACM SIGMOD international conference on Management of data

This paper proposes and evaluate Prefetching B⁺-Trees (pB⁺-Trees), which use prefetching to accelerate two important operations on B⁺-Tree indices: searches and range scans. To accelerate searches, pB⁺-Trees use prefetching to effectively create wider ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 16, Issue 9
May 2023
330 pages
ISSN:2150-8097
Editors:
Georgia Koutrika
Athena Research Center
,
Jun Yang
Duke University
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 May 2023
Published in pvldb Volume 16, Issue 9

Check for updates
Badges
- Artifacts Available / v1.1
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 76
  Total Downloads
- Downloads (Last 12 months)76
- Downloads (Last 6 weeks)12
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

DILI: A Distribution-Driven Learned Index

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Updatable learned index with precise positions

Timing-Constrained Flexibility-Driven Routing Tree Construction

Improving index performance through prefetching