skip to main content
research-article
Artifacts Available / v1.1

DILI: A Distribution-Driven Learned Index

Published:01 May 2023Publication History
Skip Abstract Section

Abstract

Targeting in-memory one-dimensional search keys, we propose a novel DIstribution-driven Learned Index tree (DILI), where a concise and computation-efficient linear regression model is used for each node. An internal node's key range is equally divided by its child nodes such that a key search enjoys perfect model prediction accuracy to find the relevant leaf node. A leaf node uses machine learning models to generate searchable data layout and thus accurately predicts the data record position for a key. To construct DILI, we first build a bottom-up tree with linear regression models according to global and local key distributions. Using the bottom-up tree, we build DILI in a top-down manner, individualizing the fanouts for internal nodes according to local distributions. DILI strikes a good balance between the number of leaf nodes and the height of the tree, two critical factors of key search time. Moreover, we design flexible algorithms for DILI to efficiently insert and delete keys and automatically adjust the tree structure when necessary. Extensive experimental results show that DILI outperforms the state-of-the-art alternatives on different kinds of workloads.

References

  1. https://github.com/pfl-cs/DILI.Google ScholarGoogle Scholar
  2. https://www.dropbox.com/s/j1d4ufn4fyb4po2/osm_cellids_800M_uint64.zst?dl=1.Google ScholarGoogle Scholar
  3. https://www.dropbox.com/s/y2u3nbanbnbmg7n/books_800M_uint64.zst?dl=1.Google ScholarGoogle Scholar
  4. https://panthema.net/2007/stx-btree.Google ScholarGoogle Scholar
  5. https://github.com/microsoft/ALEX.Google ScholarGoogle Scholar
  6. https://github.com/gvinciguerra/PGM-index.Google ScholarGoogle Scholar
  7. https://github.com/Jiacheng-WU/lipp.Google ScholarGoogle Scholar
  8. https://github.com/learnedsystems/SOSD.Google ScholarGoogle Scholar
  9. Jayadev Acharya, Ilias Diakonikolas, Jerry Li, and Ludwig Schmidt. 2016. Fast Algorithms for Segmented Regression. In ICML, Maria-Florina Balcan and Kilian Q. Weinberger (Eds.), Vol. 48. 2878--2886.Google ScholarGoogle Scholar
  10. Nikolas Askitis and Justin Zobel. 2009. B-tries for disk-based string management. VLDB J. 18, 1 (2009), 157--179.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Manos Athanassoulis and Anastasia Ailamaki. 2014. BF-Tree: Approximate Tree Indexing. Proc. VLDB Endow. 7, 14 (2014), 1881--1892.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Supawit Chockchowwat. 2022. Tuning Hierarchical Learned Indexes on Disk and Beyond. In SIGMOD. 2515--2517.Google ScholarGoogle Scholar
  13. Douglas Comer. 1979. The Ubiquitous B-Tree. ACM Comput. Surv. 11, 2 (1979), 121--137.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Intel Corporporation. 2018. Intel 64 and ia-32 architectures software developer manuals. https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html.Google ScholarGoogle Scholar
  15. Angjela Davitkova, Evica Milchevski, and Sebastian Michel. 2020. The ML-Index: A Multidimensional, Learned Index for Point, Range, and Nearest-Neighbor Queries. In EDBT. OpenProceedings.org, 407--410.Google ScholarGoogle Scholar
  16. Jialin Ding, Umar Farooq Minhas, Jia Yu, Chi Wang, Jaeyoung Do, Yinan Li, Hantian Zhang, Badrish Chandramouli, Johannes Gehrke, Donald Kossmann, David B. Lomet, and Tim Kraska. 2020. ALEX: An Updatable Adaptive Learned Index. In SIGMOD. 969--984.Google ScholarGoogle Scholar
  17. Jialin Ding, Vikram Nathan, Mohammad Alizadeh, and Tim Kraska. 2020. Tsunami: A Learned Multi-dimensional Index for Correlated Data and Skewed Workloads. CoRR abs/2006.13282 (2020). arXiv:2006.13282 https://arxiv.org/abs/2006.13282Google ScholarGoogle Scholar
  18. Paolo Ferragina and Giorgio Vinciguerra. 2020. The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds. Proc. VLDB Endow. 13, 8 (2020), 1162--1175.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Agner Fog. 2018. Lists of Instruction Latencies, Throughputs and Micro-operation Breakdowns for Intel, AMD and VIA CPUs, Technical University of Denmar, Last updated 2021-01-31. http://www.agner.org/optimize/instruction_tables.pdf,DoA. (2018).Google ScholarGoogle Scholar
  20. Alex Galakatos, Michael Markovitch, Carsten Binnig, Rodrigo Fonseca, and Tim Kraska. 2019. FITing-Tree: A Data-aware Index Structure. In SIGMOD. 1189--1206.Google ScholarGoogle Scholar
  21. Goetz Graefe. 2010. A survey of B-tree locking techniques. ACM Trans. Database Syst. 35, 3 (2010), 16:1--16:26.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Intel. 2021. Intel Optane Persistent Memory (PMem), Last updated 2021-11-13. https://www.intel.ca/content/www/ca/en/architecture-and-technology/optane-dcpersistent-memory.html.Google ScholarGoogle Scholar
  23. H. V. Jagadish, Beng Chin Ooi, Kian-Lee Tan, Cui Yu, and Rui Zhang. 2005. iDistance: An adaptive B+-tree based indexing method for nearest neighbor search. ACM Trans. Database Syst. 30, 2 (2005), 364--397.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Andreas Kipf, Ryan Marcus, Alexander van Renen, Mihail Stoian, Alfons Kemper, Tim Kraska, and Thomas Neumann. 2020. RadixSpline: a single-pass learned index. In aiDM@SIGMOD. 5:1--5:5.Google ScholarGoogle Scholar
  25. Donald Ervin Knuth. 1997. The art of computer programming. Vol. 3. Pearson Education.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Tim Kraska, Mohammad Alizadeh, Alex Beutel, Ed H. Chi, Ani Kristo, Guillaume Leclerc, Samuel Madden, Hongzi Mao, and Vikram Nathan. 2019. SageDB: A Learned Database System. In CIDR.Google ScholarGoogle Scholar
  27. Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018. The Case for Learned Index Structures. In SIGMOD, Gautam Das, Christopher M. Jermaine, and Philip A. Bernstein (Eds.). 489--504.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Pengfei Li, Yu Hua, Jingnan Jia, and Pengfei Zuo. 2021. FINEdex: A Fine-grained Learned Index Scheme for Scalable and Concurrent Memory Systems. Proc. VLDB Endow. 15, 2 (2021), 321--334.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Pengfei Li, Hua Lu, Qian Zheng, Long Yang, and Gang Pan. 2020. LISA: A Learned Index Structure for Spatial Data. In SIGMOD. 2119--2133.Google ScholarGoogle Scholar
  30. Pengfei Li, Hua Lu, Rong Zhu, Bolin Ding, Long Yang, and Gang Pan. 2023. DILI: A Distribution-Driven Learned Index (Extended version). CoRR abs/2304.08817 (2023). arXiv:2304.08817 Google ScholarGoogle ScholarCross RefCross Ref
  31. David B. Lomet. 1981. Digital B-Trees. In VLDB. 333--344.Google ScholarGoogle Scholar
  32. Baotong Lu, Jialin Ding, Eric Lo, Umar Farooq Minhas, and Tianzheng Wang. 2021. APEX: A High-Performance Learned Index on Persistent Memory. Proc. VLDB Endow. 15, 3 (2021), 597--610.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Yandong Mao, Eddie Kohler, and Robert Tappan Morris. 2012. Cache craftiness for fast multicore key-value storage. In EuroSys. 183--196.Google ScholarGoogle Scholar
  34. Ryan Marcus, Andreas Kipf, Alexander van Renen, Mihail Stoian, Sanchit Misra, Alfons Kemper, Thomas Neumann, and Tim Kraska. 2020. Benchmarking Learned Indexes. Proc. VLDB Endow. 14, 1 (2020), 1--13.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Guy M Morton. 1966. A computer oriented geodetic data base and a new technique in file sequencing. (1966).Google ScholarGoogle Scholar
  36. Vikram Nathan, Jialin Ding, Mohammad Alizadeh, and Tim Kraska. 2020. Learning Multi-Dimensional Indexes. In SIGMOD. 985--1000.Google ScholarGoogle Scholar
  37. Jianzhong Qi, Guanli Liu, Christian S. Jensen, and Lars Kulik. 2020. Effectively Learning Spatial Indices. Proc. VLDB Endow. 13, 11 (2020), 2341--2354.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Esteban G Tabak and Cristina V Turner. 2013. A family of nonparametric density estimation algorithms. Communications on Pure and Applied Mathematics 66, 2 (2013), 145--164.Google ScholarGoogle ScholarCross RefCross Ref
  39. Vladimir Tsymbal. 2019. Tuning Guides and Performance Analysis Papers, Last updated 2020-12-15. https://software.intel.com/content/www/us/en/develop/articles/processor-specific-performance-analysis-papers.html.Google ScholarGoogle Scholar
  40. Haixin Wang, Xiaoyi Fu, Jianliang Xu, and Hua Lu. 2019. Learned Index for Spatial Queries. In MDM. 569--574.Google ScholarGoogle Scholar
  41. Chaichon Wongkham, Baotong Lu, Chris Liu, Zhicong Zhong, Eric Lo, and Tianzheng Wang. 2022. Are Updatable Learned Indexes Ready? Proc. VLDB Endow. 15, 11 (2022), 3004--3017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Jiacheng Wu, Yong Zhang, Shimin Chen, Yu Chen, Jin Wang, and Chunxiao Xing. 2021. Updatable Learned Index with Precise Positions. Proc. VLDB Endow. 14, 8 (2021), 1276--1288.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Shangyu Wu, Yufei Cui, Jinghuan Yu, Xuan Sun, Tei-Wei Kuo, and Chun Jason Xue. 2022. NFL: Robust Learned Index via Distribution Transformation. Proc. VLDB Endow. 15, 10 (2022), 2188--2200.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Yingjun Wu, Jia Yu, Yuanyuan Tian, Richard Sidle, and Ronald Barber. 2019. Designing Succinct Secondary Indexing Mechanism by Exploiting Column Correlations. In SIGMOD. 1223--1240.Google ScholarGoogle Scholar
  45. Jiaoyi Zhang and Yihan Gao. 2022. CARMI: A Cache-Aware Learned Index with a Cost-based Construction Algorithm. Proc. VLDB Endow. 15, 11 (2022), 2679--2691.Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader