High Performance Construction of RecSplit Based Minimal Perfect Hash Functions

Authors Dominik Bez, Florian Kurpicz , Hans-Peter Lehmann , Peter Sanders



PDF
Thumbnail PDF

File

LIPIcs.ESA.2023.19.pdf
  • Filesize: 0.89 MB
  • 16 pages

Document Identifiers

Author Details

Dominik Bez
  • Karlsruhe Institute of Technology, Germany
Florian Kurpicz
  • Karlsruhe Institute of Technology, Germany
Hans-Peter Lehmann
  • Karlsruhe Institute of Technology, Germany
Peter Sanders
  • Karlsruhe Institute of Technology, Germany

Acknowledgements

This paper is based on and has text overlaps with Dominik Bez' Master’s thesis [Bez, 2022]. We refer readers to that thesis for a detailed evaluation of the effects of low-level decisions like the choice of different similar SIMD instructions.

Cite AsGet BibTex

Dominik Bez, Florian Kurpicz, Hans-Peter Lehmann, and Peter Sanders. High Performance Construction of RecSplit Based Minimal Perfect Hash Functions. In 31st Annual European Symposium on Algorithms (ESA 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 274, pp. 19:1-19:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)
https://doi.org/10.4230/LIPIcs.ESA.2023.19

Abstract

A minimal perfect hash function (MPHF) bijectively maps a set S of objects to the first |S| integers. It can be used as a building block in databases and data compression. RecSplit [Esposito et al., ALENEX'20] is currently the most space efficient practical minimal perfect hash function. It heavily relies on trying out hash functions in a brute force way. We introduce rotation fitting, a new technique that makes the search more efficient by drastically reducing the number of tried hash functions. Additionally, we greatly improve the construction time of RecSplit by harnessing parallelism on the level of bits, vectors, cores, and GPUs. In combination, the resulting improvements yield speedups up to 239 on an 8-core CPU and up to 5438 using a GPU. The original single-threaded RecSplit implementation needs 1.5 hours to construct an MPHF for 5 Million objects with 1.56 bits per object. On the GPU, we achieve the same space usage in just 5 seconds. Given that the speedups are larger than the increase in energy consumption, our implementation is more energy efficient than the original implementation.

Subject Classification

ACM Subject Classification
  • Theory of computation → Data compression
  • Information systems → Point lookups
Keywords
  • compressed data structure
  • parallel perfect hashing
  • bit parallelism
  • GPU
  • SIMD
  • parallel computing
  • vector instructions

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Michael Axtmann, Sascha Witt, Daniel Ferizovic, and Peter Sanders. Engineering in-place (shared-memory) sorting algorithms. ACM Trans. Parallel Comput., 9(1):2:1-2:62, 2022. URL: https://doi.org/10.1145/3505286.
  2. Djamal Belazzougui, Fabiano C. Botelho, and Martin Dietzfelbinger. Hash, displace, and compress. In ESA, volume 5757 of Lecture Notes in Computer Science, pages 682-693. Springer, 2009. URL: https://doi.org/10.1007/978-3-642-04128-0_61.
  3. Michael A. Bender, Martin Farach-Colton, Mayank Goswami, Rob Johnson, Samuel McCauley, and Shikha Singh. Bloom filters, adaptivity, and the dictionary problem. In FOCS, pages 182-193. IEEE Computer Society, 2018. URL: https://doi.org/10.1109/FOCS.2018.00026.
  4. Dominik Bez. Perfect hash function generation on the GPU with RecSplit. Master’s thesis, Karlsruhe Institute for Technology (KIT), 2022. URL: https://doi.org/10.5445/IR/1000152719.
  5. Dominik Bez, Florian Kurpicz, Hans-Peter Lehmann, and Peter Sanders. High performance construction of recsplit based minimal perfect hash functions, 2022. URL: https://doi.org/arXiv:2212.09562.
  6. Fabiano C. Botelho, Rasmus Pagh, and Nivio Ziviani. Perfect hashing for data management applications. CoRR, abs/cs/0702159, 2007. URL: https://arxiv.org/abs/0702159.
  7. Fabiano C. Botelho, Rasmus Pagh, and Nivio Ziviani. Simple and space-efficient minimal perfect hash functions. In WADS, volume 4619 of Lecture Notes in Computer Science, pages 139-150. Springer, 2007. URL: https://doi.org/10.1007/978-3-540-73951-7_13.
  8. Fabiano C. Botelho, Rasmus Pagh, and Nivio Ziviani. Practical perfect hashing in nearly optimal space. Inf. Syst., 38(1):108-131, 2013. URL: https://doi.org/10.1016/J.IS.2012.06.002.
  9. Jarrod A. Chapman, Isaac Ho, Sirisha Sunkara, Shujun Luo, Gary P. Schroth, and Daniel S. Rokhsar. Meraculous: De novo genome assembly with short paired-end reads. PLOS ONE, 6(8):1-13, August 2011. URL: https://doi.org/10.1371/journal.pone.0023501.
  10. David Richard Clark. Compact Pat Trees. PhD thesis, University of Waterloo, Canada, 1996. Google Scholar
  11. Zbigniew J. Czech, George Havas, and Bohdan S. Majewski. An optimal algorithm for generating minimal perfect hash functions. Inf. Process. Lett., 43(5):257-264, 1992. URL: https://doi.org/10.1016/0020-0190(92)90220-P.
  12. Peter C. Dillinger, Lorenz Hübschle-Schneider, Peter Sanders, and Stefan Walzer. Fast succinct retrieval and approximate membership using ribbon. In SEA, volume 233 of LIPIcs, pages 4:1-4:20. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2022. URL: https://doi.org/10.4230/LIPICS.SEA.2022.4.
  13. Peter Elias. Efficient storage and retrieval by content and address of static files. J. ACM, 21(2):246-260, 1974. URL: https://doi.org/10.1145/321812.321820.
  14. Emmanuel Esposito, Thomas Mueller Graf, and Sebastiano Vigna. Recsplit: Minimal perfect hashing via recursive splitting. In ALENEX, pages 175-185. SIAM, 2020. URL: https://doi.org/10.1137/1.9781611976007.14.
  15. Bin Fan, David G. Andersen, Michael Kaminsky, and Michael Mitzenmacher. Cuckoo filter: Practically better than bloom. In CoNEXT, pages 75-88. ACM, 2014. URL: https://doi.org/10.1145/2674005.2674994.
  16. Robert Mario Fano. On the number of bits required to implement an associative memory. Technical report, MIT, Computer Structures Group, 1971. Project MAC, Memorandum 61. Google Scholar
  17. Michael J. Flynn. Some computer organizations and their effectiveness. IEEE Trans. Computers, 21(9):948-960, 1972. URL: https://doi.org/10.1109/TC.1972.5009071.
  18. Agner Fog. C++ vector class library. http://www.agner.org/optimize/vectorclass.pdf, 2013.
  19. Dimitris Fotakis, Rasmus Pagh, Peter Sanders, and Paul G. Spirakis. Space efficient hash tables with worst case constant access time. Theory Comput. Syst., 38(2):229-248, 2005. URL: https://doi.org/10.1007/S00224-004-1195-X.
  20. Edward A. Fox, Qi Fan Chen, and Lenwood S. Heath. A faster algorithm for constructing minimal perfect hash functions. In SIGIR, pages 266-273. ACM, 1992. URL: https://doi.org/10.1145/133160.133209.
  21. Michael L. Fredman, János Komlós, and Endre Szemerédi. Storing a sparse table with 0(1) worst case access time. J. ACM, 31(3):538-544, 1984. URL: https://doi.org/10.1145/828.1884.
  22. GpuRecSplit - GitHub. https://github.com/ByteHamster/GpuRecSplit, 2023.
  23. MPHF-Experiments - GitHub. https://github.com/ByteHamster/MPHF-Experiments, 2023.
  24. Solomon W. Golomb. Run-length encodings (corresp.). IEEE Trans. Inf. Theory, 12(3):399-401, 1966. URL: https://doi.org/10.1109/TIT.1966.1053907.
  25. Intel. Advanced vector extensions programming reference. https://www.intel.com/content/dam/develop/external/us/en/documents/36945, 2011.
  26. Intel. Avx-512 instructions. https://www.intel.com/content/www/us/en/developer/articles/technical/intel-avx-512-instructions.html, 2013.
  27. Guy Jacobson. Space-efficient static trees and graphs. In FOCS, pages 549-554. IEEE Computer Society, 1989. URL: https://doi.org/10.1109/SFCS.1989.63533.
  28. Florian Kurpicz. Engineering compact data structures for rank and select queries on bit vectors. In SPIRE, volume 13617 of Lecture Notes in Computer Science, pages 257-272. Springer, 2022. URL: https://doi.org/10.1007/978-3-031-20643-6_19.
  29. Sylvain Lefebvre and Hugues Hoppe. Perfect spatial hashing. ACM Trans. Graph., 25(3):579-588, 2006. URL: https://doi.org/10.1145/1141911.1141926.
  30. Hans-Peter Lehmann, Peter Sanders, and Stefan Walzer. SicHash - small irregular cuckoo tables for perfect hashing. In ALENEX, pages 176-189. SIAM, 2023. URL: https://doi.org/10.1137/1.9781611977561.CH15.
  31. Antoine Limasset, Guillaume Rizk, Rayan Chikhi, and Pierre Peterlongo. Fast and scalable minimal perfect hashing for massive key sets. In SEA, volume 75 of LIPIcs, pages 25:1-25:16. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2017. URL: https://doi.org/10.4230/LIPICS.SEA.2017.25.
  32. Ingo Müller, Peter Sanders, Robert Schulze, and Wei Zhou. Retrieval and perfect hashing using fingerprinting. In SEA, volume 8504 of Lecture Notes in Computer Science, pages 138-149. Springer, 2014. URL: https://doi.org/10.1007/978-3-319-07959-2_12.
  33. Nvidia. Nvidia ampere GA102 GPU architecture. https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf, 2020.
  34. Nvidia. CUDA C++ programming guide. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html, 2022.
  35. Rasmus Pagh and Flemming Friche Rodler. Cuckoo hashing. J. Algorithms, 51(2):122-144, 2004. URL: https://doi.org/10.1016/j.jalgor.2003.12.002.
  36. Giulio Ermanno Pibiri and Roberto Trani. Parallel and external-memory construction of minimal perfect hash functions with pthash. CoRR, abs/2106.02350, 2021. URL: https://arxiv.org/abs/2106.02350.
  37. Giulio Ermanno Pibiri and Roberto Trani. PTHash: Revisiting FCH minimal perfect hashing. In SIGIR, pages 1339-1348. ACM, 2021. URL: https://doi.org/10.1145/3404835.3462849.
  38. Robert F. Rice. Some practical universal noiseless coding techniques. Jet Propulsion Laboratory, JPL Publication, 1979. Google Scholar
  39. Sebastiano Vigna. Broadword implementation of rank/select queries. In WEA, volume 5038 of Lecture Notes in Computer Science, pages 154-168. Springer, 2008. URL: https://doi.org/10.1007/978-3-540-68552-4_12.
  40. Sean A. Weaver and Marijn Heule. Constructing minimal perfect hash functions using SAT technology. In AAAI, pages 1668-1675. AAAI Press, 2020. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail