Skip to main content

Engineering a Textbook Approach to Index Massive String Dictionaries

  • Conference paper
  • First Online:
String Processing and Information Retrieval (SPIRE 2023)

Abstract

We study the problem of engineering space-time efficient indexes that support membership and lexicographic (rank) queries on very large static dictionaries of strings.

Our solution is based on a very simple approach that consists of decoupling string storage and string indexing by means of a blockwise compression of the sorted dictionary strings (to be stored in external memory) and a succinct implementation of a Patricia trie (to be stored in internal memory) built on the first string of each block.

Our experimental evaluation on two new datasets, which are at least one order of magnitude larger than the ones used in the literature, shows that (i) the state-of-the-art compressed string dictionaries (such as FST, PDT, CoCo-trie) do not provide significant benefits if used in an indexing setting compared to Patricia tries, and (ii) our two-level approach enables the indexing of 3.5 billion strings taking 273 GB in less than 200 MB of internal memory, which is available on any commodity machine, while still guaranteeing comparable or faster query performance than those offered by array-based solutions used in modern storage systems, such as RocksDB, thus possibly influencing their future designs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 74.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Abramatic, J., Di Cosmo, R., Zacchiroli, S.: Building the universal archive of source code. Commun. ACM 61(10), 29–31 (2018). https://doi.org/10.1145/3183558

    Article  Google Scholar 

  2. Acharya, A., Zhu, H., Shen, K.: Adaptive algorithms for cache-efficient trie search. In: Goodrich, M.T., McGeoch, C.C. (eds.) ALENEX 1999. LNCS, vol. 1619, pp. 300–315. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-48518-X_18

    Chapter  Google Scholar 

  3. Arz, J., Fischer, J.: LZ-compressed string dictionaries. In: Proceedings of the 24th Data Compression Conference (DCC), pp. 322–331 (2014). https://doi.org/10.1109/DCC.2014.36

  4. Baskins, D.: A 10-minute description of how Judy arrays work and why they are so fast (2002). http://judy.sourceforge.net/doc/10minutes.htm

  5. Benoit, D., Demaine, E.D., Munro, J.I., Raman, R., Raman, V., Rao, S.S.: Representing trees of higher degree. Algorithmica 43(4), 275–292 (2005). https://doi.org/10.1007/s00453-004-1146-6

    Article  MathSciNet  MATH  Google Scholar 

  6. Boffa, A., Ferragina, P., Tosoni, F., Vinciguerra, G.: Compressed string dictionaries via data-aware subtrie compaction. In: Arroyuelo, D., Poblete, B. (eds.) SPIRE 2022. LNCS, vol. 13617, pp. 233–249. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20643-6_17. Implementation available at https://github.com/aboffa/CoCo-trie

  7. Boffa, A., Ferragina, P., Vinciguerra, G.: A learned approach to design compressed rank/select data structures. ACM Trans. Algorithms 18(3) (2022). https://doi.org/10.1145/3524060

  8. Boldi, P., Marino, A., Santini, M., Vigna, S.: BUbiNG: massive crawling for the masses. ACM Trans. Web 12(2), 12:1–12:26 (2018). https://doi.org/10.1145/3160017. Datasets of URLs available at https://law.di.unimi.it/datasets.php

  9. Boncz, P., Neumann, T., Leis, V.: FSST: fast random access string compression. PVLDB 13(12), 2649–2661 (2020). https://doi.org/10.14778/3407790.3407851

    Article  Google Scholar 

  10. Brisaboa, N.R., Cerdeira-Pena, A., de Bernardo, G., Navarro, G.: Improved compressed string dictionaries. In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM), pp. 29–38 (2019). https://doi.org/10.1145/3357384.3357972

  11. Chikhi, R., Holub, J., Medvedev, P.: Data structures to represent a set of \(k\)-long DNA sequences. ACM Comput. Surv. 54(1) (2021). https://doi.org/10.1145/3445967

  12. Clark, J.L.: PATRICIA-II. Two-level overlaid indexes for large libraries. Int. J. Parallel Program. 2(4), 269–292 (1973). https://doi.org/10.1007/BF00985662

    Article  Google Scholar 

  13. Di Cosmo, R.: Should we preserve the world’s software history, and can we? In: Silvello, G., et al. (eds.) TPDL 2022. LNCS, vol. 13541, pp. 3–7. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-16802-4_1

    Chapter  Google Scholar 

  14. Di Cosmo, R., Zacchiroli, S.: Software Heritage: why and how to preserve software source code. In: Proceedings of the 14th International Conference on Digital Preservation (iPRES) (2017). https://hdl.handle.net/11353/10.931064

  15. Ferragina, P.: Pearls of Algorithm Engineering. Cambridge University Press (2023). https://doi.org/10.1017/9781009128933

  16. Ferragina, P., Frasca, M., Marinò, G.C., Vinciguerra, G.: On nonlinear learned string indexing. IEEE Access 11, 74021–74034 (2023). https://doi.org/10.1109/ACCESS.2023.3295434

    Article  Google Scholar 

  17. Ferragina, P., Grossi, R.: The string B-tree: a new data structure for string search in external memory and its applications. J. ACM 46(2), 236–280 (1999). https://doi.org/10.1145/301970.301973

    Article  MathSciNet  MATH  Google Scholar 

  18. Ferragina, P., Grossi, R., Gupta, A., Shah, R., Vitter, J.S.: On searching compressed string collections cache-obliviously. In: Proceedings of the 27th ACM Symposium on Principles of Database Systems (PODS), pp. 181–190 (2008). https://doi.org/10.1145/1376916.1376943

  19. Ferragina, P., Luccio, F.: String search in coarse-grained parallel computers. Algorithmica 24(3–4), 177–194 (1999). https://doi.org/10.1007/PL00008259

    Article  MathSciNet  MATH  Google Scholar 

  20. Ferragina, P., Manzini, G., Vinciguerra, G.: Compressing and querying integer dictionaries under linearities and repetitions. IEEE Access 10, 118831–118848 (2022). https://doi.org/10.1109/ACCESS.2022.3221520

    Article  Google Scholar 

  21. Ferragina, P., Venturini, R.: Compressed cache-oblivious string B-tree. ACM Trans. Algorithms 12(4), 52:1–52:17 (2016). https://doi.org/10.1145/2903141

  22. Fredkin, E.: Trie memory. Commun. ACM 3(9), 490–499 (1960). https://doi.org/10.1145/367390.367400

    Article  Google Scholar 

  23. Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: plug and play with succinct data structures. In: Gudmundsson, J., Katajainen, J. (eds.) SEA 2014. LNCS, vol. 8504, pp. 326–337. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07959-2_28

    Chapter  Google Scholar 

  24. Grossi, R., Ottaviano, G.: Fast compressed tries through path decompositions. ACM J. Exp. Algorithmics 19 (2015). https://doi.org/10.1145/2656332. Implementation available at https://github.com/ot/path_decomposed_tries

  25. Jacobson, G.: Space-efficient static trees and graphs. In: Proceedings of the 30th IEEE Symposium on Foundations of Computer Science (FOCS), pp. 549–554 (1989). https://doi.org/10.1109/SFCS.1989.63533

  26. Joannou, S., Raman, R.: Dynamizing succinct tree representations. In: Klasing, R. (ed.) SEA 2012. LNCS, vol. 7276, pp. 224–235. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-30850-5_20

    Chapter  Google Scholar 

  27. Joulin, A., Grave, E., Bojanowski, P., Douze, M., Jégou, H., Mikolov, T.: Fasttext.zip: compressing text classification models. CoRR abs/1612.03651 (2016). http://arxiv.org/abs/1612.03651

  28. Krishnan, U., Moffat, A., Zobel, J.: A taxonomy of query auto completion modes. In: Proceedings of the 22nd Australasian Document Computing Symposium (ADCS) (2017). https://doi.org/10.1145/3166072.3166081

  29. Kurpicz, F.: Engineering compact data structures for rank and select queries on bit vectors. In: Arroyuelo, D., Poblete, B. (eds.) SPIRE 2022. LNCS, vol. 13617, pp. 257–272. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20643-6_19

    Chapter  Google Scholar 

  30. Lasch, R., Oukid, I., Dementiev, R., May, N., Demirsoy, S.S., Sattler, K.: Fast & strong: the case of compressed string dictionaries on modern CPUs. In: Proceedings of the 15th International Workshop on Data Management on New Hardware (DaMoN), pp. 4:1–4:10 (2019). https://doi.org/10.1145/3329785.3329924

  31. Leis, V., Kemper, A., Neumann, T.: The adaptive radix tree: ARTful indexing for main-memory databases. In: Proceedings of the 29th IEEE International Conference on Data Engineering (ICDE), pp. 38–49 (2013). https://doi.org/10.1109/ICDE.2013.6544812

  32. Lorentz, V., Di Cosmo, R., Zacchiroli, S.: The popular content filenames dataset: deriving most likely filenames from the Software Heritage archive. Technical report (2023). https://inria.hal.science/hal-04171177, preprint

  33. Luo, C., Carey, M.J.: LSM-based storage techniques: a survey. VLDB J. 29(1), 393–418 (2019). https://doi.org/10.1007/s00778-019-00555-y

    Article  Google Scholar 

  34. Martínez-Prieto, M.A., Brisaboa, N.R., Cánovas, R., Claude, F., Navarro, G.: Practical compressed string dictionaries. Inf. Syst. 56, 73–108 (2016). https://doi.org/10.1016/j.is.2015.08.008

    Article  Google Scholar 

  35. Meta Platforms Inc.: RocksDB. https://rocksdb.org/

  36. Morrison, D.R.: PATRICIA—practical algorithm to retrieve information coded in alphanumeric. J. ACM 15(4), 514–534 (1968). https://doi.org/10.1145/321479.321481

    Article  Google Scholar 

  37. Navarro, G.: Compact Data Structures: A Practical Approach. Cambridge University Press (2016). https://doi.org/10.1017/CBO9781316588284

  38. O’Neil, P.E., Cheng, E., Gawlick, D., O’Neil, E.J.: The log-structured merge-tree (LSM-tree). Acta Informatica 33(4), 351–385 (1996). https://doi.org/10.1007/s002360050048

    Article  MATH  Google Scholar 

  39. Silberschatz, A., Galvin, P.B., Gagne, G.: Operating System Concepts, 10th edn. Wiley, Hoboken (2018)

    MATH  Google Scholar 

  40. Tsuruta, K., et al.: C-trie++: a dynamic trie tailored for fast prefix searches. Inf. Comput. 285, 104794 (2022). https://doi.org/10.1016/j.ic.2021.104794

    Article  MathSciNet  MATH  Google Scholar 

  41. Vigna, S.: Broadword implementation of rank/select queries. In: McGeoch, C.C. (ed.) WEA 2008. LNCS, vol. 5038, pp. 154–168. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-68552-4_12

    Chapter  Google Scholar 

  42. Zhang, H., Andersen, D.G., Pavlo, A., Kaminsky, M., Ma, L., Shen, R.: Reducing the storage overhead of main-memory OLTP databases with hybrid indexes. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD), pp. 1567–1581 (2016). https://doi.org/10.1145/2882903.2915222

  43. Zhang, H., et al.: Succinct range filters. ACM Trans. Database Syst. 45(2) (2020). https://doi.org/10.1145/3375660. Fork of the implementation available at https://github.com/kampersanda/fast_succinct_trie

  44. Zhang, W., et al.: TernaryBERT: distillation-aware ultra-low bit BERT. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 509–521 (2020). https://doi.org/10.18653/v1/2020.emnlp-main.37

Download references

Acknowledgements

We thank Antonio Boffa for executing some tests on the CoCo-trie, and the Green Data Centre at the University of Pisa for machines and technical support. We also thank Roberto Di Cosmo, Valentin Lorentz, Stefano Zacchiroli, and the Software Heritage team for providing us with the Filenames dataset. This work was made possible by Software Heritage, the great library of source code: https://www.softwareheritage.org.

This work has been supported by the European Union – Horizon 2020 Program under the scheme “INFRAIA-01-2018-2019 – Integrating Activities for Advanced Communities”, Grant Agreement n. 871042, “SoBigData++: European Integrated Infrastructure for Social Mining and Big Data Analytics” http://www.sobigdata.eu, by the NextGenerationEU – National Recovery and Resilience Plan (Piano Nazionale di Ripresa e Resilienza, PNRR) – Project: “SoBigData.it - Strengthening the Italian RI for Social Mining and Big Data Analytics” – Prot. IR0000013 – Avviso n. 3264 del 28/12/2021, by the spoke “FutureHPC & BigData” of the ICSC – Centro Nazionale di Ricerca in High-Performance Computing, Big Data and Quantum Computing funded by European Union – NextGenerationEU – PNRR, by the Italian Ministry of University and Research “Progetti di Rilevante Interesse Nazionale” project: “Multicriteria data structures and algorithms” (grant n. 2017WR7SHH).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Giorgio Vinciguerra .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ferragina, P., Rotundo, M., Vinciguerra, G. (2023). Engineering a Textbook Approach to Index Massive String Dictionaries. In: Nardini, F.M., Pisanti, N., Venturini, R. (eds) String Processing and Information Retrieval. SPIRE 2023. Lecture Notes in Computer Science, vol 14240. Springer, Cham. https://doi.org/10.1007/978-3-031-43980-3_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-43980-3_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-43979-7

  • Online ISBN: 978-3-031-43980-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics