Engineering a Textbook Approach to Index Massive String Dictionaries

Ferragina, Paolo; Rotundo, Mariagiovanna; Vinciguerra, Giorgio

doi:10.1007/978-3-031-43980-3_16

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14240))

Included in the following conference series:

International Symposium on String Processing and Information Retrieval

445 Accesses
1 Citations

Abstract

We study the problem of engineering space-time efficient indexes that support membership and lexicographic (rank) queries on very large static dictionaries of strings.

Our solution is based on a very simple approach that consists of decoupling string storage and string indexing by means of a blockwise compression of the sorted dictionary strings (to be stored in external memory) and a succinct implementation of a Patricia trie (to be stored in internal memory) built on the first string of each block.

Our experimental evaluation on two new datasets, which are at least one order of magnitude larger than the ones used in the literature, shows that (i) the state-of-the-art compressed string dictionaries (such as FST, PDT, CoCo-trie) do not provide significant benefits if used in an indexing setting compared to Patricia tries, and (ii) our two-level approach enables the indexing of 3.5 billion strings taking 273 GB in less than 200 MB of internal memory, which is available on any commodity machine, while still guaranteeing comparable or faster query performance than those offered by array-based solutions used in modern storage systems, such as RocksDB, thus possibly influencing their future designs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abramatic, J., Di Cosmo, R., Zacchiroli, S.: Building the universal archive of source code. Commun. ACM 61(10), 29–31 (2018). https://doi.org/10.1145/3183558
Article Google Scholar
Acharya, A., Zhu, H., Shen, K.: Adaptive algorithms for cache-efficient trie search. In: Goodrich, M.T., McGeoch, C.C. (eds.) ALENEX 1999. LNCS, vol. 1619, pp. 300–315. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-48518-X_18
Chapter Google Scholar
Arz, J., Fischer, J.: LZ-compressed string dictionaries. In: Proceedings of the 24th Data Compression Conference (DCC), pp. 322–331 (2014). https://doi.org/10.1109/DCC.2014.36
Baskins, D.: A 10-minute description of how Judy arrays work and why they are so fast (2002). http://judy.sourceforge.net/doc/10minutes.htm
Benoit, D., Demaine, E.D., Munro, J.I., Raman, R., Raman, V., Rao, S.S.: Representing trees of higher degree. Algorithmica 43(4), 275–292 (2005). https://doi.org/10.1007/s00453-004-1146-6
Article MathSciNet MATH Google Scholar
Boffa, A., Ferragina, P., Tosoni, F., Vinciguerra, G.: Compressed string dictionaries via data-aware subtrie compaction. In: Arroyuelo, D., Poblete, B. (eds.) SPIRE 2022. LNCS, vol. 13617, pp. 233–249. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20643-6_17. Implementation available at https://github.com/aboffa/CoCo-trie
Boffa, A., Ferragina, P., Vinciguerra, G.: A learned approach to design compressed rank/select data structures. ACM Trans. Algorithms 18(3) (2022). https://doi.org/10.1145/3524060
Boldi, P., Marino, A., Santini, M., Vigna, S.: BUbiNG: massive crawling for the masses. ACM Trans. Web 12(2), 12:1–12:26 (2018). https://doi.org/10.1145/3160017. Datasets of URLs available at https://law.di.unimi.it/datasets.php
Boncz, P., Neumann, T., Leis, V.: FSST: fast random access string compression. PVLDB 13(12), 2649–2661 (2020). https://doi.org/10.14778/3407790.3407851
Article Google Scholar
Brisaboa, N.R., Cerdeira-Pena, A., de Bernardo, G., Navarro, G.: Improved compressed string dictionaries. In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM), pp. 29–38 (2019). https://doi.org/10.1145/3357384.3357972
Chikhi, R., Holub, J., Medvedev, P.: Data structures to represent a set of \(k\)-long DNA sequences. ACM Comput. Surv. 54(1) (2021). https://doi.org/10.1145/3445967
Clark, J.L.: PATRICIA-II. Two-level overlaid indexes for large libraries. Int. J. Parallel Program. 2(4), 269–292 (1973). https://doi.org/10.1007/BF00985662
Article Google Scholar
Di Cosmo, R.: Should we preserve the world’s software history, and can we? In: Silvello, G., et al. (eds.) TPDL 2022. LNCS, vol. 13541, pp. 3–7. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-16802-4_1
Chapter Google Scholar
Di Cosmo, R., Zacchiroli, S.: Software Heritage: why and how to preserve software source code. In: Proceedings of the 14th International Conference on Digital Preservation (iPRES) (2017). https://hdl.handle.net/11353/10.931064
Ferragina, P.: Pearls of Algorithm Engineering. Cambridge University Press (2023). https://doi.org/10.1017/9781009128933
Ferragina, P., Frasca, M., Marinò, G.C., Vinciguerra, G.: On nonlinear learned string indexing. IEEE Access 11, 74021–74034 (2023). https://doi.org/10.1109/ACCESS.2023.3295434
Article Google Scholar
Ferragina, P., Grossi, R.: The string B-tree: a new data structure for string search in external memory and its applications. J. ACM 46(2), 236–280 (1999). https://doi.org/10.1145/301970.301973
Article MathSciNet MATH Google Scholar
Ferragina, P., Grossi, R., Gupta, A., Shah, R., Vitter, J.S.: On searching compressed string collections cache-obliviously. In: Proceedings of the 27th ACM Symposium on Principles of Database Systems (PODS), pp. 181–190 (2008). https://doi.org/10.1145/1376916.1376943
Ferragina, P., Luccio, F.: String search in coarse-grained parallel computers. Algorithmica 24(3–4), 177–194 (1999). https://doi.org/10.1007/PL00008259
Article MathSciNet MATH Google Scholar
Ferragina, P., Manzini, G., Vinciguerra, G.: Compressing and querying integer dictionaries under linearities and repetitions. IEEE Access 10, 118831–118848 (2022). https://doi.org/10.1109/ACCESS.2022.3221520
Article Google Scholar
Ferragina, P., Venturini, R.: Compressed cache-oblivious string B-tree. ACM Trans. Algorithms 12(4), 52:1–52:17 (2016). https://doi.org/10.1145/2903141
Fredkin, E.: Trie memory. Commun. ACM 3(9), 490–499 (1960). https://doi.org/10.1145/367390.367400
Article Google Scholar
Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: plug and play with succinct data structures. In: Gudmundsson, J., Katajainen, J. (eds.) SEA 2014. LNCS, vol. 8504, pp. 326–337. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07959-2_28
Chapter Google Scholar
Grossi, R., Ottaviano, G.: Fast compressed tries through path decompositions. ACM J. Exp. Algorithmics 19 (2015). https://doi.org/10.1145/2656332. Implementation available at https://github.com/ot/path_decomposed_tries
Jacobson, G.: Space-efficient static trees and graphs. In: Proceedings of the 30th IEEE Symposium on Foundations of Computer Science (FOCS), pp. 549–554 (1989). https://doi.org/10.1109/SFCS.1989.63533
Joannou, S., Raman, R.: Dynamizing succinct tree representations. In: Klasing, R. (ed.) SEA 2012. LNCS, vol. 7276, pp. 224–235. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-30850-5_20
Chapter Google Scholar
Joulin, A., Grave, E., Bojanowski, P., Douze, M., Jégou, H., Mikolov, T.: Fasttext.zip: compressing text classification models. CoRR abs/1612.03651 (2016). http://arxiv.org/abs/1612.03651
Krishnan, U., Moffat, A., Zobel, J.: A taxonomy of query auto completion modes. In: Proceedings of the 22nd Australasian Document Computing Symposium (ADCS) (2017). https://doi.org/10.1145/3166072.3166081
Kurpicz, F.: Engineering compact data structures for rank and select queries on bit vectors. In: Arroyuelo, D., Poblete, B. (eds.) SPIRE 2022. LNCS, vol. 13617, pp. 257–272. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20643-6_19
Chapter Google Scholar
Lasch, R., Oukid, I., Dementiev, R., May, N., Demirsoy, S.S., Sattler, K.: Fast & strong: the case of compressed string dictionaries on modern CPUs. In: Proceedings of the 15th International Workshop on Data Management on New Hardware (DaMoN), pp. 4:1–4:10 (2019). https://doi.org/10.1145/3329785.3329924
Leis, V., Kemper, A., Neumann, T.: The adaptive radix tree: ARTful indexing for main-memory databases. In: Proceedings of the 29th IEEE International Conference on Data Engineering (ICDE), pp. 38–49 (2013). https://doi.org/10.1109/ICDE.2013.6544812
Lorentz, V., Di Cosmo, R., Zacchiroli, S.: The popular content filenames dataset: deriving most likely filenames from the Software Heritage archive. Technical report (2023). https://inria.hal.science/hal-04171177, preprint
Luo, C., Carey, M.J.: LSM-based storage techniques: a survey. VLDB J. 29(1), 393–418 (2019). https://doi.org/10.1007/s00778-019-00555-y
Article Google Scholar
Martínez-Prieto, M.A., Brisaboa, N.R., Cánovas, R., Claude, F., Navarro, G.: Practical compressed string dictionaries. Inf. Syst. 56, 73–108 (2016). https://doi.org/10.1016/j.is.2015.08.008
Article Google Scholar
Meta Platforms Inc.: RocksDB. https://rocksdb.org/
Morrison, D.R.: PATRICIA—practical algorithm to retrieve information coded in alphanumeric. J. ACM 15(4), 514–534 (1968). https://doi.org/10.1145/321479.321481
Article Google Scholar
Navarro, G.: Compact Data Structures: A Practical Approach. Cambridge University Press (2016). https://doi.org/10.1017/CBO9781316588284
O’Neil, P.E., Cheng, E., Gawlick, D., O’Neil, E.J.: The log-structured merge-tree (LSM-tree). Acta Informatica 33(4), 351–385 (1996). https://doi.org/10.1007/s002360050048
Article MATH Google Scholar
Silberschatz, A., Galvin, P.B., Gagne, G.: Operating System Concepts, 10th edn. Wiley, Hoboken (2018)
MATH Google Scholar
Tsuruta, K., et al.: C-trie++: a dynamic trie tailored for fast prefix searches. Inf. Comput. 285, 104794 (2022). https://doi.org/10.1016/j.ic.2021.104794
Article MathSciNet MATH Google Scholar
Vigna, S.: Broadword implementation of rank/select queries. In: McGeoch, C.C. (ed.) WEA 2008. LNCS, vol. 5038, pp. 154–168. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-68552-4_12
Chapter Google Scholar
Zhang, H., Andersen, D.G., Pavlo, A., Kaminsky, M., Ma, L., Shen, R.: Reducing the storage overhead of main-memory OLTP databases with hybrid indexes. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD), pp. 1567–1581 (2016). https://doi.org/10.1145/2882903.2915222
Zhang, H., et al.: Succinct range filters. ACM Trans. Database Syst. 45(2) (2020). https://doi.org/10.1145/3375660. Fork of the implementation available at https://github.com/kampersanda/fast_succinct_trie
Zhang, W., et al.: TernaryBERT: distillation-aware ultra-low bit BERT. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 509–521 (2020). https://doi.org/10.18653/v1/2020.emnlp-main.37

Download references

Acknowledgements

We thank Antonio Boffa for executing some tests on the CoCo-trie, and the Green Data Centre at the University of Pisa for machines and technical support. We also thank Roberto Di Cosmo, Valentin Lorentz, Stefano Zacchiroli, and the Software Heritage team for providing us with the Filenames dataset. This work was made possible by Software Heritage, the great library of source code: https://www.softwareheritage.org.

This work has been supported by the European Union – Horizon 2020 Program under the scheme “INFRAIA-01-2018-2019 – Integrating Activities for Advanced Communities”, Grant Agreement n. 871042, “SoBigData++: European Integrated Infrastructure for Social Mining and Big Data Analytics” http://www.sobigdata.eu, by the NextGenerationEU – National Recovery and Resilience Plan (Piano Nazionale di Ripresa e Resilienza, PNRR) – Project: “SoBigData.it - Strengthening the Italian RI for Social Mining and Big Data Analytics” – Prot. IR0000013 – Avviso n. 3264 del 28/12/2021, by the spoke “FutureHPC & BigData” of the ICSC – Centro Nazionale di Ricerca in High-Performance Computing, Big Data and Quantum Computing funded by European Union – NextGenerationEU – PNRR, by the Italian Ministry of University and Research “Progetti di Rilevante Interesse Nazionale” project: “Multicriteria data structures and algorithms” (grant n. 2017WR7SHH).

Author information

Authors and Affiliations

Department of Computer Science, University of Pisa, Pisa, Italy
Paolo Ferragina, Mariagiovanna Rotundo & Giorgio Vinciguerra

Authors

Paolo Ferragina
View author publications
You can also search for this author in PubMed Google Scholar
Mariagiovanna Rotundo
View author publications
You can also search for this author in PubMed Google Scholar
Giorgio Vinciguerra
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Giorgio Vinciguerra .

Editor information

Editors and Affiliations

ISTI-CNR, Pisa, Italy
Franco Maria Nardini
University of Pisa, Pisa, Italy
Nadia Pisanti
University of Pisa, Pisa, Italy
Rossano Venturini

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ferragina, P., Rotundo, M., Vinciguerra, G. (2023). Engineering a Textbook Approach to Index Massive String Dictionaries. In: Nardini, F.M., Pisanti, N., Venturini, R. (eds) String Processing and Information Retrieval. SPIRE 2023. Lecture Notes in Computer Science, vol 14240. Springer, Cham. https://doi.org/10.1007/978-3-031-43980-3_16

Download citation

DOI: https://doi.org/10.1007/978-3-031-43980-3_16
Published: 20 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43979-7
Online ISBN: 978-3-031-43980-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Engineering a Textbook Approach to Index Massive String Dictionaries