Abstract
We study the problem of engineering space-time efficient indexes that support membership and lexicographic (rank) queries on very large static dictionaries of strings.
Our solution is based on a very simple approach that consists of decoupling string storage and string indexing by means of a blockwise compression of the sorted dictionary strings (to be stored in external memory) and a succinct implementation of a Patricia trie (to be stored in internal memory) built on the first string of each block.
Our experimental evaluation on two new datasets, which are at least one order of magnitude larger than the ones used in the literature, shows that (i) the state-of-the-art compressed string dictionaries (such as FST, PDT, CoCo-trie) do not provide significant benefits if used in an indexing setting compared to Patricia tries, and (ii) our two-level approach enables the indexing of 3.5 billion strings taking 273 GB in less than 200 MB of internal memory, which is available on any commodity machine, while still guaranteeing comparable or faster query performance than those offered by array-based solutions used in modern storage systems, such as RocksDB, thus possibly influencing their future designs.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abramatic, J., Di Cosmo, R., Zacchiroli, S.: Building the universal archive of source code. Commun. ACM 61(10), 29–31 (2018). https://doi.org/10.1145/3183558
Acharya, A., Zhu, H., Shen, K.: Adaptive algorithms for cache-efficient trie search. In: Goodrich, M.T., McGeoch, C.C. (eds.) ALENEX 1999. LNCS, vol. 1619, pp. 300–315. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-48518-X_18
Arz, J., Fischer, J.: LZ-compressed string dictionaries. In: Proceedings of the 24th Data Compression Conference (DCC), pp. 322–331 (2014). https://doi.org/10.1109/DCC.2014.36
Baskins, D.: A 10-minute description of how Judy arrays work and why they are so fast (2002). http://judy.sourceforge.net/doc/10minutes.htm
Benoit, D., Demaine, E.D., Munro, J.I., Raman, R., Raman, V., Rao, S.S.: Representing trees of higher degree. Algorithmica 43(4), 275–292 (2005). https://doi.org/10.1007/s00453-004-1146-6
Boffa, A., Ferragina, P., Tosoni, F., Vinciguerra, G.: Compressed string dictionaries via data-aware subtrie compaction. In: Arroyuelo, D., Poblete, B. (eds.) SPIRE 2022. LNCS, vol. 13617, pp. 233–249. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20643-6_17. Implementation available at https://github.com/aboffa/CoCo-trie
Boffa, A., Ferragina, P., Vinciguerra, G.: A learned approach to design compressed rank/select data structures. ACM Trans. Algorithms 18(3) (2022). https://doi.org/10.1145/3524060
Boldi, P., Marino, A., Santini, M., Vigna, S.: BUbiNG: massive crawling for the masses. ACM Trans. Web 12(2), 12:1–12:26 (2018). https://doi.org/10.1145/3160017. Datasets of URLs available at https://law.di.unimi.it/datasets.php
Boncz, P., Neumann, T., Leis, V.: FSST: fast random access string compression. PVLDB 13(12), 2649–2661 (2020). https://doi.org/10.14778/3407790.3407851
Brisaboa, N.R., Cerdeira-Pena, A., de Bernardo, G., Navarro, G.: Improved compressed string dictionaries. In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM), pp. 29–38 (2019). https://doi.org/10.1145/3357384.3357972
Chikhi, R., Holub, J., Medvedev, P.: Data structures to represent a set of \(k\)-long DNA sequences. ACM Comput. Surv. 54(1) (2021). https://doi.org/10.1145/3445967
Clark, J.L.: PATRICIA-II. Two-level overlaid indexes for large libraries. Int. J. Parallel Program. 2(4), 269–292 (1973). https://doi.org/10.1007/BF00985662
Di Cosmo, R.: Should we preserve the world’s software history, and can we? In: Silvello, G., et al. (eds.) TPDL 2022. LNCS, vol. 13541, pp. 3–7. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-16802-4_1
Di Cosmo, R., Zacchiroli, S.: Software Heritage: why and how to preserve software source code. In: Proceedings of the 14th International Conference on Digital Preservation (iPRES) (2017). https://hdl.handle.net/11353/10.931064
Ferragina, P.: Pearls of Algorithm Engineering. Cambridge University Press (2023). https://doi.org/10.1017/9781009128933
Ferragina, P., Frasca, M., Marinò, G.C., Vinciguerra, G.: On nonlinear learned string indexing. IEEE Access 11, 74021–74034 (2023). https://doi.org/10.1109/ACCESS.2023.3295434
Ferragina, P., Grossi, R.: The string B-tree: a new data structure for string search in external memory and its applications. J. ACM 46(2), 236–280 (1999). https://doi.org/10.1145/301970.301973
Ferragina, P., Grossi, R., Gupta, A., Shah, R., Vitter, J.S.: On searching compressed string collections cache-obliviously. In: Proceedings of the 27th ACM Symposium on Principles of Database Systems (PODS), pp. 181–190 (2008). https://doi.org/10.1145/1376916.1376943
Ferragina, P., Luccio, F.: String search in coarse-grained parallel computers. Algorithmica 24(3–4), 177–194 (1999). https://doi.org/10.1007/PL00008259
Ferragina, P., Manzini, G., Vinciguerra, G.: Compressing and querying integer dictionaries under linearities and repetitions. IEEE Access 10, 118831–118848 (2022). https://doi.org/10.1109/ACCESS.2022.3221520
Ferragina, P., Venturini, R.: Compressed cache-oblivious string B-tree. ACM Trans. Algorithms 12(4), 52:1–52:17 (2016). https://doi.org/10.1145/2903141
Fredkin, E.: Trie memory. Commun. ACM 3(9), 490–499 (1960). https://doi.org/10.1145/367390.367400
Gog, S., Beller, T., Moffat, A., Petri, M.: From theory to practice: plug and play with succinct data structures. In: Gudmundsson, J., Katajainen, J. (eds.) SEA 2014. LNCS, vol. 8504, pp. 326–337. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07959-2_28
Grossi, R., Ottaviano, G.: Fast compressed tries through path decompositions. ACM J. Exp. Algorithmics 19 (2015). https://doi.org/10.1145/2656332. Implementation available at https://github.com/ot/path_decomposed_tries
Jacobson, G.: Space-efficient static trees and graphs. In: Proceedings of the 30th IEEE Symposium on Foundations of Computer Science (FOCS), pp. 549–554 (1989). https://doi.org/10.1109/SFCS.1989.63533
Joannou, S., Raman, R.: Dynamizing succinct tree representations. In: Klasing, R. (ed.) SEA 2012. LNCS, vol. 7276, pp. 224–235. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-30850-5_20
Joulin, A., Grave, E., Bojanowski, P., Douze, M., Jégou, H., Mikolov, T.: Fasttext.zip: compressing text classification models. CoRR abs/1612.03651 (2016). http://arxiv.org/abs/1612.03651
Krishnan, U., Moffat, A., Zobel, J.: A taxonomy of query auto completion modes. In: Proceedings of the 22nd Australasian Document Computing Symposium (ADCS) (2017). https://doi.org/10.1145/3166072.3166081
Kurpicz, F.: Engineering compact data structures for rank and select queries on bit vectors. In: Arroyuelo, D., Poblete, B. (eds.) SPIRE 2022. LNCS, vol. 13617, pp. 257–272. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20643-6_19
Lasch, R., Oukid, I., Dementiev, R., May, N., Demirsoy, S.S., Sattler, K.: Fast & strong: the case of compressed string dictionaries on modern CPUs. In: Proceedings of the 15th International Workshop on Data Management on New Hardware (DaMoN), pp. 4:1–4:10 (2019). https://doi.org/10.1145/3329785.3329924
Leis, V., Kemper, A., Neumann, T.: The adaptive radix tree: ARTful indexing for main-memory databases. In: Proceedings of the 29th IEEE International Conference on Data Engineering (ICDE), pp. 38–49 (2013). https://doi.org/10.1109/ICDE.2013.6544812
Lorentz, V., Di Cosmo, R., Zacchiroli, S.: The popular content filenames dataset: deriving most likely filenames from the Software Heritage archive. Technical report (2023). https://inria.hal.science/hal-04171177, preprint
Luo, C., Carey, M.J.: LSM-based storage techniques: a survey. VLDB J. 29(1), 393–418 (2019). https://doi.org/10.1007/s00778-019-00555-y
Martínez-Prieto, M.A., Brisaboa, N.R., Cánovas, R., Claude, F., Navarro, G.: Practical compressed string dictionaries. Inf. Syst. 56, 73–108 (2016). https://doi.org/10.1016/j.is.2015.08.008
Meta Platforms Inc.: RocksDB. https://rocksdb.org/
Morrison, D.R.: PATRICIA—practical algorithm to retrieve information coded in alphanumeric. J. ACM 15(4), 514–534 (1968). https://doi.org/10.1145/321479.321481
Navarro, G.: Compact Data Structures: A Practical Approach. Cambridge University Press (2016). https://doi.org/10.1017/CBO9781316588284
O’Neil, P.E., Cheng, E., Gawlick, D., O’Neil, E.J.: The log-structured merge-tree (LSM-tree). Acta Informatica 33(4), 351–385 (1996). https://doi.org/10.1007/s002360050048
Silberschatz, A., Galvin, P.B., Gagne, G.: Operating System Concepts, 10th edn. Wiley, Hoboken (2018)
Tsuruta, K., et al.: C-trie++: a dynamic trie tailored for fast prefix searches. Inf. Comput. 285, 104794 (2022). https://doi.org/10.1016/j.ic.2021.104794
Vigna, S.: Broadword implementation of rank/select queries. In: McGeoch, C.C. (ed.) WEA 2008. LNCS, vol. 5038, pp. 154–168. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-68552-4_12
Zhang, H., Andersen, D.G., Pavlo, A., Kaminsky, M., Ma, L., Shen, R.: Reducing the storage overhead of main-memory OLTP databases with hybrid indexes. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD), pp. 1567–1581 (2016). https://doi.org/10.1145/2882903.2915222
Zhang, H., et al.: Succinct range filters. ACM Trans. Database Syst. 45(2) (2020). https://doi.org/10.1145/3375660. Fork of the implementation available at https://github.com/kampersanda/fast_succinct_trie
Zhang, W., et al.: TernaryBERT: distillation-aware ultra-low bit BERT. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 509–521 (2020). https://doi.org/10.18653/v1/2020.emnlp-main.37
Acknowledgements
We thank Antonio Boffa for executing some tests on the CoCo-trie, and the Green Data Centre at the University of Pisa for machines and technical support. We also thank Roberto Di Cosmo, Valentin Lorentz, Stefano Zacchiroli, and the Software Heritage team for providing us with the Filenames dataset. This work was made possible by Software Heritage, the great library of source code: https://www.softwareheritage.org.
This work has been supported by the European Union – Horizon 2020 Program under the scheme “INFRAIA-01-2018-2019 – Integrating Activities for Advanced Communities”, Grant Agreement n. 871042, “SoBigData++: European Integrated Infrastructure for Social Mining and Big Data Analytics” http://www.sobigdata.eu, by the NextGenerationEU – National Recovery and Resilience Plan (Piano Nazionale di Ripresa e Resilienza, PNRR) – Project: “SoBigData.it - Strengthening the Italian RI for Social Mining and Big Data Analytics” – Prot. IR0000013 – Avviso n. 3264 del 28/12/2021, by the spoke “FutureHPC & BigData” of the ICSC – Centro Nazionale di Ricerca in High-Performance Computing, Big Data and Quantum Computing funded by European Union – NextGenerationEU – PNRR, by the Italian Ministry of University and Research “Progetti di Rilevante Interesse Nazionale” project: “Multicriteria data structures and algorithms” (grant n. 2017WR7SHH).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Ferragina, P., Rotundo, M., Vinciguerra, G. (2023). Engineering a Textbook Approach to Index Massive String Dictionaries. In: Nardini, F.M., Pisanti, N., Venturini, R. (eds) String Processing and Information Retrieval. SPIRE 2023. Lecture Notes in Computer Science, vol 14240. Springer, Cham. https://doi.org/10.1007/978-3-031-43980-3_16
Download citation
DOI: https://doi.org/10.1007/978-3-031-43980-3_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43979-7
Online ISBN: 978-3-031-43980-3
eBook Packages: Computer ScienceComputer Science (R0)