Grammar Index by Induced Suffix Sorting

Akagi, Tooru; Köppl, Dominik; Nakashima, Yuto; Inenaga, Shunsuke; Bannai, Hideo; Takeda, Masayuki

doi:10.1007/978-3-030-86692-1_8

Tooru Akagi¹⁰,
Dominik Köppl¹¹,
Yuto Nakashima¹⁰,
Shunsuke Inenaga^10,12,
Hideo Bannai¹¹ &
…
Masayuki Takeda¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12944))

Included in the following conference series:

International Symposium on String Processing and Information Retrieval

345 Accesses
2 Citations

Abstract

We propose a new compressed text index built upon a grammar compression based on induced suffix sorting [Nunes et al., DCC’18]. We show that this grammar exhibits a locality sensitive parsing property, which allows us to specify, given a pattern P, certain substrings of P, called cores, that are similarly parsed in the text grammar whenever these occurrences are extensible to occurrences of P. Supported by the cores, given a pattern of length m, we can locate all its \(\text {occ}\) occurrences in a text T of length n within \(\mathop {}\mathopen {}{\mathcal {O}}\mathopen {}(m \lg |{\mathcal {S}}| + \text {occ}_C\lg |{\mathcal {S}}| \lg n + \text {occ})\) time, where \({\mathcal {S}}\) is the set of all characters and non-terminals, \(\text {occ}\) is the number of occurrences, and \(\text {occ}_C\) is the number of occurrences of a chosen core C of P in the right hand side of all production rules of the grammar of T. Our grammar index requires \(\mathop {}\mathopen {}{\mathcal {O}}\mathopen {}(g)\) words of space and can be built in \(\mathop {}\mathopen {}{\mathcal {O}}\mathopen {}(n)\) time using \(\mathop {}\mathopen {}{\mathcal {O}}\mathopen {}(g)\) working space, where g is the sum of the lengths of the right hand sides of all production rules. We practically evaluate that our proposed index excels at locating long patterns in highly-repetitive texts. Our implementation is available at https://github.com/TooruAkagi/GCIS_Index.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
For SAIS to work, it uses a slightly different order on the LMS substrings, called LMS-order. It differs from the lexicographic order when comparing two LMS substrings, where one of them is a prefix of the other. In such a case, the LMS-order would give the longer string a smaller rank.
2.
GCIS-\(\mathsf {nep}\) stands for GCIS with non-terminals encoded plainly.
3.
See https://github.com/mpetri/FM-Index, https://github.com/tkbtkysms/esp-index-I, and https://github.com/nicolaprezza/r-index, respectively.
4.
To save space, we renamed the datasets commoncrawl.ascii.txt and einstein.de.txt to commoncrawl and einstein.de, respectively.

References

Buchsbaum, A.L., Kaplan, H., Rogers, A., Westbrook, J.R.: Linear-time pointer-machine algorithms for least common ancestors, MST verification, and dominators. In: Proceedings of the STOC, pp. 279–288 (1998)
Google Scholar
Burrows, M., Wheeler, D.J.: A block sorting lossless data compression algorithm. Technical report 124, Digital Equipment Corporation, Palo Alto, California (1994)
Google Scholar
Christiansen, A.R., Ettienne, M.B., Kociumaka, T., Navarro, G., Prezza, N.: Optimal-time dictionary-compressed indexes. ACM Trans. Algorithms 17(1), 8:1–8:39 (2021)
Google Scholar
Claude, F., Fariña, A., Martínez-Prieto, M.A., Navarro, G.: Universal indexes for highly repetitive document collections. Inf. Syst. 61, 1–23 (2016)
Article Google Scholar
Claude, F., Navarro, G.: Self-indexed grammar-based compression. Fundam. Inform. 111(3), 313–337 (2011)
Article MathSciNet Google Scholar
Claude, F., Navarro, G.: Improved grammar-based compressed indexes. In: Calderón-Benavides, L., González-Caro, C., Chávez, E., Ziviani, N. (eds.) SPIRE 2012. LNCS, vol. 7608, pp. 180–192. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-34109-0_19
Chapter Google Scholar
Cormode, G., Muthukrishnan, S.: The string edit distance matching problem with moves. ACM Trans. Algorithms 3(1), 2:1–2:19 (2007)
Google Scholar
Díaz-Domínguez, D., Navarro, G.: A grammar compressor for collections of reads with applications to the construction of the BWT. In: Proceedings of the DCC, pp. 83–92 (2021)
Google Scholar
Dinklage, P., Fischer, J., Köppl, D., Löbel, M., Sadakane, K.: Compression with the tudocomp framework. In: Proceedings of the SEA. LIPIcs, vol. 75, pp. 13:1–13:22 (2017)
Google Scholar
Du, C.F., Mousavi, H., Schaeffer, L., Shallit, J.O.: Decision algorithms for fibonacci-automatic words, with applications to pattern avoidance. CoRR abs/1406.0670 (2014). http://arxiv.org/abs/1406.0670
Elias, P.: Efficient storage and retrieval by content and address of static files. J. ACM 21(2), 246–260 (1974)
Article MathSciNet Google Scholar
Elias, P.: Universal codeword sets and representations of the integers. IEEE Trans. Inf. Theory 21(2), 194–203 (1975)
Article MathSciNet Google Scholar
Farach-Colton, M., Ferragina, P., Muthukrishnan, S.: On the sorting-complexity of suffix tree construction. J. ACM 47(6), 987–1011 (2000)
Article MathSciNet Google Scholar
Ferragina, P., González, R., Navarro, G., Venturini, R.: Compressed text indexes: from theory to practice. ACM J. Exp. Algorithmics 13, 1.12:1-1.123:1 (2008)
Google Scholar
Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proceedings of the FOCS, pp. 390–398 (2000)
Google Scholar
Fischer, J., I, T., Köppl, D.: Deterministic sparse suffix sorting in the restore model. ACM Trans. Algorithms 16(4), 50:1-50:53 (2020)
Google Scholar
Gagie, T., I, T., Manzini, G., Navarro, G., Sakamoto, H., Takabatake, Y.: Rpair: Rescaling RePair with Rsync. CoRR arXiv:abs/1906.00809 (2019)
Gagie, T., Navarro, G., Prezza, N.: Optimal-time text indexing in BWT-runs bounded space. In: Proceedings of the SODA, pp. 1459–1477 (2018)
Google Scholar
Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)
Book Google Scholar
Kieffer, J.C., Yang, E.: Grammar-based codes: a new class of universal lossless source codes. IEEE Trans. Inf. Theory 46(3), 737–754 (2000)
Article MathSciNet Google Scholar
Knuth, D.E., Morris, J.H., Pratt, V.R.: Fast pattern matching in strings. SIAM J. Comput. 6(2), 323–350 (1977)
Article MathSciNet Google Scholar
Larsson, N.J., Moffat, A.: Offline dictionary-based compression. In: Proceedings of the DCC, pp. 296–305 (1999)
Google Scholar
Manber, U., Myers, E.W.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)
Article MathSciNet Google Scholar
Maruyama, S., Nakahara, M., Kishiue, N., Sakamoto, H.: ESP-index: a compressed index based on edit-sensitive parsing. J. Discret. Algorithms 18, 100–112 (2013)
Article MathSciNet Google Scholar
Mehlhorn, K., Sundar, R., Uhrig, C.: Maintaining dynamic sequences under equality tests in polylogarithmic time. Algorithmica 17(2), 183–198 (1997)
Article MathSciNet Google Scholar
Nishimoto, T., I, T., Inenaga, S., Bannai, H., Takeda, M.: Dynamic index and LZ factorization in compressed space. Discret. Appl. Math. 274, 116–129 (2020)
Article MathSciNet Google Scholar
Nong, G., Zhang, S., Chan, W.H.: Two efficient algorithms for linear time suffix array construction. IEEE Trans. Comput. 60(10), 1471–1484 (2011)
Article MathSciNet Google Scholar
Nunes, D.S.N., Louza, F.A., Gog, S., Ayala-Rincn, M., Navarro, G.: Grammar compression by induced suffix sorting (2020)
Google Scholar
Nunes, D.S.N., da Louza, F.A., Gog, S., Ayala-Rincón, M., Navarro, G.: A grammar compression algorithm based on induced suffix sorting. In: Proceedings of the DCC, pp. 42–51 (2018)
Google Scholar
Sahinalp, S.C., Vishkin, U.: Efficient approximate and dynamic matching of patterns using a labeling paradigm (extended abstract). In: Proceedings of the FOCS, pp. 320–328 (1996)
Google Scholar
Takabatake, Y., Nakashima, K., Kuboyama, T., Tabei, Y., Sakamoto, H.: siEDM: an efficient string index and search algorithm for edit distance with moves. Algorithms 9(2), 26:1–26:18 (2016)
Google Scholar
Takabatake, Y., Tabei, Y., Sakamoto, H.: Improved ESP-index: a practical self-index for highly repetitive texts. In: Gudmundsson, J., Katajainen, J. (eds.) SEA 2014. LNCS, vol. 8504, pp. 338–350. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07959-2_29
Chapter Google Scholar
Tsuruta, K., Köppl, D., Nakashima, Y., Inenaga, S., Bannai, H., Takeda, M.: Grammar-compressed self-index with Lyndon words. IPSJ TOM 13(2), 84–92 (2020)
Google Scholar
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977)
Article MathSciNet Google Scholar

Download references

Acknowledgements

This work was supported by JSPS KAKENHI grant numbers JP21K17701 (DK), JP21K17705 (YN), JP20H04141 (HB), JP18H04098 (MT), and JST PRESTO grant number JPMJPR1922 (SI).

Author information

Authors and Affiliations

Department of Informatics, Kyushu University, Fukuoka, Japan
Tooru Akagi, Yuto Nakashima, Shunsuke Inenaga & Masayuki Takeda
M&D Data Science Center, Tokyo Medical and Dental University, Tokyo, Japan
Dominik Köppl & Hideo Bannai
PRESTO, Japan Science and Technology Agency, Kawaguchi, Japan
Shunsuke Inenaga

Authors

Tooru Akagi
View author publications
You can also search for this author in PubMed Google Scholar
Dominik Köppl
View author publications
You can also search for this author in PubMed Google Scholar
Yuto Nakashima
View author publications
You can also search for this author in PubMed Google Scholar
Shunsuke Inenaga
View author publications
You can also search for this author in PubMed Google Scholar
Hideo Bannai
View author publications
You can also search for this author in PubMed Google Scholar
Masayuki Takeda
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dominik Köppl .

Editor information

Editors and Affiliations

Université de Rouen Normandie, Mont-St-Aignan, France
Thierry Lecroq
CNRS, CRIStAL, Villeneuve d'Ascq, France
Hélène Touzet

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Akagi, T., Köppl, D., Nakashima, Y., Inenaga, S., Bannai, H., Takeda, M. (2021). Grammar Index by Induced Suffix Sorting. In: Lecroq, T., Touzet, H. (eds) String Processing and Information Retrieval. SPIRE 2021. Lecture Notes in Computer Science(), vol 12944. Springer, Cham. https://doi.org/10.1007/978-3-030-86692-1_8

Download citation

DOI: https://doi.org/10.1007/978-3-030-86692-1_8
Published: 27 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86691-4
Online ISBN: 978-3-030-86692-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics