Skip to main content

Grammar Index by Induced Suffix Sorting

  • Conference paper
  • First Online:
String Processing and Information Retrieval (SPIRE 2021)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12944))

Included in the following conference series:

Abstract

We propose a new compressed text index built upon a grammar compression based on induced suffix sorting [Nunes et al., DCC’18]. We show that this grammar exhibits a locality sensitive parsing property, which allows us to specify, given a pattern P, certain substrings of P, called cores, that are similarly parsed in the text grammar whenever these occurrences are extensible to occurrences of P. Supported by the cores, given a pattern of length m, we can locate all its \(\text {occ}\) occurrences in a text T of length n within \(\mathop {}\mathopen {}{\mathcal {O}}\mathopen {}(m \lg |{\mathcal {S}}| + \text {occ}_C\lg |{\mathcal {S}}| \lg n + \text {occ})\) time, where \({\mathcal {S}}\) is the set of all characters and non-terminals, \(\text {occ}\) is the number of occurrences, and \(\text {occ}_C\) is the number of occurrences of a chosen core C of P in the right hand side of all production rules of the grammar of T. Our grammar index requires \(\mathop {}\mathopen {}{\mathcal {O}}\mathopen {}(g)\) words of space and can be built in \(\mathop {}\mathopen {}{\mathcal {O}}\mathopen {}(n)\) time using \(\mathop {}\mathopen {}{\mathcal {O}}\mathopen {}(g)\) working space, where g is the sum of the lengths of the right hand sides of all production rules. We practically evaluate that our proposed index excels at locating long patterns in highly-repetitive texts. Our implementation is available at https://github.com/TooruAkagi/GCIS_Index.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    For SAIS to work, it uses a slightly different order on the LMS substrings, called LMS-order. It differs from the lexicographic order when comparing two LMS substrings, where one of them is a prefix of the other. In such a case, the LMS-order would give the longer string a smaller rank.

  2. 2.

    GCIS-\(\mathsf {nep}\) stands for GCIS with non-terminals encoded plainly.

  3. 3.

    See https://github.com/mpetri/FM-Index, https://github.com/tkbtkysms/esp-index-I, and https://github.com/nicolaprezza/r-index, respectively.

  4. 4.

    To save space, we renamed the datasets commoncrawl.ascii.txt and einstein.de.txt to commoncrawl and einstein.de, respectively.

References

  1. Buchsbaum, A.L., Kaplan, H., Rogers, A., Westbrook, J.R.: Linear-time pointer-machine algorithms for least common ancestors, MST verification, and dominators. In: Proceedings of the STOC, pp. 279–288 (1998)

    Google Scholar 

  2. Burrows, M., Wheeler, D.J.: A block sorting lossless data compression algorithm. Technical report 124, Digital Equipment Corporation, Palo Alto, California (1994)

    Google Scholar 

  3. Christiansen, A.R., Ettienne, M.B., Kociumaka, T., Navarro, G., Prezza, N.: Optimal-time dictionary-compressed indexes. ACM Trans. Algorithms 17(1), 8:1–8:39 (2021)

    Google Scholar 

  4. Claude, F., Fariña, A., Martínez-Prieto, M.A., Navarro, G.: Universal indexes for highly repetitive document collections. Inf. Syst. 61, 1–23 (2016)

    Article  Google Scholar 

  5. Claude, F., Navarro, G.: Self-indexed grammar-based compression. Fundam. Inform. 111(3), 313–337 (2011)

    Article  MathSciNet  Google Scholar 

  6. Claude, F., Navarro, G.: Improved grammar-based compressed indexes. In: Calderón-Benavides, L., González-Caro, C., Chávez, E., Ziviani, N. (eds.) SPIRE 2012. LNCS, vol. 7608, pp. 180–192. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-34109-0_19

    Chapter  Google Scholar 

  7. Cormode, G., Muthukrishnan, S.: The string edit distance matching problem with moves. ACM Trans. Algorithms 3(1), 2:1–2:19 (2007)

    Google Scholar 

  8. Díaz-Domínguez, D., Navarro, G.: A grammar compressor for collections of reads with applications to the construction of the BWT. In: Proceedings of the DCC, pp. 83–92 (2021)

    Google Scholar 

  9. Dinklage, P., Fischer, J., Köppl, D., Löbel, M., Sadakane, K.: Compression with the tudocomp framework. In: Proceedings of the SEA. LIPIcs, vol. 75, pp. 13:1–13:22 (2017)

    Google Scholar 

  10. Du, C.F., Mousavi, H., Schaeffer, L., Shallit, J.O.: Decision algorithms for fibonacci-automatic words, with applications to pattern avoidance. CoRR abs/1406.0670 (2014). http://arxiv.org/abs/1406.0670

  11. Elias, P.: Efficient storage and retrieval by content and address of static files. J. ACM 21(2), 246–260 (1974)

    Article  MathSciNet  Google Scholar 

  12. Elias, P.: Universal codeword sets and representations of the integers. IEEE Trans. Inf. Theory 21(2), 194–203 (1975)

    Article  MathSciNet  Google Scholar 

  13. Farach-Colton, M., Ferragina, P., Muthukrishnan, S.: On the sorting-complexity of suffix tree construction. J. ACM 47(6), 987–1011 (2000)

    Article  MathSciNet  Google Scholar 

  14. Ferragina, P., González, R., Navarro, G., Venturini, R.: Compressed text indexes: from theory to practice. ACM J. Exp. Algorithmics 13, 1.12:1-1.123:1 (2008)

    Google Scholar 

  15. Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proceedings of the FOCS, pp. 390–398 (2000)

    Google Scholar 

  16. Fischer, J., I, T., Köppl, D.: Deterministic sparse suffix sorting in the restore model. ACM Trans. Algorithms 16(4), 50:1-50:53 (2020)

    Google Scholar 

  17. Gagie, T., I, T., Manzini, G., Navarro, G., Sakamoto, H., Takabatake, Y.: Rpair: Rescaling RePair with Rsync. CoRR arXiv:abs/1906.00809 (2019)

  18. Gagie, T., Navarro, G., Prezza, N.: Optimal-time text indexing in BWT-runs bounded space. In: Proceedings of the SODA, pp. 1459–1477 (2018)

    Google Scholar 

  19. Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)

    Book  Google Scholar 

  20. Kieffer, J.C., Yang, E.: Grammar-based codes: a new class of universal lossless source codes. IEEE Trans. Inf. Theory 46(3), 737–754 (2000)

    Article  MathSciNet  Google Scholar 

  21. Knuth, D.E., Morris, J.H., Pratt, V.R.: Fast pattern matching in strings. SIAM J. Comput. 6(2), 323–350 (1977)

    Article  MathSciNet  Google Scholar 

  22. Larsson, N.J., Moffat, A.: Offline dictionary-based compression. In: Proceedings of the DCC, pp. 296–305 (1999)

    Google Scholar 

  23. Manber, U., Myers, E.W.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)

    Article  MathSciNet  Google Scholar 

  24. Maruyama, S., Nakahara, M., Kishiue, N., Sakamoto, H.: ESP-index: a compressed index based on edit-sensitive parsing. J. Discret. Algorithms 18, 100–112 (2013)

    Article  MathSciNet  Google Scholar 

  25. Mehlhorn, K., Sundar, R., Uhrig, C.: Maintaining dynamic sequences under equality tests in polylogarithmic time. Algorithmica 17(2), 183–198 (1997)

    Article  MathSciNet  Google Scholar 

  26. Nishimoto, T., I, T., Inenaga, S., Bannai, H., Takeda, M.: Dynamic index and LZ factorization in compressed space. Discret. Appl. Math. 274, 116–129 (2020)

    Article  MathSciNet  Google Scholar 

  27. Nong, G., Zhang, S., Chan, W.H.: Two efficient algorithms for linear time suffix array construction. IEEE Trans. Comput. 60(10), 1471–1484 (2011)

    Article  MathSciNet  Google Scholar 

  28. Nunes, D.S.N., Louza, F.A., Gog, S., Ayala-Rincn, M., Navarro, G.: Grammar compression by induced suffix sorting (2020)

    Google Scholar 

  29. Nunes, D.S.N., da Louza, F.A., Gog, S., Ayala-Rincón, M., Navarro, G.: A grammar compression algorithm based on induced suffix sorting. In: Proceedings of the DCC, pp. 42–51 (2018)

    Google Scholar 

  30. Sahinalp, S.C., Vishkin, U.: Efficient approximate and dynamic matching of patterns using a labeling paradigm (extended abstract). In: Proceedings of the FOCS, pp. 320–328 (1996)

    Google Scholar 

  31. Takabatake, Y., Nakashima, K., Kuboyama, T., Tabei, Y., Sakamoto, H.: siEDM: an efficient string index and search algorithm for edit distance with moves. Algorithms 9(2), 26:1–26:18 (2016)

    Google Scholar 

  32. Takabatake, Y., Tabei, Y., Sakamoto, H.: Improved ESP-index: a practical self-index for highly repetitive texts. In: Gudmundsson, J., Katajainen, J. (eds.) SEA 2014. LNCS, vol. 8504, pp. 338–350. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07959-2_29

    Chapter  Google Scholar 

  33. Tsuruta, K., Köppl, D., Nakashima, Y., Inenaga, S., Bannai, H., Takeda, M.: Grammar-compressed self-index with Lyndon words. IPSJ TOM 13(2), 84–92 (2020)

    Google Scholar 

  34. Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

This work was supported by JSPS KAKENHI grant numbers JP21K17701 (DK), JP21K17705 (YN), JP20H04141 (HB), JP18H04098 (MT), and JST PRESTO grant number JPMJPR1922 (SI).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dominik Köppl .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Akagi, T., Köppl, D., Nakashima, Y., Inenaga, S., Bannai, H., Takeda, M. (2021). Grammar Index by Induced Suffix Sorting. In: Lecroq, T., Touzet, H. (eds) String Processing and Information Retrieval. SPIRE 2021. Lecture Notes in Computer Science(), vol 12944. Springer, Cham. https://doi.org/10.1007/978-3-030-86692-1_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-86692-1_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-86691-4

  • Online ISBN: 978-3-030-86692-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics