Skip to main content
Log in

Gapped Indexing for Consecutive Occurrences

  • Published:
Algorithmica Aims and scope Submit manuscript

Abstract

The classic string indexing problem is to preprocess a string S into a compact data structure that supports efficient pattern matching queries. Typical queries include existential queries (decide if the pattern occurs in S), reporting queries (return all positions where the pattern occurs), and counting queries (return the number of occurrences of the pattern). In this paper we consider a variant of string indexing, where the goal is to compactly represent the string such that given two patterns \(P_1\) and \(P_2\) and a gap range \({[}\alpha , \beta ]\) we can quickly find the consecutive occurrences of \(P_1\) and \(P_2\) with distance in \({[}\alpha , \beta ]\), i.e., pairs of subsequent occurrences with distance within the range. We present data structures that use linear space and query time \({\widetilde{O}}(|P_1|+|P_2|+n^{2/3})\) for existence and counting and \({\widetilde{O}}(|P_1|+|P_2|+n^{2/3}\hbox {occ}^{1/3})\) for reporting. We complement this with a conditional lower bound based on the set intersection problem showing that any solution using \({\widetilde{O}}(n)\) space must use \({\widetilde{\Omega }}(|P_1| + |P_2| + \sqrt{n})\) query time. To obtain our results we develop new techniques and ideas of independent interest including a new suffix tree decomposition and hardness of a variant of the set intersection problem.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. \({\widetilde{O}}\) and \(\widetilde{\Omega }\) ignore polylogarithmic factors.

References

  1. Alstrup, S., Holm, J., de Lichtenberg, K., Thorup, M.: Minimizing diameters of dynamic trees. In: Proceedings of the 24th ICALP, pp. 270–280 (1997)

  2. Alstrup, S., Holm, J., Thorup, M.: Maintaining center and median in dynamic trees. In: Proceedings of the 7th SWAT, pp. 46–56 (2000)

  3. Alstrup, S., Rauhe, T.: Improved labeling scheme for ancestor queries. In: Proceedings of the 13th SODA, pp. 947–953 (2002)

  4. Amir, A., Chan, T.M., Lewenstein, M., Lewenstein, N.: On hardness of jumbled indexing. In: Proceedings of the 41st ICALP, pp. 114–125 (2014)

  5. Amir, A., Kopelowitz, T., Levy, A., Pettie, S., Porat, E., Shalom, B.R.: Mind the gap: essentially optimal algorithms for online dictionary matching with one gap. In: Proceedings of the 27th ISAAC, pp. 12:1–12:12 (2016)

  6. Apostolico, A., Pizzi, C., Satta, G.: Optimal discovery of subword associations in strings. In: Proceedings of the 7th DS, pp. 270–277 (2004)

  7. Apostolico, A., Pizzi, C., Ukkonen, E.: Efficient algorithms for the discovery of gapped factors. Algorithms Mol. Biol. 6, 5 (2011)

    Article  Google Scholar 

  8. Apostolico, A., Satta, G.: Discovering subword associations in strings in time linear in the output size. J. Discrete Algorithms 7(2), 227–238 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  9. Bader, J., Gog, S., Petri, M.: Practical variable length gap pattern matching. In: Proceedings of the 15th SEA, pp. 1–16 (2016)

  10. Bille, P., Gørtz, I.L.: The tree inclusion problem: in linear space and faster. ACM Trans. Algorithms 7(3), 1–47 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  11. Bille, P., Gørtz, I.L.: Substring range reporting. Algorithmica 69(2), 384–396 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  12. Bille, P., Gørtz, I.L., Pedersen, M.R., Rotenberg, E., Steiner, T.A.: String indexing for top-\(k\) close consecutive occurrences. In: Proceedings of the 40th FSTTCS, pp. 14:1–14:17 (2020)

  13. Bille, P., Gørtz, I.L., Pedersen, M.R., Steiner, T.A.: Gapped indexing for consecutive occurrences. In: Proceedings of the 32nd CPM, pp. 10:1–10:19 (2021)

  14. Bille, P., Gørtz, I.L., Vildhøj, H.W., Vind, S.: String indexing for patterns with wildcards. Theory Comput. Syst. 55(1), 41–60 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  15. Bille, P., Gørtz, I.L., Vildhøj, H.W., Wind, D.K.: String matching with variable length gaps. Theor. Comput. Sci. 443 (2012). Announced at SPIRE (2010)

  16. Biswas, S., Ganguly, A., Shah, R., Thankachan, S.V.: Ranked document retrieval for multiple patterns. Theor. Comput. Sci. 746, 98–111 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  17. Bucher, P., Bairoch, A.: A generalized profile syntax for biomolecular sequence motifs and its function in automatic sequence interpretation. In: Proceedings of the 2nd ISMB, pp. 53–61 (1994)

  18. Cáceres, M., Puglisi, S.J., Zhukova, B.: Fast indexes for gapped pattern matching. In: Proceedings of the 46th SOFSEM, pp. 493–504 (2020)

  19. Cohen, H., Porat, E.: Fast set intersection and two-patterns matching. Theor. Comput. Sci. 411(40–42), 3795–3800 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  20. Ferragina, P., Koudas, N., Muthukrishnan, S., Srivastava, D.: Two-dimensional substring indexing. J. Comput. Syst. Sci. 66(4), 763–774 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  21. Frederickson, G.N.: Ambivalent data structures for dynamic 2-edge-connectivity and \(k\) smallest spanning trees. SIAM J. Comput. 26(2), 484–538 (1997)

    Article  MathSciNet  MATH  Google Scholar 

  22. Fredman, M.L., Komlós, J., Szemerédi, E.: Storing a sparse table with \(o(1)\) worst case access time. J. ACM 31(3), 538–544 (1984)

    Article  MathSciNet  MATH  Google Scholar 

  23. Fredriksson, K., Grabowski, S.: Efficient algorithms for pattern matching with general gaps, character classes, and transposition invariance. Inf. Retr. 11(4), 335–357 (2008)

    Article  Google Scholar 

  24. Goldstein, I., Kopelowitz, T., Lewenstein, M., Porat, E.: Conditional lower bounds for space/time tradeoffs. In: Proceedings of the 15th WADS, pp. 421–436. Springer (2017)

  25. Haapasalo, T., Silvasti, P., Sippu, S., Soisalon-Soininen, E.: Online dictionary matching with variable-length gaps. In: Proceedings of the 10th SEA, pp. 76–87 (2011)

  26. Hofmann, K., Bucher, P., Falquet, L., Bairoch, A.: The PROSITE database, its status in 1999. Nucleic Acids Res. 27(1), 215–219 (1999)

    Article  Google Scholar 

  27. Hon, W., Patil, M., Shah, R., Thankachan, S.V., Vitter, J.S.: Indexes for document retrieval with relevance. In: Space-Efficient Data Structures, Streams, and Algorithms—Papers in Honor of J. Ian Munro on the Occasion of His 66th Birthday, pp. 351–362 (2013)

  28. Hon, W., Thankachan, S.V., Shah, R., Vitter, J.S.: Faster compressed top-k document retrieval. In: Proceedings of the 23rd DCC, pp. 341–350 (2013)

  29. Hon, W.K., Patil, M., Shah, R., Wu, S.B.: Efficient index for retrieving top-k most frequent documents. J. Discrete Algorithms 8(4), 402–417 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  30. Hon, W.K., Shah, R., Thankachan, S.V., Vitter, J.S.: Space-efficient frameworks for top-k string retrieval. J. ACM 61(2), 1–36 (2014). Announced at 50th FOCS

  31. Iliopoulos, C.S., Rahman, M.S.: Indexing factors with gaps. Algorithmica 55(1), 60–70 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  32. Keller, O., Kopelowitz, T., Lewenstein, M.: Range non-overlapping indexing and successive list indexing. In: Proceedings of the 11th WADS, pp. 625–636 (2007)

  33. Kopelowitz, T., Krauthgamer, R.: Color-distance oracles and snippets. In: Grossi, R., Lewenstein, M. (Eds.) Proceedings of the 27th CPM, pp. 24:1–24:10 (2016)

  34. Kopelowitz, T., Pettie, S., Porat, E.: Higher lower bounds from the 3sum conjecture. In: Proceedings of the 27th SODA, pp. 1272–1287 (2016)

  35. Larsen, K.G., Munro, J.I., Nielsen, J.S., Thankachan, S.V.: On hardness of several string indexing problems. Theor. Comput. Sci. 582, 74–82 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  36. Lewenstein, M.: Indexing with gaps. In: Proceedings of the 18th SPIRE, pp. 135–143 (2011)

  37. Mehldau, G., Myers, G.: A system for pattern matching applications on biosequences. Bioinformatics 9(3), 299–314 (1993)

    Article  Google Scholar 

  38. Munro, J.I., Navarro, G., Nielsen, J.S., Shah, R., Thankachan, S.V.: Top-k term-proximity in succinct space. Algorithmica 78(2), 379–393 (2017). Announced at 25th ISAAC

  39. Munro, J.I., Navarro, G., Shah, R., Thankachan, S.V.: Ranked document selection. Theor. Comput. Sci. 812, 149–159 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  40. Myers, E.W.: Approximate matching of network expressions with spacers. J. Comput. Biol. 3(1), 33–51 (1992)

    Article  Google Scholar 

  41. Navarro, G.: Spaces, trees, and colors: the algorithmic landscape of document retrieval on sequences. ACM Comput. Surv. 46(4), 1–47 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  42. Navarro, G., Nekrich, Y.: Time-optimal top-k document retrieval. SIAM J. Comput. 46(1), 80–113 (2017). Announced at 23rd SODA

  43. Navarro, G., Raffinot, M.: Fast and simple character classes and bounded gaps pattern matching, with applications to protein searching. J. Comput. Biol. 10(6), 903–923 (2003)

    Article  Google Scholar 

  44. Navarro, G., Thankachan, S.V.: New space/time tradeoffs for top-k document retrieval on sequences. Theor. Comput. Sci. 542, 83–97 (2014). Announced at 20th SPIRE

  45. Navarro, G., Thankachan, S.V.: Reporting consecutive substring occurrences under bounded gap constraints. Theor. Comput. Sci. 638, 108–111 (2016). Announced at 26th CPM

  46. Nekrich, Y., Navarro, G.: Sorted range reporting. In: Proceedings of the 13th SWAT, pp. 271–282 (2012)

  47. Shah, R., Sheng, C., Thankachan, S.V., Vitter, J.S.: Top-k document retrieval in external memory. In: Proceedings of the 21st ESA, pp. 803–814 (2013)

  48. Tsur, D.: Top-k document retrieval in optimal space. Inf. Process. Lett. 113(12), 440–443 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  49. Weiner, P.: Linear pattern matching algorithms. In: Proceedings of the 14th FOCS, pp. 1–11 (1973)

  50. Willard, D.E.: Log-logarithmic worst-case range queries are possible in space theta(n). Inf. Process. Lett. 17(2), 81–84 (1983). https://doi.org/10.1016/0020-0190(83)90075-3.

    Article  MATH  Google Scholar 

  51. Zhou, G.: Two-dimensional range successor in optimal time and almost linear space. Inf. Process. Lett. 116(2), 171–174 (2016)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Teresa Anna Steiner.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A preliminary version of this paper appeared at CPM 2021 [13]. P. Bille, I. L. Gørtz and M. R. Pedersen: Supported by the Danish Research Council Grant DFF-8021-002498.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bille, P., Gørtz, I.L., Pedersen, M.R. et al. Gapped Indexing for Consecutive Occurrences. Algorithmica 85, 879–901 (2023). https://doi.org/10.1007/s00453-022-01051-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00453-022-01051-6

Keywords

Navigation