Skip to main content

A Lempel-Ziv Text Index on Secondary Storage

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4580))

Abstract

Full-text searching consists in locating the occurrences of a given pattern P[1..m] in a text T[1..u], both sequences over an alphabet of size σ. In this paper we define a new index for full-text searching on secondary storage, based on the Lempel-Ziv compression algorithm and requiring 8uH k  + o(ulogσ) bits of space, where H k denotes the k-th order empirical entropy of T, for any k = o(log σ u). Our experimental results show that our index is significantly smaller than any other practical secondary-memory data structure: 1.4–2.3 times the text size including the text, which means 39%–65% the size of traditional indexes like String B-trees [Ferragina and Grossi, JACM 1999]. In exchange, our index requires more disk access to locate the pattern occurrences. Our index is able to report up to 600 occurrences per disk access, for a disk page of 32 kilobytes. If we only need to count pattern occurrences, the space can be reduced to about 1.04–1.68 times the text size, requiring about 20–60 disk accesses, depending on the pattern length.

Supported in part by CONICYT PhD Fellowship Program (first author) and Fondecyt Grant 1-050493 (second author).

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Apostolico, A.: The myriad virtues of subword trees. In: Combinatorial Algorithms on Words. NATO ISI Series, pp. 85–96. Springer, Heidelberg (1985)

    Google Scholar 

  2. Kurtz, S.: Reducing the space requeriments of suffix trees. Softw. Pract. Exper. 29(13), 1149–1171 (1999)

    Article  Google Scholar 

  3. Manzini, G.: An analysis of the Burrows-Wheeler transform. JACM 48(3), 407–430 (2001)

    Article  MathSciNet  Google Scholar 

  4. Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys (to appear)

    Google Scholar 

  5. Ferragina, P., Manzini, G.: Indexing compressed texts. JACM 54(4), 552–581 (2005)

    Article  MathSciNet  Google Scholar 

  6. Moura, E., Navarro, G., Ziviani, N., Baeza-Yates, R.: Fast and flexible word searching on compressed text. ACM TOIS 18(2), 113–139 (2000)

    Article  Google Scholar 

  7. Ferragina, P., Grossi, R.: The String B-tree: a new data structure for string search in external memory and its applications. JACM 46(2), 236–280 (1999)

    Article  MATH  MathSciNet  Google Scholar 

  8. Ferragina, P., Grossi, R.: Fast string searching in secondary storage: theoretical developments and experimental results. In: Proc. SODA, pp. 373–382 (1996)

    Google Scholar 

  9. Clark, D., Munro, J.I.: Efficient suffix trees on secondary storage. In: Proc. SODA, pp. 383–391 (1996)

    Google Scholar 

  10. Mäkinen, V., Navarro, G., Sadakane, K.: Advantages of backward searching — efficient secondary memory and distributed implementation of compressed suffix arrays. In: Proc. ISAAC, pp. 681–692 (2004)

    Google Scholar 

  11. Sadakane, K.: Succinct representations of lcp information and improvements in the compressed suffix arrays. In: Proc. SODA, pp. 225–232 (2002)

    Google Scholar 

  12. Navarro, G.: Indexing text using the Ziv-Lempel trie. J. of Discrete Algorithms 2(1), 87–114 (2004)

    Article  MATH  MathSciNet  Google Scholar 

  13. Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE TIT 24(5), 530–536 (1978)

    Article  MATH  MathSciNet  Google Scholar 

  14. Kosaraju, R., Manzini, G.: Compression of low entropy strings with Lempel-Ziv algorithms. SIAM J.Comp. 29(3), 893–911 (1999)

    Article  MathSciNet  Google Scholar 

  15. Arroyuelo, D., Navarro, G., Sadakane, K.: Reducing the space requirement of LZ-index. In: Proc. CPM, pp. 319–330 (2006)

    Google Scholar 

  16. Arroyuelo, D., Navarro, G.: Space-efficient construction of LZ-index. In: Proc. ISAAC pp. 1143–1152 (2005)

    Google Scholar 

  17. Munro, I., Raman, V.: Succinct representation of balanced parentheses and static trees. SIAM J.Comp. 31(3), 762–776 (2001)

    Article  MATH  MathSciNet  Google Scholar 

  18. Munro, I.: Tables. In: Chandru, V., Vinay, V. (eds.) Foundations of Software Technology and Theoretical Computer Science. LNCS, vol. 1180, pp. 37–42. Springer, Heidelberg (1996)

    Google Scholar 

  19. Arroyuelo, D., Navarro, G.: A Lempel-Ziv text index on secondary storage. Technical Report TR/DCC-2004, -4, Dept. of Computer Science, Universidad de Chile (2007), ftp://ftp.dcc.uchile.cl/pub/users/gnavarro/lzidisk.ps.gz

  20. Morrison, D.R.: Patricia – practical algorithm to retrieve information coded in alphanumeric. JACM 15(4), 514–534 (1968)

    Article  MathSciNet  Google Scholar 

  21. Harman, D.: Overview of the third text REtrieval conference. In: Proc. Third Text REtrieval Conference (TREC-3), NIST Special Publication, pp. 500–207 (1995)

    Google Scholar 

  22. Baeza-Yates, R., Barbosa, E.F., Ziviani, N.: Hierarchies of indices for text searching. Inf. Systems 21(6), 497–514 (1996)

    Article  Google Scholar 

  23. Manber, U., Myers, G.: Suffix arrays: A new method for on-line string searches. SIAM J. Comp. 22(5), 935–948 (1993)

    Article  MATH  MathSciNet  Google Scholar 

  24. González, R., Navarro, G.: Compressed text indexes with fast locate. In: Proc. of CPM’07. LNCS (to appear, 2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Bin Ma Kaizhong Zhang

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Arroyuelo, D., Navarro, G. (2007). A Lempel-Ziv Text Index on Secondary Storage . In: Ma, B., Zhang, K. (eds) Combinatorial Pattern Matching. CPM 2007. Lecture Notes in Computer Science, vol 4580. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73437-6_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-73437-6_11

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-73436-9

  • Online ISBN: 978-3-540-73437-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics