Skip to main content

Compressed Indexes for Aligned Pattern Matching

  • Conference paper
String Processing and Information Retrieval (SPIRE 2011)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7024))

Included in the following conference series:

Abstract

In many situations like protein sequences, the primary protein sequence is associated with secondary structure labels [6]. This can be treated as two sequences aligned character by character. Many other DNA and RNA sequences involve linkages which are aligned across or in the same or different strands. In this paper, we consider the most natural characterization of aligned string data.

The aligned pattern matching problem is to index two input texts T 1[1...n] and T 2[1...n], each having n characters taken from an alphabet set Σ of size σ = polylog(n), such that the following query can be answered efficiently: given two query patterns P 1 and P 2, find all the text positions i such that P 1 matches with T 1[i...(i + |P 1| − 1)] and P 2 matches with T 2[i...(i + |P 2| − 1)]. Our objective is to design a compressed space index for this problem and we obtained the following main results: when the query patterns are sufficiently long (|P 1|, |P 2| > α = Θ( log2 + 2ε n), where ε > 0), we can design an index which takes nH k  + nH k  + o(nlogσ) bits space and O(|P 1| + |P 2| + log4 + 4ε n + t) query time, where H k and H k denotes the empirical kth-order entropy (k = o(log σ n)) of T 1 and T 2 respectively, t represents the number of outputs and ε > 0. Further we show that designing a compressed/succinct space index with poly-logarithmic query time, which works for query patterns of all lengths is at least as hard as designing a linear space index for 3-dimensional orthogonal range reporting with poly-logarithmic query time. However, we introduce another compressed index of nH k  + nH k  + O(n) + o(nlogσ) bits space requirement with a query time of \(O(|P_1|+|P_2|+\sqrt{nt}\log^{2+\epsilon} n)\) which works without any restriction on the length of the patterns.

This work is supported in part by Taiwan NSC Grant 99-2221-E-007-123 (W. Hon) and US NSF Grant CCF–1017623 (R. Shah).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Alstrup, S., Bordal, G.S., Rauhe, T.: New data structure for orthogonal range searching. In: FOCS, pp. 198–207 (2000)

    Google Scholar 

  2. Chazelle, B.: Lower bounds for orthogonal range searching: I. the reporting case. JACM 37, 200–212, 1990 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  3. Bender, M.A., Farach-Colton, M.: The LCA Problem Revisited. In: Gonnet, G.H., Viola, A. (eds.) LATIN 2000. LNCS, vol. 1776, pp. 88–94. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  4. Burrows, M., Wheeler, D.J.: A Block-Sorting Lossless Data Compression Algorithm. Technical Report 124, Digital Equipment Corporation, Paolo Alto, CA, USA (1994)

    Google Scholar 

  5. Chien, Y.F., Hon, W.-K., Shah, R., Vitter, J.S.: Geometric Burrows-Wheeler Transform: Linking Range Searching and Text Indexing. In: DCC 2008, pp. 252–261 (2008)

    Google Scholar 

  6. Eltabakh, M.Y., Hon, W.-K., Shah, R., Aref, W.G., Vitte, J.S.: The SBC-tree: an index for run-length compressed sequences. In: EDBT, pp. 523–534 (2008)

    Google Scholar 

  7. Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. TALG 3(2) (2007)

    Google Scholar 

  8. Ferragina, P., Manzini, G.: Indexing Compressed Text. JACM 52(4), 552–581 (2005); A preliminary version appears in FOCS 2000

    Article  MathSciNet  MATH  Google Scholar 

  9. Grossi, R., Vitter, J.S.: Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching. SIAM Journal on Computing 35(2), 378–407 (2005); A preliminary version appears in STOC 2000

    Article  MathSciNet  MATH  Google Scholar 

  10. Grossi, R., Gupta, A., Vitter, J.S.: High-Order Entropy-Compressed Text Indexes. In: SODA, pp. 841–850 (2003)

    Google Scholar 

  11. Hon, W.-K., Shah, R., Vitter, J.S.: Ordered Pattern Matching: Towards Full-Text Retrieval. Technical Report TR-06-008, Purdue University (March 2006)

    Google Scholar 

  12. Hon, W.-K., Shah, R., Vitter, J.S.: Space-Efficient Framework for Top-k String Retrieval Problems. In: FOCS 2009, pp. 713–722 (2009)

    Google Scholar 

  13. Hon, W.-K., Shah, R., Thankachan, S.V., Vitter, J.S.: String Retrieval for Multi-pattern Queries. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 55–66. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  14. Manber, U., Myers, G.: Suffix Arrays: A New Method for On-Line String Searches. SIAM Journal on Computing 22(5), 935–948 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  15. Munro, J.I., Raman, V.: Succinct Representation of Balanced Parentheses and Static Trees. SICOMP 31(3), 762–776 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  16. Navarro, G., Mäkinen, V.: Compressed Full-Text Indexes. ACM Computing Surveys 39(1) (2007)

    Google Scholar 

  17. Raman, R., Raman, V., Rao, S.S.: Succinct Indexable Dictionaries with Applications to Encoding k-ary Trees, Prefix Sums and Multisets. TALG 3(4) (2007)

    Google Scholar 

  18. Sadakane, K.: Compressed Suffix Trees with Full Functionality. In: TCS, pp. 589–607 (2007)

    Google Scholar 

  19. Weiner, P.: Linear Pattern Matching Algorithms. In: Proc. Switching and Automata Theory, pp. 1–11 (1973)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Thankachan, S.V. (2011). Compressed Indexes for Aligned Pattern Matching. In: Grossi, R., Sebastiani, F., Silvestri, F. (eds) String Processing and Information Retrieval. SPIRE 2011. Lecture Notes in Computer Science, vol 7024. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24583-1_40

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-24583-1_40

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-24582-4

  • Online ISBN: 978-3-642-24583-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics