Abstract
In many situations like protein sequences, the primary protein sequence is associated with secondary structure labels [6]. This can be treated as two sequences aligned character by character. Many other DNA and RNA sequences involve linkages which are aligned across or in the same or different strands. In this paper, we consider the most natural characterization of aligned string data.
The aligned pattern matching problem is to index two input texts T 1[1...n] and T 2[1...n], each having n characters taken from an alphabet set Σ of size σ = polylog(n), such that the following query can be answered efficiently: given two query patterns P 1 and P 2, find all the text positions i such that P 1 matches with T 1[i...(i + |P 1| − 1)] and P 2 matches with T 2[i...(i + |P 2| − 1)]. Our objective is to design a compressed space index for this problem and we obtained the following main results: when the query patterns are sufficiently long (|P 1|, |P 2| > α = Θ( log2 + 2ε n), where ε > 0), we can design an index which takes nH′ k + nH″ k + o(nlogσ) bits space and O(|P 1| + |P 2| + log4 + 4ε n + t) query time, where H′ k and H″ k denotes the empirical kth-order entropy (k = o(log σ n)) of T 1 and T 2 respectively, t represents the number of outputs and ε > 0. Further we show that designing a compressed/succinct space index with poly-logarithmic query time, which works for query patterns of all lengths is at least as hard as designing a linear space index for 3-dimensional orthogonal range reporting with poly-logarithmic query time. However, we introduce another compressed index of nH′ k + nH″ k + O(n) + o(nlogσ) bits space requirement with a query time of \(O(|P_1|+|P_2|+\sqrt{nt}\log^{2+\epsilon} n)\) which works without any restriction on the length of the patterns.
This work is supported in part by Taiwan NSC Grant 99-2221-E-007-123 (W. Hon) and US NSF Grant CCF–1017623 (R. Shah).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Alstrup, S., Bordal, G.S., Rauhe, T.: New data structure for orthogonal range searching. In: FOCS, pp. 198–207 (2000)
Chazelle, B.: Lower bounds for orthogonal range searching: I. the reporting case. JACM 37, 200–212, 1990 (2005)
Bender, M.A., Farach-Colton, M.: The LCA Problem Revisited. In: Gonnet, G.H., Viola, A. (eds.) LATIN 2000. LNCS, vol. 1776, pp. 88–94. Springer, Heidelberg (2000)
Burrows, M., Wheeler, D.J.: A Block-Sorting Lossless Data Compression Algorithm. Technical Report 124, Digital Equipment Corporation, Paolo Alto, CA, USA (1994)
Chien, Y.F., Hon, W.-K., Shah, R., Vitter, J.S.: Geometric Burrows-Wheeler Transform: Linking Range Searching and Text Indexing. In: DCC 2008, pp. 252–261 (2008)
Eltabakh, M.Y., Hon, W.-K., Shah, R., Aref, W.G., Vitte, J.S.: The SBC-tree: an index for run-length compressed sequences. In: EDBT, pp. 523–534 (2008)
Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. TALG 3(2) (2007)
Ferragina, P., Manzini, G.: Indexing Compressed Text. JACM 52(4), 552–581 (2005); A preliminary version appears in FOCS 2000
Grossi, R., Vitter, J.S.: Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching. SIAM Journal on Computing 35(2), 378–407 (2005); A preliminary version appears in STOC 2000
Grossi, R., Gupta, A., Vitter, J.S.: High-Order Entropy-Compressed Text Indexes. In: SODA, pp. 841–850 (2003)
Hon, W.-K., Shah, R., Vitter, J.S.: Ordered Pattern Matching: Towards Full-Text Retrieval. Technical Report TR-06-008, Purdue University (March 2006)
Hon, W.-K., Shah, R., Vitter, J.S.: Space-Efficient Framework for Top-k String Retrieval Problems. In: FOCS 2009, pp. 713–722 (2009)
Hon, W.-K., Shah, R., Thankachan, S.V., Vitter, J.S.: String Retrieval for Multi-pattern Queries. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 55–66. Springer, Heidelberg (2010)
Manber, U., Myers, G.: Suffix Arrays: A New Method for On-Line String Searches. SIAM Journal on Computing 22(5), 935–948 (1993)
Munro, J.I., Raman, V.: Succinct Representation of Balanced Parentheses and Static Trees. SICOMP 31(3), 762–776 (2001)
Navarro, G., Mäkinen, V.: Compressed Full-Text Indexes. ACM Computing Surveys 39(1) (2007)
Raman, R., Raman, V., Rao, S.S.: Succinct Indexable Dictionaries with Applications to Encoding k-ary Trees, Prefix Sums and Multisets. TALG 3(4) (2007)
Sadakane, K.: Compressed Suffix Trees with Full Functionality. In: TCS, pp. 589–607 (2007)
Weiner, P.: Linear Pattern Matching Algorithms. In: Proc. Switching and Automata Theory, pp. 1–11 (1973)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Thankachan, S.V. (2011). Compressed Indexes for Aligned Pattern Matching. In: Grossi, R., Sebastiani, F., Silvestri, F. (eds) String Processing and Information Retrieval. SPIRE 2011. Lecture Notes in Computer Science, vol 7024. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24583-1_40
Download citation
DOI: https://doi.org/10.1007/978-3-642-24583-1_40
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24582-4
Online ISBN: 978-3-642-24583-1
eBook Packages: Computer ScienceComputer Science (R0)