String Retrieval for Multi-pattern Queries

Hon, Wing-Kai; Shah, Rahul; Thankachan, Sharma V.; Vitter, Jeffrey Scott

doi:10.1007/978-3-642-16321-0_6

String Retrieval for Multi-pattern Queries

Wing-Kai Hon¹⁸,
Rahul Shah¹⁹,
Sharma V. Thankachan¹⁹ &
…
Jeffrey Scott Vitter²⁰

Conference paper

1124 Accesses
11 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6393))

Abstract

Given a collection \(\mathcal D\) of string documents \(\{d_1,d_2,...,d_{|\mathcal D|}\}\) of total length n, which may be preprocessed, a fundamental task is to retrieve the most relevant documents for a given query. The query consists of a set of m patterns {P ₁, P ₂, ..., P _m}. To measure the relevance of a document with respect to the query patterns, we may define a score, such as the number of occurrences of these patterns in the document, or the proximity of the given patterns within the document. To control the size of the output, we may also specify a threshold (or a parameter K), so that our task is to report all the documents which match the query with score more than threshold (or respectively, the K documents with the highest scores).

When the documents are strings (without word boundaries), the traditional inverted-index-based solutions may not be applicable. The single pattern retrieval case has been well-solved by [14,9]. When it comes to two or more patterns, the only non-trivial solution for proximity search and common document listing was given by [14], which took \(\tilde{O}(n^{3/2})\) space. In this paper, we give the first linear space (and partly succinct) data structures, which can answer multi-pattern queries in \(O(\sum |P_i|) + \tilde {O}(t^{1/m} n^{1-1/m})\) time, where t is the number of output occurrences. In the particular case of two patterns, we achieve the bound of \(O(|P_1| + |P_2| + \sqrt{nt}\log^2 n)\). We also show space-time trade-offs for our data structures. Our approach is based on a novel data structure called the weight-balanced wavelet tree, which may be of independent interest.

This work is supported in part by Taiwan NSC Grant 96-2221-E-007-082 and US NSF Grants CCF-1017623 and CCF-0621457.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bender, M.A., Farach-Colton, M.: The LCA Problem Revisited. In: Gonnet, G.H., Viola, A. (eds.) LATIN 2000. LNCS, vol. 1776, pp. 88–94. Springer, Heidelberg (2000)
Chapter Google Scholar
Brin, S., Page, L.: The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks 30(1-7), 107–117 (1998)
Google Scholar
Cohen, H., Porat, E.: Fast Set Intersection and Two Patterns Matching. In: LATIN (2010)
Google Scholar
Ferragina, P., Giancarlo, R., Manzini, G.: The Myriad Virtues of Wavelet Trees. Inf. and Comp. 207(8), 849–866 (2009)
Article MathSciNet MATH Google Scholar
Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Transactions on Algorithms (TALG) 3(2) (2007)
Google Scholar
Grossi, R., Gupta, A., Vitter, J.S.: High-Order Entropy-Compressed Text Indexes. In: SODA, pp. 841–850 (2003)
Google Scholar
Grossi, R., Vitter, J.S.: Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching. SICOMP 35(2), 378–407 (2005)
Article MathSciNet MATH Google Scholar
Hon, W.K., Shah, R., Vitter, J.S.: Ordered Pattern Matching: Towards Full-Text Retrieval. Tech Report TR-06-008, Dept. of CS, Purdue University (2006)
Google Scholar
Hon, W.K., Shah, R., Vitter, J.S.: Space-Efficient Framework for Top-k String Retrival Problems. In: FOCS, pp. 713–722 (2009)
Google Scholar
Mäkinen, V., Navarro, G.: Rank and Selected Revisited and Extended. TCS 387(3), 332–347 (2007)
Article MATH Google Scholar
Manber, U., Myers, G.: Suffix Arrays: A New Method for On-Line String Searches. SICOMP 22(5), 935–948 (1993)
Article MathSciNet MATH Google Scholar
Matias, Y., Muthukrishnan, S., Sahinalp, S.C., Ziv, J.: Augmenting Suffix Trees, with Applications. In: Bilardi, G., Pietracaprina, A., Italiano, G.F., Pucci, G. (eds.) ESA 1998. LNCS, vol. 1461, pp. 67–78. Springer, Heidelberg (1998)
Google Scholar
Munro, J.I., Raman, V.: Succinct Representation of Balanced Parentheses and Static Trees. SICOMP 31(3), 762–776 (2001)
Article MathSciNet MATH Google Scholar
Muthukrishnan, S.: Efficient Algorithms for Document Retrieval Problems. In: SODA, pp. 657–666 (2002)
Google Scholar
Raman, R., Raman, V., Rao, S.S.: Succinct Indexable Dictionaries with Applications to Encoding k-ary Trees, Prefix Sums and Multisets. TALG 3(4) (2007)
Google Scholar
Sadakane, K.: Compressed Suffix Trees with Full Functionality. TCS, 589–607 (2007)
Google Scholar
Sadakane, K.: Succinct Data Structures for Flexible Text Retrieval Systems. JDA 5(1), 12–22 (2007)
MathSciNet MATH Google Scholar
Välimäki, N., Mäkinen, V.: Space-Efficient Algorithms for Document Retrieval. In: Ma, B., Zhang, K. (eds.) CPM 2007. LNCS, vol. 4580, pp. 205–215. Springer, Heidelberg (2007)
Chapter Google Scholar
Weiner, P.: Linear Pattern Matching Algorithms. In: Proc. Switching and Automata Theory, pp. 1–11 (1973)
Google Scholar
Wu, S.B., Hon, W.K., Shah, R.: Efficient Index for Retrieving Top-k Most Frequent Documents. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds.) SPIRE 2009. LNCS, vol. 5721, pp. 182–193. Springer, Heidelberg (2009)
Google Scholar
Yu, C.C., Hon, W.K., Wang, B.F.: Efficient Data Structures for the Orthogonal Range Successor Problem. In: Ngo, H.Q. (ed.) COCOON 2009. LNCS, vol. 5609, pp. 96–105. Springer, Heidelberg (2009)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Department of CS, National Tsing Hua University, Taiwan
Wing-Kai Hon
Department of CS, Louisiana State University, USA
Rahul Shah & Sharma V. Thankachan
Department of EECS, The University of Kansas, USA
Jeffrey Scott Vitter

Authors

Wing-Kai Hon
View author publications
You can also search for this author in PubMed Google Scholar
Rahul Shah
View author publications
You can also search for this author in PubMed Google Scholar
Sharma V. Thankachan
View author publications
You can also search for this author in PubMed Google Scholar
Jeffrey Scott Vitter
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Physics and Mathematics, Edificio "B", Universidad Michoacana, Ciudad Universitaria, 5800, Morelia, Mich., Mexico
Edgar Chavez
Dept. of Computer Science and Enginerring, University of California, 92521, Riverside, CA, USA
Stefano Lonardi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hon, WK., Shah, R., Thankachan, S.V., Vitter, J.S. (2010). String Retrieval for Multi-pattern Queries. In: Chavez, E., Lonardi, S. (eds) String Processing and Information Retrieval. SPIRE 2010. Lecture Notes in Computer Science, vol 6393. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-16321-0_6

Download citation

DOI: https://doi.org/10.1007/978-3-642-16321-0_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-16320-3
Online ISBN: 978-3-642-16321-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics