Faster Approximate String Matching

Baeza-Yates and G. Navarro, R.

doi:10.1007/PL00009253

Faster Approximate String Matching

Published: February 1999

Volume 23, pages 127–158, (1999)
Cite this article

Algorithmica Aims and scope Submit manuscript

R. Baeza-Yates and G. Navarro¹

389 Accesses
102 Citations
3 Altmetric
Explore all metrics

Abstract.

We present a new algorithm for on-line approximate string matching. The algorithm is based on the simulation of a nondeterministic finite automaton built from the pattern and using the text as input. This simulation uses bit operations on a RAM machine with word length w = Ω (log n) bits, where n is the text size. This is essentially similar to the model used in Wu and Manber's work, although we improve the search time by packing the automaton states differently. The running time achieved is O(n) for small patterns (i.e., whenever mk = O(log n)) , where m is the pattern length and k<m is the number of allowed errors. This is in contrast with the result of Wu and Manber, which is O(kn) for m=O(log n) . Longer patterns can be processed by partitioning the automaton into many machine words, at O(mk/w n) search cost. We allow generalizations in the pattern, such as classes of characters, gaps, and others, at essentially the same search cost.

We then explore other novel techniques to cope with longer patterns. We show how to partition the pattern into short subpatterns which can be searched with less errors using the simple automaton, to obtain an average cost close to \( O(\sqrt{mk/w} n) \) . Moreover, we allow the superimposition of many subpatterns in a single automaton, obtaining near \( O(\sqrt{mk/(\sigma w)} n) \) average complexity (σ is the alphabet size).

We perform a complete analysis of all the techniques and show how to combine them in an optimal form, also obtaining new tighter bounds for the probability of an approximate occurrence in random text. Finally, we show experimental results comparing our algorithms against previous work. These experiments show that our algorithms are among the fastest for typical text searching, being the fastest in some cases. Although we aim mainly at text searching, we believe that our ideas can be successfully applied to other areas such as computational biology.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Author information

Authors and Affiliations

Department of Computer Science, University of Chile, Blanco Encalada 2120, Santiago, Chile. rbaeza@dcc.uchile.cl, gnavarro@dcc.uchile.cl., , , , , , CL
R. Baeza-Yates and G. Navarro

Authors

R. Baeza-Yates and G. Navarro
View author publications
You can also search for this author in PubMed Google Scholar

Additional information

Received November 22, 1996; revised October 13 and December 5, 1997.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Baeza-Yates and G. Navarro, R. Faster Approximate String Matching . Algorithmica 23, 127–158 (1999). https://doi.org/10.1007/PL00009253

Download citation

Issue Date: February 1999
DOI: https://doi.org/10.1007/PL00009253

Key words. Text searching allowing errors, Bit-parallelism.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Faster Approximate String Matching

Abstract.

Access this article

Similar content being viewed by others

Longest Common Substring with Approximately k Mismatches

On the Existential Arithmetics with Addition and Bitwise Minimum

A Practical Fixed-Parameter Algorithm for Constructing Tree-Child Networks from Multiple Binary Trees

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Navigation

Faster Approximate String Matching

Abstract.

Access this article

Similar content being viewed by others

Longest Common Substring with Approximately k Mismatches

On the Existential Arithmetics with Addition and Bitwise Minimum

A Practical Fixed-Parameter Algorithm for Constructing Tree-Child Networks from Multiple Binary Trees

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation