Abstract
We have recently shown that q- gram filters based on gapped q-grams instead of the usual contiguous q-grams can provide orders of magnitude faster and/or more efficient filtering for the Hamming distance. In this paper, we extend the results for the Levenshtein distance, which is more problematic for gapped q-grams because an insertion or deletion in a gap affects a q-gram while a replacement does not. To keep this effect under control, we concentrate on gapped q-grams with just one gap. We demostrate with experiments that the resulting filters provide a significant improvement over the contiguous q-gram filters. We also develop new techniques for dealing with complex q-gram filters.
Supported by the DFG ‘Initiative Bioinformatik’ grant BIZ 4/1-1.
Partially supported by the Future and Emerging Technologies programme of the EU under contract number IST-1999-14186 (ALCOM-FT).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215:403–410, 1990.
S. Burkhardt, A. Crauser, P. Ferragina, H.-P. Lenhof, E. Rivals, and M. Vingron. q-gram based database searching using a suffix array (QUASAR). In Proc. 3rd Annual International Conference on Computational Molecular Biology (RECOMB), pages 77–83. ACM Press, 1999.
S. Burkhardt and J. Kärkkäinen. Better filtering with gapped q-grams. In Proc. 12th Annual Symposium on Combinatorial Pattern Matching, volume 2089 of LNCS, pages 73–85. Springer, 2001.
A. Califano and I. Rigoutsos. FLASH: A fast look-up algorithm for string homology. In Proc. 1st International Conference on Intelligent Systems for Molecular Biology, pages 56–64. AAAI Press, 1993.
D. Gusfield. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, 1997.
N. Holsti and E. Sutinen. Approximate string matching using q-gram places. In Proc. 7th Finnish Symposium on Computer Science, pages 23–32, 1994.
P. Jokinen and E. Ukkonen. Two algorithms for approximate string matching in static texts. In Proc. 16th Symposium on Mathematical Foundations of Computer Science, volume 520 of LNCS, pages 240–248. Springer, 1991.
J. Kärkkäinen. Computing the threshold for q-gram filters. In Proc. 8th Scandinavian Workshop on Algorithm Theory (SWAT), July 2002. To appear.
A. Krause and M. Vingron. A set-theoretic approach to database searching and clustering. Bioinformatics, 14:430–438, 1998.
O. Lehtinen, E. Sutinen, and J. Tarhio. Experiments on block indexing. In Proc. 3rd South American Workshop on String Processing (WSP), pages 183–193. Carleton University Press, 1996.
G. Navarro. Approximate Text Searching. PhD thesis, Dept. of Computer Science, University of Chile, 1998.
G. Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31–88, 2001.
P. A. Pevzner and M. S. Waterman. Multiple filtration and approximate pattern matching. Algorithmica, 13(1/2):135–154, 1995.
E. Ukkonen. Approximate string matching with q-grams and maximal matches. Theor. Comput. Sci, 92(1):191–212, 1992.
J. Weber and H. Myers. Human whole genome shotgun sequencing. Genome Research, 7:401–409, 1997.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Burkhardt, S., Kärkkäinen, J. (2002). One-Gapped q-Gram Filters for Levenshtein Distance. In: Apostolico, A., Takeda, M. (eds) Combinatorial Pattern Matching. CPM 2002. Lecture Notes in Computer Science, vol 2373. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45452-7_19
Download citation
DOI: https://doi.org/10.1007/3-540-45452-7_19
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43862-5
Online ISBN: 978-3-540-45452-6
eBook Packages: Springer Book Archive