Abstract
With recent improvements in sequencing throughput, huge numbers of genomes can now be sequenced rapidly. However, data analysis methods have not kept up, making it difficult to process the vast amounts of sequence data now available. Thus, there is a strong demand for a faster sequence clustering algorithm. We developed a new fast DNA sequence clustering method called LCS-HIT based on the popular CD-HIT program. The proposed method employs a novel filtering technique based on the longest common subsequence to identify similar sequence pairs. This filtering technique affords a considerable speed-up over CD-HIT without loss of sensitivity. For a dataset with two million DNA sequences, our method was about 7.1, 4.4 and 2.5 times faster than CD-HIT for 100, 150, and 400 bases, respectively.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic Local Alignment Search Tool. Journal of Molecular Biology 215, 403–410 (1990)
Li, W., Jaroszewski, L., Godzik, A.: Clustering of Highly Homologous Sequences to Reduce the Size of Large Protein Databases. Bioinformatics (Oxford, England) 17, 282–283 (2001)
Li, W., Godzik, A.: Cd-hit: a Fast Program for Clustering and Comparing Large Sets of Protein or Nucleotide Sequences. Bioinformatics (Oxford, England) 22, 1658–1659 (2006)
Crochemore, M., Iliopoulos, C.S., Pinzon, Y.J., Reid, J.F.: A Fast and Practical Bit-vector Algorithm for the Longest Common Subsequence Problem. Information Processing Letters 80, 279–285 (2001)
Hyyro, H.: Bit-Parallel LCS-length Computation Revisited. In: Proc. 15th Australasian Workshop on Combinatorial Algorithms (AWOCA 2004), pp. 16–27 (2004)
Richter, D.C., Ott, F., Auch, A.F., Schmid, R., Huson, D.H.: MetaSim: a Sequencing Simulator for Genomics and Metagenomics. PloS One 3, e3373 (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Namiki, Y., Ishida, T., Akiyama, Y. (2012). Fast DNA Sequence Clustering Based on Longest Common Subsequence. In: Huang, DS., Gupta, P., Zhang, X., Premaratne, P. (eds) Emerging Intelligent Computing Technology and Applications. ICIC 2012. Communications in Computer and Information Science, vol 304. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31837-5_66
Download citation
DOI: https://doi.org/10.1007/978-3-642-31837-5_66
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-31836-8
Online ISBN: 978-3-642-31837-5
eBook Packages: Computer ScienceComputer Science (R0)