Skip to main content

Fast DNA Sequence Clustering Based on Longest Common Subsequence

  • Conference paper
Emerging Intelligent Computing Technology and Applications (ICIC 2012)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 304))

Included in the following conference series:

Abstract

With recent improvements in sequencing throughput, huge numbers of genomes can now be sequenced rapidly. However, data analysis methods have not kept up, making it difficult to process the vast amounts of sequence data now available. Thus, there is a strong demand for a faster sequence clustering algorithm. We developed a new fast DNA sequence clustering method called LCS-HIT based on the popular CD-HIT program. The proposed method employs a novel filtering technique based on the longest common subsequence to identify similar sequence pairs. This filtering technique affords a considerable speed-up over CD-HIT without loss of sensitivity. For a dataset with two million DNA sequences, our method was about 7.1, 4.4 and 2.5 times faster than CD-HIT for 100, 150, and 400 bases, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic Local Alignment Search Tool. Journal of Molecular Biology 215, 403–410 (1990)

    Google Scholar 

  2. Li, W., Jaroszewski, L., Godzik, A.: Clustering of Highly Homologous Sequences to Reduce the Size of Large Protein Databases. Bioinformatics (Oxford, England) 17, 282–283 (2001)

    Article  Google Scholar 

  3. Li, W., Godzik, A.: Cd-hit: a Fast Program for Clustering and Comparing Large Sets of Protein or Nucleotide Sequences. Bioinformatics (Oxford, England) 22, 1658–1659 (2006)

    Article  Google Scholar 

  4. Crochemore, M., Iliopoulos, C.S., Pinzon, Y.J., Reid, J.F.: A Fast and Practical Bit-vector Algorithm for the Longest Common Subsequence Problem. Information Processing Letters 80, 279–285 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  5. Hyyro, H.: Bit-Parallel LCS-length Computation Revisited. In: Proc. 15th Australasian Workshop on Combinatorial Algorithms (AWOCA 2004), pp. 16–27 (2004)

    Google Scholar 

  6. Richter, D.C., Ott, F., Auch, A.F., Schmid, R., Huson, D.H.: MetaSim: a Sequencing Simulator for Genomics and Metagenomics. PloS One 3, e3373 (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Namiki, Y., Ishida, T., Akiyama, Y. (2012). Fast DNA Sequence Clustering Based on Longest Common Subsequence. In: Huang, DS., Gupta, P., Zhang, X., Premaratne, P. (eds) Emerging Intelligent Computing Technology and Applications. ICIC 2012. Communications in Computer and Information Science, vol 304. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31837-5_66

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-31837-5_66

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-31836-8

  • Online ISBN: 978-3-642-31837-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics