Fast DNA Sequence Clustering Based on Longest Common Subsequence

Namiki, Youhei; Ishida, Takashi; Akiyama, Yutaka

doi:10.1007/978-3-642-31837-5_66

Youhei Namiki⁵,
Takashi Ishida⁵ &
Yutaka Akiyama⁵

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 304))

Included in the following conference series:

International Conference on Intelligent Computing

2320 Accesses
1 Citations
1 Altmetric

Abstract

With recent improvements in sequencing throughput, huge numbers of genomes can now be sequenced rapidly. However, data analysis methods have not kept up, making it difficult to process the vast amounts of sequence data now available. Thus, there is a strong demand for a faster sequence clustering algorithm. We developed a new fast DNA sequence clustering method called LCS-HIT based on the popular CD-HIT program. The proposed method employs a novel filtering technique based on the longest common subsequence to identify similar sequence pairs. This filtering technique affords a considerable speed-up over CD-HIT without loss of sensitivity. For a dataset with two million DNA sequences, our method was about 7.1, 4.4 and 2.5 times faster than CD-HIT for 100, 150, and 400 bases, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic Local Alignment Search Tool. Journal of Molecular Biology 215, 403–410 (1990)
Google Scholar
Li, W., Jaroszewski, L., Godzik, A.: Clustering of Highly Homologous Sequences to Reduce the Size of Large Protein Databases. Bioinformatics (Oxford, England) 17, 282–283 (2001)
Article Google Scholar
Li, W., Godzik, A.: Cd-hit: a Fast Program for Clustering and Comparing Large Sets of Protein or Nucleotide Sequences. Bioinformatics (Oxford, England) 22, 1658–1659 (2006)
Article Google Scholar
Crochemore, M., Iliopoulos, C.S., Pinzon, Y.J., Reid, J.F.: A Fast and Practical Bit-vector Algorithm for the Longest Common Subsequence Problem. Information Processing Letters 80, 279–285 (2001)
Article MathSciNet MATH Google Scholar
Hyyro, H.: Bit-Parallel LCS-length Computation Revisited. In: Proc. 15th Australasian Workshop on Combinatorial Algorithms (AWOCA 2004), pp. 16–27 (2004)
Google Scholar
Richter, D.C., Ott, F., Auch, A.F., Schmid, R., Huson, D.H.: MetaSim: a Sequencing Simulator for Genomics and Metagenomics. PloS One 3, e3373 (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Graduate School of Information Science and Engineering, Tokyo Institute of Technology, Japan
Youhei Namiki, Takashi Ishida & Yutaka Akiyama

Authors

Youhei Namiki
View author publications
You can also search for this author in PubMed Google Scholar
Takashi Ishida
View author publications
You can also search for this author in PubMed Google Scholar
Yutaka Akiyama
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Machine Learning and Systems Biology Laboratory, School of Electronics and Information Engineering, Tongji University, Shanghai, China
De-Shuang Huang
Department of Computer Science and Engineering, Indian Institute of Technology Kanpur, 208016, Kanpur, India
Phalguni Gupta
Department of Chemistry, University of Louisville, 2320 South Brook Street, 40292, Louisville, Kentucky, USA
Xiang Zhang
School of Electrical, Computer & Telecommunications Engineering, The University of Wollongong,, 2522, North Wollongong, NSW, Australia
Prashan Premaratne

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Namiki, Y., Ishida, T., Akiyama, Y. (2012). Fast DNA Sequence Clustering Based on Longest Common Subsequence. In: Huang, DS., Gupta, P., Zhang, X., Premaratne, P. (eds) Emerging Intelligent Computing Technology and Applications. ICIC 2012. Communications in Computer and Information Science, vol 304. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31837-5_66

Download citation

DOI: https://doi.org/10.1007/978-3-642-31837-5_66
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-31836-8
Online ISBN: 978-3-642-31837-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics