CRiSPy-CUDA: Computing Species Richness in 16S rRNA Pyrosequencing Datasets with CUDA

Zheng, Zejun; Nguyen, Thuy-Diem; Schmidt, Bertil

doi:10.1007/978-3-642-24855-9_4

CRiSPy-CUDA: Computing Species Richness in 16S rRNA Pyrosequencing Datasets with CUDA

Zejun Zheng²¹,
Thuy-Diem Nguyen²¹ &
Bertil Schmidt²²

Conference paper

1288 Accesses
4 Citations

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 7036))

Abstract

Pyrosequencing technologies are frequently used for sequencing the 16S rRNA marker gene for metagenomic studies of microbial communities. Computing a pairwise genetic distance matrix from the produced reads is an important but highly time consuming task. In this paper, we present a parallelized tool (called CRiSPy) for scalable pairwise genetic distance matrix computation and clustering that is based on the processing pipeline of the popular ESPRIT software package. To achieve high computational efficiency, we have designed massively parallel CUDA algorithms for pairwise k-mer distance and pairwise genetic distance computation. We have also implemented a memory-efficient sparse matrix clustering program to process the distance matrix. On a single-GPU, CRiSPy achieves speedups of around two orders of magnitude compared to the sequential ESPRIT program for both the time-consuming pairwise genetic distance module and the whole processing pipeline, thus making CRiSPy particularly suitable for high-throughput microbial studies.

Download to read the full chapter text

Chapter PDF

References

Sogin, M.L., Morrison, H.G., Huber, J.A., et al.: Microbial diversity in the deep sea and the underexplored rare biosphere. PNAS 103(32), 12115–12120 (2006)
Article Google Scholar
Turnbaugh, P., Hamady, M., Yatsunenko, T., et al.: A core gut microbiome in obese and lean twins. Nature 457(7228), 480–484 (2009)
Article Google Scholar
Fabrice, A., Didier, R.: Exploring microbial diversity using 16S rRNA high-throughput methods. Applied and Environmental Microbiology 2, 074–092 (2009)
Google Scholar
Hamady, M., Knight, R.: Microbial community profiling for human microbiome projects: Tools, techniques, and challenges. Genome Research 19(7), 1141–1152 (2009)
Article Google Scholar
Edgar, R.C.: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research 32(5), 1792–1797 (2004)
Article Google Scholar
Nawrocki, E.P., Kolbe, D.L., Eddy, S.R.: Infernal 1.0: inference of RNA alignments. Bioinformatics 25(10), 1335–1337 (2009)
Article Google Scholar
Sun, Y., Cai, Y., Liu, L., et al.: ESPRIT: estimating species richness using large collections of 16S rRNA pyrosequences. Nucleic Acids Research 37(10), e76 (2009)
Article Google Scholar
Edgar, R.C.: Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26(19), 2460–2461 (2010)
Article Google Scholar
Huse, S.M., Welch, D.M., Morrison, H.G., et al.: Ironing out the wrinkles in the rare biosphere through improved OTU clustering. Environmental Microbiology 12(7), 1889–1998 (2010)
Article Google Scholar
Sun, Y., Cai, Y., Huse, S., et al.: A Large-scale Benchmark Study of Existing Algorithms for Taxonomy-Independent Microbial Community Analysis. Briefings in Bioinformatics (2011)
Google Scholar
Liu, Y., Schmidt, B., Maskell, D.L.: CUDASW++2.0: enhanced Smith-Waterman protein database search on CUDA-enabled GPUs based on SIMT and virtualized SIMD abstractions. BMC Research Notes 3, 93 (2010)
Article Google Scholar
Shi, H., Schmidt, B., Liu, W., et al.: A parallel algorithm for error correction in high-throughput short-read data on CUDA-enabled graphics hardware. Journal of Computational Biology 17(4), 603–615 (2010)
Article Google Scholar
Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48(3), 443–453 (1970)
Article Google Scholar
Schloss, P.D., Handelsman, J.: Introducing DOTUR a Computer Program for Defining Operational Taxonomic Units and Estimating Species Richness. Applied and Environmental Microbiology 71(3), 1501–1506 (2005)
Article Google Scholar
Huse, S.M., Huber, J.A., Morrison, H.G., et al.: Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biology 8(7), R143 (2007)
Article Google Scholar
Schloss, P.D., Westcott, S.L., Ryabin, T., et al.: Introducing MOTHUR Open-Source Platform-Independent Community-Supported Software for Describing and Comparing Microbial Communities. Applied and Environmental Microbiology 75(23), 7537–7541 (2009), doi:10.1128/AEM.01541-09
Article Google Scholar
Edgar, R.C.: Local homology recognition and distance measures in linear time using compressed amino acid alphabets. Nucleic Acids Research 32(1), 380–385 (2004)
Article Google Scholar
Thompson, J.D., Higgins, D.G., Gibson, T.J.: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Research 22(22), 4673–4680 (1994)
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Engineering, Nanyang Technological University, Singapore
Zejun Zheng & Thuy-Diem Nguyen
Institut für Informatik, Johannes Gutenberg University, Mainz, Germany
Bertil Schmidt

Authors

Zejun Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Thuy-Diem Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Bertil Schmidt
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Pattern Recognition Laboratory, Delft University of Technology, Mekelweg 4, 2628 CD, Delft, The Netherlands
Marco Loog , Marcel J. T. Reinders & Dick de Ridder , &
Netherlands Cancer Institute, Bioinformatics and Statistics, Plesmanlaan 121, 1066 CX, Amsterdam, The Netherlands
Lodewyk Wessels

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zheng, Z., Nguyen, TD., Schmidt, B. (2011). CRiSPy-CUDA: Computing Species Richness in 16S rRNA Pyrosequencing Datasets with CUDA. In: Loog, M., Wessels, L., Reinders, M.J.T., de Ridder, D. (eds) Pattern Recognition in Bioinformatics. PRIB 2011. Lecture Notes in Computer Science(), vol 7036. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24855-9_4

Download citation

DOI: https://doi.org/10.1007/978-3-642-24855-9_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24854-2
Online ISBN: 978-3-642-24855-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)