Abstract
Pyrosequencing technologies are frequently used for sequencing the 16S rRNA marker gene for metagenomic studies of microbial communities. Computing a pairwise genetic distance matrix from the produced reads is an important but highly time consuming task. In this paper, we present a parallelized tool (called CRiSPy) for scalable pairwise genetic distance matrix computation and clustering that is based on the processing pipeline of the popular ESPRIT software package. To achieve high computational efficiency, we have designed massively parallel CUDA algorithms for pairwise k-mer distance and pairwise genetic distance computation. We have also implemented a memory-efficient sparse matrix clustering program to process the distance matrix. On a single-GPU, CRiSPy achieves speedups of around two orders of magnitude compared to the sequential ESPRIT program for both the time-consuming pairwise genetic distance module and the whole processing pipeline, thus making CRiSPy particularly suitable for high-throughput microbial studies.
Chapter PDF
References
Sogin, M.L., Morrison, H.G., Huber, J.A., et al.: Microbial diversity in the deep sea and the underexplored rare biosphere. PNAS 103(32), 12115–12120 (2006)
Turnbaugh, P., Hamady, M., Yatsunenko, T., et al.: A core gut microbiome in obese and lean twins. Nature 457(7228), 480–484 (2009)
Fabrice, A., Didier, R.: Exploring microbial diversity using 16S rRNA high-throughput methods. Applied and Environmental Microbiology 2, 074–092 (2009)
Hamady, M., Knight, R.: Microbial community profiling for human microbiome projects: Tools, techniques, and challenges. Genome Research 19(7), 1141–1152 (2009)
Edgar, R.C.: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research 32(5), 1792–1797 (2004)
Nawrocki, E.P., Kolbe, D.L., Eddy, S.R.: Infernal 1.0: inference of RNA alignments. Bioinformatics 25(10), 1335–1337 (2009)
Sun, Y., Cai, Y., Liu, L., et al.: ESPRIT: estimating species richness using large collections of 16S rRNA pyrosequences. Nucleic Acids Research 37(10), e76 (2009)
Edgar, R.C.: Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26(19), 2460–2461 (2010)
Huse, S.M., Welch, D.M., Morrison, H.G., et al.: Ironing out the wrinkles in the rare biosphere through improved OTU clustering. Environmental Microbiology 12(7), 1889–1998 (2010)
Sun, Y., Cai, Y., Huse, S., et al.: A Large-scale Benchmark Study of Existing Algorithms for Taxonomy-Independent Microbial Community Analysis. Briefings in Bioinformatics (2011)
Liu, Y., Schmidt, B., Maskell, D.L.: CUDASW++2.0: enhanced Smith-Waterman protein database search on CUDA-enabled GPUs based on SIMT and virtualized SIMD abstractions. BMC Research Notes 3, 93 (2010)
Shi, H., Schmidt, B., Liu, W., et al.: A parallel algorithm for error correction in high-throughput short-read data on CUDA-enabled graphics hardware. Journal of Computational Biology 17(4), 603–615 (2010)
Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48(3), 443–453 (1970)
Schloss, P.D., Handelsman, J.: Introducing DOTUR a Computer Program for Defining Operational Taxonomic Units and Estimating Species Richness. Applied and Environmental Microbiology 71(3), 1501–1506 (2005)
Huse, S.M., Huber, J.A., Morrison, H.G., et al.: Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biology 8(7), R143 (2007)
Schloss, P.D., Westcott, S.L., Ryabin, T., et al.: Introducing MOTHUR Open-Source Platform-Independent Community-Supported Software for Describing and Comparing Microbial Communities. Applied and Environmental Microbiology 75(23), 7537–7541 (2009), doi:10.1128/AEM.01541-09
Edgar, R.C.: Local homology recognition and distance measures in linear time using compressed amino acid alphabets. Nucleic Acids Research 32(1), 380–385 (2004)
Thompson, J.D., Higgins, D.G., Gibson, T.J.: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Research 22(22), 4673–4680 (1994)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zheng, Z., Nguyen, TD., Schmidt, B. (2011). CRiSPy-CUDA: Computing Species Richness in 16S rRNA Pyrosequencing Datasets with CUDA. In: Loog, M., Wessels, L., Reinders, M.J.T., de Ridder, D. (eds) Pattern Recognition in Bioinformatics. PRIB 2011. Lecture Notes in Computer Science(), vol 7036. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24855-9_4
Download citation
DOI: https://doi.org/10.1007/978-3-642-24855-9_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24854-2
Online ISBN: 978-3-642-24855-9
eBook Packages: Computer ScienceComputer Science (R0)