Estimating sequence similarity from read sets for clustering next-generation sequencing data

Ryšavý, Petr; Železný, Filip

doi:10.1007/s10618-018-0584-8

Estimating sequence similarity from read sets for clustering next-generation sequencing data

Published: 04 August 2018

Volume 33, pages 1–23, (2019)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

755 Accesses
3 Citations
2 Altmetric
Explore all metrics

Abstract

Computing mutual similarity of biological sequences such as DNA molecules is essential for significant biological tasks such as hierarchical clustering of genomes. Current sequencing technologies do not provide the content of entire biological sequences; rather they identify a large number of small substrings called reads, sampled at random places of the target sequence. To estimate similarity of two sequences from their read-set representations, one may try to reconstruct each one first from its read set, and then employ conventional (dis)similarity measures such as the edit distance on the assembled sequences. Due to the nature of data, sequence assembly often cannot provide a single putative sequence that matches the true DNA. Therefore, we propose instead to estimate the similarities directly from the read sets. Our approach is based on an adaptation of the Monge-Elkan similarity known from the field of databases, avoiding the sequence assembly step. For low-coverage (i.e. small) read set samples, it yields a better approximation of the true sequence similarities. This in turn results in better clustering in comparison to the first-assemble-then-cluster approach. Put differently, for a fixed estimation accuracy, our approach requires smaller read sets and thus entails reduced wet-lab costs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Estimating Sequence Similarity from Read Sets for Clustering Sequencing Data

Clustering of reads with alignment-free measures and quality values

Article Open access 28 January 2015

Matteo Comin, Andrea Leoni & Michele Schimd

Next generation sequencing reads comparison with an alignment-free distance

Article Open access 03 December 2014

Emanuel Weitschek, Daniele Santoni, … Giovanni Felici

Notes

Here we alter the Monge-Elkan similarity into a distance measure. The standard way of using Monge-Elkan is as a similarity measure with \(\min \) replaced by \(\max \) and distance calculation by similarity calculation.
Strictly speaking, this reasoning is incorrect if read a is drawn from a place close to A’s margins, more precisely, if it starts in fewer than t (\(t+l\), respectively) symbols from A’s left (right) margin, as then not all of the 2t shifts are possible. This is however negligible due to (2).
The idea of finding outliers in \(\mathsf {BM}\) was proposed by one of the reviewers of the paper and it turned out to work better than the original version by Ryšavý and Železný (2016).
The dynamic programming algorithm for calculating the Levenshtein distance (Levenshtein 1966) is commonly called Wagner–Fischer algorithm (Wagner and Fischer 1974). When we refer to sequence alignment problem in bioinformatics, this algorithm is often called Needleman–Wunsch algorithm (Needleman and Wunsch 1970).
Implementation and more detailed experimental results are available on https://github.com/petrrysavy/readsDAMI2017.
AF389115, AF389119, AY260942, AY260945, AY260949, AY260955, CY011131, CY011135, CY011143, HE584750, J02147, K00423 and outgroup AM050555. The genomes are available at http://www.ebi.ac.uk/ena/data/view/%3caccession%3e;.
AB073912, X98292, AM050555, D13784, EU376394, FJ560719, GU076451, JN680353, JN998607, M14707, U06714, U46935, U66304, U81989, X05817, Y13051 and outgroup AY884005.
\((\alpha , l) \in \{0.1, 0.3, 0.5, 0.7, 1, 1.5, 2, 2.5, 3, 4, 5, 7, 10, 15, 20, 30,40,50,70,100\} \times \{3, 5, 10, 15, 20, 25, 30, 40, 50, 70, 100, 150,200,500\}\).
http://www.ebi.ac.uk/ena.
Accessions of the used read-sets are SRX036766, SRX036767, SRX036766, SRX036767, SRX036772, SRX036774, SRX036775, SRX036942, SRX036776, SRX036777, SRX036779, SRX036943, SRX036780, SRX036781, SRX036802, SRX036803, SRX036945.

References

1000 Genomes Project Consortium et al. (2015) A global reference for human genetic variation. Nature 526(7571):68–74
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410
Article Google Scholar
Bao E, Jiang T, Kaloshian I, Girke T (2011) SEED: efficient clustering of next-generation sequences. Bioinformatics 27(18):2502–2509
Article Google Scholar
Blaisdell BE (1986) A measure of the similarity of sets of sequences not requiring sequence alignment. Proc Natl Acad Sci 83(14):5155–5159
Article MATH Google Scholar
Comin M, Leoni A, Schimd M (2015) Clustering of reads with alignment-free measures and quality values. Algorithms Mol Biol 10(1):4
Article Google Scholar
Comin M, Schimd M (2014) Assembly-free genome comparison based on next-generation sequencing reads and variable length patterns. BMC Bioinformatics 15(9):S1
Article Google Scholar
Comin M, Schimd M (2016) Fast comparison of genomic and meta-genomic reads with alignment-free measures based on quality values. BMC Med Genomics 9(1):36
Article Google Scholar
Fowlkes EB, Mallows CL (1983) A method for comparing two hierarchical clusterings. J Am Stat Assoc 78(383):553–569
Article MATH Google Scholar
Goodwin S, Mcpherson J, Richard Mccombie W (2016) Coming of age: ten years of next-generation sequencing technologies. Nat Rev Genet 17:333–351 05
Article Google Scholar
Haiminen N, Kuhn DN, Parida L, Rigoutsos I (2011) Evaluation of methods for de novo genome assembly from high-throughput sequencing reads reveals dependencies that affect the quality of the results. PLOS ONE 6(9):1–9 09
Article Google Scholar
Hernandez D, Franois P, Farinelli L, sters M, Schrenzel J (2008) De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Genome Res 18(5):802–809
Article Google Scholar
Huang W, Li L, Myers JR, Marth GT (2012) ART: a next-generation sequencing read simulator. Bioinformatics 28(4):593–594
Article Google Scholar
Hubbard T, Barker D, Birney E, Cameron G, Chen Y et al (2002) The Ensembl genome database project. Nucl Acids Res 30(1):38–41
Article Google Scholar
Jalovec K, Železný F (2014) Binary classification of metagenomic samples using discriminative DNA superstrings. In: MLSB 2014: 8th International workshop on machine learning in systems biology, pp 44–47
Kchouk M, Elloumi M(2016) A clustering approach for denovo assembly using next generation sequencing data. In: 2016 IEEE international conference on bioinformatics and biomedicine (BIBM), IEEE, pp 1909–1911
Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W et al (2001) Initial sequencing and analysis of the human genome. Nature 409(6822):860–921
Article Google Scholar
Leinonen R, Akhtar R, Birney E, Bower L, Cerdeno-Trraga A, Cheng Y, Cleland I, Faruque N, Goodgame N, Gibson R, Hoad G, Jang M, Pakseresht N, Plaister S, Radhakrishnan R, Reddy K, Sobhany S, Ten Hoopen P, Vaughan R, Zalunin V, Cochrane G (2011) The European Nucleotide Archive. Nucl Acids Res 39(suppl–1):D28–D31
Article Google Scholar
Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. Sov Phys Dokl 10(8):707
MathSciNet Google Scholar
Malhotra R, Elleder D, Bao L, Hunter DR, Acharya R, Poss M (2014) Clustering pipeline for determining consensus sequences in targeted next-generation sequencing. ArXiv preprint
Monge AE, Elkan CP (1996) The field matching problem: algorithms and applications. In: Proceedings of the second international conference on knowledge discovery and data mining, KDD’96, AAAI Press, pp 267–270
Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv 33(1):31–88
Article Google Scholar
Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48(3):443–453
Article Google Scholar
Nurk Sergey, Bankevich Anton, et al (2013) Assembling genomes and mini-metagenomes from highly chimeric reads. In: Deng M, Jiang R, Sun F, Zhang X, (eds) 17th Annual international conference on research in computational molecular biology, RECOMB 2013, Beijing, China, April 7–10, 2013. Proceedings, Springer, Berlin Heidelberg, Berlin, Heidelberg, pp 158–170
Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM (2016) Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol 17(1):132
Article Google Scholar
Reinert G, Chew D, Sun F, Waterman MS (2009) Alignment-free sequence comparison (I): statistics and power. J Comput Biol 16(12):1615–1634
Article MathSciNet Google Scholar
Ryšavý Petr, Železný Filip (2016) Estimating sequence similarity from read sets for clustering sequencing data. In: Boström H, Knobbe A, Soares C, Papapetrou P (eds) 15th International symposium on advances in intelligent data analysis XV, IDA 2016, Stockholm, Sweden, October 13–15, 2016, Proceedings, Cham, Springer International Publishing, pp 204–214
Saitou N, Nei M (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 4(4):406–425
Google Scholar
Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJM, nan Birol (2009) ABySS: a parallel assembler for short read sequence data. Genome Res 19(6):1117–1123
Article Google Scholar
Sokal RR, Michener CD (1958) A statistical method for evaluating systematic relationships. Univ Kans Sci Bull 38:1409–1438
Google Scholar
Song K, Ren J, Zhai Z, Liu X, Deng M, Sun F (2013) Alignment-free sequence comparison based on next-generation sequencing reads. J Comput Biol 20(2):64–79
Article MathSciNet Google Scholar
Ukkonen E (1992) Approximate string-matching with \(q\)-grams and maximal matches. Theor Comput Sci 92(1):191–211
Article MathSciNet MATH Google Scholar
Wagner RA, Fischer MJ (1974) The string-to-string correction problem. J Assoc Comput Mach 21(1):168–173
Article MathSciNet MATH Google Scholar
Warren RL, Sutton GG, Jones SJM, Holt RA (2007) Assembling millions of short DNA sequences using SSAKE. Bioinformatics 23(4):500–501
Article Google Scholar
Weitschek E, Santoni D, Fiscon G, De Cola MC, Bertolazzi P, Felici G (2014) Next generation sequencing reads comparison with an alignment-free distance. BMC Res Notes 7:869
Article Google Scholar
Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, Federhen S, Feolo M, Geer LY, Helmberg W, Kapustin Y, Khovayko O, Landsman D, Lipman DJ, Madden TL, Maglott DR, Miller V, Ostell J, Pruitt KD, Schuler GD, Shumway M, Sequeira E, Sherry ST, Sirotkin K, Souvorov A, Starchenko G, Tatusov RL, Tatusova TA, Wagner L, Yaschenko E (2008) Database resources of the national center for biotechnology information. Nucl Acids Res 36(suppl–1):D13–D21
Google Scholar
Yi H, Jin L (2013) Co-phylog: an assembly-free phylogenomic approach for closely related organisms. Nucl Acids Res 41(7):e75
Article Google Scholar
Zerbino DR, Birney E (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18(5):821–829
Article Google Scholar
Železný F, Jalovec K, Tolar J (2014) Learning meets sequencing: a generality framework for read-sets. In: ILP 2014: 24th Internation conference on inductive logic programming, Late-Breaking Papers

Download references

Acknowledgements

The authors acknowledge the support of the OP VVV project CZ.02.1.01/0.0/0.0/16_019/0000765 “Research Center for Informatics”. Access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum, provided under the programme “Projects of Large Research, Development, and Innovations Infrastructures” (CESNET LM2015042), is greatly appreciated.

Author information

Authors and Affiliations

Department of Computer Science, Faculty of Electrical Engineering, Czech Technical University in Prague, Prague, Czech Republic
Petr Ryšavý & Filip Železný

Authors

Petr Ryšavý
View author publications
You can also search for this author in PubMed Google Scholar
Filip Železný
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Petr Ryšavý.

Additional information

Responsible editor: Pierre Baldi.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 267 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ryšavý, P., Železný, F. Estimating sequence similarity from read sets for clustering next-generation sequencing data. Data Min Knowl Disc 33, 1–23 (2019). https://doi.org/10.1007/s10618-018-0584-8

Download citation

Received: 28 August 2017
Accepted: 16 July 2018
Published: 04 August 2018
Issue Date: 15 January 2019
DOI: https://doi.org/10.1007/s10618-018-0584-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Estimating sequence similarity from read sets for clustering next-generation sequencing data

Abstract

Access this article

Similar content being viewed by others

Estimating Sequence Similarity from Read Sets for Clustering Sequencing Data

Clustering of reads with alignment-free measures and quality values

Next generation sequencing reads comparison with an alignment-free distance

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material 1 (pdf 267 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Estimating sequence similarity from read sets for clustering next-generation sequencing data

Abstract

Access this article

Similar content being viewed by others

Estimating Sequence Similarity from Read Sets for Clustering Sequencing Data

Clustering of reads with alignment-free measures and quality values

Next generation sequencing reads comparison with an alignment-free distance

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Electronic supplementary material

Supplementary material 1 (pdf 267 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation