RLBWT Tricks

Authors Nathaniel K. Brown , Travis Gagie , Massimiliano Rossi



PDF
Thumbnail PDF

File

LIPIcs.SEA.2022.16.pdf
  • Filesize: 1.27 MB
  • 16 pages

Document Identifiers

Author Details

Nathaniel K. Brown
  • Faculty of Computer Science, Dalhousie University, Halifax, Canada
Travis Gagie
  • Faculty of Computer Science, Dalhousie University, Halifax, Canada
Massimiliano Rossi
  • Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA

Acknowledgements

Many thanks to Omar Ahmed, Christina Boucher and Ben Langmead for discussions and assistance during our research, and to the anonymous reviewers for their insightful feedback.

Cite AsGet BibTex

Nathaniel K. Brown, Travis Gagie, and Massimiliano Rossi. RLBWT Tricks. In 20th International Symposium on Experimental Algorithms (SEA 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 233, pp. 16:1-16:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)
https://doi.org/10.4230/LIPIcs.SEA.2022.16

Abstract

Until recently, most experts would probably have agreed we cannot backwards-step in constant time with a run-length compressed Burrows-Wheeler Transform (RLBWT), since doing so relies on rank queries on sparse bitvectors and those inherit lower bounds from predecessor queries. At ICALP '21, however, Nishimoto and Tabei described a new, simple and constant-time implementation. For a permutation π, it stores an O (r)-space table - where r is the number of positions i where either i = 0 or π (i + 1) ≠ π (i) + 1 - that enables the computation of successive values of π(i) by table look-ups and linear scans. Nishimoto and Tabei showed how to increase the number of rows in the table to bound the length of the linear scans such that the query time for computing π(i) is constant while maintaining O (r)-space. In this paper we refine Nishimoto and Tabei’s approach, including a time-space tradeoff, and experimentally evaluate different implementations demonstrating the practicality of part of their result. We show that even without adding rows to the table, in practice we almost always scan only a few entries during queries. We propose a decomposition scheme of the permutation π corresponding to the LF-mapping that allows an improved compression of the data structure, while limiting the query time. We tested our implementation on real-world genomic datasets and found that without compression of the table, backward-stepping is drastically faster than with sparse bitvector implementations but, unfortunately, also uses drastically more space. After compression, backward-stepping is competitive both in time and space with the best existing implementations.

Subject Classification

ACM Subject Classification
  • Theory of computation → Data compression
Keywords
  • Compressed String Indexes
  • Repetitive Text Collections
  • Burrows-Wheeler Transform

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Omar Ahmed, Massimiliano Rossi, Sam Kovaka, Michael C. Schatz, Travis Gagie, Christina Boucher, and Ben Langmead. Pan-genomic matching statistics for targeted nanopore sequencing. iScience, 24(6):102696, 2021. Google Scholar
  2. Hideo Bannai, Travis Gagie, and Tomohiro I. Refining the r-index. Theor. Comput. Sci., 812:96-108, 2020. Google Scholar
  3. Christina Boucher, Travis Gagie, Alan Kuhnle, Ben Langmead, Giovanni Manzini, and Taher Mun. Prefix-free parsing for building big BWTs. Algorithms Mol. Biol., 14(1):13:1-13:15, 2019. Google Scholar
  4. Michael Burrows and David J. Wheeler. A block-sorting lossless data compression algorithm. Technical Report 124, DEC, 1994. Google Scholar
  5. Paolo Ferragina and Giovanni Manzini. Indexing compressed text. J. ACM, 52(4):552-581, 2005. Google Scholar
  6. Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. J. ACM, 67(1):2:1-2:54, 2020. Google Scholar
  7. Simon Gog, Timo Beller, Alistair Moffat, and Matthias Petri. From theory to practice: Plug and play with succinct data structures. In 13th International Symposium on Experimental Algorithms (SEA), pages 326-337, 2014. Google Scholar
  8. Simon Gog, Juha Kärkkäinen, Dominik Kempa, Matthias Petri, and Simon J. Puglisi. Fixed block compression boosting in FM-indexes: Theory and practice. Algorithmica, 81(4):1370-1391, 2019. Google Scholar
  9. Peter W Harrison et al. The COVID-19 Data Portal: accelerating SARS-CoV-2 and COVID-19 research through rapid open access data sharing. Nucleic Acids Research, 49(W1):W619-W623, 2021. Google Scholar
  10. Alan Kuhnle, Taher Mun, Christina Boucher, Travis Gagie, Ben Langmead, and Giovanni Manzini. Efficient construction of a complete index for pan-genomics read alignment. J. Comput. Biol., 27(4):500-513, 2020. Google Scholar
  11. Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L Salzberg. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome biology, 10(3):1-10, 2009. Google Scholar
  12. Heng Li and Richard Durbin. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinform., 25(14):1754-1760, 2009. Google Scholar
  13. Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki. Storage and retrieval of highly repetitive sequence collections. J. Comput. Biol., 17(3):281-308, 2010. Google Scholar
  14. Gonzalo Navarro. Compact Data Structures - A Practical Approach. Cambridge University Press, 2016. Google Scholar
  15. Takaaki Nishimoto and Yasuo Tabei. Optimal-time queries on bwt-runs compressed indexes. In 48th International Colloquium on Automata, Languages, and Programming (ICALP), pages 101:1-101:15, 2021. Google Scholar
  16. Massimiliano Rossi, Marco Oliva, Ben Langmead, Travis Gagie, and Christina Boucher. Moni: A pangenomic index for finding maximal exact matches. J. Comput. Biol., 29(2):169-187, 2022. Google Scholar
  17. The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature, 526(7571):68-74, 2015. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail