Subset Wavelet Trees

Authors Jarno N. Alanko, Elena Biagi , Simon J. Puglisi , Jaakko Vuohtoniemi



PDF
Thumbnail PDF

File

LIPIcs.SEA.2023.4.pdf
  • Filesize: 1.03 MB
  • 14 pages

Document Identifiers

Author Details

Jarno N. Alanko
  • Helsinki Institute for Information Technology (HIIT), Finland
  • Department of Computer Science, University of Helsinki, Finland
Elena Biagi
  • Department of Computer Science, University of Helsinki, Finland
Simon J. Puglisi
  • Helsinki Institute for Information Technology (HIIT), Finland
  • Department of Computer Science, University of Helsinki, Finland
Jaakko Vuohtoniemi
  • Department of Computer Science, University of Helsinki, Finland

Acknowledgements

A brief description of the subset wavelet tree first appeared in a technical report by the authors [Alanko et al., 2022].

Cite AsGet BibTex

Jarno N. Alanko, Elena Biagi, Simon J. Puglisi, and Jaakko Vuohtoniemi. Subset Wavelet Trees. In 21st International Symposium on Experimental Algorithms (SEA 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 265, pp. 4:1-4:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)
https://doi.org/10.4230/LIPIcs.SEA.2023.4

Abstract

Given an alphabet Σ of σ = |Σ| symbols, a degenerate (or indeterminate) string X is a sequence X = X[0],X[1]…, X[n-1] of n subsets of Σ. Since their introduction in the mid 70s, degenerate strings have been widely studied, with applications driven by their being a natural model for sequences in which there is a degree of uncertainty about the precise symbol at a given position, such as those arising in genomics and proteomics. In this paper we introduce a new data structural tool for degenerate strings, called the subset wavelet tree (SubsetWT). A SubsetWT supports two basic operations on degenerate strings: subset-rank(i,c), which returns the number of subsets up to the i-th subset in the degenerate string that contain the symbol c; and subset-select(i,c), which returns the index in the degenerate string of the i-th subset that contains symbol c. These queries are analogs of rank and select queries that have been widely studied for ordinary strings. Via experiments in a real genomics application in which degenerate strings are fundamental, we show that subset wavelet trees are practical data structures, and in particular offer an attractive space-time tradeoff. Along the way we investigate data structures for supporting (normal) rank queries on base-4 and base-3 sequences, which may be of independent interest. Our C++ implementations of the data structures are available at https://github.com/jnalanko/SubsetWT.

Subject Classification

ACM Subject Classification
  • Theory of computation → Data structures design and analysis
  • Theory of computation → Design and analysis of algorithms
Keywords
  • degenerate strings
  • compressed data structures
  • succinct data structures
  • string processing
  • data structures
  • efficient algorithms

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Karl R. Abrahamson. Generalized string matching. SIAM J. Comput., 16(6):1039-1051, 1987. Google Scholar
  2. H. Alamro, M. Alzamel, C.S. Iliopoulos, S. P. Pissis, and S. Watts. IUPACpal: efficient identification of inverted repeats in IUPAC-encoded dna sequences. BMC Bioinformatics, 22(51), 2021. Google Scholar
  3. Jarno N Alanko, Simon J Puglisi, and Jaakko Vuohtoniemi. Succinct k-mer sets using subset rank queries on the spectral Burrows-Wheeler transform. bioRxiv, 2022. Google Scholar
  4. J. Barbay, M. He, I. Munro, and S. Srinivasa Rao. Succinct indexes for strings, binary relations and multilabeled trees. ACM Transactions on Algorithms, 7(4):article 52, 2011. Google Scholar
  5. E. Cambouropoulos, T. Crawford, and C.S. Iliopoulos. Pattern processing in melodic sequences: Challenges, caveats and prospects. Computers and the Humanities, 35:9-21, 2001. Google Scholar
  6. Rayan Chikhi. A tale of optimizing the space taken by de Bruijn graphs. In Proc. 17th Conference on Computability in Europe (CiE), volume 12813 of LNCS, pages 120-134. Springer, 2021. Google Scholar
  7. Maxime Crochemore, Costas S. Iliopoulos, Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, and Tomasz Walen. Covering problems for partial words and for indeterminate strings. Theor. Comput. Sci., 698:25-39, 2017. Google Scholar
  8. P. Ferragina, G. Manzini, V. Mäkinen, and G. Navarro. Compressed representations of sequences and full-text indexes. ACM Transactions on Algorithms, 3(2):article 20, 2007. Google Scholar
  9. Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In 41st Annual Symposium on Foundations of Computer Science, FOCS 2000, 12-14 November 2000, Redondo Beach, California, USA, pages 390-398. IEEE Computer Society, 2000. Google Scholar
  10. Michael J. Fischer and Michael S. Paterson. String-matching and other products. Complexity of Computation, 7:113-125, 1974. Google Scholar
  11. Simon Gog, Timo Beller, Alistair Moffat, and Matthias Petri. From theory to practice: Plug and play with succinct data structures. In Proc. 13th International Symposium on Experimental Algorithms (SEA), LNCS 8504, pages 326-337. Springer, 2014. Google Scholar
  12. Simon Gog and Matthias Petri. Optimized succinct data structures for massive data. Softw. Pract. Exp., 44(11):1287-1314, 2014. Google Scholar
  13. A. Golynski, I. Munro, and S. Srinivasa Rao. Rank/select operations on large alphabets: a tool for text indexing. In Proc. 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 368-373, 2006. Google Scholar
  14. R. Grossi, A. Gupta, and J. S. Vitter. High-order entropy-compressed text indexes. In Proc. 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), !booktitle = "SODA", pages 841-850, 2003. Google Scholar
  15. Guillaume Holley and Páll Melsted. Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs. Genome biology, 21(1):1-20, 2020. Google Scholar
  16. Jan Holub, William F. Smyth, and Shu Wang. Fast pattern-matching on indeterminate strings. J. Discrete Algorithms, 6(1):37-50, 2008. Google Scholar
  17. Costas S. Iliopoulos and Jakub Radoszewski. Truly subquadratic-time extension queries and periodicity detection in strings with uncertainties. In Roberto Grossi and Moshe Lewenstein, editors, Proc. 27th Annual Symposium on Combinatorial Pattern Matching (CPM), volume 54 of LIPIcs, pages 8:1-8:12. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2016. Google Scholar
  18. Costas S. Iliopoulos, M. Sohel Rahman, Michal Vorácek, and Ladislav Vagner. The constrained longest common subsequence problem for degenerate strings. In Jan Holub and Jan Zdárek, editors, Proc. 12th International Conference on Implementation and Application of Automata (CIAA), LNCS 4783, pages 309-311. Springer, 2007. Google Scholar
  19. IUPAC-IUB Commission on Biochemical Nomenclature. Abbreviations and symbols for nucleic acids, polynucleotides, and their constituents. Biochemistry, 9(20):4022-4027, 1970. Google Scholar
  20. Ian B Jeffery, Anubhav Das, Eileen O’Herlihy, Simone Coughlan, Katryna Cisek, Michael Moore, Fintan Bradley, Tom Carty, Meenakshi Pradhan, Chinmay Dwibedi, et al. Differences in fecal microbiomes and metabolomes of people with vs without irritable bowel syndrome and bile acid malabsorption. Gastroenterology, 158(4):1016-1028, 2020. Google Scholar
  21. Mikhail Karasikov, Harun Mustafa, Daniel Danciu, Marc Zimmermann, Christopher Barber, Gunnar Rätsch, and André Kahles. Metagraph: Indexing and analysing nucleotide archives at petabase-scale. BioRxiv, 2020. Google Scholar
  22. Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi. Hybrid compression of bitvectors for the FM-index. In Proceedings of the Data Compression Conference (DCC), pages 302-311, 2014. Google Scholar
  23. Veli Mäkinen, Djamal Belazzougui, Fabio Cunial, and Alexandru I. Tomescu. Genome-Scale Algorithm Design: Biological Sequence Analysis in the Era of High-Throughput Sequencing. Cambridge University Press, 2015. Google Scholar
  24. J. Ian Munro. Tables. In Proc. 16th Conference on Foundations of Software Technology and Theoretical Computer Science, LNCS 1180, pages 37-42. Springer, 1996. Google Scholar
  25. G. Navarro. Wavelet trees for all. In Proc. 23rd Annual Symposium on Combinatorial Pattern Matching (CPM), LNCS 7354, pages 2-26, 2012. Google Scholar
  26. Gonzalo Navarro. Compact Data Structures - A Practical Approach. Cambridge University Press, 2016. Google Scholar
  27. Gonzalo Navarro and Eliana Providel. Fast, small, simple rank/select on bitmaps. In Experimental Algorithms: 11th International Symposium, SEA 2012, Bordeaux, France, June 7-9, 2012. Proceedings 11, pages 295-306. Springer, 2012. Google Scholar
  28. Rajeev Raman. Rank and select operations on bit strings. In Encyclopedia of Algorithms, pages 1772-1775. Springer, 2016. URL: https://doi.org/10.1007/978-1-4939-2864-4_332.
  29. Rajeev Raman, Venkatesh Raman, and S. Srinivasa Rao. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 233-242, 2002. Google Scholar
  30. Sun Wu and Udi Manber. Agrep – a fast approximate pattern-matching tool. In Proc. USENIX Winter 1992 Technical Conference, pages 153-162, 1992. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail