skip to main content
10.1145/2911451.2914802acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Succinct Data Structures in Information Retrieval: Theory and Practice

Published:07 July 2016Publication History

ABSTRACT

Succinct data structures are used today in many information retrieval applications, e.g., posting lists representation, language model representation, indexing (social) graphs, query auto-completion, document retrieval and indexing dictionary of strings, just to mention the most recent ones. These new kind of data structures mimic the operations of their classical counterparts within a comparable time complexity but require much less space. With the availability of several libraries for basic succinct structures - like SDSL, Succinct, Facebook?s Folly, and Sux - it is relatively easy to directly profit from advances in this field. In this tutorial we will introduce this field of research by presenting the most important succinct data structures to represent set of integers, set of points, trees, graphs and strings together with their most important applications to Information Retrieval problems. The introduction of the succinct data structures will be sustained with a practical session with programming handouts to solve. This will allow the attendees to directly experiment with implementations of these solutions on real datasets and understand the potential benefits they can bring on their own projects.

References

  1. J. Barbay. Succinct and compressed data structures for permutations and integer functions. In Encyclopedia of Algorithms. 2015.Google ScholarGoogle Scholar
  2. J. Barbay and J. I. Munro. Succinct encoding of permutations: Applications to text indexing. In Encyclopedia of Algorithms. 2008.Google ScholarGoogle ScholarCross RefCross Ref
  3. D. Benoit, E. Demaine, J. I. Munro, R. Raman, V.Raman, and S. S. Rao. Representing trees of higher degree. Algorithmica, 43(4):275--292, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  4. P. Elias. Efficient storage and retrieval by content and address of static files. Journal of the ACM, 21:246--260, 1974. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. M. Fano. On the number of bits required to implement anassociative memory. Memorandum 61, Computer Structures Group, Project MAC, 1971.Google ScholarGoogle Scholar
  6. P. Ferragina, R. González, G. Navarro, and R. Venturini. Compressed text indexes: From theory to practice. ACM Journal of Experimental Algorithmics, 13, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. Ferragina and G. Manzini. Indexing compressed text. Journal of the ACM, 52(4):552--581, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. P. Ferragina, F. Piccinno, and R. Venturini. Compressed indexes for string-searching in labeled graphs. In Proceedings of the 24th International Conference on World Wide Web (WWW), pages --, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. P. Ferragina and S. S. Rao. Tree compression and indexing. In Encyclopedia of Algorithms. 2008.Google ScholarGoogle ScholarCross RefCross Ref
  10. P. Ferragina and R. Venturini. Indexing compressed text. In Encyclopedia of Database Systems, pages 1442--1448. 2009.Google ScholarGoogle ScholarCross RefCross Ref
  11. P. Ferragina and R. Venturini. The compressed permuterm index. ACM Transactions on Algorithms, 7(1):10, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Fischer and V. Heun. Space-efficient preprocessing schemes for range minimum queries on static arrays. SIAM Journal on Computing, 40(2):465--492, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. L. Foschini, R. Grossi, A. Gupta, and J. S. Vitter. When indexing equals compression: Experiments with compressing suffix arrays and applications. ACM Transactions on Algorithms, 2(4):611--639, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Gog, T. Beller, A. Moffat, and M. Petri. From theory to practice: Plug and play with succinct data structures. In Proceedings of the 13th International Symposium Experimental Algorithms (SEA), pages 326--337, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Gog and G. Navarro. Improved single-term top-phk document retrieval. In Proceedings of the Seventeenth Workshop on Algorithm Engineering and Experiments, ALENEX, pages 24--32, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. Gog and M. Petri. Compact indexes for flexible top-k retrieval. In Combinatorial Pattern Matching - 26th Annual Symposium, CPM, pages 207--218, 2015.Google ScholarGoogle Scholar
  17. R. Grossi, A. Gupta, and J. S. Vitter. High-order entropy-compressed text indexes. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 841--850, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. R. Grossi and G. Ottaviano. Fast compressed tries through path decompositions. ACM Journal of Experimental Algorithmics, 19(1), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. W. Hon, R. Shah, and J. S. Vitter. Space-efficient framework for top-k string retrieval problems. In 50th Annual IEEE Symposium on Foundations of Computer Science, FOCS, pages 713--722, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. W. Hon, R. Shah, and J. S. Vitter. Space-efficient framework for top-k string retrieval problems. In Proceedings of the 50th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 713--722, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. B. P. Hsu and G. Ottaviano. Space-efficient data structures for top-phk completion. In Proceedings of the 22nd International World Wide Web Conference (WWW), pages 583--594, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. R. Konow, G. Navarro, C. L. A. Clarke, and A. López-Ortiz. Faster and smaller inverted indices with treaps. In Proceedings of the 36th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 193--202, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. G. Navarro. Spaces, trees and colors: The algorithmic landscape of document retrieval on sequences. ACM Computing Surveys, 46(4):article 52, 2014. 47 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. G. Navarro. Wavelet trees for all. Journal Discrete Algorithms, 25:2--20, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. G. Navarro and V. Mäkinen. Compressed full text indexes. ACM Computing Surveys, 39(1), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. G. Navarro and Y. Nekrich. Top-phk document retrieval in optimal time and linear space. In Proceedings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, pages 1066--1077, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. D. Okanohara and K. Sadakane. Practical entropy-compressed rank/select dictionary. In Proceedings of the Nine Workshop on Algorithm Engineering and Experiments, ALENEX, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. G. Ottaviano, N. Tonellotto, and R. Venturini. Optimal space-time tradeoffs for inverted indexes. In Proceedings of the 8th Annual International ACM Conference on Web Search and Data Mining (WSDM), pages --, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. G. Ottaviano and R. Venturini. Partitioned elias-fano indexes. In Proceedings of the 37th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 273--282, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. M. Patil, S. V. Thankachan, R. Shah, W. Hon, J. S. Vitter, and S. Chandrasekaran. Inverted indexes for phrases and strings. In Proceeding of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR, pages 555--564, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. N. Rahman and R. Raman. Rank and select operations on binary strings. In Encyclopedia of Algorithms. 2008.Google ScholarGoogle ScholarCross RefCross Ref
  32. K. Sadakane. Succinct data structures for flexible text retrieval systems. J. Discrete Algorithms, 5(1):12--22, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. K. Sadakane and G. Navarro. Fully-functional succinct trees. In Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 134--149, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. E. Shareghi, M. Petri, G. Haffari, and T. Cohn. Compact, efficient and unlimited capacity: Language modeling with compressed suffix trees. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP, pages 2409--2418, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  35. S. Vigna. Quasi-succinct indices. In Proceedings of the Sixth ACM International Conference on Web Search and Data Mining (WSDM), pages 83--92, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. S. Yata. Marisa trie, https://code.google.com/archive/p/marisa-trie/. 2011.Google ScholarGoogle Scholar

Index Terms

  1. Succinct Data Structures in Information Retrieval: Theory and Practice

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval
        July 2016
        1296 pages
        ISBN:9781450340694
        DOI:10.1145/2911451

        Copyright © 2016 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 7 July 2016

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        SIGIR '16 Paper Acceptance Rate62of341submissions,18%Overall Acceptance Rate792of3,983submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader