Elsevier

Information Systems

Volume 56, March 2016, Pages 73-108
Information Systems

Practical compressed string dictionaries

https://doi.org/10.1016/j.is.2015.08.008Get rights and content

Highlights

  • We address the problem of managing string dictionaries in compressed space.

  • We combine data structures and compression to propose several competitive solutions.

  • Our approaches usually outperform the state-of-the-art techniques on real-world dictionaries.

  • All our techniques are implemented and released in a C++ library hosted at GitHub.

Abstract

The need to store and query a set of strings – a string dictionary – arises in many kinds of applications. While classically these string dictionaries have accounted for a small share of the total space budget (e.g., in Natural Language Processing or when indexing text collections), recent applications in Web engines, Semantic Web (RDF) graphs, Bioinformatics, and many others handle very large string dictionaries, whose size is a significant fraction of the whole data. In these cases, string dictionary management is a scalability issue by itself. This paper focuses on the problem of managing large static string dictionaries in compressed main memory space. We revisit classical solutions for string dictionaries like hashing, tries, and front-coding, and improve them by using compression techniques. We also introduce some novel string dictionary representations built on top of recent advances in succinct data structures and full-text indexes. All these structures are empirically compared on a heterogeneous testbed formed by real-world string dictionaries. We show that the compressed representations may use as little as 5% of the original dictionary size, while supporting lookup operations within a few microseconds. These numbers outperform the state-of-the-art space/time tradeoffs in many cases. Furthermore, we enhance some representations to provide prefix- and substring-based searches, which also perform competitively. The results show that compressed string dictionaries are a useful building block for various data-intensive applications in different domains.

Introduction

A string dictionary is a data structure that maintains a set of strings. It arises in classical scenarios like Natural Language (NL) processing, where finding the lexicon of a text corpus is the first step in analyzing it [56]. They also arise as a component of inverted indexes, when indexing NL text collections [79], [19], [6]. In both cases, the dictionary comprises all the different words used in the text collection. The dictionary implements a bijective function that maps strings to identifiers (IDs, generally integer values) and back. Thus, a string dictionary must provide, at least, two complementary operations: (i) string-to-ID locates the ID for a given string, and (ii) ID-to-string extracts the string identified by a given ID.

String dictionaries are a simple and effective tool for managing string data in a wide range of applications. Using dictionaries enables replacing (long, variable-length) strings by simple numbers (their IDs), which are more compact to represent and easier and more efficient to handle. A compact dictionary providing efficient mapping between strings and IDs saves storage space, processing and transmission costs, in data-intensive applications. The growing volume of the datasets, however, has led to increasingly large dictionaries, whose management is becoming a scalability issue by itself. Their size is of particular importance to attain the optimal performance under restrictions of main memory.

This paper focuses on techniques to compress string dictionaries and the space/time tradeoffs they offer. We focus on static dictionaries, which do not change along the execution. These are appropriate in the many applications using dictionaries that either are static or are rebuilt only sparingly. We revisit traditional techniques for managing string dictionaries, and enhance them with data compression tools. We also design new structures that take advantage of more sophisticated compression methods, succinct data structures, and full-text indexes [62]. The resulting techniques enable large string dictionaries to be managed within compressed space in main memory. Different techniques excel on different application niches. The least space-consuming variants operate within microseconds while compressing the dictionary to as little as 5% of its original size.

The main contributions of this paper can be summarized as follows:

  • 1.

    We present, as far as we know, the most exhaustive study to date of the space/time efficiency of compressed string dictionary representations. This is not only a survey of traditional techniques, but we also design novel variants based on combinations of existing techniques with more sophisticated compression methods and data structures.

  • 2.

    We perform an exhaustive experimental tuning and comparison of all the variants we study, on a variety of real-world scenarios, providing a global picture of the current state of the art for string dictionaries. This results in clear recommendations on which structures to use depending on the application.

  • 3.

    Most of the techniques outstanding in the space/time tradeoff turn out to be combinations we designed and engineered, between classical methods and more sophisticated compression techniques and data structures. These include combinations of binary search, hashing, and Front-Coding with grammar-based and optimized Hu-Tucker compression. In particular, uncovering the advantages of the use of grammar compression for string dictionaries is an important finding.

  • 4.

    We create a C++ library, libCSD (Compressed String Dictionaries), implementing all the studied techniques. It is publicly available at https://github.com/migumar2/libCSD under GNU LGPL license.

  • 5.

    We go beyond the basic string-to-ID and ID-to-string functionality and implement advanced searches for some of our techniques. These enable prefix-based searching for most methods (except Hash ones) and substring searches for the FM-Index and XBW dictionaries.

The paper is organized as follows. Section 2 provides a general view of string dictionaries. We start describing various real-world applications where large dictionaries must be efficiently handled, then define the notation used in the paper, and finally describe classical and modern techniques used to support string dictionaries, particularly in compressed space. Section 3 provides the minimal background in data compression necessary to understand the various families of compressed string dictionaries studied in this paper. Section 4 describes how we have applied those compression methods so that they perform efficiently for the dictionary operations. 5 Compressed hashing dictionaries (Hash), 6 Front-Coding: differentially encoded dictionaries, 7 Binary searchable Re-Pair (RPDAC), 8 Full-text dictionaries (FM-Index), 9 Compressed trie dictionaries (XBW) focus on each of the families of compressed string dictionaries. Section 10 provides a full experimental study of the performance of the described techniques on dictionaries coming from various real-world applications. The best performing variants are then compared with the state of the art. We find several niches in which the new techniques dominate the space/time tradeoffs of classical methods. Finally, Section 11 concludes and describes some future work directions.

Section snippets

Applications

This section takes a short tour over various example applications where handling very large string dictionaries is a serious issue and compression could lead to considerable improvements.

NL Applications: It is the most classic application area of string dictionaries. Traditionally, the size of these dictionaries has not been a concern because classical NL collections were carefully polished to avoid typos and other errors. On those collections, Heaps [44] formulated an empirical law

Data compression and coding

Data compression [74] studies the way to encode data in less space than that originally required. We consider compression of sequences and focus on lossless compression, which allows reconstructing the exact original sequence. We only cover the elements needed to follow the paper.

Statistical Compression: A way to compress a sequence is to exploit the variable frequencies of its symbols. By assigning shorter codewords to the most frequent symbols and replacing each symbol by its codeword,

Compressing the dictionary strings

To reduce space, we represent the strings of the dictionary, Tdict, in compressed form. We cannot use any compression method, however, but have to choose one that enables fast decompression and comparison of individual strings. We describe three methods we will use in combination with the dictionary data structures. Their basics are described in Section 3. An issue is how to know where a compressed string si$ ends in the compressed Tdict. If we decompress si, we simply stop when we decompress

Compressed hashing dictionaries (Hash)

Hashing [23] is a folklore method to store a dictionary of any kind (not only strings). In our case, a hash function transforms a given string into an index in a hash table, where the corresponding value is to be inserted or sought. A collision arises when two different strings are mapped to the same array cell.

In this paper, we use closed hashing: if the cell corresponding to an element is occupied by another, one successively probes other cells until finding a free cell (for insertions and

Front-Coding: differentially encoded dictionaries

Front-Coding [79] is a folklore compression technique for lexicographically sorted dictionaries, for example it is used to compress the set of URLs in the WebGraph framework [12]. Front-Coding exploits the fact that consecutive entries are likely to share a common prefix, so each entry in the dictionary can be differentially encoded with respect to the preceding one. More precisely, each entry is represented using two values: an integer that encodes the length of the prefix it shares with the

Binary searchable Re-Pair (RPDAC)

If we remove the bitsequence B in Section 5, and instead sort Tdict in lexicographic order, we can still binary search S for p, using either bitsequence Y (Section 5.3) or DAC codes (Section 5.4). In this case, it is better to replace Huffman by Hu-Tucker compression, so that the strings can be lexicographically compared bytewise, without decompressing them (as done in Section 6).

This arrangement corresponds to applying compression on the possibly simplest data organization for a dictionary:

Full-text dictionaries (FM-Index)

A full-text index is a data structure that, built on a text T[1,N] over an alphabet of size σ, supports fast search for patterns pinT, computing all the positions where p occurs. A self-index is a compressed full-text index that, in addition, contains enough information to efficiently reproduce any text substring [62]. A self-index can therefore replace the text.

Most self-indexes emulate a suffix array [55]. This structure is an array of integers A[1,N], so that A[i] represents the text suffix T

Compressed trie dictionaries (XBW)

A trie (or digital tree) [36], [50] is an edge-labeled tree that represents a set of strings, and thus a natural choice to represent a string dictionary. Each path in the trie, from the root to a leaf, represents a particular string, so those strings sharing a common prefix also share a common subpath from the root. The leaves are marked with the corresponding string IDs.

Our basic operations are easily solved on tries. For locate(p) we traverse the trie from the root, descending by the edges

Experimental evaluation

This section analyzes the empirical performance of our techniques, in space and time, over dictionaries coming from various real-world scenarios. We first consider the basic operations of locate and extract, comparing our techniques in order to choose the most prominent ones, and then comparing those with other relevant approaches from the literature. Then, we consider the prefix and substring based operations on those dictionaries where those operations are useful in practice. At the end, we

Conclusions and future work

String dictionaries have been traditionally implemented using classical data structures such as sorted arrays, hashing or tries. However, these solutions are falling short in facing the new scalability challenges brought up by modern data-intensive applications. Managing string dictionaries in compressed storage is becoming a key technique to handle the large datasets that are emerging within fast main memory.

This paper studies the problem of representing and managing string dictionaries from a

References (81)

  • Yasuhito Asano, Yuya Miyawaki, Takao Nishizeki, Efficient compression of Web graphs, In: Proceedings of the 14th Annual...
  • Ricardo Baeza-Yates et al.

    Modern Information Retrieval

    (2011)
  • Hannah Bast

    Christian worm mortensen, and ingmar weber, output-sensitive autocompletion search

    Inf. Retr.

    (2008)
  • Djamal Belazzougui, Fabiano C. Botelho, Martin Dietzfelbinger, Hash, displace, and compress, In: 17th Annual European...
  • David Benoit et al.

    Representing trees of higher degree

    Algorithmica

    (2005)
  • Tim Berners-Lee, James Hendler, Ora Lassila, The Semantic Web, Scientific American,...
  • Paolo Boldi, Marco Rosa, Massimo Santini, Sebastiano Vigna, Layered label propagation: a multiresolution...
  • Paolo Boldi, Sebastiano Vigna, The Webgraph framework I: compression techniques, In: Proceedings of the 13th...
  • Leonid Boytsov

    Indexing methods for approximate dictionary searchingcomparative analysis

    ACM J. Exp. Algorithmics

    (2011)
  • Nieves R. Brisaboa, Rodrigo Cánovas, Francisco Claude, Miguel A. Martínez-Prieto, Gonzalo Navarro, Compressed string...
  • Michael Burrows, David J. Wheeler, A Block-sorting Lossless Data Compression Algorithm, Technical Report, Digital...
  • Stefan Büttcher et al.

    Information Retrieval: Implementing and Evaluating Search Engines

    (2010)
  • Chris Callison-Burch, Collin Bannard, Josh Schroeder, Scaling phrase-based statistical machine translation to larger...
  • Moses Charikar et al.

    The smallest grammar problem

    IEEE Trans. Inf. Theory

    (2005)
  • Francisco Claude, Antonio Fariña, Miguel A. Martínez-Prieto, Gonzalo Navarro, Compressed q-gram indexing for highly...
  • Thomas H. Cormen et al.

    Introduction to Algorithms

    (2001)
  • Debora Donato et al.

    Algorithms and experiments for the Webgraph

    J. Graph Algorithms Appl.

    (2006)
  • Peter Elias

    Efficient storage and retrieval by content and address of static files

    J. ACM

    (1974)
  • Robert Fano, On the Number of Bits Required to Implement an Associative Memory, Memo 61, Computer Structures Group,...
  • Antonio Fariña et al.

    Word-based self-indexes for natural language text

    ACM Trans. Inf. Syst.

    (2012)
  • Paolo Ferragina et al.

    Compressed text indexesfrom theory to practice

    J. Exp. Algorithmics

    (2009)
  • Paolo Ferragina et al.

    The string B-treea new data structure for string search in external memory and its applications

    J. ACM

    (1999)
  • Paolo Ferragina, Roberto Grossi, Ankur Gupta, Rahul Shah, Jeffrey S. Vitter, On searching compressed string collections...
  • Paolo Ferragina, Fabrizio Luccio, Giovanni Manzini, S. Muthukrishnan, Structuring labeled trees for optimal...
  • Paolo Ferragina et al.

    Indexing compressed texts

    J. ACM

    (2005)
  • Paolo Ferragina et al.

    Compressed representations of sequences and full-text indexes

    ACM Trans. Algorithms

    (2007)
  • Paolo Ferragina et al.

    The compressed permuterm index

    ACM Trans. Algorithms

    (2010)
  • Edward Fredkin

    Trie memory

    Commun. ACM

    (1960)
  • Michael L. Fredman et al.

    Storing a sparse table with O(1) worst case access time

    J. ACM

    (1984)
  • Rodrigo González, Szymon Grabowski, Velli Mäkinen, Gonzalo Navarro, Practical implementation of rank and select...
  • Cited by (48)

    • RDF-TR: Exploiting structural redundancies to boost RDF compression

      2020, Information Sciences
      Citation Excerpt :

      Physical compressors These techniques adapt traditional concepts from data compression to the particular case of RDF. On the one hand, they capture and remove symbolic redundancy from RDF terms by using compressed string dictionaries [31]. As explained above, this decision enables the original RDF graph to be processed as an ID-graph, in which IDs refer to the corresponding terms in the dictionary.

    • Compressed and queryable self-indexes for RDF archives

      2024, Knowledge and Information Systems
    View all citing articles on Scopus

    A preliminary version of this paper appeared in Proceedings of 10th International Symposium on Experimental Algorithms (SEA), 2011, pp. 136–147.

    1

    Funded by the Funded by the Spanish Ministry of Economy and Competitiveness: TIN2013-46238-C4-3-R, and ICT COST Action KEYSTONE (IC1302).

    2

    Funded with basal funds FB0001, Conicyt, Chile.

    3

    Funded in part by Fondecyt Iniciación 11130104.

    View full text