Practical compressed string dictionaries☆
Introduction
A string dictionary is a data structure that maintains a set of strings. It arises in classical scenarios like Natural Language (NL) processing, where finding the lexicon of a text corpus is the first step in analyzing it [56]. They also arise as a component of inverted indexes, when indexing NL text collections [79], [19], [6]. In both cases, the dictionary comprises all the different words used in the text collection. The dictionary implements a bijective function that maps strings to identifiers (IDs, generally integer values) and back. Thus, a string dictionary must provide, at least, two complementary operations: (i) string-to-ID locates the ID for a given string, and (ii) ID-to-string extracts the string identified by a given ID.
String dictionaries are a simple and effective tool for managing string data in a wide range of applications. Using dictionaries enables replacing (long, variable-length) strings by simple numbers (their IDs), which are more compact to represent and easier and more efficient to handle. A compact dictionary providing efficient mapping between strings and IDs saves storage space, processing and transmission costs, in data-intensive applications. The growing volume of the datasets, however, has led to increasingly large dictionaries, whose management is becoming a scalability issue by itself. Their size is of particular importance to attain the optimal performance under restrictions of main memory.
This paper focuses on techniques to compress string dictionaries and the space/time tradeoffs they offer. We focus on static dictionaries, which do not change along the execution. These are appropriate in the many applications using dictionaries that either are static or are rebuilt only sparingly. We revisit traditional techniques for managing string dictionaries, and enhance them with data compression tools. We also design new structures that take advantage of more sophisticated compression methods, succinct data structures, and full-text indexes [62]. The resulting techniques enable large string dictionaries to be managed within compressed space in main memory. Different techniques excel on different application niches. The least space-consuming variants operate within microseconds while compressing the dictionary to as little as 5% of its original size.
The main contributions of this paper can be summarized as follows:
- 1.
We present, as far as we know, the most exhaustive study to date of the space/time efficiency of compressed string dictionary representations. This is not only a survey of traditional techniques, but we also design novel variants based on combinations of existing techniques with more sophisticated compression methods and data structures.
- 2.
We perform an exhaustive experimental tuning and comparison of all the variants we study, on a variety of real-world scenarios, providing a global picture of the current state of the art for string dictionaries. This results in clear recommendations on which structures to use depending on the application.
- 3.
Most of the techniques outstanding in the space/time tradeoff turn out to be combinations we designed and engineered, between classical methods and more sophisticated compression techniques and data structures. These include combinations of binary search, hashing, and Front-Coding with grammar-based and optimized Hu-Tucker compression. In particular, uncovering the advantages of the use of grammar compression for string dictionaries is an important finding.
- 4.
We create a C++ library, libCSD (Compressed String Dictionaries), implementing all the studied techniques. It is publicly available at https://github.com/migumar2/libCSD under GNU LGPL license.
- 5.
We go beyond the basic string-to-ID and ID-to-string functionality and implement advanced searches for some of our techniques. These enable prefix-based searching for most methods (except Hash ones) and substring searches for the FM-Index and XBW dictionaries.
The paper is organized as follows. Section 2 provides a general view of string dictionaries. We start describing various real-world applications where large dictionaries must be efficiently handled, then define the notation used in the paper, and finally describe classical and modern techniques used to support string dictionaries, particularly in compressed space. Section 3 provides the minimal background in data compression necessary to understand the various families of compressed string dictionaries studied in this paper. Section 4 describes how we have applied those compression methods so that they perform efficiently for the dictionary operations. 5 Compressed hashing dictionaries (Hash), 6 Front-Coding: differentially encoded dictionaries, 7 Binary searchable Re-Pair (RPDAC), 8 Full-text dictionaries (FM-Index), 9 Compressed trie dictionaries (XBW) focus on each of the families of compressed string dictionaries. Section 10 provides a full experimental study of the performance of the described techniques on dictionaries coming from various real-world applications. The best performing variants are then compared with the state of the art. We find several niches in which the new techniques dominate the space/time tradeoffs of classical methods. Finally, Section 11 concludes and describes some future work directions.
Section snippets
Applications
This section takes a short tour over various example applications where handling very large string dictionaries is a serious issue and compression could lead to considerable improvements.
NL Applications: It is the most classic application area of string dictionaries. Traditionally, the size of these dictionaries has not been a concern because classical NL collections were carefully polished to avoid typos and other errors. On those collections, Heaps [44] formulated an empirical law
Data compression and coding
Data compression [74] studies the way to encode data in less space than that originally required. We consider compression of sequences and focus on lossless compression, which allows reconstructing the exact original sequence. We only cover the elements needed to follow the paper.
Statistical Compression: A way to compress a sequence is to exploit the variable frequencies of its symbols. By assigning shorter codewords to the most frequent symbols and replacing each symbol by its codeword,
Compressing the dictionary strings
To reduce space, we represent the strings of the dictionary, , in compressed form. We cannot use any compression method, however, but have to choose one that enables fast decompression and comparison of individual strings. We describe three methods we will use in combination with the dictionary data structures. Their basics are described in Section 3. An issue is how to know where a compressed string ends in the compressed . If we decompress si, we simply stop when we decompress
Compressed hashing dictionaries (Hash)
Hashing [23] is a folklore method to store a dictionary of any kind (not only strings). In our case, a hash function transforms a given string into an index in a hash table, where the corresponding value is to be inserted or sought. A collision arises when two different strings are mapped to the same array cell.
In this paper, we use closed hashing: if the cell corresponding to an element is occupied by another, one successively probes other cells until finding a free cell (for insertions and
Front-Coding: differentially encoded dictionaries
Front-Coding [79] is a folklore compression technique for lexicographically sorted dictionaries, for example it is used to compress the set of URLs in the WebGraph framework [12]. Front-Coding exploits the fact that consecutive entries are likely to share a common prefix, so each entry in the dictionary can be differentially encoded with respect to the preceding one. More precisely, each entry is represented using two values: an integer that encodes the length of the prefix it shares with the
Binary searchable Re-Pair (RPDAC)
If we remove the bitsequence in Section 5, and instead sort in lexicographic order, we can still binary search for p, using either bitsequence (Section 5.3) or DAC codes (Section 5.4). In this case, it is better to replace Huffman by Hu-Tucker compression, so that the strings can be lexicographically compared bytewise, without decompressing them (as done in Section 6).
This arrangement corresponds to applying compression on the possibly simplest data organization for a dictionary:
Full-text dictionaries (FM-Index)
A full-text index is a data structure that, built on a text over an alphabet of size σ, supports fast search for patterns , computing all the positions where p occurs. A self-index is a compressed full-text index that, in addition, contains enough information to efficiently reproduce any text substring [62]. A self-index can therefore replace the text.
Most self-indexes emulate a suffix array [55]. This structure is an array of integers , so that represents the text suffix
Compressed trie dictionaries (XBW)
A trie (or digital tree) [36], [50] is an edge-labeled tree that represents a set of strings, and thus a natural choice to represent a string dictionary. Each path in the trie, from the root to a leaf, represents a particular string, so those strings sharing a common prefix also share a common subpath from the root. The leaves are marked with the corresponding string IDs.
Our basic operations are easily solved on tries. For locate(p) we traverse the trie from the root, descending by the edges
Experimental evaluation
This section analyzes the empirical performance of our techniques, in space and time, over dictionaries coming from various real-world scenarios. We first consider the basic operations of locate and extract, comparing our techniques in order to choose the most prominent ones, and then comparing those with other relevant approaches from the literature. Then, we consider the prefix and substring based operations on those dictionaries where those operations are useful in practice. At the end, we
Conclusions and future work
String dictionaries have been traditionally implemented using classical data structures such as sorted arrays, hashing or tries. However, these solutions are falling short in facing the new scalability challenges brought up by modern data-intensive applications. Managing string dictionaries in compressed storage is becoming a key technique to handle the large datasets that are emerging within fast main memory.
This paper studies the problem of representing and managing string dictionaries from a
References (81)
- et al.
Practical perfect hashing in nearly optimal space
Inf. Syst.
(2013) - et al.
DACsbringing direct access to variable-length codes
Inf. Process. Manag.
(2013) - et al.
Graph structure in the Web
Comput. Netw.
(2000) - et al.
Binary RDF representation for publication and exchange
J. Web Semant.
(2013) Application of Lempel–Ziv factorization to the approximation of grammar-based compression
Theor. Comput. Sci.
(2003)A fully linear-time approximation algorithm for grammar-based compression
J. Discrete Algorithms
(2005)- Daniel J. Abadi, Samuel R. Madden, Miguel Ferreira, Integrating compression and execution in column-oriented database...
- et al.
Graph compression by BFS
Algorithms
(2009) - Mario Arias, Javier D. Fernández, Miguel A. Martínez-Prieto, An empirical study of real-world SPARQL queries, In:...
- Julian Arz, Johannes Fischer, LZ-compressed string dictionaries, In: Proceedings of the Data Compression Conference...
Modern Information Retrieval
Christian worm mortensen, and ingmar weber, output-sensitive autocompletion search
Inf. Retr.
Representing trees of higher degree
Algorithmica
Indexing methods for approximate dictionary searchingcomparative analysis
ACM J. Exp. Algorithmics
Information Retrieval: Implementing and Evaluating Search Engines
The smallest grammar problem
IEEE Trans. Inf. Theory
Introduction to Algorithms
Algorithms and experiments for the Webgraph
J. Graph Algorithms Appl.
Efficient storage and retrieval by content and address of static files
J. ACM
Word-based self-indexes for natural language text
ACM Trans. Inf. Syst.
Compressed text indexesfrom theory to practice
J. Exp. Algorithmics
The string B-treea new data structure for string search in external memory and its applications
J. ACM
Indexing compressed texts
J. ACM
Compressed representations of sequences and full-text indexes
ACM Trans. Algorithms
The compressed permuterm index
ACM Trans. Algorithms
Trie memory
Commun. ACM
Storing a sparse table with worst case access time
J. ACM
Cited by (48)
CoCo-trie: Data-aware compression and indexing of strings
2024, Information SystemsEfficient and compact representations of some non-canonical prefix-free codes
2022, Theoretical Computer ScienceRank/select queries over mutable bitmaps
2021, Information SystemsRDF-TR: Exploiting structural redundancies to boost RDF compression
2020, Information SciencesCitation Excerpt :Physical compressors These techniques adapt traditional concepts from data compression to the particular case of RDF. On the one hand, they capture and remove symbolic redundancy from RDF terms by using compressed string dictionaries [31]. As explained above, this decision enables the original RDF graph to be processed as an ID-graph, in which IDs refer to the corresponding terms in the dictionary.
Optimizing RPQs over a compact graph representation
2024, VLDB JournalCompressed and queryable self-indexes for RDF archives
2024, Knowledge and Information Systems
- ☆
A preliminary version of this paper appeared in Proceedings of 10th International Symposium on Experimental Algorithms (SEA), 2011, pp. 136–147.
- 1
Funded by the Funded by the Spanish Ministry of Economy and Competitiveness: TIN2013-46238-C4-3-R, and ICT COST Action KEYSTONE (IC1302).