Document Listing on Repetitive Collections with Guaranteed Performance

Navarro, Gonzalo

Computer Science > Data Structures and Algorithms

arXiv:1707.06374 (cs)

[Submitted on 20 Jul 2017 (v1), last revised 14 Nov 2018 (this version, v3)]

Title:Document Listing on Repetitive Collections with Guaranteed Performance

Authors:Gonzalo Navarro

View PDF

Abstract:We consider document listing on string collections, that is, finding in which strings a given pattern appears. In particular, we focus on repetitive collections: a collection of size $N$ over alphabet $[1,\sigma]$ is composed of $D$ copies of a string of size $n$, and $s$ edits are applied on ranges of copies. We introduce the first document listing index with size $\tilde{O}(n+s)$, precisely $O((n\log\sigma+s\log^2 N)\log D)$ bits, and with useful worst-case time guarantees: Given a pattern of length $m$, the index reports the $\ndoc>0$ strings where it appears in time $O(m\log^{1+\epsilon} N \cdot \ndoc)$, for any constant $\epsilon>0$ (and tells in time $O(m\log N)$ if $\ndoc=0$). Our technique is to augment a range data structure that is commonly used on grammar-based indexes, so that instead of retrieving all the pattern occurrences, it computes useful summaries on them. We show that the idea has independent interest: we introduce the first grammar-based index that, on a text $T[1,N]$ with a grammar of size $r$, uses $O(r\log N)$ bits and counts the number of occurrences of a pattern $P[1,m]$ in time $O(m^2 + m\log^{2+\epsilon} r)$, for any constant $\epsilon>0$. We also give the first index using $O(z\log(N/z)\log N)$ bits, where $T$ is parsed by Lempel-Ziv into $z$ phrases, counting occurrences in time $O(m\log^{2+\epsilon} N)$.

Comments:	Extended version of CPM'17 paper
Subjects:	Data Structures and Algorithms (cs.DS)
Cite as:	arXiv:1707.06374 [cs.DS]
	(or arXiv:1707.06374v3 [cs.DS] for this version)
	https://doi.org/10.48550/arXiv.1707.06374

Submission history

From: Gonzalo Navarro [view email]
[v1] Thu, 20 Jul 2017 05:01:22 UTC (82 KB)
[v2] Wed, 23 May 2018 22:33:45 UTC (41 KB)
[v3] Wed, 14 Nov 2018 17:28:02 UTC (45 KB)

Computer Science > Data Structures and Algorithms

Title:Document Listing on Repetitive Collections with Guaranteed Performance

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Data Structures and Algorithms

Title:Document Listing on Repetitive Collections with Guaranteed Performance

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators