Self-Indexed Grammar-Based Compression

  • Authors:
  • Francisco Claude;Gonzalo Navarro

  • Affiliations:
  • (Correspd.) (Funded in part by NSERC Canada, Go-Bell Scholarships program and David R. Cheriton Graduate Scholarships program.) David R. Cheriton School of Computer Science, University of Waterloo ...;(Funded in part by Millennium Institute on Cell Dynamics and Biotechnology (ICDB), Grant ICM P05-001-F, Mideplan, Chile) Department of Computer Science, University of Chile, Chile. gnavarro@dcc.uc ...

  • Venue:
  • Fundamenta Informaticae
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Self-indexes aim at representing text collections in a compressed format that allows extracting arbitrary portions and also offers indexed searching on the collection. Current self-indexes are unable of fully exploiting the redundancy of highly repetitive text collections that arise in several applications. Grammar-based compression is well suited to exploit such repetitiveness. We introduce the first grammar-based self-index. It builds on Straight-Line Programs (SLPs), a rather general kind of context-free grammars. If an SLP of n rules represents a text T[1, u], then an SLP-compressed representation of T requires 2n log 2 n bits. For that same SLP, our self-index takes O(n log n) + n log 2 u bits. It extracts any text substring of length m in time O((m + h) log n), and finds occ occurrences of a pattern string of length m in time O((m(m + h) + h occ) log n), where h is the height of the parse tree of the SLP. No previous grammar representation had achieved o(n) search time. As byproducts we introduce (i) a representation of SLPs that takes 2n log 2 n(1 + o(1)) bits and efficiently supports more operations than a plain array of rules; (ii) a representation for binary relations with labels supporting various extended queries; (iii) a generalization of our self-index to grammar compressors that reduce T to a sequence of terminals and nonterminals, such as Re-Pair and LZ78.