Succinct suffix arrays based on run-length encoding

  • Authors:
  • Veli Mäkinen;Gonzalo Navarro

  • Affiliations:
  • AG Genominformatik, Technische Fakultät Universität Bielefeld, Germany;Center for Web Research Dept. of Computer Science, University of Chile

  • Venue:
  • CPM'05 Proceedings of the 16th annual conference on Combinatorial Pattern Matching
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

A succinct full-text self-index is a data structure built on a text T=t1t2... tn, which takes little space (ideally close to that of the compressed text), permits efficient search for the occurrences of a pattern P=p1p2... pm in T, and is able to reproduce any text substring, so the self-index replaces the text. Several remarkable self-indexes have been developed in recent years. They usually take O(nH0) or O(nHk) bits, being Hk the kth order empirical entropy of T. The time to count how many times does P occur in T ranges from O(m) to O(mlog n). We present a new self-index, called run-length FM-index (RLFM index), that counts the occurrences of P in T in O(m) time when the alphabet size is $\sigma=O(\textrm{polylog}(n))$. The index requires nHklog2σ+O(n) bits of space for small k. We then show how to implement the RLFM index in practice, and obtain in passing another implementation with different space-time tradeoffs. We empirically compare ours against the best existing implementations of other indexes and show that ours are fastest among indexes taking less space than the text.