Stronger Lempel-Ziv Based Compressed Text Indexing

  • Authors:
  • Diego Arroyuelo;Gonzalo Navarro;Kunihiko Sadakane

  • Affiliations:
  • Yahoo! Research Latin America, Chile, Blanco Encalada 2120, Santiago, Chile;Universidad de Chile, Dept. of Computer Science, Blanco Encalada 2120, Santiago, Chile;National Institute of Informatics, Principles of Informatics Research Division, 2-1-2 Hitotsubashi, 101-8430, Chiyoda-ku, Tokyo, Japan

  • Venue:
  • Algorithmica
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Given a text T[1..u] over an alphabet of size σ, the full-text search problem consists in finding the occ occurrences of a given pattern P[1..m] in T. In indexed text searching we build an index on T to improve the search time, yet increasing the space requirement. The current trend in indexed text searching is that of compressed full-text self-indices, which replace the text with a more space-efficient representation of it, at the same time providing indexed access to the text. Thus, we can provide efficient access within compressed space. The Lempel-Ziv index (LZ-index) of Navarro is a compressed full-text self-index able to represent T using 4uH k (T)+o(ulog σ) bits of space, where H k (T) denotes the k-th order empirical entropy of T, for any k=o(log σ u). This space is about four times the compressed text size. The index can locate all the occ occurrences of a pattern P in T in O(m 3log σ+(m+occ)log u) worst-case time. Although this index has proven very competitive in practice, the O(m 3log σ) term can be excessive for long patterns. Also, the factor 4 in its space complexity makes it larger than other state-of-the-art alternatives. In this paper we present stronger Lempel-Ziv based indices (LZ-indices), improving the overall performance of the original LZ-index. We achieve indices requiring (2+ε)uH k (T)+o(ulog σ) bits of space, for any constant ε0, which makes them the smallest existing LZ-indices. We simultaneously improve the search time to O(m 2+(m+occ)log u), which makes our indices very competitive with state-of-the-art alternatives. Our indices support displaying any text substring of length ℓ in optimal O(ℓ/log σ u) time. In addition, we show how the space can be squeezed to (1+ε)uH k (T)+o(ulog σ) to obtain a structure with O(m 2) average search time for m≥2log σ u. Alternatively, the search time of LZ-indices can be improved to O((m+occ)log u) with (3+ε)uH k (T)+o(ulog σ) bits of space, which is much less than the space needed by other Lempel-Ziv-based indices achieving the same search time. Overall our indices stand out as a very attractive alternative for space-efficient indexed text searching.