Suffix arrays on words

  • Authors:
  • Paolo Ferragina;Johannes Fischer

  • Affiliations:
  • Dipartimento di Informatica, University of Pisa;Institut für Informatik, Ludwig-Maximilians-Universität München

  • Venue:
  • CPM'07 Proceedings of the 18th annual conference on Combinatorial Pattern Matching
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Surprisingly enough, it is not yet known how to build directly a suffix array that indexes just the k positions at word-boundaries of a text T[1, n], taking O(n) time and O(k) space in addition to T. We propose a class-note solution to this problem that achieves such optimal time and space bounds. Word-based versions of indexes achieving the same time/space bounds were already known for suffix trees [1, 2] and (compact) DAWGs [3,4]. Our solution inherits the simplicity and efficiency of suffix arrays, with respect to such other word-indexes, and thus it foresees applications in word-based approaches to data compression [5] and computational linguistics [6]. To support this, we have run a large set of experiments showing that word-based suffix arrays may be constructed twice as fast as their full-text counterparts, and with a working space as low as 20%. The space reduction of the final word-based suffix array impacts also in their query time (i.e. less random access binary-search steps!), being faster by a factor of up to 3.