New Perspectives on the Prefix Array

  • Authors:
  • W. F. Smyth;Shu Wang

  • Affiliations:
  • Algorithms Research Group, Department of Computing & Software, McMaster University, Hamilton, Canada L8S 4K1 and Digital Ecosystems & Business Intelligence Institute, Curtin University, Perth, Aus ...;Algorithms Research Group, Department of Computing & Software, McMaster University, Hamilton, Canada L8S 4K1

  • Venue:
  • SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper we consider the prefix array π =π[1..n] of a string x =x[1..n] in which π[1]=0 and, for i 1, π[i = k iff k is the largest integersuch that x[i..i+k-1]. The prefix array πis closely related to the border array β: an integerarray [1..n ] such that β[i = kiff the length of the longest border of x[1..i] isk . Border arrays or their variants are used in many stringalgorithms and prefix arrays can be used directly forpattern-matching. It is well known that for regular strings πprovides all the information that β does; we showhowever that for indeterminate strings (those containing entriesthat match a subset of the alphabet) π actually provides moreinformation, in fact still enabling all the borders of every prefixof x to be specified. Since a lot of the entries of π areexpected to be zeros, it is natural to represent π in compressedform using integer arrays POS[1..m] and LEN[1..m],where m is the number of nonzero entries in π andπ[POS[j]] = LEN [j] iff the $j^{\mbox{th}}$nonzero entry in π occurs in position POS[j] and takesthe value LEN [j]. The expected value of m isn /σ - 1, where σ is thealphabet size. The straightforward way of computing POS/LENrequires computing π first, therefore requiresO (n ) extra space. We describe twoθ (n )-time algorithms PL1 & PL2 tocompute POS/LEN for regular strings using only 8m bytes ofstorage in addition to the n bytes required for x.PL1 requires about one-third the time of the standard border arrayalgorithm MP on English-language strings; PL2 executes faster thanMP on both English-language and highly periodic strings on{a ,b }. For indeterminate strings, we describe anextension IPL of PL1 that computes POS/LEN in O (n 2) worst-case time (though generally much faster), stillusing only 8m bytes of additional storage. For bothregular and indeterminate strings, the compressed form of π canbe used for efficient pattern-matching.