On the sorting-complexity of suffix tree construction

  • Authors:
  • Martin Farach-Colton;Paolo Ferragina;S. Muthukrishnan

  • Affiliations:
  • Rutgers Univ., Rutgers, NJ;Univ. of Pisa, Pisa, Italy;AT&T Shannon Labs, Florham Park, NJ

  • Venue:
  • Journal of the ACM (JACM)
  • Year:
  • 2000

Quantified Score

Hi-index 0.01

Visualization

Abstract

The suffix tree of a string is the fundamental data structure of combinatorial pattern matching. We present a recursive technique for building suffix trees that yields optimal algorithms in different computational models. Sorting is an inherent bottleneck in building suffix trees and our algorithms match the sorting lower bound. Specifically, we present the following results. (1) Weiner [1973], who introduced the data structure, gave an optimal 0(n)-time algorithm for building the suffix tree of an n-character string drawn from a constant-size alphabet. In the comparison model, there is a trivial &Ogr;(n log n)-time lower bound based on sorting, and Weiner's algorithm matches this bound. For integer alphabets, the fastest known algorithm is the O(n log n)time comparison-based algorithm, but no super-linear lower bound is known. Closing this gap is the main open question in stringology. We settle this open problem by giving a linear time reduction to sorting for building suffix trees. Since sorting is a lower-bound for building suffix trees, this algorithm is time-optimal in every alphabet mode. In particular, for an alphabet consisting of integers in a polynomial range we get the first known linear-time algorithm. (2) All previously known algorithms for building suffix trees exhibit a marked absence of locality of reference, and thus they tend to elicit many page faults (I/Os) when indexing very long strings. They are therefore unsuitable for building suffix trees in secondary storage devices, where I/Os dominate the overall computational cost. We give a linear-I/O reduction to sorting for suffix tree construction. Since sorting is a trivial I/O-lower bound for building suffix trees, our algorithm is I/O-optimal.