On the sorting-complexity of suffix tree construction

Authors:
Martin Farach-Colton;Paolo Ferragina;S. Muthukrishnan
Affiliations:
Rutgers Univ., Rutgers, NJ;Univ. of Pisa, Pisa, Italy;AT&T Shannon Labs, Florham Park, NJ
Venue:
Journal of the ACM (JACM)
Year:
2000

Citing 21
Cited 58

Fast algorithms for finding nearest common ancestors

SIAM Journal on Computing
The input/output complexity of sorting and related problems

Communications of the ACM
Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
An introduction to disk drive modeling

Computer
Symmetry breaking for suffix tree construction

STOC '94 Proceedings of the twenty-sixth annual ACM symposium on Theory of computing
Real-time pattern matching and quasi-real-time construction of suffix trees (preliminary version)

STOC '94 Proceedings of the twenty-sixth annual ACM symposium on Theory of computing
Greed sort: optimal deterministic sorting on parallel disks

Journal of the ACM (JACM)
Large-scale assembly of DNA strings and space-efficient construction of suffix trees

STOC '95 Proceedings of the twenty-seventh annual ACM symposium on Theory of computing
Large-scale assembly of DNA strings and space-efficient construction of suffix trees

STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
On sorting strings in external memory (extended abstract)

STOC '97 Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
Optimal parallel suffix tree construction

Journal of Computer and System Sciences - Special issue: 26th annual ACM symposium on the theory of computing & STOC'94, May 23–25, 1994, and second annual Europe an conference on computational learning theory (EuroCOLT'95), March 13–15, 1995
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Performance modeling for realistic storage devices

Performance modeling for realistic storage devices
The string B-tree: a new data structure for string search in external memory and its applications

Journal of the ACM (JACM)
External-memory graph algorithms

Proceedings of the sixth annual ACM-SIAM symposium on Discrete algorithms
Faster deterministic sorting and priority queues in linear space

Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
The Design and Analysis of Computer Algorithms

The Design and Analysis of Computer Algorithms
Optimal Logarithmic Time Randomized Suffix Tree Construction

ICALP '96 Proceedings of the 23rd International Colloquium on Automata, Languages and Programming
Optimal suffix tree construction with large alphabets

FOCS '97 Proceedings of the 38th Annual Symposium on Foundations of Computer Science
Overcoming the Memory Bottleneck in Suffix Tree Construction

FOCS '98 Proceedings of the 39th Annual Symposium on Foundations of Computer Science

Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications

CPM '01 Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
Engineering a Lightweight Suffix Array Construction Algorithm

ESA '02 Proceedings of the 10th Annual European Symposium on Algorithms
Generalizations of suffix arrays to multi-dimensional matrices

Theoretical Computer Science
Generalizations of suffix arrays to multi-dimensional matrices

Theoretical Computer Science
The suffix binary search tree and suffix AVL tree

Journal of Discrete Algorithms
Constructing Suffix Tree for Gigabyte Sequences with Megabyte Memory

IEEE Transactions on Knowledge and Data Engineering
Practical methods for constructing suffix trees

The VLDB Journal — The International Journal on Very Large Data Bases
Cache-oblivious string dictionaries

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Compressed full-text indexes

ACM Computing Surveys (CSUR)
Linear work suffix array construction

Journal of the ACM (JACM)
Linear time algorithm for the longest common repeat problem

Journal of Discrete Algorithms
Genome-scale disk-based suffix tree indexing

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Alphabet-independent linear-time construction of compressed suffix arrays using o(nlogn)-bit working space

Theoretical Computer Science
Practical suffix tree construction

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
PSIST: A scalable approach to indexing protein structures using suffix trees

Journal of Parallel and Distributed Computing
Algorithms and data structures for external memory

Foundations and Trends® in Theoretical Computer Science
Better external memory suffix array construction

Journal of Experimental Algorithmics (JEA)
Improving on-line construction of two-dimensional suffix trees for square matrices

Information Processing Letters
Reducing Space Requirements for Disk Resident Suffix Arrays

DASFAA '09 Proceedings of the 14th International Conference on Database Systems for Advanced Applications
Efficient construction of maximal and minimal representations of motifs of a string

Theoretical Computer Science
On-Line Construction of Parameterized Suffix Trees

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Suffix trees for very large genomic sequences

Proceedings of the 18th ACM conference on Information and knowledge management
Suffix tree construction algorithms on modern hardware

Proceedings of the 13th International Conference on Extending Database Technology
Algorithms for memory hierarchies: advanced lectures

Algorithms for memory hierarchies: advanced lectures
Linear-time construction of suffix arrays

CPM'03 Proceedings of the 14th annual conference on Combinatorial pattern matching
Simple linear work suffix array construction

ICALP'03 Proceedings of the 30th international conference on Automata, languages and programming
Optimal self-adjusting trees for dynamic string data in secondary storage

SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval
Efficient indexing algorithms for one-dimensional discretely-scaled strings

Information Processing Letters
I/O efficient algorithms for serial and parallel suffix tree construction

ACM Transactions on Database Systems (TODS)
Algorithm engineering: bridging the gap between algorithm theory and practice

Algorithm engineering: bridging the gap between algorithm theory and practice
On-line construction of parameterized suffix trees for large alphabets

Information Processing Letters
The indexing for one-dimensional proportionally-scaled strings

Information Processing Letters
Suffix trees for inputs larger than main memory

Information Systems
Substring range reporting

CPM'11 Proceedings of the 22nd annual conference on Combinatorial pattern matching
Lossless fault-tolerant data structures with additive overhead

WADS'11 Proceedings of the 12th international conference on Algorithms and data structures
Indexing with gaps

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
On suffix extensions in suffix trees

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
A new efficient indexing algorithm for one-dimensional real scaled patterns

Journal of Computer and System Sciences
External string sorting: faster and cache-oblivious

STACS'06 Proceedings of the 23rd Annual conference on Theoretical Aspects of Computer Science
Obtaining provably good performance from suffix trees in secondary storage

CPM'06 Proceedings of the 17th Annual conference on Combinatorial Pattern Matching
O(n2 log n) time on-line construction of two-dimensional suffix trees

COCOON'05 Proceedings of the 11th annual international conference on Computing and Combinatorics
Lightweight data indexing and compression in external memory

LATIN'10 Proceedings of the 9th Latin American conference on Theoretical Informatics
A new compressed suffix tree supporting fast search and its construction algorithm using optimal working space

CPM'05 Proceedings of the 16th annual conference on Combinatorial Pattern Matching
Linear-Time construction of compressed suffix arrays using o(n log n)-bit working space for large alphabets

CPM'05 Proceedings of the 16th annual conference on Combinatorial Pattern Matching
Time and space efficient search for small alphabets with suffix arrays

FSKD'05 Proceedings of the Second international conference on Fuzzy Systems and Knowledge Discovery - Volume Part I
Online and dynamic recognition of squarefree strings

MFCS'05 Proceedings of the 30th international conference on Mathematical Foundations of Computer Science
On demand string sorting over unbounded alphabets

Theoretical Computer Science
Indexing a dictionary for subset matching queries

Algorithms and Applications
Linear time algorithm for the generalised longest common repeat problem

SPIRE'05 Proceedings of the 12th international conference on String Processing and Information Retrieval
Efficient retrieval of approximate palindromes in a run-length encoded string

Theoretical Computer Science
Longest common extensions via fingerprinting

LATA'12 Proceedings of the 6th international conference on Language and Automata Theory and Applications
Self-Indexed Grammar-Based Compression

Fundamenta Informaticae
On suffix extensions in suffix trees

Theoretical Computer Science
On demand string sorting over unbounded alphabets

CPM'07 Proceedings of the 18th annual conference on Combinatorial Pattern Matching
A simple construction of two-dimensional suffix trees in linear time

CPM'07 Proceedings of the 18th annual conference on Combinatorial Pattern Matching
Near real-time suffix tree construction via the fringe marked ancestor problem

Journal of Discrete Algorithms
Efficient parallel construction of suffix trees for genomes larger than main memory

Proceedings of the 20th European MPI Users' Group Meeting
Efficient techniques on retrieving bio-information for active U-healthcare

Personal and Ubiquitous Computing

Quantified Score

Hi-index	0.01

Visualization

Abstract

The suffix tree of a string is the fundamental data structure of combinatorial pattern matching. We present a recursive technique for building suffix trees that yields optimal algorithms in different computational models. Sorting is an inherent bottleneck in building suffix trees and our algorithms match the sorting lower bound. Specifically, we present the following results. (1) Weiner [1973], who introduced the data structure, gave an optimal 0(n)-time algorithm for building the suffix tree of an n-character string drawn from a constant-size alphabet. In the comparison model, there is a trivial &Ogr;(n log n)-time lower bound based on sorting, and Weiner's algorithm matches this bound. For integer alphabets, the fastest known algorithm is the O(n log n)time comparison-based algorithm, but no super-linear lower bound is known. Closing this gap is the main open question in stringology. We settle this open problem by giving a linear time reduction to sorting for building suffix trees. Since sorting is a lower-bound for building suffix trees, this algorithm is time-optimal in every alphabet mode. In particular, for an alphabet consisting of integers in a polynomial range we get the first known linear-time algorithm. (2) All previously known algorithms for building suffix trees exhibit a marked absence of locality of reference, and thus they tend to elicit many page faults (I/Os) when indexing very long strings. They are therefore unsuitable for building suffix trees in secondary storage devices, where I/Os dominate the overall computational cost. We give a linear-I/O reduction to sorting for suffix tree construction. Since sorting is a trivial I/O-lower bound for building suffix trees, our algorithm is I/O-optimal.