Overcoming the Memory Bottleneck in Suffix Tree Construction

Authors:
Martin Farach;Paolo Ferragina;S. Muthukrishnan
Affiliations:
-;-;-
Venue:
FOCS '98 Proceedings of the 39th Annual Symposium on Foundations of Computer Science
Year:
1998

Citing 0
Cited 32

Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract)

STOC '00 Proceedings of the thirty-second annual ACM symposium on Theory of computing
Efficient bundle sorting

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
On the sorting-complexity of suffix tree construction

Journal of the ACM (JACM)
External memory algorithms and data structures: dealing with massive data

ACM Computing Surveys (CSUR)
An experimental study of priority queues in external memory

Journal of Experimental Algorithmics (JEA)
Database indexing for large DNA and protein sequence collections

The VLDB Journal — The International Journal on Very Large Data Bases
A Database Index to Large Biological Sequences

Proceedings of the 27th International Conference on Very Large Data Bases
An Experimental Study of Priority Queues in External Memory

WAE '99 Proceedings of the 3rd International Workshop on Algorithm Engineering
LEDA-SM Extending LEDA to Secondary Memory

WAE '99 Proceedings of the 3rd International Workshop on Algorithm Engineering
Indexing Text with Approximate q-Grams

COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
A New Indexing Method for Approximate String Matching

CPM '99 Proceedings of the 10th Annual Symposium on Combinatorial Pattern Matching
On Constructing Suffix Arrays in External Memory

ESA '99 Proceedings of the 7th Annual European Symposium on Algorithms
External Memory Data Structures

ESA '01 Proceedings of the 9th Annual European Symposium on Algorithms
External memory data structures

Handbook of massive data sets
External memory algorithms

Handbook of massive data sets
Accelerating Approximate Subsequence Search on Large Protein Sequence Databases

CSB '02 Proceedings of the IEEE Computer Society Conference on Bioinformatics
Towards Automatic Clustering of Protein Sequences

CSB '02 Proceedings of the IEEE Computer Society Conference on Bioinformatics
Engineering a Fast Online Persistent Suffix Tree Construction

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Constructing Suffix Tree for Gigabyte Sequences with Megabyte Memory

IEEE Transactions on Knowledge and Data Engineering
Linear work suffix array construction

Journal of the ACM (JACM)
Constructing large suffix trees on a computational grid

Journal of Parallel and Distributed Computing
Genome-scale disk-based suffix tree indexing

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Text document clustering based on frequent word meaning sequences

Data & Knowledge Engineering
External Memory Algorithms for String Problems

Fundamenta Informaticae - Workshop on Combinatorial Algorithms
B-tries for disk-based string management

The VLDB Journal — The International Journal on Very Large Data Bases
Serial and parallel methods for i/o efficient suffix tree construction

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Simple linear work suffix array construction

ICALP'03 Proceedings of the 30th international conference on Automata, languages and programming
I/O efficient algorithms for serial and parallel suffix tree construction

ACM Transactions on Database Systems (TODS)
Estimating the number of substring matches in long string databases

APWeb'05 Proceedings of the 7th Asia-Pacific web conference on Web Technologies Research and Development
Parallel construction of large suffix trees on a PC cluster

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
External Memory Algorithms for String Problems

Fundamenta Informaticae - Workshop on Combinatorial Algorithms
Personal bankruptcy prediction by mining credit card data

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

The suffix tree of a string is the fundamental data structure of string processing. Recent focus on massive data sets has sparked interest in overcoming the memory bottlenecks of known algorithms for building and using suffix trees.Our main contribution is a new algorithm for suffix tree construction in which we choreograph almost all disk accesses to be via the sort and scan primitives. This algorithm achieves optimal results in a variety of sequential and parallel computational models. Two of our results are:1) In the traditional external memory model, in which only the number of disk accesses is counted, we achieve an optimal algorithm, both for single and multiple disk cases. This is the first optimal algorithm known for either model. 2) Traditional disk page access counting does not differentiate between random page accesses and block transfers involving several consecutive pages. This difference is routinely exploited by expert programmers to get fast algorithms on real machines. We adopt a simplweb accounting scheme and show that our algorithm achieves the same optimal tradeoff for block versus random page accesses as the one we establish for sorting.