Practical suffix tree construction

Authors:
Sandeep Tata;Richard A. Hankins;Jignesh M. Patel
Affiliations:
University of Michigan, Ann Arbor, MI;University of Michigan, Ann Arbor, MI;University of Michigan, Ann Arbor, MI
Venue:
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Year:
2004

Citing 13
Cited 19

Efficient implementation of suffix trees

Software—Practice & Experience
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
Reducing the space requirement of suffix trees

Software—Practice & Experience
On the sorting-complexity of suffix tree construction

Journal of the ACM (JACM)
Average Case Analysis of Algorithms on Sequences

Average Case Analysis of Algorithms on Sequences
A Database Index to Large Biological Sequences

Proceedings of the 27th International Conference on Very Large Data Bases
Providing Orthogonal Persistence for Java (Extended Abstract)

ECCOP '98 Proceedings of the 12th European Conference on Object-Oriented Programming
Approximate String Matching in DNA Sequences

DASFAA '03 Proceedings of the Eighth International Conference on Database Systems for Advanced Applications
A parallel algorithm for the extraction of structured motifs

Proceedings of the 2004 ACM symposium on Applied computing
Improving Hash Join Performance through Prefetching

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Engineering a Fast Online Persistent Suffix Tree Construction

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
OASIS: an online and accurate technique for local-alignment searches on biological sequences

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29

Practical methods for constructing suffix trees

The VLDB Journal — The International Journal on Very Large Data Bases
Exact match search in sequence data using suffix trees

Proceedings of the 14th ACM international conference on Information and knowledge management
A data structure for a sequence of string accesses in external memory

ACM Transactions on Algorithms (TALG)
Constructing large suffix trees on a computational grid

Journal of Parallel and Distributed Computing
Genome-scale disk-based suffix tree indexing

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
PSIST: A scalable approach to indexing protein structures using suffix trees

Journal of Parallel and Distributed Computing
Effective phrase prediction

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
The SBC-tree: an index for run-length compressed sequences

EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
A new method for indexing genomes using on-disk suffix trees

Proceedings of the 17th ACM conference on Information and knowledge management
AS-index: a structure for string search using n-grams and algebraic signatures

Proceedings of the 18th ACM conference on Information and knowledge management
Suffix trees for very large genomic sequences

Proceedings of the 18th ACM conference on Information and knowledge management
A practical method for approximate subsequence search in DNA databases

PAKDD'07 Proceedings of the 11th Pacific-Asia conference on Advances in knowledge discovery and data mining
Suffix trees for inputs larger than main memory

Information Systems
ERA: efficient serial and parallel suffix tree construction for very long strings

Proceedings of the VLDB Endowment
Search-Optimized suffix-tree storage for biological applications

HiPC'05 Proceedings of the 12th international conference on High Performance Computing
Obtaining provably good performance from suffix trees in secondary storage

CPM'06 Proceedings of the 17th Annual conference on Combinatorial Pattern Matching
Parallel construction of large suffix trees on a PC cluster

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Efficient parallel construction of suffix trees for genomes larger than main memory

Proceedings of the 20th European MPI Users' Group Meeting
Efficient techniques on retrieving bio-information for active U-healthcare

Personal and Ubiquitous Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large string datasets are common in a number of emerging text and biological database applications. Common queries over such datasets include both exact and approximate string matches. These queries can be evaluated very efficiently by using a suffix tree index on the string dataset. Although suffix trees can be constructed quickly in memory for small input datasets, constructing persistent trees for large datasets has been challenging. In this paper, we explore suffix tree construction algorithms over a wide spectrum of data sources and sizes. First, we show that on modern processors, a cache-efficient algorithm with O(n2) complexity outperforms the popular O(n) Ukkonen algorithm, even for in-memory construction. For larger datasets, the disk I/O requirement quickly becomes the bottleneck in each algorithm's performance. To address this problem, we present a buffer management strategy for the O(n2) algorithm, creating a new disk-based construction algorithm that scales to sizes much larger than have been previously described in the literature. Our approach far outperforms the best known disk-based construction algorithms.