ERA: efficient serial and parallel suffix tree construction for very long strings

Authors:
Essam Mansour;Amin Allam;Spiros Skiadopoulos;Panos Kalnis
Affiliations:
King Abdullah Univ. of Science and Technology;King Abdullah Univ. of Science and Technology;University of Peloponnese;King Abdullah Univ. of Science and Technology
Venue:
Proceedings of the VLDB Endowment
Year:
2011

Citing 17
Cited 3

Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
Database indexing for large DNA and protein sequence collections

The VLDB Journal — The International Journal on Very Large Data Bases
Handbook of Exact String Matching Algorithms

Handbook of Exact String Matching Algorithms
Boosting textual compression in optimal linear time

Journal of the ACM (JACM)
Practical methods for constructing suffix trees

The VLDB Journal — The International Journal on Very Large Data Bases
Dynamic text and static pattern matching

ACM Transactions on Algorithms (TALG)
A taxonomy of suffix array construction algorithms

ACM Computing Surveys (CSUR)
A new suffix tree similarity measure for document clustering

Proceedings of the 16th international conference on World Wide Web
Genome-scale disk-based suffix tree indexing

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Practical suffix tree construction

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
iSAX: disk-aware mining and indexing of massive time series datasets

Data Mining and Knowledge Discovery
Serial and parallel methods for i/o efficient suffix tree construction

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Suffix trees for very large genomic sequences

Proceedings of the 18th ACM conference on Information and knowledge management
Indexing genomic sequences on the IBM Blue Gene

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
High Throughput Short Read Alignment via Bi-directional BWT

BIBM '09 Proceedings of the 2009 IEEE International Conference on Bioinformatics and Biomedicine
Efficient Periodicity Mining in Time Series Databases Using Suffix Trees

IEEE Transactions on Knowledge and Data Engineering

Efficient parallel construction of suffix trees for genomes larger than main memory

Proceedings of the 20th European MPI Users' Group Meeting
String analysis by sliding positioning strategy

Journal of Computer and System Sciences
RACE: a scalable and elastic parallel system for discovering repeats in very long sequences

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

The suffix tree is a data structure for indexing strings. It is used in a variety of applications such as bioinformatics, time series analysis, clustering, text editing and data compression. However, when the string and the resulting suffix tree are too large to fit into the main memory, most existing construction algorithms become very inefficient. This paper presents a disk-based suffix tree construction method, called Elastic Range (ERa), which works efficiently with very long strings that are much larger than the available memory. ERa partitions the tree construction process horizontally and vertically and minimizes I/Os by dynamically adjusting the horizontal partitions independently for each vertical partition, based on the evolving shape of the tree and the available memory. Where appropriate, ERa also groups vertical partitions together to amortize the I/O cost. We developed a serial version; a parallel version for shared-memory and shared-disk multi-core systems; and a parallel version for shared-nothing architectures. ERa indexes the entire human genome in 19 minutes on an ordinary desktop computer. For comparison, the fastest existing method needs 15 minutes using 1024 CPUs on an IBM BlueGene supercomputer.