Constructing Suffix Tree for Gigabyte Sequences with Megabyte Memory

Authors:
Ching-Fung Cheung;Jeffrey Xu Yu;Hongjun Lu
Affiliations:
-;-;-
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2005

Citing 15
Cited 21

Fast algorithms for finding nearest common ancestors

SIAM Journal on Computing
On finding lowest common ancestors: simplification and parallelization

SIAM Journal on Computing
Introduction to algorithms

Introduction to algorithms
Genetic sequence data retrieval and manipulation based on generalized suffix trees

Genetic sequence data retrieval and manipulation based on generalized suffix trees
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
Reducing the space requirement of suffix trees

Software—Practice & Experience
On the sorting-complexity of suffix tree construction

Journal of the ACM (JACM)
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
A Database Index to Large Biological Sequences

Proceedings of the 27th International Conference on Very Large Data Bases
Optimal Logarithmic Time Randomized Suffix Tree Construction

ICALP '96 Proceedings of the 23rd International Colloquium on Automata, Languages and Programming
The LCA Problem Revisited

LATIN '00 Proceedings of the 4th Latin American Symposium on Theoretical Informatics
Sparse Suffix Trees

COCOON '96 Proceedings of the Second Annual International Conference on Computing and Combinatorics
Optimal suffix tree construction with large alphabets

FOCS '97 Proceedings of the 38th Annual Symposium on Foundations of Computer Science
Overcoming the Memory Bottleneck in Suffix Tree Construction

FOCS '98 Proceedings of the 39th Annual Symposium on Foundations of Computer Science

Practical methods for constructing suffix trees

The VLDB Journal — The International Journal on Very Large Data Bases
A data structure for a sequence of string accesses in external memory

ACM Transactions on Algorithms (TALG)
Genome-scale disk-based suffix tree indexing

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Improving suffix array locality for fast pattern matching on disk

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
A new method for indexing genomes using on-disk suffix trees

Proceedings of the 17th ACM conference on Information and knowledge management
DRFP-tree: disk-resident frequent pattern tree

Applied Intelligence
Reducing Space Requirements for Disk Resident Suffix Arrays

DASFAA '09 Proceedings of the 14th International Conference on Database Systems for Advanced Applications
B-tries for disk-based string management

The VLDB Journal — The International Journal on Very Large Data Bases
Serial and parallel methods for i/o efficient suffix tree construction

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Engineering a compressed suffix tree implementation

Journal of Experimental Algorithmics (JEA)
Abstractions in Process Mining: A Taxonomy of Patterns

BPM '09 Proceedings of the 7th International Conference on Business Process Management
Space-economical partial gram indices for exact substring matching

Proceedings of the 18th ACM conference on Information and knowledge management
Indexing genomic sequences on the IBM Blue Gene

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Suffix tree construction algorithms on modern hardware

Proceedings of the 13th International Conference on Extending Database Technology
Engineering a compressed suffix tree implementation

WEA'07 Proceedings of the 6th international conference on Experimental algorithms
STNR: A suffix tree based noise resilient algorithm for periodicity detection in time series databases

Applied Intelligence
I/O efficient algorithms for serial and parallel suffix tree construction

ACM Transactions on Database Systems (TODS)
On-line suffix tree construction with reduced branching

Journal of Discrete Algorithms
Clustering near-identical sequences for fast homology search

RECOMB'06 Proceedings of the 10th annual international conference on Research in Computational Molecular Biology
Efficient techniques on retrieving bio-information for active U-healthcare

Personal and Ubiquitous Computing
Periodic pattern analysis of non-uniformly sampled stock market data

Intelligent Data Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Mammalian genomes are typically 3Gbps (gibabase pairs) in size. The largest public database NCBI (National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov)) of DNA contains more than 20 Gbps. Suffix trees are widely acknowledged as a data structure to support exact/approximate sequence matching queries as well as repetitive structure finding efficiently when they can reside in main memory. But, it has been shown as difficult to handle long DNA sequences using suffix trees due to the so-called memory bottleneck problems. The most space efficient main-memory suffix tree construction algorithm takes nine hours and 45 GB memory space to index the human genome [19]. In this paper, we show that suffix trees for long DNA sequences can be efficiently constructed on disk using small bounded main memory space and, therefore, all existing algorithms based on suffix trees can be used to handle long DNA sequences that cannot be held in main memory. We adopt a two-phase strategy to construct a suffix tree on disk: 1) to construct a diskbase suffix-tree without suffix links and 2) rebuild suffix links upon the suffix-tree being constructed on disk, if needed. We propose a new disk-based suffix tree construction algorithm, called DynaCluster, which shows O(n \log n) experimental behavior regarding CPU cost and linearity for I/O cost. DynaCluster needs 16MB main memory only to construct more than 200Mbps DNA sequences and significantly outperforms the existing disk-based suffix-tree construction algorithms using prepartitioning techniques in terms of both construction cost and query processing cost. We conducted extensive performance studies and report our findings in this paper.