Genome-scale disk-based suffix tree indexing

Authors:
Benjarath Phoophakdee;Mohammed J. Zaki
Affiliations:
Rensselaer Polytechnic Institute, Troy, NY;Rensselaer Polytechnic Institute, Troy, NY
Venue:
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Year:
2007

Citing 18
Cited 19

The string B-tree: a new data structure for string search in external memory and its applications

Journal of the ACM (JACM)
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
On the sorting-complexity of suffix tree construction

Journal of the ACM (JACM)
Algorithms on Stings, Trees, and Sequences: Computer Science and Computational Biology

ACM SIGACT News
A Database Index to Large Biological Sequences

Proceedings of the 27th International Conference on Very Large Data Bases
Sparse Suffix Trees

COCOON '96 Proceedings of the Second Annual International Conference on Computing and Combinatorics
Suffix Trees (and Relatives) Come of Age in Bioinformatics

CSB '02 Proceedings of the IEEE Computer Society Conference on Bioinformatics
Optimal suffix tree construction with large alphabets

FOCS '97 Proceedings of the 38th Annual Symposium on Foundations of Computer Science
Overcoming the Memory Bottleneck in Suffix Tree Construction

FOCS '98 Proceedings of the 39th Annual Symposium on Foundations of Computer Science
A parallel algorithm for the extraction of structured motifs

Proceedings of the 2004 ACM symposium on Applied computing
Constructing chromosome scale suffix trees

APBC '04 Proceedings of the second conference on Asia-Pacific bioinformatics - Volume 29
Engineering a Fast Online Persistent Suffix Tree Construction

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Constructing Suffix Tree for Gigabyte Sequences with Megabyte Memory

IEEE Transactions on Knowledge and Data Engineering
Linear time algorithms for finding and representing all the tandem repeats in a string

Journal of Computer and System Sciences
Practical methods for constructing suffix trees

The VLDB Journal — The International Journal on Very Large Data Bases
Practical suffix tree construction

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Distributed and paged suffix trees for large genetic databases

CPM'03 Proceedings of the 14th annual conference on Combinatorial pattern matching
Search-Optimized suffix-tree storage for biological applications

HiPC'05 Proceedings of the 12th international conference on High Performance Computing

PSIST: A scalable approach to indexing protein structures using suffix trees

Journal of Parallel and Distributed Computing
Improving suffix array locality for fast pattern matching on disk

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
A new method for indexing genomes using on-disk suffix trees

Proceedings of the 17th ACM conference on Information and knowledge management
Reducing Space Requirements for Disk Resident Suffix Arrays

DASFAA '09 Proceedings of the 14th International Conference on Database Systems for Advanced Applications
OLAP on search logs: an infrastructure supporting data-driven applications in search engines

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Serial and parallel methods for i/o efficient suffix tree construction

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
AS-index: a structure for string search using n-grams and algebraic signatures

Proceedings of the 18th ACM conference on Information and knowledge management
Suffix trees for very large genomic sequences

Proceedings of the 18th ACM conference on Information and knowledge management
Indexing genomic sequences on the IBM Blue Gene

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Anchoring millions of distinct reads on the human genome within seconds

Proceedings of the 13th International Conference on Extending Database Technology
Suffix tree construction algorithms on modern hardware

Proceedings of the 13th International Conference on Extending Database Technology
I/O efficient algorithms for serial and parallel suffix tree construction

ACM Transactions on Database Systems (TODS)
Frequent tree pattern mining: A survey

Intelligent Data Analysis
Suffix trees for inputs larger than main memory

Information Systems
ERA: efficient serial and parallel suffix tree construction for very long strings

Proceedings of the VLDB Endowment
On-line suffix tree construction with reduced branching

Journal of Discrete Algorithms
Efficient parallel construction of suffix trees for genomes larger than main memory

Proceedings of the 20th European MPI Users' Group Meeting
RACE: a scalable and elastic parallel system for discovering repeats in very long sequences

Proceedings of the VLDB Endowment
Efficient techniques on retrieving bio-information for active U-healthcare

Personal and Ubiquitous Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the exponential growth of biological sequence databases, it has become critical to develop effective techniques for storing, querying, and analyzing these massive data. Suffix trees are widely used to solve many sequence-based problems, and they can be built in linear time and space, provided the resulting tree fits in main-memory. To index larger sequences, several external suffix tree algorithms have been proposed in recent years. However, they suffer from several problems such as susceptibility to data skew, non-scalability to genome-scale sequences, and non-existence of suffix links, which are crucial in various suffix tree based algorithms. In this paper, we target DNA sequences and propose a novel disk-based suffix tree algorithm called TRELLIS, which effectively scales up to genome-scale sequences. Specifically, it can index the entire human genome using 2GB of memory, in about 4 hours and can recover all its suffix links within 2 hours. TRELLIS was compared to various state-of-the-art persistent disk-based suffix tree construction algorithms, and was shown to outperform the best previous methods, both in terms of indexing time and querying time.