Engineering a Fast Online Persistent Suffix Tree Construction

Authors:
Srikanta J. Bedathur;Jayant R. Haritsa
Affiliations:
-;-
Venue:
ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Year:
2004

Citing 17
Cited 13

Principles of database buffer management

ACM Transactions on Database Systems (TODS)
The LRU-K page replacement algorithm for database disk buffering

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Genetic sequence data retrieval and manipulation based on generalized suffix trees

Genetic sequence data retrieval and manipulation based on generalized suffix trees
A comparison of imperative and purely functional suffix tree constructions

ESOP '94 Selected papers of ESOP '94, the 5th European symposium on Programming
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
Reducing the space requirement of suffix trees

Software—Practice & Experience
Average Case Analysis of Algorithms on Sequences

Average Case Analysis of Algorithms on Sequences
Index Access with a Finite Buffer

VLDB '87 Proceedings of the 13th International Conference on Very Large Data Bases
A Database Index to Large Biological Sequences

Proceedings of the 27th International Conference on Very Large Data Bases
2Q: A Low Overhead High Performance Buffer Management Replacement Algorithm

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Suffix Trees (and Relatives) Come of Age in Bioinformatics

CSB '02 Proceedings of the IEEE Computer Society Conference on Bioinformatics
Optimal suffix tree construction with large alphabets

FOCS '97 Proceedings of the 38th Annual Symposium on Foundations of Computer Science
Overcoming the Memory Bottleneck in Suffix Tree Construction

FOCS '98 Proceedings of the 39th Annual Symposium on Foundations of Computer Science
An evaluation of buffer management strategies for relational database systems

VLDB '85 Proceedings of the 11th international conference on Very Large Data Bases - Volume 11
Approximate string matching in sublinear expected time

SFCS '90 Proceedings of the 31st Annual Symposium on Foundations of Computer Science
Linear pattern matching algorithms

SWAT '73 Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973)

BODHI: a database habitat for bio-diversity information

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Practical methods for constructing suffix trees

The VLDB Journal — The International Journal on Very Large Data Bases
A data structure for a sequence of string accesses in external memory

ACM Transactions on Algorithms (TALG)
Genome-scale disk-based suffix tree indexing

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Practical suffix tree construction

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
The SBC-tree: an index for run-length compressed sequences

EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
A new method for indexing genomes using on-disk suffix trees

Proceedings of the 17th ACM conference on Information and knowledge management
Serial and parallel methods for i/o efficient suffix tree construction

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Indexing genomic sequences on the IBM Blue Gene

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Suffix tree construction algorithms on modern hardware

Proceedings of the 13th International Conference on Extending Database Technology
I/O efficient algorithms for serial and parallel suffix tree construction

ACM Transactions on Database Systems (TODS)
Search-Optimized suffix-tree storage for biological applications

HiPC'05 Proceedings of the 12th international conference on High Performance Computing
Obtaining provably good performance from suffix trees in secondary storage

CPM'06 Proceedings of the 17th Annual conference on Combinatorial Pattern Matching

Quantified Score

Hi-index	0.01

Visualization

Abstract

Online persistent suffix tree construction has been consideredimpractical due to its excessive I/O costs. However,these prior studies have not taken into account the effects ofthe buffer management policy and the internal node structureof the suffix tree on I/O behavior of construction andsubsequent retrievals over the tree. In this paper, we studythese two issues in detail in the context of large genomicDNA and Protein sequences. In particular, we make the followingcontributions: (i) a novel, low-overhead bufferingpolicy called TOP-Q which improves the on-disk behaviorof suffix tree construction and subsequent retrievals, and (ii)empirical evidence that the space efficient linked-list representationof suffix tree nodes provides significantly inferiorperformance when compared to the array representation.These results demonstrate that a careful choice ofimplementation strategies can make online persistent suffixtree construction considerably more scalable - in termsof length of sequences indexed with a fixed memory budget,than currently perceived.