Engineering a compressed suffix tree implementation

Authors:
Niko Välimäki;Wolfgang Gerlach;Kashyap Dixit;Veli Mäkinen
Affiliations:
Department of Computer Science, University of Helsinki, Finland;Technische Fakultät, Universität Bielefeld, Germany;Department of Computer Science and Engineering, Indian Institute of Technology, Kanpur, India;Department of Computer Science, University of Helsinki, Finland
Venue:
WEA'07 Proceedings of the 6th international conference on Experimental algorithms
Year:
2007

Citing 18
Cited 4

Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Space efficient suffix trees

Journal of Algorithms
High-order entropy-compressed text indexes

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
The LCA Problem Revisited

LATIN '00 Proceedings of the 4th Latin American Symposium on Theoretical Informatics
Tables

Proceedings of the 16th Conference on Foundations of Software Technology and Theoretical Computer Science
Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications

CPM '01 Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
Space-Economical Algorithms for Finding Maximal Unique Matches

CPM '02 Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching
Breaking a Time-and-Space Barrier in Constructing Full-Text Indices

FOCS '03 Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer Science
Replacing suffix trees with enhanced suffix arrays

Journal of Discrete Algorithms - SPIRE 2002
Indexing text using the Ziv-Lempel trie

Journal of Discrete Algorithms - SPIRE 2002
Constructing Suffix Tree for Gigabyte Sequences with Megabyte Memory

IEEE Transactions on Knowledge and Data Engineering
Indexing compressed text

Journal of the ACM (JACM)
Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching

SIAM Journal on Computing
Succinct suffix arrays based on run-length encoding

Nordic Journal of Computing
Compressed full-text indexes

ACM Computing Surveys (CSUR)
Compressed Suffix Trees with Full Functionality

Theory of Computing Systems
Dynamic entropy-compressed sequences and full-text indexes

CPM'06 Proceedings of the 17th Annual conference on Combinatorial Pattern Matching

Transforming MPI source code based on communication patterns

Future Generation Computer Systems
A Compressed Enhanced Suffix Array Supporting Fast String Matching

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Fully compressed suffix trees

ACM Transactions on Algorithms (TALG)
Practical compressed suffix trees

SEA'10 Proceedings of the 9th international conference on Experimental Algorithms

Quantified Score

Hi-index	0.00

Visualization

Abstract

Suffix tree is one of the most important data structures in string algorithms and biological sequence analysis. Unfortunately, when it comes to implementing those algorithms and applying them to real genomic sequences, often the main memory size becomes the bottleneck. This is easily explained by the fact that while a DNA sequence of length n from alphabet Σ = {A, C, G, T} can be stored in n log |Σ| = 2n bits, its suffix tree occupies O(n log n) bits. In practice, the size difference easily reaches factor 50. We report on an implementation of the compressed suffix tree very recently proposed by Sadakane (Theory of Computing Systems, in press). The compressed suffix tree occupies space proportional to the text size, i.e. O(n log |Σ|) bits, and supports all typical suffix tree operations with at most log n factor slowdown. Our experiments show that, e.g. on a 10 MB DNA sequence, the compressed suffix tree takes 10% of the space of normal suffix tree. At the same time, a representative algorithm is slowed down by factor 30. Our implementation follows the original proposal in spirit, but some internal parts are tailored towards practical implementation. Our construction algorithm has time requirement O(n log n log |Σ|) and uses closely the same space as the final structure while constructing it: on the 10 MB DNA sequence, the maximum space usage during construction is only 1.4 times the final product size.