Engineering a compressed suffix tree implementation

Authors:
N. Välimäki;V. Mäkinen;W. Gerlach;K. Dixit
Affiliations:
University of Helsinki, Helsinki, Finland;University of Helsinki, Helsinki, Finland;Bielefeld University, AG Genominformatik, Bielefeld;IIT Kanpur, New Delhi, India
Venue:
Journal of Experimental Algorithmics (JEA)
Year:
2010

Citing 26
Cited 7

Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Compact pat trees

Compact pat trees
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
Space efficient suffix trees

Journal of Algorithms
An analysis of the Burrows—Wheeler transform

Journal of the ACM (JACM)
High-order entropy-compressed text indexes

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
The LCA Problem Revisited

LATIN '00 Proceedings of the 4th Latin American Symposium on Theoretical Informatics
Tables

Proceedings of the 16th Conference on Foundations of Software Technology and Theoretical Computer Science
Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications

CPM '01 Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
Space-Economical Algorithms for Finding Maximal Unique Matches

CPM '02 Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching
Breaking a Time-and-Space Barrier in Constructing Full-Text Indices

FOCS '03 Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer Science
Replacing suffix trees with enhanced suffix arrays

Journal of Discrete Algorithms - SPIRE 2002
Indexing text using the Ziv-Lempel trie

Journal of Discrete Algorithms - SPIRE 2002
Dictionary matching and indexing with errors and don't cares

STOC '04 Proceedings of the thirty-sixth annual ACM symposium on Theory of computing
Constructing Suffix Tree for Gigabyte Sequences with Megabyte Memory

IEEE Transactions on Knowledge and Data Engineering
Indexing compressed text

Journal of the ACM (JACM)
Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching

SIAM Journal on Computing
Succinct suffix arrays based on run-length encoding

Nordic Journal of Computing
Compressed full-text indexes

ACM Computing Surveys (CSUR)
Rank and select revisited and extended

Theoretical Computer Science
Compressed Suffix Trees with Full Functionality

Theory of Computing Systems
Dynamic entropy-compressed sequences and full-text indexes

ACM Transactions on Algorithms (TALG)
Space-efficient static trees and graphs

SFCS '89 Proceedings of the 30th Annual Symposium on Foundations of Computer Science
Linear pattern matching algorithms

SWAT '73 Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973)
Obtaining provably good performance from suffix trees in secondary storage

CPM'06 Proceedings of the 17th Annual conference on Combinatorial Pattern Matching

CST++

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Inverted indexes for phrases and strings

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Scalable detection of frequent substrings by grammar-based compression

DS'11 Proceedings of the 14th international conference on Discovery science
Efficient Maximal Repeat Finding Using the Burrows-Wheeler Transform and Wavelet Tree

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Exploiting SIMD instructions in current processors to improve classical string algorithms

ADBIS'12 Proceedings of the 16th East European conference on Advances in Databases and Information Systems
RCSI: scalable similarity search in thousand(s) of genomes

Proceedings of the VLDB Endowment
A Compressed Suffix Tree Based Implementation With Low Peak Memory Usage

Electronic Notes in Theoretical Computer Science (ENTCS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Suffix tree is one of the most important data structures in string algorithms and biological sequence analysis. Unfortunately, when it comes to implementing those algorithms and applying them to real genomic sequences, often the main memory size becomes the bottleneck. This is easily explained by the fact that while a DNA sequence of length n from alphabet Σ = {A,C,G,T} can be stored in n log |Σ| &equlas; 2n bits, its suffix tree occupiesO(n log n) bits. In practice, the size difference easily reaches factor 50. We report on an implementation of the compressed suffix tree very recently proposed by Sadakane (2007). The compressed suffix tree occupies space proportional to the text size, that is, O(n log |Σ|) bits, and supports all typical suffix tree operations with at most log n factor slowdown. Our experiments show that, for example, on a 10 MB DNA sequence, the compressed suffix tree takes 10% of the space of the normal suffix tree. At the same time, a representative algorithm is slowed down by factor 30. Our implementation follows the original proposal in spirit, but some internal parts are tailored toward practical implementation. Our construction algorithm has time requirement O(n log n log |Σ|) and uses closely the same space as the final structure while constructing it: on the 10MB DNA sequence, the maximum space usage during construction is only 1.5 times the final product size. As by-products, we develop a method to create Succinct Suffix Array directly from Burrows-Wheeler transform and a space-efficient version of the suffixes-insertion algorithm to build balanced parentheses representation of suffix tree from LCP information.