Succinct suffix arrays based on run-length encoding

Authors:
Veli Mäkinen;Gonzalo Navarro
Affiliations:
Dept. of Computer Seience, University of Helsinki, University of Helsinki, Finland;Dept. of Computer Science, University of Chile, Blanco Encalada, Santiago, Chile
Venue:
Nordic Journal of Computing
Year:
2005

Citing 26
Cited 40

Complete inverted files for efficient text retrieval and analysis

Journal of the ACM (JACM)
Text compression

Text compression
Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Compact pat trees

Compact pat trees
Efficient suffix trees on secondary storage

Proceedings of the seventh annual ACM-SIAM symposium on Discrete algorithms
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract)

STOC '00 Proceedings of the thirty-second annual ACM symposium on Theory of computing
An experimental study of an opportunistic index

SODA '01 Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms
Space efficient suffix trees

Journal of Algorithms
An analysis of the Burrows—Wheeler transform

Journal of the ACM (JACM)
Time-space trade-offs for compressed suffix arrays

Information Processing Letters
Succinct representations of lcp information and improvements in the compressed suffix arrays

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Succinct indexable dictionaries with applications to encoding k-ary trees and multisets

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
High-order entropy-compressed text indexes

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Optimal Exact Strring Matching Based on Suffix Arrays

SPIRE 2002 Proceedings of the 9th International Symposium on String Processing and Information Retrieval
Sparse Suffix Trees

COCOON '96 Proceedings of the Second Annual International Conference on Computing and Combinatorics
Filtration with q-Samples in Approximate String Matching

CPM '96 Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching
Succinct representation of balanced parentheses, static trees and planar graphs

FOCS '97 Proceedings of the 38th Annual Symposium on Foundations of Computer Science
Opportunistic data structures with applications

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
On compressing and indexing data

On compressing and indexing data
Compact suffix array: a space-efficient full-text index

Fundamenta Informaticae - Special issue on computing patterns in strings
When indexing equals compression: experiments with compressing suffix arrays and applications

SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
Indexing text using the Ziv-Lempel trie

Journal of Discrete Algorithms - SPIRE 2002
A categorization theorem on suffix arrays with applications to space efficient text indexes

SODA '05 Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms
Succinct suffix arrays based on run-length encoding

CPM'05 Proceedings of the 16th annual conference on Combinatorial Pattern Matching
Advantages of backward searching — efficient secondary memory and distributed implementation of compressed suffix arrays

ISAAC'04 Proceedings of the 15th international conference on Algorithms and Computation

Suffix arrays: what are they good for?

ADC '06 Proceedings of the 17th Australasian Database Conference - Volume 49
Compressed full-text indexes

ACM Computing Surveys (CSUR)
Compressed representations of sequences and full-text indexes

ACM Transactions on Algorithms (TALG)
An efficient, versatile approach to suffix sorting

Journal of Experimental Algorithmics (JEA)
Dynamic entropy-compressed sequences and full-text indexes

ACM Transactions on Algorithms (TALG)
An(other) Entropy-Bounded Compressed Suffix Tree

CPM '08 Proceedings of the 19th annual symposium on Combinatorial Pattern Matching
Sorting streamed multisets

Information Processing Letters
Compressed text indexes: From theory to practice

Journal of Experimental Algorithmics (JEA)
Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections

SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
The myriad virtues of Wavelet Trees

Information and Computation
Storage and Retrieval of Individual Genomes

RECOMB 2'09 Proceedings of the 13th Annual International Conference on Research in Computational Molecular Biology
Engineering a compressed suffix tree implementation

Journal of Experimental Algorithmics (JEA)
Faster entropy-bounded compressed suffix trees

Theoretical Computer Science
Dynamic extended suffix arrays

Journal of Discrete Algorithms
Engineering a compressed suffix tree implementation

WEA'07 Proceedings of the 6th international conference on Experimental algorithms
Implicit compression boosting with applications to self-indexing

SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval
Move-to-Front, Distance Coding, and Inversion Frequencies revisited

Theoretical Computer Science
Compression, indexing, and retrieval for massive string data

CPM'10 Proceedings of the 21st annual conference on Combinatorial pattern matching
Engineering basic algorithms of an in-memory text search engine

ACM Transactions on Information Systems (TOIS)
Indexing similar DNA sequences

AAIM'10 Proceedings of the 6th international conference on Algorithmic aspects in information and management
Practical approaches to reduce the space requirement of lempel-ziv--based compressed text indices

Journal of Experimental Algorithmics (JEA)
Colored range queries and document retrieval

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Space-efficient construction of Lempel-Ziv compressed text indexes

Information and Computation
Fixed block compression boosting in FM-indexes

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Word-based self-indexes for natural language text

ACM Transactions on Information Systems (TOIS)
Statistical encoding of succinct data structures

CPM'06 Proceedings of the 17th Annual conference on Combinatorial Pattern Matching
Dynamic entropy-compressed sequences and full-text indexes

CPM'06 Proceedings of the 17th Annual conference on Combinatorial Pattern Matching
Inverted files versus suffix arrays for locating patterns in primary memory

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Practical compressed suffix trees

SEA'10 Proceedings of the 9th international conference on Experimental Algorithms
The myriad virtues of wavelet trees

ICALP'06 Proceedings of the 33rd international conference on Automata, Languages and Programming - Volume Part I
New algorithms on wavelet trees and applications to information retrieval

Theoretical Computer Science
Full-text search on multi-byte encoded documents

Proceedings of the 2012 ACM symposium on Document engineering
Wavelet trees for all

CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
Move-to-front, distance coding, and inversion frequencies revisited

CPM'07 Proceedings of the 18th annual conference on Combinatorial Pattern Matching
Compressed text indexes with fast locate

CPM'07 Proceedings of the 18th annual conference on Combinatorial Pattern Matching
Compressed suffix trees for repetitive texts

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
On compressing and indexing repetitive sequences

Theoretical Computer Science
Colored range queries and document retrieval

Theoretical Computer Science
A Compressed Suffix Tree Based Implementation With Low Peak Memory Usage

Electronic Notes in Theoretical Computer Science (ENTCS)
Wavelet trees for all

Journal of Discrete Algorithms

Quantified Score

Hi-index	0.00

Visualization

Abstract

A succinet full-text self-index is a data structure built on a text T = t1t2...tn, which takes little space (ideally close to that of the compressed text), permits efficient search for the occurrences of a pattern P = p1p2...pm in T, and is able to reproduce any text substring, so the self-index replaces the text.Several remarkable self-indexes have been developed in recent years. Many of those take space proportional to nH0 or nHk bits, where Hk is the kth order empirical entropy of T. The time to count how many times does P occur in T ranges from O(m) to O(m log n).In this paper we present a new self-index, called RLFM index for "run-length FM-index", that counts the occurrences of P in T in O(m) time when the alphabet size is σ = O(polylog(n)). The RLFM index requires nHklogσ + O(n) bits of space, for any k ≤ αlogσn and constant 0 O(m) counting time either require more than nH0 bits of space or require that σ = O(1). We also show that the RLFM index can be enhanced to locate occurrences in the text and display text substrings in time independent of σ.In addition, we prove a close relationship between the kth order entropy of the text and some regularities that show up in their suffix arrays and in the Burrows-Wheeler transform of T. This relationship is of independent interest and permits bounding the space occupancy of the RLFM index, as well as that of other existing compressed indexes.Finally, we present some practical considerations in order to implement the RLFM index. We empirically compare our index against the best existing implementations and show that it is practical and competitive against those. In passing, we obtain a competitive implementation of an existing theoretical proposal that can be seen as a simplified RLFM index, and explore other practical ideas such as Huffman-shaped wavelet trees.