Advantages of backward searching — efficient secondary memory and distributed implementation of compressed suffix arrays

Authors:
Veli Mäkinen;Gonzalo Navarro;Kunihiko Sadakane
Affiliations:
Dept of Computer Science, Univ of Helsinki, Finland;Center for Web Research, Dept of Computer Science, Univ of Chile, Chile;Dept of Computer Science and Communication Engineering, Kyushu Univ., Japan
Venue:
ISAAC'04 Proceedings of the 15th international conference on Algorithms and Computation
Year:
2004

Citing 17
Cited 11

A bridging model for parallel computation

Communications of the ACM
Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Compact pat trees

Compact pat trees
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract)

STOC '00 Proceedings of the thirty-second annual ACM symposium on Theory of computing
An analysis of the Burrows—Wheeler transform

Journal of the ACM (JACM)
Time-space trade-offs for compressed suffix arrays

Information Processing Letters
Succinct representations of lcp information and improvements in the compressed suffix arrays

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
High-order entropy-compressed text indexes

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Compressed Text Databases with Efficient Query Algorithms Based on the Compressed Suffix Array

ISAAC '00 Proceedings of the 11th International Conference on Algorithms and Computation
Tables

Proceedings of the 16th Conference on Foundations of Software Technology and Theoretical Computer Science
Opportunistic data structures with applications

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Succinct static data structures

Succinct static data structures
Compact suffix array: a space-efficient full-text index

Fundamenta Informaticae - Special issue on computing patterns in strings
When indexing equals compression: experiments with compressing suffix arrays and applications

SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
Indexing text using the Ziv-Lempel trie

Journal of Discrete Algorithms - SPIRE 2002
Linear pattern matching algorithms

SWAT '73 Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973)

Indexing compressed text

Journal of the ACM (JACM)
Succinct suffix arrays based on run-length encoding

Nordic Journal of Computing
Compressed full-text indexes

ACM Computing Surveys (CSUR)
Compressed representations of sequences and full-text indexes

ACM Transactions on Algorithms (TALG)
Algorithms and data structures for external memory

Foundations and Trends® in Theoretical Computer Science
Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections

SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
Compression, indexing, and retrieval for massive string data

CPM'10 Proceedings of the 21st annual conference on Combinatorial pattern matching
Parallel and distributed compressed indexes

CPM'10 Proceedings of the 21st annual conference on Combinatorial pattern matching
Inverted files versus suffix arrays for locating patterns in primary memory

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
A Lempel-Ziv text index on secondary storage

CPM'07 Proceedings of the 18th annual conference on Combinatorial Pattern Matching
Compressed text indexes with fast locate

CPM'07 Proceedings of the 18th annual conference on Combinatorial Pattern Matching

Quantified Score

Hi-index	0.00

Visualization

Abstract

One of the most relevant succinct suffix array proposals in the literature is the Compressed Suffix Array (CSA) of Sadakane [ISAAC 2000] The CSA needs n(H0+O(log logσ)) bits of space, where n is the text size, σ is the alphabet size, and H0 the zero-order entropy of the text The number of occurrences of a pattern of length m can be computed in O(mlog n) time Most notably, the CSA does not need the text separately available to operate The CSA simulates a binary search over the suffix array, where the query is compared against text substrings These are extracted from the same CSA by following irregular access patterns over the structure Sadakane [SODA 2002] has proposed using backward searching on the CSA in similar fashion as the FM-index of Ferragina and Manzini [FOCS 2000] He has shown that the CSA can be searched in O(m) time whenever σ = O(polylog(n)). In this paper we consider some other consequences of backward searching applied to CSA The most remarkable one is that we do not need, unlike all previous proposals, any complicated sub-linear structures based on the four-Russians technique (such as constant time rank and select queries on bit arrays) We show that sampling and compression are enough to achieve O(mlog n) query time using less space than the original structure It is also possible to trade structure space for search time Furthermore, the regular access pattern of backward searching permits an efficient secondary memory implementation, so that the search can be done with O(m logBn) disk accesses, being B the disk block size Finally, it permits a distributed implementation with optimal speedup and negligible communication effort.