Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching

Authors:
Roberto Grossi;Jeffrey Scott Vitter
Affiliations:
-;-
Venue:
SIAM Journal on Computing
Year:
2005

Citing 0
Cited 84

Suffix arrays: what are they good for?

ADC '06 Proceedings of the 17th Australasian Database Conference - Volume 49
When indexing equals compression: Experiments with compressing suffix arrays and applications

ACM Transactions on Algorithms (TALG)
Compressed full-text indexes

ACM Computing Surveys (CSUR)
Succinct data structures for flexible text retrieval systems

Journal of Discrete Algorithms
Compressed representations of sequences and full-text indexes

ACM Transactions on Algorithms (TALG)
Compressed indexes for dynamic text collections

ACM Transactions on Algorithms (TALG)
A taxonomy of suffix array construction algorithms

ACM Computing Surveys (CSUR)
An efficient, versatile approach to suffix sorting

Journal of Experimental Algorithmics (JEA)
Ultra-succinct representation of ordered trees

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Alphabet-independent linear-time construction of compressed suffix arrays using o(nlogn)-bit working space

Theoretical Computer Science
Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets

ACM Transactions on Algorithms (TALG)
Faster suffix sorting

Theoretical Computer Science
Rank and select revisited and extended

Theoretical Computer Science
Fast blocking of undesirable web pages on client PC by discriminating URL using neural networks

Expert Systems with Applications: An International Journal
Compact dictionaries for variable-length keys and data with applications

ACM Transactions on Algorithms (TALG)
Dynamic entropy-compressed sequences and full-text indexes

ACM Transactions on Algorithms (TALG)
A compressed self-index using a Ziv---Lempel dictionary

Information Retrieval
An(other) Entropy-Bounded Compressed Suffix Tree

CPM '08 Proceedings of the 19th annual symposium on Combinatorial Pattern Matching
On the Redundancy of Succinct Data Structures

SWAT '08 Proceedings of the 11th Scandinavian workshop on Algorithm Theory
Compressed text indexes: From theory to practice

Journal of Experimental Algorithmics (JEA)
Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections

SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
Cell probe lower bounds for succinct data structures

SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
Storage and Retrieval of Individual Genomes

RECOMB 2'09 Proceedings of the 13th Annual International Conference on Research in Computational Molecular Biology
Broadword Computing and Fibonacci Code Speed Up Compressed Suffix Arrays

SEA '09 Proceedings of the 8th International Symposium on Experimental Algorithms
Text Indexing, Suffix Sorting, and Data Compression: Common Problems and Techniques

CPM '09 Proceedings of the 20th Annual Symposium on Combinatorial Pattern Matching
Engineering a compressed suffix tree implementation

Journal of Experimental Algorithmics (JEA)
Dynamic rank/select structures with applications to run-length encoded texts

Theoretical Computer Science
Compressed Suffix Arrays for Massive Data

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
On Entropy-Compressed Text Indexing in External Memory

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
A Linear-Time Burrows-Wheeler Transform Using Induced Sorting

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Faster entropy-bounded compressed suffix trees

Theoretical Computer Science
Succinct Index for Dynamic Dictionary Matching

ISAAC '09 Proceedings of the 20th International Symposium on Algorithms and Computation
Engineering a compressed suffix tree implementation

WEA'07 Proceedings of the 6th international conference on Experimental algorithms
Approximate string matching with Lempel-Ziv compressed indexes

SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval
Sampled longest common prefix array

CPM'10 Proceedings of the 21st annual conference on Combinatorial pattern matching
Compression, indexing, and retrieval for massive string data

CPM'10 Proceedings of the 21st annual conference on Combinatorial pattern matching
Practical approaches to reduce the space requirement of lempel-ziv--based compressed text indices

Journal of Experimental Algorithmics (JEA)
UASMAs (universal automated SNP mapping algorithms): a set of algorithms to instantaneously map SNPs in real time to aid functional SNP discovery

Proceedings of the VLDB Endowment
Computing the inverse sort transform in linear time

ACM Transactions on Algorithms (TALG)
String retrieval for multi-pattern queries

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Faster compressed dictionary matching

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
CST++

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
The gapped suffix array: a new index structure for fast approximate matching

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Indexing methods for approximate dictionary searching: Comparative analysis

Journal of Experimental Algorithmics (JEA)
Note: Combined data structure for previous- and next-smaller-values

Theoretical Computer Science
Space-efficient construction of Lempel-Ziv compressed text indexes

Information and Computation
A quick tour on suffix arrays and compressed suffix arrays

Theoretical Computer Science
Space-efficient substring occurrence estimation

Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Succinct indexes for strings, binary relations and multilabeled trees

ACM Transactions on Algorithms (TALG)
Inverted indexes for phrases and strings

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Indexing finite language representation of population genotypes

WABI'11 Proceedings of the 11th international conference on Algorithms in bioinformatics
Compressed text indexing with wildcards

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Compressed indexes for aligned pattern matching

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Semi-indexing semi-structured data in tiny space

Proceedings of the 20th ACM international conference on Information and knowledge management
A linear size index for approximate pattern matching

Journal of Discrete Algorithms
Space-Efficient Preprocessing Schemes for Range Minimum Queries on Static Arrays

SIAM Journal on Computing
Ultra-succinct representation of ordered trees with applications

Journal of Computer and System Sciences
A compressed self-index using a ziv-lempel dictionary

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Efficient Maximal Repeat Finding Using the Burrows-Wheeler Transform and Wavelet Tree

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
The myriad virtues of wavelet trees

ICALP'06 Proceedings of the 33rd international conference on Automata, Languages and Programming - Volume Part I
A randomized numerical aligner (rNA)

LATA'10 Proceedings of the 4th international conference on Language and Automata Theory and Applications
Encoding 2d range maximum queries

ISAAC'11 Proceedings of the 22nd international conference on Algorithms and Computation
Succinct indexes for circular patterns

ISAAC'11 Proceedings of the 22nd international conference on Algorithms and Computation
Worst-case efficient single and multiple string matching on packed texts in the word-RAM model

Journal of Discrete Algorithms
BpMatch: An Efficient Algorithm for a Segmental Analysis of Genomic Sequences

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Wavelet trees for all

CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
Towards an optimal space-and-query-time index for top-k document retrieval

CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
A randomized Numerical Aligner (rNA)

Journal of Computer and System Sciences
Dynamic rank-select structures with applications to run-length encoded texts

CPM'07 Proceedings of the 18th annual conference on Combinatorial Pattern Matching
Fast and practical algorithms for computing all the runs in a string

CPM'07 Proceedings of the 18th annual conference on Combinatorial Pattern Matching
Compressed data structures with relevance

Proceedings of the 21st ACM international conference on Information and knowledge management
On position restricted substring searching in succinct space

Journal of Discrete Algorithms
Faster compressed dictionary matching

Theoretical Computer Science
Compressed text indexing with wildcards

Journal of Discrete Algorithms
On compressing and indexing repetitive sequences

Theoretical Computer Science
Space-efficient data structures for Top-k completion

Proceedings of the 22nd international conference on World Wide Web
Searching similar segments over textual event sequences

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Dynamic compressed strings with random access

ICALP'13 Proceedings of the 40th international conference on Automata, Languages, and Programming - Volume Part I
On the combinatorics of suffix arrays

Information Processing Letters
On compressing permutations and adaptive sorting

Theoretical Computer Science
Compressed property suffix trees

Information and Computation
Cross-document pattern matching

Journal of Discrete Algorithms
A Compressed Suffix Tree Based Implementation With Low Peak Memory Usage

Electronic Notes in Theoretical Computer Science (ENTCS)
Wavelet trees for all

Journal of Discrete Algorithms

Quantified Score

Hi-index	0.01

Visualization

Abstract

The proliferation of online text, such as found on the World Wide Web and in online databases, motivates the need for space-efficient text indexing methods that support fast string searching. We model this scenario as follows: Consider a text $T$ consisting of $n$ symbols drawn from a fixed alphabet $\Sigma$. The text $T$ can be represented in $n \lg |\Sigma|$ bits by encoding each symbol with $\lg |\Sigma|$ bits. The goal is to support fast online queries for searching any string pattern $P$ of $m$ symbols, with $T$ being fully scanned only once, namely, when the index is created at preprocessing time. The text indexing schemes published in the literature are greedy in terms of space usage: they require $\Omega(n \lg n)$ additional bits of space in the worst case. For example, in the standard unit cost RAM, suffix trees and suffix arrays need $\Omega(n)$ memory words, each of $\Omega(\lg n)$ bits. These indexes are larger than the text itself by a multiplicative factor of $\Omega(\smash{\lg_{|\Sigma|} n})$, which is significant when $\Sigma$ is of constant size, such as in \textsc{ascii} or \textsc{unicode}. On the other hand, these indexes support fast searching, either in $O(m \lg |\Sigma|)$ time or in $O(m + \lg n)$ time, plus an output-sensitive cost $O(\mathit{occ})$ for listing the $\mathit{occ}$ pattern occurrences. We present a new text index that is based upon compressed representations of suffix arrays and suffix trees. It achieves a fast $\smash{O(m /\lg_{|\Sigma|} n + \lg_{|\Sigma|}^\epsilon n)}$ search time in the worst case, for any constant $0