High-order entropy-compressed text indexes

Authors:
Roberto Grossi;Ankur Gupta;Jeffrey Scott Vitter
Affiliations:
Università di Pisa, Pisa;Center for Geometric and Biological Computing, Durham, NC;Purdue University, West Lafayette, IN
Venue:
SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Year:
2003

Citing 18
Cited 190

New indices for text: PAT Trees and PAT arrays

Information retrieval
Bit-Tree: a data structure for fast file processing

Communications of the ACM
Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Self-indexing inverted files for fast text retrieval

ACM Transactions on Information Systems (TOIS)
PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric

Journal of the ACM (JACM)
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract)

STOC '00 Proceedings of the thirty-second annual ACM symposium on Theory of computing
An experimental study of an opportunistic index

SODA '01 Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms
Space efficient suffix trees

Journal of Algorithms
Time-space trade-offs for compressed suffix arrays

Information Processing Letters
Succinct representations of lcp information and improvements in the compressed suffix arrays

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Succinct indexable dictionaries with applications to encoding k-ary trees and multisets

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Low Redundancy in Static Dictionaries with Constant Query Time

SIAM Journal on Computing
Adding Compression to Block Addressing Inverted Indexes

Information Retrieval
Tables

Proceedings of the 16th Conference on Foundations of Software Technology and Theoretical Computer Science
Opportunistic data structures with applications

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
A suboptimal lossy data compression based on approximate pattern matching

IEEE Transactions on Information Theory

Compact suffix array: a space-efficient full-text index

Fundamenta Informaticae - Special issue on computing patterns in strings
When indexing equals compression: experiments with compressing suffix arrays and applications

SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
A categorization theorem on suffix arrays with applications to space efficient text indexes

SODA '05 Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms
Indexing compressed text

Journal of the ACM (JACM)
The Indiana Center for Database Systems at Purdue University

ACM SIGMOD Record
Structuring labeled trees for optimal succinctness, and beyond

FOCS '05 Proceedings of the 46th Annual IEEE Symposium on Foundations of Computer Science
Rank/select operations on large alphabets: a tool for text indexing

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Squeezing succinct data structures into entropy bounds

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Succinct suffix arrays based on run-length encoding

Nordic Journal of Computing
When indexing equals compression: Experiments with compressing suffix arrays and applications

ACM Transactions on Algorithms (TALG)
Compressed full-text indexes

ACM Computing Surveys (CSUR)
Note: A simple storage scheme for strings achieving entropy bounds

Theoretical Computer Science
Compressed representations of sequences and full-text indexes

ACM Transactions on Algorithms (TALG)
Compressed indexes for dynamic text collections

ACM Transactions on Algorithms (TALG)
The cell probe complexity of succinct data structures

Theoretical Computer Science
The engineering of a compression boosting library: theory vs practice in BWT compression

ESA'06 Proceedings of the 14th conference on Annual European Symposium - Volume 14
Ultra-succinct representation of ordered trees

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Succinct indexes for strings, binary relations and multi-labeled trees

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
A simple storage scheme for strings achieving entropy bounds

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets

ACM Transactions on Algorithms (TALG)
A simpler analysis of Burrows–Wheeler-based compression

Theoretical Computer Science
Adaptive searching in succinctly encoded binary relations and tree-structured documents

Theoretical Computer Science
Compressed data structures: Dictionaries and data-aware measures

Theoretical Computer Science
Rank and select revisited and extended

Theoretical Computer Science
The SBC-tree: an index for run-length compressed sequences

EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
Dynamic entropy-compressed sequences and full-text indexes

ACM Transactions on Algorithms (TALG)
Output-sensitive autocompletion search

Information Retrieval
A compressed self-index using a Ziv---Lempel dictionary

Information Retrieval
Reorganizing compressed text

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Algorithms and data structures for external memory

Foundations and Trends® in Theoretical Computer Science
An(other) Entropy-Bounded Compressed Suffix Tree

CPM '08 Proceedings of the 19th annual symposium on Combinatorial Pattern Matching
On Compact Representations of All-Pairs-Shortest-Path-Distance Matrices

CPM '08 Proceedings of the 19th annual symposium on Combinatorial Pattern Matching
On the Redundancy of Succinct Data Structures

SWAT '08 Proceedings of the 11th Scandinavian workshop on Algorithm Theory
Succinct backward-DAWG-matching

Journal of Experimental Algorithmics (JEA)
Compressed text indexes: From theory to practice

Journal of Experimental Algorithmics (JEA)
Self-indexing Natural Language

SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections

SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
Practical Rank/Select Queries over Arbitrary Sequences

SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
SpeedHap: An Accurate Heuristic for the Single Individual SNP Haplotyping Problem with Many Gaps, High Reading Error Rate and Low Coverage

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Cell probe lower bounds for succinct data structures

SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
Succinct geometric indexes supporting point location queries

SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
Rank and Select for Succinct Data Structures

Electronic Notes in Theoretical Computer Science (ENTCS)
The myriad virtues of Wavelet Trees

Information and Computation
Engineering a compressed suffix tree implementation

Journal of Experimental Algorithmics (JEA)
Efficient Data Structures for the Orthogonal Range Successor Problem

COCOON '09 Proceedings of the 15th Annual International Conference on Computing and Combinatorics
Succinct Orthogonal Range Search Structures on a Grid with Applications to Text Indexing

WADS '09 Proceedings of the 11th International Symposium on Algorithms and Data Structures
Dynamic rank/select structures with applications to run-length encoded texts

Theoretical Computer Science
Rank/select on dynamic compressed sequences and applications

Theoretical Computer Science
Range Quantile Queries: Another Virtue of Wavelet Trees

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
On Entropy-Compressed Text Indexing in External Memory

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Faster entropy-bounded compressed suffix trees

Theoretical Computer Science
A New Point Access Method Based on Wavelet Trees

ER '09 Proceedings of the ER 2009 Workshops (CoMoL, ETheCoM, FP-UML, MOST-ONISW, QoIS, RIGiM, SeCoGIS) on Advances in Conceptual Modeling - Challenging Perspectives
Fast and Compact Prefix Codes

SOFSEM '10 Proceedings of the 36th Conference on Current Trends in Theory and Practice of Computer Science
Wee LCP

Information Processing Letters
The cell probe complexity of succinct data structures

ICALP'03 Proceedings of the 30th international conference on Automata, languages and programming
Engineering a compressed suffix tree implementation

WEA'07 Proceedings of the 6th international conference on Experimental algorithms
On the size of succinct indices

ESA'07 Proceedings of the 15th annual European conference on Algorithms
Implicit compression boosting with applications to self-indexing

SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval
Improved dynamic rank-select entropy-bound structures

LATIN'08 Proceedings of the 8th Latin American conference on Theoretical informatics
Move-to-Front, Distance Coding, and Inversion Frequencies revisited

Theoretical Computer Science
Note: On compact representations of All-Pairs-Shortest-Path-Distance matrices

Theoretical Computer Science
Fast and Compact Web Graph Representations

ACM Transactions on the Web (TWEB)
Index structures for efficiently searching natural language text

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
A web search engine model based on index-query bit-level compression

Proceedings of the 1st International Conference on Intelligent Semantic Web-Services and Applications
A fun application of compact data structures to indexing geographic data

FUN'10 Proceedings of the 5th international conference on Fun with algorithms
On table arrangements, scrabble freaks, and jumbled pattern matching

FUN'10 Proceedings of the 5th international conference on Fun with algorithms
Bidirectional search in a string with wavelet trees

CPM'10 Proceedings of the 21st annual conference on Combinatorial pattern matching
Compression, indexing, and retrieval for massive string data

CPM'10 Proceedings of the 21st annual conference on Combinatorial pattern matching
Top-k ranked document search in general text databases

ESA'10 Proceedings of the 18th annual European conference on Algorithms: Part II
Practical approaches to reduce the space requirement of lempel-ziv--based compressed text indices

Journal of Experimental Algorithmics (JEA)
Medium-space algorithms for inverse BWT

ESA'10 Proceedings of the 18th annual European conference on Algorithms: Part I
Improved data structures for the orthogonal range successor problem

Computational Geometry: Theory and Applications
Compressed self-indices supporting conjunctive queries on document collections

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
String retrieval for multi-pattern queries

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Colored range queries and document retrieval

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Multiplication algorithms for Monge matrices

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Dual-sorted inverted lists

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
CST++

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Succinct representations of dynamic strings

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Computing matching statistics and maximal exact matches on compressed full-text indexes

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
The gapped suffix array: a new index structure for fast approximate matching

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Space-efficient construction of Lempel-Ziv compressed text indexes

Information and Computation
A quick tour on suffix arrays and compressed suffix arrays

Theoretical Computer Science
A query-friendly compression for GML documents

DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications
Succinct indexes for strings, binary relations and multilabeled trees

ACM Transactions on Algorithms (TALG)
Fully compressed suffix trees

ACM Transactions on Algorithms (TALG)
Compressed string dictionaries

SEA'11 Proceedings of the 10th international conference on Experimental algorithms
Practical compressed document retrieval

SEA'11 Proceedings of the 10th international conference on Experimental algorithms
Inverted indexes for phrases and strings

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Lempel-Ziv factorization revisited

CPM'11 Proceedings of the 22nd annual conference on Combinatorial pattern matching
Succincter text indexing with wildcards

CPM'11 Proceedings of the 22nd annual conference on Combinatorial pattern matching
Self-indexing based on LZ77

CPM'11 Proceedings of the 22nd annual conference on Combinatorial pattern matching
Counting colours in compressed strings

CPM'11 Proceedings of the 22nd annual conference on Combinatorial pattern matching
On wavelet tree construction

CPM'11 Proceedings of the 22nd annual conference on Combinatorial pattern matching
LRM-trees: compressed indices, adaptive sorting, and compressed permutations

CPM'11 Proceedings of the 22nd annual conference on Combinatorial pattern matching
Range majority in constant time and linear space

ICALP'11 Proceedings of the 38th international colloquim conference on Automata, languages and programming - Volume Part I
Compact navigation and distance oracles for graphs with small treewidth

ICALP'11 Proceedings of the 38th international colloquim conference on Automata, languages and programming - Volume Part I
Compressed directed acyclic word graph with application in local alignment

COCOON'11 Proceedings of the 17th annual international conference on Computing and combinatorics
Alphabet-independent compressed text indexing

ESA'11 Proceedings of the 19th European conference on Algorithms
Distribution-aware compressed full-text indexes

ESA'11 Proceedings of the 19th European conference on Algorithms
Fixed block compression boosting in FM-indexes

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Space efficient wavelet tree construction

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Computing the longest common prefix array based on the burrows-wheeler transform

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
A succinct index for hypertext

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Succinct gapped suffix arrays

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Finding frequent elements in compressed 2D arrays and strings

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Improved compressed indexes for full-text document retrieval

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Compressed indexes for aligned pattern matching

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Practical representations for web and social graphs

Proceedings of the 20th ACM international conference on Information and knowledge management
Space-Efficient Preprocessing Schemes for Range Minimum Queries on Static Arrays

SIAM Journal on Computing
Word-based self-indexes for natural language text

ACM Transactions on Information Systems (TOIS)
Top-k document retrieval in optimal time and linear space

Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms
Statistical encoding of succinct data structures

CPM'06 Proceedings of the 17th Annual conference on Combinatorial Pattern Matching
Dynamic entropy-compressed sequences and full-text indexes

CPM'06 Proceedings of the 17th Annual conference on Combinatorial Pattern Matching
Reducing the space requirement of LZ-Index

CPM'06 Proceedings of the 17th Annual conference on Combinatorial Pattern Matching
Ultra-succinct representation of ordered trees with applications

Journal of Computer and System Sciences
Efficient Maximal Repeat Finding Using the Burrows-Wheeler Transform and Wavelet Tree

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Compact rich-functional binary relation representations

LATIN'10 Proceedings of the 9th Latin American conference on Theoretical Informatics
Space-efficient construction of LZ-index

ISAAC'05 Proceedings of the 16th international conference on Algorithms and Computation
Top-K color queries for document retrieval

Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms
A new compressed suffix tree supporting fast search and its construction algorithm using optimal working space

CPM'05 Proceedings of the 16th annual conference on Combinatorial Pattern Matching
Succinct suffix arrays based on run-length encoding

CPM'05 Proceedings of the 16th annual conference on Combinatorial Pattern Matching
The myriad virtues of wavelet trees

ICALP'06 Proceedings of the 33rd international conference on Automata, Languages and Programming - Volume Part I
Advantages of backward searching — efficient secondary memory and distributed implementation of compressed suffix arrays

ISAAC'04 Proceedings of the 15th international conference on Algorithms and Computation
Succinct geometric indexes supporting point location queries

ACM Transactions on Algorithms (TALG)
Efficient implementation of rank and select functions for succinct representation

WEA'05 Proceedings of the 4th international conference on Experimental and Efficient Algorithms
New algorithms on wavelet trees and applications to information retrieval

Theoretical Computer Science
Bidirectional search in a string with wavelet trees and bidirectional matching statistics

Information and Computation
Extended compact web graph representations

Algorithms and Applications
Position-Restricted substring searching

LATIN'06 Proceedings of the 7th Latin American conference on Theoretical Informatics
Path queries in weighted trees

ISAAC'11 Proceedings of the 22nd international conference on Algorithms and Computation
Space-efficient data-analysis queries on grids

ISAAC'11 Proceedings of the 22nd international conference on Algorithms and Computation
Succinct indexes for circular patterns

ISAAC'11 Proceedings of the 22nd international conference on Algorithms and Computation
Improved algorithms for the range next value problem and applications

Theoretical Computer Science
The wavelet trie: maintaining an indexed sequence of strings in compressed space

PODS '12 Proceedings of the 31st symposium on Principles of Database Systems
A faster grammar-based self-index

LATA'12 Proceedings of the 6th international conference on Language and Automata Theory and Applications
Distributed search based on self-indexed compressed text

Information Processing and Management: an International Journal
Space-efficient multiple string matching automata

International Journal of Wireless and Mobile Computing
Exchange and consumption of huge RDF data

ESWC'12 Proceedings of the 9th international conference on The Semantic Web: research and applications
Efficient in-memory top-k document retrieval

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
To index or not to index: time-space trade-offs in search engines with positional ranking functions

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
CRAM: compressed random access memory

ICALP'12 Proceedings of the 39th international colloquium conference on Automata, Languages, and Programming - Volume Part I
Self-Indexed Grammar-Based Compression

Fundamenta Informaticae
Fast, small, simple rank/select on bitmaps

SEA'12 Proceedings of the 11th international conference on Experimental Algorithms
Space-Efficient top-k document retrieval

SEA'12 Proceedings of the 11th international conference on Experimental Algorithms
Wavelet trees for all

CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
Towards an optimal space-and-query-time index for top-k document retrieval

CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
Document listing for queries with excluded pattern

CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
Computing the burrows-wheeler transform of a string and its reverse

CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
Sorted range reporting

SWAT'12 Proceedings of the 13th Scandinavian conference on Algorithm Theory
Compact Suffix Array — A Space-Efficient Full-Text Index

Fundamenta Informaticae - Computing Patterns in Strings
LRM-Trees: Compressed indices, adaptive sorting, and compressed permutations

Theoretical Computer Science
On enumerating the DNA sequences

Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine
Dynamic rank-select structures with applications to run-length encoded texts

CPM'07 Proceedings of the 18th annual conference on Combinatorial Pattern Matching
Compressed text indexes with fast locate

CPM'07 Proceedings of the 18th annual conference on Combinatorial Pattern Matching
A framework for dynamizing succinct data structures

ICALP'07 Proceedings of the 34th international conference on Automata, Languages and Programming
Compressed data structures with relevance

Proceedings of the 21st ACM international conference on Information and knowledge management
DACs: Bringing direct access to variable-length codes

Information Processing and Management: an International Journal
New lower and upper bounds for representing sequences

ESA'12 Proceedings of the 20th Annual European conference on Algorithms
Efficient indexing algorithms for approximate pattern matching in text

Proceedings of the Seventeenth Australasian Document Computing Symposium
Exploiting SIMD instructions in current processors to improve classical string algorithms

ADBIS'12 Proceedings of the 16th East European conference on Advances in Databases and Information Systems
Range majority in constant time and linear space

Information and Computation
Space-Efficient computation of maximal and supermaximal repeats in genome sequences

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
The wavelet matrix

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Improved grammar-based compressed indexes

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Compressed representation of web and social networks via dense subgraphs

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Dual-Sorted inverted lists in practice

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Smaller self-indexes for natural language

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Variable-Length codes for space-efficient grammar-based compression

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Succinct representations of weighted trees supporting path queries

Journal of Discrete Algorithms
Implicit indexing of natural language text by reorganizing bytecodes

Information Retrieval
Improved compressed indexes for full-text document retrieval

Journal of Discrete Algorithms
Computing the longest common prefix array based on the Burrows-Wheeler transform

Journal of Discrete Algorithms
Space-efficient representations of rectangle datasets supporting orthogonal range querying

Information Systems
Approximate string matching by position restricted alignment

Proceedings of the Joint EDBT/ICDT 2013 Workshops
On compressing and indexing repetitive sequences

Theoretical Computer Science
Compressed indexes for text with wildcards

Theoretical Computer Science
Colored range queries and document retrieval

Theoretical Computer Science
Space-efficient data-analysis queries on grids

Theoretical Computer Science
Faster and smaller inverted indices with treaps

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Succinct interval-splitting tree for scalable similarity search of compound-protein pairs with property constraints

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Compressed persistent index for efficient rank/select queries

WADS'13 Proceedings of the 13th international conference on Algorithms and Data Structures
Spaces, Trees, and Colors: The algorithmic landscape of document retrieval on sequences

ACM Computing Surveys (CSUR)
The Solid* toolset for software visual analytics of program structure and metrics comprehension: From research prototype to product

Science of Computer Programming
Space efficient data structures for dynamic orthogonal range counting

Computational Geometry: Theory and Applications
On compressing permutations and adaptive sorting

Theoretical Computer Science
Compact binary relation representations with rich functionality

Information and Computation
Cross-document pattern matching

Journal of Discrete Algorithms
Wavelet trees for all

Journal of Discrete Algorithms
Computing the Burrows-Wheeler transform of a string and its reverse in parallel

Journal of Discrete Algorithms

Quantified Score

Hi-index	0.02

Visualization

Abstract

We present a novel implementation of compressed suffix arrays exhibiting new tradeoffs between search time and space occupancy for a given text (or sequence) of n symbols over an alphabet σ, where each symbol is encoded by lg|σ| bits. We show that compressed suffix arrays use just nHh + σ bits, while retaining full text indexing functionalities, such as searching any pattern sequence of length m in O(m lg |σ| + polylog(n)) time. The term Hh ≤ lg |σ| denotes the hth-order empirical entropy of the text, which means that our index is nearly optimal in space apart from lower-order terms, achieving asymptotically the empirical entropy of the text (with a multiplicative constant 1). If the text is highly compressible so that Hn = o(1) and the alphabet size is small, we obtain a text index with o(m) search time that requires only o(n) bits. Further results and tradeoffs are reported in the paper.