Compressed text indexes: From theory to practice

Authors:
Paolo Ferragina;Rodrigo González;Gonzalo Navarro;Rossano Venturini
Affiliations:
University of Pisa;University of Chile;University of Chile;University of Pisa
Venue:
Journal of Experimental Algorithmics (JEA)
Year:
2009

Citing 47
Cited 31

Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Efficient implementation of suffix trees

Software—Practice & Experience
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
An experimental study of an opportunistic index

SODA '01 Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms
An analysis of the Burrows—Wheeler transform

Journal of the ACM (JACM)
Succinct representations of lcp information and improvements in the compressed suffix arrays

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Adding Compression to Block Addressing Inverted Indexes

Information Retrieval
Indexing and Retrieval for Genomic Databases

IEEE Transactions on Knowledge and Data Engineering
High-order entropy-compressed text indexes

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Sparse Suffix Trees

COCOON '96 Proceedings of the Second Annual International Conference on Computing and Combinatorics
A Space and Time Efficient Algorithm for Constructing Compressed Suffix Arrays

COCOON '02 Proceedings of the 8th Annual International Conference on Computing and Combinatorics
Tables

Proceedings of the 16th Conference on Foundations of Software Technology and Theoretical Computer Science
Filtration with q-Samples in Approximate String Matching

CPM '96 Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching
Succinct representation of balanced parentheses, static trees and planar graphs

FOCS '97 Proceedings of the 38th Annual Symposium on Foundations of Computer Science
Breaking a Time-and-Space Barrier in Constructing Full-Text Indices

FOCS '03 Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer Science
Indexing text using the Ziv-Lempel trie

Journal of Discrete Algorithms - SPIRE 2002
New text indexing functionalities of the compressed suffix arrays

Journal of Algorithms
Engineering a Lightweight Suffix Array Construction Algorithm

Algorithmica
Indexing compressed text

Journal of the ACM (JACM)
Boosting textual compression in optimal linear time

Journal of the ACM (JACM)
LZgrep: a Boyer–Moore string matching tool for Ziv–Lempel compressed text: Research Articles

Software—Practice & Experience
Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching

SIAM Journal on Computing
Structuring labeled trees for optimal succinctness, and beyond

FOCS '05 Proceedings of the 46th Annual IEEE Symposium on Foundations of Computer Science
Inverted files for text search engines

ACM Computing Surveys (CSUR)
Compressing and searching XML data via two zips

Proceedings of the 15th international conference on World Wide Web
Succinct suffix arrays based on run-length encoding

Nordic Journal of Computing
When indexing equals compression: Experiments with compressing suffix arrays and applications

ACM Transactions on Algorithms (TALG)
Compressed full-text indexes

ACM Computing Surveys (CSUR)
Note: A simple storage scheme for strings achieving entropy bounds

Theoretical Computer Science
Compressed representations of sequences and full-text indexes

ACM Transactions on Algorithms (TALG)
A taxonomy of suffix array construction algorithms

ACM Computing Surveys (CSUR)
GLIMPSE: a tool to search through entire file systems

WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
Compressed permuterm index

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Faster suffix sorting

Theoretical Computer Science
Dynamic entropy-compressed sequences and full-text indexes

ACM Transactions on Algorithms (TALG)
Reorganizing compressed text

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Word-Based Statistical Compressors as Natural Language Compression Boosters

DCC '08 Proceedings of the Data Compression Conference
On Self-Indexing Images - Image Compression with Added Value

DCC '08 Proceedings of the Data Compression Conference
Compressed Text Indexes with Fast Locate

CPM '07 Proceedings of the 18th annual symposium on Combinatorial Pattern Matching
Encyclopedia of Algorithms

Encyclopedia of Algorithms
Succinct representations of permutations

ICALP'03 Proceedings of the 30th international conference on Automata, languages and programming
Statistical encoding of succinct data structures

CPM'06 Proceedings of the 17th Annual conference on Combinatorial Pattern Matching
Reducing the space requirement of LZ-Index

CPM'06 Proceedings of the 17th Annual conference on Combinatorial Pattern Matching
Inverted files versus suffix arrays for locating patterns in primary memory

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Space-efficient construction of LZ-index

ISAAC'05 Proceedings of the 16th international conference on Algorithms and Computation
Suffix trays and suffix trists: structures for faster text indexing

ICALP'06 Proceedings of the 33rd international conference on Automata, Languages and Programming - Volume Part I

Compressing and indexing labeled trees, with applications

Journal of the ACM (JACM)
Compressed Suffix Arrays for Massive Data

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Directly Addressable Variable-Length Codes

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Faster entropy-bounded compressed suffix trees

Theoretical Computer Science
Ontologies and semantic mining for bio-technology and chemistry data and patents

Proceedings of the 2nd international workshop on Patent information retrieval
On compressing the textual web

Proceedings of the third ACM international conference on Web search and data mining
A web search engine model based on index-query bit-level compression

Proceedings of the 1st International Conference on Intelligent Semantic Web-Services and Applications
Sampled longest common prefix array

CPM'10 Proceedings of the 21st annual conference on Combinatorial pattern matching
Compression, indexing, and retrieval for massive string data

CPM'10 Proceedings of the 21st annual conference on Combinatorial pattern matching
Data structures: time, I/Os, entropy, joules!

ESA'10 Proceedings of the 18th annual European conference on Algorithms: Part II
Practical approaches to reduce the space requirement of lempel-ziv--based compressed text indices

Journal of Experimental Algorithmics (JEA)
Spatio-temporal range searching over compressed kinetic sensor data

ESA'10 Proceedings of the 18th annual European conference on Algorithms: Part I
Compressed self-indices supporting conjunctive queries on document collections

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Space-efficient construction of Lempel-Ziv compressed text indexes

Information and Computation
Space-efficient substring occurrence estimation

Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Compressed string dictionaries

SEA'11 Proceedings of the 10th international conference on Experimental algorithms
Practical compressed document retrieval

SEA'11 Proceedings of the 10th international conference on Experimental algorithms
Distribution-aware compressed full-text indexes

ESA'11 Proceedings of the 19th European conference on Algorithms
Fixed block compression boosting in FM-indexes

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Space efficient wavelet tree construction

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Word-based self-indexes for natural language text

ACM Transactions on Information Systems (TOIS)
Practical compressed suffix trees

SEA'10 Proceedings of the 9th international conference on Experimental Algorithms
String matching with alphabet sampling

Journal of Discrete Algorithms
Revisiting bounded context block-sorting transformations

Software—Practice & Experience
Efficient in-memory top-k document retrieval

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
To index or not to index: time-space trade-offs in search engines with positional ranking functions

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Efficient indexing algorithms for approximate pattern matching in text

Proceedings of the Seventeenth Australasian Document Computing Symposium
Compressed suffix trees for repetitive texts

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Implicit indexing of natural language text by reorganizing bytecodes

Information Retrieval
Development of a Novel Compressed Index-Query Web Search Engine Model

International Journal of Information Technology and Web Engineering
Spaces, Trees, and Colors: The algorithmic landscape of document retrieval on sequences

ACM Computing Surveys (CSUR)

Quantified Score

Hi-index	0.00

Visualization

Abstract

A compressed full-text self-index represents a text in a compressed form and still answers queries efficiently. This represents a significant advancement over the (full-)text indexing techniques of the previous decade, whose indexes required several times the size of the text. Although it is relatively new, this algorithmic technology has matured up to a point where theoretical research is giving way to practical developments. Nonetheless this requires significant programming skills, a deep engineering effort, and a strong algorithmic background to dig into the research results. To date only isolated implementations and focused comparisons of compressed indexes have been reported, and they missed a common API, which prevented their re-use or deployment within other applications. The goal of this article is to fill this gap. First, we present the existing implementations of compressed indexes from a practitioner's point of view. Second, we introduce the Pizza&Chili site, which offers tuned implementations and a standardized API for the most successful compressed full-text self-indexes, together with effective test-beds and scripts for their automatic validation and test. Third, we show the results of our extensive experiments on these codes with the aim of demonstrating the practical relevance of this novel algorithmic technology.