An experimental study of an opportunistic index

Authors:
Paolo Ferragina;Giovanni Manzini
Affiliations:
Dipartimento di Informatica, Università di Pisa, Italy;Dipartimento di Scienze e Tecnologie Avanzate, Università del Piemonte Orientale, Alessandria, Italy and IMC-CNR, Pisa, Italy
Venue:
SODA '01 Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms
Year:
2001

Citing 11
Cited 39

A locally adaptive data compression scheme

Communications of the ACM
Bonsai: a compact representation of trees

Software—Practice & Experience
Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Arithmetic coding for data compression

Communications of the ACM
Let sleeping files lie: pattern matching in Z-compressed files

Journal of Computer and System Sciences
The art of computer programming, volume 3: (2nd ed.) sorting and searching

The art of computer programming, volume 3: (2nd ed.) sorting and searching
Fast searching on compressed text allowing errors

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Fast algorithms for sorting and searching strings

SODA '97 Proceedings of the eighth annual ACM-SIAM symposium on Discrete algorithms
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract)

STOC '00 Proceedings of the thirty-second annual ACM symposium on Theory of computing
Opportunistic data structures with applications

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science

An analysis of the Burrows—Wheeler transform

Journal of the ACM (JACM)
High-order entropy-compressed text indexes

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Efficient Index Structures for String Databases

Proceedings of the 27th International Conference on Very Large Data Bases
A Fast Index for Semistructured Data

Proceedings of the 27th International Conference on Very Large Data Bases
Optimal Exact Strring Matching Based on Suffix Arrays

SPIRE 2002 Proceedings of the 9th International Symposium on String Processing and Information Retrieval
Indexing Text Using the Ziv-Lempel Trie

SPIRE 2002 Proceedings of the 9th International Symposium on String Processing and Information Retrieval
Trade Off Between Compression and Search Times in Compact Suffix Array

ALENEX '01 Revised Papers from the Third International Workshop on Algorithm Engineering and Experimentation
A Compressed Breadth-First Search for Satisfiability

ALENEX '02 Revised Papers from the 4th International Workshop on Algorithm Engineering and Experiments
Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications

CPM '01 Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
Engineering a Lightweight Suffix Array Construction Algorithm

ESA '02 Proceedings of the 10th Annual European Symposium on Algorithms
Searching BWT Compressed Text with the Boyer-Moore Algorithm and Binary Search

DCC '02 Proceedings of the Data Compression Conference
Compact suffix array: a space-efficient full-text index

Fundamenta Informaticae - Special issue on computing patterns in strings
When indexing equals compression: experiments with compressing suffix arrays and applications

SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
Replacing suffix trees with enhanced suffix arrays

Journal of Discrete Algorithms - SPIRE 2002
Indexing text using the Ziv-Lempel trie

Journal of Discrete Algorithms - SPIRE 2002
New text indexing functionalities of the compressed suffix arrays

Journal of Algorithms
Resolution cannot polynomially simulate compressed-BFS

Annals of Mathematics and Artificial Intelligence
A categorization theorem on suffix arrays with applications to space efficient text indexes

SODA '05 Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms
Pattern Matching in LZW Compressed Files

IEEE Transactions on Computers
Suffix arrays: what are they good for?

ADC '06 Proceedings of the 17th Australasian Database Conference - Volume 49
Succinct suffix arrays based on run-length encoding

Nordic Journal of Computing
When indexing equals compression: Experiments with compressing suffix arrays and applications

ACM Transactions on Algorithms (TALG)
Compressed full-text indexes

ACM Computing Surveys (CSUR)
Engineering succinct DOM

EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
Implementing the LZ-index: Theory versus practice

Journal of Experimental Algorithmics (JEA)
Compressed text indexes: From theory to practice

Journal of Experimental Algorithmics (JEA)
Cell probe lower bounds for succinct data structures

SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
Dependability Improvement for PPM Compressed Data by Using Compression Pattern Matching

IEICE - Transactions on Information and Systems
Efficient construction of FM-index using overlapping block processing for large scale texts

ECIR'07 Proceedings of the 29th European conference on IR research
An experimental study of compressed indexing and local alignments of DNA

COCOA'07 Proceedings of the 1st international conference on Combinatorial optimization and applications
Indexing similar DNA sequences

AAIM'10 Proceedings of the 6th international conference on Algorithmic aspects in information and management
UASMAs (universal automated SNP mapping algorithms): a set of algorithms to instantaneously map SNPs in real time to aid functional SNP discovery

Proceedings of the VLDB Endowment
Space-efficient construction of LZ-index

ISAAC'05 Proceedings of the 16th international conference on Algorithms and Computation
A new compressed suffix tree supporting fast search and its construction algorithm using optimal working space

CPM'05 Proceedings of the 16th annual conference on Combinatorial Pattern Matching
Succinct suffix arrays based on run-length encoding

CPM'05 Proceedings of the 16th annual conference on Combinatorial Pattern Matching
Efficient implementation of rank and select functions for succinct representation

WEA'05 Proceedings of the 4th international conference on Experimental and Efficient Algorithms
Compact Suffix Array — A Space-Efficient Full-Text Index

Fundamenta Informaticae - Computing Patterns in Strings
Succinct multibit tree: compact representation of multibit trees by using succinct data structures in chemical fingerprint searches

WABI'12 Proceedings of the 12th international conference on Algorithms in Bioinformatics
Hardware acceleration of genetic sequence alignment

ARC'13 Proceedings of the 9th international conference on Reconfigurable Computing: architectures, tools, and applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

The size of electronic data is currently growing at a faster rate than computer memory and disk storage capacities. For this reason compression appears always as an attractive choice, if not mandatory. However space overhead is not the only resource to be optimized when managing large data collections; in fact data turn out to be useful only when properly indexed to support search operations that efficiently extract the user-requested information.Approaches to combine compression and indexing techniques are nowadays receiving more and more attention. A first step towards the design of a compressed full-text index achieving guaranteed performance in the worst case has been recently done in [10]. This index combines the compression algorithm proposed by Burrows and Wheeler [5] with the suffix array data structure [16]. The index is opportunistic in that it takes advantage of the compressibility of the input data by decreasing the space occupancy at no significant asymptotic slowdown in the query performance.In this paper we present an implementation of this index and perform an extensive set of experiments on various text collections. The experiments show that our index is compact (its space occupancy is close to the one achieved by the best known compressors), it is fast in counting the number of pattern occurrences, and the cost of their retrieval is reasonable when they are few (i.e., in case of a selective query). In addition, our experiments show that the FM-index is flexible in that it is possible to trade space occupancy for search time by choosing the amount of auxiliary information stored into it.