Indexing compressed text

Authors:
Paolo Ferragina;Giovanni Manzini
Affiliations:
Università di Pisa, Pisa, Italy;Università del Piemonte Orientale, Alessandria, Italy
Venue:
Journal of the ACM (JACM)
Year:
2005

Citing 28
Cited 120

A locally adaptive data compression scheme

Communications of the ACM
New indices for text: PAT Trees and PAT arrays

Information retrieval
Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
The string B-tree: a new data structure for string search in external memory and its applications

Journal of the ACM (JACM)
Efficient suffix trees on secondary storage

Proceedings of the seventh annual ACM-SIAM symposium on Discrete algorithms
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract)

STOC '00 Proceedings of the thirty-second annual ACM symposium on Theory of computing
Reducing the space requirement of suffix trees

Software—Practice & Experience
Compression of Low Entropy Strings with Lempel--Ziv Algorithms

SIAM Journal on Computing
Membership in Constant Time and Almost-Minimum Space

SIAM Journal on Computing
Space efficient suffix trees

Journal of Algorithms
An analysis of the Burrows—Wheeler transform

Journal of the ACM (JACM)
An experimental study of a compressed index

Information Sciences: an International Journal - Dictionary based compression
Time-space trade-offs for compressed suffix arrays

Information Processing Letters
Succinct representations of lcp information and improvements in the compressed suffix arrays

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Succinct indexable dictionaries with applications to encoding k-ary trees and multisets

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Low Redundancy in Static Dictionaries with Constant Query Time

SIAM Journal on Computing
High-order entropy-compressed text indexes

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Sorting and Searching Revisted

SWAT '96 Proceedings of the 5th Scandinavian Workshop on Algorithm Theory
Opportunistic data structures with applications

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
New data structures for orthogonal range searching

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Compressed Index for Dynamic Text

DCC '04 Proceedings of the Conference on Data Compression
When indexing equals compression: experiments with compressing suffix arrays and applications

SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
Indexing text using the Ziv-Lempel trie

Journal of Discrete Algorithms - SPIRE 2002
New text indexing functionalities of the compressed suffix arrays

Journal of Algorithms
Boosting textual compression in optimal linear time

Journal of the ACM (JACM)
Advantages of backward searching — efficient secondary memory and distributed implementation of compressed suffix arrays

ISAAC'04 Proceedings of the 15th international conference on Algorithms and Computation

Compressing and searching XML data via two zips

Proceedings of the 15th international conference on World Wide Web
Type less, find more: fast autocompletion search with a succinct index

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
When indexing equals compression: Experiments with compressing suffix arrays and applications

ACM Transactions on Algorithms (TALG)
Compressed full-text indexes

ACM Computing Surveys (CSUR)
Succinct data structures for flexible text retrieval systems

Journal of Discrete Algorithms
Note: A simple storage scheme for strings achieving entropy bounds

Theoretical Computer Science
Compressed representations of sequences and full-text indexes

ACM Transactions on Algorithms (TALG)
Compressed indexes for dynamic text collections

ACM Transactions on Algorithms (TALG)
Compressed permuterm index

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Ultra-succinct representation of ordered trees

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
A simple storage scheme for strings achieving entropy bounds

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Alphabet-independent linear-time construction of compressed suffix arrays using o(nlogn)-bit working space

Theoretical Computer Science
Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets

ACM Transactions on Algorithms (TALG)
A simpler analysis of Burrows–Wheeler-based compression

Theoretical Computer Science
Fast BWT in small space by blockwise suffix sorting

Theoretical Computer Science
Faster suffix sorting

Theoretical Computer Science
Rank and select revisited and extended

Theoretical Computer Science
Dynamic entropy-compressed sequences and full-text indexes

ACM Transactions on Algorithms (TALG)
A compressed self-index using a Ziv---Lempel dictionary

Information Retrieval
Algorithms and data structures for external memory

Foundations and Trends® in Theoretical Computer Science
On the Redundancy of Succinct Data Structures

SWAT '08 Proceedings of the 11th Scandinavian workshop on Algorithm Theory
Succinct backward-DAWG-matching

Journal of Experimental Algorithmics (JEA)
Compressed text indexes: From theory to practice

Journal of Experimental Algorithmics (JEA)
Indexed Hierarchical Approximate String Matching

SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections

SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
Practical Rank/Select Queries over Arbitrary Sequences

SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
Adaptive search engines as discovery games: an evolutionary approach

Proceedings of the 6th International Conference on Advances in Mobile Computing and Multimedia
B-tries for disk-based string management

The VLDB Journal — The International Journal on Very Large Data Bases
Community Adaptive Search Engines

International Journal of Advanced Intelligence Paradigms
Storage and Retrieval of Individual Genomes

RECOMB 2'09 Proceedings of the 13th Annual International Conference on Research in Computational Molecular Biology
Compressed string-matching in standard Sturmian words

Theoretical Computer Science
Text Indexing, Suffix Sorting, and Data Compression: Common Problems and Techniques

CPM '09 Proceedings of the 20th Annual Symposium on Combinatorial Pattern Matching
Engineering a compressed suffix tree implementation

Journal of Experimental Algorithmics (JEA)
Compressing and indexing labeled trees, with applications

Journal of the ACM (JACM)
Compressed Suffix Arrays for Massive Data

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
On Entropy-Compressed Text Indexing in External Memory

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Succinct Index for Dynamic Dictionary Matching

ISAAC '09 Proceedings of the 20th International Symposium on Algorithms and Computation
Wee LCP

Information Processing Letters
Engineering a compressed suffix tree implementation

WEA'07 Proceedings of the 6th international conference on Experimental algorithms
Implicit compression boosting with applications to self-indexing

SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval
Approximate string matching with Lempel-Ziv compressed indexes

SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval
The compressed permuterm index

ACM Transactions on Algorithms (TALG)
Index structures for efficiently searching natural language text

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
A web search engine model based on index-query bit-level compression

Proceedings of the 1st International Conference on Intelligent Semantic Web-Services and Applications
Approximate all-pairs suffix/prefix overlaps

CPM'10 Proceedings of the 21st annual conference on Combinatorial pattern matching
Sampled longest common prefix array

CPM'10 Proceedings of the 21st annual conference on Combinatorial pattern matching
Compression, indexing, and retrieval for massive string data

CPM'10 Proceedings of the 21st annual conference on Combinatorial pattern matching
Engineering basic algorithms of an in-memory text search engine

ACM Transactions on Information Systems (TOIS)
Data structures: time, I/Os, entropy, joules!

ESA'10 Proceedings of the 18th annual European conference on Algorithms: Part II
Practical approaches to reduce the space requirement of lempel-ziv--based compressed text indices

Journal of Experimental Algorithmics (JEA)
Spatio-temporal range searching over compressed kinetic sensor data

ESA'10 Proceedings of the 18th annual European conference on Algorithms: Part I
Faster compressed dictionary matching

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Space-efficient construction of Lempel-Ziv compressed text indexes

Information and Computation
A quick tour on suffix arrays and compressed suffix arrays

Theoretical Computer Science
Space-efficient substring occurrence estimation

Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Fast construction of the HYB index

ACM Transactions on Information Systems (TOIS)
Succinct indexes for strings, binary relations and multilabeled trees

ACM Transactions on Algorithms (TALG)
Fully compressed suffix trees

ACM Transactions on Algorithms (TALG)
Self-indexing based on LZ77

CPM'11 Proceedings of the 22nd annual conference on Combinatorial pattern matching
Lightweight BWT construction for very large string collections

CPM'11 Proceedings of the 22nd annual conference on Combinatorial pattern matching
Localized genome assembly from reads to scaffolds: practical traversal of the paired string graph

WABI'11 Proceedings of the 11th international conference on Algorithms in bioinformatics
Indexing finite language representation of population genotypes

WABI'11 Proceedings of the 11th international conference on Algorithms in bioinformatics
Alphabet-independent compressed text indexing

ESA'11 Proceedings of the 19th European conference on Algorithms
Distribution-aware compressed full-text indexes

ESA'11 Proceedings of the 19th European conference on Algorithms
Fixed block compression boosting in FM-indexes

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Compressed text indexing with wildcards

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Compressed indexes for aligned pattern matching

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Space-Efficient Preprocessing Schemes for Range Minimum Queries on Static Arrays

SIAM Journal on Computing
Word-based self-indexes for natural language text

ACM Transactions on Information Systems (TOIS)
Reducing the space requirement of LZ-Index

CPM'06 Proceedings of the 17th Annual conference on Combinatorial Pattern Matching
Ultra-succinct representation of ordered trees with applications

Journal of Computer and System Sciences
A compressed self-index using a ziv-lempel dictionary

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Efficient Maximal Repeat Finding Using the Burrows-Wheeler Transform and Wavelet Tree

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
The myriad virtues of wavelet trees

ICALP'06 Proceedings of the 33rd international conference on Automata, Languages and Programming - Volume Part I
Poster: fast GPU read alignment with burrows wheeler transform based index

Proceedings of the 2011 companion on High Performance Computing Networking, Storage and Analysis Companion
New algorithms on wavelet trees and applications to information retrieval

Theoretical Computer Science
Approximate all-pairs suffix/prefix overlaps

Information and Computation
Unified view of backward backtracking in short read mapping

Algorithms and Applications
Intelligent Social Media Indexing and Sharing Using an Adaptive Indexing Search Engine

ACM Transactions on Intelligent Systems and Technology (TIST)
Position-Restricted substring searching

LATIN'06 Proceedings of the 7th Latin American conference on Theoretical Informatics
Encoding 2d range maximum queries

ISAAC'11 Proceedings of the 22nd international conference on Algorithms and Computation
Succinct indexes for circular patterns

ISAAC'11 Proceedings of the 22nd international conference on Algorithms and Computation
A faster grammar-based self-index

LATA'12 Proceedings of the 6th international conference on Language and Automata Theory and Applications
Memory-Aware BWT by segmenting sequences to support subsequence search

APWeb'12 Proceedings of the 14th Asia-Pacific international conference on Web Technologies and Applications
Fast relative lempel-ziv self-index for similar sequences

FAW-AAIM'12 Proceedings of the 6th international Frontiers in Algorithmics, and Proceedings of the 8th international conference on Algorithmic Aspects in Information and Management
Revisiting bounded context block-sorting transformations

Software—Practice & Experience
Efficient in-memory top-k document retrieval

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
ALAE: accelerating local alignment with affine gap exactly in biosequence databases

Proceedings of the VLDB Endowment
Parallel and memory-efficient reads indexing for genome assembly

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part II
CRAM: compressed random access memory

ICALP'12 Proceedings of the 39th international colloquium conference on Automata, Languages, and Programming - Volume Part I
Full-text search on multi-byte encoded documents

Proceedings of the 2012 ACM symposium on Document engineering
Self-Indexed Grammar-Based Compression

Fundamenta Informaticae
Fast, small, simple rank/select on bitmaps

SEA'12 Proceedings of the 11th international conference on Experimental Algorithms
Wavelet trees for all

CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
FEMTO: fast search of large sequence collections

CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
A Lempel-Ziv text index on secondary storage

CPM'07 Proceedings of the 18th annual conference on Combinatorial Pattern Matching
Dynamic rank-select structures with applications to run-length encoded texts

CPM'07 Proceedings of the 18th annual conference on Combinatorial Pattern Matching
Compressed text indexes with fast locate

CPM'07 Proceedings of the 18th annual conference on Combinatorial Pattern Matching
A framework for dynamizing succinct data structures

ICALP'07 Proceedings of the 34th international conference on Automata, Languages and Programming
Compressed data structures with relevance

Proceedings of the 21st ACM international conference on Information and knowledge management
New lower and upper bounds for representing sequences

ESA'12 Proceedings of the 20th Annual European conference on Algorithms
Succinct de bruijn graphs

WABI'12 Proceedings of the 12th international conference on Algorithms in Bioinformatics
The wavelet matrix

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Improved grammar-based compressed indexes

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
On position restricted substring searching in succinct space

Journal of Discrete Algorithms
ESP-index: A compressed index based on edit-sensitive parsing

Journal of Discrete Algorithms
Development of a Novel Compressed Index-Query Web Search Engine Model

International Journal of Information Technology and Web Engineering
Faster compressed dictionary matching

Theoretical Computer Science
Memory efficient minimum substring partitioning

Proceedings of the VLDB Endowment
Compressed text indexing with wildcards

Journal of Discrete Algorithms
Cache-aware parallel approximate matching and join algorithms using BWT

Proceedings of the Joint EDBT/ICDT 2013 Workshops
On compressing and indexing repetitive sequences

Theoretical Computer Science
Lightweight algorithms for constructing and inverting the BWT of string collections

Theoretical Computer Science
Trends in suffix sorting: a survey of low memory algorithms

ACSC '12 Proceedings of the Thirty-fifth Australasian Computer Science Conference - Volume 122
Space-efficient data structures for Top-k completion

Proceedings of the 22nd international conference on World Wide Web
Dynamic compressed strings with random access

ICALP'13 Proceedings of the 40th international conference on Automata, Languages, and Programming - Volume Part I
Compressed property suffix trees

Information and Computation
Multi-pattern matching with bidirectional indexes

Journal of Discrete Algorithms
A Compressed Suffix Tree Based Implementation With Low Peak Memory Usage

Electronic Notes in Theoretical Computer Science (ENTCS)
Wavelet trees for all

Journal of Discrete Algorithms

Quantified Score

Hi-index	0.02

Visualization

Abstract

We design two compressed data structures for the full-text indexing problem that support efficient substring searches using roughly the space required for storing the text in compressed form.Our first compressed data structure retrieves the occ occurrences of a pattern P[1,p] within a text T[1,n] in O(p + occ log1+ε n) time for any chosen ε, 0nHk(T) + o(n) bits of storage, where Hk(T) is the kth order empirical entropy of T. The space usage is Θ(n) bits in the worst case and o(n) bits for compressible texts. This data structure exploits the relationship between suffix arrays and the Burrows--Wheeler Transform, and can be regarded as a compressed suffix array.Our second compressed data structure achieves O(p+occ) query time using O(nHk(T)logε n) + o(n) bits of storage for any chosen ε, 0output-sensitive query time using o(nlog n) bits in the worst case. This second data structure builds upon the first one and exploits the interplay between two compressors: the Burrows--Wheeler Transform and the LZ78 algorithm.