Fast relative lempel-ziv self-index for similar sequences

Authors:
Huy Hoang Do;Jesper Jansson;Kunihiko Sadakane;Wing-Kin Sung
Affiliations:
National University of Singapore, COM 1, Singapore;Ochanomizu University, Tokyo, Japan;National Institute of Informatics, Tokyo, Japan;National University of Singapore, COM 1, Singapore
Venue:
FAW-AAIM'12 Proceedings of the 6th international Frontiers in Algorithmics, and Proceedings of the 8th international conference on Algorithmic Aspects in Information and Management
Year:
2012

Citing 22
Cited 2

Efficient algorithms for document retrieval problems

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Application of Lempel--Ziv factorization to the approximation of grammar-based compression

Theoretical Computer Science
Compression boosting in optimal linear time using the Burrows-Wheeler Transform

SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
Indexing compressed text

Journal of the ACM (JACM)
Rank/select operations on large alphabets: a tool for text indexing

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
A Simple Statistical Algorithm for Biological Sequence Compression

DCC '07 Proceedings of the 2007 Data Compression Conference
Succincter

FOCS '08 Proceedings of the 2008 49th Annual IEEE Symposium on Foundations of Computer Science
Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections

SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
Orthogonal range searching in linear and almost-linear space

Computational Geometry: Theory and Applications
Human genomes as email attachments

Bioinformatics
Self-indexed Text Compression Using Straight-Line Programs

MFCS '09 Proceedings of the 34th International Symposium on Mathematical Foundations of Computer Science 2009
Implicit compression boosting with applications to self-indexing

SPIRE'07 Proceedings of the 14th international conference on String processing and information retrieval
LZ77-Like Compression with Fast Random Access

DCC '10 Proceedings of the 2010 Data Compression Conference
Indexing similar DNA sequences

AAIM'10 Proceedings of the 6th international conference on Algorithmic aspects in information and management
Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Orthogonal range searching on the RAM, revisited

Proceedings of the twenty-seventh annual symposium on Computational geometry
Self-indexing based on LZ77

CPM'11 Proceedings of the 22nd annual conference on Combinatorial pattern matching
Reference sequence construction for relative compression of genomes

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Random access to grammar-compressed strings

Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms
A faster grammar-based self-index

LATA'12 Proceedings of the 6th international conference on Language and Automata Theory and Applications
A universal algorithm for sequential data compression

IEEE Transactions on Information Theory
A new succinct representation of RMQ-information and improvements in the enhanced suffix array

ESCAPE'07 Proceedings of the First international conference on Combinatorics, Algorithms, Probabilistic and Experimental Methodologies

Improved grammar-based compressed indexes

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
RCSI: scalable similarity search in thousand(s) of genomes

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recent advances in biotechnology and web technology are generating huge collections of similar strings. People now face the problem of storing them compactly while supporting fast pattern searching. One compression scheme called relative Lempel-Ziv compression uses textual substitutions from a reference text as follows: Given a (large) set S of strings, represent each string in S as a concatenation of substrings from a reference string R . This basic scheme gives a good compression ratio when every string in S is similar to R , but does not provide any pattern searching functionality. Here, we describe a new data structure that supports fast pattern searching.