Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval

Authors:
Shanika Kuruppu;Simon J. Puglisi;Justin Zobel
Affiliations:
National ICT Australia, Department of Computer Science & Software Engineering, University of Melbourne;School of Computer Science and Information Technology, Royal Melbourne Institute of Technology, Australia;National ICT Australia, Department of Computer Science & Software Engineering, University of Melbourne
Venue:
SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Year:
2010

Citing 6
Cited 10

A compression algorithm for DNA sequences and its applications in genome comparison

RECOMB '00 Proceedings of the fourth annual international conference on Computational molecular biology
A Guaranteed Compression Scheme for Repetitive DNA Sequences

DCC '96 Proceedings of the Conference on Data Compression
Compression of Biological Sequences by Greedy Off-Line Textual Substitution

DCC '00 Proceedings of the Conference on Data Compression
Compressed full-text indexes

ACM Computing Surveys (CSUR)
A Simple Statistical Algorithm for Biological Sequence Compression

DCC '07 Proceedings of the 2007 Data Compression Conference
Human genomes as email attachments

Bioinformatics

Sample selection for dictionary-based corpus compression

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Self-indexing based on LZ77

CPM'11 Proceedings of the 22nd annual conference on Combinatorial pattern matching
Reference sequence construction for relative compression of genomes

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Iterative Dictionary Construction for Compression of Large DNA Data Sets

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Relative Lempel-Ziv factorization for efficient storage and retrieval of web collections

Proceedings of the VLDB Endowment
A faster grammar-based self-index

LATA'12 Proceedings of the 6th international conference on Language and Automata Theory and Applications
Fast relative lempel-ziv self-index for similar sequences

FAW-AAIM'12 Proceedings of the 6th international Frontiers in Algorithmics, and Proceedings of the 8th international conference on Algorithmic Aspects in Information and Management
Optimized relative Lempel-Ziv compression of genomes

ACSC '11 Proceedings of the Thirty-Fourth Australasian Computer Science Conference - Volume 113
On compressing and indexing repetitive sequences

Theoretical Computer Science
FRESCO: Referential Compression of Highly Similar Sequences

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Self-indexes - data structures that simultaneously provide fast search of and access to compressed text - are promising for genomic data but in their usual form are not able to exploit the high level of replication present in a collection of related genomes. Our 'RLZ' approach is to store a self-index for a base sequence and then compress every other sequence as an LZ77 encoding relative to the base. For a collection of r sequences totaling N bases, with a total of s point mutations from a base sequence of length n, this representation requires just nHk(T) + s log n + s log N/s + O(s) bits. At the cost of negligible extra space, access to l consecutive symbols requires O(l + log n) time. Our experiments show that, for example, RLZ can represent individual human genomes in around 0.1 bits per base while supporting rapid access and using relatively little memory.