Optimized relative Lempel-Ziv compression of genomes

Authors:
Shanika Kuruppu;Simon J. Puglisi;Justin Zobel
Affiliations:
The University of Melbourne, Parkville, Victoria;RMIT University, Melbourne, Victoria;The University of Melbourne, Parkville, Victoria
Venue:
ACSC '11 Proceedings of the Thirty-Fourth Australasian Computer Science Conference - Volume 113
Year:
2011

Citing 15
Cited 3

A compression algorithm for DNA sequences and its applications in genome comparison

RECOMB '00 Proceedings of the fourth annual international conference on Computational molecular biology
A Guaranteed Compression Scheme for Repetitive DNA Sequences

DCC '96 Proceedings of the Conference on Data Compression
Compression of Biological Sequences by Greedy Off-Line Textual Substitution

DCC '00 Proceedings of the Conference on Data Compression
The effect of non-greedy parsing in Ziv-Lempel compression methods

DCC '95 Proceedings of the Conference on Data Compression
Replacing suffix trees with enhanced suffix arrays

Journal of Discrete Algorithms - SPIRE 2002
Indexing text using the Ziv-Lempel trie

Journal of Discrete Algorithms - SPIRE 2002
An efficient normalized maximum likelihood algorithm for DNA sequence compression

ACM Transactions on Information Systems (TOIS)
Matching statistics: efficient computation and a new practical algorithm for the multiple common substring problem

Software—Practice & Experience
Normalized maximum likelihood model of order-1 for the compression of DNA sequences

DCC '07 Proceedings of the 2007 Data Compression Conference
A Simple Statistical Algorithm for Biological Sequence Compression

DCC '07 Proceedings of the 2007 Data Compression Conference
On the bit-complexity of Lempel-Ziv compression

SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
Human genomes as email attachments

Bioinformatics
Data structures and compression algorithms for genomic sequence data

Bioinformatics
Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
DNA compression challenge revisited: a dynamic programming approach

CPM'05 Proceedings of the 16th annual conference on Combinatorial Pattern Matching

Reference sequence construction for relative compression of genomes

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Iterative Dictionary Construction for Compression of Large DNA Data Sets

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
FRESCO: Referential Compression of Highly Similar Sequences

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

High-throughput sequencing technologies make it possible to rapidly acquire large numbers of individual genomes, which, for a given organism, vary only slightly from one to another. Such repetitive and large sequence collections are a unique challange for compression. In previous work we described the RLZ algorithm, which greedily parses each genome into factors, represented as position and length pairs, which identify the corresponding material in a reference genome. RLZ provides effective compression in a single pass over the collection, and the final compressed representation allows rapid random access to arbitrary substrings. In this paper we explore several improvements to the RLZ algorithm. We find that simple non-greedy parsings can significantly improve compression performance and discover a strong correlation between the starting positions of long factors and their positions in the reference. This property is computationally inexpensive to detect and can be exploited to improve compression by nearly 50% compared to the original RLZ encoding, while simultaneously providing faster decompression.