Reference sequence construction for relative compression of genomes

Authors:
Shanika Kuruppu;Simon J. Puglisi;Justin Zobel
Affiliations:
Department of Computer Science & Software Engineering, University of Melbourne, Australia;Department of Informatics, King's College London, United Kingdom;Department of Computer Science & Software Engineering, University of Melbourne, Australia
Venue:
SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Year:
2011

Citing 9
Cited 1

A new challenge for compression algorithms: genetic sequences

Information Processing and Management: an International Journal - Special issue: data compression
Data Compression Using Long Common Strings

DCC '99 Proceedings of the Conference on Data Compression
Offline Dictionary-Based Compression

DCC '99 Proceedings of the Conference on Data Compression
A Simple Statistical Algorithm for Biological Sequence Compression

DCC '07 Proceedings of the 2007 Data Compression Conference
Data structures and compression algorithms for genomic sequence data

Bioinformatics
Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Self-indexing based on LZ77

CPM'11 Proceedings of the 22nd annual conference on Combinatorial pattern matching
Iterative Dictionary Construction for Compression of Large DNA Data Sets

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Optimized relative Lempel-Ziv compression of genomes

ACSC '11 Proceedings of the Thirty-Fourth Australasian Computer Science Conference - Volume 113

Fast relative lempel-ziv self-index for similar sequences

FAW-AAIM'12 Proceedings of the 6th international Frontiers in Algorithmics, and Proceedings of the 8th international conference on Algorithmic Aspects in Information and Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Relative compression, where a set of similar strings are compressed with respect to a reference string, is an effective method of compressing DNA datasets containing multiple similar sequences. Moreover, it supports rapid random access to the underlying data. The main difficulty of relative compression is in selecting an appropriate reference sequence. In this paper, we explore using the dictionary of repeats generated by COMRAD, RE-PAIR and DNA-X algorithms as reference sequences for relative compression. We show that this technique allows for better compression, and allows more general repetitive datasets to be compressed using relative compression.