Iterative Dictionary Construction for Compression of Large DNA Data Sets

Authors:
Shanika Kuruppu;Bryan Beresford-Smith;Thomas Conway;Justin Zobel
Affiliations:
The University of Melbourne, Parkville;National ICT Australia, Parkville;National ICT Australia, Parkville;University of Melbourne, Parkville
Venue:
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Year:
2012

Citing 20
Cited 3

Efficient decoding of prefix codes

Communications of the ACM
A compression algorithm for DNA sequences and its applications in genome comparison

RECOMB '00 Proceedings of the fourth annual international conference on Computational molecular biology
General-purpose compression for efficient retrieval

Journal of the American Society for Information Science and Technology
A Guaranteed Compression Scheme for Repetitive DNA Sequences

DCC '96 Proceedings of the Conference on Data Compression
Significantly Lower Entropy Estimates for Natural DNA Sequences

DCC '97 Proceedings of the Conference on Data Compression
Offline Dictionary-Based Compression

DCC '99 Proceedings of the Conference on Data Compression
Compression of Biological Sequences by Greedy Off-Line Textual Substitution

DCC '00 Proceedings of the Conference on Data Compression
An efficient normalized maximum likelihood algorithm for DNA sequence compression

ACM Transactions on Information Systems (TOIS)
A Simple Statistical Algorithm for Biological Sequence Compression

DCC '07 Proceedings of the 2007 Data Compression Conference
Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections

SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
Human genomes as email attachments

Bioinformatics
Storage and Retrieval of Individual Genomes

RECOMB 2'09 Proceedings of the 13th Annual International Conference on Research in Computational Molecular Biology
Data structures and compression algorithms for genomic sequence data

Bioinformatics
Self-indexed Text Compression Using Straight-Line Programs

MFCS '09 Proceedings of the 34th International Symposium on Mathematical Foundations of Computer Science 2009
LZ77-Like Compression with Fast Random Access

DCC '10 Proceedings of the 2010 Data Compression Conference
Compressed q-Gram Indexing for Highly Repetitive Biological Sequences

BIBE '10 Proceedings of the 2010 IEEE International Conference on Bioinformatics and Bioengineering
Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
DNA compression challenge revisited: a dynamic programming approach

CPM'05 Proceedings of the 16th annual conference on Combinatorial Pattern Matching
The smallest grammar problem

IEEE Transactions on Information Theory
Optimized relative Lempel-Ziv compression of genomes

ACSC '11 Proceedings of the Thirty-Fourth Australasian Computer Science Conference - Volume 113

Reference sequence construction for relative compression of genomes

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Practical compression for multi-alignment genomic files

ACSC '13 Proceedings of the Thirty-Sixth Australasian Computer Science Conference - Volume 135
FRESCO: Referential Compression of Highly Similar Sequences

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Genomic repositories increasingly include individual as well as reference sequences, which tend to share long identical and near-identical strings of nucleotides. However, the sequential processing used by most compression algorithms, and the volumes of data involved, mean that these long-range repetitions are not detected. An order-insensitive, disk-based dictionary construction method can detect this repeated content and use it to compress collections of sequences. We explore a dictionary construction method that improves repeat identification in large DNA data sets. Our adaptation, Comrad, of an existing disk-based method identifies exact repeated content in collections of sequences with similarities within and across the set of input sequences. Comrad compresses the data over multiple passes, which is an expensive process, but allows Comrad to compress large data sets within reasonable time and space. Comrad allows for random access to individual sequences and subsequences without decompressing the whole data set. Comrad has no competitor in terms of the size of data sets that it can compress (extending to many hundreds of gigabytes) and, even for smaller data sets, the results are competitive compared to alternatives; as an example, 39 S. cerevisiae genomes compressed to 0.25 bits per base.