FRESCO: Referential Compression of Highly Similar Sequences

Authors:
Sebastian Wandelt;Ulf Leser
Affiliations:
Humboldt-University of Berlin, Berlin;Humboldt-University of Berlin, Berlin
Venue:
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Year:
2013

Citing 19
Cited 0

Data compression using dynamic Markov modelling

The Computer Journal
A Boyer-Moore Type Algorithm for Compressed Pattern Matching

COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
Parsing with suffix and prefix dictionaries

DCC '96 Proceedings of the Conference on Data Compression
Offline Dictionary-Based Compression

DCC '99 Proceedings of the Conference on Data Compression
The effect of non-greedy parsing in Ziv-Lempel compression methods

DCC '95 Proceedings of the Conference on Data Compression
Compressed Pattern Matching in DNA Sequences

CSB '04 Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference
A Simple Statistical Algorithm for Biological Sequence Compression

DCC '07 Proceedings of the 2007 Data Compression Conference
Human genomes as email attachments

Bioinformatics
LZ77-Like Compression with Fast Random Access

DCC '10 Proceedings of the 2010 Data Compression Conference
Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
CST++

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
The variant call format and VCFtools

Bioinformatics
Robust relative compression of genomes with random access

Bioinformatics
Iterative Dictionary Construction for Compression of Large DNA Data Sets

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
No-Reference Compression of Genomic Data Stored in FASTQ Format

BIBM '11 Proceedings of the 2011 IEEE International Conference on Bioinformatics and Biomedicine
Compressing genomic sequence fragments using SLIMGENE

RECOMB'10 Proceedings of the 14th Annual international conference on Research in Computational Molecular Biology
Transformations for the compression of FASTQ quality scores of next-generation sequencing data

Bioinformatics
A universal algorithm for sequential data compression

IEEE Transactions on Information Theory
Optimized relative Lempel-Ziv compression of genomes

ACSC '11 Proceedings of the Thirty-Fourth Australasian Computer Science Conference - Volume 113

Quantified Score

Hi-index	0.00

Visualization

Abstract

In many applications, sets of similar texts or sequences are of high importance. Prominent examples are revision histories of documents or genomic sequences. Modern high-throughput sequencing technologies are able to generate DNA sequences at an ever-increasing rate. In parallel to the decreasing experimental time and cost necessary to produce DNA sequences, computational requirements for analysis and storage of the sequences are steeply increasing. Compression is a key technology to deal with this challenge. Recently, referential compression schemes, storing only the differences between a to-be-compressed input and a known reference sequence, gained a lot of interest in this field. In this paper, we propose a general open-source framework to compress large amounts of biological sequence data called Framework for REferential Sequence COmpression (FRESCO). Our basic compression algorithm is shown to be one to two orders of magnitudes faster than comparable related work, while achieving similar compression ratios. We also propose several techniques to further increase compression ratios, while still retaining the advantage in speed: 1) selecting a good reference sequence; and 2) rewriting a reference sequence to allow for better compression. In addition, we propose a new way of further boosting the compression ratios by applying referential compression to already referentially compressed files (second-order compression). This technique allows for compression ratios way beyond state of the art, for instance, 4,000:1 and higher for human genomes. We evaluate our algorithms on a large data set from three different species (more than 1,000 genomes, more than 3 TB) and on a collection of versions of Wikipedia pages. Our results show that real-time compression of highly similar sequences at high compression ratios is possible on modern hardware.