Data compression using dynamic Markov modelling
The Computer Journal
A Boyer-Moore Type Algorithm for Compressed Pattern Matching
COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
Parsing with suffix and prefix dictionaries
DCC '96 Proceedings of the Conference on Data Compression
Offline Dictionary-Based Compression
DCC '99 Proceedings of the Conference on Data Compression
The effect of non-greedy parsing in Ziv-Lempel compression methods
DCC '95 Proceedings of the Conference on Data Compression
Compressed Pattern Matching in DNA Sequences
CSB '04 Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference
A Simple Statistical Algorithm for Biological Sequence Compression
DCC '07 Proceedings of the 2007 Data Compression Conference
Human genomes as email attachments
Bioinformatics
LZ77-Like Compression with Fast Random Access
DCC '10 Proceedings of the 2010 Data Compression Conference
Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval
SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
The variant call format and VCFtools
Bioinformatics
Robust relative compression of genomes with random access
Bioinformatics
Iterative Dictionary Construction for Compression of Large DNA Data Sets
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
No-Reference Compression of Genomic Data Stored in FASTQ Format
BIBM '11 Proceedings of the 2011 IEEE International Conference on Bioinformatics and Biomedicine
Compressing genomic sequence fragments using SLIMGENE
RECOMB'10 Proceedings of the 14th Annual international conference on Research in Computational Molecular Biology
A universal algorithm for sequential data compression
IEEE Transactions on Information Theory
Optimized relative Lempel-Ziv compression of genomes
ACSC '11 Proceedings of the Thirty-Fourth Australasian Computer Science Conference - Volume 113
Hi-index | 0.00 |
In many applications, sets of similar texts or sequences are of high importance. Prominent examples are revision histories of documents or genomic sequences. Modern high-throughput sequencing technologies are able to generate DNA sequences at an ever-increasing rate. In parallel to the decreasing experimental time and cost necessary to produce DNA sequences, computational requirements for analysis and storage of the sequences are steeply increasing. Compression is a key technology to deal with this challenge. Recently, referential compression schemes, storing only the differences between a to-be-compressed input and a known reference sequence, gained a lot of interest in this field. In this paper, we propose a general open-source framework to compress large amounts of biological sequence data called Framework for REferential Sequence COmpression (FRESCO). Our basic compression algorithm is shown to be one to two orders of magnitudes faster than comparable related work, while achieving similar compression ratios. We also propose several techniques to further increase compression ratios, while still retaining the advantage in speed: 1) selecting a good reference sequence; and 2) rewriting a reference sequence to allow for better compression. In addition, we propose a new way of further boosting the compression ratios by applying referential compression to already referentially compressed files (second-order compression). This technique allows for compression ratios way beyond state of the art, for instance, 4,000:1 and higher for human genomes. We evaluate our algorithms on a large data set from three different species (more than 1,000 genomes, more than 3 TB) and on a collection of versions of Wikipedia pages. Our results show that real-time compression of highly similar sequences at high compression ratios is possible on modern hardware.