Efficient decoding of prefix codes
Communications of the ACM
A compression algorithm for DNA sequences and its applications in genome comparison
RECOMB '00 Proceedings of the fourth annual international conference on Computational molecular biology
General-purpose compression for efficient retrieval
Journal of the American Society for Information Science and Technology
A Guaranteed Compression Scheme for Repetitive DNA Sequences
DCC '96 Proceedings of the Conference on Data Compression
Significantly Lower Entropy Estimates for Natural DNA Sequences
DCC '97 Proceedings of the Conference on Data Compression
Offline Dictionary-Based Compression
DCC '99 Proceedings of the Conference on Data Compression
Compression of Biological Sequences by Greedy Off-Line Textual Substitution
DCC '00 Proceedings of the Conference on Data Compression
An efficient normalized maximum likelihood algorithm for DNA sequence compression
ACM Transactions on Information Systems (TOIS)
A Simple Statistical Algorithm for Biological Sequence Compression
DCC '07 Proceedings of the 2007 Data Compression Conference
Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections
SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
Human genomes as email attachments
Bioinformatics
Storage and Retrieval of Individual Genomes
RECOMB 2'09 Proceedings of the 13th Annual International Conference on Research in Computational Molecular Biology
Self-indexed Text Compression Using Straight-Line Programs
MFCS '09 Proceedings of the 34th International Symposium on Mathematical Foundations of Computer Science 2009
LZ77-Like Compression with Fast Random Access
DCC '10 Proceedings of the 2010 Data Compression Conference
Compressed q-Gram Indexing for Highly Repetitive Biological Sequences
BIBE '10 Proceedings of the 2010 IEEE International Conference on Bioinformatics and Bioengineering
Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval
SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
DNA compression challenge revisited: a dynamic programming approach
CPM'05 Proceedings of the 16th annual conference on Combinatorial Pattern Matching
IEEE Transactions on Information Theory
Optimized relative Lempel-Ziv compression of genomes
ACSC '11 Proceedings of the Thirty-Fourth Australasian Computer Science Conference - Volume 113
Reference sequence construction for relative compression of genomes
SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Practical compression for multi-alignment genomic files
ACSC '13 Proceedings of the Thirty-Sixth Australasian Computer Science Conference - Volume 135
FRESCO: Referential Compression of Highly Similar Sequences
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Hi-index | 0.00 |
Genomic repositories increasingly include individual as well as reference sequences, which tend to share long identical and near-identical strings of nucleotides. However, the sequential processing used by most compression algorithms, and the volumes of data involved, mean that these long-range repetitions are not detected. An order-insensitive, disk-based dictionary construction method can detect this repeated content and use it to compress collections of sequences. We explore a dictionary construction method that improves repeat identification in large DNA data sets. Our adaptation, Comrad, of an existing disk-based method identifies exact repeated content in collections of sequences with similarities within and across the set of input sequences. Comrad compresses the data over multiple passes, which is an expensive process, but allows Comrad to compress large data sets within reasonable time and space. Comrad allows for random access to individual sequences and subsequences without decompressing the whole data set. Comrad has no competitor in terms of the size of data sets that it can compress (extending to many hundreds of gigabytes) and, even for smaller data sets, the results are competitive compared to alternatives; as an example, 39 S. cerevisiae genomes compressed to 0.25 bits per base.