Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval
SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Iterative Dictionary Construction for Compression of Large DNA Data Sets
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Compressing genomic sequence fragments using SLIMGENE
RECOMB'10 Proceedings of the 14th Annual international conference on Research in Computational Molecular Biology
Fast relative lempel-ziv self-index for similar sequences
FAW-AAIM'12 Proceedings of the 6th international Frontiers in Algorithmics, and Proceedings of the 8th international conference on Algorithmic Aspects in Information and Management
KungFQ: A Simple and Powerful Approach to Compress fastq Files
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Optimized relative Lempel-Ziv compression of genomes
ACSC '11 Proceedings of the Thirty-Fourth Australasian Computer Science Conference - Volume 113
Practical compression for multi-alignment genomic files
ACSC '13 Proceedings of the Thirty-Sixth Australasian Computer Science Conference - Volume 135
RCSI: scalable similarity search in thousand(s) of genomes
Proceedings of the VLDB Endowment
FRESCO: Referential Compression of Highly Similar Sequences
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Hi-index | 3.84 |
Summary: The amount of genomic sequence data being generated and made available through public databases continues to increase at an ever-expanding rate. Downloading, copying, sharing and manipulating these large datasets are becoming difficult and time consuming for researchers. We need to consider using advanced compression techniques as part of a standard data format for genomic data. The inherent structure of genome data allows for more efficient lossless compression than can be obtained through the use of generic compression programs. We apply a series of techniques to James Watson's genome that in combination reduce it to a mere 4MB, small enough to be sent as an email attachment. Availability: Our algorithms are implemented in C++ and are freely available from http://www.ics.uci.edu/~xhx/project/DNAzip. Contact:chenli@ics.uci.edu; xhx@ics.uci.edu Supplementary information:Supplementary data are available at Bioinformatics online.