A New Efficient Data Structure for Storage and Retrieval of Multiple Biosequences
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
KungFQ: A Simple and Powerful Approach to Compress fastq Files
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
High-Throughput Compression of FASTQ Data with SeqDB
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Practical compression for multi-alignment genomic files
ACSC '13 Proceedings of the Thirty-Sixth Australasian Computer Science Conference - Volume 135
Hi-index | 3.84 |
SUMMARY: Large volumes of data generated by high-throughput sequencing instruments present non-trivial challenges in data storage, content access and transfer. We present G-SQZ, a Huffman coding-based sequencing-reads-specific representation scheme that compresses data without altering the relative order. G-SQZ has achieved from 65% to 81% compression on benchmark datasets, and it allows selective access without scanning and decoding from start. This article focuses on describing the underlying encoding scheme and its software implementation, and a more theoretical problem of optimal compression is out of scope. The immediate practical benefits include reduced infrastructure and informatics costs in managing and analyzing large sequencing data. Availability: http://public.tgen.org/sqz. Academic/non-profit: Source: available at no cost under a non-open-source license by requesting from the web-site; Binary: available for direct download at no cost. For-Profit: Submit request for for-profit license from the web-site. Contact: wtembe@tgen.org