G-SQZ

Authors:
Waibhav Tembe;James Lowey;Edward Suh
Affiliations:
-;-;-
Venue:
Bioinformatics
Year:
2010

Citing 0
Cited 4

A New Efficient Data Structure for Storage and Retrieval of Multiple Biosequences

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
KungFQ: A Simple and Powerful Approach to Compress fastq Files

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
High-Throughput Compression of FASTQ Data with SeqDB

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Practical compression for multi-alignment genomic files

ACSC '13 Proceedings of the Thirty-Sixth Australasian Computer Science Conference - Volume 135

Quantified Score

Hi-index	3.84

Visualization

Abstract

SUMMARY: Large volumes of data generated by high-throughput sequencing instruments present non-trivial challenges in data storage, content access and transfer. We present G-SQZ, a Huffman coding-based sequencing-reads-specific representation scheme that compresses data without altering the relative order. G-SQZ has achieved from 65% to 81% compression on benchmark datasets, and it allows selective access without scanning and decoding from start. This article focuses on describing the underlying encoding scheme and its software implementation, and a more theoretical problem of optimal compression is out of scope. The immediate practical benefits include reduced infrastructure and informatics costs in managing and analyzing large sequencing data. Availability: http://public.tgen.org/sqz. Academic/non-profit: Source: available at no cost under a non-open-source license by requesting from the web-site; Binary: available for direct download at no cost. For-Profit: Submit request for for-profit license from the web-site. Contact: wtembe@tgen.org