A Simple Statistical Algorithm for Biological Sequence Compression

Authors:
Minh Duc Cao;Trevor I. Dix;Lloyd Allison;Chris Mears
Affiliations:
Monash University, Australia;Monash University, Australia;Monash University, Australia;Monash University, Australia
Venue:
DCC '07 Proceedings of the 2007 Data Compression Conference
Year:
2007

Citing 0
Cited 13

Compressing proteomes: the relevance of medium range correlations

EURASIP Journal on Bioinformatics and Systems Biology
Computing Substitution Matrices for Genomic Comparative Analysis

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
A Distance Measure for Genome Phylogenetic Analysis

AI '09 Proceedings of the 22nd Australasian Joint Conference on Advances in Artificial Intelligence
Compression of whole genome alignments

IEEE Transactions on Information Theory - Special issue on information theory in molecular biology and neuroscience
Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Reference sequence construction for relative compression of genomes

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Iterative Dictionary Construction for Compression of Large DNA Data Sets

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Complexity profiles of DNA sequences using finite-context models

USAB'11 Proceedings of the 7th conference on Workgroup Human-Computer Interaction and Usability Engineering of the Austrian Computer Society: information Quality in e-Health
Fast relative lempel-ziv self-index for similar sequences

FAW-AAIM'12 Proceedings of the 6th international Frontiers in Algorithmics, and Proceedings of the 8th international conference on Algorithmic Aspects in Information and Management
Compression of whole genome alignments using a mixture of finite-context models

ICIAR'12 Proceedings of the 9th international conference on Image Analysis and Recognition - Volume Part I
Optimized relative Lempel-Ziv compression of genomes

ACSC '11 Proceedings of the Thirty-Fourth Australasian Computer Science Conference - Volume 113
Practical compression for multi-alignment genomic files

ACSC '13 Proceedings of the Thirty-Sixth Australasian Computer Science Conference - Volume 135
FRESCO: Referential Compression of Highly Similar Sequences

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper introduces a novel algorithm for biological sequence compression that makes use of both statistical properties and repetition within sequences. A panel of experts is maintained to estimate the probability distribution of the next symbol in the sequence to be encoded. Expert probabilities are combined to obtain the final distribution. The resulting information sequence provides insight for further study of the biological sequence. Each symbol is then encoded by arithmetic coding. Experiments show that our algorithm outperforms existing compressors on typical DNA and protein sequence datasets while maintaining a practical running time.