Compressing proteomes: the relevance of medium range correlations
EURASIP Journal on Bioinformatics and Systems Biology
Computing Substitution Matrices for Genomic Comparative Analysis
PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
A Distance Measure for Genome Phylogenetic Analysis
AI '09 Proceedings of the 22nd Australasian Joint Conference on Advances in Artificial Intelligence
Compression of whole genome alignments
IEEE Transactions on Information Theory - Special issue on information theory in molecular biology and neuroscience
Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval
SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Reference sequence construction for relative compression of genomes
SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Iterative Dictionary Construction for Compression of Large DNA Data Sets
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Complexity profiles of DNA sequences using finite-context models
USAB'11 Proceedings of the 7th conference on Workgroup Human-Computer Interaction and Usability Engineering of the Austrian Computer Society: information Quality in e-Health
Fast relative lempel-ziv self-index for similar sequences
FAW-AAIM'12 Proceedings of the 6th international Frontiers in Algorithmics, and Proceedings of the 8th international conference on Algorithmic Aspects in Information and Management
Compression of whole genome alignments using a mixture of finite-context models
ICIAR'12 Proceedings of the 9th international conference on Image Analysis and Recognition - Volume Part I
Optimized relative Lempel-Ziv compression of genomes
ACSC '11 Proceedings of the Thirty-Fourth Australasian Computer Science Conference - Volume 113
Practical compression for multi-alignment genomic files
ACSC '13 Proceedings of the Thirty-Sixth Australasian Computer Science Conference - Volume 135
FRESCO: Referential Compression of Highly Similar Sequences
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Hi-index | 0.00 |
This paper introduces a novel algorithm for biological sequence compression that makes use of both statistical properties and repetition within sequences. A panel of experts is maintained to estimate the probability distribution of the next symbol in the sequence to be encoded. Expert probabilities are combined to obtain the final distribution. The resulting information sequence provides insight for further study of the biological sequence. Each symbol is then encoded by arithmetic coding. Experiments show that our algorithm outperforms existing compressors on typical DNA and protein sequence datasets while maintaining a practical running time.