Non-repetitive DNA sequence compression using memoization

Authors:
K. G. Srinivasa;M. Jagadish;K. R. Venugopal;L. M. Patnaik
Affiliations:
Data Mining Laboratory, M S Ramaiah Institute of Technology, Bangalore;Software Engineer, MindTree Consulting, Bangalore;Professor, University of Visvesvaraya College of Engineering, Bangalore University, Bangalore;Professor, Microprocessor Application Laboratory, Indian Institute of Science, Bangalore
Venue:
ISBMDA'06 Proceedings of the 7th international conference on Biological and Medical Data Analysis
Year:
2006

Citing 6
Cited 0

A new challenge for compression algorithms: genetic sequences

Information Processing and Management: an International Journal - Special issue: data compression
Implementing the Context Tree Weighting Method for Text Compression

DCC '00 Proceedings of the Conference on Data Compression
Compression of Biological Sequences by Greedy Off-Line Textual Substitution

DCC '00 Proceedings of the Conference on Data Compression
Lossless Compression of DNA Microarray Images

CSBW '05 Proceedings of the 2005 IEEE Computational Systems Bioinformatics Conference - Workshops
DNA compression challenge revisited: a dynamic programming approach

CPM'05 Proceedings of the 16th annual conference on Combinatorial Pattern Matching
The context-tree weighting method: basic properties

IEEE Transactions on Information Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

With increasing number of DNA sequences being discovered the problem of storing and using genomic databases has become vital. Since DNA sequences consist of only four letters, two bits are sufficient to store each base. Many algorithms have been proposed in the recent past that push the bits/base limit further. The subtle patterns in DNA along with statistical inferences have been exploited to increase the compression ratio. From the compression perspective, the entire DNA sequences can be considered to be made of two types of sequences: repetitive and non-repetitive. The repetitive parts are compressed used dictionary-based schemes and non-repetitive sequences of DNA are usually compressed using general text compression schemes. In this paper, we present a memoization based encoding scheme for non-repeat DNA sequences. This scheme is incorporated with a DNA-specific compression algorithm, DNAPack, which is used for compression of DNA sequences. The results show that our method noticeably performs better than other techniques of its kind.