On enumerating the DNA sequences

Authors:
M. Oǧuzhan Külekci
Affiliations:
TUBITAK - BILGEM - UEKAE, Kocaeli, Turkey
Venue:
Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine
Year:
2012

Citing 7
Cited 0

High-order entropy-compressed text indexes

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Textual data compression in computational biology

Bioinformatics
Compression, indexing, and retrieval for massive string data

CPM'10 Proceedings of the 21st annual conference on Combinatorial pattern matching
Efficient Maximal Repeat Finding Using the Burrows-Wheeler Transform and Wavelet Tree

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
DNA compression challenge revisited: a dynamic programming approach

CPM'05 Proceedings of the 16th annual conference on Combinatorial Pattern Matching
Enumerative source encoding

IEEE Transactions on Information Theory
A comparison of enumerative and adaptive codes

IEEE Transactions on Information Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

DNA sequences are denoted mostly in raw text format, where 2-bit compressed representations are preferred either to gain space or perform some processing fast via computers intrinsic bitwise operations, e.g., search applications. Studies aiming to compress DNA sequences have deployed various encoding schemes such as Lempel--Ziv type factorizations, arithmetic/Huffman codes, context-free grammars, and many ad-hoc heuristics. Besides helping to save space, compression techniques are of use in many other areas, such as the information theoretic sequence analysis, clustering, indexing, etc. Unlike to its counterparts enumerative encoding of DNA sequences has not received much attention to date. Lack of publicly available libraries for enumeration might be a reason of that relatively less attention as one can find many freely available libraries for arithmetic coding, but not for the enumerative coding. With this motivation, this study shed lights on how to enumerate a given DNA sequence with a, c, g, t number of corresponding bases and presents a general purpose C++ software library, which may be considered in various applications including, but not limited to hashing, indexing, filtering, and searching of DNA sequences as well as the new compression schemes and information theoretic analyses. Proposed technique represents an input sequence via tuple, where the first number ParID specifies the parikh vector and the second number PermID identifies the correct permutation of the bases. We present a combinatorial approach to count and order the distinct frequency vectors for an input block, and use simple lexicographical ordering for the permutation.