On enumerating the DNA sequences

  • Authors:
  • M. Oǧuzhan Külekci

  • Affiliations:
  • TUBITAK - BILGEM - UEKAE, Kocaeli, Turkey

  • Venue:
  • Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

DNA sequences are denoted mostly in raw text format, where 2-bit compressed representations are preferred either to gain space or perform some processing fast via computers intrinsic bitwise operations, e.g., search applications. Studies aiming to compress DNA sequences have deployed various encoding schemes such as Lempel--Ziv type factorizations, arithmetic/Huffman codes, context-free grammars, and many ad-hoc heuristics. Besides helping to save space, compression techniques are of use in many other areas, such as the information theoretic sequence analysis, clustering, indexing, etc. Unlike to its counterparts enumerative encoding of DNA sequences has not received much attention to date. Lack of publicly available libraries for enumeration might be a reason of that relatively less attention as one can find many freely available libraries for arithmetic coding, but not for the enumerative coding. With this motivation, this study shed lights on how to enumerate a given DNA sequence with a, c, g, t number of corresponding bases and presents a general purpose C++ software library, which may be considered in various applications including, but not limited to hashing, indexing, filtering, and searching of DNA sequences as well as the new compression schemes and information theoretic analyses. Proposed technique represents an input sequence via tuple, where the first number ParID specifies the parikh vector and the second number PermID identifies the correct permutation of the bases. We present a combinatorial approach to count and order the distinct frequency vectors for an input block, and use simple lexicographical ordering for the permutation.