Computational inference of difficult word boundaries in DNA languages

  • Authors:
  • Guy Tsafnat;Paul Setzermann;Sally R. Partridge;Dominik Grimm

  • Affiliations:
  • University of New South Wales, Sydney, Australia;University of New South Wales, Sydney, Australia;University of Sydney, Sydney, Australia;University of New South Wales, Sydney, Australia

  • Venue:
  • Proceedings of the 4th International Symposium on Applied Sciences in Biomedical and Communication Technologies
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Many applications in molecular and systems biology exploit similarities between DNA and languages to make predictions about cell function. This approach provides structure to an otherwise monotonous sequence of nucleotides. However, one of the major differences between DNA sequences and text is in how semantic units (e.g. words) are distinguished within them. Whereas words and sentences are separated by spaces and punctuation in natural languages, no such markers exist in DNA. Some semantic units in DNA (e.g. genes) can be identified relatively easily and with relatively high accuracy. Other units may have less known molecular mechanisms and are therefore harder to identify accurately. In this paper we discuss three machine learning methods to elucidate the boundaries of such difficult units: heuristic approaches use hypothesized models of the mechanism to identify word boundaries, supervised machine learning methods generalise labelled examples of word boundaries to a model that can be used to detect these boundaries, and unsupervised machine learning methods infer a model from unlabeled data. As an example, we use a bacterial transposable element called ISEcp1 that moves DNA segments of variable length. We assess the accuracy of each of the above methods using rediscovery experiments. We demonstrate the power of the methods by examining 9 instances of DNA segments associated with ISEcp1 that lack known boundaries. We identified 6 units that include genes that confer resistance to clinically important antibiotics.