Characterising DNA/RNA signals with crisp hypermotifs: a case study on core promoters

  • Authors:
  • Carey Pridgeon;David Corne

  • Affiliations:
  • Department of Computer Science, University of Exeter, Exeter, UK;School of MACS, Heriot-Watt University, Edinburgh, UK

  • Venue:
  • EvoBIO'07 Proceedings of the 5th European conference on Evolutionary computation, machine learning and data mining in bioinformatics
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

A common way to characterise important and conserved signals in nucleotide sequences, such as transcription factor binding sites, is via the use of so-called consensus sequences or consensus patterns. A well-known example is the so-called "TATA-box" commonly found in eukaryotic core promoters. Such patterns are valuable in that they offer an insight into basic molecular biology processes, and can support reasoning regarding the understanding, design and control of these processes. However it is rare for such patterns to be accurate; instead they represent a very approximate characterisation of the signal under study. At the opposite extreme, we may instead characterise such a signal via a neural network, or a high-order Markov model, and so on. These have better sensitivity and specificity, but are unreadable, and consequently unhelpful for conveying an understanding of the underlying molecular biology processes that could support insight or reasoning. We describe a simple pattern language, called crisp hypermotifs (CHMs), that leads to highly readable patterns that can support understanding and reasoning, yet achieve greater sensitivity and specificity than the commonly used approaches to crisply characterise a signal. We use evolutionary computation to discover high-performance CHMs from data, and we argue that CHMs be used in place of classical consensus motifs, and justify that by presenting examples derived from a large dataset of mammalian core promoters. We provide CHM alternatives to the well-known core promoter TATA-box and Initiator patterns that have better sensitivity and specificity than their classical counterparts.