Discovering cis-regulatory modules by optimizing barbecues

  • Authors:
  • Axel Mosig;Türker Bıyıkoğlu;Sonja J. Prohaska;Peter F. Stadler

  • Affiliations:
  • CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, 320 Yue Yang Road, 200031 Shanghai, China and Max-Planck Institute for Mathematics in the Sciences ...;Işık University, Kumbaba Mevkii Şile, 34980 Istanbul, Turkey and Max-Planck Institute for Mathematics in the Sciences, Inselstrasse 22, D-04103 Leipzig, Germany;Bioinformatics Group, Department of Computer Science, University of Leipzig, Härtelstr. 16-18, D-04107 Leipzig, Germany and Interdisciplinary Center for Bioinformatics, University of Leipzig, ...;Bioinformatics Group, Department of Computer Science, University of Leipzig, Härtelstr. 16-18, D-04107 Leipzig, Germany and Interdisciplinary Center for Bioinformatics, University of Leipzig, ...

  • Venue:
  • Discrete Applied Mathematics
  • Year:
  • 2009

Quantified Score

Hi-index 0.04

Visualization

Abstract

Gene expression in eukaryotic cells is regulated by a complex network of interactions, in which transcription factors and their binding sites on the genomic DNA play a determining role. As transcription factors rarely, if ever, act in isolation, binding sites of interacting factors are typically arranged in close proximity forming so-called cis-regulatory modules. Even when the individual binding sites are known, module discovery remains a hard combinatorial problem, which we formalize here as the Best Barbecue Problem. It asks for simultaneously stabbing a maximum number of differently colored intervals from K arrangements of colored intervals. This geometric problem turns out to be an elementary, yet previously unstudied combinatorial optimization problem of detecting common edges in a family of hypergraphs, a decision version of which we show here to be NP-complete. Due to its relevance in biological applications, we propose algorithmic variations that are suitable for the analysis of real data sets comprising either many sequences or many binding sites. Being based on set systems induced by interval arrangements, our problem setting generalizes to discovering patterns of co-localized itemsets in non-sequential objects that consist of corresponding arrangements or induce set systems of co-localized items. In fact, our optimization problem is a generalization of the popular concept of frequent itemset mining.