MoTeX: A word-based HPC tool for MoTif eXtraction

  • Authors:
  • Solon P. Pissis;Alexandros Stamatakis;Pavlos Pavlidis

  • Affiliations:
  • Florida Museum of Natural History, University of Florida, USA & Heidelberg Institute for Theoretical Studies, Germany;Heidelberg Institute for Theoretical Studies, Germany;Foundation for Research and Technology -- Hellas Institute of Molecular Biology and Biotechnology, Greece

  • Venue:
  • Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Motivation: Identifying repeated factors that occur in a string of letters or common factors that occur in a set of strings represents an important task in computer science and biology. Such patterns are called motifs, and the process of identifying them is called motif extraction. In biology, motifs may correspond to functional elements in DNA, RNA, or protein molecules. Motifs may also correspond to whole loci whose sequences are highly similar because of recent duplication (e.g., transposable elements or recently duplicated genes). A DNA motif is a nucleic acid sequence that has a specific biological function, for instance encoding the DNA binding sites for a regulatory protein (transcription factor). Results: In this article, we introduce MoTeX, the first high-performance computing (HPC) tool for MoTif eXtraction from large-scale datasets. It uses state-of-the-art algorithms for solving the fixed-length approximate string matching problem. MoTeX comes in three flavors: a standard CPU version; an OpenMP-based version; and an MPI-based version. We show that MoTeX produces similar and partially identical results to current state-of-the-art tools with respect to accuracy as quantified by statistical significance measures. Moreover, we show that it matches or outperforms competing tools in terms of runtime efficiency. The MPI-based version of MoTeX requires only one hour to process all human genes on 1056 processors, while current sequential programmes require more than two months for this task. Availability: http://www.exelixis-lab.org/motex (open-source code)