MoTeX: A word-based HPC tool for MoTif eXtraction

Authors:
Solon P. Pissis;Alexandros Stamatakis;Pavlos Pavlidis
Affiliations:
Florida Museum of Natural History, University of Florida, USA & Heidelberg Institute for Theoretical Studies, Germany;Heidelberg Institute for Theoretical Studies, Germany;Foundation for Research and Technology -- Hellas Institute of Molecular Biology and Biotechnology, Greece
Venue:
Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics
Year:
2013

Citing 10
Cited 0

Trie memory

Communications of the ACM
Mining for Putative Regulatory Elements in the Yeast Genome Using Gene Expression Data

Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
Spelling Approximate Repeated or Common Motifs Using a Suffix Tree

LATIN '98 Proceedings of the Third Latin American Symposium on Theoretical Informatics
The Max-Shift Algorithm for Approximate String Matching

WAE '01 Proceedings of the 5th International Workshop on Algorithm Engineering
Applied Combinatorics on Words (Encyclopedia of Mathematics and its Applications)

Applied Combinatorics on Words (Encyclopedia of Mathematics and its Applications)
Algorithms on Strings

Algorithms on Strings
Efficient and Accurate Discovery of Patterns in Sequence Data Sets

IEEE Transactions on Knowledge and Data Engineering
Finding common motifs with gaps using finite automata

CIAA'06 Proceedings of the 11th international conference on Implementation and Application of Automata
A parallel algorithm for fixed-length approximate string-matching with k-mismatches

Algorithms and Applications
RISOTTO: fast extraction of motifs with mismatches

LATIN'06 Proceedings of the 7th Latin American conference on Theoretical Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Motivation: Identifying repeated factors that occur in a string of letters or common factors that occur in a set of strings represents an important task in computer science and biology. Such patterns are called motifs, and the process of identifying them is called motif extraction. In biology, motifs may correspond to functional elements in DNA, RNA, or protein molecules. Motifs may also correspond to whole loci whose sequences are highly similar because of recent duplication (e.g., transposable elements or recently duplicated genes). A DNA motif is a nucleic acid sequence that has a specific biological function, for instance encoding the DNA binding sites for a regulatory protein (transcription factor). Results: In this article, we introduce MoTeX, the first high-performance computing (HPC) tool for MoTif eXtraction from large-scale datasets. It uses state-of-the-art algorithms for solving the fixed-length approximate string matching problem. MoTeX comes in three flavors: a standard CPU version; an OpenMP-based version; and an MPI-based version. We show that MoTeX produces similar and partially identical results to current state-of-the-art tools with respect to accuracy as quantified by statistical significance measures. Moreover, we show that it matches or outperforms competing tools in terms of runtime efficiency. The MPI-based version of MoTeX requires only one hour to process all human genes on 1056 processors, while current sequential programmes require more than two months for this task. Availability: http://www.exelixis-lab.org/motex (open-source code)