Communications of the ACM
Mining for Putative Regulatory Elements in the Yeast Genome Using Gene Expression Data
Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
Spelling Approximate Repeated or Common Motifs Using a Suffix Tree
LATIN '98 Proceedings of the Third Latin American Symposium on Theoretical Informatics
The Max-Shift Algorithm for Approximate String Matching
WAE '01 Proceedings of the 5th International Workshop on Algorithm Engineering
Applied Combinatorics on Words (Encyclopedia of Mathematics and its Applications)
Applied Combinatorics on Words (Encyclopedia of Mathematics and its Applications)
Algorithms on Strings
Efficient and Accurate Discovery of Patterns in Sequence Data Sets
IEEE Transactions on Knowledge and Data Engineering
Finding common motifs with gaps using finite automata
CIAA'06 Proceedings of the 11th international conference on Implementation and Application of Automata
A parallel algorithm for fixed-length approximate string-matching with k-mismatches
Algorithms and Applications
RISOTTO: fast extraction of motifs with mismatches
LATIN'06 Proceedings of the 7th Latin American conference on Theoretical Informatics
Hi-index | 0.00 |
Motivation: Identifying repeated factors that occur in a string of letters or common factors that occur in a set of strings represents an important task in computer science and biology. Such patterns are called motifs, and the process of identifying them is called motif extraction. In biology, motifs may correspond to functional elements in DNA, RNA, or protein molecules. Motifs may also correspond to whole loci whose sequences are highly similar because of recent duplication (e.g., transposable elements or recently duplicated genes). A DNA motif is a nucleic acid sequence that has a specific biological function, for instance encoding the DNA binding sites for a regulatory protein (transcription factor). Results: In this article, we introduce MoTeX, the first high-performance computing (HPC) tool for MoTif eXtraction from large-scale datasets. It uses state-of-the-art algorithms for solving the fixed-length approximate string matching problem. MoTeX comes in three flavors: a standard CPU version; an OpenMP-based version; and an MPI-based version. We show that MoTeX produces similar and partially identical results to current state-of-the-art tools with respect to accuracy as quantified by statistical significance measures. Moreover, we show that it matches or outperforms competing tools in terms of runtime efficiency. The MPI-based version of MoTeX requires only one hour to process all human genes on 1056 processors, while current sequential programmes require more than two months for this task. Availability: http://www.exelixis-lab.org/motex (open-source code)