Parallel motif extraction from very long sequences

Authors:
Majed Sahli;Essam Mansour;Panos Kalnis
Affiliations:
King Abdullah University of Science & Technology, Thuwal, Saudi Arabia;Qatar Computing Research Institute (QCRI), Doha, Qatar;King Abdullah University of Science & Technology, Thuwal, Saudi Arabia
Venue:
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Year:
2013

Citing 16
Cited 0

Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Mining frequent patterns without candidate generation

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Spelling Approximate Repeated or Common Motifs Using a Suffix Tree

LATIN '98 Proceedings of the Third Latin American Symposium on Theoretical Informatics
A parallel algorithm for the extraction of structured motifs

Proceedings of the 2004 ACM symposium on Applied computing
Efficient computation of all perfect repeats in genomic sequences of up to half a gigabyte, with a case study on the human genome

Bioinformatics
Suffix tree characterization of maximal motifs in biological sequences

Theoretical Computer Science
Suffix tree construction algorithms on modern hardware

Proceedings of the 13th International Conference on Extending Database Technology
Protein sequence motif discovery on distributed supercomputer

GPC'08 Proceedings of the 3rd international conference on Advances in grid and pervasive computing
MADMX: a novel strategy for maximal dense motif extraction

WABI'09 Proceedings of the 9th international conference on Algorithms in bioinformatics
A taxonomy of sequential pattern mining algorithms

ACM Computing Surveys (CSUR)
Online discovery and maintenance of time series motifs

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Approximate weighted frequent pattern mining with/without noisy environments

Knowledge-Based Systems
VARUN: Discovering Extensible Motifs under Saturation Constraints

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
An Improved Heuristic Algorithm for Finding Motif Signals in DNA Sequences

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Efficient and Accurate Discovery of Patterns in Sequence Data Sets

IEEE Transactions on Knowledge and Data Engineering
An Ultrafast Scalable Many-Core Motif Discovery Algorithm for Multiple GPUs

IPDPSW '11 Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum

Quantified Score

Hi-index	0.00

Visualization

Abstract

Motifs are frequent patterns used to identify biological functionality in genomic sequences, periodicity in time series, or user trends in web logs. In contrast to a lot of existing work that focuses on collections of many short sequences, modern applications require mining of motifs in one very long sequence (i.e., in the order of several gigabytes). For this case, there exist statistical approaches that are fast but inaccurate; or combinatorial methods that are sound and complete. Unfortunately, existing combinatorial methods are serial and very slow. Consequently, they are limited to very short sequences (i.e., a few megabytes), small alphabets (typically 4 symbols for DNA sequences), and restricted types of motifs. This paper presents ACME, a combinatorial method for extracting motifs from a single very long sequence. ACME arranges the search space in contiguous blocks that take advantage of the cache hierarchy in modern architectures, and achieves almost an order of magnitude performance gain in serial execution. It also decomposes the search space in a smart way that allows scalability to thousands of processors with more than 90% speedup. ACME is the only method that: (i) scales to gigabyte-long sequences; (ii) handles large alphabets; (iii) supports interesting types of motifs with minimal additional cost; and (iv) is optimized for a variety of architectures such as multi-core systems, clusters in the cloud, and supercomputers. ACME reduces the extraction time for an exact-length query from 4 hours to 7 minutes on a typical workstation; handles 3 orders of magnitude longer sequences; and scales up to 16,384 cores on a supercomputer.