Algorithms on strings, trees, and sequences: computer science and computational biology
Algorithms on strings, trees, and sequences: computer science and computational biology
Mining frequent patterns without candidate generation
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Spelling Approximate Repeated or Common Motifs Using a Suffix Tree
LATIN '98 Proceedings of the Third Latin American Symposium on Theoretical Informatics
A parallel algorithm for the extraction of structured motifs
Proceedings of the 2004 ACM symposium on Applied computing
Suffix tree characterization of maximal motifs in biological sequences
Theoretical Computer Science
Suffix tree construction algorithms on modern hardware
Proceedings of the 13th International Conference on Extending Database Technology
Protein sequence motif discovery on distributed supercomputer
GPC'08 Proceedings of the 3rd international conference on Advances in grid and pervasive computing
MADMX: a novel strategy for maximal dense motif extraction
WABI'09 Proceedings of the 9th international conference on Algorithms in bioinformatics
A taxonomy of sequential pattern mining algorithms
ACM Computing Surveys (CSUR)
Online discovery and maintenance of time series motifs
Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Approximate weighted frequent pattern mining with/without noisy environments
Knowledge-Based Systems
VARUN: Discovering Extensible Motifs under Saturation Constraints
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
An Improved Heuristic Algorithm for Finding Motif Signals in DNA Sequences
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Efficient and Accurate Discovery of Patterns in Sequence Data Sets
IEEE Transactions on Knowledge and Data Engineering
An Ultrafast Scalable Many-Core Motif Discovery Algorithm for Multiple GPUs
IPDPSW '11 Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum
Hi-index | 0.00 |
Motifs are frequent patterns used to identify biological functionality in genomic sequences, periodicity in time series, or user trends in web logs. In contrast to a lot of existing work that focuses on collections of many short sequences, modern applications require mining of motifs in one very long sequence (i.e., in the order of several gigabytes). For this case, there exist statistical approaches that are fast but inaccurate; or combinatorial methods that are sound and complete. Unfortunately, existing combinatorial methods are serial and very slow. Consequently, they are limited to very short sequences (i.e., a few megabytes), small alphabets (typically 4 symbols for DNA sequences), and restricted types of motifs. This paper presents ACME, a combinatorial method for extracting motifs from a single very long sequence. ACME arranges the search space in contiguous blocks that take advantage of the cache hierarchy in modern architectures, and achieves almost an order of magnitude performance gain in serial execution. It also decomposes the search space in a smart way that allows scalability to thousands of processors with more than 90% speedup. ACME is the only method that: (i) scales to gigabyte-long sequences; (ii) handles large alphabets; (iii) supports interesting types of motifs with minimal additional cost; and (iv) is optimized for a variety of architectures such as multi-core systems, clusters in the cloud, and supercomputers. ACME reduces the extraction time for an exact-length query from 4 hours to 7 minutes on a typical workstation; handles 3 orders of magnitude longer sequences; and scales up to 16,384 cores on a supercomputer.