Parallel algorithms for mining frequent structural motifs in scientific data

Authors:
Chao Wang;Srinivasan Parthasarathy
Affiliations:
The Ohio State University;The Ohio State University
Venue:
Proceedings of the 18th annual international conference on Supercomputing
Year:
2004

Citing 17
Cited 6

Mining scientific data

Communications of the ACM
Fast detection of common geometric substructure in proteins

RECOMB '99 Proceedings of the third annual international conference on Computational molecular biology
Genome scale prediction of protein functional class from sequence using data mining

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Approaches to parallel graph-based knowledge discovery

Journal of Parallel and Distributed Computing - Special issue on high-performance data mining
Principles of data mining

Principles of data mining
Geometric Hashing: An Overview

IEEE Computational Science & Engineering
Finding Patterns in Three-Dimensional Graphs: Algorithms and Applications to Scientific Data Mining

IEEE Transactions on Knowledge and Data Engineering
MAFIA: A Maximal Frequent Itemset Algorithm for Transactional Databases

Proceedings of the 17th International Conference on Data Engineering
Frequent Subgraph Discovery

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
MotifMiner: A General Toolkit for Efficiently Identifying Common Substructures in Molecules

BIBE '03 Proceedings of the 3rd IEEE Symposium on BioInformatics and BioEngineering
Mining Molecular Fragments: Finding Relevant Substructures of Molecules

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Discovering Frequent Geometric Subgraphs

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Efficient Discovery of Common Substructures in Macromolecules

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
gSpan: Graph-Based Substructure Pattern Mining

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
MotifMiner: Efficient discovery of common substructures in biochemical molecules

Knowledge and Information Systems
The levelwise version space algorithm and its application to molecular fragment finding

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2

Finding Patterns on Protein Surfaces: Algorithms and Applications to Protein Classification

IEEE Transactions on Knowledge and Data Engineering
Dynamic Load Balancing for the Distributed Mining of Molecular Structures

IEEE Transactions on Parallel and Distributed Systems
Efficient pattern mining on shared memory systems: implications for chip multiprocessor architectures

Proceedings of the 2006 workshop on Memory system performance and correctness
High performance subgraph mining in molecular compounds

HPCC'05 Proceedings of the First international conference on High Performance Computing and Communications
Parallel discovery of network motifs

Journal of Parallel and Distributed Computing
G-Tries: a data structure for storing and finding subgraphs

Data Mining and Knowledge Discovery

Quantified Score

Hi-index	0.00

Visualization

Abstract

Discovery of important substructures from molecules is an important data mining problem. The basic motivation is that the structure of a molecule has a role to play in its biochemical function. There is interest in finding important, often recurrent, substructures both within a single molecule and across a class of molecules.Recently, we have developed a general purpose suite of algorithms -- the MotifMiner Toolkit -- that can mine for structural motifs in a wide area of biomolecular datasets. While the algorithms have proven to be extremely useful in their ability to identify novel substructures, the algorithms themselves are quite time consuming. There are two reasons for this: i) inherently the algorithm suffers from the curse of subgraph isomorphism; and ii) handling noise effects (e.g. protein structure data) results in a significant slowdown.To address this problem in this paper we propose parallelization strategies in a cluster environment for the above algorithms. We identify key optimizations that handle load imbalance, scheduling, and communication overheads. Results show that the optimizations are quite effective and that we are able to obtain good speedup on moderate sized clusters.