Communications of the ACM
Fast detection of common geometric substructure in proteins
RECOMB '99 Proceedings of the third annual international conference on Computational molecular biology
Genome scale prediction of protein functional class from sequence using data mining
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Approaches to parallel graph-based knowledge discovery
Journal of Parallel and Distributed Computing - Special issue on high-performance data mining
Principles of data mining
Geometric Hashing: An Overview
IEEE Computational Science & Engineering
Finding Patterns in Three-Dimensional Graphs: Algorithms and Applications to Scientific Data Mining
IEEE Transactions on Knowledge and Data Engineering
MAFIA: A Maximal Frequent Itemset Algorithm for Transactional Databases
Proceedings of the 17th International Conference on Data Engineering
ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Fast Algorithms for Mining Association Rules in Large Databases
VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
MotifMiner: A General Toolkit for Efficiently Identifying Common Substructures in Molecules
BIBE '03 Proceedings of the 3rd IEEE Symposium on BioInformatics and BioEngineering
Mining Molecular Fragments: Finding Relevant Substructures of Molecules
ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Discovering Frequent Geometric Subgraphs
ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Efficient Discovery of Common Substructures in Macromolecules
ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
gSpan: Graph-Based Substructure Pattern Mining
ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
MotifMiner: Efficient discovery of common substructures in biochemical molecules
Knowledge and Information Systems
The levelwise version space algorithm and its application to molecular fragment finding
IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2
Finding Patterns on Protein Surfaces: Algorithms and Applications to Protein Classification
IEEE Transactions on Knowledge and Data Engineering
Dynamic Load Balancing for the Distributed Mining of Molecular Structures
IEEE Transactions on Parallel and Distributed Systems
Proceedings of the 2006 workshop on Memory system performance and correctness
High performance subgraph mining in molecular compounds
HPCC'05 Proceedings of the First international conference on High Performance Computing and Communications
Parallel discovery of network motifs
Journal of Parallel and Distributed Computing
G-Tries: a data structure for storing and finding subgraphs
Data Mining and Knowledge Discovery
Hi-index | 0.00 |
Discovery of important substructures from molecules is an important data mining problem. The basic motivation is that the structure of a molecule has a role to play in its biochemical function. There is interest in finding important, often recurrent, substructures both within a single molecule and across a class of molecules.Recently, we have developed a general purpose suite of algorithms -- the MotifMiner Toolkit -- that can mine for structural motifs in a wide area of biomolecular datasets. While the algorithms have proven to be extremely useful in their ability to identify novel substructures, the algorithms themselves are quite time consuming. There are two reasons for this: i) inherently the algorithm suffers from the curse of subgraph isomorphism; and ii) handling noise effects (e.g. protein structure data) results in a significant slowdown.To address this problem in this paper we propose parallelization strategies in a cluster environment for the above algorithms. We identify key optimizations that handle load imbalance, scheduling, and communication overheads. Results show that the optimizations are quite effective and that we are able to obtain good speedup on moderate sized clusters.