Parallel algorithms for mining frequent structural motifs in scientific data

  • Authors:
  • Chao Wang;Srinivasan Parthasarathy

  • Affiliations:
  • The Ohio State University;The Ohio State University

  • Venue:
  • Proceedings of the 18th annual international conference on Supercomputing
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Discovery of important substructures from molecules is an important data mining problem. The basic motivation is that the structure of a molecule has a role to play in its biochemical function. There is interest in finding important, often recurrent, substructures both within a single molecule and across a class of molecules.Recently, we have developed a general purpose suite of algorithms -- the MotifMiner Toolkit -- that can mine for structural motifs in a wide area of biomolecular datasets. While the algorithms have proven to be extremely useful in their ability to identify novel substructures, the algorithms themselves are quite time consuming. There are two reasons for this: i) inherently the algorithm suffers from the curse of subgraph isomorphism; and ii) handling noise effects (e.g. protein structure data) results in a significant slowdown.To address this problem in this paper we propose parallelization strategies in a cluster environment for the above algorithms. We identify key optimizations that handle load imbalance, scheduling, and communication overheads. Results show that the optimizations are quite effective and that we are able to obtain good speedup on moderate sized clusters.