Reengineering High-throughput Molecular Datasets for Scalable Clustering Using MapReduce

  • Authors:
  • Trilce Estrada;Boyu Zhang;Michela Taufer;Pietro Cicotti;Roger Armen

  • Affiliations:
  • -;-;-;-;-

  • Venue:
  • HPCC '12 Proceedings of the 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

We propose a linear clustering approach for large datasets of molecular geometries produced by high-throughput molecular dynamics simulations (e.g., protein folding and protein-ligand docking simulations). To this scope, we transform each three-dimensional (3D) molecular conformation into a single point in the 3D space reducing the space complexity while still encoding the molecular similarities and geometries. We assign an identifier to each single 3D point mapping a docked ligand, generate a tree from the whole space, and apply a tree-based clustering on the reduced conformation space that identifies most dense hyperspaces. We adapt our method for MapReduce and implement it in Hadoop. The load-balancing, fault-tolerance, and scalability in MapReduce allows screening of very large conformation datasets not approachable with traditional clustering methods. We analyze results for datasets with different concentrations of optimal solutions, and draw conclusions about the limitations and usability of our method. The advantages of this approach make it attractive for complex applications in real-world high-throughput molecular simulations.