NIMBLE: a toolkit for the implementation of parallel data mining and machine learning algorithms on mapreduce

  • Authors:
  • Amol Ghoting;Prabhanjan Kambadur;Edwin Pednault;Ramakrishnan Kannan

  • Affiliations:
  • IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA;IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA;IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA;IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA

  • Venue:
  • Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

In the last decade, advances in data collection and storage technologies have led to an increased interest in designing and implementing large-scale parallel algorithms for machine learning and data mining (ML-DM). Existing programming paradigms for expressing large-scale parallelism such as MapReduce (MR) and the Message Passing Interface (MPI) have been the de facto choices for implementing these ML-DM algorithms. The MR programming paradigm has been of particular interest as it gracefully handles large datasets and has built-in resilience against failures. However, the existing parallel programming paradigms are too low-level and ill-suited for implementing ML-DM algorithms. To address this deficiency, we present NIMBLE, a portable infrastructure that has been specifically designed to enable the rapid implementation of parallel ML-DM algorithms. The infrastructure allows one to compose parallel ML-DM algorithms using reusable (serial and parallel) building blocks that can be efficiently executed using MR and other parallel programming models; it currently runs on top of Hadoop, which is an open-source MR implementation. We show how NIMBLE can be used to realize scalable implementations of ML-DM algorithms and present a performance evaluation.