NIMBLE: a toolkit for the implementation of parallel data mining and machine learning algorithms on mapreduce

Authors:
Amol Ghoting;Prabhanjan Kambadur;Edwin Pednault;Ramakrishnan Kannan
Affiliations:
IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA;IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA;IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA;IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA
Venue:
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2011

Citing 12
Cited 7

Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Bagging predictors

Machine Learning
Shared Memory Parallelization of Data Mining Algorithms: Techniques, Programming Interface, and Performance

IEEE Transactions on Knowledge and Data Engineering
A general framework for accurate and fast regression by data summarization in random decision trees

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Fast mining of distance-based outliers in high-dimensional datasets

Data Mining and Knowledge Discovery
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Pfp: parallel fp-growth for query recommendation

Proceedings of the 2008 ACM conference on Recommender systems
PFunc: modern task parallelism for modern high performance computing

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
PLANET: massively parallel learning of tree ensembles with MapReduce

Proceedings of the VLDB Endowment
DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
LIBSVM: A library for support vector machines

ACM Transactions on Intelligent Systems and Technology (TIST)

Large-scale distributed non-negative sparse coding and sparse dictionary learning

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
BC-PDM: data mining, social network analysis and text mining system based on cloud computing

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Unexpected challenges in large scale machine learning

Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications
PARMA: a parallel randomized algorithm for approximate association rules mining in MapReduce

Proceedings of the 21st ACM international conference on Information and knowledge management
Minimal MapReduce algorithms

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
The family of mapreduce and large-scale data processing systems

ACM Computing Surveys (CSUR)
Trends and outlook for the massive-scale analytics stack

IBM Journal of Research and Development

Quantified Score

Hi-index	0.00

Visualization

Abstract

In the last decade, advances in data collection and storage technologies have led to an increased interest in designing and implementing large-scale parallel algorithms for machine learning and data mining (ML-DM). Existing programming paradigms for expressing large-scale parallelism such as MapReduce (MR) and the Message Passing Interface (MPI) have been the de facto choices for implementing these ML-DM algorithms. The MR programming paradigm has been of particular interest as it gracefully handles large datasets and has built-in resilience against failures. However, the existing parallel programming paradigms are too low-level and ill-suited for implementing ML-DM algorithms. To address this deficiency, we present NIMBLE, a portable infrastructure that has been specifically designed to enable the rapid implementation of parallel ML-DM algorithms. The infrastructure allows one to compose parallel ML-DM algorithms using reusable (serial and parallel) building blocks that can be efficiently executed using MR and other parallel programming models; it currently runs on top of Hadoop, which is an open-source MR implementation. We show how NIMBLE can be used to realize scalable implementations of ML-DM algorithms and present a performance evaluation.