Mining association rules between sets of items in large databases
SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Machine Learning
IEEE Transactions on Knowledge and Data Engineering
A general framework for accurate and fast regression by data summarization in random decision trees
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Dryad: distributed data-parallel programs from sequential building blocks
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Fast mining of distance-based outliers in high-dimensional datasets
Data Mining and Knowledge Discovery
Pig latin: a not-so-foreign language for data processing
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Pfp: parallel fp-growth for query recommendation
Proceedings of the 2008 ACM conference on Recommender systems
PFunc: modern task parallelism for modern high performance computing
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
PLANET: massively parallel learning of tree ensembles with MapReduce
Proceedings of the VLDB Endowment
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
LIBSVM: A library for support vector machines
ACM Transactions on Intelligent Systems and Technology (TIST)
Large-scale distributed non-negative sparse coding and sparse dictionary learning
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
BC-PDM: data mining, social network analysis and text mining system based on cloud computing
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Unexpected challenges in large scale machine learning
Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications
PARMA: a parallel randomized algorithm for approximate association rules mining in MapReduce
Proceedings of the 21st ACM international conference on Information and knowledge management
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
The family of mapreduce and large-scale data processing systems
ACM Computing Surveys (CSUR)
Trends and outlook for the massive-scale analytics stack
IBM Journal of Research and Development
Hi-index | 0.00 |
In the last decade, advances in data collection and storage technologies have led to an increased interest in designing and implementing large-scale parallel algorithms for machine learning and data mining (ML-DM). Existing programming paradigms for expressing large-scale parallelism such as MapReduce (MR) and the Message Passing Interface (MPI) have been the de facto choices for implementing these ML-DM algorithms. The MR programming paradigm has been of particular interest as it gracefully handles large datasets and has built-in resilience against failures. However, the existing parallel programming paradigms are too low-level and ill-suited for implementing ML-DM algorithms. To address this deficiency, we present NIMBLE, a portable infrastructure that has been specifically designed to enable the rapid implementation of parallel ML-DM algorithms. The infrastructure allows one to compose parallel ML-DM algorithms using reusable (serial and parallel) building blocks that can be efficiently executed using MR and other parallel programming models; it currently runs on top of Hadoop, which is an open-source MR implementation. We show how NIMBLE can be used to realize scalable implementations of ML-DM algorithms and present a performance evaluation.