Communications of the ACM
Efficient clustering of high-dimensional data sets with application to reference matching
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Amazon.com Recommendations: Item-to-Item Collaborative Filtering
IEEE Internet Computing
An empirical comparison of supervised machine learning techniques in bioinformatics
APBC '03 Proceedings of the First Asia-Pacific bioinformatics conference on Bioinformatics 2003 - Volume 19
The Journal of Machine Learning Research
Clustering Approach for Hybrid Recommender System
WI '03 Proceedings of the 2003 IEEE/WIC International Conference on Web Intelligence
Artificial Neural Networks
Dryad: distributed data-parallel programs from sequential building blocks
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
MapReduce: simplified data processing on large clusters
Communications of the ACM - 50th anniversary issue: 1958 - 2008
Fast support vector machine training and classification on graphics processors
Proceedings of the 25th international conference on Machine learning
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Mahout in Action
Least squares quantization in PCM
IEEE Transactions on Information Theory
Adaptive heterogeneous language support within a cloud runtime
Future Generation Computer Systems
Autonomous, failure-resilient orchestration of distributed discrete event simulations
Proceedings of the 2013 ACM Cloud and Autonomic Computing Conference
Hi-index | 0.00 |
There is often a need to perform machine learning tasks on voluminous amounts of data. These tasks have application in fields such as pattern recognition, data mining, bioinformatics, and recommendation systems. Here we evaluate the performance of 4 clustering algorithms and 2 classification algorithms supported by Mahout within two different cloud runtimes, Hadoop and Granules. Our benchmarks use the same Mahout backend code, ensuring a fair comparison. The differences between these implementations stem from how the Hadoop and Granules runtimes (1) support and manage the lifecycle of individual computations, and (2) how they orchestrate exchange of data between different stages of the computational pipeline during successive iterations of the clustering algorithm. We include an analysis of our results for each of these algorithms in a distributed setting, as well as a discussion on measures for failure recovery.