On the performance of high dimensional data clustering and classification algorithms

Authors:
Kathleen Ericson;Shrideep Pallickara
Affiliations:
-;-
Venue:
Future Generation Computer Systems
Year:
2013

Citing 14
Cited 1

Recommender systems

Communications of the ACM
Efficient clustering of high-dimensional data sets with application to reference matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Amazon.com Recommendations: Item-to-Item Collaborative Filtering

IEEE Internet Computing
An empirical comparison of supervised machine learning techniques in bioinformatics

APBC '03 Proceedings of the First Asia-Pacific bioinformatics conference on Bioinformatics 2003 - Volume 19
Latent dirichlet allocation

The Journal of Machine Learning Research
Clustering Approach for Hybrid Recommender System

WI '03 Proceedings of the 2003 IEEE/WIC International Conference on Web Intelligence
Artificial Neural Networks

Artificial Neural Networks
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Fast support vector machine training and classification on graphics processors

Proceedings of the 25th international conference on Machine learning
DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Mahout in Action

Mahout in Action
Least squares quantization in PCM

IEEE Transactions on Information Theory
Adaptive heterogeneous language support within a cloud runtime

Future Generation Computer Systems

Autonomous, failure-resilient orchestration of distributed discrete event simulations

Proceedings of the 2013 ACM Cloud and Autonomic Computing Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

There is often a need to perform machine learning tasks on voluminous amounts of data. These tasks have application in fields such as pattern recognition, data mining, bioinformatics, and recommendation systems. Here we evaluate the performance of 4 clustering algorithms and 2 classification algorithms supported by Mahout within two different cloud runtimes, Hadoop and Granules. Our benchmarks use the same Mahout backend code, ensuring a fair comparison. The differences between these implementations stem from how the Hadoop and Granules runtimes (1) support and manage the lifecycle of individual computations, and (2) how they orchestrate exchange of data between different stages of the computational pipeline during successive iterations of the clustering algorithm. We include an analysis of our results for each of these algorithms in a distributed setting, as well as a discussion on measures for failure recovery.