SystemML: Declarative machine learning on MapReduce

Authors:
Amol Ghoting;Rajasekar Krishnamurthy;Edwin Pednault;Berthold Reinwald;Vikas Sindhwani;Shirish Tatikonda;Yuanyuan Tian;Shivakumar Vaithyanathan
Affiliations:
IBM Watson Research Center, USA;IBM Almaden Research Center, USA;IBM Watson Research Center, USA;IBM Almaden Research Center, USA;IBM Watson Research Center, USA;IBM Almaden Research Center, USA;IBM Almaden Research Center, USA;IBM Almaden Research Center, USA
Venue:
ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Year:
2011

Citing 0
Cited 24

Learning-based entity resolution with MapReduce

Proceedings of the third international workshop on Cloud data management
Matrix chain multiplication via multi-way join algorithms in MapReduce

Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication
Large-scale machine learning at twitter

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Large-scale distributed non-negative sparse coding and sparse dictionary learning

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
The MADlib analytics library: or MAD skills, the SQL

Proceedings of the VLDB Endowment
M3R: increased performance for in-memory Hadoop jobs

Proceedings of the VLDB Endowment
Sparkler: supporting large-scale matrix factorization

Proceedings of the 16th International Conference on Extending Database Technology
Minimal MapReduce algorithms

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Cumulon: optimizing statistical data analysis in the cloud

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Simulation of database-valued markov chains using SimSQL

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Scaling big data mining infrastructure: the twitter experience

ACM SIGKDD Explorations Newsletter
Big graph mining: algorithms and discoveries

ACM SIGKDD Explorations Newsletter
Upper and lower bounds on the cost of a map-reduce computation

Proceedings of the VLDB Endowment
Distributed data management using MapReduce

ACM Computing Surveys (CSUR)
Distributed matrix factorization with mapreduce using a series of broadcast-joins

Proceedings of the 7th ACM conference on Recommender systems
The family of mapreduce and large-scale data processing systems

ACM Computing Surveys (CSUR)
CG_Hadoop: computational geometry in MapReduce

Proceedings of the 21st ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
Compiling machine learning algorithms with SystemML

Proceedings of the 4th annual Symposium on Cloud Computing
Next generation data analytics at IBM research

Proceedings of the VLDB Endowment
A demonstration of SpatialHadoop: an efficient mapreduce framework for spatial data

Proceedings of the VLDB Endowment
Speeding-up codon analysis on the cloud with local MapReduce aggregation

Information Sciences: an International Journal
Exploiting inter-operation parallelism for matrix chain multiplication using MapReduce

The Journal of Supercomputing
Understanding system design for big data workloads

IBM Journal of Research and Development
A platform for eXtreme analytics

IBM Journal of Research and Development

Quantified Score

Hi-index	0.00

Visualization

Abstract

MapReduce is emerging as a generic parallel programming paradigm for large clusters of machines. This trend combined with the growing need to run machine learning (ML) algorithms on massive datasets has led to an increased interest in implementing ML algorithms on MapReduce. However, the cost of implementing a large class of ML algorithms as low-level MapReduce jobs on varying data and machine cluster sizes can be prohibitive. In this paper, we propose SystemML in which ML algorithms are expressed in a higher-level language and are compiled and executed in a MapReduce environment. This higher-level language exposes several constructs including linear algebra primitives that constitute key building blocks for a broad class of supervised and unsupervised ML algorithms. The algorithms expressed in SystemML are compiled and optimized into a set of MapReduce jobs that can run on a cluster of machines. We describe and empirically evaluate a number of optimization strategies for efficiently executing these algorithms on Hadoop, an open-source MapReduce implementation. We report an extensive performance evaluation on three ML algorithms on varying data and cluster sizes.