Cumulon: optimizing statistical data analysis in the cloud

Authors:
Botong Huang;Shivnath Babu;Jun Yang
Affiliations:
Duke University, Durham, NC, USA;Duke University, Durham, NC, USA;Duke University, Durham, NC, USA
Venue:
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Year:
2013

Citing 22
Cited 0

A bridging model for parallel computation

Communications of the ACM
Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
A survey of out-of-core algorithms in numerical linear algebra

External memory algorithms
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations

ICDM '09 Proceedings of the 2009 Ninth IEEE International Conference on Data Mining
MAD skills: new analysis practices for big data

Proceedings of the VLDB Endowment
Hive: a warehousing solution over a map-reduce framework

Proceedings of the VLDB Endowment
A Randomized Algorithm for Principal Component Analysis

SIAM Journal on Matrix Analysis and Applications
Overview of sciDB: large scale array storage, processing and analysis

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Ricardo: integrating R and Hadoop

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Towards optimizing hadoop provisioning in the cloud

HotCloud'09 Proceedings of the 2009 conference on Hot topics in cloud computing
Transparent runtime parallelization of the R scripting language

Journal of Parallel and Distributed Computing
HaLoop: efficient iterative data processing on large clusters

Proceedings of the VLDB Endowment
HAMA: An Efficient Matrix Computation with the MapReduce Framework

CLOUDCOM '10 Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science
Automatic optimization for MapReduce programs

Proceedings of the VLDB Endowment
SystemML: Declarative machine learning on MapReduce

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics

Proceedings of the 2nd ACM Symposium on Cloud Computing
SciHadoop: array-based query processing in Hadoop

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Computing environment for the statistical analysis of large and complex data

Computing environment for the statistical analysis of large and complex data
Meeting service level objectives of Pig programs

Proceedings of the 2nd International Workshop on Cloud Computing Platforms
The MADlib analytics library: or MAD skills, the SQL

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present Cumulon, a system designed to help users rapidly develop and intelligently deploy matrix-based big-data analysis programs in the cloud. Cumulon features a flexible execution model and new operators especially suited for such workloads. We show how to implement Cumulon on top of Hadoop/HDFS while avoiding limitations of MapReduce, and demonstrate Cumulon's performance advantages over existing Hadoop-based systems for statistical data analysis. To support intelligent deployment in the cloud according to time/budget constraints, Cumulon goes beyond database-style optimization to make choices automatically on not only physical operators and their parameters, but also hardware provisioning and configuration settings. We apply a suite of benchmarking, simulation, modeling, and search techniques to support effective cost-based optimization over this rich space of deployment plans.