Mammoth: autonomic data processing framework for scientific state-transition applications

Authors:
Xin Yang;Ze Yu;Min Li;Xiaolin Li
Affiliations:
University of Florida;University of Florida;University of Florida;University of Florida
Venue:
Proceedings of the 2013 ACM Cloud and Autonomic Computing Conference
Year:
2013

Citing 24
Cited 0

A Taxonomy and Performance Model of Data Skew Effects in Parallel Joins

VLDB '91 Proceedings of the 17th International Conference on Very Large Data Bases
Practical Skew Handling in Parallel Joins

VLDB '92 Proceedings of the 18th International Conference on Very Large Data Bases
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Quincy: fair scheduling for distributed computing clusters

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling

Proceedings of the 5th European conference on Computer systems
Skew-resistant parallel processing of feature-extracting scientific user-defined functions

Proceedings of the 1st ACM symposium on Cloud computing
Nephele/PACTs: a programming model and execution framework for web-scale analytical processing

Proceedings of the 1st ACM symposium on Cloud computing
ParaTimer: a progress indicator for MapReduce DAGs

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
An Analysis of Traces from a Production MapReduce Cluster

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Twister: a runtime for iterative MapReduce

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Towards optimizing hadoop provisioning in the cloud

HotCloud'09 Proceedings of the 2009 conference on Hot topics in cloud computing
HaLoop: efficient iterative data processing on large clusters

Proceedings of the VLDB Endowment
Reining in the outliers in map-reduce clusters using Mantri

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Hadoop: The Definitive Guide

Hadoop: The Definitive Guide
ASTERIX: towards a scalable, semistructured data platform for evolving-world models

Distributed and Parallel Databases
Optimizing data partitioning for data-parallel computing

HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
Decision making in autonomic computing systems: comparison of approaches and techniques

Proceedings of the 8th ACM international conference on Autonomic computing
ARIA: automatic resource inference and allocation for mapreduce environments

Proceedings of the 8th ACM international conference on Autonomic computing
PrIter: a distributed framework for prioritized iterative computations

Proceedings of the 2nd ACM Symposium on Cloud Computing
Making time-stepped applications tick in the cloud

Proceedings of the 2nd ACM Symposium on Cloud Computing
iMapReduce: A Distributed Computing Framework for Iterative Computation

IPDPSW '11 Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum
Distributed GraphLab: a framework for machine learning and data mining in the cloud

Proceedings of the VLDB Endowment
Re-optimizing data-parallel computing

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
AROMA: automated resource allocation and configuration of mapreduce environment in the cloud

Proceedings of the 9th international conference on Autonomic computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Scientific computing is becoming increasingly data-intensive, and more high-impact discoveries are relying on efficient processing of big scientific data. The popular MapReduce framework such as Hadoop offers an alternative to conventional solutions (e.g., MPI or OpenMP). However, they perform moderately when processing state-transition applications. There are three key challenges: (1) these applications generate the inflated intermediate data that may saturate the network; (2) they may cause substantial synchronization overheads if not managed well; (3) dynamically evolving scientific phenomena result in heterogeneous data distributions, causing significant computation skews. In this paper, we propose Mammoth, an autonomic parallel data processing framework for scientific state-transition applications. Mammoth features a MapReduce-style programming model that is familiar to users. To address the challenges mentioned, it is further enhanced with a series of optimizations that parallelize the computation automatically and efficiently. We evaluate Mammoth via a weather prediction application with real-world datasets. The experimental evaluation demonstrates that Mammoth is competitive with the MPI-based solution and at least 30% faster than the optimized Hadoop-based solution.