ParaTimer: a progress indicator for MapReduce DAGs

Authors:
Kristi Morton;Magdalena Balazinska;Dan Grossman
Affiliations:
University of Washington, Seattle, WA, USA;University of Washington, Seattle, WA, USA;University of Washington, Seattle, WA, USA
Venue:
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Year:
2010

Citing 14
Cited 33

Online aggregation

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Database Management Systems

Database Management Systems
Toward a progress indicator for database queries

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Estimating progress of execution for SQL queries

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Increasing the Accuracy and Coverage of SQL Progress Indicators

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
A disk-based join with probabilistic guarantees

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
When can we trust progress estimators for SQL queries?

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
ConEx: a system for monitoring queries

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Predicting Multiple Metrics for Queries: Better Decisions Enabled by Machine Learning

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Improving MapReduce performance in heterogeneous environments

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Multi-query SQL progress indicators

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology

Performance prediction for concurrent database workloads

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
A platform for scalable one-pass analytics using MapReduce

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
The case for being lazy: how to leverage lazy evaluation in MapReduce

Proceedings of the 2nd international workshop on Scientific cloud computing
ARIA: automatic resource inference and allocation for mapreduce environments

Proceedings of the 8th ACM international conference on Autonomic computing
Trojan data layouts: right shoes for a running elephant

Proceedings of the 2nd ACM Symposium on Cloud Computing
An adaptive scheduling algorithm for dynamic heterogeneous Hadoop systems

Proceedings of the 2011 Conference of the Center for Advanced Studies on Collaborative Research
Parallel data processing with MapReduce: a survey

ACM SIGMOD Record
A statistical approach towards robust progress estimation

Proceedings of the VLDB Endowment
Meeting service level objectives of Pig programs

Proceedings of the 2nd International Workshop on Cloud Computing Platforms
Energy efficiency for large-scale MapReduce workloads with significant interactive analysis

Proceedings of the 7th ACM european conference on Computer Systems
Jockey: guaranteed job latency in data parallel clusters

Proceedings of the 7th ACM european conference on Computer Systems
PerfXplain: debugging MapReduce job performance

Proceedings of the VLDB Endowment
Resource provisioning framework for mapreduce jobs with performance goals

Middleware'11 Proceedings of the 12th ACM/IFIP/USENIX international conference on Middleware
The HaLoop approach to large-scale iterative data analysis

The VLDB Journal — The International Journal on Very Large Data Bases
Halt or continue: estimating progress of queries in the cloud

DASFAA'12 Proceedings of the 17th international conference on Database Systems for Advanced Applications - Volume Part II
Optimizing Completion Time and Resource Provisioning of Pig Programs

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Interactive analytical processing in big data systems: a cross-industry study of MapReduce workloads

Proceedings of the VLDB Endowment
Automated profiling and resource management of pig programs for meeting service level objectives

Proceedings of the 9th international conference on Autonomic computing
SCALLA: A Platform for Scalable One-Pass Analytics Using MapReduce

ACM Transactions on Database Systems (TODS)
Bridging the tenant-provider gap in cloud services

Proceedings of the Third ACM Symposium on Cloud Computing
Resource provisioning framework for MapReduce jobs with performance goals

Proceedings of the 12th International Middleware Conference
Minimal MapReduce algorithms

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Workload management for big data analytics

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Mammoth: autonomic data processing framework for scientific state-transition applications

Proceedings of the 2013 ACM Cloud and Autonomic Computing Conference
Distributed data management using MapReduce

ACM Computing Surveys (CSUR)
HAT: history-based auto-tuning MapReduce in heterogeneous environments

The Journal of Supercomputing
Performance Modeling and Optimization of Deadline-Driven Pig Programs

ACM Transactions on Autonomous and Adaptive Systems (TAAS)
The family of mapreduce and large-scale data processing systems

ACM Computing Surveys (CSUR)
Natjam: design and evaluation of eviction policies for supporting priorities and deadlines in mapreduce clusters

Proceedings of the 4th annual Symposium on Cloud Computing
Does RDMA-based enhanced Hadoop MapReduce need a new performance model?

Proceedings of the 4th annual Symposium on Cloud Computing
PREDIcT: towards predicting the runtime of large scale iterative analytics

Proceedings of the VLDB Endowment
A MapReduce task scheduling algorithm for deadline constraints

Cluster Computing
Balancing reducer workload for skewed data using sampling-based partitioning

Computers and Electrical Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Time-oriented progress estimation for parallel queries is a challenging problem that has received only limited attention. In this paper, we present ParaTimer, a new type of time-remaining indicator for parallel queries. Several parallel data processing systems exist. ParaTimer targets environments where declarative queries are translated into ensembles of MapReduce jobs. ParaTimer builds on previous techniques and makes two key contributions. First, it estimates the progress of queries that translate into directed acyclic graphs of MapReduce jobs, where jobs on different paths can execute concurrently (unlike prior work that looked at sequences only). For such queries, we use a new type of critical-path-based progress-estimation approach. Second, ParaTimer handles a variety of real systems challenges such as failures and data skew. To handle unexpected changes in query execution times due to runtime condition changes, ParaTimer provides users with not only one but with a set of time-remaining estimates, each one corresponding to a different carefully selected scenario. We implement our estimator in the Pig system and demonstrate its performance on experiments running on a real, small-scale cluster.