MapReduce "garbage" collection

Authors:
Shady Khalifa;Tianbin Jiang;Patrick Martin
Affiliations:
Queen's University, ON, Canada;Queen's University, ON, Canada;Queen's University, ON, Canada
Venue:
CASCON '13 Proceedings of the 2013 Conference of the Center for Advanced Studies on Collaborative Research
Year:
2013

Citing 11
Cited 0

Inside the Java Virtual Machine

Inside the Java Virtual Machine
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Hadoop: The Definitive Guide

Hadoop: The Definitive Guide
Making cloud intermediate data fault-tolerant

Proceedings of the 1st ACM symposium on Cloud computing
Improving MapReduce performance in heterogeneous environments

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
SAMR: A Self-adaptive MapReduce Scheduling Algorithm in Heterogeneous Environment

CIT '10 Proceedings of the 2010 10th IEEE International Conference on Computer and Information Technology
The Hadoop Distributed File System

MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
Reining in the outliers in map-reduce clusters using Mantri

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
RAFT at work: speeding-up mapreduce applications under task and node failures

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Tarazu: optimizing MapReduce on heterogeneous clusters

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recently, Hadoop, an open source implementation of MapReduce, has become very popular due to its characteristics such as simple programming syntax, and its support for distributed computing and fault tolerance. Although Hadoop is able to automatically reschedule failed tasks, it is powerless to deal with tasks with poor performance. Managing such tasks is vital as they lower the whole job's performance. Thus in this work, we design a novel garbage collection technique that identifies and collects "garbage" tasks. Three research questions are addressed in this work. The first, does collecting (shutting down) garbage (slow) tasks help in reducing the total job completion time and resources cost? The second, when is it most efficient to invoke the Garbage Collector? The third, how to identify garbage (slow) tasks and what are the major factors causing a task to slow down?. The proposed Garbage Collector is evaluated on Amazon EC2 using two metrics: (i) the time for a single job completion, and (ii) resource costs. The empirical results using the TeraSort benchmark show that collecting garbage tasks does reduce the job completion time by 16% and resources cost by 27%. The results also show that the Garbage Collector needs to be invoked before the job is 40% completed, otherwise it would be better to leave the slow tasks till the end of the job because at this point the cost of re-executing these slow tasks becomes high. Finally, our results show that CPU utilization is a good indicator of slow tasks.