MapReduce "garbage" collection

  • Authors:
  • Shady Khalifa;Tianbin Jiang;Patrick Martin

  • Affiliations:
  • Queen's University, ON, Canada;Queen's University, ON, Canada;Queen's University, ON, Canada

  • Venue:
  • CASCON '13 Proceedings of the 2013 Conference of the Center for Advanced Studies on Collaborative Research
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Recently, Hadoop, an open source implementation of MapReduce, has become very popular due to its characteristics such as simple programming syntax, and its support for distributed computing and fault tolerance. Although Hadoop is able to automatically reschedule failed tasks, it is powerless to deal with tasks with poor performance. Managing such tasks is vital as they lower the whole job's performance. Thus in this work, we design a novel garbage collection technique that identifies and collects "garbage" tasks. Three research questions are addressed in this work. The first, does collecting (shutting down) garbage (slow) tasks help in reducing the total job completion time and resources cost? The second, when is it most efficient to invoke the Garbage Collector? The third, how to identify garbage (slow) tasks and what are the major factors causing a task to slow down?. The proposed Garbage Collector is evaluated on Amazon EC2 using two metrics: (i) the time for a single job completion, and (ii) resource costs. The empirical results using the TeraSort benchmark show that collecting garbage tasks does reduce the job completion time by 16% and resources cost by 27%. The results also show that the Garbage Collector needs to be invoked before the job is 40% completed, otherwise it would be better to leave the slow tasks till the end of the job because at this point the cost of re-executing these slow tasks becomes high. Finally, our results show that CPU utilization is a good indicator of slow tasks.