The design and implementation of a log-structured file system
ACM Transactions on Computer Systems (TOCS)
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling
Proceedings of the 5th European conference on Computer systems
An Analysis of Traces from a Production MapReduce Cluster
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
The Hadoop Distributed File System
MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
Mesos: a platform for fine-grained resource sharing in the data center
Proceedings of the 8th USENIX conference on Networked systems design and implementation
Dominant resource fairness: fair allocation of multiple resource types
Proceedings of the 8th USENIX conference on Networked systems design and implementation
A platform for scalable one-pass analytics using MapReduce
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Hadoop acceleration through network levitated merge
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Tarazu: optimizing MapReduce on heterogeneous clusters
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing
NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
PACMan: coordinated memory caching for parallel jobs
NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Delay tails in MapReduce scheduling
Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems
Sailfish: a framework for large scale data processing
Proceedings of the Third ACM Symposium on Cloud Computing
Themis: an I/O-efficient MapReduce
Proceedings of the Third ACM Symposium on Cloud Computing
Balancing reducer skew in MapReduce workloads using progressive sampling
Proceedings of the Third ACM Symposium on Cloud Computing
Interference and locality-aware task scheduling for MapReduce applications in virtual clusters
Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
JVM-Bypass for Efficient Hadoop Shuffling
IPDPS '13 Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing
Hi-index | 0.00 |
Hadoop is a widely adopted open source implementation of MapReduce programming model for big data processing. It represents system resources as available map and reduce slots and assigns them to various tasks. This execution model gives little regard to the need of cross-task coordination on the use of shared system resources on a compute node, which results in task interference. In addition, the existing Hadoop merge algorithm can cause excessive I/O. In this study, we undertake an effort to address both issues. Accordingly, we have designed a cross-task coordination framework called CooMR for efficient data management in MapReduce programs. CooMR consists of three component schemes including cross-task opportunistic memory sharing and log-structured I/O consolidation, which are designed to facilitate task coordination, and the key-based in-situ merge (KISM) algorithm which is designed to enable the sorting/merging of Hadoop intermediate data without actually moving the pairs. Our evaluation demonstrates that CooMR is able to increase task coordination, improve system resource utilization, and significantly speed up the execution time of MapReduce programs.