MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Quincy: fair scheduling for distributed computing clusters
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
NapSAC: design and implementation of a power-proportional web cluster
Proceedings of the first ACM SIGCOMM workshop on Green networking
Improving MapReduce performance in heterogeneous environments
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Dremel: interactive analysis of web-scale datasets
Proceedings of the VLDB Endowment
Reining in the outliers in map-reduce clusters using Mantri
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Scarlett: coping with skewed content popularity in mapreduce clusters
Proceedings of the sixth conference on Computer systems
Power management of online data-intensive services
Proceedings of the 38th annual international symposium on Computer architecture
Benefits and limitations of tapping into stored energy for datacenters
Proceedings of the 38th annual international symposium on Computer architecture
Better never than late: meeting deadlines in datacenter networks
Proceedings of the ACM SIGCOMM 2011 conference
Warehouse-Scale Computing: Entering the Teenage Decade
Proceedings of the 38th annual international symposium on Computer architecture
Energy efficiency for large-scale MapReduce workloads with significant interactive analysis
Proceedings of the 7th ACM european conference on Computer Systems
Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing
NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
PACMan: coordinated memory caching for parallel jobs
NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
More is less: reducing latency via redundancy
Proceedings of the 11th ACM Workshop on Hot Topics in Networks
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
ACM SIGOPS 24th Symposium on Operating Systems Principles
Sparrow: distributed, low latency scheduling
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
Proceedings of the ninth ACM conference on Emerging networking experiments and technologies
PIKACHU: how to rebalance load in optimizing mapreduce on heterogeneous clusters
USENIX ATC'13 Proceedings of the 2013 USENIX conference on Annual Technical Conference
Hi-index | 0.00 |
Despite prior research on outlier mitigation, our analysis of jobs from the Facebook cluster shows that outliers still occur, especially in small jobs. Small jobs are particularly sensitive to long-running outlier tasks because of their interactive nature. Outlier mitigation strategies rely on comparing different tasks of the same job and launching speculative copies for the slower tasks. However, small jobs execute all their tasks simultaneously, thereby not providing sufficient time to observe and compare tasks. Building on the observation that clusters are underutilized, we take speculation to its logical extreme--run full clones of jobs to mitigate the effect of outliers. The heavy-tail distribution of job sizes implies that we can impact most jobs without using much resources. Trace-driven simulations show that average completion time of all the small jobs improves by 47% using cloning, at the cost of just 3% extra resources.