On the efficacy, efficiency and emergent behavior of task replication in large distributed systems

Authors:
Walfredo Cirne;Francisco Brasileiro;Daniel Paranhos;Luís Fabrício W. Góes;William Voorsluys
Affiliations:
Universidade Federal de Campina Grande, Departamento de Sistemas e Computação, Brazil;Universidade Federal de Campina Grande, Departamento de Sistemas e Computação, Brazil;Universidade Federal de Campina Grande, Departamento de Sistemas e Computação, Brazil;Pontifícia Universidade Católica de Minas Gerais, Instituto de Informática, Brazil;Universidade Federal de Campina Grande, Departamento de Sistemas e Computação, Brazil
Venue:
Parallel Computing
Year:
2007

Citing 16
Cited 14

Static and dynamic processor scheduling disciplines in heterogeneous parallel architectures

Journal of Parallel and Distributed Computing
Theory of Modeling and Simulation

Theory of Modeling and Simulation
SETI@home: an experiment in public-resource computing

Communications of the ACM
Dynamically forecasting network performance using the Network Weather Service

Cluster Computing
Experiences with predicting resource performance on-line in computational grid settings

ACM SIGMETRICS Performance Evaluation Review
Data Staging Effects in Wide Area Task Farming Applications

CCGRID '01 Proceedings of the 1st International Symposium on Cluster Computing and the Grid
Dynamic Matching and Scheduling of a Class of Independent Tasks onto Heterogeneous Computing Systems

HCW '99 Proceedings of the Eighth Heterogeneous Computing Workshop
Heuristics for Scheduling Parameter Sweep Applications in Grid Environments

HCW '00 Proceedings of the 9th Heterogeneous Computing Workshop
A Resource Query Interface for Network-Aware Applications

HPDC '98 Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Resource Management for Rapid Application Turnaround on Enterprise Desktop Grids

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
NodeWiz: peer-to-peer resource discovery for grids

CCGRID '05 Proceedings of the Fifth IEEE International Symposium on Cluster Computing and the Grid - Volume 01
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Exploiting replication and data reuse to efficiently schedule data-intensive applications on grids

JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing
Improving speedup and response times by replicating parallel programs on a SNOW

JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing
Measuring bandwidth between planetlab nodes

PAM'05 Proceedings of the 6th international conference on Passive and Active Network Measurement

P2P file sharing for P2P computing

Multiagent and Grid Systems - Content management and delivery through P2P-based content networks
A multi-strategy collaborative prediction model for the runtime of online tasks in computing cluster/grid

Cluster Computing
Mapping workflow applications with types on heterogeneous specialized platforms

Parallel Computing
Workload balancing and throughput optimization for heterogeneous systems subject to failures

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
Business-driven short-term management of a hybrid IT infrastructure

Journal of Parallel and Distributed Computing
Self-Healing of Operational Workflow Incidents on Distributed Computing Infrastructures

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Coordinated rescheduling of Bag-of-Tasks for executions on multiple resource providers

Concurrency and Computation: Practice & Experience
Monte Carlo simulation on heterogeneous distributed systems: A computing framework with parallel merging and checkpointing strategies

Future Generation Computer Systems
A User-Based Model of Grid Computing Workloads

GRID '12 Proceedings of the 2012 ACM/IEEE 13th International Conference on Grid Computing
Assessing Green Strategies in Peer-to-Peer Opportunistic Grids

Journal of Grid Computing
Scheduling linear chain streaming applications on heterogeneous systems with failures

Future Generation Computer Systems
Effective straggler mitigation: attack of the clones

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Self-healing of workflow activity incidents on distributed computing infrastructures

Future Generation Computer Systems
GRASS: trimming stragglers in approximation analytics

NSDI'14 Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large distributed systems challenge traditional schedulers, as it is often hard to determine a priori how long each task will take to complete on each resource, information that is input for such schedulers. Task replication has been applied in a variety of scenarios as a way to circumvent this problem. Task replication consists of dispatching multiple replicas of a task and using the result from the first replica to finish. Replication schedulers (i.e. schedulers that employ task replication) are able to achieve good performance even in the absence of information on tasks and resources. They are also of smaller complexity than traditional schedulers, making them better suitable for large distributed systems. On the other hand, replication schedulers waste cycles with the replicas that are not the first to finish. Moreover, this extra consumption of resources raises severe concerns about the system-wide performance of a distributed system with multiple, competing replication schedulers. This paper presents a comprehensive study of task replication, comparing replication schedulers against traditional information-based schedulers, and establishing their efficacy (the performance delivered to the application), efficiency (the amount of resources wasted), and emergent behavior (the system-wide behavior of a system with multiple replication schedulers). We also introduce a simple access control strategy that can be implemented locally by each resource and greatly improves overall performance of a system on which multiple replication schedulers compete for resources.