Solving the straggler problem with bounded staleness

Authors:
James Cipar;Qirong Ho;Jin Kyu Kim;Seunghak Lee;Gregory R. Ganger;Garth Gibson;Kimberly Keeton;Eric Xing
Affiliations:
Carnegie Mellon University;Carnegie Mellon University;Carnegie Mellon University;Carnegie Mellon University;Carnegie Mellon University;Carnegie Mellon University;HP Labs;Carnegie Mellon University
Venue:
HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
Year:
2013

Citing 13
Cited 2

Effective distributed scheduling of parallel workloads

Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Design and evaluation of a conit-based continuous consistency model for replicated services

ACM Transactions on Computer Systems (TOCS)
Latent dirichlet allocation

The Journal of Machine Learning Research
The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Loose synchronization for large-scale networked systems

ATEC '06 Proceedings of the annual conference on USENIX '06 Annual Technical Conference
Open Cirrus: A Global Cloud Computing Testbed

Computer
Improving MapReduce performance in heterogeneous environments

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
The Impact of System Design Parameters on Application Noise Sensitivity

CLUSTER '10 Proceedings of the 2010 IEEE International Conference on Cluster Computing
Reining in the outliers in map-reduce clusters using Mantri

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Piccolo: building fast, distributed programs with partitioned tables

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Disks are like snowflakes: no two are alike

HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
Scalable inference in latent variable models

Proceedings of the fifth ACM international conference on Web search and data mining
Distributed GraphLab: a framework for machine learning and data mining in the cloud

Proceedings of the VLDB Endowment

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles

ACM SIGOPS 24th Symposium on Operating Systems Principles
Naiad: a timely dataflow system

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many important applications fall into the broad class of iterative convergent algorithms. Parallel implementations of these algorithms are naturally expressed using the Bulk Synchronous Parallel (BSP) model of computation. However, implementations using BSP are plagued by the straggler problem, where every transient slowdown of any given thread can delay all other threads. This paper presents the Stale Synchronous Parallel (SSP) model as a generalization of BSP that preserves many of its advantages, while avoiding the straggler problem. Algorithms using SSP can execute efficiently, even with significant delays in some threads, addressing the oft-faced straggler problem.