Speeding up distributed request-response workflows

Authors:
Virajith Jalaparti;Peter Bodik;Srikanth Kandula;Ishai Menache;Mikhail Rybalkin;Chenyu Yan
Affiliations:
University of Illinois at Urbana-Champaign, Champaign, IL, USA;Microsoft Research, Redmond, WA, USA;Microsoft Research, Redmond, WA, USA;Microsoft Research, Redmond, WA, USA;St. Petersburg Department of Steklov Institute of Mathematics of Russian Academy of Sciences, St. Petersburg, Russian Fed.;Microsoft Bing, Bellevue, WA, USA
Venue:
Proceedings of the ACM SIGCOMM 2013 conference on SIGCOMM
Year:
2013

Citing 25
Cited 0

Static scheduling algorithms for allocating directed task graphs to multiprocessors

ACM Computing Surveys (CSUR)
Is 100 Milliseconds Too Fast?

CHI '01 Extended Abstracts on Human Factors in Computing Systems
Improving web availability for clients with MONET

NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dynamo: amazon's highly available key-value store

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Green: a framework for supporting energy-conscious programming using controlled approximation

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Data center TCP (DCTCP)

Proceedings of the ACM SIGCOMM 2010 conference
Improving MapReduce performance in heterogeneous environments

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
ICTCP: Incast Congestion Control for TCP in data center networks

Proceedings of the 6th International COnference
Reining in the outliers in map-reduce clusters using Mantri

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Dynamic knobs for responsive power-aware computing

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Better never than late: meeting deadlines in datacenter networks

Proceedings of the ACM SIGCOMM 2011 conference
Jockey: guaranteed job latency in data parallel clusters

Proceedings of the 7th ACM european conference on Computer Systems
RPT: re-architecting loss protection for content-aware networks

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Finishing flows quickly with preemptive scheduling

Proceedings of the ACM SIGCOMM 2012 conference on Applications, technologies, architectures, and protocols for computer communication
DeTail: reducing the flow completion time tail in datacenter networks

Proceedings of the ACM SIGCOMM 2012 conference on Applications, technologies, architectures, and protocols for computer communication
AppInsight: mobile app performance monitoring in the wild

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
More is less: reducing latency via redundancy

Proceedings of the 11th ACM Workshop on Hot Topics in Networks
Deconstructing datacenter packet transport

Proceedings of the 11th ACM Workshop on Hot Topics in Networks
Chronos: predictable low latency for data center applications

Proceedings of the Third ACM Symposium on Cloud Computing
Zeta: scheduling interactive services with partial execution

Proceedings of the Third ACM Symposium on Cloud Computing
The tail at scale

Communications of the ACM
Bobtail: avoiding long tails in the cloud

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Scaling Memcache at Facebook

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Demystifying page load performance with WProf

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation

Quantified Score

Hi-index	0.00

Visualization

Abstract

We found that interactive services at Bing have highly variable datacenter-side processing latencies because their processing consists of many sequential stages, parallelization across 10s-1000s of servers and aggregation of responses across the network. To improve the tail latency of such services, we use a few building blocks: reissuing laggards elsewhere in the cluster, new policies to return incomplete results and speeding up laggards by giving them more resources. Combining these building blocks to reduce the overall latency is non-trivial because for the same amount of resource (e.g., number of reissues), different stages improve their latency by different amounts. We present Kwiken, a framework that takes an end-to-end view of latency improvements and costs. It decomposes the problem of minimizing latency over a general processing DAG into a manageable optimization over individual stages. Through simulations with production traces, we show sizable gains; the 99th percentile of latency improves by over 50% when just 0.1% of the responses are allowed to have partial results and by over 40% for 25% of the services when just 5% extra resources are used for reissues.