Ricochet: lateral error correction for time-critical multicast

  • Authors:
  • Mahesh Balakrishnan;Ken Birman;Amar Phanishayee;Stefan Pleisch

  • Affiliations:
  • Cornell University and;Cornell University and;Carnegie Mellon University;Cornell University and

  • Venue:
  • NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Ricochet is a low-latency reliable multicast protocol designed for time-critical clustered applications. It uses IP Multicast to transmit data and recovers from packet loss in end-hosts using Lateral Error Correction (LEC), a novel repair mechanism in which XORs are exchanged between receivers and combined across overlapping groups. In datacenters and clusters, application needs frequently dictate large numbers of fine-grained overlapping multicast groups. Existing multicast reliability schemes scale poorly in such settings, providing latency of packet recovery that depends inversely on the data rate within a single group: the lower the data rate, the longer it takes to recover lost packets. LEC is insensitive to the rate of data in any one group and allows each node to split its bandwidth between hundreds to thousands of fine-grained multicast groups without sacrificing timely packet recovery. As a result, Ricochet provides developers with a scalable, reliable and fast multicast primitive to layer under high-level abstractions such as publish-subscribe, group communication and replicated service/object infrastructures. We evaluate Ricochet on a 64-node cluster with up to 1024 groups per node: under various loss rates, it recovers almost all packets using LEC in tens of milliseconds and the remainder with reactive traffic within 200 milliseconds.