Discretized streams: fault-tolerant streaming computation at scale

  • Authors:
  • Matei Zaharia;Tathagata Das;Haoyuan Li;Timothy Hunter;Scott Shenker;Ion Stoica

  • Affiliations:
  • University of California, Berkeley;University of California, Berkeley;University of California, Berkeley;University of California, Berkeley;University of California, Berkeley;University of California, Berkeley

  • Venue:
  • Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Many "big data" applications must act on data in real time. Running these applications at ever-larger scales requires parallel platforms that automatically handle faults and stragglers. Unfortunately, current distributed stream processing models provide fault recovery in an expensive manner, requiring hot replication or long recovery times, and do not handle stragglers. We propose a new processing model, discretized streams (D-Streams), that overcomes these challenges. D-Streams enable a parallel recovery mechanism that improves efficiency over traditional replication and backup schemes, and tolerates stragglers. We show that they support a rich set of operators while attaining high per-node throughput similar to single-node systems, linear scaling to 100 nodes, sub-second latency, and sub-second fault recovery. Finally, D-Streams can easily be composed with batch and interactive query models like MapReduce, enabling rich applications that combine these modes. We implement D-Streams in a system called Spark Streaming.