Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

  • Authors:
  • Matei Zaharia;Mosharaf Chowdhury;Tathagata Das;Ankur Dave;Justin Ma;Murphy McCauley;Michael J. Franklin;Scott Shenker;Ion Stoica

  • Affiliations:
  • University of California, Berkeley;University of California, Berkeley;University of California, Berkeley;University of California, Berkeley;University of California, Berkeley;University of California, Berkeley;University of California, Berkeley;University of California, Berkeley;University of California, Berkeley

  • Venue:
  • NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data mining tools. In both cases, keeping data in memory can improve performance by an order of magnitude. To achieve fault tolerance efficiently, RDDs provide a restricted form of shared memory, based on coarse-grained transformations rather than fine-grained updates to shared state. However, we show that RDDs are expressive enough to capture a wide class of computations, including recent specialized programming models for iterative jobs, such as Pregel, and new applications that these models do not capture. We have implemented RDDs in a system called Spark, which we evaluate through a variety of user applications and benchmarks.