Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

Authors:
Matei Zaharia;Mosharaf Chowdhury;Tathagata Das;Ankur Dave;Justin Ma;Murphy McCauley;Michael J. Franklin;Scott Shenker;Ion Stoica
Affiliations:
University of California, Berkeley;University of California, Berkeley;University of California, Berkeley;University of California, Berkeley;University of California, Berkeley;University of California, Berkeley;University of California, Berkeley;University of California, Berkeley;University of California, Berkeley
Venue:
NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Year:
2012

Citing 29
Cited 51

Distributed Shared Memory: A Survey of Issues and Algorithms

Computer - Distributed computing systems: separate resources acting as one
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Caching function calls using precise dependencies

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
A first order approximation to the optimum checkpoint interval

Communications of the ACM
Database Management Systems

Database Management Systems
Lineage retrieval for scientific data processing: a survey

ACM Computing Surveys (CSUR)
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Provenance in Databases: Why, How, and Where

Foundations and Trends in Databases
The case for RAMClouds: scalable high-performance storage entirely in DRAM

ACM SIGOPS Operating Systems Review
Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling

Proceedings of the 5th European conference on Computer systems
FlumeJava: easy, efficient data-parallel pipelines

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Stateful bulk processing for incremental analytics

Proceedings of the 1st ACM symposium on Cloud computing
Comet: batched stream processing for data intensive distributed computing

Proceedings of the 1st ACM symposium on Cloud computing
Pregel: a system for large-scale graph processing

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Twister: a runtime for iterative MapReduce

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
On availability of intermediate data in cloud computations

HotOS'09 Proceedings of the 12th conference on Hot topics in operating systems
DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
R2: an application-level kernel for record and replay

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
HaLoop: efficient iterative data processing on large clusters

Proceedings of the VLDB Endowment
Nectar: automatic management of data and computation in datacenters

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Large-scale incremental processing using distributed transactions and notifications

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Piccolo: building fast, distributed programs with partitioned tables

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
CIEL: a universal execution engine for distributed data-flow computing

Proceedings of the 8th USENIX conference on Networked systems design and implementation
Mesos: a platform for fine-grained resource sharing in the data center

Proceedings of the 8th USENIX conference on Networked systems design and implementation
Disk-locality in datacenter computing considered irrelevant

HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
Design and Evaluation of a Real-Time URL Spam Filtering Service

SP '11 Proceedings of the 2011 IEEE Symposium on Security and Privacy
Incoop: MapReduce for incremental computations

Proceedings of the 2nd ACM Symposium on Cloud Computing
Scaling the mobile millennium system in the cloud

Proceedings of the 2nd ACM Symposium on Cloud Computing

Shark: fast data analysis using coarse-grained distributed memory

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Streaming graph partitioning for large distributed graphs

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
A case for performance-centric network allocation

HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters

HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
Why let resources idle? aggressive cloning of jobs with dolly

HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
Parallelizing ListNet training using spark

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Blink and it's done: interactive queries on very large data

Proceedings of the VLDB Endowment
Coflow: a networking abstraction for cluster applications

Proceedings of the 11th ACM Workshop on Hot Topics in Networks
Themis: an I/O-efficient MapReduce

Proceedings of the Third ACM Symposium on Cloud Computing
A generate-test-aggregate parallel programming library: systematic parallel programming for MapReduce

Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and Manycores
Sparkler: supporting large-scale matrix factorization

Proceedings of the 16th International Conference on Extending Database Technology
Incremental stream processing using computational conflict-free replicated data types

Proceedings of the 3rd International Workshop on Cloud Data and Platforms
Shark: SQL and rich analytics at scale

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Machine learning for big data

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Optimus: a dynamic rewriting framework for data-parallel execution plans

Proceedings of the 8th ACM European Conference on Computer Systems
BlinkDB: queries with bounded errors and bounded response times on very large data

Proceedings of the 8th ACM European Conference on Computer Systems
Presto: distributed machine learning and graph processing with sparse matrices

Proceedings of the 8th ACM European Conference on Computer Systems
Choosy: max-min fair sharing for datacenter jobs with constraints

Proceedings of the 8th ACM European Conference on Computer Systems
Workload management for big data analytics

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Stream-monitoring with blockmon: convergence of network measurements and data analytics platforms

ACM SIGCOMM Computer Communication Review
Scaling big data mining infrastructure: the twitter experience

ACM SIGKDD Explorations Newsletter
Effective straggler mitigation: attack of the clones

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
GraphX: a resilient distributed graph system on Spark

First International Workshop on Graph Data Management Experiences and Systems
On benchmarking online social media analytical queries

First International Workshop on Graph Data Management Experiences and Systems
Leveraging endpoint flexibility in data-intensive clusters

Proceedings of the ACM SIGCOMM 2013 conference on SIGCOMM
Scalable I/O-bound parallel incremental gradient descent for big data analytics in GLADE

Proceedings of the Second Workshop on Data Analytics in the Cloud
A case for dynamic memory partitioning in data centers

Proceedings of the Second Workshop on Data Analytics in the Cloud
Big data analytics with small footprint: squaring the cloud

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Large-scale computation not at the cost of expressiveness

HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
The case for tiny tasks in compute clusters

HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
New wine in old skins: the case for distributed operating systems in the data center

Proceedings of the 4th Asia-Pacific Workshop on Systems
i2MapReduce: incremental iterative MapReduce

Proceedings of the 2nd International Workshop on Cloud Intelligence
CooMR: cross-task coordination for efficient data management in MapReduce programs

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
"All roads lead to Rome": optimistic recovery for distributed iterative data processing

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Instant pickles: generating object-oriented pickler combinators for fast and extensible serialization

Proceedings of the 2013 ACM SIGPLAN international conference on Object oriented programming systems languages & applications
Forge: generating a high performance DSL implementation from a declarative specification

Proceedings of the 12th international conference on Generative programming: concepts & experiences
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles

ACM SIGOPS 24th Symposium on Operating Systems Principles
Dandelion: a compiler and runtime for heterogeneous systems

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
Sparrow: distributed, low latency scheduling

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
Discretized streams: fault-tolerant streaming computation at scale

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
Naiad: a timely dataflow system

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
The family of mapreduce and large-scale data processing systems

ACM Computing Surveys (CSUR)
Scale-up vs scale-out for Hadoop: time to rethink?

Proceedings of the 4th annual Symposium on Cloud Computing
On limitations of network acceleration

Proceedings of the ninth ACM conference on Emerging networking experiments and technologies
MillWheel: fault-tolerant stream processing at internet scale

Proceedings of the VLDB Endowment
Scalable topic-specific influence analysis on microblogs

Proceedings of the 7th ACM international conference on Web search and data mining
Scalable progressive analytics on big data in the cloud

Proceedings of the VLDB Endowment
A Generate-Test-Aggregate parallel programming library for systematic parallel programming

Parallel Computing
Log-structured memory for DRAM-based storage

FAST'14 Proceedings of the 12th USENIX conference on File and Storage Technologies
Analysis of HDFS under HBase: a facebook messages case study

FAST'14 Proceedings of the 12th USENIX conference on File and Storage Technologies
GRASS: trimming stragglers in approximation analytics

NSDI'14 Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data mining tools. In both cases, keeping data in memory can improve performance by an order of magnitude. To achieve fault tolerance efficiently, RDDs provide a restricted form of shared memory, based on coarse-grained transformations rather than fine-grained updates to shared state. However, we show that RDDs are expressive enough to capture a wide class of computations, including recent specialized programming models for iterative jobs, such as Pregel, and new applications that these models do not capture. We have implemented RDDs in a system called Spark, which we evaluate through a variety of user applications and benchmarks.