Spark: cluster computing with working sets

Authors:
Matei Zaharia;Mosharaf Chowdhury;Michael J. Franklin;Scott Shenker;Ion Stoica
Affiliations:
University of California, Berkeley;University of California, Berkeley;University of California, Berkeley;University of California, Berkeley;University of California, Berkeley
Venue:
HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing
Year:
2010

Citing 17
Cited 72

Generative communication in Linda

ACM Transactions on Programming Languages and Systems (TOPLAS)
Distributed Shared Memory: A Survey of Issues and Algorithms

Computer - Distributed computing systems: separate resources acting as one
Implementation and performance of Munin

SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
Safe and efficient sharing of persistent objects in Thor

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
A Recoverable Distributed Shared Memory Integrating Coherence and Recoverability

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Lineage retrieval for scientific data processing: a survey

ACM Computing Surveys (CSUR)
A survey of data provenance in e-science

ACM SIGMOD Record
Map-reduce-merge: simplified relational data processing on large clusters

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
IPython: A System for Interactive Scientific Computing

Computing in Science and Engineering
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
MapReduce for Data Intensive Scientific Analyses

ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
Provenance in Databases: Why, How, and Where

Foundations and Trends in Databases
Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling

Proceedings of the 5th European conference on Computer systems
A common substrate for cluster computing

HotCloud'09 Proceedings of the 2009 conference on Hot topics in cloud computing
DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation

Scripting the cloud with skywriting

HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing
Large-scale incremental processing using distributed transactions and notifications

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Mesos: a platform for fine-grained resource sharing in the data center

Proceedings of the 8th USENIX conference on Networked systems design and implementation
Disk-locality in datacenter computing considered irrelevant

HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
Non-deterministic parallelism considered useful

HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
Cost optimized provisioning of elastic resources for application workflows

Future Generation Computer Systems
Managing data transfers in computer clusters with orchestra

Proceedings of the ACM SIGCOMM 2011 conference
PrIter: a distributed framework for prioritized iterative computations

Proceedings of the 2nd ACM Symposium on Cloud Computing
Making time-stepped applications tick in the cloud

Proceedings of the 2nd ACM Symposium on Cloud Computing
Scaling the mobile millennium system in the cloud

Proceedings of the 2nd ACM Symposium on Cloud Computing
A distributed look-up architecture for text mining applications using MapReduce

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Riding the elephant: managing ensembles with hadoop

Proceedings of the 2011 ACM international workshop on Many task computing on grids and supercomputers
A down-to-earth look at the cloud host OS

Proceedings of the 1st International Workshop on Hot Topics in Cloud Data Processing
The datacenter needs an operating system

HotCloud'11 Proceedings of the 3rd USENIX conference on Hot topics in cloud computing
TransMR: data-centric programming beyond data parallelism

HotCloud'11 Proceedings of the 3rd USENIX conference on Hot topics in cloud computing
Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers

Foundations and Trends® in Machine Learning
The HaLoop approach to large-scale iterative data analysis

The VLDB Journal — The International Journal on Very Large Data Bases
iMapReduce: A Distributed Computing Framework for Iterative Computation

Journal of Grid Computing
Distributed GraphLab: a framework for machine learning and data mining in the cloud

Proceedings of the VLDB Endowment
PACMan: coordinated memory caching for parallel jobs

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Adaptive MapReduce using situation-aware mappers

Proceedings of the 15th International Conference on Extending Database Technology
MapIterativeReduce: a framework for reduction-intensive data processing on azure clouds

Proceedings of third international workshop on MapReduce and its Applications Date
Accelerate large-scale iterative computation through asynchronous accumulative updates

Proceedings of the 3rd workshop on Scientific Cloud Computing Date
Adapting scientific computing problems to clouds using MapReduce

Future Generation Computer Systems
Streaming graph partitioning for large distributed graphs

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Programming your network at run-time for big data applications

Proceedings of the first workshop on Hot topics in software defined networks
The seven deadly sins of cloud computing research

HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
Using R for iterative and incremental processing

HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
MixApart: decoupled analytics for shared storage systems

HotStorage'12 Proceedings of the 4th USENIX conference on Hot Topics in Storage and File Systems
Parallelizing ListNet training using spark

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Spinning fast iterative data flows

Proceedings of the VLDB Endowment
REX: recursive, delta-based data-centric computation

Proceedings of the VLDB Endowment
Subscriber classification within telecom networks utilizing big data technologies and machine learning

Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications
Socio-PLT: principles for programming language adoption

Proceedings of the ACM international symposium on New ideas, new paradigms, and reflections on programming and software
Collaborative energy debugging for mobile devices

HotDep'12 Proceedings of the Eighth USENIX conference on Hot Topics in System Dependability
PowerGraph: distributed graph-parallel computation on natural graphs

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
GraphChi: large-scale graph computation on just a PC

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Large scale data analytics on clouds

Proceedings of the fourth international workshop on Cloud data management
True elasticity in multi-tenant data-intensive compute clusters

Proceedings of the Third ACM Symposium on Cloud Computing
MapReduce-Based data stream processing over large history data

ICSOC'12 Proceedings of the 10th international conference on Service-Oriented Computing
Scalable parallel computing on clouds using Twister4Azure iterative MapReduce

Future Generation Computer Systems
Tiled-MapReduce: Efficient and Flexible MapReduce Processing on Multicore with Tiling

ACM Transactions on Architecture and Code Optimization (TACO)
A bloat-aware design for big data applications

Proceedings of the 2013 international symposium on memory management
Trinity: a distributed graph engine on a memory cloud

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
HyMR: a hybrid MapReduce workflow system

Proceedings of the 3rd international workshop on Emerging computational methods for the life sciences
GPS: a graph processing system

Proceedings of the 25th International Conference on Scientific and Statistical Database Management
Large-scale computation not at the cost of expressiveness

HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
Distributed data management using MapReduce

ACM Computing Surveys (CSUR)
Exploiting application dynamism and cloud elasticity for continuous dataflows

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
SIDR: structure-aware intelligent data routing in Hadoop

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Scalable bootstrapping for python

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Data-Intensive Cloud Computing: Requirements, Expectations, Challenges, and Solutions

Journal of Grid Computing
Consolidated cluster systems for data centers in the cloud age: a survey and analysis

Frontiers of Computer Science: Selected Publications from Chinese Universities
Distributed matrix factorization with mapreduce using a series of broadcast-joins

Proceedings of the 7th ACM conference on Recommender systems
PredictionIO: a distributed machine learning server for practical software development

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Clustering on the cloud: reducing CLARA to MapReduce

Proceedings of the Second Nordic Symposium on Cloud Computing & Internet Technologies
Carat: collaborative energy diagnosis for mobile devices

Proceedings of the 11th ACM Conference on Embedded Networked Sensor Systems
The family of mapreduce and large-scale data processing systems

ACM Computing Surveys (CSUR)
Memory-efficient groupby-aggregate using compressed buffer trees

Proceedings of the 4th annual Symposium on Cloud Computing
Apache Hadoop YARN: yet another resource negotiator

Proceedings of the 4th annual Symposium on Cloud Computing
High performance clustering of social images in a map-collective programming model

Proceedings of the 4th annual Symposium on Cloud Computing
Recommending just enough memory for analytics

Proceedings of the 4th annual Symposium on Cloud Computing
A protocol for simultaneous use of confidentiality and integrity in large-scale storage systems

Proceedings of the 6th International Conference on Security of Information and Networks
PonIC: using stratosphere to speed up pig analytics

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
MR-runner: a modularized map-reduce job management tool

Proceedings of the 5th Asia-Pacific Symposium on Internetware
Hone: "Scaling down" Hadoop on shared-memory systems

Proceedings of the VLDB Endowment
Simplifying Scalable Graph Processing with a Domain-Specific Language

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
Resilient X10: efficient failure-aware programming

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
LASER: a scalable response prediction platform for online advertising

Proceedings of the 7th ACM international conference on Web search and data mining
PREDIcT: towards predicting the runtime of large scale iterative analytics

Proceedings of the VLDB Endowment
How to build a bad research center

Communications of the ACM
MixApart: decoupled analytics for shared storage systems

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies

Quantified Score

Hi-index	0.02

Visualization

Abstract

MapReduce and its variants have been highly successful in implementing large-scale data-intensive applications on commodity clusters. However, most of these systems are built around an acyclic data flow model that is not suitable for other popular applications. This paper focuses on one such class of applications: those that reuse a working set of data across multiple parallel operations. This includes many iterative machine learning algorithms, as well as interactive data analysis tools. We propose a new framework called Spark that supports these applications while retaining the scalability and fault tolerance of MapReduce. To achieve these goals, Spark introduces an abstraction called resilient distributed datasets (RDDs). An RDD is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time.