Shark: fast data analysis using coarse-grained distributed memory

Authors:
Cliff Engle;Antonio Lupher;Reynold Xin;Matei Zaharia;Michael J. Franklin;Scott Shenker;Ion Stoica
Affiliations:
University of California Berkeley, Berkeley, CA, USA;University of California Berkeley, Berkeley, CA, USA;University of California Berkeley, Berkeley, CA, USA;University of California Berkeley, Berkeley, CA, USA;University of California Berkeley, Berkeley, CA, USA;University of California Berkeley, Berkeley, CA, USA;University of California Berkeley, Berkeley, CA, USA
Venue:
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Year:
2012

Citing 3
Cited 5

A comparison of approaches to large-scale data analysis

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Disk-locality in datacenter computing considered irrelevant

HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation

Blink and it's done: interactive queries on very large data

Proceedings of the VLDB Endowment
BlinkDB: queries with bounded errors and bounded response times on very large data

Proceedings of the 8th ACM European Conference on Computer Systems
jVerbs: ultra-low latency for data center applications

Proceedings of the 4th annual Symposium on Cloud Computing
PonIC: using stratosphere to speed up pig analytics

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Scuba: diving into data at facebook

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Shark is a research data analysis system built on a novel coarse-grained distributed shared-memory abstraction. Shark marries query processing with deep data analysis, providing a unified system for easy data manipulation using SQL and pushing sophisticated analysis closer to data. It scales to thousands of nodes in a fault-tolerant manner. Shark can answer queries 40X faster than Apache Hive and run machine learning programs 25X faster than MapReduce programs in Apache Hadoop on large datasets.