Shark: SQL and rich analytics at scale

Authors:
Reynold S. Xin;Josh Rosen;Matei Zaharia;Michael J. Franklin;Scott Shenker;Ion Stoica
Affiliations:
UC Berkeley, Berkeley, CA, USA;UC Berkeley, Berkeley, CA, USA;UC Berkeley, Berkeley, CA, USA;UC Berkeley, Berkeley, CA, USA;UC Berkeley, Berkeley, CA, USA;UC Berkeley, Berkeley, CA, USA
Venue:
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Year:
2013

Citing 30
Cited 7

Efficient mid-query re-optimization of sub-optimal query execution plans

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Cost-based query scrambling for initial delays

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Eddies: continuously adaptive query processing

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
C-store: a column-oriented DBMS

VLDB '05 Proceedings of the 31st international conference on Very large data bases
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
SCOPE: easy and efficient parallel processing of massive data sets

Proceedings of the VLDB Endowment
A comparison of approaches to large-scale data analysis

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Distributed data-parallel computing using a high-level programming language

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
MapReduce and parallel DBMSs: friends or foes?

Communications of the ACM - Amir Pnueli: Ahead of His Time
Quincy: fair scheduling for distributed computing clusters

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
MAD skills: new analysis practices for big data

Proceedings of the VLDB Endowment
HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads

Proceedings of the VLDB Endowment
Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling

Proceedings of the 5th European conference on Computer systems
Towards automatic optimization of MapReduce programs

Proceedings of the 1st ACM symposium on Cloud computing
Pregel: a system for large-scale graph processing

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
HaLoop: efficient iterative data processing on large clusters

Proceedings of the VLDB Endowment
Dremel: interactive analysis of web-scale datasets

Proceedings of the VLDB Endowment
Cheetah: a high performance, custom data warehouse on top of MapReduce

Proceedings of the VLDB Endowment
ASTERIX: towards a scalable, semistructured data platform for evolving-world models

Distributed and Parallel Databases
Mesos: a platform for fine-grained resource sharing in the data center

Proceedings of the 8th USENIX conference on Networked systems design and implementation
Hyracks: A flexible and extensible foundation for data-intensive computing

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Distributed GraphLab: a framework for machine learning and data mining in the cloud

Proceedings of the VLDB Endowment
SkewTune: mitigating skew in mapreduce applications

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Towards a unified architecture for in-RDBMS analytics

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
PACMan: coordinated memory caching for parallel jobs

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Re-optimizing data-parallel computing

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Processing a trillion cells per mouse click

Proceedings of the VLDB Endowment
The case for tiny tasks in compute clusters

HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems

DeepSea: self-adaptive data partitioning and replication in scalable distributed data systems

Proceedings of the 2013 Sigmod/PODS Ph.D. symposium on PhD symposium
Can we analyze big data inside a DBMS?

Proceedings of the sixteenth international workshop on Data warehousing and OLAP
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles

ACM SIGOPS 24th Symposium on Operating Systems Principles
Sparrow: distributed, low latency scheduling

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
Piranha: optimizing short jobs in Hadoop

Proceedings of the VLDB Endowment
Understanding insights into the basic structure and essential issues of table placement methods in clusters

Proceedings of the VLDB Endowment
Semantic-based QoS management in cloud systems: Current status and future challenges

Future Generation Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Shark is a new data analysis system that marries query processing with complex analytics on large clusters. It leverages a novel distributed memory abstraction to provide a unified engine that can run SQL queries and sophisticated analytics functions (e.g. iterative machine learning) at scale, and efficiently recovers from failures mid-query. This allows Shark to run SQL queries up to 100X faster than Apache Hive, and machine learning programs more than 100X faster than Hadoop. Unlike previous systems, Shark shows that it is possible to achieve these speedups while retaining a MapReduce-like execution engine, and the fine-grained fault tolerance properties that such engine provides. It extends such an engine in several ways, including column-oriented in-memory storage and dynamic mid-query replanning, to effectively execute SQL. The result is a system that matches the speedups reported for MPP analytic databases over MapReduce, while offering fault tolerance properties and complex analytics capabilities that they lack.