Dremel: interactive analysis of web-scale datasets

Authors:
Sergey Melnik;Andrey Gubarev;Jing Jing Long;Geoffrey Romer;Shiva Shivakumar;Matt Tolton;Theo Vassilakis
Affiliations:
Google, Inc.;Google, Inc.;Google, Inc.;Google, Inc.;Google, Inc.;Google, Inc.;Google, Inc.
Venue:
Proceedings of the VLDB Endowment
Year:
2010

Citing 19
Cited 51

A recursive algebra and query optimization for nested relations

SIGMOD '89 Proceedings of the 1989 ACM SIGMOD international conference on Management of data
XMill: an efficient compressor for XML data

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Foundations of Databases: The Logical Level

Foundations of Databases: The Logical Level
Counting Distinct Elements in a Data Stream

RANDOM '02 Proceedings of the 6th International Workshop on Randomization and Approximation Techniques
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
ORDPATHs: insert-friendly XML node labels

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Interpreting the data: Parallel analysis with Sawzall

Scientific Programming - Dynamic Grids and Worldwide Computing
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
SCOPE: easy and efficient parallel processing of massive data sets

Proceedings of the VLDB Endowment
Challenges in building large-scale information retrieval systems: invited talk

Proceedings of the Second ACM International Conference on Web Search and Data Mining
MapReduce and parallel DBMSs: friends or foes?

Communications of the ACM - Amir Pnueli: Ahead of His Time
MapReduce: a flexible data processing tool

Communications of the ACM - Amir Pnueli: Ahead of His Time
The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines

The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines
Column-oriented database systems

Proceedings of the VLDB Endowment
HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads

Proceedings of the VLDB Endowment
FlumeJava: easy, efficient data-parallel pipelines

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation

Dremel: interactive analysis of web-scale datasets

Communications of the ACM
ASTERIX: towards a scalable, semistructured data platform for evolving-world models

Distributed and Parallel Databases
Column-oriented storage techniques for MapReduce

Proceedings of the VLDB Endowment
Parallel evaluation of conjunctive queries

Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Llama: leveraging columnar storage for scalable join processing in the MapReduce framework

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Non-deterministic parallelism considered useful

HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
Full-text indexing for optimizing selection operations in large-scale data analytics

Proceedings of the second international workshop on MapReduce and its applications
In-situ MapReduce for log processing

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
Brown Dwarf: A fully-distributed, fault-tolerant data warehousing system

Journal of Parallel and Distributed Computing
Building cubes with MapReduce

Proceedings of the ACM 14th international workshop on Data Warehousing and OLAP
ChuQL: processing XML with XQuery using Hadoop

Proceedings of the 2011 Conference of the Center for Advanced Studies on Collaborative Research
Parallel data processing with MapReduce: a survey

ACM SIGMOD Record
GLADE: a scalable framework for efficient analytics

ACM SIGOPS Operating Systems Review
Energy efficiency for large-scale MapReduce workloads with significant interactive analysis

Proceedings of the 7th ACM european conference on Computer Systems
LazyBase: trading freshness for performance in a scalable database

Proceedings of the 7th ACM european conference on Computer Systems
Privacy-sensitive VM retrospection

HotCloud'11 Proceedings of the 3rd USENIX conference on Hot topics in cloud computing
In-situ MapReduce for log processing

HotCloud'11 Proceedings of the 3rd USENIX conference on Hot topics in cloud computing
Trust me, i'm partially right: incremental visualization lets analysts explore large datasets faster

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Advanced partitioning techniques for massively distributed computation

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
FunSQL: it is time to make SQL functional

Proceedings of the 2012 Joint EDBT/ICDT Workshops
Peregrine: Low-latency queries on Hive warehouse data

XRDS: Crossroads, The ACM Magazine for Students - Big Data
Why let resources idle? aggressive cloning of jobs with dolly

HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
Toward efficient querying of compressed network payloads

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Interactive analytical processing in big data systems: a cross-industry study of MapReduce workloads

Proceedings of the VLDB Endowment
Self-adaptive approximate queries for large-scale information aggregation

International Journal of Web and Grid Services
SCOPE: parallel databases meet MapReduce

The VLDB Journal — The International Journal on Very Large Data Bases
Spanner: Google's globally-distributed database

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
The compiler forest

ESOP'13 Proceedings of the 22nd European conference on Programming Languages and Systems
Efficient processing of containment queries on nested sets

Proceedings of the 16th International Conference on Extending Database Technology
Elastic online analytical processing on RAMCloud

Proceedings of the 16th International Conference on Extending Database Technology
HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm

Proceedings of the 16th International Conference on Extending Database Technology
Stat!: an interactive analytics environment for big data

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Minimal MapReduce algorithms

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Communication steps for parallel query processing

Proceedings of the 32nd symposium on Principles of database systems
Shark: SQL and rich analytics at scale

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
CPI2: CPU performance isolation for shared compute clusters

Proceedings of the 8th ACM European Conference on Computer Systems
Scaling big data mining infrastructure: the twitter experience

ACM SIGKDD Explorations Newsletter
Effective straggler mitigation: attack of the clones

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Exploiting in-network processing for big data management

Proceedings of the 2013 Sigmod/PODS Ph.D. symposium on PhD symposium
Supporting application-specific in-network processing in data centres

Proceedings of the ACM SIGCOMM 2013 conference on SIGCOMM
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles

ACM SIGOPS 24th Symposium on Operating Systems Principles
Sparrow: distributed, low latency scheduling

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
The family of mapreduce and large-scale data processing systems

ACM Computing Surveys (CSUR)
jVerbs: ultra-low latency for data center applications

Proceedings of the 4th annual Symposium on Cloud Computing
On bridging relational and document-centric data stores

BNCOD'13 Proceedings of the 29th British National conference on Big Data
CRUCIBLE: towards unified secure on- and off-line analytics at scale

DISCS-2013 Proceedings of the 2013 International Workshop on Data-Intensive Scalable Computing Systems
Specialized storage for big numeric time series

HotStorage'13 Proceedings of the 5th USENIX conference on Hot Topics in Storage and File Systems
Continuous cloud-scale query optimization and processing

Proceedings of the VLDB Endowment
Scuba: diving into data at facebook

Proceedings of the VLDB Endowment
Overview of turn data management platform for digital advertising

Proceedings of the VLDB Endowment
Scalable progressive analytics on big data in the cloud

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.02

Visualization

Abstract

Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. In this paper, we describe the architecture and implementation of Dremel, and explain how it complements MapReduce-based computing. We present a novel columnar storage representation for nested records and discuss experiments on few-thousand node instances of the system.