Blink and it's done: interactive queries on very large data

Authors:
Sameer Agarwal;Anand P. Iyer;Aurojit Panda;Samuel Madden;Barzan Mozafari;Ion Stoica
Affiliations:
UC Berkeley;UC Berkeley;UC Berkeley;MIT CSAIL;MIT CSAIL;UC Berkeley
Venue:
Proceedings of the VLDB Endowment
Year:
2012

Citing 5
Cited 4

Approximate Query Processing: Taming the TeraBytes

Proceedings of the 27th International Conference on Very Large Data Bases
Shark: fast data analysis using coarse-grained distributed memory

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Recurring job optimization in scope

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Re-optimizing data-parallel computing

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation

Parallel online aggregation in action

Proceedings of the 25th International Conference on Scientific and Statistical Database Management
Making every bit count in wide-area analytics

HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
Generation of test databases using sampling methods

Proceedings of the 2013 International Symposium on Software Testing and Analysis
Sampling estimators for parallel online aggregation

BNCOD'13 Proceedings of the 29th British National conference on Big Data

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this demonstration, we present BlinkDB, a massively parallel, sampling-based approximate query processing framework for running interactive queries on large volumes of data. The key observation in BlinkDB is that one can make reasonable decisions in the absence of perfect answers. BlinkDB extends the Hive/HDFS stack and can handle the same set of SPJA (selection, projection, join and aggregate) queries as supported by these systems. BlinkDB provides real-time answers along with statistical error guarantees, and can scale to petabytes of data and thousands of machines in a fault-tolerant manner. Our experiments using the TPC-H benchmark and on an anonymized real-world video content distribution workload from Conviva Inc. show that BlinkDB can execute a wide range of queries up to 150x faster than Hive on MapReduce and 10--150x faster than Shark (Hive on Spark) over tens of terabytes of data stored across 100 machines, all with an error of 2--10%.