BlinkDB: queries with bounded errors and bounded response times on very large data

Authors:
Sameer Agarwal;Barzan Mozafari;Aurojit Panda;Henry Milner;Samuel Madden;Ion Stoica
Affiliations:
University of California, Berkeley;Massachusetts Institute of Technology;University of California, Berkeley;University of California, Berkeley;Massachusetts Institute of Technology;Conviva Inc. and University of California, Berkeley
Venue:
Proceedings of the 8th ACM European Conference on Computer Systems
Year:
2013

Citing 17
Cited 4

Online aggregation

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Approximate computation of multidimensional aggregates of sparse data using wavelets

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Join synopses for approximate query answering

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
The Aqua approximate query answering system

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Congressional samples for approximate answering of group-by queries

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Approximate Query Processing: Taming the TeraBytes

Proceedings of the 27th International Conference on Very Large Data Bases
PROMISE: Predicting Query Behavior to Enable Predictive Caching Strategies for OLAP Systems

DaWaK 2000 Proceedings of the Second International Conference on Data Warehousing and Knowledge Discovery
Optimized stratified sampling for approximate query processing

ACM Transactions on Database Systems (TODS)
Hive: a warehousing solution over a map-reduce framework

Proceedings of the VLDB Endowment
MapReduce online

NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
Improving MapReduce performance in heterogeneous environments

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Reining in the outliers in map-reduce clusters using Mantri

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Dremel: interactive analysis of web-scale datasets

Communications of the ACM
Optimal random sampling from distributed streams revisited

DISC'11 Proceedings of the 25th international conference on Distributed computing
Shark: fast data analysis using coarse-grained distributed memory

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
Re-optimizing data-parallel computing

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation

Natjam: design and evaluation of eviction policies for supporting priorities and deadlines in mapreduce clusters

Proceedings of the 4th annual Symposium on Cloud Computing
Scalable progressive analytics on big data in the cloud

Proceedings of the VLDB Endowment
Aggregation and degradation in JetStream: streaming analytics in the wide area

NSDI'14 Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation
GRASS: trimming stragglers in approximation analytics

NSDI'14 Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we present BlinkDB, a massively parallel, approximate query engine for running interactive SQL queries on large volumes of data. BlinkDB allows users to trade-off query accuracy for response time, enabling interactive queries over massive data by running queries on data samples and presenting results annotated with meaningful error bars. To achieve this, BlinkDB uses two key ideas: (1) an adaptive optimization framework that builds and maintains a set of multi-dimensional stratified samples from original data over time, and (2) a dynamic sample selection strategy that selects an appropriately sized sample based on a query's accuracy or response time requirements. We evaluate BlinkDB against the well-known TPC-H benchmarks and a real-world analytic workload derived from Conviva Inc., a company that manages video distribution over the Internet. Our experiments on a 100 node cluster show that BlinkDB can answer queries on up to 17 TBs of data in less than 2 seconds (over 200 x faster than Hive), within an error of 2-10%.