Predictable performance and high query concurrency for data analytics

Authors:
George Candea;Neoklis Polyzotis;Radek Vingralek
Affiliations:
EPFL, Lausanne, Switzerland;University of California, Santa Cruz, USA;Google Inc., Mountain View, Santa Clara, USA
Venue:
The VLDB Journal — The International Journal on Very Large Data Bases
Year:
2011

Citing 24
Cited 3

Multiple-query optimization

ACM Transactions on Database Systems (TODS)
Red brick warehouse: a read-mostly RDBMS for open SMP platforms

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Implementing data cubes efficiently

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
NiagaraCQ: a scalable continuous query system for Internet databases

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
An optimal evaluation of Boolean expressions in an online query system

Communications of the ACM
Continuously adaptive continuous queries over streams

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
TelegraphCQ: continuous dataflow processing

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Adaptive ordering of pipelined stream filters

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
QPipe: a simultaneously pipelined relational query engine

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Integrating compression and execution in column-oriented database systems

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Adaptive aggregation on chip multiprocessors

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Cooperative scans: dynamic bandwidth sharing in a DBMS

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
The end of an architectural era: (it's time for a complete rewrite)

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Near-optimal algorithms for shared filter evaluation in data stream systems

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Scalable regular expression matching on data streams

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
A generic flow algorithm for shared filter ordering problems

Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Main-memory scan sharing for multi-core CPUs

Proceedings of the VLDB Endowment
Constant-Time Query Processing

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
An architecture for recycling intermediates in a column-store

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
The multikernel: a new OS architecture for scalable multicore systems

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Predictable performance for unpredictable workloads

Proceedings of the VLDB Endowment
The Data Cyclotron query processing scheme

Proceedings of the 13th International Conference on Extending Database Technology
The DataPath system: a data-centric analytic processing engine for large data warehouses

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
The pipelined set cover problem

ICDT'05 Proceedings of the 10th international conference on Database Theory

SharedDB: killing one thousand queries with one stone

Proceedings of the VLDB Endowment
On the optimization of schedules for MapReduce workloads in the presence of shared scans

The VLDB Journal — The International Journal on Very Large Data Bases
Sharing data and work across concurrent analytical queries

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Conventional data warehouses employ the query-at-a-time model, which maps each query to a distinct physical plan. When several queries execute concurrently, this model introduces contention and thrashing, because the physical plans--unaware of each other--compete for access to the underlying I/O and computation resources. As a result, while modern systems can efficiently optimize and evaluate a single complex data analysis query, their performance suffers significantly and can be highly erratic when multiple complex queries run at the same time. We present in this paper Cjoin, a new design that substantially improves throughput in large-scale data analytics systems processing many concurrent join queries. In contrast to the conventional query-at-a-time model our approach employs a single physical plan that shares I/O, computation, and tuple storage across all in-flight join queries. We use an "always on" pipeline of non-blocking operators, managed by a controller that continuously examines the current query mix and optimizes the pipeline on the fly. Our design enables data analytics engines to scale gracefully to large data sets, provide predictable execution times, and reduce contention. We implemented Cjoin as an extension to the PostgreSQL DBMS. This prototype outperforms conventional commercial systems by an order of magnitude for tens to hundreds of concurrent queries.