Adaptively parallelizing distributed range queries

Authors:
Ymir Vigfusson;Adam Silberstein;Brian F. Cooper;Rodrigo Fonseca
Affiliations:
Cornell University;Yahoo! Research;Yahoo! Research;Yahoo! Research
Venue:
Proceedings of the VLDB Endowment
Year:
2009

Citing 17
Cited 2

Parallel database systems: the future of high performance database systems

Communications of the ACM
An overview of DB2 parallel edition

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
High-performance sorting on networks of workstations

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Bipartite Edge Coloring in $O(\Delta m)$ Time

SIAM Journal on Computing
(Almost) optimal parallel block access to range queries

PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Fast concurrent access to parallel disks

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Prototyping Bubba, A Highly Parallel Database System

IEEE Transactions on Knowledge and Data Engineering
The Gamma Database Machine Project

IEEE Transactions on Knowledge and Data Engineering
Concentric Hyperspaces and Disk Allocation for Fast Parallel Range Searching

ICDE '99 Proceedings of the 15th International Conference on Data Engineering
Distributed computing in practice: the Condor experience: Research Articles

Concurrency and Computation: Practice & Experience - Grid Performance
C-store: a column-oriented DBMS

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Trickle: a userland bandwidth shaper for Unix-like systems

ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Dynamo: amazon's highly available key-value store

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Online balancing of range-partitioned data with applications to peer-to-peer systems

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
PNUTS: Yahoo!'s hosted data serving platform

Proceedings of the VLDB Endowment

CCIndex: a complemental clustering index on distributed ordered tables for multi-dimensional range queries

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Towards non-intrusive elastic query processing in the cloud

Proceedings of the fourth international workshop on Cloud data management

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the problem of how to best parallelize range queries in a massive scale distributed database. In traditional systems the focus has been on maximizing parallelism, for example by laying out data to achieve the highest throughput. However, in a massive scale database such as our PNUTS system [11] or BigTable [10], maximizing parallelism is not necessarily the best strategy: the system has more than enough servers to saturate a single client by returning results faster than the client can consume them, and when there are multiple concurrent queries, maximizing parallelism for all of them will cause disk contention, reducing everybody's performance. How can we find the right parallelism level for each query in order to achieve high, consistent throughput for all queries? We propose an adaptive approach with two aspects. First, we adaptively determine the ideal parallelism for a single query execution, which is the minimum number of parallel scanning servers needed to satisfy the client, depending on query selectivity, client load, client-server bandwidth, and so on. Second, we adaptively schedule which servers will be assigned to different query executions, to minimize disk contention on servers and ensure that all queries receive good performance. Our scheduler can be tuned based on different policies, such as favoring short versus long queries or high versus low priority queries. An experimental study demonstrates the effectiveness of our techniques in the PNUTS system.