Adaptively parallelizing distributed range queries

  • Authors:
  • Ymir Vigfusson;Adam Silberstein;Brian F. Cooper;Rodrigo Fonseca

  • Affiliations:
  • Cornell University;Yahoo! Research;Yahoo! Research;Yahoo! Research

  • Venue:
  • Proceedings of the VLDB Endowment
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

We consider the problem of how to best parallelize range queries in a massive scale distributed database. In traditional systems the focus has been on maximizing parallelism, for example by laying out data to achieve the highest throughput. However, in a massive scale database such as our PNUTS system [11] or BigTable [10], maximizing parallelism is not necessarily the best strategy: the system has more than enough servers to saturate a single client by returning results faster than the client can consume them, and when there are multiple concurrent queries, maximizing parallelism for all of them will cause disk contention, reducing everybody's performance. How can we find the right parallelism level for each query in order to achieve high, consistent throughput for all queries? We propose an adaptive approach with two aspects. First, we adaptively determine the ideal parallelism for a single query execution, which is the minimum number of parallel scanning servers needed to satisfy the client, depending on query selectivity, client load, client-server bandwidth, and so on. Second, we adaptively schedule which servers will be assigned to different query executions, to minimize disk contention on servers and ensure that all queries receive good performance. Our scheduler can be tuned based on different policies, such as favoring short versus long queries or high versus low priority queries. An experimental study demonstrates the effectiveness of our techniques in the PNUTS system.