Efficient parallel processing of range queries through replicated declustering

  • Authors:
  • Hakan Ferhatosmanoglu;Ali Şaman Tosun;Guadalupe Canahuate;Aravind Ramachandran

  • Affiliations:
  • Department of Computer Science and Engineering, The Ohio State University, Columbus 43210;Department of Computer Science, University of Texas, San Antonio 78249;Department of Computer Science and Engineering, The Ohio State University, Columbus 43210;Microsoft Corporation, Redmond 98052

  • Venue:
  • Distributed and Parallel Databases
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

A common technique used to minimize I/O in data intensive applications is data declustering over parallel servers. This technique involves distributing data among several disks so as to parallelize query retrieval and thus, improve performance. We focus on optimizing access to large spatial data, and the most common type of queries on such data, i.e., range queries. An optimal declustering scheme is one in which the processing for all range queries is balanced uniformly among the available disks. It has been shown that single copy based declustering schemes are non-optimal for range queries. In this paper, we integrate replication in conjunction with parallel disk declustering for efficient processing of range queries. We note that replication is largely used in database applications for several purposes like load balancing, fault tolerance and availability of data. We propose theoretical foundations for replicated declustering and propose a class of replicated declustering schemes, periodic allocations, which are shown to be strictly optimal for a number of disks. We propose a framework for replicated declustering, using a limited amount of replication and provide extensions to apply it on real data, which include arbitrary grids and a large number of disks. Our framework also provides an effective indexing scheme that enables fast identification of data of interest in parallel servers. In addition to optimal processing of single queries, we show that this framework is effective for parallel processing of multiple queries. We present experimental results comparing the proposed replication scheme to other techniques for both single queries and multiple queries, on synthetic and real data sets.