Cluster-and-conquer: hierarchical multi-metric query processing in large-scale database federations

  • Authors:
  • Di Wang;Murali Mani;Elke A. Rundersteiner

  • Affiliations:
  • Worcester Polytechnic Institute, Worcester, MA;University of Michigan, Flint, Flint, MI;Worcester Polytechnic Institute, Worcester, MA

  • Venue:
  • Proceedings of the Fourteenth International Database Engineering & Applications Symposium
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

The federated database architecture has been introduced to maintain the autonomy of individual data sources yet accomplish federated task for diverse applications from traditional enterprises to computational sciences. We identify two challenging problems of query optimization in large-scale database federation systems. First, run-time conditions of data sources have a profound effect on the performance of database federations, yet the distributed environment of database federations makes it prohibitively expensive for the optimizer to gather rapidly fluctuating run-time conditions from remote data sources. Second, large-scale database federation systems are often widely distributed and built on heterogeneous networks, thus efficiently utilizing network resources is of ever increasing importance for query scheduling. In this paper, we propose to exploit the clustered hierarchical structure of database federations to solve these two problems. Our Cluster-and-Conquer strategy coordinates hierarchical clusters of data sources to optimize and process queries cooperatively. Within each cluster we employ an I/O-bound cost model with run-time conditions being accessible with relatively little delay. While among clusters a network-bound cost model is instead utilized to capture the network heterogeneity and optimize the query plans for efficient network utilization. The experimental study on the prototype database federation system with real-world network settings shows the effectiveness of our Cluster-and-Conquer strategy for scheduling data-intensive queries, as well as demonstrates the performance benefits of our proposed strategies over existing state-of-art solutions.