Load distribution of analytical query workloads for database cluster architectures

Authors:
Thomas Phan;Wen-Syan Li
Affiliations:
Yahoo!, Inc., Sunnyvale, CA;IBM Almaden Research Center, San Jose, CA
Venue:
EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
Year:
2008

Citing 24
Cited 1

How to roll a join: asynchronous incremental view maintenance

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Maintaining data warehouses over changing information sources

Communications of the ACM
Comparative Models of the File Assignment Problem

ACM Computing Surveys (CSUR)
A comparison of eleven static heuristics for mapping a class of independent tasks onto heterogeneous distributed computing systems

Journal of Parallel and Distributed Computing
Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control and Artificial Intelligence

Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control and Artificial Intelligence
Genetic Algorithms in Search, Optimization and Machine Learning

Genetic Algorithms in Search, Optimization and Machine Learning
Automating physical database design in a parallel database

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Optimizing Queries with Materialized Views

ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
Job Shop Scheduling with Genetic Algorithms

Proceedings of the 1st International Conference on Genetic Algorithms
Automated Selection of Materialized Views and Indexes in SQL Databases

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Heuristics for Scheduling Parameter Sweep Applications in Grid Environments

HCW '00 Proceedings of the 9th Heterogeneous Computing Workshop
Introduction to Evolutionary Computing

Introduction to Evolutionary Computing
Computation scheduling and data replication algorithms for data Grids

Grid resource management
How to Solve It: Modern Heuristics

How to Solve It: Modern Heuristics
Load and Network Aware Query Routing for Information Integration

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Recommending Materialized Views and Indexes with IBM DB2 Design Advisor

ICAC '04 Proceedings of the First International Conference on Autonomic Computing
An evaluation of the close-to-files processor and data co-allocation policy in multiclusters

CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
A taxonomy of Data Grids for distributed data sharing, management, and processing

ACM Computing Surveys (CSUR)
Automatic physical design tuning: workload as a sequence

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Task scheduling strategies for workflow-based applications in grids

CCGRID '05 Proceedings of the Fifth IEEE International Symposium on Cluster Computing and the Grid (CCGrid'05) - Volume 2 - Volume 02
Computing queries from derived relations

VLDB '85 Proceedings of the 11th international conference on Very Large Data Bases - Volume 11
DB2 design advisor: integrated automatic physical database design

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Exploiting replication and data reuse to efficiently schedule data-intensive applications on grids

JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing
Integration of scheduling and replication in data grids

HiPC'04 Proceedings of the 11th international conference on High Performance Computing

A request-routing framework for SOA-based enterprise computing

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Enterprises may have multiple database systems spread across the organization for redundancy or for serving different applications. In such systems, query workloads can be distributed across different servers for better performance. A materialized view, or Materialized Query Table (MQT), is an auxiliary table with pre-computed data that can be used to significantly improve the performance of a database query. In this paper, we propose a framework for coordinating execution of OLAP query workloads across a database cluster with shared nothing architecture. Such coordination is complex since we need to consider (1) the time to build the MQTs, (2) the query execution impact of the MQTs, (3) whether the MQTs can fit in the disk space limitation, (4) server computation power, and (5) the effectiveness of the scheduling and placement algorithms in deriving a combination of configurations so that the workload can be completed in the shortest time period. We frame the problem as a combinatorial problem with a solution space that is exponential in the number of queries, MQTs, and servers. We provide a stochastic search heuristic that finds a near-optimal mapping of queries-to-servers and MQTs-to-servers within an arbitrarily bounded time and compare our solution with an exhaustive search and three standard greedy algorithms. Our search implementation produced schedules within 9% of the optimal found through an exhaustive search and produced better solutions than typical greedy algorithms for both TPC-H and synthetic benchmarks under a variety of experiments. For a key trial where disk space is limited, it produced 15% better results than the next best competitor, corresponding to an absolute wall clock advantage of over 10 hours.