Algorithms for the database layout problem

Authors:
Gagan Aggarwal;Tomás Feder;Rajeev Motwani;Rina Panigrahy;An Zhu
Affiliations:
Computer Science Department, Stanford University, Stanford, CA;Computer Science Department, Stanford University, Stanford, CA;Computer Science Department, Stanford University, Stanford, CA;Computer Science Department, Stanford University, Stanford, CA;Computer Science Department, Stanford University, Stanford, CA
Venue:
ICDT'05 Proceedings of the 10th international conference on Database Theory
Year:
2005

Citing 7
Cited 2

Data placement in Bubba

SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
Snowball: Scalable Storage on Networks of Workstations with Balanced Load

Distributed and Parallel Databases
File Assignment in Parallel I/O Systems with Minimal Variance of Service Time

IEEE Transactions on Computers
Towards self-tuning data placement in parallel database systems

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Predictive dynamic load balancing of parallel hash-joins over heterogeneous processors in the presence of data skew

PDIS '94 Proceedings of the third international conference on on Parallel and distributed information systems
Multiway Cuts in Directed and Node Weighted Graphs

ICALP '94 Proceedings of the 21st International Colloquium on Automata, Languages and Programming
Data partitioning and load balancing in parallel disk systems

The VLDB Journal — The International Journal on Very Large Data Bases

An object placement advisor for DB2 using solid state storage

Proceedings of the VLDB Endowment
Workload-aware storage layout for database systems

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a formal analysis of the database layout problem, i.e., the problem of determining how database objects such as tables and indexes are assigned to disk drives. Optimizing this layout has a direct impact on the I/O performance of the entire system. The traditional approach of striping each object across all available disk drives is aimed at optimizing I/O parallelism; however, it is suboptimal when queries co-access two or more database objects, e.g., during a merge join of two tables, due to the increase in random disk seeks. We adopt an existing model, which takes into account both the benefit of I/O parallelism and the overhead due to random disk accesses, in the context of a query workload which includes co-access of database objects. The resulting optimization problem is intractable in general and we employ techniques from approximation algorithms to present provable performance guarantees. We show that while optimally exploiting I/O parallelism alone suggests uniformly striping data objects (even for heterogeneous files and disks), optimizing random disk access alone would assign each data object to a single disk drive. This confirms the intuition that the two effects are in tension with each other. We provide approximation algorithms in an attempt to optimize the trade-off between the two effects. We show that our algorithm achieves the best possible approximation ratio.