AutoPart: Automating Schema Design for Large Scientific Databases Using Data Partitioning

  • Authors:
  • Stratos Papadomanolakis;Anastassia Ailamaki

  • Affiliations:
  • Carnegie Mellon University;Carnegie Mellon University

  • Venue:
  • SSDBM '04 Proceedings of the 16th International Conference on Scientific and Statistical Database Management
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Database applications that use multi-terabyte datasets arebecoming increasingly important for scientific fields such asastronomy and biology. Scientific databases are particularlysuited for the application of automated physical design techniques,because of their data volume and the complexity of thescientific workloads. Current automated physical design toolsfocus on the selection of indexes and materialized views. Inlarge-scale scientific databases, however, the data volume andthe continuous insertion of new data allows for only limitedindexes and materialized views. By contrast, data partitioningdoes not replicate data, thereby reducing space requirements andminimizing update overhead. In this paper we present AutoPart,an algorithm that automatically partitions database tables tooptimize sequential access assuming prior knowledge of a representativeworkload. The resulting schema is indexed using a fractionof the space required for indexing the original schema. Toevaluate AutoPart we built an automated schema design tool thatinterfaces to commercial database systems. We experiment withAutoPart in the context of the Sloan Digital Sky Survey database,a real-world astronomical database, running on SQL Server2000. Our experiments demonstrate the benefits of partitioningfor large-scale systems: Partitioning alone improves query executionperformance by a factor of two on average. Combinedwith indexes, the new schema also outperforms the indexed originalschema by 20% (for queries) and a factor of five (forupdates), while using only half the original index space.