Scientific data services: a high-performance I/O system with array semantics

  • Authors:
  • Kesheng Wu;Surendra Byna;Doron Rotem;Arie Shoshani

  • Affiliations:
  • Lawrence Berkeley National Lab, Berkeley, CA, USA;Lawrence Berekeley National Lab, Berkeley, CA, USA;Lawrence Berkeley National Lab, Berkeley, CA, USA;Lawrence Berekeley National Lab, Berkeley, CA, USA

  • Venue:
  • Proceedings of the first annual workshop on High performance computing meets databases
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

As high-performance computing approaches exascale, the existing I/O system design is having trouble keeping pace in both performance and scalability. We propose to address this challenge by adopting database principles and techniques in parallel I/O systems. First, we propose to adopt an array data model because many scientific applications represent their data in arrays. This strategy follows a cardinal principle from database research, which separates the logical view from the physical layout of data. This high-level data model gives the underlying implementation more freedom to optimize the physical layout and to choose the most effective way of accessing the data. For example, knowing that a set of write operations is working on a single multi-dimensional array makes it possible to keep the subarrays in a log structure during the write operations and reassemble them later into another physical layout as resources permit. While maintaining the high-level view, the storage system could compress the user data to reduce the physical storage requirement, collocate data records that are frequently used together, or replicate data to increase availability and fault-tolerance. Additionally, the system could generate secondary data structures such as database indexes and summary statistics. We expect the proposed Scientific Data Services approach to create a "live" storage system that dynamically adjusts to user demands and evolves with the massively parallel storage hardware.