Integrating parallel file I/O and database support for high-performance scientific data management

Authors:
Jaechun No;Rajeev Thakur;Alok Choudhary
Affiliations:
Math. and Computer Science Division, Argonne National Laboratory, Argonne, IL;Math. and Computer Science Division, Argonne National Laboratory, Argonne, IL;Dept. of Electrical and Computer Eng., Northwestern University, Evanston, IL
Venue:
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Year:
2000

Citing 22
Cited 10

Design and Evaluation of primitives for Parallel I/O

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
High-performance I/O for massively parallel computers: problems and prospects

Computer
Server-directed collective I/O in Panda

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
PPFS: a high performance portable parallel file system

ICS '95 Proceedings of the 9th international conference on Supercomputing
The Vesta parallel file system

ACM Transactions on Computer Systems (TOCS)
Disk-directed I/O for MIMD multiprocessors

ACM Transactions on Computer Systems (TOCS)
An extended two-phase method for accessing sections of out-of-core arrays

Scientific Programming
The Galley parallel file system

Parallel Computing - Special double issue: parallel I/O
On implementing MPI-IO portably and with high performance

Proceedings of the sixth workshop on I/O in parallel and distributed systems
Querying very large multi-dimensional datasets in ADR

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
A data intensive distributed computing architecture for “grid” applications

Future Generation Computer Systems - Special issue on high performance computing and networking Europe 1999
A case for using MPI's derived datatypes to improve I/O performance

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Using MPI-2: Advanced Features of the Message Passing Interface

Using MPI-2: Advanced Features of the Message Passing Interface
An Experimental Evaluation of the Parallel I/O Systems of the IBM SP and Intel Paragon Using a Production Application

Proceedings of the Third International ACPC Conference with Special Emphasis on Parallel Databases and Parallel I/O: Parallel Computation
The SDSC storage resource broker

CASCON '98 Proceedings of the 1998 conference of the Centre for Advanced Studies on Collaborative research
Intelligent, adaptive file system policy selection

FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
PMPIO - A Portable Implementation of MPI-IO

FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
A Network-Aware Distributed Storage Cache for Data Intensive Environments

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Data Management for Large-Scale Scientific Computations in High Performance Distributed Systems

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Multidimensional Indexing and Query Coordination for Tertiary Storage Management

SSDBM '99 Proceedings of the 11th International Conference on Scientific and Statistical Database Management
Globalized Newton-Krylov-Schwarz algorithms and software for parallel implicit CFD

Globalized Newton-Krylov-Schwarz algorithms and software for parallel implicit CFD
Scalability in the XFS file system

ATEC '96 Proceedings of the 1996 annual conference on USENIX Annual Technical Conference

A Scientific Data Management System for Irregular Applications

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
High-performance scientific data management system

Journal of Parallel and Distributed Computing
GODIVA: Lightweight Data Management for Scientific Visualization Applications

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM 2009 Proceedings of the 21st International Conference on Scientific and Statistical Database Management
Experience with BXGrid: a data repository and computing grid for biometrics research

Cluster Computing
ROARS: a scalable repository for data intensive scientific computing

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Bitmap indexes for large scientific data sets: a case study

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
A novel automatic virtual metrology system architecture for TFT-LCD industry based on main memory database

Robotics and Computer-Integrated Manufacturing
Supporting User-Defined Subsetting and Aggregation over Parallel NetCDF Datasets

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
ROARS: a robust object archival system for data intensive scientific computing

Distributed and Parallel Databases

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many scientific applications have large I/O requirements, in terms of both the size of data and the number of files or data sets. Management, storage, efficient access, and analysis of these data present an extremely challenging task. Traditionally, two different solutions are used for this problem: file I/O or databases. File I/O can provide high performance but is tedious to use with large numbers of files and large and complex data sets. Databases can be convenient, flexible, and powerful but do notperform and scale well for parallel supercomputing applications. We have developed a software system, called Scientific Data Manager (SDM), which aims to combine the good features of both file I/O and databases. SDM provides a high-level API to the user and, internally, uses a parallel file system to store real data and a database to store appreciation-related metadata. SDM takes advantage of various I/O optimizations available in MPI-IO, such as collective I/O and noncontiguous requests, in a manner that is transparent to the user. As a result, users can write and retrieve data with the performance of parallel file I/O, without having to bother with the details of actually performing file I/O.In this paper, we describe the design and implementation of SDM. With the help of two parallel application templates, ASTRO3D and an Euler solver, we illustrate how some of the design criteria affect performance.