Six degrees of scientific data: reading patterns for extreme scale science IO

Authors:
Jay Lofstead;Milo Polte;Garth Gibson;Scott Klasky;Karsten Schwan;Ron Oldfield;Matthew Wolf;Qing Liu
Affiliations:
Sandia National Laboratories, Albuquerque, NM, USA;Carnegie Mellon University, Pittsburgh, PA, USA;Carnegie Mellon University, Pittsburgh, PA, USA;Oak Ridge National Laboratory, Oak Ridge, TN, USA;Georgia Institute of Technology, Atlanta, GA, USA;Sandia National Laboratories, Albuquerque, NM, USA;Georgia Institute of Technology, Atlanta, GA, USA;Oak Ridge National Laboratory, Oak Ridge, TN, USA
Venue:
Proceedings of the 20th international symposium on High performance distributed computing
Year:
2011

Citing 15
Cited 7

Efficient Organization of Large Multidimensional Arrays

Proceedings of the Tenth International Conference on Data Engineering
Data Sieving and Collective I/O in ROMIO

FRONTIERS '99 Proceedings of the The 7th Symposium on the Frontiers of Massively Parallel Computation
Parallel netCDF: A High-Performance Scientific I/O Interface

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Iteration aware prefetching for large multidimensional datasets

SSDBM'2005 Proceedings of the 17th international conference on Scientific and statistical database management
On multidimensional data and modern disks

FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
Toward automatic parallelization of spatial computation for computing clusters

HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
Flexible IO and integration for scientific codes through the adaptable IO system (ADIOS)

CLADE '08 Proceedings of the 6th international workshop on Challenges of large applications in distributed environments
Scaling parallel I/O performance through I/O delegate and caching system

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
DataStager: scalable data staging services for petascale applications

Proceedings of the 18th ACM international symposium on High performance distributed computing
Adaptable, metadata rich IO methods for portable high performance IO

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
PLFS: a checkpoint filesystem for parallel applications

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
...and eat it too: high read performance in write-optimized HPC I/O middleware file formats

Proceedings of the 4th Annual Workshop on Petascale Data Storage
Accelerating parallel analysis of scientific simulation data via Zazen

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Managing Variability in the IO Performance of Petascale Storage Systems

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Extreme scale data management in high performance computing

Extreme scale data management in high performance computing

High end scientific codes with computational I/O pipelines: improving their end-to-end performance

Proceedings of the 2nd international workshop on Petascal data analytics: challenges and opportunities
Towards scalable I/O architecture for exascale systems

Proceedings of the 2011 ACM international workshop on Many task computing on grids and supercomputers
Extending scalability of collective IO through nessie and staging

Proceedings of the sixth workshop on Parallel Data Storage
In-situ I/O processing: a case for location flexibility

Proceedings of the sixth workshop on Parallel Data Storage
A study on data deduplication in HPC storage systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Understanding i/o performance using i/o skeletal applications

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Insights for exascale IO APIs from building a petascale IO API

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Petascale science simulations generate 10s of TBs of application data per day, much of it devoted to their checkpoint/restart fault tolerance mechanisms. Previous work demonstrated the importance of carefully managing such output to prevent application slowdown due to IO blocking, resource contention negatively impacting simulation performance and to fully exploit the IO bandwidth available to the petascale machine. This paper takes a further step in understanding and managing extreme-scale IO. Specifically, its evaluations seek to understand how to efficiently read data for subsequent data analysis, visualization, checkpoint restart after a failure, and other read-intensive operations. In their entirety, these actions support the 'end-to-end' needs of scientists enabling the scientific processes being undertaken. Contributions include the following. First, working with application scientists, we define 'read' benchmarks that capture the common read patterns used by analysis codes. Second, these read patterns are used to evaluate different IO techniques at scale to understand the effects of alternative data sizes and organizations in relation to the performance seen by end users. Third, defining the novel notion of a 'data district' to characterize how data is organized for reads, we experimentally compare the read performance seen with the ADIOS middleware's log-based BP format to that seen by the logically contiguous NetCDF or HDF5 formats commonly used by analysis tools. Measurements assess the performance seen across patterns and with different data sizes, organizations, and read process counts. Outcomes demonstrate that high end-to-end IO performance requires data organizations that offer flexibility in data layout and placement on parallel storage targets, including in ways that can make tradeoffs in the performance of data writes vs. reads.