Taming parallel I/O complexity with auto-tuning

Authors:
Babak Behzad;Huong Vu Thanh Luu;Joseph Huchette;Surendra Byna; Prabhat;Ruth Aydt;Quincey Koziol;Marc Snir
Affiliations:
University of Illinois at Urbana-Champaign;University of Illinois at Urbana-Champaign;Rice University;Lawrence Berkeley National Laboratory;Lawrence Berkeley National Laboratory;The HDF Group;The HDF Group;University of Illinois at Urbana-Champaign
Venue:
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Year:
2013

Citing 20
Cited 1

Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology

ICS '97 Proceedings of the 11th international conference on Supercomputing
Automatic parallel I/O performance optimization in Panda

Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures
Performance modeling for the panda array I/O library

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Minerva: An automated resource provisioning tool for large-scale storage systems

ACM Transactions on Computer Systems (TOCS)
A computationally efficient evolutionary algorithm for real-parameter optimization

Evolutionary Computation
Data Sieving and Collective I/O in ROMIO

FRONTIERS '99 Proceedings of the The 7th Symposium on the Frontiers of Massively Parallel Computation
Heuristics for Scheduling Parameter Sweep Applications in Grid Environments

HCW '00 Proceedings of the 9th Heterogeneous Computing Workshop
Automatic Parallel I/O Performance Optimization Using Genetic Algorithms

HPDC '98 Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing
VORPAL: a versatile plasma simulation code

Journal of Computational Physics
An overview of evolutionary algorithms for parameter optimization

Evolutionary Computation
Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Using utility to provision storage systems

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
I/O performance challenges at leadership scale

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Pyevolve: a Python open-source framework for genetic algorithms

ACM SIGEVOlution
Hippodrome: running circles around storage administration

FAST'02 Proceedings of the 1st USENIX conference on File and storage technologies
Online Adaptive Code Generation and Tuning

IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
A multi-objective auto-tuning framework for parallel codes

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Parallel I/O, analysis, and visualization of a trillion particle simulation

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A framework for auto-tuning HDF5 applications

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing

SDS: a framework for scientific data services

PDSW '13 Proceedings of the 8th Parallel Data Storage Workshop

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present an auto-tuning system for optimizing I/O performance of HDF5 applications and demonstrate its value across platforms, applications, and at scale. The system uses a genetic algorithm to search a large space of tunable parameters and to identify effective settings at all layers of the parallel I/O stack. The parameter settings are applied transparently by the auto-tuning system via dynamically intercepted HDF5 calls. To validate our auto-tuning system, we applied it to three I/O benchmarks (VPIC, VORPAL, and GCRM) that replicate the I/O activity of their respective applications. We tested the system with different weak-scaling configurations (128, 2048, and 4096 CPU cores) that generate 30 GB to 1 TB of data, and executed these configurations on diverse HPC platforms (Cray XE6, IBM BG/P, and Dell Cluster). In all cases, the auto-tuning framework identified tunable parameters that substantially improved write performance over default system settings. We consistently demonstrate I/O write speedups between 2x and 100x for test configurations.