SRC: Damaris - using dedicated i/o cores for scalable post-petascale HPC simulations

  • Authors:
  • Matthieu Dorier

  • Affiliations:
  • ENS Cachan Brittany - IRISA, Rennes, France

  • Venue:
  • Proceedings of the international conference on Supercomputing
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

As we enter the post-petascale era, scientific applications running on large-scale platforms generate increasingly larger amounts of data for checkpointing or offline visualization, which puts current storage systems under heavy pressure. Unfortunately, I/O scalability rapidly fades behind the increasing computation power available, and thereby reduced the overall application performance scalability. We consider the common case of large-scale simulations who alternate between computation phases and I/O phases. Two main approaches have been used to handle these I/O phases: 1) each process writes an individual file, leading to a very large number of files from which it is hard to retrieve scientific insights; 2) processes synchronize and use collective I/O to write to the same shared file. In both cases, because of mandatory communications betweens processes during the computation phase, all processes enter the I/O phase at the same time, which leads to huge access contention and extreme performance variability. Previous research efforts have focused on improving each layer of the I/O stack separately: at the highest level scientific data formats like HDF5 allow to keep a high degree of semantics within files, while leveraging MPI-IO optimizations. Parallel file systems like GPFS or PVFS are also subject to optimization efforts, as they usually represent the main bottleneck of this I/O stack. As a step forward, we introduce Damaris (Dedicated Adaptable Middleware for Application Resources Inline Steering), an approach targeting large-scale multicore SMP supercomputers. The main idea is to dedicate one or a few cores on each node to I/O and data processing to provide an efficient, scalable-by-design, in-compute-node data processing service. Damaris takes into account user-provided information related to the application, the file system and the intended use of the datasets to better schedule data transfers and processing. It may also respond to visualization tools to allow in-situ visualization without impacting the simulation. We tested our implementation of Damaris as an I/O backend for the CM1 atmospheric model, one of the application intended to run on next generation supercomputer BlueWaters at NCSA. CM1 is a typical MPI application, originally writing one file per process at each checkpoint using HDF5. Deployed on 1024 cores on BluePrint, the BlueWater's interim system at NCSA with GPFS as underlying filesystem, this approach induces up to 10 seconds overhead in checkpointing phases every 2 minutes, with a high variability in the time spent by each process to write its data (from 1 to 10 seconds). Using one dedicated I/O core in each 16-cores SMP node, we completely remove this overhead. Moreover, the time spared by the I/O core enables a better compression level, thus reducing both the number of files produced (by a factor of 16) and the total data size. Experiments conducted on the French Grid5000 testbed with PVFS as underlying filesystem and a 24 cores/node cluster emphasized the benefit of our approach, which allows communication and computation to overlap, in a context involving high network contention at multiple levels.