Input/output characteristics of scalable parallel applications
Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Sequoia: programming the memory hierarchy
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Sequoia: programming the memory hierarchy
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Ceph: a scalable, high-performance distributed file system
OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Overview of the Blue Gene/L system architecture
IBM Journal of Research and Development
hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications
PDP '10 Proceedings of the 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
The Hadoop Distributed File System
MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
Explicit Platform Descriptions for Heterogeneous Many-Core Architectures
IPDPSW '11 Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum
On the Role of NVRAM in Data-intensive Architectures: An Evaluation
IPDPS '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium
A 1 PB/s file system to checkpoint three million MPI tasks
Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Structuring PLFS for extensibility
PDSW '13 Proceedings of the 8th Parallel Data Storage Workshop
Hi-index | 0.00 |
In-system solid state storage is expected to be an important component of the I/O subsystem on the first exascale platforms, as it has the potential to reduce DRAM requirements, to increase system reliability, and to smooth I/O loads. This paper describes the design of a prototype, integrated in-system storage architecture that we are developing to serve the diverse needs of high performance computing. Our container abstraction will provide lightweight management of in-system storage devices, as well as methods to access containers remotely and to transfer them within the storage hierarchy. We are also working on a storage hierarchy abstraction API to provide portable HPC I/O software with the critical information on the configuration of the system on which it is running. As currently available large-scale HPC systems lack in-system storage, we are developing a solid state storage simulator backed by DRAM. We are integrating these efforts around an I/O-intensive workload provided by the scalable checkpoint/restart (SCR) library. We expect our efforts to reduce the overheads of checkpointing and data movement across the system and thus to improve the scalability and reliability of HPC applications.