Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures

Authors:
Min Li;Sudharshan S. Vazhkudai;Ali R. Butt;Fei Meng;Xiaosong Ma;Youngjae Kim;Christian Engelmann;Galen Shipman
Affiliations:
-;-;-;-;-;-;-;-
Venue:
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Year:
2010

Citing 32
Cited 2

Server-directed collective I/O in Panda

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Diskless Checkpointing

IEEE Transactions on Parallel and Distributed Systems
Flash code: studying astrophysical thermonuclear flashes

Computing in Science and Engineering
Incremental Recovery in Main Memory Database Systems

IEEE Transactions on Knowledge and Data Engineering
MTIO - A Multi-Threaded Parallel I/O System

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Software and the Concurrency Revolution

Queue - Multiprocessors
C-CORE: Using Communication Cores for High Performance Network Services

NCA '05 Proceedings of the Fourth IEEE International Symposium on Network Computing and Applications
FreeLoader: Scavenging Desktop Storage Resources for Scientific Data

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
High-Level Buffering for Hiding Periodic Output Cost in Scientific Simulations

IEEE Transactions on Parallel and Distributed Systems
The Impact of Multicore on Math Software and Exploiting Single Precision Computing to Obtain Double Precision Results

ICPP '06 Proceedings of the 2006 International Conference on Parallel Processing
Log-based architectures for general-purpose monitoring of deployed code

Proceedings of the 1st workshop on Architectural and system support for improving software dependability
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Parallel computing on any desktop

Communications of the ACM - ACM's plan to go online first
ZOID: I/O-forwarding infrastructure for petascale architectures

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Parallelizing security checks on commodity hardware

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Timely offloading of result-data in HPC centers

Proceedings of the 22nd annual international conference on Supercomputing
stdchk: A Checkpoint Storage System for Desktop Grid Computing

ICDCS '08 Proceedings of the 2008 The 28th International Conference on Distributed Computing Systems
Mars: a MapReduce framework on graphics processors

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Performance analysis and visualization tools for cell/B.E. multicore environment

IFMT '08 Proceedings of the 1st international forum on Next-generation multicore/manycore technologies
Celling SHIM: compiling deterministic concurrency to a heterogeneous multicore

Proceedings of the 2009 ACM symposium on Applied Computing
Supporting MapReduce on large-scale asymmetric multi-core clusters

ACM SIGOPS Operating Systems Review
A multigrain Delaunay mesh generation method for multicore SMT-based architectures

Journal of Parallel and Distributed Computing
Understanding intrinsic characteristics and system implications of flash memory based solid state drives

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Novel approaches to parallel H.264 decoder on symmetric multicore systems

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Adaptable, metadata rich IO methods for portable high performance IO

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Dynamic Job Scheduling on Heterogeneous Clusters

ISPDC '09 Proceedings of the 2009 Eighth International Symposium on Parallel and Distributed Computing
Extensible component-based architecture for FLASH, a massively parallel, multiphysics simulation code

Parallel Computing
Overview of the Blue Gene/L system architecture

IBM Journal of Research and Development
Designing Accelerator-Based Distributed Systems for High Performance

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
MapReduce for the cell broadband engine architecture

IBM Journal of Research and Development
Corey: an operating system for many cores

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Platform level support for high throughput edge applications: the Twin Cities prototype

IEEE Network: The Magazine of Global Internetworking

Combining in-situ and in-transit processing to enable extreme-scale scientific analysis

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
GoldRush: resource efficient in situ scientific data analytics using fine-grained interference aware execution

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Scaling computations on emerging massive-core supercomputers is a daunting task, which coupled with the significantly lagging system I/O capabilities exacerbates applications' end-to-end performance. The I/O bottleneck often negates potential performance benefits of assigning additional compute cores to an application. In this paper, we address this issue via a novel functional partitioning (FP) runtime environment that allocates cores to specific application tasks -- checkpointing, de-duplication, and scientific data format transformation -- so that the deluge of cores can be brought to bear on the entire gamut of application activities. The focus is on utilizing the extra cores to support HPC application I/O activities and also leverage solid-state disks in this context. For example, our evaluation shows that dedicating 1 core on an oct-core machine for checkpointing and its assist tasks using FP can improve overall execution time of a FLASH benchmark on 80 and 160 cores by 43.95% and 41.34%, respectively.