Programming heterogeneous clusters with accelerators using object-based programming

Authors:
David M. Kunzman;Laxmikant V. Kalé/
Affiliations:
(Correspd.) University of Illinois (UIUC), Department of Computer Science, 201 N. Goodwin Ave, Urbana, IL 61801, USA. Tel.: +1 217 333 5827/ Fax: +1 217 2446306/ E-mail: kunzman2@illinois.edu;Department of Computer Science, University of Illinois, Urbana, IL, USA
Venue:
Scientific Programming
Year:
2011

Citing 16
Cited 1

OpenMP: An Industry-Standard API for Shared-Memory Programming

IEEE Computational Science & Engineering
Optimizing Compiler for the CELL Processor

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
MPI Microtask for programming the cell broadband engineTM processor

IBM Systems Journal
HeteroMPI: Towards a message-passing library for heterogeneous networks of computers

Journal of Parallel and Distributed Computing
Introduction to the cell multiprocessor

IBM Journal of Research and Development - POWER5 and packaging
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Larrabee: a many-core x86 architecture for visual computing

ACM SIGGRAPH 2008 papers
Fine-grained parallelization of the Car-Parrinello ab initio molecular dynamics method on the IBM Blue Gene/L supercomputer

IBM Journal of Research and Development
Entering the petaflop era: the architecture and performance of Roadrunner

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Hierarchical Task-Based Programming With StarSs

International Journal of High Performance Computing Applications
Exploiting Locality on the Cell/B.E. through Bypassing

SAMOS '09 Proceedings of the 9th International Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation
Exploiting the Cell/BE Architecture with the StarPU Unified Runtime System

SAMOS '09 Proceedings of the 9th International Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Towards a framework for abstracting accelerators in parallel applications: experience with cell

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Scaling applications to massively parallel machines using Projections performance analysis tool

Future Generation Computer Systems

Vc: A C++ library for explicit vectorization

Software—Practice & Experience

Quantified Score

Hi-index	0.00

Visualization

Abstract

Heterogeneous clusters that include accelerators have become more common in the realm of high performance computing because of the high GFlop/s rates such clusters are capable of achieving. However, heterogeneous clusters are typically considered hard to program as they usually require programmers to interleave architecture-specific code within application code. We have extended the Charm++ programming model and runtime system to support heterogeneous clusters (with host cores that differ in their architecture) that include accelerators. We are currently focusing on clusters that include commodity processors, Cell processors, and Larrabee devices. When our extensions are used to develop code, the resulting code is portable between various homogeneous and heterogeneous clusters that may or may not include accelerators. Using a simple example molecular dynamics (MD) code, we demonstrate our programming model extensions and runtime system modifications on a heterogeneous cluster comprised of Xeon and Cell processors. Even though there is no architecture-specific code in the example MD program, it is able to successfully make use of three core types, each with a different ISA (Xeon, PPE, SPE), three SIMD instruction extensions (SSE, AltiVec/VMX and the SPE's SIMD instructions), and two memory models (cache hierarchies and scratchpad memories) in a single execution. Our programming model extensions abstract away hardware complexities while our runtime system modifications automatically adjust application data to account for architectural differences between the various cores.