OpenMP: An Industry-Standard API for Shared-Memory Programming
IEEE Computational Science & Engineering
Optimizing Compiler for the CELL Processor
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
MPI Microtask for programming the cell broadband engineTM processor
IBM Systems Journal
HeteroMPI: Towards a message-passing library for heterogeneous networks of computers
Journal of Parallel and Distributed Computing
Introduction to the cell multiprocessor
IBM Journal of Research and Development - POWER5 and packaging
Sequoia: programming the memory hierarchy
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Sequoia: programming the memory hierarchy
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Larrabee: a many-core x86 architecture for visual computing
ACM SIGGRAPH 2008 papers
IBM Journal of Research and Development
Entering the petaflop era: the architecture and performance of Roadrunner
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Hierarchical Task-Based Programming With StarSs
International Journal of High Performance Computing Applications
Exploiting Locality on the Cell/B.E. through Bypassing
SAMOS '09 Proceedings of the 9th International Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation
Exploiting the Cell/BE Architecture with the StarPU Unified Runtime System
SAMOS '09 Proceedings of the 9th International Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Towards a framework for abstracting accelerators in parallel applications: experience with cell
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Scaling applications to massively parallel machines using Projections performance analysis tool
Future Generation Computer Systems
Vc: A C++ library for explicit vectorization
Software—Practice & Experience
Hi-index | 0.00 |
Heterogeneous clusters that include accelerators have become more common in the realm of high performance computing because of the high GFlop/s rates such clusters are capable of achieving. However, heterogeneous clusters are typically considered hard to program as they usually require programmers to interleave architecture-specific code within application code. We have extended the Charm++ programming model and runtime system to support heterogeneous clusters (with host cores that differ in their architecture) that include accelerators. We are currently focusing on clusters that include commodity processors, Cell processors, and Larrabee devices. When our extensions are used to develop code, the resulting code is portable between various homogeneous and heterogeneous clusters that may or may not include accelerators. Using a simple example molecular dynamics (MD) code, we demonstrate our programming model extensions and runtime system modifications on a heterogeneous cluster comprised of Xeon and Cell processors. Even though there is no architecture-specific code in the example MD program, it is able to successfully make use of three core types, each with a different ISA (Xeon, PPE, SPE), three SIMD instruction extensions (SSE, AltiVec/VMX and the SPE's SIMD instructions), and two memory models (cache hierarchies and scratchpad memories) in a single execution. Our programming model extensions abstract away hardware complexities while our runtime system modifications automatically adjust application data to account for architectural differences between the various cores.