Algorithmic skeletons: structured management of parallel computation
Algorithmic skeletons: structured management of parallel computation
Structured development of parallel programs
Structured development of parallel programs
NestStep: Nested Parallelism and Virtual Shared Memory for the BSP Model
The Journal of Supercomputing
Practical Pram Programming
Parallel Programming Using Skeleton Functions
PARLE '93 Proceedings of the 5th International PARLE Conference on Parallel Architectures and Languages Europe
Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
Concurrency and Computation: Practice & Experience - Compilers for Parallel Computers
The potential of the cell processor for scientific computing
Proceedings of the 3rd conference on Computing frontiers
Optimizing locality and scalability of embedded Runge--Kutta solvers using block-based pipelining
Journal of Parallel and Distributed Computing
International workshop on multicore software engineering (IWMSE 2008)
Companion of the 30th international conference on Software engineering
Towards an Intelligent Environment for Programming Multi-core Computing Systems
Euro-Par 2008 Workshops - Parallel Processing
ACM SIGARCH Computer Architecture News
A Skeletal Parallel Framework with Fusion Optimizer for GPGPU Programming
APLAS '09 Proceedings of the 7th Asian Symposium on Programming Languages and Systems
SkePU: a multi-backend skeleton programming library for multi-GPU systems
Proceedings of the fourth international workshop on High-level parallel programming and applications
Optimized on-chip-pipelined mergesort on the cell/B.E.
Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
Scheduling streaming applications on a complex multicore platform
Concurrency and Computation: Practice & Experience
Hi-index | 0.00 |
Cell Broadband Engine is a heterogeneous multicore processor for high-performance computing and gaming. Its architecture allows for an impressive peak performance but, at the same time, makes it very hard to write efficient code. The need to simultaneously exploit SIMD instructions, coordinate parallel execution of the slave processors, overlap DMA memory traffic with computation, keep data properly aligned in memory, and explicitly manage the very small on-chip memory buffers of the slave processors, leads to very complex code. In this work, we adopt the skeleton programming approach to abstract from much of the complexity of Cell programming while maintaining high performance. The abstraction is achieved through a library of parallel generic building blocks, called BlockLib. Macro-based generative programming is used to reduce the overhead of genericity in skeleton functions and control code size expansion. We demonstrate the library usage with a parallel ODE solver application. Our experimental results show that BlockLib code achieves performance close to hand-written code and even outperforms the native IBM BLAS library in cases where several slave processors are used.