Scalability and locality of extrapolation methods for distributed-memory architectures
Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
Optimized dense matrix multiplication on a many-core architecture
Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
Locality optimization of stencil applications using data dependency graphs
LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Data layout transformation for stencil computations on short-vector SIMD architectures
CC'11/ETAPS'11 Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software
Proceedings of the 9th conference on Computing Frontiers
A stencil compiler for short-vector SIMD architectures
Proceedings of the 27th international ACM conference on International conference on supercomputing
Hi-index | 0.00 |
This paper reports a study of mapping the Finite Difference Time Domain (FDTD) application to the IBM Cyclops-64 (C64) many-core chip architecture [1]. C64 is chosen for this study as it represents the current trend in computer architecture to develop a class of many-core architectures with distinct features e.g. software manageable on-chip memory hierarchy (vs. a hardware-managed data cache), high on-chip bandwidth, fine grain multithreading and synchronization, among others. Major results of our study include: 1. A good mapping of FDTD can effectively exploit the on-chip parallelism of C64-like architectures and show good performance and scalability. 2. Such performance improvement is derived by employing a number of code optimization techniques such as time skewing and split tiling that judiciously exploit the architecture features described in (1). 3. High performance requires maximum reuse of on-chip memory, which is obtained by tiling with non conventional tile shapes. 4. Such code optimization techniques we used in (2) and tiling such as the one used in (3) should be implementable within a reasonable compilation framework, opening a new set of possibilities for compiler optimizations.