Mapping the FDTD Application to Many-Core Chip Architectures

Authors:
Daniel A. Orozco;Guang R. Gao
Affiliations:
-;-
Venue:
ICPP '09 Proceedings of the 2009 International Conference on Parallel Processing
Year:
2009

Citing 0
Cited 6

Scalability and locality of extrapolation methods for distributed-memory architectures

Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
Optimized dense matrix multiplication on a many-core architecture

Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
Locality optimization of stencil applications using data dependency graphs

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Data layout transformation for stencil computations on short-vector SIMD architectures

CC'11/ETAPS'11 Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software
Dynamic percolation: a case of study on the shortcomings of traditional optimization in many-core architectures

Proceedings of the 9th conference on Computing Frontiers
A stencil compiler for short-vector SIMD architectures

Proceedings of the 27th international ACM conference on International conference on supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper reports a study of mapping the Finite Difference Time Domain (FDTD) application to the IBM Cyclops-64 (C64) many-core chip architecture [1]. C64 is chosen for this study as it represents the current trend in computer architecture to develop a class of many-core architectures with distinct features e.g. software manageable on-chip memory hierarchy (vs. a hardware-managed data cache), high on-chip bandwidth, fine grain multithreading and synchronization, among others. Major results of our study include: 1. A good mapping of FDTD can effectively exploit the on-chip parallelism of C64-like architectures and show good performance and scalability. 2. Such performance improvement is derived by employing a number of code optimization techniques such as time skewing and split tiling that judiciously exploit the architecture features described in (1). 3. High performance requires maximum reuse of on-chip memory, which is obtained by tiling with non conventional tile shapes. 4. Such code optimization techniques we used in (2) and tiling such as the one used in (3) should be implementable within a reasonable compilation framework, opening a new set of possibilities for compiler optimizations.