DBDB: optimizing DMATransfer for the cell be architecture

Authors:
Tao Liu;Haibo Lin;Tong Chen;John Kevin O'Brien;Ling Shao
Affiliations:
IBM China Research Laboratory, Beijing, China;IBM China Research Laboratory, Beijing, China;IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA;IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA;IBM China Research Laboratory, Beijing, China
Venue:
Proceedings of the 23rd international conference on Supercomputing
Year:
2009

Citing 21
Cited 6

Fine-grain access control for distributed shared memory

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
A Loop Transformation Algorithm for Communication Overlapping

International Journal of Parallel Programming - Special issue on international symposium on high performance computing 1997, part I
A synthesis of memory mechanisms for distributed architectures

ICS '01 Proceedings of the 15th international conference on Supercomputing
Cool-cache for hot multimedia

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Removing the overhead from software-based shared memory

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Scratchpad memory: design alternative for cache on-chip memory in embedded systems

Proceedings of the tenth international symposium on Hardware/software codesign
CRL: High Performance All-Software Distributed Shared Memory

CRL: High Performance All-Software Distributed Shared Memory
Communication Optimizations for Fine-Grained UPC Applications

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
HUNTing the Overlap

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Chip multiprocessing and the cell broadband engine

Proceedings of the 3rd conference on Computing frontiers
Using advanced compiler technology to exploit the performance of the Cell Broadband EngineTM architecture

IBM Systems Journal
Cell Multiprocessor Communication Network: Built for Speed

IEEE Micro
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Compilation for explicitly managed memory hierarchies

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Prefetching irregular references for software cache on cell

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Orchestrating data transfer for the cell/B.E. processor

Proceedings of the 22nd annual international conference on Supercomputing
Hybrid access-specific software cache techniques for the cell BE architecture

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
COMIC: a coherent shared memory interface for cell be

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
A comparison of programming models for multiprocessors with explicitly managed memory hierarchies

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Optimizing the use of static buffers for DMA on a CELL chip

LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing

An OpenCL framework for heterogeneous multicores with local memory

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
DMATiler: revisiting loop tiling for direct memory access

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Strider: Runtime Support for Optimizing Strided Data Accesses on Multi-Cores with Explicitly Managed Memories

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
TL-DAE: thread-level decoupled access/execution for OpenMP on the cyclops-64 many-core processor

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Tagged procedure calls (TPC): efficient runtime support for task-based parallelism on the cell processor

HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
DMA-circular: an enhanced high level programmable DMA controller for optimized management of on-chip local memories

Proceedings of the 9th conference on Computing Frontiers

Quantified Score

Hi-index	0.00

Visualization

Abstract

In heterogeneous multi-core systems, such as the Cell BE or certain embedded systems, the accelerator core has its own fast local memory without hardware supported coherence. It is software's responsibility to dynamically transfer the working set when the total data set is too large to fit in the local memory. The data can be transferred through a software controlled cache which maintains correctness and exploits reuse among references, especially when complicated aliasing or data dependence exists. However, the software controlled cache introduces the extra overhead of cache lookup. In this paper we present the design and implementation of a Direct Blocking Data Buffer (DBDB) which combines compiler analysis and runtime management to optimize local memory utilization. We use compile time analysis to identify regular references in a loop body, block the innermost loop according to the access patterns and available local memory space, insert DMA operations for the blocked loop, and substitute references to local buffers. The runtime is responsible for allocating local memory for DBDB, especially for disambiguating aliased memory accesses which could not be resolved at compile time. We further optimize noncontiguous references by taking advantage of the DMA-list feature provided by the Cell BE. A practical performance model is presented to guide the DMA transfer scheme selection among single-DMA, multi-DMA and DMA-list. We have implemented DBDB in the IBM XL C/C++ for Multicore Acceleration for Linux, and have conducted experiments with selected test cases from the NAS OpenMP and SPEC benchmarks. The results show that our method performs well compared with traditional software cache approach. We have observed a speedup of up to 5.3x and 4x in average.