Orchestrating data transfer for the cell/B.E. processor

Authors:
Tong Chen;Haibo Lin;Tao Zhang
Affiliations:
IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA;IBM China Research Laboratory, Beijing, China;IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA
Venue:
Proceedings of the 22nd annual international conference on Supercomputing
Year:
2008

Citing 19
Cited 12

Constant propagation with conditional branches

ACM Transactions on Programming Languages and Systems (TOPLAS)
Fine-grain access control for distributed shared memory

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
A Loop Transformation Algorithm for Communication Overlapping

International Journal of Parallel Programming - Special issue on international symposium on high performance computing 1997, part I
Tiling optimizations for 3D scientific computations

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Memory Issues in Embedded Systems-on-Chip: Optimizations and Exploration

Memory Issues in Embedded Systems-on-Chip: Optimizations and Exploration
Cool-cache for hot multimedia

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Removing the overhead from software-based shared memory

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Scratchpad memory: design alternative for cache on-chip memory in embedded systems

Proceedings of the tenth international symposium on Hardware/software codesign
Global common subexpression elimination

Proceedings of a symposium on Compiler optimization
Comparative Evaluation of Fine- and Coarse-Grain Approaches for Software Distributed Shared Memory

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
CRL: High Performance All-Software Distributed Shared Memory

CRL: High Performance All-Software Distributed Shared Memory
Communication Optimizations for Fine-Grained UPC Applications

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
HUNTing the Overlap

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Using advanced compiler technology to exploit the performance of the Cell Broadband EngineTM architecture

IBM Systems Journal
Cell Multiprocessor Communication Network: Built for Speed

IEEE Micro
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Compilation for explicitly managed memory hierarchies

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Optimizing the use of static buffers for DMA on a CELL chip

LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing

Hybrid access-specific software cache techniques for the cell BE architecture

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
DBDB: optimizing DMATransfer for the cell be architecture

Proceedings of the 23rd international conference on Supercomputing
Tile Percolation: An OpenMP Tile Aware Parallelization Technique for the Cyclops-64 Multicore Processor

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
An OpenCL framework for heterogeneous multicores with local memory

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Adaptive line size cache for irregular references on cell multicore processor

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Strider: Runtime Support for Optimizing Strided Data Accesses on Multi-Cores with Explicitly Managed Memories

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
A study of a software cache implementation of the OpenMP memory model for multicore and manycore architectures

Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
DDM-VMc: the data-driven multithreading virtual machine for the cell processor

Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
TL-DAE: thread-level decoupled access/execution for OpenMP on the cyclops-64 many-core processor

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Adaptive and speculative memory consistency support for multi-core architectures with on-chip local memories

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Integrating software caches with scratch pad memory

Proceedings of the 2012 international conference on Compilers, architectures and synthesis for embedded systems
SPM-Sieve: a framework for assisting data partitioning in scratch pad memory based systems

Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In heterogeneous multi-core systems, such as the Cell/B.E. or certain embedded systems, the accelerator core has its own fast local memory without hardware supported coherence between the local and global memories. It is software's responsibility to dynamically transfer the working set into the local memory when the total data set is too large to fit in the local memory. The data can be transferred through either a software controlled cache or a direct buffer. Such a software cache can maintain correctness and exploit reuse among references, especially when complicated aliasing or data dependences exist. However, the software cache introduces the extra overhead of cache lookup. Direct buffering, on the other hand, is fast but is limited by the compiler's ability to disambiguate memory references. It is desirable to judiciously use both methods, for irregular and regular accesses respectively. However, when a datum resides in both the software cache and the direct buffer, coherence problems occur. In this paper, we propose a solution which provides compile time analysis and runtime maintenance to address this coherence issue. We use compiler analysis to guarantee that there is no access to software cache within the local live range of a direct buffer, and rely on runtime support to update values from or to software cache at the entry or exit of the direct buffer. Further, we present a global data flow analysis design to eliminate redundant coherence maintenance, and overlap computation and DMA accesses to reduce runtime overhead. We have implemented this method in our Single Source Compiler for Cell, and have conducted experiments with the NAS OpenMP benchmarks. The results show that our method maintains correctness while keeping most of the opportunities for direct buffering. The execution performance can increase more than 3x compared to approaches using only the software cache. Furthermore, compile time analysis can reduce 90% of the runtime updates, thereby improving performance by 20% further.