Hybrid access-specific software cache techniques for the cell BE architecture

Authors:
Marc Gonzàlez;Nikola Vujic;Xavier Martorell;Eduard Ayguadé;Alexandre E. Eichenberger;Tong Chen;Zehra Sura;Tao Zhang;Kevin O'Brien;Kathryn O'Brien
Affiliations:
Barcelona Supercomputing Center, Barcelona, Spain;Barcelona Supercomputing Center, Barcelona, Spain;Barcelona Supercomputing Center, Barcelona, Spain;Barcelona Supercomputing Center, Barcelona, Spain;T.J. Watson IBM Research Center, Yorktown Heights, NY, USA;T.J. Watson IBM Research Center, Yorktown Heights, NY, USA;T.J. Watson IBM Research Center, Yorktown Heights, NY, USA;T.J. Watson IBM Research Center, Yorktown Heights, NY, USA;T.J. Watson IBM Research Center, Yorktown Heights, NY, USA;T.J. Watson IBM Research Center, Yorktown Heights, NY, USA
Venue:
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Year:
2008

Citing 11
Cited 19

Direct addressed caches for reduced power consumption

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Optimizing Compiler for the CELL Processor

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Using advanced compiler technology to exploit the performance of the Cell Broadband EngineTM architecture

IBM Systems Journal
POWER5 System microarchitecture

IBM Journal of Research and Development - POWER5 and packaging
Dynamic allocation for scratch-pad memory using compile-time decisions

ACM Transactions on Embedded Computing Systems (TECS)
Cell Multiprocessor Communication Network: Built for Speed

IEEE Micro
Software-based instruction caching for embedded processors

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Prefetching irregular references for software cache on cell

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Orchestrating data transfer for the cell/B.E. processor

Proceedings of the 22nd annual international conference on Supercomputing
Optimizing the use of static buffers for DMA on a CELL chip

LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
The OpenMP memory model

IWOMP'05/IWOMP'06 Proceedings of the 2005 and 2006 international conference on OpenMP shared memory parallel programming

DBDB: optimizing DMATransfer for the cell be architecture

Proceedings of the 23rd international conference on Supercomputing
An efficient software cache for H.264 motion compensation

SOC'09 Proceedings of the 11th international conference on System-on-chip
An OpenCL framework for heterogeneous multicores with local memory

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Adaptive line size cache for irregular references on cell multicore processor

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Optimization of FDTD computations in a streaming model architecture

PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
A study of a software cache implementation of the OpenMP memory model for multicore and manycore architectures

Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
Compiler-directed memory management for heterogeneous MPSoCs

Journal of Systems Architecture: the EUROMICRO Journal
DDM-VMc: the data-driven multithreading virtual machine for the cell processor

Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
An instruction to accelerate software caches

ARCS'11 Proceedings of the 24th international conference on Architecture of computing systems
OpenMP extensions for heterogeneous architectures

IWOMP'11 Proceedings of the 7th international conference on OpenMP in the Petascale era
Adaptive and speculative memory consistency support for multi-core architectures with on-chip local memories

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Automatic data distribution for improving data locality on the cell BE architecture

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Streaming model computation of the FDTD problem

PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume Part I
DMA-circular: an enhanced high level programmable DMA controller for optimized management of on-chip local memories

Proceedings of the 9th conference on Computing Frontiers
Integrating software caches with scratch pad memory

Proceedings of the 2012 international conference on Compilers, architectures and synthesis for embedded systems
A transactional runtime system for the Cell/BE architecture

Journal of Parallel and Distributed Computing
Hardware-software coherence protocol for the coexistence of caches and local memories

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A Multidimensional Software Cache for Scratchpad-Based Systems

International Journal of Embedded and Real-Time Communication Systems
SemCache: semantics-aware caching for efficient GPU offloading

Proceedings of the 27th international ACM conference on International conference on supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Ease of programming is one of the main impediments for the broad acceptance of multi-core systems with no hardware support for transparent data transfer between local and global memories. Software cache is a robust approach to provide the user with a transparent view of the memory architecture; but this software approach can suffer from poor performance. In this paper, we propose a hierarchical, hybrid software-cache architecture that classifies at compile time memory accesses in two classes, high-locality and irregular. Our approach then steers the memory references toward one of two specific cache structures optimized for their respective access pattern. The specific cache structures are optimized to enable high-level compiler optimizations to aggressively unroll loops, reorder cache references, and/or transform surrounding loops so as to practically eliminate the software cache overhead in the innermost loop. Performance evaluation indicates that improvements due to the optimized software-cache structures combined with the proposed code-optimizations translate into 3.5 to 8.4 speedup factors, compared to a traditional software cache approach. As a result, we demonstrate that the Cell BE processor can be a competitive alternative to a modern server-class multi-core such as the IBM Power5 processor for a set of parallel NAS applications.