A Novel Asynchronous Software Cache Implementation for the Cell-BE Processor

Authors:
Jairo Balart;Marc Gonzalez;Xavier Martorell;Eduard Ayguade;Zehra Sura;Tong Chen;Tao Zhang;Kevin O'Brien;Kathryn O'Brien
Affiliations:
Barcelona Supercomputing Center (BSC), Technical University of Catalunya (UPC),;Barcelona Supercomputing Center (BSC), Technical University of Catalunya (UPC),;Barcelona Supercomputing Center (BSC), Technical University of Catalunya (UPC),;Barcelona Supercomputing Center (BSC), Technical University of Catalunya (UPC),;IBM TJ Watson Research Center,;IBM TJ Watson Research Center,;IBM TJ Watson Research Center,;IBM TJ Watson Research Center,;IBM TJ Watson Research Center,
Venue:
Languages and Compilers for Parallel Computing
Year:
2007

Citing 7
Cited 7

Optimizing Compiler for the CELL Processor

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Cell Multiprocessor Communication Network: Built for Speed

IEEE Micro
The HPC Challenge (HPCC) benchmark suite

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Pipelined Execution of Critical Sections Using Software-Controlled Caching in Network Processors

Proceedings of the International Symposium on Code Generation and Optimization
Optimizing software cache performance of packet processing applications

Proceedings of the 2007 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Compiler-managed partitioned data caches for low power

Proceedings of the 2007 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Optimizing the use of static buffers for DMA on a CELL chip

LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing

An elastic software cache with fast prefetching for motion compensation in video decoding

CODES/ISSS '10 Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Heap data management for limited local memory (LLM) multi-core processors

CODES/ISSS '10 Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Adaptive line size cache for irregular references on cell multicore processor

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
An instruction to accelerate software caches

ARCS'11 Proceedings of the 24th international conference on Architecture of computing systems
The impact of diverse memory architectures on multicore consumer software: an industrial perspective from the video games domain

Proceedings of the 2011 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
Analysis of task offloading for accelerators

HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
A transactional runtime system for the Cell/BE architecture

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes the implementation of a runtime library for asynchronous communication in the Cell BE processor. The runtime library implementation provides with several services that allow the compiler to generate code, maximizing the chances for overlapping communication and computation. The library implementation is organized as a Software Cache and the main services correspond to mechanisms for data look up, data placement and replacement, data write back, memory synchronization and address translation. The implementation guarantees that all those services can be totally uncoupled when dealing with memory references. Therefore this provides opportunities to the compiler to organize the generated code in order to overlap as much as possible computation with communication. The paper also describes the necessary mechanism to overlap the communication related to write back operations with actual computation. The paper includes the description of the compiler basic algorithms and optimizations for code generation. The system is evaluated measuring bandwidth and global updates ratios, with two benchmarks from the HPCC benchmark suite: Stream and Random Access.