Using advanced compiler technology to exploit the performance of the Cell Broadband EngineTM architecture

Authors:
A. E. Eichenberger;J. K. O'Brien;K. M. O'Brien;P. Wu;T. Chen;P. H. Oden;D. A. Prener;J. C. Shepherd;B. So;Z. Sura;A. Wang;T. Zhang;P. Zhao;M. K. Gschwind;R. Archambault;Y. Gao;R. Koo
Affiliations:
-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-;-
Venue:
IBM Systems Journal
Year:
2006

Citing 16
Cited 61

A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
An analytical approach to performance/cost modeling of parallel computers

Journal of Parallel and Distributed Computing
A multiprocessor architecture combining fine-grained and coarse-grained parallelism strategies

Parallel Computing
Tolerating latency through software-controlled data prefetching

Tolerating latency through software-controlled data prefetching
Simple vector microprocessors for multimedia applications

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Exploiting superword level parallelism with multimedia instruction sets

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
A vectorizing compiler for multimedia extensions

International Journal of Parallel Programming - Special issue on instruction-level parallelism and parallelizing compilation, Part 1
Tiling optimizations for 3D scientific computations

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Evaluating the impact of memory system performance on software prefetching and locality optimizations

ICS '01 Proceedings of the 15th international conference on Supercomputing
Automatic intra-register vectorization for the Intel architecture

International Journal of Parallel Programming
Vectorizing for a SIMdD DSP architecture

Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems
Vectorization for SIMD architectures with alignment constraints

Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Superword-Level Parallelism in the Presence of Control Flow

Proceedings of the international symposium on Code generation and optimization
Efficient SIMD Code Generation for Runtime Alignment and Length Conversion

Proceedings of the international symposium on Code generation and optimization
An integrated simdization framework using virtual vectors

Proceedings of the 19th annual international conference on Supercomputing
Introduction to the cell multiprocessor

IBM Journal of Research and Development - POWER5 and packaging

Software-based instruction caching for embedded processors

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
CellSs: a programming model for the cell BE architecture

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Compilation for explicitly managed memory hierarchies

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Pipelined Execution of Critical Sections Using Software-Controlled Caching in Network Processors

Proceedings of the International Symposium on Code Generation and Optimization
Parallelization schemes for memory optimization on the cell processor: a case study of image processing algorithm

MEDEA '07 Proceedings of the 2007 workshop on MEmory performance: DEaling with Applications, systems and architecture
Microarchitecture and implementation of the synergistic processor in 65-nm and 90-nm SOI

IBM Journal of Research and Development
CellSs: making it easier to program the cell broadband engine processor

IBM Journal of Research and Development
Cell GC: using the cell synergistic processor as a garbage collection coprocessor

Proceedings of the fourth ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Automated techniques for energy efficient scheduling on homogeneous and heterogeneous chip multi-processor architectures

Proceedings of the 2008 Asia and South Pacific Design Automation Conference
Vectorized data processing on the cell broadband engine

DaMoN '07 Proceedings of the 3rd international workshop on Data management on new hardware
Cell-SWat: modeling and scheduling wavefront computations on the cell broadband engine

Proceedings of the 5th conference on Computing frontiers
Dma-based prefetching for i/o-intensive workloads on the cell architecture

Proceedings of the 5th conference on Computing frontiers
An experimental study of sorting and branch prediction

Journal of Experimental Algorithmics (JEA)
Orchestrating data transfer for the cell/B.E. processor

Proceedings of the 22nd annual international conference on Supercomputing
Loading OpenMP to Cell: An Effective Compiler Framework for Heterogeneous Multi-core Chip

IWOMP '07 Proceedings of the 3rd international workshop on OpenMP: A Practical Programming Model for the Multi-Core Era
A Constraint Programming Approach for Allocation and Scheduling on the CELL Broadband Engine

CP '08 Proceedings of the 14th international conference on Principles and Practice of Constraint Programming
A tuning framework for software-managed memory hierarchies

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Hybrid access-specific software cache techniques for the cell BE architecture

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Performance analysis and visualization tools for cell/B.E. multicore environment

IFMT '08 Proceedings of the 1st international forum on Next-generation multicore/manycore technologies
Automatic Pre-Fetch and Modulo Scheduling Transformations for the Cell BE Architecture

Languages and Compilers for Parallel Computing
Petascale computing with accelerators

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Mapping and Synchronizing Streaming Applications on Cell Processors

HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Evaluation of memory performance on the cell BE with the SARC programming model

Proceedings of the 9th workshop on MEmory performance: DEaling with Applications, systems and architecture
Celling SHIM: compiling deterministic concurrency to a heterogeneous multicore

Proceedings of the 2009 ACM symposium on Applied Computing
Implementation of a wide-angle lens distortion correction algorithm on the cell broadband engine

Proceedings of the 23rd international conference on Supercomputing
DBDB: optimizing DMATransfer for the cell be architecture

Proceedings of the 23rd international conference on Supercomputing
Stream Compilation for Real-Time Embedded Multicore Systems

Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
Optimized on-chip pipelining of memory-intensive computations on the cell BE

ACM SIGARCH Computer Architecture News
Dynamic code footprint optimization for the IBM Cell Broadband Engine

IWMSE '09 Proceedings of the 2009 ICSE Workshop on Multicore Software Engineering
Adaptive Fault Tolerance for Scalable Cluster Computing in Space

International Journal of High Performance Computing Applications
Exploiting Locality on the Cell/B.E. through Bypassing

SAMOS '09 Proceedings of the 9th International Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation
Achieving high memory performance from heterogeneous architectures with the SARC programming model

Proceedings of the 10th workshop on MEmory performance: DEaling with Applications, systems and architecture
Building heterogeneous reconfigurable systems with a hardware microkernel

CODES+ISSS '09 Proceedings of the 7th IEEE/ACM international conference on Hardware/software codesign and system synthesis
Program overlays revisited

PDCN '08 Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Networks
Streams: emerging from a shared memory model

IWOMP'08 Proceedings of the 4th international conference on OpenMP in a new era of parallelism
State-of-the-art in heterogeneous computing

Scientific Programming
An OpenCL framework for heterogeneous multicores with local memory

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
DMATiler: revisiting loop tiling for direct memory access

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Heap data management for limited local memory (LLM) multi-core processors

CODES/ISSS '10 Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
LU decomposition on cell broadband engine: an empirical study to exploit heterogeneous chip multiprocessors

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Adaptive line size cache for irregular references on cell multicore processor

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Automatic program parallelization for multicore processors

PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
A study of a software cache implementation of the OpenMP memory model for multicore and manycore architectures

Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
Source-to-source optimization of CUDA C for GPU accelerated cardiac cell modeling

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
DDM-VMc: the data-driven multithreading virtual machine for the cell processor

Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Optimizing the exploitation of multicore processors and GPUs with OpenMP and OpenCL

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Region-based parallelization of irregular reductions on explicitly managed memory hierarchies

The Journal of Supercomputing
Importance of explicit vectorization for CPU and GPU software performance

Journal of Computational Physics
Making the Best of Temporal Locality: Just-in-Time Renaming and Lazy Write-Back on the Cell/B.E

International Journal of High Performance Computing Applications
Parallelization schemes for memory optimization on the cell processor: a case study on the Harris corner detector

Transactions on high-performance embedded architectures and compilers III
The impact of diverse memory architectures on multicore consumer software: an industrial perspective from the video games domain

Proceedings of the 2011 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
Single-port and multi-port collective communication operations on single and dual Cell BE processor systems

International Journal of Communication Networks and Distributed Systems
Mint: realizing CUDA performance in 3D stencil methods with annotated C

Proceedings of the international conference on Supercomputing
Exploring Multi-Grained Parallelism in Compute-Intensive DEVS Simulations

PADS '10 Proceedings of the 2010 IEEE Workshop on Principles of Advanced and Distributed Simulation
Adaptive and speculative memory consistency support for multi-core architectures with on-chip local memories

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
DMA-circular: an enhanced high level programmable DMA controller for optimized management of on-chip local memories

Proceedings of the 9th conference on Computing Frontiers
Multicore acceleration of Discrete Event System Specification systems

Simulation
For extreme parallelism, your OS is Sooooo last-millennium

HotPar'12 Proceedings of the 4th USENIX conference on Hot Topics in Parallelism
Hardware-software coherence protocol for the coexistence of caches and local memories

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Parallelization strategies for the points of interests algorithm on the cell processor

ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications
A software-only scheme for managing heap data on limited local memory(LLM) multicore processors

ACM Transactions on Embedded Computing Systems (TECS)

Quantified Score

Hi-index	0.01

Visualization

Abstract

The continuing importance of game applications and other numerically intensive workloads has generated an upsurge in novel computer architectures tailored for such functionality. Game applications feature highly parallel code for functions such as game physics, which have high computation and memory requirements, and scalar code for functions such as game artificial intelligence, for which fast response times and a full-featured programming environment are critical. The Cell Broadband EngineTM architecture targets such applications, providing both flexibility and high performance by utilizing a 64-bit multithreaded PowerPC® processor element (PPE) with two levels of globally coherent cache and eight synergistic processor elements (SPEs), each consisting of a processor designed for streaming workloads, a local memory, and a globally coherent DMA (direct memory access) engine. Growth in processor complexity is driving a parallel need for sophisticated compiler technology. In this paper, we present a variety of compiler techniques designed to exploit the performance potential of the SPEs and to enable the multilevel heterogeneous parallelism found in the Cell Broadband Engine architecture. Our goal in developing this compiler has been to enhance programmability while continuing to provide high performance. We review the Cell Broadband Engine architecture and present the results of our compiler techniques, including SPE optimization, automatic code generation, single source parallelization, and partitioning.