Optimizing Compiler for the CELL Processor

Authors:
Alexandre E. Eichenberger;Kathryn O'Brien;Kevin O'Brien;Peng Wu;Tong Chen;Peter H. Oden;Daniel A. Prener;Janice C. Shepherd;Byoungro So;Zehra Sura;Amy Wang;Tao Zhang;Peng Zhao;Michael Gschwind
Affiliations:
IBM T.J. Watson Research Center Yorktown Heights, New York, USA.;IBM T.J. Watson Research Center Yorktown Heights, New York, USA.;IBM T.J. Watson Research Center Yorktown Heights, New York, USA.;IBM T.J. Watson Research Center Yorktown Heights, New York, USA.;IBM T.J. Watson Research Center Yorktown Heights, New York, USA.;IBM T.J. Watson Research Center Yorktown Heights, New York, USA.;IBM T.J. Watson Research Center Yorktown Heights, New York, USA.;IBM T.J. Watson Research Center Yorktown Heights, New York, USA.;IBM T.J. Watson Research Center Yorktown Heights, New York, USA.;IBM T.J. Watson Research Center Yorktown Heights, New York, USA.;IBM T.J. Watson Research Center Yorktown Heights, New York, USA.;College of Computing Georgia Tech, USA.;IBM Toronto Laboratory Markham, Ontario, Canada.;IBM Toronto Laboratory Markham, Ontario, Canada.
Venue:
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Year:
2005

Citing 28
Cited 75

Automatic translation of FORTRAN programs to vector form

ACM Transactions on Programming Languages and Systems (TOPLAS)
Supercompilers for parallel and vector computers

Supercompilers for parallel and vector computers
Profile guided code positioning

PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Effective compiler support for predicated execution using the hyperblock

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Fine-grain access control for distributed shared memory

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Tolerating latency through software-controlled data prefetching

Tolerating latency through software-controlled data prefetching
CRL: high-performance all-software distributed shared memory

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
An integrated compile-time/run-time software distributed shared memory system

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Cooperative prefetching: compiler and hardware support for effective instruction prefetching in modern processors

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Exploiting superword level parallelism with multimedia instruction sets

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Operating Systems: Program overlay techniques

Communications of the ACM
Tiling optimizations for 3D scientific computations

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Evaluating the impact of memory system performance on software prefetching and locality optimizations

ICS '01 Proceedings of the 15th international conference on Supercomputing
ILP-based Instruction Scheduling for IA-64

OM '01 Proceedings of the 2001 ACM SIGPLAN workshop on Optimization of middleware and distributed systems
High performance Fortran compilation techniques for parallelizing scientific codes

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
A compiler approach to fast hardware design space exploration in FPGA-based systems

PLDI '02 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation
Removing the overhead from software-based shared memory

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Automatic intra-register vectorization for the Intel architecture

International Journal of Parallel Programming
PICO: Automatically Designing Custom Computers

Computer
Increasing and Detecting Memory Address Congruence

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Comparative Evaluation of Fine- and Coarse-Grain Approaches for Software Distributed Shared Memory

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Vectorization for SIMD architectures with alignment constraints

Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Optimizing OpenMP programs on software distributed shared memory systems

International Journal of Parallel Programming - Special issue: OpenMP: Experiences and implementations
Superword-Level Parallelism in the Presence of Control Flow

Proceedings of the international symposium on Code generation and optimization
Efficient SIMD Code Generation for Runtime Alignment and Length Conversion

Proceedings of the international symposium on Code generation and optimization
An integrated simdization framework using virtual vectors

Proceedings of the 19th annual international conference on Supercomputing
Communication Optimizations for Fine-Grained UPC Applications

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques

Chip multiprocessing and the cell broadband engine

Proceedings of the 3rd conference on Computing frontiers
Synergistic Processing in Cell's Multicore Architecture

IEEE Micro
Optimizing compiler for shared-memory multiple SIMD architecture

Proceedings of the 2006 ACM SIGPLAN/SIGBED conference on Language, compilers, and tool support for embedded systems
Multiple Instruction Stream Processor

Proceedings of the 33rd annual international symposium on Computer Architecture
MPI Microtask for programming the cell broadband engineTM processor

IBM Systems Journal
Dynamic multigrain parallelization on the cell broadband engine

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Compilation for explicitly managed memory hierarchies

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
EXOCHI: architecture and programming environment for a heterogeneous multi-core multithreaded system

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
An Open Source Environment for Cell Broadband Engine System Software

Computer
Sequencer virtualization

Proceedings of the 21st annual international conference on Supercomputing
Exploring New Search Algorithms and Hardware for Phylogenetics: RAxML Meets the IBM Cell

Journal of VLSI Signal Processing Systems
Runtime scheduling of dynamic parallelism on accelerator-based multi-core systems

Parallel Computing
Executing stream joins on the cell processor

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Microarchitecture and implementation of the synergistic processor in 65-nm and 90-nm SOI

IBM Journal of Research and Development
Cell broadband engine architecture and its first implementation: a performance view

IBM Journal of Research and Development
The cell broadband engine: exploiting multiple levels of parallelism in a chip multiprocessor

International Journal of Parallel Programming
Cell GC: using the cell synergistic processor as a garbage collection coprocessor

Proceedings of the fourth ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Prefetching irregular references for software cache on cell

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Compiling for vector-thread architectures

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Optimization strategies for a java virtual machine interpreter on the cell broadband engine

Proceedings of the 5th conference on Computing frontiers
Efficient computation of sum-products on GPUs through software-managed cache

Proceedings of the 22nd annual international conference on Supercomputing
Entering the petaflop era: the architecture and performance of Roadrunner

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Supporting OpenMP on Cell

IWOMP '07 Proceedings of the 3rd international workshop on OpenMP: A Practical Programming Model for the Multi-Core Era
A Constraint Programming Approach for Allocation and Scheduling on the CELL Broadband Engine

CP '08 Proceedings of the 14th international conference on Principles and Practice of Constraint Programming
Managing Multicore with OpenMP (Extended Abstract)

Proceedings of the 15th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
A Novel Asynchronous Software Cache Implementation for the Cell-BE Processor

Languages and Compilers for Parallel Computing
Exploiting SIMD Parallelism with the CGiS Compiler Framework

Languages and Compilers for Parallel Computing
A tuning framework for software-managed memory hierarchies

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Hybrid access-specific software cache techniques for the cell BE architecture

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
COMIC: a coherent shared memory interface for cell be

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Supporting OpenMP on cell

International Journal of Parallel Programming
Implementation and performance modeling of deterministic particle transport (Sweep3D) on the IBM Cell/B.E.

Scientific Programming - High Performance Computing with the Cell Broadband Engine
CellJoin: a parallel stream join operator for the cell processor

The VLDB Journal — The International Journal on Very Large Data Bases
Celling SHIM: compiling deterministic concurrency to a heterogeneous multicore

Proceedings of the 2009 ACM symposium on Applied Computing
Towards automatic program partitioning

Proceedings of the 6th ACM conference on Computing frontiers
Scheduling dynamic parallelism on accelerators

Proceedings of the 6th ACM conference on Computing frontiers
Tile Reduction: The First Step towards Tile Aware Parallelization in OpenMP

IWOMP '09 Proceedings of the 5th International Workshop on OpenMP: Evolving OpenMP in an Age of Extreme Parallelism
Compiler-Based Performance Evaluation of an SIMD Processor with a Multi-Bank Memory Unit

Journal of Signal Processing Systems
Automatic parallelization for graphics processing units

PPPJ '09 Proceedings of the 7th International Conference on Principles and Practice of Programming in Java
Design and implementation of stream processing system and library for CELL broadband engine processors

PDCS '07 Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and Systems
Towards a framework for abstracting accelerators in parallel applications: experience with cell

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Brain derived vision algorithm on high performance architectures

International Journal of Parallel Programming
Compiling Python to a hybrid execution environment

Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
Chameleon: Virtualizing idle acceleration cores of a heterogeneous multicore processor for caching and prefetching

ACM Transactions on Architecture and Code Optimization (TACO)
Dependence-based code generation for a CELL processor

LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
Optimizing the use of static buffers for DMA on a CELL chip

LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
CG-Cell: an NPB benchmark implementation on cell broadband engine

ICDCN'08 Proceedings of the 9th international conference on Distributed computing and networking
Models for generating locality-tuned traveling threads for a hierarchical multi-level heterogeneous multicore

Proceedings of the 7th ACM international conference on Computing frontiers
State-of-the-art in heterogeneous computing

Scientific Programming
MapReduce for the cell broadband engine architecture

IBM Journal of Research and Development
Accelerating large-scale DEVS-based simulation on the cell processor

SpringSim '10 Proceedings of the 2010 Spring Simulation Multiconference
A study of a software cache implementation of the OpenMP memory model for multicore and manycore architectures

Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
Monte Carlo implementation of financial simulation on Cell/B.E. multi-core processor

Mathematics and Computers in Simulation
A configurable framework for stream programming exploration in baseband applications

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Bothnia: a dual-personality extension to the Intel integrated graphics driver

ACM SIGOPS Operating Systems Review
Region-based parallelization of irregular reductions on explicitly managed memory hierarchies

The Journal of Supercomputing
Programming heterogeneous clusters with accelerators using object-based programming

Scientific Programming
Single-port and multi-port collective communication operations on single and dual Cell BE processor systems

International Journal of Communication Networks and Distributed Systems
Vector class on limited local memory (LLM) multi-core processors

CASES '11 Proceedings of the 14th international conference on Compilers, architectures and synthesis for embedded systems
Branch penalty reduction on IBM cell SPUs via software branch hinting

CODES+ISSS '11 Proceedings of the seventh IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
A comparison of three commodity-level parallel architectures: multi-core CPU, cell BE and GPU

MMCS'08 Proceedings of the 7th international conference on Mathematical Methods for Curves and Surfaces
Safe and familiar multi-core programming by means of a hybrid functional and imperative language

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Automatic data distribution for improving data locality on the cell BE architecture

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
DMA-circular: an enhanced high level programmable DMA controller for optimized management of on-chip local memories

Proceedings of the 9th conference on Computing Frontiers
Elastic computing: A portable optimization framework for hybrid computers

Parallel Computing
Fast and efficient automatic memory management for GPUs using compiler-assisted runtime coherence scheme

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
The RACECAR heuristic for automatic function specialization on multi-core heterogeneous systems

Proceedings of the 2012 international conference on Compilers, architectures and synthesis for embedded systems
A transactional runtime system for the Cell/BE architecture

Journal of Parallel and Distributed Computing
Hardware-software coherence protocol for the coexistence of caches and local memories

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Video processing and retrieval on cell processor architecture

ICEC'07 Proceedings of the 6th international conference on Entertainment Computing
Parallel execution of Java loops on Graphics Processing Units

Science of Computer Programming
RSVM: a region-based software virtual memory for GPU

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Simple, portable and fast SIMD intrinsic programming: generic simd library

Proceedings of the 2014 Workshop on Programming models for SIMD/Vector processing
Loop Transforming for Reducing Data Alignment on Multi-Core SIMD Processors

Journal of Signal Processing Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Developed for multimedia and game applications, as well as other numerically intensive workloads, the CELL processor provides support both for highly parallel codes, which have high computation and memory requirements, and for scalar codes, which require fast response time and a full-featured programming environment. This first generation CELL processor implements on a single chip a Power Architecture processor with two levels of cache, and eight attached streaming processors with their own local memories and globally coherent DMA engines. In addition to processor-level parallelism, each processing element has a Single Instruction Multiple Data (SIMD) unit that can process from 2 double precision floating points up to 16 bytes per instruction. This paper describes, in the context of a research prototype, several compiler techniques that aim at automatically generating high quality codes over a wide range of heterogeneous parallelism available on the CELL processor. Techniques include compiler-supported branch prediction, compiler-assisted instruction fetch, generation of scalar codes on SIMD units, automatic generation of SIMD codes, and data and code partitioning across the multiple processor elements in the system. Results indicate that significant speedup can be achieved with a high level of support from the compiler.