The cell broadband engine: exploiting multiple levels of parallelism in a chip multiprocessor

Authors:
Michael Gschwind
Affiliations:
IBM T.J. Watson Research Center, Yorktown Heights, NY
Venue:
International Journal of Parallel Programming
Year:
2007

Citing 14
Cited 25

Hitting the memory wall: implications of the obvious

ACM SIGARCH Computer Architecture News
Piranha: a scalable architecture based on single-chip multiprocessing

Proceedings of the 27th annual international symposium on Computer architecture
Optimizing pipelines for power and performance

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Evaluation of a Multithreaded Architecture for Cellular Computing

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Microarchitecture Optimizations for Exploiting Memory-Level Parallelism

Proceedings of the 31st annual international symposium on Computer architecture
Blue Gene: a vision for protein science using a petaflop supercomputer

IBM Systems Journal - Deep computing for the life sciences
Power Efficient Processor Architecture and The Cell Processor

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Power and performance optimization at the system level

Proceedings of the 2nd conference on Computing frontiers
Montecito: A Dual-Core, Dual-Thread Itanium Processor

IEEE Micro
Optimizing Compiler for the CELL Processor

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Chip multiprocessing and the cell broadband engine

Proceedings of the 3rd conference on Computing frontiers
Synergistic Processing in Cell's Multicore Architecture

IEEE Micro
Introduction to the cell multiprocessor

IBM Journal of Research and Development - POWER5 and packaging
Exploiting Workload Parallelism for Performance and Power Optimization in Blue Gene

IEEE Micro

Cell GC: using the cell synergistic processor as a garbage collection coprocessor

Proceedings of the fourth ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Special Section: Parallel Graphics and Visualization: Practical global illumination for interactive particle visualization

Computers and Graphics
Entering the petaflop era: the architecture and performance of Roadrunner

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Radioastronomy Image Synthesis on the Cell/B.E.

Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
SPENK: adding another level of parallelism on the cell broadband engine

IFMT '08 Proceedings of the 1st international forum on Next-generation multicore/manycore technologies
Implementation and performance modeling of deterministic particle transport (Sweep3D) on the IBM Cell/B.E.

Scientific Programming - High Performance Computing with the Cell Broadband Engine
Fractal terrain generation for SIMD architectures

International Journal of Computer Applications in Technology
Global Principal Typing in Partially Commutative Asynchronous Sessions

ESOP '09 Proceedings of the 18th European Symposium on Programming Languages and Systems: Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2009
Stream compaction for deferred shading

Proceedings of the Conference on High Performance Graphics 2009
Experiences with Cell-BE and GPU for Tomography

SAMOS '09 Proceedings of the 9th International Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation
The multikernel: a new OS architecture for scalable multicore systems

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Bias scheduling in heterogeneous multi-core architectures

Proceedings of the 5th European conference on Computer systems
Accelerating 3D nonrigid registration using the cell broadband engine processor

IBM Journal of Research and Development
An orthogonal matching pursuit algorithm for image denoising on the cell broadband engine

PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Real-Time Adaptive Background Modeling for Multicore Embedded Systems

Journal of Signal Processing Systems
Bridging functional heterogeneity in multicore architectures

ACM SIGOPS Operating Systems Review
Static bus schedule aware scratchpad allocation in multiprocessors

Proceedings of the 2011 SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems
Improving programmability of heterogeneous many-core systems via explicit platform descriptions

Proceedings of the 4th International Workshop on Multicore Software Engineering
Optimizing explicit data transfers for data parallel applications on the cell architecture

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Parallelization of Belief Propagation on Cell Processors for Stereo Vision

ACM Transactions on Embedded Computing Systems (TECS)
Multicore acceleration of Discrete Event System Specification systems

Simulation
Efficient sorting design on a novel embedded parallel computing architecture with unique memory access

Computers and Electrical Engineering
Efficient communication support in predictable heterogeneous MPSoC designs for streaming applications

Journal of Systems Architecture: the EUROMICRO Journal
Optimizing two-dimensional DMA transfers for scratchpad Based MPSoCs platforms

Microprocessors & Microsystems
Configurable range memory for effective data reuse on programmable accelerators

ACM Transactions on Design Automation of Electronic Systems (TODAES)

Quantified Score

Hi-index	0.00

Visualization

Abstract

As CMOS feature sizes continue to shrink and traditional microarchitectural methods for delivering high performance (e.g., deep pipelining) become too expensive and power-hungry, chip multiprocessors (CMPs) become an exciting new direction by which system designers can deliver increased performance. Exploiting parallelism in such designs is the key to high performance, and we find that parallelism must be exploited at multiple levels of the system: the thread-level parallelism that has become popular in many designs fails to exploit all the levels of available parallelism in many workloads for CMP systems. We describe the Cell Broadband Engine and the multiple levels at which its architecture exploits parallelism: data-level, instruction-level, thread-level, memory-level, and compute-transfer parallelism. By taking advantage of opportunities at all levels of the system, this CMP revolutionizes parallel architectures to deliver previously unattained levels of single chip performance. We describe how the heterogeneous cores allow to achieve this performance by parallelizing and offloading computation intensive application code onto the Synergistic Processor Element (SPE) cores using a heterogeneous thread model with SPEs. We also give an example of scheduling code to be memory latency tolerant using software pipelining techniques in the SPE.