Chip multiprocessing and the cell broadband engine

Authors:
Michael Gschwind
Affiliations:
IBM T.J. Watson Research Center, Yorktown Heights, NY
Venue:
Proceedings of the 3rd conference on Computing frontiers
Year:
2006

Citing 10
Cited 34

Hitting the memory wall: implications of the obvious

ACM SIGARCH Computer Architecture News
Piranha: a scalable architecture based on single-chip multiprocessing

Proceedings of the 27th annual international symposium on Computer architecture
Optimizing pipelines for power and performance

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Evaluation of a Multithreaded Architecture for Cellular Computing

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Microarchitecture Optimizations for Exploiting Memory-Level Parallelism

Proceedings of the 31st annual international symposium on Computer architecture
Blue Gene: a vision for protein science using a petaflop supercomputer

IBM Systems Journal - Deep computing for the life sciences
Power Efficient Processor Architecture and The Cell Processor

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Power and performance optimization at the system level

Proceedings of the 2nd conference on Computing frontiers
Optimizing Compiler for the CELL Processor

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Introduction to the cell multiprocessor

IBM Journal of Research and Development - POWER5 and packaging

Synergistic Processing in Cell's Multicore Architecture

IEEE Micro
Converting massive TLP to DLP: a special-purpose processor for molecular orbital computations

Proceedings of the 4th international conference on Computing frontiers
Carbon: architectural support for fine-grained parallelism on chip multiprocessors

Proceedings of the 34th annual international symposium on Computer architecture
An Open Source Environment for Cell Broadband Engine System Software

Computer
The cell broadband engine: exploiting multiple levels of parallelism in a chip multiprocessor

International Journal of Parallel Programming
CASL: A rapid-prototyping language for modern micro-architectures

Computer Languages, Systems and Structures
Scaling performance of interior-point method on large-scale chip multiprocessor system

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Atomic Vector Operations on Chip Multiprocessors

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
COMIC: a coherent shared memory interface for cell be

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Efficient implementation of sorting on multi-core SIMD CPU architecture

Proceedings of the VLDB Endowment
Implementing a parallel matrix factorization library on the cell broadband engine

Scientific Programming - High Performance Computing with the Cell Broadband Engine
Building high-resolution sky images using the Cell/B.E.

Scientific Programming - High Performance Computing with the Cell Broadband Engine
Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Parallel Computing
Extracting Coarse-Grained Pipelined Parallelism Out of Sequential Applications for Parallel Processor Arrays

ARCS '09 Proceedings of the 22nd International Conference on Architecture of Computing Systems
Improving Memory Subsystem Performance Using ViVA: Virtual Vector Architecture

ARCS '09 Proceedings of the 22nd International Conference on Architecture of Computing Systems
Optimizing Memory Access Latencies on a Reconfigurable Multimedia Accelerator: A Case of a Turbo Product Codes Decoder

ARC '09 Proceedings of the 5th International Workshop on Reconfigurable Computing: Architectures, Tools and Applications
DBDB: optimizing DMATransfer for the cell be architecture

Proceedings of the 23rd international conference on Supercomputing
Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs

Proceedings of the 23rd international conference on Supercomputing
Rigel: an architecture and scalable programming interface for a 1000-core accelerator

Proceedings of the 36th annual international symposium on Computer architecture
Optimization of a lattice Boltzmann computation on state-of-the-art multicore platforms

Journal of Parallel and Distributed Computing
Adapting application execution in CMPs using helper threads

Journal of Parallel and Distributed Computing
Reevaluating Amdahl's law in the multicore era

Journal of Parallel and Distributed Computing
CA-MPSoC: An automated design flow for predictable multi-processor architectures for multiple applications

Journal of Systems Architecture: the EUROMICRO Journal
Dynamic warp subdivision for integrated branch and memory divergence tolerance

Proceedings of the 37th annual international symposium on Computer architecture
Cohesion: a hybrid memory model for accelerators

Proceedings of the 37th annual international symposium on Computer architecture
WAYPOINT: scaling coherence to thousand-core architectures

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Parallelizing the H.264 decoder on the cell BE architecture

EMSOFT '10 Proceedings of the tenth ACM international conference on Embedded software
Extending synchronization constructs in openMP to exploit pipeline parallelism on heterogeneous multi-core

ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part II
Multicore acceleration of Discrete Event System Specification systems

Simulation
Who watches the watchmen? - protecting operating system reliability mechanisms

HotDep'12 Proceedings of the Eighth USENIX conference on Hot Topics in System Dependability
Regional cache organization for NoC based many-core processors

Journal of Computer and System Sciences
Reduction methods for adapting optical network on chip topologies to 3D architectures

Microprocessors & Microsystems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Chip multiprocessing has become an exciting new direction for system designers to deliver increased performance by exploiting CMOS scaling. We discuss key design decisions facing the system architect of a chip multiprocessor and describe how these choices were made in the design of the Cell Broadband Engine.An important decision is whether to base system performance on thread-level parallelism alone, or to complement thread-level parallelism with other forms of parallelism. Depending on workload characteristics, providing parallelism at the processor core level may increase overall system efficiency.Parallelism is also a key to utilize available memory bandwidth more efficiently, by overlapping and interleaving multiple accesses to system memory. By interleaving the access streams of multiple threads, memory level parallelism can be increased to allow better memory interface utilization. In addition, compute-transfer parallelism (CTP) offers a new form of parallelism to initiate memory transfers under software control without stalling the requesting thread.We describe how the Cell Broadband Enginetmuses parallelism at all levels of the system abstraction to deliver a quantum leap in application performance, and how the Cell Synergistic Memory Flow engine exploits compute-transfer level parallelism by providing efficient block transfer capabilities.