Cell Multiprocessor Communication Network: Built for Speed

Authors:
Michael Kistler;Michael Perrone;Fabrizio Petrini
Affiliations:
IBM Austin Research Laboratory;IBM TJ Watson Research Center;Pacific Northwest National Laboratory
Venue:
IEEE Micro
Year:
2006

Citing 7
Cited 106

Hitting the memory wall: implications of the obvious

ACM SIGARCH Computer Architecture News
Optimizing pipelines for power and performance

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Power Efficient Processor Architecture and The Cell Processor

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
QsNetII: Defining High-Performance Network Design

IEEE Micro
Application of full-system simulation in exploratory system design and development

IBM Journal of Research and Development
Introduction to the cell multiprocessor

IBM Journal of Research and Development - POWER5 and packaging
The future of CMOS technology

IBM Journal of Research and Development

Dynamic multigrain parallelization on the cell broadband engine

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
On the Design of a Photonic Network-on-Chip

NOCS '07 Proceedings of the First International Symposium on Networks-on-Chip
An Open Source Environment for Cell Broadband Engine System Software

Computer
Towards a Java multiprocessor

JTRES '07 Proceedings of the 5th international workshop on Java technologies for real-time and embedded systems
Runtime scheduling of dynamic parallelism on accelerator-based multi-core systems

Parallel Computing
High performance combinatorial algorithm design on the Cell Broadband Engine processor

Parallel Computing
Characterizing the Cell EIB On-Chip Network

IEEE Micro
Executing stream joins on the cell processor

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
CellSort: high performance sorting on the cell processor

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Parallelization schemes for memory optimization on the cell processor: a case study of image processing algorithm

MEDEA '07 Proceedings of the 2007 workshop on MEmory performance: DEaling with Applications, systems and architecture
Hardware-aware analysis and optimization of stable fluids

Proceedings of the 2008 symposium on Interactive 3D graphics and games
Prefetching irregular references for software cache on cell

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Orchestrating data transfer for the cell/B.E. processor

Proceedings of the 22nd annual international conference on Supercomputing
Supporting OpenMP on Cell

IWOMP '07 Proceedings of the 3rd international workshop on OpenMP: A Practical Programming Model for the Multi-Core Era
Toward Human Arm Attention and Recognition

Neural Information Processing
A Constraint Programming Approach for Allocation and Scheduling on the CELL Broadband Engine

CP '08 Proceedings of the 14th international conference on Principles and Practice of Constraint Programming
A Novel Asynchronous Software Cache Implementation for the Cell-BE Processor

Languages and Compilers for Parallel Computing
Hybrid access-specific software cache techniques for the cell BE architecture

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
COMIC: a coherent shared memory interface for cell be

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Accelerating BLASTP on the Cell Broadband Engine

PRIB '08 Proceedings of the Third IAPR International Conference on Pattern Recognition in Bioinformatics
Supporting OpenMP on cell

International Journal of Parallel Programming
SPENK: adding another level of parallelism on the cell broadband engine

IFMT '08 Proceedings of the 1st international forum on Next-generation multicore/manycore technologies
An efficient in-place 3D transpose for multicore processors with software managed memory hierarchy

IFMT '08 Proceedings of the 1st international forum on Next-generation multicore/manycore technologies
Automatic Pre-Fetch and Modulo Scheduling Transformations for the Cell BE Architecture

Languages and Compilers for Parallel Computing
Implementation and performance modeling of deterministic particle transport (Sweep3D) on the IBM Cell/B.E.

Scientific Programming - High Performance Computing with the Cell Broadband Engine
3D seismic imaging through reverse-time migration on homogeneous and heterogeneous multi-core processors

Scientific Programming - High Performance Computing with the Cell Broadband Engine
High performance protein sequence database scanning on the Cell Broadband Engine

Scientific Programming - High Performance Computing with the Cell Broadband Engine
Building high-resolution sky images using the Cell/B.E.

Scientific Programming - High Performance Computing with the Cell Broadband Engine
Computing discrete transforms on the Cell Broadband Engine

Parallel Computing
CellJoin: a parallel stream join operator for the cell processor

The VLDB Journal — The International Journal on Very Large Data Bases
Celling SHIM: compiling deterministic concurrency to a heterogeneous multicore

Proceedings of the 2009 ACM symposium on Applied Computing
Scheduling dynamic parallelism on accelerators

Proceedings of the 6th ACM conference on Computing frontiers
Evaluating multi-core platforms for HPC data-intensive kernels

Proceedings of the 6th ACM conference on Computing frontiers
High-performance regular expression scanning on the Cell/B.E. processor

Proceedings of the 23rd international conference on Supercomputing
Computer generation of fast fourier transforms for the cell broadband engine

Proceedings of the 23rd international conference on Supercomputing
DBDB: optimizing DMATransfer for the cell be architecture

Proceedings of the 23rd international conference on Supercomputing
Efficient high performance collective communication for the cell blade

Proceedings of the 23rd international conference on Supercomputing
Time-predictable computer architecture

EURASIP Journal on Embedded Systems - FPGA supercomputing platforms, architectures, and techniques for accelerating computationally complex algorithms
Fast and Efficient Synchronization and Communication Collective Primitives for Dual Cell-Based Blades

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Performance balancing: software-based on-chip memory management for effective CMP executions

Proceedings of the 10th workshop on MEmory performance: DEaling with Applications, systems and architecture
The multikernel: a new OS architecture for scalable multicore systems

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
No cache-coherence: a single-cycle ring interconnection for multi-core L1-NUCA sharing on 3D chips

Proceedings of the 46th Annual Design Automation Conference
Modeling advanced collective communication algorithms on cell-based systems

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Buffer sharing in CSP-like programs

MEMOCODE'09 Proceedings of the 7th IEEE/ACM international conference on Formal Methods and Models for Codesign
Prototype design of cluster-based homogeneous multiprocessor system-on-chip

ASID'09 Proceedings of the 3rd international conference on Anti-Counterfeiting, security, and identification in communication
Mesh-of-trees and alternative interconnection networks for single-chip parallelism

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Chameleon: Virtualizing idle acceleration cores of a heterogeneous multicore processor for caching and prefetching

ACM Transactions on Architecture and Code Optimization (TACO)
An enhancer of memory and network for applications with large-capacity data and non-continuous data accessing

The Journal of Supercomputing
Optimizing the use of static buffers for DMA on a CELL chip

LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
FFTC: fastest Fourier transform for the IBM cell broadband engine

HiPC'07 Proceedings of the 14th international conference on High performance computing
CG-Cell: an NPB benchmark implementation on cell broadband engine

ICDCN'08 Proceedings of the 9th international conference on Distributed computing and networking
Multi-stage benders decomposition for optimizing multicore architectures

CPAIOR'08 Proceedings of the 5th international conference on Integration of AI and OR techniques in constraint programming for combinatorial optimization problems
A real-time Java chip-multiprocessor

ACM Transactions on Embedded Computing Systems (TECS)
Communication-aware heuristics for run-time task mapping on NoC-based MPSoC platforms

Journal of Systems Architecture: the EUROMICRO Journal
Back Suction: Service Guarantees for Latency-Sensitive On-chip Networks

NOCS '10 Proceedings of the 2010 Fourth ACM/IEEE International Symposium on Networks-on-Chip
Network interface design based on mutual interface definition

International Journal of High Performance Systems Architecture
The reverse-acceleration model for programming petascale hybrid systems

IBM Journal of Research and Development
ATAC: a 1000-core cache-coherent processor with on-chip optical network

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Recursion-driven parallel code generation for multi-core platforms

Proceedings of the Conference on Design, Automation and Test in Europe
A link arbitration scheme for quality of service in a latency-optimized network-on-chip

Proceedings of the Conference on Design, Automation and Test in Europe
Buffer sharing in rendezvous programs

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems - Special section on the ACM IEEE international conference on formal methods and models for codesign (MEMOCODE) 2009
Weighted random oblivious routing on torus networks

Proceedings of the 5th ACM/IEEE Symposium on Architectures for Networking and Communications Systems
Exploring a Novel Gathering Method for Finite Element Codes on the Cell/B.E. Architecture

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Efficient throughput-guarantees for latency-sensitive networks-on-chip

Proceedings of the 2010 Asia and South Pacific Design Automation Conference
An analytical network performance model for SIMD processor CSX600 interconnects

Journal of Systems Architecture: the EUROMICRO Journal
Throughput-Effective On-Chip Networks for Manycore Accelerators

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
A portable, efficient inter-core communication scheme for embedded multicore platforms

Journal of Systems Architecture: the EUROMICRO Journal
Hierarchical circuit-switched NoC for multicore video processing

Microprocessors & Microsystems
Improved scalability by using hardware-aware thread affinities

Facing the multicore-challenge
Parallelization schemes for memory optimization on the cell processor: a case study on the Harris corner detector

Transactions on high-performance embedded architectures and compilers III
Improved scalability by using hardware-aware thread affinities

Facing the multicore-challenge
Single-port and multi-port collective communication operations on single and dual Cell BE processor systems

International Journal of Communication Networks and Distributed Systems
A 98 GMACs/W 32-core vector processor in 65nm CMOS

Proceedings of the 17th IEEE/ACM international symposium on Low-power electronics and design
Cat-tail dma: efficient image data transport for multicore embedded mobile systems

Journal of Mobile Multimedia
A hybrid strategy for mapping multiple throughput-constrained applications on MPSoCs

CASES '11 Proceedings of the 14th international conference on Compilers, architectures and synthesis for embedded systems
Optimizing explicit data transfers for data parallel applications on the cell architecture

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
TL-DAE: thread-level decoupled access/execution for OpenMP on the cyclops-64 many-core processor

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Adaptive and speculative memory consistency support for multi-core architectures with on-chip local memories

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Exploration of 3D grid caching strategies for ray-shooting

Journal of Real-Time Image Processing
Model-driven adaptation of double-precision matrix multiplication to the Cell processor architecture

Parallel Computing
Remote store programming: a memory model for embedded multicore

HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
Performance impact of task mapping on the cell BE multicore processor

ISCA'10 Proceedings of the 2010 international conference on Computer Architecture
Analysis of gravitational wave signals on heterogeneous architectures

PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
A dynamically reconfigurable communication architecture for multicore embedded systems

Journal of Systems Architecture: the EUROMICRO Journal
Networks on chips: structure and design methodologies

Journal of Electrical and Computer Engineering - Special issue on Networks-on-Chip: Architectures, Design Methodologies, and Case Studies
A metric for layout-friendly microarchitecture optimization in high-level synthesis

Proceedings of the 49th Annual Design Automation Conference
Adapting wave-front algorithms to efficiently utilize systems with deep communication hierarchies

Parallel Computing
Communication and memory architecture design of application-specific high-end multiprocessors

VLSI Design
Microwave tomography for breast cancer detection on Cell broadband engine processors

Journal of Parallel and Distributed Computing
MCEmu: A Framework for Software Development and Performance Analysis of Multicore Systems

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Increasing the efficiency of the DaCS programming model for heterogeneous systems

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
A transpose-free in-place SIMD optimized FFT

ACM Transactions on Architecture and Code Optimization (TACO)
Scalable communication architectures for massively parallel hardware multi-processors

Journal of Parallel and Distributed Computing
A real-time, energy-efficient system software suite for heterogeneous multicore platforms

Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Hardware-software coherence protocol for the coexistence of caches and local memories

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Scheduling streaming applications on a complex multicore platform

Concurrency and Computation: Practice & Experience
Accelerating throughput-aware runtime mapping for heterogeneous MPSoCs

ACM Transactions on Design Automation of Electronic Systems (TODAES) - Special section on adaptive power management for energy and temperature-aware computing systems
Parallelization strategies for the points of interests algorithm on the cell processor

ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications
SSDM: smart stack data management for software managed multicores (SMMs)

Proceedings of the 50th Annual Design Automation Conference
CADSE: communication aware design space exploration for efficient run-time MPSoC management

Frontiers of Computer Science: Selected Publications from Chinese Universities
Designing on-chip networks for throughput accelerators

ACM Transactions on Architecture and Code Optimization (TACO)
Scheduling of synchronous data flow models onto scratchpad memory-based embedded processors

ACM Transactions on Embedded Computing Systems (TECS) - Special Section on ESTIMedia'10
Flexible filters in stream programs

ACM Transactions on Embedded Computing Systems (TECS)
Optimizing two-dimensional DMA transfers for scratchpad Based MPSoCs platforms

Microprocessors & Microsystems
Design of massively parallel hardware multi-processors for highly-demanding embedded applications

Microprocessors & Microsystems
Direct distributed memory access for CMPs

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Multicore designs promise various power-performance and area-performance benefits. But inadequate design of the on-chip communication network can deprive applications of these benefits. To illuminate this important point in multicore processor design, the authors analyze the Cell processor's communication network, using a series of benchmarks involving DMA traffic patterns and synchronization protocols.