Power Efficient Processor Architecture and The Cell Processor

Authors:
H. Peter Hofstee
Affiliations:
IBM Server & Technology Group
Venue:
HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Year:
2005

Citing 0
Cited 125

Improving superword level parallelism support in modern compilers

CODES+ISSS '05 Proceedings of the 3rd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Fast and fair: data-stream quality of service

Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems
Stream Programming on General-Purpose Processors

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Performance characteristics of MAUI: an intelligent memory system architecture

Proceedings of the 2005 workshop on Memory system performance
Data and Computation Transformations for Brook Streaming Applications on Multiprocessors

Proceedings of the International Symposium on Code Generation and Optimization
Chip multiprocessing and the cell broadband engine

Proceedings of the 3rd conference on Computing frontiers
Synergistic Processing in Cell's Multicore Architecture

IEEE Micro
Optimizing compiler for shared-memory multiple SIMD architecture

Proceedings of the 2006 ACM SIGPLAN/SIGBED conference on Language, compilers, and tool support for embedded systems
Interconnect-Aware Coherence Protocols for Chip Multiprocessors

Proceedings of the 33rd annual international symposium on Computer Architecture
MPI Microtask for programming the cell broadband engineTM processor

IBM Systems Journal
Systems research challenges: a scale-out perspective

IBM Journal of Research and Development
Introduction to the cell multiprocessor

IBM Journal of Research and Development - POWER5 and packaging
Cell Multiprocessor Communication Network: Built for Speed

IEEE Micro
Avoiding conversion and rearrangement overhead in SIMD architectures

International Journal of Parallel Programming
Stall cycle redistribution in a transparent fetch pipeline

Proceedings of the 2006 international symposium on Low power electronics and design
Exploiting coarse-grained task, data, and pipeline parallelism in stream programs

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
A defect tolerant self-organizing nanoscale SIMD architecture

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Integrated scratchpad memory optimization and task scheduling for MPSoC architectures

CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
Enabling real-time physics simulation in future interactive entertainment

Proceedings of the 2006 ACM SIGGRAPH symposium on Videogames
Physical aware frequency selection for dynamic thermal management in multi-core systems

Proceedings of the 2006 IEEE/ACM international conference on Computer-aided design
Rotary router: an efficient architecture for CMP interconnection networks

Proceedings of the 34th annual international symposium on Computer architecture
ParallAX: an architecture for real-time physics

Proceedings of the 34th annual international symposium on Computer architecture
On Characterizing Performance of the Cell Broadband Engine Element Interconnect Bus

NOCS '07 Proceedings of the First International Symposium on Networks-on-Chip
A self-organizing defect tolerant SIMD architecture

ACM Journal on Emerging Technologies in Computing Systems (JETC)
Microprocessors in the era of terascale integration

Proceedings of the conference on Design, automation and test in Europe
Executing irregular scientific applications on stream architectures

Proceedings of the 21st annual international conference on Supercomputing
Interconnects in the third dimension: design challenges for 3D ICs

Proceedings of the 44th annual Design Automation Conference
Towards a Java multiprocessor

JTRES '07 Proceedings of the 5th international workshop on Java technologies for real-time and embedded systems
Beyond gaming: programming the PLAYSTATION®3 cell architecture for cost-effective parallel processing

CODES+ISSS '07 Proceedings of the 5th IEEE/ACM international conference on Hardware/software codesign and system synthesis
A New Era of Performance Evaluation

Computer
The cell broadband engine: exploiting multiple levels of parallelism in a chip multiprocessor

International Journal of Parallel Programming
Streamware: programming general-purpose multicore processors using streams

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
The worst-case execution-time problem—overview of methods and survey of tools

ACM Transactions on Embedded Computing Systems (TECS)
Hierarchical memory system design for a heterogeneous multi-core processor

Proceedings of the 2008 ACM symposium on Applied computing
Dma-based prefetching for i/o-intensive workloads on the cell architecture

Proceedings of the 5th conference on Computing frontiers
Fpga-based prototype of a pram-on-chip processor

Proceedings of the 5th conference on Computing frontiers
A modular 3d processor for flexible product design and technology migration

Proceedings of the 5th conference on Computing frontiers
Asynchronous control of modules activity in integrated systems for reducing peak temperatures

Integration, the VLSI Journal
A low-power cache scheme for embedded computing

Journal of Embedded Computing - Issues in embedded single-chip multicore architectures
RC-SIMD: Reconfigurable communication SIMD architecture for image processing applications

Journal of Embedded Computing - Issues in embedded single-chip multicore architectures
Orchestrating the execution of stream programs on multicore platforms

Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
A lightweight streaming layer for multicore execution

ACM SIGARCH Computer Architecture News
Exact and approximate task assignment algorithms for pipelined software synthesis

Proceedings of the conference on Design, automation and test in Europe
Radioastronomy Image Synthesis on the Cell/B.E.

Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Automatic Pre-Fetch and Modulo Scheduling Transformations for the Cell BE Architecture

Languages and Compilers for Parallel Computing
Deriving Efficient Data Movement from Decoupled Access/Execute Specifications

HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Parallel LDPC Decoding on the Cell/B.E. Processor

HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Available task-level parallelism on the Cell BE

Scientific Programming - High Performance Computing with the Cell Broadband Engine
GViM: GPU-accelerated virtual machines

Proceedings of the 3rd ACM Workshop on System-level Virtualization for High Performance Computing
Compile-Time and Run-Time Issues in an Auto-Parallelisation System for the Cell BE Processor

Euro-Par 2008 Workshops - Parallel Processing
High-performance regular expression scanning on the Cell/B.E. processor

Proceedings of the 23rd international conference on Supercomputing
Efficient high performance collective communication for the cell blade

Proceedings of the 23rd international conference on Supercomputing
Parallelizing sequential applications on commodity hardware using a low-cost software transactional memory

Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Time-predictable computer architecture

EURASIP Journal on Embedded Systems - FPGA supercomputing platforms, architectures, and techniques for accelerating computationally complex algorithms
Leakage-Aware Multiprocessor Scheduling

Journal of Signal Processing Systems
Vector Symmetry Reduction

Electronic Notes in Theoretical Computer Science (ENTCS)
Combining Coarse-Grained Software Pipelining with DVS for Scheduling Real-Time Periodic Dependent Tasks on Multi-Core Embedded Systems

Journal of Signal Processing Systems
Mapping stream programs onto heterogeneous multiprocessor systems

CASES '09 Proceedings of the 2009 international conference on Compilers, architecture, and synthesis for embedded systems
Efficient program scheduling for heterogeneous multi-core processors

Proceedings of the 46th Annual Design Automation Conference
Allocation wall: a limiting factor of Java applications on emerging multi-core platforms

Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications
Fool me twice: Exploring and exploiting error tolerance in physics-based animation

ACM Transactions on Graphics (TOG)
An adaptative game loop architecture with automatic distribution of tasks between CPU and GPU

Computers in Entertainment (CIE) - SPECIAL ISSUE: Games
Algorithm/architecture co-exploration of visual computing on emergent platforms: overview and future prospects

IEEE Transactions on Circuits and Systems for Video Technology
TRaX: a multicore hardware architecture for real-time ray tracing

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Design and implementation of a graphical user interface for stream-based distributed computing

PDCN '08 Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Networks
An analytical model to exploit memory task scheduling

Proceedings of the 2010 Workshop on Interaction between Compilers and Computer Architecture
Elastic computing: a framework for transparent, portable, and adaptive multi-core heterogeneous computing

Proceedings of the ACM SIGPLAN/SIGBED 2010 conference on Languages, compilers, and tools for embedded systems
Data pipeline optimization for shared memory multiple-SIMD architecture

LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
Minimizing communication in rate-optimal software pipelining for stream programs

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Trade-offs between voltage scaling and processor shutdown for low-energy embedded multiprocessors

SAMOS'07 Proceedings of the 7th international conference on Embedded computer systems: architectures, modeling, and simulation
Real-time motion tracking using the CELL BE

NTMS'09 Proceedings of the 3rd international conference on New technologies, mobility and security
New challenges of parallel job scheduling

JSSPP'07 Proceedings of the 13th international conference on Job scheduling strategies for parallel processing
Buffer-space efficient and deadlock-free scheduling of stream applications on multi-core architectures

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
A real-time Java chip-multiprocessor

ACM Transactions on Embedded Computing Systems (TECS)
Tale in the multi-core era: is java still competitive to host SIP applications?

ICC'09 Proceedings of the 2009 IEEE international conference on Communications
Reducing write activities on non-volatile memories in embedded CMPs via data migration and recomputation

Proceedings of the 47th Design Automation Conference
Integrated execution: a programming model for accelerators

IBM Journal of Research and Development
MapReduce for the cell broadband engine architecture

IBM Journal of Research and Development
A parallel computing approach for tracking of neuronal fibers

IBM Journal of Research and Development
Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Task superscalar: using processors as functional units

HotPar'10 Proceedings of the 2nd USENIX conference on Hot topics in parallelism
Hera-JVM: a runtime system for heterogeneous multi-core architectures

Proceedings of the ACM international conference on Object oriented programming systems languages and applications
Fast software AES encryption

FSE'10 Proceedings of the 17th international conference on Fast software encryption
Federation: Boosting per-thread performance of throughput-oriented manycore architectures

ACM Transactions on Architecture and Code Optimization (TACO)
Performance analysis of the SHA-3 candidates on exotic multi-core architectures

CHES'10 Proceedings of the 12th international conference on Cryptographic hardware and embedded systems
Montgomery multiplication on the cell

PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Exploring a Novel Gathering Method for Finite Element Codes on the Cell/B.E. Architecture

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Extending the cell SPE with energy efficient branch prediction

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
High-performance modular multiplication on the cell processor

WAIFI'10 Proceedings of the Third international conference on Arithmetic of finite fields
Leakage-aware multiprocessor scheduling for low power

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Run-time reconfiguration of communication in SIMD architectures

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Memory Latency Reduction via Thread Throttling

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Landing stencil code on Godson-T

Journal of Computer Science and Technology
Orchestration by approximation: mapping stream programs onto multicore architectures

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
A generic parallel processing model for facilitating data mining and integration

Parallel Computing
Acceleration of acoustic emission signal processing algorithms using CUDA standard

Computer Standards & Interfaces
Region-based parallelization of irregular reductions on explicitly managed memory hierarchies

The Journal of Supercomputing
FPGA vs. multi-core CPUs vs. GPUs: hands-on experience with a sorting application

Facing the multicore-challenge
FPGA vs. multi-core CPUs vs. GPUs: hands-on experience with a sorting application

Facing the multicore-challenge
The impact of diverse memory architectures on multicore consumer software: an industrial perspective from the video games domain

Proceedings of the 2011 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
Resource-constrained multiprocessor synthesis for floating-point applications on FPGAs

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Programming heterogeneous multicore systems using threading building blocks

Euro-Par 2010 Proceedings of the 2010 conference on Parallel processing
Automatic analysis of DMA races using model checking and k-induction

Formal Methods in System Design
Branch penalty reduction on IBM cell SPUs via software branch hinting

CODES+ISSS '11 Proceedings of the seventh IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
A high performance heterogeneous architecture and its optimization design

HPCC'06 Proceedings of the Second international conference on High Performance Computing and Communications
A case for dual-mapping one-way caches

ARCS'06 Proceedings of the 19th international conference on Architecture of Computing Systems
Automatic data distribution for improving data locality on the cell BE architecture

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Solving a 112-bit prime elliptic curve discrete logarithm problem on game consoles using sloppy reduction

International Journal of Applied Cryptography
ECC2K-130 on cell CPUs

AFRICACRYPT'10 Proceedings of the Third international conference on Cryptology in Africa
Tiled multi-core stream architecture

Transactions on High-Performance Embedded Architectures and Compilers IV
Buffer sizing for self-timed stream programs on heterogeneous distributed memory multiprocessors

HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
Tagged procedure calls (TPC): efficient runtime support for task-based parallelism on the cell processor

HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
Offload – automating code migration to heterogeneous multicore systems

HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
Automatic analysis of scratch-pad memory code for heterogeneous multicore processors

TACAS'10 Proceedings of the 16th international conference on Tools and Algorithms for the Construction and Analysis of Systems
Solving a 112-bit prime elliptic curve discrete logarithm problem on game consoles using sloppy reduction

International Journal of Applied Cryptography
Improving coherence protocol reactiveness by trading bandwidth for latency

Proceedings of the 9th conference on Computing Frontiers
Profile-guided deployment of stream programs on multicores

Proceedings of the 13th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory for Embedded Systems
StreamPI: a stream-parallel programming extension for object-oriented programming languages

The Journal of Supercomputing
Hardware acceleration in the IBM PowerEN processor: architecture and performance

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Algorithms and architectures for 2D discrete wavelet transform

The Journal of Supercomputing
Write activity reduction on non-volatile main memories for embedded chip multiprocessors

ACM Transactions on Embedded Computing Systems (TECS)
High performance and low power design techniques for ASIC and custom in nanometer technologies

Proceedings of the 2013 ACM international symposium on International symposium on physical design
Efficient Loop Scheduling for Chip Multiprocessors with Non-Volatile Main Memory

Journal of Signal Processing Systems
Automatic parallelization of canonical loops

Science of Computer Programming
Asymmetrical topology and entropy-based heterogeneous link for many-core massive data communication

Cluster Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper provides a background and rationale for some of the architecture and design decisions in the Cell processor, a processor optimized for compute-intensive and broadband rich media applications, jointly developed by Sony Group, Toshiba, and IBM.