The potential of the cell processor for scientific computing

Authors:
Samuel Williams;John Shalf;Leonid Oliker;Shoaib Kamil;Parry Husbands;Katherine Yelick
Affiliations:
Lawrence Berkeley National Laboratory, Berkeley, CA;Lawrence Berkeley National Laboratory, Berkeley, CA;Lawrence Berkeley National Laboratory, Berkeley, CA;Lawrence Berkeley National Laboratory, Berkeley, CA;Lawrence Berkeley National Laboratory, Berkeley, CA;Lawrence Berkeley National Laboratory, Berkeley, CA
Venue:
Proceedings of the 3rd conference on Computing frontiers
Year:
2006

Citing 19
Cited 90

Dynamic management of scratch-pad memory space

Proceedings of the 38th annual Design Automation Conference
Increasing temporal locality with skewing and recursive blocking

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Designing and Programming the Emotion Engine

IEEE Micro
Vector Unit Architecture for Emotion Synthesis

IEEE Micro
Imagine: Media Processing with Streams

IEEE Micro
Using Time Skewing to Eliminate Idle Time due to Memory Bandwidth and Network Limitations

IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Analysis of Memory Hierarchy Performance of Block Data Layout

ICPP '02 Proceedings of the 2002 International Conference on Parallel Processing
Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors

Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors
Use of Local Memory for Efficient Java Execution

ICCD '01 Proceedings of the International Conference on Computer Design: VLSI in Computers & Processors
A cellular computer to implement the kalman filter algorithm

A cellular computer to implement the kalman filter algorithm
An integrated hardware/software approach for run-time scratchpad management

Proceedings of the 41st annual Design Automation Conference
Automatic performance tuning of sparse matrix kernels

Automatic performance tuning of sparse matrix kernels
Automatic tiling of iterative stencil loops

ACM Transactions on Programming Languages and Systems (TOPLAS)
The Vector Floating-Point Unit in a Synergistic Processor Element of a CELL Processor

ARITH '05 Proceedings of the 17th IEEE Symposium on Computer Arithmetic
Sparsity: Optimization Framework for Sparse Matrix Kernels

International Journal of High Performance Computing Applications
Impact of modern memory subsystems on cache optimizations for stencil computations

Proceedings of the 2005 workshop on Memory system performance
Introduction to the cell multiprocessor

IBM Journal of Research and Development - POWER5 and packaging
A performance evaluation of the cray x1 for scientific applications

VECPAR'04 Proceedings of the 6th international conference on High Performance Computing for Computational Science
Vectorized sparse matrix multiply for compressed row storage format

ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part I

Implicit and explicit optimizations for stencil computations

Proceedings of the 2006 workshop on Memory system performance and correctness
Dynamic multigrain parallelization on the cell broadband engine

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Feasibility study of MPI implementation on the heterogeneous multi-core cell BE™ architecture

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
A 64-bit stream processor architecture for scientific applications

Proceedings of the 34th annual international symposium on Computer architecture
An Open Source Environment for Cell Broadband Engine System Software

Computer
Executing irregular scientific applications on stream architectures

Proceedings of the 21st annual international conference on Supercomputing
Runtime scheduling of dynamic parallelism on accelerator-based multi-core systems

Parallel Computing
High performance combinatorial algorithm design on the Cell Broadband Engine processor

Parallel Computing
Improving disk bandwidth-bound applications through main memory compression

MEDEA '07 Proceedings of the 2007 workshop on MEmory performance: DEaling with Applications, systems and architecture
Parallel Model Checking Large-Scale Genetic Regulatory Networks with DiVinE

Electronic Notes in Theoretical Computer Science (ENTCS)
Microarchitecture and implementation of the synergistic processor in 65-nm and 90-nm SOI

IBM Journal of Research and Development
Scientific computing Kernels on the cell processor

International Journal of Parallel Programming
Cell GC: using the cell synergistic processor as a garbage collection coprocessor

Proceedings of the fourth ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Optimization strategies for a java virtual machine interpreter on the cell broadband engine

Proceedings of the 5th conference on Computing frontiers
BlockLib: a skeleton library for cell broadband engine

Proceedings of the 1st international workshop on Multicore software engineering
Data mining on the cell broadband engine

Proceedings of the 22nd annual international conference on Supercomputing
Optimizing scientific application loops on stream processors

Proceedings of the 2008 ACM SIGPLAN-SIGBED conference on Languages, compilers, and tools for embedded systems
Performance and accuracy of hardware-oriented native-, emulated-and mixed-precision solvers in FEM simulations

International Journal of Parallel, Emergent and Distributed Systems
Teaching parallel computing in a small college: meeting a renewed demand

Journal of Computing Sciences in Colleges
Entering the petaflop era: the architecture and performance of Roadrunner

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
High performance discrete Fourier transforms on graphics processors

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Parallel exact inference on the cell broadband engine processor

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
A Buffered-Mode MPI Implementation for the Cell BETM Processor

ICCS '07 Proceedings of the 7th international conference on Computational Science, Part I: ICCS 2007
Supporting OpenMP on Cell

IWOMP '07 Proceedings of the 3rd international workshop on OpenMP: A Practical Programming Model for the Multi-Core Era
Fast and Small Short Vector SIMD Matrix Multiplication Kernels for the Synergistic Processing Element of the CELL Processor

ICCS '08 Proceedings of the 8th international conference on Computational Science, Part I
Data Mining Algorithms on the Cell Broadband Engine

Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Radioastronomy Image Synthesis on the Cell/B.E.

Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Using GPUs to improve multigrid solver performance on a cluster

International Journal of Computational Science and Engineering
Supporting OpenMP on cell

International Journal of Parallel Programming
Comparability graph coloring for optimizing utilization of stream register files in stream processors

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Mapping and Synchronizing Streaming Applications on Cell Processors

HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Implementation and performance modeling of deterministic particle transport (Sweep3D) on the IBM Cell/B.E.

Scientific Programming - High Performance Computing with the Cell Broadband Engine
3D seismic imaging through reverse-time migration on homogeneous and heterogeneous multi-core processors

Scientific Programming - High Performance Computing with the Cell Broadband Engine
Building high-resolution sky images using the Cell/B.E.

Scientific Programming - High Performance Computing with the Cell Broadband Engine
Computing discrete transforms on the Cell Broadband Engine

Parallel Computing
Optimizing matrix multiplication for a short-vector SIMD architecture - CELL processor

Parallel Computing
Vector stream processing for effective application of heterogeneous parallelism

Proceedings of the 2009 ACM symposium on Applied Computing
Data parallel acceleration of decision support queries using Cell/BE and GPUs

Proceedings of the 6th ACM conference on Computing frontiers
Scheduling dynamic parallelism on accelerators

Proceedings of the 6th ACM conference on Computing frontiers
Evaluating multi-core platforms for HPC data-intensive kernels

Proceedings of the 6th ACM conference on Computing frontiers
Error-Free Transformation in Rounding Mode toward Zero

Numerical Validation in Current Hardware Architectures
Implementation of a wide-angle lens distortion correction algorithm on the cell broadband engine

Proceedings of the 23rd international conference on Supercomputing
On algorithmic analysis of transcriptional regulation by LTL model checking

Theoretical Computer Science
Implementation of a Non-bonded Interaction Calculation Algorithm for the Cell Architecture

PaCT '09 Proceedings of the 10th International Conference on Parallel Computing Technologies
High Performance Matrix Multiplication on Many Cores

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Design and implementation of stream processing system and library for CELL broadband engine processors

PDCS '07 Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and Systems
Multi-core acceleration of chemical kinetics for simulation and prediction

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
A High Throughput FPGA-Based Floating Point Conjugate Gradient Implementation for Dense Matrices

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
The new SIMD Implementation of the Smith-Waterman Algorithm on Cell Microprocessor

Fundamenta Informaticae
Data parallelization of Kd-tree ray tracing on the cell broadband engine

ICME'09 Proceedings of the 2009 IEEE international conference on Multimedia and Expo
Unleashing the Power of the Playstation 3 to Boost Graphics Programming

SIBGRAPI-TUTORIALS '09 Proceedings of the 2009 Tutorials of the XXII Brazilian Symposium on Computer Graphics and Image Processing
Parallel exact inference on the Cell Broadband Engine processor

Journal of Parallel and Distributed Computing
FFTC: fastest Fourier transform for the IBM cell broadband engine

HiPC'07 Proceedings of the 14th international conference on High performance computing
FT64: scientific computing with streams

HiPC'07 Proceedings of the 14th international conference on High performance computing
CG-Cell: an NPB benchmark implementation on cell broadband engine

ICDCN'08 Proceedings of the 9th international conference on Distributed computing and networking
Experiences with parallelizing a bio-informatics program on the cell BE

HiPEAC'08 Proceedings of the 3rd international conference on High performance embedded architectures and compilers
New challenges of parallel job scheduling

JSSPP'07 Proceedings of the 13th international conference on Job scheduling strategies for parallel processing
Optimization of BLAS on the cell processor

HiPC'08 Proceedings of the 15th international conference on High performance computing
Accelerating Climate and Weather Simulations Through Hybrid Computing

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Integrated execution: a programming model for accelerators

IBM Journal of Research and Development
The reverse-acceleration model for programming petascale hybrid systems

IBM Journal of Research and Development
Parallel subdivision surface rendering and animation on the cell BE processor

Proceedings of the Conference on Design, Automation and Test in Europe
Reuse-aware modulo scheduling for stream processors

Proceedings of the Conference on Design, Automation and Test in Europe
Variational optic flow on the Sony PlayStation 3

Journal of Real-Time Image Processing
Accelerating large-scale DEVS-based simulation on the cell processor

SpringSim '10 Proceedings of the 2010 Spring Simulation Multiconference
A fast GPU implementation for solving sparse ill-posed linear equation systems

PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Adaptation of double-precision matrix multiplication to the cell broadband engine architecture

PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Scalable Graph Exploration on Multicore Processors

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Adding data-movement constructs to the PGAS parallel computing model

Proceedings of the 7th European Lisp Workshop
FPGA vs. multi-core CPUs vs. GPUs: hands-on experience with a sorting application

Facing the multicore-challenge
FPGA vs. multi-core CPUs vs. GPUs: hands-on experience with a sorting application

Facing the multicore-challenge
Single-port and multi-port collective communication operations on single and dual Cell BE processor systems

International Journal of Communication Networks and Distributed Systems
Scalable heterogeneous parallelism for atmospheric modeling and simulation

The Journal of Supercomputing
A performance evaluation on monte carlo simulation for radiation dosimetry using cell processor

Journal of Computational Methods in Sciences and Engineering
HiFlow3: a flexible and hardware-aware parallel finite element package

Proceedings of the 9th Workshop on Parallel/High-Performance Object-Oriented Scientific Computing
Optimizing modulo scheduling to achieve reuse and concurrency for stream processors

The Journal of Supercomputing
A comparison of three commodity-level parallel architectures: multi-core CPU, cell BE and GPU

MMCS'08 Proceedings of the 7th international conference on Mathematical Methods for Curves and Surfaces
Interactive data mining on a CBEA cluster

HPCS'09 Proceedings of the 23rd international conference on High Performance Computing Systems and Applications
Comparability Graph Coloring for Optimizing Utilization of Software-Managed Stream Register Files for Stream Processors

ACM Transactions on Architecture and Code Optimization (TACO)
Model-driven adaptation of double-precision matrix multiplication to the Cell processor architecture

Parallel Computing
Performance impact of task mapping on the cell BE multicore processor

ISCA'10 Proceedings of the 2010 international conference on Computer Architecture
An efficient scheduler of RTOS for multi/many-core system

Computers and Electrical Engineering
Parallelization and performance comparison of the conjugate gradient equation solver on multicore Cell and Xeon computers

Concurrency and Computation: Practice & Experience
Large-scale fast Fourier transform on a heterogeneous multi-core system

International Journal of High Performance Computing Applications
Multicore acceleration of Discrete Event System Specification systems

Simulation
Optimization of geometric multigrid for emerging multi- and manycore processors

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A synchronous mode MPI implementation on the cell BETM architecture

ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications
A Simple Compressive Sensing Algorithm for Parallel Many-Core Architectures

Journal of Signal Processing Systems
Bitonic sort on a chained-cubic tree interconnection network

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The slowing pace of commodity microprocessor performance improvements combined with ever-increasing chip power demands has become of utmost concern to computational scientists. As a result, the high performance computing community is examining alternative architectures that address the limitations of modern cache-based designs. In this work, we examine the potential of using the forthcoming STI Cell processor as a building block for future high-end computing systems. Our work contains several novel contributions. First, we introduce a performance model for Cell and apply it to several key scientific computing kernels: dense matrix multiply, sparse matrix vector multiply, stencil computations, and 1D/2D FFTs. The difficulty of programming Cell, which requires assembly level intrinsics for the best performance, makes this model useful as an initial step in algorithm design and evaluation. Next, we validate the accuracy of our model by comparing results against published hardware results, as well as our own implementations on the Cell full system simulator. Additionally, we compare Cell performance to benchmarks run on leading superscalar (AMD Opteron), VLIW (Intel Itanium2), and vector (Cray X1E) architectures. Our work also explores several different mappings of the kernels and demonstrates a simple and effective programming model for Cell's unique architecture. Finally, we propose modest microarchitectural modifications that could significantly increase the efficiency of double-precision calculations. Overall results demonstrate the tremendous potential of the Cell architecture for scientific computations in terms of both raw performance and power efficiency.