Scientific computing Kernels on the cell processor

Authors:
Samuel Williams;John Shalf;Leonid Oliker;Shoaib Kamil;Parry Husbands;Katherine Yelick
Affiliations:
Lawrence Berkeley National Laboratory, Berkeley, CA;Lawrence Berkeley National Laboratory, Berkeley, CA;Lawrence Berkeley National Laboratory, Berkeley, CA;Lawrence Berkeley National Laboratory, Berkeley, CA;Lawrence Berkeley National Laboratory, Berkeley, CA;Lawrence Berkeley National Laboratory, Berkeley, CA
Venue:
International Journal of Parallel Programming
Year:
2007

Citing 22
Cited 29

Dynamic management of scratch-pad memory space

Proceedings of the 38th annual Design Automation Conference
MPI-The Complete Reference, Volume 1: The MPI Core

MPI-The Complete Reference, Volume 1: The MPI Core
Increasing temporal locality with skewing and recursive blocking

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Designing and Programming the Emotion Engine

IEEE Micro
Vector Unit Architecture for Emotion Synthesis

IEEE Micro
Imagine: Media Processing with Streams

IEEE Micro
Iterative Methods for Sparse Linear Systems

Iterative Methods for Sparse Linear Systems
Using Time Skewing to Eliminate Idle Time due to Memory Bandwidth and Network Limitations

IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Analysis of Memory Hierarchy Performance of Block Data Layout

ICPP '02 Proceedings of the 2002 International Conference on Parallel Processing
Use of Local Memory for Efficient Java Execution

ICCD '01 Proceedings of the International Conference on Computer Design: VLSI in Computers & Processors
A cellular computer to implement the kalman filter algorithm

A cellular computer to implement the kalman filter algorithm
An integrated hardware/software approach for run-time scratchpad management

Proceedings of the 41st annual Design Automation Conference
Automatic performance tuning of sparse matrix kernels

Automatic performance tuning of sparse matrix kernels
Automatic tiling of iterative stencil loops

ACM Transactions on Programming Languages and Systems (TOPLAS)
The Vector Floating-Point Unit in a Synergistic Processor Element of a CELL Processor

ARITH '05 Proceedings of the 17th IEEE Symposium on Computer Arithmetic
Sparsity: Optimization Framework for Sparse Matrix Kernels

International Journal of High Performance Computing Applications
Impact of modern memory subsystems on cache optimizations for stencil computations

Proceedings of the 2005 workshop on Memory system performance
The potential of the cell processor for scientific computing

Proceedings of the 3rd conference on Computing frontiers
Introduction to the cell multiprocessor

IBM Journal of Research and Development - POWER5 and packaging
Implicit and explicit optimizations for stencil computations

Proceedings of the 2006 workshop on Memory system performance and correctness
A performance evaluation of the cray x1 for scientific applications

VECPAR'04 Proceedings of the 6th international conference on High Performance Computing for Computational Science
Vectorized sparse matrix multiply for compressed row storage format

ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part I

Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Implementing Wilson-Dirac operator on the cell broadband engine

Proceedings of the 22nd annual international conference on Supercomputing
Entering the petaflop era: the architecture and performance of Roadrunner

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Hardware-accelerated components for hybrid computing systems

Proceedings of the 2008 compFrame/HPC-GECO workshop on Component based high performance
3D seismic imaging through reverse-time migration on homogeneous and heterogeneous multi-core processors

Scientific Programming - High Performance Computing with the Cell Broadband Engine
Efficient SIMDization and data management of the Lattice QCD computation on the Cell Broadband Engine

Scientific Programming - High Performance Computing with the Cell Broadband Engine
Optimizing matrix multiplication for a short-vector SIMD architecture - CELL processor

Parallel Computing
Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Parallel Computing
Evaluation of the SUN UltraSparc T2+ Processor for Computational Science

ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
Streamlining Offload Computing to High Performance Architectures

ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
Algorithm/architecture co-exploration of visual computing on emergent platforms: overview and future prospects

IEEE Transactions on Circuits and Systems for Video Technology
Beyond Nyquist: efficient sampling of sparse bandlimited signals

IEEE Transactions on Information Theory
The reverse-acceleration model for programming petascale hybrid systems

IBM Journal of Research and Development
Variational optic flow on the Sony PlayStation 3

Journal of Real-Time Image Processing
Accelerating large-scale DEVS-based simulation on the cell processor

SpringSim '10 Proceedings of the 2010 Spring Simulation Multiconference
3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Strider: Runtime Support for Optimizing Strided Data Accesses on Multi-Cores with Explicitly Managed Memories

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
An efficient CELL library for lattice quantum chromodynamics

ACM SIGARCH Computer Architecture News
On the performance of an algebraic multigrid solver on multicore clusters

VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Parallelization schemes for memory optimization on the cell processor: a case study on the Harris corner detector

Transactions on high-performance embedded architectures and compilers III
Mint: realizing CUDA performance in 3D stencil methods with annotated C

Proceedings of the international conference on Supercomputing
A performance evaluation on monte carlo simulation for radiation dosimetry using cell processor

Journal of Computational Methods in Sciences and Engineering
Exploring Multi-Grained Parallelism in Compute-Intensive DEVS Simulations

PADS '10 Proceedings of the 2010 IEEE Workshop on Principles of Advanced and Distributed Simulation
HiFlow3: a flexible and hardware-aware parallel finite element package

Proceedings of the 9th Workshop on Parallel/High-Performance Object-Oriented Scientific Computing
Hardware/software co-design for energy-efficient seismic modeling

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Model-driven adaptation of double-precision matrix multiplication to the Cell processor architecture

Parallel Computing
Multicore acceleration of Discrete Event System Specification systems

Simulation
Accelerator-Based implementation of the harris algorithm

ICISP'12 Proceedings of the 5th international conference on Image and Signal Processing
Vectorized OpenCL implementation of numerical integration for higher order finite elements

Computers & Mathematics with Applications

Quantified Score

Hi-index	0.07

Visualization

Abstract

In this work, we examine the potential of using the recently-released STI Cell processor as a building block for future high-end scientific computing systems. Our work contains several novel contributions. First, we introduce a performance model for Cell and apply it to several key numerical kernels: dense matrix multiply, sparse matrix vector multiply, stencil computations, and 1D/2D FFTs. Next, we validate our model by comparing results against published hardware data, as well as our own Cell blade implementations. Additionally, we compare Cell performance to benchmarks run on leading superscalar (AMD Opteron), VLIW (Intel Itanium2), and vector (Cray X1E) architectures. Our work also explores several different kernel implementations and demonstrates a simple and effective programming model for Cell's unique architecture. Finally, we propose modest microarchitectural modifications that could significantly increase the efficiency of double-precision calculations. Overall results demonstrate the tremendous potential of the Cell architecture for scientific computations in terms of both raw performance and power efficiency.