Dynamic management of scratch-pad memory space
Proceedings of the 38th annual Design Automation Conference
MPI-The Complete Reference, Volume 1: The MPI Core
MPI-The Complete Reference, Volume 1: The MPI Core
Increasing temporal locality with skewing and recursive blocking
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Designing and Programming the Emotion Engine
IEEE Micro
Imagine: Media Processing with Streams
IEEE Micro
Iterative Methods for Sparse Linear Systems
Iterative Methods for Sparse Linear Systems
Using Time Skewing to Eliminate Idle Time due to Memory Bandwidth and Network Limitations
IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Analysis of Memory Hierarchy Performance of Block Data Layout
ICPP '02 Proceedings of the 2002 International Conference on Parallel Processing
Use of Local Memory for Efficient Java Execution
ICCD '01 Proceedings of the International Conference on Computer Design: VLSI in Computers & Processors
A cellular computer to implement the kalman filter algorithm
A cellular computer to implement the kalman filter algorithm
An integrated hardware/software approach for run-time scratchpad management
Proceedings of the 41st annual Design Automation Conference
Automatic performance tuning of sparse matrix kernels
Automatic performance tuning of sparse matrix kernels
Automatic tiling of iterative stencil loops
ACM Transactions on Programming Languages and Systems (TOPLAS)
The Vector Floating-Point Unit in a Synergistic Processor Element of a CELL Processor
ARITH '05 Proceedings of the 17th IEEE Symposium on Computer Arithmetic
Sparsity: Optimization Framework for Sparse Matrix Kernels
International Journal of High Performance Computing Applications
Impact of modern memory subsystems on cache optimizations for stencil computations
Proceedings of the 2005 workshop on Memory system performance
The potential of the cell processor for scientific computing
Proceedings of the 3rd conference on Computing frontiers
Introduction to the cell multiprocessor
IBM Journal of Research and Development - POWER5 and packaging
Implicit and explicit optimizations for stencil computations
Proceedings of the 2006 workshop on Memory system performance and correctness
A performance evaluation of the cray x1 for scientific applications
VECPAR'04 Proceedings of the 6th international conference on High Performance Computing for Computational Science
Vectorized sparse matrix multiply for compressed row storage format
ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part I
Optimization of sparse matrix-vector multiplication on emerging multicore platforms
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Implementing Wilson-Dirac operator on the cell broadband engine
Proceedings of the 22nd annual international conference on Supercomputing
Entering the petaflop era: the architecture and performance of Roadrunner
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Hardware-accelerated components for hybrid computing systems
Proceedings of the 2008 compFrame/HPC-GECO workshop on Component based high performance
Scientific Programming - High Performance Computing with the Cell Broadband Engine
Scientific Programming - High Performance Computing with the Cell Broadband Engine
Evaluation of the SUN UltraSparc T2+ Processor for Computational Science
ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
Streamlining Offload Computing to High Performance Architectures
ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
IEEE Transactions on Circuits and Systems for Video Technology
Beyond Nyquist: efficient sampling of sparse bandlimited signals
IEEE Transactions on Information Theory
The reverse-acceleration model for programming petascale hybrid systems
IBM Journal of Research and Development
Variational optic flow on the Sony PlayStation 3
Journal of Real-Time Image Processing
Accelerating large-scale DEVS-based simulation on the cell processor
SpringSim '10 Proceedings of the 2010 Spring Simulation Multiconference
3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
An efficient CELL library for lattice quantum chromodynamics
ACM SIGARCH Computer Architecture News
On the performance of an algebraic multigrid solver on multicore clusters
VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Transactions on high-performance embedded architectures and compilers III
Mint: realizing CUDA performance in 3D stencil methods with annotated C
Proceedings of the international conference on Supercomputing
A performance evaluation on monte carlo simulation for radiation dosimetry using cell processor
Journal of Computational Methods in Sciences and Engineering
Exploring Multi-Grained Parallelism in Compute-Intensive DEVS Simulations
PADS '10 Proceedings of the 2010 IEEE Workshop on Principles of Advanced and Distributed Simulation
HiFlow3: a flexible and hardware-aware parallel finite element package
Proceedings of the 9th Workshop on Parallel/High-Performance Object-Oriented Scientific Computing
Hardware/software co-design for energy-efficient seismic modeling
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Accelerator-Based implementation of the harris algorithm
ICISP'12 Proceedings of the 5th international conference on Image and Signal Processing
Vectorized OpenCL implementation of numerical integration for higher order finite elements
Computers & Mathematics with Applications
Hi-index | 0.07 |
In this work, we examine the potential of using the recently-released STI Cell processor as a building block for future high-end scientific computing systems. Our work contains several novel contributions. First, we introduce a performance model for Cell and apply it to several key numerical kernels: dense matrix multiply, sparse matrix vector multiply, stencil computations, and 1D/2D FFTs. Next, we validate our model by comparing results against published hardware data, as well as our own Cell blade implementations. Additionally, we compare Cell performance to benchmarks run on leading superscalar (AMD Opteron), VLIW (Intel Itanium2), and vector (Cray X1E) architectures. Our work also explores several different kernel implementations and demonstrates a simple and effective programming model for Cell's unique architecture. Finally, we propose modest microarchitectural modifications that could significantly increase the efficiency of double-precision calculations. Overall results demonstrate the tremendous potential of the Cell architecture for scientific computations in terms of both raw performance and power efficiency.