Dynamic management of scratch-pad memory space
Proceedings of the 38th annual Design Automation Conference
Increasing temporal locality with skewing and recursive blocking
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Designing and Programming the Emotion Engine
IEEE Micro
Imagine: Media Processing with Streams
IEEE Micro
Using Time Skewing to Eliminate Idle Time due to Memory Bandwidth and Network Limitations
IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Analysis of Memory Hierarchy Performance of Block Data Layout
ICPP '02 Proceedings of the 2002 International Conference on Parallel Processing
Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors
Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors
Use of Local Memory for Efficient Java Execution
ICCD '01 Proceedings of the International Conference on Computer Design: VLSI in Computers & Processors
A cellular computer to implement the kalman filter algorithm
A cellular computer to implement the kalman filter algorithm
An integrated hardware/software approach for run-time scratchpad management
Proceedings of the 41st annual Design Automation Conference
Automatic performance tuning of sparse matrix kernels
Automatic performance tuning of sparse matrix kernels
Automatic tiling of iterative stencil loops
ACM Transactions on Programming Languages and Systems (TOPLAS)
The Vector Floating-Point Unit in a Synergistic Processor Element of a CELL Processor
ARITH '05 Proceedings of the 17th IEEE Symposium on Computer Arithmetic
Sparsity: Optimization Framework for Sparse Matrix Kernels
International Journal of High Performance Computing Applications
Impact of modern memory subsystems on cache optimizations for stencil computations
Proceedings of the 2005 workshop on Memory system performance
Introduction to the cell multiprocessor
IBM Journal of Research and Development - POWER5 and packaging
A performance evaluation of the cray x1 for scientific applications
VECPAR'04 Proceedings of the 6th international conference on High Performance Computing for Computational Science
Vectorized sparse matrix multiply for compressed row storage format
ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part I
Implicit and explicit optimizations for stencil computations
Proceedings of the 2006 workshop on Memory system performance and correctness
Dynamic multigrain parallelization on the cell broadband engine
Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Feasibility study of MPI implementation on the heterogeneous multi-core cell BE™ architecture
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
A 64-bit stream processor architecture for scientific applications
Proceedings of the 34th annual international symposium on Computer architecture
Executing irregular scientific applications on stream architectures
Proceedings of the 21st annual international conference on Supercomputing
Improving disk bandwidth-bound applications through main memory compression
MEDEA '07 Proceedings of the 2007 workshop on MEmory performance: DEaling with Applications, systems and architecture
Parallel Model Checking Large-Scale Genetic Regulatory Networks with DiVinE
Electronic Notes in Theoretical Computer Science (ENTCS)
Microarchitecture and implementation of the synergistic processor in 65-nm and 90-nm SOI
IBM Journal of Research and Development
Scientific computing Kernels on the cell processor
International Journal of Parallel Programming
Cell GC: using the cell synergistic processor as a garbage collection coprocessor
Proceedings of the fourth ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Optimization strategies for a java virtual machine interpreter on the cell broadband engine
Proceedings of the 5th conference on Computing frontiers
BlockLib: a skeleton library for cell broadband engine
Proceedings of the 1st international workshop on Multicore software engineering
Data mining on the cell broadband engine
Proceedings of the 22nd annual international conference on Supercomputing
Optimizing scientific application loops on stream processors
Proceedings of the 2008 ACM SIGPLAN-SIGBED conference on Languages, compilers, and tools for embedded systems
International Journal of Parallel, Emergent and Distributed Systems
Teaching parallel computing in a small college: meeting a renewed demand
Journal of Computing Sciences in Colleges
Entering the petaflop era: the architecture and performance of Roadrunner
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
High performance discrete Fourier transforms on graphics processors
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Parallel exact inference on the cell broadband engine processor
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
A Buffered-Mode MPI Implementation for the Cell BETM Processor
ICCS '07 Proceedings of the 7th international conference on Computational Science, Part I: ICCS 2007
IWOMP '07 Proceedings of the 3rd international workshop on OpenMP: A Practical Programming Model for the Multi-Core Era
ICCS '08 Proceedings of the 8th international conference on Computational Science, Part I
Data Mining Algorithms on the Cell Broadband Engine
Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Radioastronomy Image Synthesis on the Cell/B.E.
Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Using GPUs to improve multigrid solver performance on a cluster
International Journal of Computational Science and Engineering
International Journal of Parallel Programming
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Mapping and Synchronizing Streaming Applications on Cell Processors
HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Scientific Programming - High Performance Computing with the Cell Broadband Engine
Scientific Programming - High Performance Computing with the Cell Broadband Engine
Building high-resolution sky images using the Cell/B.E.
Scientific Programming - High Performance Computing with the Cell Broadband Engine
Computing discrete transforms on the Cell Broadband Engine
Parallel Computing
Vector stream processing for effective application of heterogeneous parallelism
Proceedings of the 2009 ACM symposium on Applied Computing
Data parallel acceleration of decision support queries using Cell/BE and GPUs
Proceedings of the 6th ACM conference on Computing frontiers
Scheduling dynamic parallelism on accelerators
Proceedings of the 6th ACM conference on Computing frontiers
Evaluating multi-core platforms for HPC data-intensive kernels
Proceedings of the 6th ACM conference on Computing frontiers
Error-Free Transformation in Rounding Mode toward Zero
Numerical Validation in Current Hardware Architectures
Implementation of a wide-angle lens distortion correction algorithm on the cell broadband engine
Proceedings of the 23rd international conference on Supercomputing
On algorithmic analysis of transcriptional regulation by LTL model checking
Theoretical Computer Science
Implementation of a Non-bonded Interaction Calculation Algorithm for the Cell Architecture
PaCT '09 Proceedings of the 10th International Conference on Parallel Computing Technologies
High Performance Matrix Multiplication on Many Cores
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
PDCS '07 Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and Systems
Multi-core acceleration of chemical kinetics for simulation and prediction
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
A High Throughput FPGA-Based Floating Point Conjugate Gradient Implementation for Dense Matrices
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
The new SIMD Implementation of the Smith-Waterman Algorithm on Cell Microprocessor
Fundamenta Informaticae
Data parallelization of Kd-tree ray tracing on the cell broadband engine
ICME'09 Proceedings of the 2009 IEEE international conference on Multimedia and Expo
Unleashing the Power of the Playstation 3 to Boost Graphics Programming
SIBGRAPI-TUTORIALS '09 Proceedings of the 2009 Tutorials of the XXII Brazilian Symposium on Computer Graphics and Image Processing
Parallel exact inference on the Cell Broadband Engine processor
Journal of Parallel and Distributed Computing
FFTC: fastest Fourier transform for the IBM cell broadband engine
HiPC'07 Proceedings of the 14th international conference on High performance computing
FT64: scientific computing with streams
HiPC'07 Proceedings of the 14th international conference on High performance computing
CG-Cell: an NPB benchmark implementation on cell broadband engine
ICDCN'08 Proceedings of the 9th international conference on Distributed computing and networking
Experiences with parallelizing a bio-informatics program on the cell BE
HiPEAC'08 Proceedings of the 3rd international conference on High performance embedded architectures and compilers
New challenges of parallel job scheduling
JSSPP'07 Proceedings of the 13th international conference on Job scheduling strategies for parallel processing
Optimization of BLAS on the cell processor
HiPC'08 Proceedings of the 15th international conference on High performance computing
Accelerating Climate and Weather Simulations Through Hybrid Computing
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Integrated execution: a programming model for accelerators
IBM Journal of Research and Development
The reverse-acceleration model for programming petascale hybrid systems
IBM Journal of Research and Development
Parallel subdivision surface rendering and animation on the cell BE processor
Proceedings of the Conference on Design, Automation and Test in Europe
Reuse-aware modulo scheduling for stream processors
Proceedings of the Conference on Design, Automation and Test in Europe
Variational optic flow on the Sony PlayStation 3
Journal of Real-Time Image Processing
Accelerating large-scale DEVS-based simulation on the cell processor
SpringSim '10 Proceedings of the 2010 Spring Simulation Multiconference
A fast GPU implementation for solving sparse ill-posed linear equation systems
PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Adaptation of double-precision matrix multiplication to the cell broadband engine architecture
PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Scalable Graph Exploration on Multicore Processors
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Adding data-movement constructs to the PGAS parallel computing model
Proceedings of the 7th European Lisp Workshop
FPGA vs. multi-core CPUs vs. GPUs: hands-on experience with a sorting application
Facing the multicore-challenge
FPGA vs. multi-core CPUs vs. GPUs: hands-on experience with a sorting application
Facing the multicore-challenge
International Journal of Communication Networks and Distributed Systems
Scalable heterogeneous parallelism for atmospheric modeling and simulation
The Journal of Supercomputing
A performance evaluation on monte carlo simulation for radiation dosimetry using cell processor
Journal of Computational Methods in Sciences and Engineering
HiFlow3: a flexible and hardware-aware parallel finite element package
Proceedings of the 9th Workshop on Parallel/High-Performance Object-Oriented Scientific Computing
Optimizing modulo scheduling to achieve reuse and concurrency for stream processors
The Journal of Supercomputing
A comparison of three commodity-level parallel architectures: multi-core CPU, cell BE and GPU
MMCS'08 Proceedings of the 7th international conference on Mathematical Methods for Curves and Surfaces
Interactive data mining on a CBEA cluster
HPCS'09 Proceedings of the 23rd international conference on High Performance Computing Systems and Applications
ACM Transactions on Architecture and Code Optimization (TACO)
Performance impact of task mapping on the cell BE multicore processor
ISCA'10 Proceedings of the 2010 international conference on Computer Architecture
An efficient scheduler of RTOS for multi/many-core system
Computers and Electrical Engineering
Concurrency and Computation: Practice & Experience
Large-scale fast Fourier transform on a heterogeneous multi-core system
International Journal of High Performance Computing Applications
Optimization of geometric multigrid for emerging multi- and manycore processors
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A synchronous mode MPI implementation on the cell BETM architecture
ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications
A Simple Compressive Sensing Algorithm for Parallel Many-Core Architectures
Journal of Signal Processing Systems
Bitonic sort on a chained-cubic tree interconnection network
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
The slowing pace of commodity microprocessor performance improvements combined with ever-increasing chip power demands has become of utmost concern to computational scientists. As a result, the high performance computing community is examining alternative architectures that address the limitations of modern cache-based designs. In this work, we examine the potential of using the forthcoming STI Cell processor as a building block for future high-end computing systems. Our work contains several novel contributions. First, we introduce a performance model for Cell and apply it to several key scientific computing kernels: dense matrix multiply, sparse matrix vector multiply, stencil computations, and 1D/2D FFTs. The difficulty of programming Cell, which requires assembly level intrinsics for the best performance, makes this model useful as an initial step in algorithm design and evaluation. Next, we validate the accuracy of our model by comparing results against published hardware results, as well as our own implementations on the Cell full system simulator. Additionally, we compare Cell performance to benchmarks run on leading superscalar (AMD Opteron), VLIW (Intel Itanium2), and vector (Cray X1E) architectures. Our work also explores several different mappings of the kernels and demonstrates a simple and effective programming model for Cell's unique architecture. Finally, we propose modest microarchitectural modifications that could significantly increase the efficiency of double-precision calculations. Overall results demonstrate the tremendous potential of the Cell architecture for scientific computations in terms of both raw performance and power efficiency.