Optimization of BLAS on the cell processor

Authors:
Vaibhav Saxena;Prashant Agrawal;Yogish Sabharwal;Vijay K. Garg;Vimitha A. Kuruvilla;John A. Gunnels
Affiliations:
IBM India Research Lab, New Delhi;IBM India Research Lab, New Delhi;IBM India Research Lab, New Delhi;IBM India Research Lab, New Delhi;IBM India STG Engineering Labs, Bangalore;IBM T. J. Watson Research Center, Yorktown Heights, NY
Venue:
HiPC'08 Proceedings of the 15th international conference on High performance computing
Year:
2008

Citing 6
Cited 1

LAPACK Users' guide (third ed.)

LAPACK Users' guide (third ed.)
Tiling, Block Data Layout, and Memory Hierarchy Performance

IEEE Transactions on Parallel and Distributed Systems
High-performance linear algebra algorithms using new generalized data structures for matrices

IBM Journal of Research and Development
The potential of the cell processor for scientific computing

Proceedings of the 3rd conference on Computing frontiers
CellSort: high performance sorting on the cell processor

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
FFTC: fastest Fourier transform for the IBM cell broadband engine

HiPC'07 Proceedings of the 14th international conference on High performance computing

Model-driven adaptation of double-precision matrix multiplication to the Cell processor architecture

Parallel Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The unique architecture of the heterogeneous multicore Cell processor offers great potential for high performance computing.It offers features such as high memory bandwidth using DMA, usermanaged local stores and SIMD architecture. In this paper, we presentstrategies for leveraging these features to develop a high performanceBLAS library. We propose techniques to partition and distribute dataacross SPEs for handling DMA efficiently. We show that suitable preprocessingof data leads to significant performance improvements whenthe data is unaligned. In addition, we use a combination of two kernels -a specialized high performance kernel for the more frequently occurringcases and a generic kernel for handling boundary cases - to obtain betterperformance. Using these techniques for double precision, we obtain upto 70-80% of peak performance for different memory bandwidth boundlevel 1 and 2 routines and up to 80-90% for computation bound level 3routines.