Optimization of BLAS on the cell processor

  • Authors:
  • Vaibhav Saxena;Prashant Agrawal;Yogish Sabharwal;Vijay K. Garg;Vimitha A. Kuruvilla;John A. Gunnels

  • Affiliations:
  • IBM India Research Lab, New Delhi;IBM India Research Lab, New Delhi;IBM India Research Lab, New Delhi;IBM India Research Lab, New Delhi;IBM India STG Engineering Labs, Bangalore;IBM T. J. Watson Research Center, Yorktown Heights, NY

  • Venue:
  • HiPC'08 Proceedings of the 15th international conference on High performance computing
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

The unique architecture of the heterogeneous multicore Cell processor offers great potential for high performance computing.It offers features such as high memory bandwidth using DMA, usermanaged local stores and SIMD architecture. In this paper, we presentstrategies for leveraging these features to develop a high performanceBLAS library. We propose techniques to partition and distribute dataacross SPEs for handling DMA efficiently. We show that suitable preprocessingof data leads to significant performance improvements whenthe data is unaligned. In addition, we use a combination of two kernels -a specialized high performance kernel for the more frequently occurringcases and a generic kernel for handling boundary cases - to obtain betterperformance. Using these techniques for double precision, we obtain upto 70-80% of peak performance for different memory bandwidth boundlevel 1 and 2 routines and up to 80-90% for computation bound level 3routines.