Optimization of dense matrix multiplication on IBM cyclops-64: challenges and experiences

  • Authors:
  • Ziang Hu;Juan del Cuvillo;Weirong Zhu;Guang R. Gao

  • Affiliations:
  • Department of Electrical and Computer Engineering, University of Delaware, Newark, Delaware, U.S.A.;Department of Electrical and Computer Engineering, University of Delaware, Newark, Delaware, U.S.A.;Department of Electrical and Computer Engineering, University of Delaware, Newark, Delaware, U.S.A.;Department of Electrical and Computer Engineering, University of Delaware, Newark, Delaware, U.S.A.

  • Venue:
  • Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper presents a study of performance optimization of dense matrix multiplication on IBM Cyclops-64(C64) chip architecture. Although much has been published on how to optimize dense matrix applications on shared memory architecture with multi-level caches, little has been reported on the applicability of the existing methods to the new generation of multi-core architectures like C64. For such architectures a more economical use of on-chip storage resources appears to discourage the use of caches, while providing tremendous on-chip memory bandwidth per storage area. This paper presents an in-depth case study of a collection of well known optimization methods and tries to re-engineer them to address the new challenges and opportunities provided by this emerging class of multi-core chip architectures. Our study demonstrates that efficiently exploiting the memory hierarchy is the key to achieving good performance. The main contributions of this paper include: (a) identifying a set of key optimizations for C64-like architectures, and (b) exploring a practical order of the optimizations, which yields good performance for applications like matrix multiplication.