BSArc: blacksmith streaming architecture for HPC accelerators

  • Authors:
  • Muhammad Shafiq;Miquel Pericas;Nacho Navarro;Eduard Ayguade

  • Affiliations:
  • Barcelona Supercomputing Center and Universitat Politecnica de Catalunya, Barcelona, Spain;Tokyo Institute of Technology, Tokyo, Japan;Barcelona Supercomputing Center and Universitat Politecnica de Catalunya, Barcelona, Spain;Barcelona Supercomputing Center and Universitat Politecnica de Catalunya, Barcelona, Spain

  • Venue:
  • Proceedings of the 9th conference on Computing Frontiers
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

The current trend in high performance computing (HPC) systems is to deploy parallel computers equipped with general purpose multi-core processors and possibly multi-core streaming accelerators. However, the performance of these multi-cores is often constrained by the limited external bandwidth or by badly matching data access patterns. The latter reduces the size of useful data during memory transactions. A change in the application algorithm can improve the memory accesses but a hardware support mechanism for an application specific data arrangement in the memory hierarchy can significantly boost the performance for many application domains. In this work, we present a conceptual computing architecture named BSArc (Blacksmith Streaming Architecture). BSArc introduces a forging front-end to efficiently distribute data to a large set of simple streaming processors in the back-end. We apply this concept to a SIMT execution model and present a design space exploration in the context of a GPU-like streaming architecture with a reconfigurable application specific front-end. These design space explorations are carried out on a streaming architectural simulator that models BSArc. We evaluate the performance advantages for the BSArc design against a standard L2 cache in a GPU-like device. In our evaluations we use three application kernels: 2D-FFT, Matrix-Matrix Multiplication and 3D-Stencil. The results show that employing an application specific arrangement of data on these kernels achieves an average speedup of 2.3× compared to a standard cache for a GPU-like streaming device.