BSArc: blacksmith streaming architecture for HPC accelerators

Authors:
Muhammad Shafiq;Miquel Pericas;Nacho Navarro;Eduard Ayguade
Affiliations:
Barcelona Supercomputing Center and Universitat Politecnica de Catalunya, Barcelona, Spain;Tokyo Institute of Technology, Tokyo, Japan;Barcelona Supercomputing Center and Universitat Politecnica de Catalunya, Barcelona, Spain;Barcelona Supercomputing Center and Universitat Politecnica de Catalunya, Barcelona, Spain
Venue:
Proceedings of the 9th conference on Computing Frontiers
Year:
2012

Citing 12
Cited 0

The effect of reconfigurable units in superscalar processors

FPGA '01 Proceedings of the 2001 ACM/SIGDA ninth international symposium on Field programmable gate arrays
Garp: a MIPS processor with a reconfigurable coprocessor

FCCM '97 Proceedings of the 5th IEEE Symposium on FPGA-Based Custom Computing Machines
The chimaera reconfigurable functional unit

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
The MOLEN Polymorphic Processor

IEEE Transactions on Computers
An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

Proceedings of the 36th annual international symposium on Computer architecture
An adaptive performance modeling tool for GPU architectures

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
An integrated GPU power and performance model

Proceedings of the 37th annual international symposium on Computer architecture
Assessing Accelerator-Based HPC Reverse Time Migration

IEEE Transactions on Parallel and Distributed Systems
FEM: A Step Towards a Common Memory Layout for FPGA Based Accelerators

FPL '10 Proceedings of the 2010 International Conference on Field Programmable Logic and Applications
CuMAPz: a tool to analyze memory access patterns in CUDA

Proceedings of the 48th Design Automation Conference
TARCAD: A template architecture for reconfigurable accelerator designs

SASP '11 Proceedings of the 2011 IEEE 9th Symposium on Application Specific Processors
GROPHECY: GPU performance projection from CPU code skeletons

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

The current trend in high performance computing (HPC) systems is to deploy parallel computers equipped with general purpose multi-core processors and possibly multi-core streaming accelerators. However, the performance of these multi-cores is often constrained by the limited external bandwidth or by badly matching data access patterns. The latter reduces the size of useful data during memory transactions. A change in the application algorithm can improve the memory accesses but a hardware support mechanism for an application specific data arrangement in the memory hierarchy can significantly boost the performance for many application domains. In this work, we present a conceptual computing architecture named BSArc (Blacksmith Streaming Architecture). BSArc introduces a forging front-end to efficiently distribute data to a large set of simple streaming processors in the back-end. We apply this concept to a SIMT execution model and present a design space exploration in the context of a GPU-like streaming architecture with a reconfigurable application specific front-end. These design space explorations are carried out on a streaming architectural simulator that models BSArc. We evaluate the performance advantages for the BSArc design against a standard L2 cache in a GPU-like device. In our evaluations we use three application kernels: 2D-FFT, Matrix-Matrix Multiplication and 3D-Stencil. The results show that employing an application specific arrangement of data on these kernels achieves an average speedup of 2.3× compared to a standard cache for a GPU-like streaming device.