The effect of reconfigurable units in superscalar processors
FPGA '01 Proceedings of the 2001 ACM/SIGDA ninth international symposium on Field programmable gate arrays
Garp: a MIPS processor with a reconfigurable coprocessor
FCCM '97 Proceedings of the 5th IEEE Symposium on FPGA-Based Custom Computing Machines
The chimaera reconfigurable functional unit
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
The MOLEN Polymorphic Processor
IEEE Transactions on Computers
An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness
Proceedings of the 36th annual international symposium on Computer architecture
An adaptive performance modeling tool for GPU architectures
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
An integrated GPU power and performance model
Proceedings of the 37th annual international symposium on Computer architecture
Assessing Accelerator-Based HPC Reverse Time Migration
IEEE Transactions on Parallel and Distributed Systems
FEM: A Step Towards a Common Memory Layout for FPGA Based Accelerators
FPL '10 Proceedings of the 2010 International Conference on Field Programmable Logic and Applications
CuMAPz: a tool to analyze memory access patterns in CUDA
Proceedings of the 48th Design Automation Conference
TARCAD: A template architecture for reconfigurable accelerator designs
SASP '11 Proceedings of the 2011 IEEE 9th Symposium on Application Specific Processors
GROPHECY: GPU performance projection from CPU code skeletons
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
The current trend in high performance computing (HPC) systems is to deploy parallel computers equipped with general purpose multi-core processors and possibly multi-core streaming accelerators. However, the performance of these multi-cores is often constrained by the limited external bandwidth or by badly matching data access patterns. The latter reduces the size of useful data during memory transactions. A change in the application algorithm can improve the memory accesses but a hardware support mechanism for an application specific data arrangement in the memory hierarchy can significantly boost the performance for many application domains. In this work, we present a conceptual computing architecture named BSArc (Blacksmith Streaming Architecture). BSArc introduces a forging front-end to efficiently distribute data to a large set of simple streaming processors in the back-end. We apply this concept to a SIMT execution model and present a design space exploration in the context of a GPU-like streaming architecture with a reconfigurable application specific front-end. These design space explorations are carried out on a streaming architectural simulator that models BSArc. We evaluate the performance advantages for the BSArc design against a standard L2 cache in a GPU-like device. In our evaluations we use three application kernels: 2D-FFT, Matrix-Matrix Multiplication and 3D-Stencil. The results show that employing an application specific arrangement of data on these kernels achieves an average speedup of 2.3× compared to a standard cache for a GPU-like streaming device.