A code-based analytical approach for using separate device coprocessors in computing systems

Authors:
Volker Hampel;Grigori Goronzy;Erik Maehle
Affiliations:
Institute of Computer Engineering, University of Lübeck, Lübeck, Germany;Institute of Computer Engineering, University of Lübeck, Lübeck, Germany;Institute of Computer Engineering, University of Lübeck, Lübeck, Germany
Venue:
ARCS'11 Proceedings of the 24th international conference on Architecture of computing systems
Year:
2011

Citing 10
Cited 1

The Intel®8087 numeric data processor

ISCA '80 Proceedings of the 7th annual symposium on Computer Architecture
Trident: From High-Level Language to Hardware Circuitry

Computer
Ray Tracing from the Ground Up

Ray Tracing from the Ground Up
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Program optimization space pruning for a multithreaded gpu

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
FPGA-accelerated deletion-tolerant coding for reliable distributed storage

ARCS'07 Proceedings of the 20th international conference on Architecture of computing systems
Hybrid Core Acceleration of UWB SIRE Radar Signal Processing

IEEE Transactions on Parallel and Distributed Systems
Comparing Hardware Accelerators in Scientific Applications: A Case Study

IEEE Transactions on Parallel and Distributed Systems
hiCUDA: High-Level GPGPU Programming

IEEE Transactions on Parallel and Distributed Systems
Design and Performance Evaluation of Image Processing Algorithms on GPUs

IEEE Transactions on Parallel and Distributed Systems

An approach for performance estimation of hybrid systems with FPGAs and GPUs as coprocessors

ARCS'12 Proceedings of the 25th international conference on Architecture of Computing Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Special hardware accelerators like FPGAs and GPUs are commonly introduced into a computing system as a separate device. Consequently, the accelerator and the host system do not share a common memory. Sourcing out the data to the additional hardware thus introduces a communication penalty. Based on a combination of a program's source code and execution profiling we perform an analysis which evaluates the arithmetic intensity as a cost function to identify those parts most reasonable to source out to the accelerating hardware. The basic principles of this analysis are introduced and tested with a sample application. Its concrete results are discussed and evaluated based on the performance of a FPGA-based and a GPU-based implementation.