The RACECAR heuristic for automatic function specialization on multi-core heterogeneous systems

Authors:
John Robert Wernsing;Greg Stitt;Jeremy Fowers
Affiliations:
University of Florida, Gainesville, FL, USA;University of Florida, Gainesville, FL, USA;University of Florida, Gainesville, FL, USA
Venue:
Proceedings of the 2012 international conference on Compilers, architectures and synthesis for embedded systems
Year:
2012

Citing 24
Cited 0

Extracting task-level parallelism

ACM Transactions on Programming Languages and Systems (TOPLAS)
Performance estimation of embedded software with instruction cache modeling

ICCAD '95 Proceedings of the 1995 IEEE/ACM international conference on Computer-aided design
SystemC: a homogenous environment to test embedded systems

Proceedings of the ninth international symposium on Hardware/software codesign
Adaptive Optimizing Compilers for the 21st Century

The Journal of Supercomputing
The Density Advantage of Configurable Computing

Computer
Iterative Compilation

Embedded Processor Design Challenges: Systems, Architectures, Modeling, and Simulation - SAMOS
A performance analysis of the Berkeley UPC compiler

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
A quantitative analysis of the speedup factors of FPGAs over processors

FPGA '04 Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arrays
Brook for GPUs: stream computing on graphics hardware

ACM SIGGRAPH 2004 Papers
Pace--A Toolset for the Performance Prediction of Parallel and Distributed Systems

International Journal of High Performance Computing Applications
Optimizing Compiler for the CELL Processor

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
X10: an object-oriented approach to non-uniform cluster computing

OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Exploring Graphics Processor Performance for General Purpose Applications

DSD '05 Proceedings of the 8th Euromicro Conference on Digital System Design
Parallel Programmability and the Chapel Language

International Journal of High Performance Computing Applications
Performance estimation of distributed real-time embedded systems by discrete event simulations

EMSOFT '07 Proceedings of the 7th ACM & IEEE international conference on Embedded software
Algorithms for Reporting and Counting Geometric Intersections

IEEE Transactions on Computers
C is for circuits: capturing FPGA circuits as sequential code for portability

Proceedings of the 16th international ACM/SIGDA symposium on Field programmable gate arrays
RAT: RC Amenability Test for Rapid Performance Prediction

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
PetaBricks: a language and compiler for algorithmic choice

Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Elastic computing: a framework for transparent, portable, and adaptive multi-core heterogeneous computing

Proceedings of the ACM SIGPLAN/SIGBED 2010 conference on Languages, compilers, and tools for embedded systems
Designing Modular Hardware Accelerators in C with ROCCC 2.0

FCCM '10 Proceedings of the 2010 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines
Novo-G: At the Forefront of Scalable Reconfigurable Supercomputing

Computing in Science and Engineering
LegUp: high-level synthesis for FPGA-based processor/accelerator systems

Proceedings of the 19th ACM/SIGDA international symposium on Field programmable gate arrays

Quantified Score

Hi-index	0.00

Visualization

Abstract

Embedded systems increasingly combine multi-core processors and heterogeneous resources such as graphics-processing units and field-programmable gate arrays. However, significant application design complexity for such systems caused by parallel programming and device-specific challenges has often led to untapped performance potential. Application developers targeting such systems currently must determine how to parallelize computation, create different device-specialized implementations for each heterogeneous resource, and then determine how to apportion work to each resource. In this paper, we present the RACECAR heuristic to automate the optimization of applications for multi-core heterogeneous systems by automatically exploring implementation alternatives that include different algorithms, parallelization strategies, and work distributions. Experimental results show RACECAR-specialized implementations can effectively incorporate provided implementations and parallelize computation across multiple cores, graphics-processing units, and field-programmable gate arrays, improving performance by an average of 47x compared to a CPU, while the fastest provided implementations are only able to average 33x.