Extracting task-level parallelism
ACM Transactions on Programming Languages and Systems (TOPLAS)
Performance estimation of embedded software with instruction cache modeling
ICCAD '95 Proceedings of the 1995 IEEE/ACM international conference on Computer-aided design
SystemC: a homogenous environment to test embedded systems
Proceedings of the ninth international symposium on Hardware/software codesign
Adaptive Optimizing Compilers for the 21st Century
The Journal of Supercomputing
Embedded Processor Design Challenges: Systems, Architectures, Modeling, and Simulation - SAMOS
A performance analysis of the Berkeley UPC compiler
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
A quantitative analysis of the speedup factors of FPGAs over processors
FPGA '04 Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arrays
Brook for GPUs: stream computing on graphics hardware
ACM SIGGRAPH 2004 Papers
Pace--A Toolset for the Performance Prediction of Parallel and Distributed Systems
International Journal of High Performance Computing Applications
Optimizing Compiler for the CELL Processor
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
X10: an object-oriented approach to non-uniform cluster computing
OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Exploring Graphics Processor Performance for General Purpose Applications
DSD '05 Proceedings of the 8th Euromicro Conference on Digital System Design
Parallel Programmability and the Chapel Language
International Journal of High Performance Computing Applications
Performance estimation of distributed real-time embedded systems by discrete event simulations
EMSOFT '07 Proceedings of the 7th ACM & IEEE international conference on Embedded software
Algorithms for Reporting and Counting Geometric Intersections
IEEE Transactions on Computers
C is for circuits: capturing FPGA circuits as sequential code for portability
Proceedings of the 16th international ACM/SIGDA symposium on Field programmable gate arrays
RAT: RC Amenability Test for Rapid Performance Prediction
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
PetaBricks: a language and compiler for algorithmic choice
Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Proceedings of the ACM SIGPLAN/SIGBED 2010 conference on Languages, compilers, and tools for embedded systems
Designing Modular Hardware Accelerators in C with ROCCC 2.0
FCCM '10 Proceedings of the 2010 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines
Novo-G: At the Forefront of Scalable Reconfigurable Supercomputing
Computing in Science and Engineering
LegUp: high-level synthesis for FPGA-based processor/accelerator systems
Proceedings of the 19th ACM/SIGDA international symposium on Field programmable gate arrays
Hi-index | 0.00 |
Embedded systems increasingly combine multi-core processors and heterogeneous resources such as graphics-processing units and field-programmable gate arrays. However, significant application design complexity for such systems caused by parallel programming and device-specific challenges has often led to untapped performance potential. Application developers targeting such systems currently must determine how to parallelize computation, create different device-specialized implementations for each heterogeneous resource, and then determine how to apportion work to each resource. In this paper, we present the RACECAR heuristic to automate the optimization of applications for multi-core heterogeneous systems by automatically exploring implementation alternatives that include different algorithms, parallelization strategies, and work distributions. Experimental results show RACECAR-specialized implementations can effectively incorporate provided implementations and parallelize computation across multiple cores, graphics-processing units, and field-programmable gate arrays, improving performance by an average of 47x compared to a CPU, while the fastest provided implementations are only able to average 33x.