Thread Warping: Dynamic and Transparent Synthesis of Thread Accelerators

Authors:
Greg Stitt;Frank Vahid
Affiliations:
University of Florida;University of California, Riverside
Venue:
ACM Transactions on Design Automation of Electronic Systems (TODAES)
Year:
2011

Citing 33
Cited 0

Wisconsin Architectural Research Tool Set

ACM SIGARCH Computer Architecture News
Plasma: an FPGA for million gate systems

Proceedings of the 1996 ACM fourth international symposium on Field-programmable gate arrays
The SimpleScalar tool set, version 2.0

ACM SIGARCH Computer Architecture News
PipeRench: a co/processor for streaming multimedia acceleration

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
An ASIP design methodology for embedded systems

CODES '99 Proceedings of the seventh international workshop on Hardware/software codesign
Customized instruction-sets for embedded processors

Proceedings of the 36th annual ACM/IEEE Design Automation Conference
System Design with SystemC

System Design with SystemC
Synthesis and Optimization of Digital Circuits

Synthesis and Optimization of Digital Circuits
Processor reconfiguration through instruction-set metamorphosis

Computer
UQBT: Adaptable Binary Translation at Low Cost

Computer
Structuring Decompiled Graphs

CC '96 Proceedings of the 6th International Conference on Compiler Construction
Dynamic hardware/software partitioning: a first approach

Proceedings of the 40th annual Design Automation Conference
Garp: a MIPS processor with a reconfigurable coprocessor

FCCM '97 Proceedings of the 5th IEEE Symposium on FPGA-Based Custom Computing Machines
SPARK: A High-Lev l Synthesis Framework For Applying Parallelizing Compiler Transformations

VLSID '03 Proceedings of the 16th International Conference on VLSI Design
Extending the SystemC synthesis subset by object-oriented features

Proceedings of the 1st IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Programming Models for Hybrid CPU/FPGA Chips

Computer
The chimaera reconfigurable functional unit

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Automatic translation of software binaries onto FPGAs

Proceedings of the 41st annual Design Automation Conference
Hardware synthesis from coarse-grained dataflow specification for fast HW/SW cosynthesis

Proceedings of the 2nd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Application-Specific Processing on a General-Purpose Core via Transparent Instruction Set Customization

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Using a Decompiler for Real-World Source Recovery

WCRE '04 Proceedings of the 11th Working Conference on Reverse Engineering
Techniques for synthesizing binaries to an advanced register/memory structure

Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
Dynamic reconfiguration with binary translation: breaking the ILP barrier with software compatibility

Proceedings of the 42nd annual Design Automation Conference
Hardware/software partitioning of software binaries: a case study of H.264 decode

CODES+ISSS '05 Proceedings of the 3rd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation

FCCM '05 Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
An Event-Driven Multithreaded Dynamic Optimization Framework

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
New decompilation techniques for binary-level co-processor generation

ICCAD '05 Proceedings of the 2005 IEEE/ACM International conference on Computer-aided design
Warp Processors

Proceedings of the 41st annual Design Automation Conference
Efficient hardware checkpointing: concepts, overhead analysis, and implementation

Proceedings of the 2007 ACM/SIGDA 15th international symposium on Field programmable gate arrays
RAT: a methodology for predicting performance in application design migration to FPGAs

HPRCTA '07 Proceedings of the 1st international workshop on High-performance reconfigurable computing technology and applications: held in conjunction with SC07
Run-time instruction set selection in a transmutable embedded processor

Proceedings of the 45th annual Design Automation Conference
Designing Modular Hardware Accelerators in C with ROCCC 2.0

FCCM '10 Proceedings of the 2010 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines
Intermediate fabrics: virtual architectures for circuit portability and fast placement and routing

CODES/ISSS '10 Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis

Quantified Score

Hi-index	0.00

Visualization

Abstract

We introduce thread warping, a dynamic optimization technique that customizes multicore architectures to a given application by dynamically synthesizing threads into custom accelerator circuits on FPGAs (Field-Programmable Gate Arrays). Thread warping builds upon previous dynamic synthesis techniques for single-threaded applications, enabling dynamic architectural adaptation to different amounts of thread-level parallelism, while also exploiting parallelism within each thread to further improve performance. Furthermore, thread warping maintains the important separation of function from architecture, enabling portability of applications to architectures with different quantities of microprocessors and FPGAs, an advantage not shared by static compilation/synthesis approaches. We introduce an approach consisting of CAD tools and operating system support that enables thread warping on potentially any microprocessor/FPGA architecture. We evaluate thread warping using a simulator for high-performance computing systems with different interconnections in addition to multicore embedded systems having between 4 and 64 ARM11 microprocessors. On average, thread warping achieved approximately 3x speedup compared to a high-performance quad-core Intel Xeon and 109x compared to an embedded system consisting of 4 ARM11 cores, with a size cost approximately equal to 36 ARM11 cores.