PICO-NPA: High-Level Synthesis of Nonprogrammable Hardware Accelerators

Authors:
Robert Schreiber;Shail Aditya;Scott Mahlke;Vinod Kathail;B. Ramakrishna Rau;Darren Cronquist;Mukund Sivaraman
Affiliations:
Hewlett-Packard Laboratories, Palo Alto, California 94304-1126, USA;Hewlett-Packard Laboratories, Palo Alto, California 94304-1126, USA;Hewlett-Packard Laboratories, Palo Alto, California 94304-1126, USA;Hewlett-Packard Laboratories, Palo Alto, California 94304-1126, USA;Hewlett-Packard Laboratories, Palo Alto, California 94304-1126, USA;Hewlett-Packard Laboratories, Palo Alto, California 94304-1126, USA;Hewlett-Packard Laboratories, Palo Alto, California 94304-1126, USA
Venue:
Journal of VLSI Signal Processing Systems
Year:
2002

Citing 10
Cited 33

Partitioning and Mapping Algorithms into Fixed Size Systolic Arrays

IEEE Transactions on Computers
A design methodology for synthesizing parallel algorithms and architectures

Journal of Parallel and Distributed Computing
On the problem of optimizing data transfers for complex memory systems

ICS '88 Proceedings of the 2nd international conference on Supercomputing
Supernode partitioning

POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
A practical algorithm for exact array dependence analysis

Communications of the ACM
SUIF: an infrastructure for research on parallelizing and optimizing compilers

ACM SIGPLAN Notices
Constructing and exploiting linear schedules with prescribed parallelism

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Memory Access Optimization and RAM Inference for Pipeline Vectorization

FPL '99 Proceedings of the 9th International Workshop on Field-Programmable Logic and Applications
Automatic Architectural Synthesis of VLIW and EPIC Processors

Proceedings of the 12th international symposium on System synthesis
Bitwidth cognizant architecture synthesis of custom hardware accelerators

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Cycle-time aware architecture synthesis of custom hardware accelerators

CASES '02 Proceedings of the 2002 international conference on Compilers, architecture, and synthesis for embedded systems
PICO: Automatically Designing Custom Computers

Computer
Data remapping for design space optimization of embedded memory systems

ACM Transactions on Embedded Computing Systems (TECS)
Spatial computation

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
A scheduling algorithm for optimization and early planning in high-level synthesis

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Automatic identification of application-specific functional units with architecturally visible storage

Proceedings of the conference on Design, automation and test in Europe: Proceedings
Area and delay estimation for FPGA implementation of coarse-grained reconfigurable architectures

Proceedings of the 2006 ACM SIGPLAN/SIGBED conference on Language, compilers, and tool support for embedded systems
Streamroller:: automatic synthesis of prescribed throughput accelerator pipelines

CODES+ISSS '06 Proceedings of the 4th international conference on Hardware/software codesign and system synthesis
Increasing hardware efficiency with multifunction loop accelerators

CODES+ISSS '06 Proceedings of the 4th international conference on Hardware/software codesign and system synthesis
Improving performance and energy consumption in embedded microprocessor platforms with a flexible custom coprocessor data-path

Proceedings of the 17th ACM Great Lakes symposium on VLSI
Automatic mapping of nested loops to FPGAS

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Scheduling of Iterative Algorithms with Matrix Operations for Efficient FPGA Design--Implementation of Finite Interval Constant Modulus Algorithm

Journal of VLSI Signal Processing Systems
Exploring the speedups of embedded microprocessor systems utilizing a high-performance coprocessor data-path

The Journal of Supercomputing
Speedups in embedded systems with a high-performance coprocessor datapath

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Modulo scheduling for highly customized datapaths to increase hardware reusability

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
A domain specific interconnect for reconfigurable computing

Proceedings of the 2008 ACM SIGPLAN-SIGBED conference on Languages, compilers, and tools for embedded systems
Performance and energy consumption improvements in microprocessor systems utilizing a coprocessor data-path

Journal of Signal Processing Systems - Special Issue: Embedded computing systems for DSP
VEAL: Virtualized Execution Accelerator for Loops

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
C-based design flow: a case study on G.729A for voice over internet protocol (VoIP)

Proceedings of the 45th annual Design Automation Conference
Behavior-level observability don't-cares and application to low-power behavioral synthesis

Proceedings of the 14th ACM/IEEE international symposium on Low power electronics and design
Squashing microcode stores to size in embedded systems while delivering rapid microcode accesses

CODES+ISSS '09 Proceedings of the 7th IEEE/ACM international conference on Hardware/software codesign and system synthesis
Path-based scheduling in a hardware compiler

Proceedings of the Conference on Design, Automation and Test in Europe
Architecture exploration for efficient data transfer and storage in data-parallel applications

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Co-synthesis of FPGA-based application-specific floating point simd accelerators

Proceedings of the 19th ACM/SIGDA international symposium on Field programmable gate arrays
Multi-objective efficient design space exploration and architectural synthesis of an application specific processor (ASP)

Microprocessors & Microsystems
Dynamic memory access management for high-performance DSP applications using high-level synthesis

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Massively parallel programming models used as hardware description languages: the OpenCL case

Proceedings of the International Conference on Computer-Aided Design
Trimaran: an infrastructure for research in instruction-level parallelism

LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
Bundled execution of recurring traces for energy-efficient general purpose processing

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
From software to accelerators with LegUp high-level synthesis

Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems
SDC-based modulo scheduling for pipeline synthesis

Proceedings of the International Conference on Computer-Aided Design
Studying the code compression design space - A synthesis approach

Journal of Systems Architecture: the EUROMICRO Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

The PICO-NPA system automatically synthesizes nonprogrammable accelerators (NPAs) to be used as co-processors for functions expressed as loop nests in C. The NPAs it generates consist of a synchronous array of one or more customized processor datapaths, their controller, local memory, and interfaces. The user, or a design space exploration tool that is a part of the full PICO system, identifies within the application a loop nest to be implemented as an NPA, and indicates the performance required of the NPA by specifying the number of processors and the number of machine cycles that each processor uses per iteration of the inner loop. PICO-NPA emits synthesizable HDL that defines the accelerator at the register transfer level (RTL). The system also modifies the user's application software to make use of the generated accelerator.The main objective of PICO-NPA is to reduce design cost and time, without significantly reducing design quality. Design of an NPA and its support software typically requires one or two weeks using PICO-NPA, which is a many-fold improvement over the industry norm. In addition, PICO-NPA can readily generate a wide-range of implementations with scalable performance from a single specification. In experimental comparison of NPAs of equivalent throughput, PICO-NPA designs are slightly more costly than hand-designed accelerators.Logic synthesis and place-and-route have been performed successfully on PICO-NPA designs, which have achieved high clock rates.