Partitioning and Mapping Algorithms into Fixed Size Systolic Arrays
IEEE Transactions on Computers
A design methodology for synthesizing parallel algorithms and architectures
Journal of Parallel and Distributed Computing
On the problem of optimizing data transfers for complex memory systems
ICS '88 Proceedings of the 2nd international conference on Supercomputing
POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
A practical algorithm for exact array dependence analysis
Communications of the ACM
SUIF: an infrastructure for research on parallelizing and optimizing compilers
ACM SIGPLAN Notices
Constructing and exploiting linear schedules with prescribed parallelism
ACM Transactions on Design Automation of Electronic Systems (TODAES)
Memory Access Optimization and RAM Inference for Pipeline Vectorization
FPL '99 Proceedings of the 9th International Workshop on Field-Programmable Logic and Applications
Automatic Architectural Synthesis of VLIW and EPIC Processors
Proceedings of the 12th international symposium on System synthesis
Bitwidth cognizant architecture synthesis of custom hardware accelerators
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Cycle-time aware architecture synthesis of custom hardware accelerators
CASES '02 Proceedings of the 2002 international conference on Compilers, architecture, and synthesis for embedded systems
Data remapping for design space optimization of embedded memory systems
ACM Transactions on Embedded Computing Systems (TECS)
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
A scheduling algorithm for optimization and early planning in high-level synthesis
ACM Transactions on Design Automation of Electronic Systems (TODAES)
Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Proceedings of the conference on Design, automation and test in Europe: Proceedings
Area and delay estimation for FPGA implementation of coarse-grained reconfigurable architectures
Proceedings of the 2006 ACM SIGPLAN/SIGBED conference on Language, compilers, and tool support for embedded systems
Streamroller:: automatic synthesis of prescribed throughput accelerator pipelines
CODES+ISSS '06 Proceedings of the 4th international conference on Hardware/software codesign and system synthesis
Increasing hardware efficiency with multifunction loop accelerators
CODES+ISSS '06 Proceedings of the 4th international conference on Hardware/software codesign and system synthesis
Proceedings of the 17th ACM Great Lakes symposium on VLSI
Automatic mapping of nested loops to FPGAS
Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Journal of VLSI Signal Processing Systems
The Journal of Supercomputing
Speedups in embedded systems with a high-performance coprocessor datapath
ACM Transactions on Design Automation of Electronic Systems (TODAES)
Modulo scheduling for highly customized datapaths to increase hardware reusability
Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
A domain specific interconnect for reconfigurable computing
Proceedings of the 2008 ACM SIGPLAN-SIGBED conference on Languages, compilers, and tools for embedded systems
Journal of Signal Processing Systems - Special Issue: Embedded computing systems for DSP
VEAL: Virtualized Execution Accelerator for Loops
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
C-based design flow: a case study on G.729A for voice over internet protocol (VoIP)
Proceedings of the 45th annual Design Automation Conference
Behavior-level observability don't-cares and application to low-power behavioral synthesis
Proceedings of the 14th ACM/IEEE international symposium on Low power electronics and design
Squashing microcode stores to size in embedded systems while delivering rapid microcode accesses
CODES+ISSS '09 Proceedings of the 7th IEEE/ACM international conference on Hardware/software codesign and system synthesis
Path-based scheduling in a hardware compiler
Proceedings of the Conference on Design, Automation and Test in Europe
Architecture exploration for efficient data transfer and storage in data-parallel applications
EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Co-synthesis of FPGA-based application-specific floating point simd accelerators
Proceedings of the 19th ACM/SIGDA international symposium on Field programmable gate arrays
Microprocessors & Microsystems
Dynamic memory access management for high-performance DSP applications using high-level synthesis
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Massively parallel programming models used as hardware description languages: the OpenCL case
Proceedings of the International Conference on Computer-Aided Design
Trimaran: an infrastructure for research in instruction-level parallelism
LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
Bundled execution of recurring traces for energy-efficient general purpose processing
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
From software to accelerators with LegUp high-level synthesis
Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems
SDC-based modulo scheduling for pipeline synthesis
Proceedings of the International Conference on Computer-Aided Design
Studying the code compression design space - A synthesis approach
Journal of Systems Architecture: the EUROMICRO Journal
Hi-index | 0.00 |
The PICO-NPA system automatically synthesizes nonprogrammable accelerators (NPAs) to be used as co-processors for functions expressed as loop nests in C. The NPAs it generates consist of a synchronous array of one or more customized processor datapaths, their controller, local memory, and interfaces. The user, or a design space exploration tool that is a part of the full PICO system, identifies within the application a loop nest to be implemented as an NPA, and indicates the performance required of the NPA by specifying the number of processors and the number of machine cycles that each processor uses per iteration of the inner loop. PICO-NPA emits synthesizable HDL that defines the accelerator at the register transfer level (RTL). The system also modifies the user's application software to make use of the generated accelerator.The main objective of PICO-NPA is to reduce design cost and time, without significantly reducing design quality. Design of an NPA and its support software typically requires one or two weeks using PICO-NPA, which is a many-fold improvement over the industry norm. In addition, PICO-NPA can readily generate a wide-range of implementations with scalable performance from a single specification. In experimental comparison of NPAs of equivalent throughput, PICO-NPA designs are slightly more costly than hand-designed accelerators.Logic synthesis and place-and-route have been performed successfully on PICO-NPA designs, which have achieved high clock rates.