A high-performance microarchitecture with hardware-programmable functional units
MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
A tool for partitioning and pipelined scheduling of hardware-software systems
Proceedings of the 11th international symposium on System synthesis
Profiling in the ASP codesign environment
Journal of Systems Architecture: the EUROMICRO Journal
Hardware-Software Cosynthesis for Microcontrollers
IEEE Design & Test
Dynamic hardware/software partitioning: a first approach
Proceedings of the 40th annual Design Automation Conference
Control Speculation in Multithreaded Processors through Dynamic Loop Detection
HPCA '98 Proceedings of the 4th International Symposium on High-Performance Computer Architecture
SPARK: A High-Lev l Synthesis Framework For Applying Parallelizing Compiler Transformations
VLSID '03 Proceedings of the 16th International Conference on VLSI Design
A Processor-Coprocessor Architecture for High End Video Applications
ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97) -Volume 1 - Volume 1
PEAS-III: An ASIP Design Environment
ICCD '00 Proceedings of the 2000 IEEE International Conference on Computer Design: VLSI in Computers & Processors
Computer Architecture: A Quantitative Approach
Computer Architecture: A Quantitative Approach
Characteristics of Loop Unrolling Effect: Software Pipelining and Memory Latency Hiding
IWIA '01 Proceedings of the Innovative Architecture for Future Generation High-Performance Processors and Systems (IWIA'01)
The chimaera reconfigurable functional unit
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
INSIDE: INstruction Selection/Identification & Design Exploration for Extensible Processors
Proceedings of the 2003 IEEE/ACM international conference on Computer-aided design
Rapid Embedded Hardware/Software System Generation
VLSID '05 Proceedings of the 18th International Conference on VLSI Design held jointly with 4th International Conference on Embedded Systems Design
An efficient architecture for JPEG2000 coprocessor
IEEE Transactions on Consumer Electronics
Specification and analysis of timing constraints for embedded systems
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Custom-instruction synthesis for extensible-processor platforms
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Proceedings of the 17th ACM Great Lakes symposium on VLSI
The Journal of Supercomputing
Speedups in embedded systems with a high-performance coprocessor datapath
ACM Transactions on Design Automation of Electronic Systems (TODAES)
Journal of Signal Processing Systems - Special Issue: Embedded computing systems for DSP
The input-aware dynamic adaptation of area and performance for reconfigurable accelerator
Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays
Rapid, low-power loop execution in a network of functional units
Proceedings of the 17th Panhellenic Conference on Informatics
Hi-index | 0.00 |
In this paper, we show a novel approach to accelerate loops by tightly coupling a coprocessor to an ASIP. Latency hiding is used to exploit the parallelism available in this architecture. To illustrate the advantages of this approach, we investigate a JPEG encoding algorithm and accelerate one of its loop by implementing it in a coprocessor. We contrast the acceleration by implementing the critical segment as two different coprocessors and a set of customized instructions. The two different coprocessor approaches are: a high-level synthesis (HLS) approach; and a custom coprocessor approach. The HLS approach provides a faster method of generating coprocessors. We show that a loop performance improvement of 2.57x is achieved using the custom coprocessor approach, compared to 1.58x for the HLS approach and 1.33x for the customized instruction approach compared with just the main processor. Respective energy savings within the loop are 57%, 28% and 19%.