URPR—An extension of URCR for software pipelining
MICRO 19 Proceedings of the 19th annual workshop on Microprogramming
PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Two-dimensional signal and image processing
Two-dimensional signal and image processing
EURO-DAC '96/EURO-VHDL '96 Proceedings of the conference on European design automation
Data and memory optimization techniques for embedded systems
ACM Transactions on Design Automation of Electronic Systems (TODAES)
An efficient technique for exploring register file size in ASIP synthesis
CASES '02 Proceedings of the 2002 international conference on Compilers, architecture, and synthesis for embedded systems
Compiler-directed customization of ASIP cores
Proceedings of the tenth international symposium on Hardware/software codesign
Profiling tools for hardware/software partitioning of embedded applications
Proceedings of the 2003 ACM SIGPLAN conference on Language, compiler, and tool for embedded systems
Pipelined Fast 2-D DCT Architecture for JPEG Image Compression
Proceedings of the 14th symposium on Integrated circuits and systems design
Automatic generation of application specific processors
Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems
Application Specific Instruction Set Processors: Redefining Hardware-Software Boundary
VLSID '04 Proceedings of the 17th International Conference on VLSI Design
Characterizing embedded applications for instruction-set extensible processors
Proceedings of the 41st annual Design Automation Conference
A Scalable Application-Specific Processor Synthesis Methodology
Proceedings of the 2003 IEEE/ACM international conference on Computer-aided design
Rapid Configuration and Instruction Selection for an ASIP: A Case Study
DATE '03 Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Performance and Area Modeling of Complete FPGA Designs in the Presence of Loop Transformations
IEEE Transactions on Computers
Instruction set extension with shadow registers for configurable processors
Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
Fine-grained application source code profiling for ASIP design
Proceedings of the 42nd annual Design Automation Conference
An Architecture Framework for Transparent Instruction Set Customization in Embedded Processors
Proceedings of the 32nd annual international symposium on Computer Architecture
Motion adaptive interpolation with horizontal motion detection for deinterlacing
IEEE Transactions on Consumer Electronics
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Exact and approximate algorithms for the extension of embedded processor instruction sets
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Two versions of architectures for dynamic implied addressing mode
Journal of Systems Architecture: the EUROMICRO Journal
Loop acceleration exploration for ASIP architecture
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Hi-index | 0.00 |
Application-specific instruction-set processors (ASIPs) provide a good alternative for video processing acceleration, but the productivity gap implied by such a new technology may prevent leveraging it fully. Video processing SoCs need flexibility that is not available in pure hardware architectures, while pure software solutions do not meet video processing performance constraints. Thus, ASIP design could offer a good tradeoff between performance and flexibility. Video processing algorithms are often characterized by intrinsic parallelism that can be accelerated by ASIP specialized instructions. In this paper, we propose a new approach for exploiting sequences of tightly coupled specialized instructions in ASIP design applicable to video processing. Our approach, which avoids costly data communications by applying data grouping and data reuse, consists of accelerating an algorithm's critical loops by transforming them according to a new intermediate representation. This representation is optimized and loop parallelism possibilities are also explored. This approach has been applied to video processing algorithms such as the ELA deinterlacer and the 2D-DCT. Experimental results show speedups up to 18 (on the considered applications, while the hardware overhead in terms of additional logic gates was found to be between 18 and 59%.