ATOM: a system for building customized program analysis tools
PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
IEEE Transactions on Computers
Automatic application-specific instruction-set extensions under microarchitectural constraints
Proceedings of the 40th annual Design Automation Conference
Automatic Topology-Based Identification of Instruction-Set Extensions for Embedded Processors
Proceedings of the conference on Design, automation and test in Europe
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Design Methodology for a Tightly Coupled VLIW/Reconfigurable Matrix Architecture: A Case Study
Proceedings of the conference on Design, automation and test in Europe - Volume 2
Flexible architectures for engineering successful SOCs
Proceedings of the 41st annual Design Automation Conference
Fast Cycle-accurate Behavioral Simulation for Pipelined Processors Using Early Pipeline Evaluation
Proceedings of the 2003 IEEE/ACM international conference on Computer-aided design
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Heterogeneous associative cache for multimedia applications
IMSA'07 IASTED European Conference on Proceedings of the IASTED European Conference: internet and multimedia systems and applications
Heterogeneous associative cache for multimedia applications
EurolMSA '07 Proceedings of the Third IASTED European Conference on Internet and Multimedia Systems and Applications
Chip multiprocessor based on data-driven multithreading model
International Journal of High Performance Systems Architecture
A fault tolerant NoC architecture using quad-spare mesh topology and dynamic reconfiguration
Journal of Systems Architecture: the EUROMICRO Journal
Hi-index | 0.00 |
High performance microprocessors are designed with general-purpose applications in mind. When it comes to embedded applications, these architectures typically perform control-intensive tasks in a System-on-Chip (SoC) design. But they are significantly inefficient for data-intensive tasks such as video encoding/decoding. Although configurable processors fill this gap by complementing the existing functional units with instruction extensions, their performance lags behind the needs of real-time embedded tasks. In this paper, we evaluate the performance potential of a dataflow processor for H.264 video decoding. We first profile the H.264 application to capture the amount of data traffic among modules. We use this information to guide the placement of H.264 modules in the WaveScalar dataflow architecture. A simulated annealing based placement algorithm produces the final placement aiming to optimize the communication costs between the modules in the dataflow architecture. In addition to outperforming contemporary embedded and customized processors, our simulated annealing guided design shows a speedup of 13% in execution time over the original WaveScalar architecture. With our dataflow design methodology, emerging embedded applications requiring several GOPS to meet real-time constraints can be drafted within a reasonable amount of design time.