Bulldog: a compiler for VLSI architectures
Bulldog: a compiler for VLSI architectures
Partitioned register files for VLIWs: a preliminary analysis of tradeoffs
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Effective cluster assignment for modulo scheduling
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Region-based hierarchical operation partitioning for multicluster processors
PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
PACT '00 Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques
Removing communications in clustered microarchitectures through instruction replication
ACM Transactions on Architecture and Code Optimization (TACO)
An Empirical Study On the Vectorization of Multimedia Applications for Multimedia Extensions
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Instruction Replication for Reducing Delays Due to Inter-PE Communication Latency
IEEE Transactions on Computers
Concurrency and Computation: Practice & Experience - Current Trends in Compilers for Parallel Computers (CPC2006)
Journal of Signal Processing Systems
Concurrency and Computation: Practice & Experience - Compilers for Parallel Computers 2007 Workshop (CPC 2007)
AGAMOS: A Graph-Based Approach to Modulo Scheduling for Clustered Microarchitectures
IEEE Transactions on Computers
Journal of Signal Processing Systems
Journal of Signal Processing Systems
Special Issue: Compilers for Parallel Computing (CPC 2010)
Concurrency and Computation: Practice & Experience
Hi-index | 0.00 |
To sustain growing multimedia workload, modern digital signal processing (DSP) processors are commonly equipped with subword instructions to accelerate signal processing. Besides subword, functional units of very long instruction word (VLIW) DSP processors can also be employed to process multiple data streams in parallel. However, because of power and area concerns, many embedded VLIW DSP processors adopt distributed register files to reduce read/write ports and wire connection by privatizing register files for clusters and even for functional units. The distributed design presents great challenges to compilers in distributing single instruction, multiple data (SIMD) workload to functional units. In this paper, we address the issue in supporting SIMD parallelism on VLIW DSP processors with subword instructions and distributed register files. Currently, industrial practices have adopted intrinsics that enable developers to utilize hardware resources and compete with hand-coded assembly in performance. However, it is still an open issue to provide such a solution for VLIW DSP processors with distributed register files. In this work, we provide SIMD intrinsics to allow programmers to write highly optimized codes by following given programming guides. In addition, an enhanced register allocation scheme and data replication optimizations are devised to enable efficient code generation. In our experiments, DSPstone benchmark and a set of H.264 kernels are used to evaluate the proposed programming and optimization schemes. The result shows that by combining SIMD intrinsics and compiler optimizations, one is able to obtain remarkable performance improvements, speedups of 2.9 and 3.5 for DSPstone and H.264 kernels, respectively. Copyright © 2011 John Wiley & Sons, Ltd.