Compiler supports for VLIW DSP processors with SIMD intrinsics

Authors:
Chi-Bang Kuan;Jenq Kuen Lee
Affiliations:
Department of Computer Science, National Tsing Hua University, Hsinchu 30013, Taiwan;Department of Computer Science, National Tsing Hua University, Hsinchu 30013, Taiwan
Venue:
Concurrency and Computation: Practice & Experience
Year:
2012

Citing 15
Cited 1

Bulldog: a compiler for VLSI architectures

Bulldog: a compiler for VLSI architectures
Partitioned register files for VLIWs: a preliminary analysis of tradeoffs

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
Effective cluster assignment for modulo scheduling

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Unified assign and schedule: a new approach to scheduling for clustered register file microarchitectures

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Region-based hierarchical operation partitioning for multicluster processors

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Global Register Partitioning

PACT '00 Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques
Removing communications in clustered microarchitectures through instruction replication

ACM Transactions on Architecture and Code Optimization (TACO)
An Empirical Study On the Vectorization of Multimedia Applications for Multimedia Extensions

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Instruction Replication for Reducing Delays Due to Inter-PE Communication Latency

IEEE Transactions on Computers
PALF: compiler supports for irregular register files in clustered VLIW DSP processors: Research Articles

Concurrency and Computation: Practice & Experience - Current Trends in Compilers for Parallel Computers (CPC2006)
Design and Implementation of a High-Performance and Complexity-Effective VLIW DSP for Multimedia Applications

Journal of Signal Processing Systems
LC-GRFA: global register file assignment with local consciousness for VLIW DSP processors with non-uniform register files

Concurrency and Computation: Practice & Experience - Compilers for Parallel Computers 2007 Workshop (CPC 2007)
AGAMOS: A Graph-Based Approach to Modulo Scheduling for Clustered Microarchitectures

IEEE Transactions on Computers
Parallel Architecture Core (PAC)--the First Multicore Application Processor SoC in Taiwan Part II: Application Programming

Journal of Signal Processing Systems
Parallel Architecture Core (PAC)--the First Multicore Application Processor SoC in Taiwan Part I: Hardware Architecture & Software Development Tools

Journal of Signal Processing Systems

Special Issue: Compilers for Parallel Computing (CPC 2010)

Concurrency and Computation: Practice & Experience

Quantified Score

Hi-index	0.00

Visualization

Abstract

To sustain growing multimedia workload, modern digital signal processing (DSP) processors are commonly equipped with subword instructions to accelerate signal processing. Besides subword, functional units of very long instruction word (VLIW) DSP processors can also be employed to process multiple data streams in parallel. However, because of power and area concerns, many embedded VLIW DSP processors adopt distributed register files to reduce read/write ports and wire connection by privatizing register files for clusters and even for functional units. The distributed design presents great challenges to compilers in distributing single instruction, multiple data (SIMD) workload to functional units. In this paper, we address the issue in supporting SIMD parallelism on VLIW DSP processors with subword instructions and distributed register files. Currently, industrial practices have adopted intrinsics that enable developers to utilize hardware resources and compete with hand-coded assembly in performance. However, it is still an open issue to provide such a solution for VLIW DSP processors with distributed register files. In this work, we provide SIMD intrinsics to allow programmers to write highly optimized codes by following given programming guides. In addition, an enhanced register allocation scheme and data replication optimizations are devised to enable efficient code generation. In our experiments, DSPstone benchmark and a set of H.264 kernels are used to evaluate the proposed programming and optimization schemes. The result shows that by combining SIMD intrinsics and compiler optimizations, one is able to obtain remarkable performance improvements, speedups of 2.9 and 3.5 for DSPstone and H.264 kernels, respectively. Copyright © 2011 John Wiley & Sons, Ltd.