Automatic translation of FORTRAN programs to vector form
ACM Transactions on Programming Languages and Systems (TOPLAS)
MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Journal of the ACM (JACM)
Exploiting superword level parallelism with multimedia instruction sets
PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Compilation techniques for multimedia processors
International Journal of Parallel Programming - Special issue on instruction-level parallelism and parallelizing compilation, Part 1
A vectorizing compiler for multimedia extensions
International Journal of Parallel Programming - Special issue on instruction-level parallelism and parallelizing compilation, Part 1
Monotonic evolution: an alternative to induction variable substitution for dependence analysis
ICS '01 Proceedings of the 15th international conference on Supercomputing
Automatic intra-register vectorization for the Intel architecture
International Journal of Parallel Programming
Array recovery and high-level transformations for DSP applications
ACM Transactions on Embedded Computing Systems (TECS)
A SIMD Vectorizing Compiler for Digital Signal Processing Algorithms
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Compiling for SIMD Within a Register
LCPC '98 Proceedings of the 11th International Workshop on Languages and Compilers for Parallel Computing
Increasing and Detecting Memory Address Congruence
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Design and characterization of the Berkeley multimedia workload
Multimedia Systems
Software Vectorization Handbook, The: Applying Intel Multimedia Extensions for Maximum Performance
Software Vectorization Handbook, The: Applying Intel Multimedia Extensions for Maximum Performance
Vectorization for SIMD architectures with alignment constraints
Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Superword-Level Parallelism in the Presence of Control Flow
Proceedings of the international symposium on Code generation and optimization
Efficient SIMD Code Generation for Runtime Alignment and Length Conversion
Proceedings of the international symposium on Code generation and optimization
Formal loop merging for signal transforms
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Paper: A comparative study of automatic vectorizing compilers
Parallel Computing
Automatic detection of saturation and clipping idioms
LCPC'02 Proceedings of the 15th international conference on Languages and Compilers for Parallel Computing
Vectorization for SIMD architectures with alignment constraints
Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Efficient SIMD Code Generation for Runtime Alignment and Length Conversion
Proceedings of the international symposium on Code generation and optimization
Optimizing data permutations for SIMD devices
Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Auto-vectorization of interleaved data for SIMD
Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Vector LLVA: a virtual vector instruction set for media processing
Proceedings of the 2nd international conference on Virtual execution environments
Challenges in exploitation of loop parallelism in embedded applications
CODES+ISSS '06 Proceedings of the 4th international conference on Hardware/software codesign and system synthesis
Exploiting SIMD Parallelism with the CGiS Compiler Framework
Languages and Compilers for Parallel Computing
On the exploitation of loop-level parallelism in embedded applications
ACM Transactions on Embedded Computing Systems (TECS)
Optimizing techniques for saturated arithmetic with first-order linear recurrence
Proceedings of the 2009 ACM symposium on Applied Computing
Compiler-Based Performance Evaluation of an SIMD Processor with a Multi-Bank Memory Unit
Journal of Signal Processing Systems
Mapping streaming languages to general purpose processors through vectorization
LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Portable Parallel Programs using architecture-aware libraries
Proceedings of the 27th Annual ACM Symposium on Applied Computing
Compiler supports for VLIW DSP processors with SIMD intrinsics
Concurrency and Computation: Practice & Experience
ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications
Hi-index | 0.00 |
Multimedia extensions (MME) are architectural extensions to general-purpose processors to boost the performance of multimedia workloads. Today, in-line assembly code, intrinsic functions and library routines are the most common means to program these extensions. A promising alternative is to exploit vectorization technology to automatically generate MME instructions from programs written in standard high-level languages. However, despite the early success of automatic vectorization for traditional vector supercomputers, state-of-the-art vectorizing compilers for multimedia extensions have yet to demonstrate their effectiveness, especially on multimedia workloads. In this paper, we conducted an empirical study on the vectorization of media processing programs for multimedia extensions. Our study identified several new issues that are not handled by traditional vectorizers. These issues arise partly as the result of the unique features of MME architectures, partly due to the characteristics of media processing applications. We proposed several techniques to address some of these issues. We further assessed the effectiveness of our techniques by manually applying them to a set of multimedia programs. In addition, we found that further optimizations after vectorization are essential to benefit from multimedia extensions. In our experiments, 23 of 34 core procedures from the Berkeley Media Benchmark (BMW) were manually vectorized and 14 procedures achieved speedups of 1.10 to 3.39 on a Pentium 4 processor.