Partitioned register files for VLIWs: a preliminary analysis of tradeoffs
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Exploiting superword level parallelism with multimedia instruction sets
PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Optimizing compilers for modern architectures: a dependence-based approach
Optimizing compilers for modern architectures: a dependence-based approach
Graph-partitioning based instruction scheduling for clustered processors
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Compiler-Controlled Caching in Superword Register Files for Multimedia Extension Architectures
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Increasing and Detecting Memory Address Congruence
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Region-based hierarchical operation partitioning for multicluster processors
PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
PACT '00 Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques
Bottlenecks in Multimedia Processing with SIMD Style Extensions and Architectural Enhancements
IEEE Transactions on Computers
Processor Acceleration Through Automated Instruction Set Customization
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
The Vector-Thread Architecture
Proceedings of the 31st annual international symposium on Computer architecture
Microarchitectural techniques for power gating of execution units
Proceedings of the 2004 international symposium on Low power electronics and design
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
An Architecture Framework for Transparent Instruction Set Customization in Embedded Processors
Proceedings of the 32nd annual international symposium on Computer Architecture
SODA: A Low-power Architecture For Software Radio
Proceedings of the 33rd annual international symposium on Computer Architecture
Modulo graph embedding: mapping applications onto coarse-grained reconfigurable architectures
CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
Scalable subgraph mapping for acyclic computation accelerators
CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
Vector processing as an enabler for software-defined radio in handheld devices
EURASIP Journal on Applied Signal Processing
Outer-loop vectorization: revisited for short SIMD architectures
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Edge-centric modulo scheduling for coarse-grained reconfigurable architectures
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
From SODA to scotch: The evolution of a wireless baseband processor
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
AnySP: anytime anywhere anyway signal processing
Proceedings of the 36th annual international symposium on Computer architecture
Dynamic power gating with quality guarantees
Proceedings of the 14th ACM/IEEE international symposium on Low power electronics and design
Proceedings of the 16th ACM/IEEE international symposium on Low power electronics and design
Efficient Selection of Vector Instructions Using Dynamic Programming
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
A case for guarded power gating for multi-core processors
HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
Vapor SIMD: Auto-vectorize once, run everywhere
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Breaking SIMD shackles with an exposed flexible microarchitecture and the access execute PDG
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Hi-index | 0.00 |
Single-instruction multiple-data (SIMD) accelerators provide an energy-efficient platform to scale the performance of mobile systems while still retaining post-programmability. The central challenge is translating the parallel resources of the SIMD hardware into real application performance. In scientific applications, automatic vectorization techniques have proven quite effective at extracting large levels of data-level parallelism (DLP). However, vectorization is often much less effective for media applications due to low trip count loops, complex control flow, and non-uniform execution behavior. As a result, SIMD lanes remain idle due to insufficient DLP. To attack this problem, this paper proposes a new vectorization pass called SIMD Defragmenter to uncover hidden DLP that lurks below the surface in the form of instruction-level parallelism (ILP). The difficulty is managing the data packing/unpacking overhead that can easily exceed the benefits gained through SIMD execution. The SIMD degragmenter overcomes this problem by identifying groups of compatible instructions (subgraphs) that can be executed in parallel across the SIMD lanes. By SIMDizing in bulk at the subgraph level, packing/unpacking overhead is minimized. On a 16-lane SIMD processor, experimental results show that SIMD defragmentation achieves a mean 1.6x speedup over traditional loop vectorization and a 31% gain over prior research approaches for converting ILP to DLP.