Compiling Fortran 8x array features for the connection machine computer system
PPEALS '88 Proceedings of the ACM/SIGPLAN conference on Parallel programming: experience with applications, languages and systems
Evaluation of Fortran vector compilers and preprocessors
Software—Practice & Experience
SUIF: an infrastructure for research on parallelizing and optimizing compilers
ACM SIGPLAN Notices
Initial results on the performance and cost of vector microprocessors
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Simple vector microprocessors for multimedia applications
MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Pointer analysis for multithreaded programs
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Bidwidth analysis with application to silicon compilation
PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Dependence graphs and compiler optimizations
POPL '81 Proceedings of the 8th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
VIS Speeds New Media Processing
IEEE Micro
MicroUnity's MediaProcessor Architecture
IEEE Micro
Subword Parallelism with MAX-2
IEEE Micro
Bidwidth analysis with application to silicon compilation
PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
C Compiler Design for an Industrial Network Processor
OM '01 Proceedings of the 2001 ACM SIGPLAN workshop on Optimization of middleware and distributed systems
Energy aware compilation for DSPs with SIMD instructions
Proceedings of the joint conference on Languages, compilers and tools for embedded systems: software and compilers for embedded systems
The architecture of the DIVA processing-in-memory chip
ICS '02 Proceedings of the 16th international conference on Supercomputing
Bit section instruction set extension of ARM for embedded applications
CASES '02 Proceedings of the 2002 international conference on Compilers, architecture, and synthesis for embedded systems
Automatic intra-register vectorization for the Intel architecture
International Journal of Parallel Programming
Bitwidth aware global register allocation
POPL '03 Proceedings of the 30th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Automatic Intra-Register Vectorization for the Intel® Architecture
International Journal of Parallel Programming
Measuring the Performance of Multimedia Instruction Sets
IEEE Transactions on Computers
Compiler-Controlled Caching in Superword Register Files for Multimedia Extension Architectures
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Increasing and Detecting Memory Address Congruence
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
A Representation for Bit Section Based Analysis and Optimization
CC '02 Proceedings of the 11th International Conference on Compiler Construction
Data Compression Transformations for Dynamically Allocated Data Structures
CC '02 Proceedings of the 11th International Conference on Compiler Construction
Macro Extension for SIMD Processing
Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
Cg: a system for programming graphics hardware in a C-like language
ACM SIGGRAPH 2003 Papers
Vectorization for SIMD architectures with alignment constraints
Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Speculative software management of datapath-width for energy optimization
Proceedings of the 2004 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Compiler based exploration of DSP energy savings by SIMD operations
Proceedings of the 2004 Asia and South Pacific Design Automation Conference
An extended ANSI C for processors with a multimedia extension
International Journal of Parallel Programming
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Register Packing: Exploiting Narrow-Width Operands for Reducing Register File Pressure
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Predicting Unroll Factors Using Supervised Classification
Proceedings of the international symposium on Code generation and optimization
Superword-Level Parallelism in the Presence of Control Flow
Proceedings of the international symposium on Code generation and optimization
Efficient SIMD Code Generation for Runtime Alignment and Length Conversion
Proceedings of the international symposium on Code generation and optimization
Unlocking the Performance of the BlueGene/L Supercomputer
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
A Prototype Processing-In-Memory (PIM) Chip for the Data-Intensive Architecture (DIVA) System
Journal of VLSI Signal Processing Systems
An Empirical Study On the Vectorization of Multimedia Applications for Multimedia Extensions
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Instruction combining for coalescing memory accesses using global code motion
MSP '04 Proceedings of the 2004 workshop on Memory system performance
Generation of permutations for SIMD processors
LCTES '05 Proceedings of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Improving superword level parallelism support in modern compilers
CODES+ISSS '05 Proceedings of the 3rd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
An integrated simdization framework using virtual vectors
Proceedings of the 19th annual international conference on Supercomputing
Optimizing Compiler for the CELL Processor
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Exploiting Vector Parallelism in Software Pipelined Loops
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Multi-platform Auto-vectorization
Proceedings of the International Symposium on Code Generation and Optimization
Optimizing data permutations for SIMD devices
Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Auto-vectorization of interleaved data for SIMD
Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Vector LLVA: a virtual vector instruction set for media processing
Proceedings of the 2nd international conference on Virtual execution environments
A case for a complexity-effective, width-partitioned microarchitecture
ACM Transactions on Architecture and Code Optimization (TACO)
A new idiom recognition framework for exploiting hardware-assist instructions
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Retargetable code optimization with SIMD instructions
CODES+ISSS '06 Proceedings of the 4th international conference on Hardware/software codesign and system synthesis
Pack instruction generation for media pUsing multi-valued decision diagram
CODES+ISSS '06 Proceedings of the 4th international conference on Hardware/software codesign and system synthesis
Limitations of special-purpose instructions for similarity measurements in media SIMD extensions
CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
Proceedings of the conference on Design, automation and test in Europe
Compiling for vector-thread architectures
Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Versatility of extended subwords and the matrix register file
ACM Transactions on Architecture and Code Optimization (TACO)
Outer-loop vectorization: revisited for short SIMD architectures
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
A SIMD optimization framework for retargetable compilers
ACM Transactions on Architecture and Code Optimization (TACO)
Generation of Pack Instruction Sequence for Media Processors Using Multi-Valued Decision Diagram
IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences
Evaluating compiler technology for control-flow optimizations for multimedia extension architectures
Microprocessors & Microsystems
A case study on compiler optimizations for the Intel® Core™ 2 duo processor
International Journal of Parallel Programming
Automatic parallelization for graphics processing units
PPPJ '09 Proceedings of the 7th International Conference on Principles and Practice of Programming in Java
Orthogonal parallel processing in vector Pascal
Computer Languages, Systems and Structures
Design and exploitation of a high-performance SIMD floating-point unit for Blue Gene/L
IBM Journal of Research and Development
Vectorization techniques for the Blue Gene/L double FPU
IBM Journal of Research and Development
MacroSS: macro-SIMDization of streaming applications
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Dependence-based code generation for a CELL processor
LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
New algorithms for SIMD alignment
CC'07 Proceedings of the 16th international conference on Compiler construction
Runtime Reconfiguration of Multiprocessors Based on Compile-Time Analysis
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Automatic vector instruction selection for dynamic compilation
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
A new compilation technique for SIMD code generation across basic block boundaries
Proceedings of the 2010 Asia and South Pacific Design Automation Conference
Efficient Selection of Vector Instructions Using Dynamic Programming
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Symbolic crosschecking of floating-point and SIMD code
Proceedings of the sixth conference on Computer systems
Data layout transformation for stencil computations on short-vector SIMD architectures
CC'11/ETAPS'11 Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software
Using machine learning to improve automatic vectorization
ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Scheduling latency insensitive computer vision tasks
ISPA'05 Proceedings of the Third international conference on Parallel and Distributed Processing and Applications
Compiler technology for blue gene systems
Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Automatically tuned FFTs for bluegene/l's double FPU
VECPAR'04 Proceedings of the 6th international conference on High Performance Computing for Computational Science
Boosting the performance of multimedia applications using SIMD instructions
CC'05 Proceedings of the 14th international conference on Compiler Construction
Overflow controlled SIMD arithmetic
LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
Automatic detection of saturation and clipping idioms
LCPC'02 Proceedings of the 15th international conference on Languages and Compilers for Parallel Computing
Efficient SIMD code generation for irregular kernels
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Extending a C-like language for portable SIMD programming
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
SIMD defragmenter: efficient ILP realization on data-parallel architectures
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Bit-sliced datapath for energy-efficient high performance microprocessors
PACS'04 Proceedings of the 4th international conference on Power-Aware Computer Systems
Enhanced bitwidth-aware register allocation
CC'06 Proceedings of the 15th international conference on Compiler Construction
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Vapor SIMD: Auto-vectorize once, run everywhere
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Scout: a source-to-source transformator for SIMD-Optimizations
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
A compiler framework for extracting superword level parallelism
Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Dynamic trace-based analysis of vectorization potential of applications
Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Extending OpenMP* with vector constructs for modern multicore SIMD architectures
IWOMP'12 Proceedings of the 8th international conference on OpenMP in a Heterogeneous World
A SWP specification for sequential image processing algorithms
ACSAC'07 Proceedings of the 12th Asia-Pacific conference on Advances in Computer Systems Architecture
ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications
Parallel execution of Java loops on Graphics Processing Units
Science of Computer Programming
When polyhedral transformations meet SIMD code generation
Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
Hybrid type legalization for a sparse SIMD instruction set
ACM Transactions on Architecture and Code Optimization (TACO)
Idiom recognition framework using topological embedding
ACM Transactions on Architecture and Code Optimization (TACO)
Breaking SIMD shackles with an exposed flexible microarchitecture and the access execute PDG
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Vectorization past dependent branches through speculation
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Exploring the vectorization of python constructs using pythran and boost SIMD
Proceedings of the 2014 Workshop on Programming models for SIMD/Vector processing
Sierra: a SIMD extension for C++
Proceedings of the 2014 Workshop on Programming models for SIMD/Vector processing
Loop Transforming for Reducing Data Alignment on Multi-Core SIMD Processors
Journal of Signal Processing Systems
Hi-index | 0.00 |
Increasing focus on multimedia applications has prompted the additionof multimedia extensions to most existing general purpose microprocessors. This added functionality comes primarily with the addition of short SIMD instructions. Unfortunately, access to these instructions is limited to in-line assembly and library calls. Generally, it has been assumed that vector compilers provide the most promising means of exploiting multimedia instructions. Although vectorization technology is well understood, it is inherently complex and fragile. In addition, it is incapable of locating SIMD-style parallelism within a basic block.In this paper we introduce the concept of Superword Level Parallelism (SLP) ,a novel way of viewing parallelism in multimedia and scientific applications. We believe SLPP is fundamentally different from the loop level parallelism exploited by traditional vector processing, and therefore demands a new method of extracting it. We have developed a simple and robust compiler for detecting SLPP that targets basic blocks rather than loop nests. As with techniques designed to extract ILP, ours is able to exploit parallelism both across loop iterations and within basic blocks. The result is an algorithm that provides excellent performance in several application domains. In our experiments, dynamic instruction counts were reduced by 46%. Speedups ranged from 1.24 to 6.70.