Compiler-Controlled Caching in Superword Register Files for Multimedia Extension Architectures

Authors:
Jaewook Shin;Jacqueline Chame;Mary W. Hall
Affiliations:
-;-;-
Venue:
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Year:
2002

Citing 22
Cited 21

More iteration space tiling

Proceedings of the 1989 ACM/IEEE conference on Supercomputing
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Improving locality and parallelism in nested loops

Improving locality and parallelism in nested loops
Scalar replacement in the presence of conditional control flow

Software—Practice & Experience
Compiler optimizations for improving data locality

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Tile size selection using cache organization and data layout

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Influence of cross-interferences on blocked loops: a case study with matrix-vector multiply

ACM Transactions on Programming Languages and Systems (TOPLAS)
Cache miss equations: an analytical representation of cache misses

ICS '97 Proceedings of the 11th international conference on Supercomputing
Precise miss analysis for program transformations with caches of arbitrary associativity

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Performance of image and video processing with general-purpose processors and media ISA extensions

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Microservers: a new memory semantics for massively parallel computing

ICS '99 Proceedings of the 13th international conference on Supercomputing
A tile selection algorithm for data locality and cache interference

ICS '99 Proceedings of the 13th international conference on Supercomputing
Mapping irregular applications to DIVA, a PIM-based data-intensive architecture

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Exploiting superword level parallelism with multimedia instruction sets

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
A vectorizing compiler for multimedia extensions

International Journal of Parallel Programming - Special issue on instruction-level parallelism and parallelizing compilation, Part 1
Maximizing Multiprocessor Performance with the SUIF Compiler

Computer
Subword Parallelism with MAX-2

IEEE Micro
On Estimating and Enhancing Cache Effectiveness

Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
Increasing and Detecting Memory Address Congruence

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
A Comparison of Compiler Tiling Algorithms

CC '99 Proceedings of the 8th International Conference on Compiler Construction, Held as Part of the European Joint Conferences on the Theory and Practice of Software, ETAPS'99
T0 Engineering Data

T0 Engineering Data

Increasing and Detecting Memory Address Congruence

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Vectorizing for a SIMdD DSP architecture

Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems
Vectorization for SIMD architectures with alignment constraints

Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Superword-Level Parallelism in the Presence of Control Flow

Proceedings of the international symposium on Code generation and optimization
Instruction combining for coalescing memory accesses using global code motion

MSP '04 Proceedings of the 2004 workshop on Memory system performance
Improving superword level parallelism support in modern compilers

CODES+ISSS '05 Proceedings of the 3rd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
An integrated simdization framework using virtual vectors

Proceedings of the 19th annual international conference on Supercomputing
Exploiting Vector Parallelism in Software Pipelined Loops

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Multi-platform Auto-vectorization

Proceedings of the International Symposium on Code Generation and Optimization
Auto-vectorization of interleaved data for SIMD

Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Compiling for an indirect vector register architecture

Proceedings of the 5th conference on Computing frontiers
Exploiting SIMD Parallelism with the CGiS Compiler Framework

Languages and Compilers for Parallel Computing
Outer-loop vectorization: revisited for short SIMD architectures

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Automatic parallelization for graphics processing units

PPPJ '09 Proceedings of the 7th International Conference on Principles and Practice of Programming in Java
Dependence-based code generation for a CELL processor

LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
Extending a C-like language for portable SIMD programming

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
SIMD defragmenter: efficient ILP realization on data-parallel architectures

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Vapor SIMD: Auto-vectorize once, run everywhere

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
A compiler framework for extracting superword level parallelism

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Polyhedral parallel code generation for CUDA

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Parallel execution of Java loops on Graphics Processing Units

Science of Computer Programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we describe an algorithm and implementation of locality optimizations for architectures with instruction sets such as Intel's SSE and Motorola's AltiVec that support operations on superwords, i.e., aggregate objects consisting of several machine words. We treat the large superword register file as a compiler-controlled cache, thus avoiding unnecessary memory accesses by exploiting reuse in superword registers. This research is distinguished from previous work on exploiting reuse in scalar registers because it considers not only temporal but also spatial reuse. As compared to optimizations to exploit reuse in cache, the compiler must also manage replacement, and thus, explicitly name registers in the generated code. We describe an implementation of our approach integrated with a compiler that exploits superword-level parallelism (SLP). We present a set of results derived automatically on 4 multimedia kernels and 2 scientific benchmarks. Our results show speedups ranging from 1.3 to 2.8X on the 6 programs as compared to using SLP alone, and we eliminate the majority of memory accesses.