The Combined Effectiveness of Unimodular Transformations, Tiling, and Software Prefetching

Authors:
Rafael H. Saavedra-Barrera;Weihua Mao;Daeyeon Park;Jacqueline Chame;Sungdo Moon
Affiliations:
-;-;-;-;-
Venue:
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Year:
1996

Citing 26
Cited 7

Automatic translation of FORTRAN programs to vector form

ACM Transactions on Programming Languages and Systems (TOPLAS)
Synchronization, Coherence, and Event Ordering in Multiprocessors

Computer
MASA: a multithreaded processor architecture for parallel symbolic computing

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
Supernode partitioning

POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
More iteration space tiling

Proceedings of the 1989 ACM/IEEE conference on Supercomputing
Improving register allocation for subscripted variables

PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
LAPACK: a portable linear algebra library for high-performance computers

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Tolerating latency through software-controlled prefetching in shared-memory multiprocessors

Journal of Parallel and Distributed Computing - Special issue on shared-memory multiprocessors
Design and evaluation of a compiler algorithm for prefetching

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Weak ordering—a new definition

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Memory consistency and event ordering in scalable shared-memory multiprocessors

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
APRIL: a processor architecture for multiprocessing

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Dependence Analysis for Supercomputing

Dependence Analysis for Supercomputing
Solving Linear Systems on Vector and Shared Memory Computers

Solving Linear Systems on Vector and Shared Memory Computers
A Loop Transformation Theory and an Algorithm to Maximize Parallelism

IEEE Transactions on Parallel and Distributed Systems
On Estimating and Enhancing Cache Effectiveness

Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
Iteration Space Tiling for Memory Hierarchies

Proceedings of the Third SIAM Conference on Parallel Processing for Scientific Computing
Improving the performance of virtual memory computers.

Improving the performance of virtual memory computers.
Dependence analysis for subscripted variables and its application to program transformations

Dependence analysis for subscripted variables and its application to program transformations
Optimizing supercompilers for supercomputers

Optimizing supercompilers for supercomputers
Software methods for improvement of cache performance on supercomputer applications

Software methods for improvement of cache performance on supercomputer applications
Data Coherence Problem in a Multicache System

IEEE Transactions on Computers
A New Solution to Coherence Problems in Multicache Systems

IEEE Transactions on Computers
Cache system design in the tightly coupled multiprocessor system

AFIPS '76 Proceedings of the June 7-10, 1976, national computer conference and exposition

Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology

ICS '97 Proceedings of the 11th international conference on Supercomputing
An integer linear programming approach for optimizing cache locality

ICS '99 Proceedings of the 13th international conference on Supercomputing
Evaluating the impact of memory system performance on software prefetching and locality optimizations

ICS '01 Proceedings of the 15th international conference on Supercomputing
Static and Dynamic Locality Optimizations Using Integer Linear Programming

IEEE Transactions on Parallel and Distributed Systems
Combining Loop Fusion with Prefetching on Shared-memory Multiprocessors

ICPP '97 Proceedings of the international Conference on Parallel Processing
Merging, sorting and matrix operations on the SOME-bus multiprocessor architecture

Future Generation Computer Systems - Special issue: Advanced services for clusters and internet computing
SAC: a functional array language for efficient multi-threaded execution

International Journal of Parallel Programming

Quantified Score

Hi-index	0.01

Visualization

Abstract

Unimodular transformations, tiling, and software prefetching are loop optimizations known to be effective in increasing parallelism, reducing cache miss rates, and eliminating processor stall time. Although these optimizations individually are quite effective, there is the expectation that even better improvements can be obtained by combining them together. In this paper we show that indeed this is the case when unimodular transformations are combined with either tiling or software prefetching. However, our results also show that although combining tiling with prefetching tends to improve the performance of tiling alone, it is also the case that in some situations tiling can degrade the cache performance of software prefetching. The reasons for this unexpected behavior are three fold: 1) tiling introduces interference misses inside the localized space which are difficult to characterize with current techniques based on locality analysis; 2) prefetch predicates are computed using only estimates on the amount of capacity misses, so the latency induced by cache interference is not completely covered; and 3) tiling limits the maximum amount of latency that can be masked with prefetching.