Tolerating latency in multiprocessors through compiler-inserted prefetching

Authors:
Todd C. Mowry
Affiliations:
Carnegie Mellon Univ., Pittsburgh, PA
Venue:
ACM Transactions on Computer Systems (TOCS)
Year:
1998

Citing 33
Cited 20

A VLIW architecture for a trace scheduling compiler

ASPLOS II Proceedings of the second international conference on Architectual support for programming languages and operating systems
Portable programs for parallel processors

Portable programs for parallel processors
Synchronization, Coherence, and Event Ordering in Multiprocessors

Computer
Overlapped loop support in the Cydra 5

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
A Survey of Cache Coherence Schemes for Multiprocessors

Computer
Software prefetching

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Performance evaluation of memory consistency models for shared-memory multiprocessors

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Tolerating latency through software-controlled prefetching in shared-memory multiprocessors

Journal of Parallel and Distributed Computing - Special issue on shared-memory multiprocessors
Data cache performance of supercomputer applications

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
An architecture for software-controlled data prefetching

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Comparative evaluation of latency reducing and tolerating techniques

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
An effective on-chip preloading scheme to reduce data access penalty

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
The Stanford Dash Multiprocessor

Computer
Sharlit—a tool for building optimizers

PLDI '92 Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation
Design and evaluation of a compiler algorithm for prefetching

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Global optimizations for parallelism and locality on scalable parallel machines

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Communication optimization and code generation for distributed memory machines

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Limitations of cache prefetching on a bus-based multiprocessor

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Tolerating latency through software-controlled data prefetching

Tolerating latency through software-controlled data prefetching
Compiler-based prefetching for recursive data structures

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Compiler-directed data prefetching in multiprocessors with memory hierarchies

ICS '90 Proceedings of the 4th international conference on Supercomputing
The SGI Origin: a ccNUMA highly scalable server

Proceedings of the 24th annual international symposium on Computer architecture
Weak ordering—a new definition

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Memory consistency and event ordering in scalable shared-memory multiprocessors

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Organizing matrices and matrix operations for paged memory systems

Communications of the ACM
Cache performance in vector supercomputers

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Lockup-free instruction fetch/prefetch cache organization

ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture
Tango introduction and tutorial

Tango introduction and tutorial
SPLASH: Stanford parallel applications for shared-memory

SPLASH: Stanford parallel applications for shared-memory
The effectiveness of caches and data prefetch buffers in large-scale shared memory multiprocessors

The effectiveness of caches and data prefetch buffers in large-scale shared memory multiprocessors
The effectiveness of caches and data prefetch buffers in large-scale shared memory multiprocessors

The effectiveness of caches and data prefetch buffers in large-scale shared memory multiprocessors
Software methods for improvement of cache performance on supercomputer applications

Software methods for improvement of cache performance on supercomputer applications

Automatic Compiler-Inserted Prefetching for Pointer-Based Applications

IEEE Transactions on Computers - Special issue on cache memory and related problems
PSCR: A Coherence Protocol for Eliminating Passive Sharing in Shared-Bus Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Optimal two level partitioning and loop scheduling for hiding memory latency for DSP applications

Proceedings of the 37th Annual Design Automation Conference
Optimal partitioning and balanced scheduling with the maximal overlap of data footprints

GLSVLSI '01 Proceedings of the 11th Great Lakes symposium on VLSI
Evaluating the impact of memory system performance on software prefetching and locality optimizations

ICS '01 Proceedings of the 15th international conference on Supercomputing
Design and evaluation of compiler algorithms for pre-execution

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Transparent Threads: Resource Sharing in SMT Processors for High Single-Thread Performance

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Physical Experimentation with Prefetching Helper Threads on Intel's Hyper-Threaded Processors

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
A general framework for prefetch scheduling in linked data structures and its application to multi-chain prefetching

ACM Transactions on Computer Systems (TOCS)
A study of source-level compiler algorithms for automatic construction of pre-execution code

ACM Transactions on Computer Systems (TOCS)
Impact of Compiler-based Data-Prefetching Techniques on SPEC OMP Application Performance

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Runtime support for integrating precomputation and thread-level parallelism on simultaneous multithreaded processors

LCR '04 Proceedings of the 7th workshop on Workshop on languages, compilers, and run-time support for scalable systems
Memory Performance Optimizations For Real-Time Software HDTV Decoding

Journal of VLSI Signal Processing Systems
Design and Implementation of a Compiler Framework for Helper Threading on Multi-core Processors

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Partitioning and scheduling DSP applications with maximal memory access hiding

EURASIP Journal on Applied Signal Processing
Exploring the performance limits of simultaneous multithreading for memory intensive applications

The Journal of Supercomputing
On reducing load/store latencies of cache accesses

Journal of Systems Architecture: the EUROMICRO Journal
Adaptive prefetching for shared cache based chip multiprocessors

Proceedings of the Conference on Design, Automation and Test in Europe
Tackling cache-line stealing effects using run-time adaptation

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
vTube: efficient streaming of virtual appliances over last-mile networks

Proceedings of the 4th annual Symposium on Cloud Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The large latency of memory accesses in large-scale shared-memory multiprocessors is a key obstacle to achieving high processor utilization. Software-controlled prefetching is a technique for tolerating memory latency by explicitly executing instructions to move data close to the processor before the data are actually needed. To minimize the burden on the programmer, compiler support is needed to automatically insert prefetch instructions into the code. A key challenge when inserting prefetches is ensuring that the overheads of prefetching do not outweigh the benefits. While previous studies have demonstrated the effectiveness of hand-inserted prefetching in multiprocessor applications, the benefit of compiler-inserted prefetching in practice has remained an open question. This article proposes and evaluates a new compiler algorithm for inserting prefetches into multiprocessor code. The proposed algorithm attempts to minimize overheads by only issuing prefetches for references that are predicted to suffer cache misses. The algorithm can prefetch both dense-matrix and sparse-matrix codes, thus covering a large fraction of scientific applications. We have implemented our algorithm in the SUIF(Stanford University Intermediate Format) optimizing compiler. The results of our detailed architectural simulations demonstrate that compiler-inserted prefetching can improve the speed of some parallel applications by as much as a factor of two.