Tackling cache-line stealing effects using run-time adaptation

Authors:
Stéphane Zuckerman;William Jalby
Affiliations:
University of Versailles Saint-Quentin-en-Yvelines, France;University of Versailles Saint-Quentin-en-Yvelines, France
Venue:
LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Year:
2010

Citing 20
Cited 0

Compiler transformations for high-performance computing

ACM Computing Surveys (CSUR)
Reducing false sharing on shared memory multiprocessors through compile time data transformations

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Hitting the memory wall: implications of the obvious

ACM SIGARCH Computer Architecture News
An analysis of degenerate sharing and false coherence

Journal of Parallel and Distributed Computing
Tolerating latency in multiprocessors through compiler-inserted prefetching

ACM Transactions on Computer Systems (TOCS)
An Integrated Hardware/Software Data Prefetching Scheme for Shared-Memory Multiprocessors

International Journal of Parallel Programming
False Sharing and Spatial Locality in Multiprocessor Caches

IEEE Transactions on Computers
Hybrid compiler/hardware prefetching for multiprocessors using low-overhead cache miss traps

ICPP '97 Proceedings of the international Conference on Parallel Processing
Evaluating Two Loop Transformations for Reducing Multiple Writer False Sharing

LCPC '94 Proceedings of the 7th International Workshop on Languages and Compilers for Parallel Computing
A low-overhead coherence solution for multiprocessors with private cache memories

ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture
Miss Penalty Reduction Using Bundled Capacity Prefetching in Multiprocessors

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Guided region prefetching: a cooperative hardware/software approach

Proceedings of the 30th annual international symposium on Computer architecture
Analysis of cache-coherence bottlenecks with hybrid hardware/software techniques

ACM Transactions on Architecture and Code Optimization (TACO)
Structure Layout Optimization for Multithreaded Programs

Proceedings of the International Symposium on Code Generation and Optimization
False sharing and its effect on shared memory performance

Sedms'93 USENIX Systems on USENIX Experiences with Distributed and Multiprocessor Systems - Volume 4
L2 Cache Modeling for Scientific Applications on Chip Multi-Processors

ICPP '07 Proceedings of the 2007 International Conference on Parallel Processing
Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Analyzing memory access intensity in parallel programs on multicore

Proceedings of the 22nd annual international conference on Supercomputing
A compiler-directed data prefetching scheme for chip multiprocessors

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Latencies of conflicting writes on contemporary multicore architectures

PaCT'07 Proceedings of the 9th international conference on Parallel Computing Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

Modern multicore processors are now found in mainstream systems, as well as supercomputers. They usually embed prefetching facilities to hide memory stalls. While very useful in general, there are some cases where such mechanisms can actually hamper performance, as is the case with cache-line stealing. This paper characterizes and quantifies cache-line stealing, and shows it can induce huge slowdowns - down to almost 65%. Several solutions are examined, ranging from deactivation of hardware prefetching to array reshaping. Such solutions bring between 10% and 65% speedups in the best cases. In order to apply these transformations where they are relevant, we use run-time measurements and adaptive methods to generate code wrappers to be used only when prefetching hurts performance.