Reducing false sharing on shared memory multiprocessors through compile time data transformations

Authors:
Tor E. Jeremiassen;Susan J. Eggers
Affiliations:
AT&T Bell Laboratories, 600 Mountain Ave., Murray Hill, New Jersey;Department of Computer Science and Engineering, University of Washington, Seattle, Washington
Venue:
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Year:
1995

Citing 27
Cited 54

Logic verification algorithms and their parallel implementation

DAC '87 Proceedings of the 24th ACM/IEEE Design Automation Conference
Analysis of interprocedural side effects in a parallel programming environment

Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
Interprocedural side-effect analysis in linear time

PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Memory-reference characteristics of multiprocessor applications under MACH

SIGMETRICS '88 Proceedings of the 1988 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Simple but effective techniques for NUMA memory management

SOSP '89 Proceedings of the twelfth ACM symposium on Operating systems principles
Minimum Distance: A Method for Partitioning Recurrences for Multiprocessors

IEEE Transactions on Computers
Techniques for efficient inline tracing on a shared-memory multiprocessor

SIGMETRICS '90 Proceedings of the 1990 ACM SIGMETRICS conference on Measurement and modeling of computer systems
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Simplicity Versus Accuracy in a Model of Cache Coherency Overhead

IEEE Transactions on Computers
The Stanford Dash Multiprocessor

Computer
Optimizing for parallelism and data locality

ICS '92 Proceedings of the 6th international conference on Supercomputing
Non-concurrency analysis

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
The detection and elimination of useless misses in multiprocessors

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
A parallel adaptive fast multipole method

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Parallel Visualization Algorithms: Performance and Architectural Implications

Computer
The Stanford FLASH multiprocessor

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Using compile-time analysis and transformations to reduce false sharing on shared-memory multiprocessors

Using compile-time analysis and transformations to reduce false sharing on shared-memory multiprocessors
A practical interprocedural data flow analysis algorithm

Communications of the ACM
A precise inter-procedural data flow algorithm

POPL '81 Proceedings of the 8th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
An efficient way to find the side effects of procedure calls and the aliases of variables

POPL '79 Proceedings of the 6th ACM SIGACT-SIGPLAN symposium on Principles of programming languages
False Sharing and Spatial Locality in Multiprocessor Caches

IEEE Transactions on Computers
An Implementation of Interprocedural Bounded Regular Section Analysis

IEEE Transactions on Parallel and Distributed Systems
Reduction of Cache Coherence Overhead by Compiler Data Layout and Loop Transformation

Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
Computing Per-Process Summary Side-Effect Information

Proceedings of the 5th International Workshop on Languages and Compilers for Parallel Computing
Toward a Compile-Time Methodology for Reducing False Sharing and Communication Traffic in Shared Virtual Memory Systems

Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
Static Analysis of Barrier Synchronization in Explicitly Parallel Programs

PACT '94 Proceedings of the IFIP WG10.3 Working Conference on Parallel Architectures and Compilation Techniques
SPLASH: Stanford parallel applications for shared-memory

SPLASH: Stanford parallel applications for shared-memory

Optimizing parallel programs with explicit synchronization

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Evaluating the impact of advanced memory systems on compiler-parallelized codes

PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
An integrated compile-time/run-time software distributed shared memory system

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Compiler-directed page coloring for multiprocessors

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Characterizing the Memory Behavior of Compiler-Parallelized Applications

IEEE Transactions on Parallel and Distributed Systems
Non-singular data transformations: definition, validity and applications

ICS '97 Proceedings of the 11th international conference on Supercomputing
Tradeoffs between false sharing and aggregation in software distributed shared memory

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Data transformations for eliminating conflict misses

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Eliminating conflict misses for high performance architectures

ICS '98 Proceedings of the 12th international conference on Supercomputing
Using prediction to accelerate coherence protocols

Proceedings of the 25th annual international symposium on Computer architecture
An Efficient Solution to the Cache Thrashing Problem Caused by True Data Sharing

IEEE Transactions on Computers
A Compiler Optimization Algorithm for Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Cache-conscious data placement

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Improving Cache Locality by a Combination of Loop and Data Transformations

IEEE Transactions on Computers - Special issue on cache memory and related problems
A Linear Algebra Framework for Automatic Determination of Optimal Data Layouts

IEEE Transactions on Parallel and Distributed Systems
Memory forwarding: enabling aggressive layout optimizations by guaranteeing the safety of data relocation

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Improving cache performance in dynamic applications through data and computation reorganization at run time

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
PSCR: A Coherence Protocol for Eliminating Passive Sharing in Shared-Bus Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Nonsingular Data Transformations: Definition, Validity, and Applications

International Journal of Parallel Programming
Cacheminer: A Runtime Approach to Exploit Cache Locality on SMP

IEEE Transactions on Parallel and Distributed Systems
Hoard: a scalable memory allocator for multithreaded applications

ACM SIGPLAN Notices
Improving fine-grained irregular shared-memory benchmarks by data reordering

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Hoard: a scalable memory allocator for multithreaded applications

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Static and Dynamic Locality Optimizations Using Integer Linear Programming

IEEE Transactions on Parallel and Distributed Systems
Reducing coherence overhead of barrier synchronization in software DSMs

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Using high performance GIS software to visualize data: a hands-on software demonstration

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
A proposal for preservice student technology competence

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Integrating loop and data transformations for global optimization

Journal of Parallel and Distributed Computing
OpenMP on networks of workstations for software DSMs

Journal of Computer Science and Technology
A Layout-Conscious Iteration Space Transformation Technique

IEEE Transactions on Computers
Inter-array Data Regrouping

LCPC '99 Proceedings of the 12th International Workshop on Languages and Compilers for Parallel Computing
Minerva: An Adaptive Subblock Coherence Protocol for Improved SMP Performance

ISHPC '02 Proceedings of the 4th International Symposium on High Performance Computing
Reducing False Sharing and Improving Spatial Locality in a Unified Compilation Framework

IEEE Transactions on Parallel and Distributed Systems
Improving server software support for simultaneous multithreaded processors

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Dynamically Controlling False Sharing in Distributed Shared Memory

HPDC '96 Proceedings of the 5th IEEE International Symposium on High Performance Distributed Computing
Improving effective bandwidth through compiler enhancement of global cache reuse

Journal of Parallel and Distributed Computing
Array regrouping and structure splitting using whole-program reference affinity

Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Performance analysis of methods that overcome false sharing effects in software DSMs

Journal of Parallel and Distributed Computing
Reducing coherence overhead and boosting performance of high-end SMP multiprocessors running a DSS workload

Journal of Parallel and Distributed Computing
Lightweight reference affinity analysis

Proceedings of the 19th annual international conference on Supercomputing
Whole-program optimization of global variable layout

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Structure Layout Optimization for Multithreaded Programs

Proceedings of the International Symposium on Code Generation and Optimization
Incrementally parallelizing database transactions with thread-level speculation

ACM Transactions on Computer Systems (TOCS)
Speeding-up multiprocessors running DBMS workloads through coherence protocols

International Journal of High Performance Computing and Networking
COMIC: a coherent shared memory interface for cell be

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Static Detection of Place Locality and Elimination of Runtime Checks

APLAS '08 Proceedings of the 6th Asian Symposium on Programming Languages and Systems
Algorithms for memory hierarchies: advanced lectures

Algorithms for memory hierarchies: advanced lectures
Assessing cache false sharing effects by dynamic binary instrumentation

Proceedings of the Workshop on Binary Instrumentation and Applications
Dynamic cache contention detection in multi-threaded applications

Proceedings of the 7th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Tackling cache-line stealing effects using run-time adaptation

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Transactional conflict decoupling and value prediction

Proceedings of the international conference on Supercomputing
SHERIFF: precise detection and automatic mitigation of false sharing

Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
A static analysis tool using a three-step approach for data races in HPC programs

Proceedings of the 2012 Workshop on Parallel and Distributed Systems: Testing, Analysis, and Debugging
Detection of false sharing using machine learning

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.01

Visualization

Abstract

We have developed compiler algorithms that analyze explicitly parallel programs and restructure their shared data to reduce the number of false sharing misses. The algorithms analyze per-process shared data accesses, pinpoint the data structures that are susceptible to false sharing and choose an appropriate transformation to reduce it. The transformations either group data that is accessed by the same processor or separate individual data items that are shared.This paper evaluates that technique. We show through simulation that our analysis successfully identifies the data structures that are responsible for most false sharing misses, and then transforms them without unduly decreasing spatial locality. The reduction in false sharing positively impacts both execution time and program scalability when executed on a KSR2. Both factors combine to increase the maximum achievable speedup for all programs, more than doubling it for several. Despite being able to only approximate actual inter-processor memory accesses, the compiler-directed transformations always outperform programmer efforts to eliminate false sharing.