Reducing False Sharing and Improving Spatial Locality in a Unified Compilation Framework

Authors:
Mahmut Kandemir;Alok Choudhary;J. Ramanujam;Prith Banerjee
Affiliations:
-;-;-;-
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
2003

Citing 41
Cited 1

The effect of sharing on the cache and bus performance of parallel programs

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Supernode partitioning

POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Simple but effective techniques for NUMA memory management

SOSP '89 Proceedings of the twelfth ACM symposium on Operating systems principles
Parafrase-2: an environment for parallelizing, partitioning, synchronizing, and scheduling programs on multiprocessors

International Journal of High Speed Computing
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Optimizing for parallelism and data locality

ICS '92 Proceedings of the 6th international conference on Supercomputing
Access normalization: loop restructuring for NUMA computers

ACM Transactions on Computer Systems (TOCS)
Managing pages in shared virtual memory systems: getting the compiler into the game

ICS '93 Proceedings of the 7th international conference on Supercomputing
Communication-free hyperplane partitioning of nested loops

Journal of Parallel and Distributed Computing
Compiling for numa parallel machines

Compiling for numa parallel machines
Unifying data and control transformations for distributed shared-memory machines

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Tile size selection using cache organization and data layout

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Data and computation transformations for multiprocessors

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Reducing false sharing on shared memory multiprocessors through compile time data transformations

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
The Omega Library interface guide

The Omega Library interface guide
Unified compilation techniques for shared and distributed address space machines

ICS '95 Proceedings of the 9th international conference on Supercomputing
Evaluating the impact of advanced memory systems on compiler-parallelized codes

PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
Improving data locality with loop transformations

ACM Transactions on Programming Languages and Systems (TOPLAS)
The influence of caches on the performance of heaps

Journal of Experimental Algorithmics (JEA)
Data-centric multi-level blocking

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology

ICS '97 Proceedings of the 11th international conference on Supercomputing
Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
The SGI Origin: a ccNUMA highly scalable server

Proceedings of the 24th annual international symposium on Computer architecture
Data transformations for eliminating conflict misses

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
A hyperplane based approach for optimizing spatial locality in loop nests

ICS '98 Proceedings of the 12th international conference on Supercomputing
Computer architecture (2nd ed.): a quantitative approach

Computer architecture (2nd ed.): a quantitative approach
Improving locality using loop and data transformations in an integrated framework

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
A Linear Algebra Framework for Automatic Determination of Optimal Data Layouts

IEEE Transactions on Parallel and Distributed Systems
Nonlinear array layouts for hierarchical memory systems

ICS '99 Proceedings of the 13th international conference on Supercomputing
Cache performance analysis of traversals and random accesses

Proceedings of the tenth annual ACM-SIAM symposium on Discrete algorithms
The parallel execution of DO loops

Communications of the ACM
High Performance Compilers for Parallel Computing

High Performance Compilers for Parallel Computing
False Sharing and Spatial Locality in Multiprocessor Caches

IEEE Transactions on Computers
A Loop Transformation Theory and an Algorithm to Maximize Parallelism

IEEE Transactions on Parallel and Distributed Systems
Compile-Time Techniques for Data Distribution in Distributed Memory Machines

IEEE Transactions on Parallel and Distributed Systems
False Sharing Elimination by Selection of Runtime Scheduling Parameters

ICPP '97 Proceedings of the international Conference on Parallel Processing
A Graph Based Framework to Detect Optimal Memory Layouts for Improving Data Locality

IPPS '99/SPDP '99 Proceedings of the 13th International Symposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing
Reduction of Cache Coherence Overhead by Compiler Data Layout and Loop Transformation

Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
Evaluating Two Loop Transformations for Reducing Multiple Writer False Sharing

LCPC '94 Proceedings of the 7th International Workshop on Languages and Compilers for Parallel Computing
A Matrix-Based Approach to the Global Locality Optimization Problem

PACT '98 Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques

Trace-Based data layout optimizations for multi-core processors

HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers

Quantified Score

Hi-index	0.00

Visualization

Abstract

The performance of applications on large shared-memory multiprocessors with coherent caches depends on the interaction between the granularity of data sharing, the size of the coherence unit, and the spatial locality exhibited by the applications, in addition to the amount of parallelism in the applications. Large coherence units are helpful in exploiting spatial locality, but worsen the effects of false sharing. A mathematical framework that allows a clean description of the relationship between spatial locality and false sharing is derived in this paper. First, a technique to identify a severe form of multiple-writer false sharing is presented. The importance of the interaction between optimization techniques aimed at enhancing locality and the techniques oriented toward reducing false sharing is then demonstrated. Given the conflicting requirements, a compiler-based approach to this problem holds promise. This paper investigates the use of data transformations in addressing spatial locality and false sharing, and derives an approach that balances the impact of the two. Experimental results demonstrate that such a balanced approach outperforms those approaches that consider only one of these two issues. On an eight-processor SGI/Cray Origin 2000 multiprocessor, our approach brings an additional 9 percent improvement over a powerful locality optimization technique that uses both loop and data transformations. Also, the presented approach obtains an additional 19 percent improvement over an optimization technique that is oriented specifically toward reducing false sharing. This study also reveals that, in addition to reducing synchronization costs and improving the memory subsystem performance, obtaining large granularity parallelism is helpful in balancing the effects of enhancing locality and reducing false sharing, rendering them compatible.