False Sharing and Spatial Locality in Multiprocessor Caches

Authors:
J. Torrellas;H. S. Lam;J. L. Hennessy
Affiliations:
-;-;-
Venue:
IEEE Transactions on Computers
Year:
1994

Citing 21
Cited 60

Line (block) size choice for CPU cache memories

IEEE Transactions on Computers
Hierarchical cache/bus architecture for shared memory multiprocessors

ISCA '87 Proceedings of the 14th annual international symposium on Computer architecture
Multiprocessor cache design considerations

ISCA '87 Proceedings of the 14th annual international symposium on Computer architecture
Portable programs for parallel processors

Portable programs for parallel processors
Performance tradeoffs in cache design

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
The Wisconsin multicube: a new large-scale cache-coherent multiprocessor

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
Memory-reference characteristics of multiprocessor applications under MACH

SIGMETRICS '88 Proceedings of the 1988 ACM SIGMETRICS conference on Measurement and modeling of computer systems
MIPS RISC architecture

MIPS RISC architecture
Analysis of cache invalidation patterns in multiprocessors

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
The effect of sharing on the cache and bus performance of parallel programs

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Supernode partitioning

POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Characterization of parallelism and deadlocks in distributed digital logic simulation

DAC '89 Proceedings of the 26th ACM/IEEE Design Automation Conference
Multi-level shared caching techniques for scalability in VMP-M/C

ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Computer architecture: a quantitative approach

Computer architecture: a quantitative approach
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
LocusRoute: a parallel global router for standard cells

DAC '88 Proceedings of the 25th ACM/IEEE Design Automation Conference
The directory-based cache coherence protocol for the DASH multiprocessor

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Issues related to MIMD shared-memory computers: the NYU ultracomputer approach

ISCA '85 Proceedings of the 12th annual international symposium on Computer architecture
Cache Memories

ACM Computing Surveys (CSUR)
A low-overhead coherence solution for multiprocessors with private cache memories

ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture
Aspects of Cache Memory and Instruction

Aspects of Cache Memory and Instruction

Compiler transformations for high-performance computing

ACM Computing Surveys (CSUR)
Effective cache prefetching on bus-based multiprocessors

ACM Transactions on Computer Systems (TOCS)
Reducing false sharing on shared memory multiprocessors through compile time data transformations

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Speeding up irregular applications in shared-memory multiprocessors: memory binding and group prefetching

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
CAT—caching address tags: a technique for reducing area cost of on-chip caches

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Unified compilation techniques for shared and distributed address space machines

ICS '95 Proceedings of the 9th international conference on Supercomputing
Evaluating the impact of advanced memory systems on compiler-parallelized codes

PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
Compiler and hardware support for cache coherence in large-scale multiprocessors: design considerations and performance study

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Characterizing the Memory Behavior of Compiler-Parallelized Applications

IEEE Transactions on Parallel and Distributed Systems
Dynamic memory usage in parallel simulation: a case study of a large-scale military logistics application

WSC '96 Proceedings of the 28th conference on Winter simulation
Performance debugging shared memory parallel programs using run-time dependence analysis

SIGMETRICS '97 Proceedings of the 1997 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A compiler algorithm for optimizing locality in loop nests

ICS '97 Proceedings of the 11th international conference on Supercomputing
The interaction of parallel programming constructs and coherence protocols

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Minimizing Area Cost of On-Chip Cache Memories by Caching Address Tags

IEEE Transactions on Computers
An Efficient Solution to the Cache Thrashing Problem Caused by True Data Sharing

IEEE Transactions on Computers
Evaluating the Effect of Coherence Protocols on the Performance of Parallel Programming Constructs

International Journal of Parallel Programming
Improving Cache Locality by a Combination of Loop and Data Transformations

IEEE Transactions on Computers - Special issue on cache memory and related problems
A Linear Algebra Framework for Automatic Determination of Optimal Data Layouts

IEEE Transactions on Parallel and Distributed Systems
Memory forwarding: enabling aggressive layout optimizations by guaranteeing the safety of data relocation

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Scal-Tool: pinpointing and quantifying scalability bottlenecks in DSM multiprocessors

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Cacheminer: A Runtime Approach to Exploit Cache Locality on SMP

IEEE Transactions on Parallel and Distributed Systems
Hardware and Compiler-Directed Cache Coherence in Large-Scale Multiprocessors: Design Considerations and Performance Study

IEEE Transactions on Parallel and Distributed Systems
Hoard: a scalable memory allocator for multithreaded applications

ACM SIGPLAN Notices
Hoard: a scalable memory allocator for multithreaded applications

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Static and Dynamic Locality Optimizations Using Integer Linear Programming

IEEE Transactions on Parallel and Distributed Systems
Adaptive Sector Grouping to Reduce False Sharing in Distributed RAID

Cluster Computing
Hidden Costs in Avoiding False Sharing in Software DSMs

HiPC '01 Proceedings of the 8th International Conference on High Performance Computing
Enhancing Software DSM for Compiler-Parallelized Applications

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Minerva: An Adaptive Subblock Coherence Protocol for Improved SMP Performance

ISHPC '02 Proceedings of the 4th International Symposium on High Performance Computing
Boosting the Performance of Three-Tier Web Servers Deploying SMP Architecture

Revised Papers from the NETWORKING 2002 Workshops on Web Engineering and Peer-to-Peer Computing
Reducing False Sharing and Improving Spatial Locality in a Unified Compilation Framework

IEEE Transactions on Parallel and Distributed Systems
Fine-grain design space exploration for a cartographic SoC multiprocessor

ACM SIGARCH Computer Architecture News
An effective full-map directory scheme for the sectored caches

HPC-ASIA '97 Proceedings of the High-Performance Computing on the Information Superhighway, HPC-Asia '97
Analysis of Shared Memory Misses and Reference Patterns

ICCD '00 Proceedings of the 2000 IEEE International Conference on Computer Design: VLSI in Computers & Processors
Scalable lock-free dynamic memory allocation

Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Performance analysis of methods that overcome false sharing effects in software DSMs

Journal of Parallel and Distributed Computing
EMPS: An Environment for Memory Performance Studies

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 10 - Volume 11
Reducing coherence overhead and boosting performance of high-end SMP multiprocessors running a DSS workload

Journal of Parallel and Distributed Computing
Quantifying Locality In The Memory Access Patterns of HPC Applications

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
On the correctness of program execution when cache coherence is maintained locally at data-sharing boundaries in distributed shared memory multiprocessors

International Journal of Parallel Programming
Whole-program optimization of global variable layout

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Unbounded page-based transactional memory

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
SAC: off-the-shelf support for data-parallelism on multicores

Proceedings of the 2007 workshop on Declarative aspects of multicore programming
Structure Layout Optimization for Multithreaded Programs

Proceedings of the International Symposium on Code Generation and Optimization
Characteristics of workloads used in high performance and technical computing

Proceedings of the 21st annual international conference on Supercomputing
Speeding-up multiprocessors running DBMS workloads through coherence protocols

International Journal of High Performance Computing and Networking
MPC: A Unified Parallel Runtime for Clusters of NUMA Machines

Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
A Methodology to Characterize Critical Section Bottlenecks in DSM Multiprocessors

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Algorithms for memory hierarchies: advanced lectures

Algorithms for memory hierarchies: advanced lectures
On the design and implementation of a portable DSM system for low-cost multicomputers

ICCSA'03 Proceedings of the 2003 international conference on Computational science and its applications: PartI
PacketShader: a GPU-accelerated software router

Proceedings of the ACM SIGCOMM 2010 conference
Adaptive conflict unit size for distributed optimistic synchronization

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Tackling cache-line stealing effects using run-time adaptation

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Transactional conflict decoupling and value prediction

Proceedings of the international conference on Supercomputing
Automatic OpenCL device characterization: guiding optimized kernel design

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
HW/SW co-design of parallel systems

Proceedings of the International Conference on Computer-Aided Design
Efficiently enabling conventional block sizes for very large die-stacked DRAM caches

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Latencies of conflicting writes on contemporary multicore architectures

PaCT'07 Proceedings of the 9th international conference on Parallel Computing Technologies
Native actors: a scalable software platform for distributed, heterogeneous environments

Proceedings of the 2013 workshop on Programming based on actors, agents, and decentralized control

Quantified Score

Hi-index	14.99

Visualization

Abstract

The performance of the data cache in shared-memory multiprocessors has been shown to be different from that in uniprocessors. In particular, cache miss rates in multiprocessors do not show the sharp drop typical of uniprocessors when the size of the cache block increases. The resulting high cache miss rate is a cause of concern, since it can significantly limit the performance of multiprocessors. Some researchers have speculated that this effect is due to false sharing, the coherence transactions that result when different processors update different words of the same cache block in an interleaved fashion. While the analysis of six applications in the paper confirms that false sharing has a significant impact on the miss rate, the measurements also show that poor spatial locality among accesses to shared data has an even larger impact. To mitigate false sharing and to enhance spatial locality, we optimize the layout of shared data in cache blocks in a programmer-transparent manner. We show that this approach can reduce the number of misses on shared data by about 10% on average.