Line (block) size choice for CPU cache memories
IEEE Transactions on Computers
Hierarchical cache/bus architecture for shared memory multiprocessors
ISCA '87 Proceedings of the 14th annual international symposium on Computer architecture
Multiprocessor cache design considerations
ISCA '87 Proceedings of the 14th annual international symposium on Computer architecture
Portable programs for parallel processors
Portable programs for parallel processors
Performance tradeoffs in cache design
ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
The Wisconsin multicube: a new large-scale cache-coherent multiprocessor
ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
Memory-reference characteristics of multiprocessor applications under MACH
SIGMETRICS '88 Proceedings of the 1988 ACM SIGMETRICS conference on Measurement and modeling of computer systems
MIPS RISC architecture
Analysis of cache invalidation patterns in multiprocessors
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
The effect of sharing on the cache and bus performance of parallel programs
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Characterization of parallelism and deadlocks in distributed digital logic simulation
DAC '89 Proceedings of the 26th ACM/IEEE Design Automation Conference
Multi-level shared caching techniques for scalability in VMP-M/C
ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Computer architecture: a quantitative approach
Computer architecture: a quantitative approach
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
LocusRoute: a parallel global router for standard cells
DAC '88 Proceedings of the 25th ACM/IEEE Design Automation Conference
The directory-based cache coherence protocol for the DASH multiprocessor
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Issues related to MIMD shared-memory computers: the NYU ultracomputer approach
ISCA '85 Proceedings of the 12th annual international symposium on Computer architecture
ACM Computing Surveys (CSUR)
A low-overhead coherence solution for multiprocessors with private cache memories
ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture
Aspects of Cache Memory and Instruction
Aspects of Cache Memory and Instruction
Compiler transformations for high-performance computing
ACM Computing Surveys (CSUR)
Effective cache prefetching on bus-based multiprocessors
ACM Transactions on Computer Systems (TOCS)
Reducing false sharing on shared memory multiprocessors through compile time data transformations
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
CAT—caching address tags: a technique for reducing area cost of on-chip caches
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Unified compilation techniques for shared and distributed address space machines
ICS '95 Proceedings of the 9th international conference on Supercomputing
Evaluating the impact of advanced memory systems on compiler-parallelized codes
PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Characterizing the Memory Behavior of Compiler-Parallelized Applications
IEEE Transactions on Parallel and Distributed Systems
WSC '96 Proceedings of the 28th conference on Winter simulation
Performance debugging shared memory parallel programs using run-time dependence analysis
SIGMETRICS '97 Proceedings of the 1997 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A compiler algorithm for optimizing locality in loop nests
ICS '97 Proceedings of the 11th international conference on Supercomputing
The interaction of parallel programming constructs and coherence protocols
PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Minimizing Area Cost of On-Chip Cache Memories by Caching Address Tags
IEEE Transactions on Computers
An Efficient Solution to the Cache Thrashing Problem Caused by True Data Sharing
IEEE Transactions on Computers
Evaluating the Effect of Coherence Protocols on the Performance of Parallel Programming Constructs
International Journal of Parallel Programming
Improving Cache Locality by a Combination of Loop and Data Transformations
IEEE Transactions on Computers - Special issue on cache memory and related problems
A Linear Algebra Framework for Automatic Determination of Optimal Data Layouts
IEEE Transactions on Parallel and Distributed Systems
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Scal-Tool: pinpointing and quantifying scalability bottlenecks in DSM multiprocessors
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Cacheminer: A Runtime Approach to Exploit Cache Locality on SMP
IEEE Transactions on Parallel and Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
Hoard: a scalable memory allocator for multithreaded applications
ACM SIGPLAN Notices
Hoard: a scalable memory allocator for multithreaded applications
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Static and Dynamic Locality Optimizations Using Integer Linear Programming
IEEE Transactions on Parallel and Distributed Systems
Adaptive Sector Grouping to Reduce False Sharing in Distributed RAID
Cluster Computing
Hidden Costs in Avoiding False Sharing in Software DSMs
HiPC '01 Proceedings of the 8th International Conference on High Performance Computing
Enhancing Software DSM for Compiler-Parallelized Applications
IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Minerva: An Adaptive Subblock Coherence Protocol for Improved SMP Performance
ISHPC '02 Proceedings of the 4th International Symposium on High Performance Computing
Boosting the Performance of Three-Tier Web Servers Deploying SMP Architecture
Revised Papers from the NETWORKING 2002 Workshops on Web Engineering and Peer-to-Peer Computing
Reducing False Sharing and Improving Spatial Locality in a Unified Compilation Framework
IEEE Transactions on Parallel and Distributed Systems
Fine-grain design space exploration for a cartographic SoC multiprocessor
ACM SIGARCH Computer Architecture News
An effective full-map directory scheme for the sectored caches
HPC-ASIA '97 Proceedings of the High-Performance Computing on the Information Superhighway, HPC-Asia '97
Analysis of Shared Memory Misses and Reference Patterns
ICCD '00 Proceedings of the 2000 IEEE International Conference on Computer Design: VLSI in Computers & Processors
Scalable lock-free dynamic memory allocation
Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Performance analysis of methods that overcome false sharing effects in software DSMs
Journal of Parallel and Distributed Computing
EMPS: An Environment for Memory Performance Studies
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 10 - Volume 11
Journal of Parallel and Distributed Computing
Quantifying Locality In The Memory Access Patterns of HPC Applications
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
International Journal of Parallel Programming
Whole-program optimization of global variable layout
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Unbounded page-based transactional memory
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
SAC: off-the-shelf support for data-parallelism on multicores
Proceedings of the 2007 workshop on Declarative aspects of multicore programming
Structure Layout Optimization for Multithreaded Programs
Proceedings of the International Symposium on Code Generation and Optimization
Characteristics of workloads used in high performance and technical computing
Proceedings of the 21st annual international conference on Supercomputing
Speeding-up multiprocessors running DBMS workloads through coherence protocols
International Journal of High Performance Computing and Networking
MPC: A Unified Parallel Runtime for Clusters of NUMA Machines
Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
A Methodology to Characterize Critical Section Bottlenecks in DSM Multiprocessors
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Algorithms for memory hierarchies: advanced lectures
Algorithms for memory hierarchies: advanced lectures
On the design and implementation of a portable DSM system for low-cost multicomputers
ICCSA'03 Proceedings of the 2003 international conference on Computational science and its applications: PartI
PacketShader: a GPU-accelerated software router
Proceedings of the ACM SIGCOMM 2010 conference
Adaptive conflict unit size for distributed optimistic synchronization
EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Tackling cache-line stealing effects using run-time adaptation
LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Transactional conflict decoupling and value prediction
Proceedings of the international conference on Supercomputing
Automatic OpenCL device characterization: guiding optimized kernel design
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
HW/SW co-design of parallel systems
Proceedings of the International Conference on Computer-Aided Design
Efficiently enabling conventional block sizes for very large die-stacked DRAM caches
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Latencies of conflicting writes on contemporary multicore architectures
PaCT'07 Proceedings of the 9th international conference on Parallel Computing Technologies
Native actors: a scalable software platform for distributed, heterogeneous environments
Proceedings of the 2013 workshop on Programming based on actors, agents, and decentralized control
Hi-index | 14.99 |
The performance of the data cache in shared-memory multiprocessors has been shown to be different from that in uniprocessors. In particular, cache miss rates in multiprocessors do not show the sharp drop typical of uniprocessors when the size of the cache block increases. The resulting high cache miss rate is a cause of concern, since it can significantly limit the performance of multiprocessors. Some researchers have speculated that this effect is due to false sharing, the coherence transactions that result when different processors update different words of the same cache block in an interleaved fashion. While the analysis of six applications in the paper confirms that false sharing has a significant impact on the miss rate, the measurements also show that poor spatial locality among accesses to shared data has an even larger impact. To mitigate false sharing and to enhance spatial locality, we optimize the layout of shared data in cache blocks in a programmer-transparent manner. We show that this approach can reduce the number of misses on shared data by about 10% on average.