The effect of sharing on the cache and bus performance of parallel programs

Authors:
S. J. Eggers;R. H. Katz
Affiliations:
Depattment of Computer Science, FR-35, University of Washington, Seattle, WA;Computer Science Division, Dept. of Electrical Engineering & Computer Science, University of California, Berkeley, California
Venue:
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Year:
1989

Citing 19
Cited 54

Cache coherence protocols: evaluation using a multiprocessor simulation model

ACM Transactions on Computer Systems (TOCS)
Design Decisions in SPUR

Computer
An in-cache address translation mechanism

ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
Line (block) size choice for CPU cache memories

IEEE Transactions on Computers
Cache memory optimization to reduce processor/memory traffic

Advances in VLSI and Computer Systems
Logic verification algorithms and their parallel implementation

DAC '87 Proceedings of the 24th ACM/IEEE Design Automation Conference
Cache performance of operating system and multiprogramming workloads

ACM Transactions on Computer Systems (TOCS)
Multiprocessor cache analysis using ATUM

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
Performance tradeoffs in cache design

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
A characterization of sharing in parallel programs and its application to coherency protocol evaluation

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
The VMP multiprocessor: initial experience, refinements, and performance evaluation

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
Memory-reference characteristics of multiprocessor applications under MACH

SIGMETRICS '88 Proceedings of the 1988 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Lifetime analysis of dynamically allocated objects

POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Reduced instruction set computers

Communications of the ACM - Special section on computer architecture
Cache evaluation and the impact of workload choice

ISCA '85 Proceedings of the 12th annual international symposium on Computer architecture
Implementing a cache consistency protocol

ISCA '85 Proceedings of the 12th annual international symposium on Computer architecture
Cache memory performance in a unix enviroment

ACM SIGARCH Computer Architecture News
Aspects of Cache Memory and Instruction

Aspects of Cache Memory and Instruction
SPUR Memory System Architecture

SPUR Memory System Architecture

Evaluating the performance of four snooping cache coherency protocols

ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Cache considerations for multiprocessor programmers

Communications of the ACM
Directory-Based Cache Coherence in Large-Scale Multiprocessors

Computer
Quick and easy cache performance analysis

ACM SIGARCH Computer Architecture News
Techniques for efficient inline tracing on a shared-memory multiprocessor

SIGMETRICS '90 Proceedings of the 1990 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Analysis of critical architectural and programming parameters in a hierarchical

SIGMETRICS '90 Proceedings of the 1990 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Munin: distributed shared memory based on type-specific memory coherence

PPOPP '90 Proceedings of the second ACM SIGPLAN symposium on Principles & practice of parallel programming
A software coherence scheme with the assistance of directories

ICS '91 Proceedings of the 5th international conference on Supercomputing
Determination of mass properties of polygonal CSG objects in parallel

SMA '91 Proceedings of the first ACM symposium on Solid modeling foundations and CAD/CAM applications
Comparison of hardware and software cache coherence schemes

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Implementation and performance of Munin

SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
Hardware-assisted replay of multiprocessor programs

PADD '91 Proceedings of the 1991 ACM/ONR workshop on Parallel and distributed debugging
On reconfigurable on-chip data caches

MICRO 24 Proceedings of the 24th annual international symposium on Microarchitecture
Simplicity Versus Accuracy in a Model of Cache Coherency Overhead

IEEE Transactions on Computers
Comparison and analysis of software and directory coherence schemes

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
A performance study of memory consistency models

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Towards a shared-memory massively parallel multiprocessor

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Adjustable block size coherent caches

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Cache Invalidation Patterns in Shared-Memory Multiprocessors

IEEE Transactions on Computers
A performance evaluation of optimal hybrid cache coherency protocols

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Cooperative shared memory: software and hardware for scalable multiprocessor

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Cache coherence in large-scale shared-memory multiprocessors: issues and comparisons

ACM Computing Surveys (CSUR)
Performance evaluation of hybrid hardware and software distributed shared memory protocols

ICS '94 Proceedings of the 8th international conference on Supercomputing
An approach to scalability study of shared memory parallel systems

SIGMETRICS '94 Proceedings of the 1994 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Impact of sharing-based thread placement on multithreaded architectures

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Decoupled sectored caches: conciliating low tag implementation cost

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Fine-grain access control for distributed shared memory

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Implications of hierarchical N-body methods for multiprocessor architectures

ACM Transactions on Computer Systems (TOCS)
Data and computation transformations for multiprocessors

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
CAT—caching address tags: a technique for reducing area cost of on-chip caches

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Evaluating the impact of advanced memory systems on compiler-parallelized codes

PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
Don't use the page number, but a pointer to it

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Compiler-directed page coloring for multiprocessors

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Characterizing the Memory Behavior of Compiler-Parallelized Applications

IEEE Transactions on Parallel and Distributed Systems
Decoupled Sectored Caches

IEEE Transactions on Computers
Tradeoffs between false sharing and aggregation in software distributed shared memory

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Minimizing Area Cost of On-Chip Cache Memories by Caching Address Tags

IEEE Transactions on Computers
Hardware Approaches to Cache Coherence in Shared-Memory Multiprocessors Part 2

IEEE Micro
False Sharing and Spatial Locality in Multiprocessor Caches

IEEE Transactions on Computers
Compile-Time Partitioning of Iterative Parallel Loops to Reduce Cache Coherency Traffic

IEEE Transactions on Parallel and Distributed Systems
An Analysis of Cache Performance for a Hypercube Multicomputer

IEEE Transactions on Parallel and Distributed Systems
The Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared Memory Multiprocessor

IEEE Transactions on Parallel and Distributed Systems
Sequential Hardware Prefetching in Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
False Sharing Elimination by Selection of Runtime Scheduling Parameters

ICPP '97 Proceedings of the international Conference on Parallel Processing
Minerva: An Adaptive Subblock Coherence Protocol for Improved SMP Performance

ISHPC '02 Proceedings of the 4th International Symposium on High Performance Computing
Reducing False Sharing and Improving Spatial Locality in a Unified Compilation Framework

IEEE Transactions on Parallel and Distributed Systems
Improving server software support for simultaneous multithreaded processors

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
A Shared-bus Control Mechanism and a Cache Coherence Protocol for a High-performance On-chip Multiprocessor

HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Evaluation of cache consistency algorithm performance

MASCOTS '96 Proceedings of the 4th International Workshop on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems
Analysis of Shared Memory Misses and Reference Patterns

ICCD '00 Proceedings of the 2000 IEEE International Conference on Computer Design: VLSI in Computers & Processors
Can High Bandwidth and Latency Justify Large Cache Blocks in Scalable Multiprocessors?

ICPP '94 Proceedings of the 1994 International Conference on Parallel Processing - Volume 01
Speeding-up multiprocessors running DBMS workloads through coherence protocols

International Journal of High Performance Computing and Networking
Transactional memory

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.03

Visualization

Abstract

Bus bandwidth ultimately limits the performance, and therefore the scale, of bus-based, shared memory multiprocessors. Previous studies have extrapolated from uniprocessor measurements and simulations to estimate the performance of these machines. In this study, we use traces of parallel programs to evaluate the cache and bus performance of shared memory multiprocessors, in which coherency is maintained by a write-invalidate protocol. In particular, we analyze the effect of sharing overhead on cache miss ratio and bus utilization.Our studies show that parallel programs incur substantially higher miss ratios and bus utilization than comparable uniprocessor programs. The sharing component of these metrics proportionally increases with both cache and block size, and for some cache configurations determines both their magnitude and trend. The amount of overhead depends on the memory reference pattern to the shared data. Programs that exhibit good per-processor-locality perform better than those with fine-grain-sharing. This suggests that parallel software writers and better compiler technology can improve program performance through better memory organization of shared data.