Measuring memory hierarchy performance of cache-coherent multiprocessors using micro benchmarks

Authors:
Cristina Hristea;Daniel Lenoski;John Keen
Affiliations:
Massachusetts Institute of Technology, Cambridge, MA;Silicon Graphics, Inc.;Silicon Graphics, Inc.
Venue:
SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Year:
1997

Citing 10
Cited 25

A class of compatible cache consistency protocols and their support by the IEEE futurebus

ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
Performance analysis of transaction processing systems

Performance analysis of transaction processing systems
Computer Technology and Architecture: An Evolving Interaction

Computer
The SGI Origin: a ccNUMA highly scalable server

Proceedings of the 24th annual international symposium on Computer architecture
Computer architecture (2nd ed.): a quantitative approach

Computer architecture (2nd ed.): a quantitative approach
Scalable Shared-Memory Multiprocessing

Scalable Shared-Memory Multiprocessing
The MIPS R10000 Superscalar Microprocessor

IEEE Micro
Lockup-free instruction fetch/prefetch cache organization

ISCA '81 Proceedings of the 8th annual symposium on Computer Architecture
A low-overhead coherence solution for multiprocessors with private cache memories

ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture
Microbenchmarking and Performance Prediction for Parallel

Microbenchmarking and Performance Prediction for Parallel

Memory system characterization of commercial workloads

Proceedings of the 25th annual international symposium on Computer architecture
Effects of architectural and technological advances on the HP/Convex Exemplar's memory and communication performance

Proceedings of the 25th annual international symposium on Computer architecture
Compiler-controlled memory

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Comparing the memory system performance of the HP V-class and SGI Origin 2000 multiprocessors using microbenchmarks and scientific applications

ICS '99 Proceedings of the 13th international conference on Supercomputing
Performance experiences on Sun's Wildfire prototype

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
A case for user-level dynamic page migration

Proceedings of the 14th international conference on Supercomputing
Architecture and design of AlphaServer GS320

ACM SIGPLAN Notices
Timestamp snooping: an approach for extending SMPs

ACM SIGPLAN Notices
FLASH vs. (simulated) FLASH: closing the simulation loop

ACM SIGPLAN Notices
The trade-off between implicit and explicit data distribution in shared-memory programming paradigms

ICS '01 Proceedings of the 15th international conference on Supercomputing
Architecture and design of AlphaServer GS320

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Timestamp snooping: an approach for extending SMPs

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
FLASH vs. (Simulated) FLASH: closing the simulation loop

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Performance prediction for random write reductions: a case study in modeling shared memory programs

SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Runtime vs. Manual Data Distribution for Architecture-Agnostic Shared-Memory Programming Models

International Journal of Parallel Programming
System Optimization for OLTP Workloads

IEEE Micro
Quantifying and Resolving Remote Memory Access Contention on Hardware DSM Multiprocessors

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
A Study of Implicit Data Distribution Methods for OpenMP Using the SPEC Benchmarks

WOMPAT '01 Proceedings of the International Workshop on OpenMP Applications and Tools: OpenMP Shared Memory Parallel Programming
Quantifying contention and balancing memory load on hardware DSM multiprocessors

Journal of Parallel and Distributed Computing - Special section best papers from the 2002 international parallel and distributed processing symposium
Cross-architecture performance predictions for scientific applications using parameterized models

Proceedings of the joint international conference on Measurement and modeling of computer systems
A Framework for Measuring Supercomputer Productivity

International Journal of High Performance Computing Applications
A methodology for detailed performance modeling of reduction computations on SMP machines

Performance Evaluation - Performance modelling and evaluation of high-performance parallel and distributed systems
An experimental evaluation of the HP V-class and SGI origin 2000 multiprocessors using microbenchmarks and scientific applications

International Journal of Parallel Programming
A multithreaded PowerPC processor for commercial servers

IBM Journal of Research and Development
A mathematical model for the transitional region between cache hierarchy levels

IICS'04 Proceedings of the 4th international conference on Innovative Internet Community Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Even with today's large caches, the increasing performance gap between processors and memory systems imposes a memory bottleneck for many important scientific and commercial applications. This bottleneck is intensified in shared-memory multiprocessors by contention and the effects of cache coherency. Under heavy memory contention, the memory latency may increase 2 or 3 times. Nonethless, as more sophisticated techniques are used to hide latency and increase bandwidth, measuring memory performance has become increasingly difficult. Previous simple methods to measure memory performance can overestimate uniprocessor memory latency and underestimate bandwidth by tens of percent. This paper introduces a micro benchmark suite that measures memory hierarchy performance in light of both uniprocessor optimizations and the contention and coherence effects of multiprocessors. The benchmark suite has been used to improve the memory system performance of the SGI Origin multiprocessor.