MemSpy: analyzing memory system bottlenecks in programs

Authors:
Margaret Martonosi;Anoop Gupta;Thomas Anderson
Affiliations:
Computer Systems Laboratory, Stanford University, CA;Computer Systems Laboratory, Stanford University, CA;Computer Science Division, Univ. of California, Berkeley, CA
Venue:
SIGMETRICS '92/PERFORMANCE '92 Proceedings of the 1992 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Year:
1992

Citing 13
Cited 68

Portable programs for parallel processors

Portable programs for parallel processors
Memory-reference characteristics of multiprocessor applications under MACH

SIGMETRICS '88 Proceedings of the 1988 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Sparse matrix test problems

ACM Transactions on Mathematical Software (TOMS)
Non-intrusive and interactive profiling in parasight

PPEALS '88 Proceedings of the ACM/SIGPLAN conference on Parallel programming: experience with applications, languages and systems
A tool to aid in the design, implementation, and understanding of matrix algorithms for parallel processors

Journal of Parallel and Distributed Computing - Special issue: software tools for parallel programming and visualization
Quartz: a tool for tuning parallel program performance

SIGMETRICS '90 Proceedings of the 1990 ACM SIGMETRICS conference on Measurement and modeling of computer systems
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Computer Technology and Architecture: An Evolving Interaction

Computer
Performance debugging shared memory multiprocessor programs with MTOOL

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
The DASH prototype: implementation and performance

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
SPLASH: Stanford parallel applications for shared-memory

SPLASH: Stanford parallel applications for shared-memory
Parallel ICCG on a hierarchical memory multiprocessor - addressing the triangular solve bottleneck

Parallel ICCG on a hierarchical memory multiprocessor - addressing the triangular solve bottleneck
A Memory Allocation Profiler for C and Lisp Programs

A Memory Allocation Profiler for C and Lisp Programs

Tools for the development of application-specific virtual memory management

OOPSLA '93 Proceedings of the eighth annual conference on Object-oriented programming systems, languages, and applications
Effectiveness of trace sampling for performance debugging tools

SIGMETRICS '93 Proceedings of the 1993 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Performance debugging using parallel performance predicates

PADD '93 Proceedings of the 1993 ACM/ONR workshop on Parallel and distributed debugging
Normalized performance indices for message passing parallel programs

ICS '94 Proceedings of the 8th international conference on Supercomputing
Architectural support for performance tuning: a case study on the SPARCcenter 2000

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Trap-driven simulation with Tapeworm II

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Real-time volume rendering on shared memory multiprocessors using the shear-warp factorization

PRS '95 Proceedings of the IEEE symposium on Parallel rendering
SM-prof: a tool to visualise and find cache coherence performance bottlenecks in multiprocessor programs

Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Predicting application behavior in large scale shared-memory multiprocessors

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Evaluating the impact of advanced memory systems on compiler-parallelized codes

PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
The influence of caches on the performance of heaps

Journal of Experimental Algorithmics (JEA)
Mapping performance data for high-level and data views of parallel program performance

ICS '96 Proceedings of the 10th international conference on Supercomputing
An online computation of critical path profiling

SPDT '96 Proceedings of the SIGMETRICS symposium on Parallel and distributed tools
Trap-driven memory simulation with Tapeworm II

ACM Transactions on Modeling and Computer Simulation (TOMACS)
Active memory: a new abstraction for memory system simulation

ACM Transactions on Modeling and Computer Simulation (TOMACS)
Using the SimOS machine simulator to study complex computer systems

ACM Transactions on Modeling and Computer Simulation (TOMACS)
Characterizing the Memory Behavior of Compiler-Parallelized Applications

IEEE Transactions on Parallel and Distributed Systems
Trace-driven memory simulation: a survey

ACM Computing Surveys (CSUR)
Predictability of load/store instruction latencies

MICRO 26 Proceedings of the 26th annual international symposium on Microarchitecture
Cache miss equations: an analytical representation of cache misses

ICS '97 Proceedings of the 11th international conference on Supercomputing
Shared-memory performance profiling

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Searching for the sorting record: experiences in tuning NOW-Sort

SPDT '98 Proceedings of the SIGMETRICS symposium on Parallel and distributed tools
Precise miss analysis for program transformations with caches of arbitrary associativity

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Critical Path Profiling of Message Passing and Shared-Memory Programs

IEEE Transactions on Parallel and Distributed Systems
Cache conscious programming in undergraduate computer science

SIGCSE '99 The proceedings of the thirtieth SIGCSE technical symposium on Computer science education
An Application-Driven Study of Parallel System Overheads and Network Bandwidth Requirements

IEEE Transactions on Parallel and Distributed Systems
Cache miss equations: a compiler framework for analyzing and tuning memory behavior

ACM Transactions on Programming Languages and Systems (TOPLAS)
Automated cache optimizations using CME driven diagnosis

Proceedings of the 14th international conference on Supercomputing
Using hardware performance monitors to isolate memory bottlenecks

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Tools for application-oriented performance tuning

ICS '01 Proceedings of the 15th international conference on Supercomputing
Exact analysis of the cache behavior of nested loops

Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
A Tool to Help Tune where Computation Is Performed

IEEE Transactions on Software Engineering
Parallel performance prediction using lost cycles analysis

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Analysis of a Parallel Volume Rendering System Based on the Shear-Warp Factorization

IEEE Transactions on Visualization and Computer Graphics
Cache Profiling and the SPEC Benchmarks: A Case Study

Computer
A Cache Visualization Tool

Computer
Chitra: Visual Analysis of Parallel and Distributed Programs in the Time, Event, and Frequency Domains

IEEE Transactions on Parallel and Distributed Systems
A Blocked All-Pairs Shortest-Path Algorithm

SWAT '00 Proceedings of the 7th Scandinavian Workshop on Algorithm Theory
SIP: Performance Tuning through Source Code Interdependence

Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
Trace-Driven Memory Simulation: A Survey

Performance Evaluation: Origins and Directions
SIGMA: a simulator infrastructure to guide memory analysis

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Performance visualization for distributed shared memory systems

Virtual shared memory for distributed architectures
A fast and accurate framework to analyze and optimize cache memory behavior

ACM Transactions on Programming Languages and Systems (TOPLAS)
Efficient and Accurate Analytical Modeling of Whole-Program Data Cache Behavior

IEEE Transactions on Computers
A blocked all-pairs shortest-paths algorithm

Journal of Experimental Algorithmics (JEA)
Detailed cache coherence characterization for OpenMP benchmarks

Proceedings of the 18th annual international conference on Supercomputing
Processor/Memory Co-Exploration on Multiple Abstraction Levels

DATE '03 Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Data Centric Cache Measurement on the Intel ltanium 2 Processor

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Memory Profiling using Hardware Counters

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
EMPS: An Environment for Memory Performance Studies

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 10 - Volume 11
Fast data-locality profiling of native execution

SIGMETRICS '05 Proceedings of the 2005 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A hybrid hardware/software approach to efficiently determine cache coherence Bottlenecks

Proceedings of the 19th annual international conference on Supercomputing
TAPE: a transactional application profiling environment

Proceedings of the 19th annual international conference on Supercomputing
Decomposing memory performance: data structures and phases

Proceedings of the 5th international symposium on Memory management
Analysis of cache-coherence bottlenecks with hybrid hardware/software techniques

ACM Transactions on Architecture and Code Optimization (TACO)
I/O system performance debugging using model-driven anomaly characterization

FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
Source-Code-Correlated Cache Coherence Characterization of OpenMP Benchmarks

IEEE Transactions on Parallel and Distributed Systems
Memory behavior of an X11 window system

WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
Locating cache performance bottlenecks using data profiling

Proceedings of the 5th European conference on Computer systems
DeFT: Design space exploration for on-the-fly detection of coherence misses

ACM Transactions on Architecture and Code Optimization (TACO)
QUAD: a memory access pattern analyser

ARC'10 Proceedings of the 6th international conference on Reconfigurable Computing: architectures, Tools and Applications
A tool to display array access patterns in OpenMP programs

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Cache optimizations for iterative numerical codes aware of hardware prefetching

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Collecting and exploiting cache-reuse metrics

ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part II
Pinpointing data locality problems using data-centric analysis

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
A data-centric profiler for parallel programs

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Call Paths for Pin Tools

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
A scalable and near-optimal representation of access schemes for memory management

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.01

Visualization

Abstract

To cope with the increasing difference between processor and main memory speeds, modern computer systems use deep memory hierarchies. In the presence of such hierarchies, the performance attained by an application is largely determined by its memory reference behavior—if most references hit in the cache, the performance is significantly higher than if most references have to go to main memory. Frequently, it is possible for the programmer to restructure the data or code to achieve better memory reference behavior. Unfortunately, most existing performance debugging tools do not assist the programmer in this component of the overall performance tuning task.This paper describes MemSpy, a prototype tool that helps programmers identify and fix memory bottlenecks in both sequential and parallel programs. A key aspect of MemSpy is that it introduces the notion of data oriented, in addition to code oriented, performance tuning. Thus, for both source level code objects and data objects, MemSpy provides information such as cache miss rates, causes of cache misses, and in multiprocessors, information on cache invalidations and local versus remote memory misses. MemSpy also introduces a concise matrix presentation to allow programmers to view both code and data oriented statistics at the same time. This paper presents design and implementation issues for MemSpy, and gives a detailed case study using MemSpy to tune a parallel sparse matrix application. It shows how MemSpy helps pinpoint memory system bottlenecks, such as poor spatial locality and interference among data structures, and suggests paths for improvement.