Effectiveness of trace sampling for performance debugging tools

Authors:
Margaret Martonosi;Anoop Gupta;Thomas Anderson
Affiliations:
Stanford Univ., Stanford, CA;Stanford Univ., Stanford, CA;Univ. of California, Berkeley
Venue:
SIGMETRICS '93 Proceedings of the 1993 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Year:
1993

Citing 10
Cited 20

Accurate Low-Cost Methods for Performance Evaluation of Cache Memory Systems

IEEE Transactions on Computers
A tool to aid in the design, implementation, and understanding of matrix algorithms for parallel processors

Journal of Parallel and Distributed Computing - Special issue: software tools for parallel programming and visualization
High-performance computer architecture (2nd ed.)

High-performance computer architecture (2nd ed.)
Quartz: a tool for tuning parallel program performance

SIGMETRICS '90 Proceedings of the 1990 ACM SIGMETRICS conference on Measurement and modeling of computer systems
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A model for estimating trace-sample miss ratios

SIGMETRICS '91 Proceedings of the 1991 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Computer Technology and Architecture: An Evolving Interaction

Computer
Performance debugging shared memory multiprocessor programs with MTOOL

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
SPLASH: Stanford parallel applications for shared-memory

ACM SIGARCH Computer Architecture News
MemSpy: analyzing memory system bottlenecks in programs

SIGMETRICS '92/PERFORMANCE '92 Proceedings of the 1992 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems

Optimal allocation of on-chip memory for multiple-API operating systems

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Trap-driven simulation with Tapeworm II

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Active memory: a new abstraction for memory-system simulation

Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Integrating performance monitoring and communication in parallel computers

Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Trap-driven memory simulation with Tapeworm II

ACM Transactions on Modeling and Computer Simulation (TOMACS)
Active memory: a new abstraction for memory system simulation

ACM Transactions on Modeling and Computer Simulation (TOMACS)
Trace-driven memory simulation: a survey

ACM Computing Surveys (CSUR)
Branch Prediction, Instruction-Window Size, and Cache Size: Performance Trade-Offs and Simulation Techniques

IEEE Transactions on Computers
Automatic Accurate Live Memory Analysis for Garbage-Collected Languages

OM '01 Proceedings of the 2001 ACM SIGPLAN workshop on Optimization of middleware and distributed systems
Tuning Memory Performance of Sequential and Parallel Programs

Computer
Optimized Live Heap Bound Analysis

VMCAI 2003 Proceedings of the 4th International Conference on Verification, Model Checking, and Abstract Interpretation
Trace-Driven Memory Simulation: A Survey

Performance Evaluation: Origins and Directions
DiST: a simple, reliable and scalable method to significantly reduce processor architecture simulation time

SIGMETRICS '03 Proceedings of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A fast and accurate framework to analyze and optimize cache memory behavior

ACM Transactions on Programming Languages and Systems (TOPLAS)
Efficient simulation of trace samples on parallel machines

Parallel Computing
Cluster miss prediction with prefetch on miss for embedded CPU instruction caches

Proceedings of the 2004 international conference on Compilers, architecture, and synthesis for embedded systems
Data Centric Cache Measurement on the Intel ltanium 2 Processor

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Using Dynamic Tracing Sampling to Measure Long Running Programs

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Optimal sample length for efficient cache simulation

Journal of Systems Architecture: the EUROMICRO Journal
Discovery of locality-improving refactorings by reuse path analysis

HPCC'06 Proceedings of the Second international conference on High Performance Computing and Communications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recently there has been a surge of interest in developing performance debugging tools to help programmers tune their applications for better memory performance [2, 4, 10]. These tools vary both in the detail of feedback provided to the user, and in the run-time overbead of using them. MemSpy [10] is a simulation-based tool which gives programmers detailed statistics on the memory system behavior of applications. It provides information on the frequency and causes of cache misses, and presents it in terms of source-level data and code objects with which the programmer is familiar. However, using MemSpy increases a program's execution time by roughly 10 to 40 fold. This overhead is generally acceptable for applications with execution times of several minutes or less, but it can be inconvenient when tuning applications with very long execution times.This paper examines the use of trace sampling techniques to reduce the execution time overhead of tools like MemSpy. When simulating one tenth of the references, we find that MemSpy's execution time overhead is improved by a factor of 4 to 6. That is, the execution time when using MemSpy is generally within a factor of 3 to 8 times the normal exwution time. With this improved performance, we observe only small errors in the performance statistics reported by MemSpy. On moderate sized caches of 16KB to 128KB, simulating as few as one tenth of the references (in samples of 0.5M references each) allows us to estimate the program's actual cache miss rate with an absolute error no greater than 0.3% on our five benchmarks. These errors are quite tolerable within the context of performance bugging. With larger caches we can also obtain good accuracy by using longer sample lengths. We conclude that, used with care, trace sampling is a powerful technique that makes possible performance debugging tools which provide both detailed memory statistics and low execution time overheads.