Accurate Low-Cost Methods for Performance Evaluation of Cache Memory Systems
IEEE Transactions on Computers
Journal of Parallel and Distributed Computing - Special issue: software tools for parallel programming and visualization
High-performance computer architecture (2nd ed.)
High-performance computer architecture (2nd ed.)
Quartz: a tool for tuning parallel program performance
SIGMETRICS '90 Proceedings of the 1990 ACM SIGMETRICS conference on Measurement and modeling of computer systems
The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A model for estimating trace-sample miss ratios
SIGMETRICS '91 Proceedings of the 1991 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Performance debugging shared memory multiprocessor programs with MTOOL
Proceedings of the 1991 ACM/IEEE conference on Supercomputing
SPLASH: Stanford parallel applications for shared-memory
ACM SIGARCH Computer Architecture News
MemSpy: analyzing memory system bottlenecks in programs
SIGMETRICS '92/PERFORMANCE '92 Proceedings of the 1992 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Optimal allocation of on-chip memory for multiple-API operating systems
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Trap-driven simulation with Tapeworm II
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Active memory: a new abstraction for memory-system simulation
Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Integrating performance monitoring and communication in parallel computers
Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Trap-driven memory simulation with Tapeworm II
ACM Transactions on Modeling and Computer Simulation (TOMACS)
Active memory: a new abstraction for memory system simulation
ACM Transactions on Modeling and Computer Simulation (TOMACS)
Trace-driven memory simulation: a survey
ACM Computing Surveys (CSUR)
IEEE Transactions on Computers
Automatic Accurate Live Memory Analysis for Garbage-Collected Languages
OM '01 Proceedings of the 2001 ACM SIGPLAN workshop on Optimization of middleware and distributed systems
Optimized Live Heap Bound Analysis
VMCAI 2003 Proceedings of the 4th International Conference on Verification, Model Checking, and Abstract Interpretation
Trace-Driven Memory Simulation: A Survey
Performance Evaluation: Origins and Directions
SIGMETRICS '03 Proceedings of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A fast and accurate framework to analyze and optimize cache memory behavior
ACM Transactions on Programming Languages and Systems (TOPLAS)
Efficient simulation of trace samples on parallel machines
Parallel Computing
Cluster miss prediction with prefetch on miss for embedded CPU instruction caches
Proceedings of the 2004 international conference on Compilers, architecture, and synthesis for embedded systems
Data Centric Cache Measurement on the Intel ltanium 2 Processor
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Using Dynamic Tracing Sampling to Measure Long Running Programs
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Optimal sample length for efficient cache simulation
Journal of Systems Architecture: the EUROMICRO Journal
Discovery of locality-improving refactorings by reuse path analysis
HPCC'06 Proceedings of the Second international conference on High Performance Computing and Communications
Hi-index | 0.00 |
Recently there has been a surge of interest in developing performance debugging tools to help programmers tune their applications for better memory performance [2, 4, 10]. These tools vary both in the detail of feedback provided to the user, and in the run-time overbead of using them. MemSpy [10] is a simulation-based tool which gives programmers detailed statistics on the memory system behavior of applications. It provides information on the frequency and causes of cache misses, and presents it in terms of source-level data and code objects with which the programmer is familiar. However, using MemSpy increases a program's execution time by roughly 10 to 40 fold. This overhead is generally acceptable for applications with execution times of several minutes or less, but it can be inconvenient when tuning applications with very long execution times.This paper examines the use of trace sampling techniques to reduce the execution time overhead of tools like MemSpy. When simulating one tenth of the references, we find that MemSpy's execution time overhead is improved by a factor of 4 to 6. That is, the execution time when using MemSpy is generally within a factor of 3 to 8 times the normal exwution time. With this improved performance, we observe only small errors in the performance statistics reported by MemSpy. On moderate sized caches of 16KB to 128KB, simulating as few as one tenth of the references (in samples of 0.5M references each) allows us to estimate the program's actual cache miss rate with an absolute error no greater than 0.3% on our five benchmarks. These errors are quite tolerable within the context of performance bugging. With larger caches we can also obtain good accuracy by using longer sample lengths. We conclude that, used with care, trace sampling is a powerful technique that makes possible performance debugging tools which provide both detailed memory statistics and low execution time overheads.