Portable programs for parallel processors
Portable programs for parallel processors
Memory-reference characteristics of multiprocessor applications under MACH
SIGMETRICS '88 Proceedings of the 1988 ACM SIGMETRICS conference on Measurement and modeling of computer systems
ACM Transactions on Mathematical Software (TOMS)
Non-intrusive and interactive profiling in parasight
PPEALS '88 Proceedings of the ACM/SIGPLAN conference on Parallel programming: experience with applications, languages and systems
Journal of Parallel and Distributed Computing - Special issue: software tools for parallel programming and visualization
Quartz: a tool for tuning parallel program performance
SIGMETRICS '90 Proceedings of the 1990 ACM SIGMETRICS conference on Measurement and modeling of computer systems
The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Performance debugging shared memory multiprocessor programs with MTOOL
Proceedings of the 1991 ACM/IEEE conference on Supercomputing
The DASH prototype: implementation and performance
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
SPLASH: Stanford parallel applications for shared-memory
SPLASH: Stanford parallel applications for shared-memory
Parallel ICCG on a hierarchical memory multiprocessor - addressing the triangular solve bottleneck
Parallel ICCG on a hierarchical memory multiprocessor - addressing the triangular solve bottleneck
A Memory Allocation Profiler for C and Lisp Programs
A Memory Allocation Profiler for C and Lisp Programs
Tools for the development of application-specific virtual memory management
OOPSLA '93 Proceedings of the eighth annual conference on Object-oriented programming systems, languages, and applications
Effectiveness of trace sampling for performance debugging tools
SIGMETRICS '93 Proceedings of the 1993 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Performance debugging using parallel performance predicates
PADD '93 Proceedings of the 1993 ACM/ONR workshop on Parallel and distributed debugging
Normalized performance indices for message passing parallel programs
ICS '94 Proceedings of the 8th international conference on Supercomputing
Architectural support for performance tuning: a case study on the SPARCcenter 2000
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Trap-driven simulation with Tapeworm II
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Real-time volume rendering on shared memory multiprocessors using the shear-warp factorization
PRS '95 Proceedings of the IEEE symposium on Parallel rendering
Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Predicting application behavior in large scale shared-memory multiprocessors
Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Evaluating the impact of advanced memory systems on compiler-parallelized codes
PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
The influence of caches on the performance of heaps
Journal of Experimental Algorithmics (JEA)
Mapping performance data for high-level and data views of parallel program performance
ICS '96 Proceedings of the 10th international conference on Supercomputing
An online computation of critical path profiling
SPDT '96 Proceedings of the SIGMETRICS symposium on Parallel and distributed tools
Trap-driven memory simulation with Tapeworm II
ACM Transactions on Modeling and Computer Simulation (TOMACS)
Active memory: a new abstraction for memory system simulation
ACM Transactions on Modeling and Computer Simulation (TOMACS)
Using the SimOS machine simulator to study complex computer systems
ACM Transactions on Modeling and Computer Simulation (TOMACS)
Characterizing the Memory Behavior of Compiler-Parallelized Applications
IEEE Transactions on Parallel and Distributed Systems
Trace-driven memory simulation: a survey
ACM Computing Surveys (CSUR)
Predictability of load/store instruction latencies
MICRO 26 Proceedings of the 26th annual international symposium on Microarchitecture
Cache miss equations: an analytical representation of cache misses
ICS '97 Proceedings of the 11th international conference on Supercomputing
Shared-memory performance profiling
PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Searching for the sorting record: experiences in tuning NOW-Sort
SPDT '98 Proceedings of the SIGMETRICS symposium on Parallel and distributed tools
Precise miss analysis for program transformations with caches of arbitrary associativity
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Critical Path Profiling of Message Passing and Shared-Memory Programs
IEEE Transactions on Parallel and Distributed Systems
Cache conscious programming in undergraduate computer science
SIGCSE '99 The proceedings of the thirtieth SIGCSE technical symposium on Computer science education
An Application-Driven Study of Parallel System Overheads and Network Bandwidth Requirements
IEEE Transactions on Parallel and Distributed Systems
Cache miss equations: a compiler framework for analyzing and tuning memory behavior
ACM Transactions on Programming Languages and Systems (TOPLAS)
Automated cache optimizations using CME driven diagnosis
Proceedings of the 14th international conference on Supercomputing
Using hardware performance monitors to isolate memory bottlenecks
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Tools for application-oriented performance tuning
ICS '01 Proceedings of the 15th international conference on Supercomputing
Exact analysis of the cache behavior of nested loops
Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
A Tool to Help Tune where Computation Is Performed
IEEE Transactions on Software Engineering
Parallel performance prediction using lost cycles analysis
Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Analysis of a Parallel Volume Rendering System Based on the Shear-Warp Factorization
IEEE Transactions on Visualization and Computer Graphics
Computer
IEEE Transactions on Parallel and Distributed Systems
A Blocked All-Pairs Shortest-Path Algorithm
SWAT '00 Proceedings of the 7th Scandinavian Workshop on Algorithm Theory
SIP: Performance Tuning through Source Code Interdependence
Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
Trace-Driven Memory Simulation: A Survey
Performance Evaluation: Origins and Directions
SIGMA: a simulator infrastructure to guide memory analysis
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Performance visualization for distributed shared memory systems
Virtual shared memory for distributed architectures
A fast and accurate framework to analyze and optimize cache memory behavior
ACM Transactions on Programming Languages and Systems (TOPLAS)
Efficient and Accurate Analytical Modeling of Whole-Program Data Cache Behavior
IEEE Transactions on Computers
A blocked all-pairs shortest-paths algorithm
Journal of Experimental Algorithmics (JEA)
Detailed cache coherence characterization for OpenMP benchmarks
Proceedings of the 18th annual international conference on Supercomputing
Processor/Memory Co-Exploration on Multiple Abstraction Levels
DATE '03 Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Data Centric Cache Measurement on the Intel ltanium 2 Processor
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Memory Profiling using Hardware Counters
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
EMPS: An Environment for Memory Performance Studies
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 10 - Volume 11
Fast data-locality profiling of native execution
SIGMETRICS '05 Proceedings of the 2005 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A hybrid hardware/software approach to efficiently determine cache coherence Bottlenecks
Proceedings of the 19th annual international conference on Supercomputing
TAPE: a transactional application profiling environment
Proceedings of the 19th annual international conference on Supercomputing
Decomposing memory performance: data structures and phases
Proceedings of the 5th international symposium on Memory management
Analysis of cache-coherence bottlenecks with hybrid hardware/software techniques
ACM Transactions on Architecture and Code Optimization (TACO)
I/O system performance debugging using model-driven anomaly characterization
FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
Source-Code-Correlated Cache Coherence Characterization of OpenMP Benchmarks
IEEE Transactions on Parallel and Distributed Systems
Memory behavior of an X11 window system
WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
Locating cache performance bottlenecks using data profiling
Proceedings of the 5th European conference on Computer systems
DeFT: Design space exploration for on-the-fly detection of coherence misses
ACM Transactions on Architecture and Code Optimization (TACO)
QUAD: a memory access pattern analyser
ARC'10 Proceedings of the 6th international conference on Reconfigurable Computing: architectures, Tools and Applications
A tool to display array access patterns in OpenMP programs
PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Cache optimizations for iterative numerical codes aware of hardware prefetching
PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Collecting and exploiting cache-reuse metrics
ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part II
Pinpointing data locality problems using data-centric analysis
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
A data-centric profiler for parallel programs
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
A scalable and near-optimal representation of access schemes for memory management
ACM Transactions on Architecture and Code Optimization (TACO)
Hi-index | 0.01 |
To cope with the increasing difference between processor and main memory speeds, modern computer systems use deep memory hierarchies. In the presence of such hierarchies, the performance attained by an application is largely determined by its memory reference behavior—if most references hit in the cache, the performance is significantly higher than if most references have to go to main memory. Frequently, it is possible for the programmer to restructure the data or code to achieve better memory reference behavior. Unfortunately, most existing performance debugging tools do not assist the programmer in this component of the overall performance tuning task.This paper describes MemSpy, a prototype tool that helps programmers identify and fix memory bottlenecks in both sequential and parallel programs. A key aspect of MemSpy is that it introduces the notion of data oriented, in addition to code oriented, performance tuning. Thus, for both source level code objects and data objects, MemSpy provides information such as cache miss rates, causes of cache misses, and in multiprocessors, information on cache invalidations and local versus remote memory misses. MemSpy also introduces a concise matrix presentation to allow programmers to view both code and data oriented statistics at the same time. This paper presents design and implementation issues for MemSpy, and gives a detailed case study using MemSpy to tune a parallel sparse matrix application. It shows how MemSpy helps pinpoint memory system bottlenecks, such as poor spatial locality and interference among data structures, and suggests paths for improvement.