MemSpy: analyzing memory system bottlenecks in programs
SIGMETRICS '92/PERFORMANCE '92 Proceedings of the 1992 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Mapping performance data for high-level and data views of parallel program performance
ICS '96 Proceedings of the 10th international conference on Supercomputing
Continuous profiling: where have all the cycles gone?
ACM Transactions on Computer Systems (TOCS)
ProfileMe: hardware support for instruction-level profiling on out-of-order processors
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Gprof: A call graph execution profiler
SIGPLAN '82 Proceedings of the 1982 SIGPLAN symposium on Compiler construction
Data Centric Cache Measurement on the Intel ltanium 2 Processor
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Memory Profiling using Hardware Counters
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Low-overhead call path profiling of unmodified, optimized code
Proceedings of the 19th annual international conference on Supercomputing
Sampling-based program locality approximation
Proceedings of the 7th international symposium on Memory management
Computer
Binary analysis for measurement and attribution of program performance
Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Rodinia: A benchmark suite for heterogeneous computing
IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
HPCTOOLKIT: tools for performance analysis of optimized parallel programs http://hpctoolkit.org
Concurrency and Computation: Practice & Experience - Scalable Tools for High-End Computing
Accelerating multicore reuse distance analysis with sampling and parallelization
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Scalable Identification of Load Imbalance in Parallel Executions Using Call Path Profiles
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Discovery of locality-improving refactorings by reuse path analysis
HPCC'06 Proceedings of the Second international conference on High Performance Computing and Communications
Pinpointing data locality problems using data-centric analysis
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
MemProf: a memory profiler for NUMA multicore systems
USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
A tool to analyze the performance of multithreaded programs on NUMA architectures
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Hi-index | 0.00 |
It is difficult to manually identify opportunities for enhancing data locality. To address this problem, we extended the HPCToolkit performance tools to support data-centric profiling of scalable parallel programs. Our tool uses hardware counters to directly measure memory access latency and attributes latency metrics to both variables and instructions. Different hardware counters provide insight into different aspects of data locality (or lack thereof). Unlike prior tools for data-centric analysis, our tool employs scalable measurement, analysis, and presentation methods that enable it to analyze the memory access behavior of scalable parallel programs with low runtime and space overhead. We demonstrate the utility of HPCToolkit's new data-centric analysis capabilities with case studies of five well-known benchmarks. In each benchmark, we identify performance bottlenecks caused by poor data locality and demonstrate non-trivial performance optimizations enabled by this guidance.