Self-adjusting binary search trees
Journal of the ACM (JACM)
MemSpy: analyzing memory system bottlenecks in programs
SIGMETRICS '92/PERFORMANCE '92 Proceedings of the 1992 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Mapping performance data for high-level and data views of parallel program performance
ICS '96 Proceedings of the 10th international conference on Supercomputing
Exploiting hardware performance counters with flow and context sensitive profiling
Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Continuous profiling: where have all the cycles gone?
ACM Transactions on Computer Systems (TOCS)
Efficient management of parallelism in object-oriented numerical software libraries
Modern software tools for scientific computing
ProfileMe: hardware support for instruction-level profiling on out-of-order processors
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Pentium 4 Performance-Monitoring Features
IEEE Micro
LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Data Centric Cache Measurement on the Intel ltanium 2 Processor
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Memory Profiling using Hardware Counters
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Fast data-locality profiling of native execution
SIGMETRICS '05 Proceedings of the 2005 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Low-overhead call path profiling of unmodified, optimized code
Proceedings of the 19th annual international conference on Supercomputing
Computer
Binary analysis for measurement and attribution of program performance
Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Assigning Blame: Mapping Performance to High Level Parallel Programming Abstractions
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Evaluation techniques for storage hierarchies
IBM Systems Journal
HPCTOOLKIT: tools for performance analysis of optimized parallel programs http://hpctoolkit.org
Concurrency and Computation: Practice & Experience - Scalable Tools for High-End Computing
Discovery of locality-improving refactorings by reuse path analysis
HPCC'06 Proceedings of the Second international conference on High Performance Computing and Communications
Cache Conscious Task Regrouping on Multicore Processors
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
A coldness metric for cache optimization
Proceedings of the ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
A data-centric profiler for parallel programs
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A tool to analyze the performance of multithreaded programs on NUMA architectures
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
SPM-Sieve: a framework for assisting data partitioning in scratch pad memory based systems
Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems
Hi-index | 0.00 |
In modern computer architectures, access latency varies considerably between different levels in the memory hierarchy. Consequently, applications with data access patterns that don't reuse much data in fast levels of the hierarchy incur additional delays. To improve the performance of complex, data-intensive applications, developers need tools that help them understand the causes of poor memory hierarchy utilization. While most performance tools associate metrics with functions or statements, in this paper we explore data-centric analyses that associate metrics not only with data accesses but also with data objects themselves. Our contributions are three-fold. First, we propose several refinements to existing data-centric techniques that enable accurate and low-overhead measurements. Second, we combine data-centric analysis with call path profiling; this combination of techniques relates inefficient access patterns back to data objects across complete dynamic call chains. Third, we developed a graphical user interface that gracefully presents our analysis results using a multiplicity of views, which helps users identify problematic accesses and data structures. We demonstrate the utility of our approach by showing how our tool identifies problematic data access patterns in several HPC applications and a pair of the SPEC CPU2006 benchmarks.