Locating cache performance bottlenecks using data profiling

Authors:
Aleksey Pesterev;Nickolai Zeldovich;Robert T. Morris
Affiliations:
Massachusetts Institute of Technology, Cambridge, MA, USA;Massachusetts Institute of Technology, Cambridge, MA, USA;Massachusetts Institute of Technology, Cambridge, MA, USA
Venue:
Proceedings of the 5th European conference on Computer systems
Year:
2010

Citing 16
Cited 12

Quartz: a tool for tuning parallel program performance

SIGMETRICS '90 Proceedings of the 1990 ACM SIGMETRICS conference on Measurement and modeling of computer systems
MemSpy: analyzing memory system bottlenecks in programs

SIGMETRICS '92/PERFORMANCE '92 Proceedings of the 1992 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Memory system performance of UNIX on CC-NUMA multiprocessors

Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Integrating performance monitoring and communication in parallel computers

Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Continuous profiling: where have all the cycles gone?

ACM Transactions on Computer Systems (TOCS)
ProfileMe: hardware support for instruction-level profiling on out-of-order processors

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Computer architecture: a quantitative approach

Computer architecture: a quantitative approach
Cache Profiling and the SPEC Benchmarks: A Case Study

Computer
DBMSs on a Modern Processor: Where Does Time Go?

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Gprof: A call graph execution profiler

SIGPLAN '82 Proceedings of the 1982 SIGPLAN symposium on Compiler construction
Detailed cache simulation for detecting bottleneck, miss reason and optimization potentialities

valuetools '06 Proceedings of the 1st international conference on Performance evaluation methodolgies and tools
Valgrind: a framework for heavyweight dynamic binary instrumentation

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
RouteBricks: exploiting parallelism to scale software routers

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
MCC-DB: minimizing cache conflicts in multi-core processors for databases

Proceedings of the VLDB Endowment
Reinventing scheduling for multicore systems

HotOS'09 Proceedings of the 12th conference on Hot topics in operating systems
Corey: an operating system for many cores

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation

An analysis of Linux scalability to many cores

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Directly characterizing cross core interference through contention synthesis

Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
S2E: a platform for in-vivo multi-path analysis of software systems

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
SHERIFF: precise detection and automatic mitigation of false sharing

Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
The S2E Platform: Design, Implementation, and Applications

ACM Transactions on Computer Systems (TOCS) - Special Issue APLOS 2011
Improving network connection locality on multicore systems

Proceedings of the 7th ACM european conference on Computer Systems
MemProf: a memory profiler for NUMA multicore systems

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Survey of scheduling techniques for addressing shared resources in multicore processors

ACM Computing Surveys (CSUR)
Production-run software failure diagnosis via hardware performance counters

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Traffic management: a holistic approach to memory placement on NUMA systems

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Whose cache line is it anyway?: operating system support for live detection and repair of false sharing

Proceedings of the 8th ACM European Conference on Computer Systems
PREDATOR: predictive false sharing detection

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

Effective use of CPU data caches is critical to good performance, but poor cache use patterns are often hard to spot using existing execution profiling tools. Typical profilers attribute costs to specific code locations. The costs due to frequent cache misses on a given piece of data, however, may be spread over instructions throughout the application. The resulting individually small costs at a large number of instructions can easily appear insignificant in a code profiler's output. DProf helps programmers understand cache miss costs by attributing misses to data types instead of code. Associating cache misses with data helps programmers locate data structures that experience misses in many places in the application's code. DProf introduces a number of new views of cache miss data, including a data profile, which reports the data types with the most cache misses, and a data flow graph, which summarizes how objects of a given type are accessed throughout their lifetime, and which accesses incur expensive cross-CPU cache loads. We present two case studies of using DProf to find and fix cache performance bottlenecks in Linux. The improvements provide a 16-57% throughput improvement on a range of memcached and Apache workloads.