Quartz: a tool for tuning parallel program performance
SIGMETRICS '90 Proceedings of the 1990 ACM SIGMETRICS conference on Measurement and modeling of computer systems
MemSpy: analyzing memory system bottlenecks in programs
SIGMETRICS '92/PERFORMANCE '92 Proceedings of the 1992 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Memory system performance of UNIX on CC-NUMA multiprocessors
Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Integrating performance monitoring and communication in parallel computers
Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Continuous profiling: where have all the cycles gone?
ACM Transactions on Computer Systems (TOCS)
ProfileMe: hardware support for instruction-level profiling on out-of-order processors
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Computer architecture: a quantitative approach
Computer architecture: a quantitative approach
DBMSs on a Modern Processor: Where Does Time Go?
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Gprof: A call graph execution profiler
SIGPLAN '82 Proceedings of the 1982 SIGPLAN symposium on Compiler construction
Detailed cache simulation for detecting bottleneck, miss reason and optimization potentialities
valuetools '06 Proceedings of the 1st international conference on Performance evaluation methodolgies and tools
Valgrind: a framework for heavyweight dynamic binary instrumentation
Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
RouteBricks: exploiting parallelism to scale software routers
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
MCC-DB: minimizing cache conflicts in multi-core processors for databases
Proceedings of the VLDB Endowment
Reinventing scheduling for multicore systems
HotOS'09 Proceedings of the 12th conference on Hot topics in operating systems
Corey: an operating system for many cores
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
An analysis of Linux scalability to many cores
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Directly characterizing cross core interference through contention synthesis
Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
S2E: a platform for in-vivo multi-path analysis of software systems
Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
SHERIFF: precise detection and automatic mitigation of false sharing
Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
The S2E Platform: Design, Implementation, and Applications
ACM Transactions on Computer Systems (TOCS) - Special Issue APLOS 2011
Improving network connection locality on multicore systems
Proceedings of the 7th ACM european conference on Computer Systems
MemProf: a memory profiler for NUMA multicore systems
USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Survey of scheduling techniques for addressing shared resources in multicore processors
ACM Computing Surveys (CSUR)
Production-run software failure diagnosis via hardware performance counters
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Traffic management: a holistic approach to memory placement on NUMA systems
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Proceedings of the 8th ACM European Conference on Computer Systems
PREDATOR: predictive false sharing detection
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Hi-index | 0.00 |
Effective use of CPU data caches is critical to good performance, but poor cache use patterns are often hard to spot using existing execution profiling tools. Typical profilers attribute costs to specific code locations. The costs due to frequent cache misses on a given piece of data, however, may be spread over instructions throughout the application. The resulting individually small costs at a large number of instructions can easily appear insignificant in a code profiler's output. DProf helps programmers understand cache miss costs by attributing misses to data types instead of code. Associating cache misses with data helps programmers locate data structures that experience misses in many places in the application's code. DProf introduces a number of new views of cache miss data, including a data profile, which reports the data types with the most cache misses, and a data flow graph, which summarizes how objects of a given type are accessed throughout their lifetime, and which accesses incur expensive cross-CPU cache loads. We present two case studies of using DProf to find and fix cache performance bottlenecks in Linux. The improvements provide a 16-57% throughput improvement on a range of memcached and Apache workloads.