Identifying the sources of cache misses in Java programs without relying on hardware counters

Authors:
Hiroshi Inoue;Toshio Nakatani
Affiliations:
IBM Research - Tokyo, Tokyo, Japan;IBM Research - Tokyo, Tokyo, Japan
Venue:
Proceedings of the 2012 international symposium on Memory Management
Year:
2012

Citing 19
Cited 0

Using generational garbage collection to implement cache-conscious data placement

Proceedings of the 1st international symposium on Memory management
Segregating heap objects by reference behavior and lifetime

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Cache-conscious structure layout

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Exploiting prolific types for memory management and optimizations

POPL '02 Proceedings of the 29th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Static load classification for improving the value predictability of data-cache misses

PLDI '02 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation
Static Identification of Delinquent Loads

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Prefetch injection based on hardware monitoring and object metadata

Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Dynamic object sampling for pretenuring

Proceedings of the 4th international symposium on Memory management
Improving locality with parallel hierarchical copying GC

Proceedings of the 5th international symposium on Memory management
Profile-guided proactive garbage collection for locality optimization

Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
The DaCapo benchmarks: java benchmarking development and analysis

Proceedings of the 21st annual ACM SIGPLAN conference on Object-oriented programming systems, languages, and applications
Online optimizations driven by hardware performance monitoring

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
IBM POWER6 microarchitecture

IBM Journal of Research and Development
Placement optimization using data context collected during garbage collection

Proceedings of the 2009 international symposium on Memory management
Two memory allocators that use hints to improve locality

Proceedings of the 2009 international symposium on Memory management
A Practical Approach to Hardware Performance Monitoring Based Dynamic Optimizations in a Production JVM

Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
How a Java VM can get more from a hardware performance monitor

Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications
Efficient runtime tracking of allocation sites in Java

Proceedings of the 6th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Layout transformations for heap objects using static access patterns

CC'07 Proceedings of the 16th international conference on Compiler construction

Quantified Score

Hi-index	0.00

Visualization

Abstract

Cache miss stalls are one of the major sources of performance bottlenecks for multicore processors. A Hardware Performance Monitor (HPM) in the processor is useful for locating the cache misses, but is rarely used in the real world for various reasons. It would be better to find a simple approach to locate the sources of cache misses and apply runtime optimizations without relying on an HPM. This paper shows that pointer dereferencing in hot loops is a major source of cache misses in Java programs. Based on this observation, we devised a new approach to identify the instructions and objects that cause frequent cache misses. Our heuristic technique effectively identifies the majority of the cache misses in typical Java programs by matching the hot loops to simple idiomatic code patterns. On average, our technique selected only 2.8% of the load and store instructions generated by the JIT compiler and these instructions accounted for 47% of the L1D cache misses and 49% of the L2 cache misses caused by the JIT-compiled code. To prove the effectiveness of our technique in compiler optimizations, we prototyped object placement optimizations, which align objects in cache lines or collocate paired objects in the same cache line to reduce cache misses. For comparison, we also implemented the same optimizations based on the accurate information obtained from the HPM. Our results showed that our heuristic approach was as effective as the HPM-based approach and achieved comparable performance improvements in the SPECjbb2005 and SPECpower_ssj2008 benchmark programs.