Taming hardware event samples for FDO compilation

Authors:
Dehao Chen;Neil Vachharajani;Robert Hundt;Shih-wei Liao;Vinodha Ramasamy;Paul Yuan;Wenguang Chen;Weimin Zheng
Affiliations:
Tsinghua University, Beijing, China;Google, Mountain View, CA, USA;Google, Mountain View, CA, USA;Google, Mountain View, CA, USA;-;Peking University, Beijing, China;Tsinghua University, Beijing, China;Tsinghua University, Beijing, China
Venue:
Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Year:
2010

Citing 15
Cited 9

Optimally profiling and tracing programs

ACM Transactions on Programming Languages and Systems (TOPLAS)
Static branch frequency and program profile analysis

MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Hardware-based profiling: an effective technique for profile-driven optimization

International Journal of Parallel Programming
Efficient path profiling

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Exploiting hardware performance counters with flow and context sensitive profiling

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Continuous profiling: where have all the cycles gone?

ACM Transactions on Computer Systems (TOCS)
System support for automatic profiling and optimization

Proceedings of the sixteenth ACM symposium on Operating systems principles
ProfileMe: hardware support for instruction-level profiling on out-of-order processors

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
A hardware-driven profiling scheme for identifying program hot spots to support runtime optimization

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
A tutorial on support vector regression

Statistics and Computing
Pin: building customized program analysis tools with dynamic instrumentation

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Low-overhead call path profiling of unmodified, optimized code

Proceedings of the 19th annual international conference on Supercomputing
Online optimizations driven by hardware performance monitoring

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Optimal Insertion of Software Probes in Well-Delimited Programs

IEEE Transactions on Software Engineering
Complementing missing and inaccurate profiling using a minimum cost circulation algorithm

HiPEAC'08 Proceedings of the 3rd international conference on High performance embedded architectures and compilers

Evaluating the accuracy of Java profilers

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
RACEZ: a lightweight and non-invasive race detection tool for production applications

Proceedings of the 33rd International Conference on Software Engineering
Exploiting hardware advances for software testing and debugging (NIER track)

Proceedings of the 33rd International Conference on Software Engineering
MAO -- An extensible micro-architectural optimizer

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
HQEMU: a multi-threaded and retargetable dynamic binary translator on multicores

Proceedings of the Tenth International Symposium on Code Generation and Optimization
THeME: a system for testing by hardware monitoring events

Proceedings of the 2012 International Symposium on Software Testing and Analysis
Siblingrivalry: online autotuning through local competitions

Proceedings of the 2012 international conference on Compilers, architectures and synthesis for embedded systems
A survey and taxonomy of on-chip monitoring of multicore systems-on-chip

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Simple profile rectifications go a long way

ECOOP'13 Proceedings of the 27th European conference on Object-Oriented Programming

Quantified Score

Hi-index	0.01

Visualization

Abstract

Feedback-directed optimization (FDO) is effective in improving application runtime performance, but has not been widely adopted due to the tedious dual-compilation model, the difficulties in generating representative training data sets, and the high runtime overhead of profile collection. The use of hardware-event sampling to generate estimated edge profiles overcomes these drawbacks. Yet, hardware event samples are typically not precise at the instruction or basic-block granularity. These inaccuracies lead to missed performance when compared to instrumentation-based FDO@. In this paper, we use multiple hardware event profiles and supervised learning techniques to generate heuristics for improved precision of basic-block-level sample profiles, and to further improve the smoothing algorithms used to construct edge profiles. We demonstrate that sampling-based FDO can achieve an average of 78% of the performance gains obtained using instrumentation-based exact edge profiles for SPEC2000 benchmarks, matching or beating instrumentation-based FDO in many cases. The overhead of collection is only 0.74% on average, while compiler based instrumentation incurs 6.8%-53.5% overhead (and 10x overhead on an industrial web search application), and dynamic instrumentation incurs 28.6%-1639.2% overhead.