PiPA: Pipelined profiling and analysis on multicore systems

Authors:
Qin Zhao;Ioana Cutcutache;Weng-Fai Wong
Affiliations:
Massachusetts Institute of Technology;Duke-NUS Graduate Medical School;National University of Singapore
Venue:
ACM Transactions on Architecture and Code Optimization (TACO)
Year:
2010

Citing 18
Cited 1

Efficient path profiling

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
DAISY: dynamic compilation for 100% architectural compatibility

Proceedings of the 24th annual international symposium on Computer architecture
Whole program paths

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Dynamo: a transparent dynamic optimization system

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
A framework for reducing the cost of instrumented code

Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Secure Execution via Program Shepherding

Proceedings of the 11th USENIX Security Symposium
Dynamic trace selection using performance monitoring hardware sampling

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
SWIFT: Software Implemented Fault Tolerance

Proceedings of the international symposium on Code generation and optimization
Pin: building customized program analysis tools with dynamic instrumentation

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Efficient, transparent, and comprehensive runtime code manipulation

Efficient, transparent, and comprehensive runtime code manipulation
Extended Whole Program Paths

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
DEP: detailed execution profile

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Valgrind: a framework for heavyweight dynamic binary instrumentation

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Shadow Profiling: Hiding Instrumentation Costs with Parallelism

Proceedings of the International Symposium on Code Generation and Optimization
SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance

Proceedings of the International Symposium on Code Generation and Optimization
Ubiquitous memory introspection

Proceedings of the International Symposium on Code Generation and Optimization
Pipa: pipelined profiling and analysis on multi-core systems

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Umbra: efficient and scalable memory shadowing

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization

DIME: time-aware dynamic binary instrumentation using rate-based resource allocation

Proceedings of the Eleventh ACM International Conference on Embedded Software

Quantified Score

Hi-index	0.00

Visualization

Abstract

Profiling and online analysis are important tasks in program understanding and feedback-directed optimization. However, fine-grained profiling and online analysis tend to seriously slow down the application. To cope with the slowdown, one may have to terminate the process early or resort to sampling. The former tends to distort the result because of warm-up effects. The latter runs the risk of missing important effects because sampling was turned off during the time that these effects appeared. A promising approach is to make use of the parallel processing capabilities of the now ubiquitous multicore processors to speed up the profiling and analysis process. In this article, we present Pipelined Profiling and Analysis (PiPA), which is a novel technique for parallelizing dynamic program profiling and analysis by taking advantage of multicore systems. In essence, the application under examination is profiled using a dynamic instrumentation tool. Optimized instrumentation code outputs the profile information in a succinct format, that we call the REP format, to buffers. This lightweight trace compression minimizes the processing overhead impinged on the application whenever a buffer is full. Another thread recovers the required information from the REP buffer. The recovered full profile is then divided up and passed to multiple threads for further analysis. To achieve the best performance, the entire system has to be well-balanced. We have implemented prototypes of PiPA using two dynamic instrumentation systems, namely DynamoRIO and Pin, thereby demonstrating its portability. Our experiments show that PiPA is able to speed up the overall profiling and analysis tasks significantly. Compared to the more than 100× slowdown of Cachegrind and the 32× slowdown of Pin dcache, we achieved a mere 10.2× slowdown on an 8-core system. In this paper, we will also describe the insights we gained in obtaining the balance needed for PiPA to perform optimally.