Fast loop-level data dependence profiling

Authors:
Hongtao Yu;Zhiyuan Li
Affiliations:
Purdue University, West Lafayette, IN, USA;Purdue University, West Lafayette, IN, USA
Venue:
Proceedings of the 26th ACM international conference on Supercomputing
Year:
2012

Citing 32
Cited 1

A software instruction counter

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Points-to analysis in almost linear time

POPL '96 Proceedings of the 23rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Whole program paths

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Optimizing compilers for modern architectures: a dependence-based approach

Optimizing compilers for modern architectures: a dependence-based approach
Computers and Intractability: A Guide to the Theory of NP-Completeness

Computers and Intractability: A Guide to the Theory of NP-Completeness
An infrastructure for adaptive dynamic optimization

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
VPC3: a fast and effective trace-compression algorithm

Proceedings of the joint international conference on Measurement and modeling of computer systems
Pin: building customized program analysis tools with dynamic instrumentation

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Whole execution traces and their applications

ACM Transactions on Architecture and Code Optimization (TACO)
Extended Whole Program Paths

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
POSH: a TLS compiler that exploits program structure

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Valgrind: a framework for heavyweight dynamic binary instrumentation

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Shadow Profiling: Hiding Instrumentation Costs with Parallelism

Proceedings of the International Symposium on Code Generation and Optimization
SuperPin: Parallelizing Dynamic Instrumentation for Real-Time Performance

Proceedings of the International Symposium on Code Generation and Optimization
How to shadow every byte of memory used by a program

Proceedings of the 3rd international conference on Virtual execution environments
Unified control flow and data dependence traces

ACM Transactions on Architecture and Code Optimization (TACO)
Revisiting the Sequential Programming Model for Multi-Core

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
A Practical Approach to Exploiting Coarse-Grained Pipeline Parallelism in C Programs

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Pipa: pipelined profiling and analysis on multi-core systems

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Bootstrapping: a technique for scalable flow and context-sensitive pointer alias analysis

Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
Compiler-Driven Dependence Profiling to Guide Program Parallelization

Languages and Compilers for Parallel Computing
Towards a holistic approach to auto-parallelization: integrating profile-driven parallelism detection and machine-learning based mapping

Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Alchemist: A Transparent Dependence Distance Profiling Infrastructure

Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
From approximate to optimal solutions: a case study of number partitioning

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 1
Umbra: efficient and scalable memory shadowing

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Level by level: making flow- and context-sensitive pointer analysis scalable for millions of lines of code

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
A profile-based tool for finding pipeline parallelism in sequential programs

Parallel Computing
The Paralax infrastructure: automatic parallelization with a helping hand

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Estimating and exploiting potential parallelism by source-level dependence profiling

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
SD3: A Scalable Approach to Dynamic Data-Dependence Profiling

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Kremlin: like gprof, but for parallelization

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Kremlin: rethinking and rebooting gprof for the multicore age

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation

General data structure expansion for multi-threading

Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Execution-driven data dependence profiling has gained significant interest as a tool to compensate the weakness of static data dependence analysis. Although such dependence profiling is valid for specific inputs only, its result can be used in many ways for program parallelization. Unfortunately, traditional hash-based dependence profiling can take tremendous memory and machine time, which severely limits its practical use. In this paper, we propose new compiler-based techniques to perform fast loop-level data dependence profiling. Firstly, using type consistency and alias information, our compiler embeds memory tags into the data structures in the original program such that memory addresses can be efficiently compared for dependence testing. This approach avoids the bytewise hashing overhead in conventional profiling methods. Secondly, we prove that a partial dependence graph obtained from profiling is sufficient for loop-level reordering transformations and parallelization. Such partial dependence graph can be obtained very fast, without having to exhaustively enumerate all dependence edges. Thirdly, our compiler partitions the profiling task into independent slices. Such slices can be profiled in parallel, producing subgraphs which are eventually combined automatically into the complete data dependence graph by the compiler. Experiments show that these techniques significantly reduce the memory use and shorten the profiling time (by an order of magnitude for several SPEC2006 benchmarks). Benchmarks too big to profile at all loop levels by previous methods can now be profiled fully within several hours.