Beyond reuse distance analysis: Dynamic analysis for characterization of data locality potential

Authors:
Naznin Fauzia;Venmugil Elango;Mahesh Ravishankar;J. Ramanujam;Fabrice Rastello;Atanas Rountev;Louis-Noël Pouchet;P. Sadayappan
Affiliations:
The Ohio State University, Columbus OH, USA;The Ohio State University, Columbus OH, USA;The Ohio State University, Columbus OH, USA;Louisiana State University, Baton Rouge LA, USA;INRIA COMPSYS/ENS Lyon, Lyon cedex, France;The Ohio State University, Columbus OH, USA;University of California Los Angeles, Los Angeles CA, USA;The Ohio State University, Columbus OH, USA
Venue:
ACM Transactions on Architecture and Code Optimization (TACO)
Year:
2013

Citing 45
Cited 0

Compile-time partitioning and scheduling of parallel programs

SIGPLAN '86 Proceedings of the 1986 SIGPLAN symposium on Compiler construction
Measuring Parallelism in Computation-Intensive Scientific/Engineering Applications

IEEE Transactions on Computers
Supernode partitioning

POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Limits of instruction-level parallelism

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Limits of control flow on parallelism

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Dynamic dependency analysis of ordinary programs

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
On the limits of program parallelism and its smoothability

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
The LRPD test: speculative run-time parallelization of loops with privatization and reduction parallelization

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Measuring limits of parallelism and characterizing its vulnerability to resource constraints

MICRO 26 Proceedings of the 26th annual international symposium on Microarchitecture
The limits of instruction level parallelism in SPEC95 applications

ACM SIGARCH Computer Architecture News - Special issue on Interact-3 workshop
Optimizing compilers for modern architectures: a dependence-based approach

Optimizing compilers for modern architectures: a dependence-based approach
Loop-Level Parallelism in Numeric and Symbolic Programs

IEEE Transactions on Parallel and Distributed Systems
Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution

Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
A Characterization of Temporal Locality and Its Portability across Memory Hierarchies

ICALP '01 Proceedings of the 28th International Colloquium on Automata, Languages and Programming,
Limits and Graph Structure of Available Instruction-Level Parallelism (Research Note)

Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
Predicting whole-program locality through reuse distance analysis

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Miss Rate Prediction across All Program Inputs

Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques
A blocked all-pairs shortest-paths algorithm

Journal of Experimental Algorithmics (JEA)
Array regrouping and structure splitting using whole-program reference affinity

Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Cross-architecture performance predictions for scientific applications using parameterized models

Proceedings of the joint international conference on Measurement and modeling of computer systems
Locality phase prediction

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Optimizing Graph Algorithms for Improved Cache Performance

IEEE Transactions on Parallel and Distributed Systems
Making LRU Friendly to Weak Locality Workloads: A Novel Replacement Algorithm to Improve Buffer Cache Performance

IEEE Transactions on Computers
Multiple Page Size Modeling and Optimization

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
SPEC CPU2006 benchmark descriptions

ACM SIGARCH Computer Architecture News
Measuring the Parallelism Available for Very Long Instruction Word Architectures

IEEE Transactions on Computers
Revisiting the Sequential Programming Model for Multi-Core

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Prediction and trace compression of data access addresses through nested loop recognition

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
A practical automatic polyhedral parallelizer and locality optimizer

Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
Set-Congruence Dynamic Analysis for Thread-Level Speculation (TLS)

Languages and Compilers for Parallel Computing
Compiler-Driven Dependence Profiling to Guide Program Parallelization

Languages and Compilers for Parallel Computing
Copy or Discard execution model for speculative parallelization on multicores

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Towards a holistic approach to auto-parallelization: integrating profile-driven parallelism detection and machine-learning based mapping

Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Evaluation techniques for storage hierarchies

IBM Systems Journal
Exascale computing technology challenges

VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Graph expansion and communication costs of fast matrix multiplication: regular submission

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Kremlin: rethinking and rebooting gprof for the multicore age

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Computer Architecture, Fifth Edition: A Quantitative Approach

Computer Architecture, Fifth Edition: A Quantitative Approach
The Future of Computing Performance: Game Over or Next Level?

The Future of Computing Performance: Game Over or Next Level?
Limits of parallelism using dynamic dependency graphs

WODA '09 Proceedings of the Seventh International Workshop on Dynamic Analysis
Is reuse distance applicable to data locality analysis on chip multiprocessors?

CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction
Dynamic trace-based analysis of vectorization potential of applications

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Communication-optimal Parallel and Sequential QR and LU Factorizations

SIAM Journal on Scientific Computing
PARDA: A Fast Parallel Reuse Distance Analysis Algorithm

IPDPS '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium

Quantified Score

Hi-index	0.00

Visualization

Abstract

Emerging computer architectures will feature drastically decreased flops/byte (ratio of peak processing rate to memory bandwidth) as highlighted by recent studies on Exascale architectural trends. Further, flops are getting cheaper, while the energy cost of data movement is increasingly dominant. The understanding and characterization of data locality properties of computations is critical in order to guide efforts to enhance data locality. Reuse distance analysis of memory address traces is a valuable tool to perform data locality characterization of programs. A single reuse distance analysis can be used to estimate the number of cache misses in a fully associative LRU cache of any size, thereby providing estimates on the minimum bandwidth requirements at different levels of the memory hierarchy to avoid being bandwidth bound. However, such an analysis only holds for the particular execution order that produced the trace. It cannot estimate potential improvement in data locality through dependence-preserving transformations that change the execution schedule of the operations in the computation. In this article, we develop a novel dynamic analysis approach to characterize the inherent locality properties of a computation and thereby assess the potential for data locality enhancement via dependence-preserving transformations. The execution trace of a code is analyzed to extract a Computational-Directed Acyclic Graph (CDAG) of the data dependences. The CDAG is then partitioned into convex subsets, and the convex partitioning is used to reorder the operations in the execution trace to enhance data locality. The approach enables us to go beyond reuse distance analysis of a single specific order of execution of the operations of a computation in characterization of its data locality properties. It can serve a valuable role in identifying promising code regions for manual transformation, as well as assessing the effectiveness of compiler transformations for data locality enhancement. We demonstrate the effectiveness of the approach using a number of benchmarks, including case studies where the potential shown by the analysis is exploited to achieve lower data movement costs and better performance.