Automatically enhancing locality for tree traversals with traversal splicing

Authors:
Youngjoon Jo;Milind Kulkarni
Affiliations:
Purdue University, West Lafayette, IN, USA;Purdue University, West Lafayette, IN, USA
Venue:
Proceedings of the ACM international conference on Object oriented programming systems languages and applications
Year:
2012

Citing 29
Cited 2

Load balancing and data locality in adaptive hierarchical N-body methods: Barnes-Hut, fast multipole, and radiosity

Journal of Parallel and Distributed Computing
Is it a tree, a DAG, or a cyclic graph? A shape analysis for heap-directed pointers in C

POPL '96 Proceedings of the 23rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Rendering complex scenes with memory-coherent ray tracing

Proceedings of the 24th annual conference on Computer graphics and interactive techniques
Commutativity analysis: a new analysis technique for parallelizing compilers

ACM Transactions on Programming Languages and Systems (TOPLAS)
Using generational garbage collection to implement cache-conscious data placement

Proceedings of the 1st international symposium on Memory management
Cache-conscious structure layout

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Cache-conscious structure definition

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Improving cache performance in dynamic applications through data and computation reorganization at run time

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Multidimensional binary search trees used for associative searching

Communications of the ACM
Random projection in dimensionality reduction: applications to image and text data

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Optimizing compilers for modern architectures: a dependence-based approach

Optimizing compilers for modern architectures: a dependence-based approach
Parametric shape analysis via 3-valued logic

ACM Transactions on Programming Languages and Systems (TOPLAS)
Computation regrouping: restructuring programs for temporal data cache locality

ICS '02 Proceedings of the 16th international conference on Supercomputing
Parallelizing Programs with Recursive Data Structures

IEEE Transactions on Parallel and Distributed Systems
Rescheduling for Locality in Sparse Matrix Computations

ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
A Data Parallel Formulation of the Barnes-Hut Method for N -Body Simulations

PARA '00 Proceedings of the 5th International Workshop on Applied Parallel Computing, New Paradigms for HPC in Industry and Academia
Improving Cache Behavior of Dynamically Allocated Data Structures

PACT '98 Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques
Localizing Non-Affine Array References

PACT '99 Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques
Automatic pool allocation: improving performance by controlling data structure layout in the heap

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
The jastadd extensible java compiler

Proceedings of the 22nd annual ACM SIGPLAN conference on Object-oriented programming systems and applications
Statistically rigorous java performance evaluation

Proceedings of the 22nd annual ACM SIGPLAN conference on Object-oriented programming systems and applications
Deep Coherent Ray Tracing

RT '07 Proceedings of the 2007 IEEE Symposium on Interactive Ray Tracing
Dynamic Ray Scheduling to Improve Ray Coherence and Bandwidth Utilization

RT '07 Proceedings of the 2007 IEEE Symposium on Interactive Ray Tracing
The WEKA data mining software: an update

ACM SIGKDD Explorations Newsletter
Evaluation techniques for storage hierarchies

IBM Systems Journal
On improving heap memory layout by dynamic pool allocation

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Cache-oblivious ray reordering

ACM Transactions on Graphics (TOG)
Architecture considerations for tracing incoherent rays

Proceedings of the Conference on High Performance Graphics
Enhancing locality for recursive traversals of recursive structures

Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications

General transformations for GPU execution of tree traversals

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Automatic vectorization of tree traversals

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques

Quantified Score

Hi-index	0.00

Visualization

Abstract

Generally applicable techniques for improving temporal locality in irregular programs, which operate over pointer-based data structures such as trees and graphs, are scarce. Focusing on a subset of irregular programs, namely, tree traversal algorithms like Barnes-Hut and nearest neighbor, previous work has proposed point blocking, a technique analogous to loop tiling in regular programs, to improve locality. However point blocking is highly dependent on point sorting, a technique to reorder points so that consecutive points will have similar traversals. Performing this a priori sort requires an understanding of the semantics of the algorithm and hence highly application specific techniques. In this work, we propose traversal splicing, a new, general, automatic locality optimization for irregular tree traversal codes, that is less sensitive to point order, and hence can deliver substantially better performance, even in the absence of semantic information. For six benchmark algorithms, we show that traversal splicing can deliver single-thread speedups of up to 9.147 (geometric mean: 3.095) over baseline implementations, and up to 4.752 (geometric mean: 2.079) over point-blocked implementations. Further, we show that in many cases, automatically applying traversal splicing to a baseline implementation yields performance that is better than carefully hand-optimized implementations.