Vectorization of tree traversals
Journal of Computational Physics
Journal of Computational Physics
Parallel hierarchical N-body methods
Parallel hierarchical N-body methods
Journal of Parallel and Distributed Computing
Is it a tree, a DAG, or a cyclic graph? A shape analysis for heap-directed pointers in C
POPL '96 Proceedings of the 23rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Rendering complex scenes with memory-coherent ray tracing
Proceedings of the 24th annual conference on Computer graphics and interactive techniques
Commutativity analysis: a new analysis technique for parallelizing compilers
ACM Transactions on Programming Languages and Systems (TOPLAS)
The implementation of the Cilk-5 multithreaded language
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Using generational garbage collection to implement cache-conscious data placement
Proceedings of the 1st international symposium on Memory management
Cache-conscious structure layout
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Cache-conscious structure definition
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Optimizing compilers for modern architectures: a dependence-based approach
Optimizing compilers for modern architectures: a dependence-based approach
Parametric shape analysis via 3-valued logic
ACM Transactions on Programming Languages and Systems (TOPLAS)
Distribution-Independent Hierarchical Algorithmsfor the N-body Problem
The Journal of Supercomputing
International Journal of Parallel Programming
Parallelizing Programs with Recursive Data Structures
IEEE Transactions on Parallel and Distributed Systems
A Data Parallel Formulation of the Barnes-Hut Method for N -Body Simulations
PARA '00 Proceedings of the 5th International Workshop on Applied Parallel Computing, New Paradigms for HPC in Industry and Academia
Improving Cache Behavior of Dynamically Allocated Data Structures
PACT '98 Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques
Automatic pool allocation: improving performance by controlling data structure layout in the heap
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Lightcuts: a scalable approach to illumination
ACM SIGGRAPH 2005 Papers
Exploiting Locality for Irregular Scientific Codes
IEEE Transactions on Parallel and Distributed Systems
Optimistic parallelism requires abstractions
Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
The jastadd extensible java compiler
Proceedings of the 22nd annual ACM SIGPLAN conference on Object-oriented programming systems and applications
On fast Construction of SAH-based Bounding Volume Hierarchies
RT '07 Proceedings of the 2007 IEEE Symposium on Interactive Ray Tracing
RT '07 Proceedings of the 2007 IEEE Symposium on Interactive Ray Tracing
A scalable auto-tuning framework for compiler optimization
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Automatically enhancing locality for tree traversals with traversal splicing
Proceedings of the ACM international conference on Object oriented programming systems languages and applications
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
General transformations for GPU execution of tree traversals
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Automatic vectorization of tree traversals
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Hi-index | 0.00 |
While there has been decades of work on developing automatic, locality-enhancing transformations for regular programs that operate over dense matrices and arrays, there has been little investigation of such transformations for irregular programs, which operate over pointer-based data structures such as graphs, trees and lists. In this paper, we argue that, for a class of irregular applications we call traversal codes, there exists substantial data reuse and hence opportunity for locality exploitation. We develop a novel optimization called point blocking, inspired by the classic tiling loop transformation, and show that it can substantially enhance temporal locality in traversal codes. We then present a transformation and optimization framework called TreeTiler that automatically detects opportunities for applying point blocking and applies the transformation. TreeTiler uses autotuning techniques to determine appropriate parameters for the transformation. For a series of traversal algorithms drawn from real-world applications, we show that TreeTiler is able to deliver performance improvements of up to 245% over an optimized (but non-transformed) parallel baseline, and in several cases, significantly better scalability.