Multilevel k-way partitioning scheme for irregular graphs
Journal of Parallel and Distributed Computing
Shape-optimized mesh partitioning and load balancing for parallel adaptive FEM
Parallel Computing - Special issue on graph partioning and parallel computing
Survey propagation: An algorithm for satisfiability
Random Structures & Algorithms
A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Designing Multithreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2
ICPP '06 Proceedings of the 2006 International Conference on Parallel Processing
Optimistic parallelism requires abstractions
Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Sparse matrix computations on manycore GPU's
Proceedings of the 45th annual Design Automation Conference
CUDA Solutions for the SSSP Problem
ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
Fast minimum spanning tree for large graphs on the GPU
Proceedings of the Conference on High Performance Graphics 2009
Accelerating large graph algorithms on the GPU using CUDA
HiPC'07 Proceedings of the 14th international conference on High performance computing
Accelerating CUDA graph algorithms at maximum warp
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
On-the-fly elimination of dynamic irregularities for GPU computing
Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
The tao of parallelism in algorithms
Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
A GPU implementation of inclusion-based points-to analysis
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Internally deterministic parallel algorithms can be fast
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Scalable parallel minimum spanning forest computation
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Automatic C-to-CUDA code generation for affine programs
CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction
Dynamically managed data for CPU-GPU architectures
Proceedings of the Tenth International Symposium on Code Generation and Optimization
Parallel replication-based points-to analysis
CC'12 Proceedings of the 21st international conference on Compiler Construction
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Software Transactional Memory for GPU Architectures
Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
Programming a Multicore Architecture without Coherency and Atomic Operations
Proceedings of Programming Models and Applications on Multicores and Manycores
Hi-index | 0.00 |
Atomic instructions are a key ingredient of codes that operate on irregular data structures like trees and graphs. It is well known that atomics can be expensive, especially on massively parallel GPUs, and are often on the critical path of a program. In this paper, we present two high-level methods to eliminate atomics in irregular programs. The first method advocates synchronous processing using barriers. We illustrate how to exploit synchronous processing to avoid atomics even when the threads' memory accesses conflict with each other. The second method is based on exploiting algebraic properties of algorithms to elide atomics. Specifically, we focus on three key properties: monotonicity, idempotency and associativity, and show how each of them enables an atomic-free implementation. We illustrate the generality of the two methods by applying them to five irregular graph applications: breadth-first search, single-source shortest paths computation, Delaunay mesh refinement, pointer analysis and survey propagation, and show that both methods provide substantial speedup in each case on different GPUs.