Morph algorithms on GPUs

Authors:
Rupesh Nasre;Martin Burtscher;Keshav Pingali
Affiliations:
The University of Texas, Austin, USA;Texas State University, San Marcos, USA;The University of Texas, Austin, USA
Venue:
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Year:
2013

Citing 26
Cited 3

Triangle: Engineering a 2D Quality Mesh Generator and Delaunay Triangulator

FCRC '96/WACG '96 Selected papers from the Workshop on Applied Computational Geormetry, Towards Geometric Engineering
Survey propagation: An algorithm for satisfiability

Random Structures & Algorithms
A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Threshold values of random K-SAT from the cavity method

Random Structures & Algorithms
Designing Multithreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2

ICPP '06 Proceedings of the 2006 International Conference on Parallel Processing
Optimistic parallelism requires abstractions

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Three-dimensional delaunay refinement for multi-core processors

Proceedings of the 22nd annual international conference on Supercomputing
How much parallelism is there in irregular applications?

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Fast minimum spanning tree for large graphs on the GPU

Proceedings of the Conference on High Performance Graphics 2009
Structure-driven optimizations for amorphous data-parallel programs

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Accelerating large graph algorithms on the GPU using CUDA

HiPC'07 Proceedings of the 14th international conference on High performance computing
An effective GPU implementation of breadth-first search

Proceedings of the 47th Design Automation Conference
Effective out-of-core parallel delaunay mesh refinement using off-the-shelf software

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
EigenCFA: accelerating flow analysis with GPUs

Proceedings of the 38th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Accelerating CUDA graph algorithms at maximum warp

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
On-the-fly elimination of dynamic irregularities for GPU computing

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Fully Generalized Two-Dimensional Constrained Delaunay Mesh Refinement

SIAM Journal on Scientific Computing
The tao of parallelism in algorithms

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Computing Strongly Connected Components in Parallel on CUDA

IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Efficient Parallel Graph Exploration on Multi-Core CPU and GPU

PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
A GPU implementation of inclusion-based points-to analysis

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Scalable GPU graph traversal

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Scalable parallel minimum spanning forest computation

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Computing 2D constrained Delaunay triangulation using the GPU

I3D '12 Proceedings of the ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games
Parallel replication-based points-to analysis

CC'12 Proceedings of the 21st international conference on Compiler Construction
A yoke of oxen and a thousand chickens for heavy lifting graph processing

Proceedings of the 21st international conference on Parallel architectures and compilation techniques

Atomic-free irregular computations on GPUs

Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Energy efficient GPU transactional memory via space-time optimizations

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Time- and space-efficient flow-sensitive points-to analysis

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

There is growing interest in using GPUs to accelerate graph algorithms such as breadth-first search, computing page-ranks, and finding shortest paths. However, these algorithms do not modify the graph structure, so their implementation is relatively easy compared to general graph algorithms like mesh generation and refinement, which morph the underlying graph in non-trivial ways by adding and removing nodes and edges. We know relatively little about how to implement morph algorithms efficiently on GPUs. In this paper, we present and study four morph algorithms: (i) a computational geometry algorithm called Delaunay Mesh Refinement (DMR), (ii) an approximate SAT solver called Survey Propagation (SP), (iii) a compiler analysis called Points-To Analysis (PTA), and (iv) Boruvka's Minimum Spanning Tree algorithm (MST). Each of these algorithms modifies the graph data structure in different ways and thus poses interesting challenges. We overcome these challenges using algorithmic and GPU-specific optimizations. We propose efficient techniques to perform concurrent subgraph addition, subgraph deletion, conflict detection and several optimizations to improve the scalability of morph algorithms. For an input mesh with 10 million triangles, our DMR code achieves an 80x speedup over the highly optimized serial Triangle program and a 2.3x speedup over a multicore implementation running with 48 threads. Our SP code is 3x faster than a multicore implementation with 48 threads on an input with 1 million literals. The PTA implementation is able to analyze six SPEC 2000 benchmark programs in just 74 milliseconds, achieving a geometric mean speedup of 9.3x over a 48-thread multicore version. Our MST code is slower than a multicore version with 48 threads for sparse graphs but significantly faster for denser graphs. This work provides several insights into how other morph algorithms can be efficiently implemented on GPUs.