An introduction to parallel algorithms
An introduction to parallel algorithms
Small worlds: the dynamics of networks between order and randomness
Small worlds: the dynamics of networks between order and randomness
A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
A computational study of external-memory BFS algorithms
SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Designing Multithreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2
ICPP '06 Proceedings of the 2006 International Conference on Parallel Processing
A Parallel External-Memory Frontier Breadth-First Traversal Algorithm for Clusters of Workstations
ICPP '06 Proceedings of the 2006 International Conference on Parallel Processing
Sequoia: programming the memory hierarchy
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Sequoia: programming the memory hierarchy
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Graph Analysis with High-Performance Computing
Computing in Science and Engineering
Rodinia: A benchmark suite for heterogeneous computing
IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
Accelerating large graph algorithms on the GPU using CUDA
HiPC'07 Proceedings of the 14th international conference on High performance computing
IEEE Micro
Dynamic warp subdivision for integrated branch and memory divergence tolerance
Proceedings of the 37th annual international symposium on Computer architecture
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU
Proceedings of the 37th annual international symposium on Computer architecture
Language virtualization for heterogeneous parallel computing
Proceedings of the ACM international conference on Object oriented programming systems languages and applications
Scalable Graph Exploration on Multicore Processors
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
CudaDMA: optimizing GPU memory bandwidth via warp specialization
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Exploring the limits of GPGPU scheduling in control flow bound applications
ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
A GPU implementation of inclusion-based points-to analysis
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
GPU-based NFA implementation for memory efficient high speed regular expression matching
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
FlexBFS: a parallelism-aware implementation of breadth-first search on GPU
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
An overview of Medusa: simplified graph processing on GPUs
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Green-Marl: a DSL for easy and efficient graph analysis
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
GPUs as an opportunity for offloading garbage collection
Proceedings of the 2012 international symposium on Memory Management
Designing fast LTL model checking algorithms for many-core GPUs
Journal of Parallel and Distributed Computing
A yoke of oxen and a thousand chickens for heavy lifting graph processing
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Breaking the speed and scalability barriers for graph exploration on distributed-memory machines
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Large-scale energy-efficient graph traversal: a path to efficient data-intensive supercomputing
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
On distributed file tree walk of parallel file systems
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
CudaGIS: report on the design and realization of a massive data parallel GIS on GPUs
Proceedings of the Third ACM SIGSPATIAL International Workshop on GeoStreaming
Speeding up large-scale point-in-polygon test based spatial join on GPUs
Proceedings of the 1st ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data
Atomic-free irregular computations on GPUs
Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Massive data analytics: the graph 500 on IBM Blue Gene/Q
IBM Journal of Research and Development
Efficient breadth first search on multi-GPU systems
Journal of Parallel and Distributed Computing
RSVM: a region-based software virtual memory for GPU
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
The energy case for graph processing on hybrid CPU and GPU systems
IA^3 '13 Proceedings of the 3rd Workshop on Irregular Applications: Architectures and Algorithms
Parallel graph processing on graphics processors made easy
Proceedings of the VLDB Endowment
Divergence-aware warp scheduling
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Efficient Mapping of Irregular C++ Applications to Integrated GPUs
Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
CUDA-NP: realizing nested thread-level parallelism in GPGPU applications
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Singe: leveraging warp specialization for high performance on GPUs
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
A memory access model for highly-threaded many-core architectures
Future Generation Computer Systems
Efficient decomposition of strongly connected components on GPUs
Journal of Systems Architecture: the EUROMICRO Journal
GLZSS: LZSS Lossless Data Compression Can Be Faster
Proceedings of Workshop on General Purpose Processing Using GPUs
Hi-index | 0.00 |
Graphs are powerful data representations favored in many computational domains. Modern GPUs have recently shown promising results in accelerating computationally challenging graph problems but their performance suffered heavily when the graph structure is highly irregular, as most real-world graphs tend to be. In this study, we first observe that the poor performance is caused by work imbalance and is an artifact of a discrepancy between the GPU programming model and the underlying GPU architecture.We then propose a novel virtual warp-centric programming method that exposes the traits of underlying GPU architectures to users. Our method significantly improves the performance of applications with heavily imbalanced workloads, and enables trade-offs between workload imbalance and ALU underutilization for fine-tuning the performance. Our evaluation reveals that our method exhibits up to 9x speedup over previous GPU algorithms and 12x over single thread CPU execution on irregular graphs. When properly configured, it also yields up to 30% improvement over previous GPU algorithms on regular graphs. In addition to performance gains on graph algorithms, our programming method achieves 1.3x to 15.1x speedup on a set of GPU benchmark applications. Our study also confirms that the performance gap between GPUs and other multi-threaded CPU graph implementations is primarily due to the large difference in memory bandwidth.