Understanding the performance of concurrent data structures on graphics processors

Authors:
Daniel Cederman;Bapi Chatterjee;Philippas Tsigas
Affiliations:
Chalmers University of Technology, Sweden;Chalmers University of Technology, Sweden;Chalmers University of Technology, Sweden
Venue:
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Year:
2012

Citing 13
Cited 0

Simple, fast, and practical non-blocking and blocking concurrent queue algorithms

PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
Specifying Concurrent Program Modules

ACM Transactions on Programming Languages and Systems (TOPLAS)
A new solution of Dijkstra's concurrent programming problem

Communications of the ACM
A simple, fast and scalable non-blocking concurrent FIFO queue for shared memory multiprocessor systems

Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Fast and Lock-Free Concurrent Priority Queues for Multi-Thread Systems

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Glift: Generic, efficient, random-access GPU data structures

ACM Transactions on Graphics (TOG)
FastForward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Real-time KD-tree construction on graphics hardware

ACM SIGGRAPH Asia 2008 papers
On dynamic load balancing on graphics processors

Proceedings of the 23rd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
A lock-free, cache-efficient shared ring buffer for multi-core architectures

Proceedings of the 5th ACM/IEEE Symposium on Architectures for Networking and Communications Systems
BatchQueue: Fast and Memory-Thrifty Core to Core Communication

SBAC-PAD '10 Proceedings of the 2010 22nd International Symposium on Computer Architecture and High Performance Computing
Cache-aware lock-free queues for multiple producers/consumers and weak memory consistency

OPODIS'10 Proceedings of the 14th international conference on Principles of distributed systems
Data-Parallel Octrees for Surface Reconstruction

IEEE Transactions on Visualization and Computer Graphics

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we revisit the design of concurrent data structures --- specifically queues --- and examine their performance portability with regard to the move from conventional CPUs to graphics processors. We have looked at both lock-based and lock-free algorithms and have, for comparison, implemented and optimized the same algorithms on both graphics processors and multi-core CPUs. Particular interest has been paid to study the difference between the old Tesla and the new Fermi and Kepler architectures in this context. We provide a comprehensive evaluation and analysis of our implementations on all examined platforms. Our results indicate that the queues are in general performance portable, but that platform specific optimizations are possible to increase performance. The Fermi and Kepler GPUs, with optimized atomic operations, are observed to provide excellent scalability for both lock-based and lock-free queues.