Simple, fast, and practical non-blocking and blocking concurrent queue algorithms
PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
Specifying Concurrent Program Modules
ACM Transactions on Programming Languages and Systems (TOPLAS)
A new solution of Dijkstra's concurrent programming problem
Communications of the ACM
Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Fast and Lock-Free Concurrent Priority Queues for Multi-Thread Systems
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Glift: Generic, efficient, random-access GPU data structures
ACM Transactions on Graphics (TOG)
FastForward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Real-time KD-tree construction on graphics hardware
ACM SIGGRAPH Asia 2008 papers
On dynamic load balancing on graphics processors
Proceedings of the 23rd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
A lock-free, cache-efficient shared ring buffer for multi-core architectures
Proceedings of the 5th ACM/IEEE Symposium on Architectures for Networking and Communications Systems
BatchQueue: Fast and Memory-Thrifty Core to Core Communication
SBAC-PAD '10 Proceedings of the 2010 22nd International Symposium on Computer Architecture and High Performance Computing
Cache-aware lock-free queues for multiple producers/consumers and weak memory consistency
OPODIS'10 Proceedings of the 14th international conference on Principles of distributed systems
Data-Parallel Octrees for Surface Reconstruction
IEEE Transactions on Visualization and Computer Graphics
Hi-index | 0.00 |
In this paper we revisit the design of concurrent data structures --- specifically queues --- and examine their performance portability with regard to the move from conventional CPUs to graphics processors. We have looked at both lock-based and lock-free algorithms and have, for comparison, implemented and optimized the same algorithms on both graphics processors and multi-core CPUs. Particular interest has been paid to study the difference between the old Tesla and the new Fermi and Kepler architectures in this context. We provide a comprehensive evaluation and analysis of our implementations on all examined platforms. Our results indicate that the queues are in general performance portable, but that platform specific optimizations are possible to increase performance. The Fermi and Kepler GPUs, with optimized atomic operations, are observed to provide excellent scalability for both lock-based and lock-free queues.