Transactional memory: architectural support for lock-free data structures
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Fast discovery of association rules
Advances in knowledge discovery and data mining
Transactional Memory Coherence and Consistency
Proceedings of the 31st annual international symposium on Computer architecture
Bulk Disambiguation of Speculative Threads in Multiprocessors
Proceedings of the 33rd annual international symposium on Computer Architecture
JudoSTM: A Dynamic Binary-Rewriting Approach to Software Transactional Memory
PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
LogTM-SE: Decoupling Hardware Transactional Memory from Caches
HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Scalable Parallel Programming with CUDA
Queue - GPU Computing
Flexible Decoupled Transactional Memory Support
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
NOrec: streamlining STM by abolishing ownership records
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Transactional Memory, 2nd Edition
Transactional Memory, 2nd Edition
RMS-TM: a comprehensive benchmark suite for transactional memory systems
Proceedings of the 2nd ACM/SPEC International Conference on Performance engineering
SoC-TM: integrated HW/SW support for transactional memory programming on embedded MPSoCs
CODES+ISSS '11 Proceedings of the seventh IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Hardware transactional memory for GPU architectures
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
BulkSMT: Designing SMT processors for atomic-block execution
HPCA '12 Proceedings of the 2012 IEEE 18th International Symposium on High-Performance Computer Architecture
Shared memory multiplexing: a novel way to improve GPGPU throughput
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
A yoke of oxen and a thousand chickens for heavy lifting graph processing
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Towards a software transactional memory for graphics processors
EG PGV'10 Proceedings of the 10th Eurographics conference on Parallel Graphics and Visualization
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Cache-Conscious Wavefront Scheduling
MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
GPUWattch: enabling energy optimizations in GPGPUs
Proceedings of the 40th Annual International Symposium on Computer Architecture
Cache coherence for GPU architectures
HPCA '13 Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA)
Hi-index | 0.00 |
Many applications with regular parallelism have been shown to benefit from using Graphics Processing Units (GPUs). However, employing GPUs for applications with irregular parallelism tends to be a risky process, involving significant effort from the programmer. One major, non-trivial effort/risk is to expose the available parallelism in the application as 1000s of concurrent threads without introducing data races or deadlocks via fine-grained data synchronization. To reduce this effort, prior work has proposed supporting transactional memory on GPU architectures. One hardware proposal, Kilo TM, can scale to 1000s of concurrent transaction. However, performance and energy overhead of Kilo TM may deter GPU vendors from incorporating it into future designs. In this paper, we analyze the performance and energy efficiency of Kilo TM and propose two enhancements: (1) Warp-level transaction management allows transactions within a warp to be managed as a group. This aggregates protocol messages to reduce communication overhead and captures spatial locality from multiple transactions to increase memory subsystem utility. (2) Temporal conflict detection uses globally synchronized timers to detect conflicts in read-only transactions with low overhead. Our evaluation shows that combining the two enhancements in combination can improve the overall performance and energy efficiency of Kilo TM by 65% and 34% respectively. Kilo TM with the above two enhancements achieves 66% of the performance of fine-grained locking with 34% energy overhead.