Hardware transactional memory for GPU architectures

Authors:
Wilson W. L. Fung;Inderpreet Singh;Andrew Brownsword;Tor M. Aamodt
Affiliations:
University of British Columbia;University of British Columbia;University of British Columbia;University of British Columbia
Venue:
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Year:
2011

Citing 42
Cited 10

PixelFlow: high-speed rendering using image composition

SIGGRAPH '92 Proceedings of the 19th annual conference on Computer graphics and interactive techniques
Transactional memory: architectural support for lock-free data structures

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Fast discovery of association rules

Advances in knowledge discovery and data mining
Merging and transformation of raster images for cartoon animation

SIGGRAPH '81 Proceedings of the 8th annual conference on Computer graphics and interactive techniques
Chap - a SIMD graphics processor

SIGGRAPH '84 Proceedings of the 11th annual conference on Computer graphics and interactive techniques
Bulk Disambiguation of Speculative Threads in Multiprocessors

Proceedings of the 33rd annual international symposium on Computer Architecture
The Problem with Threads

Computer
Architectural Support for Software Transactional Memory

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Making the fast case common and the uncommon case simple in unbounded transactional memory

Proceedings of the 34th annual international symposium on Computer architecture
An effective hybrid transactional memory system with strong isolation guarantees

Proceedings of the 34th annual international symposium on Computer architecture
Performance pathologies in hardware transactional memory

Proceedings of the 34th annual international symposium on Computer architecture
JudoSTM: A Dynamic Binary-Rewriting Approach to Software Transactional Memory

PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
LogTM-SE: Decoupling Hardware Transactional Memory from Caches

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
A Scalable, Non-blocking Approach to Transactional Memory

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Implementing Signatures for Transactional Memory

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
On the correctness of transactional memory

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Larrabee: a many-core x86 architecture for visual computing

ACM SIGGRAPH 2008 papers
Scalable Parallel Programming with CUDA

Queue - GPU Computing
NVIDIA Tesla: A Unified Graphics and Computing Architecture

IEEE Micro
Adaptive transaction scheduling for transactional memory systems

Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
RingSTM: scalable transactions with a single atomic instruction

Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
TokenTM: Efficient Execution of Large Transactions with Hardware Transactional Memory

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Flexible Decoupled Transactional Memory Support

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Early experience with a commercial hardware transactional memory implementation

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware

ACM Transactions on Architecture and Code Optimization (TACO)
EazyHTM: eager-lazy hardware transactional memory

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
NOrec: streamlining STM by abolishing ownership records

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
An efficient software transactional memory using commit-time invalidation

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
WAYPOINT: scaling coherence to thousand-core architectures

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
SPACE: sharing pattern-based directory coherence for multicore scalability

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Discovering and understanding performance bottlenecks in transactional applications

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Transactional Memory, 2nd Edition

Transactional Memory, 2nd Edition
The Sharing Tracker: Using Ideas from Cache Coherence Hardware to Reduce Off-Chip Memory Traffic with Non-Coherent Caches

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
ASF: AMD64 Extension for Lock-Free Data Structures and Transactional Memory

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Hardware acceleration of transactional memory on commodity systems

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
RMS-TM: a comprehensive benchmark suite for transactional memory systems

Proceedings of the 2nd ACM/SPEC International Conference on Performance engineering
Transactional conflict decoupling and value prediction

Proceedings of the international conference on Supercomputing
Cuckoo directory: A scalable directory for many-core systems

HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
Bloom Filter Guided Transaction Scheduling

HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
Time-Out bloom filter: a new sampling method for recording more flows

ICOIN'06 Proceedings of the 2006 international conference on Information Networking: advances in Data Communications and Wireless Networks
Towards a software transactional memory for graphics processors

EG PGV'10 Proceedings of the 10th Eurographics conference on Parallel Graphics and Visualization

Paragon: collaborative speculative loop execution on GPU and CPU

Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
GPUDet: a deterministic GPU architecture

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Exploring memory consistency for massively-threaded throughput-oriented processors

Proceedings of the 40th Annual International Symposium on Computer Architecture
Orchestrated scheduling and prefetching for GPGPUs

Proceedings of the 40th Annual International Symposium on Computer Architecture
Neither more nor less: optimizing thread-level parallelism for GPGPUs

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Energy efficient GPU transactional memory via space-time optimizations

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Software Transactional Memory for GPU Architectures

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
Efficient execution of speculative threads and transactions with hardware transactional memory

Future Generation Computer Systems
Leveraging GPUs using cooperative loop speculation

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Graphics processor units (GPUs) are designed to efficiently exploit thread level parallelism (TLP), multiplexing execution of 1000s of concurrent threads on a relatively smaller set of single-instruction, multiple-thread (SIMT) cores to hide various long latency operations. While threads within a CUDA block/OpenCL workgroup can communicate efficiently through an intra-core scratchpad memory, threads in different blocks can only communicate via global memory accesses. Programmers wishing to exploit such communication have to consider data-races that may occur when multiple threads modify the same memory location. Recent GPUs provide a form of inter-block communication through atomic operations for single 32-bit/64-bit words. Although fine-grained locks can be constructed from these atomic operations, synchronization using locks is prone to deadlock. In this paper, we propose to solve these problems by extending GPUs to support transactional memory (TM). Major challenges include supporting 1000s of concurrent transactions and committing non-conflicting transactions in parallel. We propose KILO TM, a novel hardware TM design for GPUs that scales to 1000s of concurrent transactions. Without cache coherency hardware to depend on, it uses word-level, value-based conflict detection to avoid broadcast communication and reduce on-chip storage overhead. It employs speculative validation using a novel bloom filter organization to increase transaction commit parallelism. For a set of TM-enhanced GPU applications, KILO TM captures 59% of the performance of fine-grained locking, and is on average 128x faster than executing all transactions serially, for an estimated hardware area overhead of 0.5% of a commercial GPU.