PixelFlow: high-speed rendering using image composition
SIGGRAPH '92 Proceedings of the 19th annual conference on Computer graphics and interactive techniques
Transactional memory: architectural support for lock-free data structures
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Fast discovery of association rules
Advances in knowledge discovery and data mining
Merging and transformation of raster images for cartoon animation
SIGGRAPH '81 Proceedings of the 8th annual conference on Computer graphics and interactive techniques
Chap - a SIMD graphics processor
SIGGRAPH '84 Proceedings of the 11th annual conference on Computer graphics and interactive techniques
Bulk Disambiguation of Speculative Threads in Multiprocessors
Proceedings of the 33rd annual international symposium on Computer Architecture
Computer
Architectural Support for Software Transactional Memory
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Making the fast case common and the uncommon case simple in unbounded transactional memory
Proceedings of the 34th annual international symposium on Computer architecture
An effective hybrid transactional memory system with strong isolation guarantees
Proceedings of the 34th annual international symposium on Computer architecture
Performance pathologies in hardware transactional memory
Proceedings of the 34th annual international symposium on Computer architecture
JudoSTM: A Dynamic Binary-Rewriting Approach to Software Transactional Memory
PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
LogTM-SE: Decoupling Hardware Transactional Memory from Caches
HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
A Scalable, Non-blocking Approach to Transactional Memory
HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Implementing Signatures for Transactional Memory
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
On the correctness of transactional memory
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Larrabee: a many-core x86 architecture for visual computing
ACM SIGGRAPH 2008 papers
Scalable Parallel Programming with CUDA
Queue - GPU Computing
Adaptive transaction scheduling for transactional memory systems
Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
RingSTM: scalable transactions with a single atomic instruction
Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
TokenTM: Efficient Execution of Large Transactions with Hardware Transactional Memory
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Flexible Decoupled Transactional Memory Support
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Early experience with a commercial hardware transactional memory implementation
Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware
ACM Transactions on Architecture and Code Optimization (TACO)
EazyHTM: eager-lazy hardware transactional memory
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
NOrec: streamlining STM by abolishing ownership records
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
An efficient software transactional memory using commit-time invalidation
Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
WAYPOINT: scaling coherence to thousand-core architectures
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
SPACE: sharing pattern-based directory coherence for multicore scalability
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Discovering and understanding performance bottlenecks in transactional applications
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Transactional Memory, 2nd Edition
Transactional Memory, 2nd Edition
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
ASF: AMD64 Extension for Lock-Free Data Structures and Transactional Memory
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Hardware acceleration of transactional memory on commodity systems
Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
RMS-TM: a comprehensive benchmark suite for transactional memory systems
Proceedings of the 2nd ACM/SPEC International Conference on Performance engineering
Transactional conflict decoupling and value prediction
Proceedings of the international conference on Supercomputing
Cuckoo directory: A scalable directory for many-core systems
HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
Bloom Filter Guided Transaction Scheduling
HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
Time-Out bloom filter: a new sampling method for recording more flows
ICOIN'06 Proceedings of the 2006 international conference on Information Networking: advances in Data Communications and Wireless Networks
Towards a software transactional memory for graphics processors
EG PGV'10 Proceedings of the 10th Eurographics conference on Parallel Graphics and Visualization
Paragon: collaborative speculative loop execution on GPU and CPU
Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
GPUDet: a deterministic GPU architecture
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Exploring memory consistency for massively-threaded throughput-oriented processors
Proceedings of the 40th Annual International Symposium on Computer Architecture
Orchestrated scheduling and prefetching for GPGPUs
Proceedings of the 40th Annual International Symposium on Computer Architecture
Neither more nor less: optimizing thread-level parallelism for GPGPUs
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Energy efficient GPU transactional memory via space-time optimizations
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Software Transactional Memory for GPU Architectures
Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
Efficient execution of speculative threads and transactions with hardware transactional memory
Future Generation Computer Systems
Leveraging GPUs using cooperative loop speculation
ACM Transactions on Architecture and Code Optimization (TACO)
Hi-index | 0.00 |
Graphics processor units (GPUs) are designed to efficiently exploit thread level parallelism (TLP), multiplexing execution of 1000s of concurrent threads on a relatively smaller set of single-instruction, multiple-thread (SIMT) cores to hide various long latency operations. While threads within a CUDA block/OpenCL workgroup can communicate efficiently through an intra-core scratchpad memory, threads in different blocks can only communicate via global memory accesses. Programmers wishing to exploit such communication have to consider data-races that may occur when multiple threads modify the same memory location. Recent GPUs provide a form of inter-block communication through atomic operations for single 32-bit/64-bit words. Although fine-grained locks can be constructed from these atomic operations, synchronization using locks is prone to deadlock. In this paper, we propose to solve these problems by extending GPUs to support transactional memory (TM). Major challenges include supporting 1000s of concurrent transactions and committing non-conflicting transactions in parallel. We propose KILO TM, a novel hardware TM design for GPUs that scales to 1000s of concurrent transactions. Without cache coherency hardware to depend on, it uses word-level, value-based conflict detection to avoid broadcast communication and reduce on-chip storage overhead. It employs speculative validation using a novel bloom filter organization to increase transaction commit parallelism. For a set of TM-enhanced GPU applications, KILO TM captures 59% of the performance of fine-grained locking, and is on average 128x faster than executing all transactions serially, for an estimated hardware area overhead of 0.5% of a commercial GPU.