GPUDet: a deterministic GPU architecture

Authors:
Hadi Jooybar;Wilson W.L. Fung;Mike O'Connor;Joseph Devietti;Tor M. Aamodt
Affiliations:
University of British Columbia, Vancouver, BC, Canada;University of British Columbia, Vancouver, BC, Canada;mike.oconnor@amd.com, Austin, TX, USA;University of Washington, Seattle, WA, USA;University of British Columbia, Vancouver, BC, Canada
Venue:
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Year:
2013

Citing 26
Cited 0

The design, implementation, and evaluation of Jade

ACM Transactions on Programming Languages and Systems (TOPLAS)
StreamIt: A Language for Streaming Applications

CC '02 Proceedings of the 11th International Conference on Compiler Construction
Teleport messaging for distributed stream programs

Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
SHIM: a deterministic model for heterogeneous embedded systems

Proceedings of the 5th ACM international conference on Embedded software
CADRE: Cycle-Accurate Deterministic Replay for Hardware Debugging

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Data parallel Haskell: a status report

Proceedings of the 2007 workshop on Declarative aspects of multicore programming
Implementing Signatures for Transactional Memory

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
NVIDIA Tesla: A Unified Graphics and Computing Architecture

IEEE Micro
DMP: deterministic shared memory multiprocessing

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Kendo: efficient deterministic multithreading in software

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
A type and effect system for deterministic parallel Java

Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications
Rodinia: A benchmark suite for heterogeneous computing

IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
CoreDet: a compiler and runtime system for deterministic multithreaded execution

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Accelerating large graph algorithms on the GPU using CUDA

HiPC'07 Proceedings of the 14th international conference on High performance computing
A work-efficient parallel breadth-first search algorithm (or how to cope with the nondeterminism of reducers)

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Scalable SMT-based verification of GPU kernel functions

Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering
Deterministic process groups in dOS

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Efficient system-enforced deterministic parallelism

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
RCDC: a relaxed consistency deterministic computer

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Calvin: Deterministic or not? Free will to choose

HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
High-performance software rasterization on GPUs

Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics
Dthreads: efficient deterministic multithreading

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Hardware transactional memory for GPU architectures

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
RAIDR: Retention-Aware Intelligent DRAM Refresh

Proceedings of the 39th Annual International Symposium on Computer Architecture
GPUVerify: a verifier for GPU kernels

Proceedings of the ACM international conference on Object oriented programming systems languages and applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Nondeterminism is a key challenge in developing multithreaded applications. Even with the same input, each execution of a multithreaded program may produce a different output. This behavior complicates debugging and limits one's ability to test for correctness. This non-reproducibility situation is aggravated on massively parallel architectures like graphics processing units (GPUs) with thousands of concurrent threads. We believe providing a deterministic environment to ease debugging and testing of GPU applications is essential to enable a broader class of software to use GPUs. Many hardware and software techniques have been proposed for providing determinism on general-purpose multi-core processors. However, these techniques are designed for small numbers of threads. Scaling them to thousands of threads on a GPU is a major challenge. This paper proposes a scalable hardware mechanism, GPUDet, to provide determinism in GPU architectures. In this paper we characterize the existing deterministic and nondeterministic aspects of current GPU execution models, and we use these observations to inform GPUDet's design. For example, GPUDet leverages the inherent determinism of the SIMD hardware in GPUs to provide determinism within a wavefront at no cost. GPUDet also exploits the Z-Buffer Unit, an existing GPU hardware unit for graphics rendering, to allow parallel out-of-order memory writes to produce a deterministic output. Other optimizations in GPUDet include deterministic parallel execution of atomic operations and a workgroup-aware algorithm that eliminates unnecessary global synchronizations. Our simulation results indicate that GPUDet incurs only 2X slowdown on average over a baseline nondeterministic architecture, with runtime overheads as low as 4% for compute-bound applications, despite running GPU kernels with thousands of threads. We also characterize the sources of overhead for deterministic execution on GPUs to provide insights for further optimizations.