GRace: a low-overhead mechanism for detecting data races in GPU programs

Authors:
Mai Zheng;Vignesh T. Ravi;Feng Qin;Gagan Agrawal
Affiliations:
The Ohio State University, Columbus, OH, USA;The Ohio State University, Columbus, OH, USA;The Ohio State University, Columbus, OH, USA;The Ohio State University, Columbus, OH, USA
Venue:
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Year:
2011

Citing 32
Cited 4

An empirical comparison of monitoring algorithms for access anomaly detection

PPOPP '90 Proceedings of the second ACM SIGPLAN symposium on Principles & practice of parallel programming
Improving the accuracy of data race detection

PPOPP '91 Proceedings of the third ACM SIGPLAN symposium on Principles and practice of parallel programming
The Mantis parallel debugger

SPDT '96 Proceedings of the SIGMETRICS symposium on Parallel and distributed tools
Online data-race detection via coherency guarantees

OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
Eraser: a dynamic data race detector for multithreaded programs

ACM Transactions on Computer Systems (TOCS)
Type-based race detection for Java

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
Dynamic software testing of MPI applications with umpire

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Efficient and precise datarace detection for multithreaded object-oriented programs

PLDI '02 Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation
Ownership types for safe programming: preventing data races and deadlocks

OOPSLA '02 Proceedings of the 17th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Hybrid dynamic data race detection

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Efficient on-the-fly data race detection in multithreaded C++ programs

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
RacerX: effective, static detection of race conditions and deadlocks

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Race checking by context inference

Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Extending a traditional debugger to debug massively parallel applications

Journal of Parallel and Distributed Computing
Trust but verify: monitoring remotely executing programs for progress and correctness

Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
RaceTrack: efficient detection of data race conditions via adaptive tracking

Proceedings of the twentieth ACM symposium on Operating systems principles
Using Dynamic Tracing Sampling to Measure Long Running Programs

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Automated, scalable debugging of MPI programs with Intel® Message Checker

Proceedings of the second international workshop on Software engineering for high performance computing system applications
A Portable Method for Finding User Errors in the Usage of MPI Collective Operations

International Journal of High Performance Computing Applications
Learning from mistakes: a comprehensive study on real world concurrency bug characteristics

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
DMTracker: finding bugs in large-scale parallel programs by detecting anomaly in data movements

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
OpenMP to GPGPU: a compiler framework for automatic translation and optimization

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
A graph based approach for MPI deadlock detection

Proceedings of the 23rd international conference on Supercomputing
A translation system for enabling data mining applications on GPUs

Proceedings of the 23rd international conference on Supercomputing
A framework for efficient and scalable execution of domain-specific templates on GPUs

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
A Tool for Detecting First Races in OpenMP Programs

PaCT '09 Proceedings of the 10th International Conference on Parallel Computing Technologies
Managing contention for shared resources on multicore processors

Communications of the ACM
Scalable temporal order analysis for large scale debugging

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping

Proceedings of the 24th ACM International Conference on Supercomputing
An experimental approach to performance measurement of heterogeneous parallel applications using CUDA

Proceedings of the 24th ACM International Conference on Supercomputing
Scalable SMT-based verification of GPU kernel functions

Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering

GPU-based NFA implementation for memory efficient high speed regular expression matching

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
GKLEE: concolic verification and test generation for GPUs

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Parametric flows: automated behavior equivalencing for symbolic analysis of races in CUDA programs

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Symbolic testing of OpenCL code

HVC'11 Proceedings of the 7th international Haifa Verification conference on Hardware and Software: verification and testing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In recent years, GPUs have emerged as an extremely cost-effective means for achieving high performance. Many application developers, including those with no prior parallel programming experience, are now trying to scale their applications using GPUs. While languages like CUDA and OpenCL have eased GPU programming for non-graphical applications, they are still explicitly parallel languages. All parallel programmers, particularly the novices, need tools that can help ensuring the correctness of their programs. Like any multithreaded environment, data races on GPUs can severely affect the program reliability. Thus, tool support for detecting race conditions can significantly benefit GPU application developers. Existing approaches for detecting data races on CPUs or GPUs have one or more of the following limitations: 1) being illsuited for handling non-lock synchronization primitives on GPUs; 2) lacking of scalability due to the state explosion problem; 3) reporting many false positives because of simplified modeling; and/or 4) incurring prohibitive runtime and space overhead. In this paper, we propose GRace, a new mechanism for detecting races in GPU programs that combines static analysis with a carefully designed dynamic checker for logging and analyzing information at runtime. Our design utilizes GPUs memory hierarchy to log runtime data accesses efficiently. To improve the performance, GRace leverages static analysis to reduce the number of statements that need to be instrumented. Additionally, by exploiting the knowledge of thread scheduling and the execution model in the underlying GPUs, GRace can accurately detect data races with no false positives reported. Based on the above idea, we have built a prototype of GRace with two schemes, i.e., GRace-stmt and GRace-addr, for NVIDIA GPUs. Both schemes are integrated with the same static analysis. We have evaluated GRace-stmt and GRace-addr with three data race bugs in three GPU kernel functions and also have compared them with the existing approach, referred to as B-tool. Our experimental results show that both schemes of GRace are effective in detecting all evaluated cases with no false positives, whereas Btool reports many false positives for one evaluated case. On the one hand, GRace-addr incurs low runtime overhead, i.e., 22-116%, and low space overhead, i.e., 9-18MB, for the evaluated kernels. On the other hand, GRace-stmt offers more help in diagnosing data races with larger overhead.