Race recording for multithreaded deterministic replay using multiprocessor hardware

Authors:
Mark D. Hill;Rastislav Bodik;Min Xu
Affiliations:
The University of Wisconsin - Madison;The University of Wisconsin - Madison;The University of Wisconsin - Madison
Venue:
Race recording for multithreaded deterministic replay using multiprocessor hardware
Year:
2006

Citing 0
Cited 4

A regulated transitive reduction (RTR) for longer memory race recording

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
A Hardware Memory Race Recorder for Deterministic Replay

IEEE Micro
Rerun: Exploiting Episodes for Lightweight Memory Race Recording

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Architecting a chunk-based memory race recorder in modern CMPs

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

Multithreaded deterministic replay has important applications in cyclic debugging, fault tolerance, intrusion analysis and more. Memory race recording is a key technology for multithreaded deterministic replay. This dissertation proposes a new race recording algorithm and describes a novel implementation of a race recorder based on multiprocessor cache coherence mechanisms. As a result of the new algorithm and the novel implementation, the new race recorder is significantly more efficient and less expensive than existing memory race recorders. Notably, the recorder simultaneously achieves several desired features: (1) Long recording by reducing the recorder log size to around one byte per thousand instructions. (2) Always-on recording by reducing the runtime overhead to less than 2%. (3) Inexpensive recording by reducing the timestamp memory size (which is different from the log size) to approximately 24 kilobytes per processor. (4) Broad applicability by supporting programs with data races and by supporting multiprocessor systems with both the Sequential Consistency and the Total Store Order (TSO) memory consistency models.Our improvements stem from several ideas: (1) a method of creating artificial dependencies that allows reduction and compression in the log, yet still allows parallel replay; (2) a method of approximating timestamps that allows significant reduction in the chip area cost; (3) a method of hardware coherence piggybacking that enables race recording with extremely low run-time overhead, yet still supports race recording with programs with data races; (4) a method of order-value-hybrid recording that supports race recording on multiprocessor systems with the TSO memory consistency model.We evaluate the recorder with full-system simulation of a Chip MultiProcessing (CMP) system and commercial workloads. Our results support that the recorder can be always-on and the log size is around one byte per kilo instructions (55 to 180 KB per (2 gigahertz) processor per second).