Architecting a chunk-based memory race recorder in modern CMPs

Authors:
Gilles Pokam;Cristiano Pereira;Klaus Danne;Rolf Kassa;Ali-Reza Adl-Tabatabai
Affiliations:
Intel Corporation;Intel Corporation;Intel Corporation;Intel Corporation;Intel Corporation
Venue:
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Year:
2009

Citing 19
Cited 9

Debugging concurrent processes: a case study

PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Introduction to algorithms

Introduction to algorithms
Logical Time: Capturing Causality in Distributed Systems

Computer
Efficient Hardware Hashing Functions for High Performance Computers

IEEE Transactions on Computers
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
Space/time trade-offs in hash coding with allowable errors

Communications of the ACM
Supporting nondeterministic execution in fault-tolerant systems

FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
A "flight data recorder" for enabling full-system multiprocessor deterministic replay

Proceedings of the 30th annual international symposium on Computer architecture
Pin: building customized program analysis tools with dynamic instrumentation

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
BugNet: Continuously Recording Program Execution for Deterministic Replay Debugging

Proceedings of the 32nd annual international symposium on Computer Architecture
Bulk Disambiguation of Speculative Threads in Multiprocessors

Proceedings of the 33rd annual international symposium on Computer Architecture
Recording shared memory dependencies using strata

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Race recording for multithreaded deterministic replay using multiprocessor hardware

Race recording for multithreaded deterministic replay using multiprocessor hardware
Cheap Hardware Parallelism Implies Cheap Security

FDTC '07 Proceedings of the Workshop on Fault Diagnosis and Tolerance in Cryptography
Implementing Signatures for Transactional Memory

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Rerun: Exploiting Episodes for Lightweight Memory Race Recording

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
DeLorean: Recording and Deterministically Replaying Shared-Memory Multiprocessor Execution Ef?ciently

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Capo: a software-hardware interface for practical deterministic multiprocessor replay

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Kendo: efficient deterministic multithreading in software

Proceedings of the 14th international conference on Architectural support for programming languages and operating systems

ParaLog: enabling and accelerating online parallel monitoring of multithreaded applications

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Understanding bloom filter intersection for lazy address-set disambiguation

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Karma: scalable deterministic record-replay

Proceedings of the international conference on Supercomputing
CoreRacer: a practical memory race recorder for multicore x86 TSO processors

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
CCTR: An efficient point-to-point memory race recorder implemented in chunks

Microprocessors & Microsystems
Cyrus: unintrusive application-level record-replay for replay parallelism

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
An efficient deterministic record-replay with separate dependencies

Computers and Electrical Engineering
QuickRec: prototyping an intel architecture extension for record and replay of multithreaded programs

Proceedings of the 40th Annual International Symposium on Computer Architecture
RelaxReplay: record and replay for relaxed-consistency multiprocessors

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Prior work on HW support for memory race recording piggybacks time stamps on coherence messages and logs the outcome of memory races using point-to-point or chunk-based approaches. These memory race recorder (MRR) techniques are effective, but they require modifications to the cache coherence protocol that can hurt performance. In addition, prior work has mostly focused on directory coherence and considered only CMP systems with single-level cache hierarchies. Most modern CMP systems shipped today, however, implement snoop coherence and feature multilevel cache hierarchies. To be practical, a MRR must target CMPs with multilevel caches, mitigate the coherence overhead due to piggybacking, and emphasize on replay speed to broaden applicability of deterministic replay. This paper contributes three new solutions for making chunk-based MRR practical for modern CMPs. We show that MRR interactions with a cache hierarchy can degrade performance and present a novel mechanism that mitigates this degradation. We propose new mechanisms for snoop-based caches that eliminate coherence traffic overhead due to piggybacking. We finally propose new techniques for improving replay speed and introduce a novel framework for evaluating the replay speed potential of MRR designs.