InvisiFence: performance-transparent memory ordering in conventional multiprocessors

Authors:
Colin Blundell;Milo M.K. Martin;Thomas F. Wenisch
Affiliations:
University of Pennsylvania, Philadelphia, PA, USA;University of Pennsylvania, Philadelphia, PA, USA;University of Michigan, Ann Arbor, MI, USA
Venue:
Proceedings of the 36th annual international symposium on Computer architecture
Year:
2009

Citing 34
Cited 23

Performance evaluation of memory consistency models for shared-memory multiprocessors

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Transactional memory: architectural support for lock-free data structures

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
An evaluation of memory consistency models for shared-memory systems with ILP processors

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Using speculative retirement and larger instruction windows to narrow the performance gap between memory consistency models

Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures
Data speculation support for a chip multiprocessor

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Is SC + ILP = RC?

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Speculative lock elision: enabling highly concurrent multithreaded execution

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Speculative synchronization: applying thread-level speculation to explicitly parallel applications

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Shared Memory Consistency Models: A Tutorial

Computer
Multiprocessors Should Support Simple Memory-Consistency Models

Computer
The MIPS R10000 Superscalar Microprocessor

IEEE Micro
Speculative Sequential Consistency with Little Custom Storage

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Cherry: checkpointed early resource recycling in out-of-order microprocessors

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Speculative Versioning Cache

HPCA '98 Proceedings of the 4th International Symposium on High-Performance Computer Architecture
Tradeoffs in Buffering Memory State for Thread-Level Speculation in Multiprocessors

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Memory Ordering: A Value-Based Approach

Proceedings of the 31st annual international symposium on Computer architecture
Transactional Memory Coherence and Consistency

Proceedings of the 31st annual international symposium on Computer architecture
Programming with transactional coherence and consistency (TCC)

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Toward kilo-instruction processors

ACM Transactions on Architecture and Code Optimization (TACO)
Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Optimization

Proceedings of the 32nd annual international symposium on Computer Architecture
Store Memory-Level Parallelism Optimizations for Commercial Applications

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Cherry-MP: Correctly Integrating Checkpointed Early Resource Recycling in Chip Multiprocessors

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Conditional Memory Ordering

Proceedings of the 33rd annual international symposium on Computer Architecture
Bulk Disambiguation of Speculative Threads in Multiprocessors

Proceedings of the 33rd annual international symposium on Computer Architecture
Issues in the design of store buffers in dynamically scheduled processors

ISPASS '00 Proceedings of the 2000 IEEE International Symposium on Performance Analysis of Systems and Software
SimFlex: Statistical Sampling of Computer System Simulation

IEEE Micro
Hardware atomicity for reliable software speculation

Proceedings of the 34th annual international symposium on Computer architecture
Mechanisms for store-wait-free multiprocessors

Proceedings of the 34th annual international symposium on Computer architecture
BulkSC: bulk enforcement of sequential consistency

Proceedings of the 34th annual international symposium on Computer architecture
A Scalable, Non-blocking Approach to Transactional Memory

HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
Using Hardware Memory Protection to Build a High-Performance, Strongly-Atomic Hybrid Transactional Memory

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
POWER4 system microarchitecture

IBM Journal of Research and Development
On the effectiveness of speculative and selective memory fences

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Implicit transactional memory in kilo-instruction multiprocessors

ACSAC'07 Proceedings of the 12th Asia-Pacific conference on Advances in Computer Systems Architecture

Decoupled store completion/silent deterministic replay: enabling scalable data memory for CPR/CFP processors

Proceedings of the 36th annual international symposium on Computer architecture
BulkCompiler: high-performance sequential consistency through cooperative compiler and hardware support

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
A real system evaluation of hardware atomicity for software speculation

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Memory models: a case for rethinking parallel languages and hardware

Communications of the ACM
DRFX: a simple and efficient memory model for concurrent programming languages

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
RETCON: transactional repair without replay

Proceedings of the 37th annual international symposium on Computer architecture
Efficient sequential consistency using conditional fences

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
ScalableBulk: Scalable Cache Coherence for Atomic Blocks in a Lazy Environment

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Efficient processor support for DRFx, a memory model with exceptions

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
A case for an SC-preserving compiler

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Automatic inference of memory fences

Proceedings of the 2010 Conference on Formal Methods in Computer-Aided Design
FlexBulk: intelligently forming atomic blocks in blocked-execution multiprocessors to minimize squashes

Proceedings of the 38th annual international symposium on Computer architecture
Efficient sequential consistency via conflict ordering

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
End-to-end sequential consistency

Proceedings of the 39th Annual International Symposium on Computer Architecture
BlockChop: dynamic squash elimination for hybrid processor architecture

Proceedings of the 39th Annual International Symposium on Computer Architecture
TSO_ATOMICITY: efficient hardware primitive for TSO-preserving region optimizations

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Volition: scalable and precise sequential consistency violation detection

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Vulcan: Hardware Support for Detecting Sequential Consistency Violations Dynamically

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Fast RMWs for TSO: semantics and implementation

Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
Address-aware fences

Proceedings of the 27th international ACM conference on International conference on supercomputing
WeeFence: toward making fences free in TSO

Proceedings of the 40th Annual International Symposium on Computer Architecture
BulkCommit: scalable and fast commit of atomic blocks in a lazy multiprocessor environment

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Fence-free work stealing on bounded TSO processors

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems

Quantified Score

Hi-index	0.02

Visualization

Abstract

A multiprocessor's memory consistency model imposes ordering constraints among loads, stores, atomic operations, and memory fences. Even for consistency models that relax ordering among loads and stores, ordering constraints still induce significant performance penalties due to atomic operations and memory ordering fences. Several prior proposals reduce the performance penalty of strongly ordered models using post-retirement speculation, but these designs either (1) maintain speculative state at a per-store granularity, causing storage requirements to grow proportionally to speculation depth, or (2) employ distributed global commit arbitration using unconventional chunk-based invalidation mechanisms. In this paper we propose InvisiFence, an approach for implementing memory ordering based on post-retirement speculation that avoids these concerns. InvisiFence leverages minimalistic mechanisms for post-retirement speculation proposed in other contexts to (1) track speculative state efficiently at block-granularity with dedicated storage requirements independent of speculation depth, (2) provide fast commit by avoiding explicit commit arbitration, and (3) operate under a conventional invalidation-based cache coherence protocol. InvisiFence supports both modes of operation found in prior work: speculating only when necessary to minimize the risk of rollback-inducing violations or speculating continuously to decouple consistency enforcement from the processor core. Overall, InvisiFence requires approximately one kilobyte of additional state to transform a conventional multiprocessor into one that provides performance-transparent memory ordering, fences, and atomic operations.