TSO_ATOMICITY: efficient hardware primitive for TSO-preserving region optimizations

Authors:
Cheng Wang;Youfeng Wu
Affiliations:
Intel Labs, Santa Clara, CA, USA;Intel Labs, Santa Clara, CA, USA
Venue:
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Year:
2013

Citing 30
Cited 0

A critique of ANSI SQL isolation levels

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Using speculative retirement and larger instruction windows to narrow the performance gap between memory consistency models

Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures
DAISY: dynamic compilation for 100% architectural compatibility

Proceedings of the 24th annual international symposium on Computer architecture
Dynamo: a transparent dynamic optimization system

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Location Consistency-A New Memory Model and Cache Consistency Protocol

IEEE Transactions on Computers
Increasing the size of atomic instruction blocks using control flow assertions

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
rePLay: A Hardware Framework for Dynamic Optimization

IEEE Transactions on Computers
An infrastructure for adaptive dynamic optimization

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
IA-32 Execution Layer: a two-phase dynamic translator designed to support IA-32 applications on Itanium®-based systems

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Transactional Memory Coherence and Consistency

Proceedings of the 31st annual international symposium on Computer architecture
Power Awareness through Selective Dynamically Optimized Traces

Proceedings of the 31st annual international symposium on Computer architecture
The Java memory model

Proceedings of the 32nd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Unbounded Transactional Memory

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Pin: building customized program analysis tools with dynamic instrumentation

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Virtualizing Transactional Memory

Proceedings of the 32nd annual international symposium on Computer Architecture
HDTrans: an open source, low-level dynamic instrumentation system

Proceedings of the 2nd international conference on Virtual execution environments
Hardware atomicity for reliable software speculation

Proceedings of the 34th annual international symposium on Computer architecture
Mechanisms for store-wait-free multiprocessors

Proceedings of the 34th annual international symposium on Computer architecture
BulkSC: bulk enforcement of sequential consistency

Proceedings of the 34th annual international symposium on Computer architecture
Foundations of the C++ concurrency memory model

Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
InvisiFence: performance-transparent memory ordering in conventional multiprocessors

Proceedings of the 36th annual international symposium on Computer architecture
A Better x86 Memory Model: x86-TSO

TPHOLs '09 Proceedings of the 22nd International Conference on Theorem Proving in Higher Order Logics
BulkCompiler: high-performance sequential consistency through cooperative compiler and hardware support

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
TAO: two-level atomicity for dynamic binary optimizations

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
A case for an SC-preserving compiler

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
FlexBulk: intelligently forming atomic blocks in blocked-execution multiprocessors to minimize squashes

Proceedings of the 38th annual international symposium on Computer architecture
Modeling and Performance Evaluation of TSO-Preserving Binary Optimization

PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
Verifying local transformations on relaxed memory models

CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction
LAR-CC: Large atomic regions with conditional commits

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
End-to-end sequential consistency

Proceedings of the 39th Annual International Symposium on Computer Architecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

Program optimizations based on data dependences may not preserve the memory consistency in the programs. Previous works leverage a hardware ATOMICITY primitive to restrict the thread interleaving for preserving sequential consistency in region optimizations. However, ATOMICITY primitive is over restrictive on the thread interleaving for optimizing real-world applications developed with the popular Total-Store-Ordering (TSO) memory consistency, which is weaker than sequential consistency. In this paper, we present a novel hardware TSO_ATOMICITY primitive, which has less restriction on the thread interleaving than ATOMICITY primitive to permit more efficient program execution than ATOMICITY primitive, but can still preserve TSO memory consistency in all region optimizations. Furthermore, TSO_ATOMICITY primitive requires similar architecture support as ATOMICITY primitive and can be implemented with only slight change to the existing ATOMICITY primitive implementation. Our experimental results show that in a start-of-art dynamic binary optimization system on a large set of workloads, ATOMICITY primitive can only improve the performance by 4% on average. TSO_ATOMICITY primitive can reduce the overhead associated with ATOMICITY primitive and improve the performance by 12% on average.