Efficient and correct execution of parallel programs that share memory
ACM Transactions on Programming Languages and Systems (TOPLAS)
Compiling programs with user parallelism
Selected papers of the second workshop on Languages and compilers for parallel computing
Detecting data races on weak memory systems
ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Transactional memory: architectural support for lock-free data structures
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Designing memory consistency models for shared-memory multiprocessors
Designing memory consistency models for shared-memory multiprocessors
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Source-level debugging of scalar optimized code
PLDI '96 Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation
Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures
Generating representative Web workloads for network and server performance evaluation
SIGMETRICS '98/PERFORMANCE '98 Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
A system-level specification framework for I/O architectures
Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Weak ordering—a new definition
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Type-based race detection for Java
PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Pointer and escape analysis for multithreaded programs
PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
A parameterized type system for race-free Java programs
OOPSLA '01 Proceedings of the 16th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Ownership types for safe programming: preventing data races and deadlocks
OOPSLA '02 Proceedings of the 17th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Speculative Sequential Consistency with Little Custom Storage
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
ReEnact: using thread-level speculation mechanisms to debug data races in multithreaded codes
Proceedings of the 30th annual international symposium on Computer architecture
Stack allocation and synchronization optimizations for Java using escape analysis
ACM Transactions on Programming Languages and Systems (TOPLAS)
LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Transactional Memory Coherence and Consistency
Proceedings of the 31st annual international symposium on Computer architecture
Programming with transactional coherence and consistency (TCC)
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Proceedings of the 32nd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Compiler techniques for high performance sequentially consistent java programs
Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Making Sequential Consistency Practical in Titanium
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Mechanisms for store-wait-free multiprocessors
Proceedings of the 34th annual international symposium on Computer architecture
BulkSC: bulk enforcement of sequential consistency
Proceedings of the 34th annual international symposium on Computer architecture
Automatically classifying benign and harmful data races using replay analysis
Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Goldilocks: a race and transaction-aware java runtime
Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs
IEEE Transactions on Computers
Execution replay of multiprocessor virtual machines
Proceedings of the fourth ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Foundations of the C++ concurrency memory model
Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
The PARSEC benchmark suite: characterization and architectural implications
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
FastTrack: efficient and precise dynamic race detection
Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Reactive NUCA: near-optimal block placement and replication in distributed caches
Proceedings of the 36th annual international symposium on Computer architecture
InvisiFence: performance-transparent memory ordering in conventional multiprocessors
Proceedings of the 36th annual international symposium on Computer architecture
SigRace: signature-based data race detection
Proceedings of the 36th annual international symposium on Computer architecture
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
DRFX: a simple and efficient memory model for concurrent programming languages
PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Proceedings of the 37th annual international symposium on Computer architecture
Subspace snooping: filtering snoops with operating system support
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Efficient sequential consistency using conditional fences
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Transactional Memory, 2nd Edition
Transactional Memory, 2nd Edition
Efficient processor support for DRFx, a memory model with exceptions
Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
A case for an SC-preserving compiler
Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks
Proceedings of the 38th annual international symposium on Computer architecture
Efficient sequential consistency via conflict ordering
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
TSO_ATOMICITY: efficient hardware primitive for TSO-preserving region optimizations
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Fast RMWs for TSO: semantics and implementation
Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
Proceedings of the 27th international ACM conference on International conference on supercomputing
WeeFence: toward making fences free in TSO
Proceedings of the 40th Annual International Symposium on Computer Architecture
Fence-free work stealing on bounded TSO processors
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Hi-index | 0.00 |
Sequential consistency (SC) is arguably the most intuitive behavior for a shared-memory multithreaded program. It is widely accepted that language-level SC could significantly improve programmability of a multiprocessor system. However, efficiently supporting end-to-end SC remains a challenge as it requires that both compiler and hardware optimizations preserve SC semantics. While a recent study has shown that a compiler can preserve SC semantics for a small performance cost, an efficient and complexity-effective SC hardware remains elusive. Past hardware solutions relied on aggressive speculation techniques, which has not yet been realized in a practical implementation. This paper exploits the observation that hardware need not enforce any memory model constraints on accesses to thread-local and shared read-only locations. A processor can easily determine a large fraction of these safe accesses with assistance from static compiler analysis and the hardware memory management unit. We discuss a low-complexity hardware design that exploits this information to reduce the overhead in ensuring SC. Our design employs an additional unordered store buffer for fast-tracking thread-local stores and allowing later memory accesses to proceed without a memory ordering related stall. Our experimental study shows that the cost of guaranteeing end-to-end SC is only 6.2% on average when compared to a system with TSO hardware executing a stock compiler's output.